Abstract
Evapotranspiration is a key variable for hydrologic, climatic, agricultural, and environmental studies. Given the non-availability of economically and technically easy to implement direct measurement methods, evapotranspiration is estimated primarily through the application of empirical and regression models, and machine learning algorithms that incorporate conventional meteorological variables. While the FAO-56 Penman-Monteith equation worldwide has been recognized as the most accurate equation to estimate the reference evapotranspiration (ETo), the number of required climatic variables makes its application questionable for regions with limited ground-based climate data. This note provides a summary of empirical and semi-empirical equations linked to its data requirement and the problems associated with these models (transferability and data quality), an overview of regression models, the potential of machine learning algorithms in regression tasks, trends of reference evapotranspiration studies, and some recommendations of the topics future research should address that would lead to a further improvement of the performance and generalization of the available models. The terminology used in this note is consistent in both the theoretical and practical field of evapotranspiration, which is often dispersed in the academic literature. The goal of this note is to provide some perspective to stimulate discussion.
HIGHLIGHTS
An overview of trends in ETo studies is presented.
The main limitation of FAO-56 Penman-Monteith is the large number of meteorological variables required.
There is a wide variety of empirical equations for ETo estimation.
The application of machine learning algorithms is increasing due to their high performance for ETo estimation.
Some aspects of ETo estimation methods are discussed and recommended.
Graphical Abstract
INTRODUCTION
An overwhelming volume of scientific literature is available on evaporation, transpiration, and evapotranspiration (ET), and an immense volume of academic articles about ET estimation methods using empirical and regression models, and machine learning algorithms have been published. Based on Scopus (Elsevier) journal database (for the period 1800–2021), the number of online access documents (articles, book chapters, conference proceedings, reports, dissertations, etc.) containing the keyword ‘reference evapotranspiration’ amounts to 72,899. A search in this database using the keywords ‘estimation models’ and ‘reference evapotranspiration’ yields 31,427 results in the period 1966–2021, while the combination of the keywords ‘machine learning’ and ‘reference evapotranspiration’ yields 3,128 documents in the period 1971–2021. Analysis of the number of published documents in the Scopus journal database using the keywords ‘hybrid data-driven machine learning techniques’ and ‘reference evapotranspiration’ reveals that in the period 1998–2021 a total of 557 documents were registered. This simple analysis clearly indicates that still today quite some effort by the scientific community is devoted to improving and calibrating the measuring techniques, empirical and model estimation methods, and artificial intelligence-based methods to measure and estimate ET at different time frequencies and spatial scales. In addition, a clear shift from the classical approaches to estimate reference evapotranspiration (ETo) using empirical equations and models to artificial intelligence-based methods are noticeable. On the other hand, Figure 1 shows the temporal evolution of the number of ETo articles published in the past decade in the core collection of the Web of Science (WoS) using ‘reference evapotranspiration’ in the title. In the past decade, the total number of publications reached 804, and ETo studies showed a remarkable increase that almost tripled by 2020, with a mean production of 79 articles per year for ETo. This rising trend in the number of publications reflects a growing interest among scientists in ETo studies.
When faced with determining ET in the context of a project, the problem arises which method to apply. There is so much literature on ET that in this context it is practically impossible to propose even a partial review. Therefore, analysis of the literature on this subject is time consuming and costly, and to circumvent this, the idea arose to develop a communication that can be used as a guide in selecting the most suitable approach for a given study. This note is based on a detailed analysis of the literature published in the last decade, and available in the WoS journal database. It is expected that the synthesized information will be a useful tool for water and climate researchers and practitioners when ETo is required. The goal of this note is not to arrive at any particular truth, but rather to stimulate lively discussion.
SOME GENERAL CONCEPTS
LIMITATIONS OF THE MEASUREMENT OF WEATHER VARIABLES
Currently, the distributed global network for eddy covariance flux measurements ‘FLUXNET’ (www.fluxnet.org), is key to generate micrometeorological data (e.g., ET) for most of the terrestrial regions and biomes of the world with different climatology. However, the network density is very low in the Global South (which is roughly defined by latitude), and, moreover, unfortunately in many remote regions a weather station has never been installed. Therefore, to integrate water resources research and management in those areas, approaches such as the FAO-56 Penman-Monteith are still needed.
The main limitation to calculate ETo by using the FAO-56 Penman-Monteith method is that the full set of climatic variables needed are not measured in many weather stations worldwide. The quality with which the weather data are measured is another problem. The meteorological data obtained by different weather instruments/sensors is not free from flaws such as lacking reliability (solar radiation), intermittent errors and questionable quality (relative humidity and wind speed). Temperature is the variable that is least prone to faulty sensor reading and is largely and easily available in many regions of the world.
To apply the FAO-56 Penman-Monteith equation under limited data conditions, classically missing solar radiation (Rs), relative humidity (RH), or wind speed (u), some guidelines have been established by Allen et al. (1998). Solar radiation and wind speed values from near weather stations with similar topography and climatic conditions can be used when local values are missing. As the second option, solar radiation can be calculated using the Hargreaves radiation formula as a function of the minimum and maximum temperature. When relative humidity data is lacking, the actual vapor pressure (ea) can be estimated by assuming that the dew point temperature (Tdew) is equal to the minimum temperature (Tmin), and under missing wind speed data, the FAO-56 Penman-Monteith equation can be estimated using the global average wind speed value of 2 m s−1.
One should be aware of the fact that uncertainties of field measurement and meteorological variables could be large, primarily associated with instrument calibration, installation, operation, and maintenance. To overcome this issue, guidelines for quality control of weather data have been established. Meek & Hatfield (1994) and Allen (1996) developed screening rules and instructions guiding the decision when data/sensors should be scrutinized.
Continuous monitoring of weather stations is another issue. Worldwide many stations have been abandoned or disassembled, while many FLUXNET stations have been installed during the past decade. Overall, the lack of spatial and temporal (long-term) weather data and the uncertainty of data quality are common, and limits the application of the FAO-56 Penman-Monteith equation.
THE NEED FOR EMPIRICAL MODELS
Due to the lack of lysimeters and fully equipped weather stations to estimate ETo using the FAO-56 Penman-Monteith method, the application of empirical equations requiring fewer weather variables is pivotal for hydrological, ecohydrological, and biometeorological studies and applications. It must be highlighted that a large body of literature related to empirical equations for the estimation of ETo is available. Based on the data requirement the available equations can be subdivided into the following groups: Temperature-, radiation-, and mass transfer- based methods (see Supplementary Table). Most ETo equations have been developed specifically for definite atmospheric conditions and for different temporal scales such as hourly, daily, or monthly. Hupet & Vanclooster (2001) demonstrated that low temporal sampling resolutions of meteorological variables (time-aggregation effect) tend to overestimate ETo. This highlights the paramount importance of using finer-scale monitoring resolutions. Some academic efforts were directed to adapt empirical equations from low to finer resolutions. For example, Pereira & Pruitt (2004) and Chang et al. (2019) attempted to modify the original monthly Thornthwaite temperature-based equation to estimating daily ETo.
The Supplementary Table provides for each of the listed empirical and semi-empirical equations the data requirement of each equation. In this way, this table serves as a guide for users to identify the optimal methods that they can apply given the availability of weather data.
TRANSFERABILITY OF EMPIRICAL MODELS
As shown in the Supplementary Table different empirical models to estimate ETo were developed using meteorological variables from weather stations at surface level, assuming intrinsically the local conditions where the models were formulated. Some models work well in areas with similar climatological and environmental conditions. When such approaches are tested in other climatic conditions, their performance might be poor. This makes the transferability of models (those that can be used beyond the spatial and temporal bounds of their underlying data) to other areas or time periods uncertain. Except for the FAO-56 Penman-Monteith method, the transferability of ETo models across geographic locations have failed, and the development of transferable models remains elusive. Empirical models will always have to be calibrated to the local conditions where they are applied. Models require making the tradeoff between prediction bias and variance (homogenization versus non-transferability), and it is evident that for application and decision making (e.g., irrigation systems, catchment water balance), preference ought to be given to estimation models with high accuracy.
THE ROLE OF REGRESSION MODELS AND MACHINE LEARNING ALGORITHMS
In recent decades, rapid advances in the application of regression models and machine learning (ML) have been made, and the scientific community has adopted these techniques for different purposes in the hydrology field (Lange & Sippel 2020). As found by Jing et al. (2019) there is a large and growing field of implementation of evolutionary computational models for ETo estimates. Regression models in general terms are a method that use observations records to quantify the relationship between a target variable (also named as dependent variable), and a set of independent variables (also named as a covariate). The following are classic examples of regression models: multiple linear regression, Bayesian regression, robust regression, and multivariate adaptive regression splines. ML, depending on the underlying algorithm that is used, can perform supervised and unsupervised learning and then build statistical models, determining trends and patterns, for data analysis and forecasting. The ML algorithms (e.g., artificial neural network (ANN), support vector machine (SVM) and adaptive neuro-fuzzy inference system (ANFIS)) are able to learn implicitly using the input data and provide accurate predictions, without having been specifically programmed for that task. A brief description of the most widely used regression models are given in the following.
The most popular applied regression model is the multiple linear regression (MLR), which is a statistical approach for modelling the linear relationship between explanatory (independent) and response (dependent) variables. The main assumption in the MLR is that the relationship between the dependent and independent variables is linear. It also assumes that there is no significant correlation between the independent variables. MLR can be considered as an extension of ordinary least-squares (OLS) regression because it involves more than one explanatory variable (Eberly 2007).
Multivariate adaptive regression splines (MARS) is a non-parametric model of a nonlinear regression that allows explaining the dependence of the response variable on one or more explanatory variables. Non-parametric modeling does not approximate one single function but adjusts it to several other functions for simple metrics, usually low-order polynomials, defined on a sub-region of the domain (parametric adjustment per section), or sets a simple function for each value of the variable (global setting). MARS is preferred because it allows approximating complex nonlinear relationships from the data, without postulating a hypothesis about the present type of nonlinearity. The construction of the algorithm model incorporates mechanisms that allow the selection of relevant explanatory variables. The resulting model is easier to interpret and apply. Finally, the estimation of its parameters is computationally efficient and rapid (Friedman 1991).
The basis of robust regression (RB) consists of assigning a weight to each data point, to counter OLS estimates which are extremely sensitive to outliers. Weighting is done automatically and iteratively through a process called ‘iteratively reweighted least squares’. In the first iteration, each point is assigned equal weight, and model coefficients are estimated using OLS. At subsequent iterations, weights are recomputed so that points farther from model predictions in the previous iteration are given a lower weight. Model coefficients are then recomputed using weighted least squares. The process continues until the values of the coefficient estimates converge within a specified tolerance (Khoshravesh et al. 2017).
Bayesian regression (BR) in simple terms attempt to find a variable θ considered as a random variable with probability distribution π(θ) (called prior distribution) from the data y=(y1,y2,…,yn) using a statistical model described by a density function [l(y|θ)], called the likelihood function. The prior distribution expresses the beliefs about the parameter before examining the data. Given the observed data y, update of beliefs about θ by combining information from the prior distribution and the data by the use of Bayes theorem, and so the calculation of the posterior distribution, π(θ|y), i.e., the posterior distribution is computed by the variances of the prior and sample data. The variance establishes two conditions: if variance (1) prior data<sample data, a higher weight is assigned to the prior data, or (2) prior data>sample data, a higher weight is assigned to the sample data (Khoshravesh et al. 2017).
Similar to the brief outline of the most frequently used regression models in the previous paragraphs, in the following a brief description is given of the most popular ML algorithms used for prediction. ANNs are considered a computation tool that emulates the function of neural networks in biological systems. ANNs extract the relationship of inputs and outputs of a process, without explicitly knowing the physical nature of the problem in such a way that the result is transmitted in the network until a signal output is obtained. The ANN-based model's procedure is, in general, divided into training, validation, and testing. The architecture of an ANN has an input layer (where data are introduced to an ANN), the hidden layer(s) (where data is processed), and the output layer (where results of given inputs are provided). The advantage of the neural method relies on the possibility of improving the performance criteria by modifying the network architecture (Lange & Sippel 2020).
SVM is popularly and broadly used for classification and regression problems in machine learning ML. SVM for classification problems separate the data by class from the separating line (called hyperplane) and unlike regression, a safety boundary from both sides of the hyperplane is created (maximizing the margin), while SVM models for regression problems find the linear regression function that can best approximate the output vector with an error tolerance. The advantage of SVM models is their flexibility in defining how much error is acceptable and by yielding an appropriate line (or hyperplane in higher dimensions) that fits the data (Kecman 2005).
Finally, a random forest (RF) is a trendy and effective algorithm based on model aggregation ideas for several tasks such as classification, regression, and forecasts. RF works by constructing a large number of relatively uncorrelated decision trees from bootstrap samples that operate as an ensemble, and also involves selecting a subset of input features (columns or variables) at each split point in the construction of the trees. Each individual tree in the RF returns a class prediction, and the class with the most votes becomes the model predictor (Breiman 2001).
The application of any of the models will depend on the objective to be achieved, the relationship between the variables in the dataset, and also on the capacity and expertise of the user who develops and implements the model. Some of the main advantages and disadvantages of these models are presented in Table 1.
Estimation models . | Advantages . | Disadvantages . |
---|---|---|
MLR | Adequate for small datasets Simple to understand and interpret | The linear assumption Sensitive to outliers |
RB | Improve the performance when the dataset present heteroscedasticity and outliers | When the underlying assumptions of the classic method (OLS) are true, the RBs have lower performance than the classic method |
BR | Fast data processing | Less accurate when collinearity exists |
MARS | Fast data processing Simple to understand and interpret Is flexible to capture the shape of functions | The high degree of flexibility can result in overfitting |
ANN | Powerful to identify complex non-linear relationships | Large datasets are required to achieve good performance Overfitting |
SWM | Powerful to identify complex non-linearrelationships Robust for outliers | Requires considerable processing time The performance depends on the selection of the kernel function Risk of overfitting |
RF | Powerful to identify complex non-linear relationshipsHarder to overfit | Poor performance with small datasets The number of decision trees must be set Low model interpretability |
Estimation models . | Advantages . | Disadvantages . |
---|---|---|
MLR | Adequate for small datasets Simple to understand and interpret | The linear assumption Sensitive to outliers |
RB | Improve the performance when the dataset present heteroscedasticity and outliers | When the underlying assumptions of the classic method (OLS) are true, the RBs have lower performance than the classic method |
BR | Fast data processing | Less accurate when collinearity exists |
MARS | Fast data processing Simple to understand and interpret Is flexible to capture the shape of functions | The high degree of flexibility can result in overfitting |
ANN | Powerful to identify complex non-linear relationships | Large datasets are required to achieve good performance Overfitting |
SWM | Powerful to identify complex non-linearrelationships Robust for outliers | Requires considerable processing time The performance depends on the selection of the kernel function Risk of overfitting |
RF | Powerful to identify complex non-linear relationshipsHarder to overfit | Poor performance with small datasets The number of decision trees must be set Low model interpretability |
MLR, multiple linear regression; RB, robust regression; BR, Bayesian regression; MARS, multivariate adaptive regression splines; ANN, artificial neural networks; SVM, support vector machine; RF, random forests.
ESTIMATION MODELS: SOME CAVEATS
In the past decade, the assessment of the performance of empirical equations and ML algorithms and regression model approaches for ETo estimation has considerably increased in the academic literature (e.g., Table 2). From these studies, the following facts can be highlighted: (1) most studies used the FAO-56 Penman-Monteith model as the reference for performance assessment; (2) studies applied original, modified, and locally adapted equations; (3) the ranking of the different model's performance between studies showed heterogeneity and its mainly related to the geographic location; (4) most of the ETo models that have been developed are site specific; (5) a combination of several input variables were chosen to identify the ML models with the least number of weather variables, which were found to have higher superiority than empirical equations under all climatic conditions; and 6) most of the regression models also demonstrated high performance.
Source . | Number of empirical models* . | Machine learning models . | Regression models . | Country (environment type) . |
---|---|---|---|---|
Landeras et al. (2008) | 11 | ANN | – | Spain (subatlantic enviroment) |
Tabari et al. (2012) | 13 | SVM, ANFIS | MLR, MNLR | Iran (semi-arid environment) |
Tabari et al. (2013) | 32 | – | – | Iran (humid environment) |
Khoshravesh et al. (2017) | 1 | MFP, RB, BR | Iran (arid environment) | |
Mehdizadeh et al. (2017) | 17 | GEP, SVM | MARS | Iran (mainly arid and semi-arid environment) |
Djaman et al. (2019) | 35 | – | – | New Mexico – USA (semi-arid environment) |
Farzanpour et al. (2019) | 21 | – | – | Iran (semi-arid environment) |
Muhammad et al. (2019) | 31 | – | – | Peninsular Malaysia (tropical environment) |
Celestin et al. (2020) | 33 | – | – | Hexi Corridor – China (arid environment) |
Chen et al. (2020) | 8 | ANN, SVM, RF | – | Northeast Plain – China (subtropical monsoon environment) |
dos Santos Farias et al. (2020) | 4 | ANN, SVM | CB, SW | Brazil (humid and semi-arid enviroment) |
Ferreira & da Cunha (2020) | 1 | ANN, RF, XGBoost | – | Brazil (sub-humid environment) |
Pinos et al. (2020) | 22 | ANN | MARS | Ecuador (super-humid environment) |
Tikhamarine et al. (2020) | 7 | ANN, SVM | – | Algeria (Mediterranean environment) |
Source . | Number of empirical models* . | Machine learning models . | Regression models . | Country (environment type) . |
---|---|---|---|---|
Landeras et al. (2008) | 11 | ANN | – | Spain (subatlantic enviroment) |
Tabari et al. (2012) | 13 | SVM, ANFIS | MLR, MNLR | Iran (semi-arid environment) |
Tabari et al. (2013) | 32 | – | – | Iran (humid environment) |
Khoshravesh et al. (2017) | 1 | MFP, RB, BR | Iran (arid environment) | |
Mehdizadeh et al. (2017) | 17 | GEP, SVM | MARS | Iran (mainly arid and semi-arid environment) |
Djaman et al. (2019) | 35 | – | – | New Mexico – USA (semi-arid environment) |
Farzanpour et al. (2019) | 21 | – | – | Iran (semi-arid environment) |
Muhammad et al. (2019) | 31 | – | – | Peninsular Malaysia (tropical environment) |
Celestin et al. (2020) | 33 | – | – | Hexi Corridor – China (arid environment) |
Chen et al. (2020) | 8 | ANN, SVM, RF | – | Northeast Plain – China (subtropical monsoon environment) |
dos Santos Farias et al. (2020) | 4 | ANN, SVM | CB, SW | Brazil (humid and semi-arid enviroment) |
Ferreira & da Cunha (2020) | 1 | ANN, RF, XGBoost | – | Brazil (sub-humid environment) |
Pinos et al. (2020) | 22 | ANN | MARS | Ecuador (super-humid environment) |
Tikhamarine et al. (2020) | 7 | ANN, SVM | – | Algeria (Mediterranean environment) |
ANFIS, adaptive neuro-fuzzy inference system; ANN, artificial neural networks; BR, Bayesian regression; CB, cubist regression; GEP, gene expression programming; MARS, multivariate adaptive regression splines; MFP, multivariate fractional polynomial; MLR, multiple linear regression; MNLR, multiple non-linear regression; RB, robust regression; RF, random forest; SVM, support vector machine models; SW, stepwise regression; XGBoost, extreme gradient boosting. Asterisk (*) means that the FAO-56 Penman-Monteith model is included.
Do you know a water or climate scientist who denies the importance of precise estimation of ET? It is well known that the FAO-56 Penman-Monteith model has high performance in estimating ETo and its serves as a good proxy of lysimeters or eddy covariance measurements, however, the model is not free of errors. In the absence of lysimeter or eddy covariance data, the question arises whether the FAO-56 Penman-Monteith method is a valid reference to be used? A ‘double bias’ in estimation model studies can be expected, one by the FAO-56 Penman-Monteith model against the lysimeters/eddy covariance, and the second by the FAO-56 Penman-Monteith model against the estimation model. Why are accurate estimations so important? The under or overestimation of ETo at the aggregated scale (e.g., Pinos et al. 2020) can have consequences in terms of water management, which can introduce water supply problems for agriculture and human consumption, as well as water-cycling modeling.
Why can models not be generalized? Empirical equations require local calibration, which is site-specific, thereby preventing generalization, because high variance exists between locations. Regression models are developed to be site-specific and are sensitive to changes in the ranges and dynamics of the input data, i.e., to other locations. ML models cannot be transferred because the main data processing is implicitly developed as a black box (i.e., the internal working is unknown). To complicate things even more, almost all the studies do not publish the data used in their analysis in open access repositories, therefore, their results cannot be validated or reused by the scientific community for new studies covering large scales such as regions, countries, or biomes.
CONCLUDING OBSERVATIONS
In summary, the past few decades have seen an explosion of research on ETo estimation methods; however, scientific progress has remained somewhat stagnant. This note reports a brief explanation of the main components and assumptions in estimation models and presents an extended compilation of relevant existing models. In the absence of actual field measurements, empirical models are helpful to estimate ETo using routine meteorological variables. Yet these models require local calibration. To minimize the effort required to calibrate the empirical models to local conditions, studies should be oriented to derive estimation approaches that cover large spatial scales such as regional, national, or biomes. To achieve this goal, it is recommended that studies publish the weather data used in open access repositories. In this way, the number of surveys usually conducted on a local scale can be generalized to larger areas of interest. ML algorithms are increasingly applied to estimating ETo as a function of weather variables. Since ML performs as a black box, future studies should be directed to how to make these models with relatively high accuracy transferable. At present, a great bulk of studies compare the performance of ETo estimation models against the FAO-56 Penman-Monteith equation, and only in a limited number of studies are model results compared with lysimeters or eddy covariance measurements. The quality of input data and direct measurements play a key role for estimation and validation, respective of the empirical, regression, and machine learning models. Furthermore, heterogeneity on ETo models ranking, based on their performance, can lead to ambiguous interpretations.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.