Random forest predictive model development with uncertainty analysis capability for the estimation of evapotranspiration in an arid oasis region

The study evaluates the potential utility of the random forest (RF) predictive model used to simulate daily reference evapotranspiration (ET0) in two stations located in the arid oasis area of northwestern China. To construct an accurate RF-based predictive model, ET0 is estimated by an appropriate combination of model inputs comprising maximum air temperature (Tmax), minimum air temperature (Tmin), sunshine durations (Sun), wind speed (U2), and relative humidity (Rh). The output of RF models are tested by ET0 calculated using Penman–Monteith FAO 56 (PMF-56) equation. Results showed that the RF model was considered as a better way to predict ET0 for the arid oasis area with limited data. Besides, Rh was the most influential factor on the behavior of ET0, except for air temperature in the proposed arid area. Moreover, the uncertainty analysis with a Monte Carlo method was carried out to verify the reliability of the results, and it was concluded that RF model had a lower uncertainty and can be used successfully in simulating ET0. The proposed study shows RF as a sound modeling approach for the prediction of ET0 in the arid areas where reliable weather data sets are available, but relatively limited.

• Random forest model is designed for estimation of evapotranspiration in an arid oasis region. • The Monte-Carlo method is carried to analyze uncertainty of simulation results.  (Tao et al. ). In arid oasis conditions, crops are a material basis on which human beings depend for their survival as well as being an ecological protection barrier in such areas. Knowledge of crop-water demands is an important practical consideration for improved wateruse efficiency (Benli et al. ). This is because ET is a primary source of water loss, so its accurate evaluation can provide valuable information for water balance, irrigation system design, and water resources management (Torres et al. ; Wen et al. ). This is especially true for arid regions, such as the northwest region in China, where population growth, expansion of agriculture, and other socioeconomic activities are significantly constraining the available water resources.
Due to the lack of observation data, the precise estimation of ET has produced the need for another comprehensive concept called reference evapotranspiration (ET 0 ) (Abdullah et al. ). ET 0 can be measured directly using lysimeters which are characterized by providing accurate measurement results; however, the application of the methods is limited due to their cost and complexity (Ferreira et al. ), which increases the requirements of employing data-based methods to predict ET 0 . Several conventionally empirical models like Hargreaves equation, Priestley-Taylor equation, and Ritchie equation have been developed to estimate ET 0 using meteorological data.
Because the PMF-56 equation takes into account moisture availability, mass transfer, and required energy for the process (Granata ), it has been recommended for the computation of ET 0 by the Food and Agricultural Organization of the United Nations (FAO) as the only standard equation which is usually applied to validate other models and has been accepted in many regions across the world. PMF-56 equation can be broadly applied in various environments and climate conditions due to its good precision and stability (Huang et al. ). However, some restrictions still exist in the application of PMF-56 equation, for example, it is difficult to obtain all meteorological data required in the estimation process, particularly in a developing country, where the number of meteorological stations is limited and weather data records could be scarce (Abdullah et al. ). Within this context, an alternative data-driven model which requires easily available input variables is necessary and significant.
As the ET 0 depends on several interacting meteorological factors, such as temperature, humidity, wind speed, and radiation, it is difficult for the ordinary formula to express all the related physical processes (Yassin et al. ; Yin et al. ). In this context, artificial intelligence or data-driven models are considered as efficient tools to deal with non-linear relationships between independent and dependent variables. In the past few decades, artificial intelligence models, including artificial neural network (ANN), extreme learning machine (ELM), support vector machine (SVM), and so on, have been extensively used in the area of predicting and forecasting (Kisi &  The results revealed that ELM is a simple yet efficient algorithm and superior to the other two methods. Tabari et al. () estimated the performances of SVM, adaptive neurofuzzy inference system (ANFIS), multiple linear regression (MLR), and multiple non-linear regression (MNLR) for estimating ET 0 using six input vectors of climatic data in a semiarid highland environment in Iran. The results displayed that the capability of SVM and ANFIS models for ET 0 prediction was better than those achieved using the regression and climate-based models. Kisi & Cimen () used the SVM approach for modeling ET 0 in three stations in central California. The results were compared with empirical models and ANN model and revealed that the SVM method could be employed successfully in simulating the ET 0 process. These models have demonstrated promising prediction ability of ET 0 in many parts of the world, but some deficiencies exist. ANN models become easily stuck in a local minimum, and the optimization process is effortlessly influenced by initial point selection. SVM and numerous ELM models are machine learning methods based on kernel function, and generalization abilities depend largely on the choice of the kernel function.
Random forest (RF) is another emerging machine learning technique and a natural non-linear modeling tool, the superiority of which is good tolerance for outliers and noise, difficulty in producing an over-fitting phenomenon. However, one remarkable issue is that the uncertainty of the model in estimation is usually ignored by most studies, and no such studies have been reported adding uncertainty analysis in predicting ET 0 so far. In this condition, uncertainty analysis is conducted in the paper for assessing the precision of the RF model.

PMF-56 equation
As a standard method to estimate ET 0 , PMF-56 equation was used to be a RF target output to train and test the model in this paper and proposed by Allen et al. () as follows: where ET 0-PMF-56 is the reference evapotranspiration (mm day À1 ); R n is the net radiation at the crop surface (MJ m À2 day À1 ); G is the soil heat flux (MJ m À2 day À1 ); γ is the psychrometric constant (kPa C À1 ); T is the mean daily air temperature at 2 m height ( C); U 2 is the mean daily wind speed at 2 m height (m s À1 ); e s is the saturation vapor pressure (kPa), e a is the actual vapor pressure (kPa), e s À e a is the saturation vapor pressure deficit  This paper uses the regression algorithm whose calculation processes are as follows.
First, randomly generate k training samples (Θ 1 , Θ 2 , …, Θ k ) from the total training sample using the bootstrap sampling method, corresponding to K decision trees can be constructed.
Second, at each node of the decision tree, the m features are randomly selected from the M features as the splitting features set of the current nodes, then selecting one node from the m features to split according to the principle of node purity minimum, each decision tree is grown to the largest extent possible, no pruning.
Third, for new data, the predictive value of a single decision tree can be obtained through the average of the observations of the leaf node 1(x, Θ). If an observation value Xi is a leaf node 1(x, Θ) and not 0, the weight ω i (x,Θ) is set as: where the sum of weights equals 1.
Fourth, the prediction of a single decision tree gained by the weighted average of the observations of dependent variables is defined as: Finally, given weight of decision tree ω i (x,Θ t ) (t ¼ 1, 2, …, k), the weight of each observation as Equation (4): thus, the final predicted value of RFR is: the flowchart of RF for regression is shown as follows.
In addition, index importance assessment is a promi-

Uncertainty analysis
Uncertainty analysis by Monte Carlo simulations is used for evaluating the analysis of final models. Input parameter uncertainty considered in this paper is related to the precision and representativeness of the input data applied for predictions (Antanasijević et al. ). In this method, the input parameter is described using a probability distribution and a single input data set involves the generation of random input respecting this distribution, then running the model and obtaining output (Noori et al. ). In the present work, we randomly resample the input data set without replacement for 1,000 times, keeping the ratio between the training and validation sets unchanged (Dehghani et al. ; Gao et al. ). Finally, the 95% confidence intervals are determined by finding the 2.5th (X L ) and 97.5th (X U ) percentiles of the cumulative distribution consisting of 1,000 data. The ratio of observed values that lie within the 95% confidence interval is calculated as judging the robustness metric of the final model; the higher the ratio is, the stronger the robustness is, and vice versa. The 95% prediction uncertainties (95PPU) are represented as: where the n indicates the number of observed data points. N is increasing with the value of PMF-56 ET 0 falling between corresponding X L and X U increase, the 'Bracketed by 95PPU' is 100 when all of the PMF-56 ET 0 values are within the range of X L N X U .
In addition, d-factor (Ghorbani et al. ) is applied for computing the average width of the confidence interval, and can be evaluated according to Equation (7): where d x is the average distance between the upper (97.5th) and lower (2.5th) bands, σ x is the standard deviation of the observed data. It is relevant to note that the better results would have a d-factor value which is close to 0.

CASE STUDY Observation data and statistical analysis
The weather data for this study were obtained from two sites in Zhangye (100 17   In order to eliminate the influence of the dimension, the input and output data were normalized to obtain data with a mean of 0 and a variance of 1 before running models; the equation is used as follows: where χ new is the normalized dimensionless data, μ is the average data and σ is the standard deviation.

Models' performance criteria
For the assessment of the performances of the RF model, where E p (i) and and n is the number of data. In terms of these metrics, the model is denoted as a perfect fit when r ¼ 1, RMSE and MAE ¼ 0, and NS ¼ 1, respectively.   shown that the RF3 model performed slightly better than RF2 and RF4 models in terms of four statistical indicators, and it can be stated that PMF-56 ET 0 was easily influenced by U 2 . This was also confirmed by RF5 and RF7 models with the insertion of U 2 into the inputs presented in Tables 3 and   4. The RF5 and RF7 models remarkably increased the r and NS values of 0.6% and 1.1%, and 0.3% and 0.5%, respectively, and decreased the RMSE and MAE values of 8.4% and 6.5%, and 3.9% and 6.5%, respectively, relative to the RF6 model, exhibiting the superiority of RF5 and RF7 models to the RF6 model significantly. The results of this comparison revealed that integrating U 2 improved the accuracy of the model significantly. Accordingly, adding U 2 is found to be more influential than S un and R h on ET 0 simulation, which is the same outcome obtained by Traore et al.
() and Karimaldini et al. (). It is observed that the input scenarios listed in Table 2 Tables 3 and 4. As well, it was observed that the fitting performance of the maximum and minimum PMF-56 ET 0 was not very good, especially that of peaks of the first few models.
Due to the importance of PMF-56 ET 0 in irrigation and agricultural water use, water resources planning and management, the estimation of total PMF-56 ET 0 obtained by different combinations of RF model was also considered in this paper. The total ET 0 amounts calculated by  and RF models in the testing phase are given in Table 5. It  is remarkable that all models had a quite good estimation of total PMF-56 ET 0 value since there was a smaller relative error (all values less than 3.5%) for both sites, especially the RF1 model, whose input parameters were only T max and T min at Zhangye station, with a relative error of À0.2%. In addition, noting the fact that the RF8 model  where restricted data are available.

Evaluation of the importance of variables
Index importance assessment is an advantage of the RF model which can directly obtain an order of all of the weather parameters. As shown in Figure 7,

Uncertainty analysis
The techniques of Monte Carlo simulations were used to corroborate the applicability of RF models in modeling  Figures 3-6), which also further illustrates the differences between the total ET 0 amounts computed by PMF-56 and RF models as displayed by Table 5. In spite of some errors, taking the discussion of the section on evaluation of the importance of variables into account, we find that the RF model with only T max and T min as inputs is still considered as an appropriate technique to simulate daily PMF-56 ET 0 in arid conditions.

CONCLUSIONS
Water resources play an essential role in arid environments, so new modeling and water assessment methods are crucial for maintaining sustainability of water resources, strategies for water quality and usage. ET 0 provides a vital parameter of water resources calculation, regional water resources management, and irrigation plan development. This research discussed the performance of the RF model to predict PMF-56 ET 0 using different combinations of daily climatic data, including maximum air temperature (T max ), minimum air temperature (T min ), sunshine duration (S un ), wind speed (U 2 ), and relative humidity (R h ) for Zhangye and Gaotai stations, in an arid region, northwest China. It was found that the precision of the models was respectively improved when adding S un , U 2 , and R h into the temperature-based model.