Comparison of random forests and other statistical methods for the prediction of lake water level : a case study of the Poyang Lake in China

Modeling of hydrological time series is essential for sustainable development and management of lake water resources. This study aims to develop an efficient model for forecasting lake water level variations, exemplified by the Poyang Lake (China) case study. A random forests (RF) model was first applied and compared with artificial neural networks, support vector regression, and a linear model. Three scenarios were adopted to investigate the effect of time lag and previous water levels as model inputs for real-time forecasting. Variable importance was then analyzed to evaluate the influence of each predictor for water level variations. Results indicated that the RF model exhibits the best performance for daily forecasting in terms of root mean square error (RMSE) and coefficient of determination (R). Moreover, the highest accuracy was achieved using discharge series at 4-dayahead and the average water level over the previous week as model inputs, with an average RMSE of 0.25 m for five stations within the lake. In addition, the previous water level was the most efficient predictor for water level forecasting, followed by discharge from the Yangtze River. Based on the performance of the soft computing methods, RF can be calibrated to provide information or simulation scenarios for water management and decision-making. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/). doi: 10.2166/nh.2016.264 s://iwaponline.com/hr/article-pdf/47/S1/69/366925/nh047s10069.pdf Bing Li Guishan Yang (corresponding author) Rongrong Wan Xue Dai Yanhui Zhang Key Laboratory of Watershed Geographic Sciences, Nanjing Institute of Geography and Limnology, Chinese Academy of Sciences, Nanjing 210008, China E-mail: gsyang@niglas.ac.cn Bing Li Xue Dai University of Chinese Academy of Sciences, Beijing 100049, China


INTRODUCTION
integrates precipitation, discharge from tributaries, topography, and so on. The variations become even more complex when the lake interacts with a large river (e.g., the interaction of the Poyang Lake with the Yangtze River).
Reliable and accurate forecasting of lake water level has always been a challenge for hydrologists and water resource managers.
In recent decades, numerous forecasting techniques, including physically based hydrodynamic models (e.g., CHAM, MIKE21, and EFDC), time series analysis (e.g., auto-regressive moving average and auto-regressive inte- In particular, physically based hydrodynamic models exhibit the best performance in forecasting water level. However, these methods require detailed terrain data, as well as complex boundaries and parameters as input, and are computationally expensive and limited to restricted duration (Li et al. ). Time series analysis is more complex and unreliable than the neural network model (Altunkaynak ). In addition, time series analysis does not consider the nonstationary and nonlinear characteristics of data structure (Kumar & Maity ). It is difficult to use nonlinear and complex exhibition of model variables for accurate quantification of uncertainty associated with the predictions, which often mislead water resource managers during decision-making (Aqil et al. ; Mustafa et al. ). Soft computing methods are capable of capturing complex nonlinear relationships between inputs and outputs without the need for explicit knowledge of the physical process, and they also avoid the creation of extremely complex models in the rare cases when all information is available (Trichakis et al. ). Soft computing methods, particularly ANN and SVR, have been successfully applied to solve nonlinear problems in hydrological series simulations, such as groundwater level forecasting (Daliakopoulos et al. ; Yoon et al. ; Gholami et al. ), rainfall prediction (Chau & Wu ), and surface water level/discharge forecasting (Altunkaynak ; Callegari et al. ).
The random forests (RF) model has been proposed as a new soft computing method by Breiman (). RF handles nonlinear and non-Gaussian data well, is amenable to model interpretation, and is free of over-fitting problems as the number of trees increases. Furthermore, RF provides a measure of the relative importance of descriptors, which can be further utilized in variable selection (Genuer et al. ). In the past few years, RF has been employed to simulate suspended sediment concentration and soil organic carbon stocks (Francke et al. ; Were et al. ). However, soft computing methods with different algorithms may have different levels of adaptability for diverse problems. For example, Yoon et al. () found that the performance of ANN is better than that of SVM in the model training and testing stages when predicting groundwater level in a coastal aquifer. Rodriguez-Galiano et al. () found that the RF method performs better than ANN and SVM in predicting and mapping mineral prospectivity. Were et al. () concluded that RF has the highest tendency for overestimation, and that SVR is the best model for predicting soil organic carbon stocks. However, few studies have compared the adaptability and accuracy of different soft computing methods for hydrological series forecasting, especially for highly nonlinear water level forecasting. In the present work, the RF model was first utilized for forecasting water level fluctuations and then compared with commonly used ANN, SVR, and a linear model (LM) in terms of accuracy.
Poyang Lake, the largest freshwater lake in China, is fed by five main tributaries and is connected to the Yangtze River, whose blocking effect (even intrusion) greatly affects water level variations in the lake. In recent decades, intensified global climate changes and anthropogenic activities have greatly altered Poyang Lake's water regime to some extent (Guo et al. ), with more frequent occurrence of floods and droughts, which take on a trend of sharp transformation (Guo et al. ; Li & Zhang ). Building a dam has been proposed in the downstream area of the lake to alleviate severe droughts and flood risk in Poyang Lake (Huang et al.  Chau & Wu () found notable differences at 1-, 2-, and 3-day ahead by using partial autocorrelation for daily rainfall prediction using an ANN model.
Cross-correlation has been used to determine lag times of precipitation and discharge (Yoon et al. ; Li et al. ). The trial-and-error method was also utilized to obtain the most sensitive time lag (Hipni et al. ). Therefore, an accurate water level forecasting model that considers the previous hydrological status and the time lag effect is required to provide suggestions for the development and management of water resources. Such a model can also help identify the main factors that influence water levels in Poyang Lake.
The specific objectives for this paper were: (1) to determine a model of highest accuracy by comparing RF with ANN, SVR, and the LM model, and incorporating discharge from lake catchment tributaries and the Yangtze River, the time lag effect and the previous hydrological status for water level forecasting; and (2) to explore the relative importance of each predictor for different water level stations within the lake. The proposed model provides a useful tool for water resource management and for identifying the major influencing factors for lake water level fluctuations.

Data collection
As shown in Figure 1, Jiujiang station is the closest representative of the Yangtze River to affect water level variations within the lake. Meanwhile, given the missing discharge data in Jiujiang station prior to 1988, Hankou station was chosen to be a substitute in the model, as it has significant correlation with Jiujiang station (correlation coefficient ¼ 0.995). The data applied in this study include the following: (  Figure 1 shows the locations of these hydrological stations.
Soft computing methods

RF model
The   could be achieved with a higher number of trees (Were et al.

).
As well, the importance of each predictor is measured by increased mean squared errors (MSEs) as the predictors were excluded one by one from RF models. The relative importance of each predictor is determined from 100 runs of the RF models and normalized to 100% to provide a simple basis for comparison in different stations. In this paper, 500 parameter sets including n tree , m try , and nodesize for the RF model were tried and the one with the highest accuracy was selected.

SVR model
SVR is a forecasting model based on the structural risk mini-  (Figure 3). Given the train- , where x i and y i are the input and output data, respectively. The goal of ε-SVR is to determine a function f(x) that has the most ε deviation from the input data and that is as flat as possible (Smola & Schölkopf ). The formula of the RBF kernel is: where γ is a parameter and vector x j is the input of the training data. The unknown vector of w is determined to minimize the function: where C (cost) >0 controls the tradeoff between the flatness of f(x), and deviations greater than ε are tolerated.
An internationally recognized uniform method for SVM parameter optimization has not been established. This study adopted the most commonly used method, in which γ, C and ε are calibrated in a certain range by grid search in R statistical environment. Similarly, 500 pairs of parameters were tried and the set with the best performance was selected.

ANN model
The in the output layer. The input is normalized by subtracting each column of the data set by its mean value, and divided by the standard deviation. Furthermore, a trial-and-error method based on performance value from the training stage was applied to determine the optimal hidden nodes (size), learning rate, and momentum. Similar to the RF model, 500 parameter sets were generated and tested to determine the optimal parameter set. The set with the smallest hidden neurons giving the best performance was chosen.  units of the water level. R 2 measures the degree of colinearity between the observed and predicted values. It also describes the proportion of the total variance in the observed data that can be explained by the model. The NSCE is a popular index to assess the predictive power of hydrological models. In this work, the NSCE was used for evaluating the sensitivities of each parameter set. An RMSE value of 0, R 2 and NSCE value of 1 are pursued in the best forecast models. The formulas for RMSE, R 2 , and NSCE are listed as follows:

Input-output scenarios
where y i is the observed water level,ỹ i is the forecasted water level, and y i indicates the average water level.

RESULTS AND DISCUSSION
Comparison of the models for daily water level forecasting Figure 5 shows the relationship between measured water level and those predicted by the RF, SVR, ANN, and LM  Therefore, RF was chosen to further establish the daily water level forecasting model for the three scenarios.

Effect of time lag on daily water level forecasting
In this subsection, a series of time lags is selected to predict daily water level using RF and five-fold cross-validation under scenario 2 for each station. For simplicity, only the performance during the testing stage is displayed ( Table 2)

Effect of input scenarios on daily water level forecasting
Scenario 3 produced the best results among all the three scenarios for all five hydrological stations (Table 3). For the training stage, the values of R 2 are close to 1 for all five stations. Although scenario 3 has the highest R 2 , only a slight difference is observed among the three scenarios.
By contrast, RMSE decreased from scenario 1 (0.29 m on average) to scenario 3 (0.14 m on average  (Table 3). In addition, for the five hydrological stations within Poyang Lake, the R 2 values slightly decreased from north to south (i.e., longer distance from the Yangtze River) ( Table 2). Kangshan has the lowest level of forecasting precision, which indicates that the discharge of the Yangtze River greatly affects the water level within the lake and its effect gradually decreases in the upstream direction. Similar results were obtained by Li et al. () using the back-propagation neural network.
Thus, scenario 3 is considered the best among the three scenarios for all five water level stations. In other words, the RF algorithm and five-fold cross-validation comprise the best model to forecast the daily water level in Poyang Lake, when the inputs include the 4-day time lag of the Yangtze River, the daily discharge of the tributaries, and the previous water level within the lake.

Source of uncertainty
RF, SVR, and ANN models for forecasting water level fluctuations were compared based on continuous measured data quality controlled by the Hydrological Bureau of Jiangxi Province. The models were calibrated with five-fold cross-validation data in order to reduce the uncertainty. In addition, the training/testing data set represents relatively real-time hydrological processes in lake water level fluctuations, which incorporated the influence of lake catchment     Figure 7 also shows a remarkable difference among the prediction performance of the RF, ANN, and SVR models, with a best to worst order of RF, ANN, and SVR (P < 0.01).

Relative importance of the predictor variables
In this subsection, models are established for the five hydrological stations within Poyang Lake using the RF model, and five-fold cross-validation under scenario 3.
As shown in Figure 8, for each station the previous lake water level (wl7) is the most important predictor for Poyang Lake, with a mean relative importance of 18%. Moreover, the effect of discharge from Hankou station is also considerable (mean relative importance at 12.4%), especially for Hukou and Xingzi water level stations in the Poyang Lake-Yangtze River waterway. This is mainly because the water level variation in the lake is contributed to by both the lake inflow and the Yangtze River, of which the blocking (even intrusion) and pulling effects of the Yangtze River on the outflows from Poyang Lake greatly influence the inter-intra

CONCLUSIONS
This study aimed to determine the most efficient model by In addition, variable importance analysis was implemented for each water level station using the most accurate RF model and scenario 3. Results indicated that the previous water level was the most efficient predictor for water level forecasting. Moreover, the discharge from the Yangtze River also has a fundamental effect on water level variations.
Nevertheless, meteorological factors are not included in this study, thereby unavoidably introducing uncertainty to real-time water level forecasting. Future work should fully consider the complex hydrological and hydrodynamic processes of Poyang Lake.