Abstract

Obtaining accurate runoff prediction results and quantifying the uncertainty of the forecasting are critical to the planning and management of water resources. However, the strong randomness of runoff makes it difficult to predict. In this study, a hybrid model based on XGBoost (XGB) and Gaussian process regression (GPR) with Bayesian optimization algorithm (BOA) is proposed for runoff probabilistic forecasting. XGB is first used to obtain point prediction results, which can guarantee the accuracy of forecast. Then, GPR is constructed to obtain runoff probability prediction results. To make the model show better performance, the hyper-parameters of the model are optimized by BOA. Finally, the proposed hybrid model XGB-GPR-BOA is applied to four runoff prediction cases in the Yangtze River Basin, China and compared with eight state-of-the-art runoff prediction methods from three aspects: point prediction accuracy, interval prediction suitability and probability prediction comprehensive performance. The experimental results show that the proposed model can obtain high-precision point prediction, appropriate prediction interval and reliable probabilistic prediction results on the runoff prediction problems.

HIGHLIGHTS

  • A novel deep learning method XGBoost is applied to predict runoff.

  • A new hybrid model is proposed to quantify the uncertainty of prediction.

  • The maximum information coefficient is used to select feature inputs.

  • Bayesian optimization algorithm is used to optimize the hyper-parameters of the model.

INTRODUCTION

With the increasing shortage of energy, hydropower has received worldwide attention as a clean renewable energy (Liu et al. 2018a). However, flood disasters can pose risks to the safety of the reservoir and also cause economic benefits (Liu et al. 2018b). Therefore, hydrological probabilistic forecasting is very important for the operation and management of the reservoir, which contains two aspects: accurately predicting runoff and quantifying the uncertainty of the forecasting.

Runoff prediction methods are usually divided into two categories: process-driven method and data-driven method (Wen et al. 2019). The process-driven model is based on the hydrological concept and focuses on the description of the physical mechanism of runoff yield and concentration, such as the Xin'anjiang hydrological model (Fang et al. 2017) and numerical weather prediction (Wu & Lin 2017). These models have high prediction accuracy and are highly interpretable, but their data collection is difficult and the solution is time-consuming (Wu & Lin 2017). The data-driven model evolves runoff by mining the information contained in time series (Behera et al. 2006). The time-series model is a commonly used method for runoff prediction, mainly including auto-regressive model (AR), moving average model (MA), auto-regressive moving average model (ARMA) and their variants (Papacharalampous et al. 2018). These models are based on data stationarity assumptions, so their prediction accuracy is limited because of the strong nonlinearity of runoff (Mauricio 1995). To deal with the nonlinearity, many machine learning methods are used to predict runoff. In data-driven methods, many machine learning methods are used to predict runoff. A hybrid framework based on support vector regression (SVR) and series decomposition is proposed for monthly runoff forecasting (Luo et al. 2019). Artificial neural network (ANN) is used to characterize the nonlinearity of runoff and predict it, combining with empirical mode decomposition to improve the forecast performance (Tan et al. 2018). With the rapid development of deep learning methods in recent years, they are gradually being used to predict runoff or other time series, such as long short-term memory (LSTM) network (Zhang et al. 2019a) and convolutional neural network (CNN) (Le Callet et al. 2006). XGBoost (XGB; Chen & Guestrin 2016) is a novel deep learning method that is popular in various data mining competitions and wins competitions, such as Kaggle and KDDCup. However, this excellent model has not been used for hydrological forecasting. XGB belongs to the family of gradient boosting tree model. The model well known in their family has gradient boosting decision tree (GBDT) (Rao et al. 2019), classification and regression tree (CART) (Yang et al. 2016). As the latest variant of this family, the prediction performance of XGB is better than GBDT and CART. Another new variant light gradient boosting machine (LightGBM, LGB) (Deng et al. 2018), its training time is lower than XGB but the prediction accuracy is still not as good as XGB. Therefore, XGB is applied to predict runoff in this paper, which can ensure the prediction accuracy.

All of the above methods are deterministic prediction models, which cannot quantify forecast uncertainty. Constructing the upper and lower bounds corresponding to a certain confidence level is one of the approaches, called interval prediction (Khosravi et al. 2011). More comprehensively, the Bayesian method can quantify uncertainty by obtaining the probability density function of predictions (Wang et al. 2009). Gaussian process regression (GPR) is employed for monthly streamflow forecasting, which has the advantage of high reliability (Sun et al. 2014). Quantile regression (QR) is also the type of runoff probability prediction method, which can obtain the conditional quantile of prediction (Zhang et al. 2019b). The probability density function can also be obtained by further combining the kernel density estimation (KDE) method. In this study, the GPR model is used to quantify forecast uncertainty.

To make the forecasting model show better performance, the hyper-parameter optimization is carried out for each one. Commonly used hyper-parameter optimization algorithms are grid search algorithm (GSA) (Kong et al. 2017), random search algorithm (RSA) (Al-Muhammed & Abu Zitar 2018) and Bayesian optimization algorithm (BOA) (He et al. 2019). GSA is essentially an exhaustive method and its search efficiency is low. The loss function of RSA is prone to fluctuations, and the search process has no direction. BOA can estimate the distribution between the hyper-parameters and the loss so that the search process has a direction and the efficiency is high. Therefore, BOA is used to optimize the hyper-parameters of the hybrid model in this paper.

In this paper, a probabilistic forecasting hybrid model called XGB-GPR-BOA is proposed to predict runoff and quantify the uncertainty of prediction. The main contributions are outlined as follows:

  1. A novel deep learning method XGB is applied to predict runoff, which can ensure the prediction accuracy.

  2. A new hybrid model combined XGB and GPR is proposed to obtain runoff probabilistic prediction results.

  3. To make the hybrid model show better performance, BOA is used to optimize hyper-parameters.

  4. The proposed hybrid model XGB-GPR-BOA is applied to four runoff prediction cases in the Yangtze River Basin, China and compared with eight state-of-the-art runoff prediction methods. The rest of the paper is organized as follows: In Section 2, the implementation details of the hybrid model XGB-GPR-BOA are introduced. In Section 3, the evaluation metrics of prediction performance are explained. In Section 4, the proposed model is applied to solve the practical prediction problem and performed the comparison experiments. In Section 5, the work of this paper is summarized and the conclusions are given.

METHODS

XGBoost

XGB is a novel gradient tree boosting algorithm, which has the advantages of strong ability to overcome overfitting and high prediction accuracy (Chen & Guestrin 2016). Suppose is a given training set with n examples and m features, where are features and is label.

Model structure of XGB

XGB uses K additive functions to predict the label:
formula
(1)
where is the prediction value and is the kth regression tree weak model. K weak models are integrated into a strong model .

The tree ensemble model is shown in Figure 1. K different regression trees divide the samples into different leaves according to their own split conditions. The final prediction for a given example is the sum of predictions from each tree. It can be seen from the diagram that the prediction of a tree is determined by the structure of the tree and the weight of the leaves. The structure of the tree is actually these leaves, represented by the symbol q. The weight of the leaves is represented by the symbol w.

Figure 1

Tree ensemble model.

Figure 1

Tree ensemble model.

Model training of XGB

Training an XGB model is to solve the number of regression tree (), the structure of each regression tree (), and the weight of each leaf (). The purpose of training is to minimize the following regularized loss function:
formula
(2)
where is a loss function that measures the difference between the prediction and the label . is a penalty that measures the complexity of the model, which helps to avoid overfitting, such as L1 regularization and L2 regularization (Chen & Guestrin 2016).
The number of regression tree ()
Formally, represents the prediction of the ith instance at the (k − 1)th iteration. Next, we need to add to minimize the following objective:
formula
(3)

The new regression tree is greedily added if it can significantly improve the model according to Equation (2). The final number of regression tree is the K value.

2.1.2.2. Leaf weights ()
Each regression tree () corresponds to an independent tree structure () and leaf weights (). Solving means solving and . The objective is approximated by second-order Taylor expansion as follows:
formula
(4)
where and are first- and second-order gradient statistics on the loss function, respectively. The second-order gradient statistic on the loss function is used in XGB, which is the biggest difference from other gradient tree boosting models and it is also the reason for its high prediction accuracy.
Removing the constant terms, the optimization objective can be simplified as follows:
formula
(5)
Define as the instance set of jth leaf in the kth regression tree. The objective can be rewritten as follows. This process will be easier to understand in conjunction with Figure 1.
formula
(6)
where T is the number of leaves. is a function describing the complexity of the jth leaf model of the kth regression tree. and are commonly used formulas of , which corresponds to L2 regularization and L1 regularization, respectively. is the penalty coefficient. In the following derivation, L2 regularization is taken as an example, and the other form of is similar.
Since is the sum of T independent quadratic functions, for a fixed structure , let , the optimal weight and optimal objective can be computed as follows:
formula
(7)
formula
(8)
Tree structure ()
Solving the tree structure () is actually determining the split conditions () and the instance set () on the leaf. Branches are greedily increased by score gain. Assume that and are the instance sets of the left and right nodes after split and . The score gain is computed as follows:
formula
(9)
where s is the score gain. The maximum of score gain is the point of split. If the maximum of score gain is less than zero, it means that the current node is not split.

Gaussian process regression

The point prediction results of XGB are taken as the feature inputs, and the real runoff results are used as the labels to construct the GPR (Sun et al. 2014).

A regression model with noisy is assumed as follows:
formula
(10)
where Y is the label and X is the feature input. The noisy .
Then the prior distribution of the label Y, the joint prior distribution of the label Y and the prediction can be obtained as follows:
formula
(11)
formula
(12)
where is a symmetric positive definite covariance matrix, whose elements measure the correlation between and through a kernel function . is the covariance matrix between the validation set and training set X. is the covariance matrix of the validation set itself. is an n-dimensional unit matrix.
Through the knowledge of probability theory and linear algebra, the posterior distribution of the prediction can be obtained as follows:
formula
(13)
formula
(14)
formula
(15)
Therefore, the point prediction result of runoff is , and the interval prediction result corresponding to the 95% confidence level is . The probability density function of ith predicted value is as follows:
formula
(16)

Bayesian optimization algorithm

To make the model perform better, BOA is used to optimize hyper-parameters (He et al. 2019). The prediction framework flowchart of XGB-GPR-BOA is shown in Figure 2.

Figure 2

The prediction framework flowchart of XGB-GPR-BOA.

Figure 2

The prediction framework flowchart of XGB-GPR-BOA.

Taking minimizing the loss function as an example, the hyper-parameter optimization problem can be given as follows:
formula
(17)
where H is the range of all hyper-parameters. is the loss of prediction model under the hyper-parameter combination h, and calculating it is a time-consuming step. is the optimal hyper-parameter combination.

The optimization steps of the hyper-parameter are as follows.

Step 1: A small number of hyper-parameter combinations are randomly initialized in the definition domain H, and each combination is input into the model (XGB-GPR) proposed in this paper. The corresponding loss function is further calculated to construct an initial data set D.

Step 2: Training a probabilistic regression model M on the data set D, the probability distribution function of the objective function (loss function) l is obtained by trained M. M is an existing probabilistic regression model, such as random forest and tree parzen estimators (He et al. 2019).

Step 3: Defining the acquisition function S, using the current probability distribution function as a cheap surrogate for the expensive objective l, the new location is obtained by minimizing S. Common forms of acquisition function are the probability of improvement, excepted improvement and entropy search (He et al. 2019).

Step 4: Calculate the loss function of new location , add it into the data set D, repeat steps 2 and 3 until the total number of iterations is reached and output the final parameter combination as the optimal parameter combination .

EVALUATION METRICS

Evaluation metric of point prediction

To evaluate the accuracy of point prediction, root-mean-square error (RMSE) (Zhang et al. 2020), mean absolute percentage error (MAPE) (Zhang et al. 2019c) and coefficient of determination () are used to evaluate the deviation between the predictions and observations.

Root-mean-square error

The smaller the RMSE, the higher the point prediction accuracy. Its formula is as follows:
formula
(18)
where and are the prediction and observation, respectively. is the size of validation set.

3.1.2. Mean absolute percentage error

The smaller the MAPE, the higher the point prediction accuracy. Its formula is as follows:
formula
(19)

Coefficient of determination

The closer the value of is to 1, the higher the point prediction accuracy. Its formula is as follows:
formula
(20)
where is the mean of observations.

Evaluation metric of interval prediction

To evaluate the suitability of interval prediction, coverage probability (CP) (Li & Jin 2018) and mean width percentage (MWP) (Li & Jin 2018) are used in this paper.

Coverage probability

CPα is defined as the probability that the observation falls within the prediction interval under the confidence level of α. Its formula is as follows:
formula
(21)
where is the number of samples whose observations fall within the prediction interval.

Mean width percentage

MWPα is used to measure the prediction interval width, whose formula is as follows:
formula
(22)
where and are the upper and lower limits of the prediction interval.

Suitability metric

The ideal prediction interval should have high CPα and low MWPα; therefore, the comprehensive metric of interval prediction is defined as MCα. The smaller the MCα, the more suitable the prediction interval. Its formula is as follows:
formula
(23)

Evaluation metric of probability prediction

To evaluate the comprehensive performance of probability prediction, continuous ranked probability score (CRPS) is used in this paper (Hersbach 2000). The smaller the CRPS, the better the comprehensive performance of probability prediction. The formula of CRPS is as follows:
formula
(24)
formula
(25)
formula
(26)
where is the probability density function of and is its cumulative distribution function. is the Heaviside function.

CASE INTRODUCTION

To verify the performance of the proposed model XGB-GPR-BOA, four datasets with different hydrologic stations and different lengths are gathered for experiments. Statistical information of four datasets is shown in Table 1. The data of these datasets are from Yichang Station, Gaochang Station or Binshan Station, which are located in the Yangtze River Basin of China. The study area is shown in Figure 3. In the table, T, Ta and Tv represent the size of total dataset, training set and validation set, respectively.

Table 1

Statistical information of four datasets

DatasetsStationTimeTTaTvMin.MeanMax.SD
Unit1 period = 1 dayPeriodm3/sm3/sm3/s
Dataset 1 Yichang 1 January 2000–31 December 2004 1,827 1,096 731 2,950 13,323 58,400 9,974 
Dataset 2 Yichang 1 January 2007–31 December 2011 1,826 1,096 730 3,292 12,369 66,488 9,310 
Dataset 3 Gaochang 1 January 2004–31 December 2010 2,557 1,461 1,096 515 2,470 15,900 1,908 
Dataset 4 Binshan 1 January 2001–31 December 2007 2,556 1,461 1,095 1,180 4,695 22,100 3,856 
DatasetsStationTimeTTaTvMin.MeanMax.SD
Unit1 period = 1 dayPeriodm3/sm3/sm3/s
Dataset 1 Yichang 1 January 2000–31 December 2004 1,827 1,096 731 2,950 13,323 58,400 9,974 
Dataset 2 Yichang 1 January 2007–31 December 2011 1,826 1,096 730 3,292 12,369 66,488 9,310 
Dataset 3 Gaochang 1 January 2004–31 December 2010 2,557 1,461 1,096 515 2,470 15,900 1,908 
Dataset 4 Binshan 1 January 2001–31 December 2007 2,556 1,461 1,095 1,180 4,695 22,100 3,856 
Figure 3

The study area.

Figure 3

The study area.

RESULTS AND DISCUSSION

Experimental design and parameter settings

To verify the performance of the proposed method, LGB (Deng et al. 2018), Gradient Boosting Regression Tree (GBR) (Rao et al. 2019), LSTM network (Zhang et al. 2019a), CNN (Le Callet et al. 2006), ANN (Tan et al. 2018), SVR (Luo et al. 2019), QR (Zhang et al. 2019b) and GPR (Sun et al. 2014) are compared with XGB. Using the framework proposed in this paper, combined with GPR, these models are further transformed to obtain the probabilistic forecasting model: XGB-GPR, LGB-GPR, GBR-GPR, LSTM-GPR, CNN-GPR, ANN-GPR and SVR-GPR. QR and GPR are all probabilistic forecasting models. The prediction results obtained by QR are a series of conditional quantiles of runoff, and it is necessary to combine the KDE (He & Li 2018) to obtain the probability distribution function, called QR-KDE. These models are realized by Python.

For the fairness of comparison, the nine models use the same set of feature inputs, and the super-parameters of these models are optimized by BOA. The detailed hyper-parameters of nine models are shown in Table 2. The optimal prediction results of these models are taken as the final results.

Table 2

The hyper-parameters of the nine models

ModelHyper-parameterSearch range
XGB Maximum depth of a tree [1, 10]; Integer 
Learning rate [0.001, 0.3]; Real 
Penalty coefficient of L1 regularization [0, 1]; Real 
Penalty coefficient of L2 regularization [0, 1]; Real 
Subsample ratio of the training instances [0.7, 1]; Real 
Subsample ratio of columns [0.7, 1]; Real 
Subsample ratio of columns for each level [0.7, 1]; Real 
Subsample ratio of columns for each node [0.7, 1]; Real 
LGB Maximum depth of a tree [1, 10]; Integer 
Maximum number of leaves in one tree [20, 200, 5]; Integer 
Learning rate [0.001, 0.3]; Real 
Penalty coefficient of L1 regularization [0, 1]; Real 
Penalty coefficient of L2 regularization [0, 1]; Real 
Minimal number of data in one leaf [10, 200, 5]; Integer 
Minimal sum hessian in one leaf [0.0001, 0.005]; Real 
Bagging fraction [0.1, 1]; Real 
Frequency for bagging [20, 60, 5]; Integer 
Feature fraction [0.7, 1]; Real 
GBR Maximum depth of a tree [1, 10]; Integer 
Learning rate [0.001, 0.3]; Real 
The number of boosting stages to perform [50, 200, 10]; Integer 
Subsample ratio of the training instances [0.7, 1]; Real 
The minimum number of samples required to split an internal node [10, 200, 5]; Integer 
The minimum number of samples required to be at a leaf node [10, 200, 5]; Integer 
LSTM The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Drop out rate [0.01, 0.2]; Real 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
CNN The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Kernel size 
Strides 
Activation [‘relu’, ‘tanh’, ‘sigmoid’] 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
ANN The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Activation [‘relu’, ‘tanh’, ‘sigmoid’] 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
SVR Kernel function [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’] 
QR Kernel function [‘epa’, ‘cos’, ‘gau’, ‘par’] 
GPR Kernel function [‘C’, ‘RBF’, ‘W’, ‘M’, ‘RQ’] 
ModelHyper-parameterSearch range
XGB Maximum depth of a tree [1, 10]; Integer 
Learning rate [0.001, 0.3]; Real 
Penalty coefficient of L1 regularization [0, 1]; Real 
Penalty coefficient of L2 regularization [0, 1]; Real 
Subsample ratio of the training instances [0.7, 1]; Real 
Subsample ratio of columns [0.7, 1]; Real 
Subsample ratio of columns for each level [0.7, 1]; Real 
Subsample ratio of columns for each node [0.7, 1]; Real 
LGB Maximum depth of a tree [1, 10]; Integer 
Maximum number of leaves in one tree [20, 200, 5]; Integer 
Learning rate [0.001, 0.3]; Real 
Penalty coefficient of L1 regularization [0, 1]; Real 
Penalty coefficient of L2 regularization [0, 1]; Real 
Minimal number of data in one leaf [10, 200, 5]; Integer 
Minimal sum hessian in one leaf [0.0001, 0.005]; Real 
Bagging fraction [0.1, 1]; Real 
Frequency for bagging [20, 60, 5]; Integer 
Feature fraction [0.7, 1]; Real 
GBR Maximum depth of a tree [1, 10]; Integer 
Learning rate [0.001, 0.3]; Real 
The number of boosting stages to perform [50, 200, 10]; Integer 
Subsample ratio of the training instances [0.7, 1]; Real 
The minimum number of samples required to split an internal node [10, 200, 5]; Integer 
The minimum number of samples required to be at a leaf node [10, 200, 5]; Integer 
LSTM The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Drop out rate [0.01, 0.2]; Real 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
CNN The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Kernel size 
Strides 
Activation [‘relu’, ‘tanh’, ‘sigmoid’] 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
ANN The number of hidden layer nodes [64, 32, 16, 8, 4, 2, 1] 
Activation [‘relu’, ‘tanh’, ‘sigmoid’] 
Batch size [64, 32, 16, 8] 
Epochs 50 
Optimizer Adam 
SVR Kernel function [‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’] 
QR Kernel function [‘epa’, ‘cos’, ‘gau’, ‘par’] 
GPR Kernel function [‘C’, ‘RBF’, ‘W’, ‘M’, ‘RQ’] 

In this case, there are five tasks that need to be completed.

Task I: Analyze the correlation factors of runoff and construct feature inputs.

Task II: Verify the convergence of BOA and prediction model to ensure the prediction results obtained by each model are optimal.

Task III: Compare different models from point prediction accuracy, interval prediction suitability, and probability prediction comprehensive performance.

Task IV: Analyze hyper-parameter sensitivity and provide suggestions for hyper-parameter optimization.

Task I: construct feature inputs

To analyze the correlation factors of the runoff and improve the prediction accuracy of the model, historical runoff data are selected as alternative feature. The maximal information coefficient (MIC) values between historical runoff and current runoff are calculated in Table 3. The feature represents the runoff of the previous period, called period feature. The feature represents the runoff of the day last year, called year feature. Period features with MIC values greater than 0.85 and year features with MIC values greater 0.8 are filled with gray.

Table 3

MIC between historical runoff and current runoff

 
 

Therefore, the observations (labels) for the four datasets are all , and the feature inputs for the four datasets are , , and , respectively.

Task II: verify convergence

To ensure that the prediction results obtained by each model are optimal, the XGB model of Dataset 4 is taken as an example to verify the convergence of the super-parameter optimization model BOA and the prediction model. Convergence curves of BOA and XGB are shown in Figure 4. The number of iterations of BOA and XGB models is set to 100 and 500, respectively. The loss function is measured by the metric RMSE, and it is calculated using the normalized runoff. In the BOA model, the loss of each epoch varies widely, but the optimal loss continues to decrease throughout the iteration and converges at approximately 65th epoch. The loss of XGB is declining. After 200 epochs, the loss varies very little. Therefore, the model can converge whether it is optimizing the hyper-parameter or training the prediction model.

Figure 4

Convergence curves of BOA and XGB on Dataset 4.

Figure 4

Convergence curves of BOA and XGB on Dataset 4.

Task III: compare different models

To fully verify the performance of the model proposed in this paper, the comparison is made from three aspects: point prediction, interval prediction and probability prediction.

Point prediction comparison

Point prediction comparison is to verify the prediction accuracy of XGB-GPR-BOA. Point prediction metrics of nine models on four datasets are shown in Table 4. The best and second-best metrics are highlighted with dark and light gray backgrounds, respectively. Taking Dataset 1 as an example, RMSE, MAPE and R2 of XGB are 1,847 m3/s, 8.09% and 0.965, respectively. Compared with metrics of other models, RMSE and MAPE of XGB are significantly smaller, and R2 of XGB is much higher. There are similar results in other datasets. These metrics indicate that the runoff prediction results obtained by XGB are the most accurate.

Table 4

Point prediction metrics of nine models on four datasets

 
 

Interval prediction comparison

Interval prediction comparison is to verify the suitability of interval obtained by XGB-GPR-BOA. Interval prediction metrics of the nine models on four datasets are shown in Table 5. The best and second best metrics are highlighted with dark and light gray background, respectively. Taking Dataset 1 as an example, since these models use almost the same mechanism to quantify the uncertainty of the forecast, the interval widths of the nine models are relatively close, both around 0.85. In the case where the interval widths are close, an interval with higher CP is more suitable. Since the point prediction accuracy of XGB is higher than other models, the CP of XGB-GPR is higher than other models. Therefore, MC95% of XGB-GPR is 0.846, which is the smallest of the nine models, indicating that the interval obtained by XGB-GPR is most appropriate. There are similar results in other datasets.

Table 5

Interval prediction metrics of nine models on four datasets

 
 

To more vividly compare the interval prediction suitability, the interval prediction results on four datasets are shown in Figure 5. In each figure, the upper part is the interval prediction results of XGB-GPR, and the lower part is a comparison of the interval prediction metrics. It can be clearly seen from the figure that the blue curve and the red curve are very close, and most points of red curve are located in the gray interval, which indicates that the prediction results obtained by XGB-GPR have high accuracy and coverage. Meanwhile, the same conclusion can be drawn from the metric comparison histogram: the interval obtained by XGB-GPR is most appropriate among nine models.

Figure 5

Interval prediction results on four datasets. CP, coverage probability; MWP, mean width percentage; MC, suitability metric. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/nh.2021.161.

Figure 5

Interval prediction results on four datasets. CP, coverage probability; MWP, mean width percentage; MC, suitability metric. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/nh.2021.161.

Probability prediction comparison

Probability prediction comparison is to verify the comprehensive performance of probability density function obtained by XGB-GPR-BOA. The probability prediction metrics (CRPS) on four datasets are shown in Table 6. The best and second-best metrics are highlighted with dark and light gray backgrounds, respectively. The CRPS values of XGB-GPR on four datasets are 874, 1,145, 204 and 262, which are the smallest among nine models, indicating that the comprehensive performance of probability density function obtained by XGB-GPR is the best. It is also consistent with the results of point prediction and interval prediction.

Table 6

Probability prediction metrics (CRPS) of nine models on four datasets

 
 

Probability density functions obtained by XGB-GPR of eight periods of equidistant sampling on Dataset 3 are shown in Figure 6. In general, these curves are very full, and no curve is excessively high or low, wide or narrow, indicating that probability density functions obtained by XGB-GPR are suitable. In period 1, 157, 783 and 1,096, the observation lines are near the center of the curve, which show these points' prediction accuracy is high. In periods 313, 470, 626 and 939, the observation lines are far from the center. In probability prediction results of validation set, some observations are close to the center and other observations are off-center, which indicates that the probabilistic forecast is reliable. If all points are at the center or far from the center, we may not be convinced of these probabilistic forecast results.

Figure 6

Probability density function obtained by XGB-GPR on Dataset 3.

Figure 6

Probability density function obtained by XGB-GPR on Dataset 3.

Task IV: analyze hyper-parameter sensitivity

It is necessary to analyze the change process of super-parameters when BOA is used to optimize the super-parameters of XGB, and provide the suggestions for practical application tuning. The scatter plot of the hyper-parameter and loss is shown in Figure 7. The eight scatter plots are the relationship plot between maximum depth of a tree (P1), learning rate (P2), penalty coefficient of L1 regularization (P3), penalty coefficient of L2 regularization (P4), subsample ratio of the training instances (P5), subsample ratio of columns (P6), subsample ratio of columns for each level (P7) and subsample ratio of columns for each node (P8) and loss.

Figure 7

Hyper-parameter sensitivity analysis of XGB-GPR on Dataset 4.

Figure 7

Hyper-parameter sensitivity analysis of XGB-GPR on Dataset 4.

Some parameter optimization results can be obtained by analyzing these scatter plots:

  1. None of the scatter plots exhibit a distinct linear or nonlinear relationship. Most of the scatter shapes are nearly horizontal, indicating that the performance of XGB is not extremely sensitive to any of the hyper-parameters. When XGB is optimized by BOA, all parameters are combined to affect performance.

  2. Hyper-parameter P1 is one of the most important core parameters in XGB. In theory, increasing P1 will improve the performance of the model, but too large P1 will make the model complex, consume more memory resources and more likely to overfitting. In the first subgraph, P1 under optimal loss is evenly distributed from 1 to 10, which indicates that the model performance is not sensitive to the P1.

  3. Hyper-parameter P2 is also one of the most important core parameters in XGB. As can be seen from the second subgraph, a small learning rate will cause the XGB model to fail to converge. The learning rate under the optimal loss is mainly concentrated between 0.05 and 0.1; therefore, it is recommended to limit learning rate to this range when applied to the actual situation.

  4. P3 and P4 are parameters that avoid overfitting by regularization. P3 and P4 points less than 0.5 are more than points greater than 0.5, so it is recommended to limit P3 and P4 to below 0.5.

  5. P5–P8 are parameters that avoid overfitting by subsampling. Similar to results (3) and (4), it is recommended that P5 is between 0.7 and 0.9, P6 is between 0.7 and 1.0, P7 is between 0.8 and 1.0 and P8 is between 0.8 and 1.0.

The most important parameter in the GPR of XGB-GPR is the selection of kernel functions. The losses of different kernel functions on Dataset 4 are listed in Table 7. C, W, M, RBF and RQ are constant kernel, white kernel, matern kernel, radial basis function kernel and rational quadratic kernel, respectively. The RQ corresponds to the least loss; therefore, the recommended kernel function is RQ in the practical application.

Table 7

Losses of different kernel functions on Dataset 4

KernelCWMRBFRQ
Loss (RMSE) 0.170 0.222 0.196 0.379 0.025 
KernelCWMRBFRQ
Loss (RMSE) 0.170 0.222 0.196 0.379 0.025 

Discussion

The advantages of the forecast model proposed in this study are mainly in two aspects: (1) probabilistic forecast results can quantify forecast uncertainty and provide decision-makers with richer information; (2) model hyper-parameters are screened by BOA to ensure forecast accuracy. The disadvantage of the model is that the prediction result is a one-step ahead forecasting result, and multiple models need to be trained in a multi-step ahead furcating scenario. In application, we can use historical data to realize runoff scroll forecasting in future coming years.

To clarify the knowledge gap between the existing research and our research, the differences are presented as follows:

  1. Existing studies focus on the deterministic prediction of runoff, and the model proposed in our research is the deep learning probabilistic model.

  2. In some studies, the model super-parameters were not selected or selected manually, while in our model, the super-parameters were obtained through BOA.

CONCLUSIONS AND FUTURE WORKS

In this study, a probabilistic forecasting hybrid model called XGB-GPR-BOA is proposed to predict runoff and quantify the uncertainty of prediction. At first, XGB, as a novel gradient tree boosting algorithm, is applied to predict runoff, which can ensure the prediction accuracy. However, XGB can only obtain point prediction results, unable to quantify the uncertainty of the forecast. Then, it is assumed that the runoff of each period obeys a Gaussian distribution. The combination of XGB and GPR makes it possible to quantify the uncertainty of the forecast. Finally, in order to make the model show better performance, the MIC is used to filter the feature input and the BOA is used to optimize the hyper-parameter of the model.

The hybrid model XGB-GPR-BOA is applied to predict runoff for four actual cases in the Yangtze River Basin of China. The nine state-of-the-art models use seven evaluation metrics (RMSE, MAPE, R2, CPα, MWPα, MCα and CRPS) to verify from three aspects: point prediction accuracy, interval prediction suitability and probability prediction comprehensive performance.

The main conclusions of this study are summarized as follows:

  1. XGB-GPR-BOA can obtain high-precision point prediction, appropriate prediction interval and high-performance probabilistic prediction results.

  2. The optimal hyper-parameters of the model obtained by BOA are conducive to improving the prediction accuracy.

ACKNOWLEDGEMENTS

The authors want to thank Yongqi Liu and Shaoqian Pei for their help in the fieldwork.

AUTHOR CONTRIBUTIONS

H.B. and H.Q. conceptualized the study; G.L. did the data curation; C.L. investigated the study; B.L. formulated the study methodology; H.Q. provided resources; Z. Z. provided software; Z.Z. provided supervision; H.Q. provided validation; B.L. did the visualization; writing of original draft was done by H.B. and writing of review & editing was done by H.Q. and Z.Z.

FUNDING

This work is supported by the National Natural Science Foundation of China (No. 51979113), and special thanks are given to the anonymous reviewers and editors for their constructive comments.

CONFLICTS OF INTEREST

The authors declare no conflicts of interest.

DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

REFERENCES

Behera
P. K.
Adams
B. J.
Li
J. Y.
2006
Runoff quality analysis of urban catchments with analytical probabilistic models
.
J. Water Resour. Plann. Manage.
132
(
1
),
4
14
.
Chen
T.
Guestrin
C.
2016
XGBoost: a scalable tree boosting system
.
ACM
785
794
.
Deng
L.
Pan
J.
Xu
X.
Yang
W.
Liu
C.
Liu
H.
2018
PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine
.
BMC Bioinf.
19
,
145
522
.
Fang
Y.
Zhang
X.
Corbari
C.
Mancini
M.
Niu
G.
Zeng
W.
2017
Improving the Xin'anjiang hydrological model based on mass–energy balance
.
Hydrol. Earth Syst. Sci.
21
,
3359
3375
.
Khosravi
A.
Nahavandi
S.
Creighton
D.
Atiya
A. F.
2011
Lower upper bound estimation method for construction of neural network-based prediction intervals
.
IEEE Trans. Neural Netw. Learn. Syst.
22
,
337
346
.
Le Callet
P.
Viard-Gaudin
C.
Barba
D.
2006
A convolutional neural network approach for objective video quality assessment
.
IEEE Trans. Neural Netw. Learn. Syst.
17
,
1316
1327
.
Luo
X.
Yuan
X.
Zhu
S.
Xu
Z.
Meng
L.
Peng
J.
2019
A hybrid support vector regression framework for streamflow forecast
.
J. Hydrol.
568
,
184
193
.
Papacharalampous
G.
Tyralis
H.
Koutsoyiannis
D.
2018
Predictability of monthly temperature and precipitation using automatic time series forecasting methods
.
Acta Geophys.
66
(
4
),
807
831
.
Rao
H.
Shi
X.
Rodrigue
A. K.
Feng
J.
Xia
Y.
Elhoseny
M.
Yuan
X.
Gu
L.
2019
Feature selection based on artificial bee colony and gradient boosting decision tree
.
Appl. Soft Comput.
74
,
634
642
.
Tan
Q.
Lei
X.
Wang
X.
Wang
H.
Wen
X.
Ji
Y.
Kang
A.
2018
An adaptive middle and long-term runoff forecast model using EEMD-ANN hybrid approach
.
J. Hydrol.
567
,
767
780
.
Wang
Q. J.
Robertson
D. E.
Chiew
F. H. S.
2009
A Bayesian joint probability modeling approach for seasonal forecasting of streamflows at multiple sites
.
Water Resour. Res.
45
,
W5407
.
Zhang
Z.
Qin
H.
Liu
Y.
Yao
L.
Yu
X.
Lu
J.
Jiang
Z.
Feng
Z.
2019b
Wind speed forecasting based on quantile regression minimal gated memory network and kernel density estimation
.
Energy Convers. Manage.
196
,
1395
1409
.
Zhang
Z.
Qin
H.
Liu
Y.
Wang
Y.
Yao
L.
Li
Q.
Li
J.
Pei
S.
2019c
Long short-term memory network based on neighborhood gates for processing complex causality in wind speed prediction
.
Energy Convers. Manage.
192
,
37
51
.
Zhang
Z.
Qin
H.
Li
J.
Liu
Y.
Yao
L.
Wang
Y.
Wang
C.
Pei
S.
Zhou
J.
2020
Short-term optimal operation of wind-solar-hydro hybrid system considering uncertainties
.
Energy Convers. Manage.
205
,
112405
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).