A comparative study of wavelet and empirical mode decomposition-based GPR models for river discharge relationship modeling at consecutive hydrometric stations

The river stage – discharge relationship has an important impact on modeling, planning, and management of river basins and water resources. In this study, the capability of the Gaussian Process Regressions (GPR) kernel-based approach was assessed in predicting the daily river stage – discharge (RSD) relationship. Three successive hydrometric stations of the Housatonic River were considered, and based on the ﬂ ow characteristics during the period of 2002 – 2006 several models were developed and tested via GPR. To enhance the applied model ef ﬁ ciency, two pre-processing techniques, namely Wavelet Transform (WT) and Ensemble Empirical Mode Decomposition (EEMD), were used. Also, two states of the RSD modeling were investigated. In state 1, each station ’ s own data was used and in state 2, the upstream stations ’ datasets were used as input to model the RSD downstream of the river. The single and integrated model results showed that the integrated WT- and EEMD-GPR models resulted in more accurate outcomes. Data processing enhanced the models ’ capability between 25% and 40%. The results showed that the RSD modeling in state 1 led to better results; however, when the station ’ s own data were not available the integrated methods could be applied successfully for the RSD modeling using the previous stations ’ data.


GRAPHICAL ABSTRACT INTRODUCTION
Accurate prediction of river stage-discharge is of great importance for the utilization and management of sustainable water resources. Reliable river discharge prediction is particularly important for hydrological operations and hydro-environmental management (Hasanpour Kashani et al. ). According to Petersen-Øverleir (), both the stage and the discharge of a river vary depending on the magnitude of rainfall intensity that the river basin receives.
To obtain a continuous record of discharge, the stage is recorded and the discharge is computed from a correlation of stage and measured discharge. This correlation, or training (calibration), is known as the stage-discharge relationship (Baiamonte & Ferro ). Accurate information about the flow rates of rivers is important for a variety of hydrologic applications such as water and sediment bed material load estimation, water resource planning, operation and development, and hydraulic and hydrologic modeling (Wahlin & Clemmens ).
However, collecting data for discharge on a continuous basis is costly, especially during large flood events. An alternative approach is to convert records of water stages into discharges using a stage-discharge relationship. So far, various and complex relationships have been proposed to predict river stage-discharge. However, due to the complexity and nonlinearity of the process and validated models, it is difficult to simulate the accurate amount of discharge carried by rivers.
It is said that the stage-discharge relationship is not a simple, unique relation (Habib & Meselhe ). Therefore, it is necessary to use other methods with higher efficiency. A number of researchers have developed several prediction models using traditional statistical regression methods and artificial intelligence techniques with this aim. The existing regression methods often do not show the desired accuracy due to the uncertainties of the stage-discharge relationship process. On the other hand, numerical models require a lot of data, which is time-consuming and costly. As the number of variables and their interactions increase, the accuracy of operation predicting will decrease. However, meta-model approaches can easily switch from linear to nonlinear separation. These techniques has been successfully used in a number of diverse fields including water resources (e.g., Anmala (Babovic ; Meshgi et al. ). In the past few decades several artificial intelligence methods (e.g., Artificial Neural Networks (ANNs), Neuro-Fuzzy models ( has been used recently. Unlike wavelet decomposition, empirical mode decomposition extracts the data oscillatory-mode components without a priori determining the basis functions or level of decomposition (Labate et al. ). A time series in the Discrete Wavelet Transform (DWT) breaks down into a series of linearly independent detail signals and one approximation signal by using a specific wavelet function. DWT can detect sudden signal changes well, because it transforms original time series data into two types of wavelet coefficients: approximation and detail. EMD has been proven to be a quite effective method for extracting signals from data generated in noisy nonlinear and non-stationary processes. Research has proved that proper data pre-processing by using WT and EMD can lead models to adequately illustrate the real specifications of the basic system.
By considering the aforementioned issues, river discharge is very important in the hydrological cycle and hydro-environmental management and plays a basic role in designing and operating irrigation schemes on a watershed scale. One of the most important problems in analysis and management of river discharge is the modeling/simulation. Drawing upon the knowledge of the previous data-mining models, this paper aimed to enhance the quality of estimation of river stage-discharge using a kernel-based model (GER) and two data-processing methods under two scenarios. In the first scenario, the intended parameter of each station was estimated using the station's own data, and in the second scenario, the pre- training of the GPR method is fast, has high accuracy, and the probability of the occurrence of data overtraining in this method is less. Also, WT and EEMD are pre-processing methods that give remarkable vision to the physical form of the data by representing information in both time and frequency domains. Therefore, in this study, the input data were decomposed into sub-series by WT and EEMD.
Then, these subseries were used as inputs in the GPR method. The daily river stage-discharge data of the Housatonic River in the period of 2002-2006 were used to test the developed models.
In general, due to the complex nature of the river stagedischarge process, the present study tried to develop a unique strategy to predict it precisely. The proposed conceptual strategy has some advantages to cover the shortcomings. First, based on quadratic optimization of the convex function in GPR, it can easily switch from linear to nonlinear separation by nonlinear mapping using so-called kernel functions. Therefore, it is capable of being applied to prediction of the river discharge process. On the other hand, EEMD and WT were employed as a hybrid pre-processing approach conjugated to GPR. WT along with EEMD decomposes a non-stationary signal into a given quantity of stationary sub-signals. Then, AI methods can be combined with WT and EEMD to improve the preciseness of the prediction. Also, by applying sensitivity analysis, the most effective parameters in the modeling process can be selected. It is expected that the integrated hybrid techniques outperform the single GPR method and improve the modeling efficiency.

Study area and used data
The Housatonic River is a river, approximately 149 miles (240 km) long, in western Massachusetts and western Connecticut in the United States. It flows south to southeast, and drains about 1,950 square miles (5,100 km 2 ) of southwestern Connecticut into Long Island Sound. Its watershed is just to the west of the watershed of the lower Connecticut River. In this study, the datasets of three successive stations were used in which the distance between stations was approximately 50 km. Daily discharge and stage data of the river are available for the three selected hydrometric stations on the Housatonic River at the USGS website (https://waterdata.usgs.gov/nwis/sw). Table 1

Pre-processing approaches
One of the most popular approaches in time series processing is the wavelet transform (WT) (Farajzadeh & Alizadeh ). The WT uses the flexible window function (mother wavelet) in signal processing. The flexible window function can be changed over time according to the signal shape and compactness (Mehr et al. ). After using WT, the signal will decompose into two approximation (largescale or low-frequency) and detailed (small-scale) components. An illustration of a three-level WT is shown in Figure 2. In the first level, the original signal (x) is decomposed into the two components of approximation (cA1) and detailed (cD1). In the second level, cA1 is again decomposed to approximation (cA2) and detailed (cD2) components. Finally, in the third level, cA2 is decomposed to cA3 approximation and cD3 detailed components. The sum of all detailed subseries and approximation series obtained from the third level will be the original signal (i.e. x ¼ cD1 þ cD2 þ cD3 þ cA3). The other approach for time series processing is empirical mode decomposition (EMD). The EMD method is an effective self-adaptive dyadic filter bank which is applied to white noise (a random signal which has equal intensity at different frequencies). By applying this method, each signal can be decomposed into a number of inherent mode functions (IMFs) which can be used to process nonlinear and non-stationary signals. One of the advantages of this method is the ability to determine the instantaneous frequency of the signal. At each step of the signal decomposition into its frequency components, the high-frequency components are separated first and this process must continue until the component with the lowest frequency remains. EEMD is developed based on EMD. In fact, using the noise-added data analysis method the EEMD solves the EMD mode mixing issue (Wu & Huang ). The EEMD algorithm can be described as: (1) for a given signal x(t), random white noise is added to the signal, (2) the noise-added signal is decomposed using EMD for obtaining IMF series, (3) steps 1 and 2 are repeated until the number of additions of white noise is greater than or equal to the number of trials, (4) for obtaining the ensemble IMF, the average of the sum of all IMFs is computed (I j (t)), and (5) the original signal is formed as x(t) ¼ ∑ n j¼1 I j (t). For selecting the most effective IMFs and using them as inputs in the modeling process, their energy values can be calculated and the IMFs with higher energy can be used as inputs (see Lei et al. () for more details).

Kernel-based approach
Kernel-based (KB) approaches are new methods which are used for classification and regression purposes. One of the important kernel-based approaches is Gaussian Process Regression (GPR), which works based on the different kernel types such as polynomial, normalized polynomial, radial basis function (RBF) and Pearson functions. Such a model is capable of adapting itself to predict any variable of interest via sufficient inputs. The GPR can model nonlinear decision boundaries, and there are several kernels to choose from. It is also fairly robust against overfitting, especially in high-dimensional space. As an intrinsic extension of the Gaussian distribution, Gaussian Processes (GP) bear their properties with mean and covariance as a vector and matrix respectively (MacKay ). A GP is defined as a collection of random variables, any limited number of which has a joint multivariate Gaussian distribution. The appropriate selection of kernel type is the most important step in the GPR due to its direct impact on training and classification precision. In fact, the GPR method is memory-intensive, and trickier to tune due to the importance of picking the right kernel. In kernel-based models we will be able to predict the proper behavior of the system, although we will not be able to characterize its intrinsic structure and behavior (Babovic ).

Performance criteria
In the current study, the proposed model's efficiency was assessed via correlation coefficient (R), determination coefficient (DC), and root mean square error (RMSE) criteria as follows: where L is the decomposition level and N is the number of time series. Using Equation (2) the value of parameter L was obtained as 3, however, due to the seasonality of the process, L ¼ 5 was used. Several combinations of discharge and stage values were used as inputs of the models to predict daily discharge. The developed input combinations for river discharge prediction are shown in Table 2. For all defined combinations, t demonstrates the time step (daily step), is stage values at time t À 1, and Q(t À 1) and Q(t À 2) represent discharge values at time (t À 1) and (t À 2), respectively. The output is discharge values at current time step Q(t). In Figure 3 the considered modeling process is shown.

GPR model development
It should be noted that each artificial intelligence method has its own parameters, and for achieving the desired results, the optimized amount of these parameters should be determined.
State 2 Model Input variable Model Input variable  Therefore, the RBF kernel function was used as the core tool of the GPR, which was applied for the rest of the models.

Modeling based on raw data
For evaluating the river discharge in the selected stations, several models were developed based on the river discharge and stage data. The models were analyzed with the GPR model to carry out the river discharge prediction. Table 3 and Figure 5 show the results of the developed GPR models. From the obtained results of the statistical parameters (RMSE, R, and DC), it can be stated that in the first state the model (IV) with input parameters of Q(t À 1), Q(t À 2) performed better than the others. Based on the results of models (I) and (II), it can be seen that in estimation of the RSD, using previous stage data led to more accurate results than using previous discharge data. In the second state, for station 2, the model I(III) with input parameters of Q 1 (t À 1), H 1 (t À 1) was selected as the superior model and for station 3, the model H(IV) with parameters of Q 2 (t À 1), Q 2 (t À 2) led to better predictions. In the stations' relationship investigation, it seems that using the previous stations' discharge data yielded better predictions than the previous stations' stage data. A comparison between the results of the two considered states showed that modeling based on each station's own data led to more desirable results. However, using the previous stations' data in the modeling process led to relatively accurate results; therefore, via the GPR approach the previous stations' data could be applied when the stations' own data were unavailable. Modeling based on pre-processing data In this section, the effect of time series pre-processing on increasing model accuracy was investigated. The input time series were decomposed using the WT and EEMD methods. To decompose the time series by WT, a mother wavelet which is more similar to the signal should be selected. In this study, the Daubechies (db2 and db4) and symlet (sym2 and sym4) mother wavelets were trained and it was found that the db4 mother wavelet led to better outcomes (see Figure 6). Therefore, the db4 mother wavelet was used for time series decomposition. Also, the decomposition level was selected as five (L ¼ 5), which means five detail and one approximation subseries would be obtained (see Figure 7(a)). Figure 7 illustrates the decomposed time series of station 1 using db4 and L ¼ 5. In the second step, data was decomposed via EEMD. The principle of EEMD is the decomposition of a signal to different IMFs and one residual signal. The sum of these signals will be the same original signal. The formation of IMFs is based on subtracting the basic function from the original signal. This process continues until the residual signal remains almost constant. In the EEMD, two critical parameters directly affect the decomposition performance of the algorithm.   averaging effects on decomposition regarding these parameters can be found in Wu & Huang ().
In this study, according to Figure 7(b), time series were decomposed into ten IMFs and one residual signal. Then, the obtained subseries were used as inputs of the GPR model to predict the river discharge. The results of the integrated pre-processing models are listed in Table 4 and shown in Figure 8. According to the results presented in Tables 3 and 4, it can be deduced that data pre-processing improved significantly the results accuracy and the applied integrated models were more accurate than the single GPR model. In fact, the use of WT and EEMD led to an improvement in the outcomes. The integrated pre-processing methods increased the model accuracy approximately between 25% and 40%.
According to the results, it can be seen that between the two pre-processing methods the EEMD had higher RMSE error criteria in comparison with the WT. Therefore, it can be stated that the WT method performed more successfully than the EEMD method in enhancing the prediction accuracy.
Also, it was found that model (IV) with input parameters of Q(t À 1), Q(t À 2) in modeling based on each station's own data performed better than the others. In this state adding the H(t À 1) variable to the input combinations enhanced the model accuracy. In the second state, for station 2, model I(III) with input parameters of Q 1 (t À 1), H 1 (t À 1) and for station 3, model H(IV) with parameters of Q 2 (t À 1), Q 2 (t À 2) were superior models. From the results, it can be seen that discharge modeling based on the station's own data resulted in more desirable outcomes.
The RMSE error criterion was used to graphically compare the performance of the single and integrated GPR models. The results are shown in Figure 9(a). As can be seen, in the river discharge modeling the distribution ranges of the RMSE were smaller for the integrated models and the WT-GPR model led to more accurate results.
In the next step, for evaluating the impact of each independent parameter, sensitivity analysis was performed for the best model of each state. In this regard, the input variables of the best models were omitted one by one and the WT-GPR model, which had higher efficiency compared with the GPR and EEMD-GPR models, was rerun. Based on the results from Figure 9(b), it can be deduced that parameter Q(t À 1) had the most significant impact on daily river discharge prediction.
Validation of proposed best models using Arkansas  Table 5 and shown in Figure 10.
As can be seen from Table 5, the integrated models led to the desired accuracy and the efficiency of WT-GPR was better than that of EEMD-GPR and GPR. Based on the results, it can be deduced that discharge modeling based on the station's own data led to more desirable outcomes.

CONCLUSION
The accurate prediction of river discharge is an important factor in improving water resource management. This study assessed the capability of time series pre-processing methods for river daily discharge modeling. In this regard, in the first step, the raw time series (without any data processing) were imposed on the GPR model. Then, the time series were decomposed to several subseries using WT and EEMD and these subseries were used as inputs. According to the results, it was found that the use of both WT and EEMD pre-processing methods increased the models' accuracy. The applied pre-processing methods enhanced the GPR model performance approximately between 25% and 40%. It was observed that in estimation of the river discharge, using the previous data of flow discharge led to more accurate results. It showed that modeling based on the each stations' own data led to more desirable results. However, using the integrated GPR approaches, the previous stations' data could be used in the river daily discharge modeling when the station's own data was unavailable. Based on the obtained results, it was observed that in state 1, the model with input parameters of Q(t À 1), Q(t À 2) and in state 2, the model with parameters of Q 1 (t À 1), H 1 (t À 1) for station 2 and the model with parameters of Q 2 (t À 1), Q 2 (t À 2) for station 3 performed more successfully. The efficiency of the applied methodology was verified using Arkansas River datasets and results showed that the applied methods were successful in the modeling process. In general, the results showed that the integration of the GPR model with pre-processing models could be a suitable solution for more accurate prediction of hydrological variables such as river daily discharge.

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.