Abstract
Hydrological runoff prediction is vital for water resource management. The non-linear and non-stationary runoff series and the complex hydrological features for large-scale basins make it difficult to predict. Long short-term memory (LSTM) is effective for runoff prediction but unstable for large-scale basins. This study develops three hybrid models combined with two-stage decomposition and LSTM, including wavelet transformation (WT) combined with complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN), variational mode decomposition (VMD), and local mean decomposition (LMD), to predict the daily runoff of the Pearl River in China. The results indicate CEEMDAN's broader signal decomposition applicability for runoff series preprocessing, while VMD is simpler to extract high-runoff characteristics. VMD–WT–LSTM is appropriate for predicting high and median runoff, whereas CEEMDAN–WT–LSTM is better for low-runoff and high and median runoffs with low-violent fluctuations. These hybrid models provide satisfactory predictions for NSE and R2 indicators, and 97.2% of indicators fall within the acceptable range for high-runoff predictions. The hybrid models outperform traditional and standalone models in high-runoff but none of the decomposition methods in this research can identify low-runoff sub-sequence. This study provided runoff prediction methods requiring fewer data and processing time, and these methods are promising alternatives for daily runoff prediction in large-scale basins.
HIGHLIGHTS
Data-driven model is the trend of runoff prediction model.
Signal decomposition technology can decompose the runoff features.
Wavelet transformation can decompose highly fluctuating runoffs.
The LSTM-based hybrid model is suitable for river runoff prediction.
Signal decomposition-LSTM model outperforms the standalone model.
INTRODUCTION
Runoff plays a critical role in efficient water resource management and effective flood warning (Yaseen et al. 2019), and also is a vital driving force of the hazardous materials transportation in river basins (Young & Liu 2015). Obtaining accurate runoff information enables us to better understand the hydrological environment, provide early warning of flood events, plan water resources, sediment control, and protect ecosystems (Min et al. 2023). Precipitation and local hydrological environments are the main factors affecting the runoff and seasonal changes of rivers (Sun et al. 2022). As a result, a combination of hydrological parameters and mathematical models can predict the river runoff effectively. Nowadays, runoff prediction has been one of the best ways to collect valuable runoff data alongside on-site monitoring, and has played a crucial role in the scheduling and management of water resources (Chen et al. 2021).
Two types of runoff modeling approaches have been developed and applied in recent years: physical-based and data-driven models (Fidal & Kjeldsen 2020; Yin et al. 2022). The physical-based models, such as the Soil and Water Assessment Tool (SWAT) and MIKE System Hydrologique Eurpeen model (MIKE SHE) (Devi et al. 2015; Xiang et al. 2018), cover the major processes in the hydrologic cycle and include process models for evapotranspiration, overland flow, unsaturated flow, groundwater flow, and channel flow and their interactions. Their exact processes internally are aiding in understanding the underlying rules governing runoff processes. However, these models are subject to many simplifying assumptions (Duan et al. 1992), and they need wide-ranging and abundant parameters and are subject to many simplifying assumptions, making the establishment and calibration process extremely time-consuming (Bajirao et al. 2021). In addition, the physical-based model has a high demand for various meteorological, underlying surface and hydrological data, and obtaining these data is also a major challenge. By comparison, data-driven models, mainly using the machine learning method, can establish a functional relationship between input parameters and output results for accurate forecasting without a clear physical mechanism (Cao et al. 2019), without the need for complex parameters and long-term establishment and calibration. Due to the composite features of non-linearity, high uncertainty, and spatiotemporal variability of runoff data, data-driven models are considered an optimal approach for minimizing or overcoming these issues in runoff prediction (He et al. 2014; Hao et al. 2023). Time series neural networks represented by the long short-term memory (LSTM) model are the most widely used in machine learning data-driven models, and it has obvious advantages in processing large data (Ren et al. 2022). The LSTM architecture is an improved recurrent neural network (RNN) model that overcomes the problem of vanishing or exploding gradients (Gao et al. 2020), and it can be trained for sequence generation by analyzing real data sequences one step at a time and predicting what will occur next. Due to its superior ability to capture the correlation of time series, it has been increasingly used by researchers in runoff prediction applications. The LSTM-based methods have been successfully used in the runoff prediction in various time series worldwide (Xiang et al. 2020) and have indicated their capability to simulate low-runoff conditions and their outflow curve for the peak operation period.
Runoff prediction in large-scale basins is now developed to satisfy water management plans (Goudarzi et al. 2020). The main problem faced by this research is that the data fluctuates in a large range with time series; for example, in large basins, the difference in daily runoff between different months varies by dozens of times (Ren et al. 2022). However, the time series capture ability of a single LSTM model is limited in its capacity to accurately distinguish and identify these features. Recent studies indicated that deep learning models with data preprocessing perform better when predicting time series (Tang et al. 2018). The characteristic factors of runoff data with long or short time series can be extracted and expressed using appropriate time–frequency decomposition technology to preprocess data (Mosavi et al. 2018; Xie et al. 2019). Taking the runoff data as the signal data and using the signal decomposition method is conducive to solving the problem of a large amount of data and its fluctuation (Jamei et al. 2020). Several comparisons between standalone, namely models that are directly predicted without data preprocessing, and hybrid models revealed that the data preprocessing technique can enhance the performance of standalone models (Xie et al. 2019). Wavelet transform (WT) is frequently applied in the field of hydrology, mainly for processing and analyzing hydrological data (Gharbia et al. 2022). WT can be used to analyze rainfall trends, stream and river sediments (Gao et al. 2021). Hydrological sequence data processed through wavelet decomposition usually performs better than before (Ahmadi et al. 2021). Currently, WT is one of the most popular preprocessing methods for short-term runoff prediction models (Kaveh et al. 2021). WT is also frequently applied in the field of hydrology, mainly for processing and analyzing hydrological data (Gharbia et al. 2022). WT can be used to analyze rainfall trends, stream and river sediments (Gao et al. 2021). Hydrological sequence data processed through wavelet decomposition usually performs better than before (Ahmadi et al. 2021). Currently, WT is one of the most popular preprocessing methods for short-term runoff prediction models (Kaveh et al. 2021). There is substantial evidence that the performance of forecast models can be enhanced by utilizing signal decomposition techniques to produce cleaner signals as model inputs (La Rosa Lama & Sánchez 2020). However, it should be noted that some papers report that using wavelet analysis as a preprocessing method improves modeling performance, while others indicate that it deteriorates modeling results (Sachindra et al. 2019). The performance of different wavelet types depends on the original input signal, and the behavior of input signals for different hydrological processes depends on the climate and hydrological characteristics of the watershed region (Bajirao et al. 2021). However, hydrological processes are highly dynamic and stochastic (Liu et al. 2022). Especially for large-scale basins with high-runoff, their runoff dynamics are more intense and difficult to predict. In addition, the current mainstream signal decomposition methods are primarily employed for runoff simulation in middle and small river basins, and there is still a lack of hybrid models combined with signal decomposition application cases in large-scale basins. Considering the difficulty in selecting wavelet bases for wavelet decomposition and the complexity of high runoff fluctuations, adaptive signal decompositions such as empirical mode decomposition (EMD) (Feng et al. 2022), variation mode decomposition (VMD) (Seo et al. 2018) and local mean decomposition (LMD) (Peng et al. 2021) are firstly used to preliminarily decompose the runoff series and obtain components with uniform high-frequency components. Then, components with unified high-frequency features are decomposed using WT for secondary decomposition, which may avoid the problem of wavelet base mismatch and solve the problem of the difficulty in capturing runoff features in large-scale basins.
This study aims to build the hybrid models of multiple time–frequency decomposition technology and LSTM and to predict the runoff in a large-scale watershed basin. Various signal decomposition technologies including CEEMDAN/VMD/LMD-WT are set out, and the advantages and disadvantages of each hybrid model are explored and then compare these models to obtain the most outstanding performance. In addition, other published runoff prediction performance results for this basin were collected for comparison to access the superior performance of the hybrid runoff prediction model in large-scale basins.
STUDY AREA AND METHODS
Study area
Methods building
Various methods are commonly used for data decomposition, including WT (Hadi & Tombul 2018), variational mode decomposition (VMD) (Zounemat-Kermani et al. 2019), EMD (Liu et al. 2016), LMD, and complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN). Among them, the WT has been applied successfully in various cases and is especially suitable for modeling the daily runoff as preprocessing to the data to be entered into the deep learning models (Alizadeh et al. 2018). As for other data decomposition methods, EMD has proven to be with end-point effects and over-enveloping (Sankaran & Reddy 2016). On the other hand, CEEMDAN has fewer calculations and accurate reconstruction results than EMD (Ren et al. 2015). In addition, the VMD largely alleviates the situation that one subcomponent has more than one sub-signal with clear differences or more sub-components with similar characteristics, namely the mode mixing issue, which often appears in EMD (Naik et al. 2018). Thus, the EMD is excluded from the comparison method in this study.
The combination of wavelet decomposition and neural networks has been considered one of the best methods for traffic prediction. However, in large basins, the fluctuation of temporal data is more severe. Therefore, a layer of data decomposition is added before WT to better analyze the fluctuation of high-frequency data. CEEMDAN, VMD, and LMD are adaptive, while WT is decomposed manually. Different stations have different characteristics of runoff fluctuation, and the manually set parameters differ greatly. Therefore, CEEMDAN, VMD, and LMD are selected for the first adaptive decomposition to obtain components with relatively fixed frequency, and then WT is used for decomposition of the high-frequency components. Finally, a two-stage preprocessing of CEEMDAN/VMD/LMD-WT-LSTM is used as a hybrid model for runoff prediction.
Data signals decomposition
CEEMDAN: CEEMDAN decomposition is an algorithm developed for analyzing and processing non-linear and non-stationary signals, which is suitable for processing signals with higher noise levels and situations that require more stable decomposition results. This study uses PyEMD to build the CEEMDAN method (https://github.com/vrcarva/ewtpy). The CEEMDAN is self-adaptive; hence, the original runoff time series is automatically divided into several IMFs representing different frequency and variation characteristics (Torres et al. 2011). The CEEMDAN automatically decomposes the time series to a certain extent and then stops. The IMFs with high frequency, which is difficult to predict, shall be further decomposed based on stationary WT, and then the decomposition results are predicted using the LSTM. The other path with low frequency, which is relatively predictable, is directly predicted using the LSTM. The details are provided in Part S1, Supplementary information.
VMD: The VMD is developed to extract non-linear trends and harmonics from complex signals (He et al. 2019a), which is suitable for the decomposition of non-linear and non-stationary signals, and can effectively capture the time–frequency characteristics of these signals. Recently, VMD has been applied in multiple fields, such as forecasting economic and financial time series (Lahmiri 2016), sock price index (Niu et al. 2020), and sunspots time series (Li et al. 2018). Thus, VMD is a potential method for runoff time series processing. Since VMD is not self-adaptive, its parameters must be defined manually. In this study, the tolerance of the convergence criterion is defined as 0.000006 after being tested many times. Regarding the decomposition number, the number of CEEMDAN compositions is selected as a benchmark. Starting from the benchmark value, the decomposition number for the VMD method is set up until the center-frequency in the time has little change compared with the one in the next decomposition. The original runoff time series is divided into several variational mode (VM) components representing different frequency and variation characteristics. Like CEEMDAN, these VMs are divided into two parts for WT preprocession or straightforward prediction. The details are provided in Part S2, Supplementary information.
LMD: LMD was proposed as a frequency analysis method (Smith 2005), which is suitable for decomposing signals with local frequency changes, as it can capture the local characteristics of the signal. Similar to CEEMDAN, LMD is a self-adaptive signal composition method. The original runoff time series is automatically divided into several Productive Functions (PFs) representing different frequency and variation characteristics. Also, by using PyLMD, the variations for LMD can be set automatically, and it performs adaptive decomposition based on the characteristics of the signal. These PFs are also divided into two parts for similar operation with CEEMDAN and VMD. The details are provided in Part S3, Supplementary information.
WT: In this study, the SWT is used for further processing in high-frequency sub-components of the first stage processed by the signals decomposition method. After testing many cases, the coif 3 is selected as the wavelet basis and decomposed the high-frequency sub-components based on the three-scale stationary wavelet packet decomposition method. In general, PyWavelets is an open-source WT software in Python. It combines a simple high-level interface with low-level C and Cython performance. It decomposes the time series through its built-in SWT method. A single waveform can be decomposed into high-frequency and low-frequency waveforms. Thereafter, the high- and low-frequency waveforms are decomposed again. The three-scale stationary wavelet packet decomposition method is used to decompose a waveform for a total of 8 times. Finally, the original runoff time series are transformed into four high-frequency wavelet components (digital components) and four low-frequency wavelet components (approximate components), and those are predicted in the next stage. The details are provided in Part S4, Supplementary information.
LSTM method
For the runoff prediction, the major input variables are rainfall and runoff. Hence, the LSTM model with two input parameters is used to predict the time series of runoff. As the input parameters are only with two dimensions, the complexity of the model does not need to be set very high. Through the trial-and-error method (Liang et al. 2018), a double-layer LSTM with 36 neuron number is the most accurate structure. However, it is a difficult problem as to how long to use the data volume to predict the next step of runoff. After many experiments in this study, it is determined that 30 days is a relatively accurate early days, and the runoff of the 31st day is predicted, as shown in Figure S2, Supplementary information. It is noted that 30 days is a parameter for the Pearl River Basin in the study, and more suitable prediction time needs to be explored for other basins. Runoff sub-sequences and rainfall series have been normalized as input variables. There are 4,382 days in 12 years (2006–2017 years) and ideally 4,322 instances in one hydrological station. Accordingly, the model is trained using 75% of the data (3,258 days, 2006–2014 years) and validated using 25% of the data (1,064 days, 2015–2017 years). Before running the LSTM, the learning factor needs to be set up. Since the decomposition sub-sequences fluctuate more than the wavelet components, a learning rate of 0.0000001 is used for decomposition sub-sequences, and a learning rate of 0.00000001 is used for wavelet components. Once the training is ended, the mean squared error (MSE) between prediction and observation is utilized as a loss function for training assessment and parameter calibration. During each training of the model, the predicted results of the final component sequence of a single input are decomposed from the original data using the MSE function for calculation. The MSE value of 0 is the optimal value, and the gradient value of each parameter is calculated based on backpropagation to continuously optimize the model. The training is finished once the MSE is relatively stable. After many tests, epochs are adjusted 60 times. In this study, the LSTM model is developed and executed using PyTorch. The initialization of parameters within the model ensures the generation of the same random number in different runs by setting a global random seed. Detailed survey and data processing methods are summarized in Part S5 and Figure S1, Supplementary information.
Hybrid method for runoff time series prediction
Step 1: Data decomposition. There are two steps in this stage. First, the signal decomposition methods (CEEMDAN, VMD, and LMD) are used to decompose the original runoff time series into several sub-components (IMFs, VMs, PFs), representing different values of vibrant frequency. Second, the SWT is applied for further decomposition on the high-frequency part of the sub-components from the previous step to get wavelet components. Subsequently, the bands obtained from data decomposition are collectively referred to as the sub-sequence, with sub-components representing the bands decomposed in the first step and wavelet components representing the bands decomposed in the second step.
Step 2: Data preprocessing. All data are subdivided into training and validation sets, and then each 30-day series part is taken as an input data series length to predict the 31st day. These data are normalized into [0, 1] using Min-Max Normalization. Detailed survey and data processing methods are summarized in Part S6, Supplementary information.
Step 3: Sub-sequence prediction. LSTM models are applied to predict each sequence, including the remaining CEEMDAN/VMD/LMD sub-components (IMFs, VMs, or PFs) and the stationary WT frequency wavelet components. In the prediction, the historic value time series [t − 29, t] is considered an input, while the data at time t + 1 are considered as an output ranging from 0 to 1. Thereafter, the true runoff data are obtained based on inverse normalization as shown in Part S6, Supplementary information.
Step 4: Data reconstruction. Firstly, the inverse stationary WT is used to reconstruct high- and low-frequency wavelet components as high-frequency sub-components (IMFs, VMs, PFs) of CEEMDAN/VMD/LMD. Secondly, by summing reconstructed sub-components (IMFs, VMs, or PFs) from wavelet components and predicted sub-components based on LSTM, the new runoff time series, including the runoff prediction time step, is finally predicted.
In this study, the daily runoff for 38 hydrological stations of the Pearl River is predicted using the hybrid model. In order to show the prediction results more directly, it is divided these hydrological stations into three groups based on mean runoff volume: group 1 – low runoff, group 2 – medium runoff, group 3 – high runoff. The geometric interval is a statistical classification method based on the law of numerical statistical distribution that minimizes the sum of squares within groups, and it is used for runoff grouping. The runoff values and their division results are shown in Table 1.
Station ID . | Mean flow (m3) . | NSE . | R2 . | Group . | ||||
---|---|---|---|---|---|---|---|---|
CEEMDAN . | VMD . | LMD . | CEEMDAN . | VMD . | LMD . | |||
1 | 6,748.35 | 0.614 | 0.923 | 0.608 | 0.721 | 0.925 | 0.609 | 3 |
2 | 6,204.72 | 0.889 | 0.985 | 0.613 | 0.889 | 0.989 | 0.640 | 3 |
3 | 5,245.53 | 0.891 | 0.965 | 0.661 | 0.891 | 0.966 | 0.686 | 3 |
4 | 3,786.70 | 0.760 | 0.977 | 0.503 | 0.782 | 0.979 | 0.559 | 3 |
5 | 1,791.81 | 0.636 | 0.973 | 0.807 | 0.789 | 0.975 | 0.808 | 3 |
6 | 1,618.92 | 0.910 | 0.969 | 0.815 | 0.911 | 0.969 | 0.815 | 3 |
7 | 1,340.77 | 0.916 | 0.973 | 0.664 | 0.919 | 0.975 | 0.666 | 3 |
8 | 1,333.73 | 0.824 | 0.939 | 0.702 | 0.827 | 0.942 | 0.703 | 3 |
9 | 1,300.33 | 0.929 | 0.976 | 0.778 | 0.930 | 0.979 | 0.780 | 3 |
10 | 1,259.45 | 0.744 | 0.958 | 0.460 | 0.748 | 0.961 | 0.474 | 3 |
11 | 1,136.71 | 0.830 | 0.952 | 0.681 | 0.832 | 0.955 | 0.686 | 3 |
12 | 1,071.47 | 0.929 | 0.969 | 0.315 | 0.930 | 0.972 | 0.580 | 3 |
13 | 773.80 | 0.813 | 0.937 | 0.638 | 0.815 | 0.940 | 0.644 | 2 |
14 | 671.44 | 0.679 | 0.919 | −0.064 | 0.682 | 0.925 | 0.347 | 2 |
15 | 649.43 | 0.850 | 0.955 | 0.713 | 0.853 | 0.958 | 0.725 | 2 |
16 | 503.64 | 0.874 | 0.966 | 0.729 | 0.885 | 0.969 | 0.731 | 2 |
17 | 495.97 | 0.896 | 0.958 | 0.822 | 0.904 | 0.962 | 0.825 | 2 |
18 | 466.86 | 0.791 | 0.927 | 0.756 | 0.798 | 0.928 | 0.757 | 2 |
19 | 410.57 | 0.887 | 0.914 | 0.792 | 0.890 | 0.916 | 0.793 | 2 |
20 | 406.13 | 0.749 | 0.904 | 0.506 | 0.749 | 0.918 | 0.545 | 2 |
21 | 362.92 | 0.713 | 0.901 | 0.433 | 0.760 | 0.914 | 0.461 | 2 |
22 | 293.80 | 0.677 | 0.889 | 0.299 | 0.679 | 0.897 | 0.451 | 2 |
23 | 228.79 | 0.772 | 0.789 | 0.357 | 0.788 | 0.801 | 0.554 | 2 |
24 | 216.93 | 0.811 | 0.782 | 0.665 | 0.824 | 0.785 | 0.682 | 2 |
25 | 214.26 | 0.712 | 0.815 | 0.555 | 0.715 | 0.816 | 0.557 | 2 |
26 | 189.10 | 0.772 | 0.866 | 0.682 | 0.777 | 0.870 | 0.683 | 2 |
27 | 146.69 | 0.703 | 0.701 | 0.707 | 0.703 | 0.751 | 0.719 | 2 |
28 | 140.21 | 0.658 | 0.797 | −0.169 | 0.659 | 0.803 | 0.300 | 1 |
29 | 106.59 | 0.682 | 0.740 | −0.775 | 0.683 | 0.743 | 0.175 | 1 |
30 | 106.08 | 0.618 | 0.598 | 0.621 | 0.647 | 0.665 | 0.632 | 1 |
31 | 98.03 | 0.748 | 0.740 | 0.669 | 0.749 | 0.741 | 0.692 | 1 |
32 | 82.30 | 0.524 | 0.623 | 0.512 | 0.555 | 0.631 | 0.526 | 1 |
33 | 67.67 | 0.674 | 0.447 | 0.507 | 0.678 | 0.458 | 0.526 | 1 |
34 | 45.36 | 0.350 | 0.050 | 0.353 | 0.363 | 0.292 | 0.386 | 1 |
35 | 45.12 | 0.394 | 0.393 | 0.410 | 0.407 | 0.411 | 0.430 | 1 |
36 | 37.27 | 0.499 | 0.113 | −0.320 | 0.512 | 0.160 | 0.191 | 1 |
37 | 24.79 | 0.466 | 0.093 | 0.431 | 0.496 | 0.334 | 0.451 | 1 |
38 | 13.17 | 0.379 | −0.296 | 0.278 | 0.435 | 0.095 | 0.330 | 1 |
Station ID . | Mean flow (m3) . | NSE . | R2 . | Group . | ||||
---|---|---|---|---|---|---|---|---|
CEEMDAN . | VMD . | LMD . | CEEMDAN . | VMD . | LMD . | |||
1 | 6,748.35 | 0.614 | 0.923 | 0.608 | 0.721 | 0.925 | 0.609 | 3 |
2 | 6,204.72 | 0.889 | 0.985 | 0.613 | 0.889 | 0.989 | 0.640 | 3 |
3 | 5,245.53 | 0.891 | 0.965 | 0.661 | 0.891 | 0.966 | 0.686 | 3 |
4 | 3,786.70 | 0.760 | 0.977 | 0.503 | 0.782 | 0.979 | 0.559 | 3 |
5 | 1,791.81 | 0.636 | 0.973 | 0.807 | 0.789 | 0.975 | 0.808 | 3 |
6 | 1,618.92 | 0.910 | 0.969 | 0.815 | 0.911 | 0.969 | 0.815 | 3 |
7 | 1,340.77 | 0.916 | 0.973 | 0.664 | 0.919 | 0.975 | 0.666 | 3 |
8 | 1,333.73 | 0.824 | 0.939 | 0.702 | 0.827 | 0.942 | 0.703 | 3 |
9 | 1,300.33 | 0.929 | 0.976 | 0.778 | 0.930 | 0.979 | 0.780 | 3 |
10 | 1,259.45 | 0.744 | 0.958 | 0.460 | 0.748 | 0.961 | 0.474 | 3 |
11 | 1,136.71 | 0.830 | 0.952 | 0.681 | 0.832 | 0.955 | 0.686 | 3 |
12 | 1,071.47 | 0.929 | 0.969 | 0.315 | 0.930 | 0.972 | 0.580 | 3 |
13 | 773.80 | 0.813 | 0.937 | 0.638 | 0.815 | 0.940 | 0.644 | 2 |
14 | 671.44 | 0.679 | 0.919 | −0.064 | 0.682 | 0.925 | 0.347 | 2 |
15 | 649.43 | 0.850 | 0.955 | 0.713 | 0.853 | 0.958 | 0.725 | 2 |
16 | 503.64 | 0.874 | 0.966 | 0.729 | 0.885 | 0.969 | 0.731 | 2 |
17 | 495.97 | 0.896 | 0.958 | 0.822 | 0.904 | 0.962 | 0.825 | 2 |
18 | 466.86 | 0.791 | 0.927 | 0.756 | 0.798 | 0.928 | 0.757 | 2 |
19 | 410.57 | 0.887 | 0.914 | 0.792 | 0.890 | 0.916 | 0.793 | 2 |
20 | 406.13 | 0.749 | 0.904 | 0.506 | 0.749 | 0.918 | 0.545 | 2 |
21 | 362.92 | 0.713 | 0.901 | 0.433 | 0.760 | 0.914 | 0.461 | 2 |
22 | 293.80 | 0.677 | 0.889 | 0.299 | 0.679 | 0.897 | 0.451 | 2 |
23 | 228.79 | 0.772 | 0.789 | 0.357 | 0.788 | 0.801 | 0.554 | 2 |
24 | 216.93 | 0.811 | 0.782 | 0.665 | 0.824 | 0.785 | 0.682 | 2 |
25 | 214.26 | 0.712 | 0.815 | 0.555 | 0.715 | 0.816 | 0.557 | 2 |
26 | 189.10 | 0.772 | 0.866 | 0.682 | 0.777 | 0.870 | 0.683 | 2 |
27 | 146.69 | 0.703 | 0.701 | 0.707 | 0.703 | 0.751 | 0.719 | 2 |
28 | 140.21 | 0.658 | 0.797 | −0.169 | 0.659 | 0.803 | 0.300 | 1 |
29 | 106.59 | 0.682 | 0.740 | −0.775 | 0.683 | 0.743 | 0.175 | 1 |
30 | 106.08 | 0.618 | 0.598 | 0.621 | 0.647 | 0.665 | 0.632 | 1 |
31 | 98.03 | 0.748 | 0.740 | 0.669 | 0.749 | 0.741 | 0.692 | 1 |
32 | 82.30 | 0.524 | 0.623 | 0.512 | 0.555 | 0.631 | 0.526 | 1 |
33 | 67.67 | 0.674 | 0.447 | 0.507 | 0.678 | 0.458 | 0.526 | 1 |
34 | 45.36 | 0.350 | 0.050 | 0.353 | 0.363 | 0.292 | 0.386 | 1 |
35 | 45.12 | 0.394 | 0.393 | 0.410 | 0.407 | 0.411 | 0.430 | 1 |
36 | 37.27 | 0.499 | 0.113 | −0.320 | 0.512 | 0.160 | 0.191 | 1 |
37 | 24.79 | 0.466 | 0.093 | 0.431 | 0.496 | 0.334 | 0.451 | 1 |
38 | 13.17 | 0.379 | −0.296 | 0.278 | 0.435 | 0.095 | 0.330 | 1 |
Model performance evaluation
The performance of the three hybrid models in this study is evaluated using the Nash–Sutcliffe efficiency coefficient (NSEC) and coefficient of determination (R2). Generally, the NSE is one of the most regularly used criteria for assessing runoff prediction, and the closer the value is to 1, the more accurate the prediction (Kumar et al. 2016). Besides, an NSE value greater than 0.6 is an outstanding prediction, and a value less than 0.4 is an unacceptable prediction (Nash & Sutcliffe 1970). Detailed survey and data processing method are summarized in Part S7, Supplementary information.
RESULTS
Sequence decomposition of runoff
The VMD's decomposition process is different from those of CEEMDAN and LMD. CEEMDAN and LMD separate the high-frequency band from the original runoff time series, and the amplitudes of IMFs and PFs decrease with the decomposition times. VMD separates the low-frequency band from the original runoff time series first, and the amplitudes of VMs decrease with the decomposition times. VMs fluctuate more slowly than IMFs and PFs when the amplitude is large. Thus, at the same frequency, the VMs have a smaller amplitude and stable vibration when compared with IMFs and PFs. In addition, the last separated IMF and PF is monotone, with only a single extreme point. However, the first VM also has many extreme points, and the lowest frequency VM still fluctuates regularly. Thus, VMD will not completely decompose the band and is inferior to CEEMDAN and LMD in decomposition depth.
For a more detailed elaboration, stations 7, 20, and 33 were selected as the representations of three station groups since they are the median of corresponding groups; the results are illustrated in Figure 3. Figure 3(a) shows that, for CEEMDAN, the original runoff time series is decomposed into 11–12 IMFs which indicate more significant variation characteristics. The IMFs with low-frequency are more helpful for models to estimate future value. Station 7 series exhibits less decomposition than stations 20 and 33, demonstrating that large runoff is helpful for CEEMDAN to capture the variation characteristics. As shown in Figure 3(b), VM frequency increases from the first to the last sub-components. The decomposition number is 8–10, less than CEEMDAN; therefore, VMD performs more efficiently for single decomposition. In Figure 3(c), multiple extreme vibrations exist in high-frequency PFs; the maximum and minimum values of PF1 are considerably greater than IMF1 and VM1. It is difficult to capture the features of band changes with such violent fluctuations. In addition, the last PF (PF9) is irregular overall, reducing model prediction accuracy.
In general, CEEMDAN is suitable for capturing the data features of different amounts of runoff, while the VMD method makes it easier to extract high-runoff data. Although LMD has a balanced ability to extract various high and low runoff data, it is inferior to CEEMDAN in the same amount of runoff. In the simulation of runoff in large-scale watersheds, the CEEMDAN method has more general applicability in the preprocessing of runoff.
WT of runoff
The high-frequency band data obtained after a signal decomposition remains challenging to predict due to their violent fluctuations. Therefore, WT was applied to decompose the two bands with the highest frequency (the last two bands of IMFs and PFs and the first two bands of VMs). The stations of 7, 20, and 33 are also the representative stations for describing the method's features, as shown in Figure S3, Supplementary information. Each station has two band groups after processing by each method, and one band group has eight bands. The first and second band groups are decomposed from the sequence of the first high-frequency and the second high-frequency in the first decomposition stage, respectively. The first four bands represent the decomposed digital components in a single decomposition result, whereas the latter four bands indicate the decomposed approximate components. The approximate components demonstrate the low-frequency of IMFs/VMs/PFs, and digital components represent the higher one.
In terms of amplitude, the band amplitudes obtained by VMs decomposition are the smallest at station 7. The maximum and minimum amplitudes of VMs decomposition are 600 and 12.5, respectively, less than IMFs (5,000 and 300) and PFs (50,000 and 750) decompositions. Maximum and minimum amplitudes of VMs decomposition are 1,500 and 15, respectively, at station 22. These values are similarly lower than IMFs (10,000 and 400) and PFs (50,000 and 1,000). The band with excessively high amplitude is unpredictable. Thus, the decomposed wavelet components from VMs are more predictable than from IMFs and PFs at stations with a high and median runoff. However, the average runoff at station 22 is lower than at station 7. It is reasonable to speculate that a decrease in runoff reduces the performance of WT, causing a sudden peak in the band. It will make bands more unpredictable. VMs decomposition also yields the smallest band amplitudes for station 33. There is only one band of high-information VMs, with a value of 1,000. The other bands are mostly in single digits and carry little information. These bands with single digits offer a high prediction accuracy. However, the amount of information stored inside is minimal, and its contribution to the final result is limited.
Regarding fluctuation, two approximate components from IMFs are not prominent. In stations 7 and 22, the values around the 200th day changed sharply, whereas those around the 500th, 1,000th, 1,500th, and 2,000th day did not change. It shows that the first and second IMFs included fewer low-frequency components and that the most low-frequency information is stored in other IMFs. This will make the wavelet component easy to predict but will raise the complexity of predicting other IMFs. The amplitude of VM at site 7 is the largest, but it is only 1,000; hence, VMs are simple to estimate due to their small amplitude. As for the digital decomposition result of PF, it fluctuates most violently with up to 5,000, 3,000, and 5,000 in stations 7, 22, and 33, respectively. These fluctuations are entirely unpredictable. In general, in stations with high runoff, WT is ideally suited for processing VMs. The LMD-WT method is inferior to VMD-WT and CEEMDAN-WT for forecasting runoff under high variability conditions.
Prediction performance for hybrid models
Before running the model, a unique random seed for initialization of the three models was set to ensure consistency in the initial parameters of the LSTM model. Training data (from the years 2006 to 2014) were applied to calibrate parameters. Figures S4 and S5 and Table S1, Supplementary information illustrate the performance of three hybrid models during training periods. The average NSE and R2 of CEEMDAN-based, VMD-based, and LMD-based models are 0.690 and 0.705, 0.764 and 0.787, 0.375 and 0.529, respectively. For the CEEMDAN-based method, 94.7% of all predictions are satisfactory, showing an outstanding performance. Followed by the VMD-based hybrid model, 86.8% of all NSE and R2 are acceptable. The LMD-based model has the worst performance, although 63.2% NSE and 71.1% R2 of all predictions are still within the acceptable range.
The simulation performance of three hybrid models varies significantly with runoff levels, as shown in Figure 4. In group 3, 97.2% of all NSEs and R2 fall within the acceptable range. VMD–WT–LSTM performs the best, followed by CEEMDAN–WT–LSTM and LMD-WT-LSTM. The low limit of the hybrid model based on VMD–WT–LSTM (NSE = 0.923, R2 = 0.929) was nearly identical to the high limit of CEEMDAN–WT–LSTM (NSE = 0.929, R2 = 0.930), which was significantly superior to LMD-WT-LSTM (NSE = 0.815, R2 = 0.815). In group 2, the performances of VMD–WT–LSTM and CEEMDAN–WT–LSTM declined slightly, but CEEMDAN–WT–LSTM performance indicators are more stable. These two methods have good results in feature extraction of medium and high flow, but the simulation performance is also different regarding runoff fluctuation. As shown in Figure 5 and Table 1, VMD–WT–LSTM performance is almost directly proportional to the station amount of runoff, but the performance of CEEMDAN–WT–LSTM is irregular. For example, station 1 has the highest runoff and the largest difference between minimum, medium, and average runoff, while its NSE and R2 of the CEEMDAN–WT–LSTM method are greatly lower than those of stations 2 and 3. Several other stations, such as stations 4, 5, 10, and 14, with large runoff changes, have comparable simulation results. As stated previously, LMD-WT-LSTM is poor in extracting high-frequency bands, resulting in a poor prediction in large runoff stations, such as the 900th day predictions in stations 7 and 20 that diverge significantly from actual observations (Figure S6, Supplementary information). Thus, LMD-WT-LSTM is inadequate for processing runoff with excessively erratic fluctuations compared to the two other approaches, which can handle the extraordinary value. Only the CEEMDAN–WT–LSTM in group 1 has an average NSE greater than 0.5. In general, the average runoff data above a specific scale (about 140 m3/s) are very suitable for simulation using hybrid models. VMD–WT–LSTM is excellent in identifying and predicting fluctuations, so it is suitable for data with high (mean runoff greater than 1,071 m3/s) and median (mean runoff greater than 140 m3/s and lower than 1,071 m3/s) amounts of runoff. An increase in runoff volume enhances the VMD–WT–LSTM impact. CEEMDAN–WT–LSTM is more appropriate for predicting low runoff (mean runoff less than 140 m3/s) and high and median runoffs with relatively low violent fluctuations. LMD-WT-LSTM is incapable of handling outliers. It is just suitable for medium and high runoff bands with stable fluctuations but generally inferior to CEEMDAN–WT–LSTM and VMD–WT–LSTM when based on the same runoff batch.
DISCUSSION
The single LSTM performance (Table S2 and Figure S7, Supplementary information) in predicting the large and medium runoff is often worse than that of the CEEMDAN–WT–LSTM and VMD–WT–LSTM models and superior to that of the LMD-WT-LSTM model. This result is consistent with the report of Sun et al. (2019) (Table S3, Supplementary information), who applied a single ANN/LSTM/autoregressive model (AR) and decomposition hybrid models to estimate runoff of two stations in the Pearl River and the overall prediction performance of the hybrid models is superior. Previous studies indicated that the decomposed sub-sequence is easier to predict, as shown in Table S3, Supplementary information. Wang et al. (2015) reported that ensemble empirical mode decomposition (EEMD) can effectively increase forecasting accuracy for the Dahuofang reservoir, with improvements of 437.78% and 135.01% in the validation stage for the coefficient of correlation (R) and NSEC. Zhang et al. (2021) indicated that the non-autoregressive (NAR) model with CEEMDAN decomposition has superior prediction ability than the single model. Daily runoff prediction is often seen as a more complex process due to its significant variability. He et al. (2019a, 2019b) demonstrated that the VMD-DNN model is a promising new method for daily runoff forecasting. Sun et al. (2019) also presented that the performance of the hybrid model based on single decomposition (WT-ANN) at Longchuan station (R = 0.843) is better than the single model (R = 0.834). For comparable hydrological stations, the CEEMDAN–WT–LSTM and VMD–WT–LSTM models in this study have R values of 0. 907 and 0.886, showing superior performance. Similar LSTM or single decomposition-based hybrid runoff prediction models, as exhibited in Table S3, Supplementary information, provided good prediction results but lower performance than hybrid models in this study. Although first-stage decomposition can successfully identify sub-sequences with fluctuation characteristics, the high-frequency sub-sequence is still challenging to predict. Hence, the two-stage decomposition method helps extract the series characteristics hidden in the high-frequency series (Chen et al. 2021). Overall, incorporating signal decomposition into the deep learning models can produce accurate runoff prediction findings that can serve as a valuable reference for global hydrological runoff prediction research.
The traditional hydrological model, such as SWAT, i.e., the physical-based model, has been the most crucial tool for runoff prediction and calibration. Table S3, Supplementary information summarizes the model performance for various rivers compared to the hybrid models and the model performance in a similar area or runoff level of this study. For the rivers of Marol, Roodan, Fox, and Luohe, whose basin areas are comparable to Longchuan (7,691 km2), the NSEs of the SWAT model for runoff prediction are 0.740, 0.680, 0.460–0.670, and 0.540 (Bekele & Knapp 2010; Suliman et al. 2015; Himanshu et al. 2018; Zhang et al. 2021), and are significantly less than the performance of the CEEMDAN–WT–LSTM method (Tables 1 and Table S3, Supplementary information). Similarly, the stated model performance of NSE and R for Mara River (13,750 km2), Xiaohong River (4,417 km2), and Yass River (1,597 km2) for daily runoff prediction is inferior to this study (Dessu & Melesse 2013; Saha et al. 2014; Li et al. 2020). As can be seen, the performance of hybrid models is superior to that of traditional ones in both large and small areas, and the proposed hybrid model performs better in similar stations. In addition, the SWAT requires several meteorological and surface parameters, including daily precipitation, temperature, wind speed, solar radiation, and relative humidity, and the simulation for large basins needs a long running time (Zhang et al. 2016). In contrast, the hybrid models in this study require only two parameters: daily precipitation and runoff, and their running time for all basin stations is within 10 minutes. In terms of accuracy, efficiency, and simplicity, LSTM-based two-stage decomposition is better than traditional methods.
The effectiveness and practicability of hybrid models were verified in this study and various other attempts, but there were still a few possible shortfalls. Firstly, the performance of predicting time series with low runoff is poor because none of the decomposition methods used in this research can identify sub-sequence with sufficient change characteristics when the runoff is extremely low. It is assumed that runoff with a small quantity that cannot demonstrate the available time scale feature is more susceptible to other influences, such as solar radiation. Hence, obtaining additional data types and increasing the number of possible impact factors within the hybrid model is a feasible optimization method for predicting runoff with a small degree of error. Secondly, higher data accuracy can lead to better prediction performance. This study applies the daily runoff data to predict. The prediction accuracy will be enhanced with higher time accuracy data, such as predictions based on 6 or 12 h; however, it is challenging to achieve high-time accuracy. The relationship between data quality and model accuracy must be investigated further. The LSTM model based on two-layer signal decomposition is recommended for large basin runoff forecasting, despite the presence of some uncertainty.
CONCLUSION
This study proposes three hybrid models for predicting the daily runoff time series using a two-stage decomposition. Based on daily runoff and rainfall data from the Pearl River Basin in China, two statistical performance evaluation indicators (R2 and NSEC) are applied to evaluate the model performances. It demonstrated that signal decomposition methods can effectively reduce the difficulty of prediction, and two-stage decomposition can more finely decompose high- and low-frequency components, resulting in more accurate prediction results. VMD–WT–LSTM is appropriate for predicting high and median runoff, whereas CEEMDAN–WT–LSTM is more suitable for low runoff and high and median runoffs with low violent fluctuations. The results predicted by LMD-WT-LSTM are generally not as accurate as the first two types of models. Hybrid models require fewer data and processing times and are easier to predict than traditional models. However, when the runoff is extremely low, none of the decomposition methods used in this research can identify sub-sequence with sufficient change characteristics. Therefore, the presented hybrid methods, based on two-stage signal decomposition and deep learning, are promising alternatives for daily runoff prediction in large-scale basins, especially for high runoffs.
ACKNOWLEDGEMENTS
The authors would like to acknowledge the financial support from the Guangdong Natural Science Funds for Distinguished Young Scholars (2022B1515020088), National Natural Science Foundation of China (42077429 and U1701242), Guangdong Provincial Key Laboratory of Chemical Pollution and Environmental Safety (2019B030301008), Guangdong Basic and Applied Basic Research Foundation (2021A1515011641), and Guangzhou Basic and Applied Basic Research Foundation (202102021233).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.