ABSTRACT
The main goal of this study is to enhance the precision and reliability of monthly runoff forecasts within the complex Navrood watershed, situated in northern Iran. The innovative use of a waveform matching algorithm is a defining feature of this study. This approach is vital in optimizing the selection of the mother wavelet, which is a critical component in wavelet analysis. This is a significant divergence from established techniques in hydrological research, indicating a paradigm change in the area. To thoroughly assess model performance, the Technique for Order of Preference by Similarity to the Ideal Solution (TOPSIS) is applied. This all-encompassing evaluation guarantees not only astounding precision but also a near-perfect fit with the ideal solution. The findings highlight the remarkable precision attained by using the hybrid multiresolution analysis (MRA) methodology. The proposed methodology involves the integration of the maximal overlap discrete wavelet transform (MODWT) with a random forest (RF) model, referred to as MRA–RF. The obtained Nash–Sutcliffe efficiency (NSE) score of 0.94 is noteworthy. Furthermore, the model exhibits a low mean absolute error (MAE) of just 0.36 m3/s, a strong p-factor of 73.5%, and a significant d-factor of 37.9% during extensive testing.
HIGHLIGHTS
Using appropriate lag time for data of ML methods to monthly runoff forecasting.
Combine ML methods with wavelet preprocessing methods, stepwise regression, and PCA.
Using the waveform matching algorithm to find the optimal mother wavelet.
Introducing RF–MODWTMRA as the top model and uncertainty analysis of modeling.
INTRODUCTION
Floods are considered the major natural hazards that cause widespread destructive impacts on humans, the environment, and properties. Heavy rains, insufficient drainage practices, climate change, and improper management of water resources have been regarded as the most frequent reasons for the occurrence of floods. Therefore, a high-precision evaluation of river runoff can play a significant role in flood management policies. Due to the lack of ground-truth monitoring stations, as well as the costly conventional systems for flood warning, it is necessary to develop a machine learning (ML) model for accurately predicting river runoff using innovative and feasible approaches.
Wang et al. (2023) used a decomposition method to reduce input uncertainty and artificial neural networks (ANNs) and least-squares support vector machines (LSSVMs) to predict monthly stream discharge. Also, Ibrahim et al. (2022) reviewed studies that predicted flow discharge with ML methods. They reported that a number of studies such as Tikhamarine et al. (2019, 2020); Fathian et al. (2019); Niu et al. (2018); Zhang et al. (2018); and Choong et al. (2017) applied different ML methods for streamflow prediction.
In recent years, pre-processing techniques such as wavelet transform (WT) have been broadly used in hydrological studies to improve the performance of ML models (Aussem & Murtagh 1997; Labat et al. 2000; Cannas et al. 2006; Nourani et al. 2009, 2011, 2019; Liu et al. 2014; Farajpanah et al. 2020; Nalley et al. 2020; Adib et al. 2021; Azarpira & Shahabi 2021; Abebe et al. 2022; Ahmadi et al. 2022; Esmaeili-Gisavandani et al. 2022; Syed et al. 2023).
Several algorithms have been established to choose the optimal mother wavelet, including: Gabor's method, which applied a relatively difficult approach to find the appropriate mother wavelet for each data set (Burrus et al. 1998), energy matching algorithm, in which the Fourier transform (FT) is applied to a raw signal to identify the frequency ranges with dominant signal energy; then, the signal is decomposed using different mother wavelets, and the signal energy can be calculated within the aforementioned frequency ranges; based on the Parswall theorem, the more appropriate mother wavelet makes the more suitable matching between these two calculated signal energies (Burrus et al. 1998); the entropy matching algorithm, in which, similar to the energy matching algorithm, the high-entropy ranges are recognized through the FT; within these ranges, the wavelet with the highest compatible entropy with the FT will be considered the most desirable mother wavelet (Passoni et al. 2005). The matching algorithm, in which the optimal mother wavelet can be determined based on the most similarity to the shape of the partially known source wavelet. Each wavelet is incorporated for analyzing a similar signal, particularly producing results different from others; hence, visual shape matching is usually applied to obtain the most proper mother wavelet.
The efficiency of different forms of mother wavelets was investigated to measure the multiunit bursts' timing in surface electromyograms. The db2 was selected as the most similar wavelet to the corresponding signal (Flanders 2002). Ahadi & Bakhtiar (2010) measured the acoustic emission leakage signal signatures via the visual inspection method. They found that the Gaussian mother wavelet had the most resemblance to the real signal. The results also demonstrated that the spectrograms generated via proper wavelets could be beneficially applied to leak detection. Tang et al. (2010) used Morlet wavelet for oscillation signal denoising in wind turbines. The results showed the highest correspondence of the Morlet wavelet to the mechanical impulse signal. More similarities of the best wavelet to the mechanical characteristics of impulse response made the higher coefficients for the related impulses. However, it was difficult to perform a visual shape matching of the signal to the mother wavelet.
Two criteria, the information extraction of the signal and the distribution error were proposed to obtain a proper wavelet for the image correction process (Zhang et al. 2005). The results revealed the priority of bior1.3 (Zhang et al. 2005). For partial discharge detection, db4 was chosen as the most appropriate with the maximum cross-correlation to the ultrahigh-frequency signal (Yang et al. 2004). Mao et al. (2021) showed that ANNs can simulate monthly runoff more accurately than long short-term memory (LSTM) neural networks, while LSTM performs better than ANNs for simulating daily runoff. Pal & Talukdar (2020) showed that hybrid wavelet ANNs can predict flow discharge better than ML methods such as support vector machines (SVMs) and random forest (RF).
According to the authors’ knowledge, a few studies have been conducted on the process of selecting a mother wavelet in fields other than hydrology (Ngui et al. 2013; Jang et al. 2021). In the research conducted in the field of hydrology, no study has used the waveform matching algorithm to select the appropriate mother wavelet, the principal component analysis (PCA) method to reduce the dimensions of the data, and the methods of uncertainty analysis and technique for order of preference by similarity to the ideal solution (TOPSIS) to select the best model. This study attempts to provide a framework based on ML models for the accurate estimation of watershed runoff. In most studies, the selection of the optimal mother wavelet depends on its performance in modeling and recognition of desired features. In this study, the innovative approaches have been characterized to predict river runoff using mother wavelets coupled with the climatic variables, and hybrid models comprising adaptive neuro-fuzzy inference system (ANFIS), LSSVM, group method of data handling (GMDH), multivariate adaptive regression splines (MARS), and RF. Then, the uncertainty of the outperformed models has been analyzed. This study aims to develop a preprocessing technique to select an appropriate mother wavelet function for each input parameter to the modeling process.
MATERIALS AND METHODS
Study area
Input data to modeling
In this study, three weather stations, Nav, Khalian, and Kharjgil, provided monthly data sets including precipitation (PCP), maximum and minimum temperatures, evaporation, relative humidity (RH), sun hours (SH) in a month (hours/month), and runoff from 1989 to 2017. The spatial average of data was obtained through the Thiessen polygons. The runoff data for the Navrood River were also extracted from the Kharjgil station. During the study period, the average annual PCP was 1,256 mm; the average maximum and minimum temperatures were 19.9 and 11 °C, the average annual evaporation is 751 mm, the average annual RH is 82.4%, the average annual SH in month is 104 h, and the average runoff is 4.4 m3/s.
To complete data at stations with gaps in their time series, this study employed the double mass method. For this purpose, data from the station that exhibits the highest correlation with the data of the considered station are utilized (typically, this station is the one closest to the considered station). It is worth noting that the amount of missing meteorological data (precipitation and temperature data) in this study was negligible (<1% of the data) and river flow data were complete.
One of the most important steps in the modeling process is to provide a suitable combination of input variables to the model. Therefore, first, the mutual correlation between input and output variables was calculated. The values of correlation with Q are 0.33, −0.34, −0.36, −0.34, 0.32, and −0.39 for RH, SH in a month, maximum and minimum temperatures (Tmax, Tmin), precipitation (PCP) and evaporation (Eva), respectively. These correlations are significant at the 0.01 level (two-tailed).
Therefore, the input variables have interdependent effects, and these effects are considered in the modeling process using correlation calculations, selection of an appropriate mother wavelet, stepwise regression, and PCA. These processes have helped to improve the accuracy of river flow forecasting and achieve better results.
The correlation between Qt and Qt−1 and the correlation between Qt and meteorological data such as precipitation and temperature are nearly equal. Therefore, to determine Qt, all of these variables should be considered. This watershed has a small size, such that in winter, there is a high correlation between precipitation and river flow in the same month, and in spring, when river flow is mostly due to snowmelt, a high correlation between temperature and river flow is observed in the same month. Due to the small size of the watershed, the correlation between river flow in two consecutive months decreases.
The Navrood watershed is a small watershed and the correlation between flow discharge and meteorological parameters (PCP, EVA, RH, SH, Tmin, and Tmax) in previous months is low (). Therefore, the flow discharge in each month is a function of the meteorological parameters in the same month and the flow discharge in the previous month.
At first, the preprocessing wavelet technique was employed to eliminate the deterministic trend in the time series. In recent years, these transformations have been widely used in hydrology (Sang 2013; Nourani et al. 2014; Farajpanah et al. 2020).
DATA PREPROCESSING
Waveform matching algorithm
Waveform matching is a method to select a mother wavelet that closely resembles the shape of the signal being analyzed. This leads to more accurate analysis and better results.
The waveform matching algorithm works based on the following steps:
Identify the signal you want to analyze.
Pick a set of common wavelets.
Compare the shapes of these wavelets to the signal's shape using visual inspection and calculations like correlation.
Choose the wavelet that most closely matches the signal's shape.
Analyze the signal using the chosen wavelet.
Waveform matching offers several advantages:
Increased accuracy due to a better match between the signal and the wavelet.
Reduced errors in the analysis.
More applicable to signals with some prior knowledge.
Traditionally, choosing a mother wavelet relied on experience or using well-known wavelets by default. Waveform matching provides a more data-driven approach for selecting the optimal wavelet.
This study used waveform matching to select the best wavelet for analyzing river flow data. Waveform matching showed better performance compared with other methods due to:
Higher accuracy in matching the river flow signal.
Better results in modeling monthly river flow.
Simplicity and ease of implementation.
Generalizability for various types of signals.
The study also highlights that different input parameters might require different mother wavelets due to their unique characteristics. Selecting an optimal wavelet for each parameter leads to:
Better capturing of the unique characteristics of each parameter.
More accurate extraction of important features from the data.
Increased accuracy in modeling and prediction of the phenomenon.
WT and the corresponding algorithm to choose a suitable wavelet
The main idea behind the WT is to overcome the disadvantages of the FT to deal with the frequency–time resolution and non-stationary signal analysis. The family of wavelets includes a wide range of different kinds of WTs (known as ‘mother wavelet’) with respective filtering characteristics family of transforms coupled with the corresponding ML models were employed to compete for superiority as an optimal condition with the maximum possible decomposed information extracted from a raw signal.
Wavelet name . | Correlations . | |||
---|---|---|---|---|
d1 (Qt−1) . | d2 (Qt−1) . | d3 (Qt−1) . | a3 (Qt−1) . | |
bior3.1 | −0.17 | 0.26 | 0.14 | 0.19 |
db2 | −0.17 | 0.21 | 0.39 | 0.45 |
sym14 | −0.26 | 0.24 | 0.45 | 0.43 |
Wavelet name . | Correlations . | |||
---|---|---|---|---|
d1 (Qt−1) . | d2 (Qt−1) . | d3 (Qt−1) . | a3 (Qt−1) . | |
bior3.1 | −0.17 | 0.26 | 0.14 | 0.19 |
db2 | −0.17 | 0.21 | 0.39 | 0.45 |
sym14 | −0.26 | 0.24 | 0.45 | 0.43 |
For finding the best mother wavelet function, the LSSVM method is coupled with the different mother wavelet functions. These mother wavelet functions are: 21 Daubechies, 19 Symlets, 5 Coiflets, 15 Biorthogonal, 15 Reverse biorthogonal, 5 Fejér–Korovkin, and discrete Meyer (dmey) mother wavelet functions. The written code for these mother wavelet functions in MATLAB is illustrated in the Supplementary material.
These mother wavelet functions were incorporated for forecasting monthly runoff. The simulation process was repeated 10 times, and at last, the results were averaged and compared (Table 2). It is worth noting that the best mother wavelet function from each family with the highest R-value and the lowest standard deviation ratio (RSR) and mean absolute error (MAE) values are shown in Tables 1 and 2 and Figure 3. In this study, the daily time series of Qt was also considered. It was observed that the mother wavelet functions that have the highest conformity with the shape of the daily time series of Qt can provide the maximum possible information for decomposing the time series and making it available to the user. The superior mother wavelet functions for daily time series were the same as the superior mother wavelet functions for monthly time series. The shape of the daily time series, the shape of the mother wavelet functions that have the highest similarity (blue color) and the lowest similarity (red color) with the time series shape, and the values of correlation of decomposed components of Qt−1 by these wavelets with Qt are shown in the Supplementary material.
Wavelet . | No. of runs . | Train data . | Test data . | ||||
---|---|---|---|---|---|---|---|
. | . | . | . | . | . | ||
bior3.1 | 10 | 0.61 | 0.79 | 1.38 | 0.65 | 0.77 | 1.24 |
db2 | 10 | 0.63 | 0.78 | 1.36 | 0.68 | 0.75 | 1.20 |
sym14 | 10 | 0.75 | 0.66 | 1.14 | 0.73 | 0.69 | 1.16 |
Wavelet . | No. of runs . | Train data . | Test data . | ||||
---|---|---|---|---|---|---|---|
. | . | . | . | . | . | ||
bior3.1 | 10 | 0.61 | 0.79 | 1.38 | 0.65 | 0.77 | 1.24 |
db2 | 10 | 0.63 | 0.78 | 1.36 | 0.68 | 0.75 | 1.20 |
sym14 | 10 | 0.75 | 0.66 | 1.14 | 0.73 | 0.69 | 1.16 |
As indicated in Table 1, it can be recognized that sym14 outperformed the two other mother wavelets, bior3.1 and db2 to analyze the time series, such that the correlation of the approximation and details with the target is relatively higher than those of the other mother wavelets. Also, Table 2 demonstrates the priority of sym14 to simulate the monthly river runoff. This demonstrates that the maximum possible information decomposed from the time series has been acquired using the sym14 as the most similar to the shape of the main time series.
The mother wavelet is responsible for discovering the similarity between the wavelet function and the given time series. Since all different types of mother wavelets are not equally effective, each wavelet function may provide accurate results for one or more specific signals, while not suitable for others. Therefore, choosing the appropriate mother wavelet is a crucial step in wavelet hybrid modeling.
The main point is that several input data may be needed to model a phenomenon. In this study, the monthly runoff of the Navrood River is a function of the meteorological parameters in the same month and the runoff of the previous month (Equation (1)).
Since discrete wavelet transform (DWT) has been widely used to analyze different hydrological time series, in this study, two WT methods, maximal overlap discrete wavelet transform (MODWT) and maximal overlap discrete wavelet transform multiresolution analysis (MODWTMRA), generalized forms of the DWT are also incorporated.
DWT, MODWT, and MODWTMRA decompose a signal but differ in their strengths and weaknesses.
DWT: efficient (uses smaller filters) and widely used, but may lose information (especially at high frequencies).
MODWT: preserves information better (due to maximum overlap filters) and is good for non-stationary signals, but requires more computation.
MODWTMRA: enables detailed analysis with multiresolution, but can be very demanding on computational resources and storage space.
The best choice depends on your specific needs: signal type, size, desired accuracy, and available computing power.
Stepwise regression method
In this method, the independent variables x1, x2, …, xn are introduced into the related equation based on some predefined criteria. A variable in the equation may be replaced by a new variable in the equation or discarded from the equation altogether (Thompson 1995).
The stepwise regression, a set of criteria to determine how a variable is entered, replaced, or eliminated is as follows:
The stepwise procedure with F-test
Replacing with F-test
The stepwise procedure with multiple correlation coefficient R2
Swapping with R2
Principal component analysis
The PCA provides new coordinates to preserve data with the highest variance. Therefore, the purpose of this method is to analyze multivariate data sets into those components representing the maximum possible variations. After sorting the components based on the maximum variance, those with the lowest variance can be ignored. Hence, this analytic approach can reduce the dimensions of the data, such that to only preserve the data containing the most useful information. In this method, the primary variables are transformed into components that are not correlated with each other; a linear combination of a new component may be established. Indeed, this method attempts to find a linear combination of L indices, x1, x2, …, xm to produce the independent indices, z1, z2, …, zl while l < m. The steps of the PCA method can be summarized as follows: (1) data averaging and normalizing, (2) establishing a covariance matrix of the normalized data, (3) calculating the eigenvalues and eigenvectors for the covariance matrix, (4) arranging the eigenvectors in the descending order based on the corresponding eigenvalues, and (5) selecting an appropriate set of eigenvectors (Abdi & Williams 2010).
This study investigated if using PCA (a dimensionality reduction technique) to simplify data would affect the accuracy of predicting monthly river flow. For this investigation, it used performance criteria such as mean squared error and coefficient of determination (R2) and sensitivity analysis for comparison between models trained using PCA-reduced data and models trained using original data. They found that PCA can significantly reduce the complexity of the data (reduce dimensions) without harming the prediction accuracy. This success is attributed to PCA's focus on capturing the most informative parts of the data and ensuring no critical information is lost during the reduction process (because the principal components selected by PCA captured a substantial portion of the data variance). Overall, the study confirms that PCA is a valuable tool for simplifying data analysis in tasks like river flow prediction, while still retaining the key information needed for accurate results.
In this study, preventing overfitting was a crucial aspect of model tuning. For this purpose, various methods and techniques were employed.
(a) Hybrid wavelet–ML models: hybrid models utilize wavelet decomposition to separate input signals into different frequency components. This allows the models to focus on each component individually, reducing model complexity and overfitting.
(b) Feature selection technique: feature selection, which involves identifying and retaining important features while eliminating irrelevant ones, is an effective approach to overfitting prevention. In this study, stepwise regression and PCA were employed for feature selection and dimensionality reduction.
(c) Model and parameter tuning: the various models used in this study have multiple parameters that require careful tuning. Optimization algorithms and trial-and-error methods were employed to optimize these parameters.
(d) Independent testing: to evaluate the models definitively, independent data not used in the training process was utilized. This assesses model performance in real-world scenarios and prevents overfitting.
(e) Model comparison: the results of different models were compared to select the one with the lowest error and least overfitting.
Uncertainty analysis
In this study, the uncertainty has been quantified using two criteria, p-factor and d-factor. The p-factor denotes the percentage of the observed data surrounded within the range of 95PPU, and therefore its optimal value will be 100%. Since each observation datum has its 95PPU range and the width of this range is special to each observation value, d-factor, the ratio of the average width of the uncertainty range of all observed data (percentile difference of 97.5 and 2.5 for each simulated value) to the standard deviation of the observed data has also been proposed (the desired value is zero). Therefore, the small value of the d-factor means that the model uncertainty is low. It should be mentioned that the d-factor reduction reflects the p-factor diminishes, which is not desirable. So, it is essential to establish a proper balance between these two criteria to obtain optimal values. In runoff forecasting, for the p-factor, at least 70%, and for the average width of the uncertainty range, a maximum of 1.5 are acceptable (Abbaspour et al. 2015).
In this study, the p-factor and d-factor methods are used to assess uncertainty in monthly flow modeling. These two methods are statistical tools employed to examine the agreement between the predicted and observed values of monthly flow. The goal of these two methods is to evaluate the performance of monthly flow models and ensure that the model accurately predicts changes in flow.
Performance criteria
In this study, the performance criteria, including correlation coefficient (R), Kling–Gupta efficiency (KGE), Nash–Sutcliffe efficiency (NSE), the root mean square error (RMSE) to RSR, and MAE have been employed to analyze the results as well as quantify both the accuracy and efficiency of the developed models. In the following, a brief description of these criteria has been presented.
RESULTS AND DISCUSSION
The proper mother wavelet using the waveform matching algorithm
Modeling of the decomposed time series using stepwise regression
The selection of input parameters is one of the most important steps for making a successful ML model. In this regard, the data collection process should not be costly; one of the main criteria is to reduce the size of the network while preserving high performance; in this study, the input parameters were selected based on Pearson's correlation coefficient at 99% confidence level and the stepwise regression method. Therefore, the time series of the hydroclimatic variables were decomposed by the most suitable mother wavelet; then, the different combinations of the obtained sub-series (including approximate and detail components) were prepared via stepwise regression to make an optimal model configuration.
The PCA results
The third step of the procedure employed in this study (Figure 4) is the PCA technique to acquire the principal components of the corresponding variables, which accordingly leads to reducing the dimensions of the input data. PCA is often used for dimensionality reduction, where the goal is to reduce the number of variables while preserving the most important information. By selecting the top k principal components, which capture the most variance in the data, we can represent the data set in a lower-dimensional space. In this study, the PCA algorithm has been implemented on the results of the wavelet-based (DWT, MODWT, and MODWTMRA) models.
As depicted in Figure 7(a)–7(c), the first principal component exhibits a significantly larger range compared with the second principal component. This indicates that the first component holds higher importance than the second component. It is obvious from this figure that the first component has more dispersion than the second component. To determine the optimal number of components that can effectively represent the original data, we can plot the variance explained by each component, as well as the cumulative variance. This analysis helps us assess the amount of variance captured by each component and identify the point where adding more components does not significantly contribute to the overall variance explained. Figure 7(d)–7(f) illustrates the variance of each component that collectively represents nearly 95% of the total variance in the original data.
As observed in Figure 7(d)–7(f), the rate of variation for the components in the WDT model is lower compared with the other models. The first four components account for approximately two-thirds of the total variance, leading to their selection while disregarding the remaining components. Conversely, both the MODWTMRA and MODWT models display a noticeable drop from the first to the second component. In these wavelet models, the first component explains ∼40% of the total variance. As for the MODWTMRA and MODWT models, the first three components and first two components respectively capture two-thirds of the total variance. Consequently, the first three components are chosen for the MODWTMRA model, while the first two components are selected for the MODWT model.
Comparison of the developed hybrid models
The simulation results of the models, comprising RF, MARS, GMDH, LSSVM, and ANFIS, coupled with DWT, MODWT, and MODWTMRA are presented in the Supplementary material.
As expected, during training, the performance of the models is relatively better than testing. In the simulation process, it is very important to avoid overfitting the model. The models were ranked based on the TOPSIS method for both training and test data (see the Supplementary material).
The results revealed that the RF–MRA model outperformed others with R = 0.98, KGE = 0.88, NSE = 0.96, RSR = 0.20, MAE = 0.36, and score = 0.9910 during training period and R = 0.98, KGE = 0.84, NSE = 0.94, RSR = 0.24, MAE = 0.36, and score = 0.9805 during testing period due to the classification, a combination of decision trees, without any overfitting during the simulation process.
In other words, RF is a hybrid learning method for classification and regression. It works on either the training timing or output of classes (classification) or the predictions of each tree separately.
Meanwhile, the results indicated that for both training and testing periods, the top three highest scores were assigned to the same models including RF–MRA, RF–DWT, and MARS–DWT. In addition, the worst-case performances of the simulation were assigned to the ANFIS–MODWT for training and LSSVM–MODWT for testing periods. Also, the comparison of preprocessing techniques demonstrated distinctive priorities for each ML model; such that for the RF model, the RF–MRA had the best performance (score = 0.9910), followed by RF–DWT (score = 0.9791) and (RF–MODWT = 0.9410) during training and RF–MRA (score = 0.9805), followed by RF–DWT (score = 0.9527) and (RF–MODWT = 0.7866); while for MARS, the ranking is MARS–DWT, MARS–MRA, and MARS–MODWT for training and testing periods.
The GMDH–DWT, GMDH–MRA, and GMDH–MODWT were the corresponding wavelet-based model ranks for both training and testing periods. For the ANFIS model, the results demonstrated the robustness of the ANFIS–MRA than ANFIS–DWT and ANFIS–MODWT during the training period, while for the testing period, the ANFIS–DWT had the highest performance rather than the ANFIS–MRA, followed by ANFIS–MODWT. A similar ranking was observed for the wavelet-based LSSVM models. The results revealed that for RF, MARS, and GMDH models, the preprocessing techniques comprising DWT, MODWT, and MODWTMRA had comparable ranks during both training and testing periods; while for ANFIS and LSSVM, the priority of different wavelet techniques was not alike during training and testing periods. Therefore, it can be concluded that there is not a unique wavelet technique that represented the best performance for all ML models due to the inherent difference among the studied models.
Uncertainty quantification in the developed hybrid models
The uncertainty analysis has been discussed based on the concepts of p-factor and d-factor. In this study, during 50 model executions, the statistical criteria extract the best performance to calculate the 95PPU for each observed value. This had a great impact on the optimized p-factor and d-factor values. Table 3 indicates the uncertainty analysis for the top two hybrid models during the training and testing periods. Also, comparison between the simulated and observed values considering the range of 95PPU for the top two models during the training and testing periods is shown in the Supplementary material.
Criteria . | RF–MODWTMRA . | MARS–DWT . | ||
---|---|---|---|---|
Train data . | Test data . | Train data . | Test data . | |
p-factor | 0.735 | 0.713 | 0.808 | 0.752 |
d-factor | 0.379 | 0.493 | 0.919 | 1.010 |
Average bandwidth | 0.92 | 0.92 | 2.23 | 1.89 |
Criteria . | RF–MODWTMRA . | MARS–DWT . | ||
---|---|---|---|---|
Train data . | Test data . | Train data . | Test data . | |
p-factor | 0.735 | 0.713 | 0.808 | 0.752 |
d-factor | 0.379 | 0.493 | 0.919 | 1.010 |
Average bandwidth | 0.92 | 0.92 | 2.23 | 1.89 |
From Table 3, it is concluded that for MODWTMRA–RF, the p-factor of 73.5% (more than 70%), d-factor of 37.9, and the average bandwidth of 0.92 (<1.5) during the training period, are considered the acceptable values for runoff forecasting. In addition, the diagrams reveal that the uncertainty interval (95PPU) for the minimum runoff is much wider than that of the maximum runoff; it means that the minimum runoff has a higher uncertainty than the maximum runoff. In contrast, the number of violations of the uncertainty range in the maximum runoff is more than the minimum ones. In fact, for minimum runoff, the wider uncertainty interval causes the higher d-factor, while the lower p-factor, which is the opposite for the maximum runoff. During the testing period of MODWTMRA–RF, the optimal p-factor, d-factor, and the average of the uncertainty bandwidth are 71.3% (more than 70%), 49.3%, and 0.92 (<1.5), respectively. In addition, the diagram indicates that the uncertainty band has almost the same distribution for both the maximum and minimum runoff. During the training of the DWT–MARS, the p-factor = 80.8%, the d-factor = 91.9%, and the average width of the uncertainty band = 2.23. Compared with the RF, the p-factor is increased, which demonstrates that a higher percentage of the observed data is surrounded within the 95PPU uncertainty band; in addition, the d-factor is significantly increased, due to that the average uncertainty band is a noticeable value compared with the standard deviation of the observed data. A relatively wide bound of the uncertainty for nearly all observed data is the reason for the high values of the p-factor and d-factor. Similar to the RF, the uncertainty of the minimum data is greater than the maximum data due to the wide uncertainty band. In general, during the modeling process, the uncertainties may be caused by a series of simplifying assumptions, the processes that occurred in a watershed, but were not included in the simulation, data limitations, uncertainty and the quality of input data, etc., which leads to the prediction errors. Similarly, this was observed during the testing of the DWT–MARS. Despite the acceptable and satisfactory p-factor (75.2%), the average width of the uncertainty band is greater than the standard deviation of the observed data, which made the d-factor slightly more than one. However, compared with the training period, the average width of the uncertainty band achieved 1.89 (∼15% reduction), but it is still significantly more than the standard deviation of the observed data. In addition, the uncertainty bandwidth for the minimum values is greater than that of the maximum ones, which indicates the higher uncertainty of the minimum values. It should be mentioned that the results are based on the top 50 model executions during both training and testing periods. The smaller number of superior models might lead to relatively more acceptable results; however, during the validation process, the models might not be properly generalized, leading to overfitting.
Climate changes can have impacts on the performance of precipitation and runoff models configured using ML. These impacts can occur directly or indirectly through changes in precipitation patterns, temperature, and runoff. To address these impacts, it is important to update precipitation and runoff models so they can more accurately predict these changes. This includes using new climate data, improving modeling algorithms, and adjusting model parameters based on current and future climate conditions. It is noteworthy that Lotfirad et al. (2023) investigated the effect of climate change on the runoff of the watersheds of the Hyrcanian region such as the Navrood watershed and observed the appropriate performance of the rainfall–runoff model in future periods under the influence of climate change.
The proposed models in this study have demonstrated high prediction accuracy, flexibility, generalization capability, stability, and reliability, making them suitable for various applications:
Water resource management: the proposed models can be highly effective in water resource planning and management. Accurate and reliable information about future river flows can aid in better decision-making regarding water allocation, flood management, and agricultural planning.
Flood forecasting: the proposed models can serve as valuable tools for flood forecasting. By accurately predicting future river flows, proactive measures can be taken to mitigate flood damage.
Agricultural planning: in agriculture, knowing future water flows is crucial for planning crop planting and harvesting. The proposed models can assist farmers and agricultural managers in developing better water management plans based on accurate forecasts.
CONCLUSION
In this study, the data-driven modeling techniques coupled with the ML models, and preprocessing methods comprising WT, stepwise algorithm, and PCA are used to develop accurate hybrid models for predicting river runoff. The waveform matching algorithm to achieve optimal mother wavelet for data preprocessing was first applied in this study. Contrary to the conventional methods which ensure the efficiency of the wavelet after the modeling process was implemented, in this method, optimal wavelet is obtained before beginning the modeling, through matching the shape of the original time series with the desired wavelet. Once the raw time series was decomposed into the corresponding sub-series (approximation and details), a stepwise method was incorporated to provide different combinations, which finally achieve the most efficient combination. The output of the stepwise method was entered into the complementary preprocessed technique, PCA, to reduce the dimensions of the input data for the modeling process. Changing the coordinates of the input data, the PCA only preserves those data that convey the most beneficial information. The PCA results revealed that for the DWT, about the first four components expressed 75% of the total variance. For the MODWTMRA and MODWT, the first three and two components expressed 75% of the variance of the total data, respectively. The preprocessing input data were introduced into the ML models, including RF, MARS, GMDH, LSSVM, and ANFIS to implement the simulation process. During training and testing periods, the statistical criteria, consisting of MAE, RSR, NSE, KGE, and R were applied to compare the performance of the models. The RF–MRA model outperformed other ML models for both training and testing periods. The last step was assigned to the uncertainty analysis based on the concepts of p-factor and d-factor parameters. In this study, for each model, 50 executions were implemented and the models with the best function were extracted based on the statistical criteria. For the testing period, the MODWTMRA–RF model showed the best performance with p-factor = 73.5% (more than 70%), d-factor = 37.9%, and the average width of the uncertainty band = 0.92 (<1.5), representing confident forecasting of the river runoff. During training, the best performance was achieved for the DWT–MARS model, in which, the p-factor, d-factor, and the average width of the uncertainty band were obtained to be 80.8% (more than 70%), 91.9%, and 2.23, respectively. The combination of ML models and preprocessed techniques, such as wavelet, were discussed in many studies. Nourani et al. (2019) integrated DWT with the tree model, M5 to predict daily and monthly river runoff in Iran and Australia. The db4 was introduced as the most appropriate mother wavelet for the runoff for precipitation simulation. Freire et al. (2019) coupled an ANN model with 54 mother wavelets from the DWT to predict daily runoff. The Meyer wavelet was selected as the most suitable mother wavelet. Farajpanah et al. (2020) incorporated eight ML models and the corresponding combinations with DWT to estimate the daily runoff of the Navrood River. They recognized the sym7 as the optimal mother wavelet for the simulation process. Wang et al. (2022) combined five ML models with the DWT wavelet to evaluate the monthly runoff of two rivers in the United States. The db4 was selected as the best mother wavelet. A noteworthy point is that in all these studies, no specific framework has been considered to choose the proper mother wavelet and only restricted themselves to improving the modeling results. However, in the current study, the waveform matching algorithm, an innovative method in hydroclimatic studies, was incorporated to identify the most appropriate mother wavelet; meanwhile, the sym14 was determined as the optimal mother wavelet to predict monthly runoff.
ETHICAL APPROVAL
The manuscript is an original work with its own merit, has not been previously published in whole or in part, and is not being considered for publication elsewhere.
CONSENT TO PARTICIPATE
The authors have read the final manuscript, have approved the submission to the journal, and have accepted full responsibilities pertaining to the manuscript's delivery and contents.
CONSENT TO PUBLISH
The authors agree to publish this manuscript upon acceptance.
AUTHORS’ CONTRIBUTIONS
The authors declare that they have contribution in the preparation of this manuscript.
FUNDING
The authors did not receive support from any organization for the submitted work.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.