It is critical to use research methods to collect and regulate surface water to provide water while avoiding damage. Following accurate runoff prediction, principled planning for optimal runoff is implemented. In recent years, there has been an increase in the use of machine learning approaches to model rainfall-runoff. In this study, the accuracy of rainfall-runoff modeling approaches such as support vector machine (SVM), gene expression programming (GEP), wavelet-SVM (WSVM), and wavelet-GEP (WGEP) is evaluated. Python is used to run the simulation. The research area is the Yellow River Basin in central China, and in the west of the region, the Tang-Nai-Hai hydrometric station has been selected. The train state data ranges from 1950 to 2000, while the test state data ranges from 2000 to 2020. The analysis looks at two different types of rainy and non-rainy days. The WGEP simulation performed best, with a Nash-Sutcliffe efficiency (NSE) of 0.98, while the WSVM, GEP, and SVM simulations performed poorly, with NSEs of 0.94, 0.89, and 0.77, respectively. As a result, combining hybrid methods with wavelet improved simulation accuracy, which is now the highest for the WGEP method.

  • The accuracy of machine learning based approaches for rainfall-runoff modeling is evaluated in the Yellow River Basin in China.

  • Wavelet transform as a pre-processing tool is combined with Gene Expression Programming and Support Vector Machine.

  • The combination of Wavelet and Gene Expression Programming (WGEP) performed best.

Graphical Abstract

Graphical Abstract
Graphical Abstract

One of the most important ways for most of the world to get water is through runoff. Recent years have seen a rise in the importance of scientific techniques for gathering and managing surface water to provide water and prevent damage (Nourani et al. 2019a; Molajou et al. 2021; Yavari et al. 2022; Zhang et al. 2022). Therefore, it is critical to obtain a more precise estimate of rainfall-runoff in catchments. Knowing how much runoff catchments can naturally produce is one of the most crucial aspects of planning for optimal runoff. Therefore, in water science and engineering, it is essential to study hydrology and simulate rain and water runoff in catchments (Ehteram et al. 2019; Afshar et al. 2021; Chen et al. 2021). Making precise predictions of flow rate and how it changes throughout the year is also crucial for planning and managing surface water resources because only a finite amount of fresh water can be withdrawn. Therefore, it is crucial to investigate how well forecasting models function (Feng et al. 2020).

The primary objective of a simulation model is to predict the operation of a complex system and investigate the impact of modifications on the system's performance (Azizi & Nejatian 2022). Most hydrological prediction models are categorized as physical and mathematical models. In runoff simulation, mathematical models are categorized as theoretical, conceptual, and experimental models. Machine learning has recently expanded into various scientific fields, including water resources (Alizadeh et al. 2020). Machine learning is the scientific study of algorithms and statistical models used by computer systems that rely on patterns and inferences rather than explicit instructions to perform tasks (Liu et al. 2021a). There is a wide range of machine learning research that can be conducted. Theoretically, researchers aim to develop new learning methods and assess their viability and quality. Alternatively, some researchers attempt to apply machine learning techniques to new problems. This spectrum is not discrete, and both approaches are present in the researchers’ work. Machine learning reduces operational expenses and accelerates data analysis (Ramezanizadeh et al. 2019; Ahmadi et al. 2020). The performance of an algorithm for machine learning can be evaluated by separating data into train and test sets (Afan et al. 2020). Data segmentation can be applied to classification and regression problems. This method applies to any supervised learning algorithm in general. The data separation procedure divides a data set into two or three subsets (Liu et al. 2021b). The significance of data in machine learning is attributable to two factors: (a) the model requires data to learn, and (b) the model must be measured using data because it may not have been able to extract the data's information effectively. The selection of appropriate and effective input variables is an important and influential step in ensuring the accuracy of models based on data processing. Various techniques for determining model input variables rely on data processing (Liu et al. 2021a; Molajou et al. 2021).

Rainfall-runoff models can be categorized according to how they express the process or solve the governing equations (Peng et al. 2020). Some characteristics and properties of hydrological variables cannot be observed in the time dimension; therefore, they can be observed and evaluated by transferring them to other spaces (such as Fourier series, frequency, wavelet, Z transform, Laplace, etc.). Therefore, studying them in spaces other than time will allow for a more thorough and accurate investigation. The Fourier transform is one of these mathematical transformations; it is a mathematical operator that produces multiple representations of a signal. Despite its many capabilities, the Fourier transform has limitations, including the inability to display the signal in the frequency and time domains, particularly for unstable signals, and spectral analysis of the signal (Komasi & Sharghi 2016; Sharghi et al. 2018). Therefore, the wavelet transform is highly effective for dealing with non-stationary time series (signals). Using a collection of wavelet functions, the wavelet transform decomposes the time series into a series of coefficients. These functions are generated by relocating a fundamental wavelet function known as the mother wavelet or analyzing wavelet.

Wavelets’ simultaneous localisation in the time and frequency domains is one of their key benefits. The use of fast wavelet transform makes wavelets computationally very efficient, which is their second major benefit. Wavelets have the great advantage of being able to separate the fine details in a signal. Despite being a robust tool for image and signal processing, the wavelet transform has three significant drawbacks: shift sensitivity, poor directionality, and lack of phase information.

In the current study, as a novel strategy, it was tried to investigate the accuracy of rainfall-runoff modeling approaches such as support vector machine (SVM), gene expression programming (GEP), wavelet-SVM (WSVM), and wavelet-GEP (WGEP). It should be noted that Python is used to run the simulation.

Study area

After the Yangtze, the Yellow River is China's second-longest river. The Yellow River Basin is located between 96 and 119 degrees east and 32 and 42 degrees north. Bayan Har Mountain, where the Yellow River rises, is located in Qinghai province. With a drainage area of 795,000 km2 and an overall length of 5,464 km, this river flows through Qinghai, Sichuan, Gansu, Ningxia, Shaanxi, Henan, and Shandong before entering the sea on the coast of Shandong in Dongying City. As one of the largest hydrometric stations in the Yellow River Basin, Tang-Nai-Hai contributes approximately 35% of total annual discharge compared to its 15% area (Figure 1) (Zhang et al. 2021). From 1950 to 2020, monthly discharge data for the Yellow River were obtained from the Tang-Nai-Hai hydrometric station.
Figure 1

The yellow river basin and study area characteristics.

Figure 1

The yellow river basin and study area characteristics.

Close modal

Simulation of rainfall-runoff process

Considering that the number of rainy days in each year is far less than non-rainy days, the calculation of runoff from the basin will be due to two different mechanisms. During the times of rainfall and a few days after that, the runoff from the basin is mainly in the form of floods with high discharge and low continuity. But in most days of the year when there is no rain, the outflow is in the form of base flow with low flow rates and high continuity. Therefore, in this study, a two-criteria rainfall-runoff model including the model related to rainy days and the model related to non-rainy days is presented. According to the average annual rainfall and the average rainfall on each rainy day in the region under study, days with a total precipitation of more than 2 mm were classified as rainy days, while days with a total precipitation of equal to or less than 2 mm were classified as non-rainy days. Therefore, data from hydrometric stations were categorized according to rainy and non-rainy days. On rainy days, when the flow is predominantly flooding, the variables that affect the flow rate at the hydrometric station are the current precipitation, the precipitation from the day before, and the precipitation from two days ago. Due to the study area's relatively short concentration time and rapid response, the previous day's precipitation significantly impacts the hydrometric station more than precipitation from the previous two days. In addition, on non-rainy days, the flow rate variables include the flow rates of the day before and two days before at the hydrometric station because the current flowing in the hydrometric station is fundamental. Because the number of rainy days per year is significantly lower than the number of non-rainy days, the variables affecting the flow rate at a hydrometric station similar to the non-rainy day's section of the model will only include the flow rate one and two days prior.

For the management of water resources, accurate runoff estimation is crucial. For many real-world uses involving conservation, environmental disposal, and water resource management, runoff simulation and forecast in basins is essential. Nevertheless, due to the spatial-temporal variability and interactions of underlying climatic and physiographic variables, the rainfall-runoff process is a complicated, non-linear, and dynamic hydrological phenomenon to predict. In order to figure out how well different machine learning methods estimate runoff in the study area, simulations were done using SVM, GEP, WSVM, and WGEP, and the results were compared with the values that were measured at the hydrometric station. Thus, data from 1950 to 2000 were utilized for the train state, and data from 2000 to 2020 were used for the test state. Figure 2 shows the overall simulation process.
Figure 2

The general process of simulation.

Figure 2

The general process of simulation.

Close modal

In recent years, the field of water resources management has been confronted with significant obstacles due to climate change and changes in land use. Consequently, it is required to understand the rainfall-runoff amount and its changes over time. Understanding each mathematical model's benefits, drawbacks, and accuracy is crucial for accurately predicting the amount of rainfall-runoff. This study aims to evaluate the predictive accuracy of models SVM, GEP, WSVM, and WGEP.

In the current study, SVM and GEP methods were utilized in conjunction with wavelet transform. Therefore, it appears necessary to examine the theoretical foundations of these methods.

Support vector machine (SVM)

An SVM has been found to be popular in modeling studies because of its advantages over artificial neural network (ANN). This method seems to be a powerful alternative that can overcome some of the basic weakness related to ANN while retaining all strengths of ANN. Also, SVM model is able to give a more accurate prediction, particularly in the situation where the data set is not distributed uniformly (Ayubi Rad & Ayubirad 2017). SVM was developed as a two-tier classification method with observer learning. SVM can be applied to classification problems (where the output is a batch) and regression problems (where the output is a numerical value). This method classifies data by locating the optimal hyperplane that separates all data in one class from another (Hearst et al. 1998). The optimal hyperplane results in a page with more significant margins than the other two classes. Margin is the sheet with the most excellent width parallel to the separator's hyperplane, within which there are no data. The points that are closest to the separator hyperplane are the backup vectors.

The vectors xi for 1 ≤ i ≤ Nx and the input, the title to the binary label vectors yi ∈ {− 1,1} are considered as the corresponding output (Nx is the sample size). Φ(xi) is defined as similarity vectors in the property space, and K(xi,xj) = Φ(xi).Φ(xj) is a kernel function that denotes the internal multiplication in the property space.

The SVM optimization problem is defined for a smooth margin problem relative to yi(w.x + b) = 1 + ξi, ξi ≥ 0 as the following equation.
formula
(1)
where, w is the normal vector of the separator hyperplane in the property space, and C > 0 is a regulatory parameter that is responsible for controlling incorrect categorization. Using Lagrangian, Equation (1) of dual form is obtained.
formula
(2)
where, 0 ≤α≤ C. This quadratic optimization problem can be solved well with algorithms such as minimum sequential optimization. In fact, during optimization, many αis tend to zero. The remaining xis belonging to αi > 0 are called backup vectors.

For simplicity, it is assumed that all non-backup vector data has been deleted. Therefore, Nx represents the number of support vectors and for all αi > 0.

This section is suitable for subheadings. It must provide a succinct and accurate description of the experimental results, their interpretation, and the conclusions drawn from the experiment. Consequently, the normal vector of the separator plate w can be calculated as follows (Asefa et al. 2006):
formula
(3)
Because Φ(xi) was implicitly defined, w exists only in the property space and cannot be calculated directly. Instead, classification f(q) for a new sample vector q can only be done by calculating the q kernel function with each backup vector (Equation (4)).
formula
(4)

In this regard, the parameter d is the deviation of the hyperplane along its normal vector, which is determined in the spectrum of the training process.

The proper selection of kernel functions is a crucial step in determining the performance of backup vector machines for the most accurate classification (Bray & Han 2004). This function is chosen based on the characteristics of the data being analyzed. There are both linear and nonlinear kernels. A polynomial kernel function with three objective and linear kernel characteristics has been utilized in this study. In the support vector machine method, the model must first be trained using training data, as with any learning model with an observer. After training, the system is capable of classifying new samples. To make the best model (with the most accuracy), you need to use a variety of kernels and choose the suitable parameters to increase the accuracy of each kernel. Figure 3 shows the SVM model's structure.
Figure 3

Structure of the SVM model.

Figure 3

Structure of the SVM model.

Close modal

Gene expression programming (GEP) by gene X Pro tools

Following the evolution of intelligent models, gene expression programming (GEP) becomes a method for calculating the rotation of algorithms (Kargar et al. 2019). It is inspired by natural evolution. The advantage of GEP over other networks is that the structure (input variables, purpose, and set of functions) is defined first. The optimal model structure and coefficients are then determined during the process. In other words, this method can optimize the model structure and components. The five-step process for solving a problem using GEP is described in the following:

  • Selecting the terminal set: the same as the problem's independent and system state variables. This step involves the selection of the fit function, which typically employs the root mean square error (RMSE).

  • Select a set of functions, including mathematical operators, tests, and Boolean functions.

  • Model accuracy measurement index is used to determine how the model can solve a specific problem.

  • Control components: program execution is controlled by numerical component values and qualitative variables.

  • The number of data in the training section, the number of data in the chromosome test sections, the size of the head, the number of genes, and the choice of the transplant operator can be modified using addition, subtraction, multiplication, and division operations.

Gene X pro tools have been used to develop and implement genes based on genetic programming in the present study. The structure of the GEP model is shown in Figure 4.
Figure 4

Structure of the GEP model.

Figure 4

Structure of the GEP model.

Close modal

Wavelet transform

The transform of wavelets is an advanced form of the Fourier series. It complements the Fourier series by dealing with the signal more effectively than the Fourier series. This converter can extract time and frequency data from a signal that varies with time. The primary objective of signal processing is to extract information about a phenomenon using data collected from its behavior and compare it to a set of known functions (Nourani et al. 2019b). This comparison is performed for a continuous signal using the internal multiplication of the signal x(t) and the functions. This comparison is performed for a continuous signal using the internal multiplication of the signal Φn(t) and the functions.
formula
(5)
where, character ∗ represents a mixed conjugate, internal multiplication indicates the similarity of signal x(t) and function ψn (t). The greater the similarity of the signal x(t) to the function Φn(t), the greater the result of the internal multiplication. If the function Φn(t) is orthogonal, the signal x(t) is retrieved as following:
formula
(6)
where, orthogonal base functions for . Here, is the Kronecker Delta function. Equation (2) is called an inverse conversion.
Unlike Fourier transform, wavelet transform uses a different method to decompose the signal into its components. The wavelet transform (WT) of signal x(t) is defined by internal multiplication.
formula
(7)
The character ∗ represents a mixed conjugate, and gτb(t) is obtained from the mother wavelet, g(t).
formula
(8)
where τ and b are translation and scale dilation parameters, respectively, the wavelet coefficient WT(τ,b) indicates the degree of correlation between the signal and the local part of the wavelet. Large values WT(τ,b) indicate that the signal understudy has a major component of the frequency belonging to the given scale. In this case, the scaled wavelet will be similar to the signal under study. By changing the values of parameters, τ and b domain changes can be made in terms of time and scale. The mother wavelet should have a mean of zero. As a result, the original signal can be calculated using the wavelet transform coefficients WT(τ,b) by inverting the wavelet transform as follows.
formula
(9)
where,
formula
(10)

G(ω) is the Fourier transform of g(t). This condition is known as the acceptance condition admissibility condition. The reversibility of the conversion and the ability to reconstruct signal x(t) from its continuous wavelet transform depends on this condition. Once the base wavelet is selected, it can be used to analyze the data.

Developed WSVM and WGEP models

The wavelet transform can decompose data or a given function into various frequency components and is used to interpret each of these components according to the resolution corresponding to its scale. Unlike frequency spectrum analysis, wavelet analysis includes time information. Thus, it can be used to analyze a spectrum's temporal characteristics and facilitates the analysis of transient and irregular wave states. The wavelet transform was used in this study to decompose the main rainfall-runoff time series into sub-time series components. The sub-time series were then used as inputs to build SVM and GEP models with the original input time series’ details and approximations. The measured Yellow River discharge time series were transformed into several multi-frequency time series. The data was used to predict rainfall-runoff using SVM or GEP after decomposing the rainfall-runoff time series into different levels. Because each of these series plays a unique role in the primary time series, their contributions to the primary time series differ. Figure 5 summarizes the WSVM and WGEP models.
Figure 5

Structure of the proposed WSVM and WGEP (I represents the rainfall, R represents the runoff and di represents the decomposed series).

Figure 5

Structure of the proposed WSVM and WGEP (I represents the rainfall, R represents the runoff and di represents the decomposed series).

Close modal

Efficiency criteria

In order to compare and evaluate the performance of the studied models, Nash-Sutcliffe efficiency (NSE), root mean square error (RMSE), and normalized root mean square error (NRMSE) have been used.
formula
(11)
formula
(12)
formula
(13)
formula
(14)
formula
(15)
where, N is the number of data, Oi and Pi are the observed and simulated values by models, respectively, and are the average observed and simulated values by models.

Analysis of variance (ANOVA)

The analysis of variance is one of the most widely used techniques in hypothesis testing and statistical research (ANOVA). Using this method, an attempt is made to investigate the differences between several statistical samples (Kumar Karn & Harada 2002). Because of the accurate data dispersion, this approach allows for the examination of variation among various groups. This method can be used to assess the mean equality of different groups. In regression models, the appropriateness of the model can also be determined by comparing the total variance to model variance and error variance. The multivariate mean test uses analysis of variance to divide total variation into variation between and within groups. So, if total scatters are denoted by the total sum of squares (SST), intergroup scatters are denoted by between sum of squares (SSB), and scatter within squares are denoted by within sum of error (SSE), Equation (16) is as follows:
formula
(16)
The calculation method for each of the components of Equation (16) is performed as follows:
formula
(17)
formula
(18)
formula
(19)
where, xij is the individual observation, is the sample mean of the jth treatment or group, is the overall sample mean, k is the number of treatments or independent comparison groups, and nj is the total number of observations or total sample size.

In the presence of an influential variable in the communities, one-way ANOVA is used to calculate the analysis of the variance table for the test comparing the means of multiple communities. If more than one factor affects this separation, the variance model analysis becomes more complex and even reveals the effect of each factor on the others. As a result, the sources of dispersion are distinguished based on two variables and the interaction between the variables. In this instance, the variance analysis table is referred to as two-way ANOVA. We considered calculated probabilities of <0.05 to be significant and <0.001 to be highly significant.

In this section, each method's results are shown on their own. These results are based on what the Tang-Nai-Hai hydrometric station calculated and what it saw from 2000 to 2020. The last step is to compare the results to each other. In this study, 70% of the data is used for training, and 30% is used for testing. The goal is to predict the monthly runoff of observational data from 1950 to 2020. Using the scikit-learn library, the train and test states were chosen by chance. An evolutionary algorithm framework named geppy developed in Python was used to implement the GEP algorithm.

Results of the SVM method

Table 1 shows the efficiency criteria and two-way ANOVA test values for the train and test states in the SVM method. According to the table, the SVM methods had high level of accuracy on the data of non-rainy days (higher R-squared values). In addition, the train state is more accurate than the test state. The ANOVA test revealed that the results of the SVM method is highly significant in the train state of non-rainy days (P-value < 0.001).

Table 1

Efficiency criteria and two-way ANOVA test of SVM method in train and test states

Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.72 0.06 3.61 0.79 3.49 0.74 
Test 0.65 0.12 3.29 0.74 4.38 0.8 
Non-rainy days Train 0.77 0.03 2.37 0.82 2.35 <0.001 
Test 0.71 0.07 3.19 0.79 2.07 0.99 
Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.72 0.06 3.61 0.79 3.49 0.74 
Test 0.65 0.12 3.29 0.74 4.38 0.8 
Non-rainy days Train 0.77 0.03 2.37 0.82 2.35 <0.001 
Test 0.71 0.07 3.19 0.79 2.07 0.99 

Figure 6 compares the observed and simulated hydrographs for the train and test states using the SVM approach.
Figure 6

Time series of observed and simulated discharge for SVM method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Figure 6

Time series of observed and simulated discharge for SVM method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Close modal
It shows that data from 1950 to 2000 were used for the train state, while data from 2000 to 2020 were used for the test state. The results of 1995–1996 and 2016–2015 are presented separately for accurate comprehension of the prediction of the train and test states. According to these findings, the SVM model is less accurate at predicting extreme points. Scatter plots for observed and simulated results in train and test states for the SVM method are shown in Figure 7.
Figure 7

Scatter plot of observed and simulated results for SVM method. (a) Train state. (b) Test state.

Figure 7

Scatter plot of observed and simulated results for SVM method. (a) Train state. (b) Test state.

Close modal

Given that the train state encompasses the years 1950–2000, it has more points than the examination phase. It is evident from the scatterplots that the train state is more accurate than the test state.

Results of the GEP method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the GEP method are presented in Table 2. The accuracy of the GEP model in the train state is greater than in the test state, as shown in Table 2. This is likely due to allocating 70% of the data to the train state. In addition, the GEP model is more accurate at predicting runoff on non-rainy days than on rainy days because the runoff from the basin during rainy days is a transient stream with a high discharge, whereas, during non-rainy days, it is a steady state with a low discharge.

Table 2

Efficiency criteria and two-way ANOVA test of GEP method in train and test states

Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.85 0.03 2.06 0.89 2.11 1.00 
Test 0.72 0.10 3.89 0.78 1.78 1.00 
Non-rainy days Train 0.89 0.01 1.59 0.91 2.33 0.35 
Test 0.77 0.06 1.26 0.82 2.16 0.17 
Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.85 0.03 2.06 0.89 2.11 1.00 
Test 0.72 0.10 3.89 0.78 1.78 1.00 
Non-rainy days Train 0.89 0.01 1.59 0.91 2.33 0.35 
Test 0.77 0.06 1.26 0.82 2.16 0.17 

Figure 8 compares the observed and simulated hydrographs for the train and test states using the GEP approach.
Figure 8

Time series of observed and simulated discharge for GEP method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Figure 8

Time series of observed and simulated discharge for GEP method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Close modal
According to results, it is evident that the GEP model predicts relative extreme points less accurately. Scatter plots for observed and simulated results in train and test states for the GEP method are shown in Figure 9.
Figure 9

Scatter plot of observed and simulated results for GEP method. (a) Train state. (b) Test state.

Figure 9

Scatter plot of observed and simulated results for GEP method. (a) Train state. (b) Test state.

Close modal

The results show that the train state is more accurate than the test state.

Results of WSVM method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the WSVM method are presented in Table 3. According to Table 3, the accuracy of a data type is greater on non-rainy days than on rainy days. Furthermore, the train state is more precise than the test state. The ANOVA test revealed that the WSVM method is highly significant on non-rainy days (P-value < 0.001).

Table 3

Efficiency criteria and two-way ANOVA test of WSVM method in train and test states

Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.83 0.03 0.57 0.88 2.69 0.011 
Test 0.78 0.08 0.65 0.83 2.07 0.034 
Non-rainy days Train 0.94 0.009 0.40 0.96 1.19 <0.001 
Test 0.88 0.05 0.32 0.91 2.16 <0.001 
Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.83 0.03 0.57 0.88 2.69 0.011 
Test 0.78 0.08 0.65 0.83 2.07 0.034 
Non-rainy days Train 0.94 0.009 0.40 0.96 1.19 <0.001 
Test 0.88 0.05 0.32 0.91 2.16 <0.001 

Figure 10 compares the observed and simulated hydrographs for the train and test states using the WSVM approach.
Figure 10

Time series of observed and simulated discharge for WSVM method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Figure 10

Time series of observed and simulated discharge for WSVM method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Close modal
It shows that the WSVM model is not as good at figuring out where extreme points will be. It also shows that the graphs for simulated and observed are closer in the train state. This means that the train state is more accurate than the test state. Scatter plots for observed and simulated results in train and test states for the WSVM method are shown in Figure 11.
Figure 11

Scatter plot of observed and simulated results for WSVM method. (a) Train state. (b) Test state.

Figure 11

Scatter plot of observed and simulated results for WSVM method. (a) Train state. (b) Test state.

Close modal

The scatterplots show that the train state is more accurate than the test state.

Results of the WGEP method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the WGEP method are presented in Table 4. Non-rainy days are more accurate than rainy days regarding data type accuracy, according to Table 4. Results from the ANOVA test show that the WGEP model is highly significant on non-rainy days (P-value < 0.001).

Table 4

Efficiency criteria and two-way ANOVA test of WGEP method in train and test states

Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.95 0.008 1.74 0.97 1.04 0.006 
Test 0.89 0.04 2.18 0.92 1.28 0.021 
Non-rainy days Train 0.98 0.006 0.31 0.99 0.70 <0.001 
Test 0.93 0.03 0.34 0.95 0.96 <0.001 
Data typeStateEfficiency criteria
ANOVA test
NSENRMSEMAPER2F-valueP-value
Rainy days Train 0.95 0.008 1.74 0.97 1.04 0.006 
Test 0.89 0.04 2.18 0.92 1.28 0.021 
Non-rainy days Train 0.98 0.006 0.31 0.99 0.70 <0.001 
Test 0.93 0.03 0.34 0.95 0.96 <0.001 

Figure 12 compares the observed and simulated hydrographs for the train and test states using the WGEP approach.
Figure 12

Time series of observed and simulated discharge for WGEP method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Figure 12

Time series of observed and simulated discharge for WGEP method. (a) Train state. (b) One-year random sample for train state. (c) Test state. (d) One-year random sample for test state.

Close modal
It shows that the train state was based on data from 1950 to 2000, while the test state was based on data from 2000 to 2020. The results for 1995–1996 and 2016–2015 are shown separately to understand how the train and test states are predicted fully. Additionally, the train state is more accurate than the test state. The simulated and observed lines are more consistent during the train state. Figure 13 displays scatter plots for observed and simulated results for the WGEP approach in the train and test states.
Figure 13

Scatter plot of observed and simulated results for WGEP method. (a) Train state. (b) Test state.

Figure 13

Scatter plot of observed and simulated results for WGEP method. (a) Train state. (b) Test state.

Close modal

Because it covers the years 1950–2000, the train state is awarded a higher point total than the examination phase. The scatterplots make it abundantly clear that the train state provides more accurate results than the test state.

According to the hydrograph diagrams for each method, there is a good match between the simulated and observed results. All models investigated in this paper were less accurate in estimating relative extreme points than calculating other points. The results showed that the GEP model is more accurate than the SVM model, with NSEs of 0.89 and 0.77 for non-rainy days and 0.85 and 0.72 for rainy days, respectively. The use of hybrid methods has also improved simulation accuracy. When the results of the WSVM and WGEP models were compared, it was discovered that the WGEP model is more accurate than the WSVM model, so the NSE for WGEP and WSVM are 0.98 and 0.94 for non-rainy days, and 0.95 and 0.83 for rainy days, respectively.

According to the scalar plot diagrams, combining the SVM and GEP methods with wavelet has resulted in closer and more correlated simulated and observed results. Furthermore, the results obtained in the train state have a higher correlation than those obtained in the test state. Furthermore, the scatter plot for each method revealed that the maximum points of the train state were more significant than the maximum points of the test state. The ANOVA test revealed that the results of the WSVM and WGEP methods were highly significant on non-rainy days (P-value < 0.001). The SVM method is also highly significant in the train state and on non-rainy days (P-value < 0.001). Also, non-rainy days have better model accuracy than rainy days (higher R-squared values).

The findings recommend that continuous rainfall-runoff modeling in catchments using data-based models be done using a two-criteria model that accounts for both rainy and non-rainy days. For future studies, comparing the one-criterion and two-criterion models is also suggested. It is also suggested that the accuracy of other runoff predicting methods and models be evaluated and compared with the results of the current study.

Today, machine learning is utilized extensively in numerous scientific fields, including water management issues. This study compares the accuracy of the SVM, GEP, WSVM, and WGEP methods for estimating the hydrograph time series in the Yellow River Basin. In addition to other required study area parameters, monthly precipitation, temperature, and discharge values from 1950 to 2020 are considered. The period 1950–2000 is considered the train state, while the period 2000–2020 is considered the test state. The results showed that the GEP model is more accurate than the SVM model, with NSEs of 0.89 and 0.77 for non-rainy days and 0.85 and 0.72 for rainy days, respectively. The use of hybrid methods has also improved simulation accuracy. When the results of the WSVM and WGEP models were compared, it was discovered that the WGEP model is more accurate than the WSVM model, so the NSE for WGEP and WSVM are 0.98 and 0.94 for non-rainy days, and 0.95 and 0.83 for rainy days, respectively. Considering the values in the tables of efficiency criteria and observed and simulated hydrographs, it can be concluded that the WGEP method yields the most accurate results, and the SVM method yields the least accurate results. A comparison of observed data from the hydrometric station in Tang-Nai-Hai and simulated results revealed an acceptable level of concordance.

There still exist certain limitations that need to be addressed in future study. For future researches, more data from various basins can be utilised to evaluate efficiency of proposed methods. Then again, the rainfall-runoff process is mostly affected by different aspects of human society and natural environments, presenting complex, vibrant evolutionary characters. Hence, a solitary prediction model may fail to provide satisfactory outcomes in certain cases.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Afan
H. A.
,
Allawi
M. F.
,
El-Shafie
A.
,
Yaseen
Z. M.
,
Ahmed
A. N.
,
Malek
M. A.
,
Koting
S. B.
,
Salih
S. Q.
,
Mohtar
W. H. M. W.
&
Lai
S. H.
2020
Input attributes optimization using the feasibility of genetic nature inspired algorithm: application of river flow forecasting
.
Scientific Reports
10
(
1
),
1
15
.
Ahmadi
M. H.
,
Mohseni-Gharyehsafa
B.
,
Ghazvini
M.
,
Goodarzi
M.
,
Jilte
R. D.
&
Kumar
R.
2020
Comparing various machine learning approaches in modeling the dynamic viscosity of CuO/water nanofluid
.
Journal of Thermal Analysis and Calorimetry
139
(
4
),
2585
2599
.
Alizadeh
F.
,
Gharamaleki
A. F.
,
Jalilzadeh
M.
&
Akhoundzadeh
A.
2020
Prediction of river stage-discharge process based on a conceptual model using EEMD-WT-LSSVM approach
.
Water Resources
47
(
1
),
41
53
.
Asefa
T.
,
Kemblowski
M.
,
McKee
M.
&
Khalil
A.
2006
Multi-time scale stream flow predictions: The support vector machines approach
.
Journal of Hydrology
318
(
1–4
),
7
16
.
Bray
M.
&
Han
D.
2004
Identification of support vector machines for runoff modelling
.
Journal of Hydroinformatics
6
(
4
),
265
280
.
Chen
W.
,
Zheng
M.
,
Gao
Q.
,
Deng
C.
,
Ma
Y.
&
Ji
G.
2021
Simulation of surface runoff control effect by permeable pavement
.
Water Science and Technology
83
(
4
),
948
960
.
Ehteram
M.
,
Afan
H. A.
,
Dianatikhah
M.
,
Ahmed
A. N.
,
Ming Fai
C.
,
Hossain
M. S.
,
Allawi
M. F.
&
Elshafie
A.
2019
Assessing the predictability of an improved ANFIS model for monthly streamflow using lagged climate indices as predictors
.
Water
11
(
6
),
1130
.
Hearst
M. A.
,
Dumais
S. T.
,
Osuna
E.
,
Platt
J.
&
Scholkopf
B.
1998
Support vector machines
.
IEEE Intelligent Systems and Their Applications
13
(
4
),
18
28
.
Kargar
K.
,
Safari
M. J. S.
,
Mohammadi
M.
&
Samadianfard
S.
2019
Sediment transport modeling in open channels using neuro-fuzzy and gene expression programming techniques
.
Water Science and Technology
79
(
12
),
2318
2327
.
Liu
Q.
,
Wang
D.
,
Zhang
Y.
&
Wang
L.
2021a
Flood simulation analysis of the Biliu River Basin based on the MIKE model
.
Complexity
https://doi.org/10.1155/2021/8827046
.
Liu
Z.
,
Li
Q.
,
Zhou
J.
,
Jiao
W.
&
Wang
X.
2021b
Runoff prediction using a novel hybrid ANFIS model based on variable screening
.
Water Resources Management
35
(
9
),
2921
2940
.
Molajou
A.
,
Nourani
V.
,
Afshar
A.
,
Khosravi
M.
&
Brysiewicz
A.
2021
Optimal design and feature selection by genetic algorithm for emotional artificial neural network (EANN) in rainfall-runoff modeling
.
Water Resources Management
35
(
8
),
2369
2384
.
Nourani
V.
,
Molajou
A.
,
Najafi
H.
&
Danandeh Mehr
A.
2019a
Emotional ANN (EANN): a new generation of neural networks for hydrological modeling in IoT
. In:
In: Al-Turjman, F. (ed.) Artificial Intelligence in IoT. Transactions on Computational Science and Computational Intelligence. Springer, Cham, Switzerland, pp. 45–61. https://doi.org/10.1007/978-3-030-04110-6_3
.
Nourani
V.
,
Molajou
A.
,
Tajbakhsh
A. D.
&
Najafi
H.
2019b
A wavelet based data mining technique for suspended sediment load modeling
.
Water Resources Management
33
(
5
),
1769
1784
.
Peng
J.
,
Zhong
X.
,
Yu
L.
&
Wang
Q.
2020
Simulating rainfall runoff and assessing low impact development (LID) facilities in sponge airport
.
Water Science and Technology
82
(
5
),
918
926
.
Ramezanizadeh
M.
,
Ahmadi
M. H.
,
Nazari
M. A.
,
Sadeghzadeh
M.
&
Chen
L.
2019
A review on the utilized machine learning approaches for modeling the dynamic viscosity of nanofluids
.
Renewable and Sustainable Energy Reviews
114
,
109345
.
Sharghi
E.
,
Nourani
V.
,
Najafi
H.
&
Molajou
A.
2018
Emotional ANN (EANN) and wavelet-ANN (WANN) approaches for Markovian and seasonal based modeling of rainfall-runoff process
.
Water Resources Management
32
(
10
),
3441
3456
.
Yavari
F.
,
Salehi Neyshabouri
S. A.
,
Yazdi
J.
,
Molajou
A.
&
Brysiewicz
A.
2022
A novel framework for urban flood damage assessment
.
Water Resources Management
36
(
6
),
1991
2011
.
Zhang
L.
,
Hu
C.
,
Jian
S.
,
Wu
Q.
,
Ran
G.
&
Xu
Y.
2021
Identifying dominant component of runoff yield processes: a case study in a sub-basin of the middle Yellow River
.
Hydrology Research
52
(
5
),
1033
1047
.
Zhang
W.
,
Tao
K.
,
Sun
H.
&
Che
W.
2022
Influence of urban runoff pollutant first flush strength on bioretention pollutant removal performance
.
Water Science & Technology
86
(
6
),
1478
1495
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).