## Abstract

It is critical to use research methods to collect and regulate surface water to provide water while avoiding damage. Following accurate runoff prediction, principled planning for optimal runoff is implemented. In recent years, there has been an increase in the use of machine learning approaches to model rainfall-runoff. In this study, the accuracy of rainfall-runoff modeling approaches such as support vector machine (SVM), gene expression programming (GEP), wavelet-SVM (WSVM), and wavelet-GEP (WGEP) is evaluated. Python is used to run the simulation. The research area is the Yellow River Basin in central China, and in the west of the region, the Tang-Nai-Hai hydrometric station has been selected. The train state data ranges from 1950 to 2000, while the test state data ranges from 2000 to 2020. The analysis looks at two different types of rainy and non-rainy days. The WGEP simulation performed best, with a Nash-Sutcliffe efficiency (NSE) of 0.98, while the WSVM, GEP, and SVM simulations performed poorly, with NSEs of 0.94, 0.89, and 0.77, respectively. As a result, combining hybrid methods with wavelet improved simulation accuracy, which is now the highest for the WGEP method.

## HIGHLIGHTS

The accuracy of machine learning based approaches for rainfall-runoff modeling is evaluated in the Yellow River Basin in China.

Wavelet transform as a pre-processing tool is combined with Gene Expression Programming and Support Vector Machine.

The combination of Wavelet and Gene Expression Programming (WGEP) performed best.

### Graphical Abstract

## INTRODUCTION

One of the most important ways for most of the world to get water is through runoff. Recent years have seen a rise in the importance of scientific techniques for gathering and managing surface water to provide water and prevent damage (Nourani *et al.* 2019a; Molajou *et al.* 2021; Yavari *et al.* 2022; Zhang *et al.* 2022). Therefore, it is critical to obtain a more precise estimate of rainfall-runoff in catchments. Knowing how much runoff catchments can naturally produce is one of the most crucial aspects of planning for optimal runoff. Therefore, in water science and engineering, it is essential to study hydrology and simulate rain and water runoff in catchments (Ehteram *et al.* 2019; Afshar *et al.* 2021; Chen *et al.* 2021). Making precise predictions of flow rate and how it changes throughout the year is also crucial for planning and managing surface water resources because only a finite amount of fresh water can be withdrawn. Therefore, it is crucial to investigate how well forecasting models function (Feng *et al.* 2020).

The primary objective of a simulation model is to predict the operation of a complex system and investigate the impact of modifications on the system's performance (Azizi & Nejatian 2022). Most hydrological prediction models are categorized as physical and mathematical models. In runoff simulation, mathematical models are categorized as theoretical, conceptual, and experimental models. Machine learning has recently expanded into various scientific fields, including water resources (Alizadeh *et al.* 2020). Machine learning is the scientific study of algorithms and statistical models used by computer systems that rely on patterns and inferences rather than explicit instructions to perform tasks (Liu *et al.* 2021a). There is a wide range of machine learning research that can be conducted. Theoretically, researchers aim to develop new learning methods and assess their viability and quality. Alternatively, some researchers attempt to apply machine learning techniques to new problems. This spectrum is not discrete, and both approaches are present in the researchers’ work. Machine learning reduces operational expenses and accelerates data analysis (Ramezanizadeh *et al.* 2019; Ahmadi *et al.* 2020). The performance of an algorithm for machine learning can be evaluated by separating data into train and test sets (Afan *et al.* 2020). Data segmentation can be applied to classification and regression problems. This method applies to any supervised learning algorithm in general. The data separation procedure divides a data set into two or three subsets (Liu *et al.* 2021b). The significance of data in machine learning is attributable to two factors: (a) the model requires data to learn, and (b) the model must be measured using data because it may not have been able to extract the data's information effectively. The selection of appropriate and effective input variables is an important and influential step in ensuring the accuracy of models based on data processing. Various techniques for determining model input variables rely on data processing (Liu *et al.* 2021a; Molajou *et al.* 2021).

Rainfall-runoff models can be categorized according to how they express the process or solve the governing equations (Peng *et al.* 2020). Some characteristics and properties of hydrological variables cannot be observed in the time dimension; therefore, they can be observed and evaluated by transferring them to other spaces (such as Fourier series, frequency, wavelet, Z transform, Laplace, etc.). Therefore, studying them in spaces other than time will allow for a more thorough and accurate investigation. The Fourier transform is one of these mathematical transformations; it is a mathematical operator that produces multiple representations of a signal. Despite its many capabilities, the Fourier transform has limitations, including the inability to display the signal in the frequency and time domains, particularly for unstable signals, and spectral analysis of the signal (Komasi & Sharghi 2016; Sharghi *et al.* 2018). Therefore, the wavelet transform is highly effective for dealing with non-stationary time series (signals). Using a collection of wavelet functions, the wavelet transform decomposes the time series into a series of coefficients. These functions are generated by relocating a fundamental wavelet function known as the mother wavelet or analyzing wavelet.

Wavelets’ simultaneous localisation in the time and frequency domains is one of their key benefits. The use of fast wavelet transform makes wavelets computationally very efficient, which is their second major benefit. Wavelets have the great advantage of being able to separate the fine details in a signal. Despite being a robust tool for image and signal processing, the wavelet transform has three significant drawbacks: shift sensitivity, poor directionality, and lack of phase information.

In the current study, as a novel strategy, it was tried to investigate the accuracy of rainfall-runoff modeling approaches such as support vector machine (SVM), gene expression programming (GEP), wavelet-SVM (WSVM), and wavelet-GEP (WGEP). It should be noted that Python is used to run the simulation.

## MATERIALS AND METHODS

### Study area

*et al.*2021). From 1950 to 2020, monthly discharge data for the Yellow River were obtained from the Tang-Nai-Hai hydrometric station.

### Simulation of rainfall-runoff process

Considering that the number of rainy days in each year is far less than non-rainy days, the calculation of runoff from the basin will be due to two different mechanisms. During the times of rainfall and a few days after that, the runoff from the basin is mainly in the form of floods with high discharge and low continuity. But in most days of the year when there is no rain, the outflow is in the form of base flow with low flow rates and high continuity. Therefore, in this study, a two-criteria rainfall-runoff model including the model related to rainy days and the model related to non-rainy days is presented. According to the average annual rainfall and the average rainfall on each rainy day in the region under study, days with a total precipitation of more than 2 mm were classified as rainy days, while days with a total precipitation of equal to or less than 2 mm were classified as non-rainy days. Therefore, data from hydrometric stations were categorized according to rainy and non-rainy days. On rainy days, when the flow is predominantly flooding, the variables that affect the flow rate at the hydrometric station are the current precipitation, the precipitation from the day before, and the precipitation from two days ago. Due to the study area's relatively short concentration time and rapid response, the previous day's precipitation significantly impacts the hydrometric station more than precipitation from the previous two days. In addition, on non-rainy days, the flow rate variables include the flow rates of the day before and two days before at the hydrometric station because the current flowing in the hydrometric station is fundamental. Because the number of rainy days per year is significantly lower than the number of non-rainy days, the variables affecting the flow rate at a hydrometric station similar to the non-rainy day's section of the model will only include the flow rate one and two days prior.

In recent years, the field of water resources management has been confronted with significant obstacles due to climate change and changes in land use. Consequently, it is required to understand the rainfall-runoff amount and its changes over time. Understanding each mathematical model's benefits, drawbacks, and accuracy is crucial for accurately predicting the amount of rainfall-runoff. This study aims to evaluate the predictive accuracy of models SVM, GEP, WSVM, and WGEP.

In the current study, SVM and GEP methods were utilized in conjunction with wavelet transform. Therefore, it appears necessary to examine the theoretical foundations of these methods.

### Support vector machine (SVM)

An SVM has been found to be popular in modeling studies because of its advantages over artificial neural network (ANN). This method seems to be a powerful alternative that can overcome some of the basic weakness related to ANN while retaining all strengths of ANN. Also, SVM model is able to give a more accurate prediction, particularly in the situation where the data set is not distributed uniformly (Ayubi Rad & Ayubirad 2017). SVM was developed as a two-tier classification method with observer learning. SVM can be applied to classification problems (where the output is a batch) and regression problems (where the output is a numerical value). This method classifies data by locating the optimal hyperplane that separates all data in one class from another (Hearst *et al.* 1998). The optimal hyperplane results in a page with more significant margins than the other two classes. Margin is the sheet with the most excellent width parallel to the separator's hyperplane, within which there are no data. The points that are closest to the separator hyperplane are the backup vectors.

The vectors xi for 1 ≤ i ≤ N_{x} and the input, the title to the binary label vectors y_{i} ∈ {− 1,1} are considered as the corresponding output (N_{x} is the sample size). *Φ*(x_{i}) is defined as similarity vectors in the property space, and K(x_{i},x_{j}) = Φ(x_{i}).Φ(x_{j}) is a kernel function that denotes the internal multiplication in the property space.

_{i}(w.x + b) = 1 +

*ξ*

_{i},

*ξ*

_{i}≥ 0 as the following equation.where, w is the normal vector of the separator hyperplane in the property space, and C > 0 is a regulatory parameter that is responsible for controlling incorrect categorization. Using Lagrangian, Equation (1) of dual form is obtained.where, 0 ≤

*α*≤ C. This quadratic optimization problem can be solved well with algorithms such as minimum sequential optimization. In fact, during optimization, many

*α*

_{i}s tend to zero. The remaining x

_{i}s belonging to

*α*

_{i}> 0 are called backup vectors.

For simplicity, it is assumed that all non-backup vector data has been deleted. Therefore, N_{x} represents the number of support vectors and for all *α*_{i} > 0.

*et al.*2006):

*Φ*(x

_{i}) was implicitly defined, w exists only in the property space and cannot be calculated directly. Instead, classification f(q) for a new sample vector q can only be done by calculating the q kernel function with each backup vector (Equation (4)).

In this regard, the parameter d is the deviation of the hyperplane along its normal vector, which is determined in the spectrum of the training process.

### Gene expression programming (GEP) by gene X Pro tools

Following the evolution of intelligent models, gene expression programming (GEP) becomes a method for calculating the rotation of algorithms (Kargar *et al.* 2019). It is inspired by natural evolution. The advantage of GEP over other networks is that the structure (input variables, purpose, and set of functions) is defined first. The optimal model structure and coefficients are then determined during the process. In other words, this method can optimize the model structure and components. The five-step process for solving a problem using GEP is described in the following:

Selecting the terminal set: the same as the problem's independent and system state variables. This step involves the selection of the fit function, which typically employs the root mean square error (RMSE).

Select a set of functions, including mathematical operators, tests, and Boolean functions.

Model accuracy measurement index is used to determine how the model can solve a specific problem.

Control components: program execution is controlled by numerical component values and qualitative variables.

The number of data in the training section, the number of data in the chromosome test sections, the size of the head, the number of genes, and the choice of the transplant operator can be modified using addition, subtraction, multiplication, and division operations.

### Wavelet transform

*et al.*2019b). This comparison is performed for a continuous signal using the internal multiplication of the signal x(t) and the functions. This comparison is performed for a continuous signal using the internal multiplication of the signal

*Φ*n(t) and the functions.where, character ∗ represents a mixed conjugate, internal multiplication indicates the similarity of signal x(t) and function

*ψ*

_{n}(t). The greater the similarity of the signal x(t) to the function

*Φ*

_{n}(t), the greater the result of the internal multiplication. If the function

*Φ*

_{n}(t) is orthogonal, the signal x(t) is retrieved as following:where, orthogonal base functions for . Here, is the Kronecker Delta function. Equation (2) is called an inverse conversion.

_{τ}_{b}(t) is obtained from the mother wavelet, g(t).where

*τ*and b are translation and scale dilation parameters, respectively, the wavelet coefficient WT(

*τ*,b) indicates the degree of correlation between the signal and the local part of the wavelet. Large values WT(

*τ*,b) indicate that the signal understudy has a major component of the frequency belonging to the given scale. In this case, the scaled wavelet will be similar to the signal under study. By changing the values of parameters,

*τ*and b domain changes can be made in terms of time and scale. The mother wavelet should have a mean of zero. As a result, the original signal can be calculated using the wavelet transform coefficients WT(

*τ*,b) by inverting the wavelet transform as follows.where,

G(*ω*) is the Fourier transform of g(t). This condition is known as the acceptance condition admissibility condition. The reversibility of the conversion and the ability to reconstruct signal x(t) from its continuous wavelet transform depends on this condition. Once the base wavelet is selected, it can be used to analyze the data.

### Developed WSVM and WGEP models

### Efficiency criteria

_{i}and P

_{i}are the observed and simulated values by models, respectively, and are the average observed and simulated values by models.

### Analysis of variance (ANOVA)

_{ij}is the individual observation, is the sample mean of the j

^{th}treatment or group, is the overall sample mean, k is the number of treatments or independent comparison groups, and n

_{j}is the total number of observations or total sample size.

In the presence of an influential variable in the communities, one-way ANOVA is used to calculate the analysis of the variance table for the test comparing the means of multiple communities. If more than one factor affects this separation, the variance model analysis becomes more complex and even reveals the effect of each factor on the others. As a result, the sources of dispersion are distinguished based on two variables and the interaction between the variables. In this instance, the variance analysis table is referred to as two-way ANOVA. We considered calculated probabilities of <0.05 to be significant and <0.001 to be highly significant.

## RESULTS AND DISCUSSION

In this section, each method's results are shown on their own. These results are based on what the Tang-Nai-Hai hydrometric station calculated and what it saw from 2000 to 2020. The last step is to compare the results to each other. In this study, 70% of the data is used for training, and 30% is used for testing. The goal is to predict the monthly runoff of observational data from 1950 to 2020. Using the scikit-learn library, the train and test states were chosen by chance. An evolutionary algorithm framework named geppy developed in Python was used to implement the GEP algorithm.

### Results of the SVM method

Table 1 shows the efficiency criteria and two-way ANOVA test values for the train and test states in the SVM method. According to the table, the SVM methods had high level of accuracy on the data of non-rainy days (higher R-squared values). In addition, the train state is more accurate than the test state. The ANOVA test revealed that the results of the SVM method is highly significant in the train state of non-rainy days (*P*-value < 0.001).

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.72 | 0.06 | 3.61 | 0.79 | 3.49 | 0.74 |

Test | 0.65 | 0.12 | 3.29 | 0.74 | 4.38 | 0.8 | |

Non-rainy days | Train | 0.77 | 0.03 | 2.37 | 0.82 | 2.35 | <0.001 |

Test | 0.71 | 0.07 | 3.19 | 0.79 | 2.07 | 0.99 |

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.72 | 0.06 | 3.61 | 0.79 | 3.49 | 0.74 |

Test | 0.65 | 0.12 | 3.29 | 0.74 | 4.38 | 0.8 | |

Non-rainy days | Train | 0.77 | 0.03 | 2.37 | 0.82 | 2.35 | <0.001 |

Test | 0.71 | 0.07 | 3.19 | 0.79 | 2.07 | 0.99 |

Given that the train state encompasses the years 1950–2000, it has more points than the examination phase. It is evident from the scatterplots that the train state is more accurate than the test state.

### Results of the GEP method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the GEP method are presented in Table 2. The accuracy of the GEP model in the train state is greater than in the test state, as shown in Table 2. This is likely due to allocating 70% of the data to the train state. In addition, the GEP model is more accurate at predicting runoff on non-rainy days than on rainy days because the runoff from the basin during rainy days is a transient stream with a high discharge, whereas, during non-rainy days, it is a steady state with a low discharge.

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.85 | 0.03 | 2.06 | 0.89 | 2.11 | 1.00 |

Test | 0.72 | 0.10 | 3.89 | 0.78 | 1.78 | 1.00 | |

Non-rainy days | Train | 0.89 | 0.01 | 1.59 | 0.91 | 2.33 | 0.35 |

Test | 0.77 | 0.06 | 1.26 | 0.82 | 2.16 | 0.17 |

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.85 | 0.03 | 2.06 | 0.89 | 2.11 | 1.00 |

Test | 0.72 | 0.10 | 3.89 | 0.78 | 1.78 | 1.00 | |

Non-rainy days | Train | 0.89 | 0.01 | 1.59 | 0.91 | 2.33 | 0.35 |

Test | 0.77 | 0.06 | 1.26 | 0.82 | 2.16 | 0.17 |

The results show that the train state is more accurate than the test state.

### Results of WSVM method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the WSVM method are presented in Table 3. According to Table 3, the accuracy of a data type is greater on non-rainy days than on rainy days. Furthermore, the train state is more precise than the test state. The ANOVA test revealed that the WSVM method is highly significant on non-rainy days (*P*-value < 0.001).

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.83 | 0.03 | 0.57 | 0.88 | 2.69 | 0.011 |

Test | 0.78 | 0.08 | 0.65 | 0.83 | 2.07 | 0.034 | |

Non-rainy days | Train | 0.94 | 0.009 | 0.40 | 0.96 | 1.19 | <0.001 |

Test | 0.88 | 0.05 | 0.32 | 0.91 | 2.16 | <0.001 |

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.83 | 0.03 | 0.57 | 0.88 | 2.69 | 0.011 |

Test | 0.78 | 0.08 | 0.65 | 0.83 | 2.07 | 0.034 | |

Non-rainy days | Train | 0.94 | 0.009 | 0.40 | 0.96 | 1.19 | <0.001 |

Test | 0.88 | 0.05 | 0.32 | 0.91 | 2.16 | <0.001 |

The scatterplots show that the train state is more accurate than the test state.

### Results of the WGEP method

The values of efficiency criteria and two-way ANOVA test for the train and test states in the WGEP method are presented in Table 4. Non-rainy days are more accurate than rainy days regarding data type accuracy, according to Table 4. Results from the ANOVA test show that the WGEP model is highly significant on non-rainy days (*P*-value < 0.001).

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.95 | 0.008 | 1.74 | 0.97 | 1.04 | 0.006 |

Test | 0.89 | 0.04 | 2.18 | 0.92 | 1.28 | 0.021 | |

Non-rainy days | Train | 0.98 | 0.006 | 0.31 | 0.99 | 0.70 | <0.001 |

Test | 0.93 | 0.03 | 0.34 | 0.95 | 0.96 | <0.001 |

Data type . | State . | Efficiency criteria . | ANOVA test . | ||||
---|---|---|---|---|---|---|---|

NSE . | NRMSE . | MAPE . | R^{2}
. | F-value . | P-value
. | ||

Rainy days | Train | 0.95 | 0.008 | 1.74 | 0.97 | 1.04 | 0.006 |

Test | 0.89 | 0.04 | 2.18 | 0.92 | 1.28 | 0.021 | |

Non-rainy days | Train | 0.98 | 0.006 | 0.31 | 0.99 | 0.70 | <0.001 |

Test | 0.93 | 0.03 | 0.34 | 0.95 | 0.96 | <0.001 |

Because it covers the years 1950–2000, the train state is awarded a higher point total than the examination phase. The scatterplots make it abundantly clear that the train state provides more accurate results than the test state.

According to the hydrograph diagrams for each method, there is a good match between the simulated and observed results. All models investigated in this paper were less accurate in estimating relative extreme points than calculating other points. The results showed that the GEP model is more accurate than the SVM model, with NSEs of 0.89 and 0.77 for non-rainy days and 0.85 and 0.72 for rainy days, respectively. The use of hybrid methods has also improved simulation accuracy. When the results of the WSVM and WGEP models were compared, it was discovered that the WGEP model is more accurate than the WSVM model, so the NSE for WGEP and WSVM are 0.98 and 0.94 for non-rainy days, and 0.95 and 0.83 for rainy days, respectively.

According to the scalar plot diagrams, combining the SVM and GEP methods with wavelet has resulted in closer and more correlated simulated and observed results. Furthermore, the results obtained in the train state have a higher correlation than those obtained in the test state. Furthermore, the scatter plot for each method revealed that the maximum points of the train state were more significant than the maximum points of the test state. The ANOVA test revealed that the results of the WSVM and WGEP methods were highly significant on non-rainy days (*P*-value < 0.001). The SVM method is also highly significant in the train state and on non-rainy days (*P*-value < 0.001). Also, non-rainy days have better model accuracy than rainy days (higher R-squared values).

The findings recommend that continuous rainfall-runoff modeling in catchments using data-based models be done using a two-criteria model that accounts for both rainy and non-rainy days. For future studies, comparing the one-criterion and two-criterion models is also suggested. It is also suggested that the accuracy of other runoff predicting methods and models be evaluated and compared with the results of the current study.

## CONCLUSIONS

Today, machine learning is utilized extensively in numerous scientific fields, including water management issues. This study compares the accuracy of the SVM, GEP, WSVM, and WGEP methods for estimating the hydrograph time series in the Yellow River Basin. In addition to other required study area parameters, monthly precipitation, temperature, and discharge values from 1950 to 2020 are considered. The period 1950–2000 is considered the train state, while the period 2000–2020 is considered the test state. The results showed that the GEP model is more accurate than the SVM model, with NSEs of 0.89 and 0.77 for non-rainy days and 0.85 and 0.72 for rainy days, respectively. The use of hybrid methods has also improved simulation accuracy. When the results of the WSVM and WGEP models were compared, it was discovered that the WGEP model is more accurate than the WSVM model, so the NSE for WGEP and WSVM are 0.98 and 0.94 for non-rainy days, and 0.95 and 0.83 for rainy days, respectively. Considering the values in the tables of efficiency criteria and observed and simulated hydrographs, it can be concluded that the WGEP method yields the most accurate results, and the SVM method yields the least accurate results. A comparison of observed data from the hydrometric station in Tang-Nai-Hai and simulated results revealed an acceptable level of concordance.

There still exist certain limitations that need to be addressed in future study. For future researches, more data from various basins can be utilised to evaluate efficiency of proposed methods. Then again, the rainfall-runoff process is mostly affected by different aspects of human society and natural environments, presenting complex, vibrant evolutionary characters. Hence, a solitary prediction model may fail to provide satisfactory outcomes in certain cases.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.