## Abstract

Due to the complex nature of river stage-discharge process, the present study tried to develop a unique strategy to predict it precisely. The proposed conceptual strategy has some advantages to cover the shortcomings. First, it uses one model instead of several models to predict multiple points instead of one point. On the one hand, the constructed model was inspired by physical-based model (to include time-space attributes of the catchment). On the other hand, ensemble empirical mode decomposition algorithm (EEMD), wavelet transform (WT), and mutual information (MI) were employed as a hybrid pre-processing approach conjugated to support vector machine. For this end, a conceptual strategy (multi-station model) was developed to forecast the Souris River discharge more accurately. The strategy used herein was capable of covering uncertainties and complexities of river discharge modeling. First, a classic model along with WT was performed to predict the 1-day-ahead river discharge for each single station. Therefore DWT-EEMD and feature selection were used for decomposed subseries using MI to be employed in conceptual models. In the proposed feature selection method, some useless subseries were deleted to achieve better performance. The results approved efficiency of the proposed WT-EEMD-MI approach to improve accuracy of different modeling strategies.

## INTRODUCTION

The collection of continuous discharge measurements is a very difficult mission and, therefore, a stage-discharge (*Q-H*) relationship is commonly used to estimate stream discharges as measured stage values. Rating curves are widely used to determine the *Q-H* relationship, although they are not able to provide sufficiently accurate results. A *Q-H* rating curve is a relationship between stream stage (water level) and discharge for a particular section of a stream. Such a relationship is usually established by a regression-based analysis, and the curves are generally expressed in the form of a power equation (Kisi & Cobenar 2009; Farhan *et al.* 2020). However, using the *Q-H* relationship for different river conditions might not be capable enough. In other words, to capture precise results, various researchers suggested applying AI-based models that have proved to provide more accurate outcome in comparison to the *Q-H* relationship (Potter *et al.* 2010; Ghorbani *et al.* 2016).

Artificial intelligence (AI) has been successfully applied in a number of diverse fields including water resources (e.g., Coulibaly *et al.* 1999; Anmala *et al.* 2000; Tokar & Markus 2000; Chou 2013; Singh *et al.* 2015), whereas there are many AI-based approaches that are driven by knowledge discovery (Babovic 2009; Meshgi *et al.* 2015). Likewise, machine learning approaches are remarkable AI methods that have been implemented in the river modeling field in recent years (e.g., Yu *et al.* 2004; Azamathulla *et al.* 2009; Chen *et al.* 2010; Nourani *et al.* 2016). Also, some researchers performed reviews on good state-of-the-art application of AI and some pre-processing approaches to categorize different researches and their advantages (Labat 2005; Abrahart *et al.* 2012; Sang 2013).

Wavelet transform (WT) and ensemble empirical decomposition mode (EEMD) are pre-processing methods that give remarkable vision into the physical form of the data by representing information in both time and frequency domains (Roushangar & Alizadeh 2018). EEMD has been proven to be a quite effective method for extracting signals from data generated in noisy nonlinear and non-stationary processes (Huang *et al.* 1998). A time series in the WT breaks down into a series of linearly independent detail signals and one approximation signal by using a specific wavelet function. Mallat (1998) presented a complete theory for wavelet multi-resolution signal decomposition (also mentioned as a pyramid decomposition algorithm). Researches approved that proper data pre-processing by using the WT and EEMD can lead the models to adequately illustrate the real specifications of the basic system. WT along with EEMD decomposes a non-stationary signal into a given quantity of stationary sub-signals. Then, AI methods can be combined with WT and EEMD to improve preciseness of the prediction. In the present research, feature extraction and learning of historical data are the key to improve forecasting accuracy. For this reason, much of the actual effort is put into the processes of feature extraction, data pre-processing, and transformations. Maheswaran & Khosa (2013) also used wavelet technique to improve nonlinear model forecasting performances in Krishna River. Bai *et al.* (2015) developed a multiscale deep feature learning (MDFL) method with hybrid models to deal with the daily reservoir inflow forecasting. Precise hybrid models were employed in recent years to forecast hydrological and hydrogeological processes (e.g., Partal 2009; Tiwari & Chatterjee 2010; Wang *et al.* 2015).

In the modeling process based on AI approaches, some of the input variables might present correlation, noise, or have no significant relationship with target variables and generally are not equally informative. Therefore, one important step is to determine dominant input variables that are independent, informative, and effective. Mutual information (MI) as a nonlinear measure of information content can be useful in obviating the selection of effective inputs among huge numbers of WT-based or WT-EEMD subseries (Sannasiraja *et al.* 2004; Sun *et al.* 2010; Roushangar *et al.* 2016). Shannon entropy-based measures are applied in this study to extract dominant inputs of the proposed models for discharge-stage modeling. Since the entropy measures arbitrary dependencies between random variables, it is suitable to be applied to complex classification tasks, where methods based on linear relations are prone to mistakes. Entropy can be regarded as a measure of information, disorder, chaos, and uncertainty (Shannon 1948; Roushangar & Alizadeh 2018). Meyer *et al.* (2008) present an original filter approach for effective feature selection in microarray data characterized by a large number of input variables and a few samples.

In most formerly held studies of AI-based hydrological modeling techniques, temporal variables (as historical time series) of the process were proposed as inputs of the model to predict runoff (Anmala *et al.* 2000). On the other hand, in some studies, the AI-based and conceptual models have been combined to estimate river flow (Tokar & Markus 2000; Ashu & Sanaga 2006; Chen & Adams 2006; Corzo *et al.* 2009; Roushangar *et al.* 2018). These models, which combine conceptual hydrological models and AI-based models, usually include two parts: one is a conceptual model for runoff of every sub-catchment and another is the AI-based model to ensemble the runoff from each catchment. Nourani *et al.* (2013) developed a geomorphological model to handle Eel River watershed runoff. The quality of the results supported the beneficial application of the approach in spatio-temporal modeling of the hydrological processes. Multi-station modeling is a conceptual approach whose capability was confirmed and, in some cases, imposing geomorphological features to the model caused an improvement in results (Zhang & Govindaraju 2003; Sarangi *et al.* 2005).

The present study aims to predict the daily river stage-discharge process and enhance the capability of modeling scenarios by using hybrid WT-EEMD-MI and machine learning models. To achieve this goal, the capability of WT-EEMD-MI based multi-station model was investigated and improvements were studied, and results were compared with classic rating curve (RC).

## MATERIALs AND METHODS

### Study area and used data

The Souris River is a river in central North America. It is about 700 km in length and drains about 61,100 km^{2}. It rises in the Yellow Grass Marshes north of Weyburn, Saskatchewan. It wanders south through North Dakota beyond Minot to its most southern point in the city of Velva, and then back north into Manitoba. The river passes through the communities of Melita, Hartney, Souris, and Wawanesa and on to its confluence with the Assiniboine River at Treesbank, about 40 km south east of Brandon. The main tributaries which flow into the Souris in Manitoba are the Antler River, the Gainsborough and Plum Creeks. Much of its drainage basin is fertile silt and clay deposited by former glacial Lake Souris. The channel capacity of the river in Manitoba varies from about 4.2 m^{3}/s near the border, to about 40 m^{3}/s through Melita, to about 31 m^{3}/s near Lauder and 48 m^{3}/s near Hartney. North of Hartney the capacity increases to more than 85 m^{3}/s. The drop between the border and Hartney is only about 9.5 cm/km. Two large dams in Saskatchewan, Rafferty Dam and Alameda Dam, were built, in part, to reduce flood peaks on the Souris River. Table 1 demonstrates the characteristics of the Souris River and used data.

Selected stations (Sherwood, Foxholm, Minot, Verendry, Bantry, and Westhope) as shown in Figure 1, are in continuous form which is suitable to be used in the proposed models. Discharge-stage data ranges from 1 January 2012 to 1 January 2013 in daily scale. Also, the dataset was separated into two parts: the first division including 70% of total data as calibration dataset and the rest was considered as verification dataset. The historical daily discharge data were available for six stations in the Souris River at the USGS website (http://co.water.usgs.gov/sediment).

### Rating curves (RC)

*Q-H*rating curve generally represents a functional relationship of the form given in Equation (1):where

*Q*is the stream discharge and

*H*is the water level. The values of

*a*and

*b*for a particular stream are determined from data via a linear regression between log

*H*and log

*Q*. A major limitation of this approach is that it is not able to consider the hysteresis effect. In this study, the values of

*a*and

*b*are computed by using the least squares method.

### Support vector machine (SVM)

Support vector machine (SVM) is a supervised learning model with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. A SVM model represents the samples as points in space, mapped so that the samples of the separate categories are divided by a clear gap that is as wide as possible. As an effective classification method, SVM has been proposed on the basis of statistical learning theory (SLT). SVM is a linear machine working in the high-dimensional feature space formed by the nonlinear mapping of the n-dimensional input vector x into a K-dimensional feature space (K > n) with a function. SVM neural network applies one hidden layer of nonlinear neurons, one output linear neuron and specialized learning procedure leading to the global minimum of the error function and excellent generalization ability of the trained network.

### Alternative regression approaches

In this study, in order to measure the efficiency of SVMs' performance, five regression models were selected to compare the outcome as follows.

#### Auto regressive integrated moving average with exogenous input (ARIMAX)

Auto regressive integrated moving average with exogenous input (ARIMAX) is a classic time series forecasting model and therefore was used as a comparison model to evaluate the efficiency of SVM. Although a linear model like ARIMAX is unlikely to model a complex nonlinear hydrological process, it is still a commonly used method in practice and, as such, is useful as a comparison model (Adamowski & Chan 2011). In this research, the ARIMAX (p, d, q) model was applied in which p, q, and d refer to orders of autoregressive and moving average components and the number of differencing operations, respectively. Temporal features were used as exogenous inputs to predict future discharge as the output.

#### Extreme learning machine (ELM)

Extreme learning machine (ELM), which is an algorithm developed for single hidden layer feed forward neural networks (SLFNs), can be thousands of times faster than traditional feed forward network learning algorithms like back-propagation algorithm, while obtaining a better generalization performance (Huang *et al.* 2004). Different from traditional learning algorithms, ELM tends to reach the smallest training error and the smallest norm of the weights. ELM, with higher scalability and less computational complexity, not only unifies different popular learning algorithms, but also provides a unified solution to different practical applications (e.g., regression, binary, and multiclass classifications).

#### Adaptive neuro-fuzzy inference system (ANFIS)

The ANFIS is a fuzzy Sugeno model put in the framework of adaptive systems to facilitate learning and adaptation (Jang 1993). Such framework makes the ANFIS modeling more systematic and less reliant on expert knowledge. To present the ANFIS architecture, two fuzzy if-then rules based on a first-order Sugeno model are considered:

*Rule 1: If x is A _{1} and y is B_{1}; then f_{1}*

*=*

*p*

_{1}x*+*

*q*

_{1}y*+*

*r*

_{1}*Rule 2: If x is A _{2} and y is B_{2}; then f_{2}*

*=*

*p*

_{2}x*+*

*q*

_{2}y*+*

*r*

_{2}where x and y are the inputs, *A _{i}* and

*B*are the fuzzy sets,

_{i}*f*are the outputs within the fuzzy region specified by the fuzzy rule,

_{i}*p*;

_{i}*q*, and

_{i}*r*are the design parameters that are determined during the training process. The output of the ANFIS is calculated by employing the consequent parameters found in the forward pass. The detailed algorithm and mathematical background of the hybrid-learning algorithm can be found in Jang (1993).

_{i}#### Artificial neural network-radial basis function (ANN-RBF)

Radial basis function (RBF) neural network is based on supervised learning. RBF networks were independently proposed by many researchers and are a popular alternative to the multi-layer perceptron (MLP). RBF networks are also good at modeling nonlinear data and can be trained in one stage rather than using an iterative process as in MLP, and also learn the given application quickly (Venkatesan & Anitha 2006). The structure of RBF neural network is similar to that of MLP. It consists of layers of neurons. The main distinction is that RBF has a hidden layer which contains nodes called RBF units. Each RBF has two key parameters that describe the location of the function's center and its deviation or width. The hidden unit measures the distance between an input data vector and the center of its RBF. The RBF has its peak when the distance between its center and that of the input data vector is zero and declines gradually as this distance increases. There is only a single hidden layer in a RBF network, there are only two sets of weights, one connecting the hidden layer to the input layer and the other connecting the hidden layer to the output layer. Those weights connecting to the input layer contain the parameters of the basis functions. The weights connecting the hidden layer to the output layer are used to form linear combinations of the activations of the basis functions (hidden units) to generate the network outputs. Since the hidden units are nonlinear, the outputs of the hidden layer may be combined linearly and so processing is rapid. The output of the network is derived from Foody (2010).

### Concept of modeling via AI-based approaches

With respect to the reviewed researches, most of the deficiencies about application of AI-based models are related to its input–output structure. A simple AI model might not be able to handle the natural uncertainty of specific hydrological processes completely. As an instance, an ad hoc AI model is not designed to predict the variable of interest in an arbitrary point across the river. Therefore, it is required to design models which have semi-distributed model characteristics. For this end, in this research, different scenarios in river discharge modeling context were proposed. Multi-station modeling via SVM as a conceptual model was used in this study, which is going to be discussed. Figure 2 shows the modeling flowchart in this study.

On the other hand, discharge data include a broad domain of values and fluctuations and have to be pre-processed. As is demonstrated in Figure 2, it is proposed to use the WT to decompose the original time series into two components: the approximation components and the detail components. Secondary decomposing was performed by applying the EEMD to the detail components obtained from WT into a new subseries of intrinsic mode functions (IMFs). The goal of this step is to additionally decrease the non-stationarity of the detail components captured from WT. Although decomposing twice might lead to the required outcome, reproduction of subseries might defect the SVM performance. For this reason, mutual information was used to select the dominant subseries. Such a feature selection could increase the performance of SVM.

### Multi-station modeling of river discharge

The multi-station modeling of river discharge used in this study was designed to catch the nonlinearity, uncertainty, and complexity of the river discharge. For this purpose, the multi-station model was established according to the observed discharge time series. In other words, the model inputs consisted of the temporal features that include *Q-H* processes from the last days. A schematic of the proposed model is shown in Figure 3.

Since the model is in the multi-station form, the input matrix was determined in a way to cover the spatio-temporal variation of the river discharge uncertainties by using temporal features of the hydrometric stations. Therefore, multiple inputs were set in a way that all temporal features could establish a unit matrix. The input variables comprised different sets of antecedent and current gauge *Q-H* values of all the stations to forecast the discharge values (*Q _{i}(t), i*

*=*1, 2, 3…

*n*, where

*n*is the number of hydrometric stations). The multi-station model presents a single model which is capable enough to be employed instead of several models within the watershed. The proposed model has unique temporal characteristics which could be applied to forecast the entire river discharge using only one model. Selection of an appropriate input combination for the model is the most important step in the modeling process (Nourani

*et al.*2013). In the multi-station model, the input variables of the model are selected among the temporal features of the river to forecast the discharge values for each station (

*Qi(t*

*+*1)) by assuming the Markovian property of the process. In addition, the proposed model is able to predict runoff in stations with lack of data or any desired point within the river due to considering the temporal features of the sub-basins as the model inputs. A MATLAB code was developed to calculate the model.

### Proposed pre-processing approach

#### Discrete wavelet transform (DWT)

*et al.*1998; Nourani

*et al.*2013; Farajzadeh & Alizadeh 2018). While the general theory behind WT is quite analogous to that of the short time Fourier transform (STFT), WT allows for a completely flexible window function (called the mother wavelet), which can be changed over time based on the shape and compactness of the signal. Given this property, WT can be used to analyze the time–frequency characteristics of any kind of time series. In recent years, WT has been widely used for the analysis of many hydro-meteorological time series (Nourani

*et al.*2013). As the mother wavelet moves across the time series during the WT process, it generates several coefficients that represent the similarity between the time series and the mother wavelet (at any specific scale). There are two main types of WT: continuous and discrete. Use of the continuous wavelet transform (CWT) can generate a large number of (often unnecessary) coefficients, making its use and interpretation more complicated. On the other hand, the discrete wavelet transform (DWT) method simplifies the transformation process while still providing a very effective and precise analysis, since DWT is normally based on the dyadic calculation. DWT coefficients can be calculated by the following equation:where 2

^{a}represents the dyadic scale of the DWT. Applying DWT to a time series decomposes that time series into two ancillary time series shape components, called the approximation (A) and detail (D) components. Component A comprises the large-scale, low-frequency component of the time series, while component D represents the small-scale, high-frequency component.

#### Ensemble empirical mode decomposition (EEMD)

EEMD was proposed to solve the mode mixing issue of empirical mode decomposition (EMD) which specifies the true intrinsic mode function (IMF) as the mean of an ensemble of trials (Wu & Huang 2009). Each trial consists of the decomposition results of the signal plus a white noise of finite amplitude. This new method is based on the insight from recent studies of the statistical properties of white noise (Wu & Huang 2004), which showed that the EMD method is an effective self-adaptive dyadic filter bank when applied to the white noise. On the other hand, studies demonstrated that noise could help data analysis in the EMD method. All these investigations promote the advent of the EEMD method.

The principle of the EEMD algorithm is the following: the added white noise would populate the whole time–frequency space uniformly with the constituting components of different scales. When a signal is added to this uniformly distributed white noise background, the components in different scales of the signal are automatically projected onto proper scales of reference established by the white noise in the background. Because each of the noise-added decompositions consists of the signal and the added white noise, each individual trial may certainly produce very noisy results, but the noise in each trial is different in separate trials (Wang *et al.* 2013; Roushangar & Alizadeh 2018). Thus, it can be decreased or even completely canceled out in the ensemble mean of enough trials. The ensemble mean is treated as the true answer because the only persistent part is the signal as more and more trials are added in the ensemble.

#### Shannon entropy and mutual information (MI) and selecting dominant subseries

*H*), now called the Shannon entropy or information content, for a discrete random variable of

*X*with sample size of

*N*(bin number), which takes values

*x*

_{1};

*x*

_{2};. . .;

*x*

_{N}with probabilities of

*p*

_{1},

*p*

_{2},. . .,

*p*, respectively, as (Shannon 1948; Roushangar

_{N}*et al.*2020):

*et al.*2000):with

*H*(

*A*) and

*H*(

*B*) being the entropy of

*A*and

*B*, respectively, and

*H*(

*A*,

*B*) their joint entropy as:

#### Efficiency criteria

In this study, the total data were separated into calibration and verification sets. Two different criteria were selected to evaluate the revenue of the proposed forecasting methods: the root mean square error (RMSE) and the determination coefficient (DC). The RMSE and DC were applied to exhibit discrepancies between predicted and observed values (Adamowski & Chan 2011). The RMSE were used to quantify the modeling accuracy by generating a positive value. The RMSE grows from zero for perfect forecasts through large positive values as the differences between forecasts and observations become increasingly large. Obviously, a high value for DC (up to 1) and small value for RMSE indicate high efficiency of the model. Legates & McCabe (1999) indicated that a hydrological model can be sufficiently evaluated by these statistics. These measures are not oversensitive to extreme values (outliers) and are sensitive to additive and proportional differences between model predictions and observations. Therefore, correlation-based measures (e.g., DC statistic) can indicate that a model is a good predictor (Legates & McCabe 1999).

RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data are around the line of best fit. RMSE is commonly used in climatology, forecasting, and regression analysis to verify experimental results. On the other hand, DC is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. Most researchers use these statistics since they provide thorough statistical information about models (Legates & McCabe 1999; Roushangar & Alizadeh 2018).

Input and output variables are normalized by scaling between 0 and 1 to eliminate their dimensions and to ensure that all variables receive equal attention during calibration of the model. Therefore, RMSE values are to be transformed between 0 and 1 (Nourani *et al.* 2016).

## RESULTS AND DISCUSSION

Results of modeling are demonstrated in Figure 4(c). Based on the results it was observed that results need to be strengthened in terms of DC, RMSE, and MAE. We are going to discuss the results of modeling using SVM in the next section.

### Optimizing of SVM parameters

In this study, SVM was used to predict the discharge values for hydrometric stations of the Souris River. A correct use of SVM is to select an appropriate kernel function and tune the related hyper-parameters. A number of kernels were discussed by researchers, however, studies suggested the effectiveness of radial basis function (RBF) kernel in the case of SVM in the majority of civil engineering applications (Gill *et al.* 2006; Goel & Pal 2009). For a fair comparison of results, a RBF kernel was utilized with SVM in this study, wherein, *σ* is a kernel specific and *γ* is margin parameter. The tuned parameters (*σ*, *γ*) were optimized via grid search procedure. Therefore, we will call the model optimized SVM (OSVM).

The RBF kernel function was selected as the core tool of OSVM used for all the rest of the models. The OSVM needs the choice of three crucial variables, that are *ɛ*, and (*γ*), in which *γ* is a variable of the RBF kernel function. In order to implement the above-mentioned variables, a systematic grid search of the variables employing cross-validation on the training stage has been performed (OSVM procedure). To consider this grid search, a normal range of variables settings are considered. Initially, modified values of *ɛ* and *C* for a determined *γ* were obtained and then *γ* was altered. Performance optimization procedures were applied to find the best values. Figure 5 illustrates the statistics parameters' values of various Gamma values of the SVM model (fed with the input configuration). Accordingly, optimal parameters were found out for the rest of the models.

A data-driven model's performance mostly depends on its structure. Hence, in this study, a conceptual model was developed to predict the *Q-H* process of the Souris River using only one model. First, different types of classic and wavelet-based modeling strategies based on temporal features (i.e., stage-discharge time series) were developed to assess the connection between discharge and river stage over the Souris River. Next, a unique multi-station model was constructed for the hydrometric stations.

### Results of multi-station model

*Q*(

*t*

*+*1) denotes the river discharge in any cross section of the river system,

*ɛ*

_{t+1}is the random error,

*x*(

*t*

*+*1) is the input vector, which may include many variables such as antecedent flow data and other relevant variables at various time lags, but in the present application, it consists of combinations of antecedent daily

*Q-H*records. It is evident that the training datasets should cover all the characters of the problem in order to capture effective estimation. One of the most important steps in developing a satisfactory forecasting model is the selection of the input variables. Hence, different combinations of the antecedent

*Q-H*are used to construct the appropriate input structure for all scenarios.

In order to determine input variables for AI-based models, the input data were extracted based on the principle underlying the RC, in which different model structures were selected by a combination of stage and/or discharge variables including: *H _{t}*,

*H*

_{t−1},

*H*

_{t−2}

_{, …}and

*H*

_{t−7}representing stage values at times (

*t*), (

*t*− 1), (

*t*− 2), … and (

*t*

*−*7) and

*Q*

_{t−1},

*Q*

_{t−2}

_{…}and

*Q*

_{t−7}representing discharge values at times (

*t*

*−*1), (

*t*

*−*2), … and (

*t*

*−*7). Models 1–3 extract information from stage values alone; Models 4–6 extract information from discharge values alone with a structure similar to autocorrelation; and Models 7–10 extract information from both stage and discharge values. While it is clear that the RC is not the only technique, Table 2 shows the possible different choices of dependent variables for each modeling technique, but these are required to have optimized structures by avoiding over-fitting. In the present study, only the best structures are used to report the performance of the proposed models. The performance measures outlined in ‘Performance criteria’ were used to select the particular model structure but other techniques will be used to assist the choice, e.g., scatter diagram, or plotting difference between individual model performances. Based on the correlation analysis and modeling outcome Comb. 1, 2, 8, and 10 had better performance and this is going to be explained.

*t*demonstrates the time step (daily step),

*Q*represents discharge features and

*H*represents river stage features. The output consists of sole variable, i.e., Q at future time steps (Q(t + 1)). A SVM through RBF kernel was trained for all input combinations to predict 1-day-ahead discharge employing normalized discharge and river stage features for hydrometric stations. In order to determine the SVM results, the outcomes based on DC values as overall results have been demonstrated for different input combinations in Figure 6. According to Figure 6, the Combs. 1 and 2 did not perform adequately in the calibration and verification steps. It can be inferred that target discharge could not correlate with river stage-discharge values of 1 or 2 days ago. Among the Combs. 3 and 4, Comb. 4 could predict the discharge relatively accurately. It could be concluded that

*Q*(

*t*

*+*1) established a good relation with

*Q*and

*H*in previous time steps. Comb. 3 also employed 1e and 2 days ago values of river stage-discharge features. Although results showed improvement using Comb. 4 in comparison to other antecedent-based combinations, nevertheless, generally results are not assured. Towards recovering the efficiency of classic modeling strategy, WT was conjugated to the modeling process in this step. The WT was employed to decompose discharge and river stage time series. Using a trial-and-error process to obtain the optimum decomposition level eventuated to select level 5 (six subseries, one approximation and five details). In order to have a comprehensive overview on decomposition level, initially the following formula (which offers a minimum level of decomposition) was employed (Aussem

*et al.*1998). As studied by Aussem

*et al.*(1998), on finding the correct decomposition level, initially the following formula which offers a minimum level of decomposition was introduced:where

*L*is decomposition level and

*N*is the number of time series features. This experimental equation was derived for fully autoregressive signals, and only considering time series length without considering seasonal efficacy of the process. In order to handle the seasonality of the process, decomposition level up to 5 (25; approximately monthly mode) was selected as the optimum level to decompose. Therefore, the seasonality of the process up to one month could be handled by the model. Due to the proportional relationship between amount of discharge and river stage, these signals were supposed to have the same seasonality level and both time series were decomposed at the same level (i.e., level 5). By selecting Daubechies-2 (db2), Daubechies-4 (db4), Meyer, and coif2 mother wavelets, both discharge and river stage time series were decomposed. In order to prevent probable issues from overtraining or other problems, it was attempted to select the dominant subseries from the decomposed time series. In this study, feature selection was performed using MI approach. MI as a supervised measuring tool was computed with respect to input and target values. The input matrix was established by considering general ranking of subseries according to MI values. Results are presented in Figure 6. According to Figure 6, the db(4,5) outperformed the classic modeling (antecedent input) and other wavelet-based inputs. Also, Figure 7 represents the scatterplots of observed against predicted models (best structures in verification period) for Foxholm station.

*Q-H*time series, took advantage of spatially varying parameters as model inputs. The geomorphological variables used in the multi-station model had an issue of incompatibility in dimension. In order to avoid possible conflicts with incorrect dimensionality of obtained formulations, it was suggested to use the dimensionless values. This is a prevalent scientific application where units of measurements are effectively omitted through the introduction of dimensionless ratios (Nourani

*et al.*2013). In this way, the dimensionless geomorphological parameters were applied in the model. Therefore, the model construction for each sub-basin could be represented as:

In Equation (8), *i* refers to the sub-basin number (*i**=* 1, 2, 3. . . 6). *H _{i}(t*

*−*

*α)*is river stage values with 1-day lag time at

*ith*sub-basin.

*A*is the drainage area for each sub-basin and

_{i}*A*is the total watershed area. In order to develop the multi-station model for the entire watershed, the data of all six stations were imposed to the SVM framework. Since the

_{t}*Q-H*values are sorely affected by recent conditions of watershed at daily time scale, only values with 2-days' lag times (i.e.,

*α*

*=*2) were used in the multi-station model. By considering the aforementioned issues for the SVM modeling, the multi-station model was calibrated and verified.

In order to find the efficient structure of the proposed multi-station model, sensitivity analysis was performed. To this end, six best input combinations were considered to create the input matrix (Figure 8). Figure 8 shows the multi-station model's performance for different input combinations. In order to explore the efficiency of physical parameters, Comb. (1) was designed using only temporal variables (i.e., discharge-gauge stage data).

According to the obtained results, imposing dimensionless physical parameters along with temporal dataset caused an increase in modeling accuracy (Combs. (2)–(6)). Generally, Combs. (2)–(6) provided higher efficiency in comparison to Comb. (1). Conjugating WT to temporal features in Comb. 5 gave an incredible insight into temporal features which caused Comb. 5 to have a better outcome. By considering improvement in Comb. 5, WT-EEMD-MI pre-processing approach was applied to find more dominant subseries. For this reason, *detail 1* and *2* of *db(4,5)* was further decomposed using EEMD. MI was used to select the appropriate IMFs from decomposed subseries. According to Comb. 6, *imf 8* and *9* from *detail subseries 1* and *imf 6* and *7* from *detail subseries 2* were selected via the proposed pre-processing approach. Figure 9 presents decomposed time series of the discharge values by db(4,5)-EEMD-MI for the multiple station form and Figure 10 the predicted time series using the proposed model (Comb. 6). By comparing the obtained results for Combs. (1)–(6) (Figure 7), it could be inferred that Comb. (6) was more efficient than others. The input structure of this combination consisted of two temporal variables. In Figures 8 and 9, *Qd1* means *detail subseries 1* and *Qd2(imf 6, 7)* means decomposed *detail subseries 2 via* EEMD and selected *imf 6* and *7* via MI and the rest is the same.

The proposed pre-processing approach comprised WT, EEMD, and MI methods. WT decomposed the time series into subseries which were grown in a dyadic form. Further breakdown was performed via EMMD and caused subseries with more stationary properties and less noise to be obtained. Since decomposing in two levels led to a number of subseries (about 40–50 subseries), feeding them into SVM might have caused over-training and other issues. Therefore, MI was used to decrease the number of subseries and omit noise and select dominant subseries. Using WT-EEMD-WT caused improvement in results, about 3–19% on average in comparison with other models.

### Other machine learning methods

ARIMAX, ELM, ANFIS, and ANN-RBF are other remarkable forecasting approaches which were employed as benchmark models in order to find out the revenue of the proposed hybrid models via OSVM. Such acknowledged models' performance could be good criteria to find OSVMs' capability as a nonlinear prediction approach. The structure of ARIMAX and ANN-RBF was optimized by trial-and-error process; on the other hand, ELM is a self-tuning regression tool. In a similar manner and using the selected input combinations, the ANFIS model was also used for prediction of the river discharge. In the ANFIS modeling, two substantial points must be considered: first, the ANFIS architecture (e.g., number and type of MFs) and second, the calibration procedure (e.g., selection of iteration number, efficiency criteria). Based on the models, the Gaussian membership function (gaussmf) was selected because of its accurate results. Figure 11 demonstrates the overall results by OSVMs via different scenarios compared with ARIMAX, SVM, ELM, ANFIS, and ANN-RBF. The best input in classic modeling for OSVM (Comb. 10) was used for the selected tools to have a fair comparison. According to the results from Figure 11, it is concluded that OSVMs' outcome led to a superior performance in comparison to other regression tools. Also, it was concluded that ELM and ANFIS performed well, respectively, and ANN-RBF performed relatively weakly and ARIMAX was the weakest among all approaches.

## CONCLUSIONS

In this research, capability of a two-stage conceptual model was proposed and verified to discover its quantitative and qualitative aspects in prediction of river *Q-H* process. At the first stage, a strong novel DWT-EEMD based pre-processing approach was used to capture signals with more stationary properties. Application of the model led to interesting results which can be presented in two different categories: proposed hydrological models and proposed pre-processing approach. A conceptual strategy was proposed to study the *Q-H* relation of the river, namely, a multi-station model. All proposed models have unique capabilities and some deficiencies to be discussed.

Rating curve: RC is a physical-based model which is used to extract information from river stage datasets. The best outcome was captured for Verendry station (DC = 0.81, RMSE = 0.042, and MAE = 0.034), and the worst result was captured for Westhope station (DC = 0.69, RMSE = 0.046, and MAE = 0.0.36).

Multi-station model: The proposed model is in continuous and multi-station form and since this model took advantage of watershed morphology data (upstream drainage area of hydrometric stations) it behaved like a spatio-temporal model that could identify and predict the discharge values more accurately when employed with the proposed WT-EEMD-MI pre-processing approach. The multi-station model could handle the uncertainty and nonlinearity of the Souris River by exploring the interaction of stations by using the spatial and temporal variables and it is capable of being applied to predict discharge values in an arbitrary point of the watershed because of its structure (input data, form of prediction) (DC = 0.89 and RMSE = 0.18, in the verification period).

WT-EEMD-MI: The proposed pre-processing approach consisted of wavelet transform, ensemble empirical mode decomposition, and mutual information approaches. WT decomposes the time series into subseries which grow in a dyadic form. The subseries carry the seasonal properties which strengthen any AI-based model to predict the time series more accurately. Further breakdown via EMMD led to obtaining subseries with more stationary properties and less noise in data. Since decomposing in two levels led to a number of subseries (about 40–50 subseries), feeding them into SVM might lead to over-training or other issues. Therefore, MI was used to select the dominant subseries and decrease the number of subseries and omit noise. Using the WT-EEMD-MI caused an improvement in results (about 3–19%) in comparison with other models.

This study also proved the advantages of a new conceptual model. The multi-station model which used multiple-point measures, was designed to estimate the river discharge in multiple-station form by training only one SVM model instead of several SVMs. This strategy was capable of forecasting at the point of interest in a river or watershed by including temporal properties of the case study.

Developing such a model could be very useful in terms of the accuracy of least time required for model development. For future studies, it is suggested to apply hydraulic properties of the *Q-H* model such as Froude number, Reynolds number, Manning coefficient, etc., to verify these variables' capability in increasing the accuracy of multi-stations' model.

## ACKNOWLEDGEMENTS

The authors would like express their gratitude to East Azerbaijan Regional Water Company for data preparation and support for the present research. The authors whose names are listed in the paper have no affiliations with or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.