Abstract
In this study, the least square support vector machines (LS-SVM) method was used to predict the longitudinal dispersion coefficient (DL) in natural streams in comparison with the empirical equations in various datasets. To do this, three datasets of field data including hydraulic and geometrical characteristics of different rivers, with various statistical characteristics, were applied to evaluate the performance of LS-SVM and 15 empirical equations. The LS-SVM was evaluated and compared with developed empirical equations using statistical indices of root mean square error (RMSE), standard error (SE), mean bias error (MBE), discrepancy ratio (DR), Nash-Sutcliffe efficiency (NSE) and coefficient of determination (R2). The results demonstrated that LS-SVM method has a high capability to predict the DL in different datasets with RMSE = 58–82 m2 s−1, SE = 24–39 m2 s−1, MBE = −1.95–2.6 m2 s−1, DR = 0.08–0.13, R2 = 0.76–0.88, and NSE = 0.75–0.87 as compared with previous empirical equations. It can be concluded that the proposed LS-SVM model can be successfully applied to predict the DL for a wide range of river characteristics.
HIGHLIGHTS
Least square support vector machines and 15 empirical equations were selected to predict longitudinal dispersion coefficient in natural streams.
Experimental datasets, consisting of the depth, width, mean velocity, shear velocity, and the longitudinal dispersion coefficient from various streams, were used from around the world.
Comprehensive statistical analysis was performed to evaluate the applied model accuracy.
ABBREVIATIONS
Measured longitudinal dispersion
Predicted longitudinal dispersion
- C
Cross-sectional average concentration
- DL
Longitudinal dispersion coefficient
- DR
Discrepancy ratio
- g
Acceleration due to gravity
- G1-65
Group 1 data including 65 data sets
- G2-116
Group 2 data including 116 data sets
- G3-188
Group 3 data including 188 data sets
- H
Mean cross-sectional depth
- LIU
- LS-SVM
Least square support vector machines
- MAE
mean absolute error
- MBE
Mean bias error
- N
Number of observations
- NSE
Nash-Sutcliffe efficiency
- POMGGP
Pareto-optimal-multigene genetic programming
- R
Hydraulic radius
- R2
Coefficient of determination
- RMSE
Root mean square error
- S
Slope of the total energy line in downstream direction
- t
Time of observation
- U
Mean longitudinal velocity
- U*
bottom shear friction velocity
- x
Longitudinal distance
INTRODUCTION
Until now, many analytical and numerical solutions have been developed for ADE (Equation (1)) with different boundary conditions. It was generally found that the value of DL can be estimated by the pollutant concentration profile, stream velocity profile, or channel and flow parameters.
In general, studies on the prediction of the longitudinal dispersion coefficient could be divided into four main categories: tracing experiments, empirical equations, artificial intelligence (AI), and combining machine learning and evolutionary algorithms.
Prediction of DL using trace experiments has been performed by many researchers (Palancar et al. 2003; Seo & Baek 2004; Disley et al. 2015). Despite the high accuracy of these methods in the prediction of the DL, there are many limitations for these methods such as non-uniform flow characteristics along the river and high variations of velocity and concentration in width and depth of flow; these methods are also costly and time-consuming in the field and experimental studies.
In the last six decades, since Elder (1959) presented the first empirical equations to predict the DL, many researchers have proposed many methods for a high accurate prediction of the DL in natural steams. These methods include analytical, numerical and data driven methods. On the other hand, there has been a great deal of research over the last decade on machine learning approaches to predict the DL.
Fisher (1968) proposed the routing approach using an analytical solution of Equation (1) by numerical integration. Subsequently, the numerical solutions were applied for the prediction of DL (Ramezani et al. 2019).
Since the publication of Fischer (1975), new and highly accurate equations have been introduced for determining DL (Deng et al. 2002; Kashefipour & Falconer 2002; Sahay & Dutta 2009; Etemad-Shahidi & Taghipour 2012; Sattar & Gharabaghi 2015; Haghiabi 2016; Wang & Huai 2016; Alizadeh et al. 2017b).
In this study, by reviewing the literature, a complete set of empirical equations for predicting DL were selected (Table 1) and their results were compared with the least square support vector machine (LS-SVM) method.
Empirical equations for prediction of DL
Model . | Developed equation . | #dataset . | Abbreviation . |
---|---|---|---|
Elder (1959) | ![]() | – | EL |
Fischer (1975) | ![]() | – | FI |
Liu (1977) | ![]() | – | LIU |
Seo & Cheong (1998) | ![]() | 59 | SC |
Deng et al. (2001) | ![]() | – | DE |
Kashefipour & Falconer (2002) | ![]() | 81 | KF |
Sahay & Dutta (2009) | ![]() | 65 | SD |
Etemad-Shahidi & Taghipour (2012) | ![]() | 149 | ET |
Li et al. (2013) | ![]() | 65 | LI |
Zeng & Huai (2014) | ![]() | 116 | ZH |
Disley et al. (2015) | ![]() | 56 | DI |
Sattar & Gharabaghi (2015) | ![]() | 150 | SG |
Wang et al. (2017) | ![]() | 116 | WA |
Alizadeh et al. (2017a) | ![]() | 164 | AL |
Riahi-Madvar et al. (2019) | ![]() | 503 | RI |
Model . | Developed equation . | #dataset . | Abbreviation . |
---|---|---|---|
Elder (1959) | ![]() | – | EL |
Fischer (1975) | ![]() | – | FI |
Liu (1977) | ![]() | – | LIU |
Seo & Cheong (1998) | ![]() | 59 | SC |
Deng et al. (2001) | ![]() | – | DE |
Kashefipour & Falconer (2002) | ![]() | 81 | KF |
Sahay & Dutta (2009) | ![]() | 65 | SD |
Etemad-Shahidi & Taghipour (2012) | ![]() | 149 | ET |
Li et al. (2013) | ![]() | 65 | LI |
Zeng & Huai (2014) | ![]() | 116 | ZH |
Disley et al. (2015) | ![]() | 56 | DI |
Sattar & Gharabaghi (2015) | ![]() | 150 | SG |
Wang et al. (2017) | ![]() | 116 | WA |
Alizadeh et al. (2017a) | ![]() | 164 | AL |
Riahi-Madvar et al. (2019) | ![]() | 503 | RI |
Note: B, H, U, U*, g, Fr, and DL are, respectively, width, depth, cross-sectional averaged velocity, shear velocity, acceleration due to gravity, Froude number and longitudinal dispersion coefficient. In the equation of Table 1, ,
,
,
and
.
In recent decades, artificial intelligence approaches have been used for predicting the DL. For example, artificial neural network (ANN) (Toprak & Cigizoglu 2008; Alizadeh et al. 2017b; Riahi-Madvar et al. 2019), adaptive neuro-fuzzy inference system (ANFIS) (Riahi-Madvar et al. 2009; Noori et al. 2016), support vector machine (Noori 2009; Azamathulla & Wu 2011) and genetic programming (Sahay & Dutta 2009; Azamathulla & Ghani 2011) were used. These researches found satisfactory results in predicting the longitudinal dispersion coefficient.
On the other hand, such new methods as AI (Toprak & Cigizoglu 2008; Azamathulla & Wu 2011; Etemad-Shahidi & Taghipour 2012; Noori et al. 2016), evolutionary algorithms (Sahay & Dutta 2009; Li et al., 2013; Sattar & Gharabaghi 2015; Alizadeh et al. 2017b; Riahi-Madvar et al. 2019) and hybrid models (Najafzadeh & Tafarojnoruz 2016; Alizadeh et al. 2017a; Seifi & Riahi-Madvar 2019) have been used for predicting the DL.
It should be mentioned that previous research has mainly focused on developing a highly accurate method to predict the DL in a special dataset, and the accuracy of each method has not been investigated in different datasets. There are two general aspects for the prediction of DL: (1) the method used for prediction and (2) the datasets for evaluating the accuracy of the used method.
In this study, the LS-SVM method was used to predict the DL in natural streams. Furthermore, the results were compared with empirical equations (Table 1) in the various datasets. Although the new methods have increased the prediction accuracy of the longitudinal dispersion coefficient, the increase in accuracy greatly depends on the statistical characteristics of the used datasets. Therefore, one of the aims of this paper is to evaluate the accuracy of different methods for different datasets and to compare the results of previous methods with the new methods.
MATERIAL AND METHODS
Experimental data
To evaluate the accuracy of the empirical equations and LS-SVM models for the prediction of DL, a wide range of longitudinal dispersion laboratory and field data has been collected from the different sources (Toprak & Cigizoglu 2008; Sahay & Dutta 2009; Li et al., 2013; Zeng & Huai 2014; Alizadeh et al. 2017a; Wang et al. 2017). The details of these datasets are presented in the appendix Table 8. Depending on the literature, these datasets were divided into the following three categories: (i) Table 8, rows 1–65 ‘G1-65' (Sahay & Dutta 2009), (ii) Table 8, rows 1–116 ‘G2-116’ (Zeng & Huai 2014), and (iii) Table 8, rows 1–188 ‘G3-188’ (Alizadeh et al. 2017b). The goal of data categorization is to evaluate the accuracy of models in various datasets to predict the value of DL.
The studied parameters consist of the depth, H (m), the width, B (m), the mean velocity, U (m s−1), the shear velocity of the flow, U* (m s−1) and the longitudinal dispersion coefficient, DL (m2 s−1) (Table 8). There are two reasons for selecting these datasets: 1) They have been used by many researchers (Seo & Cheong 1998; Deng et al. 2001; Kashefipour & Falconer 2002; Toprak & Cigizoglu 2008; Sahay & Dutta 2009; Etemad-Shahidi & Taghipour 2012; Li et al., 2013; Zeng & Huai 2014; Disley et al. 2015; Wang & Huai 2016; Alizadeh et al. 2017a), therefore, the results will be comparable with those obtained by other researchers; and 2) These data represent a wide range of geometrical (B, H) and hydrodynamic (U, U*) parameters of natural streams. Differences in statistical properties across different datasets can indicate the ability of different methods to estimate longitudinal dispersion coefficients. In the following, the datasets used in the present study are analyzed in terms of statistical properties, frequency ratios and regression relationships.
Statistical characteristics of data groups
Statistical characteristics of the studied data groups (G1-65, G2-116 and G3-188), such as minimum (Min), maximum (Max), average (AVG), standard deviation (STD), coefficient of variation (CV), skewness (SKW) and kurtosis (KUT) of the datasets are shown in Table 2.
Statistical characteristics of the studied datasets
Data category . | Par. . | B . | H . | U . | U* . | DL . | B/H . | U/U* . | DL/HU* . |
---|---|---|---|---|---|---|---|---|---|
G1-65 | Min-Max | 11.9–202.7 | 0.2–4.0 | 0.0–1.7 | 0.0–0.6 | 1.9–836.1 | 13.6–151.1 | 1.3–17.0 | 5.9–8625.0 |
AVG | 53.80 | 1.24 | 0.49 | 0.10 | 80.51 | 49.12 | 6.80 | 1083.17 | |
STD | 47.30 | 0.99 | 0.35 | 0.09 | 131.82 | 29.89 | 3.70 | 1408.53 | |
CV | 0.90 | 0.79 | 0.71 | 1.01 | 1.64 | 0.61 | 0.54 | 1.30 | |
SKW | 1.70 | 1.23 | 1.62 | 3.82 | 3.86 | 1.45 | 1.12 | 3.32 | |
KUT | 2.50 | 0.57 | 2.69 | 16.69 | 17.99 | 1.89 | 0.79 | 13.54 | |
G2-116 | Min-Max | 11.9–711.2 | 0.2–19.9 | 0.0–1.7 | 0.0–1.0 | 1.9–1486.5 | 12.5–1000.0 | 0.2–62.9 | 6.2–40,183.9 |
AVG | 90.90 | 1.76 | 0.56 | 0.09 | 130.73 | 72.67 | 9.55 | 1937.89 | |
STD | 111.20 | 2.26 | 0.41 | 0.12 | 230.60 | 133.39 | 9.06 | 5267.28 | |
CV | 1.20 | 1.29 | 0.73 | 1.35 | 1.76 | 1.84 | 0.95 | 2.72 | |
SKW | 2.90 | 5.03 | 1.20 | 5.22 | 3.23 | 5.77 | 3.29 | 6.00 | |
KUT | 11.10 | 36.42 | 0.55 | 32.43 | 12.55 | 34.20 | 14.70 | 38.71 | |
G3-188 | Min-Max | 1.4–711.2 | 0.1–19.9 | 0.0–1.7 | 0.0–1.0 | 0.2–1486.5 | 2.2–1000.0 | 0.2–62.9 | 3.1–40,183.9 |
AVG | 73.30 | 1.70 | 0.52 | 0.09 | 109.10 | 58.46 | 8.61 | 1533.40 | |
STD | 94.43 | 2.04 | 0.37 | 0.10 | 206.67 | 107.71 | 7.81 | 4484.85 | |
CV | 1.29 | 1.20 | 0.71 | 1.16 | 1.89 | 1.84 | 0.91 | 2.92 | |
SKW | 3.39 | 4.65 | 1.30 | 5.35 | 3.44 | 7.07 | 3.45 | 6.70 | |
KUT | 15.86 | 34.58 | 1.28 | 38.17 | 14.38 | 53.52 | 18.16 | 49.59 |
Data category . | Par. . | B . | H . | U . | U* . | DL . | B/H . | U/U* . | DL/HU* . |
---|---|---|---|---|---|---|---|---|---|
G1-65 | Min-Max | 11.9–202.7 | 0.2–4.0 | 0.0–1.7 | 0.0–0.6 | 1.9–836.1 | 13.6–151.1 | 1.3–17.0 | 5.9–8625.0 |
AVG | 53.80 | 1.24 | 0.49 | 0.10 | 80.51 | 49.12 | 6.80 | 1083.17 | |
STD | 47.30 | 0.99 | 0.35 | 0.09 | 131.82 | 29.89 | 3.70 | 1408.53 | |
CV | 0.90 | 0.79 | 0.71 | 1.01 | 1.64 | 0.61 | 0.54 | 1.30 | |
SKW | 1.70 | 1.23 | 1.62 | 3.82 | 3.86 | 1.45 | 1.12 | 3.32 | |
KUT | 2.50 | 0.57 | 2.69 | 16.69 | 17.99 | 1.89 | 0.79 | 13.54 | |
G2-116 | Min-Max | 11.9–711.2 | 0.2–19.9 | 0.0–1.7 | 0.0–1.0 | 1.9–1486.5 | 12.5–1000.0 | 0.2–62.9 | 6.2–40,183.9 |
AVG | 90.90 | 1.76 | 0.56 | 0.09 | 130.73 | 72.67 | 9.55 | 1937.89 | |
STD | 111.20 | 2.26 | 0.41 | 0.12 | 230.60 | 133.39 | 9.06 | 5267.28 | |
CV | 1.20 | 1.29 | 0.73 | 1.35 | 1.76 | 1.84 | 0.95 | 2.72 | |
SKW | 2.90 | 5.03 | 1.20 | 5.22 | 3.23 | 5.77 | 3.29 | 6.00 | |
KUT | 11.10 | 36.42 | 0.55 | 32.43 | 12.55 | 34.20 | 14.70 | 38.71 | |
G3-188 | Min-Max | 1.4–711.2 | 0.1–19.9 | 0.0–1.7 | 0.0–1.0 | 0.2–1486.5 | 2.2–1000.0 | 0.2–62.9 | 3.1–40,183.9 |
AVG | 73.30 | 1.70 | 0.52 | 0.09 | 109.10 | 58.46 | 8.61 | 1533.40 | |
STD | 94.43 | 2.04 | 0.37 | 0.10 | 206.67 | 107.71 | 7.81 | 4484.85 | |
CV | 1.29 | 1.20 | 0.71 | 1.16 | 1.89 | 1.84 | 0.91 | 2.92 | |
SKW | 3.39 | 4.65 | 1.30 | 5.35 | 3.44 | 7.07 | 3.45 | 6.70 | |
KUT | 15.86 | 34.58 | 1.28 | 38.17 | 14.38 | 53.52 | 18.16 | 49.59 |
Note: Bold and Italic numbers show maximums among the datasets.
It can be seen in Table 2 that the dataset ranges (Min-Max) in the G3-188 datasets are greater than in the other two datasets (G1-65 and G2-116). The values of AVG and STD in the G2-116 data category are higher than in the other two data groups (G1-65 and G3-188). The values of CV, for some parameters (H, U, U* and U/U*) in G2-116 and for some parameters (B, DL, B/H and DL/HU*) in G3-188 are high. SKW and KUT for most of the parameters (e.g. B, U*, B/H, U/U* and DL/HU*) in the G3-188 data group are greater than in the other two data groups (G1-65 and G2-116).
Frequency analysis
In order to provide an accurate interpretation of the data, the histograms of the ratios of B/H, U/U* and DL/HU* are shown in Figure 1(a)–1(c) for three datasets.
Frequency histogram of the datasets: (a) B/H; (b) U/U*; and (c) DL/HU*.
Figure 1 shows the variations of the frequency percentages for the ratios of B/H, U/U*, and DL/HU* for the studied datasets. As can be seen the values of B/H, U/U* and DL/HU* for about 30, 40, and 50% of the datasets are ranged between 30–50, 5–10 and 200–1000, respectively. About 70% of the collected dataset (Table 8) were used for training (choose randomly until the best training performance was obtained), while the remaining datasets (about 30% of the data) were used for testing the models.
Regression analysis
The relationships between the parameter DL/HU* versus B/H, U/ U* and Fr are investigated in G1-65, G2-116 and G3-188 data groups in Figure 2(a)–2(c), respectively. The value of the coefficient of determination (R2) for the relationships between DL/HU* and U/U* is 0.253 (Figure 2(a)), indicating maximum correlation among G1-65 data sets.
Relationship between DL/HU* vs. B/H, U/U* and Fr = U/√gH in (a) G1-65; (b) G2-116; and (c) G3-188 datasets.
Relationship between DL/HU* vs. B/H, U/U* and Fr = U/√gH in (a) G1-65; (b) G2-116; and (c) G3-188 datasets.
In addition, as can be seen in Figure 2(b), the maximum value of R2 (=0.351) in G2-116 datasets was observed between the parameters of DL/HU* and B/H. It should be mentioned that the relationship between DL/HU* and Fr in G1-65 datasets is very poor. In general, the relationships between DL/HU* and B/H, U/U* and Fr in all of the datasets are not strong and the methods that can estimate DL were accurately appropriate to any datasets.
Least square support vector machines (LS-SVM)
Support vector machines were first used for classification (Vapnik 1999); then, another version of SVMs was proposed by (Drucker et al. 1997). In this method, a concept known as structural risk minimization is used to minimize the error of the model, while other methods (such as ANN) use the principles of empirical risk minimization (Cristianini et al. 2000; Dibike et al. 2001). In general, the SVM is used in two or more group classification problems and regression analysis. In the SVM model, quadratic programming is used to solve the equations, making the problem complex and time-consuming (Seyedzadeh et al. 2019).
One of the factors affecting the prediction accuracy in the LS-SVM model is the selection of an appropriate kernel function. In this study, radial basis functions (RBF) were investigated. Figure 3 shows the schematic view of the flowchart used for the LS-SVM model. Table 3 presents the LS-SVM characteristics applied in the present study.
Characteristics of LS-SVM in the study
Input variables . | Width (B), Depth (H), Velocity (U), and Shear velocity (U*) of the flow . |
---|---|
Target variable | Longitudinal dispersion coefficient (DL) |
Function estimation | Gaussian |
Kernel function | Radial basic function (RBF) |
Tuning parameters (γ, σ2) | γ = 10 and σ2 = 0.2 |
Selection function | Randomize selection (Randper's function) |
Datasets ratio in train and test phases | 70 and 30% |
Input variables . | Width (B), Depth (H), Velocity (U), and Shear velocity (U*) of the flow . |
---|---|
Target variable | Longitudinal dispersion coefficient (DL) |
Function estimation | Gaussian |
Kernel function | Radial basic function (RBF) |
Tuning parameters (γ, σ2) | γ = 10 and σ2 = 0.2 |
Selection function | Randomize selection (Randper's function) |
Datasets ratio in train and test phases | 70 and 30% |
Evaluation performance criteria
Until now, a lot of performance criteria have been used to evaluate the results of the model for prediction of DL. In the present study, six statistical criteria, root mean square error (RMSE), standard error (SE), mean bias error (MBE), discrepancy ratio (DR), R2, and Nash-Sutcliffe efficiency (NSE), were applied to evaluate the accuracy of each model in predicting the longitudinal dispersion coefficient. The explanations of the indices used are presented in Table 4.
Characteristics of evaluation performance criteria used in the present study
Criteria . | Formula . | References . | Best value . |
---|---|---|---|
Standard error (SE) | ![]() | Alizadeh et al. (2017a) | 0 |
Root mean square error (RMSE) | ![]() | Ma & Iqbal (1984) | 0 |
Mean bias error (MBE) | ![]() | Ma & Iqbal (1984) | 0 |
Discrepancy ratio (DR) | ![]() | Zeng & Huai (2014) | 0 |
Nash-Sutcliffe efficiency (NSE) | ![]() | Alizadeh et al. (2017a) | 1 |
Coefficient of determination (R2) | ![]() | Behar et al. (2015) | 1 |
Criteria . | Formula . | References . | Best value . |
---|---|---|---|
Standard error (SE) | ![]() | Alizadeh et al. (2017a) | 0 |
Root mean square error (RMSE) | ![]() | Ma & Iqbal (1984) | 0 |
Mean bias error (MBE) | ![]() | Ma & Iqbal (1984) | 0 |
Discrepancy ratio (DR) | ![]() | Zeng & Huai (2014) | 0 |
Nash-Sutcliffe efficiency (NSE) | ![]() | Alizadeh et al. (2017a) | 1 |
Coefficient of determination (R2) | ![]() | Behar et al. (2015) | 1 |
In addition to statistical indices presented in Table 4, the model performance was assessed using predicted and measured DL values to calculate the standard deviation (STD), centered root mean square difference (RMSD), and correlation coefficient (R), as summarized by the Taylor diagram (Taylor 2001).
RESULTS AND DISCUSSION
Results of LS-SVM
The results of predicting the value of DL using LS-SVM model in 10 consecutive runs are shown in Table 5. In addition, the best results among 10 consecutive runs of LS-SVM models for G1-65 datasets are presented in Figure 4.
Results of empirical equations for prediction of longitudinal dispersion coefficient
Data group . | Eq. Abb. . | RMSE . | SE . | MBE . | DR . | R2 . | NSE . |
---|---|---|---|---|---|---|---|
G1-65 | KF | 51 | 35 | −10.7 | −0.16 | 0.85 | 0.85 |
SD | 46 | 31 | 8.9 | 0.03 | 0.90 | 0.88 | |
ET | 68 | 38 | −13.2 | −0.09 | 0.76 | 0.73 | |
LI | 39 | 26 | − 4.3 | −0.07 | 0.91 | 0.91 | |
ZH | 52 | 33 | −10.9 | −0.06 | 0.88 | 0.84 | |
DI | 63 | 33 | −10.1 | 0.00 | 0.84 | 0.76 | |
SG | 52 | 32 | − 0.6 | 0.03 | 0.85 | 0.84 | |
WA | 76 | 40 | −21.9 | −0.12 | 0.77 | 0.66 | |
AL | 49 | 32 | −8.3 | −0.15 | 0.88 | 0.86 | |
LS-SVM | 58 | 24 | − 0.6 | 0.10 | 0.76 | 0.75 | |
G2-116 | KF | 252 | 94 | 34.9 | −0.06 | 0.46 | −0.20 |
SD | 316 | 116 | 80.7 | 0.14 | 0.43 | −0.90 | |
ET | 190 | 81 | −9.5 | − 0.03 | 0.40 | 0.31 | |
LI | 274 | 97 | 55.5 | 0.05 | 0.46 | −0.43 | |
ZH | 201 | 80 | 9.5 | 0.03 | 0.44 | 0.23 | |
DI | 239 | 93 | 6.7 | 0.07 | 0.27 | −0.09 | |
SG | 285 | 100 | 37.5 | 0.11 | 0.31 | −0.54 | |
WA | 180 | 78 | −27.1 | −0.05 | 0.41 | 0.39 | |
AL | 283 | 95 | 45.7 | −0.08 | 0.48 | −0.52 | |
LS-SVM | 82 | 39 | 2.6 | 0.08 | 0.88 | 0.87 | |
G3-188 | KF | 213 | 86 | 25.1 | 0.01 | 0.45 | −0.07 |
SD | 257 | 92 | 49.1 | 0.15 | 0.43 | −0.56 | |
ET | 167 | 69 | −16.1 | 0.00 | 0.40 | 0.34 | |
LI | 226 | 80 | 31.4 | 0.07 | 0.46 | −0.20 | |
ZH | 172 | 68 | 0.6 | 0.07 | 0.44 | 0.31 | |
DI | 201 | 78 | 0.9 | 0.14 | 0.29 | 0.05 | |
SG | 236 | 87 | 27.9 | 0.19 | 0.32 | −0.31 | |
WA | 161 | 71 | −21.9 | 0.03 | 0.41 | 0.39 | |
AL | 234 | 77 | 20.6 | −0.07 | 0.47 | −0.29 | |
LS-SVM | 76 | 33 | − 1.95 | 0.13 | 0.87 | 0.86 |
Data group . | Eq. Abb. . | RMSE . | SE . | MBE . | DR . | R2 . | NSE . |
---|---|---|---|---|---|---|---|
G1-65 | KF | 51 | 35 | −10.7 | −0.16 | 0.85 | 0.85 |
SD | 46 | 31 | 8.9 | 0.03 | 0.90 | 0.88 | |
ET | 68 | 38 | −13.2 | −0.09 | 0.76 | 0.73 | |
LI | 39 | 26 | − 4.3 | −0.07 | 0.91 | 0.91 | |
ZH | 52 | 33 | −10.9 | −0.06 | 0.88 | 0.84 | |
DI | 63 | 33 | −10.1 | 0.00 | 0.84 | 0.76 | |
SG | 52 | 32 | − 0.6 | 0.03 | 0.85 | 0.84 | |
WA | 76 | 40 | −21.9 | −0.12 | 0.77 | 0.66 | |
AL | 49 | 32 | −8.3 | −0.15 | 0.88 | 0.86 | |
LS-SVM | 58 | 24 | − 0.6 | 0.10 | 0.76 | 0.75 | |
G2-116 | KF | 252 | 94 | 34.9 | −0.06 | 0.46 | −0.20 |
SD | 316 | 116 | 80.7 | 0.14 | 0.43 | −0.90 | |
ET | 190 | 81 | −9.5 | − 0.03 | 0.40 | 0.31 | |
LI | 274 | 97 | 55.5 | 0.05 | 0.46 | −0.43 | |
ZH | 201 | 80 | 9.5 | 0.03 | 0.44 | 0.23 | |
DI | 239 | 93 | 6.7 | 0.07 | 0.27 | −0.09 | |
SG | 285 | 100 | 37.5 | 0.11 | 0.31 | −0.54 | |
WA | 180 | 78 | −27.1 | −0.05 | 0.41 | 0.39 | |
AL | 283 | 95 | 45.7 | −0.08 | 0.48 | −0.52 | |
LS-SVM | 82 | 39 | 2.6 | 0.08 | 0.88 | 0.87 | |
G3-188 | KF | 213 | 86 | 25.1 | 0.01 | 0.45 | −0.07 |
SD | 257 | 92 | 49.1 | 0.15 | 0.43 | −0.56 | |
ET | 167 | 69 | −16.1 | 0.00 | 0.40 | 0.34 | |
LI | 226 | 80 | 31.4 | 0.07 | 0.46 | −0.20 | |
ZH | 172 | 68 | 0.6 | 0.07 | 0.44 | 0.31 | |
DI | 201 | 78 | 0.9 | 0.14 | 0.29 | 0.05 | |
SG | 236 | 87 | 27.9 | 0.19 | 0.32 | −0.31 | |
WA | 161 | 71 | −21.9 | 0.03 | 0.41 | 0.39 | |
AL | 234 | 77 | 20.6 | −0.07 | 0.47 | −0.29 | |
LS-SVM | 76 | 33 | − 1.95 | 0.13 | 0.87 | 0.86 |
Note: Rank 1 (Bold, Italic, and Underline), Rank 2 (Bold and Italic) and Rank 3 (Bold).
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G1-65 datasets.
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G1-65 datasets.
As can be seen in Figure 4(a), the accuracy of the LS-SVM model in the training phase is better than in the test phase. Figure 4(b) shows obviously that the error in the prediction of DL in the test phase of the model is greater than in the training phase. Figure 5 shows the best results for the LS-SVM model in 10 consecutive runs of the model for G2-116 datasets.
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G2-116 datasets.
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G2-116 datasets.
As can be seen in Figure 5(a), based on the values of RMSE and R2, the accuracy of the model in the training phase is better than in the testing phase; however, the value of R2 in G2-116 datasets in the testing stage is higher than that of the G1-65 datasets. Figure 5(b) shows that errors in the prediction of DL are similar in train and test phases (G2-116 datasets). Furthermore, the best results of the LS-SVM model in 10 consecutive runs for G3-188 datasets are shown in Figure 6.
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G3-188 datasets. LDC, longitudinal dispersion coefficient.
(a) Measured and predicted values of DL, (b) Errors in prediction, and (c) Distribution of errors of LS-SVM model for prediction of DL in G3-188 datasets. LDC, longitudinal dispersion coefficient.
Figure 6(a) shows similar results to those in Figures 4(a) and 5(a). In general, as can be seen in Figures 4–6, the LS-SVM model has an appropriate performance in the prediction of longitudinal dispersion coefficient for all datasets. In addition, the results show that as the number of datasets increases, the value of the R2 increases, especially at the testing phase, so that the value of R2 increases from 0.46 in the G1-65 datasets to 0.81 in the G3-188 datasets.
Comparing the results of models
The statistical criteria were calculated for the top 10 empirical equations and the results are presented in Table 5. In addition, the results of the LS-SVM model were calculated based on the average of 10 consecutive runs.
Table 5 shows the statistical criteria for the 10 top ranks of empirical equations, as previously listed in Table 1. The results show that six empirical equations including Elder (1959), Fischer (1975), Liu (1977), Seo & Cheong (1998), Deng et al. (2001) and Riahi-Madvar et al. (2019) do not exist in Table 5, because of poor results. The results in Table 5 can be interpreted from two different aspects: (1) the value of statistical indices in different datasets and (2) the best equations in different datasets.
The best statistical criteria in Table 5 were obtained for G1-65 datasets, so that the average values of RMSE and R2 for G1-65 datasets are 55 and 0.84, while the criteria were computed as 230 and 0.45 for G2-116 datasets, respectively. The results indicate that the statistical indices of average and standard deviation have significant impacts on the prediction of DL. The reason is that based on Table 5, the AVG and STD of G2-116 datasets are higher than those of the other two datasets. Another reason for the low accuracy of the empirical equations in the DL prediction in G2-116 datasets is the low correlation of the DL with the river and flow characteristics (Figure 2(b)).
The best equations in G1-65 datasets (Table 5) are Li et al., (2013) (LI) and Sahay & Dutta (2009) (SD); the main reason is that these equations have been extracted by the same datasets (G1-65).
In G2-116 and G3-188 datasets, the best results are related to the LS-SVM method, so that LS-SVM in G2-116 and G3-188 datasets obtained the best values of RMSE, SE, MBE, DR, R2 and NSE. After the LS-SVM method, Etemad-Shahidi & Taghipour (2012) (ET), Zeng & Huai (2014) (ZH) and Wang et al. (2017) (WA) are the most accurate methods. The error indices (RMSE and SE) in the prediction of DL using empirical equations and the LS-SVM method are shown in Figure 7.
Measured and predicted values of DL by empirical equations and LS-SVM model for (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
Measured and predicted values of DL by empirical equations and LS-SVM model for (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
As can be seen in Figure 7(a), the LS-SVM method has the lowest SE (SE = 22) for the prediction of DL with G1-65 datasets, although Li et al. (2013) (LI) equations (with RMSE = 39) achieved the lowest RMSE in these datasets. In addition, Figure 7(b) shows that the LS-SVM method, with RMSE = 62 and SE = 36, has the best performance in the prediction of DL in G1-118 datasets. Furthermore, as is obvious in Figure 7(c), the LS-SVM method (with RMSE = 60 and SE = 31) has the best results for the prediction of DL in G3-188 datasets.
Figure 7(b) and 7(c) illustrate the low accuracy of the empirical equations in predicting the longitudinal dispersion coefficient, especially in the wide rivers (three measurements at the Mississippi River, rows 91–93, in the appendix Table 8). This shows that the empirical equations are inappropriate for the prediction of DL on the wide rivers. Figure 8 shows the variation of dispersion coefficient according to the river width for three datasets.
The variation of dispersion coefficient according to the river width for (a) G1-65, (b) G2-116 and (c) G3-188 datasets.
The variation of dispersion coefficient according to the river width for (a) G1-65, (b) G2-116 and (c) G3-188 datasets.
As can be seen in Figure 8(a)–8(c), the accuracy of the prediction of LS-SVM improved by increasing the river width. On the other hand, Figure 8(c) shows that the lowest accuracy of LS-SVM in predicting the longitudinal dispersion coefficient is related to the lowest measured values (rows No. 117, 143 and G3-188 in the appendix Table 8). For the purpose of further comparison, the percentage of the DR values in different ranges of discrepancy is shown in Figure 9.
The histogram of DR values for six models (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
The histogram of DR values for six models (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
The accuracy of each model may be categorized by the number of DR values between −0.3 and 0.3, relative to the total number of datasets. This range was selected because the maximum acceptable error in predicting DL by the corresponding measured values is ±100% or (Kashefipour & Falconer 2002). As can be seen in Figure 9(a)–9(c), LS-SVM method obtained the best accuracy, so that in G1-65, G2-116 and G3-188 datasets about 83, 72 and 70% of DR values are in the range of −0.3 and 0.3, respectively. Figure 10(a)–10(c) shows the consistency of the measured and predicted values of longitudinal dispersion coefficients in G1-65, G2-116, and G3-188 datasets, based on the values of R2 and NSE.
Scatterplots of the measured and predicted values of longitudinal dispersion coefficients (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
Scatterplots of the measured and predicted values of longitudinal dispersion coefficients (a) G1-65, (b) G2-116, and (c) G3-188 datasets.
It can be generally observed from Figure 10(a)–10(c) that the LS-SVM model yields suitable predictions of the measured data. The best correlation between measured and predicted values of DL can be achieved when the LS-SVM method is applied to predict the value of DL equal to 0.92, 093 and 0.92 in G1-65, G2-116 and G3-188 datasets, respectively.
The Taylor diagram (Figure 11) was used to compare different performance indices. According to Taylor's diagram, the closer the model to the measured point, the more accurate the model will be in predicting the longitudinal dispersion coefficient.
Figure 11(b) and 11(c) indicate that the LS-SVM method considerably increased the accuracy of the prediction of DL. Furthermore, the efficiencies of previous equations can be ranked as follows: LS-SVM, Etemad-Shahidi & Taghipour (2012); Zeng & Huai (2014); Wang et al. (2017); Alizadeh et al. (2017a) and Li et al. (2013). The superiority of the LS-SVM method compared to the previous equations is clear from Taylor's diagram.
For comparing the model results for DL estimation from narrow rivers (B/H < 10) to very wide rivers (B/H > 100), the statistical criteria including RMSE, MBE, and R2 were calculated based on five categories of aspect ratio (B/H < 10; 10 < B/H < 30; 30 < B/H < 50; 50 < B/H < 100 and B/H > 100). The results of the 10 superior models based on aspect ratio ranges are illustrated in Table 6.
Models’ results for DL estimation in aspect ratio class
Aspect r. . | B/H < 10 . | 10 < B/H < 30 . | 30 < B/H < 50 . | 50 < B/H < 100 . | B/H > 100 . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
models . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . |
KF | 201 | 166 | 0.34 | 110 | 20 | 0.16 | 225 | 21 | 0.17 | 303 | 36 | 0.53 | 123 | −12 | 0.68 |
SD | 28 | 7 | 0.95 | 91 | −3 | 0.31 | 178 | 16 | 0.20 | 368 | 94 | 0.45 | 471 | 253 | 0.97 |
ET | 14 | 8 | 0.99 | 107 | −21 | 0.13 | 136 | 2 | 0.22 | 255 | −42 | 0.41 | 113 | 20 | 0.74 |
LI | 40 | 15 | 0.90 | 93 | −5 | 0.29 | 178 | 8 | 0.19 | 337 | 68 | 0.48 | 318 | 149 | 0.94 |
ZH | 166 | −135 | 0.96 | 55 | −24 | 0.30 | 97 | −21 | 0.21 | 165 | −48 | 0.44 | 105 | 71 | 0.85 |
DI | 81 | 63 | 0.90 | 87 | 4 | 0.37 | 227 | 20 | 0.13 | 279 | −37 | 0.33 | 152 | 34 | 0.69 |
SG | 196 | 162 | 0.82 | 105 | 29 | 0.26 | 283 | 31 | 0.12 | 308 | 15 | 0.39 | 185 | 19 | 0.58 |
WA | 166 | 128 | 0.94 | 94 | 1 | 0.27 | 126 | −12 | 0.22 | 254 | −80 | 0.43 | 60 | 1 | 0.88 |
AL | 12 | − 6 | 0.99 | 97 | −24 | 0.33 | 234 | 24 | 0.17 | 352 | 75 | 0.52 | 175 | 26 | 0.73 |
LS-SVM | 26 | 10 | 0.97 | 47 | 4 | 0.87 | 41 | − 1 | 0.92 | 89 | − 14 | 0.93 | 50 | 4 | 0.91 |
Aspect r. . | B/H < 10 . | 10 < B/H < 30 . | 30 < B/H < 50 . | 50 < B/H < 100 . | B/H > 100 . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
models . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . | RMSE . | MBE . | R2 . |
KF | 201 | 166 | 0.34 | 110 | 20 | 0.16 | 225 | 21 | 0.17 | 303 | 36 | 0.53 | 123 | −12 | 0.68 |
SD | 28 | 7 | 0.95 | 91 | −3 | 0.31 | 178 | 16 | 0.20 | 368 | 94 | 0.45 | 471 | 253 | 0.97 |
ET | 14 | 8 | 0.99 | 107 | −21 | 0.13 | 136 | 2 | 0.22 | 255 | −42 | 0.41 | 113 | 20 | 0.74 |
LI | 40 | 15 | 0.90 | 93 | −5 | 0.29 | 178 | 8 | 0.19 | 337 | 68 | 0.48 | 318 | 149 | 0.94 |
ZH | 166 | −135 | 0.96 | 55 | −24 | 0.30 | 97 | −21 | 0.21 | 165 | −48 | 0.44 | 105 | 71 | 0.85 |
DI | 81 | 63 | 0.90 | 87 | 4 | 0.37 | 227 | 20 | 0.13 | 279 | −37 | 0.33 | 152 | 34 | 0.69 |
SG | 196 | 162 | 0.82 | 105 | 29 | 0.26 | 283 | 31 | 0.12 | 308 | 15 | 0.39 | 185 | 19 | 0.58 |
WA | 166 | 128 | 0.94 | 94 | 1 | 0.27 | 126 | −12 | 0.22 | 254 | −80 | 0.43 | 60 | 1 | 0.88 |
AL | 12 | − 6 | 0.99 | 97 | −24 | 0.33 | 234 | 24 | 0.17 | 352 | 75 | 0.52 | 175 | 26 | 0.73 |
LS-SVM | 26 | 10 | 0.97 | 47 | 4 | 0.87 | 41 | − 1 | 0.92 | 89 | − 14 | 0.93 | 50 | 4 | 0.91 |
Note: bold numbers show the best results for each criterion between models.
Table 6 demonstrates that in narrow rivers (B/H < 10) the AL equation (Alizadeh et al. 2017a) was the most accurate model for DL estimation with RMSE = 12 m2 s−1, MBE = −6 m2 s−1, and R2 = 0.99. Except for narrow rivers (B/H < 10), in all aspect ratios, the LS-SVM model was the most accurate model for DL estimation in different ranges of aspect ratio. According to coefficient of determination (R2) from narrow to very wide rivers (R2 = 0.87 (10 < B/H < 30) to 0.97 (B/H < 10)), the LS-SVM model has high performance in comparison with the other models. In wide rivers (50 < B/H < 100), the LS-SVM model provides the best results with RMSE = 89 m2 s−1, MBE = −14 m2 s−1 and R2 = 0.93. The comparison of predictions obtained by the predictive model of this study with the other existing models revealed that the LS-SVM model is superior to the other because it has the highest value of R2 and the lowest value of the RMSE, especially in wide rivers (B/H > 50).
Comparison with previous studies
Riahi-Madvar et al. (2019) presented the best equations for predicting the longitudinal dispersion coefficients as Pareto optimal multigene genetic programming (POMGGP) (Liu 1977; Sattar & Gharabaghi 2015; Alizadeh et al. 2017a). They concluded that the AL and SG equations were among the best equations in predicting longitudinal dispersion coefficient. Alizadeh et al. (2017a) reported frequency values of DR ranging from −0.3 to 0.3 for particle swarm optimization (PSO), ET and SG methods, respectively, in 63, 60 and 58% of datasets. In the present study, the AL and ET equations with values of 61% were the best empirical equations after the LS-SVM method with 72%. A comparison of the results of the present study with the previous studies is summarized in Table 7.
Comparison of this study with the results of previous studies
Researchers . | No. of datasets . | Methods . | RMSE . | SE . | Accuracy −0.3 < DR < 0.3 . | R2 . | NSE . | d . | PA% . |
---|---|---|---|---|---|---|---|---|---|
Riahi-Madvar et al. (2019) | 503 | Pareto optimal multigene genetic programming (POMGGP) | 720 | 373 | – | 0.417 | 0.39 | 0.75 | – |
Alizadeh et al. (2017a) | 164 | Multi-objective particle swarm optimization(MOPSO) | 86.57 | 37.5 | 63 | 0.761 | 0.75 | – | – |
Sattar & Gharabaghi (2015) | 150 | Gene expression programming (GEP) | 464 | – | – | 0.80 | 0.80 | 0.93 | – |
Li et al. (2013) | 65 | Differential evolution (DE) | 38.96 | – | – | 0.913 | – | – | 56.92 |
Etemad-Shahidi & Taghipour (2012) | 149 | M5′ model tree (MT) | – | – | 63.1 | 0.36 | – | – | – |
Sahay & Dutta (2009) | 65 | Genetic algorithms (GA) | 45 | – | – | 0.902 | – | – | – |
The study | G1-65 | Least square support vector machines (LS-SVM) | 52 | 22 | 83 | 0.92 | 0.84 | 0.94 | 72 |
G2-116 | 62 | 36 | 72 | 0.93 | 0.94 | 0.98 | 63 | ||
G3-188 | 60 | 31 | 70 | 0.92 | 0.93 | 0.98 | 57 |
Researchers . | No. of datasets . | Methods . | RMSE . | SE . | Accuracy −0.3 < DR < 0.3 . | R2 . | NSE . | d . | PA% . |
---|---|---|---|---|---|---|---|---|---|
Riahi-Madvar et al. (2019) | 503 | Pareto optimal multigene genetic programming (POMGGP) | 720 | 373 | – | 0.417 | 0.39 | 0.75 | – |
Alizadeh et al. (2017a) | 164 | Multi-objective particle swarm optimization(MOPSO) | 86.57 | 37.5 | 63 | 0.761 | 0.75 | – | – |
Sattar & Gharabaghi (2015) | 150 | Gene expression programming (GEP) | 464 | – | – | 0.80 | 0.80 | 0.93 | – |
Li et al. (2013) | 65 | Differential evolution (DE) | 38.96 | – | – | 0.913 | – | – | 56.92 |
Etemad-Shahidi & Taghipour (2012) | 149 | M5′ model tree (MT) | – | – | 63.1 | 0.36 | – | – | – |
Sahay & Dutta (2009) | 65 | Genetic algorithms (GA) | 45 | – | – | 0.902 | – | – | – |
The study | G1-65 | Least square support vector machines (LS-SVM) | 52 | 22 | 83 | 0.92 | 0.84 | 0.94 | 72 |
G2-116 | 62 | 36 | 72 | 0.93 | 0.94 | 0.98 | 63 | ||
G3-188 | 60 | 31 | 70 | 0.92 | 0.93 | 0.98 | 57 |
As can be seen in Table 7, LS-SVM methods have a robust capability to predict the longitudinal dispersion coefficient in natural streams on various and wide range of datasets.
CONCLUSIONS
In this study, an LS-SVM method was applied for predicting the longitudinal dispersion coefficient in natural streams. The performance of the LS-SVM was evaluated for three datasets (G1-65, G2-116 and G3-188) with different hydraulic and geometrical characteristics. From each dataset, 70% of samples were used for the training phase, while the remaining samples (30%) were used for the testing phase. In addition, the performance of previous empirical equations was evaluated to predict the longitudinal dispersion coefficient by various collected datasets. The most important findings of the research can be summarized as follows:
The performance of the empirical equations depends significantly on the statistical properties of the datasets and most of the empirical equations showed many errors in datasets with high variations. For example, the ranges of RMSE are equal to 39–332 and 180–53,256, respectively, for the G1-65 and G2-116 datasets with the lowest and highest average and standard deviation of datasets for prediction of DL.
The LS-SVM method has a high capability in predicting the longitudinal dispersion coefficients in different datasets; however, when the low number of datasets was used for training and testing phases, the accuracy of this method is reduced, especially in the testing phase.
In general, the longitudinal dispersion coefficient DL predicted by Etemad-Shahidi & Taghipour (2012) (ET), Li et al., (2013) (LI), Zeng & Huai (2014) (ZH), Alizadeh et al. (2017a) (AL), Elder (1959) (EL), Fischer (1975) (FI), and Liu (1977) (LIU) equations had, respectively, the best performance in all three datasets (G1-65, G2-116 and G3-188).
Some empirical equations, such as ET and AL, have two different formulas based on the ratio of B/H and show the most accurate results in predicting longitudinal dispersion coefficients in different datasets.
The accuracy of the empirical equations in estimation of DL depends on the data series on which the equation is developed based on the data. For example, in G1-65 and G2-116 datasets, empirical equations such as Sahay & Dutta (2009) (SD), Li et al. (2013) (LI), Zeng & Huai (2014) (ZH), and Wang et al. (2017) (WA) had the best accuracy to estimate the value of DL. This is because these equations are developed based on 65 and 116 data series.
The strengths and weaknesses of the LS-SVM method were respectively identified in the prediction of DL in wide rivers (B = 533–711 m) and low longitudinal dispersion (DL = 0.2–0.5 m2 s−1).
CONFLICTS OF INTEREST
The authors declare no conflict of interest.
CODE AVAILABILITY
The software used in this research will be available (by the corresponding author), upon reasonable request.
AUTHORS' CONTRIBUTIONS
Mehdi Mohammadi Ghaleni, Mahmood Akbari, Saeed Sharafi and Mohammad Javad Nahvinia conceived the study. MMG led the data analysis and prepared all figures. MA, SS, and MJN wrote the paper. All authors reviewed the paper and contributed to the discussions.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.