Using data mining methods to improve discharge coefficient prediction in Piano Key and Labyrinth weirs

As a remarkable parameter, the discharge coefficient (Cd) plays an important role in determining weirs’ passing capacity. In this research work, the support vector machine (SVM) and the gene expression programming (GEP) algorithms were assessed to predict Cd of piano key weir (PKW), rectangular labyrinth weir (RLW), and trapezoidal labyrinth weir (TLW) with gathered experimental data set. Using dimensional analysis, various combinations of hydraulic and geometric non-dimensional parameters were extracted to perform simulation. The superior model for the SVM and the GEP predictor for PKW, RLW, and TLW included Ho=P, N, W=P ð Þ, Ho=P, W=P, N ð Þ and Ho=P, Lc=W, A=W, Fr ð Þ respectively. The results showed that both algorithms are potential in predicting discharge coefficient, but the coefficient of determination (RMSE, R, Cd(DDR)max) illustrated the superiority of the GEP performance over the SVM. The results of the sensitivity analysis determined the highest effective parameters for PKW, RLW, and TLW in predicting discharge coefficients are Ho=P, Ho=P, and Fr respectively.


INTRODUCTION
Weirs are a type of hydraulic structure implemented for water surface control, flow measurement, and passing excess water volume. The shape and the length of the weir crest have a remarkable role in determining overflow capacity. Having more crest length, nonlinear weirs named labyrinth type are a form of weir that can convey more flow for an equal hydraulic head than linear types. Their crest is often triangular, rectangular, trapezoidal, and arched in shape. The functional differences of labyrinth weirs are illustrated with discharge coefficient, C d , dependent on the hydraulic and geometric parameters. Research in recent decades indicated that because of the complexity and diversity of effective parameters on the C d , numerical simulation has been developed to predict discharge coefficient. The data-driven (soft computing) approach is another technique that has also been used to predict the C d using experimental data during train and test phases. These models are capable of extracting complex and hidden relationships among dependent and independent variables (Samadi et al. 2014(Samadi et al. , 2015(Samadi et al. , 2020. In this regard, researchers used an artificial neural network (ANN), group method of data handling (GMDH), gene expression programming (GEP), support vector machine (SVM), and adaptive neuro-fuzzy inference system (ANFIS). ASCE (2000) considers the SVM as a nonlinear mapping process predicting discharge coefficient. The superiority and tangible advantage of the ANN to regression model for predicting the C d in a sharp-crest triangle plan weir has been proved by Kumar et al. (2011) and Juma et al. (2014). The application of the ANFIS for estimating the C d of desirable accuracy was reported by Kisi (2013) and Emiroglu & Kisi (2013). Ebtehaj et al. (2015aEbtehaj et al. ( , 2015bEbtehaj et al. ( , 2015c utilized data-mining methods to determine the C d of side weirs located on the sidewalls of rectangular channels. Their results proved the significant acceptable accuracy of these models. Azamathulla et al. (2016) used the SVM to predict the C d for side weir. The determination coefficients for the train and the test phases were obtained as 0.96 and 0.93 respectively, implying more accuracy of the simulation. Parsaei & Haghiabi (2017) performed a comparative study using the GMDH, the multivariate adaptive regression splines (MARS), and the SVM to predict a combined weir-gate discharge coefficient. Their results confirmed not only that all models have the capability of prediction but also the performance of the SVM algorithm is superior to other methods. Norouzi et al. (2019) performed a comparative study to predict discharge coefficient using the SVM and the ANN of trapezoidal labyrinth weir. The results of their study showed not only the good performance of the two models but also the remarkable superiority of the SVM. Azimi et al. (2019) employed the SVR to predict the C d of a side weir in a trapezoidal channel. They claimed that the SVR model simulates the C d with acceptable accuracy. Mohammed & Sharifi (2020) proposed using the GEP to predict discharge coefficient for oblique side weir. Zhou et al. (2015), Zaji et al. (2016), Zaji & Bonakdari (2017), Roushangar et al. (2017), Nadiri et al. (2018), Majedi Asl & Fuladipanah (2019), Sadeghfam et al. (2019), Kumar et al. (2019) and Roushangar et al. (2021) proposed to evaluate discharge coefficient using data-driven methods.
Given the significance of the data-driven methods, this paper probes the exert of the GEP and the SVM algorithms to acquire the discharge coefficient of three different weirs, i.e. piano key, rectangular labyrinth, and trapezoidal labyrinth. Further, a new regression equation is extracted to predict the discharge coefficient. Finally, using the sensitivity analysis, the order of the effective parameters on the C d is determined.

Experimental models description
Two SVM and GEP models have been used to predict the C d for piano key weir (PKW), rectangular labyrinth weir (RLW), and trapezoidal labyrinth weir (TLW). Experimental data were gathered from Rostami et al. (2018) and Seamons (2014). Figures 1 and 2 show the geometric characteristics of the weirs.    Rostami et al. (2018) performed their experiments in a rectangular flume of length 10 m, width 0.3 m, and height 0.6 m at the Khuzestan Water and Power Authority Laboratory, Iran. The range of experimental geometric characteristics of the weir has been presented in Table 1. In this table, W is the width of each cycle, P is the weir height, L is the effective length of the weir, T s is the weir thickness, W i and W o are the width of inlet and outlet keys of the weir, respectively, and N is the number of keys in the weir.
They extracted and expressed discharge coefficient C d as the following equation for the PKW: where H o is the total upstream head of the weir. Seamons (2014) conducted the experiments in a rectangular flume 14.6 m in length, 1.2 m in width, and 0.9 m in depth at the Water Research Laboratory, Utah State University, Logan, UT, USA. The total number of data sets was 313 by conducting experiments on 13 labyrinth weirs. Table 2 has summarized the variation range of the applied parameters. In this table, H T displays the total head of flow above the weir crest. Geometric parameters have been introduced in Figure 2. Seamons (2014) presented C d as a function of geometry characteristics and hydraulic conditions for the RLW and TLW as shown by Equations (2) and (3), respectively:

Overview of Support Vector Machine (SVM)
Developed by Cortes & Vapink (1995), the SVM is an optimization algorithm for binary classification that separates two classes using a border. In this optimization-based algorithm, samples are determined to form the bound of the two classes. These samples are named as support vectors. Similar to other regression models, the dependent variable y is predicted using several independent variables x i . Finally, the following regression equation is formed: where the amount of the noise is determined using an amount of allowable error ε. In Equation (4), W is the coefficients vector, b is a constant value, and Φ is the nuclear function. This algorithm aims to find the f(x) so that optimization of the error function is performed using the model training with a data series. The optimizing process is carried out by minimizing Equation (5) under mentioned conditions in Equation (6): where C indicates the penalty while occurring error, N is the number of the samples, j i and j Ã i are deficiency coefficients. The algorithm processes the considered function by minimizing three terms in Equation (5). The nuclear function, Φ(x) called the Kernell function, in Equation (5) is defined as follows: Some of the most important nuclear functions are presented in Table 3. In this table γ, C and d are kernel parameters. The SVM generalization performance (estimation accuracy) depends on a good set of meta-parameters parameters γ, C and d are the kernel parameters. The choices γ, r, and d control the prediction (regression) model complexity. Of all nuclear functions, the radial-based function (RBF) is the most used one and is proposed by notable researchers. In RBF, σ is the function parameter. The characteristic parameters of the SVM, i.e., ε and C, are optimized, and the value of the setting parameter, g, is determined with the trade-off between the minimum fitting error and the estimated function. The user, via trial and error, determines the SVM parameters. In the RBF function, ε is the radius of the tube within which the regression function must lie.

Overview of gene expression programming
Invented by Ferreira (2001Ferreira ( , 2006, GEP is an artificial procedure to solve genotype systems. It is used for solving complex realworld problems (Azamathullah 2012; Samadi et al. 2021). Computer programs with different shapes and lengths encoded in linear chromosomes with fixed sizes are progressed with GEP. Encoding of chromosome information is the final step of the GEP algorithm leading to the tree expression, called the translation process. The GEP algorithm involves four main steps: (1) initialize the population by creating the chromosomes (individuals); (2) identify a suitable fitness function to evaluate the best individual; (3) conduct genetic operations to modify the individuals to achieve the optimal solution in the next generation; (4) check the stop conditions. The flowchart of the GEP algorithm is illustrated in Figure 3.

Evaluation criteria
The efficiency of the classic regression and data-driven models have been assessed using the following indices: Here, x o , x p, and N are the observed discharge coefficient, the predicted discharge coefficient, and test number, respectively. However, on the other side, mentioned statistical indices only show mean error in model performance without error distribution. Therefore, it is important to test the model using some other performance evaluation criterion to check its robustness. Another criterion has been applied in this paper, based on discrepancy ratio (DR) defined by White et al. (1973), to check the model robustness as following: DR is commonly used as an error measure in the literature and is used widely by many researchers such as Seo & Cheong (1998), Deng et al. (2002), Kashefipour & Falconer (2002), Tayfur & Singh (2005), and others. However, it is not utilizable for negative and zero values. To remedy this problem, Noori et al. (2010) have presented the developed discrepancy ratio (DDR) statistic: For better judgment and visualization, the Gaussian function of DDR values could be calculated and illustrated in a standard normal distribution format. For this reason, firstly, the DDR values must be standardized and then using the Gaussian  Water Supply Vol 22 No 2, 1968 function, the normalized value of DDR (C d(DDR) ) is calculated. Generally, more tendencies in the error distribution graph to the centerline and larger value of the maximum C d(DDR) are equal to more accuracy.

RESULTS AND DISCUSSION
A different combination of dimensionless parameters has been implemented to assess various models to predict discharge coefficient using SVM and GEP algorithms for the PKW, RLW, and TLW. These combinations have been presented in Table 4.

SVM solver
The total number of data used for modeling was 72. Of all measured data, the share of train and test phases were 75% and 25%, respectively. The results of the SVM algorithm for the PKW showed model 1 has the best performance among the five models. The setting parameters C, γ, and ε were obtained as 40, 4, and 0.1 for the optimum model with R 2 ¼ 0.9785 and RMSE ¼ 0.02418 for the train stage and R 2 ¼ 0.9789 and RMSE ¼ 0.027 for the test phase. These results were obtained with the RBF kernel function. Figure 4 illustrates a scatter plot of measured and predicted values of C d and conformity of data points for C d during train and test phases for model 1. The distribution of predicted and observed data around the fit line and their conformity during the train and the test phases determine the high accuracy of the SVM model. More statistical indices have been presented in Table 5, demonstrating the fitness among measured and predicted values of the discharge coefficient. As it is seen, all indices have very close values through the test and train stages. The distribution of the standard distribution of C d (DDR) values for the SVM model through train and test phases has been shown in Figure 5. The maximum values of C d(DDR) for the train and the test phases were obtained at 8.153 and 8.698, respectively. These values show that the model has better performance during the test stage than the train.
A sensitivity analysis was performed for model 1 in Table 6 to evaluate the impact of input variables on the estimated discharge coefficient. In this analysis, each time a parameter was omitted from the model inputs, then the model was implemented, and accuracy was evaluated as the most significant parameter. The deleted parameter with the most effect on declined model precision and boosted model error was rated as the most remarkable parameter. A look at the analysis of the results reveals that dropping the parameter H o =P has increased the RMSE to 0.1569 and 0.1905 and has decreased R 2 to 0.1151 and 0.0127 through train and test phases, respectively. The calculation demonstrated that the prominent and sensitive parameter for the SVM predictor is H o =P. The second and the third most important and effective parameters are w=P and N, respectively.
The next simulation of C d belongs to RLW. The number of data for the RLW is 33. The share of the train and the test phases are 73% and 27%, respectively. The first model of Table 4 has had the best performance to predict the discharge coefficient. The setting parameters C, γ, and ε were obtained as 60, 1.4, and 0.1, respectively. The values of (R 2 , RMSE) for the train and the test phases were calculated as (0.9745, 0.0208) and (0.9734, 0.0138), respectively. A scatter plot of observed versus predicted C d and a plot of data point against discharge coefficient values for the train and the test stages have been presented in  Figure 6. As it is clear, the data are less scattered around the fit line and have significant conformity during the test and train phases, indicating remarkable acceptance performance for the SVM model. Table 7 displays the summary of the statistical indices for the performance of the SVM algorithm for a superior combination of parameters. Very little difference among indices proves the good simulation of the SVM model. Figure 7 illustrates the C d(DDR) vs. Z DDR for the superior model through train and test stages. The maximum C d(DDR) for the train and the test stages are 8.256 and 14.255, respectively. These values verify the better performance of the SVM during the test phase than the train, indicating correct operation. A sensitivity analysis has been performed for model 1 (Table 8). Removing H o P has changed the model function dramatically. The sharp difference is evident between the outputs of the model   The discharge coefficient of the TLW is the third modeling process with the SVM. The share of the train and the test phases from the total observed data (140 data) were 80% and 20%, respectively. Among the eight mentioned combinations in Table 4, combination 3 had the highest adaptation with measured discharge coefficients. This model includes H o =P, L c =W, A=W and Fr parameters. Setting parameters C, γ, and ε of the SVM algorithm was obtained 100, 1, and 0.1, respectively. The values of the (R 2 , RMSE) for the train and test phases were (0.9896, 0.009) and (0.9886, 0.0144). The measured versus the predicted values of C d have been illustrated in Figure 8. High values of well performance indices are well known from this figure. All measured and predicted data have scattered around line 1:1 and conformity of datasets during the train and the test stages indicate the high values of correlation and fewer values of inaccuracy. The summary of statistical characteristics of the predicted and measured values for C d has been presented in Table 9, an improved SVM performance for the TLW. Figure 9 illustrates the distribution of standardized C d(DDR) vs. Z DDR for the TLW. The maximum values of C d(DDR) for the train and the test stages are 25.724 and 33.524, respectively. The better simulation for the test phase proves the correct performance of modeling. The sensitivity analysis (Table 10) shows Fr is the most effective parameter because of the most decreasing R 2 and most increasing RMSE by omitting this parameter from the modeling process. The other effective parameters in descending order are H o =P, L c =w and A=w, respectively.

GEP solver
In this section, the result of the GEP's simulation has been presented for the three weirs. Three mentioned models of PKW in Table 4 were examined, and the first one has the best output based on the presented chromosome's properties in Table 11. The share of the train and the test phases of all data are 75 and 25%, respectively. The values of RMSE as fitness function error of the GEP for the train and the test phases are 0.0297 and 0.0425, respectively. The tree expression of the GEP algorithm is presented in Figure 10, including mathematical functions and operators. The correspondence equation of the tree expression is:    Table 12. Given the analysis results, there is good fitness between observed and predicted values of discharge coefficient. The test phase is more accurate than the train. Variation of observed and predicted values of C d through the train and test phases has been displayed in Figure 11. It can be deduced that the GEP model performance is acceptable because of the adaptation of datasets during the train and the test phases. Figure 12 illustrates the standardized normal distribution of the DDR for the PKW using the GEP algorithm. The maximum values of C d(DDR) for the train and the test phases are 7.531 and 9.305, respectively. Although both bell diagrams have almost the same focus around the vertical axis, the maximum value of the test stage proves the better performance than the test. The sensitivity analysis (Table 13) for the GEP illustrates the highest impact of H o =P because of the crucial changes in accuracy indices so that in the test phase RMSE has increased from 0.0425 to 0.2411 and R 2 has dropped from 0.9767 to 0.1403.
The discharge capacity of RLW was the second one modeled using the GEP. Chromosome's parameters have been presented in Table 14. The values of RMSE for the train and test phases were 0.01179 and 0.026, respectively. A tree expression of the GEP predictor has been illustrated in Figure 13.
The correspondence formula is described by Equation (10):    Table 15. A comparison between observed and predicted values of C d has been presented in Figure 14. There is good agreement between measured and predicted values of the discharge coefficient. Improving evaluation criteria during the test phase proves the correct modeling process of the GEP simulator. The   normal distribution of the DDR for the RLW has been presented in Figure 15. The highest values of C d(DDR) for the train and the test stages are 12.23 and 13.23, respectively. These figures determine the better operation through the test phase. According to Table 16, the most effective parameter for the GEP modeling is H o =P. Figure 16 shows tree expression of TLW. The simulation of the discharge coefficient for TLW has been performed based on the chromosome characteristics mentioned in Table 17. The values of RMSE for the train and the test phases were 0.00849 and 0.0823, respectively. Figure 10 presents the tree expression of the GEP modeling for TLW.
The correspondence formula is shown by Equation (11):     Table 19, the hydraulic condition, Fr, has the most effect on the GEP performance. The other important parameters in descending order are H o =P, A=w and L c =w.