Prediction of aeration ef ﬁ ciency of parshall and modi ﬁ ed venturi ﬂ umes: application of soft computing versus regression models

In this study, the potential of soft computing techniques namely Random Forest (RF), M5P, Multivariate Adaptive Regression Splines (MARS), and Group Method of Data Handling (GMDH) was evaluated to predict the aeration ef ﬁ ciency (AE 20 ) at Parshall and Modi ﬁ ed Venturi ﬂ umes. Experiments were conducted for 26 various Modi ﬁ ed Venturi ﬂ umes and one Parshall ﬂ ume. A total of 99 observations were obtained from experiments. The results of soft computing models were compared with regression-based models (i.e., MLR: multiple linear regression, and MNLR: multiple nonlinear regression). Results of the analysis revealed that the MARS model outperformed other soft computing and regression-based models for predicting the AE 20 at Parshall and Modi ﬁ ed Venturi ﬂ umes with Pearson ’ s correlation coef ﬁ cient (CC) ¼ 0.9997, and 0.9992, and root mean square error (RMSE) ¼ 0.0015, and 0.0045 during calibration and validation periods. Sensitivity analysis was also carried out by using the best executing MARS model to assess the effect of individual input variables on AE 20 of both ﬂ umes. Obtained results on sensitivity examination indicate that the oxygen de ﬁ cit ratio (r) was the most effective input variable in predicting the AE 20 at Parshall and Modi ﬁ ed Venturi ﬂ umes. Outcomes against regression-based models. Effectiveness of was evaluated using performance evaluation indicators. MARS-based model outperformed other models.


INTRODUCTION
Several main sources cause water pollution such as poisonous substances, detergents, repellers, and products of mining, organic, and industrial wastes (Schwarzenbach et al. 2010). The quality of water in nature such as rivers and artificial systems such as canals or reservoirs are related to the existence of dissolved oxygen (DO). Therefore, the DO is considered necessary for aquatic life. The ideal value of DO in natural water bodies is ranged between 5 and 6 mg/L (Sánchez et al. 2007). Therefore, it is required to keep the amount of DO in this range because if the level of DO is drop than 5 mg/L, this will affect the water quality and the fish will die within hours if the level of DO is a drop below than 1-2 mg/L (Baylar et al. 2009b). On other hand, if the level of DO should be not greater than 110%, otherwise, it may be harmful to aquatic life. The physical processes that include the transmission of DO from the atmosphere to the water will help to replenish the existing oxygen. The term given for this process is aeration and considered a significant process in wastewater treatment plants (Sangeeta & Tiwari 2019).
The hydraulic structure could be considered as the best choice to improve the water quality in the river system by aeration where it helps to raise the amount of DO. Only one hydraulic structure can provide the same amount of DO that may occur along several kilometers of river. This speedy transfer of DO happens due to a large number of bubbles that form because of entrained air with water flow, which further enhances the surface area thereby leading to mass transfer. Different types of spillways such as stepped or overflow spillways could be used for aeration purposes in the river system. On the other hand, these structures have not been considered the best solution in the straight flow canal. Therefore, other types of hydraulic structures such as drop structures and weir are preferred. For the prismatic channels, the small Parshall flume is represented as the best solution to provide the required aeration.
Several studies reviewed the experimental works which deal with the evaluation of aeration performance of hydraulic structures such as Gulliver et al. (1990), Wilhelms et al. (1993), Chanson (1995), and Ervine (1998), and. In last two decades, Baylar & Bagatur (2000, Baylar et al. ( , 2008Baylar et al. ( , 2009aBaylar et al. ( , 2009c, Hanbay et al. (2009), and Baylar & Emiroglu (2003) have examined sharp-crested barriers with diverse cross-sectional geometry. Their findings revealed that the rate of air entrainment rate and the aeration efficacy of barriers varied depending on the barrier. Parshall flume was designed and used for the first time by Ralph Parshall. He starts his study based on work that was achieved by Cone in 1917 andimproved in 1928. After that, this cascade was used in several irrigation projects (USBR 2001;Kim et al. 2010;Dursun 2016). Recently, the Parshall flume was utilized for several projects around the world to measure the discharge in channels and river systems.
In the last few years, Artificial Intelligent (AI) techniques have been widely applied in different field of water resources as they are capable of resolving the intricate problems that did not have the tractable solution (Aghelpour et al. 2019;Jahani & Mohammadi 2019;Sihag et al. 2019bSihag et al. , 2019cThakur et al. 2021). AIbased models/algorithms have been employed extensively in many water resources applications; including hydrology (Banadkooki et al. 2020;Malik et al. 2020b;Tikhamarine et al. 2020c;Ghasempour et al. 2021;Sihag et al. 2021), hydraulics (Parsaie 2016;Parsaie et al. 2016Parsaie et al. , 2018Ebtehaj et al. 2017;Najafzadeh et al. 2017;Sihag et al. 2019a), and water flow/quality (Heddam & Kisi 2017;Parsaie & Haghiabi 2017aHaghiabi et al. 2018;Singh et al. 2019;Esmaeilbeiki et al. 2020;Pandhiani et al. 2020). Nevertheless, limited numbers of studies have considered the application of AI for the evaluation of Parshall flume aeration performance. Therefore, in this study, the M5P, RF, MARS, and GMDH techniques were used to developed the AI-based models to predict the aeration efficiency at Parshall and Modified Venturi flumes and compared with regression-based models.

Mechanism of aeration
Aeration is the most dominant factor in wastewater treatment. In aeration, oxygen is introduced into the water to increase the DO level for the survival of aquatic life. Aeration in the Parshall flume is different from the weirs. A Parshall flume consists of three portions. The first portion is made up of the upstream narrowing approach portion followed by a short and sloping throat portion from where the flow continues to the diverging downstream portion as shown in Figure 1. The upstream portion is the largest one with no slope, the short narrow throat portion has a downstream sloping portion and the floor again rises in the downstream portion (Parshall 1926). In weirs, aeration is done by creating the hydraulic jump (Baylar & Emiroglu 2003). Figure 2 demonstrates an explanation of the aeration process for various stages of the cascade from the weir (Tsang 1987). But aeration process of the Parshall flumes is that it accelerates the flow velocity through contracting sidewalls in the converging portion. At the throat portion, the flow velocity got accelerated from a subcritical to supercritical due to contraction and drop however at diverging section flow changes from fast, supercritical to slow, subcritical following a just at outlet of the flume resulting in aeration ( Figure 1).

Aeration transfer efficiency
The process of oxygen dispersion in the flowing water bodies is ruled by turbulent mixing (hydraulic jump), molecular diffusion (bubble formation), or both. It has been submitted the concept of the existence of two laminar layers remains on each side of the water-air interface. Both of these layers control the movement of oxygen in the water. The oxygen transfer rate from one side to the other is directly proportional to the concentration gradient and expressed as: where C is the concentration of dissolved oxygen, K L is the coefficient of the liquid layer for oxygen, A is the surface area, V is the volume over which oxygen transfer occurs, C S represents saturation concentration, and t indicates time.
Equation (1) excludes the source of oxygen transfer as its rate is very slow compared to the rate of oxygen transfer by most fluidic structures. Gulliver et al. (1990) has given a predictive association between C S which is constant concerning time and oxygen aeration efficiency (AE) as shown below: where subscript U, and D represents upstream and downstream locations, while r represents oxygen deficit ratio.   For full saturation of the water body, the value of aeration efficiency should be unity which represents the full transfer of oxygen to water. Oxygen would not be transferred if the value of aeration efficiency is zero. For the uniformity in measured experiment results, the temperature is generally normalized at 20°C as a standard. The equation showing the relation of temperature and aeration efficiency was given by Gulliver et al. (1990) as: In which, AE is oxygen aeration transfer efficiency at the water temperature of measurement, AE 20 denotes oxygen transfer efficiency at 20°C. f denotes aeration exponent and calculated as: In Equation (4), T is water temperature,°C.

Multiple linear regression (MLR)
MLR method is based on linear regression used for developing the association between dependent and independent variables. The commonly used equation for MLR is as follows (Malik et al. 2021a): In Equation (5), AE 20 represents the dependent (output) variable, c 0 , c 1 , c 2 , c 3 , . . . , c n denotes the regression coefficients, and x 1 , x 2 , x 3 , . . . , x n are the independent variables.

Multiple nonlinear regression (MNLR)
The application of MNLR requires the observed/experimental data set to establish the relationship between input and output variables. For nonlinear and complex issues, MNLR is commonly chosen. This study uses XLSTAT tools for the creation of the MNLR-based model. Equation (6)

Review on MARS
MARS is a non-parametric method that considers no assumptions regarding the association between independent and dependent datasets. The space of input characteristics is split into sub-domains in the MARS system, and a linear regression equation is installed for each sub-domain. A boundary value between knots is called subdomains. The fitted linear regression is defined as the base functions (BFs). The following forms are given by the BFs: where, x denotes a separate parameter, and k denotes boundary value. The following is the deduced mathematical formula using the MARS design for the desired phenomenon: where y denotes the dependent parameter predicted by the function f(x). S 0 denotes the value of a constant and n denotes the number of BFs. S n symbolizes the coefficient multiplied in BFs. b n indicates BFs. Arithmetic design is sufficient to complete two phases using the MARS process. The first stage is the creation of the model, where the function input space is split into several subdomains. This stage of the MARS design is the growth of names that feed-forward algorithms. The pruning stage is the second step. The BFs established in the previous stage have no major impact on enhancing the accuracy of the model at this stage are therefore pruned based on a criterion called Generalized Cross-Validation (GCV). Thus, the structure of the derived MARS model type system was adopted by GCV. As below, the GCV is defined: where N signifies the data number, and C(H) is the penalty for complexity that escalates by the number of BFs. The above equation shows that for each BF, d: penalty number and H: number of BFs obtained from the MARS process (Parsaie & Haghiabi 2017b, 2017cSihag et al. 2019d).

Review on GMDH
GMDH is an auto-adjusting design that eventually models in any complicated system the dynamic relationship between inputs and output. Ivakhnenko (1971) suggested GMDH approach. Separate pair of inputs are added to a neuron in the GMDH technique. As seen below, the governing equation is a quadratic polynomial: where, w 0 , w i (i ¼ 1 Á Á Á 5) designates coefficient, and x i and x j denote pairs of inputs. From the Volterra sequence, the definition of GMDH can be calculated using an infinite polynomial as: This is the discrete form of the Volterra series as presented by Kolmogorov-Gabor polynomial (Najafzadeh & Lim 2015;Alfaifi et al. 2020).

Review on M5P tree
Basically, model trees depend on the idea of regression trees having linear functions on their leaves (Pal & Deswal 2009). In piece-wise type, they are similar to linear equations. M5P model tree is a binary decision tree at the terminal (leaf) nodes with linear equations that can approximate continuous mathematical characteristics (Quinlan 1992). The chance of overfitting is reduced through pruning in this algorithm. A separation method is introduced for each node to deduce enhanced knowledge with minimal deviation within the intra-subset class values in each branch. There are three main steps in preparing the M5P model: tree growth, pruning, and smoothing. Using separation standards, which estimate the standard deviations of the class values extending to nodes, the basic tree model is generated. Linear functions in each node are generated by this approach. It utilizes the standard deviation approach by measuring the predicted error at the terminal node. The standard reduction is given as follows: In Equation (12), Z is a list of examples at the Z i node showing the outcome of the subset of possible set examples, sd is the standard equation. A considerable tree structure with high prediction precision is generated by this form of technique.

Review on RF
The random forest (RF) procedure was primarily introduced by Breiman (1996). RF is a versatile approach that has been chosen to solve various nonlinear or intricate engineering issues (Mohammadi & Mehdizadeh 2020). In this technique, a substantial quantity of trees is created having different bootstrap (bagging) samples of the original data set at the root node. The division is performed at each node using a randomly chosen subset of the estimation parameters. The random forest algorithm is comparatively insensitive to features of the training set and can achieve high prediction accuracy (Breiman 2001). It entails the usage of two user-defined parameters: the number of trees cultivated (k) and the number of input characteristics (m). For model development, a trial-and-error process is used. The WEKA 3.9 software was used to develop the random forest-based model in this current investigation.

Performance evaluation indicators
For the assessment of the accuracy of the implemented AI and regression based-models, six different types of performance/statistical indicators namely Pearson's coefficient of correlation (CC), scatter index (SI), mean absolute error (MAE), root mean square error (RMSE), Bias, and of the Nash-Sutcliffe efficiency (NSE) were considered and defined as: Pearson's correlation coefficient: The CC shows the interrelation between observed and predicted values and ranges from -1 to þ1. It can be calculated as (Malik et al. 2019b: Root mean square error: The RMSE is one of the most used errors to assess the model fitness. As the name indicates, it is the square root of the mean square error. A zero value indicates the best prediction and expressed as (Elbeltagi et al. 2020;Malik et al. 2021c): Scatter index: It is the ratio of RMSE and average of actual observations, and written as (Tao et al. 2018;Malik et al. 2019a): Bias: It is the average difference among observed and predicted values. It ranges from -∞ to ∞, and calculated as (Sihag et al. 2017): Mean absolute error: The MAE is the mean of the absolute difference between observed and predicted values. The range of MAE is 0 to ∞. The formula for MAE (Tikhamarine et al. 2020a;Malik et al. 2021b;Mohammadi et al. 2021): Nash-Sutcliffe efficiency: Nash & Sutcliffe (1970) introduced the NSE which was used to evaluate the efficacy of soft computing models. The range of NSE lies between À∞ to 1. If NSE is equal to 1, it shows a perfect model, and computed as : In Equations (13)- (18), O i and P i are the observed (actual) and the predicted aeration efficiency, O and P are the average of observed and predicted aeration efficiency, and N denotes the number of observations.

Dataset
For model establishment and validation, a total of 99 experimental observations of aeration efficiency at 20°C with Parshall and Modified Venturi flumes were used. Two separate classes of the total data set were separated. The division method was subjective. The model production training dataset included 69 observations while the remaining 30 observations were considered in the model validation test data set. Six independent variables were considered to be inputs, flow rate (Q), throat widths (W), throat lengths (L), sill heights (N), oxygen deficit ratio (r), and exponent (f), while aeration efficiency at 20°C (AE 20 ) was observed as the target for the model establishment and validation goal. Table 1 outlines the features of the data used for model development (training) and validation.

RESULTS AND DISCUSSION
For the prediction of aeration efficiency at Parshall and modified Venturi flumes, soft computing and regressionbased models were used in this investigation. Six standard statistical parameters, CC, MAE, Bias, SI, RMSE, and NSE, were chosen to test the working of all implemented models. Lower RMSE, Bias, SI, and MAE values show higher model accuracy, and the higher CC and NSE values show higher model accuracy. Model preparation is a method of trial-and-error. After several tests, the optimum values of the user-defined parameters were achieved. There are well-defined statistical criteria for selecting and defining user-defined parameters that are unique to the model.

Results of linear and nonlinear regression-based models
In this investigation, linear and nonlinear regression-based designs for the prediction of aeration efficiency at Parshall and Modified Venturi flumes have also been developed. XLSTAT software employing the least square method was used to develop these equations. For all developed equations, the performance measurement parameter values are listed in Table 2. Linear and nonlinear equations based on regression-based models are as follows: MLR Equation: AE 20 ¼ 0:33066 þ 0:00019 Q þ 0:00040 W À 0:00008 L þ 0:00122 N þ 0:62605 r À 1:02413 f ð19Þ

Uncorrected Proof
Comparison of the performance of the MLR and MNLR models was done based on six statistical performance measures viz., CC, RMSE, Bias, SI, MAE, and NSE ( Table 2). As per Table 2, the higher CC, NSE (0.9840 and 0.9670), and lower RMSE, Bias, SI, and MAE (0.0131, À0.0019, 0.1073, and 0.0111) in the testing phase of the MLR-based model, and concluded that the MLR-based model had better performance compared to MNLR-based model. Figure 3 displays agreement plots using training and testing datasets, separately, between observed and predicted aeration efficiency at Parshall and Modified Venturi flumes by the MLR and MNLR based models. As depicted in the graph, values predicted using MLR based model are close to the line of perfect agreement.  Table 2 summaries the results of the training and testing datasets which reflects that the performance of the RF design was appropriate for the prediction of aeration efficiency at Parshall and modified Venturi flumes with CC, RMSE, Bias, SI, MAE, and NSE values are 0.9849, 0.0134, À0.0014,   Uncorrected Proof suggested ( Figure 7 and Table 2) that both pruned and unpruned M5P-based models are suitable for predicting the aeration efficiency at Parshall and modified Venturi flumes.

Results of MARS based model
The dataset was chosen for the development (training) of MARS based model for the prediction of aeration efficiency at Parshall and modified Venturi flumes are given in Table 1. Thirty primary functions were chosen in the initial stage of development of the MARS model while in the next step ten primary functions were pruned. Finally, 7 basic functions and 16 total effective parameters were found to be the optimal for MARS model. Equation (21) shows the basic form of the MARS model. Table 5 enlists a comprehensive form of the MARS model. The amount of pruning as presented with GVC constraint in the standardization of the MARS model was found to be 0.0000040565. Figure 8 represents       Results of GMDH based model The development of GMDH design entails a trial-and-error process that is similar to RF, M5P, and MARS-based models. Table 2 outlines the outcomes of the developed GMDH model. Three hidden layers are included in GMDH developed model. For the development of the GMDH model, the same dataset (training and testing) was used for the development of other soft computing-based models. Table 6 enlists outcomes of constants     Figure 10) indicates that the MARS-based model works better than other regression and soft computing-based models. The potential of regression and soft computingbased models for predicting the aeration efficiency at Parshall and Modified Venturi flumes was assessed through performance indicators, agreement plots, and error as plotted in Figure 10 for the testing stage. The predicted values produced by the MARS-based model were found to be near the observed aeration efficiency at Parshall and Modified Venturi flumes as indicated in the plots. Also, the predicted aeration efficiency at Parshall and modified Venturi flumes values are noted to follow the analogous pattern as observed in actual aeration efficiency at Parshall and modified Venturi flumes. Unpruned M5P based model outperforms pruned M5P based model. Overall, the MARS-based model performance was more accurate than M5P, GMDH, MLR, MNLR, and RFbased models. MLR based model is also suitable and better than MNLR for predicting the aeration efficiency at Parshall and modified Venturi flumes using this dataset. Also, the box plot is plotted in Figure 11 for the comparison of observed (actual) and predicted values using various applied designs for the testing stage. Descriptive statistics of observed and applied models during the testing stage are enlisted in Table 7. Figure 11 and Table 7 suggest that the MARS model was outperforming in comparison to other applied models. Minimum and maximum values of actual and predicted values using the MARS model are very close. The widths of the lower and upper Quartile are almost the same in Figure 11. Taylor diagram is a graphical illustration of the performance of developed models in terms of CC, RMSE, and the standard deviation is shown in Figure 12, which indicates that the MARS was the best performing model and the performance of the MNLR model was least for the prediction of aeration efficiency at Parshall and Modified Venturi flumes using this dataset.

Sensitivity analysis
Analysis of sensitivity was carried out to find the most influential input variable in AE 20 at Parshall and modified Venturi flumes. The analysis was done by utilizing the superlative performing model (MARS). The set of training  Uncorrected Proof data was used, fluctuated as it was developed after eliminating one input variable at a time. Results were listed in terms of coefficient of correlation, RMSE, MAE, and NSE in Table 8 which indicates that the oxygen deficit ratio (r) is the most effective input variable in predicting the AE 20 at Parshall and Modified Venturi flumes using this dataset.

CONCLUSION
In this research, the M5P, Random Forest (RF), Multivariate Adaptive Regression Splines (MARS), and Group method of data handling (GMDH) have been developed to predict aeration efficiency (AE 20 ) at Parshall and Modified Venturi flumes and compared with multiple linear regression (MLR) and multiple nonlinear regression (MNLR) models. This was done by conducting experiments for 26 different Modified Venturi flumes and one Parshall flume. The comparison analysis using performance evaluation indices concludes that the MARS approach outperformed the rest of the models (i.e., M5P, RF, GMDH, MLR, and MLR) during development (training) and validation (testing) periods, separately. Other major outcomes from this study are that the MLR based model performs better than MNLR based model. In M5P based models, the unpruned model works better than the pruned model. Overall, the M5P based models outperform GMDH, MLR, RF, and MNLR based models. By utilizing the anticipated best performing MARS design, sensitivity analysis was carried out to evaluate each input variable effect on AE 20 . The finding of the sensitivity analysis indicates that the oxygen deficit ratio (r) was the most effective input variable for estimating the AE 20 at Parshall and Modified Venturi flumes using this dataset.

CONFLICT OF INTEREST
None.

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.