## Abstract

This paper explores the ability of multivariate adaptive regression splines, decision trees, Gaussian processes, and multiple non-linear regression equation approaches to predict the aeration efficiency at various weirs and discusses their results. In total, 126 experimental observations were collected in the laboratory, of which 88 were arbitrarily selected for model training, and the rest were used for model validation. Various graphical presentations and goodness-of-fit parameters were used to assess the performance of the models. Performance evaluation results, Whisker plot, and Taylor's diagram indicated that the GP_rbf-based model was superior to other implemented models in predicting the aeration efficiency of weirs with *CC* (0.9961 and 0.9973), *MAE* (0.0079 and 0.0195), *RMSE* (0.0122 and 0.0251), scattering index (0.0594 and 0.1238), and Nash Sutcliffe model efficiency (0.9923 and 0.9564) values in the training and validating stages, respectively. The predicted values by GP_rbf lie within the ±30% error line in the training and validating stages, with most of it lying at/close to the line of agreement. The random forest model had better predictability than other decision tree models implied. The sensitivity analysis of parameters suggests shape factor and drop height as major influencing factors in predicting the aeration efficiency.

## HIGHLIGHTS

Experimental study to evaluate aeration efficiency at various shapes of sharp-crested weir models.

Application of machine learning techniques to predict aeration efficiency of sharp-crested weirs.

Introduction of shape factor for different shapes of weirs as an input to ML models.

Use of graphs and goodness-of-fit parameters to assess the performance of applied ML models.

Sensitivity analysis.

## INTRODUCTION

Dissolved oxygen (DO) in water is indispensable for all life forms. It is a significant indicator of the water quality needed for human utility and healthy aquatic life (Goel 2013). DO in water should be ≥4 and 5 ppm to sustain warm water and cold water aquatic life, respectively (Baylar *et al.* 2010). Reducing DO concentration below this minimum requirement can significantly stress the natural aquatic cycle.

*et al.*(1990),

*T*, °C). Thus, to provide a common base for comparing the performance of different weirs, a normalized aeration efficiency at 20 °C standard temperature is used (Eckenfelder & Ford 1970), as given in the following equation,

*et al.*2010).

Following this, investigation on the effect of the governing factors, such as head over the crest of weirs, drop height, weir configuration, temperature, and discharge, to get the most efficient weir shape for oxygen transfer and empirical relations for aeration efficiency, were proposed by Van der Kroon & Schram (1969a, 1969b); Apted & Novak (1973), Avery & Novak (1978), Nakasone (1987), Gulliver *et al.* (1990, 1998); Gulliver & Rindels (1993), Witt & Gulliver (2012), Baylar *et al.* (2001a, 2001b) and Jaiswal & Goel (2019). Goel (2013), and Jaiswal & Goel (2020) also assessed various machine learning models to predict aeration efficiency at sharp-crested weirs.

The present study investigates the potential of multivariate adaptive regression splines (MARS), Gaussian process (GP), multiple non-linear regression equation (MNRE), and decision tree [random forest (RF), random tree (RT), and M5P] models to predict the aeration efficiency at various weirs. The results of these regression approaches were compared with the multiple non-linear regression equation models. A series of experiments were performed to collect the aeration efficiency at various sharp-crested weirs under different governing parameters. The experiments were conducted on a recirculating rectangular flume in the hydraulics laboratory of CED, NIT, Kurukshetra. The governing parameters include the shape factor of weirs, head over crest of weir, drop height, and discharge. The correlation coefficient, mean absolute error, root mean square error, scattering index (*SI*), and Nash Sutcliffe model efficiency were used as goodness-of-fit parameters to evaluate the performance of the developed models.

## MODELING TECHNIQUES

The modeling techniques that were investigated in this paper are briefly discussed in the following.

### Multivariate adaptive regression splines

*x*is an independent parameter, and

*k*is a border value. The dependent variable ( in the MARS method is expressed as a function of

*x*, as given in the following equationwhere

*a*is a constant value,

_{0}*n*is the number of BFs, is the basis function, and is coefficient of basic function.

*GCV*) techniques. The

*GCV*is represented by the following equationwhere

*N*is the number of data and

*C*(

*H*) is the complexity penalty that increases by the number of BFs. Complexity penalty

*C*(

*H*), as provided by Parsaie

*et al.*(2016), Haghiabi (2017), and Parsaie & Haghiabi (2017), is represented by the following equationwhere

*d*is a penalty number for each basic function,

*h*+ 1 is a constant term representing a fixed value independent of the number of basic functions, and

*H*is the number of basic functions in the MARS model.

### Decision tree

- (i)
*Random forest*

The RF method is a well-organized collection of tree predictors constructed from input vectors using random vector samples; Breiman (2001), Liaw & Wiener (2002). The RF algorithm, Figure (2), is a well-known general-purpose classification and regression tool based on a hit-or-miss approach, with the variables taken from the best split (Biau & Scornet 2016). This method creates RFs by trapping a group of random trees (Mohanty *et al.* 2019). RF is a combination of bagging and random subspace that integrates weak classification trees and reaches a final decision by a majority vote. The splits for the forest trees are set based on the number of decision trees to be generated (N-tree) and the number of features (*j*) to be tested to determine the best split. Due to the relative efficiency of the RF classifier and its lack of overfitting, N-tree can be as large as possible (Guan *et al.* 2013). Each tree is grown using 67% of the training data, and the rest, 33%, known as out-of-bag (OOB) data, is used for validation. Hence, RF regression combines k trees, where k represents the number of trees to be produced, which can be any arbitrary value. The classification and regression tree (CART) algorithm generates all the n-trees in the forest without pruning. By combining different factors, RF regression allows the tree to grow to the depth of all the new training data. While creating specific trees, a training set of parameters is drawn from randomly selected data, and a Gini index is used to measure the impurity level in the parameters compared with the output (Breiman *et al.* 1984). RF classifies the variables based on their importance in achieving the optimum RF model.

(ii)

*M5P*

*N*is the number of samples,

*N*is the number of

_{i}*i*th samples having potential rise, and

*sd*is the standard deviation.

In the second phase, the generated tree is pruned, as shown in Figure (3). To distinguish which branches should be pruned, the marginalized branches (terminal sections) are excluded in the final stretch, ensuring strong predictability. The new leaves are identified based on the distribution of observations employed in the learning process after they have been pruned and the predictability of the model is then frequently improved by this smoothing approach. A regularization technique is used to solve the anomalies within neighboring linear models in the tree's leaves at the final stage.

(iii)

*Random tree*

When pruning is not used, the RT model selects every node based on a specific number of random features. RT barely involves machine learning, mainly exploiting arbitrary knowledge; otherwise, it applies a bagging concept (Hamoud *et al.* 2018). The preceding subsets of each node in the RF should be equally distributed among them. The technique addresses both classification and regression issues. In the classification process, the RT classifier takes the vector of the input property, classes it for every tree in the forest, and then extracts the group mark with the highest votes. A denial result of the classifier's response is the mean of the responses from all the trees in the forest (Cutler *et al.* 2012). Random trees combine two machine learning algorithms: single model trees and RF algorithm. In model trees, the linear layout of every leaf is tailored to its local subdomain, which considerably improves the efficiency of the single stable tree. RF's tree diversity is developed in two ways; at first, the training data are sampled, just like in bagging, by deleting each tree. Also secondly, while creating the tree the best part of that subset is decided by considering a single random subset of all attributes for each node. This contradicts the traditional method of dividing each node according to its potential optimal division. Random model trees are a modified model combining model trees and RFs. RTs use this result as a dividing criterion to simplify the optimization process, encouraging thoughtfully balanced trees with a spherical ridge environment running on all leaves (Barddal & Enembreck 2019).

### Gaussian process

*f(x)*is defined using a mean function and kernel function , as given the following equation

The mean vector indicates the function is a central tendency, generally taken as zero (Kuss & Rasmussen 2003). The covariance matrix describes the function's structure and form.

For non-linear decision surfaces, GPR employs the kernel function. The literature presents several kernels, and the perusal of these studies suggests that polynomial kernels, Pearson VII universal kernels, and radial basis kernels perform better (Pal & Mather 2003; (Gill *et al.* 2006). The present study uses the following kernels,

- (i)
Polynomial (poly) kernel:

- (ii)
Pearson VII universal kernel (puk):

- (iii)
Radial basis kernel (rbf):

User-defined parameters (noise, and ) were selected based on a large number of trials that were carried out using different permutations.

### Multiple non-linear regression equation

*y*is a dependent variable, i.e., output, are predictor variables, i.e., input,

*a*is a constant coefficient, and parameters were calculated by minimizing the sum of squares of error in estimation based on the least squares method.

## DATA STATISTICS OF GOVERNING PARAMETERS

The present study aimed to assess the applicability of various machine learning algorithms in predicting the aeration efficiency at weirs and identifying the best-performing model. This study used five shapes of sharp-crested weirs, i.e., suppressed, rectangular, trapezoidal, semicircular, and triangular, to evaluate their aeration efficiency. Experiments were performed under various governing parameters and compared to the aeration data when no weir was used.

A total of 126 experimental observations were collected for aeration efficiency at weirs under varying parameters such as head over weir (*h*), drop height (*H*), and discharge (*Q*). The range of parameters used for experimentation is provided in Table 1. It was seen from the experiment that the shape of the weir has a defining role in aeration efficiency. Thus shape factor was defined for weir shapes to inculcate the shapes of weir as an input in modeling, supplied as 1, 2, 3, 4, 5, and 6 for no weir, suppressed, rectangular, trapezoidal, semicircular, and triangular weirs, respectively.

Parameter . | Notation . | Unit . | Range . |
---|---|---|---|

Head over weir | h | cm | 1–5.5 |

Drop height | H | cm | 60, 75, and 90 |

Discharge | Q | l/s | 0–6 |

Parameter . | Notation . | Unit . | Range . |
---|---|---|---|

Head over weir | h | cm | 1–5.5 |

Drop height | H | cm | 60, 75, and 90 |

Discharge | Q | l/s | 0–6 |

The correlation matrix among governing variables was calculated using Pearson's method and is summarized in Table 2. This table reveals that shape factor has a higher correlation of 0.844 with aeration efficiency (*E*_{20}) followed by the head of over weir, discharge, and drop height. Thus, the shape factor is the most influential parameter for estimating aeration efficiency.

Variables . | Shape factor . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. |
---|---|---|---|---|---|

Shape factor | 1 | ||||

h (cm) | 0 | 1 | |||

H (cm) | 0 | 0 | 1 | ||

Q (l/s) | −0.010 | 0.994 | 0.013 | 1 | |

E_{20} | 0.844 | 0.32349 | 0.265 | −0.313 | 1 |

Variables . | Shape factor . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. |
---|---|---|---|---|---|

Shape factor | 1 | ||||

h (cm) | 0 | 1 | |||

H (cm) | 0 | 0 | 1 | ||

Q (l/s) | −0.010 | 0.994 | 0.013 | 1 | |

E_{20} | 0.844 | 0.32349 | 0.265 | −0.313 | 1 |

Out of 126 experimental data, 88 were used in the training stage for model generation, and 38 were used for model validation. In the present computational analysis, shape factor, *h* (cm), *H* (cm), and *Q* (l/s) were considered as the input parameters, whereas aeration efficiency was regarded as the output/target parameter. Table 3 summarizes the descriptive data statistics of model components such as minimum, maximum, mean, standard deviation, kurtosis, skewness, and confidence level (95%) for training and validating data sets.

Statics . | Shapes . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. | Data set . |
---|---|---|---|---|---|---|

Minimum | 1 | 1 | 60 | 0.486 | 0.0113 | Training |

1 | 1 | 60 | 0.486 | 0.0112 | Validating | |

Maximum | 6 | 5.33 | 90 | 5.981 | 0.489 | Training |

6 | 5.33 | 90 | 5.981 | 0.432 | Validating | |

Mean | 3.511 | 3.240 | 75.170 | 3.048 | 0.205 | Training |

3.474 | 3.311 | 74.605 | 3.135 | 0.203 | Validating | |

Standard Deviation | 1.800 | 1.504 | 12.351 | 1.898 | 0.139 | Training |

1.520 | 1.515 | 12.323 | 1.938 | 0.122 | Validating | |

Kurtosis | −1.381 | −1.402 | −1.526 | −1.443 | −0.994 | Training |

−0.968 | −1.343 | −1.515 | −1.401 | −0.999 | Validating | |

Skewness | −0.019 | −0.116 | −0.021 | 0.087 | 0.209 | Training |

0.053 | −0.094 | 0.050 | 0.138 | 0.122 | Validating | |

Confidence Level (95%) | 0.381 | 0.319 | 2.617 | 0.402 | 0.029 | Training |

0.499 | 0.498 | 4.051 | 0.637 | 0.040 | Validating |

Statics . | Shapes . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. | Data set . |
---|---|---|---|---|---|---|

Minimum | 1 | 1 | 60 | 0.486 | 0.0113 | Training |

1 | 1 | 60 | 0.486 | 0.0112 | Validating | |

Maximum | 6 | 5.33 | 90 | 5.981 | 0.489 | Training |

6 | 5.33 | 90 | 5.981 | 0.432 | Validating | |

Mean | 3.511 | 3.240 | 75.170 | 3.048 | 0.205 | Training |

3.474 | 3.311 | 74.605 | 3.135 | 0.203 | Validating | |

Standard Deviation | 1.800 | 1.504 | 12.351 | 1.898 | 0.139 | Training |

1.520 | 1.515 | 12.323 | 1.938 | 0.122 | Validating | |

Kurtosis | −1.381 | −1.402 | −1.526 | −1.443 | −0.994 | Training |

−0.968 | −1.343 | −1.515 | −1.401 | −0.999 | Validating | |

Skewness | −0.019 | −0.116 | −0.021 | 0.087 | 0.209 | Training |

0.053 | −0.094 | 0.050 | 0.138 | 0.122 | Validating | |

Confidence Level (95%) | 0.381 | 0.319 | 2.617 | 0.402 | 0.029 | Training |

0.499 | 0.498 | 4.051 | 0.637 | 0.040 | Validating |

### Performance evaluation indices

Five statistical parameters were employed to evaluate the accuracy of machine learning (ML) based models. These statistical parameters are coefficient of correlation (*CC*), mean absolute error (*MAE*), root mean square error (*RMSE*), Nash Sutcliffe model efficiency (*NSE*), and *SI*. These statistical parameters quantify the best fit between the observed and predicted data for all the applied models. The formulas of these indicators are as listed in Equations (11)–(15).

- (i)
- (ii)
- (iii)
- (iv)
- (v)

*Q*refers to the actual values or values obtained, refers to the average of observed values,

*R*is the predicted value (model), N is the number of observation.

### Specification of software and workstation

The models used for this study were trained and validated to assess their ability to predict aeration efficiency based on four input parameters: shape factor, *h*, *H*, and discharge. All the models used in this study were generated using MATLAB 2013a and WEKA 3.9 software on a Dell Vostro 2520 (Intel^{®} CoreTM i3-2348H*Q* CPU, 2.30 GHz (four CPUs), 2048 MB RAM, Windows 8.1 Pro 64-bit) PC.

## RESULTS AND DISCUSSION

### Implementation and assessment of the MARS model

Sr. No. . | Basic function . | . |
---|---|---|

1 | +0.038 | |

2 | −0.080 | |

3 | −0.016 | |

4 | +0.071 | |

5 | +0.001 | |

6 | −0.089 |

Sr. No. . | Basic function . | . |
---|---|---|

1 | +0.038 | |

2 | −0.080 | |

3 | −0.016 | |

4 | +0.071 | |

5 | +0.001 | |

6 | −0.089 |

*CC*(0.9780 and 0.9519),

*MAE*(0.0214 and 0.0284),

*RMSE*(0.0289 and 0.0370),

*SI*(0.1412 and 0.1823), and

*NSE*(0.9565 and 0.9053) values in the training and validating stages, respectively.

### Implementation and assessment of the decision tree models

*m*) = 1 and the number of trees (

*k*) = 10. The structure of the M5P based model developed for predicting aeration efficiency (

*E*

_{20}) at weirs is shown in Figure 5. Linear equations developed using the M5P based model to predict aeration efficiency (

*E*

_{20}) at weirs are listed in Table 5.

Sr. No. . | Linear model number . | Linear equation . |
---|---|---|

1 | LM:1 | |

2 | LM:2 | |

3 | LM:3 | |

4 | LM:4 | |

5 | LM:5 |

Sr. No. . | Linear model number . | Linear equation . |
---|---|---|

1 | LM:1 | |

2 | LM:2 | |

3 | LM:3 | |

4 | LM:4 | |

5 | LM:5 |

The values of performance evaluation indices for decision tree models are represented in Table 6. The ability of the RF model to predict the aeration efficiency at weirs is better than the other decision tree model used, with its *CC* (0.9976 and 0.9653), *MAE* (0.0066 and 0.0213), *RMSE* (0.0098 and 0.0322), *SI* (0.0479 and 0.1585), and *NSE* (0.9950 and 0.9285) values in the training and validating stages, respectively.

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

M5P | 0.9778 | 0.0240 | 0.0291 | 0.9560 | 0.1420 |

RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |

RT | 0.9998 | 0.0015 | 0.024 | 09997 | 0.0120 |

Validating data set | |||||

M5P | 0.9383 | 0.0317 | 0.0418 | 0.8793 | 0.2059 |

RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |

RT | 0.9586 | 0.0227 | 0.0345 | 0.9178 | 0.1699 |

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

M5P | 0.9778 | 0.0240 | 0.0291 | 0.9560 | 0.1420 |

RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |

RT | 0.9998 | 0.0015 | 0.024 | 09997 | 0.0120 |

Validating data set | |||||

M5P | 0.9383 | 0.0317 | 0.0418 | 0.8793 | 0.2059 |

RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |

RT | 0.9586 | 0.0227 | 0.0345 | 0.9178 | 0.1699 |

### Implementation and assessment of the Gaussian process model

The performance evaluation indices values for GP models are presented in Table 7. The GP_rbf model performs better than Gp_poly and GP_puk in predicting the aeration efficiency at weirs with its *CC* (0.9961 and 0.9973), *MAE* (0.0079 and 0.0195), *RMSE* (0.0122 and 0.0251), *SI* (0.0594 and 0.1238), and *NSE* (0.9923 and 0.9564) values in the training and validating stages, respectively.

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

GP-poly | 0.9243 | 0.0375 | 0.0537 | 0.8494 | 0.2626 |

GP-puk | 1.0000 | 0.0003 | 0.0003 | 1.0000 | 0.0016 |

GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |

Validating data set | |||||

GP-poly | 0.8604 | 0.0448 | 0.0617 | 0.7368 | 0.3040 |

GP-puk | 0.9595 | 0.0259 | 0.0355 | 0.9132 | 0.1746 |

GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

GP-poly | 0.9243 | 0.0375 | 0.0537 | 0.8494 | 0.2626 |

GP-puk | 1.0000 | 0.0003 | 0.0003 | 1.0000 | 0.0016 |

GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |

Validating data set | |||||

GP-poly | 0.8604 | 0.0448 | 0.0617 | 0.7368 | 0.3040 |

GP-puk | 0.9595 | 0.0259 | 0.0355 | 0.9132 | 0.1746 |

GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |

### Implementation and assessment of the MNRE model

*E*

_{20}) at weirs with its

*CC*(0.9492 and 0.9143),

*MAE*(0.0379 and 0.0393),

*RMSE*(0.0450 and 0.497),

*SI*(0.2201 and 0.2450), and

*NSE*(0.8942 and 0.8291) values in the training and validating stages, respectively. The plots for observed aeration efficiency and the predicted values by MNRE for the training and validating phases are given in Figure 8. The predicted values of aeration efficiency by the MNRE model lie outside the 30% error line for both the training and validating stage. Thus this model overestimates the values of aeration efficiency.

### Inter-comparison of the best-performing models

A comparison of the accuracy of the best-performing models (MARS, MNRE, decision trees, and GP) used for this study in predicting the aeration efficiency at weirs is summarized in Table 8. The higher values of *CC* and *NSE* and lower values of *MAE*, *RMSE*, and *SI* indicate that the GP_rbf model outperforms other models in this study.

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

MARS | 0.9780 | 0.0214 | 0.0289 | 0.9565 | 0.1412 |

RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |

GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |

MNRE | 0.9492 | 0.0379 | 0.0450 | 0.8942 | 0.2201 |

Validating data set | |||||

MARS | 0.9519 | 0.0284 | 0.0370 | 0.9053 | 0.1823 |

RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |

GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |

MNRE | 0.9143 | 0.0393 | 0.0497 | 0.8291 | 0.2450 |

Models . | Values . | ||||
---|---|---|---|---|---|

CC
. | MAE
. | RMSE
. | NSE
. | SI
. | |

Training data set | |||||

MARS | 0.9780 | 0.0214 | 0.0289 | 0.9565 | 0.1412 |

RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |

GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |

MNRE | 0.9492 | 0.0379 | 0.0450 | 0.8942 | 0.2201 |

Validating data set | |||||

MARS | 0.9519 | 0.0284 | 0.0370 | 0.9053 | 0.1823 |

RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |

GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |

MNRE | 0.9143 | 0.0393 | 0.0497 | 0.8291 | 0.2450 |

### Comparison with the literature

In the earlier studies to predict aeration efficiency of triangular weir using the ML model, Goel (2013) used support vector machines (SVMs); Jaiswal & Goel (2020) used GP and M5P models. In both studies, the models used have provided promising results in predicting aeration efficiency at triangular weirs. The present research accesses the performance of several other ML models, including GP and M5P, to predict aeration efficiency at different shapes of weirs. The weir shapes were also taken as input parameters in creating the models by introducing a shape factor. The performance of various ML models applied in the above mentioned literature is presented in Table 9 in terms of *CC* and *RMSE* value for comparison.

Study . | Shape of weir . | ML model . | CC
. | RMSE
. |
---|---|---|---|---|

Goel (2013) | Triangular | SVM (POLY) | 0.9828 | 0.0245 |

SVM (RBF) | 0.9656 | 0.0821 | ||

Linear regression | 0.9822 | 0.0249 | ||

Jaiswal & Goel (2020) | Triangular | GP_npoly | 0.9897 | 0.0184 |

GP_poly | 0.9252 | 0.0496 | ||

GP_puk | 0.9998 | 0.0025 | ||

GP_rbf | 0.9998 | 0.0031 | ||

M5P | 0.9998 | 0.0023 |

Study . | Shape of weir . | ML model . | CC
. | RMSE
. |
---|---|---|---|---|

Goel (2013) | Triangular | SVM (POLY) | 0.9828 | 0.0245 |

SVM (RBF) | 0.9656 | 0.0821 | ||

Linear regression | 0.9822 | 0.0249 | ||

Jaiswal & Goel (2020) | Triangular | GP_npoly | 0.9897 | 0.0184 |

GP_poly | 0.9252 | 0.0496 | ||

GP_puk | 0.9998 | 0.0025 | ||

GP_rbf | 0.9998 | 0.0031 | ||

M5P | 0.9998 | 0.0023 |

The results presented in Tables 6–9 show that the performance of the best-performing models in the literature (M5P; GP_rbf and GP_puk) decreased when the shape factor was introduced as an input parameter in building the model. Wherein GP_rbf outperformed all other models and provided a promising result in predicting the outcomes while taking shape factor as an input parameter with a *CC* and *RMSE* value of 0.9961 and 0.0122, respectively.

## STATISTICAL AND GRAPHICAL APPROACHES FOR COMPARING THE MODELS

### Analysis of variance through *F*-test

Statistics suggest that ratios of the variances of the samples in each pair must follow the same distribution. The single-factor analysis of variance (ANOVA) test is used to verify whether or not the means of three or more independent groups are equal. The insignificant variation between the two samples is confirmed if the *F*-value < *F*_{critcal} and the *P*-value > *α* (=0.05). Single-factor ANOVA was performed and is summarized in Table 10. The test results indicate an insignificant variation between observed and predicted values for all the models tested in this paper, having *F*-values < *F*_{critical}, and *P*-values > 0.05 in all groups.

Source of variation . | F-value
. | P-value
. | F_{critical}
. | Insignificant variation between groups . |
---|---|---|---|---|

Between observed and MARS | 0.000103 | 0.991939 | 3.970229 | ✓ |

Between observed and M5P | 0.022449 | 0.881306 | 3.970229 | ✓ |

Between observed and RF | 0.015778 | 0.900380 | 3.970229 | ✓ |

Between observed and RT | 0.019649 | 0.888901 | 3.970229 | ✓ |

Between observed and GP_poly | 0.055451 | 0.814486 | 3.970229 | ✓ |

Between observed and GP_puk | 2.6E − 09 | 0.999959 | 3.970229 | ✓ |

Between observed and GP_rbf | 0.003854 | 0.950669 | 3.970229 | ✓ |

Between observed and MNRE | 0.00115 | 0.973035 | 3.970229 | ✓ |

Source of variation . | F-value
. | P-value
. | F_{critical}
. | Insignificant variation between groups . |
---|---|---|---|---|

Between observed and MARS | 0.000103 | 0.991939 | 3.970229 | ✓ |

Between observed and M5P | 0.022449 | 0.881306 | 3.970229 | ✓ |

Between observed and RF | 0.015778 | 0.900380 | 3.970229 | ✓ |

Between observed and RT | 0.019649 | 0.888901 | 3.970229 | ✓ |

Between observed and GP_poly | 0.055451 | 0.814486 | 3.970229 | ✓ |

Between observed and GP_puk | 2.6E − 09 | 0.999959 | 3.970229 | ✓ |

Between observed and GP_rbf | 0.003854 | 0.950669 | 3.970229 | ✓ |

Between observed and MNRE | 0.00115 | 0.973035 | 3.970229 | ✓ |

### Interquartile range analysis

To evaluate the inconsistency in the estimation of the aeration efficiency at weirs, the 25th, 50th, and 75th percentile values of the observed and predicted aeration efficiency at weirs by the various models were assessed, as tabulated in Table 11. The interquartile range (IQR) is the difference between the 75th and 25th percentiles of the sample. The IQR of values predicted by the GP_rbf based model is in line with the IQR of the observed values. Thus, it confirms that GP_rbf predicts the aeration efficiency at the weirs with greater accuracy than the other models discussed.

Statistic . | Observed . | MARS . | RF . | GP_rbf . | MNRE . |
---|---|---|---|---|---|

Minimum | 0.0112 | 0.0015 | 0.0120 | 0.0010 | 0.0502 |

Maximum | 0.4320 | 0.4637 | 0.4260 | 0.4510 | 0.4450 |

1st Quartile (=25%) | 0.1038 | 0.1069 | 0.1158 | 0.0930 | 0.1200 |

Median (=50%) | 0.1968 | 0.2062 | 0.1980 | 0.1905 | 0.2020 |

3rd Quartile (=75%) | 0.3096 | 0.2910 | 0.2788 | 0.3110 | 0.2721 |

IQR | 0.2058 | 0.1841 | 0.1630 | 0.2180 | 0.1521 |

Statistic . | Observed . | MARS . | RF . | GP_rbf . | MNRE . |
---|---|---|---|---|---|

Minimum | 0.0112 | 0.0015 | 0.0120 | 0.0010 | 0.0502 |

Maximum | 0.4320 | 0.4637 | 0.4260 | 0.4510 | 0.4450 |

1st Quartile (=25%) | 0.1038 | 0.1069 | 0.1158 | 0.0930 | 0.1200 |

Median (=50%) | 0.1968 | 0.2062 | 0.1980 | 0.1905 | 0.2020 |

3rd Quartile (=75%) | 0.3096 | 0.2910 | 0.2788 | 0.3110 | 0.2721 |

IQR | 0.2058 | 0.1841 | 0.1630 | 0.2180 | 0.1521 |

### Graphical methods

Two graphical methods, the Whisker plot and Taylor's diagram, were used to assess the accuracy of models used in this study in predicting the aeration efficiency at weirs.

- (i)
*Whisker plot*

- (ii)
*Taylor's diagram*

*CC*,

*RMSE*, and standard deviation values. In Taylor's diagram, the model near the observed point is the best-performing model. As in Figure 11, the GP_rbf model is closer to the observed point; thus, it has better accuracy in predicting the aeration efficiency than all the models used. The performance of MNRE had the lowest prediction accuracy among MARS-, RF-, and GP_rbf-based models.

### Sensitivity analysis

Sensitivity analysis determines how the target variable is affected by changes in input variables. The best-performing model (GP_rbf) was used to observe the influence on its predictability by removing any input parameters. The sensitivity in predicting the aeration efficiency (*E*_{20}) at weirs values is investigated by examining the response of each input parameter to the output. An input parameter was removed from the training data, and the remaining input combination was supplied to the GP_rbf model. The variations in Performance Evaluation Indices were obtained for each step where any input parameter is removed from the training data set, as shown in Table 12. The shape factor is the most prominent parameter in estimating the aeration efficiency with the highest variation in Performance Evaluation Indices (*CC* = 0.3812, *RMSE* = 0.1138, and *MAE* = 0.0895). The drop height is found to be the second most prominent variable after the shape factor.

Input variable . | Target . | GP_rbf model . | |||||
---|---|---|---|---|---|---|---|

Shape factor . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. | CC
. | MAE
. | RMSE
. |

✓ | ✓ | ✓ | ✓ | ✓ | 0.9793 | 0.0195 | 0.0251 |

✗ | ✓ | ✓ | ✓ | ✓ | 0.3812 | 0.0895 | 0.1138 |

✓ | ✗ | ✓ | ✓ | ✓ | 0.9844 | 0.0165 | 0.0218 |

✓ | ✓ | ✗ | ✓ | ✓ | 0.9198 | 0.0341 | 0.0474 |

✓ | ✓ | ✓ | ✗ | ✓ | 0.9854 | 0.0165 | 0.0209 |

Input variable . | Target . | GP_rbf model . | |||||
---|---|---|---|---|---|---|---|

Shape factor . | h (cm)
. | H (cm)
. | Q (l/s)
. | E_{20}
. | CC
. | MAE
. | RMSE
. |

✓ | ✓ | ✓ | ✓ | ✓ | 0.9793 | 0.0195 | 0.0251 |

✗ | ✓ | ✓ | ✓ | ✓ | 0.3812 | 0.0895 | 0.1138 |

✓ | ✗ | ✓ | ✓ | ✓ | 0.9844 | 0.0165 | 0.0218 |

✓ | ✓ | ✗ | ✓ | ✓ | 0.9198 | 0.0341 | 0.0474 |

✓ | ✓ | ✓ | ✗ | ✓ | 0.9854 | 0.0165 | 0.0209 |

## CONCLUSIONS

This investigation aimed to study the ability of various models (MARS, decision tree, and GP) to predict the aeration efficiency at weirs. The same is compared with the results of the MNRE. Various graphical presentations and goodness-of-fit parameters were used to assess the performance of the models used in this study. According to performance evaluation results, the GP_rbf model outperformed the other implemented models in predicting the aeration efficiency at weirs with more promising values of goodness-of-fit parameters. Another significant outcome was that the RF model performed best among other decision tree models used in this study. Furthermore, the performance of the GP_poly model was the worst among all models used. Based on the sensitivity analysis results using the GP_rbf model, the shape factor was the most prominent parameter, followed by drop height in current data.

Through this study, it is found that all the models used to predict the aeration efficiency at weirs are performing well with *CC* values greater than 0.95, except GP_poly, with a *CC* value of 0.86. Out of all the models studied, the GP_rbf model provides the most promising predicted values of aeration efficiency at weirs.

Machine learning models proved their effectiveness in predicting the aeration efficiency at weirs without performing any new experimentation on a similar setup designed to measure aeration efficiency. Using ML models, missing values in experimental data can also be predicted, and incorrect data can be identified. The current study can be extended further with advanced hybrid models (adaptive neuro-fuzzy inference system and genetic algorithm (GA) or particle swarm optimization (PSO), support vector regression (SVR) and GA or PSO, etc.) to assess their applicability in predicting aeration efficiency at weirs. Furthermore, a more extensive range of experimental and field aeration data at the weirs with a larger range of discharges and drop heights should be gathered to refine the accuracy of predictions of the models used for this study. For more precision in prediction of ML models, optimization techniques can be utilized to seek the optimal values of input parameters in order to get highest value of aeration efficiency.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.