Abstract
This paper explores the ability of multivariate adaptive regression splines, decision trees, Gaussian processes, and multiple non-linear regression equation approaches to predict the aeration efficiency at various weirs and discusses their results. In total, 126 experimental observations were collected in the laboratory, of which 88 were arbitrarily selected for model training, and the rest were used for model validation. Various graphical presentations and goodness-of-fit parameters were used to assess the performance of the models. Performance evaluation results, Whisker plot, and Taylor's diagram indicated that the GP_rbf-based model was superior to other implemented models in predicting the aeration efficiency of weirs with CC (0.9961 and 0.9973), MAE (0.0079 and 0.0195), RMSE (0.0122 and 0.0251), scattering index (0.0594 and 0.1238), and Nash Sutcliffe model efficiency (0.9923 and 0.9564) values in the training and validating stages, respectively. The predicted values by GP_rbf lie within the ±30% error line in the training and validating stages, with most of it lying at/close to the line of agreement. The random forest model had better predictability than other decision tree models implied. The sensitivity analysis of parameters suggests shape factor and drop height as major influencing factors in predicting the aeration efficiency.
HIGHLIGHTS
Experimental study to evaluate aeration efficiency at various shapes of sharp-crested weir models.
Application of machine learning techniques to predict aeration efficiency of sharp-crested weirs.
Introduction of shape factor for different shapes of weirs as an input to ML models.
Use of graphs and goodness-of-fit parameters to assess the performance of applied ML models.
Sensitivity analysis.
INTRODUCTION
Dissolved oxygen (DO) in water is indispensable for all life forms. It is a significant indicator of the water quality needed for human utility and healthy aquatic life (Goel 2013). DO in water should be ≥4 and 5 ppm to sustain warm water and cold water aquatic life, respectively (Baylar et al. 2010). Reducing DO concentration below this minimum requirement can significantly stress the natural aquatic cycle.
Following this, investigation on the effect of the governing factors, such as head over the crest of weirs, drop height, weir configuration, temperature, and discharge, to get the most efficient weir shape for oxygen transfer and empirical relations for aeration efficiency, were proposed by Van der Kroon & Schram (1969a, 1969b); Apted & Novak (1973), Avery & Novak (1978), Nakasone (1987), Gulliver et al. (1990, 1998); Gulliver & Rindels (1993), Witt & Gulliver (2012), Baylar et al. (2001a, 2001b) and Jaiswal & Goel (2019). Goel (2013), and Jaiswal & Goel (2020) also assessed various machine learning models to predict aeration efficiency at sharp-crested weirs.
The present study investigates the potential of multivariate adaptive regression splines (MARS), Gaussian process (GP), multiple non-linear regression equation (MNRE), and decision tree [random forest (RF), random tree (RT), and M5P] models to predict the aeration efficiency at various weirs. The results of these regression approaches were compared with the multiple non-linear regression equation models. A series of experiments were performed to collect the aeration efficiency at various sharp-crested weirs under different governing parameters. The experiments were conducted on a recirculating rectangular flume in the hydraulics laboratory of CED, NIT, Kurukshetra. The governing parameters include the shape factor of weirs, head over crest of weir, drop height, and discharge. The correlation coefficient, mean absolute error, root mean square error, scattering index (SI), and Nash Sutcliffe model efficiency were used as goodness-of-fit parameters to evaluate the performance of the developed models.
MODELING TECHNIQUES
The modeling techniques that were investigated in this paper are briefly discussed in the following.
Multivariate adaptive regression splines





Decision tree
- (i)
Random forest
The RF method is a well-organized collection of tree predictors constructed from input vectors using random vector samples; Breiman (2001), Liaw & Wiener (2002). The RF algorithm, Figure (2), is a well-known general-purpose classification and regression tool based on a hit-or-miss approach, with the variables taken from the best split (Biau & Scornet 2016). This method creates RFs by trapping a group of random trees (Mohanty et al. 2019). RF is a combination of bagging and random subspace that integrates weak classification trees and reaches a final decision by a majority vote. The splits for the forest trees are set based on the number of decision trees to be generated (N-tree) and the number of features (j) to be tested to determine the best split. Due to the relative efficiency of the RF classifier and its lack of overfitting, N-tree can be as large as possible (Guan et al. 2013). Each tree is grown using 67% of the training data, and the rest, 33%, known as out-of-bag (OOB) data, is used for validation. Hence, RF regression combines k trees, where k represents the number of trees to be produced, which can be any arbitrary value. The classification and regression tree (CART) algorithm generates all the n-trees in the forest without pruning. By combining different factors, RF regression allows the tree to grow to the depth of all the new training data. While creating specific trees, a training set of parameters is drawn from randomly selected data, and a Gini index is used to measure the impurity level in the parameters compared with the output (Breiman et al. 1984). RF classifies the variables based on their importance in achieving the optimum RF model.
(ii) M5P
In the second phase, the generated tree is pruned, as shown in Figure (3). To distinguish which branches should be pruned, the marginalized branches (terminal sections) are excluded in the final stretch, ensuring strong predictability. The new leaves are identified based on the distribution of observations employed in the learning process after they have been pruned and the predictability of the model is then frequently improved by this smoothing approach. A regularization technique is used to solve the anomalies within neighboring linear models in the tree's leaves at the final stage.
(iii) Random tree
When pruning is not used, the RT model selects every node based on a specific number of random features. RT barely involves machine learning, mainly exploiting arbitrary knowledge; otherwise, it applies a bagging concept (Hamoud et al. 2018). The preceding subsets of each node in the RF should be equally distributed among them. The technique addresses both classification and regression issues. In the classification process, the RT classifier takes the vector of the input property, classes it for every tree in the forest, and then extracts the group mark with the highest votes. A denial result of the classifier's response is the mean of the responses from all the trees in the forest (Cutler et al. 2012). Random trees combine two machine learning algorithms: single model trees and RF algorithm. In model trees, the linear layout of every leaf is tailored to its local subdomain, which considerably improves the efficiency of the single stable tree. RF's tree diversity is developed in two ways; at first, the training data are sampled, just like in bagging, by deleting each tree. Also secondly, while creating the tree the best part of that subset is decided by considering a single random subset of all attributes for each node. This contradicts the traditional method of dividing each node according to its potential optimal division. Random model trees are a modified model combining model trees and RFs. RTs use this result as a dividing criterion to simplify the optimization process, encouraging thoughtfully balanced trees with a spherical ridge environment running on all leaves (Barddal & Enembreck 2019).
Gaussian process



The mean vector indicates the function is a central tendency, generally taken as zero (Kuss & Rasmussen 2003). The covariance matrix describes the function's structure and form.
For non-linear decision surfaces, GPR employs the kernel function. The literature presents several kernels, and the perusal of these studies suggests that polynomial kernels, Pearson VII universal kernels, and radial basis kernels perform better (Pal & Mather 2003; (Gill et al. 2006). The present study uses the following kernels,
- (i)
Polynomial (poly) kernel:
- (ii)
Pearson VII universal kernel (puk):
- (iii)
Radial basis kernel (rbf):


User-defined parameters (noise, and
) were selected based on a large number of trials that were carried out using different permutations.
Multiple non-linear regression equation


DATA STATISTICS OF GOVERNING PARAMETERS
The present study aimed to assess the applicability of various machine learning algorithms in predicting the aeration efficiency at weirs and identifying the best-performing model. This study used five shapes of sharp-crested weirs, i.e., suppressed, rectangular, trapezoidal, semicircular, and triangular, to evaluate their aeration efficiency. Experiments were performed under various governing parameters and compared to the aeration data when no weir was used.
A total of 126 experimental observations were collected for aeration efficiency at weirs under varying parameters such as head over weir (h), drop height (H), and discharge (Q). The range of parameters used for experimentation is provided in Table 1. It was seen from the experiment that the shape of the weir has a defining role in aeration efficiency. Thus shape factor was defined for weir shapes to inculcate the shapes of weir as an input in modeling, supplied as 1, 2, 3, 4, 5, and 6 for no weir, suppressed, rectangular, trapezoidal, semicircular, and triangular weirs, respectively.
Range of parameters used for experimentation
Parameter . | Notation . | Unit . | Range . |
---|---|---|---|
Head over weir | h | cm | 1–5.5 |
Drop height | H | cm | 60, 75, and 90 |
Discharge | Q | l/s | 0–6 |
Parameter . | Notation . | Unit . | Range . |
---|---|---|---|
Head over weir | h | cm | 1–5.5 |
Drop height | H | cm | 60, 75, and 90 |
Discharge | Q | l/s | 0–6 |
The correlation matrix among governing variables was calculated using Pearson's method and is summarized in Table 2. This table reveals that shape factor has a higher correlation of 0.844 with aeration efficiency (E20) followed by the head of over weir, discharge, and drop height. Thus, the shape factor is the most influential parameter for estimating aeration efficiency.
Correlation matrix among explanatory variables
Variables . | Shape factor . | h (cm) . | H (cm) . | Q (l/s) . | E20 . |
---|---|---|---|---|---|
Shape factor | 1 | ||||
h (cm) | 0 | 1 | |||
H (cm) | 0 | 0 | 1 | ||
Q (l/s) | −0.010 | 0.994 | 0.013 | 1 | |
E20 | 0.844 | 0.32349 | 0.265 | −0.313 | 1 |
Variables . | Shape factor . | h (cm) . | H (cm) . | Q (l/s) . | E20 . |
---|---|---|---|---|---|
Shape factor | 1 | ||||
h (cm) | 0 | 1 | |||
H (cm) | 0 | 0 | 1 | ||
Q (l/s) | −0.010 | 0.994 | 0.013 | 1 | |
E20 | 0.844 | 0.32349 | 0.265 | −0.313 | 1 |
Out of 126 experimental data, 88 were used in the training stage for model generation, and 38 were used for model validation. In the present computational analysis, shape factor, h (cm), H (cm), and Q (l/s) were considered as the input parameters, whereas aeration efficiency was regarded as the output/target parameter. Table 3 summarizes the descriptive data statistics of model components such as minimum, maximum, mean, standard deviation, kurtosis, skewness, and confidence level (95%) for training and validating data sets.
Descriptive statistics of governing parameters for training and validating data sets
Statics . | Shapes . | h (cm) . | H (cm) . | Q (l/s) . | E20 . | Data set . |
---|---|---|---|---|---|---|
Minimum | 1 | 1 | 60 | 0.486 | 0.0113 | Training |
1 | 1 | 60 | 0.486 | 0.0112 | Validating | |
Maximum | 6 | 5.33 | 90 | 5.981 | 0.489 | Training |
6 | 5.33 | 90 | 5.981 | 0.432 | Validating | |
Mean | 3.511 | 3.240 | 75.170 | 3.048 | 0.205 | Training |
3.474 | 3.311 | 74.605 | 3.135 | 0.203 | Validating | |
Standard Deviation | 1.800 | 1.504 | 12.351 | 1.898 | 0.139 | Training |
1.520 | 1.515 | 12.323 | 1.938 | 0.122 | Validating | |
Kurtosis | −1.381 | −1.402 | −1.526 | −1.443 | −0.994 | Training |
−0.968 | −1.343 | −1.515 | −1.401 | −0.999 | Validating | |
Skewness | −0.019 | −0.116 | −0.021 | 0.087 | 0.209 | Training |
0.053 | −0.094 | 0.050 | 0.138 | 0.122 | Validating | |
Confidence Level (95%) | 0.381 | 0.319 | 2.617 | 0.402 | 0.029 | Training |
0.499 | 0.498 | 4.051 | 0.637 | 0.040 | Validating |
Statics . | Shapes . | h (cm) . | H (cm) . | Q (l/s) . | E20 . | Data set . |
---|---|---|---|---|---|---|
Minimum | 1 | 1 | 60 | 0.486 | 0.0113 | Training |
1 | 1 | 60 | 0.486 | 0.0112 | Validating | |
Maximum | 6 | 5.33 | 90 | 5.981 | 0.489 | Training |
6 | 5.33 | 90 | 5.981 | 0.432 | Validating | |
Mean | 3.511 | 3.240 | 75.170 | 3.048 | 0.205 | Training |
3.474 | 3.311 | 74.605 | 3.135 | 0.203 | Validating | |
Standard Deviation | 1.800 | 1.504 | 12.351 | 1.898 | 0.139 | Training |
1.520 | 1.515 | 12.323 | 1.938 | 0.122 | Validating | |
Kurtosis | −1.381 | −1.402 | −1.526 | −1.443 | −0.994 | Training |
−0.968 | −1.343 | −1.515 | −1.401 | −0.999 | Validating | |
Skewness | −0.019 | −0.116 | −0.021 | 0.087 | 0.209 | Training |
0.053 | −0.094 | 0.050 | 0.138 | 0.122 | Validating | |
Confidence Level (95%) | 0.381 | 0.319 | 2.617 | 0.402 | 0.029 | Training |
0.499 | 0.498 | 4.051 | 0.637 | 0.040 | Validating |
Performance evaluation indices
Five statistical parameters were employed to evaluate the accuracy of machine learning (ML) based models. These statistical parameters are coefficient of correlation (CC), mean absolute error (MAE), root mean square error (RMSE), Nash Sutcliffe model efficiency (NSE), and SI. These statistical parameters quantify the best fit between the observed and predicted data for all the applied models. The formulas of these indicators are as listed in Equations (11)–(15).
- (i)
- (ii)
- (iii)
- (iv)
- (v)

Specification of software and workstation
The models used for this study were trained and validated to assess their ability to predict aeration efficiency based on four input parameters: shape factor, h, H, and discharge. All the models used in this study were generated using MATLAB 2013a and WEKA 3.9 software on a Dell Vostro 2520 (Intel® CoreTM i3-2348HQ CPU, 2.30 GHz (four CPUs), 2048 MB RAM, Windows 8.1 Pro 64-bit) PC.
RESULTS AND DISCUSSION
Implementation and assessment of the MARS model
Basic functions and related coefficients of the MARS model
Sr. No. . | Basic function . | ![]() |
---|---|---|
1 | ![]() | +0.038 |
2 | ![]() | −0.080 |
3 | ![]() | −0.016 |
4 | ![]() | +0.071 |
5 | ![]() | +0.001 |
6 | ![]() | −0.089 |
Sr. No. . | Basic function . | ![]() |
---|---|---|
1 | ![]() | +0.038 |
2 | ![]() | −0.080 |
3 | ![]() | −0.016 |
4 | ![]() | +0.071 |
5 | ![]() | +0.001 |
6 | ![]() | −0.089 |

Plot for observed and predicted aeration efficiency at weirs using the MARS model for (a) training stage and (b) validating stage.
Plot for observed and predicted aeration efficiency at weirs using the MARS model for (a) training stage and (b) validating stage.
Implementation and assessment of the decision tree models
Linear equations developed using the M5P model
Sr. No. . | Linear model number . | Linear equation . |
---|---|---|
1 | LM:1 | ![]() |
2 | LM:2 | ![]() |
3 | LM:3 | ![]() |
4 | LM:4 | ![]() |
5 | LM:5 | ![]() |
Sr. No. . | Linear model number . | Linear equation . |
---|---|---|
1 | LM:1 | ![]() |
2 | LM:2 | ![]() |
3 | LM:3 | ![]() |
4 | LM:4 | ![]() |
5 | LM:5 | ![]() |
M5P model developed for predicting aeration efficiency (E20) at weirs.

Plot for observed and predicted aeration efficiency at weirs using decision tree for (a) training data stage and (b) validating data stage.
Plot for observed and predicted aeration efficiency at weirs using decision tree for (a) training data stage and (b) validating data stage.
The values of performance evaluation indices for decision tree models are represented in Table 6. The ability of the RF model to predict the aeration efficiency at weirs is better than the other decision tree model used, with its CC (0.9976 and 0.9653), MAE (0.0066 and 0.0213), RMSE (0.0098 and 0.0322), SI (0.0479 and 0.1585), and NSE (0.9950 and 0.9285) values in the training and validating stages, respectively.
Performance of the decision tree models in training and validating stages
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
M5P | 0.9778 | 0.0240 | 0.0291 | 0.9560 | 0.1420 |
RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |
RT | 0.9998 | 0.0015 | 0.024 | 09997 | 0.0120 |
Validating data set | |||||
M5P | 0.9383 | 0.0317 | 0.0418 | 0.8793 | 0.2059 |
RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |
RT | 0.9586 | 0.0227 | 0.0345 | 0.9178 | 0.1699 |
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
M5P | 0.9778 | 0.0240 | 0.0291 | 0.9560 | 0.1420 |
RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |
RT | 0.9998 | 0.0015 | 0.024 | 09997 | 0.0120 |
Validating data set | |||||
M5P | 0.9383 | 0.0317 | 0.0418 | 0.8793 | 0.2059 |
RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |
RT | 0.9586 | 0.0227 | 0.0345 | 0.9178 | 0.1699 |
Implementation and assessment of the Gaussian process model

Plot for observed and predicted aeration efficiency at weirs using GP models for (a) training stage and (b) validating stage.
Plot for observed and predicted aeration efficiency at weirs using GP models for (a) training stage and (b) validating stage.
The performance evaluation indices values for GP models are presented in Table 7. The GP_rbf model performs better than Gp_poly and GP_puk in predicting the aeration efficiency at weirs with its CC (0.9961 and 0.9973), MAE (0.0079 and 0.0195), RMSE (0.0122 and 0.0251), SI (0.0594 and 0.1238), and NSE (0.9923 and 0.9564) values in the training and validating stages, respectively.
Performance of the GP models for training and validating stages
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
GP-poly | 0.9243 | 0.0375 | 0.0537 | 0.8494 | 0.2626 |
GP-puk | 1.0000 | 0.0003 | 0.0003 | 1.0000 | 0.0016 |
GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |
Validating data set | |||||
GP-poly | 0.8604 | 0.0448 | 0.0617 | 0.7368 | 0.3040 |
GP-puk | 0.9595 | 0.0259 | 0.0355 | 0.9132 | 0.1746 |
GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
GP-poly | 0.9243 | 0.0375 | 0.0537 | 0.8494 | 0.2626 |
GP-puk | 1.0000 | 0.0003 | 0.0003 | 1.0000 | 0.0016 |
GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |
Validating data set | |||||
GP-poly | 0.8604 | 0.0448 | 0.0617 | 0.7368 | 0.3040 |
GP-puk | 0.9595 | 0.0259 | 0.0355 | 0.9132 | 0.1746 |
GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |
Implementation and assessment of the MNRE model

Plot for observed and predicted aeration efficiency at weirs using the MNRE model for (a) training stage and (b) validating stage.
Plot for observed and predicted aeration efficiency at weirs using the MNRE model for (a) training stage and (b) validating stage.
Inter-comparison of the best-performing models
A comparison of the accuracy of the best-performing models (MARS, MNRE, decision trees, and GP) used for this study in predicting the aeration efficiency at weirs is summarized in Table 8. The higher values of CC and NSE and lower values of MAE, RMSE, and SI indicate that the GP_rbf model outperforms other models in this study.
Performance evaluation of the MARS, RF, GP_rbf, and MNRE models for training and validating stages
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
MARS | 0.9780 | 0.0214 | 0.0289 | 0.9565 | 0.1412 |
RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |
GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |
MNRE | 0.9492 | 0.0379 | 0.0450 | 0.8942 | 0.2201 |
Validating data set | |||||
MARS | 0.9519 | 0.0284 | 0.0370 | 0.9053 | 0.1823 |
RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |
GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |
MNRE | 0.9143 | 0.0393 | 0.0497 | 0.8291 | 0.2450 |
Models . | Values . | ||||
---|---|---|---|---|---|
CC . | MAE . | RMSE . | NSE . | SI . | |
Training data set | |||||
MARS | 0.9780 | 0.0214 | 0.0289 | 0.9565 | 0.1412 |
RF | 0.9976 | 0.0066 | 0.0098 | 0.9950 | 0.0479 |
GP-rbf | 0.9961 | 0.0079 | 0.0122 | 0.9923 | 0.0594 |
MNRE | 0.9492 | 0.0379 | 0.0450 | 0.8942 | 0.2201 |
Validating data set | |||||
MARS | 0.9519 | 0.0284 | 0.0370 | 0.9053 | 0.1823 |
RF | 0.9653 | 0.0213 | 0.0322 | 0.9285 | 0.1585 |
GP-rbf | 0.9793 | 0.0195 | 0.0251 | 0.9564 | 0.1238 |
MNRE | 0.9143 | 0.0393 | 0.0497 | 0.8291 | 0.2450 |

A plot between observed and predicted values of aeration efficiency at weirs using MARS, RF, GP_rbf, and MNRE for (a) training stage and (b) validating stage.
A plot between observed and predicted values of aeration efficiency at weirs using MARS, RF, GP_rbf, and MNRE for (a) training stage and (b) validating stage.
Comparison with the literature
In the earlier studies to predict aeration efficiency of triangular weir using the ML model, Goel (2013) used support vector machines (SVMs); Jaiswal & Goel (2020) used GP and M5P models. In both studies, the models used have provided promising results in predicting aeration efficiency at triangular weirs. The present research accesses the performance of several other ML models, including GP and M5P, to predict aeration efficiency at different shapes of weirs. The weir shapes were also taken as input parameters in creating the models by introducing a shape factor. The performance of various ML models applied in the above mentioned literature is presented in Table 9 in terms of CC and RMSE value for comparison.
Predictability of various ML models used in the literature to predict aeration efficiency
Study . | Shape of weir . | ML model . | CC . | RMSE . |
---|---|---|---|---|
Goel (2013) | Triangular | SVM (POLY) | 0.9828 | 0.0245 |
SVM (RBF) | 0.9656 | 0.0821 | ||
Linear regression | 0.9822 | 0.0249 | ||
Jaiswal & Goel (2020) | Triangular | GP_npoly | 0.9897 | 0.0184 |
GP_poly | 0.9252 | 0.0496 | ||
GP_puk | 0.9998 | 0.0025 | ||
GP_rbf | 0.9998 | 0.0031 | ||
M5P | 0.9998 | 0.0023 |
Study . | Shape of weir . | ML model . | CC . | RMSE . |
---|---|---|---|---|
Goel (2013) | Triangular | SVM (POLY) | 0.9828 | 0.0245 |
SVM (RBF) | 0.9656 | 0.0821 | ||
Linear regression | 0.9822 | 0.0249 | ||
Jaiswal & Goel (2020) | Triangular | GP_npoly | 0.9897 | 0.0184 |
GP_poly | 0.9252 | 0.0496 | ||
GP_puk | 0.9998 | 0.0025 | ||
GP_rbf | 0.9998 | 0.0031 | ||
M5P | 0.9998 | 0.0023 |
The results presented in Tables 6–9 show that the performance of the best-performing models in the literature (M5P; GP_rbf and GP_puk) decreased when the shape factor was introduced as an input parameter in building the model. Wherein GP_rbf outperformed all other models and provided a promising result in predicting the outcomes while taking shape factor as an input parameter with a CC and RMSE value of 0.9961 and 0.0122, respectively.
STATISTICAL AND GRAPHICAL APPROACHES FOR COMPARING THE MODELS
Analysis of variance through F-test
Statistics suggest that ratios of the variances of the samples in each pair must follow the same distribution. The single-factor analysis of variance (ANOVA) test is used to verify whether or not the means of three or more independent groups are equal. The insignificant variation between the two samples is confirmed if the F-value < Fcritcal and the P-value > α (=0.05). Single-factor ANOVA was performed and is summarized in Table 10. The test results indicate an insignificant variation between observed and predicted values for all the models tested in this paper, having F-values < Fcritical, and P-values > 0.05 in all groups.
Single-factor ANOVA results among observed and predicted values using all applied models
Source of variation . | F-value . | P-value . | Fcritical . | Insignificant variation between groups . |
---|---|---|---|---|
Between observed and MARS | 0.000103 | 0.991939 | 3.970229 | ✓ |
Between observed and M5P | 0.022449 | 0.881306 | 3.970229 | ✓ |
Between observed and RF | 0.015778 | 0.900380 | 3.970229 | ✓ |
Between observed and RT | 0.019649 | 0.888901 | 3.970229 | ✓ |
Between observed and GP_poly | 0.055451 | 0.814486 | 3.970229 | ✓ |
Between observed and GP_puk | 2.6E − 09 | 0.999959 | 3.970229 | ✓ |
Between observed and GP_rbf | 0.003854 | 0.950669 | 3.970229 | ✓ |
Between observed and MNRE | 0.00115 | 0.973035 | 3.970229 | ✓ |
Source of variation . | F-value . | P-value . | Fcritical . | Insignificant variation between groups . |
---|---|---|---|---|
Between observed and MARS | 0.000103 | 0.991939 | 3.970229 | ✓ |
Between observed and M5P | 0.022449 | 0.881306 | 3.970229 | ✓ |
Between observed and RF | 0.015778 | 0.900380 | 3.970229 | ✓ |
Between observed and RT | 0.019649 | 0.888901 | 3.970229 | ✓ |
Between observed and GP_poly | 0.055451 | 0.814486 | 3.970229 | ✓ |
Between observed and GP_puk | 2.6E − 09 | 0.999959 | 3.970229 | ✓ |
Between observed and GP_rbf | 0.003854 | 0.950669 | 3.970229 | ✓ |
Between observed and MNRE | 0.00115 | 0.973035 | 3.970229 | ✓ |
Interquartile range analysis
To evaluate the inconsistency in the estimation of the aeration efficiency at weirs, the 25th, 50th, and 75th percentile values of the observed and predicted aeration efficiency at weirs by the various models were assessed, as tabulated in Table 11. The interquartile range (IQR) is the difference between the 75th and 25th percentiles of the sample. The IQR of values predicted by the GP_rbf based model is in line with the IQR of the observed values. Thus, it confirms that GP_rbf predicts the aeration efficiency at the weirs with greater accuracy than the other models discussed.
Quantitative statistics of observed values and predicted values by MARS, RF, GP_rbf, and MNRE models
Statistic . | Observed . | MARS . | RF . | GP_rbf . | MNRE . |
---|---|---|---|---|---|
Minimum | 0.0112 | 0.0015 | 0.0120 | 0.0010 | 0.0502 |
Maximum | 0.4320 | 0.4637 | 0.4260 | 0.4510 | 0.4450 |
1st Quartile (=25%) | 0.1038 | 0.1069 | 0.1158 | 0.0930 | 0.1200 |
Median (=50%) | 0.1968 | 0.2062 | 0.1980 | 0.1905 | 0.2020 |
3rd Quartile (=75%) | 0.3096 | 0.2910 | 0.2788 | 0.3110 | 0.2721 |
IQR | 0.2058 | 0.1841 | 0.1630 | 0.2180 | 0.1521 |
Statistic . | Observed . | MARS . | RF . | GP_rbf . | MNRE . |
---|---|---|---|---|---|
Minimum | 0.0112 | 0.0015 | 0.0120 | 0.0010 | 0.0502 |
Maximum | 0.4320 | 0.4637 | 0.4260 | 0.4510 | 0.4450 |
1st Quartile (=25%) | 0.1038 | 0.1069 | 0.1158 | 0.0930 | 0.1200 |
Median (=50%) | 0.1968 | 0.2062 | 0.1980 | 0.1905 | 0.2020 |
3rd Quartile (=75%) | 0.3096 | 0.2910 | 0.2788 | 0.3110 | 0.2721 |
IQR | 0.2058 | 0.1841 | 0.1630 | 0.2180 | 0.1521 |
Graphical methods
Two graphical methods, the Whisker plot and Taylor's diagram, were used to assess the accuracy of models used in this study in predicting the aeration efficiency at weirs.
- (i)
Whisker plot
- (ii)
Taylor's diagram
Whisker plot for the observed and predicted values for the validating stage.
Taylor's diagram among observed and predicted values for the testing stage.
Sensitivity analysis
Sensitivity analysis determines how the target variable is affected by changes in input variables. The best-performing model (GP_rbf) was used to observe the influence on its predictability by removing any input parameters. The sensitivity in predicting the aeration efficiency (E20) at weirs values is investigated by examining the response of each input parameter to the output. An input parameter was removed from the training data, and the remaining input combination was supplied to the GP_rbf model. The variations in Performance Evaluation Indices were obtained for each step where any input parameter is removed from the training data set, as shown in Table 12. The shape factor is the most prominent parameter in estimating the aeration efficiency with the highest variation in Performance Evaluation Indices (CC = 0.3812, RMSE = 0.1138, and MAE = 0.0895). The drop height is found to be the second most prominent variable after the shape factor.
Sensitivity analysis for parametric variation using the GP_rbf model
Input variable . | Target . | GP_rbf model . | |||||
---|---|---|---|---|---|---|---|
Shape factor . | h (cm) . | H (cm) . | Q (l/s) . | E20 . | CC . | MAE . | RMSE . |
✓ | ✓ | ✓ | ✓ | ✓ | 0.9793 | 0.0195 | 0.0251 |
✗ | ✓ | ✓ | ✓ | ✓ | 0.3812 | 0.0895 | 0.1138 |
✓ | ✗ | ✓ | ✓ | ✓ | 0.9844 | 0.0165 | 0.0218 |
✓ | ✓ | ✗ | ✓ | ✓ | 0.9198 | 0.0341 | 0.0474 |
✓ | ✓ | ✓ | ✗ | ✓ | 0.9854 | 0.0165 | 0.0209 |
Input variable . | Target . | GP_rbf model . | |||||
---|---|---|---|---|---|---|---|
Shape factor . | h (cm) . | H (cm) . | Q (l/s) . | E20 . | CC . | MAE . | RMSE . |
✓ | ✓ | ✓ | ✓ | ✓ | 0.9793 | 0.0195 | 0.0251 |
✗ | ✓ | ✓ | ✓ | ✓ | 0.3812 | 0.0895 | 0.1138 |
✓ | ✗ | ✓ | ✓ | ✓ | 0.9844 | 0.0165 | 0.0218 |
✓ | ✓ | ✗ | ✓ | ✓ | 0.9198 | 0.0341 | 0.0474 |
✓ | ✓ | ✓ | ✗ | ✓ | 0.9854 | 0.0165 | 0.0209 |
CONCLUSIONS
This investigation aimed to study the ability of various models (MARS, decision tree, and GP) to predict the aeration efficiency at weirs. The same is compared with the results of the MNRE. Various graphical presentations and goodness-of-fit parameters were used to assess the performance of the models used in this study. According to performance evaluation results, the GP_rbf model outperformed the other implemented models in predicting the aeration efficiency at weirs with more promising values of goodness-of-fit parameters. Another significant outcome was that the RF model performed best among other decision tree models used in this study. Furthermore, the performance of the GP_poly model was the worst among all models used. Based on the sensitivity analysis results using the GP_rbf model, the shape factor was the most prominent parameter, followed by drop height in current data.
Through this study, it is found that all the models used to predict the aeration efficiency at weirs are performing well with CC values greater than 0.95, except GP_poly, with a CC value of 0.86. Out of all the models studied, the GP_rbf model provides the most promising predicted values of aeration efficiency at weirs.
Machine learning models proved their effectiveness in predicting the aeration efficiency at weirs without performing any new experimentation on a similar setup designed to measure aeration efficiency. Using ML models, missing values in experimental data can also be predicted, and incorrect data can be identified. The current study can be extended further with advanced hybrid models (adaptive neuro-fuzzy inference system and genetic algorithm (GA) or particle swarm optimization (PSO), support vector regression (SVR) and GA or PSO, etc.) to assess their applicability in predicting aeration efficiency at weirs. Furthermore, a more extensive range of experimental and field aeration data at the weirs with a larger range of discharges and drop heights should be gathered to refine the accuracy of predictions of the models used for this study. For more precision in prediction of ML models, optimization techniques can be utilized to seek the optimal values of input parameters in order to get highest value of aeration efficiency.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.