ABSTRACT
This study optimizes standard oxygen transfer efficiency (SOTE) in Venturi flumes investigating the impact of key parameters such as discharge per unit width (q), throat width (W), throat length (F), upstream entrance width (E), and gauge readings (Ha and Hb). To achieve this, a comprehensive experimental dataset was analyzed using multiple linear regression (MLR), multiple nonlinear regression (MNLR), gradient boosting machine (GBM), extreme gradient boosting (XRT), random forest (RF), M5 (pruned and unpruned), random tree (RT), and reduced error pruning (REP). Model performance was evaluated based on key metrics: correlation coefficient (CC), root mean square error (RMSE), and mean absolute error (MAE). Among the proposed models, M5_Unprun emerged as the top performer, exhibiting the highest CC (0.9455), the lowest RMSE (0.1918), and the lowest MAE (0.0030). GBM followed closely with a CC value of 0.9372, an RMSE value of 0.2067, and an MAE value of 0.0006. Uncertainty analysis further solidified the superior performance of M5_Unpruned (0.7522) and GBM (0.8055), with narrower prediction bands compared to other models, including MLR, which exhibited the widest band (1.4320). One-way analysis of variance confirmed the reliability and robustness of the proposed models. Sensitivity, correlation, and SHapley Additive exPlanations analyses identified W and Hb as the most influencing factors.
HIGHLIGHTS
The study compared various machine learning models to predict standard oxygen transfer efficiency (SOTE) in Venturi flumes.
M5_Unprun and gradient boosting machine (GBM) consistently demonstrated superior performance in predicting SOTE.
Uncertainty analysis confirmed the superiority of M5_Unprun and GBM.
Sensitivity analysis identified throat width as the most significant factor influencing SOTE.
One-way analysis of variance indicated no significant differences between predicted and experimental values.
INTRODUCTION
Oxygen transfer, the process of introducing air into water to increase its oxygen concentration, is crucial for healthy aquatic ecosystems (Luxmi et al. 2023a). Minimum dissolved oxygen (DO) levels are necessary for various bodies of water like lakes and rivers. Hydraulic structures in these environments can naturally enhance DO by creating turbulence. This turbulence traps air bubbles, increasing the water's surface area for oxygen exchange and improving oxygen transfer efficiency (Baylar & Bagatur 2000). Maintaining adequate DO is vital for aquaculture. Aerators in water systems can provide sufficient oxygen, promoting the growth of aquatic life and improving overall water quality (Yadav et al. 2021a). In essence, DO levels act as an indicator of both water pollution and the ecosystem's ability to sustain aquatic life. Higher DO signifies better water quality, while natural processes that consume oxygen can lower these levels (Baylar & Ozkan 2006).
Venturi flumes, with their unique converging-throat-diverging sections, offer an efficient alternative for prismatic channels (Sihag et al. 2021). Traditional aeration methods like stepped spillways are effective but have limitations in specific channel types (Cone 1917). In contrast, Venturi flumes enhance aeration by accelerating water through a narrow throat, generating turbulence that mixes air and water efficiently. This method is particularly suitable for applications requiring precise oxygenation, such as wastewater treatment and aquaculture.
These flumes minimize head loss while promoting turbulent flow within the throat section (Cone 1917). This turbulence, a well-documented phenomenon in hydraulic structures (Avery & Novak 1978; Gulliver et al. 1990, 1997, 1998; Chanson 1995; Felder & Chanson 2015), significantly enhances oxygen transfer from the atmosphere into the water, improving overall DO levels in streams. Research demonstrates the effectiveness of Venturi flumes for oxygen transfer in low-slope channels (Dursun 2016). These flumes outperform traditional methods like Parshall flumes across various geometries (throat width, length, and sill height). Additionally, studies have explored optimizing Venturi flumes for aeration performance. Yadav et al. (2021b) investigated the influence of geometrical parameters like throat length, air holes, and convergence/divergence angles using non-dimensional techniques. Furthermore, Yadav et al. (2022) conducted an economic evaluation, comparing a proposed Venturi aerator to existing options across different pond sizes, initial DO concentrations, and operating hours. Research has compared the aeration efficiency of various water treatment systems (Puri et al. 2023a). Venturi flumes are a common option, alongside weirs, conduits, and stepped channels. Baylar et al. (2010) categorized these systems into high-head flow (like Venturi nozzles) and free-surface flow (like weirs) based on their configurations, emphasizing the role of air/water flow ratio and overall aeration efficiency. Studies have also explored factors influencing efficiency within specific systems. For example, Baylar et al. (2011) found a positive correlation between energy dissipation and aeration efficiency in stepped chutes. Similarly, Wormleaton & Tsang (2000) demonstrated that labyrinth weirs, with their unique design, significantly outperform straight weirs in aeration, especially at lower drop heights. Weir geometry also plays a role, with Baylar & Bagatur (2000) showing that triangular notch weirs generally have better aeration efficiency compared to other shapes. Researchers have explored methods to predict and optimize aeration efficiency in different water treatment systems. Khdhiri et al. (2014) developed a correlation for stepped cascades, considering changes in flow regimes that affect oxygen transfer. Bostan et al. (2013) investigated the impact of jet interaction on aeration efficiency in hydraulic jumps. Venturi devices have been identified as particularly effective for aeration due to their ability to entrain air (Baylar & Ozkan 2006). Studies on weirs have also shown efficiency improvements. Küçükali (2023) found a positive correlation between discharge rate and aeration efficiency in 90° V-notch weirs, emphasizing the role of free-surface macro turbulence. He also proposed a formula for efficiency prediction. Similarly, Ozkan et al. (2014) focused on high-head conduits, discovering an optimal air-demand ratio for efficiency based on the Froude number and proposing a design formula.
Machine learning (ML) models have been increasingly utilized in analysing Venturi flumes, focusing on improving accuracy in flow measurement. Sihag et al. (2021) assessed the predictive capability of soft computing methods, including random forest (RF), M5P, multivariate adaptive regression splines, and group method of data handling, for estimating the aeration efficiency of Parshall and modified Venturi flumes. They aimed to enhance precision in aeration efficiency predictions through advanced computational techniques.
Sangeeta et al. (2021) conducted a comparative analysis of advanced ML models, including K-nearest neighbour (KNN), random forest regression (RFR), and decision tree regression (DTR), to identify the most accurate method for predicting aeration efficiency in Parshall flumes. Their research aimed to optimize predictive models to ensure reliable assessments of aeration efficiency across diverse hydraulic conditions. Tiwari (2021) investigated the application of artificial neural networks (ANNs) and adaptive neuro-fuzzy inference system (ANFIS) models to estimate aeration efficiency in hydraulic jumps under sluice gates. The study aimed to enhance understanding and prediction capabilities in hydraulic engineering through the use of computational modelling. Accurate assessment of aeration efficiency is crucial for effective water management strategies. Baylar et al. (2009) demonstrated the effectiveness of least square support vector machine (LS-SVM) models in predicting the air entrainment rate and the aeration efficiency of weirs. These models outperformed other regression models in predicting the impact of plunging overfall jets on water quality. In the same year, Hanbay et al. (2009) employed LS-SVM models to forecast flow conditions and aeration efficiency in stepped cascades, utilizing critical flow depth, step height, and channel slope as input parameters. Later, Gerger et al. (2017) compared five data-driven techniques: feedforward neural network, radial basis neural network, generalized regression neural network, ANFIS with subtractive clustering, and ANFIS with fuzzy c-means clustering to estimate oxygen transfer efficiency in baffled chutes with various baffle block designs (stepped (S), wedge (W), trapezoidal (T), and T-shaped (T-S)). In the same year, Jahromi & Khiadani (2017) evaluated flow characteristics, including the volumetric oxygen transfer coefficient (VOTC) and standard oxygen transfer efficiency (SOTE), by considering factors such as the water jet to cross-flow velocity ratio, jet fall height, cross-flow depth, and jet impact angle. Srinivas & Tiwari (2022a) evaluated the performance of the ANFIS, backpropagation neural network, and deep neural network (DNN) in predicting the oxygen aeration efficiency of gabion spillways. Additionally, Srinivas & Tiwari (2022b) compared gradient boosting machine (GBM), neural network (NN), and DNN for the same purpose. Both studies aimed to enhance the predictive accuracy of oxygen aeration in gabion spillways using advanced ML methods. Tiwari et al. (2023) utilized experimental data to evaluate the performance of multiple linear regression (MLR), NNs, neuro-fuzzy systems, and DNNs in estimating the aeration performance efficiency of gabion weirs. In the same year, Luxmi et al. (2023a, 2023b) conducted a comparative analysis of various models for predicting gabion weir performance. Their findings indicated that neuro-fuzzy, NN, and Gaussian process regression models were effective for oxygen aeration efficiency, while ensemble models like bagging, boosting, and stacking were useful for predicting mass oxygen transfer, with RF-based bagging demonstrating superior performance. Puri et al. (2023b) employed ML methods such as reduced error pruning tree, RF, and M5P to forecast aeration efficiency through circular plunging jets. In the following year, Puri et al. (2024) combined the development of three ML models – ANN, M5P, and RF – with an experimental investigation to estimate aeration efficiency under various conditions of channel inclination angle, discharge, jet count, Froude number, and hydraulic radius for each jet. Yadav et al. (2024) investigated the use of ANN technology in conjunction with optimization techniques like genetic algorithms (GAs) and particle swarm optimization to maximize the aeration efficiency of a Venturi aerator. Saha et al. (2024) assessed optimal aeration efficiency in stepwise cascade systems through laboratory experiments, employing KNN, gradient boosting regressor, DTR, and RFR to predict aeration efficiency. Oxygen transfer in Venturi flumes is a critical aspect of water treatment and aeration systems. He et al. (2003) measured oxygen transfer capacity in clean water using desorption and absorption techniques in a small-scale tank. The absorption method yielded a mean standard oxygen transfer coefficient of 8.60 h⁻¹ and a SOTE range of 4.5–4.9%, highlighting the significant influence of water depth on SOTE. Soreanu et al. (2010) demonstrated that non-porous, hollow fibre, gas-permeable (GP) membranes exhibit significantly higher oxygen transfer efficiency compared to conventional aerators. GP membranes achieved SOTEs (sulphite oxidation) of 70.6% and SOTER (re-aeration) of 52.2%, indicating greater energy efficiency for aeration in water treatment plants.
Pham et al. (2008) investigated the impact of carrier media on standard oxygen transfer efficiency in clean water (SOTEcw) using coarse and fine bubble air diffusers in an 840-gallon test tank. Their findings revealed that fine bubble diffusers achieved approximately twice the SOTEcw of coarse bubble diffusers at up to 50% media fill, highlighting the superior efficiency of fine bubble diffusers in influencing SOTEcw. Gillot et al. (2005) investigated oxygen mass transfer in biofilters by tracking off-gas oxygen during excess sulphite oxidation in a 250 L pilot-scale unit. They found comparable SOTE to standardized methods without cobalt addition, proposed a conductivity-based SOTE relationship, and explored the effects of gas and liquid velocities on mass transfer efficiency. Ashley et al. (2014) investigated the Speece Cone hypolimnetic aerator, observing that higher gas-to-water flow rate ratios enhanced the oxygen transfer coefficient (KLa), standard oxygen transfer rate (SOTR), and standard aeration efficiency but diminished SOTE. Optimal performance was achieved at specific inlet water velocities and flow rate ratios, suggesting the need for further optimization to maximize efficiency. Barreto et al. (2018) investigated a pilot-scale superoxygenation system, achieving nearly 100% SOTRs and SOTEs in clean water and mixed liquor. However, they observed decreased efficiency at higher oxygen flow rates, with alpha factors ranging from 0.6 to 1.0, optimized through oxygen flow rate adjustments. Kujawiak et al. (2020) developed a hybrid barbotage reactor with various structural designs and found that placing the aeration nozzle mid-length in the 50-mm-diameter column maximized aeration efficiency. Moving bed variants (20 and 40%) generally did not significantly affect aeration efficiency despite decreasing water velocity. Kumar et al. (2021) studied the effects of jet velocity, jet length, water depth, and jet thickness of plunging hollow jets on water oxygenation in an aeration tank. They proposed empirical correlations for determining the VOTC and SOTE based on jet velocities and kinetic powers.
Background and context
The background and context for this research lie in the ongoing challenges of accurately modelling and optimizing SOTE in Venturi flumes, which play a crucial role in applications like wastewater treatment and aquaculture. Venturi flumes are widely used to boost DO levels through induced turbulence, a process essential for maintaining water quality in various ecosystems. However, accurately predicting and optimizing the SOTE in Venturi flumes remain a significant challenge. The complexity of SOTE is rooted in the intricate interplay of several factors, including flume geometry, discharge rates, and submergence conditions. While advancements in SOTE modelling have been made, existing methods often struggle to capture these variables precisely. This results in inaccuracies in SOTE predictions, hindering efforts to optimize oxygen transfer processes. Additionally, the absence of standardized evaluation methods further complicates the assessment and comparison of different models.
Environmental factors, such as water temperature and quality, also exert a considerable influence on oxygen transfer dynamics. These factors, coupled with the technological limitations of traditional aeration systems, including Venturi-based designs, restrict energy efficiency and operational flexibility. Consequently, there is a pressing need for innovative solutions to improve the effectiveness and sustainability of these systems. ML offers a promising avenue for enhancing SOTE prediction accuracy. However, challenges related to data quality, model interpretability, and overfitting must be addressed to ensure the successful implementation of ML-based approaches. By overcoming these obstacles, it is possible to advance oxygen transfer mechanisms in water treatment and contribute to overall water quality management.
Research gap and significance
While significant advancements have been made in modelling SOTE in Venturi flumes, several key gaps persist. Existing models often fall short, struggling to predict how variations in geometry, discharge rates, and submergence conditions affect SOTE. Additionally, the lack of standardized evaluation methods hinders the comparison and validation of results across different studies. To address these gaps, this study aims to develop robust and accurate SOTE models by leveraging advanced ML techniques.
The significance of this research lies in its potential to revolutionize the design and operation of Venturi flumes. By identifying critical parameters and their influence on SOTE, it can be optimized flume design and operational strategies to enhance oxygen transfer efficiency. This, in turn, can lead to improved water quality in various applications, including wastewater treatment and aquaculture. Furthermore, the study's findings can inform future research and development of innovative aeration technologies. By bridging the gap between theoretical understanding and practical implementation, this research contributes to the advancement of sustainable water management practices.
Objectives and scope
The primary objective of this study is to optimize the SOTE in Venturi flumes by investigating the impact of various flume parameters on oxygen transfer. To achieve this, the research aims to develop accurate predictive models using ML techniques, including GBM, extreme gradient boosting (XRT), RF, M5 (Pruned and Unpruned), random tree (RT), and reduced error pruning (REP) models, as well as traditional regression techniques like MLR and multiple nonlinear regression (MNLR). Additionally, the study seeks to identify the most influential input variables affecting SOTE through sensitivity analysis, correlation matrix analysis, and SHapley Additive exPlanations (SHAP) analysis. By accomplishing these objectives, this research aims to advance the understanding of oxygen transfer mechanisms in Venturi flumes and provide valuable insights for optimizing their design and operation.
The scope of this research encompasses a comprehensive investigation into the optimization of SOTE in Venturi flumes. This includes conducting experiments to collect data on SOTE under various conditions, developing and evaluating the performance of multiple ML (GBM, XRT, RF, M5, RT, and REP) and traditional regression models (MLR and MNLR), and utilizing statistical analysis (correlation coefficient (CC), root mean square error (RMSE), and mean absolute error (MAE)) and uncertainty assessments to enhance model reliability. Additionally, the study will employ variable analysis techniques (sensitivity analysis, correlation analysis, and SHAP analysis) to identify the most influential parameters affecting SOTE. By addressing these key areas, the research aims to provide valuable insights into the complex relationships between flume parameters and oxygen transfer efficiency, ultimately contributing to the optimization of Venturi flume design and operation.
Novelty of the work
This research offers a groundbreaking, novel, and comprehensive approach to optimizing SOTE in Venturi flumes, merging experimental insights with advanced ML techniques. At the core of its innovation is a rigorous evaluation of diverse ML models, including GBM, XRT, RF, M5, RT, and REP, alongside traditional regression methods (MLR and MNLR), to pinpoint the most precise model for SOTE prediction. By integrating uncertainty analysis, this study significantly enhances model reliability and interpretability, providing a robust measure of confidence in the predictions. Furthermore, sensitivity, correlation, and SHAP analyses identify critical parameters affecting SOTE, delivering valuable optimization insights, while analysis of variance (ANOVA) analysis adds further rigour and credibility to the findings. These advancements hold substantial practical implications for the design, operation, and maintenance of Venturi flumes, offering pathways to improved water quality and greater energy efficiency across numerous applications.
The paper is meticulously structured, beginning with a concise introduction that provides a comprehensive overview of the topic, a thorough literature review, and a clear articulation of research gaps and significance. The introduction culminates in a well-defined set of objectives, scope, and novelty. The subsequent sections systematically delve into water transfer characteristics, experimental setup, data analysis methodologies, and the application of advanced tree-based ML techniques. The results and discussion section offers a comprehensive analysis, supported by robust statistical metrics, insightful visual representations, and a detailed exploration of model performance. The paper concludes with a concise summary of key findings, acknowledges limitations, discusses implications, and provides valuable recommendations for future research.
CHARACTERISTICS OF OXYGEN TRANSFER
In this instance, ‘T’ denotes the water's test temperature (°C), while KLa indicates the VOTC (s⁻¹) at 20 °C and 1 atmospheric pressure.
The SOTR is represented in kilograms per hour (kg/h). The tank water volume (Vtank) is measured in cubic metres (m³), and the saturation concentration of DO at standard conditions (C*saturation) is measured in milligrams per litre (mg/L).
The experimental layout and procedures
Sr. no. . | Model no. . | W (cm) . | F (cm) . | E (cm) . | Ha (cm) . | Hb (cm) . | q (L/m/min) . |
---|---|---|---|---|---|---|---|
1 | CVF 1 | 3.50 | 16.50 | 16.11 | 4.21–12.29 | 2.26–11.84 | 153.72–355.00 |
2 | CVF 2 | 10.20 | 36.00 | 15.68 | 2.78–10.64 | 1.30–10.71 | 153.72–557.14 |
3 | CVF 3 | 5.90 | 24.50 | 22.56 | 2.33–12.06 | 1.30–12.65 | 153.72–557.14 |
4 | RVF 4 | 7.62 | 15.00 | 15.68 | 2.81–13.60 | 1.66–14.19 | 153.72–557.14 |
5 | RVF 5 | 10.00 | 20.00 | 15.70 | 3.25–10.69 | 2.13–11.41 | 153.72–557.14 |
6 | RVF 6 | 5.08 | 11.43 | 15.66 | 6.90–18.73 | 4.90–18.92 | 153.72–557.14 |
7 | RVF 7 | 5.08 | 15.00 | 16.30 | 5.26–19.03 | 2.63–19.02 | 153.72–557.14 |
Sr. no. . | Model no. . | W (cm) . | F (cm) . | E (cm) . | Ha (cm) . | Hb (cm) . | q (L/m/min) . |
---|---|---|---|---|---|---|---|
1 | CVF 1 | 3.50 | 16.50 | 16.11 | 4.21–12.29 | 2.26–11.84 | 153.72–355.00 |
2 | CVF 2 | 10.20 | 36.00 | 15.68 | 2.78–10.64 | 1.30–10.71 | 153.72–557.14 |
3 | CVF 3 | 5.90 | 24.50 | 22.56 | 2.33–12.06 | 1.30–12.65 | 153.72–557.14 |
4 | RVF 4 | 7.62 | 15.00 | 15.68 | 2.81–13.60 | 1.66–14.19 | 153.72–557.14 |
5 | RVF 5 | 10.00 | 20.00 | 15.70 | 3.25–10.69 | 2.13–11.41 | 153.72–557.14 |
6 | RVF 6 | 5.08 | 11.43 | 15.66 | 6.90–18.73 | 4.90–18.92 | 153.72–557.14 |
7 | RVF 7 | 5.08 | 15.00 | 16.30 | 5.26–19.03 | 2.63–19.02 | 153.72–557.14 |
To evaluate oxygen, transfer efficiency within the Venturi flumes, a closed-loop water circulation system was established in the laboratory. Water was supplied to the flume through a pipe connected to a dedicated oxygen transfer tank (0.87 m × 0.87 m × 0.90 m). This tank served two purposes: storing water for the experiment and measuring DO content. A pump with adjustable flow control ensured a consistent head in the system, drawing water from the tank through a 5.08 cm pipe. A digital flow metre with ±0.5% accuracy monitored the flow rate. Water temperature within the tank was precisely measured using a thermometer accurate to ±0.1 °C.
The working section of the channel housed the Venturi flume under investigation. Here, oxygen transfer from air to water occurs. The remaining section of the channel, connected to the oxygen transfer tank, remained sealed to prevent air contact and ensure accurate DO measurement within the tank. The water level in the Venturi flume was measured using a precise pointer gauge with a resolution of 0.01 cm. This comprehensive setup allowed for controlled testing and accurate measurement of oxygen transfer efficiency within the various Venturi flume configurations.
This study investigates the influence of backwater effects on oxygen transfer efficiency in Venturi flumes. Backwater occurs when submerged flow conditions exist (Heyrani et al. 2022). This happens when the downstream water level (Hb) is high enough to impede upstream flow velocity. To quantify this effect, the submergence ratio is calculated by dividing the downstream depth (Hb) by the upstream depth (Ha) measured two-thirds upstream from the Venturi flume's throat (Hager & Hager 2010). While Ha is a fixed measurement point, Hb is controlled by adjusting a downstream tailgate (Cone 1917), allowing for the creation of various submergence ratios and the evaluation of their impact on oxygen transfer.
To ensure consistent and measurable oxygen transfer, water was pre-treated before each experiment by adding sodium sulphite (7.9 g/m³) to deoxygenate it to a DO range of 1.0–2.0 ppm (Bagatur 2014). Cobalt chloride (3.3 g/m³) was introduced as a catalyst to expedite oxygen transfer and stabilize DO levels below saturation throughout the experiment (Kumar et al. 2022a). Each test began with 10 min of water deoxygenation, followed by DO sampling from various locations and depths to establish the initial concentration (Cinitial) in 300 mL biological oxygen demand (BOD) bottles. The water was then aerated for 90 s using Venturi flumes to control DO saturation at temperatures of 18–25 °C (Tiwari & Sihag 2020; Kumar et al. 2022a). For optimal DO distribution, we employed manual stirring before and after each experiment, a method that has proven effective in previous research (Kumar et al. 2022a; Luxmi et al. 2023a, b; Puri et al. 2023a, b). Despite the potential for minor heterogeneity, careful sampling and sufficient mixing achieved satisfactory DO uniformity, confirmed by multiple DO readings from various depths. A tolerance limit of ±0.2 ppm was set for DO variations, with any measurements outside this range discarded and experiments repeated. This comprehensive approach aligns with established standards (Kumar et al. 2022a; Luxmi et al. 2023a, b; Puri et al. 2023a, b, 2024), ensuring accurate DO levels for reliable SOTE measurements. DO was measured using the azide modification method (APHA AWWA & WEF 2005), involving titration with sodium thiosulphate. Post-aeration, DO concentration (Cfinal) was determined, and SOTE was calculated using Equations (1)–(5). This protocol was applied across seven Venturi flume setups, generating 609 SOTE observations in total.
Impact of scale effect on experimental measurements
Scale effects can significantly influence the accuracy of experimental measurements, especially in studies involving hydraulic structures like Venturi flumes. When conducting laboratory-scale experiments, it is difficult to perfectly replicate full-scale hydraulic conditions due to differences in Reynolds and Froude numbers between scales. These discrepancies can lead to variations in flow patterns, oxygen transfer dynamics, and turbulence characteristics. In small-scale setups, scale effects may result in altered oxygen transfer rates, leading to results that may not accurately represent full-scale conditions. These inconsistencies can impact the SOTE measurements, as laboratory conditions may either amplify or dampen the actual turbulence levels observed in real-world scenarios, potentially affecting the validity of the findings.
Approach to enhance measurement accuracy and mitigate scale effects
To ensure accurate measurements and minimize scale effects, maintaining a Froude-scaled environment is crucial, as it ensures proportional representation of forces in the hydraulic experiment. Additionally, adjusting experimental setups by using larger-scale models or employing numerical scaling techniques can enhance the similarity between the experimental model and real-world conditions. Advanced ML techniques can further refine data analysis, correcting or accounting for scale-induced discrepancies, leading to more reliable SOTE predictions in practical applications. By carefully considering scaling laws and similarity criteria and employing advanced numerical modelling and ML techniques, researchers can improve the accuracy and reliability of experimental results, leading to more informative studies of hydraulic structures.
Limitations of the experimental study
This study acknowledges several limitations that may affect the broader applicability of its findings. Conducted in a controlled laboratory environment, the experiments may not fully replicate the dynamic variability observed in natural water bodies, including fluctuations in ambient conditions and variations in organic matter content. Despite efforts to address scale effects, these may still influence the relevance of the results to larger, real-world applications where physical conditions differ significantly. Additionally, the ML models developed were specifically designed for the experimental data used in this study, meaning their performance may vary when applied to other datasets or alternative flume configurations. This underscores the need for further validation across a wider range of flow conditions and diverse experimental setups. Lastly, environmental factors such as water temperature and initial DO levels could impact SOTE measurements, potentially introducing variability in model predictions when conditions fluctuate.
DATA SEGMENTATION AND RESEARCH METHODOLOGY
This study investigates the impact of Venturi flume design on oxygen transfer efficiency using two flume types: CVFs and RVFs with variations in specific flume dimensions. In model development, data division is crucial for verifying accurate compilation and validation procedures, though there is no specific method or criterion to identify the optimal data division. Factors impacting the division ratio include the size of the dataset and the complexity of the estimation model (Kisi et al. 2019; Katipoğlu 2020). The SOTE dataset is typically divided into training and testing sets, with different ratios (70–30%, 75–25%, and 80–20%) examined for ease in developing and validating the model. A 75% (458) training and 25% (151) testing ratio appears optimal. In the training phase, 75% of the dataset was used to perform regression analysis, identifying relationships between the input parameters and SOTE. Hyperparameter tuning and optimization were employed to enhance prediction accuracy and model performance. The remaining 25% of the dataset was reserved for the testing phase, during which the model's predictive capability was evaluated using CC, RMSE, and MAE metrics. This approach ensured the model's robustness and its ability to generalize effectively to new data.
Variables . | q (L/m/min) . | W (cm) . | F (cm) . | E (cm) . | Ha (cm) . | Hb (cm) . | SOTE (kg/Wh) . |
---|---|---|---|---|---|---|---|
Training | |||||||
Minimum | 153.720 | 3.500 | 11.430 | 15.660 | 2.330 | 1.300 | 0.001 |
Maximum | 557.140 | 10.200 | 36.000 | 22.560 | 19.030 | 19.020 | 3.828 |
Mean | 446.692 | 7.199 | 20.324 | 16.994 | 9.209 | 7.964 | 0.765 |
S. deviation | 127.306 | 2.213 | 7.922 | 2.571 | 3.219 | 3.352 | 0.717 |
C.O.V | 0.285 | 0.307 | 0.390 | 0.151 | 0.350 | 0.421 | 0.937 |
Kurtosis | 0.192 | −1.473 | −0.270 | 0.907 | −0.131 | 0.271 | 3.389 |
Skewness | −1.242 | 0.247 | 0.941 | 1.690 | 0.303 | 0.667 | 1.792 |
Sum | 204,585.080 | 3,297.020 | 9,308.380 | 7,783.160 | 4,217.930 | 3,647.290 | 350.337 |
Testing | |||||||
Minimum | 153.720 | 3.500 | 11.430 | 15.660 | 2.410 | 1.300 | 0.021 |
Maximum | 557.140 | 10.200 | 36.000 | 22.560 | 16.670 | 16.580 | 2.686 |
Mean | 447.333 | 7.099 | 20.732 | 17.458 | 9.106 | 7.928 | 0.687 |
S. deviation | 119.396 | 2.100 | 7.832 | 2.914 | 2.768 | 3.014 | 0.587 |
C.O.V | 0.267 | 0.296 | 0.378 | 0.167 | 0.304 | 0.380 | 0.854 |
Kurtosis | 0.036 | −1.285 | −0.405 | −0.590 | 0.339 | 0.281 | 2.155 |
Skewness | −1.123 | 0.388 | 0.815 | 1.180 | 0.176 | 0.631 | 1.573 |
Sum | 67,547.230 | 1,071.880 | 3,130.600 | 2,636.150 | 1,375.020 | 1,197.140 | 103.752 |
Variables . | q (L/m/min) . | W (cm) . | F (cm) . | E (cm) . | Ha (cm) . | Hb (cm) . | SOTE (kg/Wh) . |
---|---|---|---|---|---|---|---|
Training | |||||||
Minimum | 153.720 | 3.500 | 11.430 | 15.660 | 2.330 | 1.300 | 0.001 |
Maximum | 557.140 | 10.200 | 36.000 | 22.560 | 19.030 | 19.020 | 3.828 |
Mean | 446.692 | 7.199 | 20.324 | 16.994 | 9.209 | 7.964 | 0.765 |
S. deviation | 127.306 | 2.213 | 7.922 | 2.571 | 3.219 | 3.352 | 0.717 |
C.O.V | 0.285 | 0.307 | 0.390 | 0.151 | 0.350 | 0.421 | 0.937 |
Kurtosis | 0.192 | −1.473 | −0.270 | 0.907 | −0.131 | 0.271 | 3.389 |
Skewness | −1.242 | 0.247 | 0.941 | 1.690 | 0.303 | 0.667 | 1.792 |
Sum | 204,585.080 | 3,297.020 | 9,308.380 | 7,783.160 | 4,217.930 | 3,647.290 | 350.337 |
Testing | |||||||
Minimum | 153.720 | 3.500 | 11.430 | 15.660 | 2.410 | 1.300 | 0.021 |
Maximum | 557.140 | 10.200 | 36.000 | 22.560 | 16.670 | 16.580 | 2.686 |
Mean | 447.333 | 7.099 | 20.732 | 17.458 | 9.106 | 7.928 | 0.687 |
S. deviation | 119.396 | 2.100 | 7.832 | 2.914 | 2.768 | 3.014 | 0.587 |
C.O.V | 0.267 | 0.296 | 0.378 | 0.167 | 0.304 | 0.380 | 0.854 |
Kurtosis | 0.036 | −1.285 | −0.405 | −0.590 | 0.339 | 0.281 | 2.155 |
Skewness | −1.123 | 0.388 | 0.815 | 1.180 | 0.176 | 0.631 | 1.573 |
Sum | 67,547.230 | 1,071.880 | 3,130.600 | 2,636.150 | 1,375.020 | 1,197.140 | 103.752 |
ML TECHNIQUES: TREE-BASED METHODS
ML offers powerful tools for tackling complex problems, and tree-based models are a prominent example. These models, including GBM, XRT, RF, M5 models (both Prun and Unprun), RT, and REP (REP_Prun and REP_Unprun) variants, excel at building predictive models. They achieve this by recursively splitting the data based on features, essentially creating a series of if-then-else rules. This approach offers several advantages: interpretability of the model's decision-making process, the ability to handle nonlinear relationships between variables, and wide applicability for tasks like classification, regression, and feature importance analysis. However, it is important to acknowledge potential challenges associated with tree-based models, such as overfitting and the need for careful parameter tuning.
Gradient boosting machines
GBMs, introduced by Friedman (2001), have become a widely used supervised ML technique for both regression and classification tasks (Freund 1995; Friedman 2001; Lu et al. 2016; Zhou et al. 2019). This approach leverages ‘boosting,’ a powerful technique applicable across various ML algorithms (Freund & Schapire 1996). GBMs build a robust model by sequentially adding weak learners, each focusing on improving the overall prediction accuracy (Singh & Minocha 2024). Each new learner is trained on a modified version of the data, emphasizing areas where previous learners struggled. Ultimately, the GBM combines these individual learners, weighted based on their performance, into a single, powerful predictive model (Luxmi et al. 2023b). In essence, GBMs excel at combining multiple weak learners to create a robust and effective prediction tool.
Extreme gradient boosting
XRT, introduced by Chen & Guestrin (2016), is a ML technique that falls under the category of ensemble learning. Ensemble methods, such as boosting, combine multiple weak learners (e.g., decision trees) to create a more robust final model (Tao et al. 2021). XRT builds such a model by aggregating multiple decision trees and combining their leaf weights for the final prediction. This approach leverages L1 and L2 regularization, which penalizes overly complex models, effectively preventing overfitting (Tao et al. 2021). During each iteration, XRT uses a regularized objective function to optimize its performance, essentially focusing on correcting the errors made by the previous learners and adjusting their weights for improved overall prediction accuracy (Chen & Guestrin 2016). These features, combined with its speed and ability to handle large-scale problems efficiently, have made XRT a popular choice for tackling regression and classification tasks (Wang et al. 2018; Parsa et al. 2020).
Random forest
RF is a supervised ensemble learning technique introduced by Breiman (2001). Ensemble methods, such as RF, combine multiple weak learners (e.g., decision trees) to create a more robust model (Liaw & Wiener 2002; Janitza et al. 2016; Azarhoosh & Koohmishi 2023). In RF regression, each tree is built using a random subset of features (variables) from the original data, promoting diversity within the forest (Singh et al. 2019). The final prediction is obtained by averaging the predictions from all the trees in the forest. Tuning two key parameters, the number of trees (k) and the number of features considered at each node (m), is crucial for optimal performance (Breiman 2017). This approach offers several advantages: handling nonlinear relationships, being robust to outliers, and requiring minimal parameter tuning compared to other models (Luxmi et al. 2023b). Overall, RF regression is a powerful non-parametric method widely used for regression problems and can also be applied to multi-class classification tasks (Breiman 2001; Belmokre et al. 2019).
M5 model tree
The M5 model tree algorithm, developed by Quinlan (1992), is a powerful tool for uncovering relationships between input variables and numerical output (Samadi et al. 2021). It works by building a binary decision tree, iteratively splitting the data based on the attribute that best reduces the spread (standard deviation) within each subgroup (Wang & Witten 1996). This process continues until a stopping criterion is met, such as a minimum number of instances or a small variation in the output values. At each internal node (decision point) in the tree, a linear regression model is created using relevant features to predict the output variable (Bonakdar & Etemad-Shahidi 2011). To prevent overfitting and improve generalization, these models are then simplified by removing unnecessary features that do not significantly contribute to the prediction accuracy. Finally, a pruning step is applied. Pruning removes entire sub-trees where the error of the linear regression model at the root is less than or equal to the overall error of the sub-tree itself (Zhang & Tsai 2006). This technique helps to improve the model's performance, especially when dealing with limited training data. Overall, M5 model trees offer a valuable approach for identifying relationships in data and generating interpretable linear regression models for prediction.
Random trees
RTs are a type of ensemble learning algorithm used for both classification and regression tasks (Breiman 2001). Compared to RFs, they are a simpler approach that builds a single decision tree. This tree recursively splits the data based on the best feature to minimize prediction error within each branch. To improve robustness and reduce overfitting, RTs leverage techniques like bagging (training on random subsets of data) and random feature selection at each split (Ho 1995; LaValle 1998; Aldous & Pitman 2000). While this may lead to slightly less interpretability compared to simpler models, RTs offer advantages like high accuracy, flexibility, and fast training times. During prediction, the input data are passed through the entire tree, and the final class label is determined by the most frequent prediction across all splits in the tree (Witten & Frank 2002; Elbeltagi et al. 2023).
Reduced error pruning
REP combines the strengths of decision tree learning with a pruning technique to create a more efficient and accurate model. The decision tree algorithm itself helps simplify the model, while REP further reduces its complexity by removing unnecessary branches (Quinlan 1987; Mohamed et al. 2012). This pruning process identifies the smallest sub-tree that maintains high accuracy. It achieves this by analysing measures like information gain or variance reduction (Chen et al. 2009; Srinivasan & Mekala 2014). By removing less accurate branches, REP mitigates overfitting caused by noise or outliers in the data, ultimately leading to a more generalizable model. To achieve this balance between accuracy and simplicity, REP leverages a combination of training data for tree building, test data for pruning specific branches, and a validation dataset to estimate the model's performance on unseen data (Quinlan 1993; Polo et al. 2010; Nhu et al. 2020). This multi-pronged approach allows REP to achieve fast learning speeds with reduced variance, making it a valuable tool for building decision tree models.
Multiple linear regression
Multiple nonlinear regression
RESULTS AND ANALYSIS
Performance benchmarks
In the equations, ‘Sexp’ denotes the experimental SOTE values, ‘Spred’ represents the SOTE values predicted by the models, and ‘m’ signifies the total number of observations.
Results of ML models
This section provides a comprehensive analysis of the ML models employed for predicting SOTE in the experimental setup. A range of models, including GBM, XRT, RF, M5_Prun, M5_Unprun, RT, REP_Prun, and REP_Unprun, were evaluated. Detailed hyperparameter optimization was conducted for each model to maximize predictive accuracy, with optimal values listed in Table 3. The resulting SOTE predictions for both training and testing phases are shown in Table 4, along with key performance metrics: the CC, RMSE, and MAE. These metrics allowed us to assess each model's effectiveness and determine the most suitable approach for SOTE estimation.
Models . | Optimal value of hyperparameters . |
---|---|
GBM | Number of folds = 5, Number of trees = 67, Max depth = 5, Encoding method = enum, Fold assignment = modulo, and Distribution function = Gaussian |
XRT | Number of folds = 5, Number of trees = 38, Max depth = 20, Distribution function = Gaussian |
RF | Features (m) = 0, Number of trees grown (k) = 0 |
M5_Prun | Instances = 4 |
M5_Unprun | Instances = 4 |
RT | k value = 0; max depth = 0; minNum = 2.0; minVarianceProp = 0.002; numFolds = 0; and seed = 1 |
REP_Prun | Initial count = 0; max depth = −1; minNum = 2; minVarianceProp = 0.001; numFolds = 3; and seed = 1 |
REP_Unprun | Initial count = 0; max depth = −1; minNum = 2; minVarianceProp = 0.001; numFolds = 3; and seed = 1 |
Models . | Optimal value of hyperparameters . |
---|---|
GBM | Number of folds = 5, Number of trees = 67, Max depth = 5, Encoding method = enum, Fold assignment = modulo, and Distribution function = Gaussian |
XRT | Number of folds = 5, Number of trees = 38, Max depth = 20, Distribution function = Gaussian |
RF | Features (m) = 0, Number of trees grown (k) = 0 |
M5_Prun | Instances = 4 |
M5_Unprun | Instances = 4 |
RT | k value = 0; max depth = 0; minNum = 2.0; minVarianceProp = 0.002; numFolds = 0; and seed = 1 |
REP_Prun | Initial count = 0; max depth = −1; minNum = 2; minVarianceProp = 0.001; numFolds = 3; and seed = 1 |
REP_Unprun | Initial count = 0; max depth = −1; minNum = 2; minVarianceProp = 0.001; numFolds = 3; and seed = 1 |
S. no. . | Models . | CC . | RMSE . | MAE . |
---|---|---|---|---|
Training | ||||
1 | GBM | 0.9683 | 0.1829 | 0.0001 |
2 | XRT | 0.9765 | 0.1882 | 0.0002 |
3 | RF | 0.9879 | 0.1248 | 0.0001 |
4 | M5_Prun | 0.8910 | 0.3261 | 0.0012 |
5 | M5_Unprun | 0.9304 | 0.2671 | 0.0007 |
6 | RT | 0.9770 | 0.1528 | 0.0002 |
7 | REP_Prun | 0.9089 | 0.2990 | 0.0019 |
8 | REP_Unprun | 0.9636 | 0.1915 | 0.0002 |
9 | MLR | 0.7780 | 0.4504 | 0.0002 |
10 | MNLR | 0.9495 | 0.2251 | 0.00002 |
Testing | ||||
1 | GBM | 0.9372 | 0.2067 | 0.0006 |
2 | XRT | 0.8852 | 0.2768 | 0.0006 |
3 | RF | 0.9087 | 0.2451 | 0.0004 |
4 | M5_Prun | 0.9109 | 0.2452 | 0.0048 |
5 | M5_Unprun | 0.9455 | 0.1918 | 0.0030 |
6 | RT | 0.8531 | 0.3259 | 0.0007 |
7 | REP_Prun | 0.8471 | 0.3185 | 0.0059 |
8 | REP_Unprun | 0.9034 | 0.2628 | 0.0005 |
9 | MLR | 0.7850 | 0.3683 | 0.0002 |
10 | MNLR | 0.9343 | 0.2118 | 0.0001 |
S. no. . | Models . | CC . | RMSE . | MAE . |
---|---|---|---|---|
Training | ||||
1 | GBM | 0.9683 | 0.1829 | 0.0001 |
2 | XRT | 0.9765 | 0.1882 | 0.0002 |
3 | RF | 0.9879 | 0.1248 | 0.0001 |
4 | M5_Prun | 0.8910 | 0.3261 | 0.0012 |
5 | M5_Unprun | 0.9304 | 0.2671 | 0.0007 |
6 | RT | 0.9770 | 0.1528 | 0.0002 |
7 | REP_Prun | 0.9089 | 0.2990 | 0.0019 |
8 | REP_Unprun | 0.9636 | 0.1915 | 0.0002 |
9 | MLR | 0.7780 | 0.4504 | 0.0002 |
10 | MNLR | 0.9495 | 0.2251 | 0.00002 |
Testing | ||||
1 | GBM | 0.9372 | 0.2067 | 0.0006 |
2 | XRT | 0.8852 | 0.2768 | 0.0006 |
3 | RF | 0.9087 | 0.2451 | 0.0004 |
4 | M5_Prun | 0.9109 | 0.2452 | 0.0048 |
5 | M5_Unprun | 0.9455 | 0.1918 | 0.0030 |
6 | RT | 0.8531 | 0.3259 | 0.0007 |
7 | REP_Prun | 0.8471 | 0.3185 | 0.0059 |
8 | REP_Unprun | 0.9034 | 0.2628 | 0.0005 |
9 | MLR | 0.7850 | 0.3683 | 0.0002 |
10 | MNLR | 0.9343 | 0.2118 | 0.0001 |
Combination . | Variable removed . | CC . | RMSE . | MAE . |
---|---|---|---|---|
- | 0.9455 | 0.1918 | 0.0030 | |
q | 0.8823 | 0.2816 | 0.0041 | |
W | 0.7244 | 0.4092 | 0.0008 | |
F | 0.9484 | 0.1870 | 0.0029 | |
E | 0.9412 | 0.1985 | 0.0024 | |
0.9399 | 0.2007 | 0.0036 | ||
0.7862 | 0.3633 | 0.00001 |
Combination . | Variable removed . | CC . | RMSE . | MAE . |
---|---|---|---|---|
- | 0.9455 | 0.1918 | 0.0030 | |
q | 0.8823 | 0.2816 | 0.0041 | |
W | 0.7244 | 0.4092 | 0.0008 | |
F | 0.9484 | 0.1870 | 0.0029 | |
E | 0.9412 | 0.1985 | 0.0024 | |
0.9399 | 0.2007 | 0.0036 | ||
0.7862 | 0.3633 | 0.00001 |
Note. Bold highlights the parameters that have the most significance.
The M5 model with unpruned trees (M5_Unprun) demonstrated the highest performance among all models, achieving a CC value of 0.9455, a low RMSE value of 0.1918, and a minimal MAE value of 0.0030. The GBM model closely followed, with a CC value of 0.9372, an RMSE value of 0.2067, and an MAE value of 0.0006. While GBM showed strong performance, its slightly higher RMSE, and marginally lower CC compared to M5_Unprun reflected a modest decrease in accuracy for SOTE prediction. The pruned version of the M5 model (M5_Prun) also performed well, achieving a CC value of 0.9109, though with an RMSE value of 0.2452, and an MAE value of 0.0048, suggesting that pruning led to a slight reduction in predictive accuracy compared to the unpruned M5 model.
Comparative analysis and justification of results of suggested ML models
The M5_Unprun model's superior performance stems from its ability to capture complex interactions in the dataset without the limitations imposed by pruning. Unlike M5_Prun, which restricts tree growth, M5_Unprun can explore a broader feature set, enhancing its accuracy in SOTE prediction. This advantage is reflected in its CC, RMSE, and MAE values, where M5_Unprun consistently outperforms other models, particularly in the testing phase where overfitting risks are minimized. Although the GBM model performs well, its slightly lower accuracy may result from constraints in its hyperparameters, such as max-depth limits, which may affect its ability to handle SOTE data variability. The competitive yet lower performance of models like RF and XRT highlights their stability in predictions, though their limited complexity may prevent them from capturing finer details within the SOTE data. Similarly, the REP models (both pruned and unpruned) demonstrate reliable generalization; however, their higher error values suggest difficulty in achieving the precision seen with M5_Unprun and GBM.
In summary, the M5_Unprun model proves to be the most effective for SOTE prediction within the tested models, with its high CC and low error metrics indicating its reliability and precision. While other models are effective, they exhibit limitations in capturing the intricate relationships inherent in the SOTE data. Scatter plots and fitting-line curves visually reinforce these findings, showing that M5_Unprun predictions align most closely with actual values, a critical factor for accurate SOTE estimation.
These insights shed light on the models' behaviours in SOTE prediction, especially where precise predictions are essential for optimizing both experimental and practical applications. Future research might explore hybrid or ensemble methods, potentially integrating M5_Unprun with models like GBM to capitalize on their unique strengths and further enhance SOTE prediction accuracy. This comprehensive model comparison clarifies the performance landscape, guiding the choice of optimal models for reliable SOTE prediction in experimental flume studies.
Comparison of results
This study conducted a comprehensive evaluation of various ML models and traditional regression techniques for predicting SOTE. The models tested included GBM, XRT, RF, M5 model trees (both pruned and unpruned), RT, REP (with pruned and unpruned versions), MLR, and MNLR. Hyperparameters for each model were carefully optimized, as detailed in Table 3.
Figure 5 presents scatter plots of predicted versus experimental SOTE values for all models during both the training and testing stages. Each scatter plot includes ±10% error lines and a perfect prediction line (y = x), offering a clear view of each model's prediction accuracy. The plots indicate that predictions from the M5_Unprun model align most closely with experimental values, followed by GBM and MNLR, highlighting these models' high accuracy. These visual results reinforce M5_Unprun's superior performance in both the training and testing phases, as further evidenced by its statistical metrics in Table 4.
The Taylor diagram in Figure 6 provides a comprehensive comparison of model performance by displaying correlation, variance, and RMSE for each model relative to the reference dataset. It highlights M5_Unprun and GBM as most closely aligned with experimental data, showing higher CCs and lower errors, reinforcing their superior predictive performance. This alignment suggests that these models capture the underlying patterns in SOTE data more effectively than others.
Figure 7 uses box plots to depict the relative error distribution, offering insight into each model's consistency. The M5_Unprun model has the narrowest interquartile range (IQR) and the lowest median relative error, indicating minimal variance in prediction errors. GBM also demonstrates low error dispersion, while models such as XRT and M5_Prun have wider IQRs, suggesting less consistent predictions. These observations corroborate the rankings shown by the statistical metrics.
Predictive model fitting curves in Figure 8 further illustrate how each model aligns with experimental SOTE values. The M5_Unprun curve closely follows the actual data curve, with GBM also showing strong alignment, underscoring M5_Unprun's robust predictive capabilities. Among traditional models, MNLR stands out with fitting curves comparable to those of ML models, particularly the M5_Prun.
Table 4 summarizes the CC, RMSE, and MAE values for each model across the training and testing phases. M5_Unprun emerged as the top-performing ML model, achieving a CC value of 0.9455, an RMSE value of 0.1918, and an MAE value of 0.0030 in testing, indicating minimal deviation and high predictive accuracy. GBM followed closely with a CC value of 0.9372, an RMSE value of 0.2067, and an MAE value of 0.0006, also demonstrating strong accuracy and low average error. M5_Prun ranked third among ML models, with a CC value of 0.9109 and an RMSE value of 0.2452, reflecting slightly higher errors than M5_Unprun. Among traditional models, MNLR performed well, achieving a CC value of 0.9343, an RMSE value of 0.2118, and an MAE value of 0.0001, highlighting its predictive strength despite a simpler structure.
Detailed analysis of observations and proposed model behaviour
The superior performance of the M5_Unprun and GBM models for SOTE prediction can be attributed to several key strengths. First, their complexity and flexibility enable them to capture the complex, nonlinear relationships between input and output variables, a crucial requirement in this application where simpler models like MLR fall short. These models' ability to identify the importance of specific features enhances predictive accuracy by focusing on the most influential factors. Additionally, both M5_Unprun and GBM exhibit resilience to noise and outliers, a notable advantage given the experimental nature of our dataset. GBM, as an ensemble method, benefits from combining multiple decision trees to capture diverse patterns, reducing both variance and bias for more robust predictions. Its built-in regularization further prevents overfitting, improving the generalization of new data. These models have also shown a strong capacity to handle noise and missing values, likely contributing to their success with our data. Furthermore, their efficacy aligns with findings from previous studies (Kumar et al. 2022a; Luxmi et al. 2023b; Panwar & Tiwari 2024a), supporting our model choices for SOTE prediction. Overall, the ability of M5_Unprun and GBM to manage complex relationships and generalize well to unseen data makes them particularly well suited for this application.
Analysis of parameters and correlation diagram
Correlation diagrams and parameter analysis are essential tools for establishing correlations between parameters and performing statistical studies. Parameter assessment is the process of analyzing and understanding the characteristics, features, and relationships between different parameters in a dataset. This includes analyzing distributions, interdependencies, connections, and specific parameters. Correlation diagrams, also known as correlation matrices or plots (Brentan et al. 2017; Li et al. 2023), visually represent the correlation between parameters in a dataset. The CC is a statistical measure that expresses the strength and direction of the association between two factors. Ranging from −1 to +1, the CC indicates a strong negative association when it reaches −1, a strong positive correlation when it reaches +1, and no correlation when it reaches zero. By examining the correlation diagram and performing parameter evaluations, researchers can gain insights into the connections and trends between parameters.
Throat width (W) shows a weak positive correlation (0.2622) with SOTE, indicating that a wider throat may slightly improve oxygen transfer efficiency. The increased width likely affects the velocity profile, promoting turbulent mixing that can enhance oxygen transfer, though this effect remains modest, as shown by the correlation strength. Other parameters demonstrate minimal correlations with SOTE. Throat length (F), for instance, has almost no correlation, indicating a limited impact on oxygen transfer.
The upstream entrance width (E) shows a weak negative correlation (−0.2724) with SOTE, suggesting that a larger entrance width slightly reduces SOTE, possibly due to decreased velocity and turbulence at the inlet. Among water level gauges, Ha has a very weak negative correlation (−0.0835), indicating negligible influence on SOTE. In contrast, Hb, which measures the midpoint of the throat section, has a weak positive correlation (0.1741), suggesting that a higher water level in the throat may promote conditions favourable to turbulence and air entrainment, slightly enhancing oxygen transfer.
In summary, throat width (W) and midpoint throat gauge reading (Hb) are the most influential parameters on SOTE, with slight increases in both positively impacting oxygen transfer. This correlation analysis clarifies parameter interactions and offers valuable insights for optimizing oxygen transfer efficiency in the system.
SHapley additive explanations (Shapley) analysis
To gain a deeper understanding of the outperforming M5_Unprun model's predictions, we employed SHAP analysis. SHAP, based on the Shapley values from game theory (Shapley 1953), is a powerful tool for interpreting complex models (which behave like the black box) by measuring feature importance (Lundberg & Lee 2017). It calculates feature-wise SHAP values, with positive values indicating a positive contribution to predicted SOTE (increased oxygen transfer) and negative values indicating a negative impact (decreased oxygen transfer). The magnitude of the value reflects the strength of the influence.
Figure 9(b) illustrates SHAP's effectiveness in visualizing feature distributions and their impact on SOTE predictions across the testing dataset. The SHAP summary plot identifies throat width ‘W’ as the most influential factor for SOTE prediction. High values of ‘W’ (indicated in red) correlate with increased SOTE, while lower values (in blue) tend to decrease it. This aligns with physical expectations, as a wider throat enhances turbulence and mixing within the Venturi flume, thus promoting better oxygen transfer. The broad spread of SHAP values further emphasizes W's substantial contribution to SOTE predictions.
The gauge reading at the midpoint of the throat section ‘Hb’ emerges as the second most influential factor. Higher ‘Hb’ values (red) are associated with a positive impact on SOTE, while lower values (blue) have a negative impact. This pattern aligns with Hb's role as an indicator of flow conditions in the throat, where higher readings suggest turbulence and mixing conditions that support more effective oxygen transfer. SHAP confirms both the significance and the positive direction of Hb's effect, reinforcing findings from correlation analysis. Discharge per metre width ‘q’, on the other hand, exhibits an inverse relationship with SOTE. Higher ‘q’ values (red) negatively affect SOTE, while lower values (blue) positively influence it. This negative impact is likely due to higher flow rates reducing air–water contact time, thus limiting oxygen transfer. SHAP analysis confirms ‘q’ importance, highlighting that increased discharge generally decreases SOTE due to shortened interaction time. The remaining features, such as E, Ha, and F, have relatively minor impacts on SOTE, reflected in their smaller SHAP values. While they may influence SOTE to some extent, their contributions are less significant than W, Hb, and q, consistent with previous analyses showing weaker correlations with SOTE.
Overall, the SHAP analysis underscores the primary roles of throat width (W) and gauge reading (Hb) in enhancing SOTE. This analysis not only validates correlation findings but also offers a clearer and deeper understanding of each parameter's impact direction and magnitude, further clarifying the mechanisms driving oxygen transfer efficiency in the Venturi flume setup.
Sensitivity analysis
The sensitivity analysis assesses how different input variables affect specific outcomes within a given system or model under specific conditions. The Morris method, Sobol's indices, Latin hypercube sampling (L.H.S.), Monte Carlo sampling (M.C.S.), the one-factor-at-a-time (OFAT) approach, and other techniques can be used to conduct sensitivity evaluations. However, the current study opts to use the fundamental OFAT method. Applying the OFAT technique, one input variable changes at a time while the others remain unchanged (Panwar & Tiwari 2024a, b; Tiwari & Panwar 2024). The sensitivity analysis focuses on the M5_Unprun model that performs most effectively in this investigation.
Table 5 analyses the variations in CC, RMSE, and MAE provides valuable insights into the influence of each variable on the model's accuracy in predicting SOTE. This approach helps identify the most significant parameters in the Venturi flume setup, guiding optimization efforts for enhanced oxygen transfer.
Starting with throat width, ‘W’, its removal causes the most substantial performance degradation, with the CC dropping to 0.7244, RMSE increasing to 0.4092, and MAE to 0.0008. This significant decline in CC and increase in RMSE and MAE confirms that throat width is the most influential variable affecting SOTE. Throat width, ‘W’, plays a critical role in fostering turbulence and improving air–water mixing within the flume. Excluding this parameter severely diminishes the model's ability to predict oxygen transfer efficiency accurately, emphasizing its vital role in maintaining flow dynamics conducive to efficient oxygen transfer.
The gauge reading at the midpoint of the throat section ‘Hb’ is the second most influential variable. Removing ‘Hb’ leads to a notable decrease in CC (0.7862), an increase in RMSE (0.3633), and MAE (0.00001). The strong influence of ‘Hb’ can be attributed to its reflection of flow conditions within the throat section, where effective oxygen transfer occurs. Higher ‘Hb’ values typically indicate optimal turbulence, which enhances oxygen transfer. Without this variable, the model's predictive performance worsens, as it loses a key indicator of flow and turbulence characteristics within the flume.
The influence of other parameters, such as throat length ‘F’, upstream entrance width ‘E’, and water level at gauge ‘Ha’, is comparatively smaller. Excluding these variables results in minimal changes to CC, RMSE, and MAE, further confirming their relatively low impact on SOTE.
In summary, the sensitivity analysis highlights throat width ‘W’ and gauge reading at the midpoint of the throat ‘Hb’ as the most crucial factors for accurate SOTE prediction. These findings align with both the correlation and SHAP analyses, which also emphasize the importance of these variables in optimizing oxygen transfer by enhancing turbulence and air–water interaction within the Venturi flume. These insights are valuable for guiding model improvements and optimizing Venturi flume configurations for better oxygen transfer efficiency.
One-way ANOVA
To evaluate if there are statistically significant differences between the means of various models used to predict SOTE, a one-way ANOVA test was employed. ANOVA differentiates systematic factors, which demonstrably affect the data, from random factors with no significant influence. Developed by Anscombe (1948), this test specifically targets the means of the groups being compared.
The core objective of one-way ANOVA is to assess the null hypothesis, which states that all groups have equal means. The test aims to determine if there's enough evidence to reject this hypothesis, suggesting at least one group's mean differs significantly from the others. The Fisher (F) statistic calculates the ratio between the variation within groups and the variation between groups. The p-value reflects the likelihood of observing such an F-value by chance. The critical F-value serves as a threshold for rejecting the null hypothesis.
In this study, a 0.05 confidence level was used, meaning the null hypothesis is rejected if the probability p-value falls below this pre-defined cutoff (often confidence level (α) = 0.05). This indicates a statistically significant difference in the mean SOTE prediction of at least one model compared to the others. Table 6 summarizes the assessment results for the various models evaluated.
Models . | F-statics . | p-value . | F-crit . | Experimental–predicted values . |
---|---|---|---|---|
GBM | 0.0336 | 0.8545 | 3.8491 | Insignificant |
XRT | 0.0148 | 0.9031 | 3.8491 | Insignificant |
RF | 0.0019 | 0.9648 | 3.8491 | Insignificant |
M5_Prun | 0.0150 | 0.9026 | 3.8491 | Insignificant |
M5_Unprun | 0.0267 | 0.8703 | 3.8491 | Insignificant |
RT | 0.0102 | 0.9196 | 3.8491 | Insignificant |
REP_Prun | 0.0004 | 0.9844 | 3.8491 | Insignificant |
REP_Unprun | 0.0016 | 0.9681 | 3.8491 | Insignificant |
MLR | 0.1516 | 0.6971 | 3.8491 | Insignificant |
MNLR | 0.0735 | 0.7864 | 3.8491 | Insignificant |
Models . | F-statics . | p-value . | F-crit . | Experimental–predicted values . |
---|---|---|---|---|
GBM | 0.0336 | 0.8545 | 3.8491 | Insignificant |
XRT | 0.0148 | 0.9031 | 3.8491 | Insignificant |
RF | 0.0019 | 0.9648 | 3.8491 | Insignificant |
M5_Prun | 0.0150 | 0.9026 | 3.8491 | Insignificant |
M5_Unprun | 0.0267 | 0.8703 | 3.8491 | Insignificant |
RT | 0.0102 | 0.9196 | 3.8491 | Insignificant |
REP_Prun | 0.0004 | 0.9844 | 3.8491 | Insignificant |
REP_Unprun | 0.0016 | 0.9681 | 3.8491 | Insignificant |
MLR | 0.1516 | 0.6971 | 3.8491 | Insignificant |
MNLR | 0.0735 | 0.7864 | 3.8491 | Insignificant |
However, in this study, all models yielded p-values well above 0.05, with the highest F-statistic (for MLR) being only 0.1516. None of the models' F-statistics exceeded the critical F-value (F-crit) of 3.8491, confirming that the observed differences in mean predictions are statistically insignificant. The results in Table 6 show that no model's predictions deviate significantly from the experimental SOTE values, with all tests marked as ‘Insignificant.’ This suggests that each model provides a reliable and consistent prediction of SOTE, with no single model standing out as exceptionally better or worse. The lack of significant difference likely reflects the similar predictive capabilities of these models under the current dataset and conditions, indicating that each model effectively captures the essential patterns required to predict SOTE accurately. This consistency across models enhances confidence in the robustness of the predictions and suggests that model selection may be based on other factors, such as computational efficiency or simplicity, rather than predictive accuracy.
Uncertainty exploration
To evaluate the predictive accuracy of the proposed models for SOTE, a comprehensive uncertainty analysis was conducted solely on the testing dataset to ensure an unbiased evaluation of the model's predictive capabilities.
Confidence interval estimation, a statistical method that provides a quantitative measure of uncertainty, was employed. By constructing confidence intervals around estimated values, we assessed the reliability of our results and made informed decisions. This method's simplicity, interpretability, and wide applicability across statistical models make it advantageous. Additionally, the mean predicted error (ē) and the standard deviation of prediction errors (σe) were calculated to quantify the overall error and its variation. These metrics, combined with confidence intervals, provide a comprehensive understanding of the model's predictive accuracy and the consistency of their errors throughout the dataset.
Models . | . | . | ConL+ . | ConL− . | (ConL+ConL− . | Rank . |
---|---|---|---|---|---|---|
GBM | 0.0281 | 0.2055 | 0.4308 | −0.3747 | 0.8055 | 2 |
XRT | 0.0081 | 0.2776 | 0.5522 | −0.5360 | 1.0882 | 7 |
RF | 0.0086 | 0.2457 | 0.4903 | −0.4730 | 0.9633 | 5 |
M5_Prun | 0.0160 | 0.2455 | 0.4971 | −0.4652 | 0.9623 | 4 |
M5_Unprun | 0.0148 | 0.1919 | 0.3909 | −0.3613 | 0.7522 | 1 |
RT | -0.0159 | 0.3266 | 0.6242 | −0.6560 | 1.2802 | 9 |
REP_Prun | -0.0032 | 0.3195 | 0.6231 | −0.6295 | 1.2526 | 8 |
REP_Unprun | 0.0062 | 0.2636 | 0.5230 | −0.5105 | 1.0335 | 6 |
MLR | 0.0557 | 0.3653 | 0.7717 | −0.6603 | 1.4320 | 10 |
MNLR | 0.0307 | 0.2102 | 0.4428 | −0.3813 | 0.8242 | 3 |
Models . | . | . | ConL+ . | ConL− . | (ConL+ConL− . | Rank . |
---|---|---|---|---|---|---|
GBM | 0.0281 | 0.2055 | 0.4308 | −0.3747 | 0.8055 | 2 |
XRT | 0.0081 | 0.2776 | 0.5522 | −0.5360 | 1.0882 | 7 |
RF | 0.0086 | 0.2457 | 0.4903 | −0.4730 | 0.9633 | 5 |
M5_Prun | 0.0160 | 0.2455 | 0.4971 | −0.4652 | 0.9623 | 4 |
M5_Unprun | 0.0148 | 0.1919 | 0.3909 | −0.3613 | 0.7522 | 1 |
RT | -0.0159 | 0.3266 | 0.6242 | −0.6560 | 1.2802 | 9 |
REP_Prun | -0.0032 | 0.3195 | 0.6231 | −0.6295 | 1.2526 | 8 |
REP_Unprun | 0.0062 | 0.2636 | 0.5230 | −0.5105 | 1.0335 | 6 |
MLR | 0.0557 | 0.3653 | 0.7717 | −0.6603 | 1.4320 | 10 |
MNLR | 0.0307 | 0.2102 | 0.4428 | −0.3813 | 0.8242 | 3 |
The key takeaway from this analysis is that the ‘accuracy level’ is determined by the confidence band. A narrower band indicates a more efficient model, whereas a wider band indicates greater uncertainty in the model's predictions. M5_Unprun outperformed the others, showing the lowest uncertainty (0.7522), suggesting its superiority in predicting SOTE. In contrast, MLR had the highest uncertainty (1.4320), ranking last. Table 6 lists all models (1–10) according to their uncertainty values, which correlates with the findings from statistical evaluations, scatter plots, fitting curves, Taylor's diagrams, and box plots of relative error curves. In general, the uncertainty analysis confirms M5_Unprun as the best model among the proposed ones, while others showed satisfactory performance.
Engineering applications of optimized oxygen transfer in Venturi flumes
The engineering application of the results obtained in this study can be summarized in terms of optimizing oxygen transfer efficiency (SOTE) in Venturi flumes, which is crucial for improving water quality in various environmental and industrial applications. The research demonstrates the effectiveness of ML models, particularly the M5_Unprun model, in accurately predicting SOTE by analyzing the impact of key design parameters such as throat width and gauge readings. These insights can be applied in wastewater treatment plants and aquaculture, where precise oxygenation is essential for maintaining healthy aquatic ecosystems.
By optimizing the design of Venturi flumes, this study provides a pathway to enhance aeration efficiency, reduce energy consumption, and improve the overall sustainability of water treatment processes. The results also offer valuable guidance for engineers and practitioners in the field, helping them design and implement more efficient oxygen transfer systems that meet the specific requirements of various hydraulic applications. Moreover, the integration of ML models with experimental data can inform future innovations in water treatment technologies, offering a more cost-effective and accurate approach to managing DO levels in water bodies.
CONCLUSIONS
Findings
This study investigated the effectiveness of various ML models in predicting SOTE in Venturi flumes, devices crucial for enhancing DO levels in water treatment. Venturi flumes achieve this by manipulating water velocity and pressure through converging, throat, and diverging sections, promoting oxygen absorption. Key variables, including discharge per unit width (q), throat width (W), throat length (F), upstream entrance width (E), and gauge readings (Ha) and (Hb), are analyzed. Data from oxygenation experiments with different flume configurations were analyzed.
GBM XRT, RF, M5 model tree (Prun and Unprun), RT, and REP (REP_Prun, and REP_Unprun), were compared to traditional regression models (MLR and MNLR) and evaluated using metrics like CC, RMSE, and MAE. Additionally, graphical assessments were used using scatter plots, Taylor's diagrams, box plots, relative curves, and fitting curves. Additionally, the evaluation was reinforced through comprehensive statistical methods such as uncertainty analysis, ANOVA, sensitivity analysis, correlation analysis, and SHAP analysis. Together, these techniques provided deeper insights into the oxygen transfer, creating a solid foundation for a detailed assessment.
The M5_Unprun model emerged as the top performer, demonstrating the most accurate SOTE predictions across testing datasets with the highest CC value of 0.9455, the lowest RMSE value of 0.1918, and a minimal MAE value of 0.0030. Notably, the GBM model also displayed promising potential with a CC value of 0.9372, an RMSE value of 0.2067, and an MAE value of 0.0006.
The findings are further validated through multiple evaluation criteria. Scatter plots and fitting curves demonstrated a strong agreement between predicted and experimental values, indicating high accuracy. Likewise, relative error curves in box plots consistently showed minimal error. Additionally, the Taylor diagram confirmed the model's effectiveness in closely aligning with experimental data.
In testing datasets, among traditional models, MNLR demonstrated superior predictive ability, outperforming most proposed ML models except for M5_Unpruned and GBM, achieving a CC value of 0.9343, an RMSE value of 0.2118, and an MAE value of 0.0001. This highlights MNLR's strong predictive power despite its simpler structure. In contrast, MLR showed the weakest performance among both ML and traditional models.
The uncertainty analysis confirmed earlier results, highlighting M5_Unpruned and GBM as the most precise models with the narrowest uncertainty bands of 0.7522 and 0.8055, respectively, while MLR had the widest band at 1.4320. Furthermore, the analysis indicated a tendency for the RT and REP_Pruned models to underestimate SOTE, whereas GBM, XRT, RF, M5 variants, REP_Unpruned, and traditional models generally overestimated SOTE.
Finally, one-way ANOVA indicated no significant differences between the predicted and experimental values of the models, suggesting their overall reliability. This consistency implies that the model choice can prioritize factors like computational efficiency or simplicity over predictive accuracy, as all proposed models reliably capture essential patterns for SOTE prediction.
Combined correlation, sensitivity, and SHAP analyses consistently identify throat width (W) and midpoint throat gauge reading (Hb) as the most influential factors affecting SOTE in the Venturi flume system. Correlation values of 0.26221 for W and 0.17407 for Hb suggest that increases in these parameters slightly improve SOTE by enhancing turbulence and promoting effective air–water interaction. Sensitivity analysis confirms their critical role in model accuracy, while SHAP analysis further highlights their positive impact on oxygen transfer. These findings collectively underscore the importance of W and Hb in optimizing Venturi flume design and maximizing oxygen transfer efficiency.
In conclusion, M5_Unprun emerged as the most effective model for predicting SOTE in Venturi flumes, followed by GBM. The study further emphasized the significance of throat width (W) and gauge reading (Hb) in optimizing oxygen transfer efficiency.
Research limitations
The controlled laboratory setting of this study, while crucial for ensuring experimental precision, imposes certain constraints on the generalizability of the findings. First, the laboratory-scale experiments may not fully replicate full-scale hydraulic conditions due to differences in Reynolds and Froude numbers. These variations can impact oxygen transfer dynamics and the measurement of SOTE. Another limitation lies in the assumption of uniform flow distribution within the Venturi flume, which may not be accurate in complex flow patterns. This assumption could introduce uncertainties in measuring flow parameters and oxygen transfer rates. Furthermore, the study did not explicitly consider external factors, ambient air conditions, which may influence oxygen transfer efficiency. Last, limitations were observed in the use of the M5_Unpruned model, which, despite its effectiveness in predicting SOTE, is restricted by the quality and quantity of its training data. While both the M5_Unpruned and GBM models demonstrated strong performance, the limited sample of tested flume geometries and potential measurement errors restrict the confidence with which these findings can be generalized.
Implications of the study
The findings of this study have significant implications for the design, optimization, and operation of Venturi flumes in water treatment processes. The identification of M5_Unprun and GBM as top-performing models provides a robust foundation for predicting SOTE in various flume configurations. This predictive capability can aid in the design of new flumes, optimizing parameters like throat width, length, and entrance width to maximize oxygen transfer efficiency. Furthermore, the study's emphasis on the importance of throat width (W) and gauge reading (Hb) as key factors influencing SOTE offers practical insights for field applications. By carefully considering these parameters during design and operation, engineers can enhance oxygen transfer rates and improve the overall performance of water treatment systems.
Recommendations for future research
Future research should aim to overcome the limitations of laboratory-controlled conditions by investigating a broader range of Venturi flume geometries and operating conditions, including multi-phase flows and a variety of environmental factors, to better mimic real-world scenarios. Expanding the use of ML algorithms through ensemble and hybrid approaches, as well as integrating numerical modelling, could deepen insights into SOTE, enabling more accurate extrapolation to full-scale applications. Combining analytical modelling with data-driven approaches would also improve model interpretability, clarifying the physical processes influencing SOTE and offering valuable insights for practical implementation. Further validation through larger datasets and field-scale studies is essential to enhance the accuracy and robustness of these models, ensuring their effectiveness in real-world environmental and industrial aeration systems.
AUTHOR CONTRIBUTIONS
N.K.T. was accountable for the conceptualization, method, and software development; envisaged the published work; and wrote, reviewed, and edited the original draft. D.P. wrote, reviewed, and edited the original manuscript, participated in data curation, and investigated this work. N.K.T. and D.P. wrote, reviewed, and edited the original manuscript and supervised this work. D.P. was responsible for the methodology and software development. N.K.T. validated and visualized the published work.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.