ABSTRACT
The integrated solar and hydraulic jump-enhanced waste stabilization pond (ISHJEWSP) has been proposed as a solution to enhance performance of the conventional WSP. Despite the better performance of the ISHJEWSP, there is seemingly no previous study that has deployed machine learning (ML) methods in modelling the ISHJEWSP. This study is aimed at determining the relationships between the ISHJEWSP effluent parameters as well as comparing the performance of extra trees (ET), random forest (RF), decision tree (DT), light gradient boosting machine (LightGBM), gradient boosting (GB), and extreme gradient boosting (XGBoost) methods in predicting the effluent biochemical oxygen demand (BOD5) in the ISHJEWSP. The feature importance technique indicated that the most important parameters were pH, temperature, solar radiation, dissolved oxygen (DO), and total suspended solids. These selected features yielded strong correlations with the dependent variable except DO, which had a moderate correlation. With respect to coefficient of determination and root mean square error (RMSE), the XGBoost performed better than the other models [coefficient of determination (R2) = 0.807, mean absolute error (MAE) = 4.3453, RMSE = 6.2934, root mean squared logarithmic error (RMSLE) = 0.1096]. Gradient boosting, XGBoost, and RF correspondingly yielded the least MAE, RMSE, and RMSLE of 3.9044, 6.2934, and 0.1059, respectively. The study demonstrates effectiveness of ML in predicting the effluent BOD5 in the ISHJEWSP.
HIGHLIGHTS
BOD5 prediction models were developed using different ML methods.
The performance of various gradient boosting machines was evaluated.
Extreme gradient boosting proved to be better than other models.
Feature importance indicated the most important variables
The study demonstrated the effectiveness of ML in predicting BOD5.
INTRODUCTION
The global quest for improved living standards, through a varied range of activities, continually impacts water quality. Wastewater constitutes a risk factor to human health and serves as a contaminant source to the environment. Increasing the generation of wastewater, as a result of rapid population growth and industrialization, further exacerbates the scale of the problem. Therefore, wastewater treatment plays a vital role in safeguarding public health and the sustainability of the environment. The development of proper wastewater treatment is crucial for the prevention of many types of transmittable diseases (Olukanni & Ducoste 2011). The conventional wastewater treatment process train consists of various unit operations. The technical expertise required to operate the unit operations, the costs of construction and operation render them less attractive to most developing countries. Hence, it is necessary to develop a rapid, simple, eco-friendly, effective, and efficient method of wastewater treatment (Gedda et al. 2021).
A waste stabilization pond (WSP) is a natural treatment process, hence, economically feasible compared to other treatment technologies in respect to its maintenance cost and energy requirement (Mahapatra et al. 2022). The simplicity in the construction and low cost make it desirable in developing countries (Ho et al. 2017). However, the applicability of the WSP is limited by its large area requirement (Mara et al. 1983) as well as the problem of land availability. Consequently, various researches have been carried out to improve the performance of the conventional WSP via solar enhancement (Utsev & Agunwamba 2020). High solar irradiance consequent upon solar enhancement improved the treatment efficiencies, hence, the reduction in large land area requirement (Utsev & Agunwamba 2020). Previously, the effects of pond depth (Hosetti & Patil 1987; Oragui et al. 1987; Silva et al. 1987), tapering (Agunwamba 2001), and baffles (Shilton & Mara 2005) have been reported. The integrated solar and hydraulic jump-enhanced waste stabilization pond (ISHJEWSP) incorporated solar reflectors and hydraulic jump in the conventional WSP (Ogarekpe & Agunwamba 2016b). Besides the physical modification of the conventional WSP, several tools have been utilized for the analyses, interpretation, and representation of results emanating from these researches. One such analytical tool is machine learning (ML).
ML models utilize algorithms in recognition of patterns between data using a data subset for the training and the verification of the prediction accuracy using a separate data subset in a procedure known as testing (Hammed et al. 2021). ML has been described as an appropriate solution for solving difficult and elusive problems involving input and output variables, where there are difficulties in the use of mathematical equations, the necessity to make assumptions to interpret the equations as well as explaining the outcomes (Yang et al. 2024). Previously, ML algorithms have been found useful in predicting uncertain treatment performances involving the symbiotic relationship between algae and bacteria (Sundui et al. 2021). A better understanding of WSPs performance was achieved using multiple regression and neural network modelling (Khodadadi et al. 2016). Nitrogen removal in WSP has been modelled using optimal parameterization, local sensitivity analysis, and global uncertainty analysis (Mukhtar et al. 2017). Khatri et al. (2023) used artificial neural network (ANN)-based models for predicting the performance of combined upflow anaerobic sludge blanket and facultative pond. Superior effluent quality predictive performance of back propagation neural networks (BPNN) over traditional mathematical models has been reported (Gao et al. 2023).
Biochemical oxygen demand (BOD5) is an important water quality indicator (Ooi et al. 2022). However, measuring BOD5 is time consuming and could delay important and appropriate decisions, plans, and actions. The indirect estimation of major wastewater quality parameter using ML has been advocated (Ooi et al. 2022). Modelling BOD5 would enhance time and cost savings associated with actual measurements, as it can be estimated from the predictor variables in the model. ML methods have been developed for an accurate, reliable, and cost-effective prediction of BOD (Nafsin & Li 2022). Random forest (RF), support vector regression (SVR), and multilayer perceptron (MLP) have been used for the prediction of BOD5 in water based other physicochemical characteristics of water (Ooi et al. 2022). A good model performance, for BOD measurement of the effluent quality, was obtained using optimized extreme ML (Yu et al. 2019). Granata et al. (2017) predicted BOD5 and other wastewater quality indicators using SVR and regression trees (RT) techniques. BOD5 has been predicted using ANN, support vector machine (SVM), RF, gradient boosting machine (GBM), and other hybrid algorithms (Nafsin & Li 2022).
Previous studies have shown that the ISHJEWSP yields better treatment efficiencies than the conventional WSP (Ogarekpe & Agunwamba 2016a). The precedence of ambient climatic factors as well as the state of sewage to other variables, in relation to the performance of the ISHJEWSP, has been highlighted (Ogarekpe et al. 2022). The effect of rate constant models on the performance of the ISHJEWSP model was evaluated (Ogarekpe 2018). Despite the better performance of the ISHJEWSP, there is seemingly no previous study that has deployed ML methods in modelling the ISHJEWSP. This study is aimed at comparing the predictive performance of extra trees (ET), RF, decision tree (DT), light gradient boosting machine, GBM, and extreme gradient boosting (XGBoost) methods in modelling the effluent BOD5 in the ISHJEWSP. Therefore, the specific objectives of this paper will include the following: to determine the relationship between ISHJEWSP effluent parameters and to develop ML models for the prediction of the effluent BOD in the ISHJEWSP.
MATERIALS AND METHOD
Study area
Experimental setup and sample collection
Prediction models
The following algorithms were utilized for the analysis of the ISHJEWSP effluent data: DT, RF, ET, light gradient boosting machine (LightGBM), gradient boosting (GB), and XGBoost.
Decision tree
A DT model consists of nodes and branches (Song & Ying 2015; Charbuty & Abdulazeez 2021), and utilizes the important steps of splitting, stopping, and pruning in building the model (Song & Ying 2015). A tree consists of root nodes, non-terminal nodes, and terminal nodes (Swain & Hauska 1977). A node represents a test of an attribute and leaf node provides classification while the branches from the selected node are the possible values (Gavankar & Sawarkar 2017). These tests are filtered down through the tree to get the right output to the input pattern (Navada et al. 2011). DT can simultaneously handle numerical and categorical input variables, is robust to outliers, and can efficiently deal with missing input data (Touzani et al. 2018). In spite of the advantages of DT, they are prone to overfitting (James et al. 2013).
Random forest
A RF is a tree-based ensemble method (Ahmad et al. 2018). RF uses randomization to create a large number of DTs (Rigatti 2017). Bootstrapped data subsets, for the training, are grown to unpruned regression (or classification) trees (Ahmad et al. 2018). The trees are created by drawing each new training set, without replacement, from the original training set using random feature selection (Breiman 2001). The various randomized DTs are combined as well as aggregated by averaging (Biau & Scornet 2016). The out-of-bag samples are then used for testing the performance of the resulting RF model performs (Breiman 2001). RF makes few assumptions about the relationships between the variables and is extremely flexible (Langsetmo et al. 2023).
Extra tree
The ET proposed by Geurts et al. (2006) belong to the class of DT-based ensemble learning methods (John et al. 2016). DT-based ensemble methods utilize multiple DTs to perform classification and regression tasks (Gall et al. 2011). ET adds another layer of randomness to decision forests, utilizes an approach that reduces the search space, hence, resulting in faster training (Maier et al. 2015). ET employs the same principle as RF (John et al. 2016) and is less likely to overfit a dataset (Hammed et al. 2021). In addition, the use of ET ensemble technique for selection of the optimal feature importance has been reported (Arya et al. 2022).
Gradient boosting machine
Boosting models iteratively combines several simple models to obtain improved prediction accuracy (Touzani et al. 2018). Boosting has been utilized for classification problems (Freund 1995). Friedman (2001) introduced the GBM by extending the boosting to regression. Gradient boosting is a way to gradually reduce error (Ayyadevara, 2018). The GBM method can be considered a numerical optimization algorithm that aims at finding an additive model that minimizes the error function (Touzani et al. 2018). GBM aims at improving additional base models by correcting the mistakes of the previous base model (Zhang & Haghani 2015). The gradient boosting algorithms family has been extended with proposals that are centred around speed and accuracy (Bentéjac et al. 2021). The extensions of the gradient boosting algorithms have been highlighted to include XGBoost, LightGBM, and CatBoost (Bentéjac et al. 2021). XGBoost is a scalable ensemble technique while LightGBM uses selective sampling of high gradient instances to provide extremely fast training performance (Bentéjac et al. 2021).
Model training, testing, and evaluation
Data preprocessing and statistical analyses
The strength of the relationship between the variables was described using the following absolute indices: r< 0.500 (weak relationship), 0.500 ≤ r< 0.699 (moderate relationship), r ≥ 0.700 (strong relationship). Prior to the correlation analysis, the data were cleaned in order to get rid of the outlier in the solar radiation data. The cleaning entailed the utilization of the datasets within the upper and lower ranges of the box-and-whiskers plot (Figure 3). The statistical analyses were carried out in R version 4.2.2 while ML was implemented using Python.
RESULTS AND DISCUSSION
Relationship between ISHJEWSP parameters
In order to demonstrate the relationships between the variables, Pearson's correlation analysis was carried out for the ISHJEWSP parameters. From the results, a strong positive relationship was observed between temperature and pH (r = 0.878), TSS and BOD5 (r = 0.836), solar radiation and pH (r = 0.888), and solar radiation and temperature (r = 0.889). Conversely, a strong negative relationship was observed between BOD5 and pH (r = −0.880), solar radiation and BOD (r = −0.833), TSS and pH (r = −0.843), TSS and temperature (r = −0.840), and TSS and solar radiation (r = −0.864). The remaining parameters have either a weak positive or negative relationship between the respective variables (Figure 4).
The relationships reported in this study were compared with trends from previous studies. In the past, a strong positive relationship was obtained between algae and temperature for the ISHJEWSP under review (Ogarekpe et al. 2016). Increase in temperature and algae concentration result in the increase in photosynthetic activities. The rapid consumption of CO2, during photosynthesis, faster than it can be replaced by bacterial respiration, results in the occurrence of high pH values above 9 in ponds (Mara & Pearson 1998). Consequently, algae concentration and solar radiation (which provides energy for photosynthesis) play a vital role in enhancing the strong positive relationships between pH and temperature, as well as solar radiation and pH. Elevated temperature enhanced pH in WSP in Portugal (Pearson et al. 1987). A strong positive relationship between solar radiation and surface temperature has been reported (Daut et al. 2012). TSS and BOD are related. A portion of the TSS, the volatile suspended solids, exerts an oxygen demand in a facultative lagoon (Gerardi, 2015).
The pH of the ISHJEWSP effluent range between 7.2 and 11.2. The high pH values, perhaps, inhibited the oxidation of organic matter, hence, the strong negative relationship between the pH and BOD. Based on the extrapolations from the results of activated sludge, Pipes (1962) stated that for a pH of above 9.0, the oxidation of organic matter in a stabilization pond is severely inhibited. Most bacteria can grow in a pH range between 5 and 9; pH values below 5 or above 8.5 affect the growth and survival of aquatic microorganisms (Pearson et al. 1987). The microbial activities is slow below 5 °C, maximum between 25 and 30 °C and thereafter it decreases to a minimum at about 65 °C (Skiba 2008). The decrease or die off of microorganisms at elevated temperature, perhaps, inhibited the degradation of the organic matter component of the TSS, hence, the negative relationship between temperature and TSS as well as temperature and BOD.
The role of solar radiation in photosynthesis, and subsequently on the pH of the ISHJEWSP, plays a vital role in enhancing the strong negative relationships between solar radiation and BOD as well as solar radiation and TSS. The relationships are consequent upon the range of pH values obtained, hence, the inhibition of the oxidation of organic matter. The abatement of suspended solids, in oxidation pond effluents, at pH of 10.2 and above has been reported (Elmaleh et al. 1996). The high pH range from the study accounts for the abatement of the TSS. Hence, it justifies the strong negative relationship between the pH and TSS.
Conventionally, the dissolution of oxygen in pure water decreases with increase in temperature (Xing et al. 2014). Weak negative relationships existed between DO and temperature, as well as DO and solar radiation. The DO as well as ensuing relationships were influenced by algae concentration, solar radiation, as well as the photosynthetic activities in the pond. Oxygen, provided by the algal population, principally accounts for the oxygenation in pond systems (Mara & Pearson 1998). The temporal diurnal variation of DO, in the WSP systems, in response to the photosynthetic activities of algae have been reported (Ho et al. 2018). The timing and magnitude of solar radiation, as well as the variability in the algal concentration, perhaps confers additional complexities to the relationships between DO and the other parameters.
Comparison between GBMs, RF, and ET models
A comparison of the outcomes of various ML models, for effluent BOD5 prediction, are presented in Table 1 considering the metrics of R2, MAE, RMSE, and RMSLE. The ML models under review included ET, RF, light gradient boosting machine (LightGBM), DT, gradient boosting (GB), and XGBoost.
. | Model . | MAE . | RMSE . | RMSLE . |
---|---|---|---|---|
Tree-based models | Extra trees regressor | 3.9379 | 6.6906 | 0.1065 |
Random forest regressor | 4.0974 | 6.5025 | 0.1059 | |
Light gradient boosting machine | 4.8543 | 6.9004 | 0.1255 | |
Decision tree regressor | 5.0714 | 7.1975 | 0.1237 | |
Gradient boosting regressor | 3.9044 | 6.8201 | 0.1071 | |
Extreme gradient boosting | 4.3453 | 6.2934 | 0.1096 |
. | Model . | MAE . | RMSE . | RMSLE . |
---|---|---|---|---|
Tree-based models | Extra trees regressor | 3.9379 | 6.6906 | 0.1065 |
Random forest regressor | 4.0974 | 6.5025 | 0.1059 | |
Light gradient boosting machine | 4.8543 | 6.9004 | 0.1255 | |
Decision tree regressor | 5.0714 | 7.1975 | 0.1237 | |
Gradient boosting regressor | 3.9044 | 6.8201 | 0.1071 | |
Extreme gradient boosting | 4.3453 | 6.2934 | 0.1096 |
In the past, the nexus of R2 score as well as error metrics have been used to determine, among a select set of ML models, the best model for BOD5 prediction (Ooi et al. 2022). Different ML models have been reported to perform better for certain variables than others (Khodadadi et al. 2016). BOD removal has been predicted satisfactorily using ANN (Akratos et al. 2008). The use of hybrid intelligent systems and ANN has yielded high predictive accuracies of water treatment efficiencies for BOD, Chemical Oxygen Demand (COD), heavy metals, and organics (Malviya & Jaspal 2021).
CONCLUSION
This study compared the predictive performance of ET, RF, DT, Light Gradient Boosting Machine (LightGBM), gradient boosting (GB), and XGBoost methods in predicting the effluent BOD5 in the ISHJEWSP. With respect to the error evaluation metrics, the performance of the ML models varied with different models yielding the least error values for different metrics. The relationships between the parameters of the ISHJEWSP were found to range from a strong positive/negative to weak positive/negative relationships. The feature importance indicated that the most important were pH, temperature, solar radiation, DO, and TSS. These selected features yielded strong correlations with the dependent variable except DO, which had a moderate correlation. With respect to coefficient of determination and RMSE, the XGBoost performed better than the other models (R2 = 0.807, MAE = 4.3453, RMSE = 6.2934, RMSLE = 0.1096). The XGBoost was the best performing algorithm for RMSE and was in the middle range of performance metrics with the tree-based models. The study demonstrates the effectiveness of ML in predicting BOD5 in the ISHJEWSP. This can result in significant time saving associated with the determination of BOD5 via experimental procedures, especially when mitigation works due to pollution are required without delays.
AUTHOR CONTRIBUTIONS
NO initiated the study and wrote the first draft. The data collection was carried out by NO. IT wrote some sections of the paper as well as carried out the machine learning analysis. JA, OU, and AC revised the paper.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.