Abstract
The energetic nature of these important water resources makes them the most vulnerable to contamination from additional waste from multiple sources. Water quality monitoring is critical to water environmental management, and successful monitoring provides direction and confirms the effectiveness of water management. Models based on artificial intelligence are fundamental for anticipating appropriate moderation measures for surface water quality. In any case, it remains a challenge and requires a requirement to improve display accuracy. Faster and cheaper control is required due to the real-world impact of low water quality. With this inspiration, this research examines an array of machine-learning calculations to estimate water quality. The proposed approach uses Random Forest for modeling and is also useful for predicting surface water quality in the Kulik geographic region of West Bengal, India. It is a good tool for assessing the quality and ensuring the safe use of drinking water. Various water quality parameters (iron, fluoride, total coliform, fecal coliform, pH, total dissolved solids, magnesium, alkalinity, chloride, total hardness, nitrate, calcium, and Escherichia coli) were measured seasonally (winter, summer, rain) over 10 years (2010–2019). The estimated water quality parameters in this study were total dissolved solids (TDS), pH, and iron.
HIGHLIGHTS
Most of the north-Bengal people are depend on Kulik River for multiple purposes like settlement, cultivation, irrigation, fishing and various primary activities, so there is a need for water quality monitoring and management of Kulik River.
Analysis and prediction of 13 parameters will be helpful for society.
The proposed approach used Random Forest for modeling and assessing the water quality.
INTRODUCTION
The water quality includes a coordinated effect on the open well-being and the environment. Water is used for various households such as drinking water, horticulture, and industries. Recently, the advancement of water sports and excitement has done much to attract visitors (Jennings 2007). Among various water delivery providers, rivers have often been further utilized for the development of human societies due to smooth access. Using various water sources, including soil water and seawater, helped with problems at times. For example, the use of groundwater without adequate replenishment leads to subsidence (Motagh et al. 2017), and the use of seawater is usually associated with the transfer of pollutants (El-Kowrany et al. 2016). Therefore, the use of rivers has attracted attention. Observing water from rivers is not an uncommon job topic in earth science.
The study of the excellence of river processes is considered, together with the measurement of the excellent additions of the water and the definition of the pollutant transfer mechanism (Kashefipour 2002; Kashefefipour & Falconer 2012; Naseri Maleki & Kashefipour 2012; Qishlaqi et al. 2016). Among the water quality components, measuring dissolved oxygen (DO), chemical oxygen demand (COD), biochemical oxygen demand (BOD), electrical conductivity (EC), pH, temperature, K, Na, Mg, etc. have been proposed (Şener et al. 2017). To this end, governments have built hydrometric stations along rivers originating from urban regions, agro-commercial tasks, industrial zones, and rivers that are part of reservoirs (Herschy 1993; Kejiang 1993). Water quality assessment is a basic degree for improving agricultural tasks in terms of the devotion to cultivation patterns, the form of irrigation machines, and structures of water purification for industry (Chen et al. 2017). To study the mechanism of pollutant transfer, superior numerical techniques including computational hydraulics, photo processing, and GIS techniques were applied in addition to the sector and laboratory experiments (Parsai & Haghiabi 2015, 2017a, 2017b).
By reviewing the time records of prominent water additives, investigators have attempted to estimate fate values. Currently, researchers have tried to adequately study the temporal accumulation of water-soluble additives and their internal relationship by using advanced soft computational strategies in the fields of water and environmental engineering (May et al. 2008; Palani et al. 2008; Haghiabi 2016a, 2016b; Jaddi & Abdullah 2017). In this regard, Emamgholizadeh et al. (2013) have done a study on the prediction of Multilayer Perceptron (MLP), Radial Basis Function (RBF), and an Adoptive Neuro-Fuzzy Inference System (ANFIS) for Water Excellent Additions to the Karoon River. They said that anyone who implemented modes had a reasonable overall performance for predicting water quality additions: however, the MLP modes turned into barely extra correct. Shokoohi et al. (2017) did an excellent job of controlling the water using a water dispensing machine. They consider this an optimization problem and use state-of-the-art optimization techniques to solve it. Zhang et al. (2010) brought a brand-new method for water allocation.
They consider water to be one of the most important elements of their method. Nikoo & Mahjouri (2013) have developed a PSVM (Probabilistic Support Vector Machines) version related to the GIS method for making plans for the nature and distribution of soil and groundwater in Iran. They said that using these techniques could provide correct statistics for feasibility research of water conservation tasks. Heddam (2016a, 2016b, 2016c, 2016d, 2016e) has applied synthetic neural networks to predict the excellent additives in water in numerous case studies.
He said synthetic intelligence strategies have reasonable overall performance for modeling and predicting the intrinsic relationship between the water additives and modeling their time collection. The review of the literature shows that excellent water assessment and forecasting is an essential matter for growing water conservation tasks, and synthetic intelligence strategies have been proposed for this purpose. Therefore, based on this observation, it was expected that the water additions of the Kulik River, the main river of the city of Raiganj, would be utilized by Random Forest.
MATERIALS AND METHODS
Study area
Methodology
All the samples were collected from the four hydrological stations (Table 1) by the authors. Over 10 years (2010–2019) (Table 2), the surface water quality of the river basin was assessed through systematic sampling. Similar work was conducted by Roy et al. (2022). The samples for the parameters in the used data set are represented numerically. For prediction, either a water quality assessment index can be made, or a regression model can be used to make the prediction. Out of 14 physiological and biological water quality parameters such as chloride, alkalinity, total hardness, magnesium, total iron, calcium, Escherichia coli, fecal coliforms, and total coliforms, only three parameters (pH, TDS, and iron) were considered and used for modeling and prediction – Random Forest process is applied.
Geo:co-ordinates of the four sampling sites of Kulik River
Sl. No. . | Sampling sites . | Name of the locality . | Latitude . | Longitude . |
---|---|---|---|---|
1 | L-1 | Kalibari | 25° 38′ 10″N | 88° 07′ 25″E |
2 | L-2 | Kulik bridge (on NH-12) | 25° 38′ 06.7″N | 88° 07′ 19.9″E |
3 | L-3 | Abdulghata | 25° 38′ 10″N | 88° 07′ 25″E |
4 | L-4 | Bamuyaghat | 25° 40′ 26″N | 88° 09′ 06″E |
Sl. No. . | Sampling sites . | Name of the locality . | Latitude . | Longitude . |
---|---|---|---|---|
1 | L-1 | Kalibari | 25° 38′ 10″N | 88° 07′ 25″E |
2 | L-2 | Kulik bridge (on NH-12) | 25° 38′ 06.7″N | 88° 07′ 19.9″E |
3 | L-3 | Abdulghata | 25° 38′ 10″N | 88° 07′ 25″E |
4 | L-4 | Bamuyaghat | 25° 40′ 26″N | 88° 09′ 06″E |
Summary of descriptive statistics for water quality parameters
Year . | TDS (mg/L) AVG . | AVG pH . | AVG Total Alkalinity (mg/L) . | AVG Total Hardness (mg/L) . | AVG Calcium as Ca(mg/L) . | AVG Magnesi um as Mg(mg/L) . | AVG Chloride as Cl(mg/L) . | AVG Sulfate as SO4(mg/L) . | AVG Nitrate as NO3(mg/L) . | AVG Total Iron as Fe(mg/L) . | AVG Fluoride as F(mg/L) . | AVG Total Coliforms (MPN/100 ml) . | AVG Fecal Coliforms (MPN/100 ml) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 | 71.84 | 6.92 | 36.99 | 35.56 | 18.96 | 6.41 | 12.91 | 5.36 | 1.12 | 0.94 | 0.10 | 34.80 | 17.22 |
2011 | 69.35 | 6.94 | 36.85 | 34.19 | 18.85 | 6.34 | 12.71 | 5.33 | 1.12 | 0.94 | 0.10 | 30.45 | 12.87 |
2012 | 72.01 | 6.99 | 36.99 | 34.52 | 18.88 | 6.40 | 12.84 | 5.33 | 1.12 | 1.04 | 0.10 | 36.58 | 25.23 |
2013 | 74.35 | 6.89 | 36.99 | 34.62 | 18.88 | 6.44 | 12.91 | 5.33 | 1.12 | 0.74 | 0.10 | 31.76 | 12.64 |
2014 | 72.35 | 6.89 | 36.99 | 34.56 | 18.95 | 6.44 | 12.91 | 5.33 | 1.12 | 0.64 | 0.10 | 37.76 | 14.64 |
2015 | 73.68 | 6.89 | 36.80 | 34.66 | 18.90 | 6.40 | 12.91 | 5.32 | 1.12 | 0.74 | 0.10 | 43.76 | 18.64 |
2016 | 68.01 | 6.98 | 37.02 | 34.86 | 19.03 | 6.44 | 13.01 | 5.32 | 1.12 | 0.72 | 0.10 | 33.21 | 13.27 |
2017 | 69.48 | 6.98 | 37.02 | 34.98 | 18.98 | 6.40 | 13.01 | 5.33 | 1.12 | 0.93 | 0.10 | 32.57 | 15.98 |
2018 | 72.66 | 6.94 | 37.09 | 38.59 | 19.05 | 6.44 | 13.04 | 5.33 | 1.12 | 1.32 | 0.10 | 31.07 | 12.84 |
2019 | 77.01 | 6.83 | 37.23 | 40.62 | 19.05 | 6.41 | 13.07 | 5.34 | 1.10 | 1.40 | 0.10 | 33.93 | 22.01 |
Year . | TDS (mg/L) AVG . | AVG pH . | AVG Total Alkalinity (mg/L) . | AVG Total Hardness (mg/L) . | AVG Calcium as Ca(mg/L) . | AVG Magnesi um as Mg(mg/L) . | AVG Chloride as Cl(mg/L) . | AVG Sulfate as SO4(mg/L) . | AVG Nitrate as NO3(mg/L) . | AVG Total Iron as Fe(mg/L) . | AVG Fluoride as F(mg/L) . | AVG Total Coliforms (MPN/100 ml) . | AVG Fecal Coliforms (MPN/100 ml) . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010 | 71.84 | 6.92 | 36.99 | 35.56 | 18.96 | 6.41 | 12.91 | 5.36 | 1.12 | 0.94 | 0.10 | 34.80 | 17.22 |
2011 | 69.35 | 6.94 | 36.85 | 34.19 | 18.85 | 6.34 | 12.71 | 5.33 | 1.12 | 0.94 | 0.10 | 30.45 | 12.87 |
2012 | 72.01 | 6.99 | 36.99 | 34.52 | 18.88 | 6.40 | 12.84 | 5.33 | 1.12 | 1.04 | 0.10 | 36.58 | 25.23 |
2013 | 74.35 | 6.89 | 36.99 | 34.62 | 18.88 | 6.44 | 12.91 | 5.33 | 1.12 | 0.74 | 0.10 | 31.76 | 12.64 |
2014 | 72.35 | 6.89 | 36.99 | 34.56 | 18.95 | 6.44 | 12.91 | 5.33 | 1.12 | 0.64 | 0.10 | 37.76 | 14.64 |
2015 | 73.68 | 6.89 | 36.80 | 34.66 | 18.90 | 6.40 | 12.91 | 5.32 | 1.12 | 0.74 | 0.10 | 43.76 | 18.64 |
2016 | 68.01 | 6.98 | 37.02 | 34.86 | 19.03 | 6.44 | 13.01 | 5.32 | 1.12 | 0.72 | 0.10 | 33.21 | 13.27 |
2017 | 69.48 | 6.98 | 37.02 | 34.98 | 18.98 | 6.40 | 13.01 | 5.33 | 1.12 | 0.93 | 0.10 | 32.57 | 15.98 |
2018 | 72.66 | 6.94 | 37.09 | 38.59 | 19.05 | 6.44 | 13.04 | 5.33 | 1.12 | 1.32 | 0.10 | 31.07 | 12.84 |
2019 | 77.01 | 6.83 | 37.23 | 40.62 | 19.05 | 6.41 | 13.07 | 5.34 | 1.10 | 1.40 | 0.10 | 33.93 | 22.01 |
Random Forest
For classification and regression problems, many people use supervised machine learning, and Random Forest is one of them. Breiman (2001) proposed the Random Forest algorithm, which was extremely successful as a general-purpose classification and regression technique. The approach, which shuffles numerous randomized selection trees and aggregates their predictions by averaging, has shown an excellent overall performance in settings where the set of variables is much larger than the number of observations. In addition, it is flexible enough to be applied to large-scale problems, easily adaptable to various ad hoc study tasks, and returns measures of different meanings.
The Random Forest regression algorithm was chosen for the following two key reasons:
Multivariate regression analysis: The target parameter can be dependent on multiple attributes/parameters. This type of many-to-one relationship requires multivariate regression analysis instead of the usual one-to-one linear regression analysis.
Relatively small dataset: As the total number of samples in the used dataset is less than 5,000, it is considered to be a small dataset. Small datasets are difficult to analyse as sufficient samples are required to train a model as well as for the model testing and validation process
Software used for Random Forest
Python was used for model building and prediction analysis. From the scikit-learn package, Random Forest Regressor algorithm was imported from the ensemble methods available. The dataset was split into train and test samples in a 7:3 ratio using the train_test_split method from the sklearn package. For visualization, matplotlib and seaborn packages are used. Pandas package was used in the formatting of the dataset, and pre-processing methods.
Model validation
Model accuracy
Here, we have taken two key parameters which are used to measure whether the model can predict the target parameter with high accuracy.
Lower the RMSE for a model, the prediction is more precise with less residuals.
Dataset
The dataset used in this paper is made of parameters that are considered to be vital for a healthy water ecosystem, such as, total iron content, pH, sulfate, and nitrate levels. The breakdown of organic matter is measured in terms of TDS, the total amount of coliforms, and fecal coliform contents. Fluoride content, hardness, and alkalinity of the water are considered for the ergonomic use of the river water.
The data are collected from four different locations on the Kulik River bed. The samples for the parameters are in numerical representation except for the presence of E. coli bacteria.
Random Forest regression algorithm
Random Forest is a type of ensemble learning algorithm as it uses multiple decision trees to estimate a prediction with high accuracy. When multiple attributes are heavily correlated to the target parameter, a decision tree selects the parameter with the highest correlation with the target. From there, it starts the prediction process with a sequence of comparisons with other parameters based on pre-learned threshold values. Starting from the top (parameter with the highest correlation with the target), it works its way to the lowest level nodes (with the least correlation with the target), resulting in a leaf (decision/prediction) at the end of the tree. The comparison is done using MSE (mean squared error) to determine how the data branches from each node. It is given by the MSE equation.
RESULTS
Ravindra et al. (2022) had done the analysis of the surface water quality of the Amba River. Sipra & Baliarsingh (2017) have done the surface water quality analysis of the Kathojodi River for prediction and modeling. Results of a study of the physicochemical and microbial parameters of the Kulik River were studied and represented. We have studied 10 years of data from 2010 to 2019, out of which 2019 shows the highest TDS value, whereas the lowest was found in 2011. During the study of pH, water sample is slightly acidic pH (6.83) during 2019 while in 2012 pH is neutral (6.99). A similar type of work was carried out by Rabindra et al. for Amba River.
Random Forest model's development
Data for the Kulik flow were collected over a decade (2010–2019) for TDS, pH, total iron content, fluoride content, presence or absence of E. coli and its content, chloride, magnesium, calcium, total alkalinity, total hardness, sulfate, and nitrate levels were documented.
A Random Forest model can help accurately predict values for multiple predictors. Regression analysis can help determine which variables most affect the value to be predicted. The data were processed by individual date.
Prediction with Random Forest for TDS
Correlation table for the features and target, TDS
Features . | Feature importance (correlation with target) . |
---|---|
pH | 0.71 |
Total alkalinity | −0.33 |
Total hardness | −0.17 |
Calcium | 0.765 |
Magnesium | 0.79 |
Chlorine | −0.18 |
SO4 | 0.31 |
NO3 | 0.66 |
Total iron | 0.25 |
Fluorine | 0.06 |
Total coliform | 0.29 |
Fecal coliform | 0.12 |
Features . | Feature importance (correlation with target) . |
---|---|
pH | 0.71 |
Total alkalinity | −0.33 |
Total hardness | −0.17 |
Calcium | 0.765 |
Magnesium | 0.79 |
Chlorine | −0.18 |
SO4 | 0.31 |
NO3 | 0.66 |
Total iron | 0.25 |
Fluorine | 0.06 |
Total coliform | 0.29 |
Fecal coliform | 0.12 |
Prediction with Random Forest for pH
Correlation table for the features and target, pH
Features . | Feature importance (correlation with target) . |
---|---|
TDS | 0.71 |
Total alkalinity | –0.49 |
Total hardness | –0.37 |
Calcium | 0.71 |
Magnesium | 0.85 |
Chlorine | –0.38 |
SO4 | 0.14 |
NO3 | 0.72 |
Total iron | 0.02 |
Fluorine | 0.06 |
Total coliforms | 0.21 |
Fecal coliforms | 0.07 |
Features . | Feature importance (correlation with target) . |
---|---|
TDS | 0.71 |
Total alkalinity | –0.49 |
Total hardness | –0.37 |
Calcium | 0.71 |
Magnesium | 0.85 |
Chlorine | –0.38 |
SO4 | 0.14 |
NO3 | 0.72 |
Total iron | 0.02 |
Fluorine | 0.06 |
Total coliforms | 0.21 |
Fecal coliforms | 0.07 |
Prediction with Random Forest for iron
Correlation table for the features and target: iron
Features . | Feature importance (correlation with target) . |
---|---|
pH | 0.02 |
Total alkalinity | 0.07 |
Total hardness | 0.25 |
Calcium | 0.12 |
Magnesium | 0.03 |
Chlorine | 0.01 |
SO4 | 0.13 |
NO3 | 0.03 |
TDS | 0.25 |
Fluorine | −0.05 |
Total coliforms | −0.05 |
Fecal coliforms | 0.26 |
Features . | Feature importance (correlation with target) . |
---|---|
pH | 0.02 |
Total alkalinity | 0.07 |
Total hardness | 0.25 |
Calcium | 0.12 |
Magnesium | 0.03 |
Chlorine | 0.01 |
SO4 | 0.13 |
NO3 | 0.03 |
TDS | 0.25 |
Fluorine | −0.05 |
Total coliforms | −0.05 |
Fecal coliforms | 0.26 |
DISCUSSION
Predictions of the internal relationships between water quality elements are shown in this section of the paper. Traditional GEP, SVM, RF, ANN, DT, and regression-based models are used in most published works on modeling surface water quality parameters. Many models and forecast water quality indicators use traditional AI algorithms, but the results were not as expected. Consequently, it is extremely important to mix modeling processes and optimization algorithms to achieve powerful and correct modeling results. Few researchers integrate modeling and state search for input optimization in addition to modern work.
AI-Mukhtar & AI-Yaseen (2019) discovered that regression models and ANN outperform regression models and ANN in predicting EC and TDS in a comparison to previous and current studies that used modeling and optimization techniques. In addition, AliKhan et al. (2021) reported improved model results for predicting surface water salinity of the Indus using Random Forest.
The results of modern studies have proved that the input optimization system can be used to achieve modeling accuracy, the highest quality structure, reduced computation time, input optimization, and reduced version complexity. In addition, built-in optimization algorithms perform better than standalone ANN, SVM, GEP, RF, and other regression analyses, delivering a powerful version with advanced output.
CONCLUSION
The score obtained from this test confirmed TDS, pH, total iron modeling, and prediction. Regardless of the version in the water quality of the river, the version will advance correctly. The overall performance of advanced models was evaluated through the use of special statistical standards, e.g. modeling accuracy (R2) and error evaluation standards (RMSE). Input optimization reduced modeling complexity, which is useful for reducing information series and processing overhead. The accuracy of TDS, pH, and total iron was 97.74, 98.66, and 98.65, respectively. The advantage of using the RF version proposed in this document is the accurate assessment of soil water pollutant levels, and furthermore, it allows to avoid lengthy calculations feared with traditional water quality index (WQI).
FUNDING
This research received no external funding.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.