ABSTRACT
Drinking water purity analysis is an essential framework that demands several real-world parameters to ensure the quality of water. So far, sensor-based analysis of water quality in specific environments is done concerning certain parameters including the PH level, hardness, TDS, etc. The outcome of such methods analyzes whether the environment provides potable water or not. Potable denotes the purified water that is free from all contaminations. This analysis gives an absolute solution whereas the demand for drinking water is a growing problem where the multiple-level estimations are essential to use the available water resources efficiently. In this article, we used a benchmark water quality assessment dataset for analysis. To perform a level assessment, we computed three major features namely correlation-entropy, dynamic scaling, and estimation levels, and annexed with the earlier feature vector. The assessment of the available data was performed using the statistical machine learning model that ensembles the random forest model and light gradient boost model (GBM). The probability of the ensemble model was done by the Kullback Libeler Divergence model. The proposed probabilistic model has achieved an accuracy of 96.8%, a sensitivity of 94.55%, and a specificity of 98.29%.
HIGHLIGHT
Developing an IoT framework for water quality management data extraction involves deploying a network of sensors capable of measuring key parameters such as pH, dissolved oxygen, turbidity, temperature, conductivity, and contaminants across water bodies. The proposed work uses a probabilistic machine learning model to estimate the multiple levels of water quality assessments.
INTRODUCTION
Water as an essential need for every human being around the globe demands a systematic framework for analyzing its potable nature and also the parameters to ensure its integrity. Numerous parameters are used to represent the integrity of water including pH levels, saltation levels, mineral levels, TDS, etc. (Qi et al. 2022). These are biological parameters that represent the quality of water management. The water quality index (WQI) is constructed by integrating various water quality parameters into a single numerical value, providing a comprehensive assessment of overall water quality. Initially, a selection of relevant parameters such as pH, dissolved oxygen, turbidity, temperature, conductivity, and levels of contaminants are identified based on their significance in determining water quality (Gaur et al. 2022). These parameters are assigned weights or importance factors reflecting their relative significance in influencing water quality. Subsequently, individual parameter values are normalized and transformed into dimensionless scores using appropriate mathematical functions or scales to ensure comparability across different parameters (Uddin et al. 2023). These normalized scores are then aggregated using weighted averaging or another aggregation method to compute a single composite score, representing the overall water quality status. The final step involves categorizing the composite score into predefined quality classes or categories (e.g., excellent, good, fair, poor) to facilitate interpretation and decision-making. The WQI framework provides a simplified yet informative way to assess and communicate water quality status to stakeholders, enabling informed management decisions and interventions aimed at preserving and improving water resources (Uddin et al. 2021). With the biological parameters, only the assessment regarding a specific environment can be performed whereas the computational exploration of those parameters could provide a global framework for analyzing the water quality (Liu et al. 2022).
The computational exploration can be done using different strategies and the data extraction can be done using the IoT sensors that transfer the obtained data and the parameter analysis can be performed based on the nature of the parameters (Cao et al. 2022). Water quality assessment through IoT devices involves the strategic deployment of sensors capable of measuring various parameters such as pH, dissolved oxygen, turbidity, temperature, conductivity, and contaminant levels within water bodies. Initially, appropriate sensors are selected based on the specific parameters requiring monitoring. These sensors are then integrated into IoT devices, typically microcontrollers or single-board computers like Raspberry Pi, equipped with communication modules such as Wi-Fi, Bluetooth, or cellular connectivity (Kruse 2018). The sensors are strategically deployed at desired locations within the water bodies, ensuring proper calibration and positioning for accurate measurements. The IoT devices collect data from the sensors at regular intervals, with the frequency of data collection tailored to the monitoring requirements. Subsequently, the collected data are transmitted wirelessly to a central server or cloud platform using the communication capabilities of the IoT devices (Jayaraman et al. 2024). Upon transmission, the data are securely stored in a database or cloud storage, where it undergoes analysis to identify patterns, trends, and anomalies. Visualization tools and dashboards are developed to provide real-time insights into water quality conditions, enabling stakeholders to make informed decisions. Regular maintenance and calibration of sensors are conducted to ensure measurement accuracy, with malfunctioning components promptly replaced or repaired. Integration with decision support systems further enhances the utility of the data by linking it with predictive models or resource management systems, facilitating timely interventions to address any detected issues. Overall, water quality assessment via IoT devices enables continuous monitoring, proactive management, and effective preservation of water resources. The iterative modeling of water quality assessment can be carried out by inducing machine learning models to analyze the water quality-related data (Bedi et al. 2020). Water quality assessment employing machine learning models entails the utilization of historical water quality data to develop predictive algorithms capable of analyzing and forecasting water quality parameters. Initially, a comprehensive dataset containing information on various water quality parameters, alongside relevant environmental factors such as weather patterns, land use, and hydrological characteristics, is compiled (Thorslund & van Vliet 2020). This dataset serves as the foundation for training machine learning models, including regression, classification, and clustering algorithms, among others. The models are trained to recognize patterns, correlations, and anomalies within the data, enabling them to predict water quality parameters based on input variables. Once trained, these models can analyze real-time data collected from sensors deployed in water bodies, providing continuous assessment and early detection of water quality fluctuations. Additionally, machine learning models can be utilized to optimize sampling strategies, identify sources of contamination, and prioritize management interventions, thereby enhancing the efficiency and efficacy of water quality monitoring and management efforts. Regular updates and refinements to the models ensure their adaptability to changing environmental conditions and evolving water quality challenges, ultimately contributing to the sustainable management and preservation of water resources. Utilizing machine learning models for water quality assessment involves extracting pertinent features from sensor-gathered raw data in water bodies. These features are then utilized to train predictive algorithms. Initially, a comprehensive dataset comprising raw measurements of water quality parameters (e.g., pH, dissolved oxygen, turbidity, temperature, and conductivity) is compiled. Domain knowledge and statistical techniques are subsequently employed to compute additional features, including statistical moments, frequency-domain attributes, time series characteristics, and spatial patterns from the raw data. These computed features act as inputs for machine learning algorithms, facilitating the learning of intricate relationships between input features and target water quality parameters. Techniques for feature selection may also be applied to pinpoint the most informative features, thus enhancing the predictive performance of the models while reducing dimensionality and computational complexity. Post-training, these machine learning models can analyze real-time sensor data to predict water quality parameters, furnishing valuable insights into the current state of water quality and enabling timely management interventions. Continuous refinement and adaptation of feature computation techniques and machine learning models are crucial for effectively addressing evolving water quality challenges and enhancing water quality assessment systems' overall accuracy and reliability. There are three major objectives proposed in this article for estimating the water quality using IoT, and a machine learning model as follows:
IoT sensor-based data collection model from the real-world environment
A triple-stage feature computation model to increase the size of the feature vector
Probabilistic machine learning model to estimate the multiple levels of water quality assessments.
Section 2 denotes the literature review followed by Section 3 which denotes the materials and methods that are adopted for the implementation Section 4 which illustrates the experimental results that are obtained through the proposed model and discusses the obtained results followed by the conclusion and references.
LITERATURE SURVEY
The literature review on water quality assessment begins with analyzing the physical changes that are made in water quality data extraction where the parameters are extracted using the traditional methods where the purity analysis using pH calculation is performed initially (Wu et al. 2021). Because of its broad structure, the assessment highlights the consolidation of extensive water quality data into a singular value or index as a valuable method. This process involves four sequential stages within WQI models: initial selection of water parameters for sub-index generation, assigning weights to these parameters, and calculating the overall WQI. Consequently, a vast amount of water quality data is condensed into a solitary index, allowing for comparison among different traditional WQI indices concerning parameter selection, sub-index creation, weight assignment, aggregation methods, and rating scales (Nayak et al. 2020). After evaluating the WQI, it was determined that traditional calculations consumed excessive time and revealed a limited number of errors during the sub-index calculations. Various statistical and visual evaluation indicators were employed to assess the models. The data underwent division into training and testing sets by the machine learning algorithm, which utilized hybrid algorithms and provided estimations for WQI values (Bui et al. 2020). The research conducted for WQI analyses took place in Lake Poyang, China, involving the classification of 24 water quality samples into three distinct groups. The analysis focused on 20 diverse water quality parameters, particularly emphasizing Total Nitrogen (TN) and Total Phosphorus (TP), while assigning lower ratings to hazardous metals and other criteria in the WQI Analysis (Wu et al. 2017). It engaged in the utilization of a comprehensive range of fractional deviation techniques encompassing difference, ratio, and normalized difference indices. The resultant WQI values exhibit a range from 56.61 to 2,886.51. The index above was derived through the assessment of curve slope and root mean square error values (Wang et al. 2017). A variety of feature extraction methods are employed in water quality assessment to distill relevant information from raw data collected by sensors and other monitoring devices (Sagan et al. 2020). One prevalent approach involves statistical methods such as mean, median, standard deviation, and skewness, which provide insights into the central tendency, variability, and distribution of water quality parameters (Kumar & Padhy 2014). Time-domain analysis techniques, including autocorrelation and spectral analysis, capture temporal patterns, and periodicities in water quality data, aiding in the identification of seasonal variations and long-term trends (Akbarighatar et al. 2023). Frequency-domain methods, such as Fourier analysis and wavelet transforms, enable the decomposition of signals into frequency components, facilitating the detection of cyclic patterns and oscillations (Condon et al. 2021). Spatial analysis techniques, such as kriging and spatial interpolation, are utilized to analyze spatial variability and spatial autocorrelation in water quality parameters across different locations within a water body (Monica & Choi 2016). Additionally, machine learning algorithms such as principal component analysis (PCA), independent component analysis (ICA), and feature selection algorithms like genetic algorithms and recursive feature elimination are employed to identify and prioritize the most informative features for water quality assessment (Hosseini Baghanam et al. 2022). These feature extraction methods collectively contribute to a comprehensive understanding of water quality dynamics, aiding in the detection of anomalies, prediction of future trends, and formulation of effective management strategies for preserving and improving water resources (Dilmi & Ladjal 2021). A diverse array of machine learning classification methods is employed in water quality assessment to effectively analyze and categorize water quality data. Supervised learning algorithms such as decision trees, random forests, and support vector machines (SVMs) are widely utilized to classify water samples into different quality categories based on their feature vectors (Najwa Mohd Rizal et al. 2022). These algorithms leverage labeled training data to learn decision boundaries and classify unseen samples with high accuracy. Additionally, ensemble methods like AdaBoost and gradient boosting combine multiple weak classifiers to enhance classification performance (Khan et al. 2022). Deep learning techniques, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), are increasingly employed to extract intricate patterns and temporal dependencies from water quality data, particularly in spatial and temporal modeling tasks (Baek et al. 2020). Unsupervised learning methods such as clustering algorithms, including k-means and hierarchical clustering, are utilized for exploratory analysis and pattern recognition in unlabeled data, enabling the identification of natural groupings and anomalies in water quality datasets (Marin Celestino et al. 2018). Furthermore, hybrid approaches that integrate multiple machine learning techniques, such as feature selection, dimensionality reduction, and ensemble learning, are employed to improve classification accuracy and robustness in complex water quality assessment tasks (Aslam et al. 2022). Overall, machine learning classification methods play a pivotal role in effectively analyzing and interpreting water quality data, facilitating informed decision-making and management of water resources.
MATERIALS AND METHODS
Dataset description
We used the global water quality dataset retrieved from the benchmark Kaggle repository for implementing the proposed methods. The dataset consists of 3,277 instances with 10 major features, namely pH level of the water, hardness, solids, amount of chloramines, amount of sulfate, conductivity, organic carbon, trihalomethanes, turbidity, and potable nature of the water. The only class description in the obtained dataset is the potable nature whereas all the other features remain as problem-related features.
IoT framework to extract the water quality management data
Developing an IoT framework for water quality management data extraction involves deploying a network of sensors capable of measuring key parameters such as pH, dissolved oxygen, turbidity, temperature, conductivity, and contaminants across water bodies. These sensors are connected to IoT devices equipped with communication modules like Wi-Fi, cellular, or LoRa, facilitating data collection and transmission to a central server or cloud platform (Liu et al. 2023; Zhou et al. 2024). Firmware/software in IoT devices collects data at regular intervals, ensuring real-time monitoring. Data are stored securely in databases or data lakes, enabling efficient management and analysis.
Triple-stage feature extraction model
Using the categorical encoding, the estimation levels have been framed into three categories, namely Excellent, Good, and Poor. Categorical encoding offers several advantages as a feature extraction model. Firstly, it allows for the representation of categorical data in a format that can be understood and utilized by machine learning algorithms, enabling the inclusion of categorical variables in predictive models. Secondly, categorical encoding techniques such as one-hot encoding and ordinal encoding preserve the inherent information and relationships within categorical variables, ensuring that important distinctions between categories are retained during feature extraction. The proposed triple-stage feature extraction algorithm is given below.
Algorithm 1: Triple Stage Feature Extraction | |
Input | Water quality samples feature collections |
Output | An extended feature set with problem-related features |
Begin | |
Step 1: | For every attribute in feature matrix f |
Step 2: | Perform combinational analysis f(x) |
Step 3: | Calculate the dissimilarity using Equation (1) |
Step 4: | For the individual parameters i and j. |
Step 5: | Calculate the Euclidean distance |
Step 6: | Update the feature vector with |
Step 7: | Calculate the correlation between all i and |
Step 8: | and j are non-parametric |
Step 9: | Calculate and add to feature vector |
Step 10: | Perform categorical encoding using and |
Step 11: | End if |
Step 12: | End for |
Step 13: | Repeat steps 2 to 10 to increase the size of the feature vector |
Step 14: | Return the feature vector |
End |
Algorithm 1: Triple Stage Feature Extraction | |
Input | Water quality samples feature collections |
Output | An extended feature set with problem-related features |
Begin | |
Step 1: | For every attribute in feature matrix f |
Step 2: | Perform combinational analysis f(x) |
Step 3: | Calculate the dissimilarity using Equation (1) |
Step 4: | For the individual parameters i and j. |
Step 5: | Calculate the Euclidean distance |
Step 6: | Update the feature vector with |
Step 7: | Calculate the correlation between all i and |
Step 8: | and j are non-parametric |
Step 9: | Calculate and add to feature vector |
Step 10: | Perform categorical encoding using and |
Step 11: | End if |
Step 12: | End for |
Step 13: | Repeat steps 2 to 10 to increase the size of the feature vector |
Step 14: | Return the feature vector |
End |
Additionally, categorical encoding facilitates the incorporation of domain knowledge and prior information about the categorical variables into the feature space, enhancing the interpretability and explainability of the resulting models. Furthermore, categorical encoding can effectively handle nominal and ordinal categorical variables, accommodating a wide range of categorical data types commonly encountered in real-world datasets. The proposed feature handling mechanism not only initiates the computation of problem-related features but a filtering with respect to the missing values is also taken care of during the preprocessing stage. In the preprocessing stage, the proposed mechanism eradicates the missing values by having a threshold value of 0.5 which was 5% per variable in the set. Once the categorical features are computed the size of the feature vector has 13 features and 3,277 instances.
Probabilistic gradient boost model for multiclass classification
The above equation represents the objective function, which combines the loss function and a regularization term to prevent overfitting. GBMs have revolutionized machine learning by offering a potent ensemble learning technique ideal for diverse predictive modeling tasks. GBMs construct highly accurate predictive models by sequentially combining numerous weak learners, predominantly decision trees, to mitigate individual model limitations and enhance predictive performance. Their prowess extends to handling intricate, nonlinear data relationships, rendering them invaluable for regression, classification, and ranking tasks. GBMs demonstrate remarkable flexibility, adeptly adapting to diverse data types and accommodating various feature representations, thus finding extensive application across real-world scenarios. Additionally, GBMs ensure robustness against overfitting through the integration of regularization techniques and adaptive learning rates, facilitating effective generalization to unseen data. Notably, GBMs offer interpretability, furnishing insights into feature importance and model behavior, thereby aiding comprehension of underlying data patterns. With the evolution of implementation frameworks like XGBoost, LightGBM, and CatBoost, GBMs have become indispensable tools for data scientists, propelling innovations in predictive modeling, recommendation systems, anomaly detection, and beyond. Ultimately, GBMs' capacity to deliver superior predictive accuracy, flexibility, robustness, interpretability, and scalability solidifies their status as fundamental assets in the machine learning arsenal, proficiently addressing an extensive array of challenges. The trained GBM model was executed as a single-phase decision tree model and the initial validation is done. Once the results of single DT are analyzed an ensemble of DT, i.e. random forest model, is chosen for executing the GBM algorithm for all the sets available in the feature vector. Once the random outcomes are obtained, the majority voting is performed by the probabilistic Kullback–Libeler Divergence model. The significance of Kullback–Leibler (KL) divergence in multiclass classification lies in its ability to quantify the difference between probability distributions, thereby aiding in model assessment, feature selection, and optimization. In multiclass classification, where the goal is to predict the class label or probability distribution for each sample, KL divergence serves as a crucial measure for evaluating the dissimilarity between the predicted and true class distributions. By comparing the predicted probabilities with the actual class distribution, KL divergence provides insights into the model's performance, highlighting areas where the predictions deviate from the ground truth. Moreover, KL divergence can be leveraged for feature selection, helping identify informative features that contribute significantly to the predictive performance of the model. Additionally, KL divergence plays a vital role in optimization tasks, guiding the fine-tuning of model parameters to minimize the discrepancy between predicted and actual class distributions, thereby enhancing classification accuracy. Overall, KL divergence serves as a valuable tool in multiclass classification, facilitating model evaluation, feature selection, and optimization, ultimately leading to improved classification performance and robustness in real-world applications. The proposed PBGM model begins with the training using the traditional GBM model and the objective assignment is done with multiple classes as the problem deals with multiclass classification. As a probabilistic model, the outcomes of the GBM were analyzed for the divergence, and the predictive analysis was made which is unique when compared to the existing models.
RESULTS AND DISCUSSION
The overall performance of the water quality assessment was improved by constructing the proposed probabilistic gradient boost model (PGBM). This model had done the ensemble of LGBM classifier for model training whereas, in the testing and validation phase, the XGBoost along with probabilistic KLD method was adopted for the testing and validation of the proposed model. The prediction obtained through the proposed PGBM model has recorded the highest accuracy and precision of 98%, whereas the recall and f1 score of the proposed model recorded 99% which was the highest performance metrics that are achieved by any of the existing models that are used for multiclass classification of water quality level assessment so far. The confusion matrix obtained by the proposed PGBM model is shown in Figure 4, which accommodated two more testing samples in the low-level samples when compared to the traditional LGBM classifier.
From Figure 4, it is evitable that the class discrimination during the testing and validation phase of the proposed PGBM model is different when compared to the confusion matrix of the traditional LGBM prediction model. The comparison of the proposed PGBM model with the existing models that are used in recent research in water quality assessment [8,9] was also performed in terms of accuracy, specificity, and sensitivity. The comparison results are given in Table 1, which precisely conveys the models that are used in water quality assessment so far.
Methods . | Accuracy (%) . | Sensitivity (%) . | Specificity (%) . |
---|---|---|---|
XGBoost | 73.5 | 66.0 | 76.9 |
AdaBoost | 74.4 | 64.0 | 76.9 |
SVM | 75.5 | 60.9 | 79.0 |
K-NN | 75.9 | 45.7 | 83.1 |
DT | 78.3 | 80.3 | 79.5 |
LGBM | 90.21 | 89.03 | 91.54 |
PGBM (proposed) | 98 | 98 | 96 |
Methods . | Accuracy (%) . | Sensitivity (%) . | Specificity (%) . |
---|---|---|---|
XGBoost | 73.5 | 66.0 | 76.9 |
AdaBoost | 74.4 | 64.0 | 76.9 |
SVM | 75.5 | 60.9 | 79.0 |
K-NN | 75.9 | 45.7 | 83.1 |
DT | 78.3 | 80.3 | 79.5 |
LGBM | 90.21 | 89.03 | 91.54 |
PGBM (proposed) | 98 | 98 | 96 |
This model has achieved a remarkable accuracy of 98% due to the probabilistic voting mechanism by Kullback Libeler Divergence Method where the change in distribution between two divergences has led to an accurate classification. The model also achieved 98% sensitivity and 96% specificity. The major factor for performance improvement is due to the feature computation and correlation between the available features.
CONCLUSION
Estimating the water quality levels by inducing three major features and initializing three major levels was the major focus of the article. We proposed an IoT framework to perform efficient data collection. The water quality assessment done so far analyzed the environmental aspects of the features whereas the proposed models calculated the statistical relationship between the features and the new set of features are computed using the proposed triple-stage feature computation mechanism. A multiclass classification model was proposed to classify the available samples into several classes that are extracted in the feature computation stage. We proposed a PGBM which outperformed all the other existing models in terms of accuracy, specificity, and sensitivity. Time series-based feature extraction and the implementation of deep learning models for water quality assessment levels in global level data will be our future research direction.
ACKNOWLEDGEMENTS
This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R432), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. This work was supported by the Deanship of Scientific Research, Vice President for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [GrantA377].
FUNDING
Princess Nourah bint Abdulrahman University Researchers Supporting Project number (NURSP2024R432), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.