ABSTRACT
Water pollution remains a longstanding challenge globally, prompting substantial investment in water quality protection. The integration of advanced machine learning models offers promising avenues for accurate water quality prediction, enabling proactive measures to safeguard water sources. Presently, water quality assessment relies predominantly on physical and chemical metrics. This study developed MultiLayer Perceptron (MLP), eXtreme Gradient Boosting (XGBoost), long short-term memory (LSTM), and a hybrid CNN–LSTM model to forecast pH and dissolved oxygen (DO) levels. Results demonstrated the hybrid model's superior performance, with mean squared errors (MSEs) of 0.0015 and 0.0361 for pH and DO prediction, respectively. For Water Quality Classification (WQC), Random Forest (RF), k-Nearest Neighbors (kNNs), Support Vector Machine (SVM), and Light gradient Boosting Machine (Light GBM) were employed, with SVM achieving the highest accuracy at 88.75%. The research underscores the effectiveness of the CNN–LSTM model in predicting pH and DO levels. Leveraging these predictions as inputs to the SVM model offers valuable insights, particularly in regions where conventional monitoring methods face limitations. This streamlined approach, requiring only two parameters, signifies a significant advancement in accurate water quality prediction.
HIGHLIGHTS
The study showcases the application of advanced machine learning techniques such as MLP, XGBoost, LSTM, and a hybrid CNN–LSTM model for accurate forecasting of pH and dissolved oxygen levels.
The hybrid model demonstrated superior performance as evidenced by its lower mean squared error (MSE) values.
By utilizing machine learning classifier SVM, RF, KNN, and Light GBM, the study achieves high accuracy in predicting WQC.
INTRODUCTION
The quality of water, as a crucial resource for the survival of living organisms on Earth, is an issue of great concern. Determining the quality of water is crucial for various purposes, including irrigation, drinking, and industrial uses. Each of these applications requires specific water quality criteria to ensure safety, efficiency, and sustainability. The source of water can vary depending on the context and purpose for which it is being used. The main sources of water are surface water, groundwater or treated wastewater. Assessing water quality ensures that it meets the necessary standards for its intended use, safeguarding health, optimizing performance, and protecting the environment. The impact of water pollution on ecosystems and organisms cannot be ignored, and if it flows into the human body through the food chain and accumulates, it may lead to serious consequences (Schwarzenbach et al. 2010). Some factory wastes, as well as pesticides and fertilizers used for crops, can cause irreversible chemical pollution of water sources when they flow into water bodies. In order to reduce pollution and protect water sources, discharges can be pre-treated and harmful substances can be broken down through physical deposition and chemical and biological means (Posthuma et al. 2001). However, it is difficult to ensure that every source of pollution flowing into a water body can be effectively regulated, and highly toxic compounds such as sulfonic acid are difficult to be treated. Therefore, it is more effective to detect water conditions directly and take precautionary measures in response to changes in water quality indicators. The aim of this study is to develop an integrated water quality prediction system based on multiple machine learning models, which can predict the future water quality from the collected historical data of water quality parameters, and thus provide decision support to the regulators for water source protection.
Currently, machine learning has made some progress in the field of water quality prediction (Najah et al. 2021). Two prevalent indicators in research for water quality are Water Quality Index (WQI) and Water Quality Classification (WQC). Nasir et al. (2022) used machine learning models including Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), CATegorical Boosting (CATBoost), eXtreme Gradient Boosting (XGBoost), and MultiLayer Perceptron (MLP) to classify the water quality based on the WQI. CATBoost achieved the highest accuracy (94.51%). Ahmed et al. (2019) developed a WQC prediction system for temperature, turbidity, pH, and total dissolved solids (TDSs) using various machine learning models. Compared with SVM and RF, the MLP has better performance.
Hmoud Al-Adhaileh & Waselallah Alsaade (2021) developed an adaptive neuro-fuzzy inference system (ANFIS) WQI prediction model, which based on fuzzy logic and artificial neural network (ANN), the regression coefficient of the ANFIS model reached 96.17%, the ANFIS model introduces fuzzy logic on the basis of the neural network to better deal with nonlinear problems, which is suitable for complex water quality prediction problems. Eze et al. (2021) and Eze & Ajmal (2020) have used Ensemble Empirical Mode Decomposition (EEMD) to decompose the water quality data in their research to enhance the adaptability of the LSTM model to nonlinear and nonsmoothed data. The Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) can decrease the volatility by decomposing the data, based on which CEEMDAN–XGBoost, CEEMDAN–RF show higher stability in the short time period prediction of water quality parameters (Lu & Ma 2020).
Since water quality changes are usually time-dependent, scholars have introduced machine learning models with temporal properties for water quality prediction. Liu et al. (2020) constructed a Bi-directional Stacked Simple Recurrent Unit network (Bi-S-SRU) to include future context information, and the accuracy of the model in short-term prediction reached 94.42%. Fu et al. (2021) proposed a temporal convolutional network (TCN), which extracts time series features by dilated causal convolution, to demonstrate the advantages of TCN in long-term forecast of water quality parameters, model prediction accuracy up to 91.91%. The long-time prediction of LSTM shows excellent stability. Similarly, Rasheed Abdul Haq & Harigovindan (2022) constructed a water quality prediction system for aquaculture using the LSTM model and the prediction accuracy was slightly better than the autoregressive integrated moving average model (ARIMA). Barzegar Aalami & Adamowski (2020) claimed that the CNN model has high accuracy in predicting dissolved oxygen (DO) in the low concentration range, while the LSTM model performs better in the high concentration range, but the hybrid CNN–LSTM model combines the advantages of both models.
As a generalized water quality indicator, the calculation of the WQI has a minimum number requirement of water quality parameters, which can be expensive to collect, especially in some areas. Considering this situation, some prediction models for a single water quality parameter were developed. Biochemical oxygen demand (BOD) has been shown to be a measure of organic pollution in water sources. Kannel et al. (2007) developed low-cost water quality indicators WQIm and WQIDO. WQIm was calculated using five parameters: temperature, pH, DO, conductivity, and BOD, while WQIDO considered only a single parameter DO. It was verified that 90 and 93% of the samples were correctly classified using WQIm and WQIDO, respectively. Ho et al. (2019) built the DT model for WQI evaluation of the Klang River. The effects of different combinations of input parameters on the model performance were tried, and the results showed that NH3-N, pH, and suspended solids (SSs) had the least effect on the prediction of the WQI. The accuracy of the prediction model using only three water quality features, BOD, chemical oxygen demand (COD) and DO, as inputs was as high as 75%, while the accuracy of the model using all the inputs was 84.09%, which was within an acceptable range. This research demonstrates the feasibility of low-cost water quality prediction models, where fewer parameter inputs can speed up model computation while ensuring the accuracy of the prediction results. Othman et al. (2020) constructed a WQI prediction system based on ANN about BOD, DO, SS, COD, ammonia nitrogen (AN), and pH. After sensitivity analyses, DO was the parameter that had the greatest impact on the WQI predictions.
By comprehensively comparing the water quality assessment indicators discussed in the literature, pH and DO emerge as the most commonly used parameters. Models that use these indicators as input for water quality prediction tend to be more generalizable. To address this, the study innovatively proposed a flexible water quality prediction model, which constructed a mapping between pH, DO and WQI, reduced the input features of the prediction model, and thus reduced the cost of data collection, which is of great significance for areas where water quality data collection is difficult. In addition, the model is scalable and can adjust the parameter input according to the actual data collected. When there are fewer statistical water quality characteristics, the WQI can be obtained through model mapping. When the sample has enough water quality characteristic parameters, the water quality calculation method of different regions can be introduced to obtain the corresponding WQI, and finally input into the classification model to predict WQC. In addition, the model provides an adjustable input feature sequence length, and water quality prediction at different time scales can be achieved by adjusting the sequence length.
METHODS
In this research, two datasets were used to construct the water quality parameter prediction model and the WQC model, respectively. The dataset Accurate Prediction of Water Quality was provided by Liu et al. (2020). The dataset was collected in Hainan Province, China. The data are arranged in temporal order of collection and can be applied to construct a time sequence of water quality parameters. Another dataset is from the publicly available dataset Indian Water Quality Data on the Kaggle competition website. (https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data). The data samples were collected from various water quality monitoring stations in India from 2003 to 2014. The dataset has a total of 1991 samples and contains 8 categories of water quality parameters, among which pH, DO, conductivity, BOD, nitrate, fecal coliform and total coliform can be used to calculate the WQI.
Data pre-processing
Considering that the deletion of vacancies will further compress the sample size of the dataset and have an impact on the model fitting performance, the nearest neighbor interpolation method is used to deal with the vacancies in the research. Nearest neighbor interpolation uses the closest data point to the missing value to fill in the gaps and is suitable for cases where the data changes by a large margin. After checking and correcting the data, the Inter Quartile Range (IQR) of each feature was calculated. The minimum value minus 1.5 times the IQR and the maximum value plus 1.5 times the IQR were used as boundaries to further filter outliers.
WQI calculation
In the first equation, N is the number of parameters for the water source. Si is the standard value for water parameters, which are listed in Table 1. In the third equation, Vi is the actual value of the water quality parameter, which means the measured value. VIdeal is the value of the parameter in the ideal situation, the ideal value of pH is 7.0, and DO is 14:6 mg/l. Apart from that, the ideal value of other parameters are 0.
Parameter . | Value . |
---|---|
pH | 8.5 |
Dissolved oxygen | 10 mg/l |
Conductivity | 1,000 μS/cm |
Biological oxygen demand | 5 mg/l |
Nitrate | 45 mg/l |
Fecal coliform | 100 cfu/100 ml |
Total coliform | 1,000 cfu/100 ml |
Parameter . | Value . |
---|---|
pH | 8.5 |
Dissolved oxygen | 10 mg/l |
Conductivity | 1,000 μS/cm |
Biological oxygen demand | 5 mg/l |
Nitrate | 45 mg/l |
Fecal coliform | 100 cfu/100 ml |
Total coliform | 1,000 cfu/100 ml |
According to the calculated WQI, the quality of the water body can be classified into levels. The specific rating rules are shown in Table 2. According to the classification rules in the table, the samples in the dataset Indian Water Quality Data are labeled with water quality classes, which are used for WQC model training and testing.
WQI range . | Level . |
---|---|
0–25 | Excellent |
26–50 | Good |
51–75 | Poor |
76–100 | Very poor |
Above 100 | Unsuitable for drinking purpose |
WQI range . | Level . |
---|---|
0–25 | Excellent |
26–50 | Good |
51–75 | Poor |
76–100 | Very poor |
Above 100 | Unsuitable for drinking purpose |
Min–max scaling
Water parameter prediction models
Water quality can be assessed by biological, physical and chemical indicators. After reviewing the literature, pH and DO were chosen as the indicators for water quality assessment in this study. DO and pH are relatively simple to collect, which reduces the cost, and since both parameters are involved in water quality studies in different regions, models using pH and DO as predictive targets will be informative for water quality prediction in other regions. DO refers to the oxygen content of the water that the growth of aquatic organisms rely on, too low DO content will even lead to the death of aquatic organisms, the ideal conditions of the value of DO should not be less than 2 mg/l. As a common indicator of water quality, pH is easy to measure and is a key factor in water quality assessment in areas where the cost of water quality testing is high. For conventional water bodies, both high and low pH are detrimental to the survival of organisms. When the pH is 7, water quality is neutral, and near-neutral water quality is a suitable living condition for most aquatic organisms.
Multilayer perceptron
An MLP is a classic ANN consisting of an input layer, a hidden layer, and an output layer, which is commonly used for regression and classification prediction. The input layer is used for accepting the input features and passing them to the hidden layer, the cell in the hidden layer performs a linear combination of the incoming data and transforms it into a nonlinear output by means of the activation function, finally, the output of the hidden layer will reach to the output layer to compute the prediction results. The model updates the weights by backpropagation with gradient descent thus reducing the model error. The parameters of the MLP model built in this experiment are shown in Table 3.
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Input size | 4 | 4 |
Number of hidden layers | 2 | 2 |
Hidden size | 128 | 128 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.01 | 0.1 |
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Input size | 4 | 4 |
Number of hidden layers | 2 | 2 |
Hidden size | 128 | 128 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.01 | 0.1 |
Extreme gradient boosting
XGBoost is derived from the Gradient Boosting Decision Tree (GBDT), the basic theory is to generate a new tree by constantly splitting, after each iteration according to the updated tree to recalculate the deviation between the real value of the samples and the estimated value, the new tree on the residuals of the fitting, and then enter the next round of splitting. The tree-splitting method uses the greedy algorithm, which tries different tree structures until a set maximum depth is reached. Due to the parallelization of XGBoost, different splitting results can be computed in advance, thus the processing speed is faster than the traditional GBDT. Table 4 demonstrates the parameters of the XGBoost model used in the research, which were determined by the grid search method.
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Max. depth | 7 | 3 |
Number of estimators | 100 | 150 |
Gamma | 0.0 | 0.1 |
Lambda | 0.01 | 1 |
Learning rate | 0.01 | 0.1 |
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Max. depth | 7 | 3 |
Number of estimators | 100 | 150 |
Gamma | 0.0 | 0.1 |
Lambda | 0.01 | 1 |
Learning rate | 0.01 | 0.1 |
Long short-term memory
First, the previous hidden state of ht–1 and the input value Xt enter the forgetting gate and are computed to obtain ft, which is used to decide which information about the cell Ct–1 of the previous moment to retain. At the same time, ht–1 and Xt enter the input gate to compute it, which is used to control the update of the cell. will be used together with the previous cell state Ct–1 to determine the current cell state Ct. The output gate is responsible for calculating Ot and thus determining the hidden state ht at the next moment. Table 5 demonstrates the LSTM parameter settings built in the experiment.
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Input size | 5 | 5 |
Number of hidden layers | 1 | 1 |
Hidden size | 128 | 128 |
Batch size | 64 | 64 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.001 | 0.004 |
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
Input size | 5 | 5 |
Number of hidden layers | 1 | 1 |
Hidden size | 128 | 128 |
Batch size | 64 | 64 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.001 | 0.004 |
CNN–LSTM hybrid model
The CNN–LSTM hybrid model uses a one-dimensional convolutional layer, Conv1D, which is commonly used to process sequence data, allowing the cross-correlation of the convolutional kernel by the temporal dimension, and therefore learning the temporal properties of sequence data. The extracted features then enter the pooling layer, where Max Pooling is used to reduce the model parameters to prevent overfitting. After CNN processing, the data enter the LSTM layer to capture the temporal characteristics of the data, and finally fully connected layer is used to compute prediction targets according to input information. The hybrid model retains the respective strengths of CNN and LSTM and enhances the robustness of the model. Table 6 lists the parameter settings of the CNN–LSTM network.
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
In channels | 5 | 5 |
Out channels | 32 | 32 |
Kernel size | 3 | 3 |
Stride | 1 | 1 |
Number of hidden layers | 1 | 1 |
Hidden size | 128 | 128 |
Batch size | 64 | 64 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.001 | 0.001 |
Parameter . | Value (pH prediction) . | Value (DO prediction) . |
---|---|---|
In channels | 5 | 5 |
Out channels | 32 | 32 |
Kernel size | 3 | 3 |
Stride | 1 | 1 |
Number of hidden layers | 1 | 1 |
Hidden size | 128 | 128 |
Batch size | 64 | 64 |
Activation | Relu | Relu |
Optimizer | Adam | Adam |
Learning rate | 0.001 | 0.001 |
Model . | Parameter . | Evaluation metrics . | |||
---|---|---|---|---|---|
MSE . | MAE . | RMSE . | R2 . | ||
MLP | pH | 0.0446 | 0.1246 | 0.2113 | 0.3406 |
DO | 1.7075 | 1.0279 | 1.3067 | 0.7434 | |
XGBoost | pH | 0.0224 | 0.1145 | 0.1499 | 0.3254 |
DO | 2.6263 | 1.0728 | 1.6206 | 0.6053 | |
LSTM | pH | 0.0097 | 0.0821 | 0.0989 | 0.5531 |
DO | 0.0860 | 0.1471 | 0.2932 | 0.9814 | |
CNN–LSTM | pH | 0.0015 | 0.0310 | 0.0384 | 0.9328 |
DO | 0.0361 | 0.0869 | 0.1900 | 0.9922 |
Model . | Parameter . | Evaluation metrics . | |||
---|---|---|---|---|---|
MSE . | MAE . | RMSE . | R2 . | ||
MLP | pH | 0.0446 | 0.1246 | 0.2113 | 0.3406 |
DO | 1.7075 | 1.0279 | 1.3067 | 0.7434 | |
XGBoost | pH | 0.0224 | 0.1145 | 0.1499 | 0.3254 |
DO | 2.6263 | 1.0728 | 1.6206 | 0.6053 | |
LSTM | pH | 0.0097 | 0.0821 | 0.0989 | 0.5531 |
DO | 0.0860 | 0.1471 | 0.2932 | 0.9814 | |
CNN–LSTM | pH | 0.0015 | 0.0310 | 0.0384 | 0.9328 |
DO | 0.0361 | 0.0869 | 0.1900 | 0.9922 |
Model . | Evaluation indicators . | |||
---|---|---|---|---|
Accuracy . | Precision . | Recall . | F1-score . | |
RF | 82.95% | 79.64% | 82.95% | 81.23% |
KNN | 87.52% | 78.79% | 87.52% | 82.92% |
LightGBM | 88.41% | 78.87% | 88.41% | 83.37% |
SVM | 88.75% | 78.77% | 88.75% | 83.46% |
Model . | Evaluation indicators . | |||
---|---|---|---|---|
Accuracy . | Precision . | Recall . | F1-score . | |
RF | 82.95% | 79.64% | 82.95% | 81.23% |
KNN | 87.52% | 78.79% | 87.52% | 82.92% |
LightGBM | 88.41% | 78.87% | 88.41% | 83.37% |
SVM | 88.75% | 78.77% | 88.75% | 83.46% |
WQC model
This part of the research refers to the establishment of a mapping relationship between water quality parameters and WQC, which makes it possible to estimate water quality by mapping a limited number of water quality parameters to the corresponding WQC when the number of water quality parameters does not meet the requirements of calculating WQI. This part of the research is of practical significance for some areas where it is difficult to collect water quality parameters, and it provides an idea for low-cost water quality prediction. In addition, the classification of WQC is based on the interval of WQI values, which is more error-tolerant than the direct prediction of WQI.
Random Forest
RF is an integrated algorithm that works by voting the classification results of multiple classifiers to determine the final class, a property that makes the prediction results of RF models stable and unaffected by unusual data. Each time the tree nodes are constructed, RF will search for splitting points based on some features, and the split tree is trained on a small sample of the randomly sampled dataset, both of which can effectively prevent the model from overfitting. Because of the need to construct multiple trees for classification, its generalization ability is also relatively strong, but the cost of RF computation is also relatively high. Usually, water quality samples contain many parameters, and the water quality data collection cycle is based on the unit of years, which means that the amount of data is large, RF has the advantage of efficiently processing a large number of features and data, so it is suitable as a WQC model.
k-Nearest Neighbors
Support Vector Machine
SVM originated in the convex two-dimensional programming optimization problem, and the basic theory is to identify a hyperplane in the feature space to divide the data and maximize the spacing of data between different classes. When water quality is classified, if there are fewer water quality samples in a certain category, it is easy to misclassify them, and the hyperplane of the SVM model will divide the samples of different categories with the maximum interval, which can reduce the effect of uneven distribution of the dataset. Another key of the SVM algorithm is the kernel technique, when the data is linearly indivisible, the kernel function can project features to a higher dimensional space to find the decision boundary for data separation, so the SVM model is also suitable for the scenario with more features.
Light gradient boosting machine
LightGBM is a framework based on GDBT, which has the advantages of high efficiency, low storage, and massive data processing. Different from the traditional DT algorithm, LightGBM adopts the Histogram DT algorithm to discretize the feature values and constructs a distribution histogram to divide the data into each grid, so that only the discrete boundaries need to be considered for dividing the data, and thus the model computation can be accelerated. Another feature of LightGBM is that it adopts the Gradient-based One-Side Sampling algorithm, which gives priority to the data samples with larger gradients when calculating the information gain, and the samples with smaller gradients are randomly selected, which ensures the classification accuracy while reducing the data volume. These features enable LIghtGBM to have excellent training speed, and the classification task can be carried out efficiently in the face of a large number of water samples, it is expected to achieve real-time prediction by updating the model in real-time according to the newly collected samples.
Performance evaluation of prediction models
When the model is adjusted so that more samples are predicted to be positive, the recall rises, but this leads to some negative samples being incorrectly predicted to be positive, and the precision decreases. On the contrary, when the precision increases, some negative samples are no longer incorrectly predicted as positive, and eventually the recall decreases. In order to balance precision and recall, the model can be evaluated using F1socre.
RESULTS AND DISCUSSION
In this section, the prediction results of MLP, XGBoost, LSTM, and CNN–LSTM hybrid models for water quality parameters are shown and analyzed in order. The prediction results were plotted on the same graph with the real value of the water quality parameters to compare the deviation of the prediction results with the actual data. Finally, the models were evaluated by MSE, RMSE, and MAE to select the model with the best performance for predicting water quality parameters. Additionally, the predictions of the classification models for different classes of water quality will be discussed. The WQC model is evaluated by considering four indicators: accuracy, precision, recall, and F1-score, and the model with the best overall performance will be used for WQC prediction.
MLP model prediction analysis
XGBoost model prediction analysis
LSTM model prediction analysis
CNN– LSTM hybrid model prediction analysis
Analysis of evaluation indicators
The evaluation indexes of each prediction model are listed in Table 3. Overall, among the four water quality parameter prediction models, the MLP model performed the worst in predicting pH, and the MSE, MAE, and RMSE were 0.0446, 0.1246, and 0.2113, respectively, which were the highest among all models. However, the MSE, MAE, and RMSE of the XGBoost model were higher than those of the MLP model in the prediction of DO, and it became the model with the largest deviation from the actual value among the four models. The pH distribution in the data set is relatively concentrated, while the DO data fluctuates greatly. It is worth noting that in Table 7 the LSTM pH prediction model is much more effective than the MLP and XGBoost models, (MSE = 0.0097, MAE = 0.0821, RMSE = 0.0989), and the values of the evaluation metrics decrease by an order of magnitude. For DO prediction, the MSE of the LSTM model is reduced by 95%, MAE by about 85.6%, and RMSE by about 77.6% compared with the MLP model, the MSE, MAE, and RMSE are decreased by about 96.8, 86.3, and 81.9% compared with that of the XGBoost model, respectively.
Since the MSE is calculated by squaring the error, the larger the error value, the greater the impact on the MSE. Combining Figures 4 and 5, it can be seen that the MLP and XGBoost models have large errors in pH prediction and DO prediction for samples numbered between 6,000 and 7,000, while the LSTM and CNN–LSTM models only have a few large prediction biases, and therefore, the MSEs obtained by the MLP and XGBoost models are much higher than those obtained by the LSTM and CNN–LSTM hybrid models. MAE is the mean value of the errors, which intuitively reflects the overall prediction bias of the model and is the average of the errors of all the samples, while RMSE squares the errors and reduces the influence of the large errors, comprehensively, the difference between the four models in MSE is the most obvious, which suggests that the improvement of the model prediction performance depends on the ability to reduce the extreme prediction bias. Observing Figures 4–7, it can be seen that the extreme prediction bias mainly occurs in the last 1,000 samples, which are characterized by a sudden increase in pH and DO values and keep them in a high-level range. LSTM and CNN–LSTM models have the ability to capture the trend between time series and can adapt to the characteristics of the data at different time points, whereas MLP and XGBoost affect the prediction accuracy due to the large variations in the samples. For pH prediction in Table 8, the MSE, MAE, and RMSE of the hybrid CNN–LSTM model were reduced by about 84.5, 62.2, and 61.2%, respectively. For DO prediction, the MSE, MAE, and RMSE of the hybrid CNN–LSTM model were 58, 40.9, and 35.2% lower than those of the single LSTM model, respectively. As for R2, it reflects the degree of explanation of the model to the variation of the observed variables, and the closer the value is to 1, the better the model fit is. For pH prediction, the R2 of MLP, XGBoost, LSTM, and CNN–LSTM models are 0.3406, 0.3254, 0.5531, and 0.9328, respectively. Compared with the MLP and XGBoost models, the fit of LSTM is improved by about 60%, and that of the hybrid CNN–LSTM model is improved by 69% compared with that of LSTM. The performance of the single LSTM model is significantly improved by the introduction of the CNN network. For DO prediction, the R2 of MLP, XGBoost, LSTM, and CNN–LSTM are 0.7434, 0.6053, 0.9814, and 0.9922, respectively. Although the LSTM model has basically been able to fit the variation pattern of DO, on this basis, the hybrid model still achieves the highest R2, and nearly 99% of the sample variations can be accurately represented by the CNN–LSTM model. In the CNN–LSTM hybrid model, a one-dimensional convolutional layer is used to extract features in the time dimension and a maximum pooling layer is used to reduce the model parameters to prevent overfitting. By incorporating the CNN network, the LSTM model can further improve the capability of processing time series data. The experimental results demonstrated that the performance of the hybrid model was the highest among the four models for both pH and DO predictions.
WQC model analysis
The data on the main diagonal of the confusion matrix reflect the sample size that matches the predicted categories with the actual categories, and the more data on the main diagonal means the higher the accuracy of the classification model. The total sample number on the main diagonal of SVM is 505, which is the highest out of the four classification models. In addition, the data with excellent water quality class accounted for the largest proportion of the data in the dataset. The SVM model has the highest recognition rate for the data in the excellent category, and for the samples with good water quality, the RF model only classified two cases correctly, and the rest of the models were all 0, which indicates that the four models are not sensitive to the samples with good water quality. Similarly, when classifying the data with poor water quality, only the RF model has corrected classification, KNN, LightGBM, and SVM incorrectly classify the data with poor water quality into excellent category, which has the risk of misclassification for water quality prediction. The proportion of water quality categories in the dataset is not uniform, with 87% as excellent level, 10% of samples of good level water quality, and only 3% of the data for poor water quality classes, which leads to the poor prediction of good and poor water quality data, in order to further improve the performance of the model in the future, further research can use more uniform samples to train the model.
Table 4 shows the evaluation index values of the four classification models. The SVM model has the highest accuracy among the four models, and a higher accuracy means that the model can correctly classify more samples, which indicates that the SVM model is more reliable than the other three models in overall classification results. The sample sizes of the corresponding water quality categories in the dataset used were not balanced, which led to the model classifying all samples into the water quality class with the largest proportion and still obtaining a good accuracy, so the best model could not be selected based on the accuracy alone. The second indicator considered was precision, for the experiment this indicator reflects how many of the samples classified into different categories were correctly assigned to the corresponding water quality level. Among the four classification models, RF achieved the highest precision, 79.64%, while the SVM model was slightly lower at 78.77%, but the overall difference between the four models was only 1%. A higher precision indicates that the samples are classified into a particular water quality class with more confidence, but when the classes are unbalanced, the model may prefer to predict the samples into the class with the largest proportion, to balance the predictive power of the model it is also necessary to consider recall.
For recall, the SVM model is 88.75% which is slightly larger than the LightGBM model (88.41%), followed by the KNN model (87.52%), and then the RF model is the smallest with only 82.95%. Recall measures the ability of the model to recognize each category. The SVM model has a higher recall, indicating that the model can perform better in classifying samples of different classes. In the case of sample category imbalance, some categories have a small number of samples, and a high-recall model ensures the identification of these categories.
For the water quality prediction, when water quality is predicted to be excellent or good and the actual water quality is poor, such false alarms will cause managers to miss the critical time for water quality protection, so the classification model should have as few false-positives as possible, which means that the model's precision should be high. On the other hand, the model is expected to predict as many positive samples as possible, so as to avoid the waste of manpower and material resources caused by unnecessary water quality protection measures due to false-negative samples in water quality prediction, so the recall of the model needs to be taken into consideration. The F1-score is the harmonic mean of precision and recall, which balances the precision and recall of the model and ensures higher stability of the model in the face of unbalanced data. The F1-score of the SVM model is also the highest among the four classification models. Taken together, the RF model has the lowest accuracy and recall, and most of the samples in the test samples are excellent level for water quality, while the RF model has the weakest ability to identify samples of this level among the four models, which affects the model performance. For the KNN model, over-predicted the classification and classified some samples with good and poor grades as excellent. The LightGBM model is close to the SVM model and can be used as an alternative model. In order to ensure the classification accuracy and balance the recognition ability of the model for different categories, the SVM model is the best classification model in the experiment after considering the accuracy and F1-score. In the long term, water quality should remain stable at one level, the data were collected continuously over a certain period of time, which resulted in an uneven distribution of water quality classes in the samples, limiting the ability of the model to recognize other water quality categories. In future experiments, data from longer time periods or different regions can be introduced to test the classification performance of the model for each water quality category.
Comparison with previous literature studies
Table 9 demonstrates the performance difference between the CNN–LSTM water quality parameter prediction model proposed in this study and other models developed in recent literature. The CEEMDAN–XGBoost model proposed by Lu & Ma (2020) uses six different parameters as inputs, and overall the model performance is significantly different from those proposed in other studies. The TCN model proposed by Fu et al. (2021) uses only five inputs for prediction, and in pH prediction, the prediction accuracy of the model proposed in this study is slightly higher than that of the TCN model, but for DO prediction, the MSE, MAE, and RMSE obtained by the model in this study have decreased by 70.4, 78.4, and 82.8%, respectively, compared with those obtained by the TCN, which is a significant improvement in the performance of the model. For the LSTM model proposed by Eze et al. (2021), the MSE of the hybrid model proposed in this study decreased by 99.6% in pH prediction, but the model proposed in this study was higher than the LSTM model in DO prediction in the three indicators. In general, among the four models, the hybrid model proposed in this study has the lowest error in pH prediction with the least parameter input, and although the error in DO prediction is higher than that of the LSTM model, the error values are within the acceptable range and do not lead to serious judgment errors.
Scholar . | Model . | Number of input . | Prediction parameters . | Evaluation metrics . | ||
---|---|---|---|---|---|---|
MSE . | MAE . | RMSE . | ||||
Lu & Ma (2020) | CEEMDAN–XGBoost | 6 | pH | 0.04 | 0.02 | 0.02 |
DO | 0.04 | 0.19 | 0.20 | |||
Fu et al. (2021) | TCN | 5 | pH | 0.0025 | 0.0214 | 0.0505 |
DO | 1.2201 | 0.4014 | 1.1046 | |||
Eze et al. (2021) | LSTM | 4 | pH | 0.3889 | 0.0140 | 0.6236 |
DO | 0.0013 | 0.0262 | 0.0355 | |||
Current study | CNN–LSTM | 3 | pH | 0.0015 | 0.0310 | 0.0384 |
DO | 0.0361 | 0.0869 | 0.1900 |
Scholar . | Model . | Number of input . | Prediction parameters . | Evaluation metrics . | ||
---|---|---|---|---|---|---|
MSE . | MAE . | RMSE . | ||||
Lu & Ma (2020) | CEEMDAN–XGBoost | 6 | pH | 0.04 | 0.02 | 0.02 |
DO | 0.04 | 0.19 | 0.20 | |||
Fu et al. (2021) | TCN | 5 | pH | 0.0025 | 0.0214 | 0.0505 |
DO | 1.2201 | 0.4014 | 1.1046 | |||
Eze et al. (2021) | LSTM | 4 | pH | 0.3889 | 0.0140 | 0.6236 |
DO | 0.0013 | 0.0262 | 0.0355 | |||
Current study | CNN–LSTM | 3 | pH | 0.0015 | 0.0310 | 0.0384 |
DO | 0.0361 | 0.0869 | 0.1900 |
Table 10 demonstrates the performance difference between the SVM model proposed in this study and other WQC prediction models developed in the works of literature. Ahmed et al. (2019) proposed an MLP model that used 6 parameters as inputs for WQC, the accuracy of the model was close to the SVM model developed in this study, but its precision and recall were much lower than the SVM and F1-score were reduced by 26.97%. The KNN DT model proposed by Nasir et al. (2022) used 7 variables as inputs, the model accuracy, F1-score and recall were lower than that of the model proposed in this study, precision of the model was higher than that of the SVM model. Compared with other works of literature, the SVM model developed in this study has higher accuracy, and the use of only two parameters as inputs greatly reduces the cost of data collection and the model running time, which is important for the development of low-cost water quality systems.
Scholar . | Model . | Number of Input . | Evaluation metrics . | |||
---|---|---|---|---|---|---|
Accuracy . | Precision . | Recall . | F1-score . | |||
Ahmed et al. (2019) | MLP | 6 | 85.07% | 56.59% | 56.40% | 56.49% |
Nasir et al. (2022) | DT | 7 | 81.62% | 81.69% | 81.63% | 81.56% |
Current study | SVM | 2 | 88.75% | 78.77% | 88.75% | 83.46% |
Scholar . | Model . | Number of Input . | Evaluation metrics . | |||
---|---|---|---|---|---|---|
Accuracy . | Precision . | Recall . | F1-score . | |||
Ahmed et al. (2019) | MLP | 6 | 85.07% | 56.59% | 56.40% | 56.49% |
Nasir et al. (2022) | DT | 7 | 81.62% | 81.69% | 81.63% | 81.56% |
Current study | SVM | 2 | 88.75% | 78.77% | 88.75% | 83.46% |
CONCLUSION
Pollution of water sources has always been an important issue for the whole world, and since it is difficult to stop the source of pollution, it is necessary to regularly monitor the water sources to prevent pollution in advance. Water quality prediction is the critical point to solving the problem, based on the prediction results, staff can judge the status of the water source to take appropriate protective measures. Based on this purpose, two forms of water quality parameters and WQC were used as water quality assessment measures, and experiments on water quality parameter prediction and WQC prediction were carried out to determine the prediction model with better performance.
In experiments for predicting water quality parameters, MLP, XGBoost, LSTM, and CNN–LSTM hybrid models were constructed for predicting DO and pH. In terms of pH prediction, the MSE of the LSTM model is 78.3% lower than that of the MLP and XGBoost model by 56.7%. In terms of DO prediction, the MSE of the LSTM model is 85.6% lower than that of MLP and XGBoost by 95 and 96.7%. Nevertheless, the LSTM model still has some defects in pH prediction, and the prediction values of some samples fluctuate drastically, resulting in an increase in error. In order to obtain more stable prediction results, the CNN–LSTM hybrid model was developed. Errors of the CNN–LSTM hybrid model is greatly reduced in pH prediction, and the fluctuation of MSE, MAE, and RMSE are reduced by 84.5, 62.2, and 61.2%, respectively. In addition, the CNN–LSTM hybrid model optimizes the shortcomings of the LSTM model in high-level DO prediction and reduces the gap with the actual values, with the three metrics decreasing by 58, 40.9, and 35.2%, respectively, compared with the LSTM model.
For the WQC, the aim of this research is to classify water quality according to some water quality parameters. In this part, RF, KNN, LightGBM and SVM models were developed to classify water quality according to pH and DO. According to the confusion matrix, the KNN, LightGBM, and SVM models misclassify excellent water quality and are less sensitive to good and poor levels. The RF model is able to distinguish between good and poor water quality, but its overall recall is low. This situation is related to the sample distribution of the data set, about 87% of the sample water quality is excellent, good and poor grade sample data is less, resulting in a weaker classification effect of the model on good and poor levels of water quality. The sample size can be increased to improve the classification effect of the classification model, in the subsequent research. In order to compare the model capabilities more intuitively, this research also used accuracy, precision, recall, and F1-score to evaluate the models. The SVM model has the lowest precision (78.77%) among the four classification models, but it is only 0.87% different from the RF model, which has the highest precision. The SVM model was the highest in the other three categories (accuracy: 88.75%, recall: 88.75%, F1-score: 83.46%). To avoid categorizing poor water quality data (poor quality and below) as good quality water as much as possible.At the same time, the model should also ensure that high quality water is not misclassified as poor quality water, resulting in wasted resources. Considering the four metrics together, the SVM model was identified as the final WQC model.
Future research directions
In summary, the CNN–LSTM water quality parameter prediction model and the SVM WQC model require less water quality data, which is of great significance for the region where the cost of water quality data measurement is large. Future research could take advantage of the extensibility of the two models and optimize the shortcomings of this research to achieve a more comprehensive automated water quality prediction system. The following points can be considered: Depending on the water quality parameters of the dataset, the prediction system can adjust the input feature combinations according to the output results. The prediction interval can be manually adjusted to achieve water quality prediction at multiple time scales. The machine learning models used in the research can be componentized, and as the field of machine learning develops, more algorithms can be integrated into the water quality prediction system. The use of different criteria for calculating WQI in different regions makes the model more generalizable.
AUTHOR CONTRIBUTIONS
The project idea and research design of this study were given by Z.C. H.G. analyzed the data, implemented methodology, and wrote the first draft of the manuscript. F.Y.T. assisted with the research framework design. Z.C. and F.Y.T. supervised the project. All authors read and approved the final manuscript.
FUNDING
The authors declare that no funds, grants, or other support was received during the preparation of this manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.