Water pollution remains a longstanding challenge globally, prompting substantial investment in water quality protection. The integration of advanced machine learning models offers promising avenues for accurate water quality prediction, enabling proactive measures to safeguard water sources. Presently, water quality assessment relies predominantly on physical and chemical metrics. This study developed MultiLayer Perceptron (MLP), eXtreme Gradient Boosting (XGBoost), long short-term memory (LSTM), and a hybrid CNN–LSTM model to forecast pH and dissolved oxygen (DO) levels. Results demonstrated the hybrid model's superior performance, with mean squared errors (MSEs) of 0.0015 and 0.0361 for pH and DO prediction, respectively. For Water Quality Classification (WQC), Random Forest (RF), k-Nearest Neighbors (kNNs), Support Vector Machine (SVM), and Light gradient Boosting Machine (Light GBM) were employed, with SVM achieving the highest accuracy at 88.75%. The research underscores the effectiveness of the CNN–LSTM model in predicting pH and DO levels. Leveraging these predictions as inputs to the SVM model offers valuable insights, particularly in regions where conventional monitoring methods face limitations. This streamlined approach, requiring only two parameters, signifies a significant advancement in accurate water quality prediction.

  • The study showcases the application of advanced machine learning techniques such as MLP, XGBoost, LSTM, and a hybrid CNN–LSTM model for accurate forecasting of pH and dissolved oxygen levels.

  • The hybrid model demonstrated superior performance as evidenced by its lower mean squared error (MSE) values.

  • By utilizing machine learning classifier SVM, RF, KNN, and Light GBM, the study achieves high accuracy in predicting WQC.

The quality of water, as a crucial resource for the survival of living organisms on Earth, is an issue of great concern. Determining the quality of water is crucial for various purposes, including irrigation, drinking, and industrial uses. Each of these applications requires specific water quality criteria to ensure safety, efficiency, and sustainability. The source of water can vary depending on the context and purpose for which it is being used. The main sources of water are surface water, groundwater or treated wastewater. Assessing water quality ensures that it meets the necessary standards for its intended use, safeguarding health, optimizing performance, and protecting the environment. The impact of water pollution on ecosystems and organisms cannot be ignored, and if it flows into the human body through the food chain and accumulates, it may lead to serious consequences (Schwarzenbach et al. 2010). Some factory wastes, as well as pesticides and fertilizers used for crops, can cause irreversible chemical pollution of water sources when they flow into water bodies. In order to reduce pollution and protect water sources, discharges can be pre-treated and harmful substances can be broken down through physical deposition and chemical and biological means (Posthuma et al. 2001). However, it is difficult to ensure that every source of pollution flowing into a water body can be effectively regulated, and highly toxic compounds such as sulfonic acid are difficult to be treated. Therefore, it is more effective to detect water conditions directly and take precautionary measures in response to changes in water quality indicators. The aim of this study is to develop an integrated water quality prediction system based on multiple machine learning models, which can predict the future water quality from the collected historical data of water quality parameters, and thus provide decision support to the regulators for water source protection.

Currently, machine learning has made some progress in the field of water quality prediction (Najah et al. 2021). Two prevalent indicators in research for water quality are Water Quality Index (WQI) and Water Quality Classification (WQC). Nasir et al. (2022) used machine learning models including Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), CATegorical Boosting (CATBoost), eXtreme Gradient Boosting (XGBoost), and MultiLayer Perceptron (MLP) to classify the water quality based on the WQI. CATBoost achieved the highest accuracy (94.51%). Ahmed et al. (2019) developed a WQC prediction system for temperature, turbidity, pH, and total dissolved solids (TDSs) using various machine learning models. Compared with SVM and RF, the MLP has better performance.

Hmoud Al-Adhaileh & Waselallah Alsaade (2021) developed an adaptive neuro-fuzzy inference system (ANFIS) WQI prediction model, which based on fuzzy logic and artificial neural network (ANN), the regression coefficient of the ANFIS model reached 96.17%, the ANFIS model introduces fuzzy logic on the basis of the neural network to better deal with nonlinear problems, which is suitable for complex water quality prediction problems. Eze et al. (2021) and Eze & Ajmal (2020) have used Ensemble Empirical Mode Decomposition (EEMD) to decompose the water quality data in their research to enhance the adaptability of the LSTM model to nonlinear and nonsmoothed data. The Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) can decrease the volatility by decomposing the data, based on which CEEMDAN–XGBoost, CEEMDAN–RF show higher stability in the short time period prediction of water quality parameters (Lu & Ma 2020).

Since water quality changes are usually time-dependent, scholars have introduced machine learning models with temporal properties for water quality prediction. Liu et al. (2020) constructed a Bi-directional Stacked Simple Recurrent Unit network (Bi-S-SRU) to include future context information, and the accuracy of the model in short-term prediction reached 94.42%. Fu et al. (2021) proposed a temporal convolutional network (TCN), which extracts time series features by dilated causal convolution, to demonstrate the advantages of TCN in long-term forecast of water quality parameters, model prediction accuracy up to 91.91%. The long-time prediction of LSTM shows excellent stability. Similarly, Rasheed Abdul Haq & Harigovindan (2022) constructed a water quality prediction system for aquaculture using the LSTM model and the prediction accuracy was slightly better than the autoregressive integrated moving average model (ARIMA). Barzegar Aalami & Adamowski (2020) claimed that the CNN model has high accuracy in predicting dissolved oxygen (DO) in the low concentration range, while the LSTM model performs better in the high concentration range, but the hybrid CNN–LSTM model combines the advantages of both models.

As a generalized water quality indicator, the calculation of the WQI has a minimum number requirement of water quality parameters, which can be expensive to collect, especially in some areas. Considering this situation, some prediction models for a single water quality parameter were developed. Biochemical oxygen demand (BOD) has been shown to be a measure of organic pollution in water sources. Kannel et al. (2007) developed low-cost water quality indicators WQIm and WQIDO. WQIm was calculated using five parameters: temperature, pH, DO, conductivity, and BOD, while WQIDO considered only a single parameter DO. It was verified that 90 and 93% of the samples were correctly classified using WQIm and WQIDO, respectively. Ho et al. (2019) built the DT model for WQI evaluation of the Klang River. The effects of different combinations of input parameters on the model performance were tried, and the results showed that NH3-N, pH, and suspended solids (SSs) had the least effect on the prediction of the WQI. The accuracy of the prediction model using only three water quality features, BOD, chemical oxygen demand (COD) and DO, as inputs was as high as 75%, while the accuracy of the model using all the inputs was 84.09%, which was within an acceptable range. This research demonstrates the feasibility of low-cost water quality prediction models, where fewer parameter inputs can speed up model computation while ensuring the accuracy of the prediction results. Othman et al. (2020) constructed a WQI prediction system based on ANN about BOD, DO, SS, COD, ammonia nitrogen (AN), and pH. After sensitivity analyses, DO was the parameter that had the greatest impact on the WQI predictions.

By comprehensively comparing the water quality assessment indicators discussed in the literature, pH and DO emerge as the most commonly used parameters. Models that use these indicators as input for water quality prediction tend to be more generalizable. To address this, the study innovatively proposed a flexible water quality prediction model, which constructed a mapping between pH, DO and WQI, reduced the input features of the prediction model, and thus reduced the cost of data collection, which is of great significance for areas where water quality data collection is difficult. In addition, the model is scalable and can adjust the parameter input according to the actual data collected. When there are fewer statistical water quality characteristics, the WQI can be obtained through model mapping. When the sample has enough water quality characteristic parameters, the water quality calculation method of different regions can be introduced to obtain the corresponding WQI, and finally input into the classification model to predict WQC. In addition, the model provides an adjustable input feature sequence length, and water quality prediction at different time scales can be achieved by adjusting the sequence length.

This section explains the data pre-processing methods, the structure of the model and the evaluation metrics of the model used in the experiments. Figure 1 illustrates the workflow of this research. The research process was divided into two stages according to the water quality assessment method. In the first part, the water quality parameters prediction is carried out, in this stage, time sequence data are constructed, the pH and DO are predicted using MLP, XGBoost, LSTM, CNN–LSTM hybrid models, then the best prediction model is chosen based on the prediction results. In the second part, WQC prediction was carried out, water quality classes were classified according to the water quality parameters in the dataset using RF, KNN, SVM, and LightGBM models, finally, the classification model with the best performance was selected based on the evaluation indexes. The model construction is based on the Pytorch framework and the Scikit-learn tool library. The model is divided into training and testing sets in the ratio of 70 and 30%, the MSE is used as the loss function, the iteration epoch is set to 150, and the data are divided into batches of size 64 to speed up the training. The parameter settings for each model are shown in the subsequent content.
Figure 1

Workflow illustration.

Figure 1

Workflow illustration.

Close modal

In this research, two datasets were used to construct the water quality parameter prediction model and the WQC model, respectively. The dataset Accurate Prediction of Water Quality was provided by Liu et al. (2020). The dataset was collected in Hainan Province, China. The data are arranged in temporal order of collection and can be applied to construct a time sequence of water quality parameters. Another dataset is from the publicly available dataset Indian Water Quality Data on the Kaggle competition website. (https://www.kaggle.com/datasets/anbarivan/indian-water-quality-data). The data samples were collected from various water quality monitoring stations in India from 2003 to 2014. The dataset has a total of 1991 samples and contains 8 categories of water quality parameters, among which pH, DO, conductivity, BOD, nitrate, fecal coliform and total coliform can be used to calculate the WQI.

Data pre-processing

Considering that the deletion of vacancies will further compress the sample size of the dataset and have an impact on the model fitting performance, the nearest neighbor interpolation method is used to deal with the vacancies in the research. Nearest neighbor interpolation uses the closest data point to the missing value to fill in the gaps and is suitable for cases where the data changes by a large margin. After checking and correcting the data, the Inter Quartile Range (IQR) of each feature was calculated. The minimum value minus 1.5 times the IQR and the maximum value plus 1.5 times the IQR were used as boundaries to further filter outliers.

WQI calculation

The WQI is widely used as a comprehensive indicator in water quality assessment, which is obtained by weighting the key parameters affecting water quality. The numerical magnitude of the WQI can indicate the quality of water. The calculation of the WQI is affected by the type of water quality parameters and the region, which is reflected in the different parameter settings and weighting methods. In this research, the calculation of WQI was referred to the methods proposed by the authors (Tyagi et al. 2013). The specific calculation procedure is demonstrated in the following equations:
(1)
(2)
(3)
(4)

In the first equation, N is the number of parameters for the water source. Si is the standard value for water parameters, which are listed in Table 1. In the third equation, Vi is the actual value of the water quality parameter, which means the measured value. VIdeal is the value of the parameter in the ideal situation, the ideal value of pH is 7.0, and DO is 14:6 mg/l. Apart from that, the ideal value of other parameters are 0.

Table 1

Standard values for water quality parameters (Hmoud Al-Adhaileh & Waselallah Alsaade 2021)

ParameterValue
pH 8.5 
Dissolved oxygen 10 mg/l 
Conductivity 1,000 μS/cm 
Biological oxygen demand 5 mg/l 
Nitrate 45 mg/l 
Fecal coliform 100 cfu/100 ml 
Total coliform 1,000 cfu/100 ml 
ParameterValue
pH 8.5 
Dissolved oxygen 10 mg/l 
Conductivity 1,000 μS/cm 
Biological oxygen demand 5 mg/l 
Nitrate 45 mg/l 
Fecal coliform 100 cfu/100 ml 
Total coliform 1,000 cfu/100 ml 

According to the calculated WQI, the quality of the water body can be classified into levels. The specific rating rules are shown in Table 2. According to the classification rules in the table, the samples in the dataset Indian Water Quality Data are labeled with water quality classes, which are used for WQC model training and testing.

Table 2

Water quality rating based on the WQI value (Hmoud Al-Adhaileh & Waselallah Alsaade 2021)

WQI rangeLevel
0–25 Excellent 
26–50 Good 
51–75 Poor 
76–100 Very poor 
Above 100 Unsuitable for drinking purpose 
WQI rangeLevel
0–25 Excellent 
26–50 Good 
51–75 Poor 
76–100 Very poor 
Above 100 Unsuitable for drinking purpose 

Min–max scaling

Min–max scaling can scale all the features between 0 and 1 to speed up the model training and avoid the influence of different dimensions of the features. Xmax and Xmin are the maximum and minimum values of the features, respectively.
(5)

Water parameter prediction models

Water quality can be assessed by biological, physical and chemical indicators. After reviewing the literature, pH and DO were chosen as the indicators for water quality assessment in this study. DO and pH are relatively simple to collect, which reduces the cost, and since both parameters are involved in water quality studies in different regions, models using pH and DO as predictive targets will be informative for water quality prediction in other regions. DO refers to the oxygen content of the water that the growth of aquatic organisms rely on, too low DO content will even lead to the death of aquatic organisms, the ideal conditions of the value of DO should not be less than 2 mg/l. As a common indicator of water quality, pH is easy to measure and is a key factor in water quality assessment in areas where the cost of water quality testing is high. For conventional water bodies, both high and low pH are detrimental to the survival of organisms. When the pH is 7, water quality is neutral, and near-neutral water quality is a suitable living condition for most aquatic organisms.

Multilayer perceptron

An MLP is a classic ANN consisting of an input layer, a hidden layer, and an output layer, which is commonly used for regression and classification prediction. The input layer is used for accepting the input features and passing them to the hidden layer, the cell in the hidden layer performs a linear combination of the incoming data and transforms it into a nonlinear output by means of the activation function, finally, the output of the hidden layer will reach to the output layer to compute the prediction results. The model updates the weights by backpropagation with gradient descent thus reducing the model error. The parameters of the MLP model built in this experiment are shown in Table 3.

Table 3

Parameter setting of the MLP model

ParameterValue (pH prediction)Value (DO prediction)
Input size 
Number of hidden layers 
Hidden size 128 128 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.01 0.1 
ParameterValue (pH prediction)Value (DO prediction)
Input size 
Number of hidden layers 
Hidden size 128 128 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.01 0.1 

Extreme gradient boosting

XGBoost is derived from the Gradient Boosting Decision Tree (GBDT), the basic theory is to generate a new tree by constantly splitting, after each iteration according to the updated tree to recalculate the deviation between the real value of the samples and the estimated value, the new tree on the residuals of the fitting, and then enter the next round of splitting. The tree-splitting method uses the greedy algorithm, which tries different tree structures until a set maximum depth is reached. Due to the parallelization of XGBoost, different splitting results can be computed in advance, thus the processing speed is faster than the traditional GBDT. Table 4 demonstrates the parameters of the XGBoost model used in the research, which were determined by the grid search method.

Table 4

Parameter setting of the XGBoost model

ParameterValue (pH prediction)Value (DO prediction)
Max. depth 
Number of estimators 100 150 
Gamma 0.0 0.1 
Lambda 0.01 
Learning rate 0.01 0.1 
ParameterValue (pH prediction)Value (DO prediction)
Max. depth 
Number of estimators 100 150 
Gamma 0.0 0.1 
Lambda 0.01 
Learning rate 0.01 0.1 

Long short-term memory

Traditional neural networks perform poorly when performing sequence prediction. RNN were developed to address this problem by introducing hidden states to achieve the storage of the current outputs and it can be passed around the network so that the outputs of the current moment are linked to the outputs of the previous moment. However, with the deepening of the research, the defects of RNN in the prediction of long sequences begin to show. LSTM, as a special RNN, introduces cell states and controls the propagation of information through forgetting gates, input gates, and output gates, which delays the gradient vanishing that may occur in the process of backpropagation, and thus has a better performance in dealing with long sequences. Figure 2 demonstrates the internal structure of LSTM.
Figure 2

Illustration of the structure of LSTM.

Figure 2

Illustration of the structure of LSTM.

Close modal

First, the previous hidden state of ht–1 and the input value Xt enter the forgetting gate and are computed to obtain ft, which is used to decide which information about the cell Ct–1 of the previous moment to retain. At the same time, ht–1 and Xt enter the input gate to compute it, which is used to control the update of the cell. will be used together with the previous cell state Ct–1 to determine the current cell state Ct. The output gate is responsible for calculating Ot and thus determining the hidden state ht at the next moment. Table 5 demonstrates the LSTM parameter settings built in the experiment.

Table 5

Parameter setting of the LSTM model

ParameterValue (pH prediction)Value (DO prediction)
Input size 
Number of hidden layers 
Hidden size 128 128 
Batch size 64 64 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.001 0.004 
ParameterValue (pH prediction)Value (DO prediction)
Input size 
Number of hidden layers 
Hidden size 128 128 
Batch size 64 64 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.001 0.004 

CNN–LSTM hybrid model

CNN, a deep learning model, consists of a convolutional layer, a pooling layer and a fully connected layer. The convolutional layer is the core of CNN, which performs cross-correlation operations on the feature matrix with the convolutional kernel to achieve feature extraction. The pooling layer speeds up the model operations by dimensionality reduction of the data while preserving the key features. In the fully connected layer, the processed features are regressed or classified. Considering the advantages of CNN in feature processing, a CNN–LSTM hybrid model is constructed in this research, and the conceptual diagram of this hybrid model is displayed in Figure 3.
Figure 3

Conceptual graph of the CNN–LSTM model.

Figure 3

Conceptual graph of the CNN–LSTM model.

Close modal

The CNN–LSTM hybrid model uses a one-dimensional convolutional layer, Conv1D, which is commonly used to process sequence data, allowing the cross-correlation of the convolutional kernel by the temporal dimension, and therefore learning the temporal properties of sequence data. The extracted features then enter the pooling layer, where Max Pooling is used to reduce the model parameters to prevent overfitting. After CNN processing, the data enter the LSTM layer to capture the temporal characteristics of the data, and finally fully connected layer is used to compute prediction targets according to input information. The hybrid model retains the respective strengths of CNN and LSTM and enhances the robustness of the model. Table 6 lists the parameter settings of the CNN–LSTM network.

Table 6

Parameter setting of the CNN–LSTM model

ParameterValue (pH prediction)Value (DO prediction)
In channels 
Out channels 32 32 
Kernel size 
Stride 
Number of hidden layers 
Hidden size 128 128 
Batch size 64 64 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.001 0.001 
ParameterValue (pH prediction)Value (DO prediction)
In channels 
Out channels 32 32 
Kernel size 
Stride 
Number of hidden layers 
Hidden size 128 128 
Batch size 64 64 
Activation Relu Relu 
Optimizer Adam Adam 
Learning rate 0.001 0.001 
Table 7

Indicators for the evaluation of prediction models

ModelParameterEvaluation metrics
MSEMAERMSER2
MLP pH 0.0446 0.1246 0.2113 0.3406 
DO 1.7075 1.0279 1.3067 0.7434 
XGBoost pH 0.0224 0.1145 0.1499 0.3254 
DO 2.6263 1.0728 1.6206 0.6053 
LSTM pH 0.0097 0.0821 0.0989 0.5531 
DO 0.0860 0.1471 0.2932 0.9814 
CNN–LSTM pH 0.0015 0.0310 0.0384 0.9328 
DO 0.0361 0.0869 0.1900 0.9922 
ModelParameterEvaluation metrics
MSEMAERMSER2
MLP pH 0.0446 0.1246 0.2113 0.3406 
DO 1.7075 1.0279 1.3067 0.7434 
XGBoost pH 0.0224 0.1145 0.1499 0.3254 
DO 2.6263 1.0728 1.6206 0.6053 
LSTM pH 0.0097 0.0821 0.0989 0.5531 
DO 0.0860 0.1471 0.2932 0.9814 
CNN–LSTM pH 0.0015 0.0310 0.0384 0.9328 
DO 0.0361 0.0869 0.1900 0.9922 
Table 8

WQC prediction model metrics

ModelEvaluation indicators
AccuracyPrecisionRecallF1-score
RF 82.95% 79.64% 82.95% 81.23% 
KNN 87.52% 78.79% 87.52% 82.92% 
LightGBM 88.41% 78.87% 88.41% 83.37% 
SVM 88.75% 78.77% 88.75% 83.46% 
ModelEvaluation indicators
AccuracyPrecisionRecallF1-score
RF 82.95% 79.64% 82.95% 81.23% 
KNN 87.52% 78.79% 87.52% 82.92% 
LightGBM 88.41% 78.87% 88.41% 83.37% 
SVM 88.75% 78.77% 88.75% 83.46% 

WQC model

This part of the research refers to the establishment of a mapping relationship between water quality parameters and WQC, which makes it possible to estimate water quality by mapping a limited number of water quality parameters to the corresponding WQC when the number of water quality parameters does not meet the requirements of calculating WQI. This part of the research is of practical significance for some areas where it is difficult to collect water quality parameters, and it provides an idea for low-cost water quality prediction. In addition, the classification of WQC is based on the interval of WQI values, which is more error-tolerant than the direct prediction of WQI.

Random Forest

RF is an integrated algorithm that works by voting the classification results of multiple classifiers to determine the final class, a property that makes the prediction results of RF models stable and unaffected by unusual data. Each time the tree nodes are constructed, RF will search for splitting points based on some features, and the split tree is trained on a small sample of the randomly sampled dataset, both of which can effectively prevent the model from overfitting. Because of the need to construct multiple trees for classification, its generalization ability is also relatively strong, but the cost of RF computation is also relatively high. Usually, water quality samples contain many parameters, and the water quality data collection cycle is based on the unit of years, which means that the amount of data is large, RF has the advantage of efficiently processing a large number of features and data, so it is suitable as a WQC model.

k-Nearest Neighbors

The current WQC criteria are based on the range of WQI with five classes, and the KNN algorithm is suitable for multi-category problems without additional labeling of the data. In addition, the classification of KNN depends on the votes of the neighboring samples, which can reduce the influence of outliers on the classification performance, so this research introduces the KNN model to be compared with other classification models. The basic principle is to calculate the distance between all known samples and the predicted samples, then select the K samples that are closest to the predicted samples and classify the predicted samples according to the largest proportion of the categories to which these K samples belong. The KNN model constructed in the experiment uses Euclidean Distance, and since only two water quality parameters are involved in the classification, the distance formula can be simplified to Equation:
(6)
x, y in the equation reflects the values of two water quality parameters corresponding to the sample points.

Support Vector Machine

SVM originated in the convex two-dimensional programming optimization problem, and the basic theory is to identify a hyperplane in the feature space to divide the data and maximize the spacing of data between different classes. When water quality is classified, if there are fewer water quality samples in a certain category, it is easy to misclassify them, and the hyperplane of the SVM model will divide the samples of different categories with the maximum interval, which can reduce the effect of uneven distribution of the dataset. Another key of the SVM algorithm is the kernel technique, when the data is linearly indivisible, the kernel function can project features to a higher dimensional space to find the decision boundary for data separation, so the SVM model is also suitable for the scenario with more features.

Light gradient boosting machine

LightGBM is a framework based on GDBT, which has the advantages of high efficiency, low storage, and massive data processing. Different from the traditional DT algorithm, LightGBM adopts the Histogram DT algorithm to discretize the feature values and constructs a distribution histogram to divide the data into each grid, so that only the discrete boundaries need to be considered for dividing the data, and thus the model computation can be accelerated. Another feature of LightGBM is that it adopts the Gradient-based One-Side Sampling algorithm, which gives priority to the data samples with larger gradients when calculating the information gain, and the samples with smaller gradients are randomly selected, which ensures the classification accuracy while reducing the data volume. These features enable LIghtGBM to have excellent training speed, and the classification task can be carried out efficiently in the face of a large number of water samples, it is expected to achieve real-time prediction by updating the model in real-time according to the newly collected samples.

Performance evaluation of prediction models

The water quality parameter prediction models and the WQC models proposed in the research will be evaluated separately by different indicators, and the model with the best performance will be selected based on the evaluation results. The equations below show how the evaluation metrics are calculated.
(7)
Mean square error (MSE) is the average of the squares of the errors between the actual and predicted values, the smaller the value the more accurate the model prediction. Because of the squaring operation, MSE amplifies the effect of extreme values compared to MAE.
(8)
Root mean squared error (RMSE) is the arithmetic square root of MSE, which converts the error value to the same unit as the target value and makes the error representation more intuitive.
(9)
Mean absolute error (MAE) is the average of the absolute discrepancies between the actual values and the predicted values, and all the errors are equally weighted, which can better reflect the overall error of the model.
(10)
R2 is a measure of the performance of regression prediction models, reflecting the extent to which the model can explain the variation of the target variable, and the closer its value is to 1, the better the model fits.
(11)
Accuracy is the ratio of correctly predicted observations to all observations, which is the most intuitive metric for classification model evaluation, where TP stands for true-positive, meaning the number of positive cases, that actually are true, that are correctly classified to be positive; FP stands for false-positive which represents the number of negative instances misclassified as positive instances; FN stands for false-negative which represents the number of positive cases that are misclassified as negative cases; TN is true-negative, which refers to the number of negative cases that are correctly predicted as negative cases.
(12)
Precision reflects the percentage of actually positive samples among the samples predicted by the model as positive samples, the higher the precision the lower the likelihood of false alarms.
(13)
Recall is the proportion of all actual positive cases that are predicted to be positive; the higher the recall, the lower the likelihood that a positive case will be underreported.
(14)

When the model is adjusted so that more samples are predicted to be positive, the recall rises, but this leads to some negative samples being incorrectly predicted to be positive, and the precision decreases. On the contrary, when the precision increases, some negative samples are no longer incorrectly predicted as positive, and eventually the recall decreases. In order to balance precision and recall, the model can be evaluated using F1socre.

In this section, the prediction results of MLP, XGBoost, LSTM, and CNN–LSTM hybrid models for water quality parameters are shown and analyzed in order. The prediction results were plotted on the same graph with the real value of the water quality parameters to compare the deviation of the prediction results with the actual data. Finally, the models were evaluated by MSE, RMSE, and MAE to select the model with the best performance for predicting water quality parameters. Additionally, the predictions of the classification models for different classes of water quality will be discussed. The WQC model is evaluated by considering four indicators: accuracy, precision, recall, and F1-score, and the model with the best overall performance will be used for WQC prediction.

MLP model prediction analysis

Figure 4 shows the difference between the predicted and actual values of pH and DO by MLP. Overall, the MLP model predicts high pH, and the predicted values of the first 6,000 samples match closely with the trend of the actual values. However, for samples after about 6,000 groups, the predicted values far exceeded the collected data. Although the plot shows a prediction deviation of about 1 unit, this is a large difference for pH. A deviation of 1 unit is sufficient to misclassify water quality from acidic to neutral. In terms of DO prediction, the MLP model predicts more than the actual value for the first 3,000 samples. For samples in the range of 3,000–5,000, the peak DO predictions begin to exceed the actual values. For the samples from 6,000 backward, the DO measurements are more variable, the peaks of the DO prediction curves in this part of the sample are much lower than the actual values.
Figure 4

Comparison of predicted and actual values of pH and DO by the MLP model.

Figure 4

Comparison of predicted and actual values of pH and DO by the MLP model.

Close modal

XGBoost model prediction analysis

The prediction results of the XGBoost model for pH and DO are shown in Figure 5, together with the measured data. According to Figure 5, the XGBoost model is more accurate in predicting the data in the pH range below 8.5, and the discrepancy between the predicted curves and the actual curves is slight. However, as the pH rises above 8.6, the pH prediction is much lower than the actual value in the last 1,000 data sets on the graph. For some aquatic organisms such as fish, the pH range for their survival is roughly between 6.5 and 8.5, but when the pH is higher than 8.6, the prediction obtained by XGBoost is low, which leads to the possibility that if the pH is already higher than 8.5, the prediction obtained may still be below 8.5, and therefore regulators may mistakenly believe that the water quality is still in a suitable condition for organisms. For DO prediction, the XGBoost predictions are close to the actual values when the DO is in the range of 8 or less. However, when the DO value is in the high-level range, which is shown as the part of the curve of sample number 6,000–7,000, the XGBoost model also gets a lower prediction value than the actual value.
Figure 5

Comparison of predicted and actual values of pH and DO by the XGBoost model.

Figure 5

Comparison of predicted and actual values of pH and DO by the XGBoost model.

Close modal

LSTM model prediction analysis

Figure 6 is used to show the prediction results of pH and DO by LSTM, compared with the MLP and XGBoost models, the LSTM model predicts pH more accurately in the interval greater than 8.6, and the difference from the measured value is smaller than the former two models. However, the pH prediction curves of the LSTM model showed two abrupt changes, in which the predicted values increased from about 8.5 to 9.0 with a variation of 0.5, indicating that the model is still not stable enough. The LSTM model significantly outperforms the MLP and XGBoost models in terms of DO prediction, and the DO prediction curves are almost identical to the measured curves for DO values below 10. However, the LSTM model also has some errors for high DO values (DO > 10).
Figure 6

Comparison of predicted and actual values of pH and DO by the LSTM model.

Figure 6

Comparison of predicted and actual values of pH and DO by the LSTM model.

Close modal

CNN– LSTM hybrid model prediction analysis

Figure 7 shows the pH and DO predictions of the CNN–LSTM hybrid model, respectively. In comparison with the single LSTM, the pH prediction curves of the CNN–LSTM hybrid model are closer to the actual value curves, and there is no sudden change, which indicates that the pH prediction stability of the hybrid model is higher than that of the single LSTM model. The predicted values of the first 300 sequences are mostly close to the actual values, and the prediction deviation increases in the second half of the curve, but the maximum error does not exceed 0.1, and the MAE is only 0.0310, which is within the acceptable error range for pH. In addition, the CNN–LSTM hybrid prediction model further improves the prediction level based on the LSTM model when the DO is in the range of less than 10, and the prediction curve almost cover the actual value curve. Although the hybrid model still cannot fully fit the data changes when the DO is in the range of 10 and above, it can be seen that the distance between the predicted and actual curves of the hybrid model is much smaller than that of the single LSTM model.
Figure 7

Comparison of predicted and actual values of pH and DO by the CNN–LSTM model.

Figure 7

Comparison of predicted and actual values of pH and DO by the CNN–LSTM model.

Close modal

Analysis of evaluation indicators

The evaluation indexes of each prediction model are listed in Table 3. Overall, among the four water quality parameter prediction models, the MLP model performed the worst in predicting pH, and the MSE, MAE, and RMSE were 0.0446, 0.1246, and 0.2113, respectively, which were the highest among all models. However, the MSE, MAE, and RMSE of the XGBoost model were higher than those of the MLP model in the prediction of DO, and it became the model with the largest deviation from the actual value among the four models. The pH distribution in the data set is relatively concentrated, while the DO data fluctuates greatly. It is worth noting that in Table 7 the LSTM pH prediction model is much more effective than the MLP and XGBoost models, (MSE = 0.0097, MAE = 0.0821, RMSE = 0.0989), and the values of the evaluation metrics decrease by an order of magnitude. For DO prediction, the MSE of the LSTM model is reduced by 95%, MAE by about 85.6%, and RMSE by about 77.6% compared with the MLP model, the MSE, MAE, and RMSE are decreased by about 96.8, 86.3, and 81.9% compared with that of the XGBoost model, respectively.

Since the MSE is calculated by squaring the error, the larger the error value, the greater the impact on the MSE. Combining Figures 4 and 5, it can be seen that the MLP and XGBoost models have large errors in pH prediction and DO prediction for samples numbered between 6,000 and 7,000, while the LSTM and CNN–LSTM models only have a few large prediction biases, and therefore, the MSEs obtained by the MLP and XGBoost models are much higher than those obtained by the LSTM and CNN–LSTM hybrid models. MAE is the mean value of the errors, which intuitively reflects the overall prediction bias of the model and is the average of the errors of all the samples, while RMSE squares the errors and reduces the influence of the large errors, comprehensively, the difference between the four models in MSE is the most obvious, which suggests that the improvement of the model prediction performance depends on the ability to reduce the extreme prediction bias. Observing Figures 47, it can be seen that the extreme prediction bias mainly occurs in the last 1,000 samples, which are characterized by a sudden increase in pH and DO values and keep them in a high-level range. LSTM and CNN–LSTM models have the ability to capture the trend between time series and can adapt to the characteristics of the data at different time points, whereas MLP and XGBoost affect the prediction accuracy due to the large variations in the samples. For pH prediction in Table 8, the MSE, MAE, and RMSE of the hybrid CNN–LSTM model were reduced by about 84.5, 62.2, and 61.2%, respectively. For DO prediction, the MSE, MAE, and RMSE of the hybrid CNN–LSTM model were 58, 40.9, and 35.2% lower than those of the single LSTM model, respectively. As for R2, it reflects the degree of explanation of the model to the variation of the observed variables, and the closer the value is to 1, the better the model fit is. For pH prediction, the R2 of MLP, XGBoost, LSTM, and CNN–LSTM models are 0.3406, 0.3254, 0.5531, and 0.9328, respectively. Compared with the MLP and XGBoost models, the fit of LSTM is improved by about 60%, and that of the hybrid CNN–LSTM model is improved by 69% compared with that of LSTM. The performance of the single LSTM model is significantly improved by the introduction of the CNN network. For DO prediction, the R2 of MLP, XGBoost, LSTM, and CNN–LSTM are 0.7434, 0.6053, 0.9814, and 0.9922, respectively. Although the LSTM model has basically been able to fit the variation pattern of DO, on this basis, the hybrid model still achieves the highest R2, and nearly 99% of the sample variations can be accurately represented by the CNN–LSTM model. In the CNN–LSTM hybrid model, a one-dimensional convolutional layer is used to extract features in the time dimension and a maximum pooling layer is used to reduce the model parameters to prevent overfitting. By incorporating the CNN network, the LSTM model can further improve the capability of processing time series data. The experimental results demonstrated that the performance of the hybrid model was the highest among the four models for both pH and DO predictions.

WQC model analysis

The dataset is divided into 70% of the training set, and the trained models are evaluated using the test set. The confusion matrix of RF, KNN, LightGBM, and SVM models are shown in Figure 8, respectively. The Confusion matrix is a visual representation of the classification effect of the model, and the distribution of classified and actual categories can be analyzed to see how well the model classifies each category. Since the model is trained by coding the category data as numerical data, the graphs use 0, 1, and 2 to represent the three water quality levels of excellent, good, and poor, respectively.
Figure 8

Confusion matrix of the classification model: (a) RF, (b) KNN, (c) LightGBM, and (d) SVM.

Figure 8

Confusion matrix of the classification model: (a) RF, (b) KNN, (c) LightGBM, and (d) SVM.

Close modal

The data on the main diagonal of the confusion matrix reflect the sample size that matches the predicted categories with the actual categories, and the more data on the main diagonal means the higher the accuracy of the classification model. The total sample number on the main diagonal of SVM is 505, which is the highest out of the four classification models. In addition, the data with excellent water quality class accounted for the largest proportion of the data in the dataset. The SVM model has the highest recognition rate for the data in the excellent category, and for the samples with good water quality, the RF model only classified two cases correctly, and the rest of the models were all 0, which indicates that the four models are not sensitive to the samples with good water quality. Similarly, when classifying the data with poor water quality, only the RF model has corrected classification, KNN, LightGBM, and SVM incorrectly classify the data with poor water quality into excellent category, which has the risk of misclassification for water quality prediction. The proportion of water quality categories in the dataset is not uniform, with 87% as excellent level, 10% of samples of good level water quality, and only 3% of the data for poor water quality classes, which leads to the poor prediction of good and poor water quality data, in order to further improve the performance of the model in the future, further research can use more uniform samples to train the model.

Table 4 shows the evaluation index values of the four classification models. The SVM model has the highest accuracy among the four models, and a higher accuracy means that the model can correctly classify more samples, which indicates that the SVM model is more reliable than the other three models in overall classification results. The sample sizes of the corresponding water quality categories in the dataset used were not balanced, which led to the model classifying all samples into the water quality class with the largest proportion and still obtaining a good accuracy, so the best model could not be selected based on the accuracy alone. The second indicator considered was precision, for the experiment this indicator reflects how many of the samples classified into different categories were correctly assigned to the corresponding water quality level. Among the four classification models, RF achieved the highest precision, 79.64%, while the SVM model was slightly lower at 78.77%, but the overall difference between the four models was only 1%. A higher precision indicates that the samples are classified into a particular water quality class with more confidence, but when the classes are unbalanced, the model may prefer to predict the samples into the class with the largest proportion, to balance the predictive power of the model it is also necessary to consider recall.

For recall, the SVM model is 88.75% which is slightly larger than the LightGBM model (88.41%), followed by the KNN model (87.52%), and then the RF model is the smallest with only 82.95%. Recall measures the ability of the model to recognize each category. The SVM model has a higher recall, indicating that the model can perform better in classifying samples of different classes. In the case of sample category imbalance, some categories have a small number of samples, and a high-recall model ensures the identification of these categories.

For the water quality prediction, when water quality is predicted to be excellent or good and the actual water quality is poor, such false alarms will cause managers to miss the critical time for water quality protection, so the classification model should have as few false-positives as possible, which means that the model's precision should be high. On the other hand, the model is expected to predict as many positive samples as possible, so as to avoid the waste of manpower and material resources caused by unnecessary water quality protection measures due to false-negative samples in water quality prediction, so the recall of the model needs to be taken into consideration. The F1-score is the harmonic mean of precision and recall, which balances the precision and recall of the model and ensures higher stability of the model in the face of unbalanced data. The F1-score of the SVM model is also the highest among the four classification models. Taken together, the RF model has the lowest accuracy and recall, and most of the samples in the test samples are excellent level for water quality, while the RF model has the weakest ability to identify samples of this level among the four models, which affects the model performance. For the KNN model, over-predicted the classification and classified some samples with good and poor grades as excellent. The LightGBM model is close to the SVM model and can be used as an alternative model. In order to ensure the classification accuracy and balance the recognition ability of the model for different categories, the SVM model is the best classification model in the experiment after considering the accuracy and F1-score. In the long term, water quality should remain stable at one level, the data were collected continuously over a certain period of time, which resulted in an uneven distribution of water quality classes in the samples, limiting the ability of the model to recognize other water quality categories. In future experiments, data from longer time periods or different regions can be introduced to test the classification performance of the model for each water quality category.

Comparison with previous literature studies

Table 9 demonstrates the performance difference between the CNN–LSTM water quality parameter prediction model proposed in this study and other models developed in recent literature. The CEEMDAN–XGBoost model proposed by Lu & Ma (2020) uses six different parameters as inputs, and overall the model performance is significantly different from those proposed in other studies. The TCN model proposed by Fu et al. (2021) uses only five inputs for prediction, and in pH prediction, the prediction accuracy of the model proposed in this study is slightly higher than that of the TCN model, but for DO prediction, the MSE, MAE, and RMSE obtained by the model in this study have decreased by 70.4, 78.4, and 82.8%, respectively, compared with those obtained by the TCN, which is a significant improvement in the performance of the model. For the LSTM model proposed by Eze et al. (2021), the MSE of the hybrid model proposed in this study decreased by 99.6% in pH prediction, but the model proposed in this study was higher than the LSTM model in DO prediction in the three indicators. In general, among the four models, the hybrid model proposed in this study has the lowest error in pH prediction with the least parameter input, and although the error in DO prediction is higher than that of the LSTM model, the error values are within the acceptable range and do not lead to serious judgment errors.

Table 9

Comparison of the performance of the water quality parameter prediction model proposed in this study with other models in the literature

ScholarModelNumber of inputPrediction parametersEvaluation metrics
MSEMAERMSE
Lu & Ma (2020)  CEEMDAN–XGBoost pH 0.04 0.02 0.02 
DO 0.04 0.19 0.20 
Fu et al. (2021)  TCN pH 0.0025 0.0214 0.0505 
DO 1.2201 0.4014 1.1046 
Eze et al. (2021)  LSTM pH 0.3889 0.0140 0.6236 
DO 0.0013 0.0262 0.0355 
Current study CNN–LSTM pH 0.0015 0.0310 0.0384 
DO 0.0361 0.0869 0.1900 
ScholarModelNumber of inputPrediction parametersEvaluation metrics
MSEMAERMSE
Lu & Ma (2020)  CEEMDAN–XGBoost pH 0.04 0.02 0.02 
DO 0.04 0.19 0.20 
Fu et al. (2021)  TCN pH 0.0025 0.0214 0.0505 
DO 1.2201 0.4014 1.1046 
Eze et al. (2021)  LSTM pH 0.3889 0.0140 0.6236 
DO 0.0013 0.0262 0.0355 
Current study CNN–LSTM pH 0.0015 0.0310 0.0384 
DO 0.0361 0.0869 0.1900 

Table 10 demonstrates the performance difference between the SVM model proposed in this study and other WQC prediction models developed in the works of literature. Ahmed et al. (2019) proposed an MLP model that used 6 parameters as inputs for WQC, the accuracy of the model was close to the SVM model developed in this study, but its precision and recall were much lower than the SVM and F1-score were reduced by 26.97%. The KNN DT model proposed by Nasir et al. (2022) used 7 variables as inputs, the model accuracy, F1-score and recall were lower than that of the model proposed in this study, precision of the model was higher than that of the SVM model. Compared with other works of literature, the SVM model developed in this study has higher accuracy, and the use of only two parameters as inputs greatly reduces the cost of data collection and the model running time, which is important for the development of low-cost water quality systems.

Table 10

Comparison of the performance of the WQC prediction model proposed in this study with other models in the literature

ScholarModelNumber of InputEvaluation metrics
AccuracyPrecisionRecallF1-score
Ahmed et al. (2019)  MLP 85.07% 56.59% 56.40% 56.49% 
Nasir et al. (2022)  DT 81.62% 81.69% 81.63% 81.56% 
Current study SVM 88.75% 78.77% 88.75% 83.46% 
ScholarModelNumber of InputEvaluation metrics
AccuracyPrecisionRecallF1-score
Ahmed et al. (2019)  MLP 85.07% 56.59% 56.40% 56.49% 
Nasir et al. (2022)  DT 81.62% 81.69% 81.63% 81.56% 
Current study SVM 88.75% 78.77% 88.75% 83.46% 

Pollution of water sources has always been an important issue for the whole world, and since it is difficult to stop the source of pollution, it is necessary to regularly monitor the water sources to prevent pollution in advance. Water quality prediction is the critical point to solving the problem, based on the prediction results, staff can judge the status of the water source to take appropriate protective measures. Based on this purpose, two forms of water quality parameters and WQC were used as water quality assessment measures, and experiments on water quality parameter prediction and WQC prediction were carried out to determine the prediction model with better performance.

In experiments for predicting water quality parameters, MLP, XGBoost, LSTM, and CNN–LSTM hybrid models were constructed for predicting DO and pH. In terms of pH prediction, the MSE of the LSTM model is 78.3% lower than that of the MLP and XGBoost model by 56.7%. In terms of DO prediction, the MSE of the LSTM model is 85.6% lower than that of MLP and XGBoost by 95 and 96.7%. Nevertheless, the LSTM model still has some defects in pH prediction, and the prediction values of some samples fluctuate drastically, resulting in an increase in error. In order to obtain more stable prediction results, the CNN–LSTM hybrid model was developed. Errors of the CNN–LSTM hybrid model is greatly reduced in pH prediction, and the fluctuation of MSE, MAE, and RMSE are reduced by 84.5, 62.2, and 61.2%, respectively. In addition, the CNN–LSTM hybrid model optimizes the shortcomings of the LSTM model in high-level DO prediction and reduces the gap with the actual values, with the three metrics decreasing by 58, 40.9, and 35.2%, respectively, compared with the LSTM model.

For the WQC, the aim of this research is to classify water quality according to some water quality parameters. In this part, RF, KNN, LightGBM and SVM models were developed to classify water quality according to pH and DO. According to the confusion matrix, the KNN, LightGBM, and SVM models misclassify excellent water quality and are less sensitive to good and poor levels. The RF model is able to distinguish between good and poor water quality, but its overall recall is low. This situation is related to the sample distribution of the data set, about 87% of the sample water quality is excellent, good and poor grade sample data is less, resulting in a weaker classification effect of the model on good and poor levels of water quality. The sample size can be increased to improve the classification effect of the classification model, in the subsequent research. In order to compare the model capabilities more intuitively, this research also used accuracy, precision, recall, and F1-score to evaluate the models. The SVM model has the lowest precision (78.77%) among the four classification models, but it is only 0.87% different from the RF model, which has the highest precision. The SVM model was the highest in the other three categories (accuracy: 88.75%, recall: 88.75%, F1-score: 83.46%). To avoid categorizing poor water quality data (poor quality and below) as good quality water as much as possible.At the same time, the model should also ensure that high quality water is not misclassified as poor quality water, resulting in wasted resources. Considering the four metrics together, the SVM model was identified as the final WQC model.

Future research directions

In summary, the CNN–LSTM water quality parameter prediction model and the SVM WQC model require less water quality data, which is of great significance for the region where the cost of water quality data measurement is large. Future research could take advantage of the extensibility of the two models and optimize the shortcomings of this research to achieve a more comprehensive automated water quality prediction system. The following points can be considered: Depending on the water quality parameters of the dataset, the prediction system can adjust the input feature combinations according to the output results. The prediction interval can be manually adjusted to achieve water quality prediction at multiple time scales. The machine learning models used in the research can be componentized, and as the field of machine learning develops, more algorithms can be integrated into the water quality prediction system. The use of different criteria for calculating WQI in different regions makes the model more generalizable.

The project idea and research design of this study were given by Z.C. H.G. analyzed the data, implemented methodology, and wrote the first draft of the manuscript. F.Y.T. assisted with the research framework design. Z.C. and F.Y.T. supervised the project. All authors read and approved the final manuscript.

The authors declare that no funds, grants, or other support was received during the preparation of this manuscript.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Ahmed
U.
,
Mumtaz
R.
,
Anwar
H.
,
Shah
A. A.
,
Irfan
R.
&
García-Nieto
J.
(
2019
)
Efficient water quality prediction using supervised machine learning
,
Water
,
11
(
11
),
2210
.
doi:https://doi.org/10.3390/w11112210
.
Barzegar
R.
,
Aalami
M. T.
&
Adamowski
J.
(
2020
)
Short-term water quality variable prediction using a hybrid CNN–LSTM deep learning model
,
Stochastic Environmental Research and Risk Assessment
,
34
(
2
),
415
433
.
doi:https://doi.org/10.1007/s00477-020-01776-2
.
Eze
E.
&
Ajmal
T.
(
2020
)
Dissolved oxygen forecasting in aquaculture: A hybrid model approach
,
Applied Sciences
,
10
(
20
),
7079
.
doi:https://doi.org/10.3390/app10207079
.
Eze
E.
,
Halse
S.
&
Ajmal
T.
(
2021
)
Developing a novel water quality prediction model for a South African aquaculture farm
,
Water
,
13
(
13
),
1782
.
doi:https://doi.org/10.3390/w13131782
.
Fu
Y.
,
Hu
Z.
,
Zhao
Y.
&
Huang
M.
(
2021
)
A long-term water quality prediction method based on the temporal convolutional network in smart mariculture
,
Water
,
13
(
20
),
2907
.
doi:https://doi.org/10.3390/w13202907
.
Hmoud Al-Adhaileh
M.
&
Waselallah Alsaade
F.
(
2021
)
Modeling and prediction of water quality by using artificial intelligence
,
Sustainability
,
13
(
8
),
4259
.
https://doi.org/10.3390/su13084259
.
Ho
J. Y.
,
Afan
H. A.
,
El-Shafie
A. H.
,
Koting
S. B.
,
Mohd
N. S.
,
Jaafar
W. Z. B.
,
Lai Sai
H.
,
Malek
M. A.
,
Ahmed
A. N.
,
Mohtar
W. H. M. W.
,
Elshorbagy
A.
&
El-Shafie
A.
(
2019
)
Towards a time and cost effective approach to water quality index class prediction
,
Journal of Hydrology
,
575
,
148
165
.
https://doi.org/10.1016/j.jhydrol.2019.05.016
.
Kannel
P. R.
,
Lee
S.
,
Lee
Y.-S.
,
Kanel
S. R.
&
Khan
S. P.
(
2007
)
Application of water quality indices and dissolved oxygen as indicators for river water classification and urban impact assessment
,
Environmental Monitoring and Assessment
,
132
(
1–3
),
93
110
.
https://doi.org/10.1007/s10661-006-9505-1
.
Liu
J.
,
Yu
C.
,
Hu
Z.
,
Zhao
Y.
,
Bai
Y.
,
Xie
M.
&
Luo
J.
(
2020
)
Accurate prediction scheme of water quality in smart mariculture with deep Bi-S-SRU learning network
,
IEEE Access
,
8
,
24784
24798
.
https://doi.org/10.1109/ACCESS.2020.2971253
.
Lu
H.
&
Ma
X.
(
2020
)
Hybrid decision tree-based machine learning models for short-term water quality prediction
,
Chemosphere
,
249
,
126169
.
https://doi.org/10.1016/j.chemosphere.2020.126169
.
Najah
A.
,
Teo
F. Y.
,
Chow
M. F.
,
Huang
Y. F.
,
Latif
S. D.
,
Abdullah
S.
,
Ismail
M.
&
El-Shafie
A.
(
2021
)
Surface water quality status and prediction during movement control operation order under COVID-19 pandemic: Case studies in Malaysia
,
International Journal of Environmental Science and Technology
,
18
(
4
),
1009
1018
.
https://doi.org/10.1007/s13762-021-03139-y
.
Nasir
N.
,
Kansal
A.
,
Alshaltone
O.
,
Barneih
F.
,
Sameer
M.
,
Shanableh
A.
&
Al-Shamma'A
A.
(
2022
)
Water quality classification using machine learning algorithms
,
Journal of Water Process Engineering
,
48
,
102920
.
https://doi.org/10.1016/j.jwpe.2022.102920
.
Othman
F.
,
Alaaeldin
M. E.
,
Seyam
M.
,
Ahmed
A. N.
,
Teo
F. Y.
,
Ming Fai
C.
,
Afan
H. A.
,
Sherif
M.
,
Sefelnasr
A.
&
El-Shafie
A.
(
2020
)
Efficient river water quality index prediction considering minimal number of inputs variables
,
Engineering Applications of Computational Fluid Mechanics
,
14
(
1
),
51
763
.
https://doi.org/10.1080/19942060.2020.1760942
.
Posthuma, L., Suter II, G. W. & Traas, T. P. (eds.) (2001) Species sensitivity distributions in ecotoxicology. CRC Press EBooks, 20, Dec. 2001. https://doi.org/10.1201/9781420032314.
Rasheed Abdul Haq
K. P.
&
Harigovindan
V. P.
(
2022
)
Water quality prediction system based on adam optimised LSTM neural network for aquaculture: A case study in Kerala, India
,
J. Inst. Eng. India Ser. B
,
103
,
2177
2188
.
https://doi.org/10.1007/s40031-022-00806-7
.
Schwarzenbach
R. P.
,
Egli
T.
,
Hofstetter
T. B.
,
Von Gunten
U.
&
Wehrli
B.
(
2010
)
Global water pollution and human health
,
Annual Review of Environment and Resources
,
35
(
1
),
109
136
.
https://doi.org/10.1146/annurev-environ-100809-125342
.
Tyagi
S.
,
Sharma
B.
,
Singh
P.
&
Dobhal
R.
(
2013
)
Water quality assessment in terms of water quality index
,
American Journal of Water Resources
,
1
(
3
),
34
38
.
https://doi.org/10.12691/ajwr-1-3-3
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).