Machine learning (ML), a branch of artificial intelligence (AI), has been increasingly used in environmental engineering due to the ability to analyze complex nonlinear problems (such as ones connected with water quality management) through a data-driven approach. This study provides an overview of different ML algorithms applied for monitoring and predicting river water quality. Different parameters could be monitored or predicted, such as dissolved oxygen (DO), biological and chemical oxygen demand (BOD and COD), turbidity levels, the concentration of different ions (such as Mg2+ and Ca2+), heavy metal or other pollutant's concentration, pH, temperature, and many more. Although many algorithms have been investigated for the prediction of river water quality, there are several which are most commonly used in engineering practice. These models mostly include so-called supervised learning algorithms, such as artificial neural network (ANN), support vector machine (SVM), random forest (RF), decision tree (DT), and deep learning (DL). To further enhance prediction power, novel hybrid algorithms, could be used. However, the quality of prediction is not only dependent on the applied algorithm but also on the availability of previously mentioned water quality parameters, their selection, and the combination of input data used to train the ML model.

  • Classification, prediction, and anomaly detection algorithms were reviewed.

  • Hydrometeorology data can be used to compensate for missing parameter data.

  • Algorithms can struggle with generalization aspects important for real applications.

  • Covering critical sampling points and periods could enhance prediction accuracy.

  • Hybrid models could overcome the limitations of single models.

In recent decades, the surface water quality (WQ) of the river streams has been negatively impacted by pollutants and wastes (Khullar & Singh 2021). The deteriorated WQ may bring about serious negative consequences on humans, aquatic life, and the environment in general. Moreover, climate change represents an additional pressure on surface WQ by reducing WQ during the low-flow seasons and increasing the river water temperature over the year. The required quality of surface water is defined by the framework directive on water and the law on water. In accordance with the Directive (2000/60/EC), the main goal is the sustainable management of all water systems, and it refers to determining the impact and pressure on water bodies because these are the main causes of pollution. The law on water includes decrees that define the issue in more detail. The most significant is the decree on limit values of emission of polluting substances in water, as well as the decree on limit values of priority and priority hazardous substances that pollute surface waters. The given decrees specify the limit values of the main parameters that define the required WQ (‘Sl. glasnik RS’, br. 50/2012, 24/2014). To define WQ, different data collection techniques could be applied, such as sampling and analysis in the field, laboratory analyses, and the application (appl.) of monitoring sensors that operate in real-time. High-quality sensors can be quite expensive, need regular maintenance to function optimally, and require calibration to ensure accurate water quality data parameters. With the development of technology, research goes toward an optimized way of managing WQ (Ahmed et al. 2019a, 2019b; Park et al. 2020). This research intends to analyze the existing challenges and find opportunities for appl. of machine learning (ML) algorithms (algo.) in the river WQ management. It envelops several aspects of ML appl. within the following issues: (a) WQ estimation and prediction (Wagle et al. 2020; Khullar & Singh 2021); (b) WQ classification (Abuzir & Abuzir 2022); and (c) WQ anomaly detection (Russo et al. 2021). Exhaustive investigation throughout these issues enables the development of ML algo. as a tool for the decision-making process in the river basin, with an ultimate goal to improve the river's health and maintain wildlife.

The term ‘water quality’ can often be defined in terms of the chemical, physical, and biological water indicators (Antanasijević et al. 2013). Different parameters could affect and serve as indicators of WQ, which could be expressed through the water quality index (WQI) and water quality classes (WQCs). Among the most frequently monitored parameters are chemical (DO, COD, BOD, total dissolved solids (TDS), nitrates , pH, etc.), physical (water temperature (WT), turbidity, electrical conductivity (EC), solids, etc.), and biological (chlorophyll-a). Precise definitions of several of them and their contribution to WQ are previously published in papers written by Khullar & Singh (2022) and Syeed et al. (2023).

Water quality index and water quality classes

WQI is a dimensionless, single number that indicates the status of WQ (Gorde & Jadhav 2013; Sutadian et al. 2016). As mentioned, it depends on different water parameters that reflect WQ. WQI is calculated by the following equation:
(1)
where qi is the value of a parameter in the range of 0–100 and wi is the weight of a particular parameter (Ahmed et al. 2019a, 2019b).

Based on the WQI, WQC for each water body could be established. An example of such classification is provided in Table 1 (Ahmed et al. 2019a, 2019b).

Table 1

WQ classification (Ahmed et al. 2019a, 2019b)

WQI rateClassification
0–25 Very bad 
25–50 Bad 
50–70 Medium 
70–90 Good 
90–100 Excellent 
WQI rateClassification
0–25 Very bad 
25–50 Bad 
50–70 Medium 
70–90 Good 
90–100 Excellent 

ML has a wide appl. in environmental science (ES) and engineering (EE), thanks to its high precision, flexible customization, and ability in solving complex data patterns (Maganathan et al. 2020). According to Zhong et al. (2021) from 1990 to 2020, 5,855 publications were generated, as a result of ML appl. in EE in fields of water (47.63%), air (27.32%), soil (21.02), and sediment (4.02%). Four general appl. of ML in the field of ES and EE are provided by Zhu et al. (2022), and they are (1) making predictions, (2) identifying feature importance, (3) detecting anomalies, and (4) discovering new materials or chemicals. Appl. of (1) and (3) techniques are mostly reflected in supervised (regression or classification) learning (SupVL), but also through unsupervised (clustering) learning (unSupVL), to a lesser extent. Techniques within (2) and (4) can be implemented through SupVL, using for example linear discriminant analysis (LDA), a classification technique for (2). SupVL is dominantly applied for EE issues, such as the prediction of particulate matter (PM2.5), water resource availability, and modeling of biochemical wastewater treatment systems (Zhong et al. 2021). Over the past few decades, different ML models have been developed to solve various water engineering management problems (Syeed et al. 2023). Surface WQ profiling is one of the high priorities especially in developing countries. According to Zhu et al. (2022), ML algo. applied in WQ evaluation of surface river waters are bootstrapped wavelet neural network (BWNN), ANN, autoregressive integrated moving average (ARIMA), bootstrapped artificial neural network (BANN), long short-term memory (LSTM), Nash–Sutcliffe efficiency (NSE), polynomial neural network (PNN), cascade correlation neural network (CCNN), Tsinghua/Temporary DeepSpeed (TDS), deep neural network (DNN), support vector regression (SVR), RF, SVM, and convolutional neural network (CNN). Significance of ML appl. in the river research is evident in the number of publications from 2000 to 2020, which increased from 310 (in 2000) to 3,444 (in 2020). Until the 2000s, SupVL appl. was dominant, but after 2000s, it gradually equalized with unSupVL. Trend analysis also showed that unSupVL and SupVL dominated the field of river research (1990–2020), while NN and DL have gained more attention in this field, featuring in 15–21% of the total publications over the last two decades (Ho & Goethals 2022). Frequently used ML models for WQ classification, WQ estimation and prediction, and anomaly detection as parts of river WQ management, are tree-structured algo. DT and RF, SVM and ANN, and LSTM as a type of ANN.

Tree-structured algorithms

DT is a SupVL classifier, consisting of decision and leaf nodes and its structure is presented in Figure 1.
Figure 1

Single DT and RF (Quinlan 1993).

Decision nodes are used to make any decision and have multiple branches, whereas the leaf nodes are the output of those decisions, and do not contain any further branches. Each node represents features in a category to be classified and each subset defines a value that can be taken by the node (Abuzir & Abuzir 2022).

For the best splitting feature selection, there are several selection measures for gaining the purest subset, such as Information Gain. Information Gain evaluates the feature for splitting based on the difference in entropy (Equation (2)), before and after the split is calculated. The feature with the highest Information Gain value, split nodes, and DT is built (Equation (3)). Also, quite often, applied RF algo. represents the ensemble of DTs, which provides fits of multiple DTs on various subsamples of the dataset and makes the predictions by averaging the predictions from each DT. As a result, possible DT overfitting can be controlled and better prediction accuracy is provided (Quinlan 1993):
(2)
where p is the whole dataset, N is the number of classes, and pi is the frequency of class i in the same dataset.
(3)
where bef is the dataset before the split, K is the number of subsets generated by the split, and (j, aft) is the subset after the split.

Support vector machine algorithms

Although the ANN model is often examined and proven as suitable for different predictions, the SVM model has been recognized as more reliable for the same purpose, by several authors (Liu & Lu 2014; Park et al. 2015; Haghiabi et al. 2018; Zhu et al. 2022). Hence, the general architecture of the SVM model is presented in Figure 2.
Figure 2

SVM architecture (Liu & Lu 2014).

K(.) is a kernel function, while n represents the number of support vectors (Liu & Lu 2014). The following Equation (4) presents the SVM regression function:
(4)
where (x) is the nonlinear mapping function, w is the weight vector, and b is the bias term (Liu & Lu 2014).

Long short-term memory (LSTM)

Figure 3 depicts the structure of commonly used deep learning-based algo. LSTM for anomaly detection.
Equation (5) provides the calculation of LSTM algo.:
(5)
where ft is the forget gate at t, t is the timestep, xt is the input, ht–1 is the previous hidden state, Wf is the weight matrix between the forget gate and the input gate, and bt is the connection bias at t (Vijayaprabakaran & Sathiyamurthy 2022).

Water quality classification

Anthropological activities from urban and rural areas are the most common causes of deteriorated water quality (Nazir et al. 2016); hence, WQ assessment and WQI estimation are vital for preserving human and environmental health (Wang et al. 2017; Zhu et al. 2022; Syeed et al. 2023). WQ parameters that were mostly used in selected articles are DO, BOD, nitrate , pH, EC (Bui et al. 2020; Sillberg et al. 2021; Al-Adhaileh & Alsaade 2021; Hassan et al. 2021), and in lesser extent COD, total solids (TS), phosphate (Bui et al. 2020), turbidity (Bui et al. 2020; Sillberg et al. 2021; Abuzir & Abuzir 2022), fecal coliform (FC) (Bui et al. 2020; Al-Adhaileh & Alsaade 2021), total coliform (TC) (Hassan et al. 2021), total coliform bacteria (TCB), salinity, TDS, suspended solids (SS) (Sillberg et al. 2021), total organic carbon (TOC) (Abuzir & Abuzir 2022), and ammonia nitrogen (AN) (Shamsuddin et al. 2022). Table 2 contains applied ML algo. regarding this assessment within reviewed papers.

Table 2

Commonly used ML algo. used for WQ classification

Domain of applicationType of algorithmReferences
Water quality classification (determination of water quality index (WQI) and water classes) Neural network (NN) (artificial neural network (ANN), feedforward neural network (FFNN)) Hassan et al. (2021), Al-Adhaileh & Alsaade (2021), and Shamsuddin et al. (2022)  
Random forest (RF) Bui et al. (2020) and Hassan et al. (2021)  
Multinomial logistic regression (MLR) Hassan et al. (2021)  
Support vector machine (SVM) Hassan et al. (2021) and Shamsuddin et al. (2022)  
Bagged tree model (BTM) Hassan et al. (2021)  
Decision tree (DT) (M5P) Bui et al. (2020) and Shamsuddin et al. (2022)  
K-nearest neighbor (KNN) Al-Adhaileh & Alsaade (2021)  
Random tree (RT) Bui et al. (2020)  
Reduced error pruning tree (REPT) Bui et al. (2020)  
Hybrid models: 12 hybrid algo. as combinations of standalones with bagging (BA), CV parameter selection (CVPS), and randomizable filtered classification (RFC), attribute-realization (AR) and SVM (AR-SVM), adaptive neuro-fuzzy inference system (ANFIS) Bui et al. (2020), Sillberg et al. (2021), and Al-Adhaileh & Alsaade (2021)  
Domain of applicationType of algorithmReferences
Water quality classification (determination of water quality index (WQI) and water classes) Neural network (NN) (artificial neural network (ANN), feedforward neural network (FFNN)) Hassan et al. (2021), Al-Adhaileh & Alsaade (2021), and Shamsuddin et al. (2022)  
Random forest (RF) Bui et al. (2020) and Hassan et al. (2021)  
Multinomial logistic regression (MLR) Hassan et al. (2021)  
Support vector machine (SVM) Hassan et al. (2021) and Shamsuddin et al. (2022)  
Bagged tree model (BTM) Hassan et al. (2021)  
Decision tree (DT) (M5P) Bui et al. (2020) and Shamsuddin et al. (2022)  
K-nearest neighbor (KNN) Al-Adhaileh & Alsaade (2021)  
Random tree (RT) Bui et al. (2020)  
Reduced error pruning tree (REPT) Bui et al. (2020)  
Hybrid models: 12 hybrid algo. as combinations of standalones with bagging (BA), CV parameter selection (CVPS), and randomizable filtered classification (RFC), attribute-realization (AR) and SVM (AR-SVM), adaptive neuro-fuzzy inference system (ANFIS) Bui et al. (2020), Sillberg et al. (2021), and Al-Adhaileh & Alsaade (2021)  

Hassan et al. (2021) utilized ML algo. from Table 2 and developed a software appl. that used MLR to predict WQ in India in real-time for three classes (good, poor, and unsuitable for drinking). RF model was used to handle missing data. Performance for MLR, RF, BT, NN, and SVM classification models was 99.83, 98.99, 98.99, 98.65, and 96.98%. The highest variable importance obtained by , pH, EC, DO, TC, and BOD were with NN (19.67), BT (36.805), BT (81.494), BT (147.558), BT (105.166), and BT (130.173). Similar to the previous article, Shamsuddin et al. 2022 utilized ANN, DT, and SVM for multiclass classification of river WQ of Langat River Basin, but showed the opposite results in the best classification model. The efficiency of ANN handling big datasets and predicting WQI was overcome by SVM, whose applicability to small datasets was surpassed and improved by the kernel function. The most numerous WQC was III and II defined as water supply/fisheries. Preferences for ANN, DT, and SVM utilization were the ability to model nonlinear and complex relationships between input and output variables, easy and widely used classification techniques, and modeling of nonlinear relationships between input variables. So all three models achieved more than 85% performance, with macro accuracies and precision values of 96.35%, 91.97% for SVM, 95.62%, 92.06% for ANN, and 94.71%, 89.22% for DT. Sillberg et al. (2021) applied an integrated approach AR-SVM, along with 11 water quality parameters (WQPs) to classify Chao Rivers WQ. Linear regression proved to be the most suitable function for WQ classification, with six QPs, and accuracy, and precision of 0.94, 0.84. Water classes variated, however poor WQ class III prevailed. The main WQP in WQC and their confidence values were NH3-N (0.80), TCB (0.79), FCB (0.78), BOD (0.76), DO (0.69), and Sal (0.64). A smaller number of significant variables aided the AR-SVM model by minimizing some limitations. AR-SVM had the same results for 15 of the 16 datasets (93.75%) approving good correspondence with traditional WQI calculation. With applied three to six WQPs, the AR-SVM model showed a potent approach in classifying river WQ with an accuracy of 0.86–0.95. Al-Adhaileh & Alsaade (2021) for WQI prediction utilized an ANFIS and KNN, FFNN for WQC of different water bodies across India. WQI determined by ANFIS showed high efficiency and accuracy and a regression coefficient of 96.17%. FFNN model showed superior robustness in classifying the WQC with high accuracy and precision of 100 and 99.961%, while KNN had 80.63 and 82.50%. ANFIS and FFNN as ANN, compared to Hassan et al. (2021) and Shamsuddin et al. (2022), showed the best performances for the prediction of WQI. ANFIS confirmed its ability to monitor drinking and contaminated water with high accuracy. Determined WQ was classified as poor, hence the proposed method has been defined as helpful in water treatment and management. The determination of monthly WQI of Iran River by Bui et al. (2020), implied the appl. of four standalones and 12 hybrid data-mining models, presented in Table 2. Main WQPs in WQC were FC and TS. Different WQP combinations provided different levels of model performances for each one of them. The rank of the algo. based on the prediction power (best to worst) was BA-RT, BA-RF, BA-M5P, CVPS-RF, RF, RFC-RF, BA-REPT, M5P, CVPS-M5P, RFC-REPT, RFC-M5P, REPT RT, RFC-RT, CVPS-RT, and CVPS-REPT. Among 16 validated algo. all models performed well, but BA-RT have the highest power (R2 = 0.941), while CVPS-REPT had the lowest (R2 = 0.853) in predicting WQI. Hybrid tree-based models (especially the bagging algo.) were more robust and flexible than standalone models. Among standalone, RF outperformed M5P, REPT, and RT. Nearly all algo. overestimated WQI values, but RT, BA-RT, and CVPS-REPT did not. Even though hybrid BA-RT outperformed the other models, it did not predict extreme WQI accurately.

Water quality prediction

Monitoring and prediction of WQ are very important as they can enhance water management, including WQ preparation and regulation, higher quality and development of irrigation strategy, the efficiency of aquaculture, and improved drinking water preparation and strategies for prevention of water contamination (Al-Adhaileh & Alsaade 2021; Khullar & Singh 2022). Table 3 summarizes several ML algo. which can be used for prediction purposes. ANN, DNN, and SVM models were more frequently used in comparison to other models. It might be due to the advantages they offer (Krishnan et al. 2022). Among the most commonly used WQ parameters for surface, WQ prediction are DO, WT, pH, SS, nitrates (NOx), TDS, EC, turbidity, BOD, and COD. However, other parameters were occasionally used such as FC, chlorides, sulfates, and organic and inorganic pollutants were also occasionally used, depending on the availability of data, type, and locality of the river (Syeed et al. 2023). The mentioned indicators of WQ are interconnected since one parameter can affect the value of another. Hence, it is important to evaluate each parameter's significance and their mutual correlations (Zhu et al. 2022). For instance, many authors have recognized that DO is one of the most globally concerned WQ indicators that had strong correlations with certain input parameters such as pH value, temperature, and NOx concentration (Zhu et al. 2022). It consequently can affect the prediction accuracy of the applied ML model, and therefore, those correlations should be thoroughly investigated. Despite the prediction of concentration/level of certain parameters, or general WQ based on several parameters, other predictions, such as monthly runoff prediction (Samantaray et al. 2022) or water level prediction could be applied (Baek et al. 2020). A common approach for modeling future water status is to collect a large number of data from previously published articles or public services/monitoring stations and generate databases which could serve as input for ML model development for certain purposes. However, sometimes there is a problem of missing data. That problem caused by incomplete data could be overcome by adding another type of data such as hydrological data (which is more often available) to model development (Zhi et al. 2021) or using additional data post-processing tools such as Multivariate Bayesian Uncertainty Processor (MBUP) (Zhou 2020). Developing more than just one model is beneficial as it allows comparison between utilized models for investigated purposes and choosing the most appropriate (highest accuracy, lowest error, etc.). Specific problems could occur when predicting concentrations of certain components of agricultural drainage river basins, because of the existence of self-purification mechanisms and nonpoint source transport of pollutants. ML algo., namely ANN and SVM, succeeded to predict total nitrogen and total phosphorus concentrations in rivers in China. However, SVM showed better generalization as it avoided the occurrence of overtraining and optimized fewer parameters based on the structural risk minimization principle. To optimize the parameters of models, genetic algo., trial-and-error analysis were used (Liu & Lu 2014). Facing different limitations, single models could be outperformed by different hybrid models. For instance, although some authors report good prediction ability of deep learning LSTM model (Liu et al. 2019), some authors such as Khullar & Singh (2022) reported that single CNN and LSTM models could be often characterized as highly complex and with low prediction accuracy, which could be overcome by improved Bi-LSTM model which proved the ability to be adapted for various WQ samples from different sources (Khullar & Singh 2022). Hybrid models also proved efficient for important short-term WQ prediction, such as in the case when an advanced data denoising technique – complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) was integrated with extreme gradient boosting and RF to predict six WQ indicators (Lu & Ma 2020).

Table 3

Common ML algo. used for water quality prediction

Domain of applicationType of algorithmReferences
Water quality prediction and estimation ANN and their variations (backpropagation neural network (BPNN), general regression neural network (GRNN), recurrent neural network (RNN), deep neural networks (DNN), their variations (convolutional neural network (CNN), long short-term memory (LSTM), and combinations (CNN-LSTM)) Antanasijević et al. (2013), Liu & Lu (2014), Haghiabi et al. (2018), Liu et al. (2019), Baek et al. (2020), Bilali & Taleb (2020), and Khullar & Singh (2022)  
Group method of data handling (GMDH) Haghiabi et al. (2018)  
SVM Liu & Lu (2014) and Haghiabi et al. (2018)  
Extra tree regression (ETR) Asadollah et al. (2021)  
Support vector regression (SVR) Bilali & Taleb (2020), Asadollah et al. (2021), and Khullar & Singh (2022)  
Decision tree regression (DTR) Asadollah et al. (2021)  
Decision tree (DT)-based hybrid models: CEEMDAN-RF and CEEMDAN-XGBoost Lu & Ma (2020)  
DNN-based hybrid models: Bi-LSTM model (DLBL-WQA) Khullar & Singh (2022)  
Domain of applicationType of algorithmReferences
Water quality prediction and estimation ANN and their variations (backpropagation neural network (BPNN), general regression neural network (GRNN), recurrent neural network (RNN), deep neural networks (DNN), their variations (convolutional neural network (CNN), long short-term memory (LSTM), and combinations (CNN-LSTM)) Antanasijević et al. (2013), Liu & Lu (2014), Haghiabi et al. (2018), Liu et al. (2019), Baek et al. (2020), Bilali & Taleb (2020), and Khullar & Singh (2022)  
Group method of data handling (GMDH) Haghiabi et al. (2018)  
SVM Liu & Lu (2014) and Haghiabi et al. (2018)  
Extra tree regression (ETR) Asadollah et al. (2021)  
Support vector regression (SVR) Bilali & Taleb (2020), Asadollah et al. (2021), and Khullar & Singh (2022)  
Decision tree regression (DTR) Asadollah et al. (2021)  
Decision tree (DT)-based hybrid models: CEEMDAN-RF and CEEMDAN-XGBoost Lu & Ma (2020)  
DNN-based hybrid models: Bi-LSTM model (DLBL-WQA) Khullar & Singh (2022)  

Anomaly detection

The process of identifying unexpected problems in water supply data, such as missing values, unusual patterns, or inconsistent specifications, is called anomaly detection. Anomaly detection is done by applying ML models that may or may not require model calibration against a labeled dataset, like the SupVL ML model in the first and the unSupVL ML model in the second case. Given that the SupVL model requires large datasets, the unSupVL models can be used as the alternative (Russo et al. 2021). In Table 4, a few ML algo. used for anomaly detection within reviewed articles are presented.

Table 4

Common ML algo. used for anomaly detection

Domain of applicationType of algorithmReferences
Water quality anomaly detection Logistic regression Muharemi et al. (2019)  
SVM Muharemi et al. (2019)  
LSTM Muharemi et al. (2019) and Miau & Hung (2020)  
ANN Muharemi et al. (2019) and Miau & Hung (2020)  
DNN Muharemi et al. (2019)  
RNN Muharemi et al. (2019)  
LDA (linear discriminant analysis) Muharemi et al. (2019)  
CNN with an extreme learning machine (ELM) (CNN-ELM) Miau & Hung (2020)  
Sec2sec Miau & Hung (2020)  
Conv-GRU (CNN and GRU model) Miau & Hung (2020)  
BAR (Bayesian autoregressive) model and IF (Isolation forest) algo. Liu et al. (2020)  
Domain of applicationType of algorithmReferences
Water quality anomaly detection Logistic regression Muharemi et al. (2019)  
SVM Muharemi et al. (2019)  
LSTM Muharemi et al. (2019) and Miau & Hung (2020)  
ANN Muharemi et al. (2019) and Miau & Hung (2020)  
DNN Muharemi et al. (2019)  
RNN Muharemi et al. (2019)  
LDA (linear discriminant analysis) Muharemi et al. (2019)  
CNN with an extreme learning machine (ELM) (CNN-ELM) Miau & Hung (2020)  
Sec2sec Miau & Hung (2020)  
Conv-GRU (CNN and GRU model) Miau & Hung (2020)  
BAR (Bayesian autoregressive) model and IF (Isolation forest) algo. Liu et al. (2020)  

Muharemi et al. (2019), in their article, attended to check whether ML models give more accurate results than logistic regression and which model performs best for WQ data. To identify anomalies in WQ data authors applied ML algo. SVM, ANN, DNN, RNN, LSTM, and LDA and all were compared to the logistic regression algo. used for data classification. Experiment results were ranked based on the F1 value (as a measure of accuracy) and the SVM model performed the best with 0.9891, followed by DNN (0.9485), LSTM (0.9023), RNN (0.8345), logistic regression (0.6027), ANN (0.5768), and LDA (0.0820). All models show vulnerability in the case of unbalanced datasets giving worse results, except for SVM, logistic regression, and ANN which were less vulnerable. This is pointing out the clear laxity of the ML model when using unbalanced WQ data. But the results also revealed logistic regression's ability to explain the relationship between one dependent variable and one or more independent variables and SVM's ability to accurately predict data series in the case of nonlinear and nonstationary underlying systems. NN algo. can stimulate the structure of the human brain and include a network of many interconnected neurons (ANN), they are effective and reliable models with many hidden layers (DNN) and can use previous time series information and use recurrent loops where the output state is fed back to the input state of the cell (RNN). Considering LSTM, it has long-term benefits for making accurate predictions, and learning useful information while forgetting useless information, while LDA gives excellent results for independent measurements, but classical recognition techniques pose a problem for this model. Research conducted by Miau & Hung (2020), focused on the comparison of ANN, CNN, LSTM, Sec2sec, and Conv-GRU models while dealing with water level prediction in the Danshui River Basin in Taiwan. Achieved performance expressed through Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) was as follows: Conv-GRU (RMSE – 0.774, MAE – 0.567, MAPE – 30.684), LSTM (RMSE – 1.032, MAE – 0.620, MAPE – 31.035), and CNN (RMSE – 1.144, MAE – 0.745, MAPE – 37.154). The error between actual and predicted values by the Conv-GRU model was minimal and it had the best results when predicting the river level, LSTM and CNN had slightly higher errors when predicting the river level, but smaller than ANN and Sec2sec. CNN achieved good predicting results since it could pick out local trends and observe the same patterns repeating themselves in different places. Only integrated, the CNN and GRU model outperformed the other four models in prediction performances, by being a time series modeler, which provides an early indication of anomalous behavior. Sec2sec provided sequence-by-sequence forecasting, based on multi-step time series forecasting, and LSTM and ANN confirmed its abilities as mentioned in Muharemi et al. (2019). Anomaly detection performed by Liu et al. (2019) implied appl. of Potomac River in West Virginia, USA data, by integrating the BAR model and the IF algo. The evaluation index was represented by error indicators RMSE, MAE, and MSE (Mean Square Error), while turbidity (TURB), specific conductivity (SC), and DO were used as quality parameters. Error indicator values were RMSE (TURB – 0.1694, SC – 0.0831, DO – 0.0332), MAE (TURB – 0.1086, SC – 0.0453, DO – 0.0282), and MSE (TURB – 0.0287, SC – 0.0069, DO – 0.0011). Both models showed excellent results in anomaly detection. The developed integration model showed accuracy in the detection of water quality anomalies and revealed the ability to provide effective early warning for emergency operations.

The application of ML in surface/river water management has many opportunities. However, all the opportunities face different challenges. Although many authors already recognized the importance of comparative analysis and included several ML models to determine the most suitable one, it is important to highlight that prediction accuracy depends also on input parameters. Hence, careful selection of available WQPs is of key importance. Several ML algo. struggle with the generalization aspect, which is important for real applications within different areas. The inclusion of other variables (hydrological, morphological, geological, etc.) in model development and assessment of the model presented in one study should be considered for other rivers with diverse climates and hydrology (Asadollah et al. 2021). Quantification of the uncertainty of the regression model caused by missing input data is highly challenging. To compensate for missing data concerning WQ parameters, hydrometeorology data could be used (Zhi et al. 2021). In order to achieve higher prediction accuracies, future studies should be strategically planned. Besides the choice of ML algo., their comparison and parameter selection, include covering critical sampling points and sampling periods, when higher oscillations of input parameter concentrations are expected (Zhi et al. 2021). The utilization of hybrid ML models has, generally, been an attractive solution as they can overcome the limitations of single models and achieve higher performance and accuracy than single ML models (Khullar & Singh 2021). This trend has been observed for all three application domains, to a certain extent. Accordingly, researchers should consider this aspect in future studies.

ML has relatively recently found its purpose within EE including river water management. Various ML algo. have proved their applicability for monitoring, classifying, and predicting river WQ and detecting anomalies. Among the most common algo. that proved its efficiency within the mentioned research were DT, ANN, DNN, and SVM and DNN-based algo. for classification, prediction, and anomaly detection purposes, respectively. Limitations of single models are found to be overcome by a hybrid approach. The application of AI for the mentioned purposes is beneficial from economic, ecological, and strategic aspects. However, full potential and real appl. of these systems is yet to be investigated and implemented.

This research was supported by the Science Fund of the Republic of Serbia, grant number 6707, REmote WAter quality monitoRing anD IntelliGence – REWARDING and by the Ministry of Science, European Union's Horizon Europe Marie Sklodowska-Curie Actions (MSCA) under grant agreement project number 101086387 – REMARKABLE and Technological Development and Innovation through project no. 451-03-47/2023-01/200156 ‘Innovative scientific and artistic research from the FTS (activity) domain’.

All relevant data are available from an online repository or repositories (https://scholar.google.com).

The authors declare there is no conflict.

Abuzir
S. Y.
&
Abuzir
Y. S.
2022
Machine learning for water quality classification
.
Water Quality Research Journal
57
(
3
),
152
164
.
Ahmed
A.
,
Othman
F.
,
Afan
H. A.
,
Ibrahim
R. K.
,
Chow
M. F.
,
Hossain
S.
,
Ehteram
M.
&
El-Shafie
A.
2019a
Machine learning methods for better water quality prediction
.
Journal of Hydrology
578
,
124084
.
https://doi.org/10.1016/j.jhydrol.2019.124084
Ahmed
U.
,
Mumtaz
R.
,
Anwar
H.
,
Shah
A. A.
,
Irfan
R.
&
Garcia-Nieto
J.
2019b
Efficient water quality prediction using supervised machine learning
.
Water
11
(
11
),
2210
.
https://doi.org/10.3390/w11112210
Al-Adhaileh
M. H.
&
Alsaade
F. W.
2021
Modelling and prediction of water quality by using artificial intelligence
.
Sustainability
13
(
8
),
4259
.
https://doi.org/10.3390/su13084259
Antanasijević
D.
,
Pocajt
V.
,
Povrenović
D.
,
Perić-Grujić
A. A.
&
Ristić
M.
2013
Modelling of dissolved oxygen content using artificial neural networks: Danube River, North Serbia, case study
.
Environmental Science and Pollution Research
20
(
12
),
9006
9013
.
https://doi.org/10.1007/s11356-013-1876-6
Asadollah
S. B. H. S.
,
Sharafati
A.
,
Motta
D.
&
Yaseen
Z. M.
2021
River water quality index prediction and uncertainty analysis: a comparative study of machine learning models
.
Journal of Environmental Chemical Engineering
9
(
1
),
104599
.
https://doi.org/10.1016/j.jece.2020.104599
Baek
S.
,
Pyo
J.
&
Chun
J. A.
2020
Prediction of water level and water quality using a CNN-LSTM combined deep learning approach
.
Water
12
(
12
),
3399
.
https://doi.org/10.3390/w12123399
Bilali
A. E.
&
Taleb
A.
2020
Prediction of irrigation water quality parameters using machine learning models in a semi-arid environment
.
Journal of the Saudi Society of Agricultural Sciences
19
(
7
),
439
451
.
https://doi.org/10.1016/j.jssas.2020.08.001
Bui
D. T.
,
Khosravi
K.
,
Tiefenbacher
J.
,
Nguyen
H.
&
Kazakis
N.
2020
Improving prediction of water quality indices using novel hybrid machine-learning algorithms
.
Science of the Total Environment
721
,
137612
.
https://doi.org/10.1016/j.scitotenv.2020.137612
Directive 2000/60/EC of the European Parliament and of the Council of 23 October 2000 Establishing a framework for Community Action in the Field of Water Policy. Available from: https://eur-lex.europa.eu/eli/dir/2000/60/2014-11-20 (accessed 09 May 2023)
.
Gorde
S. P.
&
Jadhav
M. V.
2013
Assessment of water quality parameters: a review
.
International Journal of Engineering Research and Applications
3
(
6
),
2029
2035
.
Haghiabi
A. H.
,
Nasrolahi
A.
&
Parsaie
A.
2018
Water quality prediction using machine learning methods
.
Water Quality Research Journal of Canada
53
(
1
),
3
13
.
https://doi.org/10.2166/wqrj.2018.025
Hassan, M. M., Hassan, M. M., Akter, L., Rahman, M. M., Zaman, S., Hasib, K. M., Jahan, N., Smrity, R. N., Farhana, J., Raihan, M. & Mollick, S. 2021 Efficient prediction of water quality index (WQI) using machine learning algorithms. Human-Centric Intell. Syst. 1, 86
. https://doi.org/10.2991/hcis.k.211203.001.
Ho
L.
&
Goethals
P.
2022
Machine learning applications in river research: trends, opportunities and challenges
.
Methods in Ecology and Evolution
13
, 2603–2621.
Khullar
S.
&
Singh
N.
2022
Water quality assessment of a river using deep learning Bi-LSTM methodology: forecasting and validation
.
Environmental Science and Pollution Research
29
(
9
),
12875
12889
.
https://doi.org/10.1007/s11356-021-13875-w
.
Krishnan, S. R., Nallakaruppan, M. K., Chengoden, R., Koppu, S., Iyapparaja, M., Sadhasivam, J. & Sethuraman, S. 2022 Smart water resource management using artificial intelligence—a review. Sustain. 14, 1–28. https://doi.org/10.3390/su142013384.
Liu
M.
&
Lu
J.
2014
Support vector machine—an alternative to artificial neuron network for water quality forecasting in an agricultural nonpoint source polluted river?
Environmental Science and Pollution Research
21
(
18
),
11036
11053
.
https://doi.org/10.1007/s11356-014-3046-x
.
Liu
P.
,
Wang
J.
,
Sangaiah
A. K.
,
Xie
Y.
&
Yin
X.
2019
Analysis and prediction of water quality using LSTM deep neural networks in IoT environment
.
Sustainability
11
(
7
),
2058
.
https://doi.org/10.3390/su11072058
.
Liu
J.
,
Wang
P.
,
Jiang
D.
,
Nan
J.
&
Zhu
W.
2020
An integrated data-driven framework for surface water quality anomaly detection and early warning
.
Journal of Cleaner Production
251
,
119145
.
https://doi.org/10.1016/j.jclepro.2019.119145
.
Lu
H.
&
Ma
X.
2020
Hybrid decision tree-based machine learning models for short-term water quality prediction
.
Chemosphere
249
,
126169
.
https://doi.org/10.1016/j.chemosphere.2020.126169
.
Maganathan
T.
,
Senthilkumar
S.
&
Balakrishnan
V.
2020
Machine learning and data analytics for environmental science: a review
.
Prospects and Challenges. IOP Conference Series
955
(
1
),
012107
.
https://doi.org/10.1088/1757-899x/955/1/012107
.
Miau
S.
&
Hung
W. H.
2020
River flooding forecasting and anomaly detection based on deep learning
.
IEEE Access
8
,
198384
198402
.
https://doi.org/10.1109/access.2020.3034875
.
Muharemi
F.
,
Logofătu
D.
&
Leon
F.
2019
Machine learning approaches for anomaly detection of water quality on a real-world data set
.
Journal of Information and Telecommunication
3
(
3
),
294
307
.
https://doi.org/10.1080/24751839.2019.1565653
.
Nazir
H. M.
,
Hussain
I.
,
Zafar
M. I.
,
Ali
Z.
&
AbdEl-Salam
N. M.
2016
Classification of drinking water quality index and identification of significant factors
.
Water Resources Management
30
(
12
),
4233
4246
.
https://doi.org/10.1007/s11269-016-1417-4
.
Park
Y.
,
Cho
K. H.
,
Park
J.
,
Min
S. R.
&
Kim
J. B.
2015
Development of early-warning protocol for predicting chlorophyll-a concentration using machine learning models in freshwater and estuarine reservoirs
.
Science of the Total Environment
502
,
31
41
.
https://doi.org/10.1016/j.scitotenv.2014.09.005
.
Quinlan
J. R.
1993
C4.5: Programs for Machine Learning
.
Morgan Kaufmann Publishers Inc
,
San Francisco, CA
,
USA
.
Regulation on the limit values of pollutants in surface and groundwater and sediment and deadlines for them achievement. ‘Sl. Glasnik RS’, br. 50/2012
.
Russo
S.
,
Besmer
M. D.
,
Blumensaat
F.
,
Bouffard
D.
,
Disch
A.
,
Hammes
F.
,
Hess
A.
,
Lürig
M.
,
Matthews
B.
,
Minaudo
C.
,
Morgenroth
E.
,
Tran-Khac
V.
&
Villez
K.
2021
The value of human data annotation for machine learning based anomaly detection in environmental systems
.
Water Research
206
,
117695
.
https://doi.org/10.1016/j.watres.2021.117695
.
Samantaray
S.
,
Das
S.
,
Sahoo
A.
&
Satapathy
D. P.
2022
Monthly runoff prediction at Baitarani river basin by support vector machine based on Salp swarm algorithm
.
Ain Shams Engineering Journal
13
(
5
),
101732
.
https://doi.org/10.1016/j.asej.2022.101732
.
Shamsuddin
I. I. S.
,
Othman
Z.
&
Sani
N. F. M.
2022
Water quality index classification based on machine learning: a case from the Langat River Basin Model
.
Water
14
(
19
),
2939
.
https://doi.org/10.3390/w14192939
.
Sillberg
C.
,
Kullavanijaya
P.
&
Chavalparit
O.
2021
Water quality classification by integration of attribute-realization and support vector machine for the Chao Phraya River
.
Journal of Ecological Engineering
22
,
70
86
.
Sutadian
D.
,
Muttil
N.
,
Yilmaz
A. G.
&
Perera
B. J. C.
2016
Development of river water quality indices – a review
.
Environmental Monitoring and Assessment
188
(
1
),
1
29
.
doi:10.1007/s10661-015-5050-0
.
Syeed
M. M. M.
,
Hossain
M. S.
,
Karim
M. R.
,
Uddin
M. F.
,
Hasan
M.
&
Khan
R. H.
2023
Surface water quality profiling using the water quality index, pollution index and statistical methods: a critical review
.
Environmental and Sustainability Indicators
18
,
100247
.
doi:10.1016/j.indic.2023.100247
.
Vijayaprabakaran
V.
&
Sathiyamurthy
K
.
2022
Towards activation function search for long short-term model network: a differential evolution based approach
.
Journal of King Saud University – Computer and Information Sciences
34
(
6
),
2637
2650
.
https://doi.org/10.1016/j.jksuci.2020.04.015
.
Wang
X.
,
Wang
S.
&
Ding
J.
2017
Evaluation of water quality based on a machine learning algorithm and water quality index for the Ebinur Lake Watershed, China
.
Scientific Reports
7
(
1
).
https://doi.org/10.1038/s41598-017-12853-y
Zhi
W.
,
Feng
D.
,
Tsai
W.
,
Sterle
G.
,
Harpold
A. A.
,
Shen
C.
&
Li
L.
2021
From hydrometeorology to river water quality: can a deep learning model predict dissolved oxygen at the continental scale?
Environmental Science & Technology
55
(
4
),
2357
2368
.
https://doi.org/10.1021/acs.est.0c06783
Zhong
S.
,
Zhang
K.
,
Bagheri
M.
,
Burken
J.
,
Gu
A.
,
Li
B.
,
Ma
X.
,
Marrone
B.
,
Ren
Z.
,
Schrier
J.
,
Shi
W.
,
Tan
H.
,
Wang
T.
,
Wang
X.
,
Wong
B.
,
Xiao
X.
,
Yu
X.
,
Zh
J.
&
Zhang
H.
2021
Machine learning: new ideas and tools in environmental science and engineering
.
Environmental Science & Technology
55
(19), 12741–12754.
doi:10.1021/acs.est.1c01339
.
Zhu
M.
,
Wang
J.
,
Yang
X.
,
Zhang
Y.
,
Zhang
L.
,
Ren
H.
,
Wu
B.
&
Ye
L.
2022
A review of the application of machine learning in water quality evaluation
.
Eco-environment & Health
1
(
2
),
107
116
.
https://doi.org/10.1016/j.eehl.2022.06.001
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc-nd/4.0/).