Abstract
Managing water resources and determining the quality of surface and groundwater is one of the most significant issues fundamental to human and societal well-being. The process of maintaining water quality and managing water resources well involves complications due to human-induced errors. Therefore, applications that facilitate and enhance these processes have gained importance. In recent years, machine learning techniques have been applied successfully in the preservation of water quality and the management and planning of water resources. Water researchers have effectively used these techniques to integrate them into public management systems. In this study, data sources, pre-processing, and machine learning methods used in water research are briefly mentioned, and algorithms are categorized. Then, a general summary of the literature is presented on water quality determination and applications in water resources management. Lastly, the study was detailed using machine learning investigations on two publicly shared datasets.
HIGHLIGHTS
Preserving water quality and managing water resources are vital.
Data acquisition and pre-processing strategies are presented for water researchers.
Machine learning algorithms in water research are categorized and described.
Examinations with machine learning are exemplified for the analyzed topics.
Graphical Abstract
ABBREVIATIONS
- AI
artificial intelligence
- ANFIS
adaptive-network-based fuzzy inference system
- ANN
artificial neural network
- AUC
area under the ROC curve
- BMA
Bayesian model averaging
- BOD
biochemical oxygen demand
- Chl-a
chlorophyll-a
- CNN
convolutional neural network
- COD
chemical oxygen demand
- CRT
completely random tree
- DA
data augmentation
- DBP
disinfection by-products
- DCF
deep cascade forest
- DENFIS
dynamic evolving neural fuzzy inference system
- DFA
desirability function analysis
- DL
deep learning
- DNN
deep neural network
- DO
dissolved oxygen
- DR
dimensionality reduction
- DT
decision tree
- E. coli
Escherichia coli
- EC
electrical conductivity
- ELM
extreme learning machine
- FIB
fecal indicator bacteria
- FIS
fuzzy inference system
- FS
feature selection
- GA
genetic algorithm
- GAN
generative adversarial network
- GBM
gradient boosting machines
- GEE
Google Earth Engine
- GMDH
group method of data handling
- GRNN
generalized regression neural network
- GRU
gated recurrent unit
- HM
heavy metal
- IoT
internet of things
- kNN
k-nearest neighbors
- LDA
linear discriminant analysis
- LR
logistic regression
- LSTM
long short-term memory
- MAE
mean absolute error
- MAPE
mean absolute percentage error
- ML
machine learning
- MLP
multilayer perceptron
- MLR
multiple linear regression
- MPC
Microsoft planetary computer
- MSE
mean square error
- MVI
missing value imputation
- NARX
nonlinear autoregressive exogenous
- NC
nitrogen compounds
- NI
no information
- NIOA
nature-inspired optimization algorithms
- NN
neural network
- NSE
Nash–Sutcliffe efficiency
- OCT
optical coherence tomography
- OLS
ordinary least squares
- PARAFAC
parallel factor analysis
- PCA
principal component analysis
- pH
potential of hydrogen
- PLS
partial least squares
- PSO
particle swarm optimization
- RBC
rotating biological contactor
- RBFNN
radial basis function NN
- RF
random forest
- RMSE
root mean square error
- RNN
recurrent neural network
- ROC
receiver operating characteristic
- RS
remote sensing
- RSM
response surface methodology
- RSR
RMSE observations standard deviation ratio
- SDG
sustainable development goals
- SGB
stochastic gradient boosting
- SOM
self-organizing map
- SPM
suspended particulate matter
- SVM
support vector machine
- SVR
support vector regression
- TDS
total dissolved solids
- TH
total hardness
- TSS
total suspended solids
- UAV
unmanned aerial vehicle
- VS
virtual sensors
- WDS
water distribution system
- WQ
water quality
- WQI
Water Quality Index
- WQP
water quality parameter
- WSM
water supply and management
- WT
wavelet transform
- WTE
water temperature
- WTP
water treatment plant
- WWTP
wastewater water treatment plant
- XGBoost
extreme gradient boosting
- YOLO
you only look once
INTRODUCTION
While water resources cover approximately 71% of the Earth's surface, about 3% of the world's water bodies (ice caps/glaciers, groundwater, lake, soil, atmosphere, river, etc.) are fresh water, and the amount of usable fresh water is approximately 0.5% (U.S Bureau of Reclamation 2020). Given that water resources are in such a critical state, water quality (WQ) monitoring, and water supply and management (WSM) are becoming increasingly vital. However, the anthropogenic origin is the most significant cause of WQ degradation and all damage to the ecosystem. Human beings, who are both polluters and protectors of nature, urgently need to increase their environmental awareness, consciousness, and healing role in order to maintain their existence. For this purpose, the importance of water should continue to be on top of the agenda of both state and supra-political international organizations. Freshwater consumption and the threat of acid pollution to ocean water have been highlighted as two of the nine key variables that require the most attention for the well-being of humanity and the world (Rockström et al. 2009; Steffen et al. 2015).
In 2015, 17 high-priority goals were set by the United Nations (UN) within the scope of sustainable development goals (SDG). One of these goals, SDG 6, has a vital vision to provide clean, quality water and sanitation for all (UN Environment Programme 2021). Increasing awareness about the value of water, recognizing, monitoring, and determining its quality, and integrating it into the system of decision-makers is critical to ensure sustainable and equitable water resource management (UNESCO 2021).
Although there are some differences in determining the WQ of surface and groundwater, dozens of parameters measured from water's physical, chemical, and biological properties are used to determine the WQ. Some WQ parameters (WQPs) frequently used in the literature are as follows (Davie 2008; Spellman 2017; Omer 2019). (1) Physical WQPs water can be listed as water temperature (WTE), color, total dissolved solids (TDS), total suspended solids (TSS), turbidity, and electrical conductivity (EC). (2) Chemical WQPs can be sorted as dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), the potential of hydrogen (pH), nitrogen compounds (NC), heavy metals (HM), total hardness (TH). (3) Biological WQPs can be listed as algae (Chl-a etc.) and fecal indicator bacteria (FIB) (Escherichia coli, etc.). Some WQPs may have indirect effects both on the WQ and on other WQPs.
WQ indices (WQIs) are determined using more than one WQP while determining WQ. A general WQI method aims to determine the function of selected water parameters in the analyzed medium and to obtain an output through mathematical equations. The resulting output divides WQ into good-, medium-, and poor-quality format or derivative categories, usually set as classification targets. Work on establishing the first WQI standard was established in the mid-1960s by calculating parameters from physical and chemical factors of water bodies (Horton 1965; Yan et al. 2022). After the WQI method proposed by Horton, many WQI methods have been developed for various tasks by many essential organizations worldwide (Akhtar et al. 2021). For example, the foundations of NSFWQI were laid in 1970 with the contributions of the National Sanitation Foundation (NSF) (Brown et al. 1970), and the Canadian WQI (CWQI) prepared by the Canadian Council of Ministers of the Environment (CCME) was published in 2001 (CCME 2001). However, there is no universal WQI to evaluate WQ, although many studies have been conducted to create WQIs for more efficient and diverse water types (surface water, groundwater) (Sutadian et al. 2016).
Traditionally, in situ measurement is applied when determining WQ. Measured samples are tested in a laboratory, and WQPs are estimated and decided using the WQI. These techniques can provide accurate values but are often time-consuming, uneconomical, and incapable of demonstrating fundamental temporal and spatial changes in WQ (Wagle et al. 2020). A study that compared sensors and laboratory measurements and showed cost analysis indicated that laboratory measurements were more expensive (Paepae et al. 2021).
This study analyzes studies that use machine learning (ML) models on WQ and WSM. It is challenging that the water data are generally not linear and non-stationary and are affected by unpredictable natural and artificial factors. To effectively understand, learn, and extract these data with high accuracy, ML models, which have gained the trust of the scientific community, can be used in almost every field where they are applied. Numerous studies have been conducted on using ML models in water areas, and effective results have been obtained in many application areas, including the smart city concept. WQ and WSM weighted analyses using ML are also considerable topics frequently studied in the literature. In order to systematically clarify the roadmap and concepts in this study, many review articles were analyzed in detail.
In a study in the field of WQ assessment, a massive summary of the studies conducted to determine the WQ of rivers between 2000 and 2020 was presented, and the ML methods used were systematically classified (Tiyasha et al. 2020). Another review analyzed WQ studies in groundwater (Hanoon et al. 2021). In another study, detailed technical analyses of some ML architectures used in the WQ determination process were made. Many technical issues were examined, especially data processing, identifying deficiencies in the data, and dividing the data during the training phase (Chen et al. 2020b). Some other detailed reviews have also been examined (Hassan & Woo 2021; Paepae et al. 2021).
Although the number of studies that analyze the use of ML in detail in both WQ and WSM heading is relatively slim, some critical reviews in this sphere have been analyzed. In a review, most of the issues that this study deals with were studied, and the sub-titles of the issues were explained (Huang et al. 2021). Alongside some critical reviews (Sit et al. 2020; Lowe et al. 2022), several studies that analyzed only WSM portions using ML were also reviewed (Kirstein et al. 2020; Rahim et al. 2020).
The remainder of the work is organized as follows. In Section 2, bibliometric analyses have been performed outlining the use of ML in the WQ and WSM processes. In Section 3, data acquisition sources for WQ and WSM and pre-processing methods applied to the obtained data are mentioned. In Section 4, ML algorithms are classified, and the result evaluation metrics of the algorithms are mentioned. In Sections 5 and 6, studies on using ML in WQ and WSM processes, are mentioned consecutively. In Section 7, trials were conducted with an ML method on two publicly shared datasets, and the results were compared. In Section 8, the article is concluded with discussions and conclusions.
BIBLIOMETRIC ANALYSIS
The bibliometric data of the WQ and WSM tasks were analyzed and visualized as separate titles. Word cloud and numerical data of the obtained keywords enriched this study. There are thousands of studies on using WQ and WSM with ML in the literature. Since not all of these studies could be analyzed in detail, the ‘Scopus’ database was examined to grasp the essence of the studies. The reason for choosing the Scopus database in this study is that more studies can be accessed from the ‘Web of Science’ database for the identified key queries. Extensive work has been conducted comparing these two databases (Chadegani et al. 2013; Martín-Martín et al. 2018; Singh et al. 2021).
‘TITLE-ABS-KEY (‘water quality’ AND (‘machine learning’ OR ‘deep learning’ OR ‘neural network*’)) AND (LIMIT-TO (LANGUAGE, AND, ‘English’))’ key when examining the WQ header words are used in the query code. The query analyzed only studies in the English language, and it was detected that the oldest study was conducted in 1990, and there were 3,802 studies in total as of September 2022. Again, while examining the WSM title, ‘TITLE-ABS-KEY ((water W/5 (management OR administration OR supply)) AND (‘machine learning’ OR ‘deep learning’ OR ‘neural network*’) AND (LIMIT-TO (LANGUAGE), AND, ‘English’))’ keywords are used in the query code. The query analyzed only studies in the English language. The oldest study is a different one from 1990; as of September 2022, a total of 5,309 studies have been found. When the keywords created for both titles are combined, and a search is done on Scopus, nearly 1,000 incomplete results were found out of the total results of the separate investigations. This shows that many common documents on both subjects document the typical relational closeness of the WQ and WSM subjects.
A word cloud is a visual representation of word frequency. The more familiar a term appears in the analyzed text, the larger the word will appear in the rendered image. Word cloud, used in this study, was created with an open source and multifunctional bibliometric analysis method presented as a library that can be installed in the R language (Aria & Cuccurullo 2017).
When creating the word cloud for WQ, some unnecessary duplicative synonym keywords were removed from the list. For example, for the keyword phrase ‘artificial neural network,’ other synonymous keywords such as ‘artificial neural networks, ann, artificial neural network (ann)’ are excluded. At the same time, the terms ‘water quality,’ ‘machine learning,’ and ‘artificial neural network’ have been excluded to increase the visibility of other keywords, as the repetition frequencies are quite high and they take up much space in the word cloud in terms of dimensions.
Word cloud representation created with the keywords obtained as a result of the literature search: (a) for WQ and (b) for WSM.
Word cloud representation created with the keywords obtained as a result of the literature search: (a) for WQ and (b) for WSM.
Representation of the number of studies according to the year intervals determined as a result of the literature review: (a) for WQ and (b) for WSM.
Representation of the number of studies according to the year intervals determined as a result of the literature review: (a) for WQ and (b) for WSM.
Representation of 20 countries with the most studies according to the authors on the map: (a) for WQ and (b) for WSM.
Representation of 20 countries with the most studies according to the authors on the map: (a) for WQ and (b) for WSM.
In Figure 2, an increase in the studies conducted over the years can be seen. This increased rate implies how popular the topics have been, especially in the past 5 years. In Figure 3, the ranking of the number of publications of the countries in total broadcast capacity is demonstrated based on single-country publications (SCPs) and multiple-country publications (MCPs). In light of the information obtained, China, the United States, India, and Iran hold the first four places in both titles in this study. It is seen that the number of publications from the African continent, which hosts many countries and has significant water problems, is deficient.
DATA SOURCES AND PRE-PROCESSING
Data sources
The quality and quantity of data are also highly critical to an appropriate and well-trained ML model. The variety of huge data obtained in water management systems is wide (Eggimann et al. 2017; Sun & Scanlon 2019). The first step in WQ and WSM evaluation is using the measurement from various data sources. Data sources are generally obtained using the internet of things (IoT) and geographic monitoring systems (satellite, unmanned aerial vehicle (UAV), camera, etc.) with on-site measurement, laboratory, and sensors. Reliable water sensors to monitor complex physicochemical and biochemical reactions occurring in water systems and a spatio-temporal data processing capability to process large amounts of data collected over varying periods instantly are essential. Water analysis sensors are generally electrochemical (voltammetric/amperometric, potentiometric, conductometric), optical sensors, and biosensors (Huang et al. 2022). Water monitoring approaches are divided into various time types based on sensing capabilities, monitoring environment, and target. It has been determined that the best process in this sensor-based temporal classification is long-term continuous monitoring (LTCM). LTCM is taken in continuous data, has high monitoring capacity, successfully detects abnormal events, and requires a high-performance sensor (Huang et al. 2022). The high-quality data collected is well suited in terms of having alternative plans for every possible scenario, which avoids almost all water-related problems.
With sensors, data can be received from the water, satellites, UAVs, and the SCADA system where the device is placed. IoT sensors can monitor WQPs in near real-time, helping to record much more data with much higher temporal resolution (Yang et al. 2022). IoT plays a significant role in the WQ and WSM concept as an element that enables the automation of the entire system. Remote sensing (RS) by satellites is one of the essential methods for observing the Earth's surface and plays a vital role in many fields. According to The Union of Concerned Scientists1, updated on May 1, 2022, there are 5,465 satellites for various missions (civil, commercial, military, Earth observation, etc.) in earth orbit. There are important satellites that provide many free data, such as Landsat (the first civilian Earth observation satellite active since 1972), Sentinel-2 (sent by the European Space Agency in 2015), Zhuhai-1 (sent by China in 2017), and Landsat 9 (launched by NASA in 2021). The countries with the most satellites in orbit are the USA, China, and Russia. There are studies in which the spatial resolutions obtained from the relevant satellites are classified as coarse, medium, and high (years of use of the satellite, spatial and temporal resolution, spectral range, and free availability of data, etc.) (Chen et al. 2022).
It is concluded that the RS big data obtained by satellite has excellent potential for estimating optically active water parameters (Chen et al. 2022). It has been determined that some of the main factors in deciding WQ with RS are Chl-a, suspended particulate matter (SPM), and colored dissolved organic matter (CDOM) (Dörnhöfer & Oppelt 2016; Chen et al. 2022). Some WQ variables cannot be measured directly with RS because they are not optically active or lack high-resolution hyperspectral data. Real-time detection of specific parameters with sensors is not easy or economical, so predictions can be made by creating virtual (soft) sensors (VS) using other measurable data (Paepae et al. 2021). To some extent, VSs can provide near real-time monitoring of parameters that are difficult to measure. For example, direct COD measurement with sensors is very costly. Again, COD can be measured with VS created with formulaic calculations based on measurements taken with much cheaper basic sensors (pH, EC, Chl-a, temperature, DO, etc.) (Paepae et al. 2021).
Efficient processing and analysis of the data obtained with RS may involve serious difficulties (Zhang et al. 2021). These are visual difficulties caused by clouds and rainy weather, a low reflection of water, storage and processing of big data, and estimation of optically inactive water parameters (Chen et al. 2022). In addition, parameters related to image resolution, atmospheric window state, and spectral bands are also important while obtaining data from the satellite. Despite these limitations, image-based water turbidity detection results using RS are as accurate as actual turbidity measurement techniques and an encouraging method for monitoring WQ at larger spatial scales (Yang et al. 2022).
Cloud-based computing platforms are also available that can be used to process satellite data and other geographic data, such as the Google Earth Engine (GEE) (Gorelick et al. 2017) and the Microsoft planetary computer (MPC). Such platforms can help research the main desired topics rather than dealing with the time it takes to download and process large amounts of data. In addition, EPANET software, designed by the Environmental Protection Agency (EPA) to run hydraulic and WQ simulations and display the results in various formats, is also frequently used for data generation and simulation.
Aquatic eutrophication seriously threatens aquatic ecosystems, resulting in the death of aquatic organisms. Therefore, it is crucial to monitor Chl-a levels in water bodies and identify algal blooms before they occur (Yang et al. 2022). Water eutrophication can be evaluated qualitatively based on color characterization. WQ watercolor is perfect if it is blue, good if green, slightly dirty if yellow, medium dirty if orange, and heavily soiled if red (Yan et al. 2022). Therefore, many studies can be done to detect WQ from images obtained with RS data with watercolor (Hassan & Woo 2021).
UAVs are also used as a data acquisition platform. In addition to the color image taken from the air, more substantial data with high spatial resolution can be obtained with the help of various sensors (multispectral, hyperspectral, thermal, etc.) added to the UAV itself (Vélez-Nicolás et al. 2021). Water level information obtained with the help of sensors, flow and pressure information in pumps and pipes, acoustic recorders, etc., are also used as WSM data sources in addition to the sources mentioned above. In addition, it has been determined that smart water meters are one of the effective data collection methods in WSM (Monks et al. 2019; Kirstein et al. 2020; Velani et al. 2022).
Most studies use the dataset they need in cooperation with government agencies but do not share it publicly. Despite intensive research, the sources of publicly shared data included in the WQ and WSM in the reviewed sources are minimal. Studies of some openly shared datasets are as follows: (Hamshaw et al. 2018; Moritz et al. 2018, 2017; Mo et al. 2019a, 2019b; Ross et al. 2019; Zhao et al. 2019; Zhou et al. 2019; Wang et al. 2021a; Nasir et al. 2022). Experiments were carried out with ML methods with two of these open datasets, and they are explained in detail below.
Data pre-processing
Data management and pre-processing are important steps in their own right, and there are many detailed studies on these subjects (Han et al. 2012; Fan et al. 2021). Data management usually includes pre-processing steps such as missing value imputation (MVI), detection of noise and outlier anomaly values, feature selection (FS), dimensionality reduction (DR), normalization, data augmentation (DA), and proper partitioning of data. Data should not be deleted unless all column data values are missing for a significant row range in MVI. When possible, missing data should be completed by estimating meaningfully with approaches based on averaging, linear interpolation, regression, and ML algorithms (Alasadi & Bhaya 2017; Chhabra et al. 2017). Again, for anomaly detection in the data, determining the lower and upper quantitative threshold ranges, determining the distances of the data with the standard deviation, and clustering and complementary ML algorithms can be used.
Statistical models such as Pearson correlation coefficient (PCC), recursive feature elimination (RFE), and tree-based ML algorithms can be used for accurate FS (Fan et al. 2021). The DR process is also essential for data management and engineering; statistical methods and clustering ML algorithms can often be used (Velliangiri et al. 2019). Normalization ensures that the variables are in the same range and preferably small numbers so that some ML algorithms built on the distance between the data can perform a nutritional analysis. Some essential normalization techniques are min-max normalization, z-score standardization, decimal scaling data, and logarithmic methods (Saranya & Manik 2013).
DA is a set of techniques that can be applied to all data types, such as audio, text, and images. It aims to artificially increase the amount of data by obtaining new data points from existing data. It is an effective method to combat overfitting, primarily due to a need for more data. In addition to simple techniques such as adding noise to a picture, cropping, rotating, color-based changes, and deleting a random part, artificial data can also be produced with advanced ML techniques (Shorten & Khoshgoftaar 2019). For text data, methods such as replacing texts with synonyms, random insertion/deletion/shuffle/replacement, and translation of the text into a language and then back into its original language can be used (Wei & Zou 2019). Artificial sound data generation methods can be used to clip data, add noise, a shift in the time domain, sound velocity changes, and frequency masking (Wei et al. 2020).
The neat division of data for training and testing (or training, validation, and testing) purposes involves approaches that vary from study to study. The training set is used to train the model, while the validation set is used to set the hyperparameters of the trained model. The test set also provides a result by comparing the prediction outputs. It is seen that the data allocated for education is at least 50%, and this rate reaches up to 80% (Joseph 2022). In some cases, the rest of the data allocated for the training set is used only for testing or for both validation and testing. Although model verification techniques such as cross-validation are frequently utilized, alternative techniques can be used for cross-validation, especially for time series. Because the past cases for time series are important, the validation step should consider past data by not randomly being chosen (Bergmeir & Benítez 2012). The determined separation rate may vary randomly in the range mentioned above depending on parameters such as the dataset's serial pattern, size, and the number of features. Studies are also carried out to determine the optimum separation ratio of the dataset (Joseph & Vakayil 2022).
In addition to the methods mentioned above, there are also pre-processing methods such as various radiometric calibrations, atmospheric correction, and spectral band selection to use the data obtained from RS sources such as satellites and UAVs healthily (Chen et al. 2022).
AN OVERVIEW OF MACHINE LEARNING
ML is core to artificial intelligence (AI) and is one of the places the concept of AI wants to reach. While AI aims to simulate natural intelligence to solve complex problems, in ML, the goal is for machines to know and master how to perform human behavior (Xue & Zhu 2009). An algorithm that exhibits human-like behavior can be called AI, but it cannot be called a part of ML unless that algorithm learns and evolves from data automatically. Data collection is essential during the ML implementation stages. Some of the data obtained may inevitably be incomplete, incorrect, or damaged, leading to poor performance when applied in their current form. In such cases, the methods mentioned in the pre-processing data heading above become another critical step in ML applications. Then, appropriate algorithm selection, training of the selected model, and model validation are required.
Supervised and unsupervised approaches are the two main learning approaches of ML algorithms. In supervised learning, the data are tagged to tell the machine which patterns to look for, but in unsupervised learning, the information is not indexed; the engine finds patterns without knowing which patterns to look for. In addition to learning algorithms that learn by trial and error, called reinforcement learning, there are other learning algorithms called semi-supervised learning, which are generally used for many unlabeled and slightly labeled data but are not used as often as others.
ML tasks can typically be classified into broad categories depending on the type of learning (supervised or unsupervised), the learning model (classification, regression, clustering, etc.), and the algorithm chosen to implement the desired task. While it is vital to have good accuracy when training an ML model, its ability to generalize to unseen data also matters to a great extent.
Machine learning algorithms
In this section, the basic ML algorithms used in WQ and WSM processes classified according to their structures, and the algorithms used were tried to be briefly mentioned. As shown in Table 1, these algorithms are classified as artificial neural network (ANN) based, deep learning (DL)-based, tree-based, fuzzy logic-based, complementary, and others.
Classification of ML algorithms used in WQ and WSM processes
Category . | ML algorithms . |
---|---|
ANN-based | ELM, GRNN, MLP, NARX, RBFNN, SOM |
DL-based | Autoencoder, CNN, DNN, GAN, GRU, LSTM, RNN |
Tree-based | CatBoost, CRT, DCF, DT, GBM, RF, SGB, XGBoost |
Fuzzy logic-based | ANFIS, DENFIS |
Complementary | Bat, DFA, GA, LDA, PARAFAC, PCA, PSO, RSM, WT |
Other | BMA, GMDH, k-means, kNN, LR, MLR, OLS, PLS, SVM, SVR |
Category . | ML algorithms . |
---|---|
ANN-based | ELM, GRNN, MLP, NARX, RBFNN, SOM |
DL-based | Autoencoder, CNN, DNN, GAN, GRU, LSTM, RNN |
Tree-based | CatBoost, CRT, DCF, DT, GBM, RF, SGB, XGBoost |
Fuzzy logic-based | ANFIS, DENFIS |
Complementary | Bat, DFA, GA, LDA, PARAFAC, PCA, PSO, RSM, WT |
Other | BMA, GMDH, k-means, kNN, LR, MLR, OLS, PLS, SVM, SVR |
ANN-based algorithms
ANN, often called neural network (NN), is an ML method that can derive new information using previously learned or classified information by imitating the biological neural structure of the human brain. Neurologist Warren S. McCulloch and logician Walter Pitts published the first ANN model in 1943 with an artificial ‘neural network’ formulation in the brain (McCulloch & Pitts 1943). ANN consists of three main layers: input, hidden, and output. ANN models can sometimes be categorized as a ‘black box’ model because the results of the numerical approach are used at the hidden layer level without fully understanding the mechanism inside.
After the back-propagation approach was proposed to minimize the error obtained by using a loss function in classification as a new learning procedure, the use of ANN continued to develop rapidly (Rumelhart et al. 1986). Algorithms, also known as multilayer perceptron (MLP) and using fully connected feed-forward, are referred to as another ANN nomenclature. In some studies, it was seen that the algorithms called ANN meant MLP, so corrections were made accordingly when those studies were detailed in the tables below. ANN has been continuously improved over time, and its efficiency has increased with the help of various data optimizations.
There are many algorithms based on ANN. Some of the frequently used topics in this article are radial basis function NN (RBFNN) (Broomhead & Lowe 1988), generalized regression NN (GRNN) (Specht 1991), self-organizing map (SOM or Kohonen map) (Kohonen 1982, 1990), nonlinear autoregressive exogenous (NARX) (Lin et al. 1996) and extreme learning machine (ELM) (or randomized NN) (Huang et al. 2004, 2006).
SOM is an unsupervised algorithm that trains using competitive learning to reduce features' size and clustering (Kalteh et al. 2008). The NARX model is an algorithm that can deal with time series, where the network output is applied as feedback to the input and can deal with missing data, noise, and nonlinear inputs (Chang et al. 2016). RBFNN is a faster algorithm than MLP networks that uses radial basis functions as activation functions and trains it as a curve-fitting approach in multidimensional space. Since GRNN works on a radial basis and shows a one-pass learning approach, it does not require an iterative training process as in ANNs using BP. Because complexity in both RBFNN and GRNN has grown excessively, it cannot be easy to deal with large datasets. Since ELM can work with input weights and bias values randomly, calculating the inverse of the matrix instead of BP, and working with non-differentiable activation functions, the training time process is swift and very successful in generalization (Ertuğrul et al. 2021).
DL-based algorithms
DL algorithms, which can also be classified as ANN-based, should be examined under a separate heading due to their groundbreaking techniques and usage areas. DL differs from ANN according to its depth, number of hidden layers, and architectural diversity. An ANN structure with several hidden layers may also be known as deep NN (DNN) as a general nomenclature. DL algorithms have quickly attracted the attention of researchers from various scientific disciplines, along with the developments in graphics processing units (GPUs), which have compelling computational features (Goodfellow et al. 2016).
While traditional ML approaches use handcrafted engineering features in feature extraction, features are learned automatically without explicit instruction in DL algorithms and are represented at multiple levels in incremental systems. While DL approaches increase their performance according to the increase in datasets, the performance decreases in DL algorithms with insufficient data, and overfitting problems may occur (Liu et al. 2017). In addition, more nonlinear activation functions, initialization, and regularization methods can be found in DL compared to other ANN derivatives.
Fundamentals of DL model types can be broadly classified as convolutional NN (CNN) (LeCun et al. 1989), recurrent NN (RNN) (Rumelhart et al. 1986), autoencoder (AE) (Rumelhart et al. 1985), generative adversarial network (GAN) (Goodfellow et al. 2014). Algorithms with different structures are derived from these basic DL classes, which produce solutions to many different branches.
CNN is an important DL algorithm type, often used in image processing, where convolution and pooling operations are applied in its hidden layers and usually take images as input to the algorithm. It seems that both CNN and DL algorithms, in general, have become much more popular in the last ten years after successful results using DL on ImageNet. ImageNet was made publicly accessible by researchers at Stanford in 2009 and contained more than 14 million tagged images (Deng et al. 2009; Krizhevsky et al. 2017). There are numerous derivatives of CNN, and it is seen that wide varieties are used for many purposes in WQ and WSM studies. For example, CNN can often be used in image recognition, and derivatives such as the ‘you only look once’ (YOLO) algorithm can be used in automatic object detection, especially for WSM tasks (Redmon et al. 2016).
RNNs are developed by taking advantage of existing correlations between input data to predict new data. RNNs are built on a system in which current conditions are affected by past conditions at each step. Long short-term memory (LSTM) and gated recurrent unit (GRU)-like algorithms have been proposed to alleviate the difficulty of RNNs in accessing earlier information, gradient vanishing, and gradient exploding problems (Hochreiter & Schmidhuber 1997; Cho et al. 2014). In general, it is not clear which is the best in LSTM and GRU performance results, but it has been concluded that if quick performance is necessary, a less complex GRU can be used (Chung et al. 2014).
AEs are unsupervised algorithms, and although they can be used in many different tasks, their main purpose is to reduce the size of the given input by compressing it. They represent the input data as output in the most appropriate way and are instrumental in data pre-processing techniques such as FS, DR, and noise and anomaly removal in water surveys (Russo et al. 2020; Ba-Alawi et al. 2021). GANs are groundbreaking productive DL applications consisting of two different and competing NNs, and can generally be used in pre-processing data parts such as FS, DA, and anomaly removal (Li et al. 2018; Huynh et al. 2022).
Tree-based algorithms
Tree-based ML algorithms are named as such because they can be visualized with nodes and branches and have many types. One of its most basic models is the decision tree (DT) algorithm, a structure used to divide a dataset containing many records into smaller clusters by applying a set of simple decision rules (Morgan & Sonquist 1963; Hunt et al. 1966). Nodes (1) root, parent, child, (2) decision, and terminal are essential in DT, and DT uses different algorithms to decide whether to split a node into two or more child nodes. Many direct DT-based algorithms exist (CART, ID3, C4.5, etc.) (Mienye et al. 2019). As tree branches grow, overfitting problems may occur, and various measures (parameter restrictions, pruning of branches, etc.) are taken to prevent this. After these measures were taken, other important algorithms emerged.
Random forest (RF) is an approach that strengthens many weak, unrelated trees in a forest with the logic of ‘bagging,’ takes estimates from each tree, and determines the winner according to the votes received (Breiman 1996, 2001). An important aspect of the RF algorithm is that the trees that contribute the most to the information acquisition are determined, and RF can perform as a kind of FS and DR process.
Variants of the RF algorithm, which has many derivatives, such as the completely-random tree (CRT) and the deep cascade forest (DCF), which are very successful and inspired by the architectures of DL algorithms, are also used (Zhou & Feng 2017).The gradient boosting machines (GBM) algorithm, which is an effective method, has been developed by combining the gradient descent optimization algorithm with the ‘boosting’ logic to strengthen DT (Friedman 2001). Again, by making a different notation change from the author of GBM, stochastic gradient boosting (SGB) (or boosted regression tree) was introduced, and randomness became an integral part of gradient change (Friedman 2002). Furthermore, some of the newest algorithms developed as derivatives of GBM that can prevent overfitting, manage useless data, and are very fast, are extreme gradient boosting (XGBoost) and CatBoost algorithms (Chen & Guestrin 2016; Prokhorenkova et al. 2017).
Fuzzy logic-based algorithms
Fuzzy logic is a logical structure that argues that ambiguous expressions can also be digitized, instead of the true–false-based opposite boundary approach in classical Aristotelian logic, and that can define membership values of terms between 0 and 1, and in this way, it is developed to deal with ambiguity (Zadeh 1965, 1978). FIS (fuzzy inference system), which better defines the relationship between variables and then interprets them in simple language, has therefore been accepted as the ‘grey box’ model, which is a combination of the ‘white box’ and ‘black box’ model approaches (Obadina et al. 2022).
FIS does not perform learning, so it is not considered an ML approach, even though it is a subset of AI. Therefore, flexible ML models that can better describe the complexity and nonlinear data have been created by combining the FIS approach with NN learning logic. Many fuzzy logic-based algorithms have been brought into the ML approach; some of the important ones are adaptive-network-based FIS (ANFIS) (Jang 1993) and dynamic evolving neural FIS (DENFIS) (Kasabov & Song 2002). DENFIS uses the evolving clustering method to optimize membership function parameters, and studies have been carried out in the water field with this algorithm in the past (Heddam 2014; Heddam & Dechemi 2015). However, ANFIS is used more frequently in the current literature. Because ANFIS has the advantages of ANN and FIS models, it successfully reverses coding and reduces noise and can generally be used as a hybrid with multidimensional data analysis methods (Tiyasha et al. 2020).
Complementary algorithms
Wavelet transform (WT) is an algorithm that can significantly improve the overall performance of a model and is capable of describing spectral and temporal information (Grossmann & Morlet 1984). WT analysis has proven more efficient than the Fourier transform and has successfully analyzed signals in the time–frequency domain (Chun-Lin 2010). WT has presented successful examples in many tasks, such as extracting meaningful information from the dataset, decomposing non-stationary signs into different sub-signals, noise removal, and signal compression (Walczak & Massart 1997; Tiyasha et al. 2020). WT is a complementary algorithm that is effective alone and in combination with many models such as ANN and ANFIS.
The use of methods to reduce the size of the features to achieve better results, less processing time, and prevent overfitting is also a complementary method. Principal component analysis (PCA) and linear discriminant analysis (LDA) methods, the foundations of which have been established for nearly a century, can be given as examples of the main algorithms used for this purpose (Cunningham 2008). In addition, it is seen that the parallel factor analysis (PARAFAC) method, a very old mathematical model used to focus on the features of interest, is also used in some studies (Harshman 1970).
Nature-inspired optimization algorithms (NIOA), also a subset of AI, can be used with ML algorithms as a complementary model because they successfully find optimized solutions to multidimensional and multimodal challenging problems (Yang 2014). Many NIOA methods are available, inspired by nature's evolutionary, physical and chemical processes (Wang et al. 2021b). Some of the frequently used ones in the literature are the genetic algorithm (GA) (Holland 1975), particle swarm optimization (PSO) (Kennedy & Eberhart 1995), and the recent bat algorithm (Yang 2010). In addition, it is seen that methods such as response surface methodology (RSM) and desirability function analysis (DFA) are used as complementary optimization models in some studies (Bobadilla et al. 2020).
Other algorithms
This section mentions some models that do not fall into other class categories in water studies. The most popular of these is the support vector machine (SVM), an ML method that tries to separate the points placed on a plane at an optimum distance (Cortes & Vapnik 1995). To cope with high-dimensional and nonlinear problems, it performs a kernel trick by using kernel functions that enable it to transform the input space into a higher-dimensional space. Support vector regression is used in studies (SVR) based on the same concept as SVM and used for regression tasks (Drucker et al. 1996).
Another algorithm is the k-nearest neighbors (kNN) method, in which the distance and number of neighbors parameters are essential, and the new individual to be classified is calculated by looking at the proximity of ‘k’ of previous individuals (Altman 1992). Contrary to the training phase, this algorithm, described as ‘lazy’ since calculations are made on the entire dataset in the testing phase, can be costly in large multidimensional datasets. The k-means algorithm, one of the unsupervised ML methods that provide solutions for association and clustering problems by finding the pattern of the data and dividing the data into ‘k’ clusters, is also used in some studies (MacQueen 1967; Lloyd 1982).
In addition, such as Bayesian model averaging (BMA) (Raftery et al. 2005), ordinary least squares (OLS), partial least squares (PLS), logistic regression (LR), multiple linear regression (MLR), group method of data handling (GMDH) (Ivakhnenko 1970) techniques have been used in some studies.
Evaluation metrics
In order to evaluate the performance of ML and DL algorithms, mainly in classification and regression tasks, many evaluation criteria in the literature and detailed analyses are available (Goodfellow et al. 2016; Botchkarev 2019; Yang et al. 2022). In this section, to understand some concepts in the outputs of ML research, the criteria within the scope of evaluation tasks will be briefly mentioned without going into much detail. One of the essential evaluation criteria in classification tasks is the confusion matrix table and the elements it contains. This table has four combinations of the predicted and actual values resulting from the classification. These combinations are true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). Many classification metrics are derived from these four criteria. At the same time, mathematically easy and vital metrics such as accuracy , precision
, recall or sensitivity
, specificity
, and F1-score (or F-score or F-measure)
are obtained using a confusion matrix.
In addition to the above in classification criteria (1), the area under the ROC curve (AUC) method is also an effective method for visually evaluating the performance of ML models. Here, ROC means a probability curve named receiver operating characteristics. (2) Log loss is a good evaluation metric for binary classifiers. It is a slightly modified version of the Likelihood Function. (3) The Cohen Kappa statistic is frequently used in determining reliability between raters and was developed to determine the degree of agreement between two rater models scoring at the classification level. The larger the ratio obtained from the classification criteria mentioned other than log loss, the better classified the results.
There are also many evaluation criteria in regression tasks, and an ideal performance criterion cannot be generalized (Botchkarev 2019). Some methods take the sum of the absolute error value between the actual and predicted values. There are many types of this approach, with some common examples such as mean absolute error (MAE) and mean absolute percentage error (MAPE). Some of the criteria that are frequently used in regression and measure by squaring the error are methods such as mean square error (MSE), root mean square error (RMSE), R, , and Nash–Sutcliffe efficiency (NSE). MSE specifies the sum of the squared difference between the actual and predicted value with some change in MAE. RMSE is the square root of MSE and can make MSE smaller so it can be compared more easily.
The R measure, also known as the correlation coefficient, gives the correlation between the observed values and the response generated by the model. , a statistical measure of how close the data are to the fitted regression line, is also known as the coefficient of determination and is actually obtained by squaring the R criterion. Although it is a frequently used criterion, adjusted
has been proposed as a solution to the
method, which is not sensitive to outliers and overfitting problems (Babyak 2004; Li 2017). NSE is a normalized statistic that stands out in the field of hydrology and determines the relative magnitude of constant variance compared to observation data variance. In addition, some studies have also used the metric called RMSE observations standard deviation ratio (RSR).
ML IN WQ PROCESSES
WQ monitoring and determination are essential to well-functioning surface and groundwater. Table 2 contains some recent (in the last 5 years) published studies that can summarize the use of WQ and ML in general terms.
WQ assessment using ML algorithms
Authors (year) . | Number of samples/Data range/ Location . | Input variables . | Data pre-processing . | Target . | ML techniques . | Evaluation metrics . |
---|---|---|---|---|---|---|
Najah Ahmed et al. (2019) | NI/2009-2010/Johor River (Malaysia) | Chloride, E. coli, EC, iron, magnesium, nitrate, phosphate, potassium, salinity, sodium, turbidity, WTE | Noise removal with WT | Ammoniacal nitrogen, pH, TSS | ANFIS, MLP RBFNN, WT-ANFIS | R2 |
Li et al. (2019) | 1448/2017-2018/Qiantang River (China) | DO, permanganate index, pH, total phosphorus | Normalization | DO, permanganate index, pH, total phosphorus | GRU, LSTM, MLP, SVR, RNN, RNN with Dempster/ Shafer evidence theory | MAE, MAPE, NSE, RMSE |
García-Alba et al. (2019) | NI/1990 and 2013-2015/Eo Estuary (Spain) | The flow of fecal discharges, river flow/WTE, sea salinity/WTE, solar radiation, water level | NI | FIB (E. coli) | MLP | NSE, ![]() |
Kisi et al. (2020) | NI/2015-2017 (for Link River) and NI/1997-2004 (for Klamath River)/(United States) | EC, pH, WTE | NI | DO | ANFIS, BMA, CART, ELM, MLR, MLP | NSE, ![]() |
Zhang et al. (2020) | 3672/2015/Burnett River (Australia) | Chl-a, DO, EC, pH, turbidity, WTE | DR and noise removal with kernel PCA, FS | DO | GRNN, kPCA-LSTM, MLP, SVR | MAE, percent of prediction within a factor, ![]() |
Ansari & Akhoondzadeh (2020) | 102/2013-2018/Karun River (Iran) | salinity (In situ data) + Landsat8 satellite images | Radiometric/atmospheric correction, FS, normalization | salinity | GA-MLP, OLS, SVR | ![]() |
Chen et al. (2020a) | 33612/2012-2018/124 automatic WQ monitoring stations in 10 rivers and lakes (China) | AMMONIA nitrogen, COD, DO, pH | normalization | WQI | CRT, DCF, DT, kNN, LDA, LR, naive Bayes, RF, SVM | F1-score, precision, recall, weighted F1-score |
Lombard et al. (2021) | 19678 for RF and 20450 for SGB/1970-2013/(United States) | 249 variables from (base-flow index, bedrock geology, ecoregion, general lithology, groundwater recharge, multi-order hydrologic position, percent tile drainage, precipitation, soil geochemistry, surficial geology) | NI | arsenic | RF, SGB | accuracy, AUC, kappa, sensitivity, specificity |
Alizamir et al. (2021) | 1317 /NI (for Illinois River) and 1841/NI (for Grand Lake)/(United States) | EC, pH, turbidity, WTE | NI | Chl-a | Bat-ELM, CART, ELM, GMDH, RF | NSE, R, RMSE |
Liu et al. (2021) | NI/2021/Poyang Lake, Three Gorges, Yangtze River (China) | Chl-a (In situ data) + Hyperspectral and RGB images obtained by UAV | geometric registration, radiometric correction, image processing | Chl-a | MLP, RF, PLS, PSO-SVM | MAPE, ![]() |
Xu et al. (2021) | 3287 sample/2011-2019/Dongjiang River (China) | (ammonia nitrogen, DO, EC, permanganate, pH, turbidity, total phosphorus, WTE) and (air pressure/temperature, evaporation, flow, humidity, precipitation, sunshine duration, wind velocity) | FS, MVI, normalization | DO | MLP, MLR, RF, SVM, WT-MLP, WT-MLR, WT-RF, WT-SVM | MAE, NSE, R, RMSE |
Singha et al. (2021) | NI/2019/226 different excavation points in Chhattisgarh (India) | Bicarbonate, calcium, chloride, fluoride, magnesium, nitrate, pH, phosphate, potassium, sodium, sulfate, TDS, TH | Normalization | WQI | DNN, MLP, RF, XGBoost | MAPE, MSE, NSE, ![]() |
Podgorski et al. (2022) | over 6000/NI/(Bangladesh, Cambodia, Indonesia, Myanmar, Vietnam) | (ammonium, arsenic, bicarbonate, chloride, DO, EC, iron, manganese, nitrate, pH, phosphate, redox potential, sodium, sulfate, well depth, WTE) and (57 spatially continuous data of climate, geology, soil, and topography) | DR, FS | iron, manganese | GBM, RF | AUC, balanced accuracy, kappa |
Shan et al. (2022) | NI /2014-2018 (for algal density) and NI/2017-2018 (for Chl-a)/Three Gorges Reservoir (China) | (algal cell, ammonia nitrogen, Chl-a, COD, DO, EC, microcystin, pH, total nitrogen, total phosphorus, turbidity, WTE) and (air temperature, atmospheric pressure, flow velocity, water level, wind direction/speed) | FS, MVI | Chl-a, algal density, and microcystin | LSTM, MLP, RNN, SVM, XGBoost-LSTM | MAE, RMSE |
Nasir et al. (2022) | 1679/2005-2014/diverse biomes over 600 locations (India) | BOD, DO, coliform data for fecal/total, EC, nitrate, pH | DA, MVI | WQI | CATBoost, DT, LR, MLP, RT, SVM, XGBoost | accuracy, AUC, F1-score, precision, sensitivity |
Authors (year) . | Number of samples/Data range/ Location . | Input variables . | Data pre-processing . | Target . | ML techniques . | Evaluation metrics . |
---|---|---|---|---|---|---|
Najah Ahmed et al. (2019) | NI/2009-2010/Johor River (Malaysia) | Chloride, E. coli, EC, iron, magnesium, nitrate, phosphate, potassium, salinity, sodium, turbidity, WTE | Noise removal with WT | Ammoniacal nitrogen, pH, TSS | ANFIS, MLP RBFNN, WT-ANFIS | R2 |
Li et al. (2019) | 1448/2017-2018/Qiantang River (China) | DO, permanganate index, pH, total phosphorus | Normalization | DO, permanganate index, pH, total phosphorus | GRU, LSTM, MLP, SVR, RNN, RNN with Dempster/ Shafer evidence theory | MAE, MAPE, NSE, RMSE |
García-Alba et al. (2019) | NI/1990 and 2013-2015/Eo Estuary (Spain) | The flow of fecal discharges, river flow/WTE, sea salinity/WTE, solar radiation, water level | NI | FIB (E. coli) | MLP | NSE, ![]() |
Kisi et al. (2020) | NI/2015-2017 (for Link River) and NI/1997-2004 (for Klamath River)/(United States) | EC, pH, WTE | NI | DO | ANFIS, BMA, CART, ELM, MLR, MLP | NSE, ![]() |
Zhang et al. (2020) | 3672/2015/Burnett River (Australia) | Chl-a, DO, EC, pH, turbidity, WTE | DR and noise removal with kernel PCA, FS | DO | GRNN, kPCA-LSTM, MLP, SVR | MAE, percent of prediction within a factor, ![]() |
Ansari & Akhoondzadeh (2020) | 102/2013-2018/Karun River (Iran) | salinity (In situ data) + Landsat8 satellite images | Radiometric/atmospheric correction, FS, normalization | salinity | GA-MLP, OLS, SVR | ![]() |
Chen et al. (2020a) | 33612/2012-2018/124 automatic WQ monitoring stations in 10 rivers and lakes (China) | AMMONIA nitrogen, COD, DO, pH | normalization | WQI | CRT, DCF, DT, kNN, LDA, LR, naive Bayes, RF, SVM | F1-score, precision, recall, weighted F1-score |
Lombard et al. (2021) | 19678 for RF and 20450 for SGB/1970-2013/(United States) | 249 variables from (base-flow index, bedrock geology, ecoregion, general lithology, groundwater recharge, multi-order hydrologic position, percent tile drainage, precipitation, soil geochemistry, surficial geology) | NI | arsenic | RF, SGB | accuracy, AUC, kappa, sensitivity, specificity |
Alizamir et al. (2021) | 1317 /NI (for Illinois River) and 1841/NI (for Grand Lake)/(United States) | EC, pH, turbidity, WTE | NI | Chl-a | Bat-ELM, CART, ELM, GMDH, RF | NSE, R, RMSE |
Liu et al. (2021) | NI/2021/Poyang Lake, Three Gorges, Yangtze River (China) | Chl-a (In situ data) + Hyperspectral and RGB images obtained by UAV | geometric registration, radiometric correction, image processing | Chl-a | MLP, RF, PLS, PSO-SVM | MAPE, ![]() |
Xu et al. (2021) | 3287 sample/2011-2019/Dongjiang River (China) | (ammonia nitrogen, DO, EC, permanganate, pH, turbidity, total phosphorus, WTE) and (air pressure/temperature, evaporation, flow, humidity, precipitation, sunshine duration, wind velocity) | FS, MVI, normalization | DO | MLP, MLR, RF, SVM, WT-MLP, WT-MLR, WT-RF, WT-SVM | MAE, NSE, R, RMSE |
Singha et al. (2021) | NI/2019/226 different excavation points in Chhattisgarh (India) | Bicarbonate, calcium, chloride, fluoride, magnesium, nitrate, pH, phosphate, potassium, sodium, sulfate, TDS, TH | Normalization | WQI | DNN, MLP, RF, XGBoost | MAPE, MSE, NSE, ![]() |
Podgorski et al. (2022) | over 6000/NI/(Bangladesh, Cambodia, Indonesia, Myanmar, Vietnam) | (ammonium, arsenic, bicarbonate, chloride, DO, EC, iron, manganese, nitrate, pH, phosphate, redox potential, sodium, sulfate, well depth, WTE) and (57 spatially continuous data of climate, geology, soil, and topography) | DR, FS | iron, manganese | GBM, RF | AUC, balanced accuracy, kappa |
Shan et al. (2022) | NI /2014-2018 (for algal density) and NI/2017-2018 (for Chl-a)/Three Gorges Reservoir (China) | (algal cell, ammonia nitrogen, Chl-a, COD, DO, EC, microcystin, pH, total nitrogen, total phosphorus, turbidity, WTE) and (air temperature, atmospheric pressure, flow velocity, water level, wind direction/speed) | FS, MVI | Chl-a, algal density, and microcystin | LSTM, MLP, RNN, SVM, XGBoost-LSTM | MAE, RMSE |
Nasir et al. (2022) | 1679/2005-2014/diverse biomes over 600 locations (India) | BOD, DO, coliform data for fecal/total, EC, nitrate, pH | DA, MVI | WQI | CATBoost, DT, LR, MLP, RT, SVM, XGBoost | accuracy, AUC, F1-score, precision, sensitivity |
DO is one of the most commonly used output parameters when determining WQ. In a study conducted for this purpose, BMA was the first, and ELM was the second in the most successful results, and it found that the WTE variable was the most helpful attribute among the three WQPs (Kisi et al. 2020). In another study, the DO rate within 1–3 hours was estimated with the data obtained from the sensors, and the most successful result was achieved with a hybrid model combining kernel PCA and LSTM (Zhang et al. 2020). Another study observed that the four most important attributes (previous DO, WTE, air temperature, and air pressure) were successful, and the MLR model was successful among individual algorithms (Xu et al. 2021). It has been observed that hybrid models combined with WT are more successful in total performance.
Estimation of algal blooms is also an important consideration. In one study, the Bat algorithm was used as a complementary model. The Bat-ELM algorithm was most successful in estimating the daily Chl-a concentration (Alizamir et al. 2021). In another study, hyperspectral and RGB images obtained by UAV were subjected to multiple pre-processing processes and compared with measurements with sensors on the ground.
The most successful results were obtained with the RF algorithm, showing that the Chl-a ratio could be successfully determined with the UAV (Liu et al. 2021). A critical study was carried out on the world's largest hydroelectric dam by combining data from online monitoring devices and laboratory measurements. The most successful result was obtained in all scenarios with the XGBoost-LSTM hybrid approach (Shan et al. 2022). In some cases, classifications were made using the WQIs determined for WQ classification. In one study, using a considerable amount of data, classification was made according to CHINA quality standards (GB3838-2002), and DCF obtained the most successful result (Chen et al. 2020a). The DL algorithm used in a study for the quality classification of groundwater yielded much better results than the other three classifiers in estimation (Singha et al. 2021). In a recent survey, the WQI was targeted for an openly shared dataset, and CATBoost obtained the most successful result (Nasir et al. 2022).
FIB estimation is also an important consideration. In one study, a dataset was created by combining data from many sources, and E. coli prediction and bathing area classification were made (García-Alba et al. 2019). Sometimes more than one WQP estimation has been tried to be obtained as output. The hybrid algorithm created using WT removed the noise and provided more successful results for estimating ammoniacal nitrogen, pH, and TSS (Najah Ahmed et al. 2019). In another study, a multiscale estimation of each input was made as output for DO, permanganate index, pH, and total phosphorus (Li et al. 2019). RNN obtained more successful results at every stage with the evidence theory. The pre-processing techniques used in a study to determine water salinity and the combination of both in situ measurement and satellite data for comparison are impressive (Ansari & Akhoondzadeh 2020). The most successful results were obtained with the GA-MLP hybrid method.
The estimation of groundwater contaminants is also essential to the WQ process. In one study, arsenic levels in thousands of wells were classified into three classes according to specific threshold values, and SGB achieved more than 90% success (Lombard et al. 2021). In another study, the threshold values of iron and manganese levels in the groundwaters of five countries were classified, and the pollution was mapped with a very comprehensive study (Podgorski et al. 2022). The two ML algorithms used were superior to each other in different tasks.
ML IN WSM PROCESSES
The WSM process refers to a series of physical, chemical, and biological stages consisting of the collection, treatment, and discharge of natural water and the treatment, distribution, and collection of wastewater, with minimal damage to nature, as well as the goal of protecting WQ. The main facilities used in these stages can be specified as water treatment plants (WTPs), where water is made of higher quality and water distribution system (WDS) includes all systems where water distribution is done, and waste WTP (WWTP), where wastewater is treated. Although there are numerous studies on WSM, Table 3 includes some recently published studies that can summarize the use of WSM and ML in general.
Using ML algorithms for WSM processes
Task . | Authors (year) . | Input variables . | Data pre-processing . | Target . | ML techniques . | Evaluation metrics . |
---|---|---|---|---|---|---|
aqueous adsorption | Jun et al. (2020) | agitation speed, contact time, hydrogen peroxide, pH | normalization | optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘methylene blue’ and Adsorbent is ‘jicama peroxidase used buckypaper/polyvinyl alcohol membrane’) | PSO-MLP, RSM | ![]() |
aqueous adsorption | Radmehr et al. (2021) | nalidixic acid, NC dose, pH, contact time, temperature | NI | optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘nalidixic acid’ and Adsorbent is ‘NiZrAl-layered double hydroxide-graphene oxide-chitosan’) | ANFIS, GRNN, RSM, RSM-GA, RSM-DFA | kappa, MAE, MSE, ![]() |
chlorination and disinfection | Peleato (2022) | fluorescence spectra | DR | prediction of DBP with chlorine residual (Target DBP: bromodichloromethane, dichloroacetic acid, total trihalomethanes, trichloromethane, total haloacetic acids, trichloroacetic acid) | CNN, MLP, PCA-MLP, PARAFAC-MLP | MAE |
defect detection | Yin et al. (2021) | labeled CCTV videos acquired by the pipe inspection crawler | NI | sewer pipe defect type (broken, cracked, deposits, etc.) and defect location | CNN (YOLOv3) | F1-score |
fault detection | Mamandipoor et al. (2020) | 12 different chemical/operational sensors (including ammonia, nitrate etc. measurements) | MVI | detecting faults during the oxidation and nitrification processes in WWTP | PCA-SVM, LSTM | accuracy, F1-score, precision, recall |
leak detection | Mashhadi et al. (2021) | water flow at the three supply sections, water pressure values at five observation points | DR | detection and localization of leaks in WDS | LR, RF, DT, PCA, k-means, MLP | accuracy, F1-score, precision, recall |
membrane filtration parameters | Shim et al. (2021) | (dissolved organic carbon, fouling thickness, initial flux, modified fluorescence regional integration, operation time, pressure) and (real-time OCT images) | DR, black/white noise removal | membrane fouling growth (fouling thickness and permeate flux) | LSTM | ![]() |
membrane filtration parameters | Srivastava et al. (2021) | feed concentration/pH/pressure/temperature | NI | treatment of brackish groundwater (permeate flux, salt rejection, specific energy consumption, water recovery) | MLP, RSM | ![]() |
wastewater treatment | Moreno-Rodenas et al. (2021) | images obtained from the embedded camera system to observe the formation of different water levels and oil layers | DA | estimation of grease, fat, and oil in wastewater pumping stations | CNN (VGG16) | accuracy |
wastewater treatment | Gopi Kiran et al. (2021) | hydraulic retention times, heavy metal concentration (cadmium, copper, lead) | normalization | COD/heavy metal removal efficiency by RBC treatment | MLP | MAPE, R |
Task . | Authors (year) . | Input variables . | Data pre-processing . | Target . | ML techniques . | Evaluation metrics . |
---|---|---|---|---|---|---|
aqueous adsorption | Jun et al. (2020) | agitation speed, contact time, hydrogen peroxide, pH | normalization | optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘methylene blue’ and Adsorbent is ‘jicama peroxidase used buckypaper/polyvinyl alcohol membrane’) | PSO-MLP, RSM | ![]() |
aqueous adsorption | Radmehr et al. (2021) | nalidixic acid, NC dose, pH, contact time, temperature | NI | optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘nalidixic acid’ and Adsorbent is ‘NiZrAl-layered double hydroxide-graphene oxide-chitosan’) | ANFIS, GRNN, RSM, RSM-GA, RSM-DFA | kappa, MAE, MSE, ![]() |
chlorination and disinfection | Peleato (2022) | fluorescence spectra | DR | prediction of DBP with chlorine residual (Target DBP: bromodichloromethane, dichloroacetic acid, total trihalomethanes, trichloromethane, total haloacetic acids, trichloroacetic acid) | CNN, MLP, PCA-MLP, PARAFAC-MLP | MAE |
defect detection | Yin et al. (2021) | labeled CCTV videos acquired by the pipe inspection crawler | NI | sewer pipe defect type (broken, cracked, deposits, etc.) and defect location | CNN (YOLOv3) | F1-score |
fault detection | Mamandipoor et al. (2020) | 12 different chemical/operational sensors (including ammonia, nitrate etc. measurements) | MVI | detecting faults during the oxidation and nitrification processes in WWTP | PCA-SVM, LSTM | accuracy, F1-score, precision, recall |
leak detection | Mashhadi et al. (2021) | water flow at the three supply sections, water pressure values at five observation points | DR | detection and localization of leaks in WDS | LR, RF, DT, PCA, k-means, MLP | accuracy, F1-score, precision, recall |
membrane filtration parameters | Shim et al. (2021) | (dissolved organic carbon, fouling thickness, initial flux, modified fluorescence regional integration, operation time, pressure) and (real-time OCT images) | DR, black/white noise removal | membrane fouling growth (fouling thickness and permeate flux) | LSTM | ![]() |
membrane filtration parameters | Srivastava et al. (2021) | feed concentration/pH/pressure/temperature | NI | treatment of brackish groundwater (permeate flux, salt rejection, specific energy consumption, water recovery) | MLP, RSM | ![]() |
wastewater treatment | Moreno-Rodenas et al. (2021) | images obtained from the embedded camera system to observe the formation of different water levels and oil layers | DA | estimation of grease, fat, and oil in wastewater pumping stations | CNN (VGG16) | accuracy |
wastewater treatment | Gopi Kiran et al. (2021) | hydraulic retention times, heavy metal concentration (cadmium, copper, lead) | normalization | COD/heavy metal removal efficiency by RBC treatment | MLP | MAPE, R |
While chlorination is used to disinfect bacteria, viruses, and other microbes in the water, chlorine also has some dangers to human health. Sometimes, dangerous substances called disinfection by-products (DBPs) are formed due to chemical reactions between treatment agents and organic and inorganic substances in water. In a study conducted to detect these substances, the estimation of DBPs was obtained with the CNN algorithm with more successful results, unlike other hybrid methods, without needing data pre-processing (Peleato 2022).
Optical coherence tomography (OCT) images were taken as input in a study of WTP processes, and in situ data and parameters related to membrane fouling in the water filtration process were determined using LSTM (Shim et al. 2021). Another study conducted successful trials with MLP to treat brackish groundwater using the NF-RO hybrid membrane system (Srivastava et al. 2021).
A survey for leak detection and location detection in WDS systems tried six different algorithms, and ANN, RF, and LR obtained successful results (Mashhadi et al. 2021). Again, automation was designed to evaluate the sewer pipes so that meaningful texts were extracted from the labeled versions of the videos obtained by using a tracked CCTV image source equipped with a camera and automatically reported (Yin et al. 2021). Used a trained YOLOv3 algorithm for classification, an excellent algorithm for image-based object detection.
In a study for WWTP systems, more than 5.1 million data samples were analyzed for fault detection, aiming to increase treatment efficiency (Mamandipoor et al. 2020). LSTM obtained more successful results than other hybrid networks. In another study, intelligent hardware was designed for a system that automates the detection and cleaning of oil layers in the management of wastewater pumping stations, and the CNN algorithm established a high-success system (Moreno-Rodenas et al. 2021). In a study, the rotary biological contactor (RBC) reactor used in the secondary purification of water and the removal of COD and HM were tested with MLP, and successful results were observed (Gopi Kiran et al. 2021).
The ‘aqueous adsorption’ process, in which the adsorbates in the water are cleaned with the help of various adsorbents, is also essential for WSM. In one study, a critical analysis was made for cleaning industrial wastes by the adsorption process in an aqueous medium, and the hybrid algorithm obtained more successful results (Jun et al. 2020). In another study conducted for the ideal adsorption, the most successful results were obtained with ANFIS in individual performance and RSM-DFA in a hybrid approach.
EXPERIMENTS WITH TWO DATASETS
In this review study, after reviewing many algorithms and publications, some trials were conducted with some of the previously mentioned publicly available datasets. Two datasets, closely related to the selected topics, and compiled in two different areas, were again subjected to tests with various parameters. The first of the chosen datasets, ‘Dataset 1: WQ prediction’, is a regression problem in which the pH value of the next day is estimated. ‘Dataset 2: Monitoring of drinking WQ’ dataset is also a classification problem in which anomaly detection in water quality is made. Various tests have been carried out, and it has been shown that more successful results can be obtained with good optimizations than previously obtained results. Also, the flexibility of the studied ML technique against different problems (classification, regression) has been demonstrated. In addition, all the techniques mentioned in the article and the proposed transparency are summarized by applying them to the selected datasets.
Results obtained using ELM in dataset trials: (a) RMSE outputs of Dataset 1 and (b) F1-score outputs of Dataset 2.
Results obtained using ELM in dataset trials: (a) RMSE outputs of Dataset 1 and (b) F1-score outputs of Dataset 2.
Dataset 1: WQ prediction
A dataset used by Zhao et al. and later shared on the UCI2 platform was analyzed using the values taken from 37 different water sources between 2016 and 2018 (a total of 705 days), in which the pH average of the next day was tried to be determined (Zhao et al. 2019).
The dataset consists of 26,085 (705 × 37) rows and 11 attribute columns (maximum, minimum, and average information of DO, EC, pH, and WTE parameters) and one label column (next day's pH information). Since the data were presented ready-made in a normalized manner, no re-normalization was performed. The data were shuffled and verified with the 5-fold cross-validation technique for a healthy analysis.
The RMSE was used for the rating scale, and the details of the results are given in Figure 4(a). The most successful result was obtained with RMSE: 0.0048 error rate and this rate was obtained with 200 hidden neurons and the ‘radbas’ activation function in ELM. Trials for the entire data pre-processing, validation, training, and testing process take 71 s of runtime with the hardware mentioned in MATLAB. While the best error rate obtained in the original study was 0.0115, this study achieved an appreciably successful result with less than half the error in the original study (Zhao et al. 2019).
Dataset 2: monitoring of drinking WQ
The dataset of the GECCO Challenge 2017, which was held for anomaly detection in WQ, were analyzed (Moritz et al. 2017). Input columns consist of nine attribute columns of chlorine dioxide amounts, EC, flow rates, pH, redox, turbidity, WTE variables, and a label column marked as normal or abnormal. While the original dataset size consists of 122,335 rows, it decreases to 110,815 rows when the rows with no data are deleted. The MVI pre-processing methods have not been tried since the missing data in the deleted data are in all the columns. Normalization (z-score) has been applied to the data, and since the data anomaly detection is performed over time series, the dataset is simply divided as 60% training and 40% testing instead of classical k-fold cross-validation. Since F1-score was requested for the results obtained in the relevant competition and in the studies using this dataset, outputs were taken in this way in this study as well, which are given in detail in Figure 4(b).
The most successful result was obtained with an F1-score: 0.9993 error rate, and this rate was obtained with 400 hidden neurons and the ‘sig’ activation function in ELM. Trials for the entire data pre-processing, validation, training, and testing process take 92 s of runtime with the hardware mentioned in MATLAB. Muharemi et al., who won the competition and further improved the results of the competition in their publication, obtained an F1-score of 0.9891 using SVM (Muharemi et al. 2019). Given that the results obtained in this study are higher than the compared study's scores, it has once again demonstrated the effectiveness and speed of the ELM algorithm.
DISCUSSIONS AND CONCLUSIONS
ML algorithms persist in developing continuously, and complex algorithms are becoming more hassle-free thanks to technological infrastructures. To be able to work with ML in WQ and WSM processes and to actively generalize strategies at a global level, various issues need to be addressed and developed. One of these issues is the quantity and quality of the datasets. Complete datasets with higher resolution, transparent sampling frequency, and verifiable from multiple sources are very functional. There is, therefore, an urgent need for publicly available datasets on WQ and WSM that are better analyzed, accurately labeled, and integrate data from multiple sources where possible (on-site measurement, IoT, UAV, satellite, etc.). With this objective, the water research community should be encouraged to share properly obtained research data without restriction (Huang et al. 2021).
Another factor is the choice of the ML algorithm used. Although there is no perfect algorithm that can be used for all the tasks required for the field of water science, algorithms such as ANFIS, MLP, RF, and SVM are the ones that have often been used in the field. Besides, CatBoost, ELM, and XGBoost algorithms have achieved considerably successful results in their applied tasks. Hybrid approaches, which are generally created using complementary techniques, have also acquired more successful outcomes than other single algorithms. In general, simple construction techniques have flaws in performance despite their low time costs. Likewise, methods with excellent performance that require large datasets often involve complex structures and have high hardware/time costs. However, especially after the extraordinary success of DL algorithms on all kinds of data, interest in DL algorithms draws attention. A reason for this popularity is the ease of image-based processing with CNNs and the ability to deal successfully with time series with LSTM and GRU algorithms. Sharing the hardware features used in the studies and the processing time spent for detailed analysis should also be encouraged.
ML models may not be easy to grasp and apply; understandably, a water researcher is not entirely familiar with these techniques. For this reason, developments such as interdisciplinary communication, increasing cooperation with data mining experts, geographical and intuitive visualizations, and dissemination of cloud-based computing applications (GEE and MPC) are gaining importance (Chen et al. 2022; Yang et al. 2022). In addition, the ethical implications of a study should also be considered due to the importance of the subjects studied. Ethical concerns may arise because the biases in the dataset and the sensitivity of the study's content can impact the public planning modeling of that study (Sit et al. 2020). For this reason, more attention should be paid to transparency at every stage in studies involving such vital processes.
This broad-spectrum article has provided an understanding of the processes applied in the search for solutions with machine learning techniques to the problems in both WQ and WSM topics. A descriptive introduction to the basic concepts, a bibliometric analysis of an extensive literature review, types of data sources used, pre-processing processes of the data obtained, machine learning algorithms used, topics in which they are used, and evaluation criteria of the outputs are presented for both beginner and advanced researchers. Introduced machine learning techniques were selected and categorized according to their effectiveness. The methods are functional and provide examples of the most recent work. In addition, two datasets closely related to the selected topics, compiled in two different fields, were picked and subjected to tests with various parameters. The challenges and limitations of the WQ and WSM processes were mentioned, and essential points that required transparency for the development and reproducibility of research were mentioned. All in all, a useful review study has been obtained. It is thought that this study will benefit water researchers by presenting a general summary of the use of ML.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.