Managing water resources and determining the quality of surface and groundwater is one of the most significant issues fundamental to human and societal well-being. The process of maintaining water quality and managing water resources well involves complications due to human-induced errors. Therefore, applications that facilitate and enhance these processes have gained importance. In recent years, machine learning techniques have been applied successfully in the preservation of water quality and the management and planning of water resources. Water researchers have effectively used these techniques to integrate them into public management systems. In this study, data sources, pre-processing, and machine learning methods used in water research are briefly mentioned, and algorithms are categorized. Then, a general summary of the literature is presented on water quality determination and applications in water resources management. Lastly, the study was detailed using machine learning investigations on two publicly shared datasets.

  • Preserving water quality and managing water resources are vital.

  • Data acquisition and pre-processing strategies are presented for water researchers.

  • Machine learning algorithms in water research are categorized and described.

  • Examinations with machine learning are exemplified for the analyzed topics.

Graphical Abstract

Graphical Abstract
Graphical Abstract
AI

artificial intelligence

ANFIS

adaptive-network-based fuzzy inference system

ANN

artificial neural network

AUC

area under the ROC curve

BMA

Bayesian model averaging

BOD

biochemical oxygen demand

Chl-a

chlorophyll-a

CNN

convolutional neural network

COD

chemical oxygen demand

CRT

completely random tree

DA

data augmentation

DBP

disinfection by-products

DCF

deep cascade forest

DENFIS

dynamic evolving neural fuzzy inference system

DFA

desirability function analysis

DL

deep learning

DNN

deep neural network

DO

dissolved oxygen

DR

dimensionality reduction

DT

decision tree

E. coli

Escherichia coli

EC

electrical conductivity

ELM

extreme learning machine

FIB

fecal indicator bacteria

FIS

fuzzy inference system

FS

feature selection

GA

genetic algorithm

GAN

generative adversarial network

GBM

gradient boosting machines

GEE

Google Earth Engine

GMDH

group method of data handling

GRNN

generalized regression neural network

GRU

gated recurrent unit

HM

heavy metal

IoT

internet of things

kNN

k-nearest neighbors

LDA

linear discriminant analysis

LR

logistic regression

LSTM

long short-term memory

MAE

mean absolute error

MAPE

mean absolute percentage error

ML

machine learning

MLP

multilayer perceptron

MLR

multiple linear regression

MPC

Microsoft planetary computer

MSE

mean square error

MVI

missing value imputation

NARX

nonlinear autoregressive exogenous

NC

nitrogen compounds

NI

no information

NIOA

nature-inspired optimization algorithms

NN

neural network

NSE

Nash–Sutcliffe efficiency

OCT

optical coherence tomography

OLS

ordinary least squares

PARAFAC

parallel factor analysis

PCA

principal component analysis

pH

potential of hydrogen

PLS

partial least squares

PSO

particle swarm optimization

RBC

rotating biological contactor

RBFNN

radial basis function NN

RF

random forest

RMSE

root mean square error

RNN

recurrent neural network

ROC

receiver operating characteristic

RS

remote sensing

RSM

response surface methodology

RSR

RMSE observations standard deviation ratio

SDG

sustainable development goals

SGB

stochastic gradient boosting

SOM

self-organizing map

SPM

suspended particulate matter

SVM

support vector machine

SVR

support vector regression

TDS

total dissolved solids

TH

total hardness

TSS

total suspended solids

UAV

unmanned aerial vehicle

VS

virtual sensors

WDS

water distribution system

WQ

water quality

WQI

Water Quality Index

WQP

water quality parameter

WSM

water supply and management

WT

wavelet transform

WTE

water temperature

WTP

water treatment plant

WWTP

wastewater water treatment plant

XGBoost

extreme gradient boosting

YOLO

you only look once

While water resources cover approximately 71% of the Earth's surface, about 3% of the world's water bodies (ice caps/glaciers, groundwater, lake, soil, atmosphere, river, etc.) are fresh water, and the amount of usable fresh water is approximately 0.5% (U.S Bureau of Reclamation 2020). Given that water resources are in such a critical state, water quality (WQ) monitoring, and water supply and management (WSM) are becoming increasingly vital. However, the anthropogenic origin is the most significant cause of WQ degradation and all damage to the ecosystem. Human beings, who are both polluters and protectors of nature, urgently need to increase their environmental awareness, consciousness, and healing role in order to maintain their existence. For this purpose, the importance of water should continue to be on top of the agenda of both state and supra-political international organizations. Freshwater consumption and the threat of acid pollution to ocean water have been highlighted as two of the nine key variables that require the most attention for the well-being of humanity and the world (Rockström et al. 2009; Steffen et al. 2015).

In 2015, 17 high-priority goals were set by the United Nations (UN) within the scope of sustainable development goals (SDG). One of these goals, SDG 6, has a vital vision to provide clean, quality water and sanitation for all (UN Environment Programme 2021). Increasing awareness about the value of water, recognizing, monitoring, and determining its quality, and integrating it into the system of decision-makers is critical to ensure sustainable and equitable water resource management (UNESCO 2021).

Although there are some differences in determining the WQ of surface and groundwater, dozens of parameters measured from water's physical, chemical, and biological properties are used to determine the WQ. Some WQ parameters (WQPs) frequently used in the literature are as follows (Davie 2008; Spellman 2017; Omer 2019). (1) Physical WQPs water can be listed as water temperature (WTE), color, total dissolved solids (TDS), total suspended solids (TSS), turbidity, and electrical conductivity (EC). (2) Chemical WQPs can be sorted as dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), the potential of hydrogen (pH), nitrogen compounds (NC), heavy metals (HM), total hardness (TH). (3) Biological WQPs can be listed as algae (Chl-a etc.) and fecal indicator bacteria (FIB) (Escherichia coli, etc.). Some WQPs may have indirect effects both on the WQ and on other WQPs.

WQ indices (WQIs) are determined using more than one WQP while determining WQ. A general WQI method aims to determine the function of selected water parameters in the analyzed medium and to obtain an output through mathematical equations. The resulting output divides WQ into good-, medium-, and poor-quality format or derivative categories, usually set as classification targets. Work on establishing the first WQI standard was established in the mid-1960s by calculating parameters from physical and chemical factors of water bodies (Horton 1965; Yan et al. 2022). After the WQI method proposed by Horton, many WQI methods have been developed for various tasks by many essential organizations worldwide (Akhtar et al. 2021). For example, the foundations of NSFWQI were laid in 1970 with the contributions of the National Sanitation Foundation (NSF) (Brown et al. 1970), and the Canadian WQI (CWQI) prepared by the Canadian Council of Ministers of the Environment (CCME) was published in 2001 (CCME 2001). However, there is no universal WQI to evaluate WQ, although many studies have been conducted to create WQIs for more efficient and diverse water types (surface water, groundwater) (Sutadian et al. 2016).

Traditionally, in situ measurement is applied when determining WQ. Measured samples are tested in a laboratory, and WQPs are estimated and decided using the WQI. These techniques can provide accurate values but are often time-consuming, uneconomical, and incapable of demonstrating fundamental temporal and spatial changes in WQ (Wagle et al. 2020). A study that compared sensors and laboratory measurements and showed cost analysis indicated that laboratory measurements were more expensive (Paepae et al. 2021).

This study analyzes studies that use machine learning (ML) models on WQ and WSM. It is challenging that the water data are generally not linear and non-stationary and are affected by unpredictable natural and artificial factors. To effectively understand, learn, and extract these data with high accuracy, ML models, which have gained the trust of the scientific community, can be used in almost every field where they are applied. Numerous studies have been conducted on using ML models in water areas, and effective results have been obtained in many application areas, including the smart city concept. WQ and WSM weighted analyses using ML are also considerable topics frequently studied in the literature. In order to systematically clarify the roadmap and concepts in this study, many review articles were analyzed in detail.

In a study in the field of WQ assessment, a massive summary of the studies conducted to determine the WQ of rivers between 2000 and 2020 was presented, and the ML methods used were systematically classified (Tiyasha et al. 2020). Another review analyzed WQ studies in groundwater (Hanoon et al. 2021). In another study, detailed technical analyses of some ML architectures used in the WQ determination process were made. Many technical issues were examined, especially data processing, identifying deficiencies in the data, and dividing the data during the training phase (Chen et al. 2020b). Some other detailed reviews have also been examined (Hassan & Woo 2021; Paepae et al. 2021).

Although the number of studies that analyze the use of ML in detail in both WQ and WSM heading is relatively slim, some critical reviews in this sphere have been analyzed. In a review, most of the issues that this study deals with were studied, and the sub-titles of the issues were explained (Huang et al. 2021). Alongside some critical reviews (Sit et al. 2020; Lowe et al. 2022), several studies that analyzed only WSM portions using ML were also reviewed (Kirstein et al. 2020; Rahim et al. 2020).

The remainder of the work is organized as follows. In Section 2, bibliometric analyses have been performed outlining the use of ML in the WQ and WSM processes. In Section 3, data acquisition sources for WQ and WSM and pre-processing methods applied to the obtained data are mentioned. In Section 4, ML algorithms are classified, and the result evaluation metrics of the algorithms are mentioned. In Sections 5 and 6, studies on using ML in WQ and WSM processes, are mentioned consecutively. In Section 7, trials were conducted with an ML method on two publicly shared datasets, and the results were compared. In Section 8, the article is concluded with discussions and conclusions.

The bibliometric data of the WQ and WSM tasks were analyzed and visualized as separate titles. Word cloud and numerical data of the obtained keywords enriched this study. There are thousands of studies on using WQ and WSM with ML in the literature. Since not all of these studies could be analyzed in detail, the ‘Scopus’ database was examined to grasp the essence of the studies. The reason for choosing the Scopus database in this study is that more studies can be accessed from the ‘Web of Science’ database for the identified key queries. Extensive work has been conducted comparing these two databases (Chadegani et al. 2013; Martín-Martín et al. 2018; Singh et al. 2021).

TITLE-ABS-KEY (‘water quality’ AND (‘machine learning’ OR ‘deep learning’ OR ‘neural network*’)) AND (LIMIT-TO (LANGUAGE, AND, ‘English’))’ key when examining the WQ header words are used in the query code. The query analyzed only studies in the English language, and it was detected that the oldest study was conducted in 1990, and there were 3,802 studies in total as of September 2022. Again, while examining the WSM title, ‘TITLE-ABS-KEY ((water W/5 (management OR administration OR supply)) AND (‘machine learning’ OR ‘deep learning’ OR ‘neural network*’) AND (LIMIT-TO (LANGUAGE), AND, ‘English’))’ keywords are used in the query code. The query analyzed only studies in the English language. The oldest study is a different one from 1990; as of September 2022, a total of 5,309 studies have been found. When the keywords created for both titles are combined, and a search is done on Scopus, nearly 1,000 incomplete results were found out of the total results of the separate investigations. This shows that many common documents on both subjects document the typical relational closeness of the WQ and WSM subjects.

A word cloud is a visual representation of word frequency. The more familiar a term appears in the analyzed text, the larger the word will appear in the rendered image. Word cloud, used in this study, was created with an open source and multifunctional bibliometric analysis method presented as a library that can be installed in the R language (Aria & Cuccurullo 2017).

When creating the word cloud for WQ, some unnecessary duplicative synonym keywords were removed from the list. For example, for the keyword phrase ‘artificial neural network,’ other synonymous keywords such as ‘artificial neural networks, ann, artificial neural network (ann)’ are excluded. At the same time, the terms ‘water quality,’ ‘machine learning,’ and ‘artificial neural network’ have been excluded to increase the visibility of other keywords, as the repetition frequencies are quite high and they take up much space in the word cloud in terms of dimensions.

While creating the word cloud for WSM, some synonymous keywords were removed from the list. Moreover, to optimize visibility, the ‘machine learning’ and ‘artificial neural network’ pairs with high repetition frequencies are excluded. Details of the various bibliometric analyzes for both WQ and WSM titles are presented (see Figures 1,23). Parts (a) of Figures 1,23 were created for the WQ process, and parts (b) were created for the WSM processes. In Figure 1, the word cloud is shown; in Figure 2, the waterfall state of the number of studies carried out according to the determined year intervals is established; and in Figure 3, the map of the 20 countries with the most studies is presented according to the article authors. The word cloud created in Figure 1 is significant in giving a general idea about the contents of the topics. It was made from the first 200 keyword phrases that are most frequently used and detailed above.
Figure 1

Word cloud representation created with the keywords obtained as a result of the literature search: (a) for WQ and (b) for WSM.

Figure 1

Word cloud representation created with the keywords obtained as a result of the literature search: (a) for WQ and (b) for WSM.

Close modal
Figure 2

Representation of the number of studies according to the year intervals determined as a result of the literature review: (a) for WQ and (b) for WSM.

Figure 2

Representation of the number of studies according to the year intervals determined as a result of the literature review: (a) for WQ and (b) for WSM.

Close modal
Figure 3

Representation of 20 countries with the most studies according to the authors on the map: (a) for WQ and (b) for WSM.

Figure 3

Representation of 20 countries with the most studies according to the authors on the map: (a) for WQ and (b) for WSM.

Close modal

In Figure 2, an increase in the studies conducted over the years can be seen. This increased rate implies how popular the topics have been, especially in the past 5 years. In Figure 3, the ranking of the number of publications of the countries in total broadcast capacity is demonstrated based on single-country publications (SCPs) and multiple-country publications (MCPs). In light of the information obtained, China, the United States, India, and Iran hold the first four places in both titles in this study. It is seen that the number of publications from the African continent, which hosts many countries and has significant water problems, is deficient.

Data sources

The quality and quantity of data are also highly critical to an appropriate and well-trained ML model. The variety of huge data obtained in water management systems is wide (Eggimann et al. 2017; Sun & Scanlon 2019). The first step in WQ and WSM evaluation is using the measurement from various data sources. Data sources are generally obtained using the internet of things (IoT) and geographic monitoring systems (satellite, unmanned aerial vehicle (UAV), camera, etc.) with on-site measurement, laboratory, and sensors. Reliable water sensors to monitor complex physicochemical and biochemical reactions occurring in water systems and a spatio-temporal data processing capability to process large amounts of data collected over varying periods instantly are essential. Water analysis sensors are generally electrochemical (voltammetric/amperometric, potentiometric, conductometric), optical sensors, and biosensors (Huang et al. 2022). Water monitoring approaches are divided into various time types based on sensing capabilities, monitoring environment, and target. It has been determined that the best process in this sensor-based temporal classification is long-term continuous monitoring (LTCM). LTCM is taken in continuous data, has high monitoring capacity, successfully detects abnormal events, and requires a high-performance sensor (Huang et al. 2022). The high-quality data collected is well suited in terms of having alternative plans for every possible scenario, which avoids almost all water-related problems.

With sensors, data can be received from the water, satellites, UAVs, and the SCADA system where the device is placed. IoT sensors can monitor WQPs in near real-time, helping to record much more data with much higher temporal resolution (Yang et al. 2022). IoT plays a significant role in the WQ and WSM concept as an element that enables the automation of the entire system. Remote sensing (RS) by satellites is one of the essential methods for observing the Earth's surface and plays a vital role in many fields. According to The Union of Concerned Scientists1, updated on May 1, 2022, there are 5,465 satellites for various missions (civil, commercial, military, Earth observation, etc.) in earth orbit. There are important satellites that provide many free data, such as Landsat (the first civilian Earth observation satellite active since 1972), Sentinel-2 (sent by the European Space Agency in 2015), Zhuhai-1 (sent by China in 2017), and Landsat 9 (launched by NASA in 2021). The countries with the most satellites in orbit are the USA, China, and Russia. There are studies in which the spatial resolutions obtained from the relevant satellites are classified as coarse, medium, and high (years of use of the satellite, spatial and temporal resolution, spectral range, and free availability of data, etc.) (Chen et al. 2022).

It is concluded that the RS big data obtained by satellite has excellent potential for estimating optically active water parameters (Chen et al. 2022). It has been determined that some of the main factors in deciding WQ with RS are Chl-a, suspended particulate matter (SPM), and colored dissolved organic matter (CDOM) (Dörnhöfer & Oppelt 2016; Chen et al. 2022). Some WQ variables cannot be measured directly with RS because they are not optically active or lack high-resolution hyperspectral data. Real-time detection of specific parameters with sensors is not easy or economical, so predictions can be made by creating virtual (soft) sensors (VS) using other measurable data (Paepae et al. 2021). To some extent, VSs can provide near real-time monitoring of parameters that are difficult to measure. For example, direct COD measurement with sensors is very costly. Again, COD can be measured with VS created with formulaic calculations based on measurements taken with much cheaper basic sensors (pH, EC, Chl-a, temperature, DO, etc.) (Paepae et al. 2021).

Efficient processing and analysis of the data obtained with RS may involve serious difficulties (Zhang et al. 2021). These are visual difficulties caused by clouds and rainy weather, a low reflection of water, storage and processing of big data, and estimation of optically inactive water parameters (Chen et al. 2022). In addition, parameters related to image resolution, atmospheric window state, and spectral bands are also important while obtaining data from the satellite. Despite these limitations, image-based water turbidity detection results using RS are as accurate as actual turbidity measurement techniques and an encouraging method for monitoring WQ at larger spatial scales (Yang et al. 2022).

Cloud-based computing platforms are also available that can be used to process satellite data and other geographic data, such as the Google Earth Engine (GEE) (Gorelick et al. 2017) and the Microsoft planetary computer (MPC). Such platforms can help research the main desired topics rather than dealing with the time it takes to download and process large amounts of data. In addition, EPANET software, designed by the Environmental Protection Agency (EPA) to run hydraulic and WQ simulations and display the results in various formats, is also frequently used for data generation and simulation.

Aquatic eutrophication seriously threatens aquatic ecosystems, resulting in the death of aquatic organisms. Therefore, it is crucial to monitor Chl-a levels in water bodies and identify algal blooms before they occur (Yang et al. 2022). Water eutrophication can be evaluated qualitatively based on color characterization. WQ watercolor is perfect if it is blue, good if green, slightly dirty if yellow, medium dirty if orange, and heavily soiled if red (Yan et al. 2022). Therefore, many studies can be done to detect WQ from images obtained with RS data with watercolor (Hassan & Woo 2021).

UAVs are also used as a data acquisition platform. In addition to the color image taken from the air, more substantial data with high spatial resolution can be obtained with the help of various sensors (multispectral, hyperspectral, thermal, etc.) added to the UAV itself (Vélez-Nicolás et al. 2021). Water level information obtained with the help of sensors, flow and pressure information in pumps and pipes, acoustic recorders, etc., are also used as WSM data sources in addition to the sources mentioned above. In addition, it has been determined that smart water meters are one of the effective data collection methods in WSM (Monks et al. 2019; Kirstein et al. 2020; Velani et al. 2022).

Most studies use the dataset they need in cooperation with government agencies but do not share it publicly. Despite intensive research, the sources of publicly shared data included in the WQ and WSM in the reviewed sources are minimal. Studies of some openly shared datasets are as follows: (Hamshaw et al. 2018; Moritz et al. 2018, 2017; Mo et al. 2019a, 2019b; Ross et al. 2019; Zhao et al. 2019; Zhou et al. 2019; Wang et al. 2021a; Nasir et al. 2022). Experiments were carried out with ML methods with two of these open datasets, and they are explained in detail below.

Data pre-processing

Data management and pre-processing are important steps in their own right, and there are many detailed studies on these subjects (Han et al. 2012; Fan et al. 2021). Data management usually includes pre-processing steps such as missing value imputation (MVI), detection of noise and outlier anomaly values, feature selection (FS), dimensionality reduction (DR), normalization, data augmentation (DA), and proper partitioning of data. Data should not be deleted unless all column data values are missing for a significant row range in MVI. When possible, missing data should be completed by estimating meaningfully with approaches based on averaging, linear interpolation, regression, and ML algorithms (Alasadi & Bhaya 2017; Chhabra et al. 2017). Again, for anomaly detection in the data, determining the lower and upper quantitative threshold ranges, determining the distances of the data with the standard deviation, and clustering and complementary ML algorithms can be used.

Statistical models such as Pearson correlation coefficient (PCC), recursive feature elimination (RFE), and tree-based ML algorithms can be used for accurate FS (Fan et al. 2021). The DR process is also essential for data management and engineering; statistical methods and clustering ML algorithms can often be used (Velliangiri et al. 2019). Normalization ensures that the variables are in the same range and preferably small numbers so that some ML algorithms built on the distance between the data can perform a nutritional analysis. Some essential normalization techniques are min-max normalization, z-score standardization, decimal scaling data, and logarithmic methods (Saranya & Manik 2013).

DA is a set of techniques that can be applied to all data types, such as audio, text, and images. It aims to artificially increase the amount of data by obtaining new data points from existing data. It is an effective method to combat overfitting, primarily due to a need for more data. In addition to simple techniques such as adding noise to a picture, cropping, rotating, color-based changes, and deleting a random part, artificial data can also be produced with advanced ML techniques (Shorten & Khoshgoftaar 2019). For text data, methods such as replacing texts with synonyms, random insertion/deletion/shuffle/replacement, and translation of the text into a language and then back into its original language can be used (Wei & Zou 2019). Artificial sound data generation methods can be used to clip data, add noise, a shift in the time domain, sound velocity changes, and frequency masking (Wei et al. 2020).

The neat division of data for training and testing (or training, validation, and testing) purposes involves approaches that vary from study to study. The training set is used to train the model, while the validation set is used to set the hyperparameters of the trained model. The test set also provides a result by comparing the prediction outputs. It is seen that the data allocated for education is at least 50%, and this rate reaches up to 80% (Joseph 2022). In some cases, the rest of the data allocated for the training set is used only for testing or for both validation and testing. Although model verification techniques such as cross-validation are frequently utilized, alternative techniques can be used for cross-validation, especially for time series. Because the past cases for time series are important, the validation step should consider past data by not randomly being chosen (Bergmeir & Benítez 2012). The determined separation rate may vary randomly in the range mentioned above depending on parameters such as the dataset's serial pattern, size, and the number of features. Studies are also carried out to determine the optimum separation ratio of the dataset (Joseph & Vakayil 2022).

In addition to the methods mentioned above, there are also pre-processing methods such as various radiometric calibrations, atmospheric correction, and spectral band selection to use the data obtained from RS sources such as satellites and UAVs healthily (Chen et al. 2022).

ML is core to artificial intelligence (AI) and is one of the places the concept of AI wants to reach. While AI aims to simulate natural intelligence to solve complex problems, in ML, the goal is for machines to know and master how to perform human behavior (Xue & Zhu 2009). An algorithm that exhibits human-like behavior can be called AI, but it cannot be called a part of ML unless that algorithm learns and evolves from data automatically. Data collection is essential during the ML implementation stages. Some of the data obtained may inevitably be incomplete, incorrect, or damaged, leading to poor performance when applied in their current form. In such cases, the methods mentioned in the pre-processing data heading above become another critical step in ML applications. Then, appropriate algorithm selection, training of the selected model, and model validation are required.

Supervised and unsupervised approaches are the two main learning approaches of ML algorithms. In supervised learning, the data are tagged to tell the machine which patterns to look for, but in unsupervised learning, the information is not indexed; the engine finds patterns without knowing which patterns to look for. In addition to learning algorithms that learn by trial and error, called reinforcement learning, there are other learning algorithms called semi-supervised learning, which are generally used for many unlabeled and slightly labeled data but are not used as often as others.

ML tasks can typically be classified into broad categories depending on the type of learning (supervised or unsupervised), the learning model (classification, regression, clustering, etc.), and the algorithm chosen to implement the desired task. While it is vital to have good accuracy when training an ML model, its ability to generalize to unseen data also matters to a great extent.

Machine learning algorithms

In this section, the basic ML algorithms used in WQ and WSM processes classified according to their structures, and the algorithms used were tried to be briefly mentioned. As shown in Table 1, these algorithms are classified as artificial neural network (ANN) based, deep learning (DL)-based, tree-based, fuzzy logic-based, complementary, and others.

Table 1

Classification of ML algorithms used in WQ and WSM processes

CategoryML algorithms
ANN-based ELM, GRNN, MLP, NARX, RBFNN, SOM 
DL-based Autoencoder, CNN, DNN, GAN, GRU, LSTM, RNN 
Tree-based CatBoost, CRT, DCF, DT, GBM, RF, SGB, XGBoost 
Fuzzy logic-based ANFIS, DENFIS 
Complementary Bat, DFA, GA, LDA, PARAFAC, PCA, PSO, RSM, WT 
Other BMA, GMDH, k-means, kNN, LR, MLR, OLS, PLS, SVM, SVR 
CategoryML algorithms
ANN-based ELM, GRNN, MLP, NARX, RBFNN, SOM 
DL-based Autoencoder, CNN, DNN, GAN, GRU, LSTM, RNN 
Tree-based CatBoost, CRT, DCF, DT, GBM, RF, SGB, XGBoost 
Fuzzy logic-based ANFIS, DENFIS 
Complementary Bat, DFA, GA, LDA, PARAFAC, PCA, PSO, RSM, WT 
Other BMA, GMDH, k-means, kNN, LR, MLR, OLS, PLS, SVM, SVR 

ANN-based algorithms

ANN, often called neural network (NN), is an ML method that can derive new information using previously learned or classified information by imitating the biological neural structure of the human brain. Neurologist Warren S. McCulloch and logician Walter Pitts published the first ANN model in 1943 with an artificial ‘neural network’ formulation in the brain (McCulloch & Pitts 1943). ANN consists of three main layers: input, hidden, and output. ANN models can sometimes be categorized as a ‘black box’ model because the results of the numerical approach are used at the hidden layer level without fully understanding the mechanism inside.

After the back-propagation approach was proposed to minimize the error obtained by using a loss function in classification as a new learning procedure, the use of ANN continued to develop rapidly (Rumelhart et al. 1986). Algorithms, also known as multilayer perceptron (MLP) and using fully connected feed-forward, are referred to as another ANN nomenclature. In some studies, it was seen that the algorithms called ANN meant MLP, so corrections were made accordingly when those studies were detailed in the tables below. ANN has been continuously improved over time, and its efficiency has increased with the help of various data optimizations.

There are many algorithms based on ANN. Some of the frequently used topics in this article are radial basis function NN (RBFNN) (Broomhead & Lowe 1988), generalized regression NN (GRNN) (Specht 1991), self-organizing map (SOM or Kohonen map) (Kohonen 1982, 1990), nonlinear autoregressive exogenous (NARX) (Lin et al. 1996) and extreme learning machine (ELM) (or randomized NN) (Huang et al. 2004, 2006).

SOM is an unsupervised algorithm that trains using competitive learning to reduce features' size and clustering (Kalteh et al. 2008). The NARX model is an algorithm that can deal with time series, where the network output is applied as feedback to the input and can deal with missing data, noise, and nonlinear inputs (Chang et al. 2016). RBFNN is a faster algorithm than MLP networks that uses radial basis functions as activation functions and trains it as a curve-fitting approach in multidimensional space. Since GRNN works on a radial basis and shows a one-pass learning approach, it does not require an iterative training process as in ANNs using BP. Because complexity in both RBFNN and GRNN has grown excessively, it cannot be easy to deal with large datasets. Since ELM can work with input weights and bias values randomly, calculating the inverse of the matrix instead of BP, and working with non-differentiable activation functions, the training time process is swift and very successful in generalization (Ertuğrul et al. 2021).

DL-based algorithms

DL algorithms, which can also be classified as ANN-based, should be examined under a separate heading due to their groundbreaking techniques and usage areas. DL differs from ANN according to its depth, number of hidden layers, and architectural diversity. An ANN structure with several hidden layers may also be known as deep NN (DNN) as a general nomenclature. DL algorithms have quickly attracted the attention of researchers from various scientific disciplines, along with the developments in graphics processing units (GPUs), which have compelling computational features (Goodfellow et al. 2016).

While traditional ML approaches use handcrafted engineering features in feature extraction, features are learned automatically without explicit instruction in DL algorithms and are represented at multiple levels in incremental systems. While DL approaches increase their performance according to the increase in datasets, the performance decreases in DL algorithms with insufficient data, and overfitting problems may occur (Liu et al. 2017). In addition, more nonlinear activation functions, initialization, and regularization methods can be found in DL compared to other ANN derivatives.

Fundamentals of DL model types can be broadly classified as convolutional NN (CNN) (LeCun et al. 1989), recurrent NN (RNN) (Rumelhart et al. 1986), autoencoder (AE) (Rumelhart et al. 1985), generative adversarial network (GAN) (Goodfellow et al. 2014). Algorithms with different structures are derived from these basic DL classes, which produce solutions to many different branches.

CNN is an important DL algorithm type, often used in image processing, where convolution and pooling operations are applied in its hidden layers and usually take images as input to the algorithm. It seems that both CNN and DL algorithms, in general, have become much more popular in the last ten years after successful results using DL on ImageNet. ImageNet was made publicly accessible by researchers at Stanford in 2009 and contained more than 14 million tagged images (Deng et al. 2009; Krizhevsky et al. 2017). There are numerous derivatives of CNN, and it is seen that wide varieties are used for many purposes in WQ and WSM studies. For example, CNN can often be used in image recognition, and derivatives such as the ‘you only look once’ (YOLO) algorithm can be used in automatic object detection, especially for WSM tasks (Redmon et al. 2016).

RNNs are developed by taking advantage of existing correlations between input data to predict new data. RNNs are built on a system in which current conditions are affected by past conditions at each step. Long short-term memory (LSTM) and gated recurrent unit (GRU)-like algorithms have been proposed to alleviate the difficulty of RNNs in accessing earlier information, gradient vanishing, and gradient exploding problems (Hochreiter & Schmidhuber 1997; Cho et al. 2014). In general, it is not clear which is the best in LSTM and GRU performance results, but it has been concluded that if quick performance is necessary, a less complex GRU can be used (Chung et al. 2014).

AEs are unsupervised algorithms, and although they can be used in many different tasks, their main purpose is to reduce the size of the given input by compressing it. They represent the input data as output in the most appropriate way and are instrumental in data pre-processing techniques such as FS, DR, and noise and anomaly removal in water surveys (Russo et al. 2020; Ba-Alawi et al. 2021). GANs are groundbreaking productive DL applications consisting of two different and competing NNs, and can generally be used in pre-processing data parts such as FS, DA, and anomaly removal (Li et al. 2018; Huynh et al. 2022).

Tree-based algorithms

Tree-based ML algorithms are named as such because they can be visualized with nodes and branches and have many types. One of its most basic models is the decision tree (DT) algorithm, a structure used to divide a dataset containing many records into smaller clusters by applying a set of simple decision rules (Morgan & Sonquist 1963; Hunt et al. 1966). Nodes (1) root, parent, child, (2) decision, and terminal are essential in DT, and DT uses different algorithms to decide whether to split a node into two or more child nodes. Many direct DT-based algorithms exist (CART, ID3, C4.5, etc.) (Mienye et al. 2019). As tree branches grow, overfitting problems may occur, and various measures (parameter restrictions, pruning of branches, etc.) are taken to prevent this. After these measures were taken, other important algorithms emerged.

Random forest (RF) is an approach that strengthens many weak, unrelated trees in a forest with the logic of ‘bagging,’ takes estimates from each tree, and determines the winner according to the votes received (Breiman 1996, 2001). An important aspect of the RF algorithm is that the trees that contribute the most to the information acquisition are determined, and RF can perform as a kind of FS and DR process.

Variants of the RF algorithm, which has many derivatives, such as the completely-random tree (CRT) and the deep cascade forest (DCF), which are very successful and inspired by the architectures of DL algorithms, are also used (Zhou & Feng 2017).The gradient boosting machines (GBM) algorithm, which is an effective method, has been developed by combining the gradient descent optimization algorithm with the ‘boosting’ logic to strengthen DT (Friedman 2001). Again, by making a different notation change from the author of GBM, stochastic gradient boosting (SGB) (or boosted regression tree) was introduced, and randomness became an integral part of gradient change (Friedman 2002). Furthermore, some of the newest algorithms developed as derivatives of GBM that can prevent overfitting, manage useless data, and are very fast, are extreme gradient boosting (XGBoost) and CatBoost algorithms (Chen & Guestrin 2016; Prokhorenkova et al. 2017).

Fuzzy logic-based algorithms

Fuzzy logic is a logical structure that argues that ambiguous expressions can also be digitized, instead of the true–false-based opposite boundary approach in classical Aristotelian logic, and that can define membership values of terms between 0 and 1, and in this way, it is developed to deal with ambiguity (Zadeh 1965, 1978). FIS (fuzzy inference system), which better defines the relationship between variables and then interprets them in simple language, has therefore been accepted as the ‘grey box’ model, which is a combination of the ‘white box’ and ‘black box’ model approaches (Obadina et al. 2022).

FIS does not perform learning, so it is not considered an ML approach, even though it is a subset of AI. Therefore, flexible ML models that can better describe the complexity and nonlinear data have been created by combining the FIS approach with NN learning logic. Many fuzzy logic-based algorithms have been brought into the ML approach; some of the important ones are adaptive-network-based FIS (ANFIS) (Jang 1993) and dynamic evolving neural FIS (DENFIS) (Kasabov & Song 2002). DENFIS uses the evolving clustering method to optimize membership function parameters, and studies have been carried out in the water field with this algorithm in the past (Heddam 2014; Heddam & Dechemi 2015). However, ANFIS is used more frequently in the current literature. Because ANFIS has the advantages of ANN and FIS models, it successfully reverses coding and reduces noise and can generally be used as a hybrid with multidimensional data analysis methods (Tiyasha et al. 2020).

Complementary algorithms

Wavelet transform (WT) is an algorithm that can significantly improve the overall performance of a model and is capable of describing spectral and temporal information (Grossmann & Morlet 1984). WT analysis has proven more efficient than the Fourier transform and has successfully analyzed signals in the time–frequency domain (Chun-Lin 2010). WT has presented successful examples in many tasks, such as extracting meaningful information from the dataset, decomposing non-stationary signs into different sub-signals, noise removal, and signal compression (Walczak & Massart 1997; Tiyasha et al. 2020). WT is a complementary algorithm that is effective alone and in combination with many models such as ANN and ANFIS.

The use of methods to reduce the size of the features to achieve better results, less processing time, and prevent overfitting is also a complementary method. Principal component analysis (PCA) and linear discriminant analysis (LDA) methods, the foundations of which have been established for nearly a century, can be given as examples of the main algorithms used for this purpose (Cunningham 2008). In addition, it is seen that the parallel factor analysis (PARAFAC) method, a very old mathematical model used to focus on the features of interest, is also used in some studies (Harshman 1970).

Nature-inspired optimization algorithms (NIOA), also a subset of AI, can be used with ML algorithms as a complementary model because they successfully find optimized solutions to multidimensional and multimodal challenging problems (Yang 2014). Many NIOA methods are available, inspired by nature's evolutionary, physical and chemical processes (Wang et al. 2021b). Some of the frequently used ones in the literature are the genetic algorithm (GA) (Holland 1975), particle swarm optimization (PSO) (Kennedy & Eberhart 1995), and the recent bat algorithm (Yang 2010). In addition, it is seen that methods such as response surface methodology (RSM) and desirability function analysis (DFA) are used as complementary optimization models in some studies (Bobadilla et al. 2020).

Other algorithms

This section mentions some models that do not fall into other class categories in water studies. The most popular of these is the support vector machine (SVM), an ML method that tries to separate the points placed on a plane at an optimum distance (Cortes & Vapnik 1995). To cope with high-dimensional and nonlinear problems, it performs a kernel trick by using kernel functions that enable it to transform the input space into a higher-dimensional space. Support vector regression is used in studies (SVR) based on the same concept as SVM and used for regression tasks (Drucker et al. 1996).

Another algorithm is the k-nearest neighbors (kNN) method, in which the distance and number of neighbors parameters are essential, and the new individual to be classified is calculated by looking at the proximity of ‘k’ of previous individuals (Altman 1992). Contrary to the training phase, this algorithm, described as ‘lazy’ since calculations are made on the entire dataset in the testing phase, can be costly in large multidimensional datasets. The k-means algorithm, one of the unsupervised ML methods that provide solutions for association and clustering problems by finding the pattern of the data and dividing the data into ‘k’ clusters, is also used in some studies (MacQueen 1967; Lloyd 1982).

In addition, such as Bayesian model averaging (BMA) (Raftery et al. 2005), ordinary least squares (OLS), partial least squares (PLS), logistic regression (LR), multiple linear regression (MLR), group method of data handling (GMDH) (Ivakhnenko 1970) techniques have been used in some studies.

Evaluation metrics

In order to evaluate the performance of ML and DL algorithms, mainly in classification and regression tasks, many evaluation criteria in the literature and detailed analyses are available (Goodfellow et al. 2016; Botchkarev 2019; Yang et al. 2022). In this section, to understand some concepts in the outputs of ML research, the criteria within the scope of evaluation tasks will be briefly mentioned without going into much detail. One of the essential evaluation criteria in classification tasks is the confusion matrix table and the elements it contains. This table has four combinations of the predicted and actual values resulting from the classification. These combinations are true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN). Many classification metrics are derived from these four criteria. At the same time, mathematically easy and vital metrics such as accuracy , precision , recall or sensitivity , specificity , and F1-score (or F-score or F-measure) are obtained using a confusion matrix.

In addition to the above in classification criteria (1), the area under the ROC curve (AUC) method is also an effective method for visually evaluating the performance of ML models. Here, ROC means a probability curve named receiver operating characteristics. (2) Log loss is a good evaluation metric for binary classifiers. It is a slightly modified version of the Likelihood Function. (3) The Cohen Kappa statistic is frequently used in determining reliability between raters and was developed to determine the degree of agreement between two rater models scoring at the classification level. The larger the ratio obtained from the classification criteria mentioned other than log loss, the better classified the results.

There are also many evaluation criteria in regression tasks, and an ideal performance criterion cannot be generalized (Botchkarev 2019). Some methods take the sum of the absolute error value between the actual and predicted values. There are many types of this approach, with some common examples such as mean absolute error (MAE) and mean absolute percentage error (MAPE). Some of the criteria that are frequently used in regression and measure by squaring the error are methods such as mean square error (MSE), root mean square error (RMSE), R, , and Nash–Sutcliffe efficiency (NSE). MSE specifies the sum of the squared difference between the actual and predicted value with some change in MAE. RMSE is the square root of MSE and can make MSE smaller so it can be compared more easily.

The R measure, also known as the correlation coefficient, gives the correlation between the observed values and the response generated by the model. , a statistical measure of how close the data are to the fitted regression line, is also known as the coefficient of determination and is actually obtained by squaring the R criterion. Although it is a frequently used criterion, adjusted has been proposed as a solution to the method, which is not sensitive to outliers and overfitting problems (Babyak 2004; Li 2017). NSE is a normalized statistic that stands out in the field of hydrology and determines the relative magnitude of constant variance compared to observation data variance. In addition, some studies have also used the metric called RMSE observations standard deviation ratio (RSR).

WQ monitoring and determination are essential to well-functioning surface and groundwater. Table 2 contains some recent (in the last 5 years) published studies that can summarize the use of WQ and ML in general terms.

Table 2

WQ assessment using ML algorithms

Authors (year)Number of samples/Data range/ LocationInput variablesData pre-processingTargetML techniquesEvaluation metrics
Najah Ahmed et al. (2019)  NI/2009-2010/Johor River (Malaysia) Chloride, E. coli, EC, iron, magnesium, nitrate, phosphate, potassium, salinity, sodium, turbidity, WTE Noise removal with WT Ammoniacal nitrogen, pH, TSS ANFIS, MLP RBFNN, WT-ANFIS R2 
Li et al. (2019)  1448/2017-2018/Qiantang River (China) DO, permanganate index, pH, total phosphorus Normalization DO, permanganate index, pH, total phosphorus GRU, LSTM, MLP, SVR, RNN, RNN with Dempster/ Shafer evidence theory MAE, MAPE, NSE, RMSE 
García-Alba et al. (2019)   NI/1990 and 2013-2015/Eo Estuary (Spain) The flow of fecal discharges, river flow/WTE, sea salinity/WTE, solar radiation, water level NI FIB (E. coliMLP NSE,  
Kisi et al. (2020)  NI/2015-2017 (for Link River) and NI/1997-2004 (for Klamath River)/(United States) EC, pH, WTE NI DO ANFIS, BMA, CART, ELM, MLR, MLP NSE, , RMSE 
Zhang et al. (2020)  3672/2015/Burnett River (Australia) Chl-a, DO, EC, pH, turbidity, WTE DR and noise removal with kernel PCA, FS DO GRNN, kPCA-LSTM, MLP, SVR MAE, percent of prediction within a factor, , RMSE 
Ansari & Akhoondzadeh (2020)  102/2013-2018/Karun River (Iran) salinity (In situ data) + Landsat8 satellite images Radiometric/atmospheric correction, FS, normalization salinity GA-MLP, OLS, SVR , RMSE 
Chen et al. (2020a)  33612/2012-2018/124 automatic WQ monitoring stations in 10 rivers and lakes (China) AMMONIA nitrogen, COD, DO, pH normalization WQI CRT, DCF, DT, kNN, LDA, LR, naive Bayes, RF, SVM F1-score, precision, recall, weighted F1-score 
Lombard et al. (2021)  19678 for RF and 20450 for SGB/1970-2013/(United States) 249 variables from (base-flow index, bedrock geology, ecoregion, general lithology, groundwater recharge, multi-order hydrologic position, percent tile drainage, precipitation, soil geochemistry, surficial geology) NI arsenic RF, SGB accuracy, AUC, kappa, sensitivity, specificity 
Alizamir et al. (2021)  1317 /NI (for Illinois River) and 1841/NI (for Grand Lake)/(United States) EC, pH, turbidity, WTE NI Chl-a Bat-ELM, CART, ELM, GMDH, RF NSE, R, RMSE 
Liu et al. (2021)   NI/2021/Poyang Lake, Three Gorges, Yangtze River (China) Chl-a (In situ data) + Hyperspectral and RGB images obtained by UAV geometric registration, radiometric correction, image processing Chl-a MLP, RF, PLS, PSO-SVM MAPE, , RMSE 
Xu et al. (2021)  3287 sample/2011-2019/Dongjiang River (China) (ammonia nitrogen, DO, EC, permanganate, pH, turbidity, total phosphorus, WTE) and (air pressure/temperature, evaporation, flow, humidity, precipitation, sunshine duration, wind velocity) FS, MVI, normalization DO MLP, MLR, RF, SVM, WT-MLP, WT-MLR, WT-RF, WT-SVM MAE, NSE, R, RMSE 
Singha et al. (2021)   NI/2019/226 different excavation points in Chhattisgarh (India) Bicarbonate, calcium, chloride, fluoride, magnesium, nitrate, pH, phosphate, potassium, sodium, sulfate, TDS, TH Normalization WQI DNN, MLP, RF, XGBoost MAPE, MSE, NSE, , RMSE, RSR 
Podgorski et al. (2022)   over 6000/NI/(Bangladesh, Cambodia, Indonesia, Myanmar, Vietnam) (ammonium, arsenic, bicarbonate, chloride, DO, EC, iron, manganese, nitrate, pH, phosphate, redox potential, sodium, sulfate, well depth, WTE) and (57 spatially continuous data of climate, geology, soil, and topography) DR, FS iron, manganese GBM, RF AUC, balanced accuracy, kappa 
Shan et al. (2022)  NI /2014-2018 (for algal density) and NI/2017-2018 (for Chl-a)/Three Gorges Reservoir (China) (algal cell, ammonia nitrogen, Chl-a, COD, DO, EC, microcystin, pH, total nitrogen, total phosphorus, turbidity, WTE) and (air temperature, atmospheric pressure, flow velocity, water level, wind direction/speed) FS, MVI Chl-a, algal density, and microcystin LSTM, MLP, RNN, SVM, XGBoost-LSTM MAE, RMSE 
Nasir et al. (2022)  1679/2005-2014/diverse biomes over 600 locations (India) BOD, DO, coliform data for fecal/total, EC, nitrate, pH DA, MVI WQI CATBoost, DT, LR, MLP, RT, SVM, XGBoost accuracy, AUC, F1-score, precision, sensitivity 
Authors (year)Number of samples/Data range/ LocationInput variablesData pre-processingTargetML techniquesEvaluation metrics
Najah Ahmed et al. (2019)  NI/2009-2010/Johor River (Malaysia) Chloride, E. coli, EC, iron, magnesium, nitrate, phosphate, potassium, salinity, sodium, turbidity, WTE Noise removal with WT Ammoniacal nitrogen, pH, TSS ANFIS, MLP RBFNN, WT-ANFIS R2 
Li et al. (2019)  1448/2017-2018/Qiantang River (China) DO, permanganate index, pH, total phosphorus Normalization DO, permanganate index, pH, total phosphorus GRU, LSTM, MLP, SVR, RNN, RNN with Dempster/ Shafer evidence theory MAE, MAPE, NSE, RMSE 
García-Alba et al. (2019)   NI/1990 and 2013-2015/Eo Estuary (Spain) The flow of fecal discharges, river flow/WTE, sea salinity/WTE, solar radiation, water level NI FIB (E. coliMLP NSE,  
Kisi et al. (2020)  NI/2015-2017 (for Link River) and NI/1997-2004 (for Klamath River)/(United States) EC, pH, WTE NI DO ANFIS, BMA, CART, ELM, MLR, MLP NSE, , RMSE 
Zhang et al. (2020)  3672/2015/Burnett River (Australia) Chl-a, DO, EC, pH, turbidity, WTE DR and noise removal with kernel PCA, FS DO GRNN, kPCA-LSTM, MLP, SVR MAE, percent of prediction within a factor, , RMSE 
Ansari & Akhoondzadeh (2020)  102/2013-2018/Karun River (Iran) salinity (In situ data) + Landsat8 satellite images Radiometric/atmospheric correction, FS, normalization salinity GA-MLP, OLS, SVR , RMSE 
Chen et al. (2020a)  33612/2012-2018/124 automatic WQ monitoring stations in 10 rivers and lakes (China) AMMONIA nitrogen, COD, DO, pH normalization WQI CRT, DCF, DT, kNN, LDA, LR, naive Bayes, RF, SVM F1-score, precision, recall, weighted F1-score 
Lombard et al. (2021)  19678 for RF and 20450 for SGB/1970-2013/(United States) 249 variables from (base-flow index, bedrock geology, ecoregion, general lithology, groundwater recharge, multi-order hydrologic position, percent tile drainage, precipitation, soil geochemistry, surficial geology) NI arsenic RF, SGB accuracy, AUC, kappa, sensitivity, specificity 
Alizamir et al. (2021)  1317 /NI (for Illinois River) and 1841/NI (for Grand Lake)/(United States) EC, pH, turbidity, WTE NI Chl-a Bat-ELM, CART, ELM, GMDH, RF NSE, R, RMSE 
Liu et al. (2021)   NI/2021/Poyang Lake, Three Gorges, Yangtze River (China) Chl-a (In situ data) + Hyperspectral and RGB images obtained by UAV geometric registration, radiometric correction, image processing Chl-a MLP, RF, PLS, PSO-SVM MAPE, , RMSE 
Xu et al. (2021)  3287 sample/2011-2019/Dongjiang River (China) (ammonia nitrogen, DO, EC, permanganate, pH, turbidity, total phosphorus, WTE) and (air pressure/temperature, evaporation, flow, humidity, precipitation, sunshine duration, wind velocity) FS, MVI, normalization DO MLP, MLR, RF, SVM, WT-MLP, WT-MLR, WT-RF, WT-SVM MAE, NSE, R, RMSE 
Singha et al. (2021)   NI/2019/226 different excavation points in Chhattisgarh (India) Bicarbonate, calcium, chloride, fluoride, magnesium, nitrate, pH, phosphate, potassium, sodium, sulfate, TDS, TH Normalization WQI DNN, MLP, RF, XGBoost MAPE, MSE, NSE, , RMSE, RSR 
Podgorski et al. (2022)   over 6000/NI/(Bangladesh, Cambodia, Indonesia, Myanmar, Vietnam) (ammonium, arsenic, bicarbonate, chloride, DO, EC, iron, manganese, nitrate, pH, phosphate, redox potential, sodium, sulfate, well depth, WTE) and (57 spatially continuous data of climate, geology, soil, and topography) DR, FS iron, manganese GBM, RF AUC, balanced accuracy, kappa 
Shan et al. (2022)  NI /2014-2018 (for algal density) and NI/2017-2018 (for Chl-a)/Three Gorges Reservoir (China) (algal cell, ammonia nitrogen, Chl-a, COD, DO, EC, microcystin, pH, total nitrogen, total phosphorus, turbidity, WTE) and (air temperature, atmospheric pressure, flow velocity, water level, wind direction/speed) FS, MVI Chl-a, algal density, and microcystin LSTM, MLP, RNN, SVM, XGBoost-LSTM MAE, RMSE 
Nasir et al. (2022)  1679/2005-2014/diverse biomes over 600 locations (India) BOD, DO, coliform data for fecal/total, EC, nitrate, pH DA, MVI WQI CATBoost, DT, LR, MLP, RT, SVM, XGBoost accuracy, AUC, F1-score, precision, sensitivity 

DO is one of the most commonly used output parameters when determining WQ. In a study conducted for this purpose, BMA was the first, and ELM was the second in the most successful results, and it found that the WTE variable was the most helpful attribute among the three WQPs (Kisi et al. 2020). In another study, the DO rate within 1–3 hours was estimated with the data obtained from the sensors, and the most successful result was achieved with a hybrid model combining kernel PCA and LSTM (Zhang et al. 2020). Another study observed that the four most important attributes (previous DO, WTE, air temperature, and air pressure) were successful, and the MLR model was successful among individual algorithms (Xu et al. 2021). It has been observed that hybrid models combined with WT are more successful in total performance.

Estimation of algal blooms is also an important consideration. In one study, the Bat algorithm was used as a complementary model. The Bat-ELM algorithm was most successful in estimating the daily Chl-a concentration (Alizamir et al. 2021). In another study, hyperspectral and RGB images obtained by UAV were subjected to multiple pre-processing processes and compared with measurements with sensors on the ground.

The most successful results were obtained with the RF algorithm, showing that the Chl-a ratio could be successfully determined with the UAV (Liu et al. 2021). A critical study was carried out on the world's largest hydroelectric dam by combining data from online monitoring devices and laboratory measurements. The most successful result was obtained in all scenarios with the XGBoost-LSTM hybrid approach (Shan et al. 2022). In some cases, classifications were made using the WQIs determined for WQ classification. In one study, using a considerable amount of data, classification was made according to CHINA quality standards (GB3838-2002), and DCF obtained the most successful result (Chen et al. 2020a). The DL algorithm used in a study for the quality classification of groundwater yielded much better results than the other three classifiers in estimation (Singha et al. 2021). In a recent survey, the WQI was targeted for an openly shared dataset, and CATBoost obtained the most successful result (Nasir et al. 2022).

FIB estimation is also an important consideration. In one study, a dataset was created by combining data from many sources, and E. coli prediction and bathing area classification were made (García-Alba et al. 2019). Sometimes more than one WQP estimation has been tried to be obtained as output. The hybrid algorithm created using WT removed the noise and provided more successful results for estimating ammoniacal nitrogen, pH, and TSS (Najah Ahmed et al. 2019). In another study, a multiscale estimation of each input was made as output for DO, permanganate index, pH, and total phosphorus (Li et al. 2019). RNN obtained more successful results at every stage with the evidence theory. The pre-processing techniques used in a study to determine water salinity and the combination of both in situ measurement and satellite data for comparison are impressive (Ansari & Akhoondzadeh 2020). The most successful results were obtained with the GA-MLP hybrid method.

The estimation of groundwater contaminants is also essential to the WQ process. In one study, arsenic levels in thousands of wells were classified into three classes according to specific threshold values, and SGB achieved more than 90% success (Lombard et al. 2021). In another study, the threshold values of iron and manganese levels in the groundwaters of five countries were classified, and the pollution was mapped with a very comprehensive study (Podgorski et al. 2022). The two ML algorithms used were superior to each other in different tasks.

The WSM process refers to a series of physical, chemical, and biological stages consisting of the collection, treatment, and discharge of natural water and the treatment, distribution, and collection of wastewater, with minimal damage to nature, as well as the goal of protecting WQ. The main facilities used in these stages can be specified as water treatment plants (WTPs), where water is made of higher quality and water distribution system (WDS) includes all systems where water distribution is done, and waste WTP (WWTP), where wastewater is treated. Although there are numerous studies on WSM, Table 3 includes some recently published studies that can summarize the use of WSM and ML in general.

Table 3

Using ML algorithms for WSM processes

TaskAuthors (year)Input variablesData pre-processingTargetML techniquesEvaluation metrics
aqueous adsorption Jun et al. (2020)  agitation speed, contact time, hydrogen peroxide, pH normalization optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘methylene blue’ and Adsorbent is ‘jicama peroxidase used buckypaper/polyvinyl alcohol membrane’) PSO-MLP, RSM , RMSE 
aqueous adsorption Radmehr et al. (2021)  nalidixic acid, NC dose, pH, contact time, temperature NI optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘nalidixic acid’ and Adsorbent is ‘NiZrAl-layered double hydroxide-graphene oxide-chitosan’) ANFIS, GRNN, RSM, RSM-GA, RSM-DFA kappa, MAE, MSE, , RMSE 
chlorination and disinfection Peleato (2022)  fluorescence spectra DR prediction of DBP with chlorine residual (Target DBP: bromodichloromethane, dichloroacetic acid, total trihalomethanes, trichloromethane, total haloacetic acids, trichloroacetic acid) CNN, MLP, PCA-MLP, PARAFAC-MLP MAE 
defect detection Yin et al. (2021)  labeled CCTV videos acquired by the pipe inspection crawler NI sewer pipe defect type (broken, cracked, deposits, etc.) and defect location CNN (YOLOv3) F1-score 
fault detection Mamandipoor et al. (2020)  12 different chemical/operational sensors (including ammonia, nitrate etc. measurements) MVI detecting faults during the oxidation and nitrification processes in WWTP PCA-SVM, LSTM accuracy, F1-score, precision, recall 
leak detection Mashhadi et al. (2021)  water flow at the three supply sections, water pressure values at five observation points DR detection and localization of leaks in WDS LR, RF, DT, PCA, k-means, MLP accuracy, F1-score, precision, recall 
membrane filtration parameters Shim et al. (2021)  (dissolved organic carbon, fouling thickness, initial flux, modified fluorescence regional integration, operation time, pressure) and (real-time OCT images) DR, black/white noise removal membrane fouling growth (fouling thickness and permeate flux) LSTM , RMSE 
membrane filtration parameters Srivastava et al. (2021)  feed concentration/pH/pressure/temperature NI treatment of brackish groundwater (permeate flux, salt rejection, specific energy consumption, water recovery) MLP, RSM  
wastewater treatment Moreno-Rodenas et al. (2021)  images obtained from the embedded camera system to observe the formation of different water levels and oil layers DA estimation of grease, fat, and oil in wastewater pumping stations CNN (VGG16) accuracy 
wastewater treatment Gopi Kiran et al. (2021)  hydraulic retention times, heavy metal concentration (cadmium, copper, lead) normalization COD/heavy metal removal efficiency by RBC treatment MLP MAPE, R 
TaskAuthors (year)Input variablesData pre-processingTargetML techniquesEvaluation metrics
aqueous adsorption Jun et al. (2020)  agitation speed, contact time, hydrogen peroxide, pH normalization optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘methylene blue’ and Adsorbent is ‘jicama peroxidase used buckypaper/polyvinyl alcohol membrane’) PSO-MLP, RSM , RMSE 
aqueous adsorption Radmehr et al. (2021)  nalidixic acid, NC dose, pH, contact time, temperature NI optimum parameters for maximum cleaning of adsorbate (Adsorbate is ‘nalidixic acid’ and Adsorbent is ‘NiZrAl-layered double hydroxide-graphene oxide-chitosan’) ANFIS, GRNN, RSM, RSM-GA, RSM-DFA kappa, MAE, MSE, , RMSE 
chlorination and disinfection Peleato (2022)  fluorescence spectra DR prediction of DBP with chlorine residual (Target DBP: bromodichloromethane, dichloroacetic acid, total trihalomethanes, trichloromethane, total haloacetic acids, trichloroacetic acid) CNN, MLP, PCA-MLP, PARAFAC-MLP MAE 
defect detection Yin et al. (2021)  labeled CCTV videos acquired by the pipe inspection crawler NI sewer pipe defect type (broken, cracked, deposits, etc.) and defect location CNN (YOLOv3) F1-score 
fault detection Mamandipoor et al. (2020)  12 different chemical/operational sensors (including ammonia, nitrate etc. measurements) MVI detecting faults during the oxidation and nitrification processes in WWTP PCA-SVM, LSTM accuracy, F1-score, precision, recall 
leak detection Mashhadi et al. (2021)  water flow at the three supply sections, water pressure values at five observation points DR detection and localization of leaks in WDS LR, RF, DT, PCA, k-means, MLP accuracy, F1-score, precision, recall 
membrane filtration parameters Shim et al. (2021)  (dissolved organic carbon, fouling thickness, initial flux, modified fluorescence regional integration, operation time, pressure) and (real-time OCT images) DR, black/white noise removal membrane fouling growth (fouling thickness and permeate flux) LSTM , RMSE 
membrane filtration parameters Srivastava et al. (2021)  feed concentration/pH/pressure/temperature NI treatment of brackish groundwater (permeate flux, salt rejection, specific energy consumption, water recovery) MLP, RSM  
wastewater treatment Moreno-Rodenas et al. (2021)  images obtained from the embedded camera system to observe the formation of different water levels and oil layers DA estimation of grease, fat, and oil in wastewater pumping stations CNN (VGG16) accuracy 
wastewater treatment Gopi Kiran et al. (2021)  hydraulic retention times, heavy metal concentration (cadmium, copper, lead) normalization COD/heavy metal removal efficiency by RBC treatment MLP MAPE, R 

While chlorination is used to disinfect bacteria, viruses, and other microbes in the water, chlorine also has some dangers to human health. Sometimes, dangerous substances called disinfection by-products (DBPs) are formed due to chemical reactions between treatment agents and organic and inorganic substances in water. In a study conducted to detect these substances, the estimation of DBPs was obtained with the CNN algorithm with more successful results, unlike other hybrid methods, without needing data pre-processing (Peleato 2022).

Optical coherence tomography (OCT) images were taken as input in a study of WTP processes, and in situ data and parameters related to membrane fouling in the water filtration process were determined using LSTM (Shim et al. 2021). Another study conducted successful trials with MLP to treat brackish groundwater using the NF-RO hybrid membrane system (Srivastava et al. 2021).

A survey for leak detection and location detection in WDS systems tried six different algorithms, and ANN, RF, and LR obtained successful results (Mashhadi et al. 2021). Again, automation was designed to evaluate the sewer pipes so that meaningful texts were extracted from the labeled versions of the videos obtained by using a tracked CCTV image source equipped with a camera and automatically reported (Yin et al. 2021). Used a trained YOLOv3 algorithm for classification, an excellent algorithm for image-based object detection.

In a study for WWTP systems, more than 5.1 million data samples were analyzed for fault detection, aiming to increase treatment efficiency (Mamandipoor et al. 2020). LSTM obtained more successful results than other hybrid networks. In another study, intelligent hardware was designed for a system that automates the detection and cleaning of oil layers in the management of wastewater pumping stations, and the CNN algorithm established a high-success system (Moreno-Rodenas et al. 2021). In a study, the rotary biological contactor (RBC) reactor used in the secondary purification of water and the removal of COD and HM were tested with MLP, and successful results were observed (Gopi Kiran et al. 2021).

The ‘aqueous adsorption’ process, in which the adsorbates in the water are cleaned with the help of various adsorbents, is also essential for WSM. In one study, a critical analysis was made for cleaning industrial wastes by the adsorption process in an aqueous medium, and the hybrid algorithm obtained more successful results (Jun et al. 2020). In another study conducted for the ideal adsorption, the most successful results were obtained with ANFIS in individual performance and RSM-DFA in a hybrid approach.

In this review study, after reviewing many algorithms and publications, some trials were conducted with some of the previously mentioned publicly available datasets. Two datasets, closely related to the selected topics, and compiled in two different areas, were again subjected to tests with various parameters. The first of the chosen datasets, ‘Dataset 1: WQ prediction’, is a regression problem in which the pH value of the next day is estimated. ‘Dataset 2: Monitoring of drinking WQ’ dataset is also a classification problem in which anomaly detection in water quality is made. Various tests have been carried out, and it has been shown that more successful results can be obtained with good optimizations than previously obtained results. Also, the flexibility of the studied ML technique against different problems (classification, regression) has been demonstrated. In addition, all the techniques mentioned in the article and the proposed transparency are summarized by applying them to the selected datasets.

Trials for both classification and regression tasks were performed with the MATLAB program, and the ELM algorithm, which was found to be fast and effective, was used for both datasets. The hardware used for the experimentations was made with an AMD Ryzen 7 4800H CPU and a laptop with 16 GB of ram. The activation functions used for experimentation in the ELM algorithm are ‘sig’, ‘sin’, ‘tribas’, ‘radbas’. The number of hidden neurons tested are 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 300, 400, 500. The results obtained here are quite successful, and the comparative analyzes are given below and visualized in Figure 4.
Figure 4

Results obtained using ELM in dataset trials: (a) RMSE outputs of Dataset 1 and (b) F1-score outputs of Dataset 2.

Figure 4

Results obtained using ELM in dataset trials: (a) RMSE outputs of Dataset 1 and (b) F1-score outputs of Dataset 2.

Close modal

Dataset 1: WQ prediction

A dataset used by Zhao et al. and later shared on the UCI2 platform was analyzed using the values taken from 37 different water sources between 2016 and 2018 (a total of 705 days), in which the pH average of the next day was tried to be determined (Zhao et al. 2019).

The dataset consists of 26,085 (705 × 37) rows and 11 attribute columns (maximum, minimum, and average information of DO, EC, pH, and WTE parameters) and one label column (next day's pH information). Since the data were presented ready-made in a normalized manner, no re-normalization was performed. The data were shuffled and verified with the 5-fold cross-validation technique for a healthy analysis.

The RMSE was used for the rating scale, and the details of the results are given in Figure 4(a). The most successful result was obtained with RMSE: 0.0048 error rate and this rate was obtained with 200 hidden neurons and the ‘radbas’ activation function in ELM. Trials for the entire data pre-processing, validation, training, and testing process take 71 s of runtime with the hardware mentioned in MATLAB. While the best error rate obtained in the original study was 0.0115, this study achieved an appreciably successful result with less than half the error in the original study (Zhao et al. 2019).

Dataset 2: monitoring of drinking WQ

The dataset of the GECCO Challenge 2017, which was held for anomaly detection in WQ, were analyzed (Moritz et al. 2017). Input columns consist of nine attribute columns of chlorine dioxide amounts, EC, flow rates, pH, redox, turbidity, WTE variables, and a label column marked as normal or abnormal. While the original dataset size consists of 122,335 rows, it decreases to 110,815 rows when the rows with no data are deleted. The MVI pre-processing methods have not been tried since the missing data in the deleted data are in all the columns. Normalization (z-score) has been applied to the data, and since the data anomaly detection is performed over time series, the dataset is simply divided as 60% training and 40% testing instead of classical k-fold cross-validation. Since F1-score was requested for the results obtained in the relevant competition and in the studies using this dataset, outputs were taken in this way in this study as well, which are given in detail in Figure 4(b).

The most successful result was obtained with an F1-score: 0.9993 error rate, and this rate was obtained with 400 hidden neurons and the ‘sig’ activation function in ELM. Trials for the entire data pre-processing, validation, training, and testing process take 92 s of runtime with the hardware mentioned in MATLAB. Muharemi et al., who won the competition and further improved the results of the competition in their publication, obtained an F1-score of 0.9891 using SVM (Muharemi et al. 2019). Given that the results obtained in this study are higher than the compared study's scores, it has once again demonstrated the effectiveness and speed of the ELM algorithm.

ML algorithms persist in developing continuously, and complex algorithms are becoming more hassle-free thanks to technological infrastructures. To be able to work with ML in WQ and WSM processes and to actively generalize strategies at a global level, various issues need to be addressed and developed. One of these issues is the quantity and quality of the datasets. Complete datasets with higher resolution, transparent sampling frequency, and verifiable from multiple sources are very functional. There is, therefore, an urgent need for publicly available datasets on WQ and WSM that are better analyzed, accurately labeled, and integrate data from multiple sources where possible (on-site measurement, IoT, UAV, satellite, etc.). With this objective, the water research community should be encouraged to share properly obtained research data without restriction (Huang et al. 2021).

Another factor is the choice of the ML algorithm used. Although there is no perfect algorithm that can be used for all the tasks required for the field of water science, algorithms such as ANFIS, MLP, RF, and SVM are the ones that have often been used in the field. Besides, CatBoost, ELM, and XGBoost algorithms have achieved considerably successful results in their applied tasks. Hybrid approaches, which are generally created using complementary techniques, have also acquired more successful outcomes than other single algorithms. In general, simple construction techniques have flaws in performance despite their low time costs. Likewise, methods with excellent performance that require large datasets often involve complex structures and have high hardware/time costs. However, especially after the extraordinary success of DL algorithms on all kinds of data, interest in DL algorithms draws attention. A reason for this popularity is the ease of image-based processing with CNNs and the ability to deal successfully with time series with LSTM and GRU algorithms. Sharing the hardware features used in the studies and the processing time spent for detailed analysis should also be encouraged.

ML models may not be easy to grasp and apply; understandably, a water researcher is not entirely familiar with these techniques. For this reason, developments such as interdisciplinary communication, increasing cooperation with data mining experts, geographical and intuitive visualizations, and dissemination of cloud-based computing applications (GEE and MPC) are gaining importance (Chen et al. 2022; Yang et al. 2022). In addition, the ethical implications of a study should also be considered due to the importance of the subjects studied. Ethical concerns may arise because the biases in the dataset and the sensitivity of the study's content can impact the public planning modeling of that study (Sit et al. 2020). For this reason, more attention should be paid to transparency at every stage in studies involving such vital processes.

This broad-spectrum article has provided an understanding of the processes applied in the search for solutions with machine learning techniques to the problems in both WQ and WSM topics. A descriptive introduction to the basic concepts, a bibliometric analysis of an extensive literature review, types of data sources used, pre-processing processes of the data obtained, machine learning algorithms used, topics in which they are used, and evaluation criteria of the outputs are presented for both beginner and advanced researchers. Introduced machine learning techniques were selected and categorized according to their effectiveness. The methods are functional and provide examples of the most recent work. In addition, two datasets closely related to the selected topics, compiled in two different fields, were picked and subjected to tests with various parameters. The challenges and limitations of the WQ and WSM processes were mentioned, and essential points that required transparency for the development and reproducibility of research were mentioned. All in all, a useful review study has been obtained. It is thought that this study will benefit water researchers by presenting a general summary of the use of ML.

All relevant data are included in the paper or its Supplementary Information.

Akhtar
N.
,
Ishak
M. I. S.
,
Ahmad
M. I.
,
Umar
K.
,
Md Yusuff
M. S.
,
Anees
M. T.
,
Qadir
A.
&
Almanasir
Y. K. A.
2021
Modification of the water quality index (Wqi) process for simple calculation using the multi-criteria decision-making (mcdm) method: a review
.
Water (Switzerland)
13
,
905
.
https://doi.org/10.3390/W13070905/S1
.
Alasadi
S. A.
&
Bhaya
W. S.
2017
Review of data preprocessing techniques in data mining
.
Journal of Engineering and Applied Sciences
12
,
4102
4107
.
Altman
N. S.
1992
An introduction to kernel and nearest-neighbor nonparametric regression
.
American Statistician
46
,
175
.
https://doi.org/10.2307/2685209
.
Ansari
M.
&
Akhoondzadeh
M.
2020
Mapping water salinity using Landsat-8 OLI satellite images (Case study: Karun basin located in Iran)
.
Advances in Space Research
65
,
1490
1502
.
https://doi.org/10.1016/J.ASR.2019.12.007
.
Aria
M.
&
Cuccurullo
C.
2017
bibliometrix : an R-tool for comprehensive science mapping analysis
.
Journal of Informetrics
11
,
959
975
.
https://doi.org/10.1016/j.joi.2017.08.007
.
Ba-Alawi
A. H.
,
Vilela
P.
,
Loy-Benitez
J.
,
Heo
S.
&
Yoo
C.
2021
Intelligent sensor validation for sustainable influent quality monitoring in wastewater treatment plants using stacked denoising autoencoders
.
Journal of Water Process Engineering
43
,
102206
.
https://doi.org/10.1016/j.jwpe.2021.102206
.
Babyak
M. A.
2004
What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models
.
Psychosomatic Medicine
66
,
411
421
.
https://doi.org/10.1097/01.psy.0000127692.23278.a9
.
Bergmeir
C.
&
Benítez
J. M.
2012
On the use of cross-validation for time series predictor evaluation
.
Information Science (N Y)
191
,
192
213
.
https://doi.org/10.1016/j.ins.2011.12.028
.
Bobadilla
M. C.
,
Lorza
R. L.
,
Gómez
F. S.
&
García
R. E.
2020
Adsorptive of nickel in wastewater by olive stone waste: optimization through multi-response surface methodology using desirability functions
.
Water
12
,
1320
.
https://doi.org/10.3390/W12051320
.
Botchkarev
A.
2019
A New typology design of performance metrics to measure errors in machine learning regression algorithms
.
Interdisciplinary Journal of Information, Knowledge, and Management
14
,
045
076
.
https://doi.org/10.28945/4184
.
Breiman
L.
1996
Bagging predictors
.
Mach Learn
24
,
123
140
.
https://doi.org/10.1007/BF00058655
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
(
1
),
5
32
.
https://doi.org/10.1023/A:1010933404324
.
Broomhead
D. S.
&
Lowe
D.
1988
Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks
.
Brown
R. M.
,
McClelland
N. I.
,
Deininger
R. A.
&
Tozer
R. G.
1970
A water quality index-do we dare
.
Water Sew Works
117
(
10
),
339
343
.
CCME
2001
Canadian Water Quality Index 1.0 Technical Report and Users Manual
.
Chadegani
A. A.
,
Salehi
H.
,
Yunus
M. M.
,
Farhadi
H.
,
Fooladi
M.
,
Farhadi
M.
&
Ebrahim
N. A.
2013
A comparison between two main academic literature collections: web of science and scopus databases
.
Asian Social Science
9
,
18
26
.
https://doi.org/10.5539/ass.v9n5p18
.
Chang
F.-J.
,
Chen
P.-A.
,
Chang
L.-C.
&
Tsai
Y.-H.
2016
Estimating spatio-temporal dynamics of stream total phosphate concentration by soft computing techniques
.
Science of the Total Environment
562
,
228
236
.
https://doi.org/10.1016/j.scitotenv.2016.03.219
.
Chen
T.
&
Guestrin
C.
2016
XGBoost: a scalable tree boosting system
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17-August-2016
, pp.
785
794
.
https://doi.org/10.1145/2939672.2939785.
Chen
K.
,
Chen
H.
,
Zhou
C.
,
Huang
Y.
,
Qi
X.
,
Shen
R.
,
Liu
F.
,
Zuo
M.
,
Zou
X.
,
Wang
J.
,
Zhang
Y.
,
Chen
D.
,
Chen
X.
,
Deng
Y.
&
Ren
H.
2020a
Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data
.
Water Research
171
,
115454
.
https://doi.org/10.1016/j.watres.2019.115454
.
Chen
Y.
,
Song
L.
,
Liu
Y.
,
Yang
L.
&
Li
D.
2020b
A review of the artificial neural network models for water quality prediction
.
Applied Sciences
10
,
5776
.
https://doi.org/10.3390/APP10175776
.
Chen
J.
,
Chen
S.
,
Fu
R.
,
Li
D.
,
Jiang
H.
,
Wang
C.
,
Peng
Y.
,
Jia
K.
&
Hicks
B. J.
2022
Remote sensing big data for water environment monitoring: current status, challenges, and future prospects
.
Earths Future
10
,
e2021EF002289
.
https://doi.org/10.1029/2021EF002289
.
Chhabra
G.
,
Vashisht
V.
&
Ranjan
J.
2017
A comparison of multiple imputation methods for data with missing values
.
Indian Journal of Science and Technology
10
,
1
7
.
https://doi.org/10.17485/ijst/2017/v10i19/110646
.
Cho
K.
,
van Merrienboer
B.
,
Gulcehre
C.
,
Bahdanau
D.
,
Bougares
F.
,
Schwenk
H.
&
Bengio
Y.
2014
Learning phrase representations using rnn encoder–decoder for statistical machine translation
. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
Association for Computational Linguistics
,
Stroudsburg, PA
,
USA
, pp.
1724
1734
.
https://doi.org/10.3115/v1/D14-1179
.
Chung
J.
,
Gulcehre
C.
,
Cho
K.
&
Bengio
Y.
2014
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
.
Chun-Lin
L.
2010
A tutorial of the wavelet transform. NTUEE, Taiwan 21, 22
.
Cortes
C.
&
Vapnik
V.
1995
Support-vector networks
.
Mach Learn
20
,
273
297
.
https://doi.org/10.1007/BF00994018
.
Cunningham
P.
2008
Dimension reduction
.
Cognitive Technologies
91
112
.
https://doi.org/10.1007/978-3-540-75171-7_4/COVER
.
Davie
T.
2008
Fundamentals of Hydrology
, 2nd edn.
Routledge, London, UK
.
Deng
J.
,
Dong
W.
,
Socher
R.
,
Li
L.-J.
,
Li
K.
&
Fei-Fei
L.
2009
ImageNet: a large-scale hierarchical image database
. In:
2009 IEEE Conference on Computer Vision and Pattern Recognition
.
IEEE
, pp.
248
255
.
https://doi.org/10.1109/CVPR.2009.5206848
.
Dörnhöfer
K.
&
Oppelt
N.
2016
Remote sensing for lake research and monitoring – recent advances
.
Ecological Indicators
64
,
105
122
.
https://doi.org/10.1016/j.ecolind.2015.12.009
.
Drucker
H.
,
Burges
C. J.
,
Kaufman
L.
,
Smola
A.
&
Vapnik
V.
1996
Support vector regression machines
.
Advances in Neural Information Processing Systems
9
, 155–161.
Eggimann
S.
,
Mutzner
L.
,
Wani
O.
,
Schneider
M. Y.
,
Spuhler
D.
,
Moy De Vitry
M.
,
Beutler
P.
&
Maurer
M.
2017
The potential of knowing more: a review of data-driven urban water management
.
Environmental Science & Technology
51
,
2538
2553
.
https://doi.org/10.1021/ACS.EST.6B04267/ASSET/IMAGES/MEDIUM/ES-2016-04267S_0004.GIF
.
Ertuğrul
Ö. F.
,
Acar
E.
,
Aldemir
E.
&
Öztekin
A.
2021
Automatic diagnosis of cardiovascular disorders by sub images of the ECG signal using multi-feature extraction methods and randomized neural network
.
Biomed Signal Process Control
64
,
102260
.
https://doi.org/10.1016/j.bspc.2020.102260
.
Fan
C.
,
Chen
M.
,
Wang
X.
,
Wang
J.
&
Huang
B.
2021
A review on data preprocessing techniques toward efficient and reliable knowledge discovery from building operational data
.
Frontiers in Energy Research
9
,
77
.
https://doi.org/10.3389/FENRG.2021.652801/BIBTEX
.
Friedman
J. H.
2001
Greedy function approximation: a gradient boosting machine
.
The Annals of Statistics
29
.
https://doi.org/10.1214/aos/1013203451
.
Friedman
J. H.
2002
Stochastic gradient boosting
.
Computational Statistics and Data Analysis
38
,
367
378
.
https://doi.org/10.1016/S0167-9473(01)00065-2
.
García-Alba
J.
,
Bárcena
J. F.
,
Ugarteburu
C.
&
García
A.
2019
Artificial neural networks as emulators of process-based models to analyse bathing water quality in estuaries
.
Water Research
150
,
283
295
.
https://doi.org/10.1016/J.WATRES.2018.11.063
.
Goodfellow
I.
,
Pouget-Abadie
J.
,
Mirza
M.
,
Xu
B.
,
Warde-Farley
D.
,
Ozair
S.
,
Courville
A.
&
Bengio
Y.
2014
Generative adversarial networks
.
Commun ACM
63
,
139
144
.
https://doi.org/doi.org/10.48550/arXiv.1406.2661
.
Goodfellow
I.
,
Bengio
Y.
&
Courville
A.
2016
Deep Learning
.
MIT Press
, London, UK.
Gopi Kiran
M.
,
Das
R.
,
Behera
S. K.
,
Pakshirajan
K.
&
Das
G.
2021
Modelling a rotating biological contactor treating heavy metal contaminated wastewater using artificial neural network
.
Water Supply
21
,
1895
1912
.
https://doi.org/10.2166/ws.2020.304
.
Gorelick
N.
,
Hancher
M.
,
Dixon
M.
,
Ilyushchenko
S.
,
Thau
D.
&
Moore
R.
2017
Google earth engine: planetary-scale geospatial analysis for everyone
.
Remote Sensing of Environment
202
,
18
27
.
https://doi.org/10.1016/J.RSE.2017.06.031
.
Grossmann
A.
&
Morlet
J.
1984
Decomposition of hardy functions into square integrable wavelets of constant shape
.
SIAM Journal on Mathematical Analysis
15
,
723
736
.
https://doi.org/10.1137/0515056
.
Hamshaw
S. D.
,
Dewoolkar
M. M.
,
Schroth
A. W.
,
Wemple
B. C.
&
Rizzo
D. M.
2018
A new machine-learning approach for classifying hysteresis in suspended-sediment discharge relationships using high-frequency monitoring data
.
Water Resources Research
54
,
4040
4058
.
https://doi.org/10.1029/2017WR022238
.
Han
J.
,
Kamber
M.
&
Pei
J.
2012
Data Mining, Data Mining: Concepts and Techniques
.
Elsevier
.
https://doi.org/10.1016/C2009-0-61819-5
.
Hanoon
M. S.
,
Ahmed
A. N.
,
Fai
C. M.
,
Birima
A. H.
,
Razzaq
A.
,
Sherif
M.
,
Sefelnasr
A.
&
El-Shafie
A.
2021
Application of artificial intelligence models for modeling water quality in groundwater: comprehensive review, evaluation and future trends
.
Water, Air, & Soil Pollution
232
,
411
.
https://doi.org/10.1007/s11270-021-05311-z
.
Harshman
R. A.
1970
Foundations of the PARAFAC Procedure: Models and Conditions for an’ Explanatory’ Multimodal Factor Analysis
.
Hassan
N.
&
Woo
C. S.
2021
Machine learning application in water quality using satellite data
.
IOP Conference Series: Earth and Environmental Science
842
,
012018
.
https://doi.org/10.1088/1755-1315/842/1/012018
.
Heddam
S.
&
Dechemi
N.
2015
A new approach based on the dynamic evolving neural-fuzzy inference system (DENFIS) for modelling coagulant dosage (Dos): case study of water treatment plant of Algeria
.
Desalination and Water Treatment
53
,
1045
1053
.
https://doi.org/10.1080/19443994.2013.878669
.
Hochreiter
S.
&
Schmidhuber
J.
1997
Long short-term memory
.
Neural Computation
9
,
1735
1780
.
https://doi.org/10.1162/neco.1997.9.8.1735
.
Holland
J. H.
1975
Adaptation in Natural and Artificial Systems
.
The University of Michigan Press
,
Ann Arbor
.
Horton
R. K.
1965
An index number system for rating water quality
.
Water Pollution Control Federation
37
,
300
306
.
Huang
G.-B.
,
Zhu
Q.-Y.
&
Siew
C.-K.
2004
Extreme learning machine: a new learning scheme of feedforward neural networks
. In:
2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541)
.
IEEE
, pp.
985
990
.
https://doi.org/10.1109/IJCNN.2004.1380068
.
Huang
G.-B.
,
Zhu
Q.-Y.
&
Siew
C.-K.
2006
Extreme learning machine: theory and applications. neurocomputing
.
Neural {Networks}
70
,
489
501
.
https://doi.org/10.1016/j.neucom.2005.12.126
.
Huang
R.
,
Ma
C.
,
Ma
J.
,
Huangfu
X.
&
He
Q.
2021
Machine learning in natural and engineered water systems
.
Water Research
205
,
117666
.
https://doi.org/10.1016/j.watres.2021.117666
.
Huang
Y.
,
Wang
X.
,
Xiang
W.
,
Wang
T.
,
Otis
C.
,
Sarge
L.
,
Lei
Y.
&
Li
B.
2022
Forward-looking roadmaps for long-term continuous water quality monitoring: bottlenecks, innovations, and prospects in a critical review
.
Environmental Science and Technology
56
,
5334
5354
.
https://doi.org/10.1021/acs.est.1c07857
.
Hunt
E. B.
,
Marin
J.
&
Stone
P. J.
1966
Experiments in Induction.
Academic Press, New York, USA.
Huynh
N. H.
,
Böer
G.
&
Schramm
H.
2022
Self-attention and generative adversarial networks for algae monitoring
.
European Journal of Remote Sensing
55
,
10
22
.
https://doi.org/10.1080/22797254.2021.2010605
.
Ivakhnenko
A. G.
1970
Heuristic self-organization in problems of engineering cybernetics
.
Automatica
6
,
207
219
.
https://doi.org/10.1016/0005-1098(70)90092-0
.
Jang
J.-S. R.
1993
ANFIS: adaptive-network-based fuzzy inference system
.
IEEE Transactions on Systems, Man, and Cybernetics
23
,
665
685
.
https://doi.org/10.1109/21.256541
.
Joseph
V. R.
2022
Optimal ratio for data splitting
.
Statistical Analysis and Data Mining: The ASA Data Science Journal
15
,
531
538
.
https://doi.org/10.1002/sam.11583
.
Joseph
V. R.
&
Vakayil
A.
2022
SPlit: an optimal method for data splitting
.
Technometrics
64
,
166
176
.
https://doi.org/10.1080/00401706.2021.1921037
.
Jun
L. Y.
,
Karri
R. R.
,
Yon
L. S.
,
Mubarak
N. M.
,
Bing
C. H.
,
Mohammad
K.
,
Jagadish
P.
&
Abdullah
E. C.
2020
Modeling and optimization by particle swarm embedded neural network for adsorption of methylene blue by jicama peroxidase immobilized on buckypaper/polyvinyl alcohol membrane
.
Environmental Research
183
,
109158
.
https://doi.org/10.1016/J.ENVRES.2020.109158
.
Kalteh
A. M.
,
Hjorth
P.
&
Berndtsson
R.
2008
Review of the self-organizing map (SOM) approach in water resources: analysis, modelling and application
.
Environmental Modelling & Software
23
,
835
845
.
https://doi.org/10.1016/j.envsoft.2007.10.001
.
Kasabov
N. K.
&
Song
Q.
2002
DENFIS: dynamic evolving neural-fuzzy inference system and its application for time-series prediction
.
IEEE Transactions on Fuzzy Systems
10
,
144
154
.
https://doi.org/10.1109/91.995117
.
Kennedy
J.
&
Eberhart
R.
1995
Particle swarm optimization
. In:
Proceedings of ICNN'95 - International Conference on Neural Networks
.
IEEE
, pp.
1942
1948
.
https://doi.org/10.1109/ICNN.1995.488968
.
Kirstein
J.
,
Rygaard
M.
,
Borup
M.
,
Löwe
R.
&
Blokker
M.
2020
Data-driven water distribution system analysis–exploring challenges and potentials from smart meters and beyond
.
Downloaded From Orbit.dtu.dk on: Au
27
,
2022
.
Kisi
O.
,
Alizamir
M.
&
Docheshmeh Gorgij
A.
2020
Dissolved oxygen prediction using a new ensemble method
.
Environmental Science and Pollution Research
27
,
9589
9603
.
https://doi.org/10.1007/s11356-019-07574-w
.
Kohonen
T.
1982
Self-organized formation of topologically correct feature maps
.
Biological Cybernetics
43
,
59
69
.
https://doi.org/10.1007/BF00337288
.
Kohonen
T.
1990
The self-organizing map
.
Proceedings of the IEEE
78
,
1464
1480
.
https://doi.org/10.1109/5.58325
.
Krizhevsky
A.
,
Sutskever
I.
&
Hinton
G. E.
2017
Imagenet classification with deep convolutional neural networks
.
Commun ACM
60
,
84
90
.
https://doi.org/10.1145/3065386
.
LeCun
Y.
,
Boser
B.
,
Denker
J.
,
Henderson
D.
,
Howard
R.
,
Hubbard
W.
&
Jackel
L.
1989
Handwritten digit recognition with a back-propagation network
.
Advances in Neural Information Processing Systems
396
404
.
Li
D.
,
Chen
D.
,
Goh
J.
&
Ng
S.
2018
Anomaly Detection with Generative Adversarial Networks for Multivariate Time Series
.
https://doi.org/10.48550/arxiv.1809.04758
.
Li
L.
,
Jiang
P.
,
Xu
H.
,
Lin
G.
,
Guo
D.
&
Wu
H.
2019
Water quality prediction based on recurrent neural network and improved evidence theory: a case study of Qiantang River, China
.
Environmental Science and Pollution Research
26
,
19879
19896
.
https://doi.org/10.1007/s11356-019-05116-y
.
Lin
T.
,
Horne
B. G.
,
Tiňo
P.
&
Giles
C. L.
1996
Learning long-term dependencies in NARX recurrent neural networks
.
IEEE Transactions on Neural Networks
7
,
1329
1338
.
https://doi.org/10.1109/72.548162
.
Liu
W.
,
Wang
Z.
,
Liu
X.
,
Zeng
N.
,
Liu
Y.
&
Alsaadi
F. E.
2017
A survey of deep neural network architectures and their applications
.
Neurocomputing
234
,
11
26
.
https://doi.org/10.1016/j.neucom.2016.12.038
.
Liu
H.
,
Yu
T.
,
Hu
B.
,
Hou
X.
,
Zhang
Z.
,
Liu
X.
,
Liu
J.
,
Wang
X.
,
Zhong
J.
,
Tan
Z.
,
Xia
S.
&
Qian
B.
2021
UAV-Borne Hyperspectral imaging remote sensing system based on acousto-optic tunable filter for water quality monitoring
.
Remote Sens (Basel)
13
,
4069
.
https://doi.org/10.3390/rs13204069
.
Lloyd
S. P.
1982
Least squares quantization in PCM
.
IEEE Transactions on Information Theory
28
,
129
137
.
https://doi.org/10.1109/TIT.1982.1056489
.
Lombard
M. A.
,
Bryan
M. S.
,
Jones
D. K.
,
Bulka
C.
,
Bradley
P. M.
,
Backer
L. C.
,
Focazio
M. J.
,
Silverman
D. T.
,
Toccalino
P.
,
Argos
M.
,
Gribble
M. O.
&
Ayotte
J. D.
2021
Machine learning models of arsenic in private wells throughout the conterminous United States as a tool for exposure assessment in human health studies
.
Environmental Science and Technology
55
,
5012
5023
.
https://doi.org/10.1021/acs.est.0c05239
.
Lowe
M.
,
Qin
R.
&
Mao
X.
2022
A review on machine learning, artificial intelligence, and smart technology in water treatment and monitoring
.
Water (Basel)
14
,
1384
.
https://doi.org/10.3390/w14091384
.
MacQueen
J.
1967
Classification and analysis of multivariate observations
. In:
5th Berkeley Symp. Math. Statist. Probability
(Lucien M. Le Cam & Jerzy Neyman, eds)
.
California Press
,
Los Angeles, USA
, pp.
281
297
.
Mamandipoor
B.
,
Majd
M.
,
Sheikhalishahi
S.
,
Modena
C.
&
Osmani
V.
2020
Monitoring and detecting faults in wastewater treatment plants using deep learning
.
Environmental Monitoring and Assessment
192
,
148
.
https://doi.org/10.1007/s10661-020-8064-1
.
Martín-Martín
A.
,
Orduna-Malea
E.
,
Thelwall
M.
&
Delgado López-Cózar
E.
2018
Google scholar, web of science, and scopus: a systematic comparison of citations in 252 subject categories
.
Journal of Informetrics
12
,
1160
1177
.
https://doi.org/10.1016/j.joi.2018.09.002
.
Mashhadi
N.
,
Shahrour
I.
,
Attoue
N.
,
el Khattabi
J.
&
Aljer
A.
2021
Use of machine learning for leak detection and localization in water distribution systems
.
Smart Cities
4
,
1293
1315
.
https://doi.org/10.3390/smartcities4040069
.
McCulloch
W. S.
&
Pitts
W.
1943
A logical calculus of the ideas immanent in nervous activity
.
Bulletin of Mathematical Biology
5
,
115
133
.
https://doi.org/10.1007/BF02478259
.
Mienye
I. D.
,
Sun
Y.
&
Wang
Z.
2019
Prediction performance of improved decision tree-based algorithms: a review
.
Procedia Manufacturing
35
,
698
703
.
https://doi.org/10.1016/j.promfg.2019.06.011
.
Mo
S.
,
Zabaras
N.
,
Shi
X.
&
Wu
J.
2019a
Deep autoregressive neural networks for high-dimensional inverse problems in groundwater contaminant source identification
.
Water Resources Research
55
,
3856
3881
.
https://doi.org/10.1029/2018WR024638
.
Mo
S.
,
Zhu
Y.
,
Zabaras
N.
,
Shi
X.
&
Wu
J.
2019b
Deep convolutional encoder-decoder networks for uncertainty quantification of dynamic multiphase flow in heterogeneous media
.
Water Resources Research
55
,
703
728
.
https://doi.org/10.1029/2018WR023528
.
Monks
I.
,
Stewart Rodney
A.
,
Sahin
O.
&
Keller
R.
2019
Revealing unreported benefits of digital water metering: literature review and expert opinions
.
Water (Switzerland)
.
https://doi.org/10.3390/w11040838
.
Moreno-Rodenas
A. M.
,
Duinmeijer
A.
&
Clemens
F. H. L. R.
2021
Deep-learning based monitoring of FOG layer dynamics in wastewater pumping stations
.
Water Research
202
,
117482
.
https://doi.org/10.1016/j.watres.2021.117482
.
Morgan
J. N.
&
Sonquist
J. A.
1963
Problems in the analysis of survey data, and a proposal
.
Journal of the American Statistical Association
58
,
415
434
.
https://doi.org/10.1080/01621459.1963.10500855
.
Moritz
S.
,
Friese
M.
,
Stork
J.
,
Rebolledo
M.
,
Fischbach
A.
&
Bartz-Beielstein
T.
2017
GECCO Industrial Challenge 2017 Dataset: A water quality dataset for the ‘Monitoring of drinking-water quality’ competition at the Genetic and Evolutionary Computation Conference 2017, Berlin, Germany. https://doi.org/10.5281/ZENODO.3884465
.
Moritz
S.
,
Rehbach
F.
,
Chandrasekaran
S.
,
Rebolledo
M.
&
Bartz-Beielstein
T.
2018
GECCO Industrial Challenge 2018 Dataset: A water quality dataset for the ‘Internet of Things: Online Anomaly Detection for Drinking Water Quality’ competition at the Genetic and Evolutionary Computation Conference 2018, Kyoto, Japan. https://doi.org/10.5281/ZENODO.3884398
.
Muharemi
F.
,
Logofătu
D.
&
Leon
F.
2019
Machine learning approaches for anomaly detection of water quality on a real-world data set
.
Journal of Information and Telecommunication
3
,
294
307
.
https://doi.org/10.1080/24751839.2019.1565653
.
Najah Ahmed
A.
,
Binti Othman
F.
,
Abdulmohsin Afan
H.
,
Khaleel Ibrahim
R.
,
Ming Fai
C.
,
Shabbir Hossain
M.
,
Ehteram
M.
&
Elshafie
A.
2019
Machine learning methods for better water quality prediction
.
Journal of Hydrologic (Amst)
578
,
124084
.
https://doi.org/10.1016/j.jhydrol.2019.124084
.
Nasir
N.
,
Kansal
A.
,
Alshaltone
O.
,
Barneih
F.
,
Sameer
M.
,
Shanableh
A.
&
Al-Shamma'a
A.
2022
Water quality classification using machine learning algorithms
.
Journal of Water Process Engineering
48
,
102920
.
https://doi.org/10.1016/j.jwpe.2022.102920
.
Obadina
O. O.
,
Thaha
M. A.
,
Mohamed
Z.
&
Shaheed
M. H.
2022
Grey-box modelling and fuzzy logic control of a Leader–Follower robot manipulator system: a hybrid Grey Wolf–Whale Optimisation approach
.
ISA Transactions
https://doi.org/10.1016/j.isatra.2022.02.023
.
Omer
N. H.
,
2019
Water quality parameters
. In:
Water Quality
(
Summers
K.
, ed.).
IntechOpen
,
Rijeka
.
https://doi.org/10.5772/intechopen.89657
.
Peleato
N. M.
2022
Application of convolutional neural networks for prediction of disinfection by-products
.
Scientific Reports
12
,
612
.
https://doi.org/10.1038/s41598-021-03881-w
.
Podgorski
J.
,
Araya
D.
&
Berg
M.
2022
Geogenic manganese and iron in groundwater of Southeast Asia and Bangladesh – machine learning spatial prediction modeling and comparison with arsenic
.
Science of The Total Environment
833
,
155131
.
https://doi.org/10.1016/j.scitotenv.2022.155131
.
Prokhorenkova
L.
,
Gusev
G.
,
Vorobev
A.
,
Dorogush
A. V.
&
Gulin
A.
2017
CatBoost: unbiased boosting with categorical features
. In:
Adv Neural Inf Process Syst 2018-December
, pp.
6638
6648
.
Radmehr
S.
,
Hosseini Sabzevari
M.
,
Ghaedi
M.
,
Ahmadi Azqhandi
M. H.
&
Marahel
F.
2021
Adsorption of nalidixic acid antibiotic using a renewable adsorbent based on graphene oxide from simulated wastewater
.
Journal of Environmental Chemical Engineering
9
,
105975
.
https://doi.org/10.1016/j.jece.2021.105975
.
Raftery
A. E.
,
Gneiting
T.
,
Balabdaoui
F.
&
Polakowski
M.
2005
Using Bayesian model averaging to calibrate forecast ensembles
.
Monthly Weather Review
133
,
1155
1174
.
https://doi.org/10.1175/MWR2906.1
.
Rahim
M. S.
,
Nguyen
K. A.
,
Stewart
R. A.
,
Giurco
D.
&
Blumenstein
M.
2020
Machine learning and data analytic techniques in digital water metering: a review
.
Water (Basel)
12
,
294
.
https://doi.org/10.3390/w12010294
.
Redmon
J.
,
Divvala
S.
,
Girshick
R.
&
Farhadi
A.
2016
You only look once: unified, real-time object detection
. In:
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
.
IEEE
, pp.
779
788
.
https://doi.org/10.1109/CVPR.2016.91
.
Rockström
J.
,
Steffen
W.
,
Noone
K.
,
Persson
Å.
,
Chapin
F. S.
,
Lambin
E. F.
,
Lenton
T. M.
,
Scheffer
M.
,
Folke
C.
,
Schellnhuber
H. J.
,
Nykvist
B.
,
de Wit
C. A.
,
Hughes
T.
,
van der Leeuw
S.
,
Rodhe
H.
,
Sörlin
S.
,
Snyder
P. K.
,
Costanza
R.
,
Svedin
U.
,
Falkenmark
M.
,
Karlberg
L.
,
Corell
R. W.
,
Fabry
V. J.
,
Hansen
J.
,
Walker
B.
,
Liverman
D.
,
Richardson
K.
,
Crutzen
P.
&
Foley
J. A.
2009
A safe operating space for humanity
.
Nature
461
,
472
475
.
https://doi.org/10.1038/461472a
.
Ross
M. R. v.
,
Topp
S. N.
,
Appling
A. P.
,
Yang
X.
,
Kuhn
C.
,
Butman
D.
,
Simard
M.
&
Pavelsky
T. M.
2019
Aquasat: a data set to enable remote sensing of water quality for inland waters
.
Water Resources Research
55
,
10012
10025
.
https://doi.org/10.1029/2019WR024883
.
Rumelhart
D. E.
,
Hinton
G. E.
&
Williams
R. J.
1985
Learning internal representations by error propagation
. In
California Univ San Diego La Jolla Inst for Cognitive Science
.
Rumelhart
D. E.
,
Hinton
G. E.
&
Williams
R. J.
1986
Learning representations by back-propagating errors
.
Nature
323
,
533
536
.
https://doi.org/10.1038/323533a0
.
Russo
S.
,
Disch
A.
,
Blumensaat
F.
&
Villez
K.
2020
Anomaly Detection Using Deep Autoencoders for in-Situ Wastewater Systems Monitoring Data
.
Saranya
C.
&
Manik
G.
2013
A study on normalization techniques for privacy preserving data mining
.
International Journal of Engineering and Technology
5
,
2701
2704
.
Shan
K.
,
Ouyang
T.
,
Wang
X.
,
Yang
H.
,
Zhou
B.
,
Wu
Z.
&
Shang
M.
2022
Temporal prediction of algal parameters in Three Gorges Reservoir based on highly time-resolved monitoring and long short-term memory network
.
Journal of Hydrology
605
,
127304
.
https://doi.org/10.1016/j.jhydrol.2021.127304
.
Shim
J.
,
Park
S.
&
Cho
K. H.
2021
Deep learning model for simulating influence of natural organic matter in nanofiltration
.
Water Research
197
,
117070
.
https://doi.org/10.1016/j.watres.2021.117070
.
Shorten
C.
&
Khoshgoftaar
T. M.
2019
A survey on image data augmentation for deep learning
.
Journal of Big Data
6
,
1
48
.
https://doi.org/10.1186/S40537-019-0197-0/FIGURES/33
.
Singh
V. K.
,
Singh
P.
,
Karmakar
M.
,
Leta
J.
&
Mayr
P.
2021
The journal coverage of web of science, scopus and dimensions: a comparative analysis
.
Scientometrics
126
,
5113
5142
.
https://doi.org/10.1007/s11192-021-03948-5
.
Singha
S.
,
Pasupuleti
S.
,
Singha
S. S.
,
Singh
R.
&
Kumar
S.
2021
Prediction of groundwater quality using efficient machine learning technique
.
Chemosphere
276
,
130265
.
https://doi.org/10.1016/j.chemosphere.2021.130265
.
Sit
M.
,
Demiray
B. Z.
,
Xiang
Z.
,
Ewing
G. J.
,
Sermet
Y.
&
Demir
I.
2020
A comprehensive review of deep learning applications in hydrology and water resources
.
Water Science and Technology
82
,
2635
2670
.
https://doi.org/10.2166/wst.2020.369
.
Specht
D. F.
1991
A general regression neural network
.
IEEE Transactions on Neural Networks
2
,
568
576
.
Spellman
F. R.
2017
The Drinking Water Handbook
, 3rd edn.
CRC Press
,
Boca Raton 
.
Taylor & Francis, CRC Press, 2018. https://doi.org/10.1201/9781315159126
.
Srivastava
A.
,
A
K.
,
Nair
A.
,
Ram
S.
,
Agarwal
S.
,
Ali
J.
,
Singh
R.
&
Garg
M. C.
2021
Response surface methodology and artificial neural network modelling for the performance evaluation of pilot-scale hybrid nanofiltration (NF) & reverse osmosis (RO) membrane system for the treatment of brackish ground water
.
Journal of Environmental Management
278
,
111497
.
https://doi.org/10.1016/j.jenvman.2020.111497
.
Steffen
W.
,
Richardson
K.
,
Rockström
J.
,
Cornell
S. E.
,
Fetzer
I.
,
Bennett
E. M.
,
Biggs
R.
,
Carpenter
S. R.
,
de Vries
W.
,
de Wit
C. A.
,
Folke
C.
,
Gerten
D.
,
Heinke
J.
,
Mace
G. M.
,
Persson
L. M.
,
Ramanathan
V.
,
Reyers
B.
&
Sörlin
S.
2015
Planetary boundaries: guiding human development on a changing planet
.
Science (1979)
347
,
1259855
.
https://doi.org/10.1126/science.1259855
.
Sun
A. Y.
&
Scanlon
B. R.
2019
How can big data and machine learning benefit environment and water management: a survey of methods, applications, and future directions
.
Environmental Research Letters
14
,
073001
.
https://doi.org/10.1088/1748-9326/ab1b7d
.
Sutadian
A. D.
,
Muttil
N.
,
Yilmaz
A. G.
&
Perera
B. J. C.
2016
Development of river water quality indices – a review
.
Environmental Monitoring and Assessment
188
,
58
.
https://doi.org/10.1007/s10661-015-5050-0
.
Tiyasha
,
Tung
T. M.
&
Yaseen
Z. M.
2020
A survey on river water quality modelling using artificial intelligence models: 2000–2020
.
Journal of Hydrologic (Amst)
585
,
124670
.
https://doi.org/10.1016/j.jhydrol.2020.124670
.
UN Environment Programme
2021
Progress on ambient water quality. Tracking SDG 6 series: global indicator 6.3.2 updates and acceleration needs, Nairobi
.
UNESCO
2021
The United Nations World Water Development Report 2021: Valuing Water, United Nations Educational, Scientific and Cultural Organization
.
UNESCO
,
Paris
.
U.S Bureau of Reclamation
2020
Water Facts – Worldwide Water Supply [WWW Document]. Available from. https://www.usbr.gov/mp/arwec/water-facts-ww-water-sup.html (accessed 8.31.22)
.
Velani
A. F.
,
Narwane
V. S.
&
Gardas
B. B.
2022
Contribution of internet of things in water supply chain management: a bibliometric and content analysis
.
Journal of Modelling in Management Ahead-of-Print
.
https://doi.org/10.1108/JM2-04-2021-0090
.
Vélez-Nicolás
M.
,
García-López
S.
,
Barbero
L.
,
Ruiz-Ortiz
V.
&
Sánchez-Bellón
Á
.
2021
Applications of Unmanned Aerial Systems (UASs) in hydrology: a review
.
Remote Sens (Basel)
13
,
1359
.
https://doi.org/10.3390/rs13071359
.
Velliangiri
S.
,
Alagumuthukrishnan
S.
&
Thankumar joseph
S. I.
2019
A review of dimensionality reduction techniques for efficient computation
.
Procedia Computer Science
165
,
104
111
.
https://doi.org/10.1016/j.procs.2020.01.079
.
Wagle
N.
,
Acharya
T. D.
&
Lee
D. H.
2020
Comprehensive review on application of machine learning algorithms for water quality parameter estimation using remote sensing data
.
Sensors and Materials
32
,
3879
.
https://doi.org/10.18494/SAM.2020.2953
.
Walczak
B.
&
Massart
D. L.
1997
Noise suppression and signal compression using the wavelet packet transform
.
Chemometrics and Intelligent Laboratory Systems
36
,
81
94
.
https://doi.org/10.1016/S0169-7439(96)00077-9
.
Wang
S.
,
Li
J.
,
Zhang
W.
,
Cao
C.
,
Zhang
F.
,
Shen
Q.
,
Zhang
X.
&
Zhang
B.
2021a
A dataset of remote-sensed Forel-Ule Index for global inland waters during 2000–2018
.
Scientific Data
8
(
1
),
1
10
.
https://doi.org/10.1038/s41597-021-00807-z
.
Wang
Z.
,
Qin
C.
,
Wan
B.
&
Song
W. W.
2021b
A comparative study of common nature-inspired algorithms for continuous function optimization
.
Entropy
23
,
874
.
https://doi.org/10.3390/E23070874/S1
.
Wei
J.
&
Zou
K.
2019
EDA: easy data augmentation techniques for boosting performance on text classification tasks
. In:
EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference
, pp.
6382
6388
.
https://doi.org/10.48550/arxiv.1901.11196
.
Wei
S.
,
Zou
S.
,
Liao
F.
&
Lang
W.
2020
A comparison on data augmentation methods based on deep learning for audio classification
.
Journal of Physics: Conference Series
1453
,
012085
.
https://doi.org/10.1088/1742-6596/1453/1/012085
.
Xu
C.
,
Chen
X.
&
Zhang
L.
2021
Predicting river dissolved oxygen time series based on stand-alone models and hybrid wavelet-based models
.
Journal of Environmental Management
295
,
113085
.
https://doi.org/10.1016/j.jenvman.2021.113085
.
Xue
M.
&
Zhu
C.
2009
A study and application on machine learning of artificial intellligence
. In:
2009 International Joint Conference on Artificial Intelligence
.
IEEE
, pp.
272
274
.
https://doi.org/10.1109/JCAI.2009.55
Yan
T.
,
Shen
S.-L.
&
Zhou
A.
2022
Indices and models of surface water quality assessment: review and perspectives
.
Environmental Pollution
308
,
119611
.
https://doi.org/10.1016/j.envpol.2022.119611
.
Yang
X.-S.
2010
A new metaheuristic bat-inspired algorithm
.
Studies in Computational Intelligence
284
,
65
74
.
Yang
X.-S.
2014
Nature-Inspired Optimization Algorithms
.
Academic Press
,
London, UK
.
Yang
L.
,
Driscol
J.
,
Sarigai
S.
,
Wu
Q.
,
Lippitt
C. D.
&
Morgan
M.
2022
Towards synoptic water monitoring systems: a review of AI methods for automating water body detection and water quality monitoring using remote sensing
.
Sensors
22
,
2416
.
https://doi.org/10.3390/s22062416
.
Yin
X.
,
Ma
T.
,
Bouferguene
A.
&
Al-Hussein
M.
2021
Automation for sewer pipe assessment: CCTV video interpretation algorithm and sewer pipe video assessment (SPVA) system development
.
Automation in Construction
125
,
103622
.
https://doi.org/10.1016/j.autcon.2021.103622
.
Zadeh
L. A.
1965
Fuzzy sets
.
Information and Control
8
,
338
353
.
https://doi.org/10.1016/S0019-9958(65)90241-X
.
Zadeh
L. A.
1978
Fuzzy sets as a basis for a theory of possibility
.
Fuzzy Sets and Systems
1
,
3
28
.
https://doi.org/10.1016/0165-0114(78)90029-5
.
Zhang
Y.-F.
,
Fitch
P.
&
Thorburn
P. J.
2020
Predicting the trend of dissolved oxygen based on the kPCA-RNN model
.
Water (Basel)
12
,
585
.
https://doi.org/10.3390/w12020585
.
Zhang
X.
,
Zhou
Y.
&
Luo
J.
2021
Deep learning for processing and analysis of remote sensing big data: a technical review
.
Big Earth Data
1
34
.
https://doi.org/10.1080/20964471.2021.1964879
.
Zhao
L.
,
Gkountouna
O.
&
Pfoser
D.
2019
Spatial auto-regressive dependency interpretable learning based on spatial topological constraints
.
ACM Transactions on Spatial Algorithms and Systems
5
,
1
28
.
https://doi.org/10.1145/3339823
.
Zhou
Z.-H.
&
Feng
J.
2017
Deep forest: towards an alternative to deep neural networks
. In:
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
.
International Joint Conferences on Artificial Intelligence Organization
, California, pp.
3553
3559
.
https://doi.org/10.24963/ijcai.2017/497
.
Zhou
X.
,
Tang
Z.
,
Xu
W.
,
Meng
F.
,
Chu
X.
,
Xin
K.
&
Fu
G.
2019
Deep learning identifies accurate burst locations in water distribution networks
.
Water Research
166
,
115058
.
https://doi.org/10.1016/j.watres.2019.115058
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).