Data-driven models for the prediction of lake eutrophication essentially rely on water quality datasets for a longer duration. If such data are not readily available, lake management through data-driven modeling becomes impractical. So, a novel approach is presented here for the prediction of eutrophication indicators, such as dissolved oxygen, Secchi depth, total nitrogen, and total phosphorus, in the waterbodies of Assam, India. These models were developed using water quality datasets collected through laboratory investigation in artificially simulated lake systems. Two artificial prototype lakes were eutrophied in a controlled environment with the gradual application of wastewater. A periodic assessment of water quality was done for model development. Data-driven modeling in the form of multilayer perceptron (MLP), time-delay neural network (TDNN), support vector regression (SVR), and Gaussian process regression (GPR) were utilized. The trained model's accuracy was evaluated based on statistical parameters and a reasonable correlation was observed between targeted and model predicted values. Finally, the trained models were tested against some natural waterbodies in Assam and a satisfactory prediction accuracy was obtained. TDNN and GPR models were found superior compared to other methods. Results of the study indicate feasibility of the adopted modeling approach in predicting lake eutrophication when periodic water quality data are limited for the waterbody under consideration.

  • A novel approach is proposed for predicting eutrophication indicators.

  • Two prototype lakes were artificially eutrophied.

  • Data-driven modeling techniques were employed.

  • Developed models were used to predict natural water bodies.

  • Further studies will help in framing the policies.

For the last four decades, several modeling methodologies have been used to forecast and mitigate lake eutrophication with various levels of accomplishment (Jørgensen 2010; Bhagowati & Ahamad 2019). The application of both process-based and data-driven modeling approaches has been observed frequently in lake modeling (Vinçon-Leite & Casenave 2019; Rousso et al. 2020; Su et al. 2022). Out of different methods, data-driven modeling techniques, like the artificial neural network (ANN), have been increasingly utilized for ecological modeling in recent times (Che Nordin et al. 2021). Such machine learning-based supervised models can regenerate the non-linear relationship between the ecosystem parameters in a more efficient manner, unlike the multiple regression techniques, which generally consider the linear relationship among variables for model development (Kuo et al. 2007). Moreover, as the underlying relationships among ecological variables are commonly intricate and accompany extensive uncertainties, machine learning based models are being developed as useful tools for ecosystem management and restoration (Lek & Guegan 1999).

Aria et al. (2019) successfully used two types of ANN algorithms to model algal bloom in Iran's eutrophic Amirkabir reservoir. ANN methodology was used to predict Chl-a concentrations in a shallow eutrophic lake, Mikri Prepa in Greece, by Hadjisolomou et al. (2021). Machine learning methods in the form of ANN and support vector regression (SVR) were used by Jimeno-Sáez et al. (2020) to model the eutrophication indicator Chl-a in the eutrophic Mar Menor coastal lagoon in Spain. A feed forward neural network modeling approach was used successfully by Ubah et al. (2021) for predicting water quality indicators such as pH, electrical conductivity (EC), total dissolved solids (TDS), and sodium in Ele River Nnewi, Nigeria. Chen et al. (2010) used the backpropagation ANN approach successfully to estimate the concentrations of TN, TP, and DO in the Chagle River in China. ANN was used by Sarkar & Pandey (2015) to estimate the DO concentrations of River Yamuna in the Mathura region of Uttar Pradesh, India. Gazzaz et al. (2012) used a multilayer perceptron (MLP) network to estimate the water quality index (WQI) in Kinta River, Malaysia. For eutrophication management in shallow eutrophic Chaohu Lake in China, SVR was used to formulate predictive models for Chl-a, TN, and TP by Xu et al. (2015). The Gaussian process regression (GPR) method was used by Xu et al. (2015) for sulfate content prediction in the lakes of China. As such, ANN and other machine learning methods have been widely used all across the world to model eutrophication indicators, and some such selected works are presented in Table 1. Out of the different neural network topologies, a feed forward network with a backpropagation algorithm has been widely used in most of the works to predict common eutrophication indicators such as dissolved oxygen (DO), total nitrogen (TN), total phosphorus (TP), chlorophyll-a (chl-a), and Secchi depth (SD). Apart from neural networks, use of other data-driven modeling techniques such as SVR, adaptive neuro-fuzzy inference system (ANFIS), and GPR are gaining attention in lake modeling. Nevertheless, collection of an exhaustive dataset for the successful training of a predictive data-driven model is a major concern. It is evident from Table 1 that for most of the earlier lake-specific eutrophication models, water quality parameters were gathered for a vast duration, such as 5–20 years. However, such prolonged datasets may not be available for all waterbodies that require prompt mitigation and restoration measures, especially in underdeveloped and developing countries. Under such circumstances, lake management with a data-driven modeling approach becomes impractical.

Table 1

Application of data-driven modeling to predict lake eutrophication

ReferencesLocationEutrophication indicatorsModeling approachData collection duration
Karul et al. (1999, 2000Keban Dam reservoir, Mogan Lake, Eymir Lake in Turkey chl-a ANN 1991–1996 
Kuo et al. (2007)  Te-Chi Reservoir, Taiwan DO, TP, SD, chl-a ANN 1983–1999 
Akkoyunlu & Akiner (2010)  Omerli Lake, Turkey DO ANN, MLR 1990–2004 
Huo et al. (2013)  Lake Fuxian, China DO, TN, SD, chl-a ANN 2003–2008 
Chen & Liu (2014)  Feitsui Reservoir, Taiwan DO ANN, ANFIS, MLR 1993–2011 
Chen & Liu (2015)  Mingder Reservoir, Taiwan DO, TP, SD, chl-a ANN, ANFIS, MLR 1993–2013 
Heddam (2016)  Saginaw Bay, Lake Huron, USA SD ANN, MLR 1991–1996 
García Nieto et al. (2019)  Englishmen Lake, Spain Chl-a, TP SVM, M5 tree, ANN 2006–2014 
Aria et al. (2019)  Amirkabir Reservoir, Iran Bacillariophyceae species ANN 2000–2012 
García-Nieto et al. (2020)  Tanes Reservoir, Spain Chl-a GPR 2006–2015 
ReferencesLocationEutrophication indicatorsModeling approachData collection duration
Karul et al. (1999, 2000Keban Dam reservoir, Mogan Lake, Eymir Lake in Turkey chl-a ANN 1991–1996 
Kuo et al. (2007)  Te-Chi Reservoir, Taiwan DO, TP, SD, chl-a ANN 1983–1999 
Akkoyunlu & Akiner (2010)  Omerli Lake, Turkey DO ANN, MLR 1990–2004 
Huo et al. (2013)  Lake Fuxian, China DO, TN, SD, chl-a ANN 2003–2008 
Chen & Liu (2014)  Feitsui Reservoir, Taiwan DO ANN, ANFIS, MLR 1993–2011 
Chen & Liu (2015)  Mingder Reservoir, Taiwan DO, TP, SD, chl-a ANN, ANFIS, MLR 1993–2013 
Heddam (2016)  Saginaw Bay, Lake Huron, USA SD ANN, MLR 1991–1996 
García Nieto et al. (2019)  Englishmen Lake, Spain Chl-a, TP SVM, M5 tree, ANN 2006–2014 
Aria et al. (2019)  Amirkabir Reservoir, Iran Bacillariophyceae species ANN 2000–2012 
García-Nieto et al. (2020)  Tanes Reservoir, Spain Chl-a GPR 2006–2015 

ANN, artificial neural network; ANFIS, adaptive neuro-fuzzy inference system; MLR, multiple linear regression; chl-a, chlorophyll-a; DO, dissolved oxygen; SVM, support vector machine; TP, total phosphorus; TN, total nitrogen; SD, Secchi depth; GPR, Gaussian process regression.

In India, surface waterbodies like lakes, rivers, etc., are getting increasingly polluted around different parts of the country due to human activities (Singh et al. 2019). Assam, the economic and cultural hub of North-Eastern India, is bestowed with a large number of lakes, ponds, and wetlands, but research works related to their trophic state, water quality, and management policies are meager. Increased anthropogenic activities have made most of the urban waterbodies susceptible to eutrophication. Periodic monitoring of water quality data in these waterbodies is also unavailable. Taking these aspects into consideration, the major aim of this research was to present eutrophication models for waterbodies in Assam, with the help of experimental water quality data monitored from two artificially eutrophied prototype lakes. Physical water quality indices DO and SD, and nutrients TN and TP, which are primarily responsible for the occurrence of eutrophication were chosen as target variables of the models. For predicting these eutrophication indicators, the popular data-driven technique ANN was used. Under ANN topology, the most commonly employed ANN architecture, i.e. MLP, is used. To compare the performance of the static MLP network with other ANN architectures, a dynamic time-delay neural network (TDNN) was also explored for model development. The neural network models were further compared with the non-parametric machine learning models based on SVR and GPR techniques, which are explored less in lake eutrophication modeling. Finally, the feasibility of the modeling approach to be used for eutrophication prediction in the waterbodies of Assam was checked by testing the developed models against natural waterbody data. The adopted investigations hold a major significance in the field of lake management through the data-driven modeling approach, by providing an alternate measure for the generation of datasets, where long-term monitoring data are not available for the considered waterbody.

Experimental set-up

Two concrete bed tanks with rectangular cross sections were constructed at the Civil Engineering Department, Tezpur University in Assam, to replicate the eutrophication scenarios in shallow lakes. The experimental methodology followed here is similar to the earlier work by Bhagowati et al. (2022). The dimensions of the tanks were 2.61 m × 1.63 m × 0.73 m (Tank 1) and 2.5 m × 2.5 m × 0.85 m (Tank 2) with the initial filled-up volume of water being 2.637 and 4.375 m3, respectively. From the outset, the prototype lakes were filled with fresh water, and subsequently, 5 and 8 L of waste water were applied every fourth day to Tank 1 and every seventh day to Tank 2, respectively. The application of waste water to these tanks and sample collection times were maintained uniformly throughout the process of investigation. Water samples from the investigated tanks were collected after each application of waste water to the tanks, and thereafter, different water quality parameters were investigated in the laboratory of the Civil Engineering Department, Tezpur University, following the protocols of Standard Methods (APHA 1995). Some important physio-chemical parameters that describe the quality of surface waters were monitored, which were pH, EC, TDS, DO, biochemical oxygen demand (BOD), SD, turbidity, total nitrogen as nitrite and nitrate (TN), TP, and average water temperature (Temp). The target outputs DO and SD were measured in the tank site itself. DO was recorded with an electronic meter and SD was measured using a 15 cm diameter metallic circular disc with alternate black and white strips. TN and TP of the water samples were measured in the laboratory with the help of a Ultraviolet (UV)-visible spectrophotometer. The laboratory examinations were done for around 8–9 months' time span, from March to December 2018, for Tank 1 and Tank 2, respectively, and at that point, total degradation of water quality had been noticed. The change in the trophic status of the prototype lakes was mathematically verified by evaluating Carlson's trophic status index (TSI) (Carlson 1977). Pearson's correlation coefficient matrix analysis was performed to check the degree of dependency between any two water quality parameters (Rezaei et al. 2020). A controlled environment was kept during the study period, in both lakes studied, and the effect of heavy rainfall was also avoided using shades. Finally, the investigated dataset was used to develop machine learning-based models for eutrophication prediction, in terms of indicators such as DO, SD, TN, and TP. Apart from the artificial prototype lakes, water samples were gathered from a few natural water bodies in Assam, and water quality parameters were investigated, which were utilized for model validation. The locations of the study area and the natural water bodies are presented in Figure 1.
Figure 1

Location map showing the study area and natural water bodies.

Figure 1

Location map showing the study area and natural water bodies.

Close modal

Modeling methodology

For the prediction of eutrophication indicators such as DO, SD, TN, and TP, data-driven machine learning approaches in the form of MLP, TDNN, SVR, and GPR were used in this present work. In modeling methodology, the input parameters were initially finalized out of the different experimentally investigated water quality parameters for the prediction of each of the desired outputs using a parameter trimming method under neural network topology. The same inputs were then used for the prediction of output parameters under the considered machine learning algorithms. A simple multiple linear regression (MLR) model was also developed for each of the target variables initially, to compare the performance of the linear model with sophisticated machine learning regression methods. The MLR, MLP, TDNN, SVR, and GPR models presented in this work were developed in MATLAB (release 2017a) environment. Details of different modeling methodologies have been illustrated in the next sections.

Input selection

Choice of the best combination of input variables is one of the most crucial aspects for better execution of any predicting neural network models. In any case, with respect to the vast majority of the previously developed ecological machine learning models, less consideration has been given to this perspective. In the majority of cases, the input parameters were picked by relying on a model-free methodology, for example, area information or on an ad-hoc basis that can lead to a very large or less input parameter condition (Maier et al. 2010). So, a stepwise model-based parameter trimming method was used in this study to obtain the best input combinations for the proposed DO, SD, TN, and TP models (Maier et al. 2010; Wu et al. 2014). Different combinations of the water quality variables from the studied prototype lakes were used to train several MLP networks for each target variable. The best input combinations, as selected in this procedure, were subsequently used for the prediction of eutrophication indicators in the MLR, TDNN, SVR, and GPR models.

ANN models

An ANN is an information handling framework that performs a particular task in an identical manner to that of a human brain (Haykin 2007). In the novel ANN paradigm, two or more layers of neurons linked with weights perform simultaneously to accomplish a specific task. The schematic architecture of the neural network models used for DO, SD, TN, and TP predictions in eutrophic lakes is shown in Figure 2, and these models have single input, hidden, and output layers. The number of neurons in the input layer was determined using a model-based parameter trimming method, whereas in the case of the hidden layer, it was decided on a trial-and-error basis. In the case of the output layer, a single neuron was predominantly used for target variables DO, SD, TN, and TP.
Figure 2

Topology of the developed neural network models for DO, SD, TN, and TP predictions.

Figure 2

Topology of the developed neural network models for DO, SD, TN, and TP predictions.

Close modal

In this study, two types of ANN models were used for target prediction in the eutrophic lakes of Assam. The first one is a static type of MLP and the second one is a dynamic type of TDNN. The basic difference between the two types is that time is not a consideration for prediction in static models, while the effect of earlier data is also considered for future data prediction in dynamic models (Aria et al. 2019). The TDNN used in this study was a non-linear autoregressive network with exogenous inputs (NARX) which is a recurrent dynamic network (Demuth et al. 2014). The Levenberg–Marquardt backpropagation training algorithm was used in the MLP and TDNN models due to its rapid convergence in comparison with other methods (Demuth et al. 2014). A sigmoid tangent activation function (TANSIG) and a linear transfer function (PURELIN) were used in the hidden and output layer, respectively.

Due to the unavailability of any fixed method to determine the ideal neuron number in the hidden layer, in the present work, a trial-and-error approach was employed (Che Nordin et al. 2021). For simplicity of the model, only one hidden layer was used and different MLP neural networks were developed by altering the neuron numbers from 4 to 40 in the hidden layer. All networks were examined based on their mean squared error (MSE) between the output and target values. To reduce the impact of variable dimensionality, a linear transformation was used to convert all data into a specified interval of (0.15, 0.85). The linear transformation equation used for data normalization is given in the following equation (Huo et al. 2013):
formula
(1)
where and yi are the normalized and observed values, respectively. ymax and ymin are the maximum and minimum values of yi. Max and Min correspond to the upper and lower limits of the data normalization range. 70% data were utilized for model training, whereas for testing and validation purposes, 15% data were used, respectively.

SVR models

Support vector machines (SVM), developed by Vapnik (1995), are supervised machine learning methods used mainly in the domain of classification and regression problems. The basic aim of the SVM algorithm is to locate the best hyperplane by converting the original input space (xi) into a higher dimensional feature space through a non-linear mapping function (ϕ(x)) (Naghibi et al. 2017). By introducing a ε-insensitive loss function, SVM is used for solving regression problems and is popularly termed as SVR (García Nieto et al. 2019). The ε-insensitive loss function for the target variable (y) and the predictor variable (x) can be defined as given in the following equation:
formula
(2)
The objective function f(x) in SVR can be expressed in the following equation:
formula
(3)
where w and b are the vector coefficient and a constant, respectively. The values of these parameters are evaluated by minimizing a regularized risk function with two slack variables as given in Equation (4) and the associated boundary conditions given in Equation (5).
formula
(4)
formula
(5)
Here, ξi and ξi* are the slack variables introduced to handle infeasible constraints. C is a constant that decides the tradeoff between model flatness and training error, and i represents n number of training samples. Finally, with the help of the kernel trick, the regression formula (Equation (3)) can be converted into non-linear SVR, as given by the dual formula in Equation (6), for the prediction of new values.
formula
(6)
where αi and αi* are the Lagrange multipliers and the kernel function is represented by K(xi, xj).

The kernel function is used for non-linear data transformation, and hence, the selection of an appropriate kernel parameter is very important for satisfactory model performance (Pradhan 2013). So, in this study, four commonly used kernel functions, viz. linear, quadratic, cubic, and Gaussian kernels, were compared under the SVR method for DO, SD, TN, and TP predictions. Based on the results of the coefficient of determination (R2) and the root mean square error (RMSE), the best kernel was chosen for each model. In the SVR models, C and ε are very important parameters that vary depending on the noise associated with the input data. These parameters are unknown for best model performance, and so, a trial-and-error method was used to find the best combination of C and ε for DO, SD, TN, and TP predictions.

GPR models

GPR is a type of non-parametric machine learning modeling approach that trains sample data using probabilistic approaches that account for uncertainty about the target variable (Sahoo et al. 2019; Li et al. 2020). The GPR technique, in general, is very robust and precise compared to the other regression methods (Singh et al. 2021). In the GPR algorithm, the targeted variable is considered to be under multivariate normal distribution. A mean function and a covariance or kernel function are used to specify the Gaussian process (Rasmussen 2004) and the posterior probability hypothesis is ascertained using the Bayesian inference method. For a target variable (y(k)), a Gaussian process can be expressed in terms of a random variable (x(k)) by the following expression, as given in the following equation:
formula
(7)
where is Gaussian noise having variance σ2. Here, it is considered that error ζ is under normal distribution and has zero mean variance σ2. The function f(x) is driven by a Gaussian process on x and it is specified by a kernel as given in the following equation:
formula
(8)
where K is the kernel or covariance matrix and I is the identity matrix. K signifies that for similar inputs in kernel space, their output values will be similar. After fixing the Gaussian noise, Bayesian inference is applied, which minimizes the negative log-posterior (Equation (9)) to train the model under GPR (Deo & Samui 2017).
formula
(9)

As the Gaussian process can be fully defined by its second-order statistics, a process with zero mean implies that the covariance function will completely determine the behavior of the process (García-Nieto et al. 2020). Hence, the choice of a proper covariance function becomes necessary. There are several types of covariance functions commonly used for GPR such as squared exponential, matern 5/2, matern 3/2, exponential, rational quadratic, and polynomial (Rasmussen & Williams 2005). In the present work, the DO, SD, TN, and TP models were developed initially using four different types of GPR algorithms namely squared exponential, rational quadratic, matern 5/2, and exponential. Thereafter, the best GPR models were selected for each target variable and the same were utilized for comparison with the corresponding ANN- and SVR-based models.

Model efficiency evaluation

The efficiency of the developed models was analyzed by statistical parameters such as the coefficient of determination (R2) and Nash–Sutcliffe efficiency (E) (Nash & Sutcliffe 1970), using Equations (10) and (11), respectively. For both parameters, any value close to unity represents an ideal prediction performance of the models, whereas values close to 0 signify an unacceptable correlation. To check the applicability of model prediction, error estimation functions such as the RMSE and the mean absolute error (MAE) were adopted. RMSE and MAE between the observed and predicted values were calculated using Equations (12) and (13), respectively.
formula
(10)
formula
(11)
formula
(12)
formula
(13)
where Oi and Pi are the experimentally observed and predicted values of the model, respectively, Omean represents the arithmetic mean value of the observed quantity, and n represents the number of data samples.

Feasibility assessment of the developed models

Eutrophication models for indicators DO, SD, TN, and TP were developed with the dataset of water quality parameters monitored on two artificially simulated prototype lakes. The feasibility of the adopted modeling approach to be used as the eutrophication predictor in natural waterbodies was evaluated by checking the performance of the model with samples collected from a few natural shallow waterbodies in Assam. The model's predictions should be flawless under different ecological conditions, and hence, a vast dataset was gathered to check the accuracy of the presented models to be used in the waterbodies in Assam. Water samples were gathered from two sampling locations of Deepor Bil, a world heritage RAMSAR wetland in Guwahati city, as well as from a marsh (two locations), a manmade lake, and a village pond in and around Tezpur University campus in Tezpur city, details of which are presented in Table 2 and their locations are shown in Figure 1. All the previously mentioned water quality parameters were investigated on the samples collected from the waterbodies during different weather conditions in the months of March, May, September, and December 2019.

Table 2

Sampling location details for model testing

Sampling locationWaterbody typeLocationLatitudeLongitude
Deepor Bil (RAMSAR wetland) Near Pamohi, Guwahati, Assam, India 26°6′47.72″ N 91°39′35.76″ E 
Deepor Bil (RAMSAR wetland) Near Pamohi, Guwahati, Assam, India 26°6ʹ46.32″ N 91°39ʹ13.51″ E 
Marsh Near Tezpur University, Tezpur, Assam, India 26°41ʹ13.97″ N 92°48ʹ58.78″ E 
Marsh Near Tezpur University, Tezpur, Assam, India 26°41ʹ16.37″ N 92°48ʹ58.00″ E 
Village pond Near Tezpur University, Tezpur, Assam, India 26°41ʹ30.56″ N 92°49ʹ15.22″ E 
Artificial lake Tezpur University, Tezpur, Assam, India 26°42ʹ3.78″ N 92°49ʹ52.08″ E 
Sampling locationWaterbody typeLocationLatitudeLongitude
Deepor Bil (RAMSAR wetland) Near Pamohi, Guwahati, Assam, India 26°6′47.72″ N 91°39′35.76″ E 
Deepor Bil (RAMSAR wetland) Near Pamohi, Guwahati, Assam, India 26°6ʹ46.32″ N 91°39ʹ13.51″ E 
Marsh Near Tezpur University, Tezpur, Assam, India 26°41ʹ13.97″ N 92°48ʹ58.78″ E 
Marsh Near Tezpur University, Tezpur, Assam, India 26°41ʹ16.37″ N 92°48ʹ58.00″ E 
Village pond Near Tezpur University, Tezpur, Assam, India 26°41ʹ30.56″ N 92°49ʹ15.22″ E 
Artificial lake Tezpur University, Tezpur, Assam, India 26°42ʹ3.78″ N 92°49ʹ52.08″ E 

Sensitivity analysis

The trained DO, SD, TN, and TP models’ sensitivity analyses were evaluated by using the data perturbation method (Cao et al. 2016). To check the impact of changing the input parameter values on output prediction, input parameters were increased and decreased by 20%, one at a time, fixing other inputs as constants. At that point, the change in output due to alteration of an input parameter is calculated in percentage, which is reported as the sensitivity value of the input parameter for target prediction (Maier et al. 1998). The parameter having a value greater than 100% is identified as the sensitive parameter for the model.

Experimental results

The eutrophication process was recreated effectively with the periodic application of waste water to the studied artificial lakes. As the major cause of contamination in most urban and rural surface water bodies in Assam is domestic and agricultural runoffs, waste water applied to the studied lakes had been gathered from similar sources throughout the study period. The average pH, EC, TDS, and turbidity values of the applied waste water were 7.54, 689.80 μS/cm, 344.71 ppm, and 178.65 nephelometric turbidity units (NTU), respectively. The nutrient concentration of the applied waste water was high with average TN and TP concentrations of 0.48 and 2.79 mg/L, respectively. From Figure 3, it can be observed that the water quality in both artificial lakes considerably declined in the last phase of experimentation, compared to the starting conditions. Over the study period, the effects of eutrophication were prominent in artificial lakes with increased algal growth, high turbidity, and hypoxia. The variation of the desired eutrophication indicator parameters DO, SD, TN, and TP in the studied artificial lakes is presented in Figure 4. It can be inferred from the figure that because of the continuous increase in nutrient concentration, a favorable condition for the occurrence of eutrophication was initiated in these lakes, and as a result, the DO and SD values decreased significantly. The statistical summary of the experimental investigation for different physio-chemical properties of water samples, collected during the study period from the prototype lakes, is given in Table 3. A total of 98 samples were analyzed from the prototype lakes for model training. The investigated dataset was used for the calculation of TSI of the studied artificial lakes, and thereafter, the data-driven models were trained and tested for DO, SD, TN, and TP predictions in the eutrophic lakes of Assam.
Table 3

Statistical summary of the investigated water quality variables

pHEC (μS/cm)TDS (ppm)Turb (NTU)TN (mg/L)TP (mg/L)Temp (°C)BOD (mg/L)DO (mg/L)SD (cm)
Max 10.32 837.00 561.50 68.90 5.77 9.62 36.00 51.80 7.13 74.00 
Min 6.93 164.80 61.31 2.30 0.00 0.01 25.00 0.00 3.40 13.50 
Avg 8.67 363.67 189.00 17.75 0.71 2.60 31.40 12.99 5.64 40.94 
St. Dev. 0.90 155.44 100.60 15.07 1.03 1.95 2.51 10.31 0.97 15.29 
pHEC (μS/cm)TDS (ppm)Turb (NTU)TN (mg/L)TP (mg/L)Temp (°C)BOD (mg/L)DO (mg/L)SD (cm)
Max 10.32 837.00 561.50 68.90 5.77 9.62 36.00 51.80 7.13 74.00 
Min 6.93 164.80 61.31 2.30 0.00 0.01 25.00 0.00 3.40 13.50 
Avg 8.67 363.67 189.00 17.75 0.71 2.60 31.40 12.99 5.64 40.94 
St. Dev. 0.90 155.44 100.60 15.07 1.03 1.95 2.51 10.31 0.97 15.29 

Turb, turbidity; Max, maximum; Min, minimum; St Dev., standard deviation; Avg, average.

Figure 3

Water quality deterioration in the studied artificial lakes during the first and last months of experimentation.

Figure 3

Water quality deterioration in the studied artificial lakes during the first and last months of experimentation.

Close modal
Figure 4

Variation of DO, SD, TN, and TP with time in the studied artificial lakes.

Figure 4

Variation of DO, SD, TN, and TP with time in the studied artificial lakes.

Close modal
To legitimize the event of eutrophication in the examined prototype lakes scientifically, the experimental dataset was utilized for trophic status assessment, as per the convention proposed by Carlson (1977). For the determination of Carlson trophic status (TSI), SD and TP values of the studied lakes were utilized as per Equations (14) and (15). A lake is considered oligotrophic if the calculated TSI value is less than 40. For values between 40 and 50, the lake is categorized as mesotrophic, and eutrophic, if the TSI value ranges from 50 to 70. For TSI values greater than 70, the considered lake is assumed to be in the hypereutrophic stage (Carlson 1977; Saghi et al. 2015).
formula
(14)
formula
(15)

The maximum, minimum, and average values of TSI based on TP and SD values for the investigated artificial lakes are presented in Table 4. The concentration of TP during the initial period of investigation was quite low (0.01 mg/L) and water was transparent up to the full depth of the lakes. For this condition, the TSI value was assessed as less than 40 indicating that the studied lakes were in the oligotrophic state. With the gradual application of nutrients to artificial lakes, the water quality of the lakes deteriorated considerably with a higher TP concentration and lower SD values. In this condition, the calculated value of TSI was well in excess of 70, as shown in Table 4, for both TP and SD standards, inferring that lake water quality had changed to a hypereutrophic stage from a freshwater stage. As such, the lake eutrophication phenomenon was replicated in a controlled environment successfully, and thereafter, the investigated dataset was used for model development.

Table 4

Calculated TSI values of the studied artificial lakes

TSITP
TSISD
MinmMaxmAvg.MinmMaxmAvg.
37.37 136.48 117.61 < 40 88.90 72.89 
TSITP
TSISD
MinmMaxmAvg.MinmMaxmAvg.
37.37 136.48 117.61 < 40 88.90 72.89 

To check the linear relationship between the investigated water quality parameters, a correlation coefficient matrix analysis was performed, and the results are presented in Table 5. Considering a correlation coefficient of 0.5 as the threshold value based on the previous research (Mukaka 2012; Rehman et al. 2018), it was observed that the linear dependency between any two investigated parameters was generally poor. A significant relationship was observed only between EC and TDS with a correlation coefficient value of 0.94. This strong positive correlation may be due to the ions dissolved in water that conduct electricity. The target eutrophication indicator parameters DO, SD, TN, and TP also had no significant correlation with other parameters, and as such, non-linear machine learning algorithms in the form of MLP, TDNN, SVR, and GPR were used for model development.

Table 5

Correlation coefficient matrix of water quality variables

pHECTDSTurbTNTPTempBODDOSD
pH 1.00          
EC 0.14 1.00         
TDS 0.06 0.94 1.00        
Turb 0.35 0.49 0.41 1.00       
TN −0.41 −0.26 −0.24 −0.25 1.00      
TP −0.50 0.30 0.33 −0.01 0.22 1.00     
Temp 0.09 −0.09 −0.11 −0.03 0.06 −0.16 1.00    
BOD −0.15 0.50 0.53 0.24 0.03 0.49 −0.18 1.00   
DO 0.21 −0.58 −0.54 −0.49 −0.02 −0.47 0.21 −0.44 1.00  
SD −0.01 −0.52 −0.48 −0.59 −0.17 −0.39 0.04 −0.43 0.71 1.00 
pHECTDSTurbTNTPTempBODDOSD
pH 1.00          
EC 0.14 1.00         
TDS 0.06 0.94 1.00        
Turb 0.35 0.49 0.41 1.00       
TN −0.41 −0.26 −0.24 −0.25 1.00      
TP −0.50 0.30 0.33 −0.01 0.22 1.00     
Temp 0.09 −0.09 −0.11 −0.03 0.06 −0.16 1.00    
BOD −0.15 0.50 0.53 0.24 0.03 0.49 −0.18 1.00   
DO 0.21 −0.58 −0.54 −0.49 −0.02 −0.47 0.21 −0.44 1.00  
SD −0.01 −0.52 −0.48 −0.59 −0.17 −0.39 0.04 −0.43 0.71 1.00 

Modeling results

Input selection

For the prediction of DO, SD, TN, and TP, six random input combinations were considered to determine the best parameter combination from the investigated water quality indices, as shown in Table 6. The coefficient of correlation (R) and the MSE were considered as the assessment criteria for the choice of an optimum parameter combination. Every combination represents the effect of omitted parameters in the model training performance. Based on the highest R and the lowest MSE values, it can be concluded that out of the different scenarios considered under MLP, combinations 3, 6, 4, and 5 were the most significant ones for the prediction of the DO, SD, TN, and TP model, respectively. So, for the presented models, 7, 4, 5, and 6 number of input parameters were used for DO, SD, TN, and TP prediction, respectively.

Table 6

Selection of input variables for the investigated models

Sl. No.Input variablesRMSE
DO model 
pH, EC, TDS, TN, TP, BOD, turbidity, SD, Temp 0.897 0.0121 
pH, EC, TDS, TN, TP, turbidity, Temp 0.901 0.0101 
pH, EC, BOD, TN, TP, turbidity, Temp 0.936 0.0094 
pH, EC, TDS, TN, TP, Temp 0.788 0.0203 
pH, EC, TN, TP, Temp 0.804 0.0204 
pH, EC, TN, TP 0.755 0.0214 
SD model 
pH, EC, TDS, DO, TN, TP, turbidity, Temp 0.883 0.0110 
pH, EC, TDS, DO, TN, TP, Temp 0.771 0.0175 
pH, EC, turbidity, TN, TP, Temp 0.845 0.0120 
pH, EC, TN, TP, Temp 0.783 0.0198 
TN, TP, turbidity, Temp 0.748 0.0118 
pH, EC, turbidity, Temp 0.921 0.0033 
TN model 
pH, EC, DO, TDS, TP, BOD, turbidity, SD, Temp 0.856 0.0122 
pH, EC, TDS, BOD, turbidity, SD, Temp 0.825 0.0186 
pH, EC, DO, BOD, turbidity, SD, Temp 0.809 0.0204 
pH, EC, TDS, turbidity, SD 0.915 0.0105 
pH, EC, DO, turbidity, SD 0.875 0.0118 
pH, EC, TDS, DO, SD 0.847 0.0121 
TP model 
pH, EC, DO, TDS, TN, BOD, turbidity, SD, Temp 0.817 0.0224 
pH, EC, DO, TDS, TN, BOD, turbidity 0.887 0.0181 
pH, EC, DO, TDS, BOD, SD, turbidity 0.846 0.0194 
pH, EC, DO, TDS, TN, BOD, turbidity, 0.840 0.0183 
pH, EC, DO, TDS. BOD, SD 0.924 0.0112 
pH, EC, DO, SD, Temp 0.785 0.0235 
Sl. No.Input variablesRMSE
DO model 
pH, EC, TDS, TN, TP, BOD, turbidity, SD, Temp 0.897 0.0121 
pH, EC, TDS, TN, TP, turbidity, Temp 0.901 0.0101 
pH, EC, BOD, TN, TP, turbidity, Temp 0.936 0.0094 
pH, EC, TDS, TN, TP, Temp 0.788 0.0203 
pH, EC, TN, TP, Temp 0.804 0.0204 
pH, EC, TN, TP 0.755 0.0214 
SD model 
pH, EC, TDS, DO, TN, TP, turbidity, Temp 0.883 0.0110 
pH, EC, TDS, DO, TN, TP, Temp 0.771 0.0175 
pH, EC, turbidity, TN, TP, Temp 0.845 0.0120 
pH, EC, TN, TP, Temp 0.783 0.0198 
TN, TP, turbidity, Temp 0.748 0.0118 
pH, EC, turbidity, Temp 0.921 0.0033 
TN model 
pH, EC, DO, TDS, TP, BOD, turbidity, SD, Temp 0.856 0.0122 
pH, EC, TDS, BOD, turbidity, SD, Temp 0.825 0.0186 
pH, EC, DO, BOD, turbidity, SD, Temp 0.809 0.0204 
pH, EC, TDS, turbidity, SD 0.915 0.0105 
pH, EC, DO, turbidity, SD 0.875 0.0118 
pH, EC, TDS, DO, SD 0.847 0.0121 
TP model 
pH, EC, DO, TDS, TN, BOD, turbidity, SD, Temp 0.817 0.0224 
pH, EC, DO, TDS, TN, BOD, turbidity 0.887 0.0181 
pH, EC, DO, TDS, BOD, SD, turbidity 0.846 0.0194 
pH, EC, DO, TDS, TN, BOD, turbidity, 0.840 0.0183 
pH, EC, DO, TDS. BOD, SD 0.924 0.0112 
pH, EC, DO, SD, Temp 0.785 0.0235 

MLR model

An MLR model was trained initially for DO, SD, TN, and TP predictions and the results of the observed vs model predicted values are presented in Figure 5. It was observed that the MLR model's training efficiency was poor, in general, for the prediction of SD, TN, and TP. In case of the DO model, slightly better results were observed, and R2 and RMSE values of 0.94 and 0.22 mg/L were achieved. The observed R2 values were 0.89, 0.65, and 0.90, and RMSE values were 4.93 cm, 0.33 mg/L, and 0.21 mg/L, respectively, for the SD, TN, and TP MLR models. The relationship between different ecological parameters is generally complex and not linearly correlated as found in the present study (Table 5), and hence, the MLR model's training was not found satisfactory. The results were found consistent with previous works (Akkoyunlu & Akiner 2010; Chen & Liu 2014, 2015) where the MLR models were not adequate for the prediction of all eutrophication indicators, which prompted the use of other sophisticated, non-linear machine learning tools.
Figure 5

Result of MLR model training for the prediction of (a) DO, (b) SD, (c) TN, and (d) TP, respectively.

Figure 5

Result of MLR model training for the prediction of (a) DO, (b) SD, (c) TN, and (d) TP, respectively.

Close modal

ANN models

Two types of neural network models, i.e., MLP and TDNN, were employed in this present work for the prediction of eutrophication indicators with the experimental data carried out on artificial lakes. Using a trial-and-error approach in the MLP network, 20, 12, 12, and 10 neurons were determined as optimum in the hidden layer for the DO, SD, TN, and TP models, respectively, in this study. A neural network structure (inputs-hidden neuron-output) of 7-20-1, 4-12-1, 5-12-1, and 6-10-1 was finalized as optimum in the MLP topology for the prediction of DO, SD, TN, and TP, respectively. Thereafter, keeping the same number of neurons in the hidden layer and with the same input parameters, TDNN models were developed with a two-step ahead prediction of eutrophication indicators. Model training results revealed satisfactory performance of both MLP and TDNN approaches. It was observed that a strong correlation was there between the observed and model predicted values and the results are presented in Figures 69 for the DO, SD, TN, and TP model, respectively. The goodness of fit parameters R2 and E were obtained with values greater than 0.97 for DO, SD, TN, and TP predictions in MLP and TDNN training, and the error estimation parameters were also reasonable as presented in Table 7. The TDNN models showed slightly better accuracy in terms of smaller RMSE and MAE values than the MLP models. Compared with previous works by Huo et al. (2013), Akkoyunlu & Akiner (2010), and Kuo et al. (2007), to predict DO and SD in eutrophic lakes using MLP models, the present study produced a higher coefficient of determination (R2) and lower RMSE values during model training. Out of the two adopted neural network approaches, it was seen that the TDNN models were slightly superior to the MLP models, and the results are consistent with the work carried out by Aria et al. (2019).
Table 7

Performance results of the trained models

MLPTDNNSVRGPR
DO model R2 0.97 0.98 0.95 0.96 
E 0.97 0.98 0.97 0.96 
RMSE 0.16 mg/L 0.13 mg/L 0.17 mg/L 0.18 mg/L 
MAE 0.09 mg/L 0.06 mg/L 0.13 mg/L 0.14 mg/L 
SD model R2 0.98 0.98 0.93 0.98 
E 0.98 0.98 0.95 0.98 
RMSE 2.44 cm 2.13 cm 3.16 cm 1.73 cm 
MAE 1.98 cm 1.42 cm 2.22 cm 1.33 cm 
TN model R2 0.97 0.98 0.91 0.98 
E 0.97 0.98 0.93 0.98 
RMSE 0.10 mg/L 0.09 mg/L 0.15 mg/L 0.07 mg/L 
MAE 0.07 mg/L 0.06 mg/L 0.10 mg/L 0.04 mg/L 
TP model R2 0.98 0.99 0.96 0.99 
E 0.99 0.99 0.98 0.99 
RMSE 0.08 mg/L 0.04 mg/L 0.11 mg/L 0.01 mg/L 
MAE 0.05 mg/L 0.02 mg/L 0.09 mg/L 0.001 mg/L 
MLPTDNNSVRGPR
DO model R2 0.97 0.98 0.95 0.96 
E 0.97 0.98 0.97 0.96 
RMSE 0.16 mg/L 0.13 mg/L 0.17 mg/L 0.18 mg/L 
MAE 0.09 mg/L 0.06 mg/L 0.13 mg/L 0.14 mg/L 
SD model R2 0.98 0.98 0.93 0.98 
E 0.98 0.98 0.95 0.98 
RMSE 2.44 cm 2.13 cm 3.16 cm 1.73 cm 
MAE 1.98 cm 1.42 cm 2.22 cm 1.33 cm 
TN model R2 0.97 0.98 0.91 0.98 
E 0.97 0.98 0.93 0.98 
RMSE 0.10 mg/L 0.09 mg/L 0.15 mg/L 0.07 mg/L 
MAE 0.07 mg/L 0.06 mg/L 0.10 mg/L 0.04 mg/L 
TP model R2 0.98 0.99 0.96 0.99 
E 0.99 0.99 0.98 0.99 
RMSE 0.08 mg/L 0.04 mg/L 0.11 mg/L 0.01 mg/L 
MAE 0.05 mg/L 0.02 mg/L 0.09 mg/L 0.001 mg/L 
Figure 6

Observed vs predicted plot of DO model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Figure 6

Observed vs predicted plot of DO model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Close modal
Figure 7

Observed vs predicted plot of SD model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Figure 7

Observed vs predicted plot of SD model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Close modal
Figure 8

Observed vs predicted plot of TN model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Figure 8

Observed vs predicted plot of TN model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Close modal
Figure 9

Observed vs predicted plot of TP model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Figure 9

Observed vs predicted plot of TP model training under (a) MLP, (b) TDNN, (c) SVR, and (d) GPR, respectively.

Close modal

SVR models

A five-fold cross-validation technique was employed throughout the SVR model training process and different kernel functions were used initially for each predictor variable. Using the default values of C and ε under MATLAB environment, linear, quadratic, cubic, and Gaussian kernels were compared. Based on the highest R2 and lowest RMSE values, it was found that the Gaussian kernel-based SVR models yield better prediction accuracy compared to the other kernels, as found in some previous works (Xu et al. 2015; García Nieto et al. 2019). Furthermore, different values of C (changed from 0 to 100 in increments of 10 for DO and SD models and 0 to 10 with 0.1 increment for TN and TP) and ε (from 0 to 0.1 by increment 0.01) were considered to find the best combination under the Gaussian kernel. It was found that C and ε combinations of 10 and 0.01 for DO, 20 and 0.02 for SD, 8 and 0.01 for TN, and 6 and 0.01 for TP, respectively, were optimal. After finalizing the structure of SVR, model training was done for each targeted output and the results are shown in Table 7. Compared to SD and TN models (Figures 7 and 8), better prediction accuracy had been achieved for DO and TP models (Figures 6 and 9) under SVR. RMSE and MAE values of 0.17 and 0.13 mg/L for the DO model, 3.16 and 2.22 cm for the SD model, 0.15 and 0.10 mg/L for the TN model, and 0.11 and 0.09 mg/L for the TP model were obtained under SVR during training. Overall, an acceptable correlation was observed, and R2 and E values more than 0.91 were attained for all models. Similar SVR models were used to predict the TP concentration in eutrophic lakes by García Nieto et al. (2019) where an R2 value of 0.90 had been achieved. So, the presented SVR models for eutrophication indicators DO, SD, TN, and TP hold significant prediction accuracy.

GPR models

Under GPR, squared exponential, rational quadratic, matern 5/2, and exponential kernels were tried to model DO, SD, TN, and TP, considering the five-fold cross-validation technique and a constant basis function in MATLAB. Considering that R2 and RMSE values squared exponential kernel was found to produce better training efficiency among others, which was also reported in earlier research works (García-Nieto et al. 2020). Table 7 shows the results of GPR model training using a squared exponential kernel, and Figures 69 illustrate the correlation achieved between actual and model predicted values for the DO, SD, TN, and TP model, respectively. The GPR models were found to be very superior in terms of prediction accuracy during training, and R2 and E values of 0.99 were observed for the TP model. These values were 0.98 for the SD and TN models and 0.96 for the DO model, respectively. The error parameters RMSE and MAE were also quite low during GPR model training. As the GPR models have the advantage of handling uncertainty as in the case of lake ecosystems, and can perform well on small datasets (García-Nieto et al. 2020), the performance of the presented models is quite promising. Moreover, as GPR is explored much less in literature to predict lake eutrophication indicators, the presented DO, SD, TN, and TP models hold major significance.

Comparing the training results of the four approaches for DO prediction, it was observed that the lowest error values were achieved with TDNN models as evident from Figure 6. RMSE and MAE of 0.13 and 0.06 mg/L were obtained with TDNN during DO model training. For SD prediction, the highest training efficiency was reported for GPR, followed by TDNN, as shown in Figure 7. From Figures 8 and 9, it can be seen that for TN and TP models also, the lowest errors were found for the GPR and TDNN model, respectively. From the summary of training results, as presented in Table 7, it can be observed that the overall R2 and E values of the TDNN, GPR, and MLP models were quite comparable and were slightly superior compared to the SVR models.

Model testing with natural water body

To check the feasibility of the well-trained models to be used as a management tool for waterbodies in Assam, the MLP-, TDNN-, SVR-, and GPR-based optimum models were used for forecasting the pre-observed values of DO, SD, TN, and TP in some natural waterbodies as mentioned in Table 2. These four model types significantly outperformed the corresponding MLR counterparts during the training phase, and hence, the MLR models were not considered for further investigation. The results of correlation achieved between the observed values of the target parameters in natural waterbodies with their model predicted values are presented in Figures 1013 for the DO, SD, TN, and TP models, respectively. A total of 25 samples were used for model testing purposes and the statistical evaluation of model testing results is presented in Table 8. It can be observed that for the prediction of DO, SD, TN, and TP, the trained TDNN and GPR models showed better consistency between field data and predicted value, as an R2 value greater than 0.90 was observed for all models. RMSE and MAE values are also the smallest for TDNN models, followed by GPR models. Except for the SVR based TP model, an acceptable prediction accuracy had been found for the other MLP and SVR models. The errors in target prediction from all models were found slightly higher during the testing phase compared to the model training phase. This may be due to the fact that the models were trained based on the data of artificially eutrophied lakes under a controlled environment, which is not an ideal condition in natural water bodies.
Table 8

Performance results of model testing against natural waterbodies

MLPTDNNSVRGPR
DO model R2 0.92 0.93 0.85 0.90 
E 0.91 0.93 0.85 0.89 
RMSE 0.39 mg/L 0.35 mg/L 0.52 mg/L 0.42 mg/L 
MAE 0.33 mg/L 0.31 mg/L 0.45 mg/L 0.36 mg/L 
SD model R2 0.90 0.91 0.84 0.90 
E 0.84 0.90 0.83 0.89 
RMSE 4.69 cm 3.73 cm 4.85 cm 3.88 cm 
MAE 4.36 cm 3.56 cm 4.45 cm 3.46 cm 
TN model R2 0.89 0.91 0.89 0.93 
E 0.87 0.90 0.83 0.87 
RMSE 0.06 mg/L 0.05 mg/L 0.07 mg/L 0.06 mg/L 
MAE 0.05 mg/L 0.04 mg/L 0.06 mg/L 0.05 mg/L 
TP model R2 0.89 0.92 0.77 0.91 
E 0.87 0.92 0.76 0.90 
RMSE 0.06 mg/L 0.05 mg/L 0.11 mg/L 0.06 mg/L 
MAE 0.05 mg/L 0.05 mg/L 0.09 mg/L 0.05 mg/L 
MLPTDNNSVRGPR
DO model R2 0.92 0.93 0.85 0.90 
E 0.91 0.93 0.85 0.89 
RMSE 0.39 mg/L 0.35 mg/L 0.52 mg/L 0.42 mg/L 
MAE 0.33 mg/L 0.31 mg/L 0.45 mg/L 0.36 mg/L 
SD model R2 0.90 0.91 0.84 0.90 
E 0.84 0.90 0.83 0.89 
RMSE 4.69 cm 3.73 cm 4.85 cm 3.88 cm 
MAE 4.36 cm 3.56 cm 4.45 cm 3.46 cm 
TN model R2 0.89 0.91 0.89 0.93 
E 0.87 0.90 0.83 0.87 
RMSE 0.06 mg/L 0.05 mg/L 0.07 mg/L 0.06 mg/L 
MAE 0.05 mg/L 0.04 mg/L 0.06 mg/L 0.05 mg/L 
TP model R2 0.89 0.92 0.77 0.91 
E 0.87 0.92 0.76 0.90 
RMSE 0.06 mg/L 0.05 mg/L 0.11 mg/L 0.06 mg/L 
MAE 0.05 mg/L 0.05 mg/L 0.09 mg/L 0.05 mg/L 
Figure 10

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR DO models, respectively, against a natural water body.

Figure 10

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR DO models, respectively, against a natural water body.

Close modal
Figure 11

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR SD models, respectively, against a natural water body.

Figure 11

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR SD models, respectively, against a natural water body.

Close modal
Figure 12

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR TN models, respectively, against a natural water body.

Figure 12

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR TN models, respectively, against a natural water body.

Close modal
Figure 13

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR TP models, respectively, against a natural water body.

Figure 13

Prediction performance of (a) MLP, (b) TDNN, (c) SVR, and (d) GPR TP models, respectively, against a natural water body.

Close modal

Sensitivity analysis

The data perturbation method was used to check the impact of input variables on the prediction of DO, SD, TN, and TP under TDNN topology, and the results are presented in Table 9. From the table, it can be seen that for the DO model, an increase in BOD and TN values had sensitivity percentage higher than 100%, and hence, reported as sensitive input variables for DO prediction. Compared to the DO model, input variables used for SD prediction were found more or less consistent, and no sensitive parameters were observed. The change in pH values and decrease in SD values had a major effect on the prediction of TN in eutrophic lakes. For TP prediction, changes in pH and DO were reported as the most sensitive parameters.

Table 9

Result of sensitivity analysis with data perturbation

ParametersDO model
SD model
TN model
TP model
+20%−20%+20%−20%+20%−20%+20%20%
pH 86.17 93.46 36.17 57.84 107.84 103.25 112.54 115.38 
EC 86.45 89.74 64.21 58.24 81.47 74.35 92.57 83.86 
TDS – – – – 69.38 76.17 80.34 88.93 
Turbidity 95.74 89.93 59.36 85.69 88.55 82.28 – – 
TN 102.83 66.46 – – – – – – 
TP 84.69 95.29 – – – – – – 
Temperature 97.43 95.33 57.35 67.08 – – – – 
BOD 121.26 98.18 – – – – 67.35 74.55 
DO – – – – – – 105.29 109.45 
SD – – – – 91.35 110.24 86.74 90.86 
ParametersDO model
SD model
TN model
TP model
+20%−20%+20%−20%+20%−20%+20%20%
pH 86.17 93.46 36.17 57.84 107.84 103.25 112.54 115.38 
EC 86.45 89.74 64.21 58.24 81.47 74.35 92.57 83.86 
TDS – – – – 69.38 76.17 80.34 88.93 
Turbidity 95.74 89.93 59.36 85.69 88.55 82.28 – – 
TN 102.83 66.46 – – – – – – 
TP 84.69 95.29 – – – – – – 
Temperature 97.43 95.33 57.35 67.08 – – – – 
BOD 121.26 98.18 – – – – 67.35 74.55 
DO – – – – – – 105.29 109.45 
SD – – – – 91.35 110.24 86.74 90.86 

Model utility and limitations

The presented data-driven approach based models have the advantage that with a finite set of inputs, target prediction can be done in real time in contrary to the physical process-based models that involve substantial input conditions and are fairly complex. The present and future trophic status of water bodies under a given nutrient loading can be estimated rationally. Using such models, the influence of the input parameters on the prediction of the target values and the factors contributing to the eutrophication of the concerned waterbody can be easily recognized. Data-driven models developed earlier utilize the data collected for several years through continuous monitoring. As surface waterbodies in India are more susceptible to cultural eutrophication in the present scenario, and long-term data on water quality are not readily available, the approach presented here can be applied for the management of lakes and surface waterbodies.

Lake eutrophication is a complex process dependent on several factors such as morphometry, nutrient loading, thermal stratification, and climatic conditions. The presented work utilizes periodic monitoring data for the development of eutrophication models, and so, can accommodate the frequent variations in water quality. Considering these aspects, the presented models are more useful for eutrophication management in shallow waterbodies with tropical monsoon climatic conditions and have higher humidity levels similar to Assam, India.

In the present work, a novel approach was employed to check the effectiveness of artificially replicated lake systems in predicting eutrophication indicators. Two model tanks were employed as prototype lakes, and the eutrophication process was initiated under controlled conditions by applying waste water on a regular basis. A gradual rise in nutrient concentration values and a decrease in DO and SD values were observed that inferred the initiation of the eutrophication process in the studied prototype lakes. TSI measurements were used to mathematically validate the deterioration in water quality in those lakes under investigation. Neural network based MLP and TDNN techniques, as well as non-parametric SVR and GPR algorithms, were utilized to generate models of the eutrophication indicators DO, SD, TN, and TP, based on the water quality data that were investigated on the prototype lakes during the study period. Following a model-based parameter trimming method under MLP architecture, the optimum number of input variables for the DO, SD, TN, and TP models were found as 7, 4, 5, and 6, respectively. Thereafter, using the same inputs in each learning algorithm, predictive models were trained for the desired eutrophication indicators. MLP-, TDNN-, SVR-, and GPR-based non-linear models were found to have better accuracy of prediction compared to the conventional MLR model. The trained MLP, TDNN, SVR, and GPR models were tested against the pre-observed data of a few natural water bodies. These models were able to forecast the eutrophication indices DO, SD, TN, and TP in the considered natural water bodies of Assam, with an acceptable level of accuracy. Based on R2, E, RMSE, and MAE values, it was observed that TDNN- and GPR-based models were superior compared to MLP and SVR models for the prediction of DO, SD, TN, and TP during both training and testing phases. R2 values greater than 0.96 and 0.90 have been achieved for all TDNN and GPR models during the training and testing phase, respectively. BOD and TN were reported as the most sensitive inputs for DO prediction but the input parameters of the SD model were not found sensitive. Similarly, pH and SD for the TN model and pH and DO for the TP model were observed as the most sensitive input parameters. The results obtained in this study demonstrate the effectiveness of the adopted experimental set-up and modeling approach for the management of surface waterbodies in situations where monitoring of input data for a long duration of time is not readily available. Detailed investigations could be carried out in future, considering more water quality variables, including biological parameters, under different nutrient loading and climatic conditions. The use of other machine learning techniques such as ANFIS, random forest algorithm, and neural networks optimized with genetic algorithm (GA-ANN) could be explored.

The authors gratefully acknowledge the financial support received from the Department of Science & Technology (DST) – Science and Engineering Research Board (SERB), New Delhi, India under project file no. ECR/2017/000740.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Akkoyunlu
A.
&
Akiner
E. M.
2010
Feasibility assessment of data-driven models in predicting pollution trends of Omerli Lake, Turkey
.
Water Resour. Manage.
24
,
3419
3436
.
https://doi.org/10.1007/s11269-010-9613-0
.
APHA
1995
Standard Methods for the Examination of Water and Waste Water
, 19th edn.
American Public Health Association
,
Washington, DC
.
Aria
H. S.
,
Asadollahfardi
G.
&
Heidarzadeh
N.
2019
Eutrophication modelling of Amirkabir Reservoir (Iran) using an artificial neural network approach
.
Lakes Reservoirs
24
,
48
58
.
https://doi.org/10.1111/lre.12254
.
Bhagowati
B.
&
Ahamad
K. U.
2019
A review on lake eutrophication dynamics and recent developments in lake modeling
.
Ecohydrol. Hydrobiol.
19
,
155
166
.
https://doi.org/10.1016/j.ecohyd.2018.03.002
.
Bhagowati
B.
,
Talukdar
B.
,
Narzary
B. K.
& Ahamad, K. U.
2022
Prediction of lake eutrophication using ANN and ANFIS by artificial simulation of lake ecosystem
.
Model. Earth Syst. Environ.
8
,
5289
5304
.
https://doi.org/10.1007/s40808-022-01377-8
.
Cao
M.
,
Alkayem
F. N.
,
Pan
L.
&
Novak
D.
,
2016
Advanced methods in neural networks-based sensitivity analysis with their applications in civil engineering
. In:
Artificial Neural Networks-Models and Applications
(
Rosa
J. L. G.
, ed.).
IntechOpen
,
London
, pp.
335
353
Carlson
R. E.
1977
A trophic state index for lakes
.
Limnol. Oceanogr.
22
,
361
369
.
Chen
W.
&
Liu
W.
2014
Artificial neural network modeling of dissolved oxygen in reservoir
.
Environ. Monit. Assess.
186
,
1203
1217
.
https://doi.org/10.1007/s10661-013-3450-6
.
Che Nordin
N. F.
,
Mohd
N. S.
,
Koting
S.
, Ismail, Z., Sherif, M. & El-Shafie, A.
2021
Groundwater quality forecasting modelling using artificial intelligence: A review
.
Groundw. Sustain. Dev.
14
,
100643
.
https://doi.org/10.1016/j.gsd.2021.100643
.
Demuth
H. B.
,
Beale
M. H.
,
Jesús
O. D.
&
Hagan
M. T.
2014
Neural Network Design
.
Martin Hagan, Stillwater
.
García Nieto
P. J.
,
García-Gonzalo
E.
,
Alonso Fernández
J. R.
&
Díaz Muñiz
C.
2019
Water eutrophication assessment relied on various machine learning techniques: A case study in the Englishmen Lake (Northern Spain)
.
Ecol. Modell.
404
,
91
102
.
https://doi.org/10.1016/j.ecolmodel.2019.03.009
.
García-Nieto
P. J.
,
García-Gonzalo
E.
,
Alonso Fernández
J. R.
&
Díaz Muñiz
C.
2020
A new predictive model for evaluating chlorophyll-a concentration in Tanes reservoir by using a Gaussian process regression
.
Water Resour. Manage.
34
,
4921
4941
.
https://doi.org/10.1007/s11269-020-02699-x
.
Gazzaz
N. M.
,
Yusoff
M. K.
,
Aris
A. Z.
,
Juahir
H.
&
Ramli
M. F.
2012
Artificial neural network modeling of the water quality index for Kinta River (Malaysia) using water quality variables as predictors
.
Mar. Pollut. Bull.
64
(
11
),
2409
2420
.
https://doi.org/10.1016/j.marpolbul.2012.08.005
.
Hadjisolomou
E.
,
Stefanidis
K.
,
Herodotou
H.
,
Michaelides
M.
,
Papatheodorou
G.
&
Papastergiadou
E.
2021
Modelling freshwater eutrophication with limited limnological data using artificial neural networks
.
Water
13
(
11
),
1590
.
https://doi.org/10.3390/w13111590
.
Haykin
S.
2007
Neural Networks – A Comprehensive Foundation
, 3nd edn.
Prentice-Hall, Inc, Hoboken
.
Huo
S.
,
He
Z.
,
Su
J.
, Xi, B. & Zhu, C.
2013
Using artificial neural network models for eutrophication prediction
.
Procedia Environ. Sci.
18
,
310
316
.
https://doi.org/10.1016/j.proenv.2013.04.040
.
Jimeno-Sáez
P.
,
Senent-Aparicio
J.
,
Cecilia
J. M.
&
Pérez-Sánchez
J.
2020
Using machine-learning algorithms for eutrophication modeling: Case study of Mar Menor Lagoon (Spain)
.
Int. J. Environ. Res. Public Health
17
(
4
).
https://doi.org/10.3390/ijerph17041189
.
Jørgensen
S. E.
2010
A review of recent developments in lake modelling
.
Ecol. Modell.
221
,
689
692
.
https://doi.org/10.1016/j.ecolmodel.2009.10.022
.
Karul
C.
,
Soyupak
S.
&
Yuteri
C.
1999
Neural network models as a management tool in lakes
.
Hydrobiologia
408/409
,
139
144
.
Karul
C.
,
Soyupak
S.
,
Cilesiz
F. A.
, Akbay, N. & Germen, E.
2000
Case studies on the use of neural networks in eutrophication modeling
.
Ecol. Modell.
134
,
145
152
.
Kuo
J.
,
Hsieh
M.
,
Lung
W.
&
She
N.
2007
Using artificial neural network for reservoir eutrophication prediction
.
Ecol. Modell.
200
,
171
177
.
https://doi.org/10.1016/j.ecolmodel.2006.06.018
.
Li
X.
,
Yuan
C.
,
Li
X.
&
Wang
Z.
2020
State of health estimation for Li-Ion battery using incremental capacity analysis and Gaussian process regression
.
Energy
190
,
116467
.
https://doi.org/10.1016/j.energy.2019.116467
.
Mukaka
M. M.
2012
Statistics corner: A guide to appropriate use of correlation coefficient in medical research
.
Malawi Med. J.
24
(
3
),
69
71
.
Naghibi
S. A.
,
Ahmadi
K.
&
Daneshi
A.
2017
Application of support vector machine, random forest, and genetic algorithm optimized random forest models in groundwater potential mapping
.
Water Resour. Manage.
31
,
2761
2775
.
https://doi.org/10.1007/s11269-017-1660-3
.
Rasmussen
C. E.
2004
Gaussian Processes in Machine Learning BT – Advanced Lectures on Machine Learning: ML Summer Schools 2003, Canberra, Australia, February 2–14, 2003, Tübingen, Germany, August 4–16, 2003, Revised Lectures. In: Bousquet O, von Luxburg U, Rätsch G (eds). Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 63–71
.
Rasmussen
C. E.
&
Williams
C. K. I.
2005
Gaussian Processes for Machine Learning
.
The MIT Press, Cambridge, MA
.
Rehman
F.
,
Cheema
T.
,
Lisa
M.
, Azeem, T., Rehman, S. U., Naseem, A. A. & Khan, Z.
2018
Statistical analysis tools for the assessment of groundwater chemical variations in Wadi Bani Malik area, Saudi Arabia
.
Global NEST Int. J.
20
,
355
362
.
doi:10.30955/gnj.002237
.
Rezaei
A.
,
Hassani
H.
,
Tziritis
E.
, Mousavi, S. B. F. & Jabbari, N.
2020
Hydrochemical characterization and evaluation of groundwater quality in Dalgan basin, SE Iran
.
Groundwater Sustainable Dev.
10
,
100353
.
https://doi.org/10.1016/j.gsd.2020.100353
.
Rousso
B. Z.
,
Bertone
E.
,
Stewart
R.
&
Hamilton
D. P.
2020
A systematic literature review of forecasting and predictive models for cyanobacteria blooms in freshwater lakes
.
Water Res.
182
,
115959
.
https://doi.org/10.1016/j.watres.2020.115959
.
Saghi
H.
,
Karimi
A.
&
Javid
A. H.
2015
Investigation on trophic state index by artificial neural networks (case study: Dez Dam of Iran)
.
Appl. Water Sci.
5
,
127
136
.
https://doi.org/10.1007/s13201-014-0161-2
.
Sahoo
B. B.
,
Jha
R.
,
Singh
A.
&
Kumar
D.
2019
Application of support vector regression for modeling low flow time series
.
KSCE J. Civ. Eng.
23
,
923
934
.
https://doi.org/10.1007/s12205-018-0128-1
.
Sarkar
A.
&
Pandey
P.
2015
River water quality modelling using artificial neural network technique
.
Aquat. Procedia
4
,
1070
1077
.
https://doi.org/10.1016/j.aqpro.2015.02.135
.
Singh
A. P.
,
Dhadse
K.
&
Ahalawat
J.
2019
Managing water quality of a river using an integrated geographically weighted regression technique with fuzzy decision-making model
.
Environ. Monit. Assess.
191
,
378
.
https://doi.org/10.1007/s10661-019-7487-z
.
Singh
A.
,
Nagar
J.
,
Sharma
S.
&
Kotiyal
V.
2021
A Gaussian process regression approach to predict the k-barrier coverage probability for intrusion detection in wireless sensor networks
.
Expert Syst. Appl.
172
,
114603
.
https://doi.org/10.1016/j.eswa.2021.114603
.
Su
Y.
,
Hu
M.
,
Wang
Y.
, Zhang, H., He, C., Wang, Y., Wang, D., Wu, X., Zhuang, Y., Hong, S. & Trolle, D.
2022
Identifying key drivers of harmful algal blooms in a tributary of the three gorges reservoir between different seasons: Causality based on data-driven methods
.
Environ. Pollut.
297
,
118759
.
https://doi.org/10.1016/j.envpol.2021.118759
.
Ubah, J. I., Orakwe, L. C., Ogbu, K. N., Awu, J. I., Ahaneku, I. E. & Chukwuma, E. C. 2021 Forecasting water quality parameters using artificial neural network for irrigation purposes. Scientific Reports 11 (1), 24438.
https://doi.org/10.1038/s41598-021-04062-5.
Vapnik
V. N.
1995
The Nature of Statistical Learning Theory
, 2nd edn.
Springer
,
New York, NY
.
Vinçon-Leite
B.
&
Casenave
C.
2019
Modelling eutrophication in lake ecosystems: A review
.
Sci. Total Environ.
651
,
2985
3001
.
https://doi.org/10.1016/j.scitotenv.2018.09.320
.
Xu
Y.
,
Ma
C.
,
Liu
Q.
, Xi, B., Qian, G., Zhang, D. & Huo, S.
2015
Method to predict key factors affecting lake eutrophication – A new approach based on support vector regression model
.
Int. Biodeterior. Biodegrad.
102
,
308
315
.
https://doi.org/10.1016/j.ibiod.2015.02.013
.
Zhao
J.
,
Guo
H.
,
Han
M.
,
Tang
H.
&
Li
X.
2019
Gaussian process regression for prediction of sulfate content in lakes of China
.
J. Eng. Technol. Sci.
51
(
2
),
198
215
.
https://doi.org/10.5614/J.ENG.TECHNOL.SCI.2019.51.2.4
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).