Abstract
Data-driven models for the prediction of lake eutrophication essentially rely on water quality datasets for a longer duration. If such data are not readily available, lake management through data-driven modeling becomes impractical. So, a novel approach is presented here for the prediction of eutrophication indicators, such as dissolved oxygen, Secchi depth, total nitrogen, and total phosphorus, in the waterbodies of Assam, India. These models were developed using water quality datasets collected through laboratory investigation in artificially simulated lake systems. Two artificial prototype lakes were eutrophied in a controlled environment with the gradual application of wastewater. A periodic assessment of water quality was done for model development. Data-driven modeling in the form of multilayer perceptron (MLP), time-delay neural network (TDNN), support vector regression (SVR), and Gaussian process regression (GPR) were utilized. The trained model's accuracy was evaluated based on statistical parameters and a reasonable correlation was observed between targeted and model predicted values. Finally, the trained models were tested against some natural waterbodies in Assam and a satisfactory prediction accuracy was obtained. TDNN and GPR models were found superior compared to other methods. Results of the study indicate feasibility of the adopted modeling approach in predicting lake eutrophication when periodic water quality data are limited for the waterbody under consideration.
HIGHLIGHTS
A novel approach is proposed for predicting eutrophication indicators.
Two prototype lakes were artificially eutrophied.
Data-driven modeling techniques were employed.
Developed models were used to predict natural water bodies.
Further studies will help in framing the policies.
INTRODUCTION
For the last four decades, several modeling methodologies have been used to forecast and mitigate lake eutrophication with various levels of accomplishment (Jørgensen 2010; Bhagowati & Ahamad 2019). The application of both process-based and data-driven modeling approaches has been observed frequently in lake modeling (Vinçon-Leite & Casenave 2019; Rousso et al. 2020; Su et al. 2022). Out of different methods, data-driven modeling techniques, like the artificial neural network (ANN), have been increasingly utilized for ecological modeling in recent times (Che Nordin et al. 2021). Such machine learning-based supervised models can regenerate the non-linear relationship between the ecosystem parameters in a more efficient manner, unlike the multiple regression techniques, which generally consider the linear relationship among variables for model development (Kuo et al. 2007). Moreover, as the underlying relationships among ecological variables are commonly intricate and accompany extensive uncertainties, machine learning based models are being developed as useful tools for ecosystem management and restoration (Lek & Guegan 1999).
Aria et al. (2019) successfully used two types of ANN algorithms to model algal bloom in Iran's eutrophic Amirkabir reservoir. ANN methodology was used to predict Chl-a concentrations in a shallow eutrophic lake, Mikri Prepa in Greece, by Hadjisolomou et al. (2021). Machine learning methods in the form of ANN and support vector regression (SVR) were used by Jimeno-Sáez et al. (2020) to model the eutrophication indicator Chl-a in the eutrophic Mar Menor coastal lagoon in Spain. A feed forward neural network modeling approach was used successfully by Ubah et al. (2021) for predicting water quality indicators such as pH, electrical conductivity (EC), total dissolved solids (TDS), and sodium in Ele River Nnewi, Nigeria. Chen et al. (2010) used the backpropagation ANN approach successfully to estimate the concentrations of TN, TP, and DO in the Chagle River in China. ANN was used by Sarkar & Pandey (2015) to estimate the DO concentrations of River Yamuna in the Mathura region of Uttar Pradesh, India. Gazzaz et al. (2012) used a multilayer perceptron (MLP) network to estimate the water quality index (WQI) in Kinta River, Malaysia. For eutrophication management in shallow eutrophic Chaohu Lake in China, SVR was used to formulate predictive models for Chl-a, TN, and TP by Xu et al. (2015). The Gaussian process regression (GPR) method was used by Xu et al. (2015) for sulfate content prediction in the lakes of China. As such, ANN and other machine learning methods have been widely used all across the world to model eutrophication indicators, and some such selected works are presented in Table 1. Out of the different neural network topologies, a feed forward network with a backpropagation algorithm has been widely used in most of the works to predict common eutrophication indicators such as dissolved oxygen (DO), total nitrogen (TN), total phosphorus (TP), chlorophyll-a (chl-a), and Secchi depth (SD). Apart from neural networks, use of other data-driven modeling techniques such as SVR, adaptive neuro-fuzzy inference system (ANFIS), and GPR are gaining attention in lake modeling. Nevertheless, collection of an exhaustive dataset for the successful training of a predictive data-driven model is a major concern. It is evident from Table 1 that for most of the earlier lake-specific eutrophication models, water quality parameters were gathered for a vast duration, such as 5–20 years. However, such prolonged datasets may not be available for all waterbodies that require prompt mitigation and restoration measures, especially in underdeveloped and developing countries. Under such circumstances, lake management with a data-driven modeling approach becomes impractical.
References . | Location . | Eutrophication indicators . | Modeling approach . | Data collection duration . |
---|---|---|---|---|
Karul et al. (1999, 2000) | Keban Dam reservoir, Mogan Lake, Eymir Lake in Turkey | chl-a | ANN | 1991–1996 |
Kuo et al. (2007) | Te-Chi Reservoir, Taiwan | DO, TP, SD, chl-a | ANN | 1983–1999 |
Akkoyunlu & Akiner (2010) | Omerli Lake, Turkey | DO | ANN, MLR | 1990–2004 |
Huo et al. (2013) | Lake Fuxian, China | DO, TN, SD, chl-a | ANN | 2003–2008 |
Chen & Liu (2014) | Feitsui Reservoir, Taiwan | DO | ANN, ANFIS, MLR | 1993–2011 |
Chen & Liu (2015) | Mingder Reservoir, Taiwan | DO, TP, SD, chl-a | ANN, ANFIS, MLR | 1993–2013 |
Heddam (2016) | Saginaw Bay, Lake Huron, USA | SD | ANN, MLR | 1991–1996 |
García Nieto et al. (2019) | Englishmen Lake, Spain | Chl-a, TP | SVM, M5 tree, ANN | 2006–2014 |
Aria et al. (2019) | Amirkabir Reservoir, Iran | Bacillariophyceae species | ANN | 2000–2012 |
García-Nieto et al. (2020) | Tanes Reservoir, Spain | Chl-a | GPR | 2006–2015 |
References . | Location . | Eutrophication indicators . | Modeling approach . | Data collection duration . |
---|---|---|---|---|
Karul et al. (1999, 2000) | Keban Dam reservoir, Mogan Lake, Eymir Lake in Turkey | chl-a | ANN | 1991–1996 |
Kuo et al. (2007) | Te-Chi Reservoir, Taiwan | DO, TP, SD, chl-a | ANN | 1983–1999 |
Akkoyunlu & Akiner (2010) | Omerli Lake, Turkey | DO | ANN, MLR | 1990–2004 |
Huo et al. (2013) | Lake Fuxian, China | DO, TN, SD, chl-a | ANN | 2003–2008 |
Chen & Liu (2014) | Feitsui Reservoir, Taiwan | DO | ANN, ANFIS, MLR | 1993–2011 |
Chen & Liu (2015) | Mingder Reservoir, Taiwan | DO, TP, SD, chl-a | ANN, ANFIS, MLR | 1993–2013 |
Heddam (2016) | Saginaw Bay, Lake Huron, USA | SD | ANN, MLR | 1991–1996 |
García Nieto et al. (2019) | Englishmen Lake, Spain | Chl-a, TP | SVM, M5 tree, ANN | 2006–2014 |
Aria et al. (2019) | Amirkabir Reservoir, Iran | Bacillariophyceae species | ANN | 2000–2012 |
García-Nieto et al. (2020) | Tanes Reservoir, Spain | Chl-a | GPR | 2006–2015 |
ANN, artificial neural network; ANFIS, adaptive neuro-fuzzy inference system; MLR, multiple linear regression; chl-a, chlorophyll-a; DO, dissolved oxygen; SVM, support vector machine; TP, total phosphorus; TN, total nitrogen; SD, Secchi depth; GPR, Gaussian process regression.
In India, surface waterbodies like lakes, rivers, etc., are getting increasingly polluted around different parts of the country due to human activities (Singh et al. 2019). Assam, the economic and cultural hub of North-Eastern India, is bestowed with a large number of lakes, ponds, and wetlands, but research works related to their trophic state, water quality, and management policies are meager. Increased anthropogenic activities have made most of the urban waterbodies susceptible to eutrophication. Periodic monitoring of water quality data in these waterbodies is also unavailable. Taking these aspects into consideration, the major aim of this research was to present eutrophication models for waterbodies in Assam, with the help of experimental water quality data monitored from two artificially eutrophied prototype lakes. Physical water quality indices DO and SD, and nutrients TN and TP, which are primarily responsible for the occurrence of eutrophication were chosen as target variables of the models. For predicting these eutrophication indicators, the popular data-driven technique ANN was used. Under ANN topology, the most commonly employed ANN architecture, i.e. MLP, is used. To compare the performance of the static MLP network with other ANN architectures, a dynamic time-delay neural network (TDNN) was also explored for model development. The neural network models were further compared with the non-parametric machine learning models based on SVR and GPR techniques, which are explored less in lake eutrophication modeling. Finally, the feasibility of the modeling approach to be used for eutrophication prediction in the waterbodies of Assam was checked by testing the developed models against natural waterbody data. The adopted investigations hold a major significance in the field of lake management through the data-driven modeling approach, by providing an alternate measure for the generation of datasets, where long-term monitoring data are not available for the considered waterbody.
METHODOLOGY
Experimental set-up
Modeling methodology
For the prediction of eutrophication indicators such as DO, SD, TN, and TP, data-driven machine learning approaches in the form of MLP, TDNN, SVR, and GPR were used in this present work. In modeling methodology, the input parameters were initially finalized out of the different experimentally investigated water quality parameters for the prediction of each of the desired outputs using a parameter trimming method under neural network topology. The same inputs were then used for the prediction of output parameters under the considered machine learning algorithms. A simple multiple linear regression (MLR) model was also developed for each of the target variables initially, to compare the performance of the linear model with sophisticated machine learning regression methods. The MLR, MLP, TDNN, SVR, and GPR models presented in this work were developed in MATLAB (release 2017a) environment. Details of different modeling methodologies have been illustrated in the next sections.
Input selection
Choice of the best combination of input variables is one of the most crucial aspects for better execution of any predicting neural network models. In any case, with respect to the vast majority of the previously developed ecological machine learning models, less consideration has been given to this perspective. In the majority of cases, the input parameters were picked by relying on a model-free methodology, for example, area information or on an ad-hoc basis that can lead to a very large or less input parameter condition (Maier et al. 2010). So, a stepwise model-based parameter trimming method was used in this study to obtain the best input combinations for the proposed DO, SD, TN, and TP models (Maier et al. 2010; Wu et al. 2014). Different combinations of the water quality variables from the studied prototype lakes were used to train several MLP networks for each target variable. The best input combinations, as selected in this procedure, were subsequently used for the prediction of eutrophication indicators in the MLR, TDNN, SVR, and GPR models.
ANN models
In this study, two types of ANN models were used for target prediction in the eutrophic lakes of Assam. The first one is a static type of MLP and the second one is a dynamic type of TDNN. The basic difference between the two types is that time is not a consideration for prediction in static models, while the effect of earlier data is also considered for future data prediction in dynamic models (Aria et al. 2019). The TDNN used in this study was a non-linear autoregressive network with exogenous inputs (NARX) which is a recurrent dynamic network (Demuth et al. 2014). The Levenberg–Marquardt backpropagation training algorithm was used in the MLP and TDNN models due to its rapid convergence in comparison with other methods (Demuth et al. 2014). A sigmoid tangent activation function (TANSIG) and a linear transfer function (PURELIN) were used in the hidden and output layer, respectively.
SVR models
The kernel function is used for non-linear data transformation, and hence, the selection of an appropriate kernel parameter is very important for satisfactory model performance (Pradhan 2013). So, in this study, four commonly used kernel functions, viz. linear, quadratic, cubic, and Gaussian kernels, were compared under the SVR method for DO, SD, TN, and TP predictions. Based on the results of the coefficient of determination (R2) and the root mean square error (RMSE), the best kernel was chosen for each model. In the SVR models, C and ε are very important parameters that vary depending on the noise associated with the input data. These parameters are unknown for best model performance, and so, a trial-and-error method was used to find the best combination of C and ε for DO, SD, TN, and TP predictions.
GPR models
As the Gaussian process can be fully defined by its second-order statistics, a process with zero mean implies that the covariance function will completely determine the behavior of the process (García-Nieto et al. 2020). Hence, the choice of a proper covariance function becomes necessary. There are several types of covariance functions commonly used for GPR such as squared exponential, matern 5/2, matern 3/2, exponential, rational quadratic, and polynomial (Rasmussen & Williams 2005). In the present work, the DO, SD, TN, and TP models were developed initially using four different types of GPR algorithms namely squared exponential, rational quadratic, matern 5/2, and exponential. Thereafter, the best GPR models were selected for each target variable and the same were utilized for comparison with the corresponding ANN- and SVR-based models.
Model efficiency evaluation
Feasibility assessment of the developed models
Eutrophication models for indicators DO, SD, TN, and TP were developed with the dataset of water quality parameters monitored on two artificially simulated prototype lakes. The feasibility of the adopted modeling approach to be used as the eutrophication predictor in natural waterbodies was evaluated by checking the performance of the model with samples collected from a few natural shallow waterbodies in Assam. The model's predictions should be flawless under different ecological conditions, and hence, a vast dataset was gathered to check the accuracy of the presented models to be used in the waterbodies in Assam. Water samples were gathered from two sampling locations of Deepor Bil, a world heritage RAMSAR wetland in Guwahati city, as well as from a marsh (two locations), a manmade lake, and a village pond in and around Tezpur University campus in Tezpur city, details of which are presented in Table 2 and their locations are shown in Figure 1. All the previously mentioned water quality parameters were investigated on the samples collected from the waterbodies during different weather conditions in the months of March, May, September, and December 2019.
Sampling location . | Waterbody type . | Location . | Latitude . | Longitude . |
---|---|---|---|---|
1 | Deepor Bil (RAMSAR wetland) | Near Pamohi, Guwahati, Assam, India | 26°6′47.72″ N | 91°39′35.76″ E |
2 | Deepor Bil (RAMSAR wetland) | Near Pamohi, Guwahati, Assam, India | 26°6ʹ46.32″ N | 91°39ʹ13.51″ E |
3 | Marsh | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ13.97″ N | 92°48ʹ58.78″ E |
4 | Marsh | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ16.37″ N | 92°48ʹ58.00″ E |
5 | Village pond | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ30.56″ N | 92°49ʹ15.22″ E |
6 | Artificial lake | Tezpur University, Tezpur, Assam, India | 26°42ʹ3.78″ N | 92°49ʹ52.08″ E |
Sampling location . | Waterbody type . | Location . | Latitude . | Longitude . |
---|---|---|---|---|
1 | Deepor Bil (RAMSAR wetland) | Near Pamohi, Guwahati, Assam, India | 26°6′47.72″ N | 91°39′35.76″ E |
2 | Deepor Bil (RAMSAR wetland) | Near Pamohi, Guwahati, Assam, India | 26°6ʹ46.32″ N | 91°39ʹ13.51″ E |
3 | Marsh | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ13.97″ N | 92°48ʹ58.78″ E |
4 | Marsh | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ16.37″ N | 92°48ʹ58.00″ E |
5 | Village pond | Near Tezpur University, Tezpur, Assam, India | 26°41ʹ30.56″ N | 92°49ʹ15.22″ E |
6 | Artificial lake | Tezpur University, Tezpur, Assam, India | 26°42ʹ3.78″ N | 92°49ʹ52.08″ E |
Sensitivity analysis
The trained DO, SD, TN, and TP models’ sensitivity analyses were evaluated by using the data perturbation method (Cao et al. 2016). To check the impact of changing the input parameter values on output prediction, input parameters were increased and decreased by 20%, one at a time, fixing other inputs as constants. At that point, the change in output due to alteration of an input parameter is calculated in percentage, which is reported as the sensitivity value of the input parameter for target prediction (Maier et al. 1998). The parameter having a value greater than 100% is identified as the sensitive parameter for the model.
RESULTS AND DISCUSSION
Experimental results
. | pH . | EC (μS/cm) . | TDS (ppm) . | Turb (NTU) . | TN (mg/L) . | TP (mg/L) . | Temp (°C) . | BOD (mg/L) . | DO (mg/L) . | SD (cm) . |
---|---|---|---|---|---|---|---|---|---|---|
Max | 10.32 | 837.00 | 561.50 | 68.90 | 5.77 | 9.62 | 36.00 | 51.80 | 7.13 | 74.00 |
Min | 6.93 | 164.80 | 61.31 | 2.30 | 0.00 | 0.01 | 25.00 | 0.00 | 3.40 | 13.50 |
Avg | 8.67 | 363.67 | 189.00 | 17.75 | 0.71 | 2.60 | 31.40 | 12.99 | 5.64 | 40.94 |
St. Dev. | 0.90 | 155.44 | 100.60 | 15.07 | 1.03 | 1.95 | 2.51 | 10.31 | 0.97 | 15.29 |
. | pH . | EC (μS/cm) . | TDS (ppm) . | Turb (NTU) . | TN (mg/L) . | TP (mg/L) . | Temp (°C) . | BOD (mg/L) . | DO (mg/L) . | SD (cm) . |
---|---|---|---|---|---|---|---|---|---|---|
Max | 10.32 | 837.00 | 561.50 | 68.90 | 5.77 | 9.62 | 36.00 | 51.80 | 7.13 | 74.00 |
Min | 6.93 | 164.80 | 61.31 | 2.30 | 0.00 | 0.01 | 25.00 | 0.00 | 3.40 | 13.50 |
Avg | 8.67 | 363.67 | 189.00 | 17.75 | 0.71 | 2.60 | 31.40 | 12.99 | 5.64 | 40.94 |
St. Dev. | 0.90 | 155.44 | 100.60 | 15.07 | 1.03 | 1.95 | 2.51 | 10.31 | 0.97 | 15.29 |
Turb, turbidity; Max, maximum; Min, minimum; St Dev., standard deviation; Avg, average.
The maximum, minimum, and average values of TSI based on TP and SD values for the investigated artificial lakes are presented in Table 4. The concentration of TP during the initial period of investigation was quite low (0.01 mg/L) and water was transparent up to the full depth of the lakes. For this condition, the TSI value was assessed as less than 40 indicating that the studied lakes were in the oligotrophic state. With the gradual application of nutrients to artificial lakes, the water quality of the lakes deteriorated considerably with a higher TP concentration and lower SD values. In this condition, the calculated value of TSI was well in excess of 70, as shown in Table 4, for both TP and SD standards, inferring that lake water quality had changed to a hypereutrophic stage from a freshwater stage. As such, the lake eutrophication phenomenon was replicated in a controlled environment successfully, and thereafter, the investigated dataset was used for model development.
TSITP . | TSISD . | ||||
---|---|---|---|---|---|
Minm . | Maxm . | Avg. . | Minm . | Maxm . | Avg. . |
37.37 | 136.48 | 117.61 | < 40 | 88.90 | 72.89 |
TSITP . | TSISD . | ||||
---|---|---|---|---|---|
Minm . | Maxm . | Avg. . | Minm . | Maxm . | Avg. . |
37.37 | 136.48 | 117.61 | < 40 | 88.90 | 72.89 |
To check the linear relationship between the investigated water quality parameters, a correlation coefficient matrix analysis was performed, and the results are presented in Table 5. Considering a correlation coefficient of 0.5 as the threshold value based on the previous research (Mukaka 2012; Rehman et al. 2018), it was observed that the linear dependency between any two investigated parameters was generally poor. A significant relationship was observed only between EC and TDS with a correlation coefficient value of 0.94. This strong positive correlation may be due to the ions dissolved in water that conduct electricity. The target eutrophication indicator parameters DO, SD, TN, and TP also had no significant correlation with other parameters, and as such, non-linear machine learning algorithms in the form of MLP, TDNN, SVR, and GPR were used for model development.
. | pH . | EC . | TDS . | Turb . | TN . | TP . | Temp . | BOD . | DO . | SD . |
---|---|---|---|---|---|---|---|---|---|---|
pH | 1.00 | |||||||||
EC | 0.14 | 1.00 | ||||||||
TDS | 0.06 | 0.94 | 1.00 | |||||||
Turb | 0.35 | 0.49 | 0.41 | 1.00 | ||||||
TN | −0.41 | −0.26 | −0.24 | −0.25 | 1.00 | |||||
TP | −0.50 | 0.30 | 0.33 | −0.01 | 0.22 | 1.00 | ||||
Temp | 0.09 | −0.09 | −0.11 | −0.03 | 0.06 | −0.16 | 1.00 | |||
BOD | −0.15 | 0.50 | 0.53 | 0.24 | 0.03 | 0.49 | −0.18 | 1.00 | ||
DO | 0.21 | −0.58 | −0.54 | −0.49 | −0.02 | −0.47 | 0.21 | −0.44 | 1.00 | |
SD | −0.01 | −0.52 | −0.48 | −0.59 | −0.17 | −0.39 | 0.04 | −0.43 | 0.71 | 1.00 |
. | pH . | EC . | TDS . | Turb . | TN . | TP . | Temp . | BOD . | DO . | SD . |
---|---|---|---|---|---|---|---|---|---|---|
pH | 1.00 | |||||||||
EC | 0.14 | 1.00 | ||||||||
TDS | 0.06 | 0.94 | 1.00 | |||||||
Turb | 0.35 | 0.49 | 0.41 | 1.00 | ||||||
TN | −0.41 | −0.26 | −0.24 | −0.25 | 1.00 | |||||
TP | −0.50 | 0.30 | 0.33 | −0.01 | 0.22 | 1.00 | ||||
Temp | 0.09 | −0.09 | −0.11 | −0.03 | 0.06 | −0.16 | 1.00 | |||
BOD | −0.15 | 0.50 | 0.53 | 0.24 | 0.03 | 0.49 | −0.18 | 1.00 | ||
DO | 0.21 | −0.58 | −0.54 | −0.49 | −0.02 | −0.47 | 0.21 | −0.44 | 1.00 | |
SD | −0.01 | −0.52 | −0.48 | −0.59 | −0.17 | −0.39 | 0.04 | −0.43 | 0.71 | 1.00 |
Modeling results
Input selection
For the prediction of DO, SD, TN, and TP, six random input combinations were considered to determine the best parameter combination from the investigated water quality indices, as shown in Table 6. The coefficient of correlation (R) and the MSE were considered as the assessment criteria for the choice of an optimum parameter combination. Every combination represents the effect of omitted parameters in the model training performance. Based on the highest R and the lowest MSE values, it can be concluded that out of the different scenarios considered under MLP, combinations 3, 6, 4, and 5 were the most significant ones for the prediction of the DO, SD, TN, and TP model, respectively. So, for the presented models, 7, 4, 5, and 6 number of input parameters were used for DO, SD, TN, and TP prediction, respectively.
Sl. No. . | Input variables . | R . | MSE . |
---|---|---|---|
DO model | |||
1 | pH, EC, TDS, TN, TP, BOD, turbidity, SD, Temp | 0.897 | 0.0121 |
2 | pH, EC, TDS, TN, TP, turbidity, Temp | 0.901 | 0.0101 |
3 | pH, EC, BOD, TN, TP, turbidity, Temp | 0.936 | 0.0094 |
4 | pH, EC, TDS, TN, TP, Temp | 0.788 | 0.0203 |
5 | pH, EC, TN, TP, Temp | 0.804 | 0.0204 |
6 | pH, EC, TN, TP | 0.755 | 0.0214 |
SD model | |||
1 | pH, EC, TDS, DO, TN, TP, turbidity, Temp | 0.883 | 0.0110 |
2 | pH, EC, TDS, DO, TN, TP, Temp | 0.771 | 0.0175 |
3 | pH, EC, turbidity, TN, TP, Temp | 0.845 | 0.0120 |
4 | pH, EC, TN, TP, Temp | 0.783 | 0.0198 |
5 | TN, TP, turbidity, Temp | 0.748 | 0.0118 |
6 | pH, EC, turbidity, Temp | 0.921 | 0.0033 |
TN model | |||
1 | pH, EC, DO, TDS, TP, BOD, turbidity, SD, Temp | 0.856 | 0.0122 |
2 | pH, EC, TDS, BOD, turbidity, SD, Temp | 0.825 | 0.0186 |
3 | pH, EC, DO, BOD, turbidity, SD, Temp | 0.809 | 0.0204 |
4 | pH, EC, TDS, turbidity, SD | 0.915 | 0.0105 |
5 | pH, EC, DO, turbidity, SD | 0.875 | 0.0118 |
6 | pH, EC, TDS, DO, SD | 0.847 | 0.0121 |
TP model | |||
1 | pH, EC, DO, TDS, TN, BOD, turbidity, SD, Temp | 0.817 | 0.0224 |
2 | pH, EC, DO, TDS, TN, BOD, turbidity | 0.887 | 0.0181 |
3 | pH, EC, DO, TDS, BOD, SD, turbidity | 0.846 | 0.0194 |
4 | pH, EC, DO, TDS, TN, BOD, turbidity, | 0.840 | 0.0183 |
5 | pH, EC, DO, TDS. BOD, SD | 0.924 | 0.0112 |
6 | pH, EC, DO, SD, Temp | 0.785 | 0.0235 |
Sl. No. . | Input variables . | R . | MSE . |
---|---|---|---|
DO model | |||
1 | pH, EC, TDS, TN, TP, BOD, turbidity, SD, Temp | 0.897 | 0.0121 |
2 | pH, EC, TDS, TN, TP, turbidity, Temp | 0.901 | 0.0101 |
3 | pH, EC, BOD, TN, TP, turbidity, Temp | 0.936 | 0.0094 |
4 | pH, EC, TDS, TN, TP, Temp | 0.788 | 0.0203 |
5 | pH, EC, TN, TP, Temp | 0.804 | 0.0204 |
6 | pH, EC, TN, TP | 0.755 | 0.0214 |
SD model | |||
1 | pH, EC, TDS, DO, TN, TP, turbidity, Temp | 0.883 | 0.0110 |
2 | pH, EC, TDS, DO, TN, TP, Temp | 0.771 | 0.0175 |
3 | pH, EC, turbidity, TN, TP, Temp | 0.845 | 0.0120 |
4 | pH, EC, TN, TP, Temp | 0.783 | 0.0198 |
5 | TN, TP, turbidity, Temp | 0.748 | 0.0118 |
6 | pH, EC, turbidity, Temp | 0.921 | 0.0033 |
TN model | |||
1 | pH, EC, DO, TDS, TP, BOD, turbidity, SD, Temp | 0.856 | 0.0122 |
2 | pH, EC, TDS, BOD, turbidity, SD, Temp | 0.825 | 0.0186 |
3 | pH, EC, DO, BOD, turbidity, SD, Temp | 0.809 | 0.0204 |
4 | pH, EC, TDS, turbidity, SD | 0.915 | 0.0105 |
5 | pH, EC, DO, turbidity, SD | 0.875 | 0.0118 |
6 | pH, EC, TDS, DO, SD | 0.847 | 0.0121 |
TP model | |||
1 | pH, EC, DO, TDS, TN, BOD, turbidity, SD, Temp | 0.817 | 0.0224 |
2 | pH, EC, DO, TDS, TN, BOD, turbidity | 0.887 | 0.0181 |
3 | pH, EC, DO, TDS, BOD, SD, turbidity | 0.846 | 0.0194 |
4 | pH, EC, DO, TDS, TN, BOD, turbidity, | 0.840 | 0.0183 |
5 | pH, EC, DO, TDS. BOD, SD | 0.924 | 0.0112 |
6 | pH, EC, DO, SD, Temp | 0.785 | 0.0235 |
MLR model
ANN models
. | MLP . | TDNN . | SVR . | GPR . | |
---|---|---|---|---|---|
DO model | R2 | 0.97 | 0.98 | 0.95 | 0.96 |
E | 0.97 | 0.98 | 0.97 | 0.96 | |
RMSE | 0.16 mg/L | 0.13 mg/L | 0.17 mg/L | 0.18 mg/L | |
MAE | 0.09 mg/L | 0.06 mg/L | 0.13 mg/L | 0.14 mg/L | |
SD model | R2 | 0.98 | 0.98 | 0.93 | 0.98 |
E | 0.98 | 0.98 | 0.95 | 0.98 | |
RMSE | 2.44 cm | 2.13 cm | 3.16 cm | 1.73 cm | |
MAE | 1.98 cm | 1.42 cm | 2.22 cm | 1.33 cm | |
TN model | R2 | 0.97 | 0.98 | 0.91 | 0.98 |
E | 0.97 | 0.98 | 0.93 | 0.98 | |
RMSE | 0.10 mg/L | 0.09 mg/L | 0.15 mg/L | 0.07 mg/L | |
MAE | 0.07 mg/L | 0.06 mg/L | 0.10 mg/L | 0.04 mg/L | |
TP model | R2 | 0.98 | 0.99 | 0.96 | 0.99 |
E | 0.99 | 0.99 | 0.98 | 0.99 | |
RMSE | 0.08 mg/L | 0.04 mg/L | 0.11 mg/L | 0.01 mg/L | |
MAE | 0.05 mg/L | 0.02 mg/L | 0.09 mg/L | 0.001 mg/L |
. | MLP . | TDNN . | SVR . | GPR . | |
---|---|---|---|---|---|
DO model | R2 | 0.97 | 0.98 | 0.95 | 0.96 |
E | 0.97 | 0.98 | 0.97 | 0.96 | |
RMSE | 0.16 mg/L | 0.13 mg/L | 0.17 mg/L | 0.18 mg/L | |
MAE | 0.09 mg/L | 0.06 mg/L | 0.13 mg/L | 0.14 mg/L | |
SD model | R2 | 0.98 | 0.98 | 0.93 | 0.98 |
E | 0.98 | 0.98 | 0.95 | 0.98 | |
RMSE | 2.44 cm | 2.13 cm | 3.16 cm | 1.73 cm | |
MAE | 1.98 cm | 1.42 cm | 2.22 cm | 1.33 cm | |
TN model | R2 | 0.97 | 0.98 | 0.91 | 0.98 |
E | 0.97 | 0.98 | 0.93 | 0.98 | |
RMSE | 0.10 mg/L | 0.09 mg/L | 0.15 mg/L | 0.07 mg/L | |
MAE | 0.07 mg/L | 0.06 mg/L | 0.10 mg/L | 0.04 mg/L | |
TP model | R2 | 0.98 | 0.99 | 0.96 | 0.99 |
E | 0.99 | 0.99 | 0.98 | 0.99 | |
RMSE | 0.08 mg/L | 0.04 mg/L | 0.11 mg/L | 0.01 mg/L | |
MAE | 0.05 mg/L | 0.02 mg/L | 0.09 mg/L | 0.001 mg/L |
SVR models
A five-fold cross-validation technique was employed throughout the SVR model training process and different kernel functions were used initially for each predictor variable. Using the default values of C and ε under MATLAB environment, linear, quadratic, cubic, and Gaussian kernels were compared. Based on the highest R2 and lowest RMSE values, it was found that the Gaussian kernel-based SVR models yield better prediction accuracy compared to the other kernels, as found in some previous works (Xu et al. 2015; García Nieto et al. 2019). Furthermore, different values of C (changed from 0 to 100 in increments of 10 for DO and SD models and 0 to 10 with 0.1 increment for TN and TP) and ε (from 0 to 0.1 by increment 0.01) were considered to find the best combination under the Gaussian kernel. It was found that C and ε combinations of 10 and 0.01 for DO, 20 and 0.02 for SD, 8 and 0.01 for TN, and 6 and 0.01 for TP, respectively, were optimal. After finalizing the structure of SVR, model training was done for each targeted output and the results are shown in Table 7. Compared to SD and TN models (Figures 7 and 8), better prediction accuracy had been achieved for DO and TP models (Figures 6 and 9) under SVR. RMSE and MAE values of 0.17 and 0.13 mg/L for the DO model, 3.16 and 2.22 cm for the SD model, 0.15 and 0.10 mg/L for the TN model, and 0.11 and 0.09 mg/L for the TP model were obtained under SVR during training. Overall, an acceptable correlation was observed, and R2 and E values more than 0.91 were attained for all models. Similar SVR models were used to predict the TP concentration in eutrophic lakes by García Nieto et al. (2019) where an R2 value of 0.90 had been achieved. So, the presented SVR models for eutrophication indicators DO, SD, TN, and TP hold significant prediction accuracy.
GPR models
Under GPR, squared exponential, rational quadratic, matern 5/2, and exponential kernels were tried to model DO, SD, TN, and TP, considering the five-fold cross-validation technique and a constant basis function in MATLAB. Considering that R2 and RMSE values squared exponential kernel was found to produce better training efficiency among others, which was also reported in earlier research works (García-Nieto et al. 2020). Table 7 shows the results of GPR model training using a squared exponential kernel, and Figures 6–9 illustrate the correlation achieved between actual and model predicted values for the DO, SD, TN, and TP model, respectively. The GPR models were found to be very superior in terms of prediction accuracy during training, and R2 and E values of 0.99 were observed for the TP model. These values were 0.98 for the SD and TN models and 0.96 for the DO model, respectively. The error parameters RMSE and MAE were also quite low during GPR model training. As the GPR models have the advantage of handling uncertainty as in the case of lake ecosystems, and can perform well on small datasets (García-Nieto et al. 2020), the performance of the presented models is quite promising. Moreover, as GPR is explored much less in literature to predict lake eutrophication indicators, the presented DO, SD, TN, and TP models hold major significance.
Comparing the training results of the four approaches for DO prediction, it was observed that the lowest error values were achieved with TDNN models as evident from Figure 6. RMSE and MAE of 0.13 and 0.06 mg/L were obtained with TDNN during DO model training. For SD prediction, the highest training efficiency was reported for GPR, followed by TDNN, as shown in Figure 7. From Figures 8 and 9, it can be seen that for TN and TP models also, the lowest errors were found for the GPR and TDNN model, respectively. From the summary of training results, as presented in Table 7, it can be observed that the overall R2 and E values of the TDNN, GPR, and MLP models were quite comparable and were slightly superior compared to the SVR models.
Model testing with natural water body
. | MLP . | TDNN . | SVR . | GPR . | |
---|---|---|---|---|---|
DO model | R2 | 0.92 | 0.93 | 0.85 | 0.90 |
E | 0.91 | 0.93 | 0.85 | 0.89 | |
RMSE | 0.39 mg/L | 0.35 mg/L | 0.52 mg/L | 0.42 mg/L | |
MAE | 0.33 mg/L | 0.31 mg/L | 0.45 mg/L | 0.36 mg/L | |
SD model | R2 | 0.90 | 0.91 | 0.84 | 0.90 |
E | 0.84 | 0.90 | 0.83 | 0.89 | |
RMSE | 4.69 cm | 3.73 cm | 4.85 cm | 3.88 cm | |
MAE | 4.36 cm | 3.56 cm | 4.45 cm | 3.46 cm | |
TN model | R2 | 0.89 | 0.91 | 0.89 | 0.93 |
E | 0.87 | 0.90 | 0.83 | 0.87 | |
RMSE | 0.06 mg/L | 0.05 mg/L | 0.07 mg/L | 0.06 mg/L | |
MAE | 0.05 mg/L | 0.04 mg/L | 0.06 mg/L | 0.05 mg/L | |
TP model | R2 | 0.89 | 0.92 | 0.77 | 0.91 |
E | 0.87 | 0.92 | 0.76 | 0.90 | |
RMSE | 0.06 mg/L | 0.05 mg/L | 0.11 mg/L | 0.06 mg/L | |
MAE | 0.05 mg/L | 0.05 mg/L | 0.09 mg/L | 0.05 mg/L |
. | MLP . | TDNN . | SVR . | GPR . | |
---|---|---|---|---|---|
DO model | R2 | 0.92 | 0.93 | 0.85 | 0.90 |
E | 0.91 | 0.93 | 0.85 | 0.89 | |
RMSE | 0.39 mg/L | 0.35 mg/L | 0.52 mg/L | 0.42 mg/L | |
MAE | 0.33 mg/L | 0.31 mg/L | 0.45 mg/L | 0.36 mg/L | |
SD model | R2 | 0.90 | 0.91 | 0.84 | 0.90 |
E | 0.84 | 0.90 | 0.83 | 0.89 | |
RMSE | 4.69 cm | 3.73 cm | 4.85 cm | 3.88 cm | |
MAE | 4.36 cm | 3.56 cm | 4.45 cm | 3.46 cm | |
TN model | R2 | 0.89 | 0.91 | 0.89 | 0.93 |
E | 0.87 | 0.90 | 0.83 | 0.87 | |
RMSE | 0.06 mg/L | 0.05 mg/L | 0.07 mg/L | 0.06 mg/L | |
MAE | 0.05 mg/L | 0.04 mg/L | 0.06 mg/L | 0.05 mg/L | |
TP model | R2 | 0.89 | 0.92 | 0.77 | 0.91 |
E | 0.87 | 0.92 | 0.76 | 0.90 | |
RMSE | 0.06 mg/L | 0.05 mg/L | 0.11 mg/L | 0.06 mg/L | |
MAE | 0.05 mg/L | 0.05 mg/L | 0.09 mg/L | 0.05 mg/L |
Sensitivity analysis
The data perturbation method was used to check the impact of input variables on the prediction of DO, SD, TN, and TP under TDNN topology, and the results are presented in Table 9. From the table, it can be seen that for the DO model, an increase in BOD and TN values had sensitivity percentage higher than 100%, and hence, reported as sensitive input variables for DO prediction. Compared to the DO model, input variables used for SD prediction were found more or less consistent, and no sensitive parameters were observed. The change in pH values and decrease in SD values had a major effect on the prediction of TN in eutrophic lakes. For TP prediction, changes in pH and DO were reported as the most sensitive parameters.
Parameters . | DO model . | SD model . | TN model . | TP model . | ||||
---|---|---|---|---|---|---|---|---|
+20% . | −20% . | +20% . | −20% . | +20% . | −20% . | +20% . | 20% . | |
pH | 86.17 | 93.46 | 36.17 | 57.84 | 107.84 | 103.25 | 112.54 | 115.38 |
EC | 86.45 | 89.74 | 64.21 | 58.24 | 81.47 | 74.35 | 92.57 | 83.86 |
TDS | – | – | – | – | 69.38 | 76.17 | 80.34 | 88.93 |
Turbidity | 95.74 | 89.93 | 59.36 | 85.69 | 88.55 | 82.28 | – | – |
TN | 102.83 | 66.46 | – | – | – | – | – | – |
TP | 84.69 | 95.29 | – | – | – | – | – | – |
Temperature | 97.43 | 95.33 | 57.35 | 67.08 | – | – | – | – |
BOD | 121.26 | 98.18 | – | – | – | – | 67.35 | 74.55 |
DO | – | – | – | – | – | – | 105.29 | 109.45 |
SD | – | – | – | – | 91.35 | 110.24 | 86.74 | 90.86 |
Parameters . | DO model . | SD model . | TN model . | TP model . | ||||
---|---|---|---|---|---|---|---|---|
+20% . | −20% . | +20% . | −20% . | +20% . | −20% . | +20% . | 20% . | |
pH | 86.17 | 93.46 | 36.17 | 57.84 | 107.84 | 103.25 | 112.54 | 115.38 |
EC | 86.45 | 89.74 | 64.21 | 58.24 | 81.47 | 74.35 | 92.57 | 83.86 |
TDS | – | – | – | – | 69.38 | 76.17 | 80.34 | 88.93 |
Turbidity | 95.74 | 89.93 | 59.36 | 85.69 | 88.55 | 82.28 | – | – |
TN | 102.83 | 66.46 | – | – | – | – | – | – |
TP | 84.69 | 95.29 | – | – | – | – | – | – |
Temperature | 97.43 | 95.33 | 57.35 | 67.08 | – | – | – | – |
BOD | 121.26 | 98.18 | – | – | – | – | 67.35 | 74.55 |
DO | – | – | – | – | – | – | 105.29 | 109.45 |
SD | – | – | – | – | 91.35 | 110.24 | 86.74 | 90.86 |
Model utility and limitations
The presented data-driven approach based models have the advantage that with a finite set of inputs, target prediction can be done in real time in contrary to the physical process-based models that involve substantial input conditions and are fairly complex. The present and future trophic status of water bodies under a given nutrient loading can be estimated rationally. Using such models, the influence of the input parameters on the prediction of the target values and the factors contributing to the eutrophication of the concerned waterbody can be easily recognized. Data-driven models developed earlier utilize the data collected for several years through continuous monitoring. As surface waterbodies in India are more susceptible to cultural eutrophication in the present scenario, and long-term data on water quality are not readily available, the approach presented here can be applied for the management of lakes and surface waterbodies.
Lake eutrophication is a complex process dependent on several factors such as morphometry, nutrient loading, thermal stratification, and climatic conditions. The presented work utilizes periodic monitoring data for the development of eutrophication models, and so, can accommodate the frequent variations in water quality. Considering these aspects, the presented models are more useful for eutrophication management in shallow waterbodies with tropical monsoon climatic conditions and have higher humidity levels similar to Assam, India.
CONCLUSION
In the present work, a novel approach was employed to check the effectiveness of artificially replicated lake systems in predicting eutrophication indicators. Two model tanks were employed as prototype lakes, and the eutrophication process was initiated under controlled conditions by applying waste water on a regular basis. A gradual rise in nutrient concentration values and a decrease in DO and SD values were observed that inferred the initiation of the eutrophication process in the studied prototype lakes. TSI measurements were used to mathematically validate the deterioration in water quality in those lakes under investigation. Neural network based MLP and TDNN techniques, as well as non-parametric SVR and GPR algorithms, were utilized to generate models of the eutrophication indicators DO, SD, TN, and TP, based on the water quality data that were investigated on the prototype lakes during the study period. Following a model-based parameter trimming method under MLP architecture, the optimum number of input variables for the DO, SD, TN, and TP models were found as 7, 4, 5, and 6, respectively. Thereafter, using the same inputs in each learning algorithm, predictive models were trained for the desired eutrophication indicators. MLP-, TDNN-, SVR-, and GPR-based non-linear models were found to have better accuracy of prediction compared to the conventional MLR model. The trained MLP, TDNN, SVR, and GPR models were tested against the pre-observed data of a few natural water bodies. These models were able to forecast the eutrophication indices DO, SD, TN, and TP in the considered natural water bodies of Assam, with an acceptable level of accuracy. Based on R2, E, RMSE, and MAE values, it was observed that TDNN- and GPR-based models were superior compared to MLP and SVR models for the prediction of DO, SD, TN, and TP during both training and testing phases. R2 values greater than 0.96 and 0.90 have been achieved for all TDNN and GPR models during the training and testing phase, respectively. BOD and TN were reported as the most sensitive inputs for DO prediction but the input parameters of the SD model were not found sensitive. Similarly, pH and SD for the TN model and pH and DO for the TP model were observed as the most sensitive input parameters. The results obtained in this study demonstrate the effectiveness of the adopted experimental set-up and modeling approach for the management of surface waterbodies in situations where monitoring of input data for a long duration of time is not readily available. Detailed investigations could be carried out in future, considering more water quality variables, including biological parameters, under different nutrient loading and climatic conditions. The use of other machine learning techniques such as ANFIS, random forest algorithm, and neural networks optimized with genetic algorithm (GA-ANN) could be explored.
ACKNOWLEDGEMENT
The authors gratefully acknowledge the financial support received from the Department of Science & Technology (DST) – Science and Engineering Research Board (SERB), New Delhi, India under project file no. ECR/2017/000740.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.