Abstract
This paper examines a series of connected and isolated lakes in the UK as a model system with historic episodes of heavy metal contamination. A 9-year hydrometeorological dataset for the sites was identified to analyse the legacy of heavy metal concentrations within the selected lakes based on physico-chemical and hydrometeorological parameters, and a comparison of the complementary methods of multiple regression, time series analysis, and artificial neural network (ANN). The results highlight the importance of the quality of historic datasets without which analyses such as those presented in this research paper cannot be undertaken. The results also indicate that the ANNs developed were more realistic than the other methodologies (regression and time series analysis) considered. The ANNs provided a higher correlation coefficient and a lower mean squared error when compared to the regression models. However, quality assurance and pre-processing of the data were challenging and were addressed by transforming the relevant dataset and interpolating the missing values. The selection and application of the most appropriate temporal modelling technique, which relies on the quality of available dataset, is crucial for the management of legacy contaminated sites to guide successful mitigation measures to avoid significant environmental and human health implications.
HIGHLIGHTS
Heavy metal contamination in aquatic ecosystems is a significant environmental concern.
A series of connected and isolated lakes were examined as a model system.
A 9-year hydrological and meteorological dataset was analysed.
Multiple regression, time series analysis, and artificial neural networks were used.
The accuracy of ANN models was better than the other methods tested.
INTRODUCTION
Exposure to heavy metals within riverine systems and the potential effects of these metals on aquatic life is a significant environmental concern in many developed (Howard et al. 2015; Hurley et al. 2017, 2019; Mistri et al. 2020), and low- and middle-income countries (Sow et al. 2019; Siddique et al. 2020; Tunde & Oluwagbenga 2020). When a source of heavy metals enters aquatic ecosystems, it may have significant environmental consequences given their toxicity, long-term persistence in the environment and potential for biomagnification within terrestrial and aquatic food chains (Zhou et al. 2008). For example, a wide range of anthropogenic sources are known to lead to metal pollution within the natural environment that include metal mines (Varol & Sen 2012; Alam et al. 2020), industrial wastes/effluents (Muhammad & Ahmad 2020), sewage treatment works (Dukes et al. 2020), inappropriately protected landfill sites (Wuana & Okieimen 2011; Chandrappa & Das 2021), agricultural land (Karaouzas et al. 2021), combusted fossil fuels (Siddique et al. 2020), and atmospheric deposition (Tunde & Oluwagbenga 2020). The most common heavy metals recorded at polluted sites include copper (Cu), zinc (Zn), lead (Pb), chromium (Cr), cadmium (Cd), nickel (Ni), manganese (Mn), and mercury (Hg) (Wuana & Okieimen 2011; Sodango et al. 2018; Karaouzas et al. 2021). Living organisms are affected by these metals in different ways, for example, Cu, Zn, and Mn play essential roles in metabolic functioning and are considered essential for the good health of humans and other organisms (Rainbow 2007). However, when these metal concentrations are elevated above their essential levels, they can have toxic effects (Kouba et al. 2010). There is therefore an urgent need for improved understanding of the associations between trace element concentration and long-term hydrological and meteorological parameters.
There is significant evidence of historic and current heavy metal pollution in riverine water and sedimentary deposits (Yi et al. 2011; Li et al. 2012; Islam et al. 2014, 2015; Martin et al. 2015; Ahmed et al. 2015; Siddique et al. 2020; Usman et al. 2020; Karaouzas et al. 2021), which have resulted in serious long-term effects on aquatic ecosystems (Turner 2000; Brewer et al. 2005; Armitage et al. 2007; Rowland et al. 2011; Byrne et al. 2012; Howard et al. 2015; Hurley et al. 2017, 2019). Although extensive long-term historical datasets on heavy metal availability in the environment exist within some geographical regions, such as the UK, few studies have attempted to develop predictive analysis tools to forecast or identify the potential cause–effect relationships between metal concentrations in the environment and abiotic parameters which may have governed these relationships (Byrne et al. 2012; Yaho et al. 2016). An exception is Rizal et al. (2023) who discussed the development of a hydrological model with the application of the Hydrologic Engineering Centre model (HEC-HMC) using soil water data and daily rainfall records from 2009 to 2019. The theoretical runoff was calibrated and validated using the soil conservation curve number (SCS-CN) method with a good coefficient of determination. In the present study, predictive models were developed for forecasting heavy metal concentrations within the Attenborough Nature Reserve (ANR) lakes based on different parameters by combining different modelling techniques in order to conduct a comprehensive environmental risk assessment. This study attempted to analyse a long-term historical dataset to identify the key metals of concern with a view to identify the temporal patterns along with the potential cause–effect relationships. The availability of a long-term monthly dataset 2009–2017 for the same site is relatively rare and is therefore exceptionally useful for environmental monitoring purposes.
Water quality models are useful to help guide monitoring activities and predict metal concentrations in the environment; however, an accurate mechanistic model that captures all physico-chemical parameters, hydrometerological processes and the associations among them are difficult to develop. As a result, it is important to evaluate the most appropriate approaches that can accurately capture the temporal patterns and trends in key environmental parameters, such as heavy metal concentrations. The accuracy of the temporal modelling technique depends on the selection of the most appropriate method based on the parameters available, and it may therefore not be a trivial process to determine which method or model is the ‘best’ (Chatfield 2000).
In order to address the identified needs, this paper specifically aims to develop and compare the ability of statistical (regression and time series analyses) and machine learning models (artificial neural networks (ANNs)) to predict the concentration of total Cu and Zn within freshwater lakes as functions of a range of abiotic parameters. Cu and Zn are the focus of this investigation due to the availability of a long-term monitoring dataset and being the metals of greatest concern in the study area (see Section 2.1). The other feasible alternatives for Zn and Cu were Al, Cd, Cr, Pb, and Ni. However, these were not used for model development because of very low levels of metal concentrations recorded. The major advantages of using Cu and Zn over the other metals for temporal modelling was the existence of less noise and variation in the data resulting in the development of better predictive models. In previous research, Dunea & Iordache (2011) used autoregressive integrated moving average (ARIMA) modelling to create time series forecasting models and found that some models performed well for certain parameters (e.g., total suspended solids (TSS)), while the models for other parameters (e.g., pH) were less accurate. Similarly, multiple regression models were found to be accurate (Yaho et al. 2016) but they performed relatively poorly in other instances (Tipping et al. 2003). In recent years, ANNs have been increasingly used with a high degree of accuracy to predict heavy metal concentrations in soil, waste waters, rivers and ponds (Xu & Liu 2013; Zhou et al. 2015; Elzwayie et al. 2017; Ayaz & Khan 2019). Other methodological approaches considered during the study including random forest (RF), additive regression (AR), and support vector regression (SVR). Our choice for methods were based on the initial analysis of the dataset. The dataset is a relatively simple dataset, displaying weak seasonality patterns with a relatively small number of variables/parameters and our primary aim was to test whether the simplest machine learning model such as a single layer ANN had an advantage over the classical statistical techniques such as multiple regression and ARIMA. The move towards ANN was also because the relation between the copper concentration and independent parameters was seen to be non-linear. In addition, as a parametric model, ANN's size is fixed, while non-parametric models such as SVMs can become quite large and expensive in terms of time and space consumption. Similarly, with no obvious underlying rules, methods such as decision trees were discounted for the current study. However, these methods could be applied in future studies to determine if it is possible to improve model accuracy and predictive power.
This paper addresses the following research objectives to identify the most suitable modelling technique, or a combination of techniques, required to (i) characterize the dominant temporal patterns of metal (Zn and Cu) concentrations derived from surface water samples and (ii) develop a series of models capturing the temporal variability of Zn and Cu concentrations in the environment to characterise the potential cause–effect relationships.
Quality assurance and pre-processing of the data were challenging. They were addressed by logarithmically transforming the data to achieve a normal distribution and interpolating the missing values with specific attention to the values before and after the missing data. The successful development and correct selection of a predictive model will help guide future assessment strategies and the identification of potential pollution hotspots that may require the development of appropriate remediation techniques.
MATERIALS AND METHODS
Study site description and model variable selection criteria
River Erewash, UK
The River Erewash in the UK has been subject to historic pollution due to former industrial activities, mining and sewage treatment (National Rivers Authority (NRA 1995). During the early 1990s, the river was recognised as one of the most polluted rivers in the UK (NRA 1995). The existence of heavy metals in historic metal and coal mine waters inflowing to this river was identified as being responsible for the decline in the abundance of aquatic organisms, including fish and invertebrate species (Derbyshire Wildlife Trust 2019). A number of sites on the river had a strong historic mismatch between the biological and chemical classifications of environmental quality (Environment Agency Thames Region 1995). As a result, the river has been extensively studied by the UK's environmental regulatory body, the Environment Agency.
ANR, UK
Due to the historic metal pollution associated with the River Erewash, the hydrologically connected lakes at ANR act as a potential sediment storage area of suspended sediments prior to its entry into a nearby River Trent. In contrast, the lakes that are not connected provide ideal control sites to gauge their effectiveness in storing sediment and any associated pollutants.
In the present study, two metals (Cu and Zn) for all chosen lakes and seven abiotic environmental parameters (pH, dissolved oxygen (DO), TSS, temperature, conductivity, flow (river discharge) and precipitation) for one connected (Coneries lake (Figure 1)) and one isolated lake (Church pond (Figure 1)) were used. This research utilizes the metal concentration data for Cu and Zn obtained from all five water bodies, but the water quality variables were used from Coneries lake and Church pond. Discharge draining the River Erewash passes through the connected lakes but does not flow through the isolated waterbodies (Figure 1). Since the historical dataset was only available for the lakes of ANR, it was adopted as a model system for this study. A total of 108 monthly measurements of total Cu and Zn concentrations and the above abiotic parameters were taken at each study site during a 9-year longitudinal study between 2009 and 2017. The total Cu and Zn concentration, pH, DO, TSS, water temperature and conductivity were determined by the University of Nottingham in partnership with Cemex (UK) Ltd with the data analysis being conducted by the Environment Agency for the duration 2009–2015 and Enitial Consultants, UK for the duration 2016–2017 (Table 1).
Name of the dataset . | Water quality characteristics . | Temporal resolution . | Metal concentration . | Temporal resolution . | |
---|---|---|---|---|---|
Appendix Dec 2019 (McGowan & Salgado-Bonnet 2019) | Physico-chemical data | Temperature Conductivity DO pH TSS | Monthly values between 2005 and 2019 | -Cu -Zn | Monthly values between 2005 and 2019 |
R_Erewash_ Sandiacre_1965–2018 (National River Flow Archive 2020) | River flow data | Gauged daily flow | Daily flow values between 1965 and 2018 | ||
GHCNdaily_ NottWatnall_ 1960–2017 (GHCN-Daily 2021) | Meteorological data | Precipitation | Daily average values between 1960 and 2017 | ||
Maximum stream temperature | |||||
Minimum stream temperature |
Name of the dataset . | Water quality characteristics . | Temporal resolution . | Metal concentration . | Temporal resolution . | |
---|---|---|---|---|---|
Appendix Dec 2019 (McGowan & Salgado-Bonnet 2019) | Physico-chemical data | Temperature Conductivity DO pH TSS | Monthly values between 2005 and 2019 | -Cu -Zn | Monthly values between 2005 and 2019 |
R_Erewash_ Sandiacre_1965–2018 (National River Flow Archive 2020) | River flow data | Gauged daily flow | Daily flow values between 1965 and 2018 | ||
GHCNdaily_ NottWatnall_ 1960–2017 (GHCN-Daily 2021) | Meteorological data | Precipitation | Daily average values between 1960 and 2017 | ||
Maximum stream temperature | |||||
Minimum stream temperature |
The monthly measurements of river flow (river discharge – m3 s−1) data, which were measured by a multi-path ultrasonic time-of-flight gauge for the River Erewash at Sandiacre (1965–2018), were collected from the National River Flow Archive (2020). Meteorological data used in this study were obtained from the Nottingham Watnall meteorological site (1960–2017), which are available in the website of Global Historical Climatology Network daily (GHCN-Daily 2021)). All analyses in this study were undertaken using the Statistical Package for Social Sciences (SPSS), MATLAB, and Microsoft Excel.
Data analysis methodologies
Data collection and processing
Before conducting the analyses, quality assurance and pre-processing of the data were undertaken to ensure consistent format and units of the data. Following the quality assurance and interpolation of missing values, summary statistics were derived for each of the variables used in subsequent analyses (Table 2). The results of the statistical analysis indicated that both Cu (skewness = 3.61, kurtosis = 25.12) and Zn (skewness = 3.04, kurtosis = 14.89) concentration data were not normally distributed and positively skewed. The data for the remaining parameters were distributed normally. Since normally distributed parameters can be predicted with better accuracy, given the statistical methods have been developed mostly for normally distributed data, Cu and Zn concentration data were logarithmically transformed to achieve a normal distribution.
Water quality characteristics . | Spread of data distribution . | Centre of data distribution . |
---|---|---|
Cu concentration (μg L−1) | Skewness = 3.61 Kurtosis = 25.12 Standard deviation = 1.73 Positively skewed | Mean = 4.26 Max = 16.8 Min = 1.6 |
Zn concentration (μg L−1) | Skewness = 3.04 Kurtosis = 14.889 Standard deviation = 9.39 Positively skewed | Mean = 13.98 Max = 72.28 Min = 2 |
Temperature (°C) | Skewness = 0.04 Kurtosis = −1.19 Standard deviation = 5.33 Normal distribution | Mean = 11.55 Max = 22.8 Min = 2.2 |
Conductivity (μS cm−1) | Skewness = −0.17 Kurtosis = −0.84 Standard deviation = 0.14 Normal distribution | Mean = 0.94 Max = 1.22 Min = 0.62 |
DO (mg L−1) | Skewness = 2.29 Kurtosis = 9.60 Standard deviation = 5.01 Normal distribution | Mean = 13.26 Max = 41 Min = 6.48 |
pH | Skewness = 0.26 Kurtosis = −0.23 Standard deviation = 0.58 Normal distribution | Mean = 8.39 Max = 10 Min = 6.85 |
TSS (mg L−1) | Skewness = 1.56 Kurtosis = 2.96 Standard deviation = 10.37 Normal distribution | Mean = 13.41 Max = 57 Min = 1 |
Precipitation (mm) | Skewness = 0.83 Kurtosis = 0.44 Standard deviation = 0.99 Normal distribution | Mean = 1.88 Max = 4.67 Min = 0.10 |
Flow (m3 s−1) | Skewness = 1.62 Kurtosis = 3.12 Standard deviation = 1.12 Normal distribution | Mean = 1.81 Max = 6.68 Min = 0.59 |
Water quality characteristics . | Spread of data distribution . | Centre of data distribution . |
---|---|---|
Cu concentration (μg L−1) | Skewness = 3.61 Kurtosis = 25.12 Standard deviation = 1.73 Positively skewed | Mean = 4.26 Max = 16.8 Min = 1.6 |
Zn concentration (μg L−1) | Skewness = 3.04 Kurtosis = 14.889 Standard deviation = 9.39 Positively skewed | Mean = 13.98 Max = 72.28 Min = 2 |
Temperature (°C) | Skewness = 0.04 Kurtosis = −1.19 Standard deviation = 5.33 Normal distribution | Mean = 11.55 Max = 22.8 Min = 2.2 |
Conductivity (μS cm−1) | Skewness = −0.17 Kurtosis = −0.84 Standard deviation = 0.14 Normal distribution | Mean = 0.94 Max = 1.22 Min = 0.62 |
DO (mg L−1) | Skewness = 2.29 Kurtosis = 9.60 Standard deviation = 5.01 Normal distribution | Mean = 13.26 Max = 41 Min = 6.48 |
pH | Skewness = 0.26 Kurtosis = −0.23 Standard deviation = 0.58 Normal distribution | Mean = 8.39 Max = 10 Min = 6.85 |
TSS (mg L−1) | Skewness = 1.56 Kurtosis = 2.96 Standard deviation = 10.37 Normal distribution | Mean = 13.41 Max = 57 Min = 1 |
Precipitation (mm) | Skewness = 0.83 Kurtosis = 0.44 Standard deviation = 0.99 Normal distribution | Mean = 1.88 Max = 4.67 Min = 0.10 |
Flow (m3 s−1) | Skewness = 1.62 Kurtosis = 3.12 Standard deviation = 1.12 Normal distribution | Mean = 1.81 Max = 6.68 Min = 0.59 |
Missing data
To improve the robustness of the analysis and model performance, missing values were estimated and imputed (Bennett 2001). However, DO data had 18.5% missing values (Table 3), which is more than 10% and this may result in biased results (Bennett 2001). This problem was addressed by analysing the data variation as a whole with specific attention to the values before and after the missing values. The mean value was used for imputation/replacement for the variables which were normally distributed, and the median used for Cu and Zn data which was positively skewed.
Abiotic parameter . | Percentage (%) of missing data . |
---|---|
Conductivity (μS cm−1) | Less than 2% |
Cu concentration (μg L−1) | |
Zn concentration (μg L−1) | |
Temperature (°C) | |
pH | |
TSS (mg L−1) | 2.7% |
DO (mg L−1) | 18.5% |
Abiotic parameter . | Percentage (%) of missing data . |
---|---|
Conductivity (μS cm−1) | Less than 2% |
Cu concentration (μg L−1) | |
Zn concentration (μg L−1) | |
Temperature (°C) | |
pH | |
TSS (mg L−1) | 2.7% |
DO (mg L−1) | 18.5% |
Approaches used for the identification of the best predictive models
Regression analysis
In the current investigation, 95% confidence interval (CI) data for both metals (Cu and Zn) for all ponds (Coneries, Main, Tween, Church, and Clifton) from ANR were analysed to determine the metal concentration variability in relation to the connectivity with the River Erewash over four seasons (e.g., winter: December–February, spring: March–May, summer: June–August, autumn: September–November) based on the climatological year. Bivariate Pearson correlation coefficients and multiple linear regression models (MLRMs) were used to determine the associations and predict metal concentrations from other abiotic environmental parameters (pH, DO, temperature, TSS, river flow, conductivity, and precipitation). All regression analyses were carried out using International Business Machines (IBM) SPSS Statistics 25.
We are aware that MLRMs have been widely used to determine the associations between metal concentrations and the abiotic environmental parameters in aquatic environments (Chatterjee & Hadi 2015; Schober et al. 2018). Sauvé et al. (1997) reported that some of the predictive models for metal concentrations performed well when only one parameter was included (Yaho et al. 2016) while other models performed better with multiple parameters (Janssen et al. 1997; Tipping et al. 2003; Meers et al. 2005). This research aims to explore the utility of MLRMs in the present scenario as described above, and in particular, their usefulness in comparison to other possible approaches such as time series analysis and ANN. The results are presented in Section 3.1.
Time series analysis
ARIMA modelling is a traditional technique of time series analysis which facilitates the analysis of a variable trend. It may be used to forecast future values (Ayaz & Khan 2019) and has been used in the past for the prediction of heavy metal concentrations in the environment (Liu et al. 2012; Pandey et al. 2015). In the present study, ARIMA was used to characterise the temporal trends and patterns for the selected heavy metals (Cu and Zn) and associated abiotic parameters as individual parameters which cannot be achieved otherwise by other techniques such as regression analysis or ANN. The ARIMA model parameters, autoregressive (AR) term (p), number of differencing (d) and moving average (MA) term (q), were selected based on the results of sample partial autocorrelation function plots (PACF), augmented Dickey–Fuller tests (ADF) and sample autocorrelation function plots (ACF), respectively (Prabhakaran 2021). The number of lags above the significance line in the PACF and ACF plots determines the AR (p) and MA (q) terms, and the models were built by making the time series stationary as determined based on the results of an ADF test (Prabhakaran 2021). The best ARIMA models were determined by selecting the parameters that fit the raw data most closely and by deriving the significant p values for the AR (p) and MA (q) terms, and the Akaike information criterion (AIC) values to quantify the goodness of fit (Prabhakaran 2021).
ANN analysis
ANNs emulate the functioning of the human brain (Gredilla et al. 2013) and are typically composed of three layers (input layer, one or more hidden layers and the output layer) of interconnecting neurons (American Society of Civil Engineers (ASCE) 2000; IBM 2020). Each connection is associated with a weight and the output of any neuron depends upon the weights of its input connections and its associated activation function (commonly used activation functions are sigmoid, tangent and hyperbolic tangent). The network is trained (i.e., the weights of connections determined) using available historic data through algorithms that fall broadly under two categories, namely, supervised and unsupervised techniques (Friedel 2014). Elzwayie et al. (2017) stated that ANN is a flexible method that recognises the patterns within the dataset and finds the association between inputs and outputs to predict heavy metal concentrations.
MATLAB tool-box was used to train and test various configurations of ANN to estimate heavy metal concentrations (outputs) based on the environmental parameters (inputs). Individual ANNs were created for Cu for both connected and isolated lakes because the results of the correlation analysis, which is an appropriate method to investigate the association between water quality variables, river discharge, and precipitation (Grum et al. 1997; Vervier et al. 1999) indicated a difference between the effective parameters on Cu concentrations in connected and isolated lakes. However, the Zn data were not used for the ANN configurations because the results of regression and time series analysis indicated that the results were not reliable enough and that the data were noisier with relatively large variation. As a result, only the ANN results for Cu data are presented in the current paper.
Several ANN architectures were tested and the feedforward ANN utilising back propagation for error reduction with five neurons in its hidden layer yielded the highest coefficient of correlation (R) values for predicting Cu concentrations within both the connected and isolated lakes. R and mean squared error (MSE) values have been used as the performance metric to characterize the accuracy of the models. The other feasible alternatives for R and MSE values include mean absolute error (MAE), root mean square error (RMSE), relative absolute error (RAE), and root relative square error (RRSE). Since R and MSE can be more easily interpreted, they were adopted to characterize the accuracy of the models in the current research.
The back propagation algorithm has the advantage of being simple to adapt with no parameter and function features to tune or learn. However, noise in the data can negatively affect back propagation algorithms performance, even leading to convergence to local minima/maxima. The dataset used after pre-processing in the current study was relatively noise free and void of major irregularities, leading us to adopt back propagation as opposed to more sophisticated learning algorithms. Future studies will explore techniques such as evolutionary approaches to improve the accuracy of prediction of back propagation neural network ensuring a global convergence. In addition to using the back propagation algorithm for training, the initial weights and biases of the model was 0 and the number of training iterations were set to a maximum of 100 with an option to halt when the accuracy of the model did not improve or remained equal. While different weights and biases were not tested, different epochs were tested and the accuracy did not improve any further, which led to finalising number of iterations as 100.
RESULTS AND DISCUSSION
To identify the models with the highest accuracy, predictive models were developed for heavy metal concentrations based on physico-chemical, biological, and meteorological parameters by comparing multiple regression, time series analysis, and ANN.
Statistical analysis
Regression models
To investigate the associations between Cu and Zn concentration and other environmental parameters, regression analysis was undertaken (Grum et al. 1997; Vervier et al. 1999). Bivariate Pearson correlation coefficients (r) were calculated for Zn and Cu (dependent variables) in association with the environmental parameters: conductivity, precipitation, temperature, pH, TSS, flow, and DO. A weak but significant positive correlation was observed between Cu concentration (μg L−1) and conductivity (μS cm−1) (r = 0.22; p < 0.01; Table 4). There was also a weak, positive correlation between Zn concentration (μg L−1) and river discharge (m3 s−1) which was statistically significant (r = 0.23; p < 0.01) (Table 4). No other statistically significant correlation was recorded.
. | TSS (mg L−1) . | DO (mg L−1) . | Conductivity (μS cm−1) . | pH . | Temp (°C) . | Flow (m3 s−1) . | Precipitation (mm) . |
---|---|---|---|---|---|---|---|
Cu | 0.20 | 0.14 | 0.22 | −0.11 | −0.12 | 0.18 | 0.14 |
Zn | 0.16 | 0.06 | 0.15 | −0.17 | −0.14 | 0.23 | 0.04 |
. | TSS (mg L−1) . | DO (mg L−1) . | Conductivity (μS cm−1) . | pH . | Temp (°C) . | Flow (m3 s−1) . | Precipitation (mm) . |
---|---|---|---|---|---|---|---|
Cu | 0.20 | 0.14 | 0.22 | −0.11 | −0.12 | 0.18 | 0.14 |
Zn | 0.16 | 0.06 | 0.15 | −0.17 | −0.14 | 0.23 | 0.04 |
MLRMs were developed to predict Cu and Zn concentrations for all lakes. Both models included all the abiotic parameters and identified conductivity, TSS, flow and precipitation as the best predictor variables. The adjusted R2 values for Cu and Zn were 0.134 for Cu and 0.101 for Zn, respectively (Table 5). The models were able to explain a greater variance (13.4%) for Cu than Zn (10.1%). Based on the results obtained, regression analysis was not considered sufficiently robust for the prediction of either Cu and Zn concentrations.
Model . | Adjusted R2 . | Number of samples . | Predictor variable and sign . |
---|---|---|---|
Cu | 0.134 | 530 | +Conductivity, +TSS, +Flow, +Precipitation |
Zn | 0.101 | 530 | +Flow, +Conductivity, +TSS, +Precipitation |
Model . | Adjusted R2 . | Number of samples . | Predictor variable and sign . |
---|---|---|---|
Cu | 0.134 | 530 | +Conductivity, +TSS, +Flow, +Precipitation |
Zn | 0.101 | 530 | +Flow, +Conductivity, +TSS, +Precipitation |
Correlation and regression analysis have been used in previous research to examine abiotic environmental variables that may influence metal concentrations (Li et al. 2013; Yaho et al. 2016; Nasrabadi et al. 2018). For example, metal concentrations have been reported to increase at times of elevated TSS (Nasrabadi et al. 2018). Increased runoff from road surfaces due to precipitation may increase the suspended solid load in rivers resulting in an increase in metal concentrations (Yaho et al. 2016). Metal concentration may also increase under acidic conditions with subsequent high conductivity, or during periods of high river flow following precipitation input (Li et al. 2013). In the context of the current study, metals have run off from the surrounding industrial and historic mining areas.
Multiple regression analysis has been used in the past to develop predictive models of heavy metal contamination. For example, Tipping et al. (2003) used multiple regression analysis to forecast solid–solution partitioning of heavy metals in upland soils in England and Wales. Carlon et al. (2004) used multiple regression analysis to predict the mobility of heavy metals and other pollutants in soil by using a stepwise forward procedure where data from five previous published studies (Gerritse & Van Driel 1984; Jopony 1991; Aten & Gupta 1996; Janssen et al. 1997; Sauvé et al. 1997) were used. A correlation was developed to predict the mobility of lead based on pH and the performance of the equation was evaluated by considering the adjusted R2 which demonstrated underperformance of the equation. Meers et al. (2005) used multivariate regression to predict the solubility of metals in soils in Belgium. Although the regression model with pH and total cadmium displayed a good model fit in this study, it was difficult to find a practical predictive model based on the equations derived. Similarly, Ivezić et al. (2012) used multiple regression to predict the solubility of metals in soil. However, the models overestimated the concentration of metals within the polluted soils and were not useful for unpolluted soils. Yaho et al. (2016) used two statistically significant linear regression models to predict the concentration of heavy metals in two rivers from China. The R2 values obtained for the models based on the samples from an industrial river ranged between 0.86 and 0.93 and the cleaner river was 0.60–0.85. Kouadri et al. (2021) developed Water Quality Index (WQI) predictions using eight artificial intelligence algorithms based on two different scenarios. The performance of the models was assessed by correlation coefficient (R), MAE, RMSE, RAE, and RRSE. The results indicated that the multilinear regression (MLR) model provided the greatest accuracy for the first scenario and RF provided the greatest accuracy for the second scenario.
In the current research, the results of the multiple regression analysis indicated a low predictive ability of the metal concentrations, and they were unable to explain the variability within the data with accuracy. Consequently, other techniques were examined as outlined in the following sections.
Time series analysis of the historical metal data
Trends and patterns of abiotic parameters of a connected lake
One connected lake (Coneries lake) was used for building the ARIMA models (Figure 3). The series of ARIMA models developed indicate good model fits for pH (Figure 3(a)), temperature (Figure 3(b)), DO (Figure 3(c)), conductivity (Figure 3(e)), and flow (Figure 3(f)) where the p values of the AR (p) and MA (q) terms were all highly significant (p < 0.05) with AIC values of −259.94, 33.68, 92.93, −122.22, and 183.91, respectively. The probable reason for this reflects the high temporal resolution of the data and because all parameters were selected after carefully reviewing the PACF plots, ACF plots, and ADF tests. However, some of the models did not provide a good fit (e.g., TSS (Figure 3(d)) and precipitation (Figure 3(g))) even though the p values of the AR (p) and MA (q) terms were significant (p < 0.05), and the AIC values were high. This was probably because there was limited data variability compared to other parameters, with many values below the limits of detection.
Trends and patterns of abiotic parameters of an isolated lake
The data for one isolated lake (Church pond) was used to build the ARIMA models (Figure 4) for analysing the trend and patterns of environmental parameters. The ARIMA models indicate a good model fit for pH (Figure 4(a)), temperature (Figure 4(b)), conductivity (Figure 4(e)), and flow (Figure 4(f)) where the p values of the AR (p) and MA (q) terms were highly significant (p < 0.05) with AIC values of 89.03, −288.1, 87.45, −246.79, 183.08, respectively. However, some of the models did not provide a good model fit, for example, DO (Figure 4(c)), TSS (Figure 4(d)) and precipitation (Figure 4(g)). For DO (Figure 4(c)), the p value of the AR (p) term was not significant (p < 0.05), although the AIC value was low. For TSS (Figure 4(d)) and precipitation (Figure 4(g)), the p values of the AR (p) and MA (q) terms were highly significant (p < 0.05) but AIC values were high. This was probably due to the limited data variability (compared to other variables) and many values being below the limits of detection.
Trends and patterns of Cu concentrations for all connected and isolated lakes
Time series plots were prepared for all lakes to illustrate the variability of monthly Cu concentrations for the time-period of 2009–2017 (Figure 5). A peak in Cu concentration (16.80 μg L−1) was observed in April 2013 for Coneries lake (Figure 5(a)). This was due to high river discharge, elevated water levels and flooding in Nottinghamshire during this time period; where the water levels were elevated at Coneries lake outlet due to water from the River Erewash being diverted away from the Coneries–Tween–Main lake chain (McGowan & Salgado-Bonnet 2019). The time series plot for the unconnected lake, Church pond, illustrates the variation of monthly Cu concentration data for the time-period of 2009–2017 (Figure 5(d)). A peak in Cu concentration (12.60 μg L−1) was observed in April 2013. Metal contaminated river sediments and metal pollutants were mobilised and carried downstream during these events. Flooding is known to result in marked fluctuations in metal concentrations associated with increased mobilisation and re-suspension of sediments (Hurley et al. 2019) on the rising limb of the hydrograph. Thus, heavy metal contaminated sediments deposited and stored within the channel and floodplain sediments may be easily mobilised and pose a threat to the aquatic environment (Environment Agency 2008). The AIC values obtained for Cu concentrations for the lakes were as follows: (a) Coneries = 28, (b) Main = 90, (c) Tween = 69, (d) Church = 87, and (e) Clifton = 120 (Figure 5).
Trends and patterns of Zn concentrations for all connected and isolated lakes
Time series plots were also prepared to analyse the variation of monthly Zn concentration values for all lakes (Figure 6) for the time-period of 2009–2017. The AIC values obtained for Zn concentrations of all the lakes were as follows: (a) Coneries = 161, (b) Main = 161, (c) Tween = 221, (d) Church = 187 and (e) Clifton = 213 (Figure 6).
The ARIMA models helped to identify and visualise temporal patterns within the historic metal (Cu and Zn) data and other parameters. The analysis demonstrated that the lakes directly connected to the inflowing River Erewash received a greater metal load than the unconnected lakes. However, it is also clear that metal concentrations do not necessarily follow seasonal patterns or the trend lines in all instances. The models performed well for some variables (e.g., pH, temperature, conductivity, river flow) but less well for other parameters (e.g., TSS and precipitation). The AIC values obtained for Cu were lower than for Zn, indicating a better goodness of fit for Cu.
Based on reviewing the ARIMA models, it was established that time series analysis was not particularly accurate for the prediction of Cu and Zn concentrations. However, the method has been widely used in previous research centred on the analysis of seasonal and long-term water quality variation. For example, Lehmann & Rode (2001) used time series and trend analysis to characterise water quality and demonstrated that there was a clear association between the years 1984 and 1989 but the patterns broke down after 1990. The correlated parameters were DO saturation and biological oxygen demand, DO saturation and nitrate, nitrate and discharge, chlorophyll-a and pH. It was also demonstrated that ARIMA models were better at forecasting events more accurately in the short-term and that they became less effective in the long term. Rao et al. (2011) used time series analysis to examine trace metal pollution in surface waters from historic Cu mining areas in China using data over 13 years from 1995 to 2007. The results indicated that the model performed well at α = 0.2 (smoothing coefficient) with an error between the estimated values and actual values of below 5%.
In the current study, the results of the time series analysis demonstrated that it was not possible to predict metal concentrations accurately using ARIMA models because it neither reflected the temporal trend nor the seasonal pattern accurately. This observation resulted in the use of ANN as outlined below.
ANN analysis
The network for Cu (Figure 7) was trained to achieve the highest model accuracy. Tests were conducted with alternative structures in addition to the structure shown in Figures 7 and 9, such as ANNs with multiple hidden layers with more hidden nodes. However, those were avoided as they took more time in converging while showing no improvement in the prediction accuracy. The validation vector stopped the training procedure when the accuracy of the model did not improve or remained equal. DO and TSS were found to be the best predictors for Cu values.
ANNs have been successfully used for building predictive models of heavy metal concentrations in the environment. For example, Rooki et al. (2011) used back propagation, multiple linear regression and general regression neural networks to predict heavy metal concentrations in the Shur River in southeast Iran – where a back propagation neural network displayed the highest performance. Cavalcante et al. (2013) used an ANN (multilayer perceptron) with one hidden layer of five neurons to predict the changes in physico-chemical parameters based on seasonal variability. Subida et al. (2013) compared univariate techniques, multivariate techniques and ANNs to assess their performance and to estimate heavy metal contamination in estuarine benthic habitats. Based on the results, univariate techniques displayed poorer performance compared to multivariate techniques, and ANNs provided the most straightforward to interpret results with the most efficient analytical effort. Wang et al. (2014) developed a neural network to detect the Cu, cadmium, and lead concentrations in industrial wastewaters. The performance of the model was assessed by the average MRE (mean of relative error) with the accuracy being compared with other popular chemometric neural network methods with results indicating that the performance of the model was very good. Zhou et al. (2015) used ANNs (Levenberg–Marquardt algorithm and back propagation) to identify contaminant sources and to predict the concentration of heavy metals in soil from a known city area of China. The results indicated that the sources for areas with high levels of heavy metal contamination were successfully identified and the ANNs with three layers of feed forward displayed high accuracy in estimating heavy metal concentrations. Elzwayie et al. (2017) used a radial basis function neural network (RBFNN) to predict the heavy metal concentrations in two different lakes from Libya and Malaysia with different climatic and environmental conditions (e.g., tropical and arid; polluted and unpolluted lakes). The significant parameters were different in each study area and the accuracy of the models were very high. Ayaz & Khan (2019) developed a feed forward neural network with back propagation to predict the heavy metal concentrations in Karachi coastal waters and harbour. The results clearly demonstrated that neural networks were the most useful method to predict metal concentrations. Hadjisolomou et al. (2021) investigated chlorophyll-a levels of a shallow eutrophic lake using an ANN model for different water quality parameters including, pH, DO, water temperature, phosphorus, nitrogen, electric conductivity, and Secchi disk depth. K-fold cross validation and leave-one-out (LOO) cross validation was applied to train the ANN model. The results indicated an improved trend with higher k number and the best results were obtained using LOO cross validation.
In the current study, a 4:5:1 neural network was developed with four input layers, five hidden layers, and one output layer, where DO and TSS were found to be the best predictors with a R value of 0.72. The ANN model could more accurately predict Cu concentrations than any of the other methods considered (regression or ARIMA) for the abiotic parameters used. A thorough examination of the regression models in this research indicated that it was not possible to predict metal concentrations using multiple regression accurately. In addition, the results of the time series analysis indicated that it was not possible to accurately predict metal concentrations using time series (ARIMA) methods because the models for the target metals neither reflected the trend nor any seasonal patterns. However, the results of the ANNs indicated that all models developed displayed a lower MSE (MSE = 0.01) compared to regression models, indicating that ANNs were better suited to predicting metal concentrations than the other methods considered in this study. The ANN models developed may ultimately assist local managers of the ANR in managing the metal concentrations within the lakes, which may currently affect the birds, aquatic organisms, and the wider environment.
CONCLUSIONS
Heavy metal contamination in aquatic ecosystems and their potential effects on aquatic life is a significant issue of environmental concern in many developed economies as well as low and middle-income countries. Despite significant historic and current issues associated with metal pollution, there have been comparatively few contemporary studies examining the long-term legacy and effect of heavy metals on aquatic ecosystems in post-industrial areas. Although extensive long-term historical datasets on heavy metal availability in the environment exist within some geographical regions, few studies have attempted to develop predictive models to forecast or to identify the potential cause–effect relationships between metal concentrations in the environment and abiotic parameters. As a result, this study attempted to compare multiple regression, time series analysis, and ANNs to identify the most accurate model to predict metal concentrations based on different abiotic parameters.
A series of lakes within the ANR, UK, was used as a model system in this paper to analyse a 15-year hydrological and meteorological dataset. ANR is a conservation area in Nottinghamshire, UK, located near the confluence of the River Erewash and River Trent. This research developed a series of models (regression analysis, time series analysis and ANNs) to predict the concentration of Cu and Zn in freshwater lakes using a range of abiotic parameters. The correlation analysis conducted in this study indicated that the electrical conductivity values (Pearson correlation coefficient, r = 0.22) and river flow (r = 0.23) were significantly associated with metal concentrations. Electrical conductivity, TSS, flow volume and precipitation were the best predictors in regression models (explaining 13.4% of the variance for Cu and 10.1% of the variance for Zn). As expected, time series analysis in this work showed that the lakes directly connected to the inflowing river received a greater metal concentration than the unconnected lakes. Individual ANN models were created for metal concentrations, where DO and TSS were found to be the best predictors, with a high R value of 0.72 and a low MSE of 0.01. Overall, the results suggest that the accuracy of ANN models were the highest among all methods considered and could more accurately predict metal concentrations based on abiotic parameters. Since ANNs could reveal different effects and patterns of the physico-chemical and seasonal changes in metal concentrations, and provided more accurate results, they were considered the most appropriate for building predictive models. Other methodological approaches could have been employed, such as RF, M5P tree (M5P), random subspace (RSS), AR, SVR, and locally weighted linear regression (LWLR). These methods could be applied in future studies to determine if it is possible to improve model accuracy and predictive power.
The results of this study revealed that TSS and DO have a strong effect on the metal concentrations in the lakes of ANR. The turbidity of a river system is raised by TSS and is strongly related to the volume (flooding) and timing of runoff. Therefore, it can be concluded that metal concentrations tend to increase in the rainy season or during flood events when there is high river flow. The environmental regulators responsible for ANR need to manage the effect of flooding in the area. The implications of this research may help environmental regulators, reserve managers and local authorities the River Erewash or ANR develop strategies to manage environmental metal concentrations and identify more robust monitoring and mitigating measures.
ACKNOWLEDGEMENTS
The monitoring and analytical work was funded by CEMEX UK Operations Ltd as part of a monitoring programme overseen by Christopher Pointer, Principal Hydrogeologist at CEMEX. The work has included many University of Nottingham students and staff, with significant contributions from Iain Cross, Teresa Needham, Julie Swales, Ian Conway, and Sarah Taylor. We would like to express our sincere gratitude to Nottinghamshire Wildlife Trust and the ANR for allowing access to the sites.
AUTHOR CONTRIBUTIONS
B.B. and L.B. conceptualised the study, prepared the methodology, did formal analysis, investigated, wrote the original draft, wrote, reviewed, and edited the article, and visualised the study. L.D. and P.J.W. conceptualised the study, prepared the methodology, supervised the study, wrote, reviewed, and edited the article. S.M.G. did field work, prepared the methodology, did laboratory analyses, did data curation, wrote, reviewed, and edited the article. D.B.D. conceptualised the study, prepared the methodology, collected resources, supervised the study, wrote, reviewed, and edited the article. All of the authors have read and agreed to the submitted version of the manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.