ABSTRACT
The present study aims to streamline the long-term spatiotemporal river water quality assessment and forecasting utilizing three intuitive Python-based modules: (1) Python toolbox for scalable Outlier Detection (PyOD) to classify significant deviations from the expected water quality norms (outliers), (2) Statsmodels to decompose the river time series data into its trend, seasonal, and residual components, and (3) Automatic Time Series forecasting model (AutoTS) to forecast (and compare) the future water quality state of the Karun River (the case study) in Southern Iran. The findings indicate that the outlier elimination has a remarkable impact on the outcomes of the Karun time series data analysis. Additionally, a significant increase in total dissolved solids (TDS) concentrations and a cyclic pattern were discernable in the decomposed time series. Furthermore, the water quality values were found to be clustered around the median of their datasets. Based on the forecasting validation metrics, the proposed automated forecasting model was found to be promising in predicting the future water quality state of the river.
HIGHLIGHTS
An intuitive model for a rapid spatiotemporal river water quality analysis was proposed.
Python toolbox for scalable Outlier Detection was leveraged to identify outliers in the river water quality dataset.
Statsmodels was applied to decompose the data time series.
Automatic Time Series forecasting model (AutoTS) was trained to automatically forecast the future water quality state of the river.
Violin and Box plots were visualized to illustrate the quality variation patterns along the river.
NOMENCLATURE
ABBREVIATION
- AutoTS
automatic time series forecasting model
- CV
coefficient of variation
- LOESS
locally estimated scatterplot smoothing
- ML
machine learning
- WQI
water quality indicator
- sMAPE
symmetric mean absolute percentage error
- TDS
total dissolved solids
- MAE
mean absolute error
- PyOD
Python toolbox for scalable Outlier Detection
- EC
electrical conductivity
- SD
standard deviation
- SPL
scaled pinball loss
- STL
seasonal-trend decomposition using LOESS
- k-NN
k-nearest neighbors
INTRODUCTION
Water is crucial not only for the survival of all living organisms but also for preserving ecological balance and supporting human activities (Chintalapudi et al. 2022; Sajan & Christopher 2023). Over the past decades, the quality of water resources has steadily declined due to a wide range of anthropogenic factors, including rapid industrialization, unrestrained urbanization, and intensive agricultural practices. These artifacts contribute to the pollution of water bodies, posing a grave threat to both aquatic ecosystems and the well-being of human populations (Chung et al. 2021a; Arabameri et al. 2023; Guo et al. 2023; Locke 2024).
Surface waters, particularly rivers, as the primary sources of community water supply, assume a significant role in protecting the environment and human health (Barakat et al. 2016; Gil-Rodas et al. 2023). The main sources of river contamination consist of municipal and industrial wastewater, as well as agricultural drainage water containing a variety of physical, chemical, and biological pollutants (Ahmadmoazzam et al. 2021; Wu & Chen 2023). One of the effective solutions to ensure the safety of river ecosystems and the health of humans is the regular monitoring of river water quality (Singh et al. 2022). The monitoring process reveals plenty of valuable information, including water characteristics, trends and changes in components, and potential water quality issues. Subsequently, the obtained data assists decision-makers in developing pollution prevention scenarios, treatment plans, and in making informed decisions (Shi et al. 2021).
Spatial and temporal river water quality monitoring, analysis, and forecasting is a long-term process that has been globally conducted by hydrologists. While the spatial assessment is less complex, the temporal part comprises multiple arduous steps, including (1) data provisioning, (2) time series decomposition, and (3) future quality state prediction (Georgescu et al. 2023; Long et al. 2023; Mijošek et al. 2023; Qian et al. 2023; Dilekoğlu et al. 2024).
In the first step (data provisioning), the outliers' detection lists are among the most critical tasks (Garces & Sbarbaro 2009). Outliers refer to data points that stand out from the remainder of the dataset; consequently, they have the potential to significantly affect the outcomes of time series analysis. In this regard, we conducted a comprehensive literature review on the tools and methods that have been used by hydrologists to trap outliers in river water quality datasets. Accordingly, the functional depth method (Di Blasi et al. 2013), isolation forest, kernel density estimation (Liu et al. 2020; Uddin et al. 2024), quartiles and box plots (Alberdi Igartua et al. 2024; Qian et al. 2024), linear prediction correction filter, multivariate nearest neighbor (Nafsin & Li 2021), support vector machine (Vellingiri et al. 2023; Kushwaha et al. 2024), Grubbs's test (Casillas-García et al. 2021), Scikit-learn library of Python (as the tool) (Mdegela et al. 2023), nearest neighbor-high dimensional algorithm, aggregated k-nearest neighbors (k-NN), sum of distance to k-NN, local distance-based outlier factor, density-based local outlier factor, connectivity-based outlier factor, influenced outlierness, robust kernel-based outlier factor (Talagala et al. 2019; Mokua et al. 2021), are among the studied approaches for anomalies/outliers detection.
The present research, for the first time, examines the performance of a versatile Python library, namely Python toolbox for scalable Outlier Detection (PyOD) (PyOD 2.0.1 documentation 2024; Zhao et al. 2019), as the tool, and a proximity-based supervised learning algorithm, namely, k-NN (Beckmann et al. 2015; Wang et al. 2022a), as the algorithm, for rapidly detecting and eliminating outliers from a river water quality dataset. PyOD is open-source with detailed documentation and supports advanced models. This toolkit is comprehensive and scalable, which provides easy-to-use and quickly executable features for researchers to isolate outliers in a set of data (PyOD 2.0.1 documentation 2024). By using this module, we seek to reduce the amount of time, effort, and expertise needed to spot the outliers in a river water quality dataset.
The second step, time series decomposition, refers to the process of breaking down time series data into its individual components, i.e., trend, seasonal, and residual. It is a well-known technique applied in time series analysis to better understand the underlying patterns and characteristics of the data. There are several commonly used time series decomposition methods, including classical decomposition, moving averages, seasonal decomposition of time series by locally estimated scatterplot smoothing (seasonal-trend decomposition using LOESS (STL)), exponential smoothing state space model, and Fourier analysis (Hyndman & Athanasopoulos 2015).
Based on the literature review, the STL is among the most frequently employed methods used by hydrologists (Cheng et al. 2021; Deng et al. 2021; Dong et al. 2023; Huan 2023; Wu et al. 2023; Yin et al. 2023). Therefore, we tried to facilitate the use of this method by introducing a user-friendly Python module, namely Statsmodels (Seabold & Perktold 2010; Statsmodels 0.14.1 2024), to unveil the underlying patterns (trend, seasonal, and residual) of a river water quality dataset. Statsmodels is an open-source Python module that provides functions for the estimation of various statistical models, as well as for swiftly implementing statistical tests and data exploration. To the authors' best knowledge, there has been no investigation into the performance of this toolkit for river water quality assessment.
The third step, water quality forecasting, provides valuable information to support water resources management and protect public health (Yu et al. 2022). To date, various methods and techniques such as autoregressive integrated moving average, nonlinear autoregressive neural network, long short-term memory (Hien Than et al. 2021), support vector model (Liu & Lu 2014), cascade-forward network, radial basis function network (Georgescu et al. 2023), and Thomas-Fiering (Kurunç et al. 2005) have been applied to forecast the water quality of different rivers. However, to the best of our understanding, there is no study that applies 42 self-standing forecasting models as an integrated system to predict the water quality parameters of a river. In the present work, we propose an automated forecasting framework, namely Automatic Time Series forecasting model (AutoTS) (Intro – AutoTS 0.6.13 documentation 2024; Wang et al. 2022b), in Python. The module is particularly created for the rapid implementation of accurate forecasts at scale. The novelty of using this approach lies in the execution of 42 different forecasting models and an ensemble of them automatically and within an integral framework.
Regarding the case study, Iran, with an area of approximately 1,648,000 km2, is located in the southwest of Asia and lies roughly between 25N and 40N in latitude and between 44E and 64E in longitude. Karun, Dez, Karkhe, Jarahi, and Maroon are the main rivers of this country, which are all located in Khuzestan province (Emamgholizadeh et al. 2014). The Karun River, with a length of 950 km and a catchment expanse of 67,000 km2, is the longest and largest river by discharge in Iran. This only navigable river in the country collects runoff from extensive regions and conveys it to the Persian Gulf (Golshan et al. 2020). The river serves as a source of hydroelectric power generation, irrigation (covering an area of more than 280,000 ha), potable water supply for several cities, and also as a critical commercial waterway. However, in recent decades, significant contamination and ecological destruction have occurred due to the discharge of industrial, agricultural, and domestic wastewater into the Karun without appropriate (standard) treatment (Noori et al. 2010).
Despite the significance of the Karun River, limited studies have been conducted to investigate its long-term spatiotemporal variation in water quality. In two earlier studies, Naddafi et al. (2007) traced the changes in water quality along the Karun River by monitoring two stations of Gotvand and Khorramshahr from 1967 to 2005. Furthermore, a statistical technique, namely factor analysis, was employed by Zarei & Pourreza Bilondi (2013) to evaluate temporal variations in the Karun water quality from 1976 to 2005 in Gotvand station. Besides, a long-term (1968–2015) evaluation of water quality parameters for the Karun River was conducted by Mahmoodabadi & Rezaei Arshad (2018), which was limited to a single station, Ahvaz. Moreover, a univariate water quality (electrical conductivity) evaluation of the Karun was carried out by Ahmadmoazzam et al. (2017) spatially and temporally from 1968 to 2014 for six different stations (Gotvand, Shushtar-Gargar, Mollasani, Arab-Asad, Ahvaz, and Darkhoveyn). Additionally, the Ebadati & Hooshmandzadeh study group (2019), focusing exclusively on the Mollasani station, reported the water quality data over a period of 49 years for the Karun River.
Based on the literature review, there is no up-to-date long-term evaluation of the Karun River water quality, which includes both spatial and temporal studies on a comprehensive set of water quality indicators (WQIs). The current study presents an intuitive Pythonic framework for long-term spatiotemporal river water quality monitoring, analysis, and forecasting, using the Karun water quality dataset from 1985 to 2020. Accordingly, six WQIs, including total dissolved solids (TDS), electrical conductivity (EC), pH, calcium (Ca), magnesium (Mg), and sodium (Na), were taken into consideration. These indicators were measured at four different hydrometric stations along the river from upstream to downstream, i.e., Mollasani (STN. 1), Ahvaz (STN. 2), Farsiat (STN. 3), and Darkhoveyn (STN. 4).
The present study aims to streamline the long-term spatiotemporal river water quality assessment and forecasting by the exertion of user-centered Python-based methods and modules. In the first place, we performed data provisioning and temporal analysis. Subsequently, spatial analysis was carried out by visualizing two types of graphs: (i) the Violin and Box plots, and (ii) the annual mean values of the Karun monthly dataset, to draw a water quality comparison among the four surveyed hydrometric stations. As mentioned, the temporal analysis process is much more complicated than the spatial one, so the majority of this paper is devoted to the temporal water quality evaluation.
MATERIALS AND METHODS
Study area and data source
Location and elevation of the four surveyed hydrometric stations along the Karun River.
Location and elevation of the four surveyed hydrometric stations along the Karun River.
The data obtained over a period of 36 years (1985–2020) was taken into account on a monthly basis to perform a spatiotemporal evaluation of the Karun River water quality. Table 1 presents basic descriptive statistics, including maximum, minimum, mean, standard deviation (SD), and number of the quality data. It is worth mentioning that the complete form of the dataset containing monthly sampling data will be available upon request.
Descriptive statistics (maximum, minimum, mean, SD, and number) of the water quality dataset
STN. . | WQI . | Max. . | Min. . | Mean . | SD . | n . |
---|---|---|---|---|---|---|
Mollasani | TDS (mg/L) | 1,722 | 470 | 997.5 | 288.2 | 327 |
EC (μS/cm) | 2,690 | 716 | 1,533.4 | 448.8 | 324 | |
pH (-) | 8.6 | 7.1 | 7.9 | 0.3 | 379 | |
Ca (mg/L) | 159.7 | 53.1 | 93.4 | 24 | 339 | |
Mg (mg/L) | 74.6 | 10.9 | 32.8 | 10.1 | 338 | |
Na (mg/L) | 388.1 | 55.2 | 179.6 | 71.6 | 336 | |
Ahvaz | TDS (mg/L) | 1,850 | 478 | 1,073.6 | 317 | 326 |
EC (μS/cm) | 2,920 | 750 | 1,689.5 | 499.6 | 326 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 361 | |
Ca (mg/L) | 181 | 46.1 | 100 | 30.7 | 342 | |
Mg (mg/L) | 69.3 | 10.3 | 36.7 | 11.8 | 330 | |
Na (mg/L) | 388.1 | 69 | 199.2 | 73.2 | 332 | |
Farsiat | TDS (mg/L) | 1,983 | 530 | 1,093.1 | 316.3 | 327 |
EC (μS/cm) | 2,770 | 820 | 1,706 | 480.7 | 326 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 383 | |
Ca (mg/L) | 173.3 | 52.1 | 100.9 | 29.5 | 332 | |
Mg (mg/L) | 72.9 | 13.4 | 37.6 | 12 | 331 | |
Na (mg/L) | 411.5 | 64.4 | 201.1 | 73.7 | 326 | |
Darkhoveyn | TDS (mg/L) | 2,310 | 520 | 1,191.6 | 470.1 | 327 |
EC (μS/cm) | 3,610 | 855 | 1,871.2 | 742.3 | 328 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 346 | |
Ca (mg/L) | 187.8 | 53.1 | 107.8 | 35 | 331 | |
Mg (mg/L) | 83.3 | 17 | 41.3 | 15.8 | 339 | |
Na (mg/L) | 540.3 | 57.5 | 235.4 | 124.4 | 328 |
STN. . | WQI . | Max. . | Min. . | Mean . | SD . | n . |
---|---|---|---|---|---|---|
Mollasani | TDS (mg/L) | 1,722 | 470 | 997.5 | 288.2 | 327 |
EC (μS/cm) | 2,690 | 716 | 1,533.4 | 448.8 | 324 | |
pH (-) | 8.6 | 7.1 | 7.9 | 0.3 | 379 | |
Ca (mg/L) | 159.7 | 53.1 | 93.4 | 24 | 339 | |
Mg (mg/L) | 74.6 | 10.9 | 32.8 | 10.1 | 338 | |
Na (mg/L) | 388.1 | 55.2 | 179.6 | 71.6 | 336 | |
Ahvaz | TDS (mg/L) | 1,850 | 478 | 1,073.6 | 317 | 326 |
EC (μS/cm) | 2,920 | 750 | 1,689.5 | 499.6 | 326 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 361 | |
Ca (mg/L) | 181 | 46.1 | 100 | 30.7 | 342 | |
Mg (mg/L) | 69.3 | 10.3 | 36.7 | 11.8 | 330 | |
Na (mg/L) | 388.1 | 69 | 199.2 | 73.2 | 332 | |
Farsiat | TDS (mg/L) | 1,983 | 530 | 1,093.1 | 316.3 | 327 |
EC (μS/cm) | 2,770 | 820 | 1,706 | 480.7 | 326 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 383 | |
Ca (mg/L) | 173.3 | 52.1 | 100.9 | 29.5 | 332 | |
Mg (mg/L) | 72.9 | 13.4 | 37.6 | 12 | 331 | |
Na (mg/L) | 411.5 | 64.4 | 201.1 | 73.7 | 326 | |
Darkhoveyn | TDS (mg/L) | 2,310 | 520 | 1,191.6 | 470.1 | 327 |
EC (μS/cm) | 3,610 | 855 | 1,871.2 | 742.3 | 328 | |
pH (-) | 8.6 | 7.2 | 7.9 | 0.3 | 346 | |
Ca (mg/L) | 187.8 | 53.1 | 107.8 | 35 | 331 | |
Mg (mg/L) | 83.3 | 17 | 41.3 | 15.8 | 339 | |
Na (mg/L) | 540.3 | 57.5 | 235.4 | 124.4 | 328 |
Proposed approach
Outlier detection
Within the collected water quality dataset, ‘outliers’ are the observations that are numerically distant from the rest of the data; hence, they are of high potential to cause deviations in the results of the time series analysis. Thus, their identification and elimination are crucial before stepping into the water quality data evaluation process. In the present study, a user-centered PyOD (PyOD 2.0.1 documentation 2024; Zhao et al. 2019) was employed to detect and remove the outliers from the WQIs dataset. PyOD is an open-source Python library for performing scalable outlier detection on univariate and multivariate data. One of the most substantial advantages of this library is that it provides access to a wide range of detection algorithms such as Ensemble Clustering Outlier Detection (ECOD), Median Absolute Deviation (MAD), Stochastic Outlier Selection (SOS), Quantile-based Minimum Covariance Determinant (QMCD), and k-NN. The algorithm used in this research, i.e., k-NN, is a non-parametric supervised machine learning (ML)-based method that utilizes proximity to make classifications about the grouping of an individual data point (Beckmann et al. 2015; Wang et al. 2022a).
Time series decomposition
Hydrological data collected over time can display a variety of patterns. In order to understand hidden patterns, it has always been helpful to break down the time series into its components. A time series mainly consists of three decomposed forms: a trend, a seasonality, and a residual (remainder). The term ‘trend’ represents the consistent upward or downward movement in the series over an extended period. The component ‘seasonality’ refers to a time series in which there are regular changes that occur during a certain period of time, while ‘residuals’ remain after fitting the model. In our proposed framework, we used Statsmodels (a handy Python module that provides classes and functions for the estimation of many different statistical models) (Seabold & Perktold 2010; Statsmodels 0.14.1 2024) to decompose the WQIs time series into their components by the STL decomposition technique. This technique uses locally fitted regression models to decompose a time series (Cleveland et al. 1990).
Forecasting
Forecasting the quality of water bodies entails predicting the future state of miscellaneous WQIs, including physical, chemical, and biological factors, based on existing available data. These predictions play a vital role in assisting decision-making processes associated with water resource management, public health, and environmental protection (Ubah et al. 2021). There exist multiple methods of forecasting used across various disciplines and industries, namely, time series analysis, regression analysis, ML, qualitative and expert judgment, ensemble methods, scenario analysis, Delphi method, market research and surveys, simulation and modeling, and data mining (Petropoulos et al. 2022).
Spatial tracing
The proposed approach for the rapid spatiotemporal assessment and forecasting of river water quality.
The proposed approach for the rapid spatiotemporal assessment and forecasting of river water quality.
RESULTS AND DISCUSSION
Detected outliers
Detected outliers in the dataset of (a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na at STN. 1.
Detected outliers in the dataset of (a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na at STN. 1.
Basic descriptive statistics, mean and SD, were adopted to explore the impact of removing outliers on the dataset. In that, the mean value and SD of the dataset for TDS at STN. 1 were found to be 1,043.2 and 363.7 mg/L before and 997.5 and 288.2 mg/L after the outlier removal process, respectively. Also, the outlier elimination for EC at STN. 1 led to a 6.1 and 21.7% reduction in the mean and SD, sequentially. Moreover, at STN. 1, Ca experienced 7.8 and 34.7%, Mg 6.5 and 22.7%, and Na 9.1% and 20.8% of alteration in their mean and SD, respectively. In contrast, the outlier detection and removal implementation had no impact on the dataset of pH; accordingly, the calculated values for the mean and SD for pH remained at 7.9 and 0.3 at STN.1. Similar results were obtained for other sampling stations, i.e., the mean values and standard deviations for all WQIs shifted before and after the application of outlier removal, except for pH (data not shown here).
We compared the performance of the PyOD with two other similar research works that specifically investigated outlier detection approaches for a river water quality dataset. Accordingly, Talagala et al. (2019) introduced an automated procedure, namely oddwater (an open-source R package), which includes eight different detection algorithms. Besides, Di Blasi et al. (2013) applied the functional depth method. The results of both studies are significant and in accordance with the expectations; however, the paths used to achieve the results are either confined to a limited number of detection models (eight algorithms) or not user-centered (composed of complicated mathematical modeling). The proposed module in this study (PyOD) has the potential to use more than 50 various detection algorithms, including individuals and ensembles, to be applied depending on the type of problem. Moreover, the intuitive interface that this toolkit provides for the researchers removes the necessity for specialized mathematical and programming skills.
Decomposed time series
Time series decomposition of the dataset of TDS (a) and EC (b) at STN. 1.
Figure 4 reveals the presence of a significant upward trend in the dataset of TDS and EC, suggesting serious changes in the water quality of the Karun River throughout this study (1985–2020). A similar increasing trend was found for Ca, Mg, and Na (data not shown here). Conversely, based on the considerable downward trend in the pH level of the Karun (data not shown here), the water of this river was becoming acidic during these 36 years.
A discernible annual periodic behavior is observed in these two indicators (TDS and EC), reflecting the influence of cyclic patterns of hydrological phenomena on the river environment. The pattern in the ‘seasonal’ panels barely changes over time, implying similar behavior in the periodic fluctuations of TDS and EC in the 36-year records. Moreover, given that the seasonal variations exhibit equal magnitude, the additive decomposition model (versus the multiplicative decomposition) proved to be the appropriate model for decomposing the WQIs dataset.
The residual component, shown in the bottom panel (Figure 4), represents what remains after subtracting the seasonal and trend components from the observed data. The same outcomes and evaluations were achieved for other WQIs within STN. 1 (data not shown here). Figures S4–S6 demonstrate the decomposition results of the WQIs associated with STN. 2, STN. 3, and STN. 4, respectively. The obtained patterns and outcomes for these three stations were found to be similar to STN. 1.
Forecasted values (from 2018 to 2020)
Considering the increasing pollution of the Karun River, reliable prediction of its water quality is critical for environmental protection, safeguarding human health, preserving ecosystems, guiding agricultural and industrial practices, and formulating effective policies and regulations. In the present study, we employed and experimented with a fast and straightforward automated time series module, i.e., the AutoTS library in Python, to forecast six WQIs of the Karun River from 2018 to 2020 (the last 36 months). The module was fed with the monthly dataset from the previous eight years (2010–2017) for the automated training/testing stage. Subsequently, the performance of the module was evaluated by comparing the actual data of the last 3 years (2018–2020). Accordingly, 42 models from the AutoTS library were called and examined. Finally, a combination of independent models (ensemble) demonstrated the best performance in forecasting the WQIs values.
The accuracy of the forecasted results was subject to appraisal by calculating the sMAPE parameter. Accordingly, the smaller the sMAPE value, the better the accuracy of the forecasting model; to wit, the model can be categorized as either excellent sMAPE ≤ 10%, good (10% < sMAPE ≤ 20%), acceptable (20% < sMAPE ≤ 50%), or inaccurate (50% < sMAPE) (Aldrees et al. 2023). Moreover, two additional error metrics, MAE and SPL, were considered to further inspect the accuracy of the model. In the scale of the target value, a lower MAE and SPL indicate better predictive performance.
Actual versus forecasted values of (a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na at STN. 1 for the years 2018, 2019, and 2020.
Actual versus forecasted values of (a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na at STN. 1 for the years 2018, 2019, and 2020.
Figures S7–S9 depict the forecasting results for STN. 2, STN. 3, and STN. 4, respectively. According to these figures, the highest prediction accuracy was also associated with pH in the other three stations (Ahvaz, Farsiat, and Darkhoveyn), while the lowest accuracy belonged to Mg for the discussed reason. There was only one exception, which was related to STN. 4, where the fluctuations in the Na dataset were greater than the rest of the indicators, and this led to the lowest prediction accuracy. Hence, the CV parameter (relative spread) of the WQIs' dataset played the most crucial role in determining the accuracy level of the forecasting model. However, contributors such as multiple pollution sources and their variation, complex behaviors of the pollutants, and the involvement of unexpected climatic and hydrological phenomena strongly influence the future state of the Karun River water quality and affect the results of the predictive model.
In comparison to the studies conducted by Nouraki et al. (2021) and Salari et al. (2021) on the prediction of the Karun water quality parameters using ML methods, the present research offers an improved forecasting framework by eliminating the need for manual data division during the training/testing process. Moreover, Emamgholizadeh et al. (2014) examined three separate forecasting models, namely multi-layer perceptron, radial basis network, and adaptive neuro-fuzzy inference systems, without combining the models (no ensemble), to forecast the WQIs of the Karun, while the automated forecasting system proposed in this research applied an ensemble of forecasting models autonomously. Consequently, there is no need to manually divide the data for the training/testing process, and implementing an ensemble of forecasting models automatically made the AutoTS framework a rapid tool for predicting the water quality of the Karun River.
Spatial patterns
Violin and Box plot of six WQIs ((a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na) in four hydrometric stations.
Violin and Box plot of six WQIs ((a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na) in four hydrometric stations.
Annual mean value comparison of six WQIs ((a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na) measured in four hydrometric stations.
Annual mean value comparison of six WQIs ((a) TDS, (b) EC, (c) pH, (d) Ca, (e) Mg, and (f) Na) measured in four hydrometric stations.
CONCLUSIONS
The present study focused on a rapid spatiotemporal river water quality evaluation by applying user-centered Python-based modules, namely PyOD (for outlier detection), Statsmodels (for time series decomposition), and AutoTS (for water quality prediction). For the case study, we measured six riverine WQIs, including TDS, EC, pH, Ca, Mg, and Na in the Karun, the largest river in Iran. A dataset of 36 years of sampling (1985–2020) from four hydrometric stations along the river, i.e., Mollasani, Ahvaz, Farsiat, and Darkhoveyn, was considered. In the spatial analysis, Violin and Box plots visualization, and the annual mean value of WQIs comparison were taken into account to illustrate the water quality patterns along the Karun River.
The following highlighted conclusions are presented for this paper:
The PyOD package provides an easy-to-use and quickly-executable framework for researchers and hydrologists to spot abnormalities in a set of water quality data. In this study, we demonstrated that the k-NN algorithm of the PyOD module is a potent method to trap and remove outliers from the dataset of the Karun River. Accordingly, the five WQIs, including TDS, EC, Ca, Mg, and Na, were highly sensitive to the outlier's removal. This means outlier isolation had a significant impact on the results of statistical analyses for these five WQIs. On the contrary, the outlier removal had a negligible impact on the dataset of pH, implying that there were relatively smooth fluctuations in the dataset of pH during the 36 years.
Water quality time series data decomposition using the Statsmodels unveiled underlying information about the long-term trends, seasonal variations, and irregular fluctuations in the WQIs dataset. Accordingly, long-term trends indicated a significant increase in TDS, EC, and Ca over the past four decades. In addition, an annual periodic behavior in all WQIs was discernable, reflecting the influence of cyclic hydrological phenomena on the river environment. The STL technique of the Statsmodels is a practical and rapid method for decomposing a river water quality time series into its components. This method can be applied to any set of data, but meaningful results are only attained if a recurring temporal pattern exists in the data.
The auto-forecasting system, including 42 self-standing forecasting models and their ensembles as a whole package (AutoTS), was found to be promising in predicting the future water quality state of the river based on the obtained error metrics values. This toolkit eliminates the need for manual training/testing in the forecasting process. The highest and lowest forecast accuracy for the Karun belonged to the WQIs with the minimum and maximum values for CV, respectively, indicating the impression of the fluctuation level of the dataset on the forecasting accuracy.
The Violin and Box plots, as well as annual mean values graphs, helped us to present a clear picture of WQIs' changes along the river. A nearly consistent distribution pattern of the WQIs' values throughout the first three stations and a significant discrepancy (more contaminated) at Darkhoveyn was found in the spatial tracing. Besides, according to the Violin plots, the water quality values were clustered around the median of their datasets.
FUNDING
The authors gratefully acknowledge the support and funding provided by the Shahrood University of Technology (grant number: 26631) to conduct this research.
AUTHOR CONTRIBUTIONS
A.A. conceptualized the project, developed the methodology, performed in work, rendered support in software and programming, validated and visualized the process, wrote the original draft. S.E. and B.C. supervised the study, support in project administration, conceptualized the whole process, developed the methodology, wrote and reviewed and edited the article. E.Z. rendered support in data provisioning and project administration.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.