Abstract
This article explores the forecasting capabilities of three classic linear and nonlinear autoregressive modeling techniques and proposes a new ensemble evolutionary time series approach to model and forecast daily dynamics in stream dissolved organic carbon (DOC). The model used data from the Oulankajoki River basin, a boreal catchment in Northern Finland. The models that were evolved used both accuracy and parsimony measures including autoregressive (AR), vector autoregressive (VAR), and self-exciting threshold autoregressive (SETAR). The new method, called genetic-based SETAR (GTAR), evolved through the integration of state-of-the-art genetic programming with SETAR. To develop the models, high-resolution DOC concentration and daily streamflow (as the external input for VAR) were measured at the same gauging station throughout the ice free season. The results showed that all the models characterize the DOC dynamics with an acceptable 1-day-ahead forecasting accuracy. Use of the streamflow time series as an exogenous variable did not increase the predictive accuracy of AR models. Moreover, the hybrid GTAR provided the best accuracy for the holdout testing data and proved to be a suitable approach for predicting DOC in boreal conditions.
HIGHLIGHTS
A new ensemble model called GTAR, was introduced for DOC time series modeling and prediction.
GTAR integrates self-exciting threshold autoregressive (SETAR) and genetic programming.
GTAR produces more accurate predictions than AR, VAR, and SETAR.
Graphical Abstract
INTRODUCTION
Dissolved organic carbon (DOC) is an important water quality variable in aquatic systems that are controlled by a variety of in-stream and catchment processes. DOC is critically important to streams’ biotic and abiotic health as it provides sustenance to microbes at the base of aquatic food webs, controls the light penetration in water, and initiates photochemical reactions (Spencer et al. 2009; Moran et al. 2016). Additionally, DOC influences drinking water quality and purification processes, particularly in communities that rely on rivers and lakes as freshwater resources (Ledesma et al. 2012; Gough et al. 2014). However, DOC transport, concentrations, and loading patterns are being altered locally by land features and storm patterns (Mancinelli et al. 2018) as well as globally by climate change (Liu & Wang 2022) and land-use change (Seybold et al. 2019). DOC dynamics in the northern latitudes are particularly vulnerable to climatic-related changes due to strong seasonality in DOC (Tiwari et al. 2018). These changes are especially notable in boreal areas due to longer growing seasons and increasing precipitation, while intensive forestry and agriculture have already increased DOC loading (Asmala et al. 2019). Due to the significant role of DOC in the riverine systems, many studies have focused on the process understanding of DOC transport with intensive measurements and process-based models (Jones et al. 2014; Rawlins et al. 2021). However, many current approaches are time-consuming to apply to new locations, and thus also, traditional time series analyses are needed to further develop DOC prediction and forecast.
In previous studies, several linear regression methods, such as multiple linear regression (MLR) and partial least-squares regression (PLSR), were applied for the direct prediction of DOC concentration (e.g., Escalas et al. 2003; Marhaba et al. 2003; Ågren et al. 2010; Etheridge et al. 2014; Hestir et al. 2015). For example, Ågren et al. (2010) implemented linear regression to model and predict DOC concentrations in Swedish catchments during a spring flood. Comparing different easily available landscape data and base flow chemistry, the authors demonstrated that the best regression model included base flow, DOC, mean annual runoff, and wetland coverage as the dominant predictors. Etheridge et al. (2014) implemented PLSR, lasso, and stepwise regression techniques to find a relationship between the absorption spectra, collected by ultraviolet-visual spectrometers, and multiple water quality parameters in a brackish tidal marsh. The results showed a high covariance between measured optical properties and some material concentrations such as nitrate, total nitrogen, and DOC if independent calibration is developed for each site using the PLSR technique. Recent studies have demonstrated the partial success of linear regression techniques, particularly when the sample size and range are limited (Etheridge et al. 2014; Zerafati et al. 2022). Moreover, site-specific measurement and analysis become costly when temporal resolution boosts. To address these issues, nonlinear methods using machine learning (ML) approaches such as artificial neural networks (ANNs), random forest (RF), and support vector regression (SVR), have been implemented (Sharghi et al. 2018; Elkiran et al. 2019; Zerafati et al. 2022; Toming et al. 2020; Gul et al. 2021). For instance, Zerafati et al. (2022) demonstrated that DOC dynamics in Groves Creek, a salt marsh creek in eastern Georgia, USA, can ideally be modeled using SVR or RF paired with cost-free auxiliary predictors such as online rainfall, streamflow, and tidal stage data.
Despite the complexity of RF or SVR calibration, a main drawback is their implicit structure so that the modeler can scarcely distill insight from the model, albeit it provides accurate predictions. In this technical study, we therefore, aimed at developing prediction models for DOC that not only are precise but also may provide explicit structures. Thus, they would be simpler to be interpreted and implemented in practical applications for DOC prediction challenges. To this end, we explored the efficiency of two linear models namely autoregressive (AR) and vector autoregressive (VAR), as well as a nonlinear autoregressive model, called self-exciting threshold autoregressive (SETAR), for 1-day-ahead forecasting of DOC series in a boreal river catchment. In addition, we introduced a new hybrid ML model in which capabilities of evolutionary programming are integrated with SETAR solutions to increase the generalization abilities of a classic SETAR approach.
In boreal catchments dissolved organic matter typically comes from waterlogged areas (wetlands/peatlands) and forest areas (litter, surface soil). In our case study area, the Oulankariver basin, DOC transport is controlled by the processes in the peatlands (source). However, hydrological connectivity (= runoff) is controlling the transportation of DOC from wetlands to downstream. This study uses instream measurements for predicting DOC time series, thus spatial analysis (hydrological connectivity) is not in current paper targets.
METHODS
Case study area and data
Discharge is measured at the Oulanka Research Station location (Figure 1) using a gauging station. Discharge data were obtained from Finnish Environment Institute's open access Hertta-data archive. The DOC was measured with 15 min intervals using TriOS Opus UV spectral sensor (TriOS, Germany), which was calibrated and validated against measured DOC water samples in Oulanka research station laboratory. The purpose of the paper was to develop and test new prediction tool to predict DOC concentration in daily resolution, thus data from 5 December 2020 to 31 December 2020 were aggregated to mean daily resolution for the purposes of this study. Our approach could be applied also in higher resolutions, however, in larger river systems DOC processes are slower and thus daily resolution is justified. In smaller headwater systems, higher hourly or even 15 min resolution is reasonable.
Variable . | Mean . | Minimum . | Maximum . | Std. dev. . | Skewness . |
---|---|---|---|---|---|
Discharge (m3 s−1) | 45.73 | 5.47 | 300.0 | 61.79 | 1.35 |
DOC (mg l−1) | 9.52 | 6.10 | 13.22 | 2.05 | 0.22 |
Variable . | Mean . | Minimum . | Maximum . | Std. dev. . | Skewness . |
---|---|---|---|---|---|
Discharge (m3 s−1) | 45.73 | 5.47 | 300.0 | 61.79 | 1.35 |
DOC (mg l−1) | 9.52 | 6.10 | 13.22 | 2.05 | 0.22 |
Input selection
Autoregressive models
The best order for an AR model is commonly determined by assessing the partial autocorrelation sequence (Partial ACF) of the desired time series. However, Partial ACF does not meet the parsimony conditions. To cope with this problem, we considered AIC as suggested in the literature (Tür 2020).
VAR models
Self-exciting threshold autoregressive (SETAR)
Threshold AutoRegressive (TAR) is a popular class of autoregressive models that was introduced as a simple and parsimonious method for nonlinear time series modeling. Despite the simplicity, TAR modeling suffers from a variety of parameters/variables that must be estimated or defined by modelers. To tackle these shortcomings, Tong (2012) proposed self-exciting TAR (SETAR) as a superior type of TAR that has been applied to model a variety of nonlinear time series comprising limit cycles, chaos, harmonic distortion, jumps, and time irreversibility features (e.g., Gonzalo & Wolf 2005; Amiri 2015).
The new genetic-based SETAR model
Model selection
RESULTS AND DISCUSSION
Preprocessing the data
AR, VAR, and SETAR models
GTAR modeling
Comparison of the models’ performance
Method . | Model . | Training phase . | Testing phase . | ||
---|---|---|---|---|---|
RMSE . | NSE . | RMSE . | NSE . | ||
Linear | AR (4) | 0.349 | 0.972 | 0.268 | 0.958 |
VAR (1) | 0.312 | 0.978 | 0.289 | 0.951 | |
Nonlinear | SETAR | 0.328 | 0.976 | 0.290 | 0.951 |
GTAR | 0.311 | 0.978 | 0.212 | 0.974 |
Method . | Model . | Training phase . | Testing phase . | ||
---|---|---|---|---|---|
RMSE . | NSE . | RMSE . | NSE . | ||
Linear | AR (4) | 0.349 | 0.972 | 0.268 | 0.958 |
VAR (1) | 0.312 | 0.978 | 0.289 | 0.951 | |
Nonlinear | SETAR | 0.328 | 0.976 | 0.290 | 0.951 |
GTAR | 0.311 | 0.978 | 0.212 | 0.974 |
Considering the nonlinear models, no change was observed in the models’ accuracy during the training period. While the SETAR performance is close to the VAR during the testing period, the GTAR provides a significant increase in the models’ accuracy during the testing period. Regarding the model's performance in the testing phase, the table proves the superiority of GTAR over its counterparts.
CONCLUSION
DOC concentration cannot always be directly measured in situ (expensive sensors), and prohibitively expensive laboratory-based analysis is required to capture high-resolution dynamics of DOC variation in different aquatic systems. Thus, the data-driven models possessing high generalization ability are of paramount importance and required for ecological modeling and water quality management. Previous DOC model development endeavors did not fully consider linear-nonlinear fusion models, so this paper carried out research on the comparison and combination of TAR models with the state-of-the-art GP in the Oulankajoki River basin. The experimental results showed that the proposed fusion model, called GTAR, had the best prediction accuracy and the strongest correlation with the observed DOC compared with the AR, VAR, and SETAR. Our modeling study shows clearly that autocorrelation statistical modeling can be efficiently used for 1-day-ahead forecasting of DOC concentration. This allows methodology to be improved for example drinking water purification processes or predicting ecological conditions in the river systems. Our approach should however further be tested with longer time series, variable seasons, and also in a different kind for river systems.
ACKNOWLEDGEMENTS
This research was funded by HYDRO-RDI network, HYDRO-RI-platform and Green-Digi-Basin, Academy of Finland (337523, 346163, 347704) and Freshwater Competence Centre (FWCC).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.