## Abstract

This article explores the forecasting capabilities of three classic linear and nonlinear autoregressive modeling techniques and proposes a new ensemble evolutionary time series approach to model and forecast daily dynamics in stream dissolved organic carbon (DOC). The model used data from the Oulankajoki River basin, a boreal catchment in Northern Finland. The models that were evolved used both accuracy and parsimony measures including autoregressive (AR), vector autoregressive (VAR), and self-exciting threshold autoregressive (SETAR). The new method, called genetic-based SETAR (GTAR), evolved through the integration of state-of-the-art genetic programming with SETAR. To develop the models, high-resolution DOC concentration and daily streamflow (as the external input for VAR) were measured at the same gauging station throughout the ice free season. The results showed that all the models characterize the DOC dynamics with an acceptable 1-day-ahead forecasting accuracy. Use of the streamflow time series as an exogenous variable did not increase the predictive accuracy of AR models. Moreover, the hybrid GTAR provided the best accuracy for the holdout testing data and proved to be a suitable approach for predicting DOC in boreal conditions.

## HIGHLIGHTS

A new ensemble model called GTAR, was introduced for DOC time series modeling and prediction.

GTAR integrates self-exciting threshold autoregressive (SETAR) and genetic programming.

GTAR produces more accurate predictions than AR, VAR, and SETAR.

### Graphical Abstract

## INTRODUCTION

Dissolved organic carbon (DOC) is an important water quality variable in aquatic systems that are controlled by a variety of in-stream and catchment processes. DOC is critically important to streams’ biotic and abiotic health as it provides sustenance to microbes at the base of aquatic food webs, controls the light penetration in water, and initiates photochemical reactions (Spencer *et al.* 2009; Moran *et al.* 2016). Additionally, DOC influences drinking water quality and purification processes, particularly in communities that rely on rivers and lakes as freshwater resources (Ledesma *et al.* 2012; Gough *et al.* 2014). However, DOC transport, concentrations, and loading patterns are being altered locally by land features and storm patterns (Mancinelli *et al.* 2018) as well as globally by climate change (Liu & Wang 2022) and land-use change (Seybold *et al.* 2019). DOC dynamics in the northern latitudes are particularly vulnerable to climatic-related changes due to strong seasonality in DOC (Tiwari *et al.* 2018). These changes are especially notable in boreal areas due to longer growing seasons and increasing precipitation, while intensive forestry and agriculture have already increased DOC loading (Asmala *et al.* 2019). Due to the significant role of DOC in the riverine systems, many studies have focused on the process understanding of DOC transport with intensive measurements and process-based models (Jones *et al.* 2014; Rawlins *et al.* 2021). However, many current approaches are time-consuming to apply to new locations, and thus also, traditional time series analyses are needed to further develop DOC prediction and forecast.

In previous studies, several linear regression methods, such as multiple linear regression (MLR) and partial least-squares regression (PLSR), were applied for the direct prediction of DOC concentration (e.g., Escalas *et al.* 2003; Marhaba *et al.* 2003; Ågren *et al.* 2010; Etheridge *et al.* 2014; Hestir *et al.* 2015). For example, Ågren *et al.* (2010) implemented linear regression to model and predict DOC concentrations in Swedish catchments during a spring flood. Comparing different easily available landscape data and base flow chemistry, the authors demonstrated that the best regression model included base flow, DOC, mean annual runoff, and wetland coverage as the dominant predictors. Etheridge *et al.* (2014) implemented PLSR, lasso, and stepwise regression techniques to find a relationship between the absorption spectra, collected by ultraviolet-visual spectrometers, and multiple water quality parameters in a brackish tidal marsh. The results showed a high covariance between measured optical properties and some material concentrations such as nitrate, total nitrogen, and DOC if independent calibration is developed for each site using the PLSR technique. Recent studies have demonstrated the partial success of linear regression techniques, particularly when the sample size and range are limited (Etheridge *et al.* 2014; Zerafati *et al.* 2022). Moreover, site-specific measurement and analysis become costly when temporal resolution boosts. To address these issues, nonlinear methods using machine learning (ML) approaches such as artificial neural networks (ANNs), random forest (RF), and support vector regression (SVR), have been implemented (Sharghi *et al.* 2018; Elkiran *et al.* 2019; Zerafati *et al.* 2022; Toming *et al.* 2020; Gul *et al.* 2021). For instance, Zerafati *et al.* (2022) demonstrated that DOC dynamics in Groves Creek, a salt marsh creek in eastern Georgia, USA, can ideally be modeled using SVR or RF paired with cost-free auxiliary predictors such as online rainfall, streamflow, and tidal stage data.

Despite the complexity of RF or SVR calibration, a main drawback is their implicit structure so that the modeler can scarcely distill insight from the model, albeit it provides accurate predictions. In this technical study, we therefore, aimed at developing prediction models for DOC that not only are precise but also may provide explicit structures. Thus, they would be simpler to be interpreted and implemented in practical applications for DOC prediction challenges. To this end, we explored the efficiency of two linear models namely autoregressive (AR) and vector autoregressive (VAR), as well as a nonlinear autoregressive model, called self-exciting threshold autoregressive (SETAR), for 1-day-ahead forecasting of DOC series in a boreal river catchment. In addition, we introduced a new hybrid ML model in which capabilities of evolutionary programming are integrated with SETAR solutions to increase the generalization abilities of a classic SETAR approach.

In boreal catchments dissolved organic matter typically comes from waterlogged areas (wetlands/peatlands) and forest areas (litter, surface soil). In our case study area, the Oulankariver basin, DOC transport is controlled by the processes in the peatlands (source). However, hydrological connectivity (= runoff) is controlling the transportation of DOC from wetlands to downstream. This study uses instream measurements for predicting DOC time series, thus spatial analysis (hydrological connectivity) is not in current paper targets.

## METHODS

### Case study area and data

*et al.*2008). The annual mean flow is 25.5 m

^{3}s

^{−1}, and minimum flows down to 3 m

^{3}s

^{−1}occur in late winter, and peak flows soon after snowmelt (in May–June, up to 249 m

^{3}s

^{−1}). The Oulankajoki River is partly located in Oulanka National Park, and its upper parts contain forestry and peatland drainage activities and rural settlements. The landscape is peatland-dominated and forest typical boreal forest with pine and spruce species.

Discharge is measured at the Oulanka Research Station location (Figure 1) using a gauging station. Discharge data were obtained from Finnish Environment Institute's open access Hertta-data archive. The DOC was measured with 15 min intervals using TriOS Opus UV spectral sensor (TriOS, Germany), which was calibrated and validated against measured DOC water samples in Oulanka research station laboratory. The purpose of the paper was to develop and test new prediction tool to predict DOC concentration in daily resolution, thus data from 5 December 2020 to 31 December 2020 were aggregated to mean daily resolution for the purposes of this study. Our approach could be applied also in higher resolutions, however, in larger river systems DOC processes are slower and thus daily resolution is justified. In smaller headwater systems, higher hourly or even 15 min resolution is reasonable.

*Q*and

_{t}*DOC*, respectively) are depicted in Figure 2, and their associated statistical features are summarized in Table 1. To account for the observed negative trend as well as the hidden cyclic oscillations of both

_{t}*Q*and

_{t}*DOC*time series, they were made stationary via differencing before the evolution of the linear AR and VAR models.

_{t}Variable . | Mean . | Minimum . | Maximum . | Std. dev. . | Skewness . |
---|---|---|---|---|---|

Discharge (m^{3} s^{−1}) | 45.73 | 5.47 | 300.0 | 61.79 | 1.35 |

DOC (mg l^{−1}) | 9.52 | 6.10 | 13.22 | 2.05 | 0.22 |

Variable . | Mean . | Minimum . | Maximum . | Std. dev. . | Skewness . |
---|---|---|---|---|---|

Discharge (m^{3} s^{−1}) | 45.73 | 5.47 | 300.0 | 61.79 | 1.35 |

DOC (mg l^{−1}) | 9.52 | 6.10 | 13.22 | 2.05 | 0.22 |

### Input selection

*Q*and

_{t}*DOC*time series (Figure 3) demonstrate high correlation in their sequential values so that the future magnitude can be projected from their previous values. Figure 3 implies that linear dependency among daily DOC values decreases slower than discharge. Thus, a univariate DOC time series model, which uses historical DOC values, could be developed for DOC prediction for the study area. In this study, the daily

_{t}*DOC*time series was modeled using two linear (AR and VAR) and two nonlinear (SETAR and genetic-based SETAR (hereafter GTAR)) approaches. The models were trained and tested using the first 70% and the last 30% of the observations, respectively. Indeed, the last 30% of data was held out as unseen values to test the models’ generalization skills. The best model for each scheme was selected based on the Akaike information criterion (AIC) that considers both accuracy and parsimony measures.

_{t}### Autoregressive models

*p*) of the autoregressive (AR) model. For example, an AR model to predict

*DOC*based on its observed values in the preceding 2 days can be expressed as:where the is the model constant, is the coefficient of the AR model, and the is the error term.

_{t}The best order for an AR model is commonly determined by assessing the partial autocorrelation sequence (Partial ACF) of the desired time series. However, Partial ACF does not meet the parsimony conditions. To cope with this problem, we considered AIC as suggested in the literature (Tür 2020).

### VAR models

*p*) of VAR model. For example, a VAR model with the lag order 4 (i.e., VAR (4)) to predict based on

*DOC*and

*Q*values can be expressed as follows:where the is the model constant, are the coefficients of the VAR model. Like AR, the optimum number of delays is selected through minimizing information criteria such as AIC and Bayesian information criterion (BIC).

### Self-exciting threshold autoregressive (SETAR)

Threshold AutoRegressive (TAR) is a popular class of autoregressive models that was introduced as a simple and parsimonious method for nonlinear time series modeling. Despite the simplicity, TAR modeling suffers from a variety of parameters/variables that must be estimated or defined by modelers. To tackle these shortcomings, Tong (2012) proposed self-exciting TAR (SETAR) as a superior type of TAR that has been applied to model a variety of nonlinear time series comprising limit cycles, chaos, harmonic distortion, jumps, and time irreversibility features (e.g., Gonzalo & Wolf 2005; Amiri 2015).

*r*) determines the number of regimes. It should be set so that each regime has at least a fraction of observations that are enough to produce reliable estimates of the associated AR model's parameter (the order

*p*). Following Gonzalo & Pitarakis (2002), to attain the best SETAR model, we used an AIC-based parameter selection criterion. Considering a time series of , a two-regime SETAR model can be expressed as follow:where = model intercepts and are the AR models’ coefficients,

*p*

_{1}and

*p*

_{2}are the orders of AR models for lower and upper regimes, and is the residuals time series. The residual value at time

*t*is the value

*d*at a lag or delay time

*t*.

### The new genetic-based SETAR model

*et al.*(2022) for evolutionary time series modeling. After the evolutionary modeling, the model output performance and generalization ability are controlled by considering statistical indicators (explained later in Section 3.5), and the best model is reported. For details on evolutionary modeling and GP engine alternatives, the reader is referred to Hrnjica & Danandeh Mehr (2018).

### Model selection

*n*is the number of measurements in the training (12 May 2020 to 22 October 2020) and the testing (23 October 2020 to 31 December 2020) periods. The SSR is the sum of squared residuals, and

*k*denotes the number of independently adjusted parameters within the model.

## RESULTS AND DISCUSSION

### Preprocessing the data

*Q*and

_{t}*DOC*time series was done to test if a unit root is present in the time series (null hypothesis). The test statistic revealed that we could not reject the null hypothesis of the test, albeit the negative trend was alleviated through the initial standardization of the data. Thus, differencing technique with the orders of one and two was applied to both series, and the results were controlled again using the same test. The results showed that the first-order differencing failed to remove the short-window trend in the discharge measurements, which is required for VAR modeling. Therefore, the second-order differencing was adopted to achieve stationary series (Figure 5). To develop nonlinear models, the use of stationary time series is not a must; however, we normalized the

_{t}*DOC*time series using

_{t}*min–max*normalization to alleviate the sharp difference in their standard deviation and abolish the dimensional inconsistency of the hybrid model.

### AR, VAR, and SETAR models

*DOC*series with AIC of the AR model in the training period (12 May 2020 to 22 October 2020; 164 days) and RMSE in the testing period (23 October 2020 to 31 December 2020; 70 days) were utilized to determine the order of the best AR model. The figure shows that a significant negative correlation exists within the first five lags of

_{t}*DOC*. Considering both model parsimony and accuracy, we selected an order of four, and the associated model was expressed as Equation (9). As we used the second-order of differencing to meet the stationary condition of AR-based modeling, the evolved AR (4) model could be considered the same as an ARIMA (4,2,0) model developed by row DOC time series. Equation (10) expresses the AR model attained for DOC prediction in the study area.

*TSA*packages from the

*R*library were used. As for the AR and VAR models, the least-squares method was used to optimize the models’ parameters in which modeling residuals are minimized using merely the training dataset. As previously discussed, the optimum threshold and the order of the associated AR models were determined based on AIC. As a result, the best orders

*p*

_{1}= 3,

*p*

_{2}= 1 and

*d*= 2, and

*r*= 6.816 were obtained. Thus, the optimum SETAR (2, 3, 1) was attained for the DOC time series as expressed below:

### GTAR modeling

### Comparison of the models’ performance

Method . | Model . | Training phase . | Testing phase . | ||
---|---|---|---|---|---|

RMSE . | NSE . | RMSE . | NSE . | ||

Linear | AR (4) | 0.349 | 0.972 | 0.268 | 0.958 |

VAR (1) | 0.312 | 0.978 | 0.289 | 0.951 | |

Nonlinear | SETAR | 0.328 | 0.976 | 0.290 | 0.951 |

GTAR | 0.311 | 0.978 | 0.212 | 0.974 |

Method . | Model . | Training phase . | Testing phase . | ||
---|---|---|---|---|---|

RMSE . | NSE . | RMSE . | NSE . | ||

Linear | AR (4) | 0.349 | 0.972 | 0.268 | 0.958 |

VAR (1) | 0.312 | 0.978 | 0.289 | 0.951 | |

Nonlinear | SETAR | 0.328 | 0.976 | 0.290 | 0.951 |

GTAR | 0.311 | 0.978 | 0.212 | 0.974 |

Considering the nonlinear models, no change was observed in the models’ accuracy during the training period. While the SETAR performance is close to the VAR during the testing period, the GTAR provides a significant increase in the models’ accuracy during the testing period. Regarding the model's performance in the testing phase, the table proves the superiority of GTAR over its counterparts.

*R*

^{2}) fitted on each data. Figure 10 shows the AR, VAR, SETAR, and GTAR model scatter plots with correlation of 0.965, 0.957, 0.956, and 0.981, respectively. Overall, the GTAR model is 2.44, 2.37, and 1.57% better than SETAR, VAR, and AR models in terms of the correlation coefficient, respectively. The Taylor diagram provides a concise statistical summary of how well the models’ prediction and the observed data match each other in terms of their correlation, their RMSE and the ratio of their variances. The figure shows that the GTAR with acorrelation of 0.98 is in a closer position to the observation than SETAR, VAR, and AR models. Like Figure 10(b) and 10(c), the VAR and SETAR models are in the same location indicating identical correlation and standard deviation. Figure 11(b) showed that the frequency of prediction error generated by the GTAR has the highest likelihood of being zero. Besides, the uncertainties from the GTAR are highly concentrated around the median showing wide in the middle and narrow onboth sides.

## CONCLUSION

DOC concentration cannot always be directly measured *in situ* (expensive sensors), and prohibitively expensive laboratory-based analysis is required to capture high-resolution dynamics of DOC variation in different aquatic systems. Thus, the data-driven models possessing high generalization ability are of paramount importance and required for ecological modeling and water quality management. Previous DOC model development endeavors did not fully consider linear-nonlinear fusion models, so this paper carried out research on the comparison and combination of TAR models with the state-of-the-art GP in the Oulankajoki River basin. The experimental results showed that the proposed fusion model, called GTAR, had the best prediction accuracy and the strongest correlation with the observed DOC compared with the AR, VAR, and SETAR. Our modeling study shows clearly that autocorrelation statistical modeling can be efficiently used for 1-day-ahead forecasting of DOC concentration. This allows methodology to be improved for example drinking water purification processes or predicting ecological conditions in the river systems. Our approach should however further be tested with longer time series, variable seasons, and also in a different kind for river systems.

## ACKNOWLEDGEMENTS

This research was funded by HYDRO-RDI network, HYDRO-RI-platform and Green-Digi-Basin, Academy of Finland (337523, 346163, 347704) and Freshwater Competence Centre (FWCC).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Boreal Environment Research*

*Theoretical and Applied Climatology*

**141**(3–4), 1151–1163. doi:10.1007/s00704-020-03272-7

*Environmental Science and Pollution Research*1–23. doi: 10.1007/s11356-022-19465-8.