A regression model integrating data pre-processing and transformation, input selection techniques and a data-driven statistical model, facilitated accurate 7 day ahead time series forecasting of selected water quality parameters. A core feature of the modelling approach is a novel recursive input–output algorithm. The herein described model development procedure was applied to the case of a 7 day ahead dissolved oxygen (DO) concentration forecast for the upper hypolimnion of Advancetown Lake, Queensland, Australia. The DO was predicted with an *R ^{2}* > 0.8 and a normalised root mean squared error of 14.9% on a validation data set by using 10 inputs related to water temperature or pH. A key feature of the model is that it can handle nonlinear correlations, which was essential for this environmental forecasting problem. The pre-processing of the data revealed some relevant inputs that had only 6 days' lag, and as a consequence, those predictors were in-turn forecasted 1 day ahead using the same procedure. In this way, the targeted prediction horizon (i.e. 7 days) was preserved. The implemented approach can be applied to a wide range of time-series forecasting problems in the complex hydro-environment research area. The reliable DO forecasting tool can be used by reservoir operators to achieve more proactive and reliable water treatment management.

## INTRODUCTION

### Dissolved oxygen cycle in drinking water reservoirs

Oxygen is an essential element for most forms of life; specifically, the oxygen dissolved in a body of water (dissolved oxygen (DO)) influences the majority of the biogeochemical processes occurring in it. The solubility of oxygen depends partly on the water temperature, with higher solubility in colder waters. Typically, subtropical lakes and reservoirs such as Advancetown Lake, which was the case study location for this study, are thermally stratified for most of the year, forming a top layer of warmer, less dense waters called the epilimnion, and a colder, denser bottom called the hypolimnion. The transition layer, where the highest temperature gradient is recorded, is called the metalimnion.

Oxygen production occurs in the epilimnion, where solar radiation enables photosynthesis to occur. Oxygen consumption, on the other hand, is much higher in the hypolimnion, where the DO level is markedly lower than in the epilimnion, despite its higher solubility due to colder waters (Tundisi & Matsumura 2011). In the hypolimnion light penetration is limited and photosynthesis, in turn, is limited too, due to the fact that the algae can hardly develop, even though they can move between layers using internal buoyancy. This implies a lower pH (because of high CO_{2} and composite forms, not dissociated because of the absence of photosynthesis) and no production of oxygen. Moreover, DO is completely consumed (in the case of productive lakes) by the microbial oxidation of organic matter (Bertoni 2011), with a rate of depletion depending on the volume of the reservoir and the amount of organic matter present for metabolism (Macdonald 1995).

As a consequence, during the stratification season, hypolimnetic waters usually record a gradual depletion of available DO; in cases of eutrophic lakes, this leads to anoxia, with important consequences including additional challenges to water treatment. Previous research reports that DO plays an important role for the release of bottom sediments, which occurs during stratification periods when the DO drops below critical levels; for example, Chiswell & Huang (2003) reported a remarkable release of manganese from the bottom sediments for DO levels below 1.5 mg/L. Other studies also found similar relationships between increases in manganese and DO depletion at the sediment–water interface (Delfino & Lee 1971). During winter, when epilimnetic waters cool down and sink and a full lake circulation occurs, anoxic, nutrient-rich waters can reach the top layer, leading to high nutrient loads throughout the whole water column. Since nutrients must be removed before they reach consumers’ taps, it is essential for water treatment organisations to understand the storages dynamic DO cycle and its implications for other element cycles. The purpose of this research was to predict 7 days ahead DO concentrations in the upper hypolimnion of Advancetown Lake, which is located in south-east Queensland, Australia.

### Review of DO prediction models

There are a number of previous studies that seek to predict DO concentrations in lakes and reservoirs. Akkoyunlu *et al.* (2011) applied artificial neural networks (ANN) to estimate DO concentrations in a Turkish lake with respect to depth rather than time. Jayaweera & Asaeda (1993) built a mathematical model for DO in lakes, but the model cannot predict future concentrations unless all future input values are known. Moreover, Rocha *et al.* (2009) applied multiple regression models to predict DO and chlorophyll-*a* in 25 lakes of the Upper Paranà River floodplain. Their study found that pH, nitrate and lake area all had relevant positive correlations with DO values, while water temperature and electrical conductivity had negative correlations. Ranković *et al.* (2012) attempted to predict DO concentrations in reservoirs using adaptive network-based fuzzy inference systems with the optimal combination of model inputs selected based on final model performance. EI-Shaarawi (1984) applied regression models to predict the DO depletion rate and probability of anoxia in the hypolimnion of the Central Basin of Lake Eire, noting how the main variables affecting these phenomena are water level and total phosphorus. Since the model was applied only during stratification seasons, the importance of water temperature (particularly the temperature differential between the top and bottom of the lake, which is a good detector of lake circulation) was only partially recognised. However, later studies on the same lake (Patterson *et al.* 1985) coupled a vertical mixing model with a DO budget model, thus recognising the importance of turbulent mixing in the distribution of DO throughout the water column.

All these models aimed to understand the relative importance of the possible inputs, but they did not aim to forecast future concentrations. Only a few research studies have tried to achieve such a goal: Xu & Liu (2013) predicted DO concentrations in a freshwater pond 1 hour ahead (*t* = 1) with the use of a wavelet ANN fed with the input at time *t* = 0. Another example is given by Coopersmith *et al.* (2011), who forecasted DO and probability of hypoxia in multiple points of a reservoir 1 day ahead. However, both of these studies had short-term prediction horizons, while the aim of this project is to enable a 7 days ahead DO forecast.

### Review of hydrological forecasting models

To the authors’ knowledge, there has been no attempt to build a DO forecasting model with prediction horizons as long as 7 days. However, in the hydro-environmental field, medium- to long-term forecasting models have been created to predict a number of parameters. To achieve that, in recent decades multi-step-ahead (MS) techniques have been applied; these MS techniques can predict time series values several time steps into the future, and can be divided into direct (where a model predicts directly *n* steps ahead with *n* the prediction horizon) or recursive (where the output of a one-step-ahead model is given as an input to the same model to predict two steps ahead and so on until *n* is reached) variations. Direct models usually perform better (e.g., Ji *et al.* 2005), since in recursive models the performance decreases with an increase in the prediction horizon due to the accumulation of error.

An application in the hydrologic field is given by Cheng *et al.* (2008), who predicted 12 steps ahead the monthly discharge from a Chinese hydropower plant using an improved, ANN-based MS model. However, for both the Ji *et al.* (2005) and Cheng *et al.* (2008) models only a single dependent variable was used. A second example is provided by Zaldívar *et al.* (2000), who forecasted high waters at the Venice lagoon using both direct and recursive nonlinear neural networks; it was underlined how, for recursive univariate models, the error propagates and performance dramatically decreases, every time the forecasted output is given as a new input. Lekkas *et al.* (2001) tried to overcome the problem by using real-time error updating techniques. On the other hand, however, Bertone *et al.* (2014) recently developed a data-driven model to predict manganese concentrations in Advancetown Lake 7 days ahead, also using predicted inputs (i.e., water and air temperatures). Limiting the number of dependent variables in the model, especially those based on less reliable future forecasts (e.g., rain or wind), provided better performance than the more complex modelling approaches, such as physical, process-based models (Helfer *et al.* 2011; Bertone *et al.* 2014).

### Persuasion for data-driven models for forecasting applications

Despite the more complex models (such as, often, process-based models) being accurate for real-time modelling since they include several inputs and processes, they usually perform poorly for medium-term forecasts (such as 7 days), since all the input variables must also be forecasted, adding a degree of uncertainty even before running the model itself. Moreover, several studies have been conducted where data-driven models have been successfully applied to estimate or forecast hydrological variables. Such data-driven models often incorporate pre-processing algorithms or optimisation techniques in order to enhance their reliability and performance. Among the models, neural networks are widely used, for instance for algal blooms forecasting (Recknagel *et al.* 2002) even though for the same problem, other models such as regression trees can be successfully applied (Jung *et al.* 2010). Regarding model optimisation, Abrahart *et al.* (1999) used an iterative learning approach to remove unimportant components, as well as optimisation strategies to make the neural network more efficient to run. Similarly, quicker simulations can be obtained by removing superfluous or redundant points in the input time series, with use of subtractive clustering methods or genetic algorithms (Sannasiraj *et al.* 2004; Doan *et al.* 2005). Other data pre-processing techniques include data transformation (Bowden *et al.* 2003; Joorabchi & Zhang 2007), which is particularly important for neural network models, and the use of singular spectrum analysis (Sivapragasam *et al.* 2001) in the case of input time series containing several repetitive values (such as rain, which is 0 for several time steps of a time series). Giustolisi & Savic (2006, 2009) recently proposed a new data-driven technique that is based on evolutionary polynomial regression (EPR), where genetic algorithms are used for estimating the best combination of parameter inputs (i.e., which inputs to include, and how to combine them together); it is also possible to use nonlinear transformations of the inputs, but not a combination of different transformations at the same time.

However, in cases where a sufficiently robust model needs to utilise a number of forecasted inputs (calculated by associated sub-models) as presented by Bertone *et al.* (2014), there are presently limited reported studies focussing on the hydrological field. Han *et al.* (2007) worked on a flood-forecasting model and noted how the inclusion of unlagged rain input data would have substantially increased the prediction accuracy, but no attempt to predict rainfall was completed by the authors. In other studies, Tsanis *et al.* (2008) and Joorabchi *et al.* (2009) attempted to forecast the groundwater level by including rainfall forecasts linked to historical seasonal averages. While this approach has some merit, since seasonal averages may be adequate for slower processes such as groundwater replenishment, such an approach would be inadequate where daily or weekly forecasts are needed since there is high rainfall variability within each season.

### Applying a novel data-driven modelling approach for forecasting DO in reservoirs

The use of modelled, forecasted inputs proved to be a promising inclusion to the current model as, unlike the traditional MS recursive techniques, the error propagation is kept limited due to the following two main reasons:

The model is not univariate, but multivariate; therefore it does not rely on the previous output forecast only, but mainly on other inputs which can be both forecasted or not. Also, it is not the output to be forecasted for the following time step, but only some inputs are predicted in case a lower lag would drastically increase the accuracy.

There is no summation of

*n*models’ errors, where*n*is the number of time steps ahead required for prediction. There is only one direct model forecasting the inputs and one direct final model forecasting the required output; hence only two errors need to be summed thereby limiting the error propagation.

For the present study, since DO concentration in the hypolimnion has been shown to be affected by only a few main parameters and decreases approximately linearly throughout the stratification (Dobson & Gilbertson 1971; Charlton 1979; Anderson *et al.* 1984), a data-driven model was deemed appropriate for this forecasting problem. A 7 days ahead statistical model was applied, extending the methodology of Bertone *et al.* (2014). This data-driven approach includes algorithms for data pre-processing, a variable input selection process, and variable input forecasts in those cases where the near real-time inputs do not yield a satisfactory performance. The inclusion of secondary input-forecasting sub-models, besides being novel in this particular formulation, follows the recommendation of Solomatine & Ostfeld (2008) who emphasised the importance of hybrid models and in particular the use of sub-models to better identify sub-processes. An advantage of the proposed forecasting approach is that it is completely automated, thus not requiring human intervention (e.g., visual inspection of correlations as in previous studies such as Lees (2000)). Moreover, unlike EPR, different linear and nonlinear input transformations are considered during the same model run, thus allowing a more comprehensive investigation of potential outputs to be conducted in a relatively short amount of time.

## METHODS

### Research domain

The location of this study is Advancetown Lake (153.28 °E, 28.06 °S), also known as Hinze Dam (Figure 1). It is located in South East Queensland (SEQ), Australia. Its current capacity is 310,730 ML, after a third upgrade of the dam was completed in 2011, raising the dam wall a further 15 m. This reservoir services the City of Gold Coast, which has a population greater than 500,000. Moreover, it is connected to the SEQ Water Grid, a water supply scheme covering all of SEQ, designed to respond to drought seasons by redirecting water from regions with an oversupply to regions in need (Spiller 2008).

The lake's average depth at full capacity is 32 m, while its surface area is approximately 15 km^{2}. The catchment area, mostly within national parks, covers 207 km^{2} of terrain; the two main inflows are the Nerang River and the Little Nerang Creek, the latter also being the outflow from another dam, Little Nerang Dam, the oldest reservoir servicing the Gold Coast region. An intake tower, located next to the Hinze Dam wall, draws the water from the most suitable depth based on reservoir water quality at the time of extraction and redirects it to the closest treatment plant in Molendinar, which is approximately 10 km north-east of the reservoir. The reservoir treatment operations are managed by Seqwater, which is a Queensland Government-owned authority, which is responsible for bulk water supply for the SEQ region.

### Data collection

Currently, water quality is primarily monitored though laboratory analysis of water samples collected manually on a weekly basis. However, in 2008 a vertical profiling system (VPS) was installed near the intake tower (point A in Figure 1). This VPS consists of a YSI Sonde suspended by a cable to a floating buoy which is automatically winched up and down through the water column and collects water quality parameters every metre including: water temperature, DO, pH, specific conductivity, redox potential and turbidity. Collected data for the whole profile are transmitted via telemetry every 2 hours back to a central data repository for collation, analysis and display. Through an effective collaboration with Seqwater, the data from the VPS and from the manual water samplings were made available for this research project. Data from the manual water samplings, which are more comprehensive, exist for a longer period of time (i.e., 2000–2013), but are limited to a depth of 24 m, collected at 3 m intervals. VPS data typically were collected for the whole water column, but cannot measure the complete range of parameters relevant to this study. Hence, for the sake of consistency, the data used in this study come from the period 2008–2013 from depths of 0–24 m. In addition, weather data for the same period were collected from the Australian Bureau of Meteorology and from a small weather station installed on the VPS buoy. Finally, the river inflow data were obtained from the Department of Energy and Resources Management of the Queensland Government.

### Model development

To predict DO concentrations in the upper hypolimnion of Advancetown Lake, a number of techniques for data pre-processing, variable input selection, data post-processing, and a multiple regression model, were combined to create the final forecasting model named ALMO (pre-processing ALgorithms + regression Models). Although ALMO was specifically applied for DO prediction in Advancetown Lake in this present paper, it can be applied for the prediction of other independent variables at any predetermined number of time steps ahead. The logical flow chart of ALMO is shown in Figure 2.

The DO forecast model's input parameters are the relevant data, such as pH or water temperature, which can be defined as , where is the index for various parameters, represents the current time, and *l* is the maximum lag to be considered in the inputs, according to the knowledge the researcher has of the system and the computational time that can be accepted. The higher the *l* chosen, the bigger the input matrix will be and the longer the simulation; however, for some particular prediction problems, where it is known that there might be a high correlation between the independent variable and a certain dependent input variable at a high lag, it can be useful to select a high *l*. To further reduce computational time, a preliminary analysis of the available data is recommended in order to exclude those variables that are not correlated with the target vector.

*n*is the number of time steps ahead of the forecast (i.e., the prediction horizon). The desired statistical performance of the forecasting model is fixed by the user. After this, the inputs are pre-processed, including data normalisation and division between a calibration and a validation set. Normalising the data within a certain range (e.g., between 0 and 1) provides the opportunity to assign the same importance to each input, rather than overestimating the importance of inputs with a higher magnitude and underestimating inputs with a lower magnitude. The elements of the input variable vectors can be normalised as:where represents any time step of the input time series and and are, respectively, the maximum and minimum values of the time series considered.

To account for nonlinear correlations, a number of nonlinear transformations are applied to the inputs. These created time series will update the input matrix. Although a number of default nonlinear transformations (e.g., hyperbolic, logarithmic, exponential, etc.) are predetermined, the user can modify or add more nonlinear transformations according to the results of the data analysis.

The final aspect of the pre-processing of the data is the division of the inputs and target time series into a calibration and a validation set. The percentages of the data in the calibration and validation sets are decided by the researcher; this procedure allows for testing of the prediction performance on an independent data set in order to avoid problems such as over-fitting (i.e., the model fitting the calibration set very well, but performing much more poorly when new data are presented).

*et al.*(2011). This is based on estimating the relative contribution of each input candidate in predicting the target

*y*through the application of regression models. To do so, the normalised root mean squared error (NRMSE) is calculated using the following formula:where

*m*is the number of time steps included in the calibration set, is the target time series for the calibration set and is the time series of the output of the performed regression analysis.

The inputs are ranked according to their performance; the best performing input (i.e., the one yielding the least NRMSE, called NRSME_{min}) is selected and stored, and the residuals are calculated. If the target NRMSE is not achieved, a new iteration of the algorithm is then performed, by estimating which input among the remaining candidates best fits the residuals through the use of the same predefined regression model. The iterative process is repeated until the target performance, defined by a certain predetermined value of NRSME*,* is reached, and thus the set of selected input variables is defined and stored in a matrix called *X*_{mod}. One of the advantages of this procedure and in particular of the application of the residuals as the next target is avoiding multicollinearity; in fact, once one input has been selected, all the variables highly correlated with that input will become useless when the next residuals will be targeted, and thus the rankings will have to be reconsidered. The input addition algorithm logical steps are illustrated in Figure 3. According to the inputs that have been selected, the equivalent of *X*_{mod} for the validation set is prepared by another algorithm.

When the iterative input selection procedure is completed, the model's forecasting performance is then assessed using the validation data set. Model users can decide whether to use NRMSE as their performance index, or to also calculate *R ^{2}*. For this specific case study, an

*R*≥ 0.8 was selected as a target performance; this standard of performance was chosen because in environmental systems, the level of uncertainty is often high. In the case where the target statistical performance level cannot be achieved in the desired forecasting time frame (7 days ahead), the model will gradually reduce

^{2}*n*in order to assess whether an acceptable performance can be achieved by reducing the prediction horizon (as can be sensibly expected, since a forecast closer in time will involve less uncertainty). Hence,

*n*is iteratively reduced and the previously described procedure repeated, until the input selection algorithm yields an acceptable validation performance.

When the desired *R ^{2}* is reached, in order to guarantee that the prediction will still be valid as many time steps ahead as was originally decided despite the

*n*reduction, the same forecast procedure will be applied to each input selected after the reduction of

*n*, to show a delay from which is lower than the originally established prediction horizon. The number of time steps ahead of each input forecast will be equal to the number of

*n*reductions before that input was found (e.g., if

*n*was reduced twice before finding a certain relevant input, then this means that the selected input must be forecasted two steps ahead). The same input selection algorithm will be adopted. The outputs of this input forecast procedure will substitute the original associated inputs in

*X*

_{mod}. So if, for instance, we have to predict

*y*(

_{O}*t*), and ALMO found that the relevant inputs are

_{j+7}*u*(

_{i}*t*),

_{j−1}*v*(

_{j}*t*) and

_{j−3}*w*(

_{k}*t*), then obviously

_{j+2}*w*(

_{k}*t*) must also be calculated using inputs at time

_{j+2}*t*or older; hence ALMO will then also be able to select the best inputs and run a prediction model for

*w*(

_{k}*t*), and the results will be used as one of the inputs for the prediction of

_{j+2}*y*(

_{O}*t*). Finally, the regression model is run again for the validation set by using the updated

_{j+7}*X*

_{mod}. The final model output and performance are displayed and saved.

*et al.*(2011) take nonlinearities into account through the application of a nonlinear (tree-based) model; however, since Bertone

*et al.*(2014) showed that nonlinear models such as ANN may yield high volatility in the residuals, this could be detrimental in cases where forecasted variables must be used as inputs for the target prediction. In this case, the volatility would increase exponentially, since the inputs themselves will already show high volatility in their residuals, before even running the nonlinear model for the target forecast. Thus, following the outcomes of Bertone

*et al.*(2014), a linear regression model, which typically yields smoother outputs, was chosen to be more appropriate for this research study. This model, if applied to time series forecasting, can be described by the following equation:where

*n*is the prediction horizon and

*l*is the maximum lag,

*a*are the regression coefficients for the

_{i,k}*k*regressors,

*a*is the intercept and

_{0}*ɛ*represents the error term. The model is linear because Equation (9) is a linear function of the unknown parameters

*a*(Montgomery

_{i}*et al.*2012). The equation can be summarised as:where

*A*is a vector containing the unknown regression coefficients (with in the first row),

*X*

_{mod}is a matrix containing the input time series after the input selection procedure and a time series of 1 in the first column, and is the residuals’ time series.

*A*in order to minimise the sum of squares of the difference between the target values and the model outputs. To do so, the following equation will be solved:The target performances chosen had a NRMSE of 0.096 during the calibration process and a

*R*of 0.8 for the validation set. The calibration set counted for 70% of the data and the validation set for the remaining 30%. These proportions for data division have been widely applied in several previous studies (see Ismail

^{2}*et al.*2011), proving to be an efficient division for calibration and validation purposes. The chosen

*n*was seven time steps, while

*l*was decided to be three to contain the computational time. Since daily data were adopted, one time step corresponds to 1 day.

## MODEL RESULTS

### Raw data analysis and inputs pre-selection

*X*represents the generic variable; subscripts hyp, met and ep stand for upper hypolimnion, metalimnion and epilimnion; and

*z*represents the depth in metres. In this study, data were collected at every 3 m, so that the depth vector Z was used for discretisation:

During the stratification season, the DO shows a linear, gradual decrease. However, at the onset of the circulation period, its level sharply increases. It was decided that a good variable describing the strength of the stratification and, in turn, the commencement of the lake circulation was the water column temperature differential *ΔT _{w}* between epilimnion and upper hypolimnion, as described in Equation (16). When

*ΔT*is high, the stratification is strong and the DO level decreases until anoxia is reached. If

_{w}*ΔT*is at approximately zero, meaning that the epilimnetic waters have similar temperature (and density) to the hypolimnion, then lake circulation is occurring and the DO concentration increases. Figure 6 clearly displays this inverse relationship, which, through proper linear and nonlinear transformations (e.g., hyperbolic), will be detected by the model.

_{w}All the other available time series were visually inspected. In order to reduce computational time, the most relevant ones were selected to be part of the model's input matrix (Table 1).

Input . | Variable . |
---|---|

X_{1} | ΔT _{w} |

X_{2} | Air temperature |

X_{3} | Water temperature hyp |

X_{4} | pH hyp |

X_{5} | pH met |

X_{6} | Conductivity hyp |

X_{7} | Redox potential hyp |

X_{8} | Turbidity hyp |

Input . | Variable . |
---|---|

X_{1} | ΔT _{w} |

X_{2} | Air temperature |

X_{3} | Water temperature hyp |

X_{4} | pH hyp |

X_{5} | pH met |

X_{6} | Conductivity hyp |

X_{7} | Redox potential hyp |

X_{8} | Turbidity hyp |

### Prediction model outcomes

^{®}Core™ i5-2400 CPU @ 3.10 GHz processor, ALMO took approximately 16 minutes to perform the simulation, with most of the time (more than 15 minutes) spent over the calibration phase. Nine variables were selected as relevant inputs. The general equation describing the ALMO regression model outputs is:where are the time series vectors contained in created after ALMO input selection, are the respective regression coefficients, and the lags of the

*i*relevant inputs selected by ALMO.

Table 2 gives a list of the final inputs selected by the model with the respective lag *t _{i}* and regression coefficients

*a*, described in Equation (17). The respective NRMSE reached when that input was added to the regression is also shown in Figure 7. The order reflects the ranks of the input, as can be noticed by observing the relative NRMSE reduction (Figure 7); the first input addition implies most of the performance achieved with the addition of producing an NMRSE reduction of 0.032, while the addition of only implies a reduction of 0.001, and from the performance improvement was negligible. As a consequence, for this case study, it was decided to retain all the first nine inputs, as did not imply any noticeable increase in performance; however, if the user was willing to accept only a slight reduction in forecasting accuracy, a simpler model could have been developed with just a few key variables.

_{i}Input name . | Variable . | a
. _{i} | t
. _{i} | NRMSE . |
---|---|---|---|---|

s_{1} | (1 + ΔT _{w})^{−5} | 6.131 | t _{j−3} | 0.152 |

s_{2} | (1 + Tw _{hyp})^{−5} | 4.244 | t _{j−3} | 0.110 |

s_{3} | pH _{hyp}^{3} | 1.608 | t _{j+1} | 0.106 |

s_{4} | (1 + ΔT _{w})^{−5} | −1.654 | t _{j+1} | 0.103 |

s_{5} | ΔTw ^{3} | −0.540 | t _{j−1} | 0.101 |

s_{6} | (1 + T _{air})^{−5} | −0.621 | t _{j+1} | 0.099 |

s_{7} | (1 + pH _{hyp})^{−5} | 0.994 | t _{j−2} | 0.098 |

s_{8} | (1 + pH _{met})^{−5} | 0.984 | t _{j+1} | 0.097 |

s_{9} | ΔTw ^{3} | −1.160 | t _{j+1} | 0.096 |

s_{10}^{a} | (1 + T _{air})^{−5} | −0.498 | t _{j−3} | 0.096 |

Input name . | Variable . | a
. _{i} | t
. _{i} | NRMSE . |
---|---|---|---|---|

s_{1} | (1 + ΔT _{w})^{−5} | 6.131 | t _{j−3} | 0.152 |

s_{2} | (1 + Tw _{hyp})^{−5} | 4.244 | t _{j−3} | 0.110 |

s_{3} | pH _{hyp}^{3} | 1.608 | t _{j+1} | 0.106 |

s_{4} | (1 + ΔT _{w})^{−5} | −1.654 | t _{j+1} | 0.103 |

s_{5} | ΔTw ^{3} | −0.540 | t _{j−1} | 0.101 |

s_{6} | (1 + T _{air})^{−5} | −0.621 | t _{j+1} | 0.099 |

s_{7} | (1 + pH _{hyp})^{−5} | 0.994 | t _{j−2} | 0.098 |

s_{8} | (1 + pH _{met})^{−5} | 0.984 | t _{j+1} | 0.097 |

s_{9} | ΔTw ^{3} | −1.160 | t _{j+1} | 0.096 |

s_{10}^{a} | (1 + T _{air})^{−5} | −0.498 | t _{j−3} | 0.096 |

*a _{0}* = intercept = −1.278.

^{a} = first excluded variable, as it did not decrease NRMSE.

It can be noted how only nonlinear transformations of the original inputs were considered relevant by the model, and this reflects well the uncertainty and lack of linearity of environmental systems such as lakes. The most important inputs (five of the first six), as expected from the preliminary visual inspection of the data, are related to the lake circulation, and altogether help in modelling the nonlinear spikes in DO concentrations at the beginning of the turnover event. The pH, in both the upper hypolimnion and metalimnion, was also a relevant input. Other variables such as specific conductivity, turbidity or redox potential did not help in improving the model performance.

Although the forecast must be performed for 7 days (*t* + 7), the model had to use as inputs several variables with a reduced lag (*t* + 1), thus implying the necessity for their forecast too. Hence, each of those variables was forecasted in turn (1 day ahead) and the results were substituted into the recorded time series in order to estimate the real final performance.

Figure 8 displays the model's forecasted DO concentrations in comparison with the measured DO values for a validation set (from August 2011 to October 2012, corresponding to the final 30% of the period considered). Validation yielded a respectable *R ^{2}* = 0.804 and

*NRMSE*= 0.149 level of accuracy. Interestingly, the least performing segment is caused by an unpredicted rapid spikezin concentration at the end of January 2012. The discrepancy in model accuracy is made even clearer when assessing seasonal performances, as detailed in Table 3.

Period . | Dates . | NRMSE . |
---|---|---|

Winter/Spring DO depletion | 2 August 2011–2 December 2011; 20 August 2012–25 September 2012 | 0.071 |

Summer/Autumn hypoxia | 2 December 2011–2 May 2012 | 0.197 |

Autumn/Winter oxygenation | 2 May 2012–20 August 2012 | 0.115 |

Period . | Dates . | NRMSE . |
---|---|---|

Winter/Spring DO depletion | 2 August 2011–2 December 2011; 20 August 2012–25 September 2012 | 0.071 |

Summer/Autumn hypoxia | 2 December 2011–2 May 2012 | 0.197 |

Autumn/Winter oxygenation | 2 May 2012–20 August 2012 | 0.115 |

Figure 9 plots the model results against the observed values for the three different periods of time detailed in Table 3. DO was forecasted with high accuracy during the period where DO was decreasing. However, during the period of very low DO concentrations, there was a lack of any appreciable correlation due to the aforementioned spike. A consistent underestimation can also be noticed for low DO levels during the increasing DO concentration period, probably still related to the same unusual, sharp increase.

The unusual peak was caused by an extremely heavy precipitation event, which led to a partial mixing of the reservoir and the intrusion of oxygenated waters from the river and the epilimnion. This atypical event did not occur previously within the calibration data set; hence the model could not predict it. It is reasonable to predict or hypothesise that, over a validation set that does not include extreme weather events, the model performance would be even higher. Alternatively, if both river inflow and temperature data were available, it would be possible to use them, or a combination of them, as extra inputs for the model. However, a similar event would still have to be present in order to allow the model to learn to detect similar event occurrences in future validation data. In case such an event was included in the calibration data set, ALMO would have given more importance to meteorological variables such as rain at time *t* + 6 or *t* + 7, thus implying the difficult challenge of reliably forecasting rain 6–7 days ahead. This could compromise the accuracy of the final DO prediction, as rain, unlike, e.g., air temperature, is difficult to be accurately predicted. Nevertheless, as only very extreme wet weather events have the power to induce mixing and affect DO in Advancetown Lake, it might be necessary to predict only those extreme events (e.g., rainfall >100 mm) without having to correctly quantify its exact amount. Hence, the use of imperfect weather forecast potentially would not substantially jeopardise the final reliability of ALMO for DO prediction in Advancetown Lake.

Predictably, the output time series is seven time steps longer. Therefore, ALMO can be effectively used for real-time DO prediction 7 days ahead, given that all the required inputs are provided.

## DISCUSSION

The Model results section showed how ALMO was successfully applied for the DO forecasting case study. Forecasting DO in an intermediate layer of a reservoir is challenging due to the complexity of the mixing and biogeochemical reactions occurring in it. Yet, the ability to reliably estimate DO 7 days ahead is highly beneficial for the Advancetown Lake water operators, as it would enable proactive raw water offtake gate level selection to occur.

The installation of remote water quality monitoring instrumentation (e.g., VPS) in drinking water reservoirs is becoming more commonplace and will exponentially expand the quantum of lake and weather data available for analysis and decision-making. Owing to the high installation and running cost of these systems, there is increasing pressure from the water utilities for exploiting such databases for enhanced reservoir and treatment management purposes. The formulation and application of data-driven modelling approaches such as ALMO provide bulk water utility operators with a reliable forecasting tool that selects its own model predictor variables and sufficiently accounts for different types of nonlinearities that may be evident in the data. Moreover, its user-friendliness enables the modeller to adjust, for example, the target performance or the number of nonlinear transformations, in order to optimise computer processing time. Moreover, since ALMO is a purely data-driven approach, able to detect a number of linear or nonlinear correlations at different lags between any time series presented, it can be applied to forecast a wide range of independent variables for a number of feasible time step horizons. However, readers should be cautioned that the approach requires a basic knowledge of the environmental system being investigated as well as sufficient data availability in order for ALMO to derive accurate forecasts of the independent variable.

ALMO is a novel, practical and user-friendly approach for time series forecasting of independent water quality variables. After a careful review of contemporary modelling approaches in this field, a hybrid model was constructed comprising analytical techniques/approaches such as nonlinear data transformations, linear regression and recursive input–output forecasting modelling techniques. Other popular approaches were trialled to examine their fitness-for-purpose for this forecasting problem but exhibited deficiencies that prevented their inclusion; two examples include: (a) ANN models were not suitable for this forecasting application since sub-model outputs are used as inputs for another sub-model, with ANN outputs typically presenting higher variance and thus tending to increase the likelihood of error propagation, as argued in Bertone *et al.* (2014); (b) ALMO better handled the numerous data transformations needed for this problem when compared to the EPR. Overall, ALMO is a sensible addition to the current suite of environmental modelling techniques available, and has particular merit for similar applications to that presented herein (i.e., week ahead forecast of a particular independent variable using a remote lake monitoring system).

## CONCLUSIONS

An innovative ensemble of data processing algorithms coupled with a prediction model (ALMO) has been built. Its main features are: takes into account nonlinearities; only relevant inputs are selected; the user can define the prediction horizon and the maximum lag, along with the target performance; if the target performance is not reached, the model automatically reduces the prediction horizon in order to reach a better correlation by using inputs closer in time with the output; and if and when the target performance is reached, then the model will forecast those added inputs too.

The formulated modelling procedure has far-reaching potential for application in a range of time series forecasting problems in different fields. In the case study presented here, the model effectively predicted DO concentrations a week ahead in Advancetown Lake with *R*^{2} > 0.8. The high correlation was reached by decreasing the prediction horizon from 7 to 6 days and building a secondary 1 day ahead forecasting model for those new selected inputs. Hence, the indirect approach first introduced in Bertone *et al.* (2014) and formalised in a comprehensive manner herein represents a novel way to increase forecasting model performance. Furthermore, the introduction of nonlinear input transformations was essential for an accurate forecast, as could have been expected for an environmental system. It should be stressed that user knowledge of the environmental system being investigated is essential to develop a prediction model using the herein described approach since the inclusion of the right inputs that are physically or chemically related to the output is essential for performing a reliable model simulation that avoids unjustified random high correlations.

Successfully predicting DO concentrations in a reservoir 1 week ahead is of paramount importance, since this parameter is highly correlated with critical nutrient cycles. The implemented data-driven forecasting tool can serve as an early warning system for Seqwater operators to more proactively address critical DO conditions. Future work will focus on the application of the model to different reservoirs and water parameters, on the reduction of the computational time when using large input matrices, and on the creation of a user-friendly interface that clearly displays outputs and warnings.

## ACKNOWLEDGEMENTS

The authors are grateful to Griffith University for support for Mr Bertone and to Seqwater for their technical and financial support to this collaborative project.