Autoregressive time series forecasting is common in different areas within water resources, which include hydrology, ecology, and the environment. Simple forecasting models such as linear regression have the advantage of fast runtime, which is attractive for real-time forecasting. However, their forecasting performance might not be acceptable when a non-linear relationship exists between model inputs and output, which necessitates the use of more sophisticated forecasting models such as artificial neural networks. This study investigates the performance and potential of a hybrid pre-processing technique to enhance the forecasting accuracy of two commonly used neural network models (feed-forward and layered recurrent neural network models) and a multiple linear regression model. The hybrid technique is a combination of significant input variable selection (using partial linear correlation) for reducing the dimensionality of the input data and input data transformation using discrete wavelet transform for decomposing the input time series into low and high frequency components. Two case study forecasting applications, namely, monthly inflow forecasting for a lake in Victoria (Australia) and weekly algal bloom predictions at a bay in Hong Kong were used to assess the forecasting ability of the models when used in conjunction with the hybrid technique. Results demonstrated that the hybrid technique can significantly improve the forecasting performance of all the models used.

## INTRODUCTION

Autoregressive time series forecasting is common in many different disciplines such as the financial sector and in water resources. The important feature of the autoregressive time series forecasting is the use of a model to forecast future values of a time series variable of interest in a process or a system through the use of its known past values. The goal of autoregressive time series forecasting is to provide accurate forecasts of the variable of interest, such as the foreign exchange rates (Vojinovic *et al.* 2001), streamflow (Wei *et al.* 2012), salinity (Maier *et al.* 2010), and chlorophyll (Cheng & Wei 2009). Such forecast information can be utilized in future operational planning of the system at hand.

The general difficulty associated with autoregressive time series forecasting is the non-linear and non-stationary characteristics which are often encountered in most water-related time series data (Coulibaly *et al.* 2001). The non-linear characteristic of the time series data shows the dynamic relationship between past values (i.e., model inputs) and future values (i.e., model outputs) in a way that a small change in model inputs will affect model outputs significantly. Exponentially increasing the magnitude of model outputs due to a small change in model inputs is one example of non-linear behavior, while the effect is ignorable in the linear behavior (Makridakis *et al.* 2003). The non-linear characteristic is commonly modeled as a non-linear combination of model inputs to predict the model output.

The non-stationary characteristic of the time series data is realistically considered as not having constant mean and/or constant variation within the time series (Tsay 2005). This characteristic can be observed through the seasonality and/or the trend in the time series data. However, the key consideration is not the time variation of data but whether the underlying process that generates the data is itself evolving (Cannas *et al.* 2006). This consideration is commonly taken into account by the use of a ‘random walk’ model, which assumes that the future value depends on the previous value plus a random noise in order to capture the non-stationary characteristic (Tsay 2005).

Many forecasting models have been developed in the past. The multiple linear regression (MLR) model is among the simplest, which basically performs forecasts by fitting a linear relationship between inputs and outputs. Although the MLR model might be very attractive for real-time forecasting due to its simplicity, the MLR model is considered unsuitable for non-linear data, which necessitate the use of more sophisticated models (Elshorbagy *et al.* 2010; Jung *et al.* 2010). To overcome this limitation of linear models, various data-driven modeling techniques like artificial neural networks (ANN), support vector machines (SVM), and genetic programming (GP) have been proposed (Minns & Hall 1996; Babovic & Keijzer 2002; Liong *et al.* 2002; Yu *et al.* 2004). Among them, ANN models have been widely used for autoregressive time series forecasting over the last 15 years (Islam 2010; Maier *et al.* 2010; Arunkumar & Jothiprakash 2013). Several distinguishing features of ANN models make them valuable and attractive for forecasting tasks (Samarasinghe 2006; Maier *et al.* 2010). First, there are few *a priori* assumptions about the ANN models, and they learn from examples and capture the functional relationship in the data even if the underlying relationships are too complex to specify. Second, ANN models can generalize after learning from the sample data presented to them. Third, ANN models are universal functional approximators for any continuous function to the desired accuracy.

There are many different types of ANN models, and among them, feed-forward neural networks (FFNN) and layered recurrent neural networks (LRNN) have gained attention in the literature for autoregressive time series forecasting (Güldal & Tongal 2010; Wei *et al.* 2012). FFNN are static and are the most commonly used architecture due to their simple framework (Güldal & Tongal 2010). The LRNN provide a representation of dynamic internal feedback loops to store information for later use and to enhance the efficiency of learning. The LRNN is more suited for mapping problems that involve auto-correlated samples (Parlos *et al.* 2000). However, the forecasting accuracy of ANN models has not always been high. For example, in a recent study by Wang *et al.* (2009) involving ANN models for forecasting monthly streamflow time series at two different case study locations, one demonstrated a high Nash–Sutcliffe coefficient (E) value of 0.87 while the other had an E value of 0.61 between observed and forecasted streamflow values for the validation dataset. Furthermore, there is limited information on comparative performances between feed-forward and RNN on the autoregressive time series forecasting.

Recently, input selection techniques have been used for reducing the number of model inputs, and can improve forecasting performance in some cases (May *et al.* 2008; Tran *et al.* 2015). Input selection techniques can be basically divided into two approaches, namely, model-free and model-based approaches (Fernando *et al.* 2009). Model-free approaches use statistical measures of dependence to determine the strength of the relationship between candidate model inputs and the model output. The partial linear correlation (PLC) technique is the simplest model-free technique, which can select inputs that are linearly correlated with outputs. The PLC technique has been used in several studies (Tiwari *et al.* 2013; Valipour *et al.* 2013; Liu *et al.* 2014). The partial mutual information (PMI) technique (Fernando *et al.* 2009) is another model-free technique which can select inputs that are non-linearly related with outputs. On the other hand, model-based approaches use the performance of calibrated models with different inputs as the basis for choosing the most appropriate inputs. ANN and GP based techniques are examples of the model-based approach. Muttil & Chau (2007) compared two model-based approaches (ANN and GP) for selecting ecologically significant inputs for algal bloom prediction. They showed that both these methods produced similar results. Tran *et al.* (2015) conducted a comparative study between PLC, PMI, and GP techniques on four hypothetical and two real datasets. They found that there are differences in significant inputs selected by these techniques. However, in terms of forecasting performance, inputs selected by the PLC technique provided similar forecasting performance with inputs selected by the PMI and GP techniques when tested using ANN models.

Another approach that has also been recently used to improve the forecasting performance of ANN models is input preprocessing using wavelet transformation (Adamowski & Sun 2010; Wei *et al.* 2012; Tiwari *et al.* 2013). The wavelet transformation is used to decompose an input variable into a set of linear components with high and low frequency. Such decomposition can be considered to separate the non-stationary and stationary parts, which becomes particularly useful for autoregressive time series forecasting. However, in their study, the effect of different wavelet parameters including wavelet functions and decomposition levels have not been investigated.

In this study, the overall objective was to improve the performance of autoregressive time series forecasting by using a hybrid combination of significant input selection and input preprocessing using wavelet transformation. Several studies have used the hybridization of various techniques to enhance the forecasting accuracy (Kisk 2010; Cao *et al.* 2013; Sun *et al.* 2014). However, as far as the authors are aware, the hybridization of the PLC technique for input selection (used in this study because of its simplicity) with wavelet decomposition to improve the forecasting performance of MLR and ANN models has not been undertaken in the past. The major contributions of this paper are as follows:

(i) A hybrid technique that combines input selection using the PLC technique with wavelet decomposition for MLR and ANN models is investigated for enhancing autoregressive time series forecasting.

(ii) Wavelet transformation has been demonstrated to be useful for improving time series forecasting. This study analyzes the sensitivity of different decomposition levels and different wavelet functions on the forecasting performance.

(iii) The improvement in performance of two popularly used ANN models (FFNN and LRNN) when used in conjunction with input selection and wavelet transformation is analyzed.

The remainder of this paper is structured as follows. Brief descriptions of feed-forward and layered recurrent ANN models, and then PLC and wavelet decomposition techniques are presented in the next section. A description of the performance indicators used in this study to evaluate the performance of the models is also included in the next section. This is followed by a description of the two case study datasets. Results and discussion are then presented, which is followed by the conclusions drawn from this study.

## METHODS AND TECHNIQUES

### ANN models

The RNN, on the other hand, has some feedback loops from the output layer or hidden layer to the input layer, as shown in Figure 2(b). The RNN allows signals to propagate in both forward and backward directions, which offers the network dynamic memories (Samarasinghe 2006). In other words, RNN has the capacity associated with the feedback loops to delay and store information from the previous time step that is not present in the architecture of feed-forward networks. The presence of these feedback loops has a profound impact on the learning capability and performance of the neural network, which allows the network to capture the true hidden dynamic memories or autoregressive components of nonlinear time series systems (Carcano *et al.* 2008). The RNN has been proved to be a powerful method for handling complex systems such as nonlinear time-varying systems. A LRNN, which is a form of RNN, is also investigated and compared with the FFNN in this study.

### PLC

*X*and

*Y*co-vary in a linear manner which also can be visually examined using a scatter plot (Makridakis

*et al.*2003). In time series data, when

*Y*has a correlation with its past data

_{t}*Y*(

_{t−h}*h*being the number of time lag), the is known as autocorrelation (Tsay 2005). On the other hand, if

*Y*has a linear correlation with X

_{t}*of a different time series, the is known as lag-*

_{t−h}*h*cross-correlation. The can be estimated from the sample data of

*X*and

*Y*using Equation (1) below. where

*x*and

*y*are sample values of

*X*and

*Y*,

*N*is the sample size and and are the sample means.

*Y*has auto-correlation with

_{t}*Y*. This means that

_{t−1}*Y*will automatically have correlation with

_{t−1}*Y*due to the nature of time series data, and thus

_{t−2}*Y*is also automatically correlated with

_{t−2}*Y*(i.e., bridge correlation) through

_{t}*Y*. The PLC is often used to detect the true linear relationship between

_{t−1}*Y*and

_{t}*Y*by removing the effect of

_{t−2}*Y*. The partial correlation coefficients which represent the true level of linear relationship between

_{t−1}*Y*and

_{t}*Y*controlling on

_{t−2}*Y*and

_{t−1}*Y*(for example) can be found by computing the correlation (using Equation (1)) between two residuals from two MLR models as shown in Equations (2)–(5) (Tsay 2005): where

_{t−3}*a*and

_{i,j}*b*are regression coefficients,

_{i,j}*r*

_{t−}_{2}and

*r*

_{t}_{,2}are the residuals in MLR models for

*Y*and

_{t−2}*Y*, respectively, on the remaining variables, and

_{t}*ε*is the error term. The same procedure can be applied when

_{t}*Y*has cross-correlation with

_{t}*X*through

_{t−2}*X*of a different time series. The partial correlation coefficients are often statistically tested using the significant

_{t−1}*t*-test with 95% confidence in the MLR analysis, in order to avoid selection of significant inputs by chance. In this study, a variable is considered statistically significant if its partial correlation coefficients pass the

*t*-test and have a value larger than 0.1. The MLR analysis of the statistical software package IBM SPSS was used to implement this technique.

### DWD

Preprocessing of input data has been recently used to enhance the performance of ANN models (Wu *et al.* 2009). The goal of data preprocessing is to identify important characteristics such as trend, oscillation, and noise in order to attain better predictability, because the time series data have time varying signals with static and dynamic components. Among the available data preprocessing techniques, wavelet decomposition has demonstrated good performance in several forecasting studies (Wu *et al.* 2009; Kisl & Cimen 2011; Wei *et al.* 2012). The underlying principle of this data preprocessing technique is to decompose the time series data of interest into several time series components, which are then used as model inputs to forecast the future value of the time series data.

*et al.*2011). It is performed based on wavelet transformation or decomposition that can decompose an original time series into a set of component time series. The sum of these component time series becomes the original time series. In general, wavelet decomposition is mathematically expressed as follows (Cannas

*et al.*2006): where CWT(

*τ,s,ψ*) are the continuous transform coefficients of a continuous wavelet transformation of a time series data

*Y(t*);

*τ*is the scale or frequency parameter;

*s*is the position or shifting parameter; and

*ψ*is the transforming function or wavelet. A wavelet is a waveform of effectively limited duration that has an average value of zero. In comparison, sine waves, which are the basis of Fourier analysis, do not have limited duration and extend from minus to plus infinity. While sinusoids are smooth and predictable, wavelets tend to be irregular and asymmetric. The CWT(

*τ,s,ψ*) provides the local similarity between the time series data and the wavelet at a particular

*τ*and

*s*. However, the shortfall of the continuous wavelet transformation is that there are numerous pairs (

*τ, s*) that can be used, thus resulting in redundant information and enormous computational effort in such a continuous domain (Kisl & Cimen 2011). Furthermore, many practical time series data (except electrical signals) are discrete in nature.

*τ, s*) are limited to powers of two and are dependent on two integer parameters

*m*and

*n*(Kisl & Cimen 2011): where DWT(

*m,n*,

*ψ*) are the discrete transform coefficients of discrete transformation of time series data

*Y(t)*with length

*N*;

*m*indicates the level of decomposition (

*m*<

*M*with

*N*

*=*

*2*) and the value of

^{M}*n*is dependent on

*m*. As a result of discrete wavelet transformation, a given time series data

*Y(t)*can be decomposed as: where

*A*(

_{m}*t*) is called the approximation component time series and

*D*is called the detail component time series. From Equation (8), it can be seen that starting from the left side,

_{m}(t)*A*

_{1}(

*t*) and

*D*

_{1}(

*t*) are obtained by discrete wavelet transformation of

*Y(t)*and then

*A*

_{2}(

*t*) and

*D*

_{2}(

*t*) are obtained by discrete wavelet transformation of

*A*

_{1}(

*t*), and so on. In theory,

*A*represents the low frequency part of the signal, which is smooth and has a trend, while

_{m}(t)*D*represents the high frequency part, which has fluctuation or noise (Catalao

_{m}(t)*et al.*2011). The performance of DWD is dependent on the choice of wavelet function ψ and decomposition levels. However, there are no guidelines for selecting the wavelet function and decomposition levels for a particular case. The ‘Haar’ wavelet function, which is the simplest wavelet and resembles a step function (Taylor 2009), is used for DWD in this study. Furthermore, three levels of decomposition, as found in several other studies (Choi

*et al.*2011; Kisl & Cimen 2011) are chosen for this study. Finally, an analysis of the forecasting performance of other wavelet functions and decomposition levels is also conducted in this study.

### Performance indicators

*et al.*2008). The MSE quantifies the differences between observed and predicted values along the time series axis, and penalizes the large differences because of its square power. Minimizing of MSE is commonly used as the training objective in ANN models (Samarasinghe 2006). This allows the ANN models to better capture peak values of time series data. E captures the goodness of fit of the model of interest by comparing it with a naïve model in which the mean value is used as the predicted values. The value of E lies between one and minus infinity. A value of unity implies that the model exactly matches observations, zero implies that the model is no better than assuming the mean value, and a negative value indicates that the mean values of the observed time series is a better predictor than the model. However, the limitations of MSE and E are that they are unable to provide the practical assessment of difference between predicted and observed values. For example, MSE provides the same value of 4 between observation of 1,000 and prediction of 1,002, and observation of 1 and prediction of 3. However, it is obvious that the former prediction is a much better prediction. The MAPE was therefore used in this study in addition to the well-known MSE and E, to provide a scale-free percentage difference between observed and predicted values (Coulibaly

*et al.*2001). where

*n*is the number of data points.

## CASE STUDIES

^{3}) to Lake Eildon in Victoria, Australia. The monthly inflow data used is from January 1891 to June 2007. Lake Eildon was constructed to protect farmers during drought years and to provide irrigation water for what was a vast uncultivated area in the northern plains of Victoria. This region has since become the largest area of irrigated farmland in Australia, which is known as the Goulburn-Murray Irrigation District. Construction of the lake took place in several development stages. The last development stage, completed in 1955, enlarged the storage to its present capacity of 3,334,158 ML. The enlargement plans also considered Victoria's electricity needs. The original 15 MW hydro-electric generation capacity was increased to 120 MW. The oldest turbines were renovated in 2001 to provide a generation capacity of 135 MW. Figure 3 shows the monthly time series plot of Data #1. The probable inputs for the one-month ahead forecast of the inflow

*Y*are assumed to depend on the time lagged values of

_{t}*Y*until

_{t}*Y*

_{t}_{−15}(i.e., from

*Y*

_{t}_{−1}to

*Y*

_{t}_{−15}).

*Y*depends on its own time lagged variables (i.e., from

_{t}*Y*

_{t}_{−7}to

*Y*

_{t}_{−13}). The following eight variables were also assumed to be influential for the prediction of chlorophyll concentrations at Tolo Harbour: total inorganic nitrogen,

*X*1 (mg/L); phosphorus,

*X*2 (mg/L); dissolved oxygen,

*X*3 (mg/L); secchi-disc depth,

*X*4 (m); water temperature,

*X*5 (°C); daily rainfall,

*X*6 (mm); daily solar radiation,

*X*7 (MJ/m

^{2}), and daily average wind speed,

*X*8 (m/s). Figure 4 shows the time series plot of daily chlorophyll concentrations used in Data #2.

## RESULTS

### Input preprocessing

Three types of model inputs (as summarized in Table 1) were used in this study for the two case studies to investigate the effectiveness of input selection and wavelet transformation on the autoregressive forecasting. The first type of input is called ‘Original inputs' and these are created by selecting all probable inputs for forecasting. These Original inputs are not subjected to either of the 2 input preprocessing techniques. As shown in Table 1, for Data #1, a set of 15 time lagged inputs (i.e., *Y _{t}*

_{−1}to

*Y*

_{t}_{−15}) is used to forecast the current value,

*Y*. For Data #2, nine variables are used as input, which include the eight variables presented earlier along with chlorophyll itself. For each of these nine variables, a time lag of 7–13 days is used (giving a total of 63 inputs) for the 7-day ahead forecasting of chlorophyll concentrations.

_{t}Type of model inputs | Description | Data #1 | Data #2 |
---|---|---|---|

Original inputs | All probable inputs to forecast the current value Y _{t} | Y (15 inputs) _{t−1}, Y_{t−2},…, Y_{t−15} | Y_{t−7}, Y_{t−8},…, Y_{t−13} X_{(1)t−7}, X_{(1)t−8},…, X_{(1)t−13}_{;}X_{(2)t−7}, X_{(2)t−8},…, X_{(2)t−13…}_{;}X (63 inputs) _{(8)t−7}, X_{(8)t−8},…, X_{(8)t−13} |

PLC-selected inputs | Inputs selected by PLC technique from the Original inputs | Y (4 inputs) _{t−1}, Y_{t−11}, Y_{t−13,} Y_{t−15} | Y (3 inputs) _{t−7}, Y_{t−8}, Y_{t−13} |

PLC-Wavelet inputs | Inputs created by which wavelet transformation is applied on PLC inputs | D1_{t}_{−1,}D1_{t}_{−11}, D1_{t}_{−13}, D1_{t}_{−15}; D2_{t}_{−1,}D2_{t}_{−11}, D2_{t}_{−13}, D2_{t}_{−15}; D3_{t}_{−1,}D3_{t}_{−11}, D3_{t}_{−13}, D3_{t}_{−15}; A3_{t}_{−1,}A3_{t}_{−11}, A3_{t}_{−13}, A3_{t}_{−15}; (16 wavelet inputs) | D1_{t}_{−7,}D1_{t}_{−8}, D1_{t}_{−13}; D2_{t}_{−7,}D2_{t}_{−8}, D2_{t}_{−13}; D3_{t}_{−7,}D3_{t}_{−8}, D3_{t}_{−13}; A3_{t}_{−7,}A3_{t}_{−8}, A3_{t}_{−13}; (12 wavelet inputs) |

Type of model inputs | Description | Data #1 | Data #2 |
---|---|---|---|

Original inputs | All probable inputs to forecast the current value Y _{t} | Y (15 inputs) _{t−1}, Y_{t−2},…, Y_{t−15} | Y_{t−7}, Y_{t−8},…, Y_{t−13} X_{(1)t−7}, X_{(1)t−8},…, X_{(1)t−13}_{;}X_{(2)t−7}, X_{(2)t−8},…, X_{(2)t−13…}_{;}X (63 inputs) _{(8)t−7}, X_{(8)t−8},…, X_{(8)t−13} |

PLC-selected inputs | Inputs selected by PLC technique from the Original inputs | Y (4 inputs) _{t−1}, Y_{t−11}, Y_{t−13,} Y_{t−15} | Y (3 inputs) _{t−7}, Y_{t−8}, Y_{t−13} |

PLC-Wavelet inputs | Inputs created by which wavelet transformation is applied on PLC inputs | D1_{t}_{−1,}D1_{t}_{−11}, D1_{t}_{−13}, D1_{t}_{−15}; D2_{t}_{−1,}D2_{t}_{−11}, D2_{t}_{−13}, D2_{t}_{−15}; D3_{t}_{−1,}D3_{t}_{−11}, D3_{t}_{−13}, D3_{t}_{−15}; A3_{t}_{−1,}A3_{t}_{−11}, A3_{t}_{−13}, A3_{t}_{−15}; (16 wavelet inputs) | D1_{t}_{−7,}D1_{t}_{−8}, D1_{t}_{−13}; D2_{t}_{−7,}D2_{t}_{−8}, D2_{t}_{−13}; D3_{t}_{−7,}D3_{t}_{−8}, D3_{t}_{−13}; A3_{t}_{−7,}A3_{t}_{−8}, A3_{t}_{−13}; (12 wavelet inputs) |

The second type of input is called PLC-selected inputs, which are identified by applying the PLC technique on the Original inputs. As can be seen from Table 1, the PLC technique selected 4 inputs out of 15 probable inputs for Data #1 while only 3 inputs are selected out of 63 inputs for Data #2.

The third type of inputs are called the PLC-Wavelet inputs, which are created by applying the DWD technique on the PLC inputs. For each input variable (e.g., Y_{t−1),} four components are created (e.g., *D*1_{t}_{−1}, *D*2_{t}_{−1},…,*D*3_{t}_{−1}, *A*3_{t}_{−1}) for three decomposition levels as per Equation (8). As a result, for Data #1, a total of 16 wavelet inputs are created while a total of 12 inputs are made for Data #2, as shown in Table 1. In this study, the PLC technique was implemented using the SPSS software package and the DWD technique was implemented using the Wavelet Toolbox of the Matlab software package.

### Forecasting performance

The FFNN and LRNN were implemented using the Neural Network Toolbox of Matlab. Both FFNN and LRNN models used only one hidden layer for ease of comparison. The MLR models were also implemented using Matlab.

All datasets were divided into training (60%), validating (20%), and testing (20%) datasets. The testing dataset was the last 20% of the time series data. The training and validation datasets are randomly selected to have a better chance of representative learning data for FFNN and LRNN models. As indicated in the flow chart of the methodology (presented in Figure 1), the original inputs (made up of the training dataset) are used as input for the PLC to select the significant inputs, called the PLC-selected inputs. The PLC-selected inputs are then decomposed by the DWD to generate the PLC-Wavelet inputs. For the purpose of comparison, the original inputs (containing all probable inputs), the PLC-selected inputs and the PLC-Wavelet inputs are used as inputs for all three forecasting models. The training and validation datasets were used in the training process in FFNN and LRNN models in which the training dataset was used to provide ‘learning knowledge’, while the validation dataset was used to avoid over-fitting. The MLR model used the initial 80% of the dataset for calibrating without splitting into a validation dataset. The testing dataset was used only for assessing the forecasting performance of the MLR, FFNN, and LRNN models. Both two datasets were scaled between [−1, 1] and were then used for training and testing.

The training process for FFNN and LRNN models was carried out using the Levenberg–Marquardt algorithm (Samarasinghe 2006). The least square method (Tsay 2005) was used for the calibration of the MLR model. The single objective of minimizing MSE was used in the training process of all models. The remaining two performance indicators (i.e., E and MAPE) were then computed using the training (i.e., combination of training and validation) and testing outcomes. A suitable number of hidden neurons for FFNN and LRNN models was selected using trial and error. The results are presented in Table 2 only for the testing performances.

Data #1 | Data #2 | ||||||||
---|---|---|---|---|---|---|---|---|---|

Models | Inputs | hn | MSE | MAPE | E | hn | MSE | MAPE | E |

MLR | Original inputs | – | 0.052 | 119.3 | 0.567 | – | 0.030 | 26.2 | 0.769 |

PLC-selected inputs | – | 0.047 | 106.7 | 0.570 | – | 0.024 | 25.5 | 0.770 | |

PLC-Wavelet inputs | – | 0.026 | 85.1 | 0.702 | – | 0.011 | 16.7 | 0.914 | |

FFNN | Original inputs | 12 | 0.034 | 86.3 | 0.589 | 2 | 0.020 | 22.3 | 0.873 |

PLC-selected inputs | 10 | 0.030 | 79.0 | 0.632 | 6 | 0.018 | 20.8 | 0.884 | |

PLC-Wavelet inputs | 16 | 0.022 | 77.8 | 0.732 | 6 | 0.009 | 16.8 | 0.944 | |

LRNN | Original inputs | 12 | 0.035 | 104.2 | 0.580 | 2 | 0.021 | 24.1 | 0.867 |

PLC-selected inputs | 12 | 0.033 | 65.1 | 0.604 | 2 | 0.018 | 21.9 | 0.882 | |

PLC-Wavelet inputs | 16 | 0.018 | 71.6 | 0.787 | 6 | 0.009 | 16.7 | 0.940 |

Data #1 | Data #2 | ||||||||
---|---|---|---|---|---|---|---|---|---|

Models | Inputs | hn | MSE | MAPE | E | hn | MSE | MAPE | E |

MLR | Original inputs | – | 0.052 | 119.3 | 0.567 | – | 0.030 | 26.2 | 0.769 |

PLC-selected inputs | – | 0.047 | 106.7 | 0.570 | – | 0.024 | 25.5 | 0.770 | |

PLC-Wavelet inputs | – | 0.026 | 85.1 | 0.702 | – | 0.011 | 16.7 | 0.914 | |

FFNN | Original inputs | 12 | 0.034 | 86.3 | 0.589 | 2 | 0.020 | 22.3 | 0.873 |

PLC-selected inputs | 10 | 0.030 | 79.0 | 0.632 | 6 | 0.018 | 20.8 | 0.884 | |

PLC-Wavelet inputs | 16 | 0.022 | 77.8 | 0.732 | 6 | 0.009 | 16.8 | 0.944 | |

LRNN | Original inputs | 12 | 0.035 | 104.2 | 0.580 | 2 | 0.021 | 24.1 | 0.867 |

PLC-selected inputs | 12 | 0.033 | 65.1 | 0.604 | 2 | 0.018 | 21.9 | 0.882 | |

PLC-Wavelet inputs | 16 | 0.018 | 71.6 | 0.787 | 6 | 0.009 | 16.7 | 0.940 |

hn, number of hidden neurons.

With regards to the effect of input selection, the forecasting performances of MLR, FFNN, and LRNN models showed slight improvement when the PLC-selected inputs were used. As can be seen from Table 2, the use of PLC-selected inputs slightly improved the performance of ANN models in all cases. Although the use of input selection provided a marginal improvement in forecasting performance, a secondary benefit is the reduction in computing time when only significant inputs are used.

As far as the comparison of performances between FFNN and LRNN models is concerned, the LRNN model produced slightly better forecast performance than the FFNN model for Data #1 while both FFNN and LRNN produced similar forecasting performance for Data #2, which is demonstrated through three performance indicators presented in Table 2. On the other hand, both FFNN and LRNN outperformed the MLR for all three types of inputs. However, it is interesting to note that the performance gap between MLR and ANN models substantially reduced when PLC-Wavelet inputs were used.

### Sensitivity of wavelet parameters

A sensitivity analysis was conducted to understand the effect of different levels of decomposition and the different wavelet functions used for producing the PLC-Wavelet inputs. In this study, three different decomposition levels are investigated and then four different wavelet functions are studied on Data #1 (i.e., monthly inflow values to Lake Eildon) using the FFNN model.

As mentioned earlier, for decomposition level 3 of an input variable, there are four probable component inputs (i.e., *D*_{1}, *D*_{2}, *D*_{3}, and *A*_{3}), which were used as model inputs in this study (as shown in Table 1). The probable inputs for decomposition levels 1 and 2 of an input variable are *A*_{1} and *D*_{1} (for level 1) and *D*_{1}, *D*_{2}, and *A*_{2} (for level 2). For Data #1, the PLC technique selected four inputs and thus there will be a total of 8, 12, and 16 model inputs for the three different levels of decomposition, respectively. The sensitivity analyses for the three levels of decomposition for Data #1 are shown in Table 3. As can be seen from this table, level 2 of decomposition produced the best forecasting performance based on the MSE and E values, whereas level 3 is slightly better than level 2 decomposition based on the MAPE value.

Performance indicators for data #1 | ||||
---|---|---|---|---|

PLC-Wavelet inputs | hn | MSE | MAPE | E |

Level 1 (8 inputs) | 12 | 0.021 | 81.4 | 0.743 |

Level 2 (12 inputs) | 20 | 0.018 | 78.9 | 0.786 |

Level 3 (16 inputs) | 16 | 0.022 | 77.8 | 0.732 |

Performance indicators for data #1 | ||||
---|---|---|---|---|

PLC-Wavelet inputs | hn | MSE | MAPE | E |

Level 1 (8 inputs) | 12 | 0.021 | 81.4 | 0.743 |

Level 2 (12 inputs) | 20 | 0.018 | 78.9 | 0.786 |

Level 3 (16 inputs) | 16 | 0.022 | 77.8 | 0.732 |

hn, number of hidden neurons.

Among the discrete wavelet functions available within the Matlab toolbox, the four commonly used wavelet functions, namely, the Haar wavelet, the Daubechies wavelet of order 3, Biorthogonal wavelet, and the Symlets wavelet of order 4 are used for the sensitivity analysis in this study. The FFNN model and level 2 of wavelet decomposition is used to compare the forecasting performance of the four different wavelet functions. The results presented in Table 4 show that the Daubechies wavelet of order 3 produced the best forecasting performance based on the MSE and E values, whereas the Symlets wavelet of order 4 produced the best forecasting performance based on the MAPE value.

Performance indicators for Data #1 | ||||
---|---|---|---|---|

Wavelet function | hn | MSE | MAPE | E |

Haar wavelet | 20 | 0.018 | 78.9 | 0.786 |

Daubechies wavelet of order 3 | 4 | 0.016 | 62.4 | 0.802 |

Biorthogonal wavelet | 16 | 0.018 | 51.1 | 0.782 |

Symlets wavelet of order 4 | 12 | 0.020 | 38.4 | 0.755 |

Performance indicators for Data #1 | ||||
---|---|---|---|---|

Wavelet function | hn | MSE | MAPE | E |

Haar wavelet | 20 | 0.018 | 78.9 | 0.786 |

Daubechies wavelet of order 3 | 4 | 0.016 | 62.4 | 0.802 |

Biorthogonal wavelet | 16 | 0.018 | 51.1 | 0.782 |

Symlets wavelet of order 4 | 12 | 0.020 | 38.4 | 0.755 |

hn, number of hidden neurons.

## DISCUSSION

As expected, the MLR model is not suitable for non-linear datasets, as demonstrated by its low performance on both the case study datasets. However, when the hybrid technique is used to preprocess the inputs, the MLR model has a significantly improved forecasting performance that makes it still useful for on-line forecasting. Therefore, sophisticated models like ANN models should be used to improve forecasting performance only when simple models like MLR produce poor performances. This observation further clarifies one of the common beliefs of the M-competitions presented in Crone *et al.* (2011) that ‘sophisticated methods are not better than simpler methods'.

The LRNN model did not significantly outperform the FFNN model, while it has a high computational time compared to that required by the FFNN model. This is understandable since the LRNN model has feedback loops, which require additional computing resources.

It is observed that the use of PLC for input selection leads to a marginal improvement in the forecasting accuracy of both datasets and for all three models. On the other hand, the use of the DWD technique along with PLC inputs significantly improves the forecasting performance of all three models for both datasets. However, it still involves trial and error to identify the optimal DWD parameters. The sensitivity analysis of the different DWD parameters indicated that there are differences in forecasting performance for different levels of decomposition and with different wavelet functions. The level 2 decomposition and the Symlets wavelet might be the initial choice for these two DWD parameters.

The DWD technique has the ability to decompose the time series signal into low and high frequency. Low frequency is often viewed as representative for mean values and high frequency is for variation (Kisk 2010). Such decomposition is similar to de-trending (or separating of the non-stationary signal), which makes it easier for ANN models or even MLR models to simulate time series data with improved forecasting performances. On the other hand, conventional de-trending (e.g., differencing) is often carried out for traditional statistical models (e.g., ARMA models), which require non-stationary signals to accurately forecast time series data. However, a negative effect of conventional de-trending is the loss of variation and peak values, which are vital for hydroinformatics applications such as streamflow and rainfall forecasting. Therefore, conventional de-trending is often not used in forecasting studies using data-driven models (e.g., Wang *et al.* 2009; Kisk 2010).

Although the hybrid technique used in this study can improve the forecasting performance of both datasets, the forecasting accuracy is still not as high as expected. This indicates that the underlying mechanism of time series data is not yet fully captured and further development in the understanding of the underlying mechanism and mathematical modeling is necessary.

## CONCLUSION

This study investigated a hybrid technique using PLC (for input selection) and DWD (for separating the non-stationary part) for improving the forecasting performance of regression-based models such as ANN models and MLR models. Based on the forecasting results of two real-world case study datasets, this study concluded that the hybrid technique can significantly enhance the forecasting performance of both the simple MLR model and ANN models. Furthermore, the LRNN model produced similar forecasting performance to that of the FFNN model. As far as input selection is concerned, the use of the PLC technique can significantly reduce the number of model inputs while producing a minor impact on the forecasting performance of ANN models. The input transformation using the DWD technique leads to significant improvement in the performance of the ANN models as well as the MLR model. This was demonstrated not only through improved performance indicators, but also through better capturing of peak values. Thus, this study provided useful understanding regarding the improvement in performance of ANN models when used in conjunction with significant input selection and wavelet-based data transformation, which could be useful for researchers and practitioners involved in autoregressive time series forecasting of real-world applications.