The purpose of this paper is to present a simple yet highly effective method to reconstruct missing data in flow time series. The presence of missing values in network flow data severely restricts their use for an adequate management of billing systems and for network operation. Despite significant technology improvements, missing values are frequent due to metering, data acquisition and storage issues. The proposed method is based on a weighted function for forecast and backcast obtained from existing time series models that accommodate multiple seasonality. A comprehensive set of tests were run to demonstrate the effectiveness of this new method and results indicated that a model for flow data reconstruction should incorporate daily and seasonal components for more accurate predictions, the window size used for forecast and backcast should range between 1 and 4 weeks, and the use of two disjoint training sets to generate flow predictions is more robust to detect anomalous events than other existing methods. Results obtained for flow data reconstruction provide evidence of the effectiveness of the proposed approach.
INTRODUCTION
Urban water systems need to ensure adequate customer service and to improve efficiency through water loss control. Water losses represent a significant economic and environmental inefficiency in most water utilities. Flow monitoring through SCADA (Supervisory Control And Data Acquisition) or telemetry systems that remotely collect data from flow meters is one of the most important tools to improve network operation and management and to ensure the efficient use of water. The number of meters installed in networks has increased remarkably as a result of technological advances. However, extracting useful information can be a difficult task due to the need to combine data from multiple meters and the fact that flow data are often faulty (e.g., missing, duplicate or out of range values).
Missing data might be due to problems with flow meters, sensors, data loggers and central database communication, or with data processing. Having complete and accurate flow data is essential for a reliable billing system and a high quality customer relation and network management. Filling gaps from flow series is also relevant for online anomalous event detection (Loureiro et al. 2016a) and for a proper water loss assessment. Examples of this assessment include the calculation of annual water balance (Lambert & Hirner 2000; Alegre et al. 2006), or more detailed and complementary methods, such as night flow analysis (Farley & Trow 2003).
Since network flow data are usually characterized by daily and weekly cycles (de Marinis et al. 2008; Mamade 2013), the focus will be on models that can accommodate multiple seasonality.
Multiple seasonality models are used, for instance, in electricity load demand forecasting. Mohamed et al. (2010) investigated the use of a double seasonal ARIMA (Auto Regressive Integrated Moving Average) model for electricity load demand forecasting in the context of electric power planning. Hassan et al. (2012) compared double seasonal ARIMA and double seasonal ARFIMA (Auto Regressive Fractionally Integrated Moving Average) models for forecasting half-hourly electricity load demand, with the ARFIMA model producing slightly better results.
In the context of water demand forecasting, various models and techniques have already been explored but the generality of work has been dedicated chiefly to forecasting and did not consider the problem of missing data. Alvisi et al. (2007) developed a short-term forecasting procedure of hourly water demand based on two modules: a daily module, which included annual and weekly seasonality, and an hourly module, which incorporated the intra-day patterns. Li & Huicheng (2010) employed multiple linear regression and fuzzy neural network in order to model the trend and cyclical components of yearly urban water demand, respectively. Caiado (2009) examined the forecasting performance of several univariate time series models, including Holt-Winters model, ARIMA model and GARCH (Generalized Auto Regressive Conditional Heteroskedasticity) model to predict daily water demand. Firat et al. (2010) compared several artificial neural network (ANN) techniques to forecast monthly water demand. Herrera et al. (2010) compared several techniques, such as ANN, projection pursuit regression, multivariate adaptive regression splines, random forests and support vector regression, to forecast hourly water demand. Quevedo et al. (2010) employed two forecasting models in order to validate and reconstruct missing and false flow meter data in water distribution systems: a daily model based on ARIMA time series model and a 10-min intra-day model based on a 10-min demand pattern, determined through correlation analysis and an unsupervised fuzzy logic classification (LAMDA algorithm).
To forecast intricate time series, De Livera et al. (2011) introduced a model incorporating Box–Cox transformations, Fourier representations with time varying coefficients and ARMA (Auto Regressive Moving Average) error correction. This enables forecasting complex seasonal time series, such as those with multiple seasonal periods, high-frequency seasonality, non-integer seasonality, and dual-calendar effects, and as such, can cover a broad range of applications. They applied this model to forecast electricity demand, gasoline and call centre data.
Despite the work put forth by previous studies, namely Quevedo et al. (2010), the need remains for a simple and robust methodology that can accommodate multiple seasonality to reconstruct online or offline flow data in water distribution networks. The novelty of this work resides in: (i) the proposed weighted function of the forecast and backcast values obtained from multiple seasonality time series models; (ii) its application in the reconstruction of water flow data; (iii) an extensive study of the performance of the proposed method in common situations for this type of data analysis, namely the existence of outliers and multiple seasonality.
The paper is organized as follows. In the Methodology section, the proposed approach, the tests carried out are presented and the general testing procedure is described. First, the forecasting model is defined and incorporated into three different reconstruction methods: a Forecast Method, a Backcast Method and a Combined Method. Furthermore, several tests are carried out to justify the model selection, to assess the window size for training the model, to study model robustness in terms of anomalous events' location in the training window, and in terms of reconstruction method, and finally to compare prediction accuracy in different reconstruction methods. The outcome of these tests is presented in the Results and discussion section. In the Conclusions section, we present the main conclusions drawn from the study.
METHODOLOGY
The proposed approach is based on a weighted function of forecasts and backcasts for estimating gaps in flow time series with multiple seasonality. Data were collected from existing flow meters that monitored the inflow in three different urban sectors, whose boundaries were clearly defined and watertight to ensure an adequate network operation and management. Electromagnetic flow meters with 150 mm of nominal diameter registered readings with 10- or 15-minute time steps throughout a SCADA system.
The first steps of data processing are known as data validation and normalization (Mamade 2013; Loureiro et al. 2016a). Data validation consists of detecting and correcting anomalous data caused by the measuring equipment, such as duplicate or out of range values. These values were removed as they might bias the subsequent analysis. Data normalization, in turn, consists of setting a regular time step for the measurement variable. A 15-minute time step was defined and water flow units were set to m3/h. Once these first steps have been completed, the next step is data reconstruction.
Only daily and weekly seasonality were considered in the current study. Annual seasonality was not included, since only 1 year of historical data was used. The availability of accurate flow data from multiple years is less frequent among the water utilities. Moreover, to test the method it was important to analyse periods where network operation was stable (i.e., clearly defined and watertight network boundaries and constant operating configurations), in order to ensure data consistency throughout the time period.
The TBATS model
The smoothing parameters are given by and , represents the damping parameter, and denote the seasonal periods.
To fit this model it is necessary to estimate not only the smoothing parameters and the damping parameter, but also the Box–Cox transformation parameter as well as the ARMA coefficients p and q. The implemented R function (Team 2014) of the model automatically handles these estimates, as well as optimal model selection through the Akaike information criterion (AIC) (Akaike 1973). The AIC is used in order to determine the best fit, and therefore allows a decision as to whether to use the Box–Cox transformation, whether to include a trend, whether to include a damping parameter in the trend and whether to include ARMA errors.
The Forecast Method
The initial idea for data reconstruction is to iteratively fit a forecasting model to the data preceding each sequence of consecutive missing values, and then generate forecasts in order to fill each sequence with reasonable values. In this study, this procedure is referred to as the Forecast Method.
Since flow data are provided with a regular 15-minute time step, there are four time steps in 1 hour, and time steps in 1 day. Therefore, daily and weekly seasonalities are accounted for TBATS model with seasonal components containing 96 and time steps, respectively. The initial approach to flow data reconstruction is to apply the Forecast Method with this double seasonal TBATS model. A second TBATS model that only accommodates daily seasonality (with seasonal component containing 96 time steps) is also considered. To compare TBATS models with a classical approach, the Forecast Method was also applied with a seasonal ARIMA model.
Furthermore, depending on the location of the sequence of missing values in the time series, the Forecast Method may not be applicable. For instance, if the very first values of flow data are missing, the Forecast Method lacks the flow data needed for fitting the model. In this case, there is a need for a reconstruction method that incorporates the flow data succeeding the sequence of missing values, in order to generate predictions for the missing past values.
The Backcast Method
Having high-resolution data (15-min time steps) allowed fitting the model in different sections of complete data, which enabled computing not only forecasts, but also backcasts. The term backcasting is introduced as a means to back-forecast the unknown past values (Wei 2006). This concept was applied in the context of flow data reconstruction: if we consider a given sequence of missing values, the flow data succeeding the sequence may be used to fit a model, thus generating predictions for the preceding missing values.
In essence, the Backcast Method allows us to predict missing values if the Forecast Method is not applicable due to lack of data. In instances where both methods are applicable, two sets of predictions are generated, which can then be combined into a third reconstruction method – the Combined Method.
The Combined Method
We consider that the uncertainty of the predictions generated by a forecast model should increase as we get further away from the left bound of the prediction window. Conversely, the predictions generated with a backcasting model are progressively more reliable as we approach the right bound of the prediction window. Therefore, when considering a sequence of missing values, a combination of predictions generated by the Forecast Method and the Backcast Method should assign progressively less weight to the forecast predictions, and progressively more weight to the backcast predictions.
where l is the length of the sequence of missing values, and are the -th component of the forecast and backcast prediction sequences, respectively, and is the prediction for -th component of the sequence of missing values, as generated by the Combined Method.
When the Forecast Method (resp. the Backcast Method) is unable to generate predictions due to lack of data, the Combined Method consists only in the application of the Backcast Method (resp. the Forecast Method).
Nevertheless, the Combined Method usually takes advantage of two disjoint sets of flow data in order to compute the predictions.
General testing procedure
Evaluation is the key to assess the actual performance of the prediction methods, and splitting data into training and testing sets is a central part of this evaluation (Witten & Frank 2005). The use of a set of independent data (test set), but with the same distribution of the training set, avoids overfitting and allows obtaining the performance characteristics of the models. In the scope of this study, the test set corresponds to a section of data that is removed from the time series data, previous to the application of a data reconstruction method. The training set is composed of a window of adjacent data preceding the test set (in the Forecast Method), succeeding the test set (in the Backcast Method), or both (in the Combined Method).
The key aspects of the reconstruction approach have been analysed through a series of five tests:
In Test 1, we analysed the prediction accuracy of the Forecast Method with three different models: a daily seasonal ARIMA model, a daily seasonal TBATS model, and a daily and weekly seasonal TBATS model.
In Test 2, we addressed the issue of the window size for fitting the TBATS model.
In Test 3, we studied the robustness of the TBATS model by creating artificial anomalous events at various instants in the training data and then analysing their impact on the forecasts.
In Test 4, we analysed the impact that the Combined Method has on robustness.
In Test 5, we compared the prediction accuracy of the Forecast Method, the Backcast Method and the Combined Method.
RESULTS AND DISCUSSION
Test 1: impact of seasonal effects on model forecasts
The three models were fitted on a window size of 3 weeks, the forecast window was set to 1 week and the process was repeated for each DMA.
Test 2: impact of window size for daily and weekly seasonal TBATS model fitting
This test focuses on the assessment of the effect of the window size assigned for fitting the forecasting model when conducting the Forecast Method (the result of this test also holds for the Backcast Method). Results obtained from Test 1 indicated that the daily and weekly seasonal TBATS model is the most suitable to the flow data. Therefore, this test was conducted for that model only.
The test was performed for each DMA as follows. The test set was fixed and assigned a window size of 1 week. Since the daily and weekly seasonal TBATS model incorporates weekly seasonality, the minimum window size for fitting was 1 week. The window size for fitting was then iteratively increased by 1 week, reaching a maximum of 4 weeks.
In general, results indicate that there is virtually no difference in terms of prediction error by selecting either 2, 3 or 4 weeks for model fitting, with a tendency toward lower errors as the window size increases. Therefore, we conclude that the reconstruction algorithm should consider the maximum length of complete data available for fitting each model. We note that the window size considered for this test is somewhat limited, and future tests should include a finer granularity of window size.
Test 3: impact of location of anomalous event in training set
In this test we studied the robustness of the daily and weekly seasonal TBATS model: artificial anomalous events were created at various times in the data used for fitting the model (training set), and the change in the prediction error of the Forecast Method was analysed.
The test was performed for each DMA as follows. The daily and weekly seasonal TBATS model was fitted on a window size of 3 weeks, and the forecast window was set to 1 week. As indicated by Test 2, it would be best to use the largest number of weeks possible. However, the data are frequently faulty and a trade-off had to be reached between having enough data for model fitting and minimizing the prediction error.
An artificial anomalous event was created in the training set. In the context of network flow data, several types of anomalous events may take place (e.g., pipe bursts, atypical consumptions due to anomalous water uses, infrequent tanks or pumping stations operational conditions). For this study, a rule of thumb was adopted to simulate a reported pipe burst – usually characterized by a moderate to high flow rate whose effect may last a few hours before detection (Loureiro et al. 2016a). Therefore, an event with a 6-hour duration was simulated in the following way: a section of flow data corresponding to a time window of 6 hours was selected, and multiplied by a factor of 2. The model was then fitted on the new data, and the forecast was generated. The impact of the anomalous event was studied by iteratively placing the event in successive days preceding the test set (at the same time of day) and calculating the error of each forecast.
Test 4: impact of Combined Method on robustness
This section focuses on dealing with the lack of robustness issue following the application of the Forecast Method on DMA 1, as illustrated in Test 3. After applying the Forecast Method, the Backcast Method and the Combined Method were applied in order to generate predictions for the same test set.
The largest amount of complete flow data available for computing the backcast was equivalent to a window size of 1 week, as opposed to the window size of 3 weeks for the forecast. Nevertheless, it is apparent from Figure 8(a) that the Backcast Method generated more accurate predictions than the Forecast Method. Indeed, the RMSE was equal to for the Forecast Method, for the Backcast Method and for the Combined Method.
In practice, if a detection step of anomalous events is not performed on the data before the data reconstruction, it is not possible to know whether an event will fall on the training set of the Forecast Method or the Backcast Method. Had the anomalous event been placed in the training set of the Backcast Method, the Forecast Method would have performed better. Therefore, the reconstruction method that would make sense as a compromise for an automated process would be the Combined Method, which would attenuate the impact of the anomalous event regardless of its location.
Test 5: comparison of prediction performance
The aim of this test is to compare and decide which of the data reconstruction methods selected for each approach is more suitable to the flow data provided. In this study, we focus on accurately reconstructing missing data in the short term. In this test, we perform a reconstruction of flow data with a test set window size equivalent to 1 day. For each day of the week, a section of flow data corresponding to that day is selected at random as the test set. Then, the Forecast, Backcast and Combined Methods are applied, generating three sets of predictions for each day of the week. The performance measures are determined, and the process is repeated for each DMA.
CONCLUSIONS
Flow data reconstruction in water distribution systems is an essential step towards improving the billing system and network operation, namely water loss control, and is achieved through the imputation of missing values with accurate predictions.
In this paper, a new method for filling missing values was developed and tested, which comprised a combination of forecast and backcast values generated by TBATS and ARIMA models. An extensive set of tests that evaluated the suitability and robustness of the method was carried out, which yielded effective results and highlighted the advantages of the Combined Method for offline data reconstruction, over a simple forecast or backcast approach.
In summary, models for flow data reconstruction should incorporate daily and seasonal components for more accurate prediction; the window size used for forecast and backcast should comprise between 1 and 4 weeks, which reflects a compromise between the typical length of datasets with continuous and complete records available and the accuracy gain. Since the Combined Method uses two disjoint training sets to generate flow predictions, it is more robust to anomalous events than are other existing methods. However, in order to better assess the adaptability of the proposed Combined Method, it should be tested on a larger number of flow data time series. Furthermore, as part of ongoing research, we are comparing the performance of the proposed methods with alternatives from the literature.
It is also worth noting that the Combined Method benefits from the so-called ‘offline approach’ to data reconstruction, since it makes use of historical flow data in order to generate predictions. Nevertheless, the proposed method is very flexible and for online flow data reconstruction only the Forecast Method should be applied.
ACKNOWLEDGEMENTS
The authors would like to express their gratitude to the iPerdas project and the water utilities that participated for providing the data used in this study.