## Abstract

Accurate and reliable simulation models are crucial for the operation and management of systems. Developing a simulation model to forecast future states of a system is generally followed by errors in prediction. Frequently, data-based models such as support vector machines (SVM) are used as forecasting techniques. This paper introduces a modular method which couples the machine learning technique of support vector regression (SVR) as a prediction model and a modified data assimilation (MDA) technique to partially correct the predicted values based on the observation data. To improve the performance and accuracy of the system output, the ensemble Kalman filter (EnKF) as a data assimilation procedure is implemented with an optimization procedure. As a case study, inflow quantities to Zayandehroud reservoir is considered as the state vector in the assimilation process to enhance the system output. Evaluation criteria such as root mean square error (RMSE) and R-squared criteria are implemented to evaluate the performance of the proposed model. The adjusted values of a hybrid model compared to the SVR model and standard DA indicate improved performance of the proposed model.

## INTRODUCTION

One of the main challenges in simulation models is reliability of their performance. Concerns about performance and accuracy of simulation models receive attention for reducing errors and ambiguities about models that produce uncertainty. Presenting approaches or methods to reduce errors and improve accuracy of models is the main concern of any data assimilation (DA) modeling to increase certainty and reliability of the systems. For example, uncertainty about the inflow discharge to a reservoir is the key factor in optimization of reservoir operation. The inflow usually consists of the runoff from upstream which is related to different parameters such as precipitation, temperature and water transfer from other basins. The accurate estimation of these parameters, which can be distributed spatially and temporally, has a significant effect on prediction of inflow. Deterministic optimization models which directly use historical data usually suffer from obtaining reliable results for future planning and management. To balance the uncertainty in the data, a simulation model is developed to predict future estimation while incorporating the information inherited into the observation data by using techniques such as DA.

For the last two decades, data-based methods have been developed and implemented in water resource system analysis as simulation models with acceptable performance. Artificial neural networks (ANN) and support vector machines (SVM) are among the frequently used data-based models capable of proper prediction and flexible operation. Introduction of an SVM model as a statistical data analysis technique was presented by Vapnik (1995) and initiated a wide variety of research in the 1990s using implementations of SVM. The studies of rainfall-runoff modeling by Dibike *et al.* (2001); Behzad *et al.* (2009) and Jajarmizadeh *et al.* (2015), river inflow prediction by Asefa *et al.* (2006) and Lin *et al.* (2006) and controlling groundwater levels in aquifers by Asefa *et al.* (2004) are examples of these studies using SVM in water resources fields. Often, the results of these data-based modeling techniques require improvement due to confounded noise from both known and unknown sources.

DA has been used in many engineering applications to improve modeling accuracy. One of the most efficient and current sequential DA methods is the Kalman filter (KF) which was developed by Kalman (1960). This filtering process reads the variables of previous state or time step methods in sequential order and analyzes the forcing terms and observations of the current state to update the variables for the subsequent state.

DA can be a useful tool in hydrological modeling as it improves the model operation using background data in the numerical model. Refsgaard *et al.* (1983) developed a lumped rainfall-runoff model in state space form to show the implementation of KF where the input uncertainties are predominant for uncertainty of simulated runoff values. Wood & O'Connell (1985) studied the KF and extended Kalman filter (EKF) for rainfall-runoff modeling and applied it on state and parameter estimation of National Weather Service River Forecasting and Sacramento Soil Moisture model at the same time. Lee & Singh (1999) applied KF in the rainfall-runoff process of a tank model. The model parameters used as state vectors and uncertainty in rainfall-runoff process decreased by variation of parameters in time. Hartnack & Madsen (2001) combine ensemble Kalman filter (EnKF) in MIKE 11 as river modeling where the implemented model corrected the model perturbations on boundary conditions. In order to improve accuracy of SVM, Gill *et al.* (2007) and Liu *et al.* (2010) integrated SVM and EnKF to predict soil moisture in different soil layers. The results in both studies show better performance of SVM-EnKF in comparison with SVM without using EnKF. Dechant & Moradkhani (2011) proposed a framework to consider uncertainty in ensemble streamflow prediction (ESP) within the National Weather Service River Forecasting system by developing DA to improve ESP. Nasseri *et al.* (2011) coupled EKF and genetic programming to predict urban water demand. Ricci *et al.* (2011) applied the KF algorithm to update river water level observation for better flood forecasting. Significant improvements were achieved in the water level and discharge values in both analysis and forecast modes. The proposed DA modeling technique of Li *et al.* (2012) includes integration of EnKF and SVM to decrease data predicted of SVM via variable precision rough set theory. Dumedah & Coulibaly (2014) implemented EnKF and particle filter methods for a hydrological DA procedure. Estimation of measurement error is improved via integration of pareto-optimality into the DA techniques. Li *et al.* (2014) coupled EnKF and SVM for rainfall-runoff simulation and proved better accuracy of coupled models for real-time flood forecasting. Liu *et al.* (2016) evaluated the uncertainty of coupled SVM and EnKF to estimate soil moisture in the ground. Siswantoro *et al.* (2016) applied linear KF to improve the performance of neural network classification. Wang & Babovic (2016) developed a hybrid model to improve water level forecasting. The coupled model includes a data-driven model and KF as the data assimilation technique.

In this study, we considered modified DA (MDA) as a process tool which includes the combination of two robust techniques, namely SVM as a simulation forecasting model to estimate inflow discharge of Zayanderoud reservoir and then the EnKF was used to update and balance the SVM regression (SVR) estimates by available observed historical data. According to the studies by Babovic *et al.* (2000), Mancarella *et al.* (2008) and Sun *et al.* (2010), in analyzing the error propagation in different modeling techniques, the predicted residuals or errors of models can be used to correct prediction of simulation models. Within the same framework for error correction, a modified approach is proposed in the updating process by an optimization procedure. This optimization model makes sure that the error in the next step is less than or equal to the previous time error. In the following sections the theory of SVM and EnKF is briefly described and further sections explain the implementations, results and discussions of the case study.

## METHODOLOGY

Using physical simulation models as prediction tools involves a variety of parameters which usually require calibration. An alternative approach would be a robust data-driven model based on analyzing the data about a system, and finding connections between the system state variables (input, internal and output variables) without explicit knowledge of the physical behavior of the system. In order to simulate the outlet streamflow of a sub-basin (reservoir inflow), historical data including rainfall, temperature and inflow discharge are required to develop a data-based model such as SVM. All data-based models need to be trained with collected data and then be tested for simulation. In the following study, EnKF as a data assimilation technique is implemented to improve the efficiency of an SVR model in each simulation step. The proposed approach is presented to improve the performance of integration of SVR and EnKF. The methodology of SVR, EnKF, the embedding approach and evaluation criteria are briefly described in the next sections.

### Support vector machine

*xєR*as

^{d}*d*-dimensional input vector and the output vector

*yєR*. SVM is divided into two classification and regression (SVR) methods in which SVR is more commonly used in water-related applications. Let vector (

*x*),

_{j}, y_{j}*jє*{1, 2, 3, … ,

*N*} be the set of input and output observed data. The aim of a data-based model such as SVR generally is searching for a function

*f(x*) as an approximation of the value

*y*with minimum risk, and is only based on the available independent and identically distributed data. In the SVR algorithm, the estimation function is determined by a small subset of training samples, namely support vectors. Also in this algorithm, a specific loss function called

_{i}*ɛ*-insensitive loss is developed to create a sparseness property for SVR. It means instead of minimizing the empirical error over the training data, SVR minimizes a regulated risk function which states that for achieving the minimum risk, simultaneous control of the complexity of the model and the error owing to training data is essential (principle of the structural risk minimization theory). In a linear SVR shown in Equation (1), the input vector is mapped into the

*N*-dimensional feature space by non-linear kernel transformation function

*k*(

_{j}*x*), where

*j*

*=*1, 2, … ,

*N*.

*α*and

*b*denote the weighted vector and bias term, respectively. The transformation of input

*x*vector into the

*N*-dimensional feature is shown in Figure 1.

*y*. Vapnik (1998) defined loss function to make the optimization model feasible. Thus, two slack variables are defined to determine upper and lower errors over the value and parameter

*C*indicates the loss function factor. The loss function is shown in Equation (3) which is also called -insensitive function: we used the radial basis function (RBF) as a proper kernel function. This function has proven to be more efficient than other kernel functions based on studies reported on the use of RBF (Asefa

*et al.*2006; Behzad

*et al.*2009).

### Kalman filter

The KF technique is one of the data assimilation methods rooted from the Monte-Carlo and Bayesian techniques. Generally, approximate Bayesian state estimation is considered in this method which states that the errors increase linearly and they are distributed normally. KF estimations are performed in two steps. In the updating step, the estimation of uncertain forecast state for time step *t* is adjusted and improved by a kernel function. By incorporating the observation of time step *t* in the kernel function, the forecast state will be modified and updated. In the forecasting step, the solution will propagate into the next time step (*t* + 1) by using the previous updated state variable.

EKF and EnKF are a variant of KF that can be used for the nonlinear filtering problem. Implementation of EKF is infeasible for large systems, such as hydrological systems, due to their calculation complexity. So EnKF, which considers a statistical approximation of EKF by sampling the forecast and analysis error, is substituted for EKF in this study.

*X*an ensemble of forecasts with size

_{t}^{b}*m*which randomly sample the background of model errors at time

*t*and the average of

*X*which is shown by . Similarly, the observation vector is shown by

_{t}^{b}*Y*: EnKF uses an ensemble data that contains

_{t}*m*datum in Equation (5) as an update step. In Equation (5), each ensemble member (

*i*= 1, 2, … ,

*m*) is updated to the

*i*th member of the

*X*vector based on the observation vector and observation operator that is shown by

_{t}^{a}*H*(

*x*). The

*H*operator is briefly explained in Appendix A (available with the online version of this paper). This operator is determined by the type of problem and can be equal to the identity matrix: where

*K*is the Kalman gain matrix, P

*is background-error covariance, and R is observation-error covariance. P*

^{b}*is created from ensemble data that are made from nonlinear forecasts and is determined by Equation (7). Background-error covariance is a function of deviation matrix*

^{b}*X*where its members are computed by . Similarly, the perturbed observation members are determined by

_{t}^{’b}*y*∼

_{i,t}^{t}*N*(0, R) which is normally distributed with an average equal to zero with covariance R: The components of Kalman gain matrix, which are the variance of perturbation observation operator matrix HP

^{b}H

^{T}and covariance of perturbation observation operator and background matrixes P

^{b}H

^{T}, are computed by the following equations: In the forecasting step, the model prediction transmits analysis or updated vector

*X*from time

^{a}*t*to

*t*

*+*1 by model operator

*M*(

*x*) which the prediction step follows by .

_{t}^{a}### Model development

For developing the SVR model to predict or simulate streamflow, EnKF is implemented to update and optimize SVR predictions. Figure 2 shows the combined use of SVR and EnKF. In time step *t,* SVR predicts the background vector *X _{t}^{b}* by reading input data for three regions including rainfall (

*p*

_{t}^{1},

*p*

_{t}^{2},

*p*

_{t}^{3}), temperature (

*T*

_{t}^{1},

*T*

_{t}^{2},

*T*

_{t}^{3}) and inter-basin water transfer (

*I*). The other input data is the inflow to the reservoir from the previous time step. Inflow to the reservoir from the previous time step is considered as a background vector in time step

_{t}*t*−1 and is accounted as a state variable of EnKF. The output of SVR is the current inflow to the reservoir and is adjusted for the background vector in time step

*t.*After prediction by the model forecast, the background data are updated to analyze the vector

*X*by an assimilation procedure and then the updated data will be used as input for time step

_{t}^{a}*t*

*+*1 (Figure 2).

*X*

_{t–1}

^{a}estimation. To improve the performance of SVR simulation, an error function is defined for each time step which must be minimized during each time step. To reach this goal, an optimization procedure for the forecast step is defined as follows: Subject to: where

*M*is the SVR operator,

*X*

_{t–1}

^{a}is updated data from a previous time,

*X*is the resulted decision variable from simulation by SVR,

_{t}^{b}*y*is the observation data and

_{t}*ɛ*is the noise variable. Since the input and output data of SVR are normalized, the

*ɛ*domain is limited to −1 and 1. By employing the optimization procedure, the best

*ɛ*will be found to correct the input state variable of SVR. It must be noted that

*ɛ*and

*X*are the only variables of the optimization procedure.

_{t}^{b}*Er*is an objective function of the optimization procedure within the DA process for time step

_{t}*t*which defines the process of minimizing the error ratio as a function of model output and observation data. The simulation output value, which is the state variable (

*X*), can be lower or higher than

_{t}^{b}*y*. The ratio of min(…)/max(…), which is a normalized error (similar to error measures used in any data-driven model), has a value between 0 and 1 to indicate the difference between observation and simulation values, therefore 1−

_{t}*Er*demonstrates the closeness of these two values at time step

_{t}*t*.

The proposed model is called modified data assimilation (MDA) which is an integration of the EnKF technique and an optimization procedure. This model will be implemented to predict the streamflow of a watershed which is an inflow to a downstream reservoir. MDA will be evaluated and compared with model prediction without using an EnKF technique (named forecast results), and will be compared with the model using an EnKF technique without any optimization (named DA results). The proposed procedure will address the uncertainty associated with parameters involved in hydrological modeling. A schematic diagram of different steps of this model is illustrated in Figure 2.

### Evaluation criteria

*n*is the number of data points, and

*y*and

_{i}*y*indicate the model measurements and observational values, respectively:

_{i}^{o}### Study area and data

Zayanderoud reservoir, located in central Iran (Figure 3), was selected to assess the proposed models and to measure their performances. The catchment area is 3694 km^{2} which is divided into three sub-basins. More than 23 rainfall and climate stations are located in the catchment area. ARCGIS analysis tools, such as kriging and zonal, are implemented to compute the average values of rainfall and climate data for three sub-basins. This procedure has been implemented for 30 years to develop an SVR model as a monthly prediction and simulation model. To forecast monthly inflow to the reservoir from October 2012 to September 2013, the SVR model uses 360 months of historical data for training and testing. Inputs include eight types of data such as rainfall (three zones), temperature (three zones), and inter-basin water transfer and monthly reservoir inflow at the outlet of the catchment from the previous month. The data sets were divided into two sets: a training set consisting of 25 years of data and validation set of five years of data.

## RESULTS AND DISCUSSION

Twenty-five years of data were used in the training phase of SVR modeling, and in the testing phase, five years of data were used. Results from the developed SVR model are shown in Table 1. In this table, the optimized parameters and goodness-of-fit values for two training and testing phases are highlighted. To find proper values of SVR parameters, both trial and error and cross-validation approaches were used. The results of both methods, considering the two evaluation criteria, reveal several valuable features. Firstly, both the trial and error and cross-validation approaches are relatively well behaved. Secondly, all two criteria values show the same trend in training and testing phases. Finally, the generalization performance of SVR for three criteria is closely established.

Trial and error | C= 25 ɛ= 0.1 γ = 0.9 | |

| RMSE (MCM^{a}) | R |

Training | 38.14 | 0.948 |

Testing | 45.92 | 0.89 |

Cross-validation | C= 15 ɛ= 0.9 γ = 0.7 | |

| RMSE (MCM^{a}) | R |

Training | 44.87 | 0.94 |

Testing | 49.2 | 0.9 |

Trial and error | C= 25 ɛ= 0.1 γ = 0.9 | |

| RMSE (MCM^{a}) | R |

Training | 38.14 | 0.948 |

Testing | 45.92 | 0.89 |

Cross-validation | C= 15 ɛ= 0.9 γ = 0.7 | |

| RMSE (MCM^{a}) | R |

Training | 44.87 | 0.94 |

Testing | 49.2 | 0.9 |

^{a}Million cubic meter.

The ensemble data, including 30 measured samples for the first month of assimilation, is considered. By using SVR as a forecasting model, the next month value will be predicted. The procedure of KF is separated into two forecasting and updating steps. For updating data, the forecasted results are assimilated using kernel updating of Equation (5) and incorporation of the EnKF technique. The results are the background data (streamflow) which is the output of SVR simulation corrected based on observation data. There is no guarantee that the error will not propagate for the next simulation time step. If *X*_{t−1}^{a} is the updated data, *X _{t}^{b}* will be predicted by SVR operator

*M*, for the next step time. By implementing the optimization procedure,

*X*=

_{t}^{b}*M*(

*X*

_{t−1}

^{a}) is substituted using Equation (12).

*ɛ*variable is a noise value that is summed with

*X*

_{t−1}

^{a}to force the model to simulate proper values in the frame of optimization. The results in Figure 4 compare the observed monthly average of streamflow values (state vector) with forecasted, DA and MDA solutions.

In the process of the error correction, the optimization procedure was incorporated into a conventional DA method to further reduce the error which was produced by the SVR. The analysis of error propagation is directed by the concept of error ratio in time step *t* (*Er _{t}*) and error ratio in time step

*t−*1 to obtain the proper output which happens when

*Er*is less than

_{t}*Er*

_{t–1}. To show the improved performance of the proposed method, Figure 5 shows the cumulative error ratio of the assimilation process for a period of five years (2003–2008). Monthly perturbation of forecasts and updates are shown by these error ratios which were the solution of the objective function of the optimization model. As seen in Figure 5, the error ratios show the differences between approximate values of forecasted, DA and MDA modeling with respect to the observational value. Monthly values refer to the average of the ensemble for each month. Referring to Figure 5, the trend of error propagation due to the MDA technique shows substantial improvement compared to SVR and DA, however it must be noticed that the error propagation cannot completely be eliminated. As indicated by Figure 6, the error ratio is generally reducing in 24 months of the simulation period. It is clear that in any non-physical modeling there is always a contingency of error variation in time series of predicted values, which is no exception in MDA techniques. To quantify the performance of the optimization procedure, Table 2 illustrates the variation of

*X*,

_{t}^{b}*y*, and

_{t}*Er*for the first 12 months corresponding to data in Figure 6. The decreasing trend of

_{t}*Er*from 0.27 to 0.02 with respect to state variable

_{t}*X*and observation

_{t}^{b}*y*variations explains the effectiveness of this procedure.

_{t}t (month)
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Er _{t} | 0.27 | 0.27 | 0.22 | 0.15 | 0.15 | 0.14 | 0.09 | 0.08 | 0.04 | 0.03 | 0.09 | 0.02 |

X (MCM) _{t}^{b} | 50.3 | 54.4 | 68.6 | 76.0 | 174.9 | 286.4 | 355.7 | 232.7 | 148.5 | 75.8 | 50.4 | 76.0 |

y (MCM) _{t} | 36.5 | 39.6 | 53.6 | 64.7 | 149.3 | 245.4 | 325.3 | 213.5 | 142.6 | 78.5 | 45.7 | 74.2 |

t (month)
. | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Er _{t} | 0.27 | 0.27 | 0.22 | 0.15 | 0.15 | 0.14 | 0.09 | 0.08 | 0.04 | 0.03 | 0.09 | 0.02 |

X (MCM) _{t}^{b} | 50.3 | 54.4 | 68.6 | 76.0 | 174.9 | 286.4 | 355.7 | 232.7 | 148.5 | 75.8 | 50.4 | 76.0 |

y (MCM) _{t} | 36.5 | 39.6 | 53.6 | 64.7 | 149.3 | 245.4 | 325.3 | 213.5 | 142.6 | 78.5 | 45.7 | 74.2 |

Figure 7 shows last-year monthly inflow computations from three modeling techniques of SVR, DA and MDA, compared to observed data. Clearly, the DA results are further corrected when the optimization procedure is integrated into the assimilation process. Although observation data for the last month is not considered in the modeling scheme, the trend of predicted values using the updated values of DA and MDA from the previous month closely follows the observation data.

To quantify the modeling performance, two error criteria were implemented to measure the significance of improvement resulting from the DA and MDA modeling techniques. In Table 3, the lower values of RMSE for the MDA technique stated improved performance of the proposed model for the planning horizon compared with the forecasted SVR or DA modeling process. Scattered diagrams of Figure 8 compare the MDA, DA and forecasted inflow for 30 years of data, which indicates a significant improvement performance of MDA. The correlation of 0.73 between observed inflow and SVR forecasted inflow is increased to 0.85 and 0.99 applying the DA and MDA techniques, respectively.

. | RMSE (MCM) . | ||
---|---|---|---|

Month . | SVR forecasted . | DA . | MDA . |

October | 52.66 | 16.55 | 13.45 |

November | 70.12 | 14.03 | 10.36 |

December | 62.97 | 5.65 | 4.54 |

January | 55.24 | 7.11 | 4.23 |

February | 71.15 | 24.79 | 9.35 |

March | 87 | 39.27 | 18.11 |

April | 99.95 | 46.47 | 14.57 |

May | 75.35 | 25.97 | 8.2 |

June | 43.55 | 15.11 | 3.32 |

July | 30.25 | 15.06 | 10.94 |

August | 16.3 | 8.21 | 6.96 |

September | 34.21 | 12.69 | 10.48 |

. | RMSE (MCM) . | ||
---|---|---|---|

Month . | SVR forecasted . | DA . | MDA . |

October | 52.66 | 16.55 | 13.45 |

November | 70.12 | 14.03 | 10.36 |

December | 62.97 | 5.65 | 4.54 |

January | 55.24 | 7.11 | 4.23 |

February | 71.15 | 24.79 | 9.35 |

March | 87 | 39.27 | 18.11 |

April | 99.95 | 46.47 | 14.57 |

May | 75.35 | 25.97 | 8.2 |

June | 43.55 | 15.11 | 3.32 |

July | 30.25 | 15.06 | 10.94 |

August | 16.3 | 8.21 | 6.96 |

September | 34.21 | 12.69 | 10.48 |

## CONCLUSIONS

In this study, the combined use of a regression support vector machine (SVR) and modified ensemble Kalman filter (EnKF) as a data assimilation process is investigated. Considering the importance of developing a real-world model to predict water inflow into the reservoir, Zayanderoud reservoir located in central Iran is selected as the case study to evaluate the performance of the proposed model. The data-based SVR model is trained to forecast inflow with a correlation coefficient of 0.95 and 0.89 for training and testing, respectively. Eight hydrologic input data including rainfall (for three regions), temperature (for three regions), and inter-basin water and reservoir inflow from the previous month are considered to predict the current inflow. By considering 30 years' measurement samples for the first month of planning horizon, the EnKF is implemented to update (DA) and optimize (MDA) the prediction values for the next month and is repeated continuously for five years. The propagation of errors is controlled when applying the DA and is further reduced by MDA techniques. It is concluded that, when EnKF as a data assimilation technique is integrated with an optimization procedure, the modified assimilated routine is capable of reducing the computational error inherited in any simulation model. Evaluation criteria such as RMSE are used to compare the results of SVR (forecast without a DA technique), DA (using EnKF) and MDA (using modified data assimilation). The quantified results indicate a better performance of the MDA approach over SVR and DA with RMSE average values of 9.54, 58.23 and 19.24 MCM, respectively. Additionally, the performance of the proposed technique was measured by comparison of computed inflow with observed data using R-squared criteria. The R-squared value for the MDA model was improved to 0.992, comparable to 0.85 of DA model.

## REFERENCES

*.*

*.*