Self-optimizer data-mining method for aquifer level prediction

Groundwater management requires accurate methods for simulating and predicting groundwater processes. Data-based methods can be applied to serve this purpose. Support vector regression (SVR) is a novel and powerful data-based method for predicting time series. This study proposes the genetic algorithm (GA)–SVR hybrid algorithm that combines the GA for parameter calibration and the SVR method for the simulation and prediction of groundwater levels. The GA–SVR algorithm is applied to three observation wells in the Karaj plain aquifer, a strategic water source for municipal water supply in Iran. The GA–SVR’s groundwater-level predictions were compared to those from genetic programming (GP). Results show that the randomized approach of GA–SVR prediction yields R values ranging between 0.88 and 0.995, and root mean square error (RMSE) values ranging between 0.13 and 0.258 m, which indicates better groundwater-level predictive skill of GA-SVR compared to GP, whose R and RMSE values range between 0.48–0.91 and 0.15–0.44 m, respectively. doi: 10.2166/ws.2019.204 om http://iwaponline.com/ws/article-pdf/20/2/724/765369/ws020020724.pdf 021 Omid Bozorg-Haddad (corresponding author) Mohammad Delpasand Department of Irrigation & Reclamation Engineering, Faculty of Agricultural Engineering & Technology, College of Agriculture & Natural Resources, University of Tehran, Karaj, Tehran, Iran E-mail: obhaddad@ut.ac.ir Hugo A. Loáiciga Department of Geography, University of California, Santa Barbara, California 93106, USA


INTRODUCTION
Groundwater is a vital source of municipal, industrial, and agricultural water use worldwide. One important aspect of groundwater management is the prediction of groundwater levels. This is mostly done by simulating groundwater flow with numerical models. This paper proposes an alternative approach to the prediction of groundwater levels, namely using a data-based methodology that relies on time series of observed groundwater levels and other relevant variables such as precipitation, recharge, and aquifer discharge. This paper shows that the application of data-based groundwaterlevel prediction models constitutes an alternative method to predicting groundwater levels bypassing the implementation showed GPs had the best predictive skill for groundwater levels up to 7 days beyond the recorded data. Gong et al. () evaluated validity of three nonlinear time-series intelligence modelsadaptive neuro fuzzy inference system, support vector machines, and artificial neural networksin the prediction of the groundwater level. They developed and applied these three models for two wells close to Lake Okeechobee in Florida. The latter authors employed data sets of temperature, precipitation, groundwater level, and others as input data to forecast groundwater levels. They also calculated five quantitative standard statistical evaluation measures, correlation coefficient, normalized mean square error, root mean square error (RMSE), Nash-Sutcliffe efficiency coefficient, and the Akaike information criteria to evaluate the performances of these models.
Their results established the SVM and ANFIS models' predictions were more accurate than the ANN model. Zhou et al. () applied a data-base prediction model combining support vector machine and discrete wavelet transform preprocess for groundwater-level forecasting. They applied regular SVM and regular ANN, and wavelet preprocessed ANN models to monthly groundwater-level records over a period of 37 years from ten wells in Mengcheng County, China. The latter authors' results indicate that wavelet preprocess improved the training and test performance of the SVM and ANN models. The SVM model provided the most accurate and reliable groundwater-level prediction compared to the SVM, ANN, and SVM models, and the prediction results of the SVM model were superior to the ANN model's in generalization ability and precision.
A literature review shows an increasing trend in the use of data-driven methods in groundwater prediction and simulation. The ANN and ANFIS models have been more frequently applied than GP and SVR models, partly explained by the more recent origin of the latter two methods. The application of SVR to groundwater prediction and simulation has been hindered by the reliance on a linear objective function (Jin et al. ) and by the lack of a systematic approach to tune its parameters (Behzad et  This paper introduces a coupled GA-SVR algorithm and tests its accuracy in predicting groundwater levels. This paper's approach selects the training (i.e., calibration) and testing groundwater-level data chronologically and randomly. The GA is applied to optimize the SVR parameters and the parameter-optimized SVR is implemented for groundwater-level prediction. The prediction results obtained with the GA-SVR and the GP methods are compared to assess their relative merits.

THE SVR METHOD
The SVR method was introduced by Vapnik () and has been applied to solve regression problems and to predict time series in a wide range of fields. The main idea behind the SVR method is to first select a nonlinear mapping algorithm, called the support vector kernel function, through which the input vectors are mapped onto a high-dimensional feature space and related to the output vectors. This method skillfully solves multi-dimensional prediction problems with generality. The SVR method steps are detailed next: The function f represents the nonlinear relation between inputs (x) and outputs (y) of an arbitrary process. Equation (1) shows the function f : where x ¼ input vector, where x belongs to an n-dimensional space (x ∈ ℜ n , x is a vector n × 1); y ¼ output (y ∈ R), [y a real scalar]; ω ¼ the regression weight vector [ω: 1 × n]; b ¼ model bias (scalar), ϕ(x) ¼ nonlinear mapping. b and ω are calculated through with Equations (2) and (3). ϕ maps the nonlinear regression between inputs and outputs onto a high-dimensional space whereby a simpler regression is achieved that replaces the complex nonlinear regression of the original input space.
The kernel function K is needed in the mapping process , where x 0 is the transpose of x. This mapping replaces x in the original space by ϕ(x). It is not necessary to know the explicit expression of the nonlinear mapping ϕ throughout the solution processes.
The kernel is appropriately chosen iteratively by the SVR method.

The selection of the kernel function K
There is no general principle to obtain the kernel function K given by Equation (4):

OPTIMIZATION OF THE SVR PARAMETERS WITH THE GA
The SVR method accuracy is largely dependent on the tuning SVR parameters (C, γ and ε). optimal in the sense that they constitute a solution that is very near the global optimal solution of the problem being solved. The GA is employed in this study to obtain optimal parameters of the SVR method. The flowchart of the coupled GA-SVR method is depicted in Figure 1.
The content of the left-dotted box in Figure 1 concerns the preprocessing of data. The SVR method with chosen kernel function is trained after identifying the train and test data set. The contents of the right-dotted box in Figure 1 describe the procedure of determining the optimal parameters (C, γ, and ε). The iteration number in the optimization process starts from zero (i ¼ 0), at which time the initial population is randomly generated. The SVR method is trained (calibrated), employing the population of its members (potential solutions), and the objective function is calculated for each individual member to evaluate their fitness. Population, elitism, crossover rate, mutation rate, and the number of iterations of the GA were set to 10, 1, 0.8, 0.02, and 100 respectively. The RMSE was herein employed as the objective function to be minimized in the search for the SVR optimal parameters. The training data set was applied to calculate the objective function.
The splitting of the data into training and testing data sets is discussed later. The RMSE and regression R 2 criteria were implemented to assess the error and accuracy of the hybrid GA-SVR method. Their RMSE and R 2 formulas are given respectively by Equations (5) and (6): where h j t ¼ groundwater level of the jth well at time step t; j ¼ number of observation wells in which Shah-Abbasi ( j ¼ 1), Tarbiat-Moallem ( j ¼ 2), and Mehr-Shahr ( j ¼ 3); t ¼ monthly time step; h 1 tÀ1 ¼ groundwater level of well number 1 at time step of t À 1; h 2 tÀ1 ¼ groundwater level of well number 2 at time step t À 1; h 3 tÀ1 ¼ groundwater level of well number 3 at time step t À 1; P tÀ1 ¼ precipitation depth at time step t À 1; P t ¼ precipitation depth at time step t; EV tÀ1 ¼ evaporation depth at time step t À 1; EV t ¼ evaporation depth at time step t; f j 1 ¼ prediction model for the jth well with groundwater data (Model 1); f

RESULTS AND DISCUSSION
This work applied three different approaches to select the training and testing data sets of the GA-SVR method. In the first approach all the available data in the statistical period (84 months total) were employed for training. The purpose of taking this approach is assessing the ability of the GA-SVR to reproduce groundwater-level fluctuations.
The second approach divides the data into training and testing sets. The division in this approach was done chronologically. The first 6 years of the statistical period were selected for training the GA-SVR method and the last year was used to test the method. The aim of this division is assessing the ability of the GA-SVR method in estimating  groundwater level with non-trained data, i.e., testing the predictive accuracy of the trained SVR with a data set not used in its training. In the third approach, similar to the second, the data were divided into two sets: training and testing.
The difference introduced in the third approach was the random selection of the data sets. The purpose of applying the third approach is increasing the probability of finding similarity in the correlation structure between the training and testing data sets.
Assessing the ability of the GA-SVR method to reproduce groundwater-level fluctuation The prediction and simulation models consider all the available data (84 month) in the first approach. The GA-SVR method was trained with all available data to test its ability in reproducing groundwater levels. The GA-SVR calculated parameters and the error criteria are listed in Table 1.
The differences between the observation data and the calculated results from the best prediction and simulation models for each well are illustrated in Figure 3. Among prediction and simulation models, the model with the lowest value of the RMSE objective function was the best one. It is seen in Figure 3 and Table 1 that the GA-SVR method is able to fit the selected models to the groundwater-level data quite well. One noteworthy issue in Table 1 is that the prediction model of the Shah-Abbasi well increases its accuracy with the addition of surface data, which is not the case in the other wells.
The models that consider the surface variables (precipitation and evaporation) in prediction and simulation of the Shah-Abbasi well's groundwater level are 0.03 and 0.04 m more accurate than the models that only consider the subsurface factors. The reason may be found in the data correlation structure. The correlation structure of the observation wells' groundwater levels is listed in Table 2.
It is seen in Table 2 that the groundwater level of the Shah-Abbasi well is correlated with the groundwater level of the Shah-Abbasi well at time step t À 1 (the previous time step), with the groundwater level of the Mehr-Shahr well at time steps of t and t À 1, and with the evaporation at time step t at the 1% significant level. The groundwater

Chronological selection of the training and testing data
The GA-SVR method was trained with the training data set and the parameters (C, γ, and ε) were determined. The predictive skill of the GA-SVR method was tested using the optimized parameters and the testing data. The optimized parameter values are listed in Table 3.
The calculated error criteria are listed in Table 4 Table 4. The contents of Table 4 show that the first and fourth models for the Shah-Abbasi well fit the data quite well while the results of Models 2 and 3 are not acceptable because of the correlation structure of the data. The correlation structure of the training and testing data is shown in Table 5. It is seen in Table 5 that the correlation structures of the training and testing data differ. The SVR is a data-based method. Therefore, the SVR's predictive accuracy drops dramatically if the structure of the testing data is   According to Table 5  The calculated results shown in Table 4 indicate that the GA-SVR method has superior predictive accuracy compared to GP. Applying the coupled GA-SVR method improved its predictive and simulation accuracies respectively by about 5% and 35% in comparison with the GP model for the Shah-Abbasi well. The predictive accuracy of the prediction models obtained with the GP algorithm for the Tarbiat-Moallem well was 4% better than the GA-SVR method. The accuracy of the GA-SVR method is about 7% higher than the GP's for the simulation models.
The application of the GA-SVR method in the prediction models for the Mehr-Shahr well resulted in 18% better accuracy than that achieved with GP; however, the simulation model with GP exhibited a 9% improvement in comparison to the GA-SVR method.   (7) and (8) and the correlation structure of the training and testing data presented in Table 5.
The evaporation at the t À 1 time step was included in Model 2. The training data for the groundwater level of the Shah-Abbasi well shows significant correlation with evaporation in the t À 1 time step; however, this correlation is not seen in the testing data. In summary, the GA-SVR method is trained in a manner that specifies weights to link the evaporation at t À 1 time step to the groundwater level in the Shah-Abbasi well, but these weights introduce errors in the GA-SVR method with Model 2 because of the dissimilar correlation structure of testing data. Similar analyses can be performed for the other models and wells.
This means that the accuracy of the GA-SVR method in prediction and simulation of the groundwater level is higher than the GP's, but it is also more sensitive to the correlation structure of the data in comparison to GP. The results of GA-SVR method are presented in Figure 4.

Random selection of the training and testing data
The previous results established that the selection of the training and testing data set affects the GA-SVR method's performance. The recorded monthly time series data used in this work constitute an extensive hydrologic time series.
The hydrologic and climatic indicators have a specific pattern of temporal fluctuations. The second and third approaches are the principal methods to select the training and testing data. The second approach selected the data chronologically. One of the advantages of that approach is capturing seasonal fluctuations of the time series. It is essential to notice that selecting the training and testing data sequentially does not mean that the correlation structure between these two data sets is well preserved. The third approach selects the training and testing data randomly.
The specific time patterns in the original data series might not be preserved in this approach. However, the random selection preserves the correlation structure between the training and testing data better than the chronological selection of the same data sets. The dissimilarity of the training and testing data correlation structures was observed in the second selection approach in this study. The third approach was implemented to evaluate the effect of the training and test data selection on prediction accuracy. The correlation structure of the randomly selected training and testing data sets is listed in Table 6.
Comparison of the results listed in Table 6 with those of Table 2 indicates that the correlation structure between the training and testing data sets conforms well with the correlation structure of the total data set, something not observed with the second approach wherein the two data sets (training and testing) where chosen chronologically. It is also evident in Table 6 that there is similarity in the correlation structure between the training and testing data sets. Recall from Table 5 that there is no clear similarity between the training and testing data in the second approach, and that differences in correlation structure were observed between the first and second approaches. These results suggest that the third approach, i.e., random selection of the training and testing data sets, is more appropriate than the second approach of chronological selection of the data sets.
The GA-SVR method was run to evaluate the third approach. The calculated results are presented in Table 7.
Comparing Tables 4 and 7

CONCLUSION
This paper introduced the coupled GA-SVR method to the prediction and simulation of groundwater levels. The GA-SVR was tested with a data set from the Karaj aquifer in Iran. Selection of the training and testing data set affects the GA-SVR method's performance. Therefore, three approaches for selecting the calibration and testing data sets were considered: (1) all of the available data were employed for training; (2) divide the data into training and testing sets chronologically; and (3) divide the data into training and testing sets randomly. This paper's results have established that the GA-SVR method exhibits accurate prediction skill when its parameters are optimized and the training and testing data sets are chosen randomly. In addition, the results demonstrate the GA-SVR algorithm's