The optimization design of surfactant-enhanced aquifer remediation for dense non-aqueous phase liquids (DNAPLs)-contaminated groundwater is proposed through integrating simulation and optimization models. Many studies have demonstrated that the surrogate model is an effective tool for building a bridge between simulation and optimization models. In this paper, the simulation model was first established to simulate a surfactant-enhanced aquifer remediation process, and on this basis, the ensemble surrogate models were constructed by applying the polynomial regression model, radial basis function neurons network and Kriging surrogate models. Second, the best surrogate model was selected in terms of the three performance indices. Lastly, a non-linear programming optimization model was constructed with the target of minimizing the DNAPLs-contaminated aquifer remediation cost. Meanwhile, the best surrogate model was embedded into the optimization model as a constrained condition, and it was used to reflect the non-linear complex relationship between injection and extraction rates with DNAPLs removal rates instead of simulation models. The results showed that the ensemble surrogate models improved the performance of the single surrogate models. Moreover, ensemble surrogate models improved the computational efficiency, and the optimal strategies have been proved to be an effective guide for contaminant remediation processes.

## INTRODUCTION

Dense non-aqueous phase liquids (DNAPLs) are typically immiscible with water and denser than water, and they are widely used in industrial and manufacturing operations (Liang & Falta 2008). DNAPLs may act as long-term sources of groundwater contamination and lead to a great risk for drinking water resources. Because of low solubility, high interfacial tension and the sinking tendency of DNAPLs, as well as the complex subsurface condition of remediation sites, it is difficult to remediate groundwater contaminated by DNAPLs (Falta *et al.* 2005). Traditional remediation methods included cosolvent flooding, steam flooding, air sparging, and soil vapor extraction. More recently, surfactant-enhanced aquifer remediation (SEAR), which was developed on the basis of pump-and-treat techniques, has been widely considered as one of the most promising techniques to remediate DNAPL contaminations (Schaerlaekens *et al.* 2006). It is necessary to optimize the SEAR remediation design because the remediation is very expensive.

An optimization model was constructed on the basis of a simulation model, and the simulation model was used to forecast the fate of contaminants in the subsurface under various conditions and describe the response relationship between the inputs and outputs of the groundwater system. It is necessary to embed the simulation model into an optimization model. The traditional methods, such as the embedding method, the response matrix method and the state transition equation method, have limitations in practical applications.

In recent years, surrogate models have also been proved effective in embedding simulation models into optimization models. The surrogate models mainly included the polynomial regression model, Kriging, artiﬁcial neural networks (ANNs), and support vector regression, and were widely used in various areas. Cooper *et al.* (1998) presented a simulation-optimization model for supporting the light non-aqueous phase liquid recovery process decisions, and the regression model was used as a surrogate model. Shourian *et al.* (2008) built an optimization model for water allocation planning at basin scale, and an ANN model was trained as a surrogate model to approximate the simulation model. Fu *et al.* (2010) also proposed ANN as a surrogate for the simulation of the urban wastewater system. Simpson *et al.* (2001) used Kriging models as alternatives for constructing global approximations in a real aerospace engineering application. Yun *et al.* (2009) proposed a multi-objective optimization method based on support vector regression as a surrogate model. Some research efforts have also been made for the surrogate models of the SEAR remediation operations. Huang *et al.* (2003) and Qin *et al.* (2007) presented integrated simulation-optimization systems for supporting decisions of SEAR. The surrogate models were both dual-response surface models. He *et al.* (2008) developed a simulation-based fuzzy chance-constrained programming model, and the surrogate model was built with ANN, which was used to reﬂect the relationship inputs and outputs instead of the simulation model. The surrogates used in SEAR remediation were discussed in detail by Qin *et al.* (2009). Hence, the ensemble surrogate models were proposed to improve the performance of single surrogate models. However, few efforts have been made to apply the ensemble surrogate model in this area.

The simulation-optimization process was carried out through integrating the simulation, surrogate model, and optimization methods, which was solved by simulated an annealing algorithm (Ramesh & Slobodan 2002). The flow chart of a simulation-optimization model is shown in Figure 1. The simulation model had been first established to simulate a surfactant-SEAR process by UTCHEM. Second, the Latin hypercube sampling (LHS) method and the multiphase flow simulation models were used for collecting input-output samples. Steps 1 and 2 were the research basis; parameter identification and calibration for the simulation model had been performed, so the simulation model behaved well enough to forecast the fate of contaminants in the subsurface under various conditions. The process was not described in detail because the main research concentrated on the surrogate models. In this paper, the polynomial regression model, RBFNN (radial basis function neurons network), and Kriging were built as the surrogate model, and the ensemble surrogate models were constructed with various combinations of single surrogate models, and the best surrogate model was selected. Finally, a non-linear programming optimization model was built to minimize the contaminated DNAPLs aquifer remediation cost, and so the optimal remediation strategy would be obtained for decision-makers with satisfying the remediation requirement.

## METHOD

### Polynomial regression model

*et al.*2005; Chang & Hsu 2006).

### Radial basis function neurons network

*et al.*2011; Luo & Lu 2014). A single-output RBF neural network with

*m*hidden layer neurons can be expressed as where = the output neurons in the output layer, = the weight connection between the

*i*th hidden and the output neurons, = the threshold value of the output neurons (Chattopadhyay 2007).

The learning process of the RBF network is divided into two phases: (1) the input layer-hidden layer, where the K-means clustering center method is usually used to allocate the centers and the spread of the Gaussian functions in the input space; and (2) the hidden layer-output layer, where the weights connecting the hidden and output neurons are estimated by the least mean square method (Han *et al.* 2011). The performance of an RBFNN mainly depends on the number of neurons in the hidden layer. In this paper, the optimal number of neurons in the hidden layer is determined by a trial-and-error process.

### Kriging

The Kriging method was first used in the 1960s by Krige, and then developed by Matheron (Kleijnen 2009). In recent years, Kriging models have attracted much attention as a special technique and been widely used in many areas.

*n*points denoted by and the associated response is denoted by ;

*Y*can be expressed as where

*f*(

*x*) represents the regression function model, and

*Z*(

*x*) is a model of a Gaussian and stationary stochastic process with mean of 0 and variance of , which can be written as where denotes the matrix of regression coefficients. In the stochastic process, the errors for the samples are related and the correlation is linked to the distance between the corresponding samples, and usually is expressed as where

*R*represents the

*n*×

*n*correlation function matrix between and , which can be formed by the Gaussian correlation function with a single correlation parameter

*θ*. The covariance of is expressed as The unknown correlation parameters

*β*,

*θ*, and can be estimated by maximizing the log-likelihood function (Forrester & Keane 2009; Huang

*et al.*2011): by differentiating the log-likelihood function with respect to

*β, θ,*and , respectively, and letting them be equal to zero (Lophaven

*et al.*2002; Fu

*et al.*2009).

### The ensemble surrogate models

*F*(

*x*) was the ensemble surrogate model, was the

*i*th surrogate model, was the weight coefficient of the

*i*th surrogate models,

*n*was the number of surrogate models. If the ensemble models were constructed by Kriging (K) and radial basis neural network (R), the ensemble models (KR) can be expressed as So the other ensemble models can also be expressed as The weight coefficients were calculated according to PRESS (predicted residual sum of squares) weighted average surrogate, as follows: where was the PRESS error of the

*i*th surrogate,

*n*was the number of surrogate models (Goel

*et al.*2007).

Excel 2003 and Matlab were used to calculate the parameters of the surrogates developed in the present study. The root mean squared error (RMSE), the maximum absolute error (MAE) and coefficient of determination (*R*^{2}) were used in order to assess the effectiveness of each model and its ability to approximate the simulation model (May *et al.* 2008). RMSE, MAE, and *R*^{2} were employed to evaluate the accuracy of surrogate models. The lower RMSE and MAE were, the nearer the *R*^{2} value to 1, the more approximated to the simulation model the surrogate model was.

## CASE STUDY

To illustrate the application of the ensemble surrogate model in the groundwater remediation process, an illustrative PCE (petrachloroethylene)-contaminated aquifer was selected as the case study.

The study area was 882 m^{2}, 49 m in the *x* direction and 18 m in the *y* direction. The thickness of aquifer in porous layers is 24 m, and the aquifer is confined and heterogeneous anisotropic. Two soil types, clay and sand, occupied the simulation domain. Initial conditions were prescribed such that the groundwater ﬂowed from left to right, with a hydraulic gradient of 0.004706, and the west and east boundaries were set as first-type boundaries, south and north were set as no-flow boundaries. The other main parameters are listed in Table 1.

Parameter | Value | Parameter | Value |
---|---|---|---|

Porosity | 0.32 | Hydraulic gradient | 0.004706 |

Longitudinal dispersivity | 3 m | Transverse dispersivity | 0.3 m |

PCE solubility in water | 240 mg/L | PCE/water interfacial tension | 44.67 dyn/cm |

Water density | 0.998 g/cm^{3} | PCE density | 1.623 g/cm^{3} |

Water viscosity | 1.00 cp | PCE viscosity | 0.89 cp |

Residual water saturation | 0.24 | Residual PCE saturation | 0.17 |

Parameter | Value | Parameter | Value |
---|---|---|---|

Porosity | 0.32 | Hydraulic gradient | 0.004706 |

Longitudinal dispersivity | 3 m | Transverse dispersivity | 0.3 m |

PCE solubility in water | 240 mg/L | PCE/water interfacial tension | 44.67 dyn/cm |

Water density | 0.998 g/cm^{3} | PCE density | 1.623 g/cm^{3} |

Water viscosity | 1.00 cp | PCE viscosity | 0.89 cp |

Residual water saturation | 0.24 | Residual PCE saturation | 0.17 |

According to the contaminant distribution shown in Figure 2, a SEAR system with three extraction wells and three injection wells at a constant flow rate of 80 m^{3}/d and a 4% surfactant solution was designed. Numbers 1–3 are injection wells and numbers 4–6 are extraction wells. The simulations were conducted using a numerical simulator UTCHEM.

### Surrogate model

In all the surrogate models, the injection and extraction rates were explanatory variables, and DNAPLs removal rates were considered as system response variables. The injection and extraction rates were set as the inputs into the simulation model, and UTCHEM software was called, and finally DNAPLs removal rates were calculated as the system response. The sampling was performed with the LHS technique. LHS provides a random sampling but ensures a stratiﬁed sample within the full range of each dimension of the sample space (Hora & Helton 2003; Davey 2008). For this study, a total of 160 samples were obtained by running the simulation models. They were divided into training and testing sets. The training set with 140 samples was used for constructing the surrogate models, and the remaining 20 samples were used for testing the surrogate models, as shown in Table 2.

Inputs (extraction and injection rate) m^{3}/d | Outputs (DNAPLs removal rates) | ||||||
---|---|---|---|---|---|---|---|

Q_{1} | Q_{2} | Q_{3} | Q_{4} | Q_{5} | Q_{6} | Year (%) | |

1 | 7.97 | 37.66 | 75.43 | 56.73 | 46.26 | 18.08 | 57.02 |

2 | 38.00 | 59.96 | 23.90 | 46.83 | 38.49 | 36.53 | 71.17 |

3 | 39.89 | 27.91 | 4.84 | 58.42 | 8.01 | 6.21 | 57.29 |

4 | 5.38 | 62.59 | 1.59 | 56.10 | 0.66 | 12.79 | 46.45 |

5 | 12.69 | 11.13 | 58.73 | 40.21 | 7.86 | 34.47 | 52.05 |

6 | 35.87 | 38.98 | 62.22 | 48.70 | 52.72 | 35.64 | 73.74 |

7 | 28.78 | 47.27 | 51.61 | 12.49 | 37.70 | 77.46 | 70.19 |

8 | 16.60 | 49.15 | 22.97 | 16.33 | 13.99 | 58.40 | 56.66 |

9 | 75.86 | 76.27 | 11.41 | 57.08 | 27.39 | 79.08 | 80.33 |

10 | 45.91 | 28.12 | 30.74 | 54.09 | 13.68 | 36.99 | 69.36 |

Inputs (extraction and injection rate) m^{3}/d | Outputs (DNAPLs removal rates) | ||||||
---|---|---|---|---|---|---|---|

Q_{1} | Q_{2} | Q_{3} | Q_{4} | Q_{5} | Q_{6} | Year (%) | |

1 | 7.97 | 37.66 | 75.43 | 56.73 | 46.26 | 18.08 | 57.02 |

2 | 38.00 | 59.96 | 23.90 | 46.83 | 38.49 | 36.53 | 71.17 |

3 | 39.89 | 27.91 | 4.84 | 58.42 | 8.01 | 6.21 | 57.29 |

4 | 5.38 | 62.59 | 1.59 | 56.10 | 0.66 | 12.79 | 46.45 |

5 | 12.69 | 11.13 | 58.73 | 40.21 | 7.86 | 34.47 | 52.05 |

6 | 35.87 | 38.98 | 62.22 | 48.70 | 52.72 | 35.64 | 73.74 |

7 | 28.78 | 47.27 | 51.61 | 12.49 | 37.70 | 77.46 | 70.19 |

8 | 16.60 | 49.15 | 22.97 | 16.33 | 13.99 | 58.40 | 56.66 |

9 | 75.86 | 76.27 | 11.41 | 57.08 | 27.39 | 79.08 | 80.33 |

10 | 45.91 | 28.12 | 30.74 | 54.09 | 13.68 | 36.99 | 69.36 |

The polynomial regression model, RBFNN, and Kriging were used to build the surrogate models, and the ensemble surrogate models were constructed according to single surrogate models, then the best surrogate model was selected according to the three performance indices.

### Optimization model

In the optimization model, minimizing the operating system costs was considered as the objective function, and operating cost was simplified as the function with extraction and injection rates. Therefore, the objective function can be transformed to minimize the total extraction and injection rates. The total extraction and injection rates are the decision variables, with the main constraints being: (1) extraction and injection rates meet both the lower and upper limit; (2) the total extraction rates equal the injection rates, which does not exert impact on groundwater flow; (3) the final surrogate model, which reflects the relationship between extraction and injection rate and DNAPLs removal rates; and (4) the remediation aims to determine the most cost-effective strategy for removing more than 80% of the DNAPLs contamination.

## RESULTS AND DISCUSSION

### Surrogate models

*newrb*function in MATLAB was called. The optimal number of the neurons in the hidden layer was identiﬁed using a trial and error procedure by changing the number of hidden neurons from 1 to 20. The effect of number of hidden neurons on the RMSE of the testing samples is shown in Figure 3. It can be seen that the RMSE reached the minimum when the number of hidden neurons was 12, so the numbers of hidden neurons were set as 12. Hence, a three-layer ANN model was built with six neurons in the input layer, 12 neurons in the hidden layer and one neuron in the output layer. The spread was 1.9, and the target error of training is 0.003; as shown in Figure 4, the error performance is 0.0028, which is less than the target error of training.

The Kriging model was completed with the computing program, dacefit and predictor function in a MATLAB toolbox. It was expressed as [dmodel, perf] = dacefit(*S*, *Y*, regr, corr, theta0). *S* and *Y* are, respectively, the input and output data sets, second order polynomial and Gaussian function were selected as regression models and correlation models, and theta0 was the initial value of the parameter . The Kriging model parameters corresponding to the initial sample are listed in Table 3.

θ_{1} | θ_{2} | θ_{3} | θ_{4} | θ_{5} | θ_{6} | σ^{2} |
---|---|---|---|---|---|---|

0.1242 | 0.1130 | 0.1655 | 0.1879 | 0.1000 | 0.1809 | 3.8592 |

θ_{1} | θ_{2} | θ_{3} | θ_{4} | θ_{5} | θ_{6} | σ^{2} |
---|---|---|---|---|---|---|

0.1242 | 0.1130 | 0.1655 | 0.1879 | 0.1000 | 0.1809 | 3.8592 |

Figure 5 illustrates a comparison among predicted and simulated DNAPLs removal rates for testing by three surrogate models. The results showed that the surrogate models were all accurate in forecasting the DNAPLs removal rate, which implies that they are capable of being alternatives of the simulation model. However, the Kriging model approximates best the simulation model, and the polynomial response surface model performs worst. The RMSE for the Kriging and polynomial regression model is 1.30 and 2.30, respectively, the MAE 2.37 and 3.94, respectively, and the *R*^{2} 0.99 and 0.97, respectively.

KR | KP | RP | KRP | |
---|---|---|---|---|

K | 0.542 | 0.745 | – | 0.455 |

R | 0.458 | – | 0.713 | 0.386 |

P | – | 0.255 | 0.287 | 0.159 |

KR | KP | RP | KRP | |
---|---|---|---|---|

K | 0.542 | 0.745 | – | 0.455 |

R | 0.458 | – | 0.713 | 0.386 |

P | – | 0.255 | 0.287 | 0.159 |

The results show that the better the single surrogate model performs, the bigger weight coefficients are. On the other hand, the weight coefficients for the single surrogate models that perform poorly are relatively smaller.

Figure 6 shows the modeling results of the different ensemble surrogate models. It is obvious that the KRP model fits the predicted and simulated DNAPL removal rate best. It can also be seen in Figure 7 that the minimum and maximum value of *R*^{2} for all the surrogates (single and ensemble) is 0.975 by the polynomial regression model and 0.993 by the KRP model, and the better ﬁt between observed and predicated values, and the bigger *R*^{2} value would be, until it reached 1. The minimum and maximum value of MAE is 2.12 by the KRP model and 3.94 by the polynomial regression model, minimum and maximum value of RMSE is 1.15 by the KRP model and 2.30 by the polynomial response surface model; MAE and RMSE demonstrate a similar trend. The lower the RMSE and MAE, the more accurate the prediction is. The KRP model has the lower RMSE/MAE values and higher *R*^{2} compared with the other models, which implies that KRP has a superior performance in reflecting the non-linear relationships between explanatory and response variables.

Figure 7 indicates that all three single surrogate models can be used for constructing a linkage between extraction and injection rates and the DNAPLs removal rates of the real groundwater system. Polynomials could capture the global trend over the entire input space, which is relatively easy to construct and has the simplest type of parameters, but the accuracy is not satisfactory. The RBF model is appropriate to reproduce the non-linear problem, and the accuracy is satisfactory, but there is a risk of overfitting. The Kriging model is the combination of a regression model and a localized deviation model based on spatial correlation of samples. It could interpolate at the sample points and is more accurate for non-linear problems. Considering the overall characteristics of all single surrogate models, the ensemble of surrogates not only improved the performance of single surrogate models, but also performed more stably than the single surrogate models. In this paper, the ensemble surrogate model, which was composed of the polynomial regression model, the radial basis function neurons network, and the Kriging (KRP model) performed the best among all the surrogates, and the polynomial regression model performed the worst among all the surrogates. However, the ensemble needed many parameters for all the single surrogate models, and it took a long training time to get the parameters. Hence, the ensemble of surrogates method regarding optimizing parameters should be further studied.

### The optimization results

The simulated annealing method was used to solve the optimization model. The optimal pumping rates *Q*_{1}–*Q*_{6} were 73.89, 29.13, 25.83, 58.5, 15.31 and 54.97 m^{3}/d, respectively. The remediation effectiveness was 80.06%, and the total extraction rate and injection rate were 257.69 m^{3}/d, respectively. If the total extraction rate and injection rate remain constant, that is, the remediation cost remains constant, the extraction rate and injection rate are randomly generated, the optimal pumping rates *Q*_{1}–*Q*_{6} were 58.89, 39.13, 30.83, 43.57, 20.31, 20.31 and 64.97 m^{3}/d, respectively, and the remediation effectiveness was 77.79%. It was obvious that the optimization model based on the ensemble surrogate model effectively improved DNAPLs removal rate even with the same cost.

In the optimization model, the remediation levels can be changed according to the requirements. If the optimal strategy needs to be implemented in the water supply region, the remediation levels should be as high as possible, because it is important to protect drinking water resources. If the optimal strategy needs to be implemented in remote areas, where there are few people, the remediation levels can be relatively low. The variations in the total extraction and injection rates for the different remediation levels (DNAPLs removal rates) are shown in Figure 8. As can be seen in Figure 8, the remediation levels varied from 50 to 90%. As the remediation levels improved, which means the values of DNAPLs removal rates became larger, the total extraction and injection rates also increased. It is significant for decision-makers to obtain the optimal strategies under different situations.

The surrogate model was an important part of the optimization process as constraints, and it could capture better the response relationship between the extraction (injection) rates and DNAPLs removal rates. If the optimal solution was obtained under 2,000 times to call the simulation model or surrogate models in the optimization process, it would take 30 minutes to assess the solution feasibility by calling the simulation model in the optimization process, so it needed 52 days to get the optimal solution; however, it took only 2 minutes by calling the surrogate model. The results showed that the surrogate model reduced the huge computational burden and remediation cost and it could approximate the best solution quickly.

## CONCLUSIONS

A numerical scheme was developed on the integrated simulation and optimization model for exploring groundwater remediation with DNAPLs contamination. The framework included the UTCHEM model that simulated a three-dimensional, multi-component, multiphase, compositional surfactant flooding process. This was followed by the ensemble surrogate model to reflect the non-linear complex relationship between injection and extraction rates with DNAPLs removal rates instead of the simulation model. Finally, an optimization model to optimize the operation cost with the ensemble surrogate model was proposed. The results show that the ensemble surrogate models not only improved the performance of single surrogate models, but also performed more stably than the single surrogate models. However, the ensemble surrogate models needed many parameters for all the single surrogate models; further study is needed to provide the parameters more quickly and accurately. Moreover, ensemble surrogate models could reduce the computational burden and save a lot of time. The optimal strategies have proved themselves as an effective guide for decision-makers in the DNAPLs contaminants remediation process.

There are a large number of uncertainty factors during the optimization process based on the surrogate model, which inevitably have an effect on the reliability of the remediation design. Future studies need to be undertaken to investigate the uncertainty source and type, consider how the uncertainty is introduced into the optimization model, and how to combine the optimization design and its reliability.

## ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Funds of China (no. 41072171 and 41372237) and Key Laboratory of Groundwater Resources and Environment, Ministry of Education, Jilin University, Changchun, China.