In most instances, the number, location and release intensity of groundwater contamination sources (GCSs) are all unknown. The 0-1 mixed integer nonlinear programming optimization model (0-1 MINPOM) used previously could only identify the location and release intensity for GCSs. Thus a 0-1 MINPOM was improved and applied to simultaneously identify the number, location and release history of GCSs. In addition, to reduce the time of calling the simulation model for iterative calculation, the deep belief neural network (DBNN), long short-term memory network (LSTM) and Kriging method were applied to establish an ensemble surrogate model of the simulation model to replace the simulation model for iterative calculation. The results show that compared with the three single surrogate models, the ensemble surrogate model has the highest approximation accuracy to the simulation model, and could save 99% of the calculation time when iterative calculation was performed in place of the simulation model. The improved 0-1 MINPOM could identify the number, location, and release history of GCSs simultaneously. The accuracy of identifying the number of GCSs was 100%, and the accuracies of identifying the locations and release history were above 95.12% and 83.51%, respectively.

  • A 0-1 mixed integer nonlinear programming optimization model (0-1 MINPOM) was improved and applied to the identification of GCSs.

  • The number, location, and release history of GCSs were identified simultaneously based on the improved 0-1 MINPOM.

  • The DBNN, LSTM and Kriging methods were applied to establish an ensemble surrogate.

  • The ensemble surrogate was linked to the improved 0-1 MINPOM for iterative calculation.

Groundwater contamination is concealed beneath the surface of the earth and cannot be detected in time after it occurs, which makes it difficult to understand the status of groundwater contamination sources (GCSs) (Lapworth et al. 2012). As a consequence, there are great challenges for contamination risk assessment, contamination remediation and identifying responsibility for contamination accidents (Sun et al. 2006). Developing an effective method for identifying GCSs would address these problems and have great significance (Han et al. 2020).

The identification of GCSs roughly started in the 1980s (Gorelick et al. 1983). After nearly 40 years of development, a variety of approaches have been proposed for identifying GCSs, including simulation optimization (S/O) approaches (Mahar & Datta 2000; Xing et al. 2019), direct approaches (Skaggs & Kabala 1994), probabilistic and geostatistical approaches (Neupauer et al. 2000), and analytical approaches (Sidauruk et al. 2010).

Among those approaches, S/O approaches are used widely to characterize GCSs due to their strong applicability (Hou & Lu 2018; Moghaddam et al. 2021). Woodbury & Ulrych (1996) recover the evolution history of contamination plumes in groundwater systems by utilizing the minimum relative entropy method. Mahar & Datta (2000) proposed a simulation nonlinear optimization approach to identify the release intensity and spatial location for GCSs in an unsteady flow system. Mahinthakumar & Sayeed (2006) employed the hybrid optimization method to reconstruct release histories of GCSs. Jha & Datta (2011) applied the S/O approach to identify the GCSs information, and used the simulated annealing algorithm to solve the optimization problem to obtain the release intensity and duration time of GCSs. Zhao et al. (2015) employed optimization approaches to identify the location and release intensity for GCSs.

Good research results have been achieved in the application of S/O methods for GCSs identification (Mahar & Datta 2001; Jiang et al. 2013). However, how to simultaneously identify the number, location and release history of GCSs in a potential contamination area (a continuous area, not just a few fixed locations) is still not reported. Most researchers set preconditions when applying S/O approaches to identify GCSs, only identifying one or two characteristics in the number, location and release history, such as: (1) identify the release history of GCSs on the premise of knowing the number and location of GCSs (Zhao et al. 2016); (2) identify the actual release history and location of GCSs on the premise that there are 3 ∼ 4 potential fixed locations of GCSs (Ayvaz 2010; Guo et al. 2019); and (3) identify the actual release history and location of GCSs on the premise of knowing the number of GCSs (set to one) and a potential contaminated area where the contaminant may be released (Yeh et al. 2014; Xu & Gómez-Hernández 2016). Therefore, the simultaneous identification of the number, location and release history of GCSs is still a thorny problem.

Guo et al. (2019) used a 0-1 MINPOM to identify the actual location (treated as a 0-1 integer variable) of the GCS among the four potential fixed locations, while identifying the release intensity (treated as a continuous variable) of the GCSs. Though the identification of number, location and release history of GCSs in a continuous potential contamination area remains unresolved, the emergence of 0-1 MINPOM makes it possible to simultaneously identify the number, location and release history of GCSs in a suspected contaminated area. Thus, a 0-1 MINPOM with better performance based on the study of Guo et al. (2019) was proposed to simultaneously identify the number, location and release history of GCSs in this study.

In addition, when solving the improved 0-1 MINPOM, the simulation model must be called repeatedly for calculations, thereby incurring miscellaneous computational load and wasting a lot of computational time (Zhao et al. 2016). This deficiency can be solved by using a surrogate model instead of simulation model for iterative calculation. In the previous study, the deep belief neural network (DBNN), long short-term memory network (LSTM) and Kriging methods were used to build surrogate models for simulation models (Jiang et al. 2015; Li et al. 2021; Miao & Guo 2021).

However, the approximation accuracy of a single surrogate model to a simulated model is often different due to different problems. No single surrogate model can guarantee the highest accuracy at any point (Xing et al. 2019; Hou et al. 2017). To make the single surrogate model give full play to its own advantages and abandon its own shortcomings meanwhile, this research establishes an ensemble surrogate model of the simulation model for the first time based on DBNN, LSTM and Kriging method, so as to further improve the approximation accuracy of the surrogate model to the simulation model.

Then the ensemble surrogate model was linked to the improved 0-1 MINPOM and participates in the iterative computation in place of the simulation model in the process of solving the optimization model. After solving the improved 0-1 MINPOM, the identification results of the number, location and release history of GCSs could be obtained simultaneously.

Surrogate modeling methods

Deep belief neural network

The DBNN consists of multiple restricted Boltzmann machine (RBM) stacks and a top-level algorithm (Hinton et al. 2006). The role of an RBM is to extract the features of the input data and the role of the top-level algorithm is to calculate the final output value of the network (Chen et al. 2017). The learning process of the DBNN includes two steps: unsupervised learning training and supervised learning training:

  • (1)

    Unsupervised learning training

The RBM is trained from the bottom to top layer by layer to obtain the important features of the input. An RBM is usually composed of a visible layer and a hidden layer. The visible layer of the RBM has n neurons, and the hidden layer has m neurons. There is no internal connection between the neurons, while the neurons in the hidden layer and the neurons in the visual layer are fully connected. An RBM is an energy-based model. Its joint probability distribution can be calculated based on the energy function (Hecht-Nielsen 1989). When dealing with regression prediction problems, a Gaussian RBM is typically used to process continuous data. In this case, the output of the RBM becomes a continuous value between 0 and 1. For any given set of states , its energy function is expressed as follows:
formula
(1)
where ; represents all neurons in the visible layer, represents all neurons in the hidden layer, w is the connection weight between the visible layer and the hidden layer, is the bias vector of the visible layer, and is the bias vector of the hidden layer, is the tolerance value corresponding to the input v; By deriving Equation (1), the joint probability density function of state can be obtained as follows:
formula
(2)
where is the normalization factor. Then the marginal distribution of the visible layer and the hidden layer can be calculated from the joint probability distribution, respectively:
formula
(3)
formula
(4)
Based on the above work, the conditional probability distributions of the visible layer and hidden layer can be obtained respectively:
formula
(5)
formula
(6)
when the states of all the neurons in the visible layer are known, the activation probability of the neurons in the hidden layer can be calculated based on the above conditional probability distribution:
formula
(7)
Similarly, when the state of all neurons in the hidden layer is known, the activation probability of the neurons in the visible layer can be calculated:
formula
(8)
After a training sample is given, the training process of the RBM is essentially to continuously adjust , so that the probability distribution of the RBM under the influence of is as close as possible to the probability distribution of the training sample. The minimum energy of the entire model can be obtained by maximizing the following likelihood function:
formula
(9)

The last layer of the DBNN is a back propagation (BP) neural network. After training the last layer of the RBM, the output is used as the input of the BP neural network, and the BP neural network is used to predict the output.

For a set of input data , the output of the hidden layer can be obtained after the calculation of the first hidden layer, as shown in Equation (10):
formula
(10)
where is the output of the first hidden layer, is the weight between the input layer and the hidden layer, and is the bias of the first hidden layer.
The RBM is trained layer by layer until the output of the last hidden layer is obtained. Then, the output is input into the BP output layer, and the output of the BP output layer is given by Equation (11):
formula
(11)
where is the output result of the BP neural network, is the output of the last hidden layer.
  • (2)

    Supervised learning training

After unsupervised training, combined with the output data of the training samples, the gradient algorithm is used to perform a top-down supervised learning training phase for the DBNN. The parameters of the DBNN were adjusted inversely according to the error between the output result of the BP neural network and the output data in the training samples. The error was calculated as follows:
formula
(12)
where is the output data in the training samples.
By calculating the error of the BP output layer, the error of the weight and bias of the hidden layer can be calculated:
formula
(13)
formula
(14)
Then, the parameters can be updated and a better DBNN can be obtained:
formula
(15)
formula
(16)
where is learning rate.

Long short-term memory network

LSTM is an improved recurrent neural network (RNN). Compared with RNN, LSTM introduces three gate mechanisms, namely input gate, forget gate and output gate. Through these three gate mechanisms, LSTM can selectively remember or forget some information in the training process, achieve better training effect, and solve the problem that the RNN has long-term dependence on input data (Hochreiter & Schmidhuber 1997; Graves & Schmidhuber 2005; Li et al. 2021). The training of LSTM consists of two main processes, forward propagation and back propagation.

  • (1)

    Forward propagation process

In the forward propagation process of LSTM, the memory cell at time t will receive input data from two sources, namely the output of the memory cell at time and the input data at time t. Based on the input data from these two sources and the three gate mechanisms in the memory cell, the output of the memory cell at time t can be calculated.

First, use the forget gate to retain or forget the information entered into the memory cell:
formula
(17)
where is the output of the forget gate, is the sigmoid function, the sigmoid function can map a real value to the interval 0~1, and then describe the amount of input information that passes through. 0 means that no information passes and 1 means that all information can pass. and are the weight matrices of and in the forget gate, respectively, is the bias vector in the forget gate.

Then, use the input gate to update the state of the memory cell at time t. The input gate consists of two parts: sigmoid activation function part and the Tanh activation function part. Input the two sources of input data into the two function parts for activation, in preparation for updating the state of the memory cell.

The sigmoid activation function part is as follows:
formula
(18)
where is the output calculated by the sigmoid activation function of the input gate, and are the weight matrices and of the sigmoid activation function part in the input gate, respectively, is bias vector in the input gate.
The Tanh activation function part is as follows:
formula
(19)
where is the output of the input gate which is activated by the Tanh activation function (the candidate state of the memory cell at the time ), and are the weight matrices and of the Tanh activation function part in the input gate, respectively, is bias vector in the input gate.
Multiply the output of the forget gate by the state of the memory cell at time . Then multiply the two outputs obtained by the input gate; on this basis, get the updated state of the memory cell at time :
formula
(20)
where is the updated state of the memory cell.
Finally, use the output gate to control the influence of on the current output value , the mathematical expression of the output gate is as follows:
formula
(21)
formula
(22)
where is the output of the output gate, and are the weight matrices and of the output gate, respectively, is bias vector in the output gate.
Finally, after Softmax activation and transformation, the regression prediction value of LSTM is obtained:
formula
(23)
where is the output of LSTM, and are the weight matrices and bias vector of LSTM visual layer, respectively.
  • (2)

    Back propagation process

To further optimize the LSTM network, the relevant parameters of the LSTM are adjusted inversely by using the training samples. According to the loss function, the weight matrix and bias vector are adjusted using the chain rule. The mathematical expressions for calculating the gradient of the related parameters are as follows (take the output gate and visual layer as examples).

The output gate:
formula
(24)
The visible layer:
formula
(25)
where L is loss function. After calculating the gradient of the above parameters, the relevant parameters of the LSTM can be updated by using the gradient algorithm. After all the weight matrices and bias vectors are obtained, the LSTM regression prediction model can be obtained.

Kriging

Kriging is an unbiased optimal predictor based on variational function and structural analysis (Bargaoui & Chebbi 2009; Kleijnen 2009). In recent years, the Kriging approach is extended as a surrogate modeling method with applications in many fields of engineering (Kleijnen & Beers 2005). A surrogate model based on Kriging can be established according to the following principle (Guo et al. 2019):
formula
(26)
where: is the input item of the training data set, are known as determinate regression functions, is a regression coefficient matrix, which can be estimated from the training data set, p is the number of determinate regression functions, and is a random part deviation. For the detailed principle of the Kriging, please refer to the article by Guo et al. (2019).

Simulation optimization method

The identification of GCSs in this study was based on the S/O method. Therefore, establishing the optimization model is a key step in the research process (Cristo & Leopardi 2008; Datta et al. 2009; Jiang et al. 2021). A brief introduction to establishing an optimization model is as follows: the GCSs identification problem is treated as an optimization problem, then determine the decision variables, objective function (the fitting error between the observed and simulated concentrations is usually used as the objective function) and constraints (include parameter value constraints, GCSs information constraints, and contaminant transport rules constraints). Finally, the objective function and constraints constitute the optimization model and the optimization model is solved to obtain the GCSs identification results.

The general form of the optimization model applied for GCSs identification can be expressed as follows (Ayvaz 2010; Guo et al. 2019):
formula
(27)
where is the objective function, are the decision variables in the optimization model, e is the total number of decision variables, is the inequality constraint, and are the lower and upper bound of the inequality constraint, n is the total number of inequality constraints, is the equality constraint and m is the total number of inequality constraints.

Site overview

A hypothetical contaminated site was used to analyze the applicability of the improved method in the identification of GCSs. The contaminant of the hypothetical contaminated site does not undergo chemical and biological conversion. The study area is a two-dimensional aquifer with irregular boundaries, and the aquifer medium is heterogeneous and isotropic. There are three parameter zones in the aquifer. The groundwater was transient flow. The potential contamination source area (all locations where the contaminant might have been released), boundary conditions and observation locations are shown in Figure 1. The other parameters related to the aquifer are presented in Table 1. The waterhead of the northwest specific head boundary was 27 m and the waterhead of the southeast specific head boundary was 25 m. The aquifer received uniform recharge from atmospheric precipitation in the vertical direction, and the recharge amount was 730 mm/a. The natural background concentration of the contaminant was 0 mg/L.
Table 1

Related parameters for the aquifers in the hypothetical contaminated site

Parameter valuesValue
IIIIII
Hydraulic conductivity in x-direction, Kxx (m/d) 42 30 18 
Hydraulic conductivity in y-direction, Kyy (m/d) 42 30 18 
Effective porosity, θ 0.25 0.2 0.18 
Longitudinal dispersivity, αL (m) 40 40 40 
Transverse dispersivity, αT (m) 
Saturated thickness, b(m) 35 35 35 
Grid spacing in x-direction, Δx (m) 
Grid spacing in y-direction, Δy (m) 
Length of the simulation period, Δt (month) 
Recharge flux per unit area, W (m/d) 0.0004 0.00035 0.0003 
Initial concentration (mg/L) 
Parameter valuesValue
IIIIII
Hydraulic conductivity in x-direction, Kxx (m/d) 42 30 18 
Hydraulic conductivity in y-direction, Kyy (m/d) 42 30 18 
Effective porosity, θ 0.25 0.2 0.18 
Longitudinal dispersivity, αL (m) 40 40 40 
Transverse dispersivity, αT (m) 
Saturated thickness, b(m) 35 35 35 
Grid spacing in x-direction, Δx (m) 
Grid spacing in y-direction, Δy (m) 
Length of the simulation period, Δt (month) 
Recharge flux per unit area, W (m/d) 0.0004 0.00035 0.0003 
Initial concentration (mg/L) 
Figure 1

Boundary conditions, location of observation wells, potential contamination source area and location of GCSs: (a) case one; (b) case two.

Figure 1

Boundary conditions, location of observation wells, potential contamination source area and location of GCSs: (a) case one; (b) case two.

Close modal

This study uses two cases to test the applicability of the method developed. The aquifer parameters of the hypothetical contaminated site were the same for the two cases, as shown in Table 1.

The aquifer was discretely divided into 5262 grids in order to make the study more convenient. Since the grid is treated as an integer variable, the location information of the potential contamination source area in the study area is represented by grid numbers. The horizontal and longitudinal locations of the potential contamination source areas and their corresponding grid numbers are shown in Table 2.

Table 2

The horizontal and longitudinal locations and their corresponding grid numbers

(a) horizontal
NumberLocationNumberLocationNumberLocationNumberLocation
19 55.5 32 94.5 45 133.5 58 172.5 
20 58.5 33 97.5 46 136.5 59 175.5 
21 61.5 34 100.5 47 139.5 60 178.5 
22 64.5 35 103.5 48 142.5 61 181.5 
23 67.5 36 106.5 49 145.5 62 184.5 
24 70.5 37 109.5 50 148.5 63 187.5 
25 73.5 38 112.5 51 151.5 64 190.5 
26 76.5 39 115.5 52 154.5 65 193.5 
27 79.5 40 118.5 53 157.5 66 196.5 
28 82.5 41 121.5 54 160.5 67 199.5 
29 85.5 42 124.5 55 163.5 68 202.5 
30 88.5 43 127.5 56 166.5 69 205.5 
31 91.5 44 130.5 57 169.5 70 208.5 
(b) longitudinal
NumberLocationNumberLocationNumberLocation
18 157.5 31 196.5 44 235.5   
19 160.5 32 199.5 45 238.5   
20 163.5 33 202.5 46 241.5   
21 166.5 34 205.5 47 244.5   
22 169.5 35 208.5 48 247.5   
23 172.5 36 211.5 49 250.5   
24 175.5 37 214.5 50 253.5   
25 178.5 38 217.5 51 256.5   
26 181.5 39 220.5 52 259.5   
27 184.5 40 223.5 53 262.5   
28 187.5 41 226.5 54 265.5   
29 190.5 42 229.5 55 268.5   
30 193.5 43 232.5     
(a) horizontal
NumberLocationNumberLocationNumberLocationNumberLocation
19 55.5 32 94.5 45 133.5 58 172.5 
20 58.5 33 97.5 46 136.5 59 175.5 
21 61.5 34 100.5 47 139.5 60 178.5 
22 64.5 35 103.5 48 142.5 61 181.5 
23 67.5 36 106.5 49 145.5 62 184.5 
24 70.5 37 109.5 50 148.5 63 187.5 
25 73.5 38 112.5 51 151.5 64 190.5 
26 76.5 39 115.5 52 154.5 65 193.5 
27 79.5 40 118.5 53 157.5 66 196.5 
28 82.5 41 121.5 54 160.5 67 199.5 
29 85.5 42 124.5 55 163.5 68 202.5 
30 88.5 43 127.5 56 166.5 69 205.5 
31 91.5 44 130.5 57 169.5 70 208.5 
(b) longitudinal
NumberLocationNumberLocationNumberLocation
18 157.5 31 196.5 44 235.5   
19 160.5 32 199.5 45 238.5   
20 163.5 33 202.5 46 241.5   
21 166.5 34 205.5 47 244.5   
22 169.5 35 208.5 48 247.5   
23 172.5 36 211.5 49 250.5   
24 175.5 37 214.5 50 253.5   
25 178.5 38 217.5 51 256.5   
26 181.5 39 220.5 52 259.5   
27 184.5 40 223.5 53 262.5   
28 187.5 41 226.5 54 265.5   
29 190.5 42 229.5 55 268.5   
30 193.5 43 232.5     

The real number, location, and release history of GCSs for the two cases are as follows:

Case one: the number was one, the location is as shown in Figure 1(a) and release intensities during two release periods are as shown in Table 3a.

Table 3

Details of the real GCSs: (a) case one; (b) case two

(a) case one
Number of sourcesGrid number
Release intensity (g/d)
HorizontalLongitudinalSP1SP2
35 52 260 140  
(b) case two
Grid number
Release intensity (g/d)
Number of sourcesSourcesHorizontalLongitudinalSP1SP2
S1 35 52 190 120 
 S2 41 36 170 160 
(a) case one
Number of sourcesGrid number
Release intensity (g/d)
HorizontalLongitudinalSP1SP2
35 52 260 140  
(b) case two
Grid number
Release intensity (g/d)
Number of sourcesSourcesHorizontalLongitudinalSP1SP2
S1 35 52 190 120 
 S2 41 36 170 160 

Case two: the number was two, the locations of the two GCSs are shown in Figure 1(b) and release intensities of the two GCSs during two release periods are as shown in Table 3b.

The information on the GCSs for the two cases were regarded as unknown during the inverse identification process (include the number, location, and release intensity); only the potential contamination source area was known (shown in Figure 1(a) and 1(b)). For hypothetical case one and case two it is estimated that the maximum number of GCSs is 3. The number, location, and release intensity of the GCSs were then identified.

All the GCSs are designed to have same initial release times (the first day of simulation) and the time intervals between the initial contaminant release of the GCSs and first contaminant concentration measurement of the two cases are shown in Table 4. The total contaminant release duration of the two cases is 180 days and the total release time is divided into two equal stress periods of 90 days. Concentration is measured every 90 days from the start of the first measurement time. The contaminant release intensity of the GCSs is designed to be constant during the two stress periods. The total simulation time for contaminant transport is 900 days. A total of eight periods of concentration measurements are utilized for the two cases.

Table 4

Test case scenarios

CaseActivity duration (d)Time interval between the initial contaminant release of the source and first measurement (d)Concentration measurement intervals (d)
one 2 × 90 270 90 
two 2 × 90 270 90 
CaseActivity duration (d)Time interval between the initial contaminant release of the source and first measurement (d)Concentration measurement intervals (d)
one 2 × 90 270 90 
two 2 × 90 270 90 

Numerical simulation model

The identification of GCSs needs to be carried out on the basis of establishing a numerical simulation model of the groundwater system (Singh & Datta 2006). The governing partial differential mathematical equations for the transient flow and the contaminant transport are defined as follows (Pinder & Bredehoeft 1968):

The groundwater flow:
formula
(28)
where is the hydraulic conductivity, is the hydraulic head, is the volumetric flux per unit volume, is the specific yield and is the simulated area range.
The contaminant transport:
formula
(29)
formula
(30)
where C is the contamination concentration, is the effective porosity, R is the source or sink term, is the dispersion coefficient and is the average linear velocity of the groundwater flow.

The GMS software was used to solve the transient flow and contaminant transport simulation model.

Unlike actual identification of GCSs, the hypothetical example had no actual contaminant monitoring data. Thus, it is necessary to input the actual information of GCSs into the simulation model first, and then run the simulation model to output the contaminant concentration of the observation wells during each simulation period. During the identification process of GCSs, the contaminant concentrations were taken as the actual data measurements. Figure 2 shows the contaminant plume distributions on day 450 and day 900 of each case. Figure 3 shows the contaminant concentration measured in the two cases.
Figure 2

Contaminant plume distributions: (a), (b) case one; (c), (d) case two.

Figure 2

Contaminant plume distributions: (a), (b) case one; (c), (d) case two.

Close modal
Figure 3

Values measured of the observation wells for two cases: (a) case one; (b) case two.

Figure 3

Values measured of the observation wells for two cases: (a) case one; (b) case two.

Close modal

Ensemble surrogate model

A surrogate model with satisfactory accuracy is a stand-in for iterative calculation in place of the simulation model, and it can output nearly the same results as the simulation model under the premise of the same output. The surrogate model is not only simpler to call, but it can reduce a lot of computational load and computational time when identifying the GCSs (Simpson et al. 2001; Jiang et al. 2015).

The horizontal and longitudinal grid numbers of locations and release intensity corresponding to a maximum number of the GCSs were used as the input item for the surrogate model (6 locations and 6 release intensities comprising a total of 12 input variables). The contaminant concentrations were treated as the output of the surrogate model. Using the Latin hypercube method, the locations (grid numbers) and release intensity of each GCS were sampled 420 times in their feasible domains. Then the locations and release intensities were used sequentially as inputs for the simulation model. Run the simulation model to output the contaminant concentrations data of each observation well (output data). Divide all 420 groups data into training data and testing data. The training data consists of 360 groups of input and output data, and the testing data consists of 60 groups of input and output data. The DBNN, LSTM and Kriging methods were coded with MATLAB, and the DBNN, LSTM and Kriging surrogate model were then established based on the training and testing data.

To further improve the approximation accuracy of the surrogate model to the simulation model, an ensemble surrogate model of the simulation model was established based on the above three single surrogate models. To make the three single surrogate models exert their own advantages, they were weighted and superimposed. To determine the weights of the three single surrogate models in the ensemble surrogate model, the least square error of the output results of the ensemble surrogate model and the simulated model was regarded as the objective function, and the weights of the three single surrogate models were regarded as decision variables. The optimization model for establishing the ensemble surrogate model is as follows:
formula
(31)
where and were the contaminant concentrations of the i-th observation well in the j-th period of the k-th test sample output by the ensemble surrogate model and the simulation model, respectively, m is the total number of observation wells, t is the total number of observation periods, n the total number of test samples. , , and are the weights occupied by DBNN, LSTM and Kriging surrogate models in the ensemble surrogate model, respectively.
The accuracy of the above four surrogate models was tested based on the determination coefficient (R2), root mean square error (RMSE), and mean relative error (MRE).
formula
(32)
formula
(33)
formula
(34)

In Equations (32)–(34), is the i-th output value obtained from the numerical simulation model, is the i-th output value obtained from the surrogate model, and is the mean value of the output values obtained from the simulation model. Smaller values for the MRE and RMSE, as well as values of R2 closer to 1 indicate that the surrogate model has higher approximation accuracy to the simulation model.

0-1 MINPOM

The number, location and release intensity of the GCSs were regarded as decision variables. Among them the number of GCSs was treated as 0-1 integer variable. The location (grid numbers) and release intensities were treated as integer and continuous variables, respectively. The least squares error of the measured and simulated contaminant concentrations was minimized as the objective function. The surrogate model was regarded as an equality constraint and was linked to the optimization model to ensure that the optimization model satisfied the solute transport law in a groundwater system during the iterative calculation of optimization solving process. The feasible domains for the number, location (grid numbers), and release history of GCSs comprised the inequality constraints. The objective function and constraints constituted the 0-1 MINPOM:
formula
(35)
where is the 0-1 integer variable representing whether the current location is GCS, 1 indicates that a real GCS is present in the current location, 0 indicates that no GCS is present in the current location; is the horizontal grid numbers; is the longitudinal grid numbers; i represents the i-th potential GCS; is the release intensity of the GCSs; m represents the m-th release intensity variable; is the simulated contaminant concentration; and is the measured contaminant concentration.

The 0-1 MINPOM for identifying the GCSs in each case was the same, except the measured contaminant concentrations corresponding to the two cases were different.

Accuracy analysis of the surrogate models

The weights of the three single surrogate models in the ensemble surrogate model are shown in Table 5.

Table 5

The weight of the single surrogate model in the ensemble surrogate model

Single surrogate modelDBNNLSTMKriging
Weights 0.095 0.638 0.267 
Single surrogate modelDBNNLSTMKriging
Weights 0.095 0.638 0.267 

The precision of the four surrogate models was evaluated by calculating the R2, RMSE, and MRE of the 60 groups of output samples from the simulation model and those from the four surrogate models. To see the fitting effect between the surrogate models and the simulation model, the fitting effect between simulation model and surrogate model was drawn (shown in Figure 4). To clearly compare and analyze the accuracies of the four surrogate models, the contrast histograms were plotted for R2, RMSE, and MRE (shown in Figure 5).
Figure 4

Fitting effect between simulation model and surrogate model: (a) DBNN; (b) LSTM (c) Kriging (d) ensemble.

Figure 4

Fitting effect between simulation model and surrogate model: (a) DBNN; (b) LSTM (c) Kriging (d) ensemble.

Close modal
Figure 5

Evaluative coefficients of surrogate model for the two cases: (a) R2; (b) RMSE; (c) MRE.

Figure 5

Evaluative coefficients of surrogate model for the two cases: (a) R2; (b) RMSE; (c) MRE.

Close modal

It can be seen from Table 5 that the higher the accuracy of the single surrogate model, the greater the proportion in the ensemble surrogate model. It could be seen from Figure 4 that among the three single surrogate models, the LSTM surrogate model has the best fitting effect to the simulation model, followed by the Kriging surrogate model, while the DBNN surrogate model was the poorest. The fitting effect of the ensemble surrogate model based on the three single surrogate models was higher than that of any single surrogate model. It could be clearly seen from Figure 5 that for each observation well, the three accuracy indexes of the ensemble surrogate model were better than the other three single surrogate models. The mean R2 of DBNN, LSTM, Kriging and ensemble surrogate model were 0.9812, 0.9948, 0.9904 and 0.9960, respectively. The mean RMSE of DBNN, LSTM, Kriging and ensemble surrogate model were 0.0068, 0.0039, 0.0052 and 0.0034, respectively. The mean MRE of DBNN, LSTM, Kriging and ensemble surrogate model were 8.16%, 3.98%, 5.41% and 3.65%, respectively. The above results indicated that the ensemble surrogate model had the best approximation accuracy to the simulated model. Therefore, the ensemble surrogate model was linked to the 0-1 MINPOM for simultaneously identifying the number, location (grid numbers) and release history of GCSs.

Analysis of GCSs identification results

The ensemble surrogate model was linked to the 0-1 MINPOM and the (genetic algorithm) GA was used to solve the 0-1 MINPOM. A detailed introduction to GA was provided by Guo et al. (2019) and Zwickl (2006). The population size and generation of GA are 200 and 1200, respectively, and for other relevant parameter settings of the GA refer to the articles of Li et al. (2021).

It takes about 1,333 hours to call the simulation model 240,000 times during the solving process of the 0-1 MINPOM, while it only takes about 4.79 hours if the surrogate model is applied 240,000 times. Thus, using the surrogate model for iterative calculation could save 99% of the computational time and the computational load.

The identification results of GCSs for the two cases are shown in Figure 6 and Table 6.
Table 6

Source number, location, and release history results for the two cases

(a) case one
Case oneNumber of sourcesS-location (m)
S-Release intensity (g/d)
XYSP1SP2
Real value 35 52 260 140     
Identified results 35 53 254.06 141.48     
Relative error 0.00% 2.86% 3.85% 2.28% 1.06%     
(b) case two
S1-location (m)
S2-location (m)
S1-Release intensity (g/d)
S2-Release intensity (g/d)
Case twoNumber of sourcesXYXYSP1SP2SP1SP2
Real value 35 52 41 36 190 120 170 160 
Identified results 35 55 41 33 195.83 115.09 189.08 133.62 
Relative error 0.00% 0.00% 0.00% 4.88% 2.78% 3.07% 4.09% 11.22% 16.49% 
(a) case one
Case oneNumber of sourcesS-location (m)
S-Release intensity (g/d)
XYSP1SP2
Real value 35 52 260 140     
Identified results 35 53 254.06 141.48     
Relative error 0.00% 2.86% 3.85% 2.28% 1.06%     
(b) case two
S1-location (m)
S2-location (m)
S1-Release intensity (g/d)
S2-Release intensity (g/d)
Case twoNumber of sourcesXYXYSP1SP2SP1SP2
Real value 35 52 41 36 190 120 170 160 
Identified results 35 55 41 33 195.83 115.09 189.08 133.62 
Relative error 0.00% 0.00% 0.00% 4.88% 2.78% 3.07% 4.09% 11.22% 16.49% 
Figure 6

The curve convergence and the identification results of GCSs: (a) case one; (b) case two.

Figure 6

The curve convergence and the identification results of GCSs: (a) case one; (b) case two.

Close modal

Case one: A value ‘1’ (shown in Figure 6(a)) in the identification results indicates that the number of GCSs was one. The location (grid number) and the release intensities of GCSs are shown in Table 6a. The identification accuracy of the number and location was 100%, and the identification accuracy of location and release history was above 96.15% and 97.72%, respectively.

Case two: Those two values of ‘1’ (shown in Figure 6(b)) in the identification results indicate that the number of GCSs was two. The location (grid number) and the release intensities of two GCSs are shown in Table 6b. The identification accuracy of the number was 100%, and the identification accuracy of location and release history was above 95.12% and 83.51%, respectively.

The identification results of GCSs for the two cases showed that compared with the following situations the 0–1 MINPOM in this study not only effectively synchronously identified the number, location, and release history of the GCSs, but ensured high identification accuracy: (1) The exact location of the GCSs can only be identified from the potential contamination source area on the premise that the number of GCSs is known (Mo et al. 2019; Pan et al. 2021; Wang et al. 2022); (2) Under the premise that the number and location of GCSs is known, the release history is identified (Xing et al. 2019). It effectively improved the shortcoming of the previous 0-1 MINPOM which could only identify the location and release history of GCSs.

The actual number of GCSs was lower than the estimated maximum number of three in this study. If the actual number of GCSs is equal to or more than the maximum estimated number of GCSs, the solution is increasing the original estimated maximum number of GCSs. Then reestablishing the surrogate model and 0-1 MINPOM to identify the information of the GCSs until the number of GCSs identified is lower than the estimated maximum number of GCSs (in a feedback correction process).

However, the input dimension of the simulation model will increase as the estimated maximum number of GCSs increases and the nonlinear relationships will become more complicated. When faced with a simulation model with a more complex nonlinear relationship, the accuracy of the surrogate model should be further improved through reasonable study.

Moreover, when the estimated maximum number of GCSs increases, the number of variables to be identified will inevitably increase, which may increase the probability of the occurrence of ‘different parameters with same effect’. When using the simulation optimization method to identify GCSs information, it is often difficult to solve the problem of ‘different parameters with same effect’. In addition to increasing the constraints of the optimization model as much as possible, how to solve this problem from other perspectives needs to be further studied. In addition, this study did not consider the uncertainty of parameters. Since the uncertainty of parameters exists objectively, the uncertainty of parameters should be considered in the GCSs identification research in future.

Two conclusions were obtained based on the study:

First, among the three single surrogate models, the LSTM surrogate model had the best accuracy, and the DBNN surrogate model had the worst accuracy. The single surrogate model with better accuracy occupied a larger proportion of the ensemble surrogate model. Compared with the three single surrogate models of DBNN, LSTM and Kriging, the fitting effect of the ensemble surrogate model on the output results of the numerical simulation model was significantly improved. Applying the ensemble surrogate model instead of the simulation model to participate in iterative calculation could save more than 99% of the computational time and the computational load.

Second, the 0-1 MINPOM could only identify the location and release history of the GCSs in the previous study. In the present study, an improved 0-1 MINPOM was established based on treating the number of GCSs as an integer variable with a value of 0 or 1, the location (represented by grid number) and release intensity of GCSs as integer variables and continuous variables, respectively. Then the number, location, and release intensity of GCSs were simultaneously identified using the improved 0-1 MINPOM. The identification results were very similar to the actual values of the GCSs characteristics. The accuracy of identifying the number of GCSs was 100%, and the accuracies of identifying the locations and release intensities were over 95.12% and 83.51%, respectively. Comparing with the 0-1 MINPOM proposed by previous researchers, the improved 0-1 MINPOM proposed in this study is more applicable to the identification of GCSs.

The authors acknowledge support provided by the National Nature Science Foundation of China (Grant Nos. 42202273), the Fundamental Research Funds for the Central Universities (2412022QD001) and the National Key R&D Program of China (Grant Nos. 2019YFC0409101). Special gratitude is extended to the journal editors for their efforts in evaluating this study. The valuable comments provided by the anonymous reviewers are also gratefully acknowledged.

This work was supported by the National Nature Science Foundation of China (Grant Nos. 42202273), the Fundamental Research Funds for the Central Universities (2412022QD001) and the National Key R&D Program of China (Grant Nos. 2019YFC0409101).

All authors contributed to the study conception and design. Conceptualization, methodology, software, writing the original draft, and formal analysis were performed by Jiuhui Li. Writing review & editing, supervision, and project administration were performed by Zhengfang Wu. Software, validation, conceptualization, and methodology were performed by Wenxi Lu. Validation, writing review & editing were performed by Hongshi He.

All relevant data are available from an online repository or repositories at https://github.com/lijiuhui666/0-1 MINLPOM/upload

Ayvaz
M. T.
2010
A linked simulation–optimization model for solving the unknown groundwater pollution source identification problems
.
Journal of Contaminant Hydrology
117
(
1–4
),
46
59
.
https://doi.org/10.1016/j.jconhyd.2010.06.004
.
Bargaoui
Z. K.
&
Chebbi
A.
2009
Comparison of two kriging interpolation methods applied to spatiotemporal rainfall
.
Journal of Hydrology
365
(
1–2
),
56
73
.
https://doi.org/10.1016/j.jhydrol.2008.11.025
.
Chen
H.
,
Wan
G. X.
&
Xiao
Z. J.
2017
Intrusion detection method of deep belief network model based on optimization of data processing
.
Journal of Computer Applications
37
(
6
),
1636
1643
.
Cristo
C. D.
&
Leopardi
A.
2008
Pollution source identification of accidental contamination in water distribution networks
.
Journal of Water Resources Planning & Management
134
,
197
202
.
https://doi.org/10.1061/(ASCE)0733-9496(2008)134:2(197)
.
Datta
B.
,
Chakrabarty
D.
&
Dhar
A.
2009
Simultaneous identification of unknown groundwater pollution sources and estimation of aquifer parameters
.
Journal of Hydrology
376
(
1–2
),
48
57
.
https://doi.org/10.1016/j.jhydrol.2009.07.014
.
Gorelick
S. M.
,
Evans
B.
&
Remson
I.
1983
Identifying sources of groundwater pollution: an optimization approach
.
Water Resources Research
19
(
3
).
https://doi.org/10.1029/WR019i003p00779.
Graves
A.
&
Schmidhuber
J.
2005
Framewise phoneme classification with bidirectional LSTM and other neural network architectures
.
Neural Networks
18
(
5–6
),
602
610
.
https://doi.org/10.1016/j.neunet.2005.06.042
.
Guo
J. Y.
,
Lu
W. X.
,
Yang
Q. C.
&
Miao
T. S.
2019
The application of 0–1 mixed integer nonlinear programming optimization model based on a surrogate model to identify the groundwater pollution source
.
Journal of Contaminant Hydrology
220, 18–25.
https://doi.org/10.1016/j.jconhyd.2018.11.005
Han
K. X.
,
Zuo
R.
,
Ni
P. C.
,
Xue
Z. K.
,
Xu
D. H.
,
Wang
J. S.
&
Zhang
D.
2020
Application of a genetic algorithm to groundwater pollution source identification
.
Journal of Hydrology
125343
.
https://doi.org/10.1016/j.jhydrol.2020.125343
.
Hecht-Nielsen, R. 1989 Theory of the backpropagation neural network. International 1989 Joint Conference on Neural Networks (IJCNN)
.
Hinton
G. E.
,
Osindero
S.
&
Teh
Y. W.
2006
A fast learning algorithm for deep belief nets
.
Neural Computation
18
(
7
),
1527
1554
.
https://doi.org/10.1162/neco.2006.18.7.1527
.
Hochreiter
S.
&
Schmidhuber
J.
1997
Long short-term memory
.
Neural Computation
9
(
8
),
1735
1780
.
https://doi.org/10.1162/neco.1997.9.8.1735.
Hou
Z. Y.
&
Lu
W. X.
2018
Comparative study of surrogate models for groundwater contamination source identification at DNAPL-contaminated sites
.
Hydrogeology Journal
26
(
3
),
1
10
.
https://doi.org/10.1007/s10040-017-1690-1
.
Hou
Z. Y.
,
Lu
W. X.
,
Xue
H. B.
&
Lin
J.
2017
A comparative research of different ensemble surrogate models based on set pair analysis for the DNAPL-contaminated aquifer remediation strategy optimization
.
Journal of Contaminant Hydrology
203
,
28
37
.
https://doi.org/10.1016/j.jconhyd.2017.06.003
.
Jha
M. K.
&
Datta
B.
2011
Simulated annealing based simulation-optimization approach for identification of unknown contaminant sources in groundwater aquifers
.
Desalination and Water Treatment
32
(
1-3
),
79
85
.
https://doi.org/10.5004/dwt.2011.2681
.
Jiang
S.
,
Zhang
Y.
,
Wang
P.
&
Zheng
M.
2013
An almost-parameter-free harmony search algorithm for groundwater pollution source identification
.
Water Science & Technology
68
,
2359
2366
.
https://doi.org/10.2166/wst.2013.499
.
Jiang
X.
,
Lu
W. X.
,
Hou
Z. Y.
,
Zhao
H. Q.
&
Na
J.
2015
Ensemble of surrogates-based optimization for identifying an optimal surfactant-enhanced aquifer remediation strategy at heterogeneous DNAPL-contaminated sites
.
Computers & Geosciences
84
,
37
45
.
https://doi.org/10.1016/j.cageo.2015.08.003
.
Jiang
X.
,
Ma
R.
,
Wang
Y.
,
Gu
W. L.
,
Lu
W. X.
&
Na
J.
2021
Two-stage surrogate model-assisted Bayesian framework for groundwater contaminant source identification
.
Journal of Hydrology
594
(
1–2
),
125955
.
https://doi.org/10.1016/j.jhydrol.2021.125955
.
Kleijnen
J. P.
2009
Kriging metamodeling in simulation: a review
.
European Journal of Operational Research
192
(
3
),
707
716
.
https://doi.org/10.1016/j.ejor.2007.10.013
.
Kleijnen
J. P. C.
&
Beers
W. C. M. V.
2005
Robustness of Kriging when interpolating in random simulation with heterogeneous variances: some experiments
.
European Journal of Operational Research
165
(
3
),
826
834
.
https://doi.org/10.1016/j.ejor.2003.09.037
.
Lapworth
D. J.
,
Baran
N.
,
Stuart
M. E.
&
Ward
R. S.
2012
Emerging organic contaminants in groundwater: a review of sources, fate and occurrence
.
Environmental Pollution
163
(
4
),
287
303
.
https://doi.org/10.1016/j.envpol.2011.12.034
.
Li
J. H.
,
Lu
W. X.
&
Luo
J. N.
2021
Groundwater contamination sources identification based on the Long-Short Term Memory network
.
Journal of Hydrology
601
,
126670
.
https://doi.org/10.1016/j.jhydrol.2021.126670
.
Mahar
P. S.
&
Datta
B.
2000
Identification of pollution sources in transient groundwater systems
.
Water Resources Management
14
(
3
),
209
227
.
https://doi.org/10.1023/A:1026527901213
.
Mahar
P. S.
&
Datta
B.
2001
Optimal identification of ground-water pollution sources and parameter estimation
.
Journal of Water Resources Planning and Management
127
,
20
29
.
Mahinthakumar, G. K. & Sayeed, M. 2006 Reconstructing groundwater source release histories using hybrid optimization approaches. Environmental Forensics 7 (1), 45–54. https://doi.org/10.1080/15275920500506774.
Miao
T. S.
&
Guo
J. Y.
2021
Application of artificial intelligence deep learning in numerical simulation of seawater intrusion
.
Environmental Science and Pollution Research
32
(
12
),
1016
1026
.
https://doi.org/10.1089/ees.2015.0055
.
Mo
S. X.
,
Zabaras
N.
,
Shi
X. Q.
&
Wu
J. C.
2019
Deep autoregressive neural networks for high-dimensional inverse problems in groundwater contaminant source identification
. Water Resources Research
55
(
5
),
3856
3881
.
https://doi.org/10.1029/2018WR024638
.
Moghaddam
M. B.
,
Mazaheri
M.
&
Samani
J.
2021
Inverse modeling of contaminant transport for pollution source identification in surface and groundwaters: a review
.
Groundwater for Sustainable Development
12
,
100651
.
https://doi.org/10.1016/j.gsd.2021.100651
.
Neupauer
R. M.
,
Borchers
B.
&
Wilson
J. L.
2000
Comparison of inverse methods for reconstructing the release history of a groundwater contamination source
.
Water Resources Research
36
(
9
),
2469
2475
.
https://doi.org/10.1029/2000wr900176
.
Pan
Z. D.
,
Lu
W. X.
,
Fan
Y.
&
Li
J. H.
2021
Identification of groundwater contamination sources and hydraulic parameters based on Bayesian regularization deep neural network
.
Environmental Science and Pollution Research
28
(
13
),
16867
16879
.
https://doi.org/10.1007/s11356-020-11614-1
.
Pinder
G. F.
&
Bredehoeft
J. D.
1968
Application of the digital computer for aquifer evaluations
.
Water Resources Research
4
(
5
),
1069
1093
.
https://doi.org/10.1029/WR004i005p01069
.
Sidauruk
P.
,
Cheng
H. D.
&
Ouazar
D.
2010
Ground water contaminant source and transport parameter identification by correlation coefficient optimization
.
GroundWater
36
(
2
),
208
214
.
https://doi.org/10.1111/j.1745-6584.1998.tb01085.x
.
Simpson
T. W.
,
Mauery
T. M.
,
Korte
J. J.
&
Mistree
F.
2001
Kriging models for global approximation in simulation-based multidisciplinary design optimization
.
AIAA Journal
39
(
12
),
2233
2241
.
https://doi.org/10.2514/3.15017
.
Singh
R. M.
&
Datta
B.
2006
Identification of groundwater pollution sources using GA-based linked simulation optimization model
.
Journal of Hydrologic Engineering
11
(
2
),
101
109
.
https://doi.org/10.1061/9780784413623.118
.
Skaggs, T. H. & Kabala, Z. J. 1994 Recovering the release history of a groundwater contaminant. Water Resources Research 30 (1), 71–79. https://doi.org/10.1029/93WR02656
.
Sun
A. Y.
,
Painter
S. L.
&
Wittmeyer
G. W.
2006
A constrained robust least squares approach for contaminant release history identification
.
Water Resources Research
42
(
4
),
263
269
.
https://doi.org/10.1029/2005WR004312
.
Woodbury
A. D.
&
Ulrych
T. J.
1996
Minimum relative entropy inversion: theory and application to recovering the release history of a groundwater contaminant
.
Water Resources Research
32
(
9
),
2671
2681
.
https://doi.org/10.1029/95wr03818
.
Xing
Z. X.
,
Qu
R. Z.
,
Zhao
Y.
,
Fu
Q.
,
Ji
Y.
&
Lu
W. X.
2019
Identifying the release history of a groundwater contaminant source based on an ensemble surrogate model
.
Journal of Hydrology
572
,
501
516
.
https://doi.org/10.1016/j.jhydrol.2019.03.020
.
Xu
T.
&
Gómez-Hernández
J. J.
2016
Joint identification of contaminant source location, initial release time and initial solute concentration in an aquifer via ensemble Kalman filtering
.
Water Resources Research
52 (8), 6587–6595.
https://doi.org/10.1002/2016WR019111.
Yeh
H. D.
,
Lin
C. C.
&
Yang
B. J.
2014
Applying hybrid heuristic approach to identify contamination source information in transient groundwater flow systems
.
Mathematical Problems in Engineering
369369
,
1
13
.
https://doi.org/10.1155/2014/369369
.
Zhao
Y.
,
Lu
W. X.
&
An
Y. K.
2015
Surrogate model-based simulation-optimization approach for groundwater source identifcation problems
.
Environmental Forensics
16
(
3
),
296
303
.
https://doi.org/10.1080/15275922.2015.1059908
.
Zhao
Y.
,
Lu
W. X.
&
Xiao
N. C.
2016
A Kriging surrogate model coupled in simulation–optimization approach for identifying release history of groundwater sources
.
Journal of Contaminant Hydrology
185–186
,
51
60
.
https://doi.org/10.1016/j.jconhyd.2016.01.004
.
Zwickl
D. J.
2006
Genetic Algorithm Approaches for the Phylogenetic Analysis of Large Biological Sequence Datasets Under the Maximum Likelihood Criterion
.
PhD dissertation, The University of Texas at Austin.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).