## Abstract

In most instances, the number, location and release intensity of groundwater contamination sources (GCSs) are all unknown. The 0-1 mixed integer nonlinear programming optimization model (0-1 MINPOM) used previously could only identify the location and release intensity for GCSs. Thus a 0-1 MINPOM was improved and applied to simultaneously identify the number, location and release history of GCSs. In addition, to reduce the time of calling the simulation model for iterative calculation, the deep belief neural network (DBNN), long short-term memory network (LSTM) and Kriging method were applied to establish an ensemble surrogate model of the simulation model to replace the simulation model for iterative calculation. The results show that compared with the three single surrogate models, the ensemble surrogate model has the highest approximation accuracy to the simulation model, and could save 99% of the calculation time when iterative calculation was performed in place of the simulation model. The improved 0-1 MINPOM could identify the number, location, and release history of GCSs simultaneously. The accuracy of identifying the number of GCSs was 100%, and the accuracies of identifying the locations and release history were above 95.12% and 83.51%, respectively.

## HIGHLIGHTS

A 0-1 mixed integer nonlinear programming optimization model (0-1 MINPOM) was improved and applied to the identification of GCSs.

The number, location, and release history of GCSs were identified simultaneously based on the improved 0-1 MINPOM.

The DBNN, LSTM and Kriging methods were applied to establish an ensemble surrogate.

The ensemble surrogate was linked to the improved 0-1 MINPOM for iterative calculation.

## INTRODUCTION

Groundwater contamination is concealed beneath the surface of the earth and cannot be detected in time after it occurs, which makes it difficult to understand the status of groundwater contamination sources (GCSs) (Lapworth *et al.* 2012). As a consequence, there are great challenges for contamination risk assessment, contamination remediation and identifying responsibility for contamination accidents (Sun *et al.* 2006). Developing an effective method for identifying GCSs would address these problems and have great significance (Han *et al.* 2020).

The identification of GCSs roughly started in the 1980s (Gorelick *et al.* 1983). After nearly 40 years of development, a variety of approaches have been proposed for identifying GCSs, including simulation optimization (S/O) approaches (Mahar & Datta 2000; Xing *et al.* 2019), direct approaches (Skaggs & Kabala 1994), probabilistic and geostatistical approaches (Neupauer *et al.* 2000), and analytical approaches (Sidauruk *et al.* 2010).

Among those approaches, S/O approaches are used widely to characterize GCSs due to their strong applicability (Hou & Lu 2018; Moghaddam *et al.* 2021). Woodbury & Ulrych (1996) recover the evolution history of contamination plumes in groundwater systems by utilizing the minimum relative entropy method. Mahar & Datta (2000) proposed a simulation nonlinear optimization approach to identify the release intensity and spatial location for GCSs in an unsteady flow system. Mahinthakumar & Sayeed (2006) employed the hybrid optimization method to reconstruct release histories of GCSs. Jha & Datta (2011) applied the S/O approach to identify the GCSs information, and used the simulated annealing algorithm to solve the optimization problem to obtain the release intensity and duration time of GCSs. Zhao *et al.* (2015) employed optimization approaches to identify the location and release intensity for GCSs.

Good research results have been achieved in the application of S/O methods for GCSs identification (Mahar & Datta 2001; Jiang *et al.* 2013). However, how to simultaneously identify the number, location and release history of GCSs in a potential contamination area (a continuous area, not just a few fixed locations) is still not reported. Most researchers set preconditions when applying S/O approaches to identify GCSs, only identifying one or two characteristics in the number, location and release history, such as: (1) identify the release history of GCSs on the premise of knowing the number and location of GCSs (Zhao *et al.* 2016); (2) identify the actual release history and location of GCSs on the premise that there are 3 ∼ 4 potential fixed locations of GCSs (Ayvaz 2010; Guo *et al.* 2019); and (3) identify the actual release history and location of GCSs on the premise of knowing the number of GCSs (set to one) and a potential contaminated area where the contaminant may be released (Yeh *et al.* 2014; Xu & Gómez-Hernández 2016). Therefore, the simultaneous identification of the number, location and release history of GCSs is still a thorny problem.

Guo *et al.* (2019) used a 0-1 MINPOM to identify the actual location (treated as a 0-1 integer variable) of the GCS among the four potential fixed locations, while identifying the release intensity (treated as a continuous variable) of the GCSs. Though the identification of number, location and release history of GCSs in a continuous potential contamination area remains unresolved, the emergence of 0-1 MINPOM makes it possible to simultaneously identify the number, location and release history of GCSs in a suspected contaminated area. Thus, a 0-1 MINPOM with better performance based on the study of Guo *et al.* (2019) was proposed to simultaneously identify the number, location and release history of GCSs in this study.

In addition, when solving the improved 0-1 MINPOM, the simulation model must be called repeatedly for calculations, thereby incurring miscellaneous computational load and wasting a lot of computational time (Zhao *et al.* 2016). This deficiency can be solved by using a surrogate model instead of simulation model for iterative calculation. In the previous study, the deep belief neural network (DBNN), long short-term memory network (LSTM) and Kriging methods were used to build surrogate models for simulation models (Jiang *et al.* 2015; Li *et al.* 2021; Miao & Guo 2021).

However, the approximation accuracy of a single surrogate model to a simulated model is often different due to different problems. No single surrogate model can guarantee the highest accuracy at any point (Xing *et al.* 2019; Hou *et al.* 2017). To make the single surrogate model give full play to its own advantages and abandon its own shortcomings meanwhile, this research establishes an ensemble surrogate model of the simulation model for the first time based on DBNN, LSTM and Kriging method, so as to further improve the approximation accuracy of the surrogate model to the simulation model.

Then the ensemble surrogate model was linked to the improved 0-1 MINPOM and participates in the iterative computation in place of the simulation model in the process of solving the optimization model. After solving the improved 0-1 MINPOM, the identification results of the number, location and release history of GCSs could be obtained simultaneously.

## METHODOLOGY

### Surrogate modeling methods

#### Deep belief neural network

The DBNN consists of multiple restricted Boltzmann machine (RBM) stacks and a top-level algorithm (Hinton *et al.* 2006). The role of an RBM is to extract the features of the input data and the role of the top-level algorithm is to calculate the final output value of the network (Chen *et al.* 2017). The learning process of the DBNN includes two steps: unsupervised learning training and supervised learning training:

- (1)
Unsupervised learning training

*n*neurons, and the hidden layer has

*m*neurons. There is no internal connection between the neurons, while the neurons in the hidden layer and the neurons in the visual layer are fully connected. An RBM is an energy-based model. Its joint probability distribution can be calculated based on the energy function (Hecht-Nielsen 1989). When dealing with regression prediction problems, a Gaussian RBM is typically used to process continuous data. In this case, the output of the RBM becomes a continuous value between 0 and 1. For any given set of states , its energy function is expressed as follows:where ; represents all neurons in the visible layer, represents all neurons in the hidden layer,

*w*is the connection weight between the visible layer and the hidden layer, is the bias vector of the visible layer, and is the bias vector of the hidden layer, is the tolerance value corresponding to the input

*v*; By deriving Equation (1), the joint probability density function of state can be obtained as follows:where is the normalization factor. Then the marginal distribution of the visible layer and the hidden layer can be calculated from the joint probability distribution, respectively:

The last layer of the DBNN is a back propagation (BP) neural network. After training the last layer of the RBM, the output is used as the input of the BP neural network, and the BP neural network is used to predict the output.

- (2)
Supervised learning training

#### Long short-term memory network

LSTM is an improved recurrent neural network (RNN). Compared with RNN, LSTM introduces three gate mechanisms, namely input gate, forget gate and output gate. Through these three gate mechanisms, LSTM can selectively remember or forget some information in the training process, achieve better training effect, and solve the problem that the RNN has long-term dependence on input data (Hochreiter & Schmidhuber 1997; Graves & Schmidhuber 2005; Li *et al.* 2021). The training of LSTM consists of two main processes, forward propagation and back propagation.

- (1)
Forward propagation process

In the forward propagation process of LSTM, the memory cell at time *t* will receive input data from two sources, namely the output of the memory cell at time and the input data at time *t*. Based on the input data from these two sources and the three gate mechanisms in the memory cell, the output of the memory cell at time *t* can be calculated.

Then, use the input gate to update the state of the memory cell at time *t*. The input gate consists of two parts: sigmoid activation function part and the Tanh activation function part. Input the two sources of input data into the two function parts for activation, in preparation for updating the state of the memory cell.

To further optimize the LSTM network, the relevant parameters of the LSTM are adjusted inversely by using the training samples. According to the loss function, the weight matrix and bias vector are adjusted using the chain rule. The mathematical expressions for calculating the gradient of the related parameters are as follows (take the output gate and visual layer as examples).

*L*is loss function. After calculating the gradient of the above parameters, the relevant parameters of the LSTM can be updated by using the gradient algorithm. After all the weight matrices and bias vectors are obtained, the LSTM regression prediction model can be obtained.

#### Kriging

*et al.*2019):where: is the input item of the training data set, are known as determinate regression functions, is a regression coefficient matrix, which can be estimated from the training data set,

*p*is the number of determinate regression functions, and is a random part deviation. For the detailed principle of the Kriging, please refer to the article by Guo

*et al.*(2019).

### Simulation optimization method

The identification of GCSs in this study was based on the S/O method. Therefore, establishing the optimization model is a key step in the research process (Cristo & Leopardi 2008; Datta *et al.* 2009; Jiang *et al.* 2021). A brief introduction to establishing an optimization model is as follows: the GCSs identification problem is treated as an optimization problem, then determine the decision variables, objective function (the fitting error between the observed and simulated concentrations is usually used as the objective function) and constraints (include parameter value constraints, GCSs information constraints, and contaminant transport rules constraints). Finally, the objective function and constraints constitute the optimization model and the optimization model is solved to obtain the GCSs identification results.

*et al.*2019):where is the objective function, are the decision variables in the optimization model,

*e*is the total number of decision variables, is the inequality constraint, and are the lower and upper bound of the inequality constraint,

*n*is the total number of inequality constraints, is the equality constraint and

*m*is the total number of inequality constraints.

## NUMERICAL APPLICATIONS

### Site overview

Parameter values . | Value . | ||
---|---|---|---|

I . | II . | III . | |

Hydraulic conductivity in x-direction, Kxx (m/d) | 42 | 30 | 18 |

Hydraulic conductivity in y-direction, Kyy (m/d) | 42 | 30 | 18 |

Effective porosity, θ | 0.25 | 0.2 | 0.18 |

Longitudinal dispersivity, α (m) _{L} | 40 | 40 | 40 |

Transverse dispersivity, α (m) _{T} | 8 | 8 | 8 |

Saturated thickness, b(m) | 35 | 35 | 35 |

Grid spacing in x-direction, Δx (m) | 3 | 3 | 3 |

Grid spacing in y-direction, Δy (m) | 3 | 3 | 3 |

Length of the simulation period, Δt (month) | 6 | 6 | 6 |

Recharge flux per unit area, W (m/d) | 0.0004 | 0.00035 | 0.0003 |

Initial concentration (mg/L) | 0 | 0 | 0 |

Parameter values . | Value . | ||
---|---|---|---|

I . | II . | III . | |

Hydraulic conductivity in x-direction, Kxx (m/d) | 42 | 30 | 18 |

Hydraulic conductivity in y-direction, Kyy (m/d) | 42 | 30 | 18 |

Effective porosity, θ | 0.25 | 0.2 | 0.18 |

Longitudinal dispersivity, α (m) _{L} | 40 | 40 | 40 |

Transverse dispersivity, α (m) _{T} | 8 | 8 | 8 |

Saturated thickness, b(m) | 35 | 35 | 35 |

Grid spacing in x-direction, Δx (m) | 3 | 3 | 3 |

Grid spacing in y-direction, Δy (m) | 3 | 3 | 3 |

Length of the simulation period, Δt (month) | 6 | 6 | 6 |

Recharge flux per unit area, W (m/d) | 0.0004 | 0.00035 | 0.0003 |

Initial concentration (mg/L) | 0 | 0 | 0 |

This study uses two cases to test the applicability of the method developed. The aquifer parameters of the hypothetical contaminated site were the same for the two cases, as shown in Table 1.

The aquifer was discretely divided into 5262 grids in order to make the study more convenient. Since the grid is treated as an integer variable, the location information of the potential contamination source area in the study area is represented by grid numbers. The horizontal and longitudinal locations of the potential contamination source areas and their corresponding grid numbers are shown in Table 2.

(a) horizontal . | |||||||
---|---|---|---|---|---|---|---|

Number . | Location . | Number . | Location . | Number . | Location . | Number . | Location . |

19 | 55.5 | 32 | 94.5 | 45 | 133.5 | 58 | 172.5 |

20 | 58.5 | 33 | 97.5 | 46 | 136.5 | 59 | 175.5 |

21 | 61.5 | 34 | 100.5 | 47 | 139.5 | 60 | 178.5 |

22 | 64.5 | 35 | 103.5 | 48 | 142.5 | 61 | 181.5 |

23 | 67.5 | 36 | 106.5 | 49 | 145.5 | 62 | 184.5 |

24 | 70.5 | 37 | 109.5 | 50 | 148.5 | 63 | 187.5 |

25 | 73.5 | 38 | 112.5 | 51 | 151.5 | 64 | 190.5 |

26 | 76.5 | 39 | 115.5 | 52 | 154.5 | 65 | 193.5 |

27 | 79.5 | 40 | 118.5 | 53 | 157.5 | 66 | 196.5 |

28 | 82.5 | 41 | 121.5 | 54 | 160.5 | 67 | 199.5 |

29 | 85.5 | 42 | 124.5 | 55 | 163.5 | 68 | 202.5 |

30 | 88.5 | 43 | 127.5 | 56 | 166.5 | 69 | 205.5 |

31 | 91.5 | 44 | 130.5 | 57 | 169.5 | 70 | 208.5 |

(b) longitudinal . | |||||||

Number . | Location . | Number . | Location . | Number . | Location . | . | . |

18 | 157.5 | 31 | 196.5 | 44 | 235.5 | ||

19 | 160.5 | 32 | 199.5 | 45 | 238.5 | ||

20 | 163.5 | 33 | 202.5 | 46 | 241.5 | ||

21 | 166.5 | 34 | 205.5 | 47 | 244.5 | ||

22 | 169.5 | 35 | 208.5 | 48 | 247.5 | ||

23 | 172.5 | 36 | 211.5 | 49 | 250.5 | ||

24 | 175.5 | 37 | 214.5 | 50 | 253.5 | ||

25 | 178.5 | 38 | 217.5 | 51 | 256.5 | ||

26 | 181.5 | 39 | 220.5 | 52 | 259.5 | ||

27 | 184.5 | 40 | 223.5 | 53 | 262.5 | ||

28 | 187.5 | 41 | 226.5 | 54 | 265.5 | ||

29 | 190.5 | 42 | 229.5 | 55 | 268.5 | ||

30 | 193.5 | 43 | 232.5 |

(a) horizontal . | |||||||
---|---|---|---|---|---|---|---|

Number . | Location . | Number . | Location . | Number . | Location . | Number . | Location . |

19 | 55.5 | 32 | 94.5 | 45 | 133.5 | 58 | 172.5 |

20 | 58.5 | 33 | 97.5 | 46 | 136.5 | 59 | 175.5 |

21 | 61.5 | 34 | 100.5 | 47 | 139.5 | 60 | 178.5 |

22 | 64.5 | 35 | 103.5 | 48 | 142.5 | 61 | 181.5 |

23 | 67.5 | 36 | 106.5 | 49 | 145.5 | 62 | 184.5 |

24 | 70.5 | 37 | 109.5 | 50 | 148.5 | 63 | 187.5 |

25 | 73.5 | 38 | 112.5 | 51 | 151.5 | 64 | 190.5 |

26 | 76.5 | 39 | 115.5 | 52 | 154.5 | 65 | 193.5 |

27 | 79.5 | 40 | 118.5 | 53 | 157.5 | 66 | 196.5 |

28 | 82.5 | 41 | 121.5 | 54 | 160.5 | 67 | 199.5 |

29 | 85.5 | 42 | 124.5 | 55 | 163.5 | 68 | 202.5 |

30 | 88.5 | 43 | 127.5 | 56 | 166.5 | 69 | 205.5 |

31 | 91.5 | 44 | 130.5 | 57 | 169.5 | 70 | 208.5 |

(b) longitudinal . | |||||||

Number . | Location . | Number . | Location . | Number . | Location . | . | . |

18 | 157.5 | 31 | 196.5 | 44 | 235.5 | ||

19 | 160.5 | 32 | 199.5 | 45 | 238.5 | ||

20 | 163.5 | 33 | 202.5 | 46 | 241.5 | ||

21 | 166.5 | 34 | 205.5 | 47 | 244.5 | ||

22 | 169.5 | 35 | 208.5 | 48 | 247.5 | ||

23 | 172.5 | 36 | 211.5 | 49 | 250.5 | ||

24 | 175.5 | 37 | 214.5 | 50 | 253.5 | ||

25 | 178.5 | 38 | 217.5 | 51 | 256.5 | ||

26 | 181.5 | 39 | 220.5 | 52 | 259.5 | ||

27 | 184.5 | 40 | 223.5 | 53 | 262.5 | ||

28 | 187.5 | 41 | 226.5 | 54 | 265.5 | ||

29 | 190.5 | 42 | 229.5 | 55 | 268.5 | ||

30 | 193.5 | 43 | 232.5 |

The real number, location, and release history of GCSs for the two cases are as follows:

Case one: the number was one, the location is as shown in Figure 1(a) and release intensities during two release periods are as shown in Table 3a.

(a) case one . | |||||
---|---|---|---|---|---|

Number of sources . | Grid number . | Release intensity (g/d) . | . | ||

Horizontal . | Longitudinal . | SP1 . | SP2 . | . | |

1 | 35 | 52 | 260 | 140 | |

(b) case two . | |||||

. | . | Grid number . | Release intensity (g/d) . | ||

Number of sources . | Sources . | Horizontal . | Longitudinal . | SP1 . | SP2 . |

2 | S1 | 35 | 52 | 190 | 120 |

S2 | 41 | 36 | 170 | 160 |

(a) case one . | |||||
---|---|---|---|---|---|

Number of sources . | Grid number . | Release intensity (g/d) . | . | ||

Horizontal . | Longitudinal . | SP1 . | SP2 . | . | |

1 | 35 | 52 | 260 | 140 | |

(b) case two . | |||||

. | . | Grid number . | Release intensity (g/d) . | ||

Number of sources . | Sources . | Horizontal . | Longitudinal . | SP1 . | SP2 . |

2 | S1 | 35 | 52 | 190 | 120 |

S2 | 41 | 36 | 170 | 160 |

Case two: the number was two, the locations of the two GCSs are shown in Figure 1(b) and release intensities of the two GCSs during two release periods are as shown in Table 3b.

The information on the GCSs for the two cases were regarded as unknown during the inverse identification process (include the number, location, and release intensity); only the potential contamination source area was known (shown in Figure 1(a) and 1(b)). For hypothetical case one and case two it is estimated that the maximum number of GCSs is 3. The number, location, and release intensity of the GCSs were then identified.

All the GCSs are designed to have same initial release times (the first day of simulation) and the time intervals between the initial contaminant release of the GCSs and first contaminant concentration measurement of the two cases are shown in Table 4. The total contaminant release duration of the two cases is 180 days and the total release time is divided into two equal stress periods of 90 days. Concentration is measured every 90 days from the start of the first measurement time. The contaminant release intensity of the GCSs is designed to be constant during the two stress periods. The total simulation time for contaminant transport is 900 days. A total of eight periods of concentration measurements are utilized for the two cases.

Case . | Activity duration (d) . | Time interval between the initial contaminant release of the source and first measurement (d) . | Concentration measurement intervals (d) . |
---|---|---|---|

one | 2 × 90 | 270 | 90 |

two | 2 × 90 | 270 | 90 |

Case . | Activity duration (d) . | Time interval between the initial contaminant release of the source and first measurement (d) . | Concentration measurement intervals (d) . |
---|---|---|---|

one | 2 × 90 | 270 | 90 |

two | 2 × 90 | 270 | 90 |

### Numerical simulation model

The identification of GCSs needs to be carried out on the basis of establishing a numerical simulation model of the groundwater system (Singh & Datta 2006). The governing partial differential mathematical equations for the transient flow and the contaminant transport are defined as follows (Pinder & Bredehoeft 1968):

The GMS software was used to solve the transient flow and contaminant transport simulation model.

### Ensemble surrogate model

A surrogate model with satisfactory accuracy is a stand-in for iterative calculation in place of the simulation model, and it can output nearly the same results as the simulation model under the premise of the same output. The surrogate model is not only simpler to call, but it can reduce a lot of computational load and computational time when identifying the GCSs (Simpson *et al.* 2001; Jiang *et al.* 2015).

The horizontal and longitudinal grid numbers of locations and release intensity corresponding to a maximum number of the GCSs were used as the input item for the surrogate model (6 locations and 6 release intensities comprising a total of 12 input variables). The contaminant concentrations were treated as the output of the surrogate model. Using the Latin hypercube method, the locations (grid numbers) and release intensity of each GCS were sampled 420 times in their feasible domains. Then the locations and release intensities were used sequentially as inputs for the simulation model. Run the simulation model to output the contaminant concentrations data of each observation well (output data). Divide all 420 groups data into training data and testing data. The training data consists of 360 groups of input and output data, and the testing data consists of 60 groups of input and output data. The DBNN, LSTM and Kriging methods were coded with MATLAB, and the DBNN, LSTM and Kriging surrogate model were then established based on the training and testing data.

*i*-th observation well in the

*j*-th period of the

*k*-th test sample output by the ensemble surrogate model and the simulation model, respectively,

*m*is the total number of observation wells,

*t*is the total number of observation periods,

*n*the total number of test samples. , , and are the weights occupied by DBNN, LSTM and Kriging surrogate models in the ensemble surrogate model, respectively.

In Equations (32)–(34), is the *i-*th output value obtained from the numerical simulation model, is the *i-*th output value obtained from the surrogate model, and is the mean value of the output values obtained from the simulation model. Smaller values for the MRE and RMSE, as well as values of *R*^{2} closer to 1 indicate that the surrogate model has higher approximation accuracy to the simulation model.

### 0-1 MINPOM

*i*represents the

*i*-th potential GCS; is the release intensity of the GCSs;

*m*represents the

*m*-th release intensity variable; is the simulated contaminant concentration; and is the measured contaminant concentration.

The 0-1 MINPOM for identifying the GCSs in each case was the same, except the measured contaminant concentrations corresponding to the two cases were different.

## RESULTS AND DISCUSSION

### Accuracy analysis of the surrogate models

The weights of the three single surrogate models in the ensemble surrogate model are shown in Table 5.

Single surrogate model . | DBNN . | LSTM . | Kriging . |
---|---|---|---|

Weights | 0.095 | 0.638 | 0.267 |

Single surrogate model . | DBNN . | LSTM . | Kriging . |
---|---|---|---|

Weights | 0.095 | 0.638 | 0.267 |

^{2}, RMSE, and MRE of the 60 groups of output samples from the simulation model and those from the four surrogate models. To see the fitting effect between the surrogate models and the simulation model, the fitting effect between simulation model and surrogate model was drawn (shown in Figure 4). To clearly compare and analyze the accuracies of the four surrogate models, the contrast histograms were plotted for R

^{2}, RMSE, and MRE (shown in Figure 5).

It can be seen from Table 5 that the higher the accuracy of the single surrogate model, the greater the proportion in the ensemble surrogate model. It could be seen from Figure 4 that among the three single surrogate models, the LSTM surrogate model has the best fitting effect to the simulation model, followed by the Kriging surrogate model, while the DBNN surrogate model was the poorest. The fitting effect of the ensemble surrogate model based on the three single surrogate models was higher than that of any single surrogate model. It could be clearly seen from Figure 5 that for each observation well, the three accuracy indexes of the ensemble surrogate model were better than the other three single surrogate models. The mean R^{2} of DBNN, LSTM, Kriging and ensemble surrogate model were 0.9812, 0.9948, 0.9904 and 0.9960, respectively. The mean RMSE of DBNN, LSTM, Kriging and ensemble surrogate model were 0.0068, 0.0039, 0.0052 and 0.0034, respectively. The mean MRE of DBNN, LSTM, Kriging and ensemble surrogate model were 8.16%, 3.98%, 5.41% and 3.65%, respectively. The above results indicated that the ensemble surrogate model had the best approximation accuracy to the simulated model. Therefore, the ensemble surrogate model was linked to the 0-1 MINPOM for simultaneously identifying the number, location (grid numbers) and release history of GCSs.

### Analysis of GCSs identification results

The ensemble surrogate model was linked to the 0-1 MINPOM and the (genetic algorithm) GA was used to solve the 0-1 MINPOM. A detailed introduction to GA was provided by Guo *et al.* (2019) and Zwickl (2006). The population size and generation of GA are 200 and 1200, respectively, and for other relevant parameter settings of the GA refer to the articles of Li *et al.* (2021).

It takes about 1,333 hours to call the simulation model 240,000 times during the solving process of the 0-1 MINPOM, while it only takes about 4.79 hours if the surrogate model is applied 240,000 times. Thus, using the surrogate model for iterative calculation could save 99% of the computational time and the computational load.

(a) case one . | |||||||||
---|---|---|---|---|---|---|---|---|---|

Case one . | Number of sources . | S-location (m) . | S-Release intensity (g/d) . | . | . | . | . | ||

X . | Y . | SP1 . | SP2 . | . | . | . | . | ||

Real value | 1 | 35 | 52 | 260 | 140 | ||||

Identified results | 1 | 35 | 53 | 254.06 | 141.48 | ||||

Relative error | 0.00% | 2.86% | 3.85% | 2.28% | 1.06% | ||||

(b) case two . | |||||||||

. | . | S1-location (m) . | S2-location (m) . | S1-Release intensity (g/d) . | S2-Release intensity (g/d) . | ||||

Case two . | Number of sources . | X . | Y . | X . | Y . | SP1 . | SP2 . | SP1 . | SP2 . |

Real value | 2 | 35 | 52 | 41 | 36 | 190 | 120 | 170 | 160 |

Identified results | 2 | 35 | 55 | 41 | 33 | 195.83 | 115.09 | 189.08 | 133.62 |

Relative error | 0.00% | 0.00% | 0.00% | 4.88% | 2.78% | 3.07% | 4.09% | 11.22% | 16.49% |

(a) case one . | |||||||||
---|---|---|---|---|---|---|---|---|---|

Case one . | Number of sources . | S-location (m) . | S-Release intensity (g/d) . | . | . | . | . | ||

X . | Y . | SP1 . | SP2 . | . | . | . | . | ||

Real value | 1 | 35 | 52 | 260 | 140 | ||||

Identified results | 1 | 35 | 53 | 254.06 | 141.48 | ||||

Relative error | 0.00% | 2.86% | 3.85% | 2.28% | 1.06% | ||||

(b) case two . | |||||||||

. | . | S1-location (m) . | S2-location (m) . | S1-Release intensity (g/d) . | S2-Release intensity (g/d) . | ||||

Case two . | Number of sources . | X . | Y . | X . | Y . | SP1 . | SP2 . | SP1 . | SP2 . |

Real value | 2 | 35 | 52 | 41 | 36 | 190 | 120 | 170 | 160 |

Identified results | 2 | 35 | 55 | 41 | 33 | 195.83 | 115.09 | 189.08 | 133.62 |

Relative error | 0.00% | 0.00% | 0.00% | 4.88% | 2.78% | 3.07% | 4.09% | 11.22% | 16.49% |

Case one: A value ‘1’ (shown in Figure 6(a)) in the identification results indicates that the number of GCSs was one. The location (grid number) and the release intensities of GCSs are shown in Table 6a. The identification accuracy of the number and location was 100%, and the identification accuracy of location and release history was above 96.15% and 97.72%, respectively.

Case two: Those two values of ‘1’ (shown in Figure 6(b)) in the identification results indicate that the number of GCSs was two. The location (grid number) and the release intensities of two GCSs are shown in Table 6b. The identification accuracy of the number was 100%, and the identification accuracy of location and release history was above 95.12% and 83.51%, respectively.

The identification results of GCSs for the two cases showed that compared with the following situations the 0–1 MINPOM in this study not only effectively synchronously identified the number, location, and release history of the GCSs, but ensured high identification accuracy: (1) The exact location of the GCSs can only be identified from the potential contamination source area on the premise that the number of GCSs is known (Mo *et al.* 2019; Pan *et al.* 2021; Wang *et al.* 2022); (2) Under the premise that the number and location of GCSs is known, the release history is identified (Xing *et al.* 2019). It effectively improved the shortcoming of the previous 0-1 MINPOM which could only identify the location and release history of GCSs.

## DISCUSSION

The actual number of GCSs was lower than the estimated maximum number of three in this study. If the actual number of GCSs is equal to or more than the maximum estimated number of GCSs, the solution is increasing the original estimated maximum number of GCSs. Then reestablishing the surrogate model and 0-1 MINPOM to identify the information of the GCSs until the number of GCSs identified is lower than the estimated maximum number of GCSs (in a feedback correction process).

However, the input dimension of the simulation model will increase as the estimated maximum number of GCSs increases and the nonlinear relationships will become more complicated. When faced with a simulation model with a more complex nonlinear relationship, the accuracy of the surrogate model should be further improved through reasonable study.

Moreover, when the estimated maximum number of GCSs increases, the number of variables to be identified will inevitably increase, which may increase the probability of the occurrence of ‘different parameters with same effect’. When using the simulation optimization method to identify GCSs information, it is often difficult to solve the problem of ‘different parameters with same effect’. In addition to increasing the constraints of the optimization model as much as possible, how to solve this problem from other perspectives needs to be further studied. In addition, this study did not consider the uncertainty of parameters. Since the uncertainty of parameters exists objectively, the uncertainty of parameters should be considered in the GCSs identification research in future.

## CONCLUSIONS

Two conclusions were obtained based on the study:

First, among the three single surrogate models, the LSTM surrogate model had the best accuracy, and the DBNN surrogate model had the worst accuracy. The single surrogate model with better accuracy occupied a larger proportion of the ensemble surrogate model. Compared with the three single surrogate models of DBNN, LSTM and Kriging, the fitting effect of the ensemble surrogate model on the output results of the numerical simulation model was significantly improved. Applying the ensemble surrogate model instead of the simulation model to participate in iterative calculation could save more than 99% of the computational time and the computational load.

Second, the 0-1 MINPOM could only identify the location and release history of the GCSs in the previous study. In the present study, an improved 0-1 MINPOM was established based on treating the number of GCSs as an integer variable with a value of 0 or 1, the location (represented by grid number) and release intensity of GCSs as integer variables and continuous variables, respectively. Then the number, location, and release intensity of GCSs were simultaneously identified using the improved 0-1 MINPOM. The identification results were very similar to the actual values of the GCSs characteristics. The accuracy of identifying the number of GCSs was 100%, and the accuracies of identifying the locations and release intensities were over 95.12% and 83.51%, respectively. Comparing with the 0-1 MINPOM proposed by previous researchers, the improved 0-1 MINPOM proposed in this study is more applicable to the identification of GCSs.

## ACKNOWLEDGEMENTS

The authors acknowledge support provided by the National Nature Science Foundation of China (Grant Nos. 42202273), the Fundamental Research Funds for the Central Universities (2412022QD001) and the National Key R&D Program of China (Grant Nos. 2019YFC0409101). Special gratitude is extended to the journal editors for their efforts in evaluating this study. The valuable comments provided by the anonymous reviewers are also gratefully acknowledged.

## FUNDING

This work was supported by the National Nature Science Foundation of China (Grant Nos. 42202273), the Fundamental Research Funds for the Central Universities (2412022QD001) and the National Key R&D Program of China (Grant Nos. 2019YFC0409101).

## AUTHOR CONTRIBUTIONS

All authors contributed to the study conception and design. Conceptualization, methodology, software, writing the original draft, and formal analysis were performed by Jiuhui Li. Writing review & editing, supervision, and project administration were performed by Zhengfang Wu. Software, validation, conceptualization, and methodology were performed by Wenxi Lu. Validation, writing review & editing were performed by Hongshi He.

## DATA AVAILABILITY STATEMENT

All relevant data are available from an online repository or repositories at https://github.com/lijiuhui666/0-1 MINLPOM/upload

## REFERENCES

**220**, 18–25.

*International 1989 Joint Conference on Neural Networks (IJCNN)*

*Environmental Forensics*

**7**(1), 45–54. https://doi.org/10.1080/15275920500506774.

*Water Resources Research*

*Water Resources Research*

**30**(1), 71–79. https://doi.org/10.1029/93WR02656

**52**(8), 6587–6595.