A better understanding of the distribution of nonaqueous phase liquid (NAPL) plumes is of great importance to groundwater pollution remediation and control. However, the efficiency of surrogate models in simulating the transport is still not well addressed. Selecting a leakage problem as an example, 50 sets of random permeability distributions are generated using the Monte Carlo method, and a numerical model is used to obtain benchmark data of NAPL transport. Four machine learning methods are employed to simulate dense NAPL transport under point leakage sources across spatiotemporal scales. The validation of the models demonstrate that the random forest model can also effectively capture the spatial-temporal distribution of the plume in heterogeneous aquifers, with a maximum mean absolute error and root mean square error smaller than 5.55 × 10−4 and 5.88 × 10−5, respectively. Meanwhile, the multiple phase outcome from the random forest model fits well with the numerical results under the scenarios of linear leakage sources and light NAPL transport. The total time consumed in the computation is reduced by over 150 times after using the surrogate models. The results suggest that surrogate models can provide a promising way to understand NAPL transport in heterogeneous aquifers.

  • Numerical solutions provided the NAPL transport benchmark dataset under different scenarios.

  • Four ML models were trained to capture DNAPL transport across the spatial-temporal scales.

  • The RF model had the best performance in reproducing multiphase transport in terms of both the accuracy and calculation efficiency.

  • Surrogate models provided promising applicability to quickly and effectively evaluate NAPL transport.

Groundwater provides a large share of the world's water demand for various sections, especially drinking water. However, under recent intense anthropogenic activities seeking rapid economic and social development, the safety of groundwater resources has been increasingly threatened (Jia et al. 2019). At many chemical sites, the accidental release of underground nonaqueous phase liquids (NAPLs) is the greatest threat to drinking groundwater and has received increasing attention (Kang et al. 2021). NAPL pollutants, especially dense nonaqueous phase liquids (DNAPLs), can cause serious harm to human health after entering soil and groundwater. For example, trichloroethylene (TCE) can cause cancer, teratogenesis, and mutagenesis (Huang & Jennifer 2011). NAPL has been approved by the US Environmental Protection Agency and is listed as a priority control pollutant. Compared with light nonaqueous phase liquid (LNAPL), DNAPL has a higher density, lower interfacial tension and more complex migration behavior. Due to the difficulty of detection, wide distribution in chemical companies and their complex migration behavior, NAPL has become an important source of pollution in soil and groundwater.

When entering the underground medium, DNAPL tends to move down through the vadose zone and accumulate in the lower confining bed due to its high density and insolubility in water. Additionally, the heterogeneity of underground media makes DNAPL transport with different phases much more complicated in contaminated sites. As an important means to analyze NAPL migration and transformation in soil and groundwater, numerical models can accurately describe the migration process and mass transfer mechanism of pollutants through strict mathematical equations, which have been identified by many laboratory tests (Kueper et al. 1989) and field tests (Boggs et al. 1992; Aly & Peralta 1999; Akbarnejad-Nesheli et al. 2016). TOUGH2/TMVOC (Pruess & Battistelli 2002), a numerical simulator for multiphase non-isothermal flows of multicomponent hydrocarbon mixtures, can accurately simulate NAPL migration and transformation in saturated-unsaturated heterogeneous media. TOUGH2-MP/TMVOC is a massively parallel (MP) version of TOUGH2/TMVOC (Zhang et al. 2007). It uses the METIS software package for grid partitioning and AZTEC linear-equation solver to solve large simulation problems that may not be solved by the standard, single-CPU TOUGH2/TMVOC code. The parallel code has been successfully applied from multi-core PCs to supercomputers on real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport.

Usually, reasonable physical parameter estimation, time consumption and necessary knowledge of model exercises from model construction and simulation to analysis disable environmental managers to effectively apply numerical models to understand NAPL migration processes. In response to this problem, some scholars have sought surrogate models for the simulation of multiphase flow (Shieh & Peralta 2005; Kumar et al. 2013; Yadav et al. 2016; Mo et al. 2019; Luo et al. 2020), which are approximate substitutes for numerical models. With high accuracy, the relationship between input variables and output variables is simulated with a small calculation burden and can be used to quickly investigate the NAPL distributions over the spatial and temporal scales. The machine learning method, which seeks the relation between the data while ignoring the physical principles, has been developed rapidly and has been widely utilized in groundwater-related fields in recent years (Nozari et al. 2022). Data-driven methods have achieved reasonable results in many case studies related to the quantity and quality of groundwater (Sahoo et al. 2017; Sun & Scanlon 2019; Bedi et al. 2020; Chakraborty et al. 2020; Desimone et al. 2020; Malakar et al. 2021; Sun et al. 2021a, 2021b; Zounemat-Kermnai et al. 2021; Deng et al. 2023). The commonly used machine learning methods include support vector machines (Khalil et al. 2005; Arabgol et al. 2016), artificial neural networks (ANNs) (Rao 2006; Srivastava & Singh 2015; Seifi & Riahi-Madvar 2019), the RF method (Naghibi et al. 2020), the kriging model (Kholghi & Hosseini 2009), and fuzzy theory (Chu & Chang 2009). Unfortunately, there are few studies on the application of machine learning in the field of multiphase flow. Zhong et al. (2019) used the generative adversarial nets (GAN) model to establish a multiphase flow surrogate model to simulate the CO2 pollution plume in heterogeneous media. Hou et al. (2015) established a surrogate model of a multiple phase flow model (UTCHEM) using support vector regression (SVR). With the optimization of machine learning algorithms, the application of machine learning in underground multiphase flow will become increasingly reliable (Zhang et al. 2020). Deep learning models (such as convolutional neural networks and long- and short-term memory models) have also demonstrated their potential. Mo et al. (2019) built a surrogate model for the quantification of the uncertainty of multiphase flow based on a deep convolutional codec network and proposed a two-stage training strategy to improve the approximation of saturation discontinuities, accurately describing the spatiotemporal evolution of pressure and discontinuous CO2 saturation fields.

The analysis of NAPL transport using surrogate models has been less focused, especially on NAPL transport in different phases and different leakage mode. Under this situation, the objective of this study is to explore the effectiveness of surrogate models in simulating NAPL transport across spatial and temporal scales. First, the TOUGH2-MP/TMVOC model was employed to obtain the benchmark data for the distribution of NAPL under both point and linear leakage conditions in heterogeneous aquifers. Then, four surrogate models, MLP, SVM, GBR, and RF, were trained and validated to capture the detailed features of DNAPL transport with point leakage pollution releases. Next, a comprehensive comparison of the outcomes generated by the four surrogate models was conducted to identify the optimal model, and the sensitivities of the model parameters and number of scenarios on the model accuracy were analyzed. Finally, the identified optimal model was used to assess the model efficiency in depicting multiphase distributions, DNAPL transport with linear leakage pollution sources, and LNAPL transport. The results can provide reference the selection of remediation strategies for organic contamination in heterogeneous aquifers.

Hypothetical model

The flowchart of this work is shown in Figure 1. For the sake of simplicity, a classical two-dimensional cross-sectional model (Pruess & Battistelli 2002) is applied to explore the applicability of machine learning models. The focused study area has a length of 100 m and a total depth of 15 m (Figure 2), and it is assumed that the aquifer system be under hydrostatic equilibrium state with limited precipitation infiltration. Then NAPL component is injected to the aquifer and multiphase NAPL transport will be focused. Considering that the porous media in the study area are usually heterogeneous, a random distribution of permeability varying from 0.001 to 10 Darcy using the Monte Carlo method is assumed throughout the domain, where one kind of distribution is shown in Figure 3. Relative permeability functions are set as Fatt and Klikoff, and three-phase capillary functions are set from Pruess et al. (2012). The top and bottom boundaries of the model are set as the Dirichlet boundary with constant atmospheric pressure and no flow boundary, respectively. The pressures at the left and right boundaries remain constant. The depth to the water table is set as 5 m on the left and right sides of the model. The impact of temperature changes is not considered, and the ambient temperature is set as 20 °C. It is assumed that the precipitation infiltrates uniformly underground with an average annual rate of 0.10 m/year through the ground. Two leakage scenarios are considered, including point and linear pollution sources. The point leakage is almost located at the center and top of the study area, and the linear leakage is set along the horizontal direction (Figure 2). The pollutant is assumed to continuously leak underground at a constant rate of 0.26 kg/d. TCE and benzene are chosen as the DNAPL and LNAPL components, respectively. The densities of TCE and benzene are set as 1,462 and 885 kg/m3, respectively. The study area was discretized into 6,222 gridblocks with 102 horizontal elements and 61 vertical model layers. Empirical values of model parameters are applied.
Figure 1

Flowchart of the study.

Figure 1

Flowchart of the study.

Close modal
Figure 2

Grid diagram for the two-dimensional cross-sectional model.

Figure 2

Grid diagram for the two-dimensional cross-sectional model.

Close modal
Figure 3

The absolute permeability (m2) distribution for the heterogeneous soil.

Figure 3

The absolute permeability (m2) distribution for the heterogeneous soil.

Close modal

Numerical simulation

TOUGH2-MP/TMVOC (Zhang et al. 2007), a parallel-processing version of TMVOC, is applied to conduct the simulation. The code is a numerical simulator for the three-phase non-isothermal flow of water, soil gas, and a multicomponent mixture of volatile organic chemicals (VOCs) in multidimensional heterogeneous porous media. The code is written in Fortran language. For related documents refer to this website: https://tough.lbl.gov/software/tough2-mp-software/. The basic mass and energy-balance equations that describe fluid and heat flow in general multiphase, multicomponent systems can be written in the following general form:
formula
(1)
where the integration is over an arbitrary subdomain of the flow system, which is bounded by the closed surface . The quantity appearing in the accumulation term (left-hand side) represents mass or energy per volume, with κ = 1, 2, 3 labeling the mass components (water, NCGs, VOCs) and κ = 4 the heat ‘component’. denotes mass or heat flux (see below), and denotes sinks and sources. is a normal vector on surface element , pointing inward into .

Machine learning methods

Multilayer perceptron method

The ANN can realize multidimensional and nonlinear precise mapping between the input and output layers. A multilayer perceptron (MLP) is a forward structured ANN consisting of an input layer, an output layer, and one or more hidden layers (Bishop 1995). Each layer of MLP contains one or more neurons, and these neurons have directional connections with neurons in the upper and lower layers. The basic idea of the MLP network algorithm is first to obtain the error through forwarding propagation, then use backpropagation to modify the weight, and finally obtain the optimal model. The algorithm usually uses stochastic gradient descent to adjust the weights in the backpropagation process. The principle of the gradient descent method is to calculate the gradient of the loss function for all internal variables and perform backpropagation. Common activation functions include the ReLU function, sigmoid function, and tanh function.

If there are m samples and n features in the experiment, the input layer is . Assuming that the MLP has only one hidden layer and the hidden layer has h neurons, the weight and deviation of the hidden layer can be expressed as and , respectively. There are q label values of the output layer, and the weight and deviation parameters of the output layer are and , respectively. We can calculate the output (H) of the hidden layer and the result (O) of the output layer by Equation (2). The MLP model has two hidden layers, each with 50 neurons in this paper:
formula
(2)

Support vector machine

Vapnik (1995) proposed the support vector machine (SVM) algorithm for classification problems and then gradually extended it to regression problems (Vapnik 1998). The SVM algorithm has been an excellent machine learning algorithm for classification and regression problems with continuous improvement. SVM can effectively solve practical problems with small samples, nonlinearities, high dimensions, and local minima based on optimization methods, statistical theories, and structural risk minimization principles. Compared with other machine learning algorithms, SVM has good generalization ability and avoids the overfitting problem during the training process. In recent years, it has been successfully used in hydrological forecasting and anomaly detection. The main calculation process of this method is as follows:

  • (1)

    Take the selected feature variables, X = {x1, x2, …, xn}, as the input variables of the model. A nonlinear regression model is used to construct the relationship with the target variable, Y = {y1, y2, …, yn}.

  • (2)

    For the sample set {(x1, y1),…, (xi, yi), …, (xn, yn)}, use nonlinear mapping Ø to map the sample set to the high-dimensional feature space to obtain the mapped sample Ø(xi).

  • (3)

    Construct the nonlinear function (x) as a mathematical model for predicting y.

Gradient boosting regression

The gradient boosting regression (GBR) algorithm is a representative boosting algorithm that uses all samples to train the model during each round of training and updates the weight of the sample in each round of training. The GBR model uses classification and regression decision tree (CART) as a single model (Friedman 2001) to process multiple data types and has the advantages of low learning error and robustness to parameters.

Random forest

The random forest (RF) algorithm was first proposed by Breiman in 2001 (Breiman 2001). The RF combines multiple CARTs and mainly solves classification and regression problems (Chen et al. 2019). The bootstrap method is first used to randomly extract multiple samples from the original data for model training. The undrawn data are used as out-of-bag data to test the algorithm's accuracy. Subsequently, it constructs the decision tree for each training sample and integrates all sample decision trees into the RF model. The main parameters in the RF model include the maximum number of features allowed to try in individual trees (max_features) and the number of trees to build before taking the maximum voting or averages of predictions (n_estimators).

The RF result is the average of the predicted results of each decision tree, as shown in Equation (3):
formula
(3)
where represents the number of decision trees; represents the explanatory variable; is the regression function between the predictor and the predictor variable.

Grid search method for parameter optimization

Grid search is essentially an optimization algorithm that selects the best parameters for the optimization problem from a list of parameter options, hence automating the ‘trail-and-error’ method. The grid search method is most popularly known in machine learning to obtain the parameters with which the model provides the best accuracy. Additionally, the number of evaluations increases exponentially as the parameter increases, which requires much time to train the model.

Evaluation metrics

To evaluate the effectiveness of the four models (MLP, SVM, GBR, and RF), three indices, including the mean absolute error (MAE), root mean square error (RMSE) and Nash-Sutcliffe efficiency coefficient (NSE), are used to evaluate the performance of the four surrogate models, and the algorithms can be seen in Equations (4)–(6). The models are constructed by using Python scikit-learn package:
formula
(4)
formula
(5)
formula
(6)
where and represent the observed and predicted total mass fractions in aqueous phase of sample i, respectively. represents the mean value of , and n is the number of time series.

Numerical modeling results

The Newton–Raphson method is used to solve Equation (1). The total simulation period is set as 10 years. The model predictions in 10 years of the point source scenario (Figure 4) suggest that TCE gradually migrates to the bottom of the formation driven by gravity action since the leakage, then accumulates at the bottom of the stratum and migrates horizontally over time. Simultaneously, TCE plumes diffuse laterally during migration due to the heterogeneity of underground media.
Figure 4

Distribution of (a) permeability, (b) NAPL saturation, (c) pressure, and (d) the total VOC mass fraction in groundwater from numerical solutions after 10 years of TCE leakage.

Figure 4

Distribution of (a) permeability, (b) NAPL saturation, (c) pressure, and (d) the total VOC mass fraction in groundwater from numerical solutions after 10 years of TCE leakage.

Close modal

Performance evaluation of four surrogate models

Scenarios set

Performance evaluation of four surrogate models (MLP, SVM, GBR, and RF) is conducted by comparing the results of two groups (Groups A and B), which represent the spatial distribution and temporal NAPL evolution results, respectively. As listed in Table 1, Groups A and B have 50 scenarios and 20 scenarios for 6,222 gridblocks, respectively. Group B has two input variables, permeability distribution and time section, whereas Group A has only permeability distribution at the end of simulation. The NAPL leakage type is a point for both Groups A and B.

Table 1

Two groups of scenarios for performance evaluation of surrogate models

GroupaComparisonInputsOutputsNumber of scenarios
Spatial distribution Permeability distribution after 10 years Total VOC mass fractions for each gridblock at 10 years 50 for 6,222 gridblocks (A1-A50) 
Temporal evolution Permeability distribution, time periods 20 for 6,222 gridblocks (B1-B20) 
GroupaComparisonInputsOutputsNumber of scenarios
Spatial distribution Permeability distribution after 10 years Total VOC mass fractions for each gridblock at 10 years 50 for 6,222 gridblocks (A1-A50) 
Temporal evolution Permeability distribution, time periods 20 for 6,222 gridblocks (B1-B20) 

aLeakage type is the point for both Groups A and B.

Spatial distribution

Forty-six scenarios (A1–A46) out of 50 in Group A are selected for model training, and the remaining four scenarios (AA1–AA4) are used for validation. The sample size for model training in each scenario is 6,222. Taking the numerical results as benchmarks, the comparisons from four models with the index of MAE, RMSE, and NSE are illustrated in Figure 5. The results indicate that the performances of the RF and GBR methods are better, with the MAE and RMSE smaller than 4.61 × 10−4 and 1.44 × 10−5, respectively. The worst performance is detected in the SVM with an MAE of 1.17 × 10−2. Concerning NSE, the best performance is observed using the RF method with the NSE index of 0.99.
Figure 5

Heatmap of different performance evaluation indexes for four surrogate models.

Figure 5

Heatmap of different performance evaluation indexes for four surrogate models.

Close modal
Using scenario AA4 as an example, Figure 6(A) and 6(B) illustrate a contour plot and profile at location 49.5 m for the total VOC mass fraction m in groundwater from the TMVOC model and four surrogate models, respectively. The MLP, GBR, and RF models can generally capture the shape of the pollution plume to a certain extent. However, the simulated plume from the MLP method is obviously lower than that from the GBR and RF models. The distributions of the total VOC mass fraction by the GBR and RF methods are highly consistent with the numerical results by TMVOC in terms of morphology and extent. The performance of the SVM model is the worst.
Figure 6

Comparison of distribution in the total VOC mass fraction in groundwater calculated by TMVOC and four surrogate models under scenario AA4: (A) spatial distribution and (B) profile at location 49.5 m.

Figure 6

Comparison of distribution in the total VOC mass fraction in groundwater calculated by TMVOC and four surrogate models under scenario AA4: (A) spatial distribution and (B) profile at location 49.5 m.

Close modal

For scenario AA4, it takes approximately 4.75 h for the TOUGH2-MP/TMVOC model (12 cores) for the simulation; however, the total time consumed in the calculation (Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz × 12, and 31.8 GiB memory) is 1,333, 11,081, 1,822, and 92 s for the MLP, GBR, SVM, and RF models, respectively. It is obvious that the calculation efficiency in the RF model is relatively high, of which the execution efficiency is approximately 178 times shorter than that of the numerical model.

Temporal evolution

Nineteen scenarios out of 20 in Group B, which represent 20 time sections with an interval of half a year over 10 years, are selected for model training, and the results at the end of the simulation (the 10th year) are used for verification. The MAE, RMSE, NSE and total time consumption for the four models are listed in Table 2. The best performance can be observed based on the RF model, with the minimum MAE (5.55 × 10−4) and RMSE (5.88 × 10−5), largest NSE (0.99), and the shortest time consumption (83 s)

Table 2

List of MAE, RMSE, NSE, and the total time consumption for four methods

MethodMAERMSENSEThe total time consumption (s)
MLP 4.34 × 10−3 5.87 × 10−4 −0.25 296 
SVM 7.77 × 10−3 2.98 × 10−4 −15.97 633 
GBR 1.10 × 10−3 3.05 × 10−4 −104.31 305 
RF 5.55 × 10−4 5.88 × 10−5 0.99 83 
MethodMAERMSENSEThe total time consumption (s)
MLP 4.34 × 10−3 5.87 × 10−4 −0.25 296 
SVM 7.77 × 10−3 2.98 × 10−4 −15.97 633 
GBR 1.10 × 10−3 3.05 × 10−4 −104.31 305 
RF 5.55 × 10−4 5.88 × 10−5 0.99 83 

Figure 7(A) and 7(B) show the spatial and profile distributions of the total VOC mass fraction in groundwater at the end of the simulation (10 years) using four methods, respectively. It can be observed that the distribution of the total VOC mass fraction from the RF model exhibits the same pattern as that from the numerical results, whereas the other three models have poor performance in capturing the shape of the plume. The performance of GBR model in simulating DNAPL temporal distribution is worse than that of RF model. It can only capture high concentration areas, but is difficult to simulate the complete distribution of pollution plume.
Figure 7

Distribution of the total VOC mass fraction in groundwater using four methods 10 years later. (A) Spatial distribution (B) Profile at location 49.5 m.

Figure 7

Distribution of the total VOC mass fraction in groundwater using four methods 10 years later. (A) Spatial distribution (B) Profile at location 49.5 m.

Close modal

The performance of each machine learning method in simulating multiphase flow depends on various factors such as the heterogeneity of medium, the leakage history, the choice of features, and the tuning of hyperparameters. From the comparison of both the spatial distribution and temporal evolution as well as the calculation efficiency, the RF model performs best among the four models and captures the tempo-spatial characteristics of the plume of the total VOC mass fraction, because the RF model constructed a large number of decision trees and then combined their results to make predictions. As studied in Sun et al. (2021a, 2021b, 2022), the RF model performs better in accurate groundwater level prediction, which is in accordance with our studies. Therefore, the RF model can act as a reasonable surrogate for the numerical models and then be further used to discuss the sensitivity of the model parameters and their efficiency in simulating NAPL transport in different phases.

Sensitivity of model parameters and the number of scenarios

The parameter max_features and n_estimators can significantly influence the results of the RF model. To analyze the influence of these two parameters, the simulation results are evaluated if different parameter combinations are adopted in terms of MAR, RMSE and the total time consumption in the calculation (Table 3). It can be clearly observed that the RF algorithm has high simulation accuracy and short calculation time consumption for the nine parameter combinations. The parameter n_estimator has very limited influences on the model results. The total time consumption in the calculation increases with the number of forest trees. The best parameter combination is max_features of ‘auto’ and n_estimators of 200, which is consistent with the optimization results from the grid search method.

Table 3

Evaluation of the models under nine parameter combinations

max_featuresn_estimatorsMAERMSENSEThe total time consumption (s)
'auto'a 50 4.33 × 10−4 7.01 × 10−6 0.99 60 
'sqrt'b 50 6.71 × 10−4 1.15 × 10−5 0.99 49 
'log2'c 50 4.62 × 10−4 1.05 × 10−5 0.99 49 
'auto' 100 4.38 × 10−4 6.89 × 10−6 0.99 86 
'sqrt' 100 4.16 × 10−4 8.87 × 10−6 0.99 66 
'log2' 100 4.98 × 10−4 9.31 × 10−6 0.99 66 
'auto' 200 4.25 × 10−4 6.88 × 10−6 0.99 138 
'sqrt' 200 4.25 × 10−4 8.38 × 10−6 0.99 97 
'log2' 200 4.14 × 10−4 8.66 × 10−6 0.99 98 
max_featuresn_estimatorsMAERMSENSEThe total time consumption (s)
'auto'a 50 4.33 × 10−4 7.01 × 10−6 0.99 60 
'sqrt'b 50 6.71 × 10−4 1.15 × 10−5 0.99 49 
'log2'c 50 4.62 × 10−4 1.05 × 10−5 0.99 49 
'auto' 100 4.38 × 10−4 6.89 × 10−6 0.99 86 
'sqrt' 100 4.16 × 10−4 8.87 × 10−6 0.99 66 
'log2' 100 4.98 × 10−4 9.31 × 10−6 0.99 66 
'auto' 200 4.25 × 10−4 6.88 × 10−6 0.99 138 
'sqrt' 200 4.25 × 10−4 8.38 × 10−6 0.99 97 
'log2' 200 4.14 × 10−4 8.66 × 10−6 0.99 98 

aConsidering all the features that make sense in every tree.

bUsing the square root of the total number of features in an individual run.

cAnother similar type of option for max_features.

To analyze the influence of the total number of scenarios on the model results, four conditions are considered, which are 50 scenarios (46 for the training and four for the verification), 38 scenarios (35 for the training and three for the verification), 25 scenarios (23 for the training and two for the verification) and 12 scenarios (11 for the training and one for the verification). Table 3 shows that the performance generally increases with the number of scenarios; however, the increase in performance is linearly dependent on the number of scenarios. The MAE for different numbers of scenarios is basically at the magnitude of 10−4, ranging from 3.26 × 10−4 to 7.93 × 10−4. The RMSE for the condition of 50 scenarios is the minimum, with a value of 6.89 × 10−6 ranging from 6.89 × 10−6 to 3.79 × 10−5. While the NSE is kept constant at 0.99.

Four additional groups (Groups C, D, E, and F) are designed to investigate the model efficiency in simulating the distribution of the VOC mass fraction in gaseous, aqueous and NAPL phases under spatial and temporal scales, linear leakage and LNAPL transport (Table 4).

Table 4

Scenario sets from the further model efficiency evaluation

GroupComparisonLeakage typeInput variablesOutput variablesNumber of scenarios
Spatial distribution Point Permeability distribution Mass fraction in gaseous, aqueous and NAPL phases 50 (C1–C50) 
Temporal evolution Permeability distribution, time section 20 (D1–D20) 
Leakage type Linear Permeability distribution Total VOC mass fraction 50 (E1–E50) 
GroupComparisonLeakage typeInput variablesOutput variablesNumber of scenarios
Spatial distribution Point Permeability distribution Mass fraction in gaseous, aqueous and NAPL phases 50 (C1–C50) 
Temporal evolution Permeability distribution, time section 20 (D1–D20) 
Leakage type Linear Permeability distribution Total VOC mass fraction 50 (E1–E50) 

Efficiency in the spatial distribution and temporal evolution of different phases of DNAPLs

The RF model supports the simulation with multiple factors, so the scenario of Group C is designed to investigate the distribution of the VOC mass fraction in the gaseous, aqueous and NAPL phases. The grid search method is also used to optimize the model parameters. Forty-six scenarios of distribution are used for model training, and the other four scenarios are used for validation. The simulated DNAPL distribution of mass fraction in three phases from the RF model matches well with numerical results with the largest MAE and RMSE of 0.3920 and 0.0124 in the NAPL phase, respectively. The scenario with the worst performance of the largest model error is chosen to examine the distribution of the VOC mass fraction in three phases at the end of the simulation. Figure 8 shows that the high concentration area of gaseous DNAPL is mainly located in the vadose zone, while the high concentration areas of aqueous and NAPL phase DNAPL are located in the aquifer below the leakage point. The distribution of gaseous, aqueous and NAPL phases can be accurately captured by the RF model, especially the high-concentration areas.
Figure 8

Distribution comparison of the VOC mass fraction after 10 years in the gas, aqueous, and NAPL phases by the RF model and its errors with the TMVOC model.

Figure 8

Distribution comparison of the VOC mass fraction after 10 years in the gas, aqueous, and NAPL phases by the RF model and its errors with the TMVOC model.

Close modal
Twenty scenarios with the same distribution of permeability in the interval of half a year over 10 years and different time sections are considered (Group D), where 18 out of 20 is chosen as the model train. The results from two validated time periods (9.5 and 10 years) do not show obvious differences in gaseous, aqueous and NAPL phases because they have the same order for MAE and RMSE for each phase for the two-time sections. Figure 9 demonstrates the comparison of the distribution of the mass fraction in gaseous, aqueous and NAPL phases over 10 years by the RF model and TMVOC. The results indicate that the shape of the pollution plume simulated by the RF model is relatively accurate and slightly smaller than that of TMVOC in the horizontal direction.
Figure 9

Distribution comparison of the VOC mass fraction after 10 years in gaseous, aqueous, and NAPL phases by the RF model and its errors with the TMVOC model.

Figure 9

Distribution comparison of the VOC mass fraction after 10 years in gaseous, aqueous, and NAPL phases by the RF model and its errors with the TMVOC model.

Close modal

Efficiency under the linear DNAPL leakage scenario

In the real contaminated site, there may be linear leakage or multipoint leakage in the pipelines, and thus, the scenario of linear leakage (Group E) is considered. Forty-six out of 50 scenarios simulated by TMVOC are used for the model training, and the rest are used for the validation. The comparison shows that the RF model performs better with MAE and RMSE less than 1.10 × 10−3 and 2.09 × 10−4, respectively. Compared with the model error in Group A, it can be concluded that the MAE and RMSE with linear pollution sources (Group E) are larger, ranging from 3.65 × 10−4 to 1.10 × 10−3 and 1.09 × 10−5 to 2.09 × 10−4, respectively, which is mainly caused by the larger magnitude of the total VOC mass fraction.

Figure 10 shows the comparison of the RF and TMVOC models in four scenarios during the validation period. The RF model can accurately capture the spatial distribution of the pollution plume in cases EA1 and EA3. However, there are some differences in the capture of the priority flow in EA2 and EA4, which may be caused by the number of scenarios with relatively coarse mesh sizes. Generally, the RF model can reflect the preferential flow during the validation periods.
Figure 10

Distribution comparison of the total VOC mass fraction in groundwater by the RF model and its error with the TMVOC model.

Figure 10

Distribution comparison of the total VOC mass fraction in groundwater by the RF model and its error with the TMVOC model.

Close modal

Efficiency in LNAPL transport

Considering that in situ pollutant-contaminated sites are usually diverse, Group F is set to investigate the applicability of the RF model to LNAPL transport. The contaminant is assumed to be benzene, which leaks continuously underground at a constant rate of 0.26 kg/d. For the four validation scenarios, the MAE and RMSE vary from 0 to 0.0008 and 6.70 × 10−9 to 1.66 × 10−5, respectively, showing that the model performs better. Figure 11 shows the comparison of the VOC mass fraction in groundwater under four validation scenarios by the RF model and TMVOC. The RF model can accurately predict the spatial distribution of the VOC mass fraction of benzene in groundwater in heterogeneous media and accurately capture the high concentration area.
Figure 11

Distribution comparison of the total VOC mass fraction in groundwater under the four validation scenarios using the RF model and its errors with the TMVOC model.

Figure 11

Distribution comparison of the total VOC mass fraction in groundwater under the four validation scenarios using the RF model and its errors with the TMVOC model.

Close modal

Accurate estimation of VOC distribution across spatial and temporal scales is a challenge in the in situ remediation of groundwater contamination induced by NAPL leakage. NAPL migration involves complicated physical, chemical, and even biological functions, and thus, phase transfer is very complex. In this study, surrogate models are developed to overcome the difficulties encountered in numerical models, including the large dependence on physical parameters and large calculation burden, in in situ NAPL-contaminated sites, especially highly heterogeneous aquifers.

Fifty random permeability distributions are generated by the Monte Carlo method to represent strongly heterogeneous aquifers. TMVOC is applied to obtain the distribution of both TCE and benzene under the point and linear leakage sources over the model area at different time sections, where the results are regarded as benchmark data. The applicability of four surrogate models (MLP, SVM, GBR, and RF) in simulating DNAPL transport is examined over the spatiotemporal distributions. The RF model is found to have the best performance where it was then further used to validate its reliability in simulating multiphase and multicomponent transport. The main conclusions are as follows:

  • The DNAPL distributions from four surrogate models (MLP, SVM, GBR, and RF) are compared in detail at both spatial and temporal scales, and the RF method has the best performance with the NSE of 0.99, the maximum MAE and RMSE smaller than 5.55 × 10−4 and 5.88 × 10−5, respectively, and over 150 times faster than TMVOC.

  • Model parameters and the number of scenarios for model training have a certain influence on model results, and the influences are not obvious in this study; thus, the grid search method is a good choice to decide the optimal parameter combination.

  • The RF model can better characterize the spatiotemporal distribution of gaseous, aqueous and DNAPL phases with high accuracy and calculation efficiency in NAPL-contaminated sites. However, the accuracy of the DNAPL phase distribution is lower than that of gaseous and aqueous phases, which may be caused by the low resolution of grid sizes and smaller number of scenarios.

  • The RF method can accurately simulate the spatial distribution of the TCE plume under linear leakage pollution sources with MAE and RMSE less than 1.10 × 10−3 and 2.09 × 10−4, respectively.

  • The RF model can efficiently and accurately simulate the spatial distribution of benzene-contaminated plumes with an MAE of almost 0 and an RMSE of less than 6.70 × 10−9.

Overall, it should be concluded that in this study the high-accuracy RF model can reproduce the NAPL distribution with multiple phases across spatial and temporal scales and considerably reduce the computational burden for modeling NAPL transport, which will provide new technical support for multiphase flow simulation and is of great applicability to real NAPL-contaminated sites. It is suggested that the surrogate models should be oriented toward the real site conditions of the contaminant site for environmental management and design the appropriate number of scenarios and the mesh size to improve the model accuracy. Then, the established model can be used to quickly and effectively produce multiphase NAPL distributions across spatiotemporal scales.

According to the results of the case study, the RF model can serve as an efficient surrogate model for simulating spatiotemporal distribution of NAPL in heterogeneous media. Compared to LNAPL, DNAPL may migrate deeper and accumulate laterally along an aquitard. Therefore, the scope of investigation and treatment of DNAPL-contaminated sites should be increased. Additionally, local preferential flow of DNAPL may occur in heterogeneous media, leading to higher concentrations than expected, which should be paid attention.

Due to the limited availability of actual site measurement data, we relied on numerical results as the reference for our surrogate models to investigate the spatiotemporal evolution of pollutants. In future research, we plan to assess the performance of these surrogate models using actual site data to enhance the credibility of our analysis.

This study was supported by the National Natural Science Foundation of China (Grant numbers: U2167211 and 41877173). We thank the public uncertainty quantification tools from Prof. Qingyun Duan at Hohai University.

L.H. was involved in conceptualization, methodology, writing, reviewing, and editing; M.Z. was involved in writing theoriginal draft, model calculation, data collection and analysis, writing, reviewing, and editing; L.T. was involved in writing, reviewing, and editing; S.H. was involved in writing, reviewing, editing.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Akbarnejad-Nesheli
S.
,
Haddad
O. B.
&
Loáiciga
H. A.
2016
Optimal in situ bioremediation design of groundwater contaminated with dissolved petroleum hydrocarbons
.
Journal of Hazardous, Toxic, and Radioactive Waste
20
(
2
),
04015021
.
Arabgol
R.
,
Sartaj
M.
&
Asghari
K.
2016
Predicting nitrate concentration and its spatial distribution in groundwater resources using Support Vector Machines (SVMs) Model
.
Environmental Modeling & Assessment
21
(
1
),
71
82
.
https://doi.org/10.1007/s10666-015-9468-0
.
Bedi
S.
,
Samal
A.
,
Ray
C.
&
Snow
D.
2020
Comparative evaluation of machine learning models for groundwater quality assessment
.
Environmental Monitoring and Assessment
192
(
12
),
776
776
.
https://doi.org/10.1007/s10661-020-08695-3
.
Bishop
M. C.
1995
Neural Networks for Pattern Recognition
.
Oxford University Press
,
New York, USA
.
Boggs
J. M.
,
Young
S. C.
,
Beard
L. M.
,
Gelhar
L. W.
,
Rehfeldt
K. R.
&
Adams
E. E.
1992
Field study of dispersion in a heterogeneous aquifer: 1. Overview and site description
.
Water Resources Research
28
(
12
),
3281
3291
.
https://doi.org/10.1029/92WR01756
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
,
5
32
.
https://doi.org/10.1023/A:1010933404324
.
Chakraborty
M.
,
Sarkar
S.
,
Mukherjee
A.
,
Shamsudduha
M.
,
Ahmed
K. M.
,
Bhattacharya
A.
&
Mitra
A.
2020
Modeling regional-scale groundwater arsenic hazard in the transboundary Ganges River Delta, India and Bangladesh: Infusing physically-based model with supervised machine learning
.
Science of the Total Environment
748
(
14
),
141107
.
https://doi.org/10.1016/j.scitotenv.2020.141107
.
Chen
L.
,
He
Q.
,
Liu
K.
,
Li
J.
&
Jing
C.
2019
Downscaling of GRACE-derived groundwater storage based on the random forest model
.
Remote Sensing
11
,
2979
.
https://doi.org/10.3390/rs11242979
.
Chu
H. J.
&
Chang
L. C.
2009
Application of optimal control and fuzzy theory for dynamic groundwater remediation design
.
Water Resources Management
23
(
4
),
647
660
.
https://doi.org/10.1007/s11269-008-9293-1
.
Deng
Y. D.
,
Ye
X. Y.
&
Du
X. Q.
2023
Predictive modeling and analysis of key drivers of groundwater nitrate pollution based on machine learning
.
Journal of Hydrology
624
,
129934
.
https://doi.org/10.1016/j.jhydrol.2023.129934
.
Desimone
L. A.
,
Pope
J. P.
&
Ransom
K. M.
2020
Machine-learning models to map pH and redox conditions in groundwater in a layered aquifer system, Northern Atlantic Coastal Plain, eastern USA
.
Journal of Hydrology: Regional Studies
30
,
100697
.
https://doi.org/10.1016/j.ejrh.2020.100697
.
Friedman
J. H.
2001
Greedy function approximation: A gradient boosting machine
.
The Annals of Statistics
29
,
1189
1232
.
https://doi.org/10.1214/AOS/1013203451
.
Hou
Z.
,
Lu
W.
,
Chu
H.
&
Luo
J.
2015
Selecting parameter-optimized surrogate models in DNAPL-contaminated aquifer remediation strategies
.
Environmental Engineering Science
32
(
12
),
1016
1026
.
https://doi.org/10.1089/ees.2015.0055
.
Huang
D. Y.
&
Jennifer
G.
2011
Dehalorespiration model that incorporates the self-inhibition and biomass inactivation effects of high tetrachloroethene concentrations
.
Environmental Science & Technology
45
(
3
),
1093
1099
.
https://doi.org/10.1021/ES102729v
.
Jia
X.
,
O'Connor
D.
,
Hou
D.
,
Jin
Y.
,
Li
G.
,
Zheng
C.
,
Ok
Y. S.
,
Daniel
C. W.
&
Luo
J.
2019
Groundwater depletion and contamination: Spatial distribution of groundwater resources sustainability in China
.
Science of the Total Environment
672
,
551
562
.
https://doi.org/10.1016/j.scitotenv.2019.03.457
.
Kang
X.
,
Kokkinaki
A.
,
Power
C.
,
Kitanidis
P. K.
,
Shi
X.
,
Duan
L.
,
Liu
T.
&
Wu
J.
2021
Integrating deep learning-based data assimilation and hydrogeophysical data for improved monitoring of DNAPL source zones during remediation
.
Journal of Hydrology
(
11
),
126655
.
https://doi.org/10.1016/j.jhydrol.2021.126655
.
Khalil
A.
,
Almasri
M. N.
,
Mckee
M.
&
Kaluarachchi
J. J.
2005
Applicability of statistical learning algorithms in groundwater quality modeling
.
Water Resources Research
41
(
5
),
W5010
W5011
.
https://doi.org/10.1029/2004WR003608
.
Kholghi
M.
&
Hosseini
S. M.
2009
Comparison of groundwater level estimation using neuro-fuzzy and ordinary Kriging
.
Environmental Modeling & Assessment
14
(
6
),
729
737
.
https://doi.org/10.1007/s10666-008-9174-2
.
Kueper
B. H.
,
Abbott
W.
&
Farquhar
G.
1989
Experimental observations of multiphase flow in heterogeneous porous media
.
Journal of Contaminant Hydrology
5
(
1
),
83
95
.
https://doi.org/10.1016/0169-7722(89)90007-7
.
Kumar
D.
,
Prasad
R. K.
&
Mathur
S.
2013
Optimal design of an in-situ bioremediation system using support vector machine and particle swarm optimization
.
Journal of Contaminant Hydrology
151
,
105
116
.
Luo
J.
,
Lu
W.
,
Yang
Q.
,
Ji
Y.
&
Xin
X.
2020
An adaptive dynamic surrogate model using a constrained trust region algorithm: Application to DNAPL-contaminated-groundwater-remediation design
.
Hydrogeology Journal
28
(
4
),
1
14
.
https://doi.org/10.1007/s10040-020-02130-0
.
Malakar
P.
,
Sarkar
S.
,
Mukherjee
A.
,
Bhanja
S.
&
Sun
A. Y.
2021
Use of machine learning and deep learning methods in groundwater
.
Global Groundwater
40
(
1
),
545
557
.
Mo
S.
,
Zhu
Y.
,
Zabaras
N.
,
Shi
X.
&
Wu
J.
2019
Deep convolutional encoder-decoder networks for uncertainty quantification of dynamic multiphase flow in heterogeneous media
.
Water Resources Research
55
(
1
),
703
728
.
https://doi.org/10.1029/2018WR023528
.
Naghibi
S. A.
,
Hashemi
H.
,
Berndtsson
R.
&
Lee
S.
2020
Application of extreme gradient boosting and parallel random forest algorithms for assessing groundwater spring potential using DEM-derived factors
.
Journal of Hydrology
589
(
1
),
125197
.
https://doi.org/10.1016/j.jhydrol.2020.125197
.
Nozari
S.
,
Bailey
R. T.
,
Haacker
E. M. K.
,
Zambreski
Z. T.
,
Xiang
Z.
&
Lin
X.
2022
Employing machine learning to quantify long-term climatological and regulatory impacts on groundwater availability in intensively irrigated regions
.
Journal of Hydrology
614
,
128511
.
https://doi.org/10.1016/j.jhydrol.2022.128511
.
Pruess
K.
&
Battistelli
A.
2002
TMVOC, A Numerical Simulator for Three-Phase non-Isothermal Flows of Multicomponent Hydrocarbon Mixtures in Saturated-Unsaturated Heterogeneous Media
.
Lawrence Berkeley National Laboratory
,
Berkeley, LBNL-49375
.
Pruess
K.
,
Oldenburg
C.
&
Moridis
G.
2012
TOUGH2 User's Guide, Version 2.0
.
Lawrence Berkeley National Laboratory
,
Berkeley, LBNL-43134
.
Sahoo
S.
,
Russo
T. A.
,
Elliott
J.
&
Foster
I.
2017
Machine learning algorithms for modeling groundwater level changes in agricultural regions of the U.S
.
Water Resources Research
53
(
5
),
3878
3895
.
https://doi.org/10.1002/2016WR019933
.
Seifi
A.
&
Riahi-Madvar
H.
2019
Improving one-dimensional pollution dispersion modeling in rivers using ANFIS and ANN-based GA optimized models
.
Environmental Science Pollution Research
26
,
867
885
.
https://doi.org/10.1007/s11356-018-3613-7
.
Shieh
H. J.
&
Peralta
R. C.
2005
Optimal in situ bioremediation design by hybrid genetic algorithm-simulated annealing
.
Journal of Water Resources Planning and Management
131
(
1
),
67
78
.
Srivastava
D.
&
Singh
R. M.
2015
Groundwater system modeling for simultaneous identification of pollution sources and parameters with uncertainty characterization
.
Water Resources Management
29
(
13
),
4607
4627
.
https://doi.org/10.1007/s11269-015-1078-8
.
Sun
A. Y.
&
Scanlon
B. R.
2019
How can big data and machine learning benefit environment and water management: A survey of methods, applications, and future directions
.
Environmental Research Letters
14
(
7
),
1
48
.
https://doi.org/10.1088/1748-9326/ab1b7d
.
Sun
A. Y.
,
Jiang
P.
,
Mudunuru
M. K.
&
Chen
X.
2021a
Explore spatio-temporal learning of large sample hydrology using graph neural networks
.
Water Resources Research
57
(
12
).
https://doi.org/10.1029/2021WR030394
.
Sun
K. N.
,
Hu
L. T.
,
Guo
J. L.
,
Yang
Z. Q.
,
Zhai
Y. Z.
&
Zhang
S. Q.
2021b
Enhancing the understanding of hydrological responses induced by ecological water replenishment using improved machine learning models: A case study in the Yongding River
.
Science of the Total Environment
768
(
10
),
145489
.
Sun
J. C.
,
Hu
L. T.
,
Li
D. D.
,
Sun
K. N.
&
Yang
Z. Q.
2022
Data-driven models for accurate groundwater level prediction and their practical significance in groundwater management
.
Journal of Hydrology
608
,
127630
.
https://doi.org/10.1016/j.jhydrol.2022.127630
.
Vapnik
V.
1995
The Natural of Statistical Learning Theory
.
Springer
,
New York, NY
,
USA
.
Vapnik
V.
1998
Statistical Learning Theory
.
Wiley
,
New York, NY
,
USA
.
Zhang
K.
,
Yamamoto
H.
&
Pruess
K.
2007
TMVOC-MP: A Parallel Numerical Simulator for Three-Phase non-Isothermal Flows of Multicomponent Hydrocarbon Mixtures in Porous/Fractured Media
.
Lawrence Berkeley National Laboratory Report, LBNL-63827
.
https://doi.org/10.2172/925544
.
Zhang
G.
,
Wu
L.
,
Jiao
K.
,
Tian
P.
,
Wang
B.
,
Wang
Y.
&
Liu
Z.
2020
Optimization of porous media flow field for proton exchange membrane fuel cell using a data-driven surrogate model
.
Energy Conversion and Management
226
,
113513
.
https://doi.org/10.1016/j.enconman.2020.113513
.
Zhong
Z.
,
Sun
A. Y.
&
Jeong
H.
2019
Predicting CO2 plume migration in heterogeneous formations using conditional deep convolutional generative adversarial network
.
Water Resources Research
55
(
7
),
5830
5851
.
https://doi.org/10.1029/2018WR024592
.
Zounemat-Kermnai
M.
,
Batelaan
O.
,
Fadaee
M.
&
Hinkelmann
R.
2021
Ensemble machine learning paradigms in hydrology: A review
.
Journal of Hydrology
598
,
126266
.
https://doi.org/10.1016/j.jhydrol.2021.126266
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).