## Abstract

A better understanding of the distribution of nonaqueous phase liquid (NAPL) plumes is of great importance to groundwater pollution remediation and control. However, the efficiency of surrogate models in simulating the transport is still not well addressed. Selecting a leakage problem as an example, 50 sets of random permeability distributions are generated using the Monte Carlo method, and a numerical model is used to obtain benchmark data of NAPL transport. Four machine learning methods are employed to simulate dense NAPL transport under point leakage sources across spatiotemporal scales. The validation of the models demonstrate that the random forest model can also effectively capture the spatial-temporal distribution of the plume in heterogeneous aquifers, with a maximum mean absolute error and root mean square error smaller than 5.55 × 10^{−4} and 5.88 × 10^{−5}, respectively. Meanwhile, the multiple phase outcome from the random forest model fits well with the numerical results under the scenarios of linear leakage sources and light NAPL transport. The total time consumed in the computation is reduced by over 150 times after using the surrogate models. The results suggest that surrogate models can provide a promising way to understand NAPL transport in heterogeneous aquifers.

## HIGHLIGHTS

Numerical solutions provided the NAPL transport benchmark dataset under different scenarios.

Four ML models were trained to capture DNAPL transport across the spatial-temporal scales.

The RF model had the best performance in reproducing multiphase transport in terms of both the accuracy and calculation efficiency.

Surrogate models provided promising applicability to quickly and effectively evaluate NAPL transport.

## INTRODUCTION

Groundwater provides a large share of the world's water demand for various sections, especially drinking water. However, under recent intense anthropogenic activities seeking rapid economic and social development, the safety of groundwater resources has been increasingly threatened (Jia *et al.* 2019). At many chemical sites, the accidental release of underground nonaqueous phase liquids (NAPLs) is the greatest threat to drinking groundwater and has received increasing attention (Kang *et al.* 2021). NAPL pollutants, especially dense nonaqueous phase liquids (DNAPLs), can cause serious harm to human health after entering soil and groundwater. For example, trichloroethylene (TCE) can cause cancer, teratogenesis, and mutagenesis (Huang & Jennifer 2011). NAPL has been approved by the US Environmental Protection Agency and is listed as a priority control pollutant. Compared with light nonaqueous phase liquid (LNAPL), DNAPL has a higher density, lower interfacial tension and more complex migration behavior. Due to the difficulty of detection, wide distribution in chemical companies and their complex migration behavior, NAPL has become an important source of pollution in soil and groundwater.

When entering the underground medium, DNAPL tends to move down through the vadose zone and accumulate in the lower confining bed due to its high density and insolubility in water. Additionally, the heterogeneity of underground media makes DNAPL transport with different phases much more complicated in contaminated sites. As an important means to analyze NAPL migration and transformation in soil and groundwater, numerical models can accurately describe the migration process and mass transfer mechanism of pollutants through strict mathematical equations, which have been identified by many laboratory tests (Kueper *et al.* 1989) and field tests (Boggs *et al.* 1992; Aly & Peralta 1999; Akbarnejad-Nesheli *et al.* 2016). TOUGH2/TMVOC (Pruess & Battistelli 2002), a numerical simulator for multiphase non-isothermal flows of multicomponent hydrocarbon mixtures, can accurately simulate NAPL migration and transformation in saturated-unsaturated heterogeneous media. TOUGH2-MP/TMVOC is a massively parallel (MP) version of TOUGH2/TMVOC (Zhang *et al.* 2007). It uses the METIS software package for grid partitioning and AZTEC linear-equation solver to solve large simulation problems that may not be solved by the standard, single-CPU TOUGH2/TMVOC code. The parallel code has been successfully applied from multi-core PCs to supercomputers on real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport.

Usually, reasonable physical parameter estimation, time consumption and necessary knowledge of model exercises from model construction and simulation to analysis disable environmental managers to effectively apply numerical models to understand NAPL migration processes. In response to this problem, some scholars have sought surrogate models for the simulation of multiphase flow (Shieh & Peralta 2005; Kumar *et al.* 2013; Yadav *et al.* 2016; Mo *et al.* 2019; Luo *et al.* 2020), which are approximate substitutes for numerical models. With high accuracy, the relationship between input variables and output variables is simulated with a small calculation burden and can be used to quickly investigate the NAPL distributions over the spatial and temporal scales. The machine learning method, which seeks the relation between the data while ignoring the physical principles, has been developed rapidly and has been widely utilized in groundwater-related fields in recent years (Nozari *et al.* 2022). Data-driven methods have achieved reasonable results in many case studies related to the quantity and quality of groundwater (Sahoo *et al.* 2017; Sun & Scanlon 2019; Bedi *et al.* 2020; Chakraborty *et al.* 2020; Desimone *et al.* 2020; Malakar *et al.* 2021; Sun *et al.* 2021a, 2021b; Zounemat-Kermnai *et al.* 2021; Deng *et al.* 2023). The commonly used machine learning methods include support vector machines (Khalil *et al.* 2005; Arabgol *et al.* 2016), artificial neural networks (ANNs) (Rao 2006; Srivastava & Singh 2015; Seifi & Riahi-Madvar 2019), the RF method (Naghibi *et al.* 2020), the kriging model (Kholghi & Hosseini 2009), and fuzzy theory (Chu & Chang 2009). Unfortunately, there are few studies on the application of machine learning in the field of multiphase flow. Zhong *et al.* (2019) used the generative adversarial nets (GAN) model to establish a multiphase flow surrogate model to simulate the CO_{2} pollution plume in heterogeneous media. Hou *et al.* (2015) established a surrogate model of a multiple phase flow model (UTCHEM) using support vector regression (SVR). With the optimization of machine learning algorithms, the application of machine learning in underground multiphase flow will become increasingly reliable (Zhang *et al.* 2020). Deep learning models (such as convolutional neural networks and long- and short-term memory models) have also demonstrated their potential. Mo *et al.* (2019) built a surrogate model for the quantification of the uncertainty of multiphase flow based on a deep convolutional codec network and proposed a two-stage training strategy to improve the approximation of saturation discontinuities, accurately describing the spatiotemporal evolution of pressure and discontinuous CO_{2} saturation fields.

The analysis of NAPL transport using surrogate models has been less focused, especially on NAPL transport in different phases and different leakage mode. Under this situation, the objective of this study is to explore the effectiveness of surrogate models in simulating NAPL transport across spatial and temporal scales. First, the TOUGH2-MP/TMVOC model was employed to obtain the benchmark data for the distribution of NAPL under both point and linear leakage conditions in heterogeneous aquifers. Then, four surrogate models, MLP, SVM, GBR, and RF, were trained and validated to capture the detailed features of DNAPL transport with point leakage pollution releases. Next, a comprehensive comparison of the outcomes generated by the four surrogate models was conducted to identify the optimal model, and the sensitivities of the model parameters and number of scenarios on the model accuracy were analyzed. Finally, the identified optimal model was used to assess the model efficiency in depicting multiphase distributions, DNAPL transport with linear leakage pollution sources, and LNAPL transport. The results can provide reference the selection of remediation strategies for organic contamination in heterogeneous aquifers.

## METHODOLOGY

### Hypothetical model

*et al.*(2012). The top and bottom boundaries of the model are set as the Dirichlet boundary with constant atmospheric pressure and no flow boundary, respectively. The pressures at the left and right boundaries remain constant. The depth to the water table is set as 5 m on the left and right sides of the model. The impact of temperature changes is not considered, and the ambient temperature is set as 20 °C. It is assumed that the precipitation infiltrates uniformly underground with an average annual rate of 0.10 m/year through the ground. Two leakage scenarios are considered, including point and linear pollution sources. The point leakage is almost located at the center and top of the study area, and the linear leakage is set along the horizontal direction (Figure 2). The pollutant is assumed to continuously leak underground at a constant rate of 0.26 kg/d. TCE and benzene are chosen as the DNAPL and LNAPL components, respectively. The densities of TCE and benzene are set as 1,462 and 885 kg/m

^{3}, respectively. The study area was discretized into 6,222 gridblocks with 102 horizontal elements and 61 vertical model layers. Empirical values of model parameters are applied.

### Numerical simulation

*et al.*2007), a parallel-processing version of TMVOC, is applied to conduct the simulation. The code is a numerical simulator for the three-phase non-isothermal flow of water, soil gas, and a multicomponent mixture of volatile organic chemicals (VOCs) in multidimensional heterogeneous porous media. The code is written in Fortran language. For related documents refer to this website: https://tough.lbl.gov/software/tough2-mp-software/. The basic mass and energy-balance equations that describe fluid and heat flow in general multiphase, multicomponent systems can be written in the following general form:where the integration is over an arbitrary subdomain of the flow system, which is bounded by the closed surface . The quantity appearing in the accumulation term (left-hand side) represents mass or energy per volume, with

*κ*= 1, 2, 3 labeling the mass components (water, NCGs, VOCs) and

*κ*= 4 the heat ‘component’. denotes mass or heat flux (see below), and denotes sinks and sources. is a normal vector on surface element , pointing inward into .

### Machine learning methods

#### Multilayer perceptron method

The ANN can realize multidimensional and nonlinear precise mapping between the input and output layers. A multilayer perceptron (MLP) is a forward structured ANN consisting of an input layer, an output layer, and one or more hidden layers (Bishop 1995). Each layer of MLP contains one or more neurons, and these neurons have directional connections with neurons in the upper and lower layers. The basic idea of the MLP network algorithm is first to obtain the error through forwarding propagation, then use backpropagation to modify the weight, and finally obtain the optimal model. The algorithm usually uses stochastic gradient descent to adjust the weights in the backpropagation process. The principle of the gradient descent method is to calculate the gradient of the loss function for all internal variables and perform backpropagation. Common activation functions include the ReLU function, sigmoid function, and tanh function.

*m*samples and

*n*features in the experiment, the input layer is . Assuming that the MLP has only one hidden layer and the hidden layer has

*h*neurons, the weight and deviation of the hidden layer can be expressed as and , respectively. There are

*q*label values of the output layer, and the weight and deviation parameters of the output layer are and , respectively. We can calculate the output (

*H*) of the hidden layer and the result (

*O*) of the output layer by Equation (2). The MLP model has two hidden layers, each with 50 neurons in this paper:

#### Support vector machine

Vapnik (1995) proposed the support vector machine (SVM) algorithm for classification problems and then gradually extended it to regression problems (Vapnik 1998). The SVM algorithm has been an excellent machine learning algorithm for classification and regression problems with continuous improvement. SVM can effectively solve practical problems with small samples, nonlinearities, high dimensions, and local minima based on optimization methods, statistical theories, and structural risk minimization principles. Compared with other machine learning algorithms, SVM has good generalization ability and avoids the overfitting problem during the training process. In recent years, it has been successfully used in hydrological forecasting and anomaly detection. The main calculation process of this method is as follows:

- (1)
Take the selected feature variables,

*X*= {*x*,_{1}*x*, …,_{2}*x*}, as the input variables of the model. A nonlinear regression model is used to construct the relationship with the target variable,_{n}*Y*= {*y*,_{1}*y*, …,_{2}*y*}._{n} - (2)
For the sample set {(

*x*,_{1}*y*),…, (_{1}*x*,_{i}*y*), …, (_{i}*x*,_{n}*y*)}, use nonlinear mapping Ø to map the sample set to the high-dimensional feature space to obtain the mapped sample Ø(_{n}*x*)._{i} - (3)
Construct the nonlinear function (

*x*) as a mathematical model for predicting*y*.

#### Gradient boosting regression

The gradient boosting regression (GBR) algorithm is a representative boosting algorithm that uses all samples to train the model during each round of training and updates the weight of the sample in each round of training. The GBR model uses classification and regression decision tree (CART) as a single model (Friedman 2001) to process multiple data types and has the advantages of low learning error and robustness to parameters.

#### Random forest

The random forest (RF) algorithm was first proposed by Breiman in 2001 (Breiman 2001). The RF combines multiple CARTs and mainly solves classification and regression problems (Chen *et al.* 2019). The bootstrap method is first used to randomly extract multiple samples from the original data for model training. The undrawn data are used as out-of-bag data to test the algorithm's accuracy. Subsequently, it constructs the decision tree for each training sample and integrates all sample decision trees into the RF model. The main parameters in the RF model include the maximum number of features allowed to try in individual trees (*max_features*) and the number of trees to build before taking the maximum voting or averages of predictions (*n_estimators*).

#### Grid search method for parameter optimization

Grid search is essentially an optimization algorithm that selects the best parameters for the optimization problem from a list of parameter options, hence automating the ‘trail-and-error’ method. The grid search method is most popularly known in machine learning to obtain the parameters with which the model provides the best accuracy. Additionally, the number of evaluations increases exponentially as the parameter increases, which requires much time to train the model.

#### Evaluation metrics

*i*, respectively. represents the mean value of , and

*n*is the number of time series.

## MODEL RESULTS

### Numerical modeling results

### Performance evaluation of four surrogate models

#### Scenarios set

Performance evaluation of four surrogate models (MLP, SVM, GBR, and RF) is conducted by comparing the results of two groups (Groups A and B), which represent the spatial distribution and temporal NAPL evolution results, respectively. As listed in Table 1, Groups A and B have 50 scenarios and 20 scenarios for 6,222 gridblocks, respectively. Group B has two input variables, permeability distribution and time section, whereas Group A has only permeability distribution at the end of simulation. The NAPL leakage type is a point for both Groups A and B.

Group^{a}
. | Comparison . | Inputs . | Outputs . | Number of scenarios . |
---|---|---|---|---|

A | Spatial distribution | Permeability distribution after 10 years | Total VOC mass fractions for each gridblock at 10 years | 50 for 6,222 gridblocks (A1-A50) |

B | Temporal evolution | Permeability distribution, time periods | 20 for 6,222 gridblocks (B1-B20) |

Group^{a}
. | Comparison . | Inputs . | Outputs . | Number of scenarios . |
---|---|---|---|---|

A | Spatial distribution | Permeability distribution after 10 years | Total VOC mass fractions for each gridblock at 10 years | 50 for 6,222 gridblocks (A1-A50) |

B | Temporal evolution | Permeability distribution, time periods | 20 for 6,222 gridblocks (B1-B20) |

^{a}Leakage type is the point for both Groups A and B.

#### Spatial distribution

^{−4}and 1.44 × 10

^{−5}, respectively. The worst performance is detected in the SVM with an MAE of 1.17 × 10

^{−2}. Concerning NSE, the best performance is observed using the RF method with the NSE index of 0.99.

For scenario AA4, it takes approximately 4.75 h for the TOUGH2-MP/TMVOC model (12 cores) for the simulation; however, the total time consumed in the calculation (Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz × 12, and 31.8 GiB memory) is 1,333, 11,081, 1,822, and 92 s for the MLP, GBR, SVM, and RF models, respectively. It is obvious that the calculation efficiency in the RF model is relatively high, of which the execution efficiency is approximately 178 times shorter than that of the numerical model.

#### Temporal evolution

Nineteen scenarios out of 20 in Group B, which represent 20 time sections with an interval of half a year over 10 years, are selected for model training, and the results at the end of the simulation (the 10th year) are used for verification. The MAE, RMSE, NSE and total time consumption for the four models are listed in Table 2. The best performance can be observed based on the RF model, with the minimum MAE (5.55 × 10^{−4}) and RMSE (5.88 × 10^{−5}), largest NSE (0.99), and the shortest time consumption (83 s)

Method . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|

MLP | 4.34 × 10^{−3} | 5.87 × 10^{−4} | −0.25 | 296 |

SVM | 7.77 × 10^{−3} | 2.98 × 10^{−4} | −15.97 | 633 |

GBR | 1.10 × 10^{−3} | 3.05 × 10^{−4} | −104.31 | 305 |

RF | 5.55 × 10^{−4} | 5.88 × 10^{−5} | 0.99 | 83 |

Method . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|

MLP | 4.34 × 10^{−3} | 5.87 × 10^{−4} | −0.25 | 296 |

SVM | 7.77 × 10^{−3} | 2.98 × 10^{−4} | −15.97 | 633 |

GBR | 1.10 × 10^{−3} | 3.05 × 10^{−4} | −104.31 | 305 |

RF | 5.55 × 10^{−4} | 5.88 × 10^{−5} | 0.99 | 83 |

The performance of each machine learning method in simulating multiphase flow depends on various factors such as the heterogeneity of medium, the leakage history, the choice of features, and the tuning of hyperparameters. From the comparison of both the spatial distribution and temporal evolution as well as the calculation efficiency, the RF model performs best among the four models and captures the tempo-spatial characteristics of the plume of the total VOC mass fraction, because the RF model constructed a large number of decision trees and then combined their results to make predictions. As studied in Sun *et al.* (2021a, 2021b, 2022), the RF model performs better in accurate groundwater level prediction, which is in accordance with our studies. Therefore, the RF model can act as a reasonable surrogate for the numerical models and then be further used to discuss the sensitivity of the model parameters and their efficiency in simulating NAPL transport in different phases.

#### Sensitivity of model parameters and the number of scenarios

The parameter *max_features* and *n_estimators* can significantly influence the results of the RF model. To analyze the influence of these two parameters, the simulation results are evaluated if different parameter combinations are adopted in terms of MAR, RMSE and the total time consumption in the calculation (Table 3). It can be clearly observed that the RF algorithm has high simulation accuracy and short calculation time consumption for the nine parameter combinations. The parameter *n_estimator* has very limited influences on the model results. The total time consumption in the calculation increases with the number of forest trees. The best parameter combination is *max_features* of ‘auto’ and *n_estimators* of 200, which is consistent with the optimization results from the grid search method.

max_features . | n_estimators . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|---|

'auto'^{a} | 50 | 4.33 × 10^{−4} | 7.01 × 10^{−6} | 0.99 | 60 |

'sqrt'^{b} | 50 | 6.71 × 10^{−4} | 1.15 × 10^{−5} | 0.99 | 49 |

'log2'^{c} | 50 | 4.62 × 10^{−4} | 1.05 × 10^{−5} | 0.99 | 49 |

'auto' | 100 | 4.38 × 10^{−4} | 6.89 × 10^{−6} | 0.99 | 86 |

'sqrt' | 100 | 4.16 × 10^{−4} | 8.87 × 10^{−6} | 0.99 | 66 |

'log2' | 100 | 4.98 × 10^{−4} | 9.31 × 10^{−6} | 0.99 | 66 |

'auto' | 200 | 4.25 × 10^{−4} | 6.88 × 10^{−6} | 0.99 | 138 |

'sqrt' | 200 | 4.25 × 10^{−4} | 8.38 × 10^{−6} | 0.99 | 97 |

'log2' | 200 | 4.14 × 10^{−4} | 8.66 × 10^{−6} | 0.99 | 98 |

max_features . | n_estimators . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|---|

'auto'^{a} | 50 | 4.33 × 10^{−4} | 7.01 × 10^{−6} | 0.99 | 60 |

'sqrt'^{b} | 50 | 6.71 × 10^{−4} | 1.15 × 10^{−5} | 0.99 | 49 |

'log2'^{c} | 50 | 4.62 × 10^{−4} | 1.05 × 10^{−5} | 0.99 | 49 |

'auto' | 100 | 4.38 × 10^{−4} | 6.89 × 10^{−6} | 0.99 | 86 |

'sqrt' | 100 | 4.16 × 10^{−4} | 8.87 × 10^{−6} | 0.99 | 66 |

'log2' | 100 | 4.98 × 10^{−4} | 9.31 × 10^{−6} | 0.99 | 66 |

'auto' | 200 | 4.25 × 10^{−4} | 6.88 × 10^{−6} | 0.99 | 138 |

'sqrt' | 200 | 4.25 × 10^{−4} | 8.38 × 10^{−6} | 0.99 | 97 |

'log2' | 200 | 4.14 × 10^{−4} | 8.66 × 10^{−6} | 0.99 | 98 |

^{a}Considering all the features that make sense in every tree.

^{b}Using the square root of the total number of features in an individual run.

^{c}Another similar type of option for max_features.

To analyze the influence of the total number of scenarios on the model results, four conditions are considered, which are 50 scenarios (46 for the training and four for the verification), 38 scenarios (35 for the training and three for the verification), 25 scenarios (23 for the training and two for the verification) and 12 scenarios (11 for the training and one for the verification). Table 3 shows that the performance generally increases with the number of scenarios; however, the increase in performance is linearly dependent on the number of scenarios. The MAE for different numbers of scenarios is basically at the magnitude of 10^{−4}, ranging from 3.26 × 10^{−4} to 7.93 × 10^{−4}. The RMSE for the condition of 50 scenarios is the minimum, with a value of 6.89 × 10^{−6} ranging from 6.89 × 10^{−6} to 3.79 × 10^{−5}. While the NSE is kept constant at 0.99.

## FURTHER DISCUSSION OF MODEL EFFICIENCY

Four additional groups (Groups C, D, E, and F) are designed to investigate the model efficiency in simulating the distribution of the VOC mass fraction in gaseous, aqueous and NAPL phases under spatial and temporal scales, linear leakage and LNAPL transport (Table 4).

Group . | Comparison . | Leakage type . | Input variables . | Output variables . | Number of scenarios . |
---|---|---|---|---|---|

C | Spatial distribution | Point | Permeability distribution | Mass fraction in gaseous, aqueous and NAPL phases | 50 (C1–C50) |

D | Temporal evolution | Permeability distribution, time section | 20 (D1–D20) | ||

E | Leakage type | Linear | Permeability distribution | Total VOC mass fraction | 50 (E1–E50) |

Group . | Comparison . | Leakage type . | Input variables . | Output variables . | Number of scenarios . |
---|---|---|---|---|---|

C | Spatial distribution | Point | Permeability distribution | Mass fraction in gaseous, aqueous and NAPL phases | 50 (C1–C50) |

D | Temporal evolution | Permeability distribution, time section | 20 (D1–D20) | ||

E | Leakage type | Linear | Permeability distribution | Total VOC mass fraction | 50 (E1–E50) |

### Efficiency in the spatial distribution and temporal evolution of different phases of DNAPLs

### Efficiency under the linear DNAPL leakage scenario

In the real contaminated site, there may be linear leakage or multipoint leakage in the pipelines, and thus, the scenario of linear leakage (Group E) is considered. Forty-six out of 50 scenarios simulated by TMVOC are used for the model training, and the rest are used for the validation. The comparison shows that the RF model performs better with MAE and RMSE less than 1.10 × 10^{−3} and 2.09 × 10^{−4}, respectively. Compared with the model error in Group A, it can be concluded that the MAE and RMSE with linear pollution sources (Group E) are larger, ranging from 3.65 × 10^{−4} to 1.10 × 10^{−3} and 1.09 × 10^{−5} to 2.09 × 10^{−4}, respectively, which is mainly caused by the larger magnitude of the total VOC mass fraction.

### Efficiency in LNAPL transport

*in situ*pollutant-contaminated sites are usually diverse, Group F is set to investigate the applicability of the RF model to LNAPL transport. The contaminant is assumed to be benzene, which leaks continuously underground at a constant rate of 0.26 kg/d. For the four validation scenarios, the MAE and RMSE vary from 0 to 0.0008 and 6.70 × 10

^{−9}to 1.66 × 10

^{−5}, respectively, showing that the model performs better. Figure 11 shows the comparison of the VOC mass fraction in groundwater under four validation scenarios by the RF model and TMVOC. The RF model can accurately predict the spatial distribution of the VOC mass fraction of benzene in groundwater in heterogeneous media and accurately capture the high concentration area.

## CONCLUSIONS

Accurate estimation of VOC distribution across spatial and temporal scales is a challenge in the *in situ* remediation of groundwater contamination induced by NAPL leakage. NAPL migration involves complicated physical, chemical, and even biological functions, and thus, phase transfer is very complex. In this study, surrogate models are developed to overcome the difficulties encountered in numerical models, including the large dependence on physical parameters and large calculation burden, in *in situ* NAPL-contaminated sites, especially highly heterogeneous aquifers.

Fifty random permeability distributions are generated by the Monte Carlo method to represent strongly heterogeneous aquifers. TMVOC is applied to obtain the distribution of both TCE and benzene under the point and linear leakage sources over the model area at different time sections, where the results are regarded as benchmark data. The applicability of four surrogate models (MLP, SVM, GBR, and RF) in simulating DNAPL transport is examined over the spatiotemporal distributions. The RF model is found to have the best performance where it was then further used to validate its reliability in simulating multiphase and multicomponent transport. The main conclusions are as follows:

The DNAPL distributions from four surrogate models (MLP, SVM, GBR, and RF) are compared in detail at both spatial and temporal scales, and the RF method has the best performance with the NSE of 0.99, the maximum MAE and RMSE smaller than 5.55 × 10

^{−4}and 5.88 × 10^{−5}, respectively, and over 150 times faster than TMVOC.Model parameters and the number of scenarios for model training have a certain influence on model results, and the influences are not obvious in this study; thus, the grid search method is a good choice to decide the optimal parameter combination.

The RF model can better characterize the spatiotemporal distribution of gaseous, aqueous and DNAPL phases with high accuracy and calculation efficiency in NAPL-contaminated sites. However, the accuracy of the DNAPL phase distribution is lower than that of gaseous and aqueous phases, which may be caused by the low resolution of grid sizes and smaller number of scenarios.

The RF method can accurately simulate the spatial distribution of the TCE plume under linear leakage pollution sources with MAE and RMSE less than 1.10 × 10

^{−3}and 2.09 × 10^{−4}, respectively.The RF model can efficiently and accurately simulate the spatial distribution of benzene-contaminated plumes with an MAE of almost 0 and an RMSE of less than 6.70 × 10

^{−9}.

Overall, it should be concluded that in this study the high-accuracy RF model can reproduce the NAPL distribution with multiple phases across spatial and temporal scales and considerably reduce the computational burden for modeling NAPL transport, which will provide new technical support for multiphase flow simulation and is of great applicability to real NAPL-contaminated sites. It is suggested that the surrogate models should be oriented toward the real site conditions of the contaminant site for environmental management and design the appropriate number of scenarios and the mesh size to improve the model accuracy. Then, the established model can be used to quickly and effectively produce multiphase NAPL distributions across spatiotemporal scales.

According to the results of the case study, the RF model can serve as an efficient surrogate model for simulating spatiotemporal distribution of NAPL in heterogeneous media. Compared to LNAPL, DNAPL may migrate deeper and accumulate laterally along an aquitard. Therefore, the scope of investigation and treatment of DNAPL-contaminated sites should be increased. Additionally, local preferential flow of DNAPL may occur in heterogeneous media, leading to higher concentrations than expected, which should be paid attention.

Due to the limited availability of actual site measurement data, we relied on numerical results as the reference for our surrogate models to investigate the spatiotemporal evolution of pollutants. In future research, we plan to assess the performance of these surrogate models using actual site data to enhance the credibility of our analysis.

## ACKNOWLEDGEMENTS

This study was supported by the National Natural Science Foundation of China (Grant numbers: U2167211 and 41877173). We thank the public uncertainty quantification tools from Prof. Qingyun Duan at Hohai University.

## AUTHOR CONTRIBUTION

L.H. was involved in conceptualization, methodology, writing, reviewing, and editing; M.Z. was involved in writing theoriginal draft, model calculation, data collection and analysis, writing, reviewing, and editing; L.T. was involved in writing, reviewing, and editing; S.H. was involved in writing, reviewing, editing.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.