Abstract
A better understanding of the distribution of nonaqueous phase liquid (NAPL) plumes is of great importance to groundwater pollution remediation and control. However, the efficiency of surrogate models in simulating the transport is still not well addressed. Selecting a leakage problem as an example, 50 sets of random permeability distributions are generated using the Monte Carlo method, and a numerical model is used to obtain benchmark data of NAPL transport. Four machine learning methods are employed to simulate dense NAPL transport under point leakage sources across spatiotemporal scales. The validation of the models demonstrate that the random forest model can also effectively capture the spatial-temporal distribution of the plume in heterogeneous aquifers, with a maximum mean absolute error and root mean square error smaller than 5.55 × 10−4 and 5.88 × 10−5, respectively. Meanwhile, the multiple phase outcome from the random forest model fits well with the numerical results under the scenarios of linear leakage sources and light NAPL transport. The total time consumed in the computation is reduced by over 150 times after using the surrogate models. The results suggest that surrogate models can provide a promising way to understand NAPL transport in heterogeneous aquifers.
HIGHLIGHTS
Numerical solutions provided the NAPL transport benchmark dataset under different scenarios.
Four ML models were trained to capture DNAPL transport across the spatial-temporal scales.
The RF model had the best performance in reproducing multiphase transport in terms of both the accuracy and calculation efficiency.
Surrogate models provided promising applicability to quickly and effectively evaluate NAPL transport.
INTRODUCTION
Groundwater provides a large share of the world's water demand for various sections, especially drinking water. However, under recent intense anthropogenic activities seeking rapid economic and social development, the safety of groundwater resources has been increasingly threatened (Jia et al. 2019). At many chemical sites, the accidental release of underground nonaqueous phase liquids (NAPLs) is the greatest threat to drinking groundwater and has received increasing attention (Kang et al. 2021). NAPL pollutants, especially dense nonaqueous phase liquids (DNAPLs), can cause serious harm to human health after entering soil and groundwater. For example, trichloroethylene (TCE) can cause cancer, teratogenesis, and mutagenesis (Huang & Jennifer 2011). NAPL has been approved by the US Environmental Protection Agency and is listed as a priority control pollutant. Compared with light nonaqueous phase liquid (LNAPL), DNAPL has a higher density, lower interfacial tension and more complex migration behavior. Due to the difficulty of detection, wide distribution in chemical companies and their complex migration behavior, NAPL has become an important source of pollution in soil and groundwater.
When entering the underground medium, DNAPL tends to move down through the vadose zone and accumulate in the lower confining bed due to its high density and insolubility in water. Additionally, the heterogeneity of underground media makes DNAPL transport with different phases much more complicated in contaminated sites. As an important means to analyze NAPL migration and transformation in soil and groundwater, numerical models can accurately describe the migration process and mass transfer mechanism of pollutants through strict mathematical equations, which have been identified by many laboratory tests (Kueper et al. 1989) and field tests (Boggs et al. 1992; Aly & Peralta 1999; Akbarnejad-Nesheli et al. 2016). TOUGH2/TMVOC (Pruess & Battistelli 2002), a numerical simulator for multiphase non-isothermal flows of multicomponent hydrocarbon mixtures, can accurately simulate NAPL migration and transformation in saturated-unsaturated heterogeneous media. TOUGH2-MP/TMVOC is a massively parallel (MP) version of TOUGH2/TMVOC (Zhang et al. 2007). It uses the METIS software package for grid partitioning and AZTEC linear-equation solver to solve large simulation problems that may not be solved by the standard, single-CPU TOUGH2/TMVOC code. The parallel code has been successfully applied from multi-core PCs to supercomputers on real field problems of multi-million-cell simulations for three-dimensional multiphase and multicomponent fluid and heat flow, as well as solute transport.
Usually, reasonable physical parameter estimation, time consumption and necessary knowledge of model exercises from model construction and simulation to analysis disable environmental managers to effectively apply numerical models to understand NAPL migration processes. In response to this problem, some scholars have sought surrogate models for the simulation of multiphase flow (Shieh & Peralta 2005; Kumar et al. 2013; Yadav et al. 2016; Mo et al. 2019; Luo et al. 2020), which are approximate substitutes for numerical models. With high accuracy, the relationship between input variables and output variables is simulated with a small calculation burden and can be used to quickly investigate the NAPL distributions over the spatial and temporal scales. The machine learning method, which seeks the relation between the data while ignoring the physical principles, has been developed rapidly and has been widely utilized in groundwater-related fields in recent years (Nozari et al. 2022). Data-driven methods have achieved reasonable results in many case studies related to the quantity and quality of groundwater (Sahoo et al. 2017; Sun & Scanlon 2019; Bedi et al. 2020; Chakraborty et al. 2020; Desimone et al. 2020; Malakar et al. 2021; Sun et al. 2021a, 2021b; Zounemat-Kermnai et al. 2021; Deng et al. 2023). The commonly used machine learning methods include support vector machines (Khalil et al. 2005; Arabgol et al. 2016), artificial neural networks (ANNs) (Rao 2006; Srivastava & Singh 2015; Seifi & Riahi-Madvar 2019), the RF method (Naghibi et al. 2020), the kriging model (Kholghi & Hosseini 2009), and fuzzy theory (Chu & Chang 2009). Unfortunately, there are few studies on the application of machine learning in the field of multiphase flow. Zhong et al. (2019) used the generative adversarial nets (GAN) model to establish a multiphase flow surrogate model to simulate the CO2 pollution plume in heterogeneous media. Hou et al. (2015) established a surrogate model of a multiple phase flow model (UTCHEM) using support vector regression (SVR). With the optimization of machine learning algorithms, the application of machine learning in underground multiphase flow will become increasingly reliable (Zhang et al. 2020). Deep learning models (such as convolutional neural networks and long- and short-term memory models) have also demonstrated their potential. Mo et al. (2019) built a surrogate model for the quantification of the uncertainty of multiphase flow based on a deep convolutional codec network and proposed a two-stage training strategy to improve the approximation of saturation discontinuities, accurately describing the spatiotemporal evolution of pressure and discontinuous CO2 saturation fields.
The analysis of NAPL transport using surrogate models has been less focused, especially on NAPL transport in different phases and different leakage mode. Under this situation, the objective of this study is to explore the effectiveness of surrogate models in simulating NAPL transport across spatial and temporal scales. First, the TOUGH2-MP/TMVOC model was employed to obtain the benchmark data for the distribution of NAPL under both point and linear leakage conditions in heterogeneous aquifers. Then, four surrogate models, MLP, SVM, GBR, and RF, were trained and validated to capture the detailed features of DNAPL transport with point leakage pollution releases. Next, a comprehensive comparison of the outcomes generated by the four surrogate models was conducted to identify the optimal model, and the sensitivities of the model parameters and number of scenarios on the model accuracy were analyzed. Finally, the identified optimal model was used to assess the model efficiency in depicting multiphase distributions, DNAPL transport with linear leakage pollution sources, and LNAPL transport. The results can provide reference the selection of remediation strategies for organic contamination in heterogeneous aquifers.
METHODOLOGY
Hypothetical model
Numerical simulation
Machine learning methods
Multilayer perceptron method
The ANN can realize multidimensional and nonlinear precise mapping between the input and output layers. A multilayer perceptron (MLP) is a forward structured ANN consisting of an input layer, an output layer, and one or more hidden layers (Bishop 1995). Each layer of MLP contains one or more neurons, and these neurons have directional connections with neurons in the upper and lower layers. The basic idea of the MLP network algorithm is first to obtain the error through forwarding propagation, then use backpropagation to modify the weight, and finally obtain the optimal model. The algorithm usually uses stochastic gradient descent to adjust the weights in the backpropagation process. The principle of the gradient descent method is to calculate the gradient of the loss function for all internal variables and perform backpropagation. Common activation functions include the ReLU function, sigmoid function, and tanh function.
Support vector machine
Vapnik (1995) proposed the support vector machine (SVM) algorithm for classification problems and then gradually extended it to regression problems (Vapnik 1998). The SVM algorithm has been an excellent machine learning algorithm for classification and regression problems with continuous improvement. SVM can effectively solve practical problems with small samples, nonlinearities, high dimensions, and local minima based on optimization methods, statistical theories, and structural risk minimization principles. Compared with other machine learning algorithms, SVM has good generalization ability and avoids the overfitting problem during the training process. In recent years, it has been successfully used in hydrological forecasting and anomaly detection. The main calculation process of this method is as follows:
- (1)
Take the selected feature variables, X = {x1, x2, …, xn}, as the input variables of the model. A nonlinear regression model is used to construct the relationship with the target variable, Y = {y1, y2, …, yn}.
- (2)
For the sample set {(x1, y1),…, (xi, yi), …, (xn, yn)}, use nonlinear mapping Ø to map the sample set to the high-dimensional feature space to obtain the mapped sample Ø(xi).
- (3)
Construct the nonlinear function (x) as a mathematical model for predicting y.
Gradient boosting regression
The gradient boosting regression (GBR) algorithm is a representative boosting algorithm that uses all samples to train the model during each round of training and updates the weight of the sample in each round of training. The GBR model uses classification and regression decision tree (CART) as a single model (Friedman 2001) to process multiple data types and has the advantages of low learning error and robustness to parameters.
Random forest
The random forest (RF) algorithm was first proposed by Breiman in 2001 (Breiman 2001). The RF combines multiple CARTs and mainly solves classification and regression problems (Chen et al. 2019). The bootstrap method is first used to randomly extract multiple samples from the original data for model training. The undrawn data are used as out-of-bag data to test the algorithm's accuracy. Subsequently, it constructs the decision tree for each training sample and integrates all sample decision trees into the RF model. The main parameters in the RF model include the maximum number of features allowed to try in individual trees (max_features) and the number of trees to build before taking the maximum voting or averages of predictions (n_estimators).
Grid search method for parameter optimization
Grid search is essentially an optimization algorithm that selects the best parameters for the optimization problem from a list of parameter options, hence automating the ‘trail-and-error’ method. The grid search method is most popularly known in machine learning to obtain the parameters with which the model provides the best accuracy. Additionally, the number of evaluations increases exponentially as the parameter increases, which requires much time to train the model.
Evaluation metrics
MODEL RESULTS
Numerical modeling results
Performance evaluation of four surrogate models
Scenarios set
Performance evaluation of four surrogate models (MLP, SVM, GBR, and RF) is conducted by comparing the results of two groups (Groups A and B), which represent the spatial distribution and temporal NAPL evolution results, respectively. As listed in Table 1, Groups A and B have 50 scenarios and 20 scenarios for 6,222 gridblocks, respectively. Group B has two input variables, permeability distribution and time section, whereas Group A has only permeability distribution at the end of simulation. The NAPL leakage type is a point for both Groups A and B.
Groupa . | Comparison . | Inputs . | Outputs . | Number of scenarios . |
---|---|---|---|---|
A | Spatial distribution | Permeability distribution after 10 years | Total VOC mass fractions for each gridblock at 10 years | 50 for 6,222 gridblocks (A1-A50) |
B | Temporal evolution | Permeability distribution, time periods | 20 for 6,222 gridblocks (B1-B20) |
Groupa . | Comparison . | Inputs . | Outputs . | Number of scenarios . |
---|---|---|---|---|
A | Spatial distribution | Permeability distribution after 10 years | Total VOC mass fractions for each gridblock at 10 years | 50 for 6,222 gridblocks (A1-A50) |
B | Temporal evolution | Permeability distribution, time periods | 20 for 6,222 gridblocks (B1-B20) |
aLeakage type is the point for both Groups A and B.
Spatial distribution
For scenario AA4, it takes approximately 4.75 h for the TOUGH2-MP/TMVOC model (12 cores) for the simulation; however, the total time consumed in the calculation (Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz × 12, and 31.8 GiB memory) is 1,333, 11,081, 1,822, and 92 s for the MLP, GBR, SVM, and RF models, respectively. It is obvious that the calculation efficiency in the RF model is relatively high, of which the execution efficiency is approximately 178 times shorter than that of the numerical model.
Temporal evolution
Nineteen scenarios out of 20 in Group B, which represent 20 time sections with an interval of half a year over 10 years, are selected for model training, and the results at the end of the simulation (the 10th year) are used for verification. The MAE, RMSE, NSE and total time consumption for the four models are listed in Table 2. The best performance can be observed based on the RF model, with the minimum MAE (5.55 × 10−4) and RMSE (5.88 × 10−5), largest NSE (0.99), and the shortest time consumption (83 s)
Method . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|
MLP | 4.34 × 10−3 | 5.87 × 10−4 | −0.25 | 296 |
SVM | 7.77 × 10−3 | 2.98 × 10−4 | −15.97 | 633 |
GBR | 1.10 × 10−3 | 3.05 × 10−4 | −104.31 | 305 |
RF | 5.55 × 10−4 | 5.88 × 10−5 | 0.99 | 83 |
Method . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|
MLP | 4.34 × 10−3 | 5.87 × 10−4 | −0.25 | 296 |
SVM | 7.77 × 10−3 | 2.98 × 10−4 | −15.97 | 633 |
GBR | 1.10 × 10−3 | 3.05 × 10−4 | −104.31 | 305 |
RF | 5.55 × 10−4 | 5.88 × 10−5 | 0.99 | 83 |
The performance of each machine learning method in simulating multiphase flow depends on various factors such as the heterogeneity of medium, the leakage history, the choice of features, and the tuning of hyperparameters. From the comparison of both the spatial distribution and temporal evolution as well as the calculation efficiency, the RF model performs best among the four models and captures the tempo-spatial characteristics of the plume of the total VOC mass fraction, because the RF model constructed a large number of decision trees and then combined their results to make predictions. As studied in Sun et al. (2021a, 2021b, 2022), the RF model performs better in accurate groundwater level prediction, which is in accordance with our studies. Therefore, the RF model can act as a reasonable surrogate for the numerical models and then be further used to discuss the sensitivity of the model parameters and their efficiency in simulating NAPL transport in different phases.
Sensitivity of model parameters and the number of scenarios
The parameter max_features and n_estimators can significantly influence the results of the RF model. To analyze the influence of these two parameters, the simulation results are evaluated if different parameter combinations are adopted in terms of MAR, RMSE and the total time consumption in the calculation (Table 3). It can be clearly observed that the RF algorithm has high simulation accuracy and short calculation time consumption for the nine parameter combinations. The parameter n_estimator has very limited influences on the model results. The total time consumption in the calculation increases with the number of forest trees. The best parameter combination is max_features of ‘auto’ and n_estimators of 200, which is consistent with the optimization results from the grid search method.
max_features . | n_estimators . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|---|
'auto'a | 50 | 4.33 × 10−4 | 7.01 × 10−6 | 0.99 | 60 |
'sqrt'b | 50 | 6.71 × 10−4 | 1.15 × 10−5 | 0.99 | 49 |
'log2'c | 50 | 4.62 × 10−4 | 1.05 × 10−5 | 0.99 | 49 |
'auto' | 100 | 4.38 × 10−4 | 6.89 × 10−6 | 0.99 | 86 |
'sqrt' | 100 | 4.16 × 10−4 | 8.87 × 10−6 | 0.99 | 66 |
'log2' | 100 | 4.98 × 10−4 | 9.31 × 10−6 | 0.99 | 66 |
'auto' | 200 | 4.25 × 10−4 | 6.88 × 10−6 | 0.99 | 138 |
'sqrt' | 200 | 4.25 × 10−4 | 8.38 × 10−6 | 0.99 | 97 |
'log2' | 200 | 4.14 × 10−4 | 8.66 × 10−6 | 0.99 | 98 |
max_features . | n_estimators . | MAE . | RMSE . | NSE . | The total time consumption (s) . |
---|---|---|---|---|---|
'auto'a | 50 | 4.33 × 10−4 | 7.01 × 10−6 | 0.99 | 60 |
'sqrt'b | 50 | 6.71 × 10−4 | 1.15 × 10−5 | 0.99 | 49 |
'log2'c | 50 | 4.62 × 10−4 | 1.05 × 10−5 | 0.99 | 49 |
'auto' | 100 | 4.38 × 10−4 | 6.89 × 10−6 | 0.99 | 86 |
'sqrt' | 100 | 4.16 × 10−4 | 8.87 × 10−6 | 0.99 | 66 |
'log2' | 100 | 4.98 × 10−4 | 9.31 × 10−6 | 0.99 | 66 |
'auto' | 200 | 4.25 × 10−4 | 6.88 × 10−6 | 0.99 | 138 |
'sqrt' | 200 | 4.25 × 10−4 | 8.38 × 10−6 | 0.99 | 97 |
'log2' | 200 | 4.14 × 10−4 | 8.66 × 10−6 | 0.99 | 98 |
aConsidering all the features that make sense in every tree.
bUsing the square root of the total number of features in an individual run.
cAnother similar type of option for max_features.
To analyze the influence of the total number of scenarios on the model results, four conditions are considered, which are 50 scenarios (46 for the training and four for the verification), 38 scenarios (35 for the training and three for the verification), 25 scenarios (23 for the training and two for the verification) and 12 scenarios (11 for the training and one for the verification). Table 3 shows that the performance generally increases with the number of scenarios; however, the increase in performance is linearly dependent on the number of scenarios. The MAE for different numbers of scenarios is basically at the magnitude of 10−4, ranging from 3.26 × 10−4 to 7.93 × 10−4. The RMSE for the condition of 50 scenarios is the minimum, with a value of 6.89 × 10−6 ranging from 6.89 × 10−6 to 3.79 × 10−5. While the NSE is kept constant at 0.99.
FURTHER DISCUSSION OF MODEL EFFICIENCY
Four additional groups (Groups C, D, E, and F) are designed to investigate the model efficiency in simulating the distribution of the VOC mass fraction in gaseous, aqueous and NAPL phases under spatial and temporal scales, linear leakage and LNAPL transport (Table 4).
Group . | Comparison . | Leakage type . | Input variables . | Output variables . | Number of scenarios . |
---|---|---|---|---|---|
C | Spatial distribution | Point | Permeability distribution | Mass fraction in gaseous, aqueous and NAPL phases | 50 (C1–C50) |
D | Temporal evolution | Permeability distribution, time section | 20 (D1–D20) | ||
E | Leakage type | Linear | Permeability distribution | Total VOC mass fraction | 50 (E1–E50) |
Group . | Comparison . | Leakage type . | Input variables . | Output variables . | Number of scenarios . |
---|---|---|---|---|---|
C | Spatial distribution | Point | Permeability distribution | Mass fraction in gaseous, aqueous and NAPL phases | 50 (C1–C50) |
D | Temporal evolution | Permeability distribution, time section | 20 (D1–D20) | ||
E | Leakage type | Linear | Permeability distribution | Total VOC mass fraction | 50 (E1–E50) |
Efficiency in the spatial distribution and temporal evolution of different phases of DNAPLs
Efficiency under the linear DNAPL leakage scenario
In the real contaminated site, there may be linear leakage or multipoint leakage in the pipelines, and thus, the scenario of linear leakage (Group E) is considered. Forty-six out of 50 scenarios simulated by TMVOC are used for the model training, and the rest are used for the validation. The comparison shows that the RF model performs better with MAE and RMSE less than 1.10 × 10−3 and 2.09 × 10−4, respectively. Compared with the model error in Group A, it can be concluded that the MAE and RMSE with linear pollution sources (Group E) are larger, ranging from 3.65 × 10−4 to 1.10 × 10−3 and 1.09 × 10−5 to 2.09 × 10−4, respectively, which is mainly caused by the larger magnitude of the total VOC mass fraction.
Efficiency in LNAPL transport
CONCLUSIONS
Accurate estimation of VOC distribution across spatial and temporal scales is a challenge in the in situ remediation of groundwater contamination induced by NAPL leakage. NAPL migration involves complicated physical, chemical, and even biological functions, and thus, phase transfer is very complex. In this study, surrogate models are developed to overcome the difficulties encountered in numerical models, including the large dependence on physical parameters and large calculation burden, in in situ NAPL-contaminated sites, especially highly heterogeneous aquifers.
Fifty random permeability distributions are generated by the Monte Carlo method to represent strongly heterogeneous aquifers. TMVOC is applied to obtain the distribution of both TCE and benzene under the point and linear leakage sources over the model area at different time sections, where the results are regarded as benchmark data. The applicability of four surrogate models (MLP, SVM, GBR, and RF) in simulating DNAPL transport is examined over the spatiotemporal distributions. The RF model is found to have the best performance where it was then further used to validate its reliability in simulating multiphase and multicomponent transport. The main conclusions are as follows:
The DNAPL distributions from four surrogate models (MLP, SVM, GBR, and RF) are compared in detail at both spatial and temporal scales, and the RF method has the best performance with the NSE of 0.99, the maximum MAE and RMSE smaller than 5.55 × 10−4 and 5.88 × 10−5, respectively, and over 150 times faster than TMVOC.
Model parameters and the number of scenarios for model training have a certain influence on model results, and the influences are not obvious in this study; thus, the grid search method is a good choice to decide the optimal parameter combination.
The RF model can better characterize the spatiotemporal distribution of gaseous, aqueous and DNAPL phases with high accuracy and calculation efficiency in NAPL-contaminated sites. However, the accuracy of the DNAPL phase distribution is lower than that of gaseous and aqueous phases, which may be caused by the low resolution of grid sizes and smaller number of scenarios.
The RF method can accurately simulate the spatial distribution of the TCE plume under linear leakage pollution sources with MAE and RMSE less than 1.10 × 10−3 and 2.09 × 10−4, respectively.
The RF model can efficiently and accurately simulate the spatial distribution of benzene-contaminated plumes with an MAE of almost 0 and an RMSE of less than 6.70 × 10−9.
Overall, it should be concluded that in this study the high-accuracy RF model can reproduce the NAPL distribution with multiple phases across spatial and temporal scales and considerably reduce the computational burden for modeling NAPL transport, which will provide new technical support for multiphase flow simulation and is of great applicability to real NAPL-contaminated sites. It is suggested that the surrogate models should be oriented toward the real site conditions of the contaminant site for environmental management and design the appropriate number of scenarios and the mesh size to improve the model accuracy. Then, the established model can be used to quickly and effectively produce multiphase NAPL distributions across spatiotemporal scales.
According to the results of the case study, the RF model can serve as an efficient surrogate model for simulating spatiotemporal distribution of NAPL in heterogeneous media. Compared to LNAPL, DNAPL may migrate deeper and accumulate laterally along an aquitard. Therefore, the scope of investigation and treatment of DNAPL-contaminated sites should be increased. Additionally, local preferential flow of DNAPL may occur in heterogeneous media, leading to higher concentrations than expected, which should be paid attention.
Due to the limited availability of actual site measurement data, we relied on numerical results as the reference for our surrogate models to investigate the spatiotemporal evolution of pollutants. In future research, we plan to assess the performance of these surrogate models using actual site data to enhance the credibility of our analysis.
ACKNOWLEDGEMENTS
This study was supported by the National Natural Science Foundation of China (Grant numbers: U2167211 and 41877173). We thank the public uncertainty quantification tools from Prof. Qingyun Duan at Hohai University.
AUTHOR CONTRIBUTION
L.H. was involved in conceptualization, methodology, writing, reviewing, and editing; M.Z. was involved in writing theoriginal draft, model calculation, data collection and analysis, writing, reviewing, and editing; L.T. was involved in writing, reviewing, and editing; S.H. was involved in writing, reviewing, editing.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.