## Abstract

Meta-model based coupled simulation-optimization methodology is an effective tool in developing sustainable saltwater intrusion management strategies for coastal aquifers. Such management strategies largely depend on the accuracy, reliability, and computational feasibility of meta-models and the numerical simulation model. However, groundwater models are associated with a certain amount of uncertainties, e.g. parameter uncertainty and uncertainty in prediction. This study addresses uncertainties related to input parameters of the groundwater flow and transport system by using a set of randomized input parameters. Three meta-models are compared to characterize responses of water quality in coastal aquifers due to groundwater extraction patterns under parameter uncertainty. The ensemble of the best meta-model is then coupled with a multi-objective optimization algorithm to develop a saltwater intrusion management model. Uncertainties in hydraulic conductivity, compressibility, bulk density, and aquifer recharge are incorporated in the proposed approach. These uncertainties in the physical system are captured by the meta-models whereas the prediction uncertainties of meta-models are further addressed by the ensemble approach. An illustrative multi-layered coastal aquifer system is used to demonstrate the feasibility of the proposed approach. Evaluation results indicate the capability of the proposed approach to develop accurate and reliable management strategies for groundwater extraction to control saltwater intrusion.

## ABBREVIATIONS AND NOTATION

### Abbreviations

- ANN
Artificial Neural Network

- CART
Classification and Regression Tree

- CEMGA
Controlled Elitist Multi-objective Genetic Algorithm

- EGPR
Ensemble GPR

- EMARS
Ensemble MARS

- EPR
Evolutionary Polynomial Regression

- ERT
Ensemble RT

- FIS
Fuzzy Inference System

- GP
Genetic Programming

- GPR
Gaussian Process Regression

- LHS
Latin Hypercube Sampling

- LSBoost
Least Squared Boosting

- MARS
Multivariate Adaptive Regression Spline

- OP
Observation Point

- RBF
Radial Basis Function

- RT
Regression Tree

- S/O
Simulation-Optimization

### Notation

*a*_{L}longitudinal dispersivity [L];

*a*_{m}molecular diffusion coefficient [L

^{2}/T];*a*_{T}lateral dispersivity [L];

*C*material concentration in aqueous phase [M/L

^{3}];*D*Dispersion coefficient tensor [L

^{2}/T];*F*storage coefficient;

*g*acceleration of gravity [L/T

^{2}];*h*pressure head [L];

*k*_{r}relative permeability or relative hydraulic conductivity [L

^{2}];*K*_{s}first order biodegradation rate through adsorbed phase;

*k*_{s}saturated permeability tensor [L

^{2}];*k*_{so}referenced saturated conductivity tensor [L

^{2}];*K*hydraulic conductivity tensor [L/T];

*K*_{d}distribution coefficient;

*K*_{w}first order biodegradation rate constant through dissolved phase;

*n*porosity of the medium;

*q*source and/or sink [[L

^{3}/T]/L^{3}];*s*saturation;

*t*time [T];

- T
number of time steps;

- V
Darcy velocity [L/T];

magnitude of V;

*z*potential head [L];

compressibility of the medium [LT

^{2}/M];*α*′modified compressibility of the medium [1/L];

compressibility of water [LT

^{2}/M];*β*′modified compressibility of the water [1/L];

moisture concentration;

decay constant;

*μ*dynamic viscosity of water at chemical concentration C [M/LT];

*μ*_{o}reference dynamic viscosity at zero chemical concentration [M/LT];

water density at chemical concentration C [M/L

^{3}];bulk density of the medium (M/L

^{3});*ρ**density of either the injection fluid or the withdrawn water [M/L

^{3}];*ρ*_{o}referenced water density at zero chemical concentration [M/L

^{3}];tortuosity;

Kronecker delta tensor;

del operator;

the density dependent coupled flow and salt transport meta-model.

## INTRODUCTION

Coastal groundwater resources are susceptible to saltwater intrusion as a result of unplanned and irrational water extraction from the aquifers. This has become a major management problem, which can be tackled through a judicial pumping management strategy by adopting optimization of groundwater extraction patterns. The optimal management strategy for groundwater pumping to control saltwater intrusion can be attained through utilizing coupled simulation-optimization (S/O) approaches (Bhattacharjya & Datta 2009; Sreekanth & Datta 2010, 2011b; Hussain *et al.* 2015). In a coupled S/O approach, the role of the simulation model is to simulate physical processes whereas the optimization part is intended to find the optimal groundwater extraction values to control saltwater intrusion. The reliability and accuracy of the optimal management strategy based on a coupled S/O approach depend on how accurately the simulation model captures and simulates the accompanying physical processes. Moreover, groundwater flow and solute transport systems in coastal aquifers are associated with uncertain model parameters (Sreekanth & Datta 2014). The multidimensional heterogeneity of aquifer properties such as hydraulic conductivity, compressibility, and bulk density are considered as major sources of uncertainty in groundwater modelling systems (Ababou & Al-Bitar 2004). Other sources of uncertainty are associated with spatial and temporal variability of hydrologic as well as human interventions, e.g. aquifer recharge and transient groundwater extraction patterns. The present study addresses uncertainties arising from uncertain model parameters (hydraulic conductivity, compressibility, bulk density, and aquifer recharge), and spatial and temporal variation of groundwater extraction patterns. Uncertain model parameters are randomly paired with the transient groundwater extraction patterns in the simulation of aquifer processes.

Coupling of simulation models in the optimization algorithm in a linked S/O approach is computationally inefficient because of the multiple calls of the detailed and complex simulation model by the optimization routine to obtain a global optimal solution (Dhar & Datta 2009). Therefore, a reasonably accurate and reliable replacement of the original numerical simulation model, commonly known as emulators or meta-models, can be used to achieve computational efficiency in the optimization routine of computationally intensive problems (Goel *et al.* 2007). Recently, Artificial Neural Network (ANN) (Bhattacharjya & Datta 2009; Kourakos & Mantoglou 2009; Sreekanth & Datta 2010), Genetic Programming (GP) (Sreekanth & Datta 2010, 2011a), Evolutionary Polynomial Regression (EPR) (Hussain *et al.* 2015), Fuzzy Inference System (FIS) (Roy & Datta 2017a), and cubic Radial Basis Function (RBF) (Christelis & Mantoglou 2016) have been used as computationally efficient substitutes of the density dependent linked flow and solute transport processes in coastal aquifers. The present study investigates the potential applicability and relative comparison of Multivariate Adaptive Regression Spline (MARS), Gaussian Process Regression (GPR), and Regression Tree (RT) based meta-models and their ensembles to approximate physical processes in a multi-layered coastal aquifer system under parameter uncertainty. A saltwater intrusion management model is developed by coupling the ensemble of the best meta-model obtained from these three meta-models to a Controlled Elitist Multi-objective Genetic Algorithm (CEMGA) (Deb & Goel 2001) in a multiple objective problem setting.

MARS (Friedman 1991) utilizes a set of coefficients and Basis functions in developing a functional relationship between inputs and outputs. This non-parametric artificial intelligence based approach provides a relatively rapid, flexible, and reasonably accurate meta-model (Salford-Systems 2016). One of the main advantages of using MARS in developing a predictor-response relationship is its ability to build simple and easy-to-interpret meta-models even from a high dimensional data pattern (Zhang & Goh 2016). MARS also considers the most important input variables in determining the output during the backward pass of the model development. MARS has been applied successfully in different areas of water resources research (Sharda *et al.* 2006, 2008; Beuchat *et al.* 2011; Adamowski *et al.* 2012; Nasseri *et al.* 2013; Samadi *et al.* 2015; Zabihi *et al.* 2016). Recently, Roy & Datta (2017b) demonstrated the potential applicability of MARS and ensemble MARS (EMARS) models in approximating the physical processes of a multi-layered coastal aquifer system. This study extends the use of MARS and EMARS to approximate the physical processes of a multi-layered coastal aquifer system under parameter uncertainty.

GPR (Rasmussen & Williams 2005) is a stochastic approach (Jacobs & Koziel 2015) to building meta-models based on the assumption of a Gaussian prior distribution of model variables. GPR meta-models are based on a non-parametric approach in which the parameters are presumed as random variables that are taken from a Gaussian distribution (Bazi *et al.* 2012). Although the GPR approach of meta-model formation is applied successfully in many engineering applications (Forrester *et al.* 2008), GPR based meta-models and their ensembles have not been utilized thus far to approximate density dependent linked flow and salt transport processes in multi-layered coastal aquifers under parameter uncertainty. Recently, Rajabi & Ketabchi (2017) proposed GPR based meta-models to develop a single objective coastal groundwater management model. Nevertheless, the capability of the GPR meta-models and their ensembles needs to be evaluated for multi-objective optimization of coastal aquifer management problems. Therefore, the present study is an attempt to develop a multiple objective management model with parameter uncertainty.

Regression Trees (RTs) are able to build flexible models that are less complex and easily interpretable. The theory behind the decision trees is based on the Classification and Regression Tree (CART) algorithm proposed by Breiman *et al.* (1984). The ensemble RT (ERT) method, integrating the flexibility of the tree model space and robustness of the ensemble learning technique, is based on the weighted combination of multiple regression trees. Boosting is one of the most commonly used methods of formulating the tree ensemble (Freund & Schapire 1997). The boosting algorithm used in this study is the least squared boosting (LSBoost) (Hastie *et al.* 2008).

However, meta-models are associated with a certain amount of uncertainty in prediction that results from the residuals. Using an ensemble of individual meta-models is one of the most effective ways of reducing predictive uncertainty of the meta-modelling approach in which outputs obtained from a group of separately trained meta-models are integrated to acquire one unified output (Zhou *et al.* 2002). An ensemble of meta-models is able to capture the accurate trend of data by combining the outputs from a certain number of individual models, thus providing better prediction capability than individual models within the ensemble (Goel *et al.* 2007; Jovanović *et al.* 2015). Nevertheless, individual meta-models within the ensemble should be adequately accurate and sufficiently diverse. This diversity of individual models can be maintained through utilization of several approaches including optimizing model structures and parameters, varying the training algorithm, and the use of different realizations of the training dataset (Sharkey 1999; Zhang & Berardi 2001). The random sampling without replacement (Hastie *et al.* 2008) technique can be utilized to obtain different realizations of the training dataset in order to form individual ensemble members.

Therefore, this study seeks to develop an optimal groundwater extraction strategy to control saltwater intrusion in multi-layered coastal aquifers incorporating the uncertainty of model parameters and the prediction uncertainty of meta-model based modelling approaches. MARS, GPR, and RT based meta-models are proposed as computationally efficient substitutes of the complex numerical simulation model. The study also considers predictive uncertainty of meta-modelling approaches by proposing ensembles of MARS, GPR, and RT based meta-models. A comparative evaluation of the MARS, GPR, and RT based meta-models and their ensembles is performed based on the accuracy of prediction. The best ensemble model thus obtained is coupled with CEMGA within a linked S/O framework in order to develop the saltwater intrusion management model. The contribution of the present study involves incorporation of parameter and prediction uncertainties, utilization of the stochastic meta-modelling approach, development of saltwater intrusion management models, and suggestions for water managers on choosing appropriate management alternatives. Applicability of the proposed methodology is evaluated using an anisotropic multi-layered coastal aquifer system. Each individual layer of this multi-layered coastal aquifer represents separate material zones with different values of hydraulic conductivities. The coupled flow and solute transport processes considered are transient and density dependent.

## METHODOLOGY

The proposed methodology integrates a numerical simulation model to simulate aquifer processes in a multi-layered coastal aquifer system under parameter uncertainty, meta-models to achieve computational efficiency in the linked S/O approach, ensembles of meta-models to address prediction uncertainty in meta-modelling, and an optimization algorithm to search for the global optimal solution of groundwater extraction patterns in a general framework. Components of the proposed approach are briefly described in the following sub-sections.

### Numerical simulation model

*et al.*1997). The simulation model uses a randomized set of uncertain model parameters and randomized groundwater pumping values as inputs, and generates saltwater concentration values at specified observation points (OPs) at the end of the specified management period as outputs. The 3D flow and solute transport processes are expressed by the following set of equations (Lin

*et al.*1997): The hydraulic conductivity tensor,

**, can be expressed as**

*K*### Generation of uncertain model parameters

This study specifically assumes four model parameters to be uncertain, i.e. hydraulic conductivity, aquifer recharge, bulk density, and compressibility of the aquifer material are considered as uncertain model parameters. These parameters are presumed homogeneous but uncertain within each material layer. Different realizations of a representative set of hydraulic conductivity values are obtained from a lognormal distribution with a specific mean and standard deviation of the associated normal distribution (Table 1). Lognormal distribution is a probability distribution in which the uncertain model parameter, hydraulic conductivity, is divided into *N* equi-probable intervals from which a single value is chosen randomly. The obtained lognormal hydraulic conductivity values more or less represent the standard values of the corresponding soil materials in different layers. These values are carefully chosen so that the hydraulic conductivity values within a material layer do not vary significantly so as to spill over to the hydraulic conductivity values in the range of the other materials. Moreover, the unsaturated curve is drawn using Van Genuchten's equation for every layer for the specific soil type. The ranges of hydraulic conductivity values for the four material layers are: Material layer 1: 4.322553–5.793223 m/day, Material layer 2: 9.334227–10.69161 m/day, Material layer 3: 14.3263–15.74599 m/day, Material layer 4: 2.524144–3.926397 m/day.

Parameters | Unit | Material layer | Distribution | Mean | Standard deviation |
---|---|---|---|---|---|

Hydraulic conductivity in x-direction | m/d | Silt | lognormal | 5.02 | 0.30 |

Hydraulic conductivity in y-direction | m/d | Silt | lognormal | 2.52 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Silt | lognormal | 0.50 | 0.03 |

Compressibility | md^{2}/kg | Silt | uniform (LHS) | 7.37 × 10^{−16} | 3.50 × 10^{−16} |

Bulk density | kg/m^{3} | Silt | normal (LHS) | 1,650 | 5 |

*Recharge | m/d | Silt | uniform (LHS) | 0.00019 | 3.48 × 10^{−5} |

Hydraulic conductivity in x-direction | m/d | Sandy loam | lognormal | 9.97 | 0.28 |

Hydraulic conductivity in y-direction | m/d | Sandy loam | lognormal | 4.98 | 0.14 |

Hydraulic conductivity in z-direction | m/d | Sandy loam | lognormal | 1.00 | 0.03 |

Compressibility | md^{2}/kg | Sandy loam | uniform (LHS) | 7.35 × 10^{−18} | 3.49 × 10^{−18} |

Bulk density | kg/m^{3} | Sandy loam | normal (LHS) | 1,600 | 5 |

Hydraulic conductivity in x-direction | m/d | Sand | lognormal | 14.95 | 0.29 |

Hydraulic conductivity in y-direction | m/d | Sand | lognormal | 7.47 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Sand | lognormal | 1.49 | 0.03 |

Compressibility | md^{2}/kg | Sand | uniform (LHS) | 7.35 × 10^{−18} | 3.49 × 10^{−18} |

Bulk density | kg/m^{3} | Sand | normal (LHS) | 1,550 | 5 |

Hydraulic conductivity in x-direction | m/d | Sandy clay | lognormal | 3.05 | 0.30 |

Hydraulic conductivity in y-direction | m/d | Sandy clay | lognormal | 1.53 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Sandy clay | lognormal | 0.31 | 0.03 |

Compressibility | md^{2}/kg | Sandy clay | uniform (LHS) | 7.35 × 10^{−17} | 3.49 × 10^{−17} |

Bulk density | kg/m^{3} | Sandy clay | normal (LHS) | 1,700 | 5 |

Parameters | Unit | Material layer | Distribution | Mean | Standard deviation |
---|---|---|---|---|---|

Hydraulic conductivity in x-direction | m/d | Silt | lognormal | 5.02 | 0.30 |

Hydraulic conductivity in y-direction | m/d | Silt | lognormal | 2.52 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Silt | lognormal | 0.50 | 0.03 |

Compressibility | md^{2}/kg | Silt | uniform (LHS) | 7.37 × 10^{−16} | 3.50 × 10^{−16} |

Bulk density | kg/m^{3} | Silt | normal (LHS) | 1,650 | 5 |

*Recharge | m/d | Silt | uniform (LHS) | 0.00019 | 3.48 × 10^{−5} |

Hydraulic conductivity in x-direction | m/d | Sandy loam | lognormal | 9.97 | 0.28 |

Hydraulic conductivity in y-direction | m/d | Sandy loam | lognormal | 4.98 | 0.14 |

Hydraulic conductivity in z-direction | m/d | Sandy loam | lognormal | 1.00 | 0.03 |

Compressibility | md^{2}/kg | Sandy loam | uniform (LHS) | 7.35 × 10^{−18} | 3.49 × 10^{−18} |

Bulk density | kg/m^{3} | Sandy loam | normal (LHS) | 1,600 | 5 |

Hydraulic conductivity in x-direction | m/d | Sand | lognormal | 14.95 | 0.29 |

Hydraulic conductivity in y-direction | m/d | Sand | lognormal | 7.47 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Sand | lognormal | 1.49 | 0.03 |

Compressibility | md^{2}/kg | Sand | uniform (LHS) | 7.35 × 10^{−18} | 3.49 × 10^{−18} |

Bulk density | kg/m^{3} | Sand | normal (LHS) | 1,550 | 5 |

Hydraulic conductivity in x-direction | m/d | Sandy clay | lognormal | 3.05 | 0.30 |

Hydraulic conductivity in y-direction | m/d | Sandy clay | lognormal | 1.53 | 0.15 |

Hydraulic conductivity in z-direction | m/d | Sandy clay | lognormal | 0.31 | 0.03 |

Compressibility | md^{2}/kg | Sandy clay | uniform (LHS) | 7.35 × 10^{−17} | 3.49 × 10^{−17} |

Bulk density | kg/m^{3} | Sandy clay | normal (LHS) | 1,700 | 5 |

*****Recharge is distributed uniformly over the first layer of the study area.

Aquifer recharge and compressibility realizations are generated from Latin Hypercube Sampling (LHS) (Pebesma & Heuvelink 1999) uniform distributions within the parameter space for specific lower and upper bounds. Realizations of bulk density are obtained from the LHS technique from a -dimensional multivariate normal distribution with specific mean and covariance. The values of hydraulic conductivity, recharge, compressibility, and bulk density are then shuffled randomly and combined to obtain multivariate random realizations of uncertain model parameters. One hundred realizations of each parameter are generated for each material layer. A total of 3000 uniformly distributed LHS of groundwater extraction values are generated from the variable space having a range of 0–1,300 m^{3}/day. For each randomized uncertain model parameter set, 30 sets of transient pumping values from the well locations are assigned. When the uncertain model parameters are combined with the transient groundwater extraction values, the resulting uncertainty level further increases. As a whole, the system resembles more or less a real world study area. Again, synthetic data from the illustrative site is chosen for evaluation, because that is necessary for the actual evaluation of a methodology in the absence of unknown measurement errors and unquantified uncertainties. The obtained 3000 randomized combined realizations of uncertain model parameters and transient groundwater extraction values are then used along with other initial and boundary conditions as inputs to the simulation model in order to obtain the corresponding salinity concentrations at specified OPs.

To demonstrate the variation of salinity concentration values within the ranges of chosen hydraulic parameters, numerical experiments are conducted using the simulation model. The effects of hydraulic conductivity on salinity concentrations are presented here as an example. The percentage difference in salinity concentration values using the lowest and highest ranges of these hydraulic conductivity values at five OPs are 4.87%, 13%, 15.93%, 9.53%, and 10.31%, respectively. For example, in observation point OP5, the lowest values of hydraulic conductivity at different soil layers produce a salinity concentration value of 4,791 mg/l. The highest hydraulic conductivity values produce 5,285 mg/l of salinity concentration. These results are obtained only by varying the hydraulic conductivity values while other uncertain parameters are kept constant. In addition to hydraulic conductivity, aquifer recharge, bulk density and compressibility are also considered uncertain. As mentioned, different realizations of these uncertain model parameters are randomly paired to obtain a multivariate realization of these uncertain parameters. This multivariate realization is further randomly paired with the random field of groundwater extraction in space and time. Therefore, the resulting uncertainty represents, more or less, a realistic coastal aquifer study area.

### Meta-models

GPR, MARS, and RT approaches and their ensembles are used to construct the meta-models. Brief descriptions of the meta-models are provided in the following sub-sections.

#### Gaussian process regression

*Y*and predictor variables, such that , where is a Gaussian noise with variance (Bishop 2006). GPR is characterized by mean and covariance functions: mean function describes the anticipated value of the function at any particular point within the variable space whereas the covariance function provides an indication of the proximity or resemblance between the response and the predictor values (Rasmussen & Williams 2005). The mean and covariance functions are defined by

*n*is the number of training samples.

The GPR based meta-models are developed using the commands and functions of MATLAB (MATLAB 2017a).

#### Multivariate adaptive regression spline

MARS is an adaptive and non-parametric regression approach (Friedman 1991) in which the total solution space is divided into different intervals of input variables to which individual Splines or Basis functions are fitted during formation of the MARS models (Bera *et al.* 2006). MARS uses a forward and a backward stepwise procedure for building models that are capable of predicting future responses through non-linear mapping of the predictors and response variables. In the forward pass, MARS builds a relatively complex model based on the number of Basis functions provided by the user. However, in the backward pass, MARS sparingly selects only those input variables, that are important in predicting the output variable. MARS utilizes this backward step to prevent model over-fitting, and to avoid formulation of the unnecessarily complex model (Salford-Systems 2016).

*i*and

*j*are the indices for Basis functions and input variables (groundwater extraction), respectively; Basis functions; input variables (groundwater extraction); = some threshold values chosen by the MARS model; = constant value, = corresponding coefficients of , ±indicates that the next entity may be added to or subtracted from the previous entity; and = predicted saltwater concentration value at a specified OP.

A commercial software package, Salford Predictive Modeller^{®} (SPM 2016) is used to build the MARS models.

#### RT

RTs are able to build flexible models that are less complex and easily interpretable. The theory behind the decision trees are based on the CART algorithm proposed by Breiman *et al.* (1984). The CART procedure builds models by following three major steps. The first step uses a binary split procedure to build a complex full tree having numerous terminal nodes. In the second step, the full tree is pruned to reduce model over-fitting. In the third step, the CART algorithm selects an optimal sub-tree so that the quality of prediction for new samples is ensured. RT methods of meta-modelling provide a predicted response by following the decisions in the tree from the root node (beginning node) to the leaf node where the responses are contained. The RT meta-models are developed in the MATLAB environment (MATLAB 2017a).

### Ensemble meta-models

Prediction uncertainty of meta-models can effectively be reduced by an ensemble of such meta-models, instead of using the single model. An ensemble approach is able to capture the nearly true trends of data by utilizing unique features of individual meta-models. Individual meta-models are built by varying different architectures of the same meta-model, using different types of meta-models, utilizing different algorithms for training, or using different realizations of training data (Sharkey 1999; Shu & Ouarda 2007). Different realizations of the training dataset using a suitable resampling technique (e.g. random sampling without replacement) often provide better results than other techniques (Shu & Burn 2004; Zaier *et al.* 2010). For developing an ensemble of GPR and MARS (written as EGPR and EMARS thereafter), different realizations of training datasets are generated using the random sampling without replacement technique (Hastie *et al.* 2008). On the other hand, the LSBoost (Hastie *et al.* 2008) algorithm is used to build RT ensembles (ERT). The optimum number of RTs within the ensemble is determined by observing the Root Mean Squared Error (RMSE) between the observed and predicted responses. In the LSBoost approach, every step of ensemble formation is characterized by fitting a new regression tree to the variance between actual response and combined prediction of all formerly grown regression trees.

Selecting the optimum number of individual meta-models within the ensemble is one of the most crucial parts of the ensemble based meta-modelling approach. The present study selects the right number of individual meta-models for building EGPR and EMARS by observing the RMSE between the observed and predicted responses after sequential addition of individual meta-models to the ensemble. For ERT formation, RMSE values between the observed and cumulative prediction of sequentially added RTs are used to determine the optimum number of RTs in the ensemble. Selection of optimum number of RTs is performed by automatic tuning of the hyperparameters using the LSBoost method and the Bayesian Optimization technique (MATLAB 2017a).

The sequential approach is adopted because this approach selects the ensemble (collection of the meta-models) resulting in the best performance in terms of predictions. The sequence is actually based on the performance, while all meta-models are included in determining the best sequence and the total number of meta-models in the sequence. It is difficult to say that this approach is the best. However, this selection process appears logical, though a different approach may result in a different set of predictions. The order is not very important as all the training patterns for the individual meta-models in the ensemble are obtained from the random sampling without replacement technique. In addition, to avoid bias in selecting individual meta-models, the original randomized order of the meta-models is not altered.

### Performance evaluation criteria

Mean Absolute Percentage Relative Error (MAPRE), Coefficient of Correlation (R), Nash-Sutcliffe Efficiency Coefficient (NS), Index of Agreement (IOA), RMSE and Threshold Statistics (TS) are used to evaluate the performance of GPR, MARS, EGPR, EMARS, and ERT based meta-models. In addition, the proposed management models' performance is evaluated by observing the constraint violation, and by confirming whether the constraints are satisfied at their upper limits.

*n*represents the number of data points.

### Management model

The management model is multi-objective in nature in which two conflicting objectives of groundwater extraction patterns are considered. Therefore, the management model provides several alternate feasible solutions of groundwater extraction patterns from the well field, showing a trade-off between the conflicting objectives represented by a Pareto optimal front. The conflicting objectives are: (i) maximize extraction of water from the production bores for beneficial purposes; and (ii) minimize water extraction from barrier extraction wells. Water extracted from barrier extraction wells cannot be used because of its high salinity content; therefore, the objective is to minimize water extraction from these wells. Barrier extraction wells are used to create a hydraulic barrier along the coast to allow hydraulic control of saltwater intrusion. The management model integrates an ensemble of meta-models and a CEMGA based optimization algorithm into a general framework of a linked S/O approach.

_{PW}and

_{BW}stand for production bores and barrier extraction wells, respectively;

*M*,

*N*, and

*T*stand for the entire pumping wells, barrier extraction wells, and time periods, respectively. The first objective of maximization of groundwater extraction from the pumping wells for beneficial use is represented by Equation (17), and the second objective of minimizing the water extraction from barrier pumping wells is given by Equation (18).

### Optimization algorithm: CEMGA

A population based optimization algorithm, CEMGA (Deb & Goel 2001) is utilized to solve the optimization routine of the linked S/O based saltwater intrusion management model. CEMGA includes individuals with a relatively low fitness value to the next generation in order to increase the diversity of the population. Therefore, the algorithm has a mechanism to control the elite members of the population to maintain the population diversity. As the algorithm progresses, this control of elite members makes the new population more diverse. This controlled elitist approach reduces the elitism effect by including a particular fraction of the dominated populations with the present best non-dominated populations. CEMGA has performed better in terms of providing better convergence to the global Pareto optimal solution for a number of complex test problems (Deb & Goel 2001) compared to its previous version of the non-CEMGA (Deb *et al.* 2000). The elite control mechanism of CEMGA uses a pre-defined geometric distribution of the number of individuals in each front to control the maximum number of individuals allowed in the *i ^{th}* front,

*n*, such that

_{i}*n*

_{i}*=*

*r*×

*n*where

_{i-1,}*r*represents the reduction rate, the value of which should be less than 1.

### Coupled simulation-optimization

This study proposes multiple objective optimization in the form of an externally linked S/O methodology together with an ensemble meta-modelling approach to develop a saltwater intrusion management model. Individual meta-models within the ensemble are separately linked to the optimization algorithm as constraints of the optimization routine. Other constraints of the optimization procedure are the maximum allowable saltwater concentrations at the specified OPs. These constraints are set based on the potential usability of the extracted water from the production bores in different regions of the study area. Therefore, the aim of developing this saltwater intrusion management model is to provide several alternate feasible solutions of optimal groundwater extraction values while restricting the salinity concentrations at specified OPs to predefined allowable limits.

### Parallel computing

The proposed ensemble meta-model based linked S/O methodology considers a large number of constraints in terms of individual meta-models requiring a relatively large computational time. Therefore, to achieve further computational efficiency, the multi-objective optimization formulation is executed in a parallel computation framework by distributing the objective functions and all constraints among a parallel pool of multiple workers. Parallel computing enables speed computations that help achieve an additional computational efficiency in the ensemble based linked S/O approach. The present study performs parallel computing utilizing the physical cores of a CPU [Intel (R) Core (TM) i7-4790 CPU@3.60 GHz] by using the parallel computing toolbox of MATLAB (MATLAB 2017b).

## APPLICATION OF THE PROPOSED METHODOLOGY

Performance evaluation of the proposed methodology is carried out in an illustrative multi-layered coastal aquifer system considering uncertainty in some of the model parameters. The reason for choosing a synthetic application is that any real life application needs extensive calibration and validation of the simulation models. The calibration process is also dependent on the accuracy of field measurements. Therefore, performance evaluation of a proposed methodology becomes complicated when actual field measurements are utilized. The evaluation process becomes dependent on the unquantifiable errors in measurement of parameters and other measurements, as well as on calibration inadequacies. However, it is necessary to incorporate realistic error scenarios or uncertainties in the modelling within the scope of synthetic modelling as reported in this study. Indeed, it needs to be emphasized that it is almost impossible to properly evaluate the accuracy and other performance criteria, with different scenarios of uncertainties and parameter values, when the performance is evaluated for a real life aquifer with a given set of parameter values and field measurements. It is for this reason that the performance needs to be first validated using synthetic data. Therefore, an illustrative aquifer and synthetic/simulated data are utilized for evaluation of the method, with quantifiable uncertainties. Nevertheless, the illustrative example model incorporates important features of a real coastal aquifer system.

Hydraulic conductivity, compressibility, and bulk density are considered homogeneous but uncertain within each vertical zone of material layers. However, a set of different realizations of these uncertain model parameters are used for each vertical material layer, making the unconfined aquifer anisotropic. The reason why an anisotropic medium is chosen is that most of the real world coastal aquifers are anisotropic in nature, where the hydraulic conductivity in the principal direction is the largest, compared to the hydraulic conductivity in an orthogonal direction even in the same plane. In addition, the hydraulic conductivity is different in a vertical direction compared to the horizontal directions. For hydraulic conductivity realizations, the ratio of horizontal hydraulic conductivities in the X- and Y-directions is taken as 2.0, i.e. . The value of vertical hydraulic conductivity in the Z-direction is specified as one tenth of the hydraulic conductivity values in the X-direction, i.e. . A set of realizations of aquifer recharge intended to be spread uniformly over the top layer of the aquifer is generated and randomly paired with the other uncertain model parameters. Finally, these uncertain model parameters are combined with the transient groundwater extraction values obtained from a set of production bores and barrier extraction wells. The multi-layered aquifer system is similar to the one developed in Roy & Datta (2017a), and is illustrated in Figure 1.

The illustrative study area has an aerial extent of 4.35 km^{2}. The total thickness of the unconfined aquifer is 80 m, dividing into four different layers with different aquifer materials. The seaside boundary has an initial head of 0 m and a constant concentration of 35,000 mg/l. The upstream end of the river is assigned with a specified head of 1 m that varies linearly along the stream until it reaches 0 m at the seaside boundary. The study considers 11 production bores and 5 barrier extraction wells having a fairly well spread extraction well field with a well density of 3.68 wells/km^{2}. The production bores, denoted by PW1–PW11 in Figure 1, are intended to extract water for beneficial purposes. On the other hand, to control saltwater intrusion by creating a hydraulic barrier along the coast, the study also considers five barrier extraction wells denoted by BW1–BW5. Water is extracted from the second and third layer of the aquifer. The transient simulation for a time horizon of 5 years is divided into 365 uniform time intervals of 5 days duration, which is the length of the time step for the numerical simulation. The total management period of 5 years is divided into five uniform management periods of 1 year each. Water extracted from both the production bores and barrier extraction wells during this 1-year management period is assumed constant. Saltwater concentrations at the end of the management period are monitored at five OPs located at three different salinity zones: OP1 is located in the low salinity zone, OP2 and OP3 are located in the moderate salinity zone, and OP4 and OP5 are located in high salinity zones. OPs are placed at different salinity zones with a view to using the extracted water from different regions of the aquifer for different purposes.

The proposed ensemble meta-model based linked S/O methodology considers 80 input pumping variables [16 wells (11 production bores + 5 barrier extraction wells) × 5 years] of groundwater extraction in space and time. These variables are designated by *X1*–*X80*. Variables *X1*–*X55* represent groundwater extraction for the management period of 5 years from the production bores, PW1–PW11. Water extracted from barrier wells BW1–BW5 are indicated by *X56*–*X80*.

## PERFORMANCE EVALUATION

Performance of the GPR, MARS, and RT based meta-models and their ensembles (EGPR, EMARS, and ERT) to approximate density dependent coupled flow and salt transport processes in a multi-layered coastal aquifer system are evaluated. The developed meta-models and their ensembles trained using the solution results of the simulation model are used to predict saltwater concentrations at specified OPs. Once proper training and validation are completed, the meta-models are presented with a new realization of test dataset to ensure a fair comparison among the meta-models. The best meta-model thus obtained from the comparison results is then used to develop the saltwater intrusion management model. The performance of the meta-model based management model is evaluated by using the optimized pumping value obtained from the management model to run the original simulation model. The percentage relative error values between the meta-model predicted saltwater concentration and simulation model's result are observed to validate the optimal solution obtained from the proposed management model. Solution results are presented in the following sub-sections.

### Comparison of the performance of GPR, MARS, EGPR, EMARS, and ERT

#### GPR and EGPR

The performance of the best GPR in the ensemble and EGPR in terms of prediction accuracy is evaluated. The optimal model structures of the GPR models are obtained through automatic tuning of the hyperparameters for three different kernel functions, namely squaredexponential, matern32, and matern52. From these trials, matern52 kernel function is selected based on the RMSE values between the training and testing datasets. Bayesian optimization is used to tune the hyperparameters for which the GPR based meta-models are formulated. Then EGPR models are developed by sequential addition of individual GPR models. EGPR models are developed for five OPs. The RMSE criterion is used to determine the optimum number of individual meta-models within the ensemble. Based on this criterion, the optimum number of GPR models for ensemble formation at five OPs are 12, 13, 20, 20, and 12, respectively, for OP1, OP2, OP3, OP4, and OP5. Once the EGPR models are obtained, both the EGPR models and the best individual GPR models in the ensemble for all OPs are presented with a completely new realization of test dataset.

In this evaluation, the actual saltwater concentration values are represented by synthetically generated saltwater concentrations obtained using the simulation model. Actual and predicted saltwater concentration values with a 95% prediction interval at five OPs are presented in Figure 2. For brevity of presentation, 20 observations from the middle part of the test dataset are presented. It is observed from Figure 2 that both actual, GPR predicted, and EGPR predicted saltwater concentration values are within the 95% confidence interval identified by GPR and EGPR models. It is also noted that both the best GPR and EGPR predictions are very close to the actual saltwater concentration values.

#### MARS and EMARS

A sufficient number of individual MARS models (50 in this study) are developed using different realizations of the input-output training patterns. During the forward pass, sufficiently large and complex MARS models are developed by using 200 Basis functions, and allowing 100 forward passes. No penalty is given to any of the variables, i.e. equal priority is given to all 80 pumping variables during the forward pass of the model development. However, during the backward pass, relatively simple but accurate MARS models are developed based on the most important input variables for predicting saltwater concentrations at specified OPs. EMARS models are then developed at each OP by sequentially adding individual MARS models, and by observing the resulting RMSE values between the actual and the predicted saltwater concentration values. Based on RMSE criterion, the optimum number of individual MARS models to build EMARS models at OP1, OP2, OP3, OP4, and OP5 are 32, 5, 27, 6, and 14, respectively. Actual and predicted values as well as errors of prediction for both the MARS and EMARS based meta-models at OP1 are illustrated in Figure 3.

It is observed from Figure 3 that both models produce a comparable result, which is very close to the actual saltwater concentration values. Figure 3(b) also depicts that the errors of prediction for both the models are very close. Therefore, in addition to reducing the prediction uncertainty, the EMARS models performed equally well when compared to the best individual MARS models within the ensemble. A similar trend of results is obtained for other OPs.

#### ERT

RT based ERT meta-models are built by automatic tuning of the model parameters, e.g. number of regression trees, learning rate, and maximum number of splits through LSBoost, by utilizing Bayesian optimization. The optimum number of regression trees to build ERT for observation points OP1, OP2, OP3, OP4, and OP5 are 498, 476, 499, 469, and 500, respectively. Once the ERT models are developed, they are then presented with a completely new set of test data. The prediction capability of the ERT models on the new test data at OP1 is shown in Figure 4. It is noted from Figure 4 that ERT predicted saltwater concentration values are very similar to the actual values, and that the errors of prediction are within the acceptable limits. A similar trend is observed at other OPs.

Although the developed meta-models and their ensembles provide a reasonably accurate prediction of saltwater concentration values at all OPs, a comparative evaluation of meta-model performances is carried out to find the best meta-model for utilization in the proposed linked S/O based saltwater intrusion management model. For consistency and a fair comparison, all the meta-models and their ensembles are tested on the same realization of the input-output patterns of the test dataset. This comparison is performed by using a number of statistical indices calculated based on the prediction capability of the proposed meta-models and their ensembles on the test data. Comparison results are illustrated in Figures 5 and 6.

Figure 5 illustrates the values of MAPRE, R, NS, IOA, and RMSE values calculated at OPs using the predicted and actual saltwater concentration values. It is observed from Figure 5 that all the models produce acceptable prediction results at all OPs: all models produce higher values of R, NS, and IOA as well as lower values of MAPRE and RMSE. However, a closer look at the comparison results reveals that GPR and EGPR based meta-models outperform the results obtained from both MARS, EMRS, and ERT based meta-models. It is also noted that the prediction capabilities of EGPR models are almost similar to the best GPR model within the ensemble at all OPs.

In most cases, the selection of the best standalone meta-model is difficult. A greater number of meta-models needs to be evaluated to decide on the best performing meta-model. On the other hand, an ensemble provides better solutions than most of the meta-models considered for the ensemble. An ensemble is formed by utilizing each individual meta-model's prediction capabilities from a diverse range of datasets. Therefore, it is more likely that an ensemble will perform better than most standalone meta-models. Indeed, this is one reason for utilizing the ensembles.

It is also worth mentioning that proposed meta-models produce comparatively lower values of R and NS at observation points OP2, OP4, and OP5. For instance, NS values produced by the best GPR model at OP2, OP4, and OP5 are 0.53, 0.65, and 0.64, respectively, whereas the NS values at those OPs provided by EGPR model are 0.78, 0.61, and 0.60, respectively. Therefore, the IOA (Willmott 1984) criterion is proposed to ascertain the proposed meta-model's prediction capability as well as to overcome the insensitivity of NS and R to differences in the simulated and predicted means and variances (Legates & McCabe 1999). All the meta-models have IOA values greater than 80% indicating the good prediction capability of the proposed meta-models.

Figure 6 illustrates boxplots of absolute errors between the simulated and predicted saltwater concentration values at different OPs obtained from GPR, MARS, EGPR, EMARS, and ERT based meta-models. In Figure 6, horizontal lines within the box indicate the median of the absolute errors whereas the small black circles represent the mean of absolute errors. It is also demonstrated from this figure that the best GPR and EGPR based meta-models are superior in terms of prediction accuracy.

Therefore, EGPR models incorporating prediction uncertainty are selected for developing the saltwater intrusion management model.

### Performance of the management model using EGPR based meta-models

The saltwater intrusion management model is developed by integrating EGPR models with a population based multi-objective optimization algorithm, CEMGA. The proposed management model provides the optimal solution of groundwater extraction in the form of a Pareto optimal front that shows the trade-off between the two conflicting objectives of groundwater extraction. EGPR consists of 77 individual GPR models, each of which is separately linked to the CEMGA. To reduce additional computational burden, compact versions of these 77 individual GPR models are linked to the optimization algorithm. The compact GPR models retain essential properties of standard GPR models. Finally, the optimization routine is run in a parallel computing platform by distributing the objective functions and the constraints among four physical cores of a PC. The EGPR-CEMGA model evaluates 1,978,801 functions (1,648 generations × 1,200 populations) before deciding on the global Pareto optimal solution. The optimum combination of the parameters of CEMGA is selected by carrying out a set of numerical experiments by varying the parameters. Based on the numerical trials, the CEMGA uses a population size of 1,200, crossover rate of 0.9, and Pareto front population fraction of 0.7. The function and constraint tolerances are set as 1 × 10^{−5} and 1 × 10^{−3}, respectively. The optimal groundwater extraction strategy in the form of a Pareto optimal front is presented in Figure 7. The Pareto front provides 840 non-dominated solutions from which the managers can choose the right combination of production and barrier well pumping. These solutions are based on limiting the salinity concentrations at specified OPs to the pre-defined maximum allowable limits.

### Verification of the management model

The performance of the developed saltwater intrusion management model is verified by comparing the solution results obtained from the optimization routine with those obtained from the numerical simulation model. To do this, 10 solutions of optimal groundwater extraction values are randomly selected from different regions of the Pareto optimal front. These solutions are used as inputs to the simulation model, developed by using the average values of uncertain model parameters to obtain the corresponding saltwater concentration values at each OP.

Table 2 represents the percentage absolute relative errors (PARE) between the EGPR predicted saltwater concentration values and simulation model's output. It is observed from Table 2 that PARE values are less than 3% for all estimates at all OPs. Therefore, the proposed saltwater intrusion management model can be applied to obtain optimal groundwater extraction values in a multi-layered coastal aquifer system under prediction and parameter uncertainties.

Obs. | Relative error, % | ||||
---|---|---|---|---|---|

OP1 | OP2 | OP3 | OP4 | OP5 | |

1 | 0.20 | 2.44 | 2.06 | 0.58 | 0.66 |

2 | 0.16 | 2.56 | 2.19 | 0.75 | 0.71 |

3 | 0.21 | 2.45 | 2.01 | 0.67 | 0.68 |

4 | 0.21 | 2.39 | 1.90 | 0.88 | 0.69 |

5 | 0.11 | 2.55 | 2.34 | 0.46 | 0.65 |

6 | 0.13 | 2.50 | 2.24 | 0.53 | 0.67 |

7 | 0.04 | 2.65 | 2.54 | 0.50 | 0.71 |

8 | 0.18 | 2.46 | 2.03 | 0.52 | 0.65 |

9 | 0.13 | 2.53 | 2.35 | 0.58 | 0.67 |

10 | 0.12 | 2.51 | 2.30 | 0.53 | 0.67 |

Obs. | Relative error, % | ||||
---|---|---|---|---|---|

OP1 | OP2 | OP3 | OP4 | OP5 | |

1 | 0.20 | 2.44 | 2.06 | 0.58 | 0.66 |

2 | 0.16 | 2.56 | 2.19 | 0.75 | 0.71 |

3 | 0.21 | 2.45 | 2.01 | 0.67 | 0.68 |

4 | 0.21 | 2.39 | 1.90 | 0.88 | 0.69 |

5 | 0.11 | 2.55 | 2.34 | 0.46 | 0.65 |

6 | 0.13 | 2.50 | 2.24 | 0.53 | 0.67 |

7 | 0.04 | 2.65 | 2.54 | 0.50 | 0.71 |

8 | 0.18 | 2.46 | 2.03 | 0.52 | 0.65 |

9 | 0.13 | 2.53 | 2.35 | 0.58 | 0.67 |

10 | 0.12 | 2.51 | 2.30 | 0.53 | 0.67 |

## CONCLUSIONS AND RECOMMENDATIONS

This study proposes a saltwater intrusion management model based on a linked S/O approach that integrates ensemble of meta-models with an optimization algorithm. Prediction uncertainty of the meta-modelling approach is incorporated by proposing an ensemble of such meta-models. The best meta-models and their ensembles are selected by comparing the prediction capabilities of GPR, MARS, RT and their ensembles (EGPR, EMARS, and ERT). Results indicate that GPR based meta-models and their ensemble, EGPR, provide the best prediction capabilities compared to MARS, RT based meta-models and their ensembles. In addition, EGPR models provide better results than most of the single GPR models within the ensemble, and EGPR models provide solution results very close to the best GPR models in the ensemble. The study develops and evaluates a more accurate and reliable meta-model compared to the existing meta-models, while also combining the ensemble approach to approximate coupled flow and solute transport processes of a coastal aquifer. The present study demonstrates the suitability of the GPR meta-model, which is also a new application in multi-objective setting. In addition, this approach is shown to facilitate a more accurate prediction of the aquifer responses, which is an important issue in developing vector (multiple) optimization based management strategies.

Therefore, EGPR models are linked to the optimization algorithm to develop the optimal groundwater extraction strategies for the coastal aquifer management problem. Results suggest that the EGPR models are suitable for incorporating into a management model as an approximate simulator of the physical processes of a multi-layered coastal aquifer system under parameter and prediction uncertainty. The present study also aims to achieve an additional computational efficiency by running the optimization formulation utilizing parallel computing facilities. Therefore, the proposed management model addresses parameter uncertainty in numerical modelling, prediction uncertainty in meta-modelling through the ensemble approach, and computational feasibility by using parallel computing facilities. Performance evaluation results reveal the potential applicability of the developed saltwater intrusion management model in providing optimal groundwater extraction values to control saltwater intrusion in multi-layered coastal aquifers under parameter uncertainty.

The present study considers a multi-layered coastal aquifer system in which four vertical material layers vary in randomized realizations of uncertain model parameters. However, the aquifer materials within each layer are considered homogeneous but uncertain. Future research may be directed towards applicability of the saltwater intrusion management model for a heterogeneous coastal aquifer system.

## REFERENCES

^{®}(Version 8.2), Salford Predictive Modeller