## Abstract

Hydrological simulation has long been a challenge because of the computationally intensive and expensive nature of complex hydrological models. In this paper, a surrogate modelling (SM) framework is presented based on the Hadoop cloud for approximating complex hydrological models. The substantial model runs required by the design of the experiment (DOE) of SM were solved using the Hadoop cloud. Polynomial chaos expansion (PCE) was fitted and verified using the high-fidelity model DOE and was then used as a case study to investigate the approximation capability in a Soil and Water Assessment Tool (SWAT) surrogate model with regard to the accuracy, fidelity, and efficiency. In experiments, the Hadoop cloud reduced the computation time by approximately 86% when used in a global sensitivity analysis. PCE achieved results equivalent to those of the standard Monte Carlo approach, with a flow variance coefficient of determination of 0.92. Moreover, PCE proved to be as reliable as the Monte Carlo approach but significantly more efficient. The proposed framework greatly decreases the computational costs through cloud computing and surrogate modelling, making it ideal for complex hydrological model simulation and optimization.

## HIGHLIGHTS

Our surrogate modelling framework reduces the computational cost of simulations.

The design of the experiment was parallelized on a Hadoop cloud.

PCE was fitted and verified using a high-fidelity model.

The approximation ability of PCE in the SWAT surrogate model was investigated.

## INTRODUCTION

Environmental models are increasingly used to evaluate the effectiveness of alternative designs, operational management, and policy options (Black *et al.* 2014; Maier *et al.* 2016). This is largely attributable to the fact that they can emulate the dynamics of real-world systems in both conventional management scenarios and alternative virtual realities (Maier *et al.* 2016; Maier *et al.* 2019). Hydrological models, for instance, have been widely used to aid in the comprehension of natural processes and the investigation of the effects of anthropogenic activity on watershed systems (Fernandez-Palomino *et al.* 2021). However, modern simulation models are often computationally demanding, as they strive to rigorously capture comprehensive knowledge about real-world systems. As a result, the number of choices that can be assessed manually is often restricted, making it difficult to decide which alternatives to explore during the decision-making process (Mugunthan *et al.* 2005; Keating *et al.* 2010; Maier *et al.* 2019). To address this challenge, optimization algorithm (OA) techniques are commonly used to locate the best solution to minimize the number of models evaluated. In addition, high-performance computing (HPC) techniques such as cloud computing are also extensively used to ‘divide’ the modelling procedures into several small subtasks and ‘conquer’ – run these subtasks concurrently, to reduce the overall computation time. As a result, simulation optimization frameworks that couple model simulations with OAs and HPC techniques have become increasingly prevalent.

Through HPC, a whole computational task can be separated into several small separate subtasks that can be processed in parallel. HPC commonly uses two primary parallel computing architectures – standalone machines and clusters or gridded networks – that are characterized in terms of the degree of parallelism allowed by the hardware (Zhang *et al.* 2016). The former assigns tasks to diverse processors on a standalone machine, whereas the latter assigns tasks within a cluster or across a gridded network. The major drawback of the former parallelization strategy is it is not as scalable, while the latter strategy has a risk of network bandwidths becoming bottlenecks (White 2012). Moreover, the time and money required to obtain such resources may impede the use of contemporary parallel computing paradigms because they often derive from conventional supercomputing resources. Recently, cloud computing has shown promising capability for parallelizing model simulations and managing massive post-simulation results (Zhang *et al.* 2016). In particular, Hadoop – an open-source software framework extensively used for building cloud computing environments – has recently been the focus of extensive testing to address computational complexity (Hu *et al.* 2015). Ideally, even though HPC can perform large-scale simulations, given the complexity of the model, we still utilize OA to further reduce the number of model runs needed.

With OAs, the goal is to find the best solution given a limited amount of time and computing budget. Many OAs have been proposed in the literature (Tayfur 2017). Typical algorithms include the earlier shuffled complex evolution (SCE-UA) algorithm developed at the University of Arizona, the dynamically dimensioned search (DDS) algorithm designed for calibration problems with many parameters, and the Markov chain Monte Carlo (MCMC) sampler-based differential evolution adaptive metropolis (DREAM) method. In acknowledgement of the inherent multi-objective nature of complex models, various multi-objective algorithms (MOEAs) have also been developed. Notable algorithms include the non-dominated sorting genetic algorithm II (NSGA-II) (Ercan & Goodall 2016), Borg multi-objective evolutionary algorithm (Borg-MOEA) (Hadka & Reed 2013, 2015), and Pareto archived dynamically dimensioned search (PA-DDS) (Asadzadeh & Tolson 2013). Although most OAs strive to drastically reduce computing costs, in some circumstances, they may inversely increase them. In reality, rather than the optimization procedure, it is the model computation that dominates the execution time required by the majority of simulation practices. Recognizing that neither HPC nor OAs may appropriately address the computational burden of simulation optimization, this raises the question of whether it is possible to significantly lower the computational cost of the original complex model while maintaining comparable accuracy to approximate the original models. Fortunately, surrogate modelling (SM) is one approach that can achieve this.

SM, which operates at a higher degree of abstraction than the original simulation, is concerned with creating a ‘surrogate’ of the raw simulation model that is less expensive to execute (Razavi *et al.* 2012a). Surrogate models have been designed to be used intelligently to substitute simulation models. Response surface modelling and lower-fidelity modelling are the two main categories that fall under SM. The former is experimentally approximated via data-driven function approximation. The latter, in contrast to the original simulation models that are often regarded as high-fidelity models, uses physically based simulation models that are less detailed. The reduced simulation models known as low-fidelity models maintain most of the processes that were represented in the original simulation model. In practice, response surface modelling is more broadly applicable than lower-fidelity modelling since it requires no modifications to the original model, has fewer parameters and is straightforward to use (Razavi *et al.* 2012b). In response to surface modelling, there are three main categories of research: (1) the creation of experimental designs (design of the experiment (DOE)) for efficient approximation, (2) the creation and use of function approximation techniques as surrogates, and (3) the creation of frameworks using surrogates. To date, the second and third groups have received most of the attention in environmental research. However, the first category of DOE-related research has rarely been reported, which is mostly because of its notoriously high computational burden, limiting the potential of this SM tool in real-life applications.

There are numerous SM techniques that have been widely used to surrogate the original computationally intensive models, such as polynomials (Blatman & Sudret 2008; Fen *et al.* 2009), ANN (Papadrakakis *et al.* 1998; Behzadian *et al.* 2009), RBF (Hussain *et al.* 2002; Regis & Shoemaker 2007), Kriging (Sacks *et al.* 1989; Sakata *et al.* 2003), and MARS (Friedman 1991; Jin *et al.* 2001; Chen *et al.* 2018). Recently, a novel data-driven SM methodology called polynomial chaos expansion (PCE) has been proposed. PCE was first proposed by Wiener (Wiener 1938), and the method is now widely used in a variety of fields, ranging from transportation (Stavropoulou & Müller 2015) to chemistry (Paffrath & Wever 2007; Villegas *et al.* 2012), aerodynamics (WU *et al.* 2018), groundwater flow and contaminant transport (Laloy *et al.* 2013; Deman *et al.* 2016), and surface water modelling (Fan *et al.* 2015; Wang *et al.* 2015). Unlike other SM techniques, PCE provides an effective solution to characterize nonlinear effects in stochastic analysis by capturing the model output's dependency on uncertain input parameters using a collection of high-dimensional orthogonal polynomials (Wang *et al.* 2015). According to previous research, PCE has the potential to be a useful approach when applied to simple conceptual hydrological models, such as HYdrological MODel (HYMOD), to save time and computational resources (Fan *et al.* 2015). A more recent study investigated the capabilities of PCE in generating probabilistic flow predictions and quantifying the uncertainty of Soil and Water Assessment Tool (SWAT) parameters to further establish its applicability and reliability for approximating more complex hydrological models (Ghaith & Li 2020). These findings suggest that PCE is a potentially effective technique that reduces the time and computational effort and that its advantages are more significant when applied to more complex models. As mentioned above, substantial research has been conducted on the creation and use of function approximation techniques as surrogates, as well as on the creation of application frameworks employing surrogates. However, to our knowledge, no studies have fully addressed the DOE implementation process, which is a fundamental component of SM. Due to a lack of DOE research, the adoption of SM approaches capable of supporting computationally intensive investigations such as sensitivity analysis (SA), calibration, and optimization has been hampered.

The primary goal of this research is to provide a framework to solve the DOE of SM implementation using cloud computing techniques. Another objective is to illustrate how SM may be exploited to further counterbalance the very large computational burden of sophisticated hydrological model simulations. We conducted this investigation utilizing the PCE approximation of the SWAT model as an example since both the SWAT model and PCE are widely utilized. The main contributions of our work are to illustrate how cloud computing methodologies can facilitate the surrogate modeling process, and how the trained surrogate modeling reduces the computational cost of complex hydrological model simulation while keeping their high accuracy, with implications for future projects that aim to reduce computation time in the hydrological simulation.

The paper is organized as follows. In Section 2, the Hadoop cloud-based SM framework is introduced for approximating the SWAT model. The major components of the framework, including the physical SWAT model setup, DOE, Hadoop cloud, and construction of PCE, are described. In Section 3, the presented framework is applied to the SWAT model approximation in a case study, and the accuracy, fidelity, and efficiency of the method are evaluated. In Sections 4 and 5, the approximation findings are presented and discussed. Finally, some conclusions are presented in Section 6.

## HADOOP CLOUD-BASED SM FRAMEWORK FOR APPROXIMATING COMPLEX MODELS

The primary purpose of using a Hadoop cloud framework to surrogate complex models is to execute DOE – by far the most computationally intensive part of SM – into the Hadoop cloud. In this sense, the DOE's extensive evaluations of the original complex model are wrapped as a MapReduce job and then concurrently executed on the Hadoop cluster, allowing flexible integration of the model into cloud computing and guaranteeing that all the model evaluations are successfully processed. Another purpose is to elaborate on how to build a practical SM application by taking advantage of the appropriate toolkits. The entire framework may be logically separated into the following two sections based on the programming languages used. The Python web framework is used to host our main SM approach – a PCE library and an additional core SA library. The Java web framework is used to handle massive model calculations with the help of a Hadoop cloud. It is within the Hadoop cloud that the DOE is decomposed into substantial SWAT model evaluations that can run simultaneously to reduce the overall time needed. Although previous research has confirmed the feasibility of wrapping complex models into the MapReduce framework (Hu *et al.* 2015; Zhang *et al.* 2016; Ma *et al.* 2022a), to our knowledge, it is rarely used in solving the DOE problem of SM.

The proposed framework is primarily developed in Python in a Linux environment by combining the PCE library (Chaospy) and SA library (SALib). Python is used to construct Chaospy, a numerical toolkit for performing uncertainty quantification using non-intrusive PCE and sophisticated Monte Carlo (MC) techniques (Feinberg & Langtangen 2015). This toolkit is designed to help scientists develop customized statistical approaches by fusing many fundamental and sophisticated building blocks. Various techniques, including Sobol, FAST, the delta moment-independent measure, and the derivative-based global sensitivity measure (DGSM), are all implemented in SALib (Herman & Usher 2017).

- (i)
Hydrological model setup

Initially, a high-fidelity model must be created as the target to be surrogated. Here, the setup of the SWAT model is illustrated as an example. The workflow starts with the compilation and collection of the input data, including meteorology, topography, land use, and soil (Franco *et al.* 2020). Then, the SWAT model was constructed through a professional graphical user interface (GUI) toolkit, either using ArcSWAT or Quantum Geographic Information System (QGIS), because both have a user-friendly interface. Subsequently, a SWAT project that comprises many configuration files can be generated with the help of the above modelling tool. These files must be executed directly by the SWAT model (Fortran-based SWAT executable program compiled for the Linux operating system). The subsequent DOEs are generated by updating these configuration files before running the SWAT executable program.

(ii) DOE

A typical SM framework begins with a DOE, which entails efficiently sampling the feasible parameter space and assessing their associated objective function values using the original model (Razavi *et al.* 2012b). Therefore, a well-distributed DOE program to generate the surrogate is critical to the success of SM practice. Because a priori knowledge of the parameter distribution is generally unavailable, either a uniform or gamma distribution is often suggested instead. Such assumptions of distribution are mainly derived from default and literature parameter values.

The DOE is made up of the following tasks. First, the model parameters are initially chosen based on prior experience, and their ranges are carefully determined based on available data. Second, certain sampling processes, such as the Latin hypercube sampling method (LHS) (Mckay *et al.* 2000), are utilized to sample the parameters properly. Namely, each sample is taken from a multi-dimensional space to represent as many potential combinations of the model parameters as possible. LHS is a stratification technique that forces random samples to be distributed more widely than regular random samples. It is similar to other low discrepancy sequences such as halton, hammersley korobov or sobol, but it keeps random samples at its core. The entire sampling process is conducted within the Python web framework, and the sampling results are sent to the Java web framework. Finally, the Java web framework creates a MapReduce task from the parsed samples and submits it to the Hadoop cloud for subsequent model evaluations.

(iii) Hadoop cloud

When the Hadoop cloud receives the MapReduce job, it begins to concurrently run the SWAT models. More specifically, the SWAT project configuration files located at the local computing node are executed by the SWAT executable program. Once the model execution finishes, the parsed simulation results are first stored in the HBase database for persistent storage purposes and then sent to the Python web framework. The Python web framework is responsible for receiving the parsed results sequentially until the final DOE is generated. Typically, the DOE is expressed as a tabular form of input parameter-output variable pairs. For more information, a detailed description of the coupling of the complex model and the Hadoop cloud can be found in prior publications (Zhang *et al.* 2016; Ma *et al.* 2022a).

(iv) Construction of the PCE

Fitting surrogates for the response surfaces of computationally intensive models over a set of previously evaluated samples will provide an approximate representation of the response surfaces. Numerous approximation strategies have been developed in the past and used as surrogates (Razavi *et al.* 2012b). In this study, the PCE was chosen because it is widely used. The popularity of PCE stems from one major feature: it has the simplest type of coefficients within a polynomial, and these coefficients can be flexibly extracted (Ghaith & Li 2020) or even estimated using the least squares regression approach. Second-order polynomial functions are the polynomials that are most frequently utilized as response surface surrogates. These functions contain (*D* + 1)(*D* + 2)/2 parameters, where *D* is the number of explanatory parameters. Once the DOE is generated, these data will be regarded as training data and used as input for the following polynomials. First, an expansion of orthogonal polynomials needs to be selected. Although not mandatory, first-order and second-order orthogonal polynomials are generally preferred over other types of polynomials due to their simplicity and stability (Feinberg & Langtangen 2015). In contrast, higher-order polynomials (third-order or more), which are frequently used in curve fitting, are sometimes impractical when D is large; as a result, they are rarely used (Razavi *et al.* 2012b). Second, coefficients of polynomials are estimated using either quadrature integration or the linear regression method.

- (v)
Evaluation of the PCE

Performing model analysis on an approximation, as a surrogate for the real model, is the final goal of SM. Following the construction of the PCE, the original model is completely replaced by the PCE to perform the analysis of interest. In this step, the approximation capability of the PCE in surrogating the SWAT model is assessed against extensive MC sampling and evaluation of the original SWAT model with regards to the accuracy, fidelity, and efficiency. Here, the difference (in the mean and variance) between the PCE and MC-simulated values of the variable of interest serves as a measure of accuracy. Fidelity is measured by the difference between the PCE-SA and the SA of the actual model. Efficiency is measured by the difference between the PCE execution time and the MC execution time.

### Polynomial chaos expansion

*et al.*2007). By constructing a mapping between an unknown random variable and other (known) random variables, the PCE can be used to determine the probability distribution of the unknown random variable. Therefore, the PCE approach is frequently used to quantify how uncertainty propagates in a dynamic system with random inputs. Wiener (1938) first proposed the PCE approach by decomposing the model stochastic process into Hermite polynomials and Gaussian random variables. However, for non-Gaussian random input variables (e.g., variables from uniform or gamma distributions), the convergence of Hermite polynomial expansion may not be optimal. Consequently, Xiu

*et al.*(2002) proposed generalized PCE (GPCE) to address the above convergence problem for non-Gaussian distributions. The polynomials can be chosen from the hypergeometric polynomials of the Askey scheme based on the type of random input (Shi

*et al.*2009). The general PCE can be written in the form below:where

*y*is the output, and is the polynomial chaos of order p. For standard normal variables, the Hermite polynomial is used and expressed aswhere are the standard normal random variables.

In comparison to two other methods, such as Gram-Schmidt and Gill-Cholesky, the discretized Stieltjes procedure is the most widely used method for creating orthogonal polynomials. It is also thought to be the only method for building orthogonal polynomials that is truly numerically stable (Feinberg & Langtangen 2015). The discretized Stieltjes procedure is based on one-dimensional recursion coefficients, which can be easily estimated using numerical integration. Unfortunately, regarding the multivariate case, this approach can be used only if the variables are stochastically independent. Thus, the GPCE approach is used instead when the variables are stochastically dependent.

### Sensitivity analysis library

To identify the parameters that have the greatest effect on the model performance, it is necessary to screen out sensitive parameters and statistically examine the effects of each parameter on the model performance. Many prior studies (van Griensven & Meixner 2006; Borgonovo *et al.* 2012) have employed SA for this purpose. As a result, if any parameters that are not influential can be identified and maintained at reasonable values within their ranges, the computational cost can be reduced without sacrificing model performance. In this research, SALib was used to perform SA for the SWAT model, which aids in estimating the effect of the model inputs or exogenous influences on the desired outputs in simulations (Herman & Usher 2017). SALib includes several global SA approaches that are simple to integrate into regular modelling processes, facilitating the creation of samples from model inputs. Moreover, it includes utilities for analyzing and visualizing model outputs.

## CASE STUDY

### SWAT model setup

To simplify the research procedure, the Meichuan River Basin, which has been used in previous research (Ma *et al.* 2022b), was chosen as the study area due to the availability of data. The same eight parameters (CANMX, CN2, CH_N2, CH_K2, ALPHA_BNK, SOL_AWC, SOL_K, and SOL_BD) were chosen as target variables, and they have the same upper and lower bounds as they did in previous research (Ma *et al.* 2022b).

### Hadoop cloud setup

Information about the software and hardware of the Hadoop cloud can be found in Table 1 of previous research (Ma *et al.* 2022a). The basic sequential framework (also known as an offline framework) of SM was employed in this study, which is the most basic SM-enabled analytic framework and requires that all DOEs be available in the first step. Therefore, all DOEs were grouped into one single MapReduce job consisting of multiple SWAT model evaluation tasks. Such a design is consistent with the design philosophy of MapReduce and has several advantages in offline computing on the Hadoop cloud.

SA method . | SA measurements . | Sampling technique . | Sample size . |
---|---|---|---|

Morris | Mean elementary effect | Morris | 9,216 |

Sobol | Sobol's first and total indices | Satellite | 18,432 |

Delta | Delta first-order index | Latin hypercube | 1,024 |

FAST | FAST first-order index (FAST) | Fast_sampler | 8,192 |

RBD-FAST | RBD-FAST first-order index | Latin hypercube | 1,024 |

DGSM | DGSM Indice | Finite_diff | 9,216 |

SA method . | SA measurements . | Sampling technique . | Sample size . |
---|---|---|---|

Morris | Mean elementary effect | Morris | 9,216 |

Sobol | Sobol's first and total indices | Satellite | 18,432 |

Delta | Delta first-order index | Latin hypercube | 1,024 |

FAST | FAST first-order index (FAST) | Fast_sampler | 8,192 |

RBD-FAST | RBD-FAST first-order index | Latin hypercube | 1,024 |

DGSM | DGSM Indice | Finite_diff | 9,216 |

### Construction of the PCE for the surrogate SWAT model

As previously stated, our offline framework begins (Step 1) with the DOE, which draws a specific number of samples over the feasible parameter space and evaluates their related objective function values using the original SWAT model. These parameter-target value pairs constitute the training dataset. In Step 2, an SM model is globally fitted to the training dataset. In Step 3, the SM model completely replaces the original model in performing the analysis of interest. For example, a specific search algorithm, such as Bayesian optimization, is typically combined with the SM model for calibration purposes. The result obtained from the SM model is commonly deemed to be the result of the same analysis as the original model; for example, the optimal point found in the SM model is typically first examined by the original function and then judged to be the optimal solution to the original function (Razavi *et al.* 2012b).

Of the three steps, Step 1 is the most computationally expensive since it consumes almost the entire computational budget allocated to solving the problem (Razavi *et al.* 2012b). First, each of the eight parameters of the SWAT model must be assigned a probability density, and they are assumed to be stochastically independent. The choice of the uniform distribution is motivated by the fact that very little information was available from the literature or previous studies. Therefore, it is generally assumed that each of the eight parameters is uniformly distributed, and then a joint distribution derived from the eight uniform distributions is generated using the Chaospy library. Second, the sample points with eight vectors in a probability space can be drawn by using either a quasi-MC scheme, such as a Sobol sequence or Hammersley sequence, or other methods, such as LHS. In this study, we specified a second-order Gaussian quadrature scheme that is customized to the eight-vector joint distribution. With this scheme, 6,561 sample points were required in total. The final step of the DOE is to run the computational model at the sample points. By packaging the 6,561 sample points into a MapReduce job and submitting it to the Hadoop cloud, the evaluation results can be efficiently calculated and retrieved.

In Step 2, a PCE is created and globally fitted to the training dataset. First, the orthogonal second-order polynomials corresponding to the eight-vector joint distribution are generated using the Chaospy library. The approximation polynomial solver, which joins the PCE, quadrature nodes and weights, and model samples together, must estimate the polynomial coefficients before it can be used for later prediction purposes. There are numerous approaches for estimating the polynomial coefficients, which are often classified as either invasive or non-intrusive. Because invasive methods need to use information from the underlying model while calculating the coefficients, they typically need modifications of the source code of the original model. Therefore, the non-intrusive method was instead used in this study. In the field of non-intrusive approaches, there are essentially two viable options available: projection of pseudospectral data (Gottlieb & Orszag 1977) and point collocation (Hosder *et al.* 2007). The former uses a numerical integration approach to determine Fourier coefficients, whereas the latter solves a linear system generated by a statistical regression formulation. Unlike both MC integration and pseudospectral projection, the point collocation method does not assume that the samples follow any particular form, although they are traditionally selected to be random, quasi-random, nodes from quadrature integration, or a combination of the three. For this case study, we selected the samples to follow the Sobol samples from MC integration and optimal quadrature nodes from pseudospectral projection. Moreover, unlike those in the pseudospectral projection, the polynomials in point collocations are not required to be orthogonal. Nonetheless, orthogonal polynomials have been shown to work well with respect to stability (Feinberg & Langtangen 2015).

In Step 3, the approximate solver of the model can efficiently evaluate the model at any point in probability space. The built-in tools of Chaospy can further be used to derive statistical information from the model prediction. For example, the expected value and variance can easily be calculated. Our study focused on the evaluation of the approximation capability of SM with regards to the accuracy, fidelity, and efficiency. The SA technique was applied to measure the fidelity of the PCE in approximating the SWAT model. As shown in Table 1, six SA approaches were simultaneously used to check whether the reduced and statistic-based PCE could inherit the parameter sensitivity information from the target mechanistic model (SWAT).

## RESULTS

### Accuracy of the PCE model

### Fidelity of the PCE model

*et al.*2021). Figure 8 shows that CN2, ALPHA_BNK, and SOL_AWC are recognizable and that the other five parameters are hardly recognizable or not recognizable because different values of these parameters may produce comparable effects when combined with the other parameters. This finding also shows that the PCE model is vulnerable to the consequences of equifinality, with alternative calibration procedures using different sets of parameters producing equivalent simulations (Beven 2006), which is common in complex hydrological nonlinear models. This result confirms the capability of the PCE model to approximate complex models such as SWAT, i.e., the PCE inherits the equifinality property of complex models.

### Reduction of the computation time

A previous study demonstrated that 100 tasks (SWAT executions) took an average of 198 s (Ma *et al.* 2022a). Conversely, a single task took an average of 112 s when using a single core. Thus, when 100 cores were used, an average speedup of 57 was obtained, which is faster than the traditional standalone model evaluation. The global SA in this case study required 47,104 model evaluations to ensure convergence of the sensitivity indices of SA and took 1,584 min on the Hadoop cloud with 100 cores using the original model. In contrast, the PCE model required 6,561 evaluations of the original model and took just 22 min. Furthermore, because of the fast processing performance of the PCE, 47,104 surrogate SA assessments took approximately 7 s on a standalone machine with an Intel 16-core i7-10700 processor and 32 GB RAM. Overall, the proposed global SA process reduced the number of model evaluations by 40,543 and the computation time by approximately 86%.

## DISCUSSION

A hydrological simulation framework usually consists of three key parts: HPC, OAs, and a complex simulation model. Despite significant progress in the first two parts, the development of such a framework is still constrained by the inherent computational complexity of the simulation models. For this reason, the emphasis seems to have shifted from the first two parts to the simulation models themselves. Consequently, SM is being widely investigated as a viable solution because it is believed to preserve the high accuracy of the original model while significantly decreasing the computational costs. Within SM, the DOE is deemed to be the most computationally expensive part because it requires nearly the whole computing budget to solve the problem (Razavi *et al.* 2012b). To date, no studies have fully addressed the DOE implementation process. This deficiency has hampered the adoption of SM methods that may aid in simulation optimization. We have extended the previously presented framework to address the DOE process of SM implementation using cloud computing techniques (Ma *et al.* 2022a). The framework allows substantial high-fidelity model evaluations of DOEs to be implemented elegantly on the Hadoop cloud. The case study uses the PCE model to surrogate the SWAT model, demonstrating how SM-based PCE and cloud computing methodologies could be exploited to successfully counterbalance the very large computational burden of complex hydrological model simulations. Moreover, the Hadoop cloud-based framework can provide an ideal environment for assessing the applicability of SM, which is critical before SM is used. The approximation capability of PCEs in surrogate SWAT models was examined within the framework. The case study demonstrated how SM libraries, such as Chaospy, and SA libraries, such as SALIB, can be conveniently coupled with cloud computing to perform approximation analysis regarding the accuracy, fidelity, and efficiency.

Previous SM-related research efforts have focused on either Step 1 (described in Section 3.3): constructing and using function approximation methods as substitutes or Step 3: constructing frameworks that utilize surrogates (Razavi *et al.* 2012b). To date, there has been little research focusing on Step 1 of SM (DOE), mostly because it is commonly thought to be biased towards computational aspects rather than SM itself. In practice, the DOE consumes nearly the whole computing budget allocated to perform SM. This was confirmed by our case study, in which 6,561 original model evaluations required by the DOE took 22 min, whereas 47,104 surrogate SA assessments took only approximately 7 s. These observations were also reported in other studies (Westermann & Evins 2019; Chen *et al.* 2020; Zhang 2020). Notable examples include the development of a surrogate method of groundwater modelling using a gated recurrent unit, in which 2,000 evaluations of the original models required to fit the surrogate took nearly 8.2 h, whereas 70,000 evaluations of the surrogate took nearly 3 min (Chen *et al.* 2020). In this regard, our framework is the first of its kind to introduce cloud computing to address the computational burden of DOEs. Various DOE strategies may be readily explored within the framework to facilitate the construction of the appropriate SM model, whose objective may range from finding the optimum to mapping out the whole surface of the original model. Furthermore, reports of previous studies provided little information on the modelling and analysis toolkits or libraries that they used. Conversely, our framework explicitly introduces SM-related libraries, such as Chaospy and SALib, and thereby provides environmental modellers who are considering SM with a more complete description of the components of the SM process, as well as recommendations for the subjective choices needed when using surrogate models for hydrological simulations.

A significant number of model simulations are typically required by simulation optimization frameworks that combine model simulations with OAs. Despite the extensive effort that has been devoted to improving HPC and OAs, one such approach could be challenging when the single-model simulation is computationally expensive. A promising framework would therefore alleviate the very large computational burden as much as possible. Using the Chaospy library, our framework can successfully surrogate the SWAT model with a second-order PCE and then examine the approximation capability of the PCE with SALib. This framework is particularly suitable for addressing the DOE problem because it inherits the parallel processing feature of cloud computing. Separating the DOE from the other two steps of SM is beneficial for programming because it allows the DOE to be extended to more complex models. Because it adopts the Python-based web framework to host the main SM library Chaospy and core SA library SALIB, the framework can be flexibly extended to other Python-based SM-related libraries and statistical analysis libraries, such as DAKOTA (Eldred *et al.* 2010), Open TURNS (Andrianov *et al.* 2007), and the Surrogate Modelling Toolbox (Hwang *et al.* 2019). Therefore, it is a generic framework for tackling a wide range of SM tasks and is not specifically limited to PCE-based SM and global SA.

Despite its advantages, the proposed framework has two potential drawbacks. One drawback is that it is currently a basic sequential framework (also known as an offline framework) and is not an adaptive–recursive framework (also known as an online framework). The basic sequential framework – the simplest type of SM analysis framework – consists of three main components: DOE, global fitting on the DOE to construct a surrogate, and replacement for the original model in conducting the relevant analyses (Razavi *et al.* 2012b). This indicates that the size of the DOE in Step 1 of the offline framework is often fixed and larger than that of the initial DOE in more advanced online SM frameworks. Because the DOE in Step 1 of the offline framework consumes nearly all computing resources allocated to address the problem, the other frameworks may be thought of as being online in this sense since they update the surrogate regularly as new data become available. Currently, only one DOE is performed in our framework to fit the PCE, and PCE updates are not yet supported. This might raise the concern that the PCE method is sometimes not good at identifying sensitive parameters or locating the optimum because the surrogate model is smoother than the true response surface, which is unavoidable due to the information limitation of a ‘once-for-all’ DOE – a limited number of samples. Notwithstanding this drawback, our framework can support the online framework through an appropriate orchestration of continuous DOEs, as required by an online framework, and earlier studies have provided us with insight in this regard (Ma *et al.* 2022b). Therefore, establishing an adaptive sampling technique capable of detecting and refining the informative region will be a critical task in the future. Another possible drawback of this framework comes from the assumption that the input variables are stochastically independent; however, this is not often the case. Reformulating the issue as a set of independent variables using GPCE is one approach for quantitatively addressing stochastically dependent variables (Xiu *et al.* 2002). Because the computational architecture of Chaospy includes the Rosenblatt transformation, which allows for mapping any domain to the unit hypercube [0, 1]* ^{D}*, GPCE is available for all densities. Future research that incorporates GPCE within the framework is thus recommended. Furthermore, the framework provides a ideal computational environment for investigating whether alternative time series prediction-based machine learning models (Mosavi

*et al.*2018; Kaya

*et al.*2019; Fu

*et al.*2020; Ehteram

*et al.*2022; Wang

*et al.*2022; Xie

*et al.*2022) could be used as quick and accurate surrogate models.

## CONCLUSIONS

In this paper, a Hadoop cloud-based SM framework is presented for approximating complex hydrological models, with the goals of improving the computational efficiency of DOEs that commonly require substantial model evaluations and of evaluating the applicability of SM before it is used. The framework is the first of its kind to solve the problem of the very large computational burden of DOEs on the Hadoop cloud. Because it employs a Python web framework to host both the SM libraries and other statistical analysis libraries, the framework is also capable of providing an ideal environment for assessing the applicability of SM with regards to the accuracy, fidelity, and efficiency. The case study using PCEs to approximate the SWAT model indicates that the framework is capable of solving the DOE problem as well as evaluating the applicability of SM. The framework is particularly well suited to complex hydrological model optimization, but it may also be extended to solve various optimization problems. It is expected to provide hydrological modellers who are contemplating SM with a more detailed description of the components of the process than what was previously available, as well as a new perspective on how to make the necessary subjective decisions when using surrogate models.

## ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (41925005) and the National Key R&D Program of China (2019YFD0901105).

## DATA AVAILABILITY STATEMENT

All relevant data are available from an https://github.com/JinfengM/PCE-Hadoop.

## CONFLICT OF INTEREST

The authors declare there is no conflict.