## Abstract

To alleviate the computational burden of parameter calibration of the Variable Infiltration Capacity (VIC) model, a stepwise surrogate model (SM) is developed based on AdaBoost. An SM first picks out the parameter sets in the range that the values of objective functions are close to the optimization objectives and then approximates the values of objective functions with these parameter sets. The *ɛ*-NSGA II (Nondominated Sorting Genetic Algorithm II) algorithm is used to search the optimal solutions of SM. The SM is tested with a case study in the upper Brahmaputra River basin, Tibet Plateau, China. The results show that the stepwise SM performed well with the rate of misclassification less than 2.56% in the global simulation step and the root mean square error less than 0.0056 in the local simulation step. With no large difference in the optimal solutions between VIC and the SM, the SM-based algorithm saves up to 90% time.

## HIGHLIGHTS

A stepwise surrogate model (SM) for VIC parameter calibration is built with AdaBoost.

The SM first simulates the response surface coarsely in the entire parameter space and then simulates the response surface in the crucial range precisely.

The surrogate-based algorithm improves the efficiency of VIC parameter calibration remarkably.

## INTRODUCTION

The distributed hydrological models are widely used to study the climate change effects on hydrological cycle (Leng *et al.* 2015; Lu *et al.* 2018), streamflow forecast (Schumann *et al.* 2013; Mo & Lettenmaier 2014), and water resource assessment (Xu *et al.* 2015; Sridhar *et al.* 2019). Calibrating the parameters of a distributed hydrological model automatically usually requires a lot of computational power, where the model often runs thousands of times. To alleviate the computational burden of model parameter calibration, a surrogate model (SM) is one of the most important measures (Maier *et al.* 2014).

There are a number of studies (Yan & Minsker 2006; Kourakos & Mantoglou 2009; Teweldebrhan *et al.* 2019; Xing *et al.* 2019) to handle the problem of parameter optimization for complex models with SM in several engineering fields. These applications show the efficiency of SM in optimization. Artificial neural network (ANN) is one of the most commonly used methods to construct SMs (Behzadian *et al.* 2009; Yalçın *et al.* 2018; Sayers *et al.* 2019), and it can reduce the number of simulations of the model required without reducing the accuracy of the optimization results. Beside the ANN, the radial basis function (RBF; Christelis *et al.* 2016), support vector machine (Zhang *et al.* 2009), and Gaussian process regression (Zhang *et al.* 2016) are also widely employed as SMs. In these cases where data-driven models or black box models are used to estimate the relationships between input parameters of the physical models and the outputs, it is known as response surface surrogate (Razavi *et al.*2012a).

SMs referred to in most studies are global response surface surrogates, which means that these SMs simulate the outputs in the entire parameter space. However, the global approaches are slow to converge when outputs are relatively flat around the optimal solution (Wang *et al.* 2004). Therefore, the local response surface surrogates are proposed to solve this problem. Bliznyuk *et al.* (2008) applied the RBFs to surrogate the physical model in the high posterior density region when calibrating parameters with Markov Chain Monte Carlo (MCMC). But this algorithm simulated a region in the entire response surface, which makes it difficult to apply in model parameter calibration coupled with a global optimization algorithm, such as Nondominated Sorting Genetic Algorithm II (NSGA II). On the contrary, some studies (Regis & Shoemaker 2004; Razavi *et al.*2012b) mainly surrogated offspring with approximated functions in the single objective genetic algorithm (GA) and replaced the most promising ones with the values from the original model. However, the practicality and efficiency of this algorithm without updating approximated functions are not clear when applying in a more complex case, such as multi-objective optimization.

In the distributed hydrological model parameter calibration, several studies (Zhang *et al.* 2009; Gong *et al.* 2016; Teweldebrhan *et al.* 2019) adopted black box models as SMs to surrogate hydrological models and then linked SM with optimization algorithms to calibrate parameters of the distributed model. In these studies, the utilization of global response surface surrogate avoids the losses of local optimal solutions. Usually, these local optimal solutions are discretely distributed in the entire parameter space because of the equifinality for different parameters in the hydrological model (Khatami *et al.* 2019; Liu *et al.* 2019). However, the problem of slow convergence in the local optimal regions for global approaches is still unsolved. Adding a physical layer between input parameters and output objectives to convert the highly nonlinear response into a linear one is one of the methods to address this problem (Razavi & Tolson 2013; Sun *et al.* 2017). But the efficiency of the SM algorithm will be affected due to the addition of the physical layer.

In our study, a stepwise SM is proposed to improve the efficiency of the distributed hydrological model parameter calibration. With global surrogates approximating the response surface in the entire parameter space and local surrogates adding details of the response surface in the local optimal regions, the approach can address the problem of slow convergence and avoid the losses of local optimal solutions. The self-adapted updating scheme is also introduced in this SM-based parameter calibration procedure. The improvement of efficiency with SM in multi-objective optimizations is mainly investigated in this study.

## STUDY AREA AND DATA

The upper Brahmaputra River (UBR) basin is located in the southern part of the Tibetan Plateau, Southwest China, with its drainage area of approximately 1.8 × 10^{5} km^{2} controlled by the Yangcun hydrologic station. The UBR basin is the longest plateau river in China and one of the highest rivers in the world with a mean altitude over 4,000 m. The river originates on the Jiemayangzong Glacier at an elevation of 5,590 m above sea level located on the northern side of the Himalayas (Zeng *et al.* 2018). The altitudes of the UBR basin range from 3,509 to 7,185 m. Precipitation, groundwater, and snow (including glacier) melt water are the three main sources of water in the UBR basin. The annual average precipitation is about 560 mm in the UBR basin and changes only slightly annually, while monthly runoff distribution is highly inhomogeneous where most of the precipitation occurs during the period from June to September. The annual average temperature in the basin is about 6.27 °C. In general, the highest temperature in a year occurs in June and July and the lowest one occurs in January. As a result, a large amount of precipitation always accompanies a large amount of snow melt water.

Figure 1 shows the basin location, elevations, weather stations, and two hydrologic outlets that are used in this study. The gauged meteorological data, including daily precipitation, daily average air temperature, wind speed, relative humidity, and atmospheric pressure, are collected from 23 China National Meteorological Observatory stations located in and around the UBR as shown in Figure 1 (black points). The daily runoff series of two gauge stations (black triangles), i.e. Nugesha station (NGS) and Yangcun station (YC), are utilized in this study. The digital elevation model data with 90 m resolution is downloaded from American National Aeronautics and Space Administration Shuttle Radar Topographic Mission; the soil data set is Harmonized World Soil Database version 1.1 (HWSD v1.1; Fischer *et al.* 2008) in a resolution of 1 km; the landcover data set is WESTDC landcover data set version 2.0 (Ran *et al.* 2010) in a resolution of 1 km. Due to the intermittence of discharge data at NGS and YC from 2005 to 2009, the meteorological and discharge data in the first 6 years from 1999 to 2004 are employed for calibration and the last 6 years from 2010 to 2015 are used as the period for validation.

## METHODS

### Hydrological model

The Variable Infiltration Capacity (VIC) Macroscale Hydrological Model was originally developed by Liang *et al.* (1994, 1996) at the University of Washington for macroscale catchments. The model simulates runoff grid by grid. However, the VIC model itself does not include a routing model to converge the runoff of each grid cell into the outlet station. Therefore, the routing model by Lohmann *et al.* (1996, 1998) is used to calculate the streamflow at the outlet station, which is a source-to-sink model to solve linearized Saint-Venant equations.

In this study, the land surface simulation uses VIC (version 4.2.d) at a six-hourly time step in water and energy balance modes. Since the study basin is located in the alpine region, the snow (Bowling *et al.* 2004; Andreadis *et al.* 2009) and frozen soil (Cherkauer & Lettenmaier 1999; Cherkauer *et al.* 2003) algorithms are adopted. Since the percentage of glaciers in the study basin is less than 5% (Xuan *et al.* 2018), the glacier melt water is simulated in the snow module coupled with the snow melt water.

The study area is divided into 0.25° × 0.25° grid cells, in which surface characteristics such as elevation, soil, and vegetation are specified (Nijssen *et al.* 2001). The gauged meteorological data are also interpolated into data with a resolution of 0.25° × 0.25° using the Inverse Distance Weighted (IDW) method coupled with elevation-based lapse rates, setting as 0.6 mm/km for precipitation and −6.5 °C/km for temperature (Cuo *et al.* 2014).

There are numerous parameters in the VIC model. However, a large proportion of them are usually set as default values and some of them adopt the measured values (such as most vegetation parameters). As a result, the parameters for calibration are usually some soil parameters which are not easy to measure. Generally, the parameters selected for calibration are the thickness of the second and third soil moisture layers (*d*_{2}, *d*_{3}), the fraction of maximum soil moisture where nonlinear baseflow occurs (*Ws*), the maximum velocity of baseflow (*Ds*_{max}), the fraction of *Ds*_{max} where nonlinear baseflow begins (*Ds*), and the variable infiltration curve parameter (*b*_{inf}) (Lohmann *et al.* 1996; Xie *et al.* 2007; Mishra *et al.* 2010; Voisin *et al.* 2011; Lu *et al.* 2013; Zhao *et al.* 2013; Kang *et al.* 2014; Hengade & Eldho 2016; Liu *et al.* 2017). Some studies also calibrated more parameters according to different catchment conditions, such as routing parameters (Parada *et al.* 2003), snow parameters (Isenstein *et al.* 2015), and vegetation parameters (Newman *et al.* 2017). In our work, the calibration period is close to the period when the Global Land Data Assimilation System (GLDAS) data were used to estimate the default values of leaf area index and minimum stomatal resistance instead of calibration. According to suggestions from previous work, eight parameters in VIC are selected for calibration in our study. The calibration parameters are *d*_{2}, *d*_{3}, *b*_{inf}, *Ws*, *Ds*, *Ds*_{max}, surface roughness of snowpack (Sr), and mean flow velocity (*v*). The feasible ranges of these parameters are given in Table 1.

Type . | Parameter . | Description . | Unit . | Range . |
---|---|---|---|---|

Soil | b_{inf} | Variable infiltration curve parameter | – | 0–0.4 |

Ds_{max} | Maximum velocity of baseflow | mm/day | 0–30 | |

Ds | Fraction of Ds_{max} where nonlinear baseflow begins | Fraction | 0–1 | |

Ws | Fraction of maximum soil moisture where nonlinear baseflow occurs | Fraction | 0–1 | |

d_{2} | Thickness of the second soil moisture layer | m | 0.1–1.5 | |

d_{3} | Thickness of the third soil moisture layer | m | 0.1–1.5 | |

Snow | Sr | Surface roughness of snowpack | m | 0.001–0.03 |

Routing | v | Mean flow velocity | m/s | 0.5–5 |

Type . | Parameter . | Description . | Unit . | Range . |
---|---|---|---|---|

Soil | b_{inf} | Variable infiltration curve parameter | – | 0–0.4 |

Ds_{max} | Maximum velocity of baseflow | mm/day | 0–30 | |

Ds | Fraction of Ds_{max} where nonlinear baseflow begins | Fraction | 0–1 | |

Ws | Fraction of maximum soil moisture where nonlinear baseflow occurs | Fraction | 0–1 | |

d_{2} | Thickness of the second soil moisture layer | m | 0.1–1.5 | |

d_{3} | Thickness of the third soil moisture layer | m | 0.1–1.5 | |

Snow | Sr | Surface roughness of snowpack | m | 0.001–0.03 |

Routing | v | Mean flow velocity | m/s | 0.5–5 |

### Calibration strategy

In this study, *ɛ*-NSGA II (Kollat & Reed 2006) is used to calibrate parameters of VIC as a multi-objective optimization algorithm, which is a global optimization algorithm improved from NSGA II (Deb *et al.* 2002). Liu *et al.* (2019) applied *ɛ*-NSGA II for VIC parameter calibration in the same study area successfully. Three features adopted in *ɛ*-NSGA II make the algorithm efficient, reliable, and easy-to-use for water resource applications (Kollat & Reed 2007).

First, the *ɛ*-dominance is used in *ɛ*-NSGA II instead of the original dominance (Laumanns *et al.* 2002). In the method of *ɛ*-dominance, the search space is divided into a grid with the grid cell size of *ɛ* × *ɛ*, and then the solution nearest to the lower left-hand corner in each grid cell is preserved. The value of *ɛ* determines the precision of the Pareto solution. A user can specify *ɛ* based on the precision goals (Kollat & Reed 2006), which makes the algorithm more efficient.

Second, the size of the population in each run is adaptive with the size of nondominated solutions as recommended by Harik & Lobo (1999), termed as the 25% injection scheme. In this scheme, 25% of the population in the next run is the nondominated solutions and the remaining 75% is generated randomly. With the injection scheme, *ɛ*-NSGA II determines the population size more objectively, while the population size of other GAs, such as Deb's NSGA II, is specified by users with the trial-and-error method. The injection scheme simplifies the parameterization and makes the algorithm easy to use.

Third, the search is terminated automatically when the solutions are not improved or the population reaches the maximum. The spontaneous termination can minimize the generations of optimization (Reed *et al.* 2003).

In our work, the initial population size of *ɛ*-NSGA II was set to 24. The maximum generations to adjust the population size and the maximum number of populations are specified as 250 and 10,000, respectively. *ɛ* is specified as 0.001, which is determined by the smallest parameter search step length of Sr.

### Surrogate model

#### Adaboost algorithm

Boosting is an ensemble algorithm for reducing bias and variance in supervised learning based on the theory of Kearns & Valiant (1994). In Boosting, a weak predictor refers to a simulator working only slightly better than random guessing, while a strong predictor means that it has a perfect prediction ability. Therefore, it is much easier to acquire a weaker predictor than a strong predictor.

AdaBoost (Freund *et al.* 1996) is one of the boosting methods which assembles several weak predictors into a strong predictor to increase the accuracy of ANN and have no requirement to guess a distribution-independent bound on the accuracy of weak predictors (Schapire 1990). AdaBoost is an iterative algorithm (Zhao *et al.* 2014). In each step of computations, different weights are assigned to training samples according to the errors of training results, which are used in the next step. A weak predictor is saved in one step of computations until the end of the loop. Each weak predictor is assigned a weight according to the predict error of the weak predictor. At last, the result of the strong predictor is a weighted mean of the results of these weak predictors. In this study, the standard backpropagation neural network (BPNN) is used as a weak learner.

The AdaBoost algorithm can solve the problem of unstable training results brought by the random generation of initial sample weights and thresholds in BPNN. Furthermore, it can change weights automatically to pay more attention to points, the training results of which are worse than others, to improve the accuracy of BPNN.

The range of the number of nodes in the hidden layer is from 4 to 14, which is calculated with the empirical formula proposed by Li & Liu (2009). The best number of nodes in the hidden layer is usually decided by the trial-and-error method. However, in AdaBoost, BPNN is used as the weak learner. The final results of the strong learner can be fixed in the iterative process, so the structure of one weak learner in AdaBoost is not so important as that of BPNN used in isolation. As a result, the number of nodes in the hidden layer is set to 10 in this study, which was around the median of the range (from 4 to 14). The maximum number of iterations and the training rate are set to 100 and 0.01, respectively, here. The sigmoid activation function whose ranges of input and output are scaled from −1 to 1 is selected to compute the output of each neuron. The default values of other parameters suggested by Beale *et al.* (2010) are employed in the training of weak learners. Ten weak predictors are trained and then assembled into a strong predictor.

#### Algorithm of SM

**X**is an

*m*-dimensional row vector composed of the parameters to be calibrated and

*m*is the number of the calibration parameters;

**O**is a row vector consisting of corresponding values of objective functions;

*f*(·) presents the response process from

**X**mapped to

**O**. In the original model,

*f*(·) is the process that streamflow is first simulated with VIC and then the values of objective functions are calculated with corresponding objective function (VIC outputs), while this is replaced by AdaBoost in the SM (SM outputs).

With the number of optimization generations increasing, the values of objective functions coverage in a range around the optimization objectives. The parameter sets selected as the calibrating results are usually distributed in this range. Therefore, we define the range that the values of objective functions converge in as the crucial range and the remaining range outside the crucial range as the trivial range. As more attention is paid to the results in the crucial range, a stepwise SM is proposed. The crucial boundaries of NSE at NGS and YC stations (NSE_{NGS}, NSE_{YC}), which are the split lines between crucial ranges and trivial ranges, were set to the upper quarter quantiles of the training samples, and those of BIAS at NGS and YC stations (BIAS_{NGS}, BIAS_{YC}) were the lower quarter quantiles.

The SM can be divided into two steps, i.e., the global simulation step and the local simulation step. In the global simulation step, SM is used to estimate the values of objective functions in the entire parameter space and screen the values in the crucial range, where the values acquired in this step are not very accurate. In the local simulation step, SM simulates the values of objective functions in the crucial range more accurately. This step helps to acquire Pareto fronts, which are closer to VICs in the crucial range.

To train the stepwise SM, it needs an adequate number of samples in the crucial range. The uniformly distributed sampling algorithms, such as MC, Good Lattice Points (GLPs; Gong *et al.*2015a, 2015b), and Latin Hypercube (LH; McKay *et al.* 1979), can generate samples distributed uniformly in the whole sample space. However, these uniformly distributed sampling algorithms cannot generate more samples in the crucial range without adding a large size of samples in the trivial range. In our work, to acquire more samples in the crucial range, *ɛ*-NSGA II is used to generate the training samples.

Figure 2 indicates the flowchart for the parameter calibration based on SM and the algorithm is described as follows:

*Step 1:*Initial samples generation. An initial set of samples**S**are generated. In order to acquire sufficient samples in the crucial range to train SM,*ɛ*-NSGA II coupled with VIC is operated in a handful of generations. All the population in*ɛ*-NSGA II is used as the training and testing samples.*Step 2*: SM training. After eliminating duplicates of**S**, SM is trained. The hold-out cross-validation is utilized, where 70% of samples are extracted randomly for the model training and the remaining 30% for the model testing. There are two steps to train SM. The global simulation step is training SM to simulate response surface in the entire range (both trivial and crucial ranges) and screen the points in the crucial range. The local simulation step is training SM to simulate the response surface in the crucial range in order to improve the simulation accuracy. The metrics in testing are used to show the performances of SM in both the global simulation step and the local simulation step.*Step 3*: Calibration with SM. The trained SM is used to calibrate the parameters with*ɛ*-NSGA II instead of the VIC model. The optimal Pareto fronts**P**and the optimal Pareto solutions_{f}**P**of the SM in the crucial range are acquired._{s}*Step 4:*Optimal Pareto solution evaluation. The optimal Pareto solutions of the SM in the crucial range**P**are used as the input parameters of VIC to calculate the values_{s}**P**of the objective functions._{v}**P**is used as the validation to_{v}**P**to confirm whether_{f}**P**can meet the condition to replace the optimal Pareto fronts of VIC._{f}*ɛ*-indicator (Zitzler*et al.*2003; Kollat & Reed 2006) is utilized to evaluate the consistency between**P**and_{f}**P**, which is calculated as:where_{v}*I*(_{ɛ}**P**,_{v}**P**) is_{f}*ɛ*-indicator between**P**and_{v}**P**;_{f}*n*is the number of objectives;**z**= [*z*_{1},*z*_{2}, …,*z*] is one vector in_{n}**P**;_{f}= [*p**p*_{1},*p*_{2}, …,*p*] is one vector in_{n}**P**; |·| is the absolute function._{v}*Step 5*: SM updating. New samples consisting of**P**and_{s}**P**are mixed in_{f}**S**and Steps 2–5 are repeated until*ɛ*-indicator is less than a specified value.

### Performance metrics

*n*is the length of the daily runoff series;

*Q*

_{o}(

*i*) and

*Q*

_{s}(

*i*) are the observed and simulated daily streamflow of the

*i*th day, respectively; is the mean value of the observed daily streamflow. The value of

*NSE*closer to 1 means a lower difference between the observed and simulated flows, while that of

*BIAS*prefers to be 0.

*e*

_{1}is the number of points on the response surface which is originally in the trivial range but misclassified into the crucial range by SM (the first type of error);

*e*

_{2}is the number of points in the crucial range but assigned into the trivial range by the SM (the second type of error);

*N*

_{e}is the number of points to classify.

## RESULTS

### Sampling for SM training and testing

To investigate the relationship between the performance of SM with the size of training and testing samples, *ɛ*-NSGA II with a small number of generations was used to generate the samples. Here, the generations *ɛ*-NSGA II operated varied from 5 to 25, and each number of generations was repeated 100 times to investigate the uncertainty.

Figure 3 shows the relationship between ER of the four outputs of SM in the global simulation step and the increasing generations for sampling. In Figure 3(a), with the increase of generations, the uncertainty of ER of NSE_{NGS} reduces slightly, and the mean value also changes slightly. In contrast to NSE_{NGS}, the uncertainties and the mean values of ER of the other three outputs NSE_{YC}, BIAS_{NGS}, and BIAS_{YC} decrease remarkably with the increase of generations as shown in Figure 3(b) and 3(c). Generally, the performance of SM in the global simulation step becomes stable when the number of sampling generations increases around 20.

The variations of RMSE of the four outputs of SM in the local simulation step with the increasing sampling generations are shown in Figure 4. As shown in Figure 4(a), the mean value and the uncertainty of RMSE of NSE_{NGS} decrease sharply with the number of generations increasing from 13 to 20, and then the decreases become mild gradually. In Figure 4(c), BIAS_{NGS} has a similar result and becomes stable with the number of sampling generations around 20. Meanwhile, as shown in Figure 4(b) and 4(d), the mean values and the uncertainties of RMSE of NSE_{YC} and BIAS_{YC} decline sharply with the number of generations from 6 to 10 and they are also stable around the 20th generation.

With the increasing number of sampling generations, the crucial boundaries of the four outputs are also changed. As shown in Figure 5(a) and 5(b), the mean values of the crucial boundaries of NSE_{NGS} and NSE_{YC} increase sharply with the number of sampling generations rising from 5 to 20 and then become stable, while their uncertainties decrease remarkably. The mean values of the crucial boundaries of BIAS_{NGS} and BIAS_{YC} decrease remarkably with the number of generations increasing from 5 to 10. As shown in Figure 5(c) and 5(d), their mean values are changed slightly and their uncertainties reach minimal values when the number of generations is from 11 to 20. After that, though their mean values continue to become smaller slightly, their uncertainties begin to become larger. This may be caused by the stochastic method to choose the training samples. In this case, when the limited number of samples with small values of BIAS is allocated into the training, the crucial boundaries will be smaller; otherwise, the crucial boundaries will be larger. When the number of sampling generations is around 20, the variations of mean values and uncertainties of the crucial boundaries of BIAS_{NGS} and BIAS_{YC} are not too large compared with those with 5–11 sample generations, so we did not delve more deeply in this part.

Considering the results shown in Figures 3–5, the training and testing samples for SM were generated by *ɛ*-NSGA II with 20 generations, with which the stepwise SM was trained. The crucial ranges of NSE_{NGS}, BIAS_{NGS} and NSE_{YC}, BIAS_{YC} were ≥0.8385, ≤0.1145 and ≥0.8445, ≤0.2534, respectively.

Figure 6 shows the scatter plots of SM in the global simulation step in training and testing with SM outputs on the vertical *y*-axis against corresponding VIC outputs on the horizontal *x*-axis. The blue line is the ‘*y* = *x*’ line. As shown in Figure 6(a)–6(h), the points are distributed along the ‘*y* = *x*’ lines and most outliers are in the trivial ranges. ERs for the four outputs in SM in the global simulation step are also calculated, which are shown in Figure 6 with a maximum value of 1.10% in training and 2.56% in testing. This means that SM in the global simulation step shows good performance to classify the response points into the crucial range and the trivial range.

Figure 7 shows the performance of SM outputs in the crucial range in the training and testing phases. In Figure 7, the points are almost distributed along the ‘*y* = *x*’ lines, except Figure 7(e) and 7(f). In Figure 7(e) and 7(f), the points clearly deviate from the ‘*y* = *x*’ lines, with RMSE of 0.0043 in training and 0.0056 in testing. However, compared with NSE_{YC} simulated in the global simulation step whose values of RMSE in the crucial range are 0.0161 and 0.0152, the performance of SM in the local simulation step on the simulation of NSE_{YC} in the crucial range is improved greatly. In the local simulation step, with the maximum RMSE value of 0.0057 in testing, the SM outputs in the crucial range are reliable to replace the VIC outputs.

### Parameter calibration and validation

The trained stepwise SM was operated in *ɛ*-NSGA II and updated until the *ɛ*-indicator between the optimal Pareto fronts of SM (F_{SM}) and the validated-SM fronts (F_{VS}) in the crucial range was less than 0.0275, where F_{VS} are the values of objective functions calculated in the original model with the optimal Pareto solutions of SM (P_{SM}). The optimal Pareto solutions of VIC (P_{VIC}) were acquired in *ɛ*-NSGA II, which were terminated with the total populations to the maximum value of 10,000.

As shown in Figure 8, green points are the optimal Pareto fronts of VIC (F_{VIC}), and red points and gray points are F_{SM} and F_{VS}, respectively. Figure 8(a)–8(c) shows F_{SM} in the three iterations, respectively. In the third iteration, *ɛ*-indicator between F_{SM} and F_{VS} in the crucial range reaches 0.0261, so the optimization is terminated with three iterations. In Figure 8(a) and 8(b), there are large deviations between F_{SM} and F_{VS}, and the red points and the green points in the crucial range are clearly gathered together into individual groups. This phenomenon disappeared in Figure 8(c), where red and green points in the crucial range are mixed. Interestingly, it is found that the *ɛ*-indicator of the second iteration at 0.0663 is slightly larger than the one in the first iteration at 0.0480. This is caused by the further convergence of F_{SM} to the ideal point, where some of F_{SM} are not included in the range of training samples and the SM needs to be updated. The ideal point is a four-dimensional objective point with NSE_{NGS} of 1, BIAS_{NGS} of 0, NSE_{YC} of 1, and BIAS_{YC} of 0.

The final results of F_{VIC}, F_{SM}, and F_{VS} are shown in Figure 8(c). In the crucial range, there are larger deviations between F_{VIC} and F_{SM}, while F_{VIC} and F_{VS} are more similar. In particular, compared with F_{SM}, larger NSE_{NGS} and smaller BIAS_{NGS} are obtained in F_{VIC} at the NGS station.

The optimal Pareto solutions of VIC and SM in the crucial range found by *ɛ*-NSGA II are presented in Figure 9. The optimal Pareto solutions of VIC in the crucial range (gray lines) show a similar structure and distribution with those of SM in the crucial range (orange lines). For Pareto solutions in the trivial range, there are significant differences between VIC (gray dashes) and SM (orange dashes), especially in the parameter *D*_{3}. The solutions of SM in the trivial range are also much fewer than those of VIC. The loss of solutions in the trivial range is because SM mainly concentrates on the simulation in the crucial range, and samples for SM updating are also mainly distributed in the crucial range. In the crucial range, the parameter Sr has different distributions in VIC and SM. Sr in VIC ranges from 0.0011 to 0.0147, while that in SM converges around 0.0030. It may be caused by the low sensitivity of Sr (Houle *et al.* 2017).

Seen from Figure 9, the compromise solution of VIC (*P*_{1}), whose corresponding front is the closest to the ideal point, is shown as the black line, and that of SM (*P*_{2}) is the green line. These two solutions, *P*_{1} and *P*_{2}, were used as the input parameters of the VIC model to verify whether *P*_{2} could replace *P*_{1} in the streamflow simulation.

At NGS and YC stations, simulated daily streamflows with *P*_{1} and *P*_{2} during the calibration and validation periods are displayed in the left panels of Figure 10, and the right panels show the corresponding empirical CDF curves. It illustrates that the empirical distributions of the simulation with *P*_{1} are consistent with those with *P*_{2}, and the two series of simulations are also well consistent. It is noted that the peak flow simulation with *P*_{2} is slightly degraded compared with that with *P*_{1}.

With the parameter set *P*_{1}, the values of NSE at the NGS station during the calibration and validation periods are 0.8650 and 0.8609, respectively, while those of BIAS are 0.0294 and 0.0413, respectively. Similarly, the values of NSE at the YC station during the calibration and validation periods are 0.8898 and 0.8933, respectively, and those of BIAS are −0.1443 and −0.1782, respectively, where the absolute symbol of BIAS function is removed to evaluate whether the streamflow is overestimated or underestimated. The performance of the simulation with *P*_{2} is slightly worse at the NGS station with NSE of 0.8421 and 0.8429, and BIAS of 0.0433 and 0.0604 during the calibration and validation periods, respectively, but slightly better at the YC station with NSE of 0.8959 and 0.8995, and BIAS of −0.1337 and −0.1663. The values of BIAS at NGS show that the streamflow is overestimated slightly and the negative values at YC mean that the streamflow is underestimated with both *P*_{1} and *P*_{2}. In general, the daily simulation results simulated with the parameter set obtained by the SM at NGS and YC demonstrate decent applicability.

### Efficiency evaluation

The efficiency of the SM is also a key point as our original intention is to save the calibration time. One execution of VIC in the study area requires about 24 min on one processor of Intel Core CPU i5-7500 3.40 GHz (Ubuntu 16.04 LTS for desktop). Hence, it takes a long time to calibrate the parameters with the original model even if the parallel mode is conducted, where all the four processors were used in parallel.

The optimal Pareto fronts of VIC in the final generation of *ɛ*-NSGA II (when the whole population reaches 10,000) in the crucial range were used as the reference fronts to calculate *ɛ*-indicators of the optimal fronts of VIC in the crucial range obtained from each generation and *ɛ*-indicators of the optimal fronts of SM in the crucial range obtained from each iteration. Figure 11 shows the *ɛ*-indicator variation against the time consumed. The time consumed for the parameter calibration shrinks sharply by using SM instead of VIC. The computational time to acquire the solutions of SM is about 58 h with the *ɛ*-indicator of 0.0164, where the time consumed for sampling occupies 78.72%. The *ɛ*-indicator for the VIC model is larger than the value of 0.0164 until the optimal generations reach 250, which costs about 561 h. The computational saving is 89.66% when the *ɛ*-indicator for VIC is less than the one for SM for the first time.

## DISCUSSION

### Comparison of different sampling methods

Sampling using *ɛ*-NSGA II with a small number of generations in our work is to acquire more samples in the crucial range. To show the difference from using the sample generator based on *ɛ*-NSGA II instead of the uniformly distributed sampling algorithms, two popular approaches, GLP and LH, were compared with the *ɛ*-NSGA II sample generator.

In this study, *ɛ*-NSGA II generated the training samples using 20 generations with 24 populations in each generation. As a result, the size of samples generated by GLP and LH was set to 480. To acquire a large enough number of samples in the crucial range and the trivial range, the testing samples consisted of two parts. The first testing samples were generated with MC. The other part was the population in the final generation (the 280th generation) of *ɛ*-NSGA II when calibrating VIC parameters without SM. The size of the testing samples was 306.

Figure 12 shows the performance of SM in the global simulation step with the three sample generators. As shown in Figure 12, the performance of SM with samples generated by *ɛ*-NSGA II has a better performance than that of SM with samples generated by GLP and LH. Especially, for BIAS_{NGS} and BIAS_{YC}, ER of SM with GLP and LH is up to around 50%. It is found that the sources of ER in SM with the three sample generators are different. For SM with the sample generator based on *ɛ*-NSGA II, the first type of error makes a major contribution to ER. However, ERs of SM with GLP and LH are mainly from the second type of error, which means that it is much easier for SM with GLP and LH to misclassify the points in the crucial range into the trivial range. This is caused by insufficient training samples when using GLP and LH sample generators. If these uniformly distributed sample generators are utilized to train SM, additional iterations are needed in order to update SM.

### Limitations and prospects

For the multi-objective optimization algorithms, *ɛ*-NSGA II is one of the good choices. As shown in many previous studies, *ɛ*-NSGA II is, in fact, highly efficient in solving multi-objective problems (Hadka & Reed 2012; Hurford *et al.* 2014; Wang *et al.* 2014; Liu *et al.* 2017). Other methods, such as NSGA II, NSGA III (Deb & Jain 2013), MOCOM-UA (Yapo *et al.* 1998), and SPEA2 (Zitzler *et al.* 2001), can also be utilized as optimization algorithms instead of *ɛ*-NSGA II. As for the efficiency comparison of different algorithms, it has been studied for many previous works (Reed *et al.* 2003; Hadka & Reed 2012), and it is therefore not the key point in our work. However, the compatibility of the SM with different optimization algorithms is not investigated in this work, which is to be studied in our future work.

In this study, the NSE which describes the simulation skill of modeled streamflow with the observations and BIAS which reflects the error between the simulated and observed streamflow were selected to evaluate the performance of the model. There are many other evaluation indicators and statistical indices to quantify the accuracy of streamflow estimates against observations from gauges, such as correlation coefficient, probability of detection, mean error (Zhu *et al.* 2016), and Kling–Gupta efficiency (Gupta *et al.* 2009). Although different choices of the evaluation indicators may influence the parameters calibrated and the performance of VIC, the efficiency of SM is not affected.

The UBR basin is used as a case study here. Figure 10 illustrates that two large peak flows occurred in the summer of 1999 and 2000, which are the first two years of the calibration period. The streamflow in other years of the calibration period is similar to that in the validation period. According to Klemeš (1986), large floods are suggested to be included in the calibration period. This makes a good performance of both VIC and SM in the validation period, especially at the YC station, where NSE_{YC} in the validation period is larger than that in the calibration period. The SM prediction capabilities are mainly limited by the original hydrological model. When some extreme floods occur in the validation period and result in poor performance of VIC, the SM prediction capabilities are also affected.

In terms of the UBR basin, as shown in Figure 10, the high flow is overestimated at the NGS station and the low flow is underestimated at the YC station. The ungauged condition in the upper stream of the UBR basin may result in these deviations. Although the area of glaciers in the study basin is small, the glacier melt water calculated as a part of snow melt water without a special glacier melt module can also produce a small amount of errors. Some studies indicate that remote sensing data such as TRMM 3B42 precipitation data product (Wang *et al.* 2017) and GRACE total water storage changes (Chen *et al.* 2017) can improve the performance of VIC. Besides, the glacier algorithm based on energy balance embodied into VIC in the study area can improve the performance of VIC (Zhao *et al.* 2013). In our work, the performance of VIC can be accepted with the smallest NSE of 0.8421 and the largest BIAS of 0.1782. Hence, the improving methods mentioned above are not conducted but can be adopted in our future work.

## CONCLUSIONS

In this study, a stepwise SM was proposed for VIC parameter calibration. This SM-based algorithm has been applied in the UBR basin successfully. The method of AdaBoost was used to screen the parameter sets whose values of objective functions are in the crucial range and then the values of objective functions in the crucial range were simulated more accurately. *ɛ*-NSGA II was adopted as the calibration strategy. The main results we found are as follows:

- (1)
The AdaBoost algorithm performed well with a low ER in the global simulation step and a small RMSE in the local simulation step. The response surface interpolated with SM could be used for parameter calibration instead of VIC.

- (2)
Compared with the optimal Pareto solutions of VIC in the crucial range, there is no large difference for the optimal Pareto solutions of SM in the crucial range.

- (3)
The stepwise SM accelerated the calibration process in the UBR basin remarkably, and SM saved up to 90% time to obtain a similar solution of VIC.

## ACKNOWLEDGEMENTS

The authors thank the National Natural Science Foundation of China (91547106) and National Key Research and Development Plan ‘Inter-governmental Cooperation in International Scientific and Technological Innovation’ (2016YFE0122100) for the financial support. China Meteorological Administration, Tibet Bureau of Hydrology, and Water Resources Survey are also greatly acknowledged for providing meteorological and hydrologic data used in this study.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.