## Abstract

Streamflow prediction of rivers is crucial for making decisions in watershed and inland waterways management. The US Army Corps of Engineers (USACE) uses a river routing model called RAPID to predict water discharges for thousands of rivers in the network for watershed and inland waterways management. However, the calibration of hydrological streamflow parameters in RAPID is time-consuming and requires streamflow measurement data which may not be available for some ungauged locations. In this study, we aim to address the calibration aspect of the RAPID model by exploring machine learning (ML)-based methods to facilitate efficient calibration of hydrological model parameters without the need for streamflow measurements. Various ML models are constructed and compared to learn a relationship between hydrological model parameters and various river parameters, such as length, slope, catchment size, percentage of vegetation, and elevation contours. The studied ML models include Gaussian process regression, Gaussian mixture copula, Random Forest, and XGBoost. This study has shown that ML models that are carefully constructed by considering causal and sensitive input features offer a potential approach that not only obtains calibrated hydrological model parameters with reasonable accuracy but also bypasses the current calibration challenges.

## HIGHLIGHTS

Calibration of hydrology model using machine learning.

Learning unknown relationship between model parameters and river features.

Rapid calibration of hydrological model parameters.

Comparative study of different machine learning techniques.

## INTRODUCTION

Watershed and inland waterways management generate a broad range of crucial and often expensive engineering decision-making problems such as dredging (Yeates *et al.* 2020), analysis of a dam's remaining useful life (England 2018), flood prediction (Maidment 2017), understanding climate change (Lucas-Picher *et al.* 2003), and sediment and contamination analysis (Palermo *et al.* 2008). The key constituent of these decision-making analyses is the streamflow discharge prediction of rivers that form a river network. The flow of water in a river network is modeled through *river routing* that simulates the changes in the shape of a hydrograph as water moves through the rivers. The US Army Corps of Engineers (USACE) uses a computational river routing model, RAPID (David *et al.* 2011a) (http://rapid-hub.org/), which offers a numerical solution to river routing modeled by a physics-based hydrological model called the *Muskingum method* to simultaneously predict water discharges in all the rivers in the network using parallelization (David *et al.* 2011a; Rouholahnejad *et al.* 2012).

The application of the Muskingum model in any form involves both a calibration step and a prediction step. The calibration step includes estimating calibrated streamflow parameters *k* and *x* (which will be discussed later) for each river by using historical inflow–outflow discharge measurements at certain gauge stations spread across a given geographic region (USACE uses discharge measurements from gauges spread across the United States for calibration). Since the Muskingum method was first introduced by McCarthy in 1938 (McCarthy 1938; Cunge 1969), numerous research articles have targeted the calibration problem to improve the prediction accuracy of streamflow forecasts by incorporating nonlinearities in river routing problems, making the calibration more computationally efficient via curve-fitting, and employing parallel computing, Bayesian methods, optimization techniques, and other statistical methods (see, for example, Gill 1978; Yoon & Padmanabhan 1993; Mohan 1997; Das 2004; Chu & Chang 2009; Rouholahnejad *et al.* 2012; Xu *et al.* 2012). In terms of effective computational implementation of river routing, a breakthrough was achieved by David *et al.* (David *et al.* 2011a, 2011b, 2013; Tavakoly *et al.* 2017) that developed the RAPID model which is a matrix form of the Muskingum model through which the streamflow could be predicted for a large set of rivers that comprise a large network. The most attractive feature of RAPID is its ability to use efficient parallel computing. However, despite RAPID being a state-of-the-art computational tool that is equipped with parallel computing, calibration of these streamflow parameters is a computationally extensive and time-consuming process (especially when done at a continental or global scale) and requires historic streamflow measurement data which may not be available for certain ungauged locations. This may potentially hinder an accurate prediction of streamflow forecasting in an ungauged region where there is no measurement data available. To address this challenge, our goal is to use machine learning (ML) to establish a causal connection between hydrological parameters and a wide range of hydrological and topological features extracted through satellite imagery. This data-driven approach will enable us to predict these parameters even in ungauged locations without relying on the currently used river routing model-based calibration process.

In recent years, there has been a significant increase in the hydrologic community's interest in machine learning. This surge can be attributed to the rapid expansion of hydrologic data repositories and the successful implementation of ML in diverse academic and commercial applications. This newfound enthusiasm is largely facilitated by the increasing accessibility to supportive hardware and software. Studies have demonstrated the accuracy of data-driven models in predicting extreme events and capturing valuable information from hydrological datasets, outperforming physics-based models in several aspects. Over a decade ago, Todini (2007) highlighted the divergence of opinions between physical process-oriented and system engineering-oriented modelers regarding data-driven models in hydrology. While physical process-oriented modelers expressed skepticism due to the heavy reliance on training sets, system engineering-oriented modelers argued that data-driven models outperformed complex physics-based models in forecasting. With advancements in ML and computational power, hydrology has seen active adoption of data-driven models. Frame *et al.* (2022) demonstrated the accuracy of deep learning-based data-driven models in predicting extreme rainfall–runoff events compared to physics-based models. Similarly, Kratzert *et al.* (2019) and Nearing *et al.* (2021) noted the ability of data-driven models to capture significant information from hydrological datasets and outperform physics-based models, including future predictions in ungauged basins and knowledge transfer between basins using modified LSTM networks. Kan *et al.* (2020) used an ANN with the K-nearest neighbor-based clustering method to propose a hybrid ML hydrological model capable of forecasting floods. Guse *et al.* (2017) noted the importance of model parameters in hydrological models and focused on establishing and investigating the connection between the model parameters and different performance criteria that evaluate the robustness of the hydrological model. Fernandez-Palomino *et al.* (2021) highlighted that model calibration purely using discharge data is problematic since it does not guarantee the correct representation of hydrological processes. Instead, Fernandez-Palomino *et al.* (2021) opted for a multi-objective calibration that includes additional information captured by vegetation data and hydrological signatures. Zhang *et al.* (2018) exploited the regression tree ensemble approach and compared it with three other widely used approaches (multiple linear regression, multiple log-transformed linear regression, and hydrological modeling) to assess the prediction accuracy of 13 runoff characteristics or signatures in Australia. Frame *et al.* (2021) utilized up to 26 parameters, ranging from watershed attributes to meteorological data, to build streamflow prediction models and perform model diagnostics. A recent study by Liu *et al.* (2023) compares the flood simulation capabilities of a hydrological model and ML-based model. Readers are recommended to refer to Xu & Liang (2021) and the references therein for a comprehensive and non-technical review of the application of ML-based methods in hydrology.

The objective of this paper is to establish a relationship between the hydrological parameters and the attributes reported by global satellite monitoring systems, thereby enhancing physics-derived modeling on a large scale. Although ML models can directly be used to develop a hybrid river routing model that improves the discharge predictions by incorporating additional hydrological and topological information in tandem with the standard physics-based model to predict discharge, it is still worthwhile to obtain data-informed hydrological parameters that serve as input to the river routing model. Three primary reasons motivate our current investigation:

- 1.
There are specific applications where physics-based models remain desirable. Despite its simplicity, the Muskingum model continues to be employed in many of the largest-scale hydrological applications (see Tavakoly

*et al.*2017; Grogan*et al.*2022). It is these users who can benefit from improved methodologies for parameter selection and calibration. This investigative approach aims to explore the implications of using ML to learn hydrological model parameters that feed into river routing models. - 2.
Some of these hydrological parameters are macro variables that have physical meaning. For example, the

*k*parameter that is the primary focus of our current investigation is linked to the flow travel time within the river reach. The significance of these parameters extends beyond its utilization within the routing model. It proves valuable in other hydrological domains, such as sediment forecasting, reservoir management, river flow estimation and flood control, dam break analysis, and groundwater studies among others (for example, see Marcus 1989; Meyer*et al.*2018). Consequently, the implications of this research reach far beyond supporting the routing model alone. Moreover, due to the intricate dynamics of rivers and the dependence of their flow on both watershed and its own geometric characteristics, the process is likely influenced by multiple interrelated parameters. This complexity makes ML an appealing approach for obtaining macro hydrological quantities, including the*k*parameter, by leveraging geography-dependent features. The traditional*k*parameter is obtained by forcing the RAPID predictions to closely match the measured discharge. As pointed out in Fernandez-Palomino*et al.*(2021), judging the hydrological model reliability purely based on discharge-based calibration is problematic, as it does not guarantee the correct representation of internal hydrological processes. In this paper, our goal is to leverage ML to obtain the*k*parameter even in ungauged locations by utilizing geographical features that can be obtained from satellite data. - 3.
River dynamics and their dependency on the proximal surroundings and environmental conditions in the given region are very complex. It is practically impossible to obtain a closed-form physics-based model that accurately models the flow. Defining the prediction bounds is also challenging since quantifying the uncertainties in the causal parameters is difficult. In that regard, our aim was to leverage the power of advanced ML methods to calibrate a hydrological quantity,

*k*, which possibly has dependencies on numerous variables or embedded features. Theoretically, the*k*parameter obtained using the ML model is more reliable and meaningful, as it is obtained as a functional dependency of factors that influence*k*. This contrasts with the traditional approach, where the*k*parameter is obtained by forcing the RAPID predictions to be as close as possible to the measured discharge.

In summary, by obtaining a more informed and improved value of these parameters, we aim to enhance the predictive capabilities of the hydrological model (even if the improvement is marginal) and improve our understanding of the flow dynamics within the river system.

To do so, we start by reasonably assuming that the Muskingum parameters bear a functional relationship with topography and the hydrodynamic characteristics of the system. This paper focuses on calibrating the streamflow parameter *k* since the Muskingum model is more sensitive to *k*; the same methodology may be also applied to calibrate *x*. The objective is to carefully consider various features that could potentially have a causal relationship with the calibrated *k* (or ), investigate the sensitivity of these features, select the most sensitive features, and finally learn the unknown functional relationship between these features and the calibrated Muskingum parameters. We consider topographical and hydrodynamic features extracted from the rivers and their watershed. These include properties like river length, river slope, vegetation data, catchment area, and elevation data and its gradient. The features related to the vegetation data and the elevation data are extracted using the MODIS Vegetation Continuous Field (VCF) (https://lpdaac.usgs.gov/products/mod44bv006/) and Digital Elevation Model (DEM) satellite images (http://hydro.iis.u-tokyo.ac.jp/ yamadai/MERIT_Hydro/), respectively. We extract two sets of features from the VCF and DEM images. The first set of features comprises the statistical moments of the vegetation and elevation data that capture their spatial distribution, whereas the second set of features is extracted using a Convolution Neural Network (CNN)-based autoencoder that takes these images as input and extracts the useful features. Rivers and watersheds can be grouped into various classes/clusters based on their shared properties (like length and slope of rivers, and mean vegetation percentage of the watershed). We use a Gaussian Mixture Model (GMM)-based clustering technique to group the rivers into different clusters. For each cluster consisting of numerous rivers, these features are used as input, and the calibrated *k* (obtained from the calibration procedure of RAPID) is used as an output to train the ML models. The studied ML models include Gaussian process regression (GPR), Gaussian mixture copula (GMC), Random Forest, and XGBoost.

The rest of the paper is arranged as follows. Section 2 lays a brief background of the Muskingum model and the calibration procedure of RAPID. Section 3 describes the solution flow of the ML-based calibration of *k* and investigates various ML models. Section 4 delineates and compares the numerical results of using different ML models and also considers the impact of different feature extraction techniques. Finally, Section 5 concludes the paper and lists ongoing research directions.

## BACKGROUND OF RIVER ROUTING MODEL

### The linear Muskingum model and RAPID

*k*and

*x*. The parameter

*k*is the storage time constant for the river reach that has a value reasonably close to the flow travel time within the river reach, and the parameter is a dimensionless weighting factor that quantifies the relative influence of the inflow and the outflow on the volume of the channel or river reach. Consider a channel with denoting the absolute channel storage at time

*t*; as per the linear Muskingum model, we have (see Tung 1985)where and are the rates of inflow and outflow at time

*t*, respectively. Therefore, by definition, the flow continuity equation can be written as

*et al.*(2011a) build on this single channel Muskingum model to develop a river routing model for the entire river network, yielding the following matrix form of the Muskingum model

Here, ** I** is the identity matrix,

**is the river network matrix,**

*N*

*C*_{1},

*C*_{2}, and

*C*_{3}are the diagonal matrices with elements

*C*_{1j},

*C*_{2j}, and

*C*_{3j}, respectively, that are functions of the time interval and the streamflow parameters and with for a river network with

*m*river reaches as defined by Equation (4);

**is a vector of outflow from each reach; and**

*Q***is a vector of lateral inflow for each reach (collected in the watershed of the river). To better understand the matrix form of the Muskingum model defined by (5), the readers are referred to a pedagogical example of a river network consisting of five-reach, two-node, and two-gauge locations provided in David**

*Q*^{e}*et al.*(2011a).

### Calibration of the streamflow parameters in RAPID

*et al.*(2011a) and Tavakoly

*et al.*(2017). The calibration procedure of RAPID involves obtaining the multiplication factor and such that

*et al.*(2017) proposes three experiments/options for

*m*river reaches. Their values can be obtained from the NHD

*Plus*dataset. is the reference water wave celerity, and is the inverse of average velocity based on the first experiment for the th river in the network, such that (see Tavakoly

*et al.*(2017) where is the mean length of the rivers in the network.

As mentioned in Tavakoly *et al.* (2017), for the first experiment (Equation (7a)), the wave celerity is assumed to be independent of topography and is constant for all river reaches. Although a simple assumption, this choice of initial -parameter makes it devoid of reality since it implies that the travel time of a flow wave is independent of the topography. The second and the third experiments (Equation (7b) and (7c)) tackle the topographical dependence of the initial -parameter by defining it in terms of the length and slope, which are topographical features. For the third experiment, to avoid the influence of extreme values on celerity, the values of are restricted between the 5 and 95% thresholds based on the cumulative probability function. Tavakoly *et al.* (2017) concluded that the predictions of RAPID improved when topological features were explicitly considered by using , and as a choice for initial *k*-parameter. This is our primary motivation to assume that the Muskingum parameters bear a functional relationship with topography and the hydrodynamic characteristics of the system.

*k*and

*x*parameters of all the rivers in the network. For

*n*gauge locations, and for optimization time-window ranging from the first day and the last day , the cost function is defined as

The calibration process described in this section (which is currently being used in the RAPID model run by USACE), although state-of-the-art, is time-consuming and depends on the availability of observation data from gauges across the United States. Therefore, we can only expect good predictions in the geographical regions where gauge measurements are available. This limits the streamflow prediction capability in geographic regions where good measurement data are unavailable or inaccessible. As pointed out in Fernandez-Palomino *et al.* (2021), judging the hydrological model reliability purely based on discharge-based calibration is problematic, as it does not guarantee the correct representation of internal hydrological processes. In the next section, we propose a data-driven ML-enabled calibration of the streamflow parameters that tackles the aforementioned challenges in the traditional calibration approach.

## THE PROPOSED ML-ENABLED CALIBRATION OF THE STREAMFLOW PARAMETERS

### Overview

To establish a data-driven calibration model, we aim to obtain statistical features that have a causal relationship with the calibrated streamflow parameters. We hypothesize that these features are dependent on the topography and the hydrodynamic characteristics of the region of interest. We recall that this paper focuses on calibrating the streamflow parameter *k* since the Muskingum model is more sensitive to *k*, and the same methodology can be applied to calibrate *x*. We extract the features from the raw datasets of various topographic quantities (discussed in Section 3.2) by using two feature extraction techniques: (1) statistical moment analysis (see Section 3.3.1), and (2) CNN-based autoencoder (see Section 3.3.2). We then use the variance-based global sensitivity analysis (GSA) to select sensitive features. In theory, all the rivers and their respective watersheds could be classified into countable numbers of groups/clusters based on their shared topographical and hydrodynamic properties. The sensitive features selected using variance-based global sensitivity analysis can be used to classify a river (and its respective watershed) based on its topographical and hydrodynamic properties. We use a GMM-based clustering technique to obtain these clusters (see Section 3.5). The optimal number of clusters are determined based on Bayesian information criterion (BIC) (see Chen & Gopalakrishnan 1998). For each cluster, we then train ML models (discussed in Section 3.3) using the selected features (selected using the variance-based global sensitivity analysis) as input data and calibrated streamflow parameter as the output. We focus our attention on these four ML techniques: (1) GPR (see Section 3.6.1), (2) GMC (see Section 3.6.2), (3) Random Forest (see Section 3.6.3), and (4) XGBoost (see Section 3.6.4). The following sections discuss these steps comprehensively.

### Capturing and extracting the raw data

*river ID*and each river ID has a

*shape file*(in .shp format) depicting the geographic shape of the watershed. We consider the following

*raw input data sources*: length of river, slope of river (data obtained from NHD

*Plus*), catchment area, vegetation data extracted using the MODIS VCF (https://lpdaac.usgs.gov/products/mod44bv006/), and elevation data obtained using the DEM satellite images (http://hydro.iis.u-tokyo.ac.jp/ yamadai/MERIT_Hydro/). The elevation, and the percentage of vegetation data are available for various land segments in DEM (with extension

*.dem*), and Hierarchical Data Format (with the extension

*.hdf*), respectively. These images for various land segments are glued together using the open-source Geospatial Data Abstraction Library (GDAL) (https://gdal.org/) to obtain the

*global*vegetation, and elevation map in

*.tiff*format for the United States consisting of seven catchment zones. Once the

*global*vegetation and elevation maps are obtained, the

*local*vegetation and elevation clips corresponding to each river ID (identifying a watershed) of interest are obtained by clipping the

*global*map using GDAL. Once the elevation clips are obtained, we also obtain the clips of the magnitude of the elevation's gradient since it is reasonable to assume that the elevation's gradient has a direct impact on the river's streamflow. Figure 2 shows a schematic flowchart demonstrating the vegetation data extraction; a similar approach is undertaken to obtain the local elevation clips.

As an expected *output* for the ML model, we obtain the calibrated values using RAPID's calibration procedure as described in Figure 1 for all the river reaches with gauges. At this point, each river ID of interest has a corresponding river length, river slope, and local vegetation clip in *.tiff* format, local elevation clip in *.tiff* format as the input data, and the calibrated as the output data to build the ML model. The raw data in tabulated form looks as shown in the figure below.

**Remark 1:** The proposed framework is designed to be flexible and can accommodate additional information or features as inputs to obtain hydrological parameters. In this paper, we intentionally focused on showcasing the framework's capabilities by using limited information related to watershed characteristics, specifically river slope, river length, watershed area, vegetation patterns and elevation data. This choice was made for practical reasons, aiming to enable the development of a ML algorithm that can be expanded later to include more features. Moreover, this approach streamlined data collection and pre-processing, saving time that could be dedicated to building the ML algorithm and maintaining a more focused investigation.

### Feature extraction

#### Features extracted via statistical moment analysis

In the first approach, for each river ID, we obtain the statistical moments of the vegetation data, elevation data, and the magnitude of the elevation gradient extracted using the distribution of these quantities in their respective *local* clips. We extracted the first four statistical moments and decided to use the first two (i.e., the mean and the standard deviation) after GSA (discussed in detail in the upcoming Sections 3.4 and 4.1). The catchment size is obtained as the number of pixels in the local vegetation clip. This feature extraction technique gives us 10 features: river length, river slope, catchment size (defined as the number of non-zero pixels in the vegetation clip), catchment geographic area, and the remaining six features corresponding to the mean and standard deviation of the vegetation data, elevation data, and magnitude of the elevation gradient.

#### Features extracted via a convolution autoencoder

A *convolution autoencoder* integrates an autoencoder and a CNN. The autoencoder comprises the encoder, the decoder, and a latent space known as a bottleneck. The encoder takes an image and extracts the latent features, thereby transforming a high-dimensional image into a lower-dimensional feature space (called the bottleneck). The decoder, on the other hand, uses the features extracted by the encoder to recover the image. In a convolution autoencoder, the encoder and the decoder use a CNN structure to achieve their respective targets of data transformation. The convolution autoencoder, therefore, first compresses the input data from a high-dimensional form to a lower-dimensional latent space using a convolution encoder, and the decoder uses the convolution transpose to convert the lower-dimensional latent features into a reconstruction of the original higher-dimensional input (Chen *et al.* 2017). The difference between the attempted reconstructed data and the original input data is called the *reconstruction error*. Therefore, the *convolution autoencoder* is trained using a large amount of input data (numerous vegetation clips in our case) so as to minimize the reconstruction error. The network learns to exploit natural structure in the input data to find an efficient lower-dimensional representation of the input data. Among various neural networks, the convolution autoencoder is the most suitable architecture to extract features from a high-dimensional image because of two primary benefits:

- 1.
The CNN-based architecture can better retain the connected information between the pixels of an image (Guo

*et al.*2017). - 2.
Slicing and stacking the data in other neural network leads to a large loss of information. The convolution autoencoder, rather than stack the data, preserves the spatial information of the input image and extracts information gently through the convolution layer.

We use the convolution autoencoder to extract latent features for the local vegetation clips. Since training the convolutional autoencoder is a computationally extensive and time-consuming process, we limit the use of the convolutional autoencoder to extract features for the vegetation clips and use statistical features for the elevation clips. This feature extraction technique gives us 14 features: river length, river slope, catchment size (defined as the number of non-zero pixels in the vegetation clip), catchment geographic area, and six CNN-based features of the vegetation data; the mean and standard deviation of elevation data, and the mean and standard deviation of elevation gradient's magnitude.

To train the model, we use all the raw local vegetation clips (column 5 of Figure 3). We start by first reshaping all the input images into size that are then uploaded into the *DataLoader* class of the *PyTorch* package which is then used to train the convolution autoencoder model.

**The encoder and decoder:**The convolution encoder's structure used for this research consisted of 2D convolution layers followed by a pooling layer for each of them, a flattening layer, and a leaky rectified linear unit (ReLU). The first convolution layer deploys a kernel size of 9 (a scanner) and has one input channel (the reshaped image) and 16 output channels. In order to assist the kernel with processing the image, padding is added to the frame of the image to allow for more space for the kernel to cover the image. For the first convolution layer, we use the padding of 4 pixels with zero value and the stride of 1 (stride is the number of pixels the kernel shifts over the input matrix). The first convolution layer is followed by the application of the ReLU that rectifies any negative value in the output channels of the first convolution layer. Following this, two-dimensional max-pooling with a size of is imposed to decrease the dimensions of the first convolution layer's output, yielding the output of size . The output of the first convolution layer acts as the input to the second convolution layer. Therefore, the second convolution layer has 16 input channels and yields six output channels. For the second layer, we use the same kernel size, padding size, stride, and max-pooling size as the first layer. The output of the second convolution layer after max-pooling is the data of size . These 1,440 numbers are flattened and the latent features are extracted using a Fully Connected Network (FCN), yielding six features. Along similar lines, the decoder uses the latent space to reconstruct the original input image using two transpose convolution layers. The convolution autoencoder is trained to minimize the reconstruction loss quantified by the mean-square error between the reconstructed original input dataset. Figure 4 illustrates the convolution autoencoder architecture adopted in this paper.

### Feature selection using the variance-based GSA

*Y*, which for this paper is the calibrated ). Various approaches have been developed for GSA in the past decades, such as the Fourier amplitude sensitivity test (McRae

*et al.*1982), correlation ratio, Kullback-Leibler divergence (Greegar & Manohar 2015), and Sobol's indices (Sudret 2008). Among these various methods, variance decomposition-based GSA is one of the most widely used methods (Sudret 2008). In this method, the variance of a QoI (i.e., ) can be decomposed into contributions of individual variables, denoted by , and the interactions between different variables, denoted by , such thatwhere is the number of uncertain input variables.

GSA usually requires a double-loop Monte Carlo simulation (MCS) as indicated in the above equation. The double-loop MCS needs to evaluate the simulation model thousands of times. This is inapplicable to the studied problem since there is no model available to represent the relationship between the driving factors and the calibrated streamflow parameter . To overcome this challenge, a data-driven probability model-based GSA method is employed. A data-driven GSA method only requires a data matrix to compute the first-order Sobol's indices (see Hu & Mahadevan 2019). It consists of three main steps, which are briefly explained as follows:

Step 1: Extract the data of input variable and that of QoI from the original data matrix.

Step 2: Build a probability model using a GMM.

Step 3: Compute the Sobol's indices using the constructed GMM.

It has been shown that this method can achieve a similar accuracy level for GSA compared to the double-loop MCS method. We direct interested readers to Hu & Mahadevan (2019) for a detailed description of the probability model-based GSA method.

### Watershed classification via clustering by a GMM

Here, *Q* is the number of Gaussian components in the GMM model, is the probability density function (PDF) of a multi-variate Gaussian distribution, is the weight of the th Gaussian component, and and are respectively the mean vector and covariance of the th Gaussian component, such that , and .

### ML models for prediction of calibrated streamflow parameter

We have conducted an investigation into the performance of four ML models with varying complexity and characteristics in order to obtain a calibrated *k* parameter. GPR and GMC generate probabilistic predictions, whereas XGBoost and Random Forest produce single-point predictions. Figure 5 provides an overview of the proposed ML-based calibration framework.

In the following sections, we provide a comprehensive description of all four methods, along with thorough testing and the presentation of their results. This will enable future readers to make informed decisions when applying similar approaches in their own work.

#### Gaussian process regression

*y*(the calibrated

*k*in this paper). As described in Williams & Rasmussen (2006), a GPR probabilistically models the output

*y*as a realization of the Gaussian Process (GP), such that

The hyperparameters , , and are estimated using the training data through Bayesian parameter estimation or maximum likelihood estimation (see Hu & Mahadevan 2016; Ramancha *et al.* 2022).

#### Gaussian mixture copula

*Y*denote the random variable corresponding to the output. A GMC constructs a probabilistic model to represent the joint PDF of the inputs and outputs, such that (see Hu & Mahadevan 2019):where, is the marginal PDF of , is the marginal PDF of the output

*Y*, is the marginal cumulative density function (CDF) of , and is a new copula function approximated by a GMM with parameter vector that is to be estimated from the data. The copula function is approximated using GMM (see Hu & Mahadevan 2019) as:Here, ,

*Q*is the number of Gaussian components in the GMM model, is the PDF of a standard normal random variable, is the inverse CDF of a standard normal random variable, is the weight of the th Gaussian component, and and are respectively the mean vector and covariance of the th Gaussian component.

The maximum likelihood estimate of *y* can then be obtained by . The corresponding mean estimate is estimated by . More details about the GMC model can be found in Hu & Mahadevan (2019). The advantage of the GMC model is that it provides a probabilistic prediction instead of a point estimate. This allows us to quantify the uncertainty in the prediction. Moreover, the probabilistic prediction is not limited to a Gaussian distribution, which is another advantage over the GPR model.

#### Random forest

Random forest is an ensemble supervised learning algorithm that is constructed from a set of decision trees. It was proposed by Breiman in 2001 (see Breiman 2001; Qi 2012). The goal of this ensemble learning method is to improve the prediction accuracy of decision trees by combining multiple trees using the bagging ensemble algorithm. Compared with decision trees, the random forest has an advantage in dealing with overfitting problems and can efficiently handle a large number of input variables.

As the basis of a random forest, a decision tree consists of three components, namely the root node, decision node, and leaf node. The root node segregates the training dataset into different branches. The decision node then decides on attributes used to predict the output. Each decision node ends up with leaf nodes that represent class labels. Decision trees are sensitive to the training dataset; even small changes in the dataset can lead to significant differences among tree predictions (see Biau & Scornet 2016). The difference between decision trees and the random forest is that decision trees allow each tree to randomly choose data from the training dataset while random forest randomly selects features from several subsets of the input features. The random forest chooses the best split among all available input features. As a result, different trees in a random forest have different input features and thus less correlation and increased diversity. The final prediction of a random forest is obtained by averaging the results of individual decision trees of the forest. Since a large number of trees can be generated, overfitting issues can be minimized.

The random forest can deal with both regression and classification problems. In this paper, it is used for regression. For the random regression forest model used in this paper, the number of trees is 170 which is optimized through cross-validation. This means that the final prediction result is an average of 170 decision trees.

#### XGBoost

XGBoost is an extreme gradient boost tree algorithm proposed by Chen & Guestrin (2016). It is an improved ensemble learning method based on a gradient boosting decision tree. The objective of XGBoost is to combine several weak models to create a collectively robust model. This algorithm is similar to the random forest described above and comprises a parallel set of decision trees. XGBoost can also deal with regression, classification, and ranking problems. In this paper, XGBoost is used to solve a regression problem. The difference between random forest and XGBoost is that the gradient boosting decision tree uses the tree iteratively, which combines residuals of the previous tree with prediction errors of new trees to perform the final prediction. The term ‘gradient boosting’ means the gradient descent algorithm is employed to minimize the objective function. The gradient boost first generates an initial model and gets the initial residuals for each input in the training dataset. After that, the gradient boost ensembles the previous models' residuals into the next new model. XGBoost repeats these two steps in a sequential manner to get the final prediction. XGBoost has an advantage over the gradient boosting trees because it uses L1 & L2 regulations to improve the computational speed and prediction accuracy during the training.

### ML-based streamflow parameter calibration and discharge prediction framework

## RESULTS AND DISCUSSIONS

### GSA results

Based on the first-order Sobol’s indices illustrated in Figure 8, obtained through the GSA, we make the following observations:

- 1.
Length and slope are the most sensitive features since they have the top two highest values of the first-order Sobol’s index. This observation is consistent with previous engineering knowledge. For example, as shown in Equation (7), the initial calibrated are functions of the river's length and/or the slope.

- 2.
Statistical moments of elevation gradient direction (denoted by in Figure 8) have much lower contributions than the other factors. Elevation gradient direction is therefore not included in the analysis.

- 3.
Statistical moments of the percentage of vegetation (denoted by in Figure 8) are more dominant than the contributions of the percentage of vegetation gradient's magnitude (denoted by in Figure 8). Therefore, we only consider the percentage of vegetation and ignore its gradient.

- 4.
For a given feature, contributions of the higher-order statistical moments are lower than those of the lower-order statistical moments. To maintain the number of variables at a manageable level, higher-order statistical moments are not included in the analysis.

- 5.
Based on GSA, we consider the following features for building the ML model: river length, river slope, the area of the watershed, the number of pixels in the vegetation clip, the first two moments (mean and the standard deviation) of the percentage of vegetation, elevation contour, and the magnitude of the elevation gradient.

### Clustering results

Table 2 reports the statistics (mean , standard deviation , and coefficient of variation ) of the input parameters used for clustering (i.e., the river's slope, the river's length, and the mean of the vegetation percentage) for all the six clusters.

Cluster . | Number of data points . | % of the dataset . | Number of training data points . | Number of testing data points . |
---|---|---|---|---|

1 | 577 | 12.63% | 402 | 175 |

2 | 524 | 11.47% | 366 | 158 |

3 | 681 | 14.91% | 476 | 205 |

4 | 542 | 11.86% | 379 | 163 |

5 | 1,676 | 36.68% | 1,163 | 513 |

6 | 569 | 12.45% | 402 | 167 |

Cluster . | Number of data points . | % of the dataset . | Number of training data points . | Number of testing data points . |
---|---|---|---|---|

1 | 577 | 12.63% | 402 | 175 |

2 | 524 | 11.47% | 366 | 158 |

3 | 681 | 14.91% | 476 | 205 |

4 | 542 | 11.86% | 379 | 163 |

5 | 1,676 | 36.68% | 1,163 | 513 |

6 | 569 | 12.45% | 402 | 167 |

Clusters . | Slope . | Length . | Mean of vegetation % . | ||||||
---|---|---|---|---|---|---|---|---|---|

. | . | (%) . | . | . | (%) . | . | . | (%) . | |

1 | 0.0059 | 0.0035 | 58.9 | 2,789.7 | 2,029.8 | 72.8 | 10.9955 | 0.5689 | 5.2 |

2 | 0.0040 | 0.0032 | 80.1 | 1,535.1 | 949.5 | 61.8 | 14.4486 | 2.8188 | 19.5 |

3 | 0.0025 | 0.0017 | 67.5 | 4,246.3 | 3,126.7 | 73.6 | 15.6748 | 1.2285 | 7.8 |

4 | 0.0096 | 0.0085 | 88.1 | 2,953.9 | 2,048.7 | 69.4 | 16.1485 | 3.9934 | 24.7 |

5 | 0.0143 | 0.0232 | 161.8 | 3,662.2 | 3,006.1 | 82.1 | 14.6886 | 5.5317 | 37.7 |

6 | 0.0030 | 0.0031 | 100.8 | 3,363.4 | 2,539.4 | 75.5 | 14.9984 | 4.9923 | 33.3 |

Clusters . | Slope . | Length . | Mean of vegetation % . | ||||||
---|---|---|---|---|---|---|---|---|---|

. | . | (%) . | . | . | (%) . | . | . | (%) . | |

1 | 0.0059 | 0.0035 | 58.9 | 2,789.7 | 2,029.8 | 72.8 | 10.9955 | 0.5689 | 5.2 |

2 | 0.0040 | 0.0032 | 80.1 | 1,535.1 | 949.5 | 61.8 | 14.4486 | 2.8188 | 19.5 |

3 | 0.0025 | 0.0017 | 67.5 | 4,246.3 | 3,126.7 | 73.6 | 15.6748 | 1.2285 | 7.8 |

4 | 0.0096 | 0.0085 | 88.1 | 2,953.9 | 2,048.7 | 69.4 | 16.1485 | 3.9934 | 24.7 |

5 | 0.0143 | 0.0232 | 161.8 | 3,662.2 | 3,006.1 | 82.1 | 14.6886 | 5.5317 | 37.7 |

6 | 0.0030 | 0.0031 | 100.8 | 3,363.4 | 2,539.4 | 75.5 | 14.9984 | 4.9923 | 33.3 |

Based on the statistics reported in Table 2, we make the following observations:

- 1.
The coefficient of variation for the distribution of length does not change appreciably for different clusters. That is, each cluster has similar variability in the lengths of the river.

- 2.
The coefficient of variation for slope and the vegetation mean is relatively low for clusters 1 and 3.

- 3.
The coefficient of variation for slope and the vegetation mean is relatively high for clusters 2, 4, 5, and 6.

### Comparison of different ML models

Clusters . | GPR . | GMC . | Random Forest . | XGBoost . |
---|---|---|---|---|

1 | 0.8895 | 0.9758 | 0.9627 | 0.9688 |

2 | 0.8315 | 0.4322 | 0.8343 | 0.7855 |

3 | 0.9136 | 0.7600 | 0.9146 | 0.9147 |

4 | 0.4334 | −0.4849 | 0.7130 | 0.6574 |

5 | 0.7594 | 0.6851 | 0.7580 | 0.7580 |

6 | 0.5909 | 0.1620 | 0.5388 | 0.5469 |

Clusters . | GPR . | GMC . | Random Forest . | XGBoost . |
---|---|---|---|---|

1 | 0.8895 | 0.9758 | 0.9627 | 0.9688 |

2 | 0.8315 | 0.4322 | 0.8343 | 0.7855 |

3 | 0.9136 | 0.7600 | 0.9146 | 0.9147 |

4 | 0.4334 | −0.4849 | 0.7130 | 0.6574 |

5 | 0.7594 | 0.6851 | 0.7580 | 0.7580 |

6 | 0.5909 | 0.1620 | 0.5388 | 0.5469 |

Clusters . | GPR . | GMC . | Random Forest . | XGBoost . |
---|---|---|---|---|

1 | 0.8665 | 0.9856 | 0.9573 | 0.9721 |

2 | 0.7845 | 0.3898 | 0.7918 | 0.7577 |

3 | 0.9186 | 0.7116 | 0.9016 | 0.9120 |

4 | 0.3101 | −0.3909 | 0.6621 | 0.9704 |

5 | 0.6548 | 0.6000 | 0.6878 | 0.7342 |

6 | 0.4833 | −0.1898 | 0.4642 | 0.5267 |

Clusters . | GPR . | GMC . | Random Forest . | XGBoost . |
---|---|---|---|---|

1 | 0.8665 | 0.9856 | 0.9573 | 0.9721 |

2 | 0.7845 | 0.3898 | 0.7918 | 0.7577 |

3 | 0.9186 | 0.7116 | 0.9016 | 0.9120 |

4 | 0.3101 | −0.3909 | 0.6621 | 0.9704 |

5 | 0.6548 | 0.6000 | 0.6878 | 0.7342 |

6 | 0.4833 | −0.1898 | 0.4642 | 0.5267 |

Based on the presented results, we make the following observations:

- 1.
The predictions for clusters 1 and 3 are better than for clusters 2, 4, 5, and 6 consistently across all ML techniques. This is because, as noted at the end of Section 4.2, clusters 1 and 3 have a relatively low coefficient of variation for slope and vegetation mean whereas clusters 2, 4, 5, and 6 have a relatively high coefficient of variation for slope and vegetation mean. Large variability in the input parameter for the clusters results in a lower coefficient of determination of the prediction.

- 2.
The performance of most of the ML techniques is consistent. Judging from the values, XGBoost offers the best performance for most of the clusters followed by Random Forest and GPR, and finally, GMC is relatively the worst performer. As can be observed in Figure 11 and reflected in Tables 3 and 4, the for clusters 3 and 6 corresponding to the GMC is notably worse because of countable but large outliers.

- 3.
Among the four ML techniques, the GPR and GMC provide probabilistic predictions, whereas the Random Forest (see Figure 12) and the XGBoost (see Figure 13) yield single-point predictions. The Figures 10 and 11 corresponding to the GPR and GMC, respectively, plot the mean value of the predicted calibrated .

- 4.
The ML predictions obtained by considering statistical features are better than the predictions obtained by considering the convolution autoencoder-based features. This is because the statistical moments of the vegetation and the elevation clips are more suitable to capture the distribution of these quantities, and the convolution autoencoder-based features inherently may have noise/bias as a consequence of numerous transformations performed on the higher-dimensional raw data (vegetation clips) to obtain low-dimensional features, as well as model structural bias.

## SUMMARY AND CONCLUSIONS

This paper focuses on comparing ML techniques to enable computationally-expeditious calibration of hydrological model parameters without requiring streamflow discharge measurements. There are two potential benefits of this study: (1) a trained, validated, evaluated, and deployable ML model can be used to obtain the calibrated hydrological parameters in the regions where there are no discharge measurement gauges; (2) the entire framework is built on an assumption that the hydrological parameters bear a functional relationship with topography and the hydrodynamic characteristics of the river system. Thus, the ML models can take into account numerous features that are sensitive to and correlated with the calibrated hydrological parameters.

The four compared ML architectures included GPR, GMC, Random Forest, and XGBoost. The GPR and GMC are capable of performing probabilistic predictions, and therefore can quantify the uncertainty in the predicted calibrated parameters. Random Forest and XGBoost yield a single-point prediction. The primary input information in the raw form includes the length of the river, the slope of the river, the percentage of vegetation, the elevation profile, and the catchment area. In order to build ML models to predict the calibrated streamflow parameters, we extract numeric features from the vegetation and elevation profiles, which are in *.tiff* image format in the raw form. To do so, we deploy two feature extraction techniques, statistical moment analysis and a convolution autoencoder.

This study has shown encouraging prediction results for this exceedingly complex problem with a better-than-expected coefficient of determination for most clusters. Judging on the coefficient of determination, all four ML techniques have performed reasonably well, with XGBoost being the best, followed by GPR and Random Forest, and GMC being the lowest in the performance rank. Our primary conclusion is that ML models that are carefully constructed by considering causal and sensitive input features (that are related to the rivers and the geographic conditions of the watersheds) do offer a potential approach that can not only obtain calibrated hydrological model parameters with reasonable accuracy but also side-step the current calibration challenges.

## ACKNOWLEDGEMENTS

Funding for this work was provided by the United States Army Corps of Engineers through the U.S. Army Engineer Research and Development Center Research Cooperative Agreement W912HZ-21-C-0042.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

## Author notes

These authors contributed equally.