ABSTRACT
Accurate river discharge forecasts for short to intermediate time intervals are crucial for decision-making related to flood mitigation, the seamless operation of inland waterways management, and optimal dredging. River routing models that are physics based, such as RAPID (‘routing application for parallel computation of discharge’) or its variants, are used to forecast river discharge. These physics-based models make numerous assumptions, including linear process modeling, accounting for only adjacent river inflows, and requiring brute force calibration of hydrological input parameters. As a consequence of these assumptions and the missing information that describes the complex dynamics of rivers and their interaction with hydrology and topography, RAPID leads to noisy forecasts that may, at times, substantially deviate from the true gauged values. In this article, we propose hybrid river discharge forecast models that integrate physics-based RAPID simulation model with advanced data-driven machine learning (ML) models. They leverage runoff data of the watershed in the entire basin, consider the physics-based RAPID model, take into account the variability in predictions made by the physics-based model relative to the true gauged discharge values, and are built on state-of-the-art ML models with different complexities. We deploy two different algorithms to build these hybrid models, namely, delta learning and data augmentation. The results of a case study indicate that a hybrid model for discharge predictions outperforms RAPID in terms of overall performance. The prediction accuracy for various rivers in the case study can be improved by a factor of four to seven.
HIGHLIGHTS
Hybrid model that combines physics-based model with machine learning model.
Two hybrid modeling techniques: delta learning and data augmentation.
Hybrid model improves streamflow prediction accuracy over time.
Multiple machine learning models are implemented.
INTRODUCTION
The management of stream and river systems entails significant and frequently costly decision-making procedures. These processes include floodplain mapping, citizen water usage, national-level infrastructure designs, and agricultural arrangements. Notable modeling investigations were conducted in the early 1950s, and over the past two decades, hydrologists have made significant efforts to develop large-scale hydrologic models (David et al. 2011a). Prior to the utilization of data-driven analysis through computational simulation, hydrologists predominantly relied on creating mathematical models of river systems as the primary method for generating hydrometeorological forecasts (David et al. 2016). Currently, significant research is dedicated to creating river models that focus on ground-level stream wave propagation across vast geographical boundaries. Notable examples of such models include the routing model proposed by Lohmann et al. (1996), total runoff integrating pathways by Oki & Sud (1998), LISFLOOD-FP by Bates & De Roo (2000), river transport model by Branstetter (2001), RiTHM (river-transfer hydrological model) by Ducharne et al. (2003), hillslope river routing by Beighley et al. (2009), and CaMa-Flood (catchment-based macro-scale floodplain) by Yamazaki et al. (2011).
Despite the advancements in these models, a significant challenge persists: a substantial portion of river stretches remains unmonitored. Streamflow measurements for the United States are primarily provided by the National Water Information System (NWIS) of the US Geological Survey (USGS) (see https://waterdata.usgs.gov/nwis). Local state or municipal institutions also contribute short-duration streamflow observations, which are generally less reliable than the NWIS dataset (Ghimire et al. 2023). As of 2023, the USGS reports that only 10,996 sites are equipped with direct automatic monitoring devices to record current conditions, despite the United States being home to approximately 250,000 rivers and having over 3.5 million miles of waterways. Furthermore, monitoring intervals vary widely and lack consistency. The only monitoring technique available aside from USGS observations entails using a bucket and timer to measure the volume of water collected over a specific period, but this is impractical for most streams except for those of the smallest order. Therefore, the challenge in predicting streams and waterways lies in the limited availability of discharge observations that are obtained from a restricted number of gauged locations. As a result, flow estimations rely on alternative techniques, such as the use of an acoustic Doppler current profiler or the utilization of the routing application for parallel computation of discharge (RAPID) model (McCarthy 1938).
This study focuses on RAPID, which is a river routing model based on the Muskingum algorithm developed by McCarthy (1938). David et al. (2011b, 2013, 2015) expanded on McCarthy’s work by proposing a matrix-based version of the Muskingum method that enables computation of flow and water volume in numerous sections of a river network, even those without measurement data. RAPID was specifically designed for the National Hydrography Dataset Plus (NHDPlus), a comprehensive dataset that effectively captures the water flow wave across a wide range of catchment boundaries (David et al. 2011b; Tavakoly et al. 2017). Within the Geographic Information System (GIS), NHDPlus contains monthly and annual streamflow estimates for over 3 million stream segments and provides a systematic representation of how these streams are interconnected within global river networks (David et al. 2011b).
Despite its high-performance computing capabilities, calibrating parameters in the RAPID model poses challenges due to the necessity for numerous measurements and the intensive computational and data requirements (Zhao et al. 2023). RAPID requires access to streamflow measurement data, which may be unattainable for certain unmonitored catchments. Meanwhile, RAPID’s process-based hydrology simulation also demands a substantial amount of computational processing time (Ghimire et al. 2023). To address these aforementioned difficulties, various research studies have been conducted. One such study integrates anticipated discharges from the variable infiltration capacity model with the RAPID model on a nationally scalable framework, incorporating waterway measurements from the NHDPlusV2 river network. This approach leverages high-level computational efforts to enhance parameter calibration (Ghimire et al. 2023). Another coupled model, AutoRAPID, combines runoff data from RAPID with watershed storm surge values generated by the AutoRoute model. This integration enables swift flood inundation predictions across large geographic areas in response to catastrophic situations (Follum et al. 2017). In addition to these coupled models, data-driven machine learning (ML) models have witnessed tremendous success in hydrological predictions and related fields. In particular, various ML models and hybrid models that combine two different types of ML models have been developed for time series prediction related to hydrology. For instance, a long short-term memory (LSTM) method tuned by ant lion optimization (called LSTM-ALO) has been applied to monthly runoff forecasting in the study by Yuan et al. (2018), where the ALO algorithm is employed to optimize parameters of the LSTM model. Similarly, a method called LSTM-INFO has been developed to predict water temperature by optimizing the LSTM model parameters using reptile search algorithm and weighted mean of vectors optimizer (Ikram et al. 2023). In addition to LSTM-based methods, support vector machine (SVM)-based methods have been explored in the studies by Adnan et al. (2023b) and Adnan et al. (2022) for evaporation prediction, resulting in methods called relevance vector machine tuned with improved Manta-Ray foraging optimization (RVM-IMRFO) and firefly algorithm and particle swarm optimization (SVM-FFAPSO). The other studied ML models for time series forecasting including the random vector functional link networks (Mostafa et al. 2023), hybrid heuristic algorithm (Adnan et al. 2021), and extreme learning machine methods tuned with metaheuristic algorithms (Adnan et al. 2023a). In the past decade, predicting time series data using ML models has become a very crowded field and numerous approaches have been proposed in the literature. Among all these available approaches, deep learning (DL)-based methods have shown promising potentials, especially for river discharge prediction. Kratzert provides convincing evidence that DL, specifically through the utilization of LSTM, produces superior runoff simulations from precipitation data in ungauged sites compared to conceptual models in monitored basins (Kratzert et al. 2019). Similarly, Nearing’s research also highlights the notable advantage of LSTM-based DL models in capturing dynamic watershed behavior and improving hydrological performance across various climatological circumstances (Nearing et al. 2021). Remarkably, even when catastrophic scenarios were not explicitly included in the training data, the LSTM network and its variants consistently predicted severe events with remarkable accuracy compared to the conceptual Sacramento model and the process-based US National Water Model (Frame et al. 2022). Furthermore, the hybrid hydrological ML model, which efficiently leverages the strengths of artificial neural networks and the k-nearest neighbor method, has demonstrated accurate and robust forecasting capabilities for extreme events (Kan et al. 2020).
The challenges associated with calibrating RAPID streamflow parameters, coupled with the calibration process’s reliance on gauged measurement data, motivated us to develop a novel data-driven parameter calibration model that utilizes the hydrological and topological features of the river’s watershed (Zhao et al. 2023). These calibrated parameters can be fed into a physics-based model that produces discharge forecasts. However, physics-based models are often simplified by numerous assumptions, such as RAPID being a linear river routing model, and therefore, they can be improved. While the aforementioned reviewed data-driven ML models has shown promising potential in river discharge prediction, their performance is often affected by both the quantity and quality of the data used for training the models. In this article, we extend our research efforts to build a hybrid streamflow discharge prediction model for gauged rivers that leverages the strengths of both the physics-based RAPID simulation model and data-driven ML models. It is worth noting that the ‘hybrid model’ in this article is different from the variety of hybrid data-driven models previously reviewed, which typically amalgamate various ML algorithms or optimization techniques. The ‘hybrid model’ in this article refers to the integration of the physics-based analysis model with the data-driven ML model, which is a concept also known as physics-informed/enhanced ML models. The major contributions and novelty of the proposed methods can be summarized as follows:
Incorporating physics-based principles: Integrating the intrinsic physics of river flow as encapsulated by the RAPID model into the data-driven ML prediction framework.
Leveraging data-informed insights: Harnessing hydrological insights embedded in the runoff data of nearby gauged rivers to address the gaps, discrepancies, and missing information in the physics-based model created as a result of inherent assumptions in physics-based models.
Utilizing cutting-edge ML techniques: Building a hybrid model that captures the capabilities of the physics-based model while being data-informed, by employing cutting-edge ML-based forecasting techniques of varying complexities, including Gaussian process (GP)-based nonlinear autoregressive with exogenous inputs (NARX) regressor (GP-NARX), neural network-based NARX regressor (NARX-Net), LSTM, and bidirectional LSTM (BiLSTM).
Forecasting for extended time periods: The model should be capable of predicting river discharge for long-time horizons (such as predicting the discharge a year in advance) and capturing the observed seasonality and trends.
The following sections of this article are structured as follows: Section 2 summarizes the datasets used and the techniques employed for feature extraction from the runoff data. Section 3 provides detailed information on the two algorithms, namely, delta learning and data augmentation, used to build the hybrid model. Section 4 outlines various ML methods employed to implement the delta learning and data augmentation algorithms and obtain the hybrid discharge prediction models. Section 5 presents the case study, and Section 6 concludes the article.
DATASETS AND EXTRACTION OF FEATURES
Datasets
The true discharge data and RAPID’s predictions are vector time series, with each element representing a discharge value on a given day. Let denote the total number of days for which the data are available (these days will be split later for the purpose of training and testing the hybrid model). We denote true gauge data as a vector and the RAPID’s prediction data time series as a vector , with . Unlike the true gauge discharge and RAPID values, which are associated with the river whose discharge we aim to predict, the runoff dataset consists of time series data for runoff from all the rivers in the basin. Therefore, the runoff data are represented as a higher-dimensional matrix, denoted by , with .
The jth element of the ith column of the matrix M, denoted as , represents the runoff values for the jth rivers in the basin on the ith day. Therefore, the dimensionality of the runoff data is . Using these three datasets in their raw format presents two major challenges that require further data processing.
First, for basins with a large number of watersheds, the dimension of runoff data can be very high, as there is one time series of runoff data for each watershed. For example, the basin studied in the case study has 431 watersheds, which means that the dimension of the runoff data is . This high-dimensional data poses significant challenges to the training of the hybrid models. Since these watersheds are in the geographic neighborhood of each other, there is bound to be correlation between the runoff time series of rivers passing through these watersheds. Therefore, we first reduce the high-dimensional data into low-dimensional features using dimension reduction techniques (i.e., reducing the number of rows in the M matrix). Two types of dimension reduction techniques, namely, SVD and LSTM-based autoencoder, have been studied.
Second, the RAPID model exhibits considerable variations and noise due to the presence of unmodeled features and underlying assumptions. To enhance the accuracy of hybrid models based on RAPID model predictions, it is essential to employ data smoothing techniques that help eliminate outliers and highlight underlying trends. We use a Hanning filter based on the convolution of a scaled window with the signal to smooth RAPID model predictions (Oppenheim 1999). The window size used in this article is 30.
Surface runoff features extracted by singular value decomposition
Surface runoff features extracted by a long short-term memory autoencoder
LSTM overview
LSTM is a recurrent neural network (RNN)-based architecture specifically developed to address the challenge of learning sequence data with long-term dependencies. LSTM networks have been employed to tackle this issue by ensuring a consistent backward flow in the error signal. The utilization of LSTM networks not only excels in capturing short-term data dependencies for accurate predictions but also effectively considers the explicit dependencies among multiple outputs over a long-term time horizon.
In this article, we leverage LSTM for both the dimension reduction of the runoff data and hybrid modeling. This section explains the application of LSTM for feature extraction of the runoff data, while the LSTM architecture is later described in Section 4.3, which discusses the hybrid modeling.
Long short-term memory autoencoder network
An autoencoder is a form of unsupervised neural network that uses data to find the best encoding–decoding architecture (Nguyen et al. 2021). It can be regarded as an n-layer neural network that aims to reconstruct the inputs based on the provided code. The autoencoder comprises five components: an input layer, an output layer, an encoder, a decoder, and a bottleneck region known as the latent space.
The encoder reduces the input dimensions to generate a representation for the bottleneck, while the decoder expands the bottleneck’s dimensions to recreate the input layers. The latent space represents the high-level inputs using compressed datasets at a lower level. In this article, considering the advantage of LSTM in dealing with time series data and the promising performance of LSTM shown in the literature, LSTM is used for constructing the autoencoder. The structures of the encoder and decoder in this article are similar, with two LSTM layers stacked, as shown in Figure 1. The encoder takes high-dimensional datasets as inputs and generates compressed datasets as outputs. The decoder uses the compressed dataset stored in the bottleneck to reconstruct the original input datasets.
Encoder structure
The encoder consists of two LSTM layers, followed by a RepeatVector layer. It serves as an encoder that takes high-dimensional datasets as inputs and generates compressed datasets as outputs.
To implement the encoder, we begin by reshaping all the input datasets at each time step into a size of (1,431). This allows us to standardize the input datasets before passing them into the LSTM layers. The first LSTM layer (LSTM layer 1) sets the input channel to one, indicating two dimensions. The output dimension of LSTM layer 1 is configured as 36, and the kernel initialization is defined as he_uniform, which follows a truncated normal distribution. Following the first LSTM layer, we proceed to create the second LSTM layer (LSTM layer 2). In this layer, the input dimension is set to 36, reflecting the output channel from the previous layer, while the output dimension is configured as 18. Like the first LSTM layer, we maintain the kernel initialized as he_uniform. The final layer utilized in the encoder structure is the RepeatVector layer, which is repeated a defined number of times using this architecture.
Decoder structure
The structure of the decoder comprises two LSTM layers. Its purpose is to utilize the compressed dataset stored in the bottleneck to reconstruct the original input datasets.
The reconstruction error and the parameters used for LSTM autoencoder architecture
Model architecture . | Description . |
---|---|
Number of layers | 5 |
Batch size | 32 |
Number of epochs | 500 |
Optimizer | Adam |
Loss function | MSE |
Model architecture . | Description . |
---|---|
Number of layers | 5 |
Batch size | 32 |
Number of epochs | 500 |
Optimizer | Adam |
Loss function | MSE |
Data processing workflow
Processing the data involves three steps:
Smoothing the RAPID’s prediction data: The prediction data R from RAPID is smoothed using a Hanning filter, with no change in the dimensionality of the data (Oppenheim 1999).
Reducing the dimensions of the runoff data: By employing SVD or a LSTM autoencoder, we can reduce the number of rows in the M matrix, consequently altering its dimensions from to either when using SVD or when using a LSTM autoencoder.
Downsampling the number of days: To further reduce the dimensionality of all three datasets, we downsample the number of days if the ML method deems it necessary. For example, if we downsample by a factor of 2, it would require deleting every other data point in RAPID’s predictions and the true gauge values (since we use daily average RAPID estimates for building the hybrid model), as well as deleting alternate columns in the runoff matrix. Let the number of remaining days after downsampling be denoted by .
ALGORITHMS FOR PHYSICS-ENHANCED ML MODEL FOR DISCHARGE FORECASTING
Having presented two data compression techniques previously, in this section, we describe two algorithms used in building the hybrid forecasting models: (a) delta learning and (b) data augmentation (Thelen et al. 2022). Both of these hybrid models utilize RAPID as the physics-based model and enhance it by incorporating runoff data from the entire basin and the historical discharge measurements of the specific gauged river whose discharge we aim to forecast.
Delta learning
The delta learning algorithm is based on learning the error between RAPID’s discharge prediction and measured gauge data.
Training
- Step 1: The first step is to construct a runoff data-informed surrogate for the RAPID model, enabling the prediction of future RAPID discharge based on the historical time series of RAPID data and runoff data. Incorporating the runoff data into the surrogate model of RAPID helps include some of the missing information about river dynamics not captured by the physics-based RAPID model. This surrogate model can be considered an enhanced physics-based model. The goal of this surrogate is to predict the discharge at the next time step given a window of RAPID data and runoff data in the immediate past. For instance, consider the current time step and the window size and , such that and . Since Noah-MP provides runoff forecasts for the near future, we also consider the next runoff data as the input to train RAPID surrogate since it informs the future. The input to the surrogate model will be the vector . For the current time step and a window size of , the discharge predicted by the RAPID’s surrogate for the next time step is given by:
- Step 2: The second step is to obtain the error in the anticipated outflow based on the physics-based prediction relative to the gauged measurement. To obtain the physics-based prediction, we utilize the previously trained surrogate defined in Equation (6). However, instead of using the RAPID data as input, we use the gauge measurements along with the runoff data to evaluate the physics-based surrogate response. This tweak is necessary for a like-for-like comparison of the physics-based prediction relative to the gauged measurement. That is,The error term (or the delta term) is given as follows:Using this construct, the delta term can be evaluated for each day by considering data points from the preceding days, except for the first time steps. This exception is necessary because evaluating the predicted discharge using the surrogate requires data from the last time steps. This yields the delta vector denoted by .
- Step 3: The goal of this step is to build a predictor for the error in discharge (the delta term surrogate) capable of predicting the error for a future time step, considering the preceding gauge measurements and runoff data for a window size of and , respectively. To maintain compatibility in dimensionality between the input and output, we delete the first columns of and . The predicted error for the step is then defined as follows:Figure 2 illustrates the steps involved in training the hybrid model using delta learning algorithm.
Testing and forecasting
Data augmentation
Training
Similar to delta learning, training the hybrid forecast model using the data augmentation algorithm involves constructing two ML models and consists of three steps, as outlined below:
Step 1: The first step is identical to delta learning as discussed in Section 3.1.1, and it entails creating a runoff data-informed surrogate for the RAPID model, denoted as and defined in Equation (6).
Step 2: At time step , the model can be used to obtain the predicted physics-based discharge for the next time step using Equation (7). The predicted physics-based discharge can be evaluated for each day by considering data points from the preceding days, except for the first time steps. This exception is necessary because evaluating the predicted discharge using the surrogate requires data from the last time steps. This yields the vector of predictions made by the model for the entire training period denoted by .
We emphasize that to obtain the prediction vector produced by the model , we use gauge measurements in conjunction with runoff data as inputs to evaluate the physics-based surrogate response, rather than relying on RAPID data that were one of the training input for the model in the first step (Equation (7)). This approach is taken because the physics-based prediction vector will be used to augment the inputs for training the surrogate model that forecasts gauge discharge measurements, as will be explained in the next step.
Step 3: This step involves utilizing the gauge data and runoff data , along with the prediction vector obtained in the previous step, to train a ML model capable of directly predicting the gauge measurements. This is unlike the delta Learning approach, where the error term was predicted first before the gauge measurements.
To maintain compatibility in dimensionality between the inputs and output vectors, we delete the first columns of and . The predicted gauge measurement for the step is then defined as follows:Notice that, in addition to the preceding gauge measurements and runoff data, the output of the physics-based surrogate , denoted as , is included as part of the input to the surrogate . That is, the input for the model was obtained by augmenting the input for the model . This is the reason why this approach is referred to as data augmentation. The term contains information about the immediate future (the next time step). Consequently, the model is trained to incorporate this immediate future information as predicted by the physics-based surrogate (at the next time step).
Testing and forecasting
IMPLEMENTATION OF PHYSICS-ENHANCED ML ALGORITHMS FOR DISCHARGE PREDICTION
As discussed in the previous section, building a hybrid discharge forecast model using the delta learning algorithm requires the construction of two ML models, and . In addition, using the delta augmentation algorithm necessitates the creation of two ML models, and . We have investigated four different ML methods with varying complexity and characteristics to train these models. The following write-up discusses these ML methods. To maintain generality in our descriptions, we use the notation of a generic input vector and a scalar output that is available for the training purpose. We denote a matrix of multivariate training inputs as , where each row represents an instance of the input vector . Similarly, we represent the output as a column vector , with each element denoting an instance of the output variable . Finally, let the input vector for a testing point be denoted by and the predicted output be denoted by .
Gaussian process regression-based nonlinear autoregressive network with exogenous inputs
Gaussian process regression
NARX model based on GPR
NARX-Net
The NARX basis discussed in Section 4.1.2 can be implemented using a neural network as the regressor function, as opposed to a polynomial or GP regressor (Xie et al. 2009). In this approach, the predictor basis comprises lags of both autoregressive and exogenous terms, which are then processed through a ML model to predict future time steps of the output. Specifically, the function in Equation (25) is estimated using a neural network model, which may be a fully connected feed-forward network with multiple layers, activation functions, dropout, and numerous parameters in both the neuron weights and biases.
Unlike the immediate explicit solution of model parameters, which is typical in ordinary least squares regression and GPR, due to the high nonlinearity of the network function, iterative training on a subset of the data is now necessary. This training involves optimization schemes commonly used in the ML/AI community, such as ADAM, RMSProp, Adagrad, and others.
The choice and application of this modeling methodology are motivated by its utilization of potentially sophisticated neural network architectures, which can capture complex patterns and data features, as well as its traditional aspect of establishing a relationship between previous and current values in an autoregressive sense. This characteristic is favorable for physics-based problems that are typically governed by differential equations (Liu et al. 2022). Furthermore, this method serves as a suitable middle ground in terms of computational complexity when compared to GP-NARX and more intensive ML methods, such as LSTM or gated response unit (GRU), when assessing its performance.
Long short-term memory
As mentioned in Section 2.3.1, we recall that LSTM is an RNN-based algorithm that is especially useful for learning sequences of data with long-term dependencies. Apart from using LSTM to extract features from high-dimensional runoff data, we also utilize LSTM to build a hybrid discharge forecasting model. In this section, we present the general description of the LSTM algorithm and its application for hybrid modeling.
As illustrated in Figure 6, multiple LSTM cells are connected together to map an input sequence to an output sequence . In each LSTM cell, the operations in the three gates are explained as follows.
Input gate
Forget gate
Output gate
The hidden states are ultimately transformed into the output through a fully connected layer. In a conventional LSTM topology, groups of neurons are organized into layers, comprising the input layer, output layer, and multiple hidden layers. The input layer receives input from external sources, while the output layer transmits information to the surroundings. The hidden layers, consisting of several levels, do not have a direct connection to the external physical world but are linked to the input and output layers. In the multilayer LSTM architecture used in this article, feed-forward neural networks are constructed by stacking cells across multiple hidden levels.
Bidirectional long short-term memory
CASE STUDY
Basin and rivers of interest
To develop the hybrid model, three datasets are required, as detailed in Section 2.1. The first and second datasets pertain to gauge measurements and discharge predictions produced by the RAPID model for the six gauged rivers, while the third dataset consists of runoff data for all 431 rivers in the basin, and all these datasets span 5,311 days. Specifically, the dimensions of the gauge measurements , the RAPID prediction vector R, and the runoff data M are 1× 5000, 1× 5000, and 431× 5000, respectively. The six gauged rivers for which we aim to forecast discharge are identified by the following USGS river IDs: 700739316, 700749972, 700777836, 700782048, 700811946, and 700825284. In this article, we refer to these rivers using numerals 1–6, respectively.
Data preprocessing
Dimension reduction of the runoff data: Since these rivers all belong to the same basin and their dynamics are interrelated, it is expected that the runoff data for all 431 rivers will exhibit some correlation. This correlation arises from the interconnected nature of the river network, where the dynamics of each river are influenced by the dynamics of the entire network and vice versa. Consequently, this correlation allows for the reduction of dimensions in the runoff data to extract essential information in a latent space. Reducing the dimensionality of the runoff dataset M is crucial for enhancing the overall construction of the hybrid model, as it significantly improves computational efficiency. To achieve this, we employ both the single value decomposition (SVD) technique (Section 2.2) and an LSTM-based autoencoder (Section 2.3) for dimension reduction. By employing both of these techniques, we extract 18 essential features from the original dataset. The dimensionality of features obtained through SVD and an LSTM-based autoencoder is the same, consisting of 18 features.
Figure 10 depicts the reconstruction of the runoff data distributed over the entire basin for a particular day using an LSTM-based autoencoder. The top figure displays the original runoff data, while the bottom figure shows the reconstructed runoff dataset obtained through the LSTM-based autoencoder. Each pixel in this figure represents the runoff volume for a specific catchment, with varying colors indicating different volume ranges as listed on the color-coded legend. The darker hues signify larger volumes, and among the 431 catchments, the largest recorded volume is 54,500 m3/s. The similarity between the original and reconstructed datasets indicates that the autoencoder efficiently captures the properties of the runoff dataset using just 18 dimensions.
Smoothing of the RAPID prediction data: The RAPID prediction is smoothed using a Hanning filter with a window size of 65 days (Oppenheim 1999).
Quality assessment and data standardization: We conduct a thorough data quality assessment to address issues such as mismatched data types, mixed data values, data outliers, and missing data. To ensure standardized data for analysis, the StandardScaler transform in Python is applied to normalize all three datasets. This normalization method ensures that the data are scaled consistently, allowing for accurate outcomes in subsequent studies while limiting the impact of various scales on hybrid model building.
Down sampling the number of days: We downsampled the data by a factor of 2 for GP-NARX. This was necessary due to GP-NARX’s inability to handle large dimensionality, as will be discussed in more detail in Section 5.4. No downsampling was performed for the other methods.
Performance evaluation metrics
Since MSE can be obtained by squaring RMSE, and RMSE has the same physical units as the predictions, we report RMSE in addition to KGE and NSE error metrics.
Numerical results of hybrid forecasting models
Prediction results
LSTM, BiLSTM, and NARX-Net are employed to forecast river discharge for six gauged rivers in all four scenarios mentioned earlier. The key parameters of the models are tuned through a grid search method. Four different dropout rates, 0.001, 0.005, 0.01, and 0.1, are used. Four different batch sizes (16, 32, 64, and 128) are also compared to determine the best batch size. The studied number of layers are 2, 3, 4, and 5. The studied learning rates are 0.0001, 0.001, 0.01, and 0.1. The studied number of neurons are 30, 40, 60, 80, and 100. The optimal hyperparameters for the models are as follows: dropout rate, 0.01; batch size, 32; number of hidden layers, 3; learning rate, 0.001; and number of neurons, 80. LSTM and BiLSTM use a past window size of . NARX-Net utilizes a window size of and for data augmentation. For delta learning, it employs a window size of and . However, GPR is used only for Scenario 1 and Scenario 3, which use SVD to reduce runoff data dimensionality, with the window size of days. GPR is sensitive to high-dimensional input, and we sidestep this challenge by using a shorter lag time and by utilizing only the most important runoff features obtained by SVD. Although SVD reduced the dimensionality of runoff data from 431 rivers to just 18 values, most of the information is contained in the first few terms. We exploit this fact and use only the first two terms of the 18 features extracted by SVD to build the GPR model.
Discussion
Table 2 shows the best-performing ML method for all the rivers and various scenarios of algorithms used in hybrid modeling. We make the following observations based on these analysis results:
Improved overall prediction results relative to the original RAPID model: Based on the error plots for RMSE, GPR-based hybrid algorithms (Scenario 1 and Scenario 3) outperforms RAPID 100% of the time. LSTM- and BiLSTM-based algorithms outperform RAPID 91.667% of the time for delta Learning and 100% of the time for data augmentation. Narx-Net-based hybrid algorithms outperform RAPID 100% of the time for all the four scenarios of hybrid models. This clearly indicates that the predictions of the hybrid algorithms perform better than the RAPID predictions. That is, the ML component in the hybrid model successfully retrieves the information that has not been modeled in the original RAPID model.
Data augmentation vs. delta learning: It is concluded from the error plots that data augmentation is overall a more robust algorithm than delta Learning. Overall, all the ML methods perform marginally better with data augmentation relative to delta Learning.
Predictions in human-controlled river with a dam: As discussed earlier, River 4 is an anomaly in that it has a dam that obstructs the flow. Consequently, the true measured discharge is almost static and low (near zero). The RAPID model, unable to incorporate this information, produces a noisy prediction of a large positive discharge. This particular river, however, reveals a very interesting behavior of the delta learning and data augmentation algorithm. Since the delta learning algorithm involves evaluating the error between smoothed RAPID and the true discharge, it is highly sensitive to the deviation of RAPID’s predictions relative to the true value. In cases where the true discharge measurements are near zero, the delta terms become almost equal to smoothed RAPID predictions. As a result, the delta learning algorithm’s predictions are influenced by RAPID’s measurements and follow a similar pattern to RAPID’s predictions. On the other hand, the data augmentation algorithm does not have this sensitivity to RAPID predictions and yields a prediction pattern closer to the true discharge. This also justifies the overall better behavior of data augmentation relative to the delta learning algorithm (the first observation).
Impact of feature extraction techniques on predictions: LSTM performs better with the delta learning algorithm that utilizes autoencoder-based runoff features relative to the SVD-based runoff features. Contrarily, the data augmentation algorithm that uses SVD runoff features performs better relative to the autoencoder-based runoff features.
Performance of different ML algorithms in different scenarios: GPR is the overall best performer for Scenario 1 of the hybrid model; LSTM for Scenario 2; BiLSTM for Scenario 4. In Scenario 3, different methods perform the best for different rivers.
Training cost and complexities of different ML algorithms: BiLSTM is harder to train than LSTM. Therefore, even though LSTM and BiLSTM perform relatively similarly, LSTM has an edge in terms of training ease. Ignoring River 4, overall, GPR marginally outperforms both LSTM and BiLSTM, even though it is a less complex architecture in terms of required training resources. This makes GPR a powerful method to support the hybrid model. However, GPR suffers from the curse of dimensionality, as discussed earlier. That is, it cannot handle larger input dimensions and is, hence, less informed than LSTM and BiLSTM.
Performance of NARX-Net: The NARX-Net method of modeling with delta learning consistently demonstrated an overly strong correlation with the RAPID prediction. This trend was also observed in other modeling techniques, albeit to a lesser extent. Two possible explanations arise: there may be an excessive bias toward RAPID as the sole predictor for the output, implying that the utilization of runoff data and historic gauge information was insufficient to rectify the exclusive reliance on RAPID. Alternatively, and more likely, a very small and inadequate delta term is being generated and added back to the original estimation as a correction. This phenomenon could be attributed to the presence of noisy, erratic, and higher frequency variance in the training data, as is the case in some instances of RAPID and gauge data. Models, when exposed to such data, may undesirably learn the variance through potential overfitting, resulting in a biased model that either inappropriately predicts noise or abandons the natural features in the data, defaulting to a smooth or flat average. In the case of NARX-Net, it appears that, during training, the model produces a very small delta term that provides minimal correction to the RAPID prediction.
Rivers . | Scenario 1 delta learning with SVD runoff features . | Scenario 2 delta learning with autoencoder runoff features . | Scenario 3 data augmentation with SVD runoff features . | Scenario 4 data augmentation with autoencoder runoff features . |
---|---|---|---|---|
RMSE: GPR | RMSE: LSTM | RMSE: LSTM | RMSE: BiLSTM | |
1 | NSE: GPR | NSE: LSTM | NSE: LSTM | NSE: BiLSTM |
KGE: GPR | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: BiLSTM | |
2 | NSE: GPR | NSE: LSTM | NSE: GPR | NSE: BiLSTM |
KGE: GPR | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: BiLSTM | RMSE: LSTM | RMSE: BiLSTM | RMSE: BiLSTM | |
3 | NSE: BiLSTM | NSE: LSTM | NSE: LSTM | NSE: BiLSTM |
KGE: BiLSTM | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: NARX-Net | RMSE: LSTM | RMSE: BiLSTM | RMSE: BiLSTM | |
4 | NSE: NARX-Net | NSE: LSTM | NSE: BiLSTM | NSE: BiLSTM |
KGE: NARX-Net | KGE: LSTM | KGE: BiLSTM | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: LSTM | |
5 | NSE: GPR | NSE: BiLSTM | NSE: GPR | NSE: LSTM |
KGE: BiLSTM | RMSE: LSTM | KGE: GPR | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: LSTM | |
6 | NSE: GPR | NSE: LSTM | NSE: BiLSTM | NSE: LSTM |
KGE: LSTM | KGE: LSTM | KGE: BiLSTM | KGE: BiLSTM |
Rivers . | Scenario 1 delta learning with SVD runoff features . | Scenario 2 delta learning with autoencoder runoff features . | Scenario 3 data augmentation with SVD runoff features . | Scenario 4 data augmentation with autoencoder runoff features . |
---|---|---|---|---|
RMSE: GPR | RMSE: LSTM | RMSE: LSTM | RMSE: BiLSTM | |
1 | NSE: GPR | NSE: LSTM | NSE: LSTM | NSE: BiLSTM |
KGE: GPR | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: BiLSTM | |
2 | NSE: GPR | NSE: LSTM | NSE: GPR | NSE: BiLSTM |
KGE: GPR | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: BiLSTM | RMSE: LSTM | RMSE: BiLSTM | RMSE: BiLSTM | |
3 | NSE: BiLSTM | NSE: LSTM | NSE: LSTM | NSE: BiLSTM |
KGE: BiLSTM | KGE: LSTM | KGE: LSTM | KGE: BiLSTM | |
RMSE: NARX-Net | RMSE: LSTM | RMSE: BiLSTM | RMSE: BiLSTM | |
4 | NSE: NARX-Net | NSE: LSTM | NSE: BiLSTM | NSE: BiLSTM |
KGE: NARX-Net | KGE: LSTM | KGE: BiLSTM | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: LSTM | |
5 | NSE: GPR | NSE: BiLSTM | NSE: GPR | NSE: LSTM |
KGE: BiLSTM | RMSE: LSTM | KGE: GPR | KGE: BiLSTM | |
RMSE: GPR | RMSE: LSTM | RMSE: GPR | RMSE: LSTM | |
6 | NSE: GPR | NSE: LSTM | NSE: BiLSTM | NSE: LSTM |
KGE: LSTM | KGE: LSTM | KGE: BiLSTM | KGE: BiLSTM |
SUMMARY AND CONCLUSIONS
This article focuses on building a hybrid data-driven physics-enhanced model to forecast river discharge. The model is based on three data sources: true discharge values measured by gauges, discharge predicted by the physics-based RAPID, and runoff data from all the rivers in the basin. Two cutting-edge algorithms, namely, delta learning and data augmentation, are proposed for hybrid modeling. Building a hybrid model using either of these algorithms requires constructing three different ML models. We exploit four different ML methods, including GP-NARX, NARX-Net, LSTM, and BiLSTM.
The inclusion of runoff data from all rivers within the basin offers a comprehensive view of the dynamics of river flow across the entire surrounding region. The runoff data from neighboring rivers provide insights into the dynamic interactions and interconnected physics of the broader river network, contributing to more informed discharge forecasts for the specific river of interest. However, considering the runoff data of all the rivers constituting a basin of interest makes the runoff dataset have higher dimensions. We utilize SVD and LSTM-based autoencoder techniques to extract features with reduced dimensionality.
There are four combinations of algorithms and two feature extraction scenarios for the runoff data: (1) Scenario 1: Delta Learning algorithm combined with runoff data features extracted using SVD; (2) Scenario 2: Delta Learning algorithm combined with runoff data features extracted using an LSTM-based autoencoder; (3) Scenario 3: Data augmentation algorithm combined with runoff data features extracted using SVD; and (4) Scenario 4: Data augmentation algorithm combined with runoff data features extracted using an LSTM-based autoencoder. Each of these scenarios can be built by exploiting various ML techniques. In the hybrid model, GPR outperforms other methods in Scenario 1, while LSTM excels in Scenario 2, and BiLSTM is the top performer in Scenario 4. In Scenario 3, various methods demonstrate the best performance for different rivers.
This study has demonstrated encouraging results in discharge forecasting using various hybrid models to address this exceedingly complex problem. The performance metrics examined, namely, RMSE, KGE, and NSE, consistently show that hybrid algorithms outperform RAPID. For example, when considering RMSE as the error measure, GPR-based hybrid algorithms (Scenario 1 and Scenario 3) outperform RAPID 100% of the time. LSTM- and BiLSTM-based algorithms outperform RAPID 91.667% of the time for delta learning and 100% of the time for data augmentation. Narx-Net-based hybrid algorithms (Scenario 2 and Scenario 3) also outperform RAPID 91.667% of the time.
One major requirement upon which these models were contingent was the availability of true discharge measurement data. The encouraging results in this study have motivated us to extend this research and build a hybrid model capable of predicting the discharge of ungauged rivers where true discharge measurements are not available. Currently, the developed method is mainly designed for rivers with gauges. The extension of the proposed methods to un-gauged rivers through transfer learning is an interesting direction worth studying in the future. In addition, for rivers with a large volume of discharge data, purely data-driven ML models can be employed. The method developed in this article is more suitable for situations where we have only a limited amount of river discharge data. Another important research direction worth studying is the selection or ensemble of ML models using optimization algorithms. In this article, we compared the performance of different hybrid modeling techniques. These models and techniques can be integrated together to achieve better performance using ensemble learning methods.
ACKNOWLEDGEMENTS
Funding for this work was provided by the http://dx.doi.org/10.13039/100006752 United States Army Corps of Engineers through the http://dx.doi.org/10.13039/100006505 U.S. Army Engineer Research and Development Center Research Cooperative Agreement W912HZ-21-C-0042.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.
REFERENCES
Author notes
These authors contributed equally.