Abstract
Better understanding the predictive capabilities of hydrological models under contrasting climate conditions will enable more robust decision-making. Here, we tested the ability of the long short-term memory (LSTM) for daily discharge prediction under changing conditions using six snow-influenced catchments in Switzerland. We benchmarked the LSTM using the Hydrologiska Byråns Vattenbalansavdelning (HBV) bucket-type model with two parameterizations. We compared the model performance under changing conditions against constant conditions and tested the impact of the time-series size used in calibration on the model performance. When calibrated, the LSTM resulted in a much better fit than the HBV. However, in validation, the performance of the LSTM dropped considerably, and the fit was as good or poorer than the HBV performance in validation. Using longer time series in calibration improved the robustness of the LSTM, whereas HBV needed fewer data to ensure a robust parameterization. When using the maximum number of years in calibration, the LSTM was considered robust to simulate discharges in a drier period than the one used in calibration. Overall, the HBV was found to be less sensitive for applications under contrasted climates than the data-driven model. However, other LSTM modeling setups might be able to improve the transferability between different conditions.
HIGHLIGHTS
The long short-term memory (LSTM) had good predictive accuracy in both the calibration and validation periods; however, it is always less robust than the HBV model.
When using the maximum number of years in calibration, the LSTM was robust enough in its application under changing conditions when applied in a condition drier than the one used in calibration.
INTRODUCTION
The use of hydrological models in conditions that differ from those during model calibration is a challenging problem in hydrology and critical for application in impact studies (Blöschl et al. 2019). Models calibrated in certain conditions have been shown to be not always suitable for different conditions or transferable in time (Bastola et al. 2011; Coron et al. 2012; Thirel et al. 2015; Broderick et al. 2016; Dakhlaoui et al. 2017; Grusson et al. 2017; Her et al. 2019; Ouermi et al. 2019; Pan et al. 2019). The lack of a robust analysis of model performance under changing conditions may lead to poor water resource management.
In the context of catchment hydrology, a changing condition refers to any significant modification in land cover, climate, or water management infrastructure, potentially affecting the transformation of rainfall into runoff (Thirel et al. 2015). A general approach for developing hydrological models suitable for use in transient conditions is to use the differential split-sample test (DSST). The model should be calibrated and validated over contrasting periods in such a method, for instance, calibrated over a wet period and validated during a dry period (Klemeš 1986; Coron et al. 2012). The modeler should seek a good transferability of the calibrated parameters to a different dataset in validation, rather than only a good fit during calibration, which is often translated as model robustness. Robustness is a model's degree of insensitivity to climatic and environmental conditions (Seiller et al. 2012).
Model generalization for contrasting climates has been extensively explored in the literature using the DSST (Seibert 2003; Wilby 2005; Vaze et al. 2010; Merz et al. 2011; Coron et al. 2012; Li et al. 2012; Seiller et al. 2012; Brigode et al. 2013; Kling et al. 2015; Li et al. 2015; Seiller et al. 2015; Thirel et al. 2015; Broderick et al. 2016; Fowler et al. 2016; Vormoor et al. 2018). The results have shown that model parameters are sensitive to the climatic conditions of the calibration period (Pan et al. 2019), that the transfer of model parameters in time may introduce a significant level of simulation errors (Zhu et al. 2016), and that calibration over a wetter (drier) climate than the validation climate leads to an overestimation (underestimation) of the mean simulated runoff (Coron et al. 2012). Changes in mean rainfall were more likely than those in mean potential evapotranspiration or air temperature to impact performance during validation (Coron et al. 2012). Furthermore, Broderick et al. (2016) pointed out that the model transferability in contrasted climates may vary depending on the testing scenario, catchment, and evaluation criteria. Here, we argue that testing new models and new calibration protocols can help with our understanding of the modeling capabilities under changing conditions.
Although data-driven techniques have proved to outperform many traditional approaches based on conceptual or physical models for constant conditions (Dawson & Wilby 1998; Dibike & Solomatine 2001; Hu et al. 2018; Lee et al. 2018; Kratzert et al. 2019a; Rafaeli Neto et al. 2019; Xu et al. 2020), and the models are reliable in out-of-sample generalization (Shen 2018), little work has been carried out to test the capabilities of data-driven methods to make reasonable predictions under changing conditions. A significant limitation of data-driven models may be that they do not benefit from our understanding of physical phenomena and instead rely on the data provided during optimization. Shortridge et al. (2016) argued that data-driven models could only generate reliable predictions for conditions comparable to those experienced historically. Otherwise, the models are likely to introduce considerable uncertainty into their projections.
The long short-term memory (LSTM), a particular type of recurrent neural network (RNN), has been shown to be promising in capturing the hydrological behavior from the learning process (Xu et al. 2020). Lees et al. (2021) showed that the LSTM simulates discharge with a consistent high model performance in a large range of catchments in Great Britain, including catchments typically considered difficult to model with four lumped conceptual models. Kratzert et al. (2019b) applied the LSTM model over 531 basins over the USA and found a high correlation between the values of the internal cells of an LSTM network and natural processes.
Recently, O & Orth (2020) evaluated state-of-the-art models to changing conditions, calibrating an LSTM network and two process-based models in 161 catchments distributed across Europe. In their modeling setup, the LSTM model and the process-based models had different calibration approaches. The LSTM was calibrated over all catchments at once using two approaches: calibrating on an extreme reference period (365 days) and calibrating with one randomly selected year from each catchment rather than the respective extreme reference year. In contrast, the process-based models were calibrated in individual catchments and only using the extreme reference period. The models were then used to simulate in the remaining years characterized by a transient condition. The models showed overall performance loss, which generally increased the more conditions deviated from the reference climate, and overall, relatively high robustness was demonstrated by the physically-based model.
In light of the discussion above, in this paper we tested new calibration protocols and extended the scope of the model evaluation, with focus on the LSTM model. This is done by (a) benchmarking the LSTM using the same modeling setup for both data-driven and process-based models (which includes calibrating one model to each catchment instead of calibrating the LSTM over all catchments), (b) testing if increasing the number of years in model calibration would lead to better model performance and robustness, in contrast to only 1 year used in the previous study), and finally (c) calibrating the models in constant conditions as comparison. We then evaluated the robustness of the LSTM for contrasted conditions compared to both its application in constant conditions and the robustness obtained by the process-based model.
STUDY AREA AND DATA
For our study, we used six snow-influenced catchments located in Switzerland, ranging from ∼60 to 400 km2, with a mean altitude between ∼500 and 1,200 m.a.s.l. The location and description of the catchments (location, area, altitude, daily mean temperature, annual precipitation, mean daily discharge, and snow fraction) are presented in Figure 1 and Table 1. Our catchment choice aimed to select catchments mainly located in the Swiss plateau, within a climate homogeneous area, and considered nearly natural (i.e., there is negligible impact on runoff from human activity) (Orth et al. 2015).
Catchment . | Mean altitude (m) . | Area (km2) . | Daily mean temperature (°C) . | Total precipitation (mm year−1) . | Mean discharge (mm d−1) . | Snow fractiona (%) . |
---|---|---|---|---|---|---|
Broye | 710 | 392 | 8.6 | 1,190 | 1.6 | 5 |
Emme | 1,189 | 124 | 5.6 | 1,692 | 3.0 | 19 |
Ergolz | 590 | 261 | 8.6 | 1,091 | 1.2 | 6 |
Langeten | 766 | 60 | 7.5 | 1,305 | 1.8 | 10 |
Murg | 650 | 79 | 8.0 | 1,313 | 2.0 | 7 |
Sense | 1,068 | 352 | 6.3 | 1,445 | 2.1 | 13 |
Catchment . | Mean altitude (m) . | Area (km2) . | Daily mean temperature (°C) . | Total precipitation (mm year−1) . | Mean discharge (mm d−1) . | Snow fractiona (%) . |
---|---|---|---|---|---|---|
Broye | 710 | 392 | 8.6 | 1,190 | 1.6 | 5 |
Emme | 1,189 | 124 | 5.6 | 1,692 | 3.0 | 19 |
Ergolz | 590 | 261 | 8.6 | 1,091 | 1.2 | 6 |
Langeten | 766 | 60 | 7.5 | 1,305 | 1.8 | 10 |
Murg | 650 | 79 | 8.0 | 1,313 | 2.0 | 7 |
Sense | 1,068 | 352 | 6.3 | 1,445 | 2.1 | 13 |
aSnow fraction (%): fraction of precipitation falling with temperature below 0 °C.
The data needed to model the daily discharge were air temperature (°C), precipitation (mm d−1), and the estimates of long-term monthly potential evapotranspiration (mm month−1). Precipitation and air temperature data were obtained from the gridded meteorological forcing data at the spatial resolution of 2 km × 2 km from the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss). We obtained daily discharge measurements from the Swiss Federal Office for the Environment (FOEN).
METHODS
Long short-term memory (LSTM)
The LSTM is a particular type of RNN used to process long time-sequences of data (Hochreiter & Schmidhuber 1997), in which the output of each time-step is fed as input to the next time-step. The control of the information flow is managed in units called gates and memory cells. The cell remembers values over arbitrary time intervals, and three gates regulate the flow of information into and out of the cell: the forget gate, the input gate, and the output gate. At every time-step t, each of the three gates is presented with the input (i.e., explanatory variables) as well as the output of the memory cells at the previous time-step .
In this work, we used a network consisting of a single LSTM layer with one hidden unit and a dense layer that connects the output of the LSTM at the last time-step to a single output neuron with linear activation. The LSTM model was implemented using the Keras package in Python, the Adam activation function, and the mean-squared error as loss function. To predict the discharge of a single time-step (day), we provided as input the last consecutive time-steps of independent meteorological variables (daily precipitation [mm d−1] and air temperature [°C]). We obtained the best hyperparameters of the LSTM model through a trial-and-error tuning approach. We varied the values of the following hyperparameters: length of the input sequence (time-steps), number of neurons in the hidden layer, and number of epochs. Our analysis resulted in the selection of 50 neurons, 50 epochs, and 365 days as time-steps.
HBV model
We benchmarked the performance of the LSTM model against the bucket-type HBV-Light version model (Seibert & Vis 2012). The HBV model consists of four routines including the snow routine, the soil routine, the groundwater routine, and the routing routine. This model usually simulates daily discharge based on daily precipitation, daily air temperature, and estimates of long-term monthly potential evapotranspiration rates. The HBV was used as both a lower and upper benchmark with two different parameterization methods (Seibert et al. 2018). As a lower benchmark, we used the ensemble mean of simulations with 1,000 randomly selected parameter sets, referred to hereafter as ‘uncalibrated HBV’. For the upper benchmark, we calibrated the HBV model using an automatic genetic algorithm and the Nash–Sutcliff efficiency (NSE) as objective function, referred to hereafter as ‘calibrated HBV’. In both cases, we specified feasible parameter ranges based on previous model applications.
Calibration procedure
The LSTM and HBV were calibrated individually for each one of the catchments resulting in six LSTM models and six HBV models. We calibrated and validated the models according to the DSST proposed by Klemeš (1986) for changing conditions. According to Klemeš (1986), if the model is intended to simulate streamflow in a wet climate scenario, then it should be calibrated on a dry period of the historical record and validated on a wet period and vice versa. Additionally, we calibrated and validated a model under constant conditions.
Selection of the calibration and validation periods
The period between 1961 and 2018 was used to select the constant and changing condition periods. We mimicked the changing conditions by selecting two continuous periods in the time series with different hydrological conditions in the historical record. The dry and wet periods were chosen as the annual discharge below and above the long-term average discharge, respectively. The discharge changes between the periods were on average 50%. This is similar to the future hydrological changes expected for Switzerland of an increase in mean and maximum floods of 5–24% in the near future and of 25–49% in the far future, with exception to the Southern alpine catchments, where the mean annual floods may decrease in the far future (Köplin et al. 2014). For the constant conditions, we selected continuous periods containing both dry and wet years.
We also selected calibration periods with different time-series sizes, ranging from 2 to 6 years (2, 3, 4, and 6 years) for each catchment and condition (constant and changing), to test the influence of the amount of data used in the calibration on the model performance. We limited this analysis to 6 years due to data availability. We needed continuous periods with only low or high discharge, which were limited, on average, to 6 years across all the catchments.
Evaluation metrics and robustness
We evaluated the model performance using the NSE (Nash & Sutcliffe 1970), Kling–Kupta efficiency (KGE) (Gupta et al. 2009), non-parametric efficiency (NPE) (Pool et al. 2018), and mean absolute relative error (MARE) (Staudinger et al. 2011). The metrics range from to 1, where 1 indicates perfect agreement between simulations and observations, and values lower than zero indicate very poor performance. These metrics were chosen to evaluate different hydrograph phases, the NSE focus on peaks and discharge dynamics, the KGE focus on the mean, variability, and dynamic, the NPE is the non-parametric version of the KGE, and finally, the MARE focus on low-to-medium flows. The robustness was calculated as the difference between the efficiency in calibration and validation (Hallouin et al. 2020). The independent two-sample t-test was used to evaluate whether the LSTM mean robustness was equal to the robustness obtained with the HBV model, and to compare the mean robustness of the LSTM under changing and constant conditions, at the significance level (α) of 0.05.
RESULTS
Model performance
In the calibration mode, the LSTM performed better than the HBV model for all criteria as expected, since it is more flexible (it has more degrees of freedom) than the conceptual model. However, the performance of the LSTM decreased more than the calibrated HBV when switching to the validation periods (Figure 2). The uncalibrated HBV model performed less well, but the performance was still better than what one might expect from a model run with random parameters and/or no local information. Therefore, we considered that a model performance of about 0.5 for NSE basically indicates that a model has no skill. By definition, its performance did not systematically differ between the calibration and validation periods for the uncalibrated model. For KGE, the patterns were roughly similar, whereas for NPE and MARE, which are more different from the NSE used for calibration, the calibrated models (LSTM and HBV) were less superior compared to the uncalibrated HBV model, especially when using fewer years during calibration.
The effect of the time-series size used in calibration on the performance of the models is represented in the x-axis of Figure 2. There was a positive correlation between the time-series length and model performances, which was more pronounced for the LSTM model. When evaluating the model's performance against metrics not used for the optimization of the model (i.e., KGE, NPE, and MARE), the increase in the time-series length used in calibration is essential to obtain LSTM performances comparable to the HBV model during the validation for contrasted conditions. Simulations for changing conditions performed less well than those for constant conditions in validation. However, the differences were less pronounced using the maximum number of years in calibration (i.e., 6 years).
The hydrographs and scatter plots of observed and estimated discharge using the best configuration, that is, using 6 years in calibration, for one of the study catchments are presented in Figures 3 and 4, respectively. The hydrograph shows the underestimation of the peaks, especially those in spring (when the snow accumulated during the winter starts to melt) by all models. However, most low- and mid-flows were predicted well. This is clearly shown in the scatter plots of the observed and simulated flows in Figure 4. The scatter plots also indicate that the predictions deviate more from the observed values in the uncalibrated HBV model. There is an underestimation of the peaks when applying the model in conditions wetter than those it was calibrated in, and the LSTM model simulations are slightly less spread than those of the calibrated HBV model.
Model robustness
The model robustness was evaluated as the difference in performance between calibration and validation periods (Table 2). The LSTM was considered robust enough for generalization in changing conditions when the LSTM mean robustness did not significantly differ from both the mean robustness of the bucket-type model and of the constant period for most of the metrics.
. | . | Constant . | |||
---|---|---|---|---|---|
. | . | NSE . | KGE . | NPE . | MARE . |
Calibration period length (years) | 2 | 0.28|0.04 | 0.16|0.03 | 0.14|0.01 | 0.27|0.06 |
3 | 0.16|0.00 | 0.17|0.00 | 0.07| − 0.01 | 0.1| − 0.02 | |
4 | 0.12|0.02 | 0.09|0.04 | 0.07|0.03 | 0.08|0.04 | |
6 | 0.14|0.10 | 0.10|0.04 | 0.08|0.07 | 0.12|0.07 | |
Dry → Wet | |||||
2 | 0.28|0.13 | 0.17|0.04 | 0.14|0.04 | 0.18|0.00 | |
3 | 0.24|0.13 | 0.21|0.08 | 0.13|0.05 | 0.06| − 0.06 | |
4 | 0.26|0.09 | 0.18|0.02 | 0.11| − 0.01 | 0.06| − 0.18 | |
6 | 0.19|0.15 | 0.14|0.04 | 0.05|0.00 | 0.07| − 0.04 | |
Wet → Dry | |||||
2 | 0.31|0.24 | 0.28|0.20 | 0.23|0.03 | 0.54|0.14 | |
3 | 0.31|0.09 | 0.21|0.07 | 0.21|0.05 | 0.34|0.18 | |
4 | 0.32|0.18 | 0.22|0.14 | 0.17|0.07 | 0.41|0.20 | |
6 | 0.13|0.04 | 0.12|0.07 | 0.11|0.06 | 0.16|0.10 |
. | . | Constant . | |||
---|---|---|---|---|---|
. | . | NSE . | KGE . | NPE . | MARE . |
Calibration period length (years) | 2 | 0.28|0.04 | 0.16|0.03 | 0.14|0.01 | 0.27|0.06 |
3 | 0.16|0.00 | 0.17|0.00 | 0.07| − 0.01 | 0.1| − 0.02 | |
4 | 0.12|0.02 | 0.09|0.04 | 0.07|0.03 | 0.08|0.04 | |
6 | 0.14|0.10 | 0.10|0.04 | 0.08|0.07 | 0.12|0.07 | |
Dry → Wet | |||||
2 | 0.28|0.13 | 0.17|0.04 | 0.14|0.04 | 0.18|0.00 | |
3 | 0.24|0.13 | 0.21|0.08 | 0.13|0.05 | 0.06| − 0.06 | |
4 | 0.26|0.09 | 0.18|0.02 | 0.11| − 0.01 | 0.06| − 0.18 | |
6 | 0.19|0.15 | 0.14|0.04 | 0.05|0.00 | 0.07| − 0.04 | |
Wet → Dry | |||||
2 | 0.31|0.24 | 0.28|0.20 | 0.23|0.03 | 0.54|0.14 | |
3 | 0.31|0.09 | 0.21|0.07 | 0.21|0.05 | 0.34|0.18 | |
4 | 0.32|0.18 | 0.22|0.14 | 0.17|0.07 | 0.41|0.20 | |
6 | 0.13|0.04 | 0.12|0.07 | 0.11|0.06 | 0.16|0.10 |
Bold values are showed when the LSTM did not differ significantly from the calibrated HBV. Underlined values are showed when the LSTM under a changing condition did not differ significantly from the constant condition (α = 0.05).
The calibrated HBV was always more robust than the LSTM model for both constant and non-constant conditions. The LSTM was robust enough for changing conditions only when the model was applied in a drier period than that used in calibration and using the maximum number of years during calibration (6 years). While a good indication of robustness was already observed with a shorter time series used in the calibration for the HBV, a longer dataset length was needed for the LSTM.
DISCUSSION
The LSTM had poorer performance under changing conditions. Others have found similar results when applying process-based models under changing conditions (Refsgaard & Knudsen 1996; Xu 1999; Seibert 2003; Wilby 2005; Chiew et al. 2009; Vaze et al. 2010; Bastola et al. 2011).
Overall, when calibrated, the LSTM resulted in a much better fit than the HBV. However, the performance drop when going into a validation mode is also much larger for the LSTM (less robust). For the validation period, the LSTM was at best as good as the HBV (especially for other criteria than used in calibration and for changing conditions).
The LSTM was shown to be more dependent on dataset length to perform as well as the bucket-type model. The improvement in model performance/robustness with the increase of the time-series size used in calibration was also observed by Ayzel & Heistermann (2021) and Gauch et al. (2019) while testing the performance of the LSTM networks for streamflow prediction in constant conditions. Here, this positive correlation was more pronounced for the constant conditions than that for changing conditions. The lesser contribution of the time-series size in model performance under changing conditions may be explained by the limitation of the data provided for the calibration (only dry or wet periods used in calibration), that is, less information about the hydrological processes was provided to the model. The physical constraints of the HBV model made the need for longer data series in calibration less important, indicating the suitability of this model for predictions when data are limited. The same was observed by Ayzel & Heistermann (2021) while comparing an LSTM network to the GR4H conceptual model.
The robustness analysis showed that the LSTM is robust enough for climate transposability to a drier period. The generalization from a dry period to a wetter period is less satisfactory, mainly because in this case, the model needs to extrapolate to a discharge range not used in calibration, as also reported by Pan et al. (2019) and Wilby (2005) employing traditional hydrological models.
It is important to highlight that in this model setup, the LSTM was trained on individual catchments and its calibration over a large number of catchments and with a larger data series can yield better results and should be explored further. However, comparing models with different structures is not an easy task, especially when trying to keep a fair comparison between the models. More sophisticated hyperparameter tuning techniques may also improve the LSTM model's simulations, as well as coupling the model with process-based models.
CONCLUSION
In this work, we tested the predictive ability of the LSTM for daily discharge prediction in snow-influenced catchments under changing conditions. When calibrated, the LSTM resulted in a much better fit than the HBV, however, in a validation mode, the LSTM often performed worse than the HBV (especially for other criteria than used in calibration and for changing conditions). The performance drop when going into a validation mode was larger for the LSTM, indicating less robustness, and the data-driven model was shown to be more dependent on dataset length used in calibration to deliver robustness comparable to a bucket-type model.
Despite this, the results indicate that using longer data series in calibration can benefit the use of the LSTM in contrasting conditions. We recommend that other LSTM modeling setups should be studied further to improve the model performance in such conditions.
ACKNOWLEDGEMENTS
The authors thank the research funding agencies CAPES-Brazil and ESKAS – Swiss Government Excellence Scholarship for the scholarships granted to the first author at different periods of time during this research.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.