Machine learning algorithms are increasingly applied in hydrological studies with promising results. However, these algorithms generally lack the ability for easy interpretability of the results by users. In this study, we compare six different explainable artificial intelligence (XAI) algorithms that help understand the effect of input data on the simulation results. The methods are explored on two distinct approaches for streamflow modeling using a long short-term memory (LSTM) model: a single model approach using only meteorological forcing data and a regional approach including also static catchment attributes. To gain further insight into the internal dynamics of the LSTM models, the relationship between cell states and soil moisture is investigated. A strong correlation suggests that the LSTM models inherently capture the concept of soil moisture as a catchment-scale storage mechanism. The XAI methods are applied to derive a timestep of influence, revealing how many days of input data are relevant for the model output. All XAI methods result in similar seasonal patterns in the timestep of influence, suggesting that the methods are comparable. Setting soil moisture dynamics in context to seasonal development of the timestep of influence suggests resetting LSTM as soon as soil moisture saturation occurs.

  • Two long short-term memory (LSTM) models were trained for streamflow modeling.

  • Internal cell states of LSTMs correspond well with ERA5-Land soil moisture dynamics.

  • Six explainable artificial intelligence algorithms show similar results.

  • LSTM resets as soon as soil moisture saturation is reached.

Like other scientific disciplines, hydrology witnessed numerous applications of machine learning (ML) and deep learning models in recent years. For example, an artificial neural network has been used for the prediction of groundwater levels (Guzman et al. 2019), a physics-constrained recurrent neural network has been used to predict evapotranspiration (Zhao et al. 2019), and a support vector machine has been applied to analyze the competitive relationship between reservoir operation decisions for hydroelectricity production and irrigation (Zeng et al. 2017).

Long short-term memory (LSTM) models have shown exceptional results in predicting sequential data such as streamflow modeling (Le et al. 2021). Also, they have repeatedly proven their potential over traditional hydrological models (e.g. Kratzert et al. 2019b; Gauch et al. 2021; Lees et al. 2021; Frame et al. 2022). ML models mimic the physically relevant runoff processes from historical data by establishing functional relationships between input and output data (Duan et al. 2020). The inherent complexity and nonlinear nature of such models enable them to learn abstract patterns but lead to difficulties in explaining their predictions. It can be concluded that there is an inverse relationship between explainability in a system and its predictive accuracy, and the link between ML model's internal weights and the physical environment of the problem is extremely hard to isolate (Angelov et al. 2021). However, understanding the features that lead to a particular output builds trust with users (Shrikumar et al. 2017). The need to open the ‘black box’ has been expressed not only in hydrological research but also in almost all application areas.

Initial attempts have been made to connect the internal state vector within the LSTM cells to our human concept of the hydrological cycle. Kratzert et al. (2019a) related the cell state to soil water storage and snow depth from the Sacramento/Snow-17 hydrological model in two exemplary catchments. Lees et al. (2022) were able to fit a linear probe to model ERA5-Land soil water storage from the cell states.

In addition, explainable artificial intelligence (XAI) methods have been applied to ML models in hydrology. XAI is used to provide interpretations of the model internals, and therefore results can be better understood by the user (Althoff et al. 2021a). Expected Gradients were used to distinguish three flood-inducing mechanisms (snowmelt, recent rainfall, and historical rainfall) within the contiguous United States (Jiang et al. 2022). Other applied methods in streamflow modeling include attention mechanism and perturbation methods to obtain information about the importance of input features (Ding et al. 2020; Anderson & Radić 2022; Prasanth Kadiyala & Woo 2022). Kratzert et al. (2019a) and Althoff et al. (2021b) used the Integrated Gradient (IG) method to distinguish the ‘timestep of influence’ (TSOI), defined as the amount of previous days relevant for the result. However, the sheer amount of available XAI algorithms makes it hard to compare results of different studies as many authors only used one or a maximum of two different XAI algorithms on their models. A systematic comparison of such algorithms is missing.

Therefore, in this study we aim to (i) compare the results of multiple XAI methods, evaluated as the TSOI, to increase their comparability in the context of streamflow modeling and (ii) link the resulting TSOI to the physically relevant processes in a catchment exemplarily. We establish two different LSTM models for the purpose of streamflow modeling using a regional and a single-catchment approach. Then six different XAI algorithms are applied and evaluated using an adapted version of the TSOI to uncover the temporal dynamics of the input feature attribution. The TSOI is then set into context with soil moisture dynamics in the catchment to investigate its hydrological significance.

Study area

The study area is the Ems catchment located in northwest Germany (Figure 1). From its spring to the outlet in the North Sea, the river only surpasses 135 m in elevation. The entire catchment displayed in Figure 1 has a size of 8,076 km2, and the sub-catchment for in-depth evaluation has a size of 1,448 km2. Long-term climate characteristics are 800 mm/a of precipitation, which is almost equally distributed throughout the year. The mean temperature is around 9 °C and the potential evapotranspiration (PET) is around 490 mm/a. The climatic water balance (precipitation − PET) (Figure 1) shows the surplus of precipitation in the winter month and a potential depletion of catchment water storage from April to August. High streamflow season usually lasts from December to March and low-flow season from June to October (Geschäftsstelle Ems et al. 2009). Sand is the dominant grain size in the catchment's soil, resulting in a baseflow ratio up to 0.8 (Wendland et al. 2005). A more detailed description of the catchment can be found in Ley et al. (2023).
Figure 1

(a) Overview map of the location of the Ems catchment in Germany. (b) Detailed view on the topography of the three sub-catchments used to train the regional LSTM and the location of the most upstream sub-catchment, which we will evaluate in depth. Furthermore, the long-term seasonal dynamics of streamflow (c), precipitation (d), temperature (e), and climatic water balance (f) are shown.

Figure 1

(a) Overview map of the location of the Ems catchment in Germany. (b) Detailed view on the topography of the three sub-catchments used to train the regional LSTM and the location of the most upstream sub-catchment, which we will evaluate in depth. Furthermore, the long-term seasonal dynamics of streamflow (c), precipitation (d), temperature (e), and climatic water balance (f) are shown.

Close modal

Data

Daily streamflow data for the three streamflow gauges (Figure 1) were acquired from the Global Runoff Data Center (GRDC) for the period 1969–2016 (BFG 2022).

All available station data within the Ems catchment area were acquired from the German Weather Service (DWD 2022) to generate the hydroclimatic input. Not all stations provide a continuous time series throughout the whole period of 1969–2016. Data gaps were not filled. Potential biases due to missing station data and the orographic effects on climate data can be neglected due to the homogeneity and low elevation differences in all sub-catchments (Ley et al. 2023). Information on soil properties was taken from the European Soil Database v2.0 to calculate static attributes for the regional LSTM (Panagos et al. 2022). Land use information was derived from CORINE Land Cover (European Environment Agency et al. 2018).

CAMELS-GB dataset

Large standardized datasets are required to successfully train a regional LSTM model (e.g. Kratzert et al. 2019b; Althoff et al. 2021b; Lees et al. 2021). The idea behind the CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) initiative is to provide such datasets. Datasets include a wide range of hydrometeorological drivers and static catchment attributes such as elevation, land use, and soil conditions. To this point, no dataset exists for Germany. Assuming similar climate conditions, we chose to use the CAMELS-GB dataset for training, which includes 671 catchments across Great Britain (Coxon et al. 2020).

ERA5-Land reanalysis

ERA5-Land reanalysis combines model data with observations into a globally complete and consistent dataset of hourly and monthly estimates for hydroclimatic state variables (Muñoz-Sabater et al. 2021). The spatial resolution of the dataset is 0.1°, which equals around 9 km. To assess the dynamics of catchment water storage, we use results for the ‘volumetric soil water layer 3’, which describes water content within the soil column in a depth of 0.28–1 m. In a preliminary analysis, soil moisture dynamics in this depth showed the highest correlation to the internal cell state of the LSTM models.

General study design

Two separate LSTM models were trained for streamflow modeling for the low-land Ems catchment in Germany. One model only used catchment-specific hydrometeorological forcing data to provide streamflow estimations at the outlet gauge. A regional approach was applied to train the second LSTM model. Ems catchment data were harmonized and added to a large sample dataset (CAMELS-GB) including hydrometeorological forcing and static attributes such as land use and soil characteristics. The LSTM is then trained using the large dataset.

Afterward, six different XAI methods were applied to both successfully trained models to quantify the feature importance of the meteorological inputs at the upstream sub-catchment (Figure 1). The results were evaluated using an adapted version of the TSOI to distinguish seasonal patterns of input feature importance. Second, the internal cell states of the LSTM were compared to ERA5-Land soil moisture values to reveal the potential derivation of soil moisture dynamics from the correlations between the input data and the target variable streamflow as proposed by Kratzert et al. (2019a) and Lees et al. (2022). The TSOI dynamics resulting from the XAI methods are then set into context to soil moisture variations to distinguish a potential physical relevance of the TSOI.

LSTM training design

LSTM models, in particular, have proven themselves useful for time series modeling (Hochreiter & Schmidhuber 1997). Within the LSTM, two cell states continuously process information from one timestep to the other. The cell state can be interpreted as the long-term memory of the cell, whereas a hidden state processes short-term information. A more detailed description of the LSTM can be found in Ley et al. (2023). The cell state makes LSTM capable of capturing persistent dependencies, like soil moisture and snow accumulation, which makes them suitable to model streamflow (Kratzert et al. 2018).

For the ‘single model’ approach, only hydroclimatic forcing data are used as input for the LSTM during the training process. Applying this approach only allows one to apply the trained model for the particular catchment it is trained for (e.g. Han et al. 2021; Ley et al. 2023). In addition, an LSTM can be trained with a dataset containing multiple catchments. Next to hydroclimatic data, these data must include static catchment characteristics (e.g. soil properties, elevation, catchment size, land use). Regional models resulted in improved performance indices and increased transferability to ungauged catchments in model intercomparison (Kratzert et al. 2019b). Detailed information about forcing data and hyperparameters for both model setups is provided in the Supplementary Material A1. As no large standardized catchment dataset exists for Germany, we use the CAMELS-GB dataset where climate conditions can be considered similar. Fifty random catchments were selected from the dataset due to storage limitations, and additionally, 3 of the 10 largest catchments were included randomly because the Ems catchment is relatively large. Then, we prepared input data for three sub-catchments of the Ems to match the CAMELS-GB dataset (Figure 1). The data were passed as one combined input dataset to train the regional LSTM. Input data were separated into three different periods: 2000–2009 for training, 2010–2016 for validation, and 1970–1999 for testing. Hyperparameter search was carried out using a small grid search, but we did not want to focus on the overall model performance and instead compared the results of XAI methods on both approaches.

Explainable artificial intelligence

The model complexity of ML applications makes interpretation of the results difficult. Therefore, XAI has received increasing scientific attention over recent years (Linardatos et al. 2021). XAI includes methods that interpret and explain ML model results.

The XAI methods applied in this study can be separated into two groups. First, gradient-based methods analyze the gradient of the cost function of LSTM with respect to the input feature. Gradients are analogous to model coefficients and describe the contribution of each input feature to the output (Althoff et al. 2021b). LSTMs are completely differentiable. Therefore, the attribution of each individual input feature can be calculated. Second, perturbation-based methods directly compute the attribution of an input feature by removing, masking, or altering them and measuring the difference with the original output (Ancona et al. 2018).

The XAI algorithms provide an attribution for each individual input. The input features with a positive attribution increase the output values and the ones with a negative value decrease the output (Xu et al. 2021). The larger the attribution, the higher its importance for the result. The following XAI algorithms were tested on both models (single and regional approach). A more detailed description of all algorithms can be found in Kokhlikyan et al. (2020).

Gradient-based methods

Saliency

Using gradients to create the so-called saliency (Sal) maps was first proposed by Simonyan et al. (2014). Sal maps are used extensively to analyze the importance of input features to model predictions and are mostly used for application in vision and language tasks with their application to time series data being relatively unexplored (Ismail et al. 2020). Sal calculates attributions by taking the absolute value of the partial derivative of the target output with respect to the input feature.

InputXGradient

InputXGradient (IXG) is basically an extension of the Sal algorithm. With IXG, the attribution scores are calculated by elementwise multiplication of gradients with the actual input (Nielsen et al. 2022). The elementwise multiplication can be considered an application of a model-independent filter (to the input), which may reduce noise and smoothen the attribution maps (Ancona et al. 2019).

Integrated Gradient

IG was introduced by Sundararajan et al. (2017). IG computes the partial derivatives of the output with respect to each input feature. While Sal and IXG compute a single derivative, evaluated at the provided input, IG computes the average gradient while the input varies along a linear path from a given baseline (Ancona et al. 2018). From the baseline to the actual input, the gradient is integrated along an equally split path. The baseline was set to zeros.

GradientSHAP

GradientSHAP (GS) approximates SHAP (SHapley Additive exPlanations) values by computing the expectations of gradients by randomly sampling from baselines (Lundberg & Lee 2017). SHAP values are another way to explain the output of an ML model by using a game theory approach. It adds white noise to each input sample, selects a random baseline and a random point along the path between the baseline and the input, and computes the gradient of outputs with respect to those selected random points. In a sense, it can be viewed as an approximation of IGs by computing the expectations of gradients for different baselines (Kokhlikyan et al. 2020). The baseline was set to zeros.

Perturbation-based methods

Feature ablation

Feature ablation (FA) is a perturbation-based approach to compute the attribution by replacing each input feature with a given baseline and computing the difference in output. Each value is taken as a feature and replaced independently. The calculated difference signifies the attribution for the permuted feature (Kokhlikyan et al. 2020).

Feature permutation

Feature permutation (FP) takes each input feature, permutes the feature values within a batch, and computes the difference between the original and shuffled outputs for the given batch. This difference signifies the feature importance of the permuted feature (Kokhlikyan et al. 2020).

Timestep of influence

The hyperparameter ‘window size’ defines how many past days are used as input data to model the streamflow on a particular day. In our case, this value was set to 250 days. Each of the aforementioned XAI methods assigns an individual attribution for each of the input features and for every timestep. The TSOI concept (Kratzert et al. 2019a) was introduced to gain an understanding about the temporal dynamics of input importance. The idea behind the TSOI is to distinguish the number of relevant past input time steps (in our example days) for the result. The methodology to calculate the TSOI is presented exemplarily in Figure 2. The plot shows the attributions resulting from the IXG method for a given day (Figure 2(a)). The absolute values for each day are then summed, as negative and positive attributions are of equal importance for the final result (Figure 2(b)). In the next step, we adapted the original method proposed by Kratzert et al. (2019a) and followed the recommendations given by Althoff et al. (2021b). Instead of using one fixed threshold throughout the whole year, the cumulative attribution with respect to the overall sum is calculated. The first day with relevant input is defined as the day when 5% of the total attribution is reached (Figure 2(c)). Using this approach, the method becomes more independent of individual extreme events and also of the interannual variability of attributions.
Figure 2

Scheme that explains the method to use attributions resulting from the XAI methods to calculate the TSOI for a particular day. Input timestep 250 is the previous day of the actual modeled day.

Figure 2

Scheme that explains the method to use attributions resulting from the XAI methods to calculate the TSOI for a particular day. Input timestep 250 is the previous day of the actual modeled day.

Close modal

Model training results

The models were evaluated using the widely applied indices like Nash–Sutcliffe efficiency (NSE) (Nash & Sutcliffe 1970), Kling–Gupta efficiency (KGE) (Gupta et al. 2009), and the logNSE (NSE with logarithmic values). For the entire testing period (1970–1999), an NSE of 0.88 was achieved for both models. Looking at the logNSE, the single model has a higher performance with a value of 0.85 compared to 0.77 for the regional model. The KGE is 0.82 for the single model and 0.78 for the regional model. Looking at the hydrograph of the year 1998 exemplarily, both models show a good fit within all streamflow ranges (Figure 3).
Figure 3

Hydrographs of the single model (a) and regional model (b) exemplarily for the year 1998 in the test period with their corresponding NSE and KGE.

Figure 3

Hydrographs of the single model (a) and regional model (b) exemplarily for the year 1998 in the test period with their corresponding NSE and KGE.

Close modal

Soil moisture and cell state dynamics

The Ems catchment is characterized by a low elevation gradient and highly permeable sandy soils (Ley et al. 2023). According to the hydrological process understanding, it can be expected that the infiltration of precipitation is enhanced and subsurface flow driven by soil moisture has a high importance for streamflow generation. The Pearson correlation of the internal cell state of the LSTMs and ERA5-Land soil moisture (0.28–1 m) is shown in Figure 4(a). LSTM vectors are passed through a linear layer before the final results are returned. Therefore, the values can be inversed again. This is why absolute values are inspected. In the single model, five cell states have an absolute correlation of above 0.75. In the regional model, 55 of 128 cells have an absolute correlation above 0.6. The majority of the cells are negatively correlated, but also for the regional model, the cell state with the highest correlation is positive (0.93).
Figure 4

(a) Heatmap for the correlation of ERA5-Land soil moisture (0.28–1 m) and the cell states within the LSTM cells on a monthly scale. All cells with an absolute correlation of above 0.6 are displayed. (b) Sorted cell state of the highest correlated cells for each model. The corresponding months with the 20 highest and lowest values of the ERA5-Land soil moisture are marked for each model.

Figure 4

(a) Heatmap for the correlation of ERA5-Land soil moisture (0.28–1 m) and the cell states within the LSTM cells on a monthly scale. All cells with an absolute correlation of above 0.6 are displayed. (b) Sorted cell state of the highest correlated cells for each model. The corresponding months with the 20 highest and lowest values of the ERA5-Land soil moisture are marked for each model.

Close modal

The cell state with the highest correlation of each model was further evaluated (Figure 4(b)). The cell states of cell numbers 8 (single model) and 65 (regional model) were sorted. Then, the months with the 20 lowest and highest ERA5-Land soil moisture levels were marked to check whether extreme values were properly reproduced within the cell states. The months with the lowest ERA5-Land soil moisture show a good match with the lowest cell states within both models. All 20 values are located within the lowest 75 cell states. The distribution of the months with high soil moisture is wider. However, most of the data points are well located within the 80 highest cell states.

Timestep of influence

The TSOI analysis investigates the seasonal patterns of how many past days are relevant for the model prediction. Each of the six XAI algorithms shows similar seasonal dynamics for the single model (Figure 5(a)). The period with the lowest TSOI occurs around days 70–90 of the year (March) depending on the method. After the wet winter period, it can be assumed that the catchment storage is almost saturated. From that point on, the TSOI rises continuously while the catchment dries out toward the peak (at the end of the year). The absolute values of the TSOI vary depending on the method. For Sal, IXG, Fa, and FP, the maximum TSOI values reach up to 200. Instead, the 25% percentile only reaches up to 170 when IG and GS are used. From the peak on, the TSOI declines steeply and steadily until the minimum is reached in winter.
Figure 5

Seasonal dynamics of the TSOI plotted for every day of the year for the single model (a) and regional model (b) for every day of the year, the median and 25th and 75th percentile TSOI are plotted.

Figure 5

Seasonal dynamics of the TSOI plotted for every day of the year for the single model (a) and regional model (b) for every day of the year, the median and 25th and 75th percentile TSOI are plotted.

Close modal

The results for the regional model in Figure 5(b) show a comparable seasonal dynamic to the single model but differ in some details. Next to a minimum around day 60 of the year (late February/early March), the methods Sal, IXG, FA, and FP show a second local minimum around the day 210 of the year (end of July). The minimum is most pronounced using the IXG method. Instead, this effect is not visible for IG and GS. The range of values is smaller than in the single model. The maximum TSOI values stay below 200 days and the minima tend to be higher with TSOI values around 70 days (Sal, IXG, FA, FP) and 50 days (IG, GS), respectively.

To further analyze the development of the previously mentioned seasonal dynamics comprehensively, all relevant data are analyzed for the driest and wettest year in the test period (Figure 6). In the dry year 1971 (Figure 6(a)), monthly precipitation values are below average in all months except June with an annual sum of 605 mm/a. ERA5-Land soil moisture corresponds to this. Highest correlated cells show a similar development compared to ERA5-Land soil moisture. Instead, the TSOI of both models starts on a low level and rises continuously until it peaks at the end of the year.
Figure 6

Overview of ERA5-Land soil moisture, the highest correlated cells in both models, the deviation of precipitation compared to the long-term mean, and TSOI development of both models. The results are presented for the driest year 1971 (a) and wettest year 1998 within the test period (b).

Figure 6

Overview of ERA5-Land soil moisture, the highest correlated cells in both models, the deviation of precipitation compared to the long-term mean, and TSOI development of both models. The results are presented for the driest year 1971 (a) and wettest year 1998 within the test period (b).

Close modal

In the wet year 1998 (Figure 6(b)), precipitation is above average in early spring and autumn. As expected, the observed precipitation results in higher soil moisture in spring and autumn. In August, still a very low soil moisture of 0.15 is modeled by ERA5-Land, but already the values reached up to 0.35 in October. Cell state values follow the development of soil moisture values subsequently. The TSOI time series shows a different development compared to the dry year. In both models, the TSOI is very high at the beginning of the year and drops to the minimum when the first soil moisture peak in spring is reached. The TSOI then increases again until the second soil moisture peak is reached. Even though the TSOI at this point is still relatively low with values around 100 days, it drops down to around 40 days.

The results indicate a nonlinear correlation between soil moisture dynamics and the TSOI. In Figure 7, a 10-year time series of monthly soil moisture values is plotted against daily TSOI dynamics. Soil moisture values usually peak between November and February and then decline steadily until late summer. TSOI dynamics for both models show different seasonal patterns compared to soil moisture. The TSOI peak usually is reached shortly before the end of each year and then decreases within several days down to the minimum value. From the lowest point on, the TSOI then increases steadily in the single model. In the regional model, the signal is noisier and the TSOI remains relatively small during spring and early summer. From late summer on, the TSOI increases steeply until the maximum. The gray lines indicate the first month reaching a soil moisture of above 0.3 for every wet season suggesting a state of soil moisture saturation. Timing of the soil moisture peak corresponds well with the steep decline of the TSOI in both models.
Figure 7

ERA5-Land soil moisture dynamics (monthly) and the TSOI (daily) calculated with the IG methods for the single and regional model plotted for the period 1990–2000. The first peak reaching above a value of 0.30 in every winter half year was evaluated for soil moisture dynamics and plotted also in the TSOI time series.

Figure 7

ERA5-Land soil moisture dynamics (monthly) and the TSOI (daily) calculated with the IG methods for the single and regional model plotted for the period 1990–2000. The first peak reaching above a value of 0.30 in every winter half year was evaluated for soil moisture dynamics and plotted also in the TSOI time series.

Close modal

Model performance

Two approaches were used to build up an LSTM model for the Ems catchment. The NSE values of 0.88 for both models can be considered very good according to Moriasi et al. (2007). Regarding logNSE and KGE, the single model has superior performance compared to the regional model. This is despite the fact that regional models have shown higher model quality in the past (Kratzert et al. 2019b). This can be explained by the homogeneity of the catchment properties and by the limited effort, which has been put into model training. A satisfactory level was reached quickly, and a model comparison was beyond the scope of this study. The use of a larger subset of the CAMELS-GB dataset or the availability of a German dataset would probably increase the quality of the regional model results and improve the spatial transferability of the trained model. However, both models show good quality in all streamflow ranges and are applicable for further analysis.

Cell state and soil moisture dynamics

Baseflow, driven by soil moisture, is expected to be the main runoff component in a sandy low-land catchment such as the Ems for most of the year (Bormann 2010). To evaluate whether the internal cell state of the LSTM reflects the hydrological state variables, we used ERA5-Land reanalysis data to compare cell states and soil moisture. It must be taken into consideration that ERA5-Land soil moisture data comes associated with uncertainties and that LSTM cell states do not necessarily have to map the ERA5-Land data one-to-one. The LSTMs are not constrained to mimic soil moisture dynamics in a single cell state vector. Instead, the signal could be split up into multiple vectors (Lees et al. 2022). Also, the ERA5-Land data use fixed depth levels, which most probably are not equally expressed in the cell state.

However, the strong correlation between soil moisture and cell states in both models supports the idea that the LSTM learns the concept of subsurface storage, as a storage or retention of rainfall input. The idea of a storage concept such as soil moisture was never intentionally implemented in the LSTM but was only derived from the statistical relationship between forcing data and the target variable streamflow. The resulting storage effect is similar to conceptual models such as HBV (Bergström & Forsman 1973), applied to the Ems river with similar model performance (Ley et al. 2023) (Figure 8). To improve soil moisture results directly from the LSTM, a probe can be trained as carried out by Lees et al. (2022). The method can also be extended to map the proportion on the overall LSTM memory to gain insights into the relevance of soil moisture to streamflow on the seasonal scale.
Figure 8

(a) Soil moisture storage dynamics of a calibrated HBV model in the same catchment (Ley et al. 2023) compared to the highest correlated cell states in the single model (b) and regional model (c).

Figure 8

(a) Soil moisture storage dynamics of a calibrated HBV model in the same catchment (Ley et al. 2023) compared to the highest correlated cell states in the single model (b) and regional model (c).

Close modal

The proven capability of LSTM models to map soil moisture dynamics according to our understanding of the hydrological system allows us to expect an accurate representation of other relevant catchment processes within the LSTM internals. Potentially, evapotranspiration and nonlinear routing dynamics can be derived for the catchment. Due to the flexibility of the LSTM model, those processes might be represented more precisely than they are in conceptual models with a more rigid structure.

Explainable artificial intelligence

According to Ismail et al. (2020), gradient-based and perturbation-based methods failed to provide high-quality interpretations for multivariate time series classification and Schlegel et al. (2019) concluded that more suitable XAI methods for time series should be introduced to achieve a better human interpretable understanding. However, six different XAI methods were applied in this study to compare their results using the TSOI. The application of the gradient-based XAI algorithms in LSTM models requires consideration of the following aspect. The use of gradient-based approaches can be problematic because sigmoid and hyperbolic tangent activations have a near-zero gradient at high or low inputs even though these inputs are significant for the outcome (Shrikumar et al. 2017). Accordingly, it is possible that the relevance of extreme values is neglected by the methods. In particular, this can be expected for precipitation because extreme precipitation events are many times greater than normal events. Also, two of the applied XAI algorithms (IG and GS) require a baseline from which the attribution is calculated stepwise while being shifted to the actual input values. The baseline is a hyperparameter that requires careful selection because if the target output is too close to the baseline, the importance of the feature will not be recognized (Sturmfels et al. 2020). In the hydrological context, it is difficult to choose a baseline that contains no information. We chose a sequence of zeros as the baseline. However, the resulting seasonal TSOI dynamics show a similar behavior for all XAI methods. The TSOI dynamics appear to be independent of the applied XAI method and the model type on the seasonal timescale. Nevertheless, differences exist and will be subject to future studies.

We adopted the TSOI method following the recommendation of Althoff et al. (2021b). Using a percentage of the accumulated sum instead of a fixed value as a threshold has the advantage that the generally higher attribution in the winter months is not considered. The seasonal pattern of the TSOI differs from previous studies (e.g. Kratzert et al. 2019a; Althoff et al. 2021b). Differences may be due to the adapted TSOI methodology but also on different catchment characteristics and runoff generation mechanisms (baseflow in this study).

However, the application of XAI methods in time series modeling is still controversial in the literature. The results of the exceptionally dry and wet years in Figure 6 reveal that a direct correlation between TSOI and soil moisture or cell state is not possible. Instead, it appears that the TSOI drops down to a baseline value of about 40 days once the peak in soil moisture is reached. This hypothesis is supported by the results in Figure 7. The timing of soil moisture peaks corresponds well with the steep decline of the TSOI. In terms of the hydrological concept, this can be interpreted as a deletion of previous information when saturation in the soil storage is reached. Further in-depth evaluation of the relationship between TSOI and soil moisture dynamics in various catchments is necessary to confirm the hypothesis. The CAMELS-GB dataset used in this study already includes catchments covering a range of hydroclimatic conditions, in particular, regarding elevation, precipitation, and mean streamflow (Coxon et al. 2020). Regional LSTM approaches show good results in different climatic regions and ungauged catchments (e.g. Kratzert et al. 2019b; Arsenault et al. 2023). Therefore, transferability of the results presented in this study can be expected.

The comparison with ERA5-Land soil moisture dynamics provides further evidence that LSTMs are capable of deriving soil moisture dynamics from the functional relationship of climate forcing and output. The application of six different XAI algorithms in combination with the TSOI showed that the methods lead to comparable results in terms of seasonal dynamics. For future evaluations, it is therefore not necessary to use an ensemble of the XAI applied in this study. Results from publications are transferable despite the use of a different subset of methods. The combination of the TSOI signal and soil moisture dynamics further supports the idea of a physically realistic representation of the important storage effects by the LSTM, as represented in conceptual models in a similar way. Because the rapid decrease in TSOI corresponds well with soil moisture peaks in spring, we hypothesize that the TSOI is neutralized after the system has passed through a state of saturation. Storage level depletion is accompanied by an increase in the TSOI. However, this opens an interesting field of study in the future to test this statement in catchments with different climatic conditions and streamflow regimes.

AL contributed to conceptualization, methodology, data curation, formal analysis, and writing the original draft. HB contributed to conceptualization, supervision, reviewing and editing the writing, and funding acquisition. MC contributed to conceptualization, supervision, and reviewing and editing the writing.

This research was funded by the German Federal Ministry of Education and Research (BMBF) (grant numbers: 01LR2003D and 01LR2003D1) within the project WAKOS – Wasser an den Küsten Ostfrieslands.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Althoff
D.
,
Bazame
H. C.
&
Nascimento
J. G.
2021a
Untangling hybrid hydrological models with explainable artificial intelligence
.
H2Open Journal
4
,
13
28
.
https://doi.org/10.2166/h2oj.2021.066
.
Althoff
D.
,
Rodrigues
L. N.
&
da Silva
D. D.
2021b
Addressing hydrological modeling in watersheds under land cover change with deep learning
.
Advances in Water Resources
154
,
103965
.
https://doi.org/10.1016/j.advwatres.2021.103965
.
Ancona
M.
,
Ceolini
E.
,
Öztireli
C.
&
Gross
M.
2018
Towards better understanding of gradient-based attribution methods for Deep Neural Networks. http://arxiv.org/abs/1711.06104.
Ancona
M.
,
Ceolini
E.
,
Öztireli
C.
,
Gross
M.
,
2019
Gradient-based attribution methods
. In:
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Lecture Notes in Computer Science
(
Samek
W.
,
Montavon
G.
,
Vedaldi
A.
,
Hansen
L. K.
&
Müller
K.-R.
, eds.).
Springer International Publishing
,
Cham
, pp.
169
191
.
https://doi.org/10.1007/978-3-030-28954-6_9
.
Angelov
P. P.
,
Soares
E. A.
,
Jiang
R.
,
Arnold
N. I.
&
Atkinson
P. M.
2021
Explainable artificial intelligence: An analytical review
.
WIRES Data Mining and Knowledge Discovery
11
,
e1424
.
https://doi.org/10.1002/widm.1424
.
Arsenault
R.
,
Martel
J.-L.
,
Brunet
F.
,
Brissette
F.
&
Mai
J.
2023
Continuous streamflow prediction in ungauged basins: Long short-term memory neural networks clearly outperform traditional hydrological models
.
Hydrology and Earth System Sciences
27
,
139
157
.
https://doi.org/10.5194/hess-27-139-2023
.
Bergström
S.
&
Forsman
A.
1973
Development of a conceptual deterministic rainfall-runoff model
.
Hydrology Research
4
,
147
170
.
https://doi.org/10.2166/nh.1973.0012
.
BFG
2022
Global Runoff Data Centre
. .
Bormann
H.
2010
Runoff regime changes in German rivers due to climate change
.
ERDKUNDE
64
,
257
279
.
https://doi.org/10.3112/erdkunde.2010.03.04
.
Coxon
G.
,
Addor
N.
,
Bloomfield
J. P.
,
Freer
J.
,
Fry
M.
,
Hannaford
J.
,
Howden
N. J. K.
,
Lane
R.
,
Lewis
M.
,
Robinson
E. L.
,
Wagener
T.
&
Woods
R.
2020
CAMELS-GB: Hydrometeorological time series and landscape attributes for 671 catchments in Great Britain
.
Earth System Science Data
12
,
2459
2483
.
https://doi.org/10.5194/essd-12-2459-2020
.
Ding
Y.
,
Zhu
Y.
,
Feng
J.
,
Zhang
P.
&
Cheng
Z.
2020
Interpretable spatio-temporal attention LSTM model for flood forecasting
.
Neurocomputing
403
,
348
359
.
https://doi.org/10.1016/j.neucom.2020.04.110
.
Duan
S.
,
Ullrich
P.
&
Shu
L.
2020
Using convolutional neural networks for streamflow projection in California
.
Frontiers in Water
2
,
28
.
https://doi.org/10.3389/frwa.2020.00028
.
European Environment Agency, European Union, Copernicus Land Monitoring Service
2018
Corine Land Cover
. .
Frame
J. M.
,
Kratzert
F.
,
Klotz
D.
,
Gauch
M.
,
Shalev
G.
,
Gilon
O.
,
Qualls
L. M.
,
Gupta
H. V.
&
Nearing
G. S.
2022
Deep learning rainfall–runoff predictions of extreme events
.
Hydrology and Earth System Sciences
26
,
3377
3392
.
https://doi.org/10.5194/hess-26-3377-2022
.
Gauch
M.
,
Kratzert
F.
,
Klotz
D.
,
Nearing
G.
,
Lin
J.
&
Hochreiter
S.
2021
Rainfall–runoff prediction at multiple timescales with a single long short-term memory network
.
Hydrology and Earth System Sciences
25
,
2045
2062
.
https://doi.org/10.5194/hess-25-2045-2021
.
Geschäftsstelle Ems, Ministerie van Verkeer en Waterstaa, Geschäftsstelle Ems-NRW
2009
Internationaler Bewirtschaftungplan nach Artikel 13 Wasserrahmenrichtlinie für die Flussgebietseinheit Ems
. .
Gupta
H. V.
,
Kling
H.
,
Yilmaz
K. K.
&
Martinez
G. F.
2009
Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling
.
Journal of Hydrology
377
,
80
91
.
https://doi.org/10.1016/j.jhydrol.2009.08.003
.
Guzman
S. M.
,
Paz
J. O.
,
Tagert
M. L. M.
&
Mercer
A. E.
2019
Evaluation of seasonally classified inputs for the prediction of daily groundwater levels: NARX networks vs support vector machines
.
Environmental Modeling and Assessment
24
,
223
234
.
https://doi.org/10.1007/s10666-018-9639-x
.
Han
H.
,
Choi
C.
,
Jung
J.
&
Kim
H. S.
2021
Deep learning with long short term memory based sequence-to-sequence model for rainfall-runoff simulation
.
Water
13
,
437
.
https://doi.org/10.3390/w13040437
.
Hochreiter
S.
&
Schmidhuber
J.
1997
Long short-term memory
.
Neural Computation
9
,
1735
1780
.
https://doi.org/10.1162/neco.1997.9.8.1735
.
Ismail
A. A.
,
Gunady
M.
,
Bravo
H. C.
&
Feizi
S.
2020
Benchmarking deep learning interpretability in time series predictions. Presented at the 34th Conference on Neural Information Processing Systems, Vancouver, Canada. http://arxiv.org/abs/2010.13924.
Jiang
S.
,
Zheng
Y.
,
Wang
C.
&
Babovic
V.
2022
Uncovering flooding mechanisms across the contiguous United States through interpretive deep learning on representative catchments
.
Water Resources Research
58
.
https://doi.org/10.1029/2021WR030185
.
Kokhlikyan
N.
,
Miglani
V.
,
Martin
M.
,
Wang
E.
,
Alsallakh
B.
,
Reynolds
J.
,
Melnikov
A.
,
Kliushkina
N.
,
Araya
C.
,
Yan
S.
&
Reblitz-Richardson
O.
2020
Captum: A unified and generic model interpretability library for PyTorch. https://doi.org/10.48550/arXiv.2009.07896
.
Kratzert
F.
,
Klotz
D.
,
Brenner
C.
,
Schulz
K.
&
Herrnegger
M.
2018
Rainfall–runoff modelling using long short-term memory (LSTM) networks
.
Hydrology and Earth System Sciences
22
,
6005
6022
.
https://doi.org/10.5194/hess-22-6005-2018
.
Kratzert
F.
,
Herrnegger
M.
,
Klotz
D.
,
Hochreiter
S.
,
Klambauer
G.
,
2019a
NeuralHydrology – Interpreting LSTMs in hydrology
. In:
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Lecture Notes in Computer Science
(
Samek
W.
,
Montavon
G.
,
Vedaldi
A.
,
Hansen
L. K.
&
Müller
K.-R.
, eds).
Springer International Publishing
,
Cham
, pp.
347
362
.
https://doi.org/10.1007/978-3-030-28954-6_19
.
Kratzert
F.
,
Klotz
D.
,
Shalev
G.
,
Klambauer
G.
,
Hochreiter
S.
&
Nearing
G.
2019b
Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets
.
Hydrology and Earth System Sciences
23
,
5089
5110
.
https://doi.org/10.5194/hess-23-5089-2019
.
Le
X.-H.
,
Nguyen
D.-H.
,
Jung
S.
,
Yeon
M.
&
Lee
G.
2021
Comparison of deep learning techniques for river streamflow forecasting
.
IEEE Access
9
,
71805
71820
.
https://doi.org/10.1109/ACCESS.2021.3077703
.
Lees
T.
,
Buechel
M.
,
Anderson
B.
,
Slater
L.
,
Reece
S.
,
Coxon
G.
&
Dadson
S. J.
2021
Benchmarking data-driven rainfall–runoff models in Great Britain: A comparison of long short-term memory (LSTM)-based models with four lumped conceptual models
.
Hydrology and Earth System Sciences
25
,
5517
5534
.
https://doi.org/10.5194/hess-25-5517-2021
.
Lees
T.
,
Reece
S.
,
Kratzert
F.
,
Klotz
D.
,
Gauch
M.
,
De Bruijn
J.
,
Kumar Sahu
R.
,
Greve
P.
,
Slater
L.
&
Dadson
S. J.
2022
Hydrological concept formation inside long short-term memory (LSTM) networks
.
Hydrology and Earth System Sciences
26
,
3079
3101
.
https://doi.org/10.5194/hess-26-3079-2022
.
Linardatos
P.
,
Papastefanopoulos
V.
&
Kotsiantis
S.
2021
Explainable AI: A review of machine learning interpretability methods
.
Entropy
23
,
18
.
https://doi.org/10.3390/e23010018
.
Lundberg
S. M.
&
Lee
S.-I.
2017
A Unified Approach to Interpreting Model Predictions, in: Advances in Neural Information Processing Systems
.
Curran Associates, Inc, Long Beach, CA
.
Moriasi
D. N.
,
Arnold
J. G.
,
Liew
M. W. V.
,
Bingner
R. L.
,
Harmel
R. D.
&
Veith
T. L.
2007
Model evaluation guidelines for systematic quantification of accuracy in watershed simulations
.
Transactions of the ASABE
50
,
885
900
.
https://doi.org/10.13031/2013.23153
.
Muñoz-Sabater
J.
,
Dutra
E.
,
Agustí-Panareda
A.
,
Albergel
C.
,
Arduini
G.
,
Balsamo
G.
,
Boussetta
S.
,
Choulga
M.
,
Harrigan
S.
,
Hersbach
H.
,
Martens
B.
,
Miralles
D. G.
,
Piles
M.
,
Rodríguez-Fernández
N. J.
,
Zsoter
E.
,
Buontempo
C.
&
Thépaut
J.-N.
2021
ERA5-Land: A state-of-the-art global reanalysis dataset for land applications
.
Earth System Science Data
13
,
4349
4383
.
https://doi.org/10.5194/essd-13-4349-2021
.
Nash
J. E.
&
Sutcliffe
J. V.
1970
River flow forecasting through conceptual models part I – A discussion of principles
.
Journal of Hydrology
10
,
282
290
.
https://doi.org/10.1016/0022-1694(70)90255-6
.
Nielsen
I. E.
,
Dera
D.
,
Rasool
G.
,
Bouaynaya
N.
&
Ramachandran
R. P.
2022
Robust explainability: A tutorial on gradient-based attribution methods for deep neural networks
.
IEEE Signal Processing Magazine
39
,
73
84
.
https://doi.org/10.1109/MSP.2022.3142719
.
Panagos
P.
,
Van Liedekerke
M.
,
Borrelli
P.
,
Köninger
J.
,
Ballabio
C.
,
Orgiazzi
A.
,
Lugato
E.
,
Liakos
L.
,
Hervas
J.
,
Jones
A.
&
Montanarella
L.
2022
European soil data centre 2.0: Soil data and knowledge in support of the EU policies
.
European Journal of Soil Science
73
,
e13315
.
https://doi.org/10.1111/ejss.13315
.
Prasanth Kadiyala
S.
&
Woo
W. L.
2022
Flood prediction and analysis on the relevance of features using explainable artificial intelligence
. In:
2021 2nd Artificial Intelligence and Complex Systems Conference, AICSconf ‘21
.
Association for Computing Machinery
,
New York, NY, USA
, pp.
1
6
.
https://doi.org/10.1145/3516529.3516530
.
Schlegel
U.
,
Arnout
H.
,
El-Assady
M.
,
Oelke
D.
&
Keim
D. A.
2019
Towards a rigorous evaluation of XAI methods on time series. Presented at the International Conference on Computer Vision Workshop (ICCVW), Piscataway, New Jersey. https://doi.org/10.1109/ICCVW.2019.00516
.
Shrikumar
A.
,
Greenside
P.
,
Shcherbina
A.
&
Kundaje
A.
,
2017
Not just a black box: Learning important features through propagating activation differences. arXiv:1605.01713 [cs]
.
Simonyan
K.
,
Vedaldi
A.
&
Zisserman
A.
2014
Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034 [cs]
.
Sturmfels
P.
,
Lundberg
S.
&
Lee
S.-I.
2020
Visualizing the impact of feature attribution baselines
.
Distill
5
,
e22
.
https://doi.org/10.23915/distill.00022
.
Sundararajan
M.
,
Taly
A.
&
Yan
Q.
2017
Axiomatic attribution for deep networks. https://doi.org/10.48550/arXiv.1703.01365
.
Wendland
F.
,
Bogena
H.
,
Goemann
H.
,
Hake
J. F.
,
Kreins
P.
&
Kunkel
R.
2005
Impact of nitrogen reduction measures on the nitrogen loads of the river Ems and Rhine (Germany)
.
Physics and Chemistry of the Earth, Parts A/B/C
30
,
527
541
.
https://doi.org/10.1016/j.pce.2005.07.007
.
Xu
W.
,
Luo
X.
,
Ren
Y.
,
Park
J. H.
,
Yoo
S.
&
Nadiga
B. T.
2021
Feature importance in a deep learning climate emulator
. In:
AI: Modeling Oceans and Climate Change Workshop. Presented at the International Conference on Learning Representations
. .
Zeng
R.
,
Cai
X.
,
Ringler
C.
&
Zhu
T.
2017
Hydropower versus irrigation – An analysis of global patterns
.
Environmental Research Letters
12
,
034006
.
https://doi.org/10.1088/1748-9326/aa5f3f
.
Zhao
W. L.
,
Gentine
P.
,
Reichstein
M.
,
Zhang
Y.
,
Zhou
S.
,
Wen
Y.
,
Lin
C.
,
Li
X.
&
Qiu
G. Y.
2019
Physics-constrained machine learning of evapotranspiration
.
Geophysical Research Letters
46
,
14496
14507
.
https://doi.org/10.1029/2019GL085291
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data