ABSTRACT
Machine learning algorithms are increasingly applied in hydrological studies with promising results. However, these algorithms generally lack the ability for easy interpretability of the results by users. In this study, we compare six different explainable artificial intelligence (XAI) algorithms that help understand the effect of input data on the simulation results. The methods are explored on two distinct approaches for streamflow modeling using a long short-term memory (LSTM) model: a single model approach using only meteorological forcing data and a regional approach including also static catchment attributes. To gain further insight into the internal dynamics of the LSTM models, the relationship between cell states and soil moisture is investigated. A strong correlation suggests that the LSTM models inherently capture the concept of soil moisture as a catchment-scale storage mechanism. The XAI methods are applied to derive a timestep of influence, revealing how many days of input data are relevant for the model output. All XAI methods result in similar seasonal patterns in the timestep of influence, suggesting that the methods are comparable. Setting soil moisture dynamics in context to seasonal development of the timestep of influence suggests resetting LSTM as soon as soil moisture saturation occurs.
HIGHLIGHTS
Two long short-term memory (LSTM) models were trained for streamflow modeling.
Internal cell states of LSTMs correspond well with ERA5-Land soil moisture dynamics.
Six explainable artificial intelligence algorithms show similar results.
LSTM resets as soon as soil moisture saturation is reached.
INTRODUCTION
Like other scientific disciplines, hydrology witnessed numerous applications of machine learning (ML) and deep learning models in recent years. For example, an artificial neural network has been used for the prediction of groundwater levels (Guzman et al. 2019), a physics-constrained recurrent neural network has been used to predict evapotranspiration (Zhao et al. 2019), and a support vector machine has been applied to analyze the competitive relationship between reservoir operation decisions for hydroelectricity production and irrigation (Zeng et al. 2017).
Long short-term memory (LSTM) models have shown exceptional results in predicting sequential data such as streamflow modeling (Le et al. 2021). Also, they have repeatedly proven their potential over traditional hydrological models (e.g. Kratzert et al. 2019b; Gauch et al. 2021; Lees et al. 2021; Frame et al. 2022). ML models mimic the physically relevant runoff processes from historical data by establishing functional relationships between input and output data (Duan et al. 2020). The inherent complexity and nonlinear nature of such models enable them to learn abstract patterns but lead to difficulties in explaining their predictions. It can be concluded that there is an inverse relationship between explainability in a system and its predictive accuracy, and the link between ML model's internal weights and the physical environment of the problem is extremely hard to isolate (Angelov et al. 2021). However, understanding the features that lead to a particular output builds trust with users (Shrikumar et al. 2017). The need to open the ‘black box’ has been expressed not only in hydrological research but also in almost all application areas.
Initial attempts have been made to connect the internal state vector within the LSTM cells to our human concept of the hydrological cycle. Kratzert et al. (2019a) related the cell state to soil water storage and snow depth from the Sacramento/Snow-17 hydrological model in two exemplary catchments. Lees et al. (2022) were able to fit a linear probe to model ERA5-Land soil water storage from the cell states.
In addition, explainable artificial intelligence (XAI) methods have been applied to ML models in hydrology. XAI is used to provide interpretations of the model internals, and therefore results can be better understood by the user (Althoff et al. 2021a). Expected Gradients were used to distinguish three flood-inducing mechanisms (snowmelt, recent rainfall, and historical rainfall) within the contiguous United States (Jiang et al. 2022). Other applied methods in streamflow modeling include attention mechanism and perturbation methods to obtain information about the importance of input features (Ding et al. 2020; Anderson & Radić 2022; Prasanth Kadiyala & Woo 2022). Kratzert et al. (2019a) and Althoff et al. (2021b) used the Integrated Gradient (IG) method to distinguish the ‘timestep of influence’ (TSOI), defined as the amount of previous days relevant for the result. However, the sheer amount of available XAI algorithms makes it hard to compare results of different studies as many authors only used one or a maximum of two different XAI algorithms on their models. A systematic comparison of such algorithms is missing.
Therefore, in this study we aim to (i) compare the results of multiple XAI methods, evaluated as the TSOI, to increase their comparability in the context of streamflow modeling and (ii) link the resulting TSOI to the physically relevant processes in a catchment exemplarily. We establish two different LSTM models for the purpose of streamflow modeling using a regional and a single-catchment approach. Then six different XAI algorithms are applied and evaluated using an adapted version of the TSOI to uncover the temporal dynamics of the input feature attribution. The TSOI is then set into context with soil moisture dynamics in the catchment to investigate its hydrological significance.
STUDY AREA AND DATA
Study area
Data
Daily streamflow data for the three streamflow gauges (Figure 1) were acquired from the Global Runoff Data Center (GRDC) for the period 1969–2016 (BFG 2022).
All available station data within the Ems catchment area were acquired from the German Weather Service (DWD 2022) to generate the hydroclimatic input. Not all stations provide a continuous time series throughout the whole period of 1969–2016. Data gaps were not filled. Potential biases due to missing station data and the orographic effects on climate data can be neglected due to the homogeneity and low elevation differences in all sub-catchments (Ley et al. 2023). Information on soil properties was taken from the European Soil Database v2.0 to calculate static attributes for the regional LSTM (Panagos et al. 2022). Land use information was derived from CORINE Land Cover (European Environment Agency et al. 2018).
CAMELS-GB dataset
Large standardized datasets are required to successfully train a regional LSTM model (e.g. Kratzert et al. 2019b; Althoff et al. 2021b; Lees et al. 2021). The idea behind the CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) initiative is to provide such datasets. Datasets include a wide range of hydrometeorological drivers and static catchment attributes such as elevation, land use, and soil conditions. To this point, no dataset exists for Germany. Assuming similar climate conditions, we chose to use the CAMELS-GB dataset for training, which includes 671 catchments across Great Britain (Coxon et al. 2020).
ERA5-Land reanalysis
ERA5-Land reanalysis combines model data with observations into a globally complete and consistent dataset of hourly and monthly estimates for hydroclimatic state variables (Muñoz-Sabater et al. 2021). The spatial resolution of the dataset is 0.1°, which equals around 9 km. To assess the dynamics of catchment water storage, we use results for the ‘volumetric soil water layer 3’, which describes water content within the soil column in a depth of 0.28–1 m. In a preliminary analysis, soil moisture dynamics in this depth showed the highest correlation to the internal cell state of the LSTM models.
METHODS
General study design
Two separate LSTM models were trained for streamflow modeling for the low-land Ems catchment in Germany. One model only used catchment-specific hydrometeorological forcing data to provide streamflow estimations at the outlet gauge. A regional approach was applied to train the second LSTM model. Ems catchment data were harmonized and added to a large sample dataset (CAMELS-GB) including hydrometeorological forcing and static attributes such as land use and soil characteristics. The LSTM is then trained using the large dataset.
Afterward, six different XAI methods were applied to both successfully trained models to quantify the feature importance of the meteorological inputs at the upstream sub-catchment (Figure 1). The results were evaluated using an adapted version of the TSOI to distinguish seasonal patterns of input feature importance. Second, the internal cell states of the LSTM were compared to ERA5-Land soil moisture values to reveal the potential derivation of soil moisture dynamics from the correlations between the input data and the target variable streamflow as proposed by Kratzert et al. (2019a) and Lees et al. (2022). The TSOI dynamics resulting from the XAI methods are then set into context to soil moisture variations to distinguish a potential physical relevance of the TSOI.
LSTM training design
LSTM models, in particular, have proven themselves useful for time series modeling (Hochreiter & Schmidhuber 1997). Within the LSTM, two cell states continuously process information from one timestep to the other. The cell state can be interpreted as the long-term memory of the cell, whereas a hidden state processes short-term information. A more detailed description of the LSTM can be found in Ley et al. (2023). The cell state makes LSTM capable of capturing persistent dependencies, like soil moisture and snow accumulation, which makes them suitable to model streamflow (Kratzert et al. 2018).
For the ‘single model’ approach, only hydroclimatic forcing data are used as input for the LSTM during the training process. Applying this approach only allows one to apply the trained model for the particular catchment it is trained for (e.g. Han et al. 2021; Ley et al. 2023). In addition, an LSTM can be trained with a dataset containing multiple catchments. Next to hydroclimatic data, these data must include static catchment characteristics (e.g. soil properties, elevation, catchment size, land use). Regional models resulted in improved performance indices and increased transferability to ungauged catchments in model intercomparison (Kratzert et al. 2019b). Detailed information about forcing data and hyperparameters for both model setups is provided in the Supplementary Material A1. As no large standardized catchment dataset exists for Germany, we use the CAMELS-GB dataset where climate conditions can be considered similar. Fifty random catchments were selected from the dataset due to storage limitations, and additionally, 3 of the 10 largest catchments were included randomly because the Ems catchment is relatively large. Then, we prepared input data for three sub-catchments of the Ems to match the CAMELS-GB dataset (Figure 1). The data were passed as one combined input dataset to train the regional LSTM. Input data were separated into three different periods: 2000–2009 for training, 2010–2016 for validation, and 1970–1999 for testing. Hyperparameter search was carried out using a small grid search, but we did not want to focus on the overall model performance and instead compared the results of XAI methods on both approaches.
Explainable artificial intelligence
The model complexity of ML applications makes interpretation of the results difficult. Therefore, XAI has received increasing scientific attention over recent years (Linardatos et al. 2021). XAI includes methods that interpret and explain ML model results.
The XAI methods applied in this study can be separated into two groups. First, gradient-based methods analyze the gradient of the cost function of LSTM with respect to the input feature. Gradients are analogous to model coefficients and describe the contribution of each input feature to the output (Althoff et al. 2021b). LSTMs are completely differentiable. Therefore, the attribution of each individual input feature can be calculated. Second, perturbation-based methods directly compute the attribution of an input feature by removing, masking, or altering them and measuring the difference with the original output (Ancona et al. 2018).
The XAI algorithms provide an attribution for each individual input. The input features with a positive attribution increase the output values and the ones with a negative value decrease the output (Xu et al. 2021). The larger the attribution, the higher its importance for the result. The following XAI algorithms were tested on both models (single and regional approach). A more detailed description of all algorithms can be found in Kokhlikyan et al. (2020).
Gradient-based methods
Saliency
Using gradients to create the so-called saliency (Sal) maps was first proposed by Simonyan et al. (2014). Sal maps are used extensively to analyze the importance of input features to model predictions and are mostly used for application in vision and language tasks with their application to time series data being relatively unexplored (Ismail et al. 2020). Sal calculates attributions by taking the absolute value of the partial derivative of the target output with respect to the input feature.
InputXGradient
InputXGradient (IXG) is basically an extension of the Sal algorithm. With IXG, the attribution scores are calculated by elementwise multiplication of gradients with the actual input (Nielsen et al. 2022). The elementwise multiplication can be considered an application of a model-independent filter (to the input), which may reduce noise and smoothen the attribution maps (Ancona et al. 2019).
Integrated Gradient
IG was introduced by Sundararajan et al. (2017). IG computes the partial derivatives of the output with respect to each input feature. While Sal and IXG compute a single derivative, evaluated at the provided input, IG computes the average gradient while the input varies along a linear path from a given baseline (Ancona et al. 2018). From the baseline to the actual input, the gradient is integrated along an equally split path. The baseline was set to zeros.
GradientSHAP
GradientSHAP (GS) approximates SHAP (SHapley Additive exPlanations) values by computing the expectations of gradients by randomly sampling from baselines (Lundberg & Lee 2017). SHAP values are another way to explain the output of an ML model by using a game theory approach. It adds white noise to each input sample, selects a random baseline and a random point along the path between the baseline and the input, and computes the gradient of outputs with respect to those selected random points. In a sense, it can be viewed as an approximation of IGs by computing the expectations of gradients for different baselines (Kokhlikyan et al. 2020). The baseline was set to zeros.
Perturbation-based methods
Feature ablation
Feature ablation (FA) is a perturbation-based approach to compute the attribution by replacing each input feature with a given baseline and computing the difference in output. Each value is taken as a feature and replaced independently. The calculated difference signifies the attribution for the permuted feature (Kokhlikyan et al. 2020).
Feature permutation
Feature permutation (FP) takes each input feature, permutes the feature values within a batch, and computes the difference between the original and shuffled outputs for the given batch. This difference signifies the feature importance of the permuted feature (Kokhlikyan et al. 2020).
Timestep of influence
RESULTS
Model training results
Soil moisture and cell state dynamics
The cell state with the highest correlation of each model was further evaluated (Figure 4(b)). The cell states of cell numbers 8 (single model) and 65 (regional model) were sorted. Then, the months with the 20 lowest and highest ERA5-Land soil moisture levels were marked to check whether extreme values were properly reproduced within the cell states. The months with the lowest ERA5-Land soil moisture show a good match with the lowest cell states within both models. All 20 values are located within the lowest 75 cell states. The distribution of the months with high soil moisture is wider. However, most of the data points are well located within the 80 highest cell states.
Timestep of influence
The results for the regional model in Figure 5(b) show a comparable seasonal dynamic to the single model but differ in some details. Next to a minimum around day 60 of the year (late February/early March), the methods Sal, IXG, FA, and FP show a second local minimum around the day 210 of the year (end of July). The minimum is most pronounced using the IXG method. Instead, this effect is not visible for IG and GS. The range of values is smaller than in the single model. The maximum TSOI values stay below 200 days and the minima tend to be higher with TSOI values around 70 days (Sal, IXG, FA, FP) and 50 days (IG, GS), respectively.
In the wet year 1998 (Figure 6(b)), precipitation is above average in early spring and autumn. As expected, the observed precipitation results in higher soil moisture in spring and autumn. In August, still a very low soil moisture of 0.15 is modeled by ERA5-Land, but already the values reached up to 0.35 in October. Cell state values follow the development of soil moisture values subsequently. The TSOI time series shows a different development compared to the dry year. In both models, the TSOI is very high at the beginning of the year and drops to the minimum when the first soil moisture peak in spring is reached. The TSOI then increases again until the second soil moisture peak is reached. Even though the TSOI at this point is still relatively low with values around 100 days, it drops down to around 40 days.
DISCUSSION
Model performance
Two approaches were used to build up an LSTM model for the Ems catchment. The NSE values of 0.88 for both models can be considered very good according to Moriasi et al. (2007). Regarding logNSE and KGE, the single model has superior performance compared to the regional model. This is despite the fact that regional models have shown higher model quality in the past (Kratzert et al. 2019b). This can be explained by the homogeneity of the catchment properties and by the limited effort, which has been put into model training. A satisfactory level was reached quickly, and a model comparison was beyond the scope of this study. The use of a larger subset of the CAMELS-GB dataset or the availability of a German dataset would probably increase the quality of the regional model results and improve the spatial transferability of the trained model. However, both models show good quality in all streamflow ranges and are applicable for further analysis.
Cell state and soil moisture dynamics
Baseflow, driven by soil moisture, is expected to be the main runoff component in a sandy low-land catchment such as the Ems for most of the year (Bormann 2010). To evaluate whether the internal cell state of the LSTM reflects the hydrological state variables, we used ERA5-Land reanalysis data to compare cell states and soil moisture. It must be taken into consideration that ERA5-Land soil moisture data comes associated with uncertainties and that LSTM cell states do not necessarily have to map the ERA5-Land data one-to-one. The LSTMs are not constrained to mimic soil moisture dynamics in a single cell state vector. Instead, the signal could be split up into multiple vectors (Lees et al. 2022). Also, the ERA5-Land data use fixed depth levels, which most probably are not equally expressed in the cell state.
The proven capability of LSTM models to map soil moisture dynamics according to our understanding of the hydrological system allows us to expect an accurate representation of other relevant catchment processes within the LSTM internals. Potentially, evapotranspiration and nonlinear routing dynamics can be derived for the catchment. Due to the flexibility of the LSTM model, those processes might be represented more precisely than they are in conceptual models with a more rigid structure.
Explainable artificial intelligence
According to Ismail et al. (2020), gradient-based and perturbation-based methods failed to provide high-quality interpretations for multivariate time series classification and Schlegel et al. (2019) concluded that more suitable XAI methods for time series should be introduced to achieve a better human interpretable understanding. However, six different XAI methods were applied in this study to compare their results using the TSOI. The application of the gradient-based XAI algorithms in LSTM models requires consideration of the following aspect. The use of gradient-based approaches can be problematic because sigmoid and hyperbolic tangent activations have a near-zero gradient at high or low inputs even though these inputs are significant for the outcome (Shrikumar et al. 2017). Accordingly, it is possible that the relevance of extreme values is neglected by the methods. In particular, this can be expected for precipitation because extreme precipitation events are many times greater than normal events. Also, two of the applied XAI algorithms (IG and GS) require a baseline from which the attribution is calculated stepwise while being shifted to the actual input values. The baseline is a hyperparameter that requires careful selection because if the target output is too close to the baseline, the importance of the feature will not be recognized (Sturmfels et al. 2020). In the hydrological context, it is difficult to choose a baseline that contains no information. We chose a sequence of zeros as the baseline. However, the resulting seasonal TSOI dynamics show a similar behavior for all XAI methods. The TSOI dynamics appear to be independent of the applied XAI method and the model type on the seasonal timescale. Nevertheless, differences exist and will be subject to future studies.
We adopted the TSOI method following the recommendation of Althoff et al. (2021b). Using a percentage of the accumulated sum instead of a fixed value as a threshold has the advantage that the generally higher attribution in the winter months is not considered. The seasonal pattern of the TSOI differs from previous studies (e.g. Kratzert et al. 2019a; Althoff et al. 2021b). Differences may be due to the adapted TSOI methodology but also on different catchment characteristics and runoff generation mechanisms (baseflow in this study).
However, the application of XAI methods in time series modeling is still controversial in the literature. The results of the exceptionally dry and wet years in Figure 6 reveal that a direct correlation between TSOI and soil moisture or cell state is not possible. Instead, it appears that the TSOI drops down to a baseline value of about 40 days once the peak in soil moisture is reached. This hypothesis is supported by the results in Figure 7. The timing of soil moisture peaks corresponds well with the steep decline of the TSOI. In terms of the hydrological concept, this can be interpreted as a deletion of previous information when saturation in the soil storage is reached. Further in-depth evaluation of the relationship between TSOI and soil moisture dynamics in various catchments is necessary to confirm the hypothesis. The CAMELS-GB dataset used in this study already includes catchments covering a range of hydroclimatic conditions, in particular, regarding elevation, precipitation, and mean streamflow (Coxon et al. 2020). Regional LSTM approaches show good results in different climatic regions and ungauged catchments (e.g. Kratzert et al. 2019b; Arsenault et al. 2023). Therefore, transferability of the results presented in this study can be expected.
CONCLUSION
The comparison with ERA5-Land soil moisture dynamics provides further evidence that LSTMs are capable of deriving soil moisture dynamics from the functional relationship of climate forcing and output. The application of six different XAI algorithms in combination with the TSOI showed that the methods lead to comparable results in terms of seasonal dynamics. For future evaluations, it is therefore not necessary to use an ensemble of the XAI applied in this study. Results from publications are transferable despite the use of a different subset of methods. The combination of the TSOI signal and soil moisture dynamics further supports the idea of a physically realistic representation of the important storage effects by the LSTM, as represented in conceptual models in a similar way. Because the rapid decrease in TSOI corresponds well with soil moisture peaks in spring, we hypothesize that the TSOI is neutralized after the system has passed through a state of saturation. Storage level depletion is accompanied by an increase in the TSOI. However, this opens an interesting field of study in the future to test this statement in catchments with different climatic conditions and streamflow regimes.
AUTHOR CONTRIBUTIONS
AL contributed to conceptualization, methodology, data curation, formal analysis, and writing the original draft. HB contributed to conceptualization, supervision, reviewing and editing the writing, and funding acquisition. MC contributed to conceptualization, supervision, and reviewing and editing the writing.
FUNDING
This research was funded by the German Federal Ministry of Education and Research (BMBF) (grant numbers: 01LR2003D and 01LR2003D1) within the project WAKOS – Wasser an den Küsten Ostfrieslands.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.