Untangling hybrid hydrological models with explainable artificial intelligence

Hydrological models are valuable tools for developing streamflow predictions in unmonitored catchments to increase our understanding of hydrological processes. A recent effort has been made in the development of hybrid (conceptual/machine learning) models that can preserve some of the hydrological processes represented by conceptual models and can improve streamflow predictions. However, these studies have not explored how the data-driven component of hybridmodels resolved runoff routing. In this study, explainable artificial intelligence (XAI) techniques are used to turn a ‘black-box’model into a ‘glass box’model. The hybrid models reduced the rootmean-square error of the simulated streamflow values by approximately 27, 50, and 24% for stations 17120000, 27380000, and 33680000, respectively, relative to the traditional method. XAI techniques helped unveil the importance of accounting for soil moisture in hydrological models. Differing frompurely data-driven hydrological models, the inclusion of the production storage in the proposed hybridmodel, which is responsible for estimating thewater balance, reduced the shortand long-term dependencies of input variables for streamflow prediction. In addition, soil moisture controlledwater percolation, whichwas themain predictor of streamflow. This finding is because soil moisture controls the underlying mechanisms of groundwater flow into river streams.


INTRODUCTION
Water scarcities, climate variabilities, and climate extremes have been a great concern in the planning and management of water resources in recent decades (Ni et al. 2020;Wyatt et al. 2020). Increasing water demands constantly threatens the planet's security and sustainability. In this scenario, the knowledge of streamflow in watersheds is essential to determining the availability of water over time. However, historical data alone have a limited value in supporting decision-making (Humphrey et al. 2016), and restrictions on gauge data in terms of density and frequency can affect the efficient management of water resources.
Alternatively, hydrological models enable accurate and reliable streamflow estimation and forecasting and play a crucial role in the proper management and allocation of water resources. Streamflow is a response to the physical interactions among climate, geology, topography, and land use in a watershed, but it is mainly a final derivation of rainfall (Poff et al. 1997). Thus, to estimate streamflow, models can use meteorological forcings along with physical and/or climatic parameters of watersheds, as well as historical data, as guides to solve empirical relationships.
Hydrological modeling is a challenging task due to the nonlinear and nonstationary nature of streamflow, especially under extreme events (high-and low-flow conditions). To obtain a reliable estimate, considerable efforts have been made to develop and apply different methods to improve the accuracy of such models. Typically, hydrological predictions are generated using models that consider a single meteorological force or multiple forces for simulating historical climatic scenarios and those of the present or the next few days, weeks, months, seasons, or even years (Sabzipour et al. 2020).
The models are based on methods that utilize various concepts, equations, parameters, and structures to describe the rainfall-runoff relationship (Tian et al. 2013). These models can be classified into two types. The first type includes process-based models in which a hydrological model is established to calculate the response of rainfall based on empirical equations (conceptual) or physical processes (Ni et al. 2020). These require the calibration of parameters to estimate, for example, evapotranspiration, infiltration, percolation, surface runoff, and other processes occurring in a basin (Kumanlioglu & Fistikoglu 2019). Regarding its disadvantages, these models have limitations due to the physical conditioning of parameters, high computational power requirements for spatially explicit operations, or the use of purely conceptual parameters that are not related to the hydrological properties of the watersheds. The second type includes data-driven models, which are based solely on observations. The modeling process becomes simpler and easier to implement, as it is based on observations/history and does not require information about physical processes (Liu et al. 2015). An advantage of these models, for example, machine learning models, compared to the more traditional approaches, is their flexible structure, which allows them to capture arbitrarily nonlinear and complex input-output relationships from data without any restrictive assumptions about the functional form of the underlying process (Humphrey et al. 2016). However, despite its appeal, the performance of machine learning models, as well as different statistical models, is highly dependent on the availability and quality of the observed data and does not provide any information regarding its hydrological processes, such as percolation and soil moisture.
In recent years, great effort has been made by scientists to combine machine learning hydrological models with conceptual models in which some hydrological processes are empirically described (Robertson et al. 2013;Humphrey et al. 2016;Ren et al. 2018;Kumanlioglu & Fistikoglu 2019). For instance, Ren et al. (2018) combined the HBV (Hydrologiska Byråns Vattenbalansavdelning) (Bergström 1992) model with artificial Bayesian neural networks, which enabled a better assessment of the uncertainties of hydrological predictions simulated on a monthly scale. Humphrey et al. (2016) combined artificial Bayesian neural networks with the intermediate products of the GR4J (Génie Rural à 4 paramètres Journalier) (Perrin et al. 2003) model to resolve runoff routing and estimate streamflow on a daily scale. Kumanlioglu & Fistikoglu (2019) also combined neural networks with outputs from the soil moisture accounting storage of the GR4J model to resolve runoff routing. These studies revealed the great potential of hybrid models in improving hydrological prediction. They did not, however, explore how the data-driven component of the models resolved the runoff routing processed and instead simply accepted them as 'black-box' models.
A subdomain of machine learning, explainable artificial intelligence (XAI), has recently received significant attention for helping its users to better understand how their 'black-box' models operate (Maksymiuk et al. 2020). The use of XAI techniques can extend the interpretability of machine learning models; therefore, the results can be better understood by humans. Thus, in this study, the objectives were (i) to assess the performance improvement in streamflow simulation resulting from using a hybrid (process-based/data-driven) hydrological model over a conceptual model and (ii) to explore a range of XAI techniques to uncover how runoff routing is being resolved and turn 'blackbox' models into 'glass box' models.

Study areas
The study areas consist of catchments corresponding to three gauge stations within the Brazilian Cerrado biome ( Figure 1). The catchment areas of the corresponding gauge stations were delimited using the TauDEM toolbox (Tarboton 2008) and a Shuttle Radar Topography Mission (SRTM) digital elevation model (DEM) with a 90-m spatial resolution ( Jarvis et al. 2008). The gauge stations are part of the Brazilian National Hydrometeorological Network (in Portuguese: Rede Hidrometeorológica Nacional), and their corresponding ID codes are 17120000, 27380000, and 33680000.
As part of the Brazilian Cerrado biome, these tropical watersheds are characterized by a highenergy input throughout the year and seasonal rainfall. The catchment areas from stations 17120000 and 27380000 present climates that, according to Köppen's climate classification (Köppen 1936;Alvares et al. 2013), can be classified predominantly as tropical savannah climates  with dry winters (A w ). Conversely, the catchment area from station 33680000 has portions characterized by a tropical savannah climate with a dry winter, a tropical savannah climate with dry summer (A s ), and a semiarid climate (BSh). The land-use and land-cover classes of the catchment areas are described in Supplementary Material Section S1.

Hydrological models
In this section, we describe a 'traditional' conceptual rainfall-runoff model, the GR4J (Perrin et al. 2003), and the hybrid rainfall-runoff model used in this study. GR4J is a daily lumped conceptual hydrological model (Figure 2(a)) based on four free parameters (x1, x2, x3, and x4) and is used in this study as a benchmark for the hybrid model to improve upon. The hybrid model, termed GR-Hyb here, consists of a process-based component, which shares a free parameter (x1) with the GR4J model, and a data-driven-based component, which simulates streamflow using output from the process-based component (Figure 2(b)).

The GR4J hydrological model
As a lumped conceptual model, GR4J uses empirical relationships to model hydrological processes aggregated to the catchment scale ( Figure 2(a)). In terms of meteorological forcings, it only requires as input the catchment average potential evaporation (E) and precipitation (P). The conceptual  structure of the GR4J model consists of two storages: the production storage (S) and the routing storage (R). In Figure 2, x1 represents the size of the routing storage (mm), x2 controls the flux to groundwater (mm), x3 represents the size of the routing storage (mm), and x4 controls the time base of the UH1 and UH2 unit hydrographs (days). After precipitation interception, if any, the current level of the production storage (S) regulates the fraction of precipitation that fills the production storage (P s ), water evaporating from the storage (actual evaporation, E s ), and water percolating from the storage (Perc). As it represents the soil water balance, the production storage is also referred to as the soil moisture accounting storage (Pushpalatha et al. 2011). The sum of precipitation that does not fill the production storage and percolation (Pr) is divided into two flow components according to a fixed split. Of the Pr, 90% is routed by a unit hydrograph (UH1) to the nonlinear routing storage, while the remaining 10% is routed by a unit hydrograph (UH2) to streamflow as direct flow (Q d ) after exchange with groundwater. Groundwater exchange is a function of x2 (f(x2)). A negative x2 value represents water infiltrating to the aquifer, while a positive value represents water exiting the aquifer and contributing to the routing storage and direct flow. The routing storage level (R) governs outflow (Q r ) and is able to simulate long streamflow recessions (Perrin et al. 2003).

The hybrid hydrological model
The GR-Hyb model presents two components: a process-based component and a data-driven component. The process-based component of GR-Hyb is identical to the soil moisture accounting component of GR4J and consists of a single free parameter (x1). The data-driven component, on the other hand, is used to replace the routing component of GR4J. Thus, the hybrid model is able to simulate the soil water balance at the catchment scale and uses variables derived from this balance as input to the data-driven model to resolve routing. The reasoning behind combining process-based and data-driven components is that streamflow is generally considered an outcome of superimposed hydrological processes, for example, superimposing runoff contributions from distinct precipitation events. In addition, antecedent soil moisture is an important factor considered when calculating the event-based runoff coefficient (Song & Wang 2019). For these conditions to be captured by a machine learning model, long-time series are generally required as input; for example, Kratzert et al. (2019) showed that streamflow predictions may be influenced by over 100 previous timesteps.
To account for the soil water balance at the catchment scale, machine learning models are provided with more information on the antecedent conditions; therefore, the data-driven component has a better ability to simulate streamflow. The main differences between the GR-Hyb model presented here and the one presented by Kumanlioglu & Fistikoglu (2019) are with respect to the data-driven component and their parameter optimization techniques. In this study, cubist regression (Quinlan 1992(Quinlan , 1993 is used instead of a neural network to model streamflow. Cubist regression is an ensemble-like model based on the concept of regression trees (Breiman et al. 1984). Cubist regression builds rule-based regression trees in which the split criterion is used to test which of the predictor variables reaching a node can minimize the intrasubset standard deviation of the target values after the split. It adopts a boosting technique that helps adjust the predictions for observations that were poorly predicted in previous trees. Its regression trees are, however, different than traditional trees. Instead of discrete predictions, the final node of a tree presents linear models built considering the variables used in previous splits of the tree. The final prediction is smoothed by considering predictions made by the linear models built recursively up the tree (in previous splits). Cubist regression is chosen in this study for two reasons. First, it allowed other authors to achieve state-of-art performance within a range of studies (Noi et al. 2017;Worland et al. 2018;Filgueiras et al. 2019;Althoff et al. 2020b), and second, it provided a good trade-off between performance and computational time (Althoff et al. 2020b), which was an important aspect when considering the optimization task. For more information on cubist regressions, refer to Quinlan (1993).
For parameter optimization, the particle swarm optimization technique, developed by Kennedy & Eberhart (1995) to mimic a simplified social model (bird flocking, fish schooling, and swarming theory), was used instead of the genetic algorithm to optimize x1. Both optimization techniques have performed well in a range of previous studies. However, Panda & Padhy (2008) argued that from an evolutionary point of view, the performance of the particle swarm optimization technique is better than that of the genetic algorithm. Optimizing a single parameter seems to be a simple task. However, the cubist model calibration is independent of physical meanings in the outputs of the process-based component; therefore, calibration on highly varied parameter values can introduce equifinality. In this case, the particle swarm optimization technique was best to avoid local optima falling and find the parameter value resulting in the best outcome. The optimization algorithm was implemented using the model-independent software (Zambrano-Bigiarini & Rojas 2013).
By accounting for soil moisture, the model decreases its dependency on inputs from previous timesteps, as required by purely data-driven hydrological models (Kratzert et al. 2019;Kumanlioglu & Fistikoglu 2019). Nevertheless, conceptual models usually depend on previous timesteps. For example, in GR4J, the time base of the unit hydrograph UH2 (2 Â x4) results in fluxes exceeding a few consecutive days. Direct flow is then obtained by superimposing the fluxes from individual events. In Perrin et al. (2003), x4 ranged from 1.1 to 2.9 (80% confidence interval) for a large sample of catchments. Thus, two versions of the hybrid model were assessed in this study. The first model, denoted GR-Hyb-1, considered only inputs from the respective day for the machine learning model, i.e., to predict Q t , only Perc t , Es t , and Pn t are considered. The second model, denoted GR-Hyb-2, additionally considered inputs from previous days up to three timesteps, i.e., Perc t , Perc tÀ1 , Perc tÀ2 , Perc tÀ3 , Es t , Es tÀ1 , Es tÀ2 , Es tÀ3 , Pn t , Pn tÀ1 , Pn tÀ2 , and Pn tÀ3 were used to predict Q t .
The three input variables were selected based on model parsimony. The soil water balance input is the net precipitation, and the outputs are the actual evaporation and percolation. However, part of the net precipitation used to fill the storage can be inferred by the final water percolation. Thus, we allow the model handle net precipitation alone to predict its contribution to runoff. The inputs from three previous timesteps were used in the GR-Hyb-2 version considering the x4 ranges found by Perrin et al. (2003). x4 represents the time base of the unit hydrograph and the possible number of timesteps, where fluxes are superimposed for runoff generation.

Meteorological forcing data
The study period considered for this study ranged from 1 June 2000 to 31 December 2014 (∼14.5 years). The meteorological forcings required by the models are precipitation (P) and potential evaporation (E). The datasets used to represent these variables were the GPM IMERG Final Precipitation V06 1 day (Huffman et al. 2019) and the ETo-Brazil V3 (Althoff et al. 2020a). The IMERG dataset has been favorably assessed across the Brazilian territory, while ETo-Brazil is a daily gridded Penman-Monteith evapotranspiration dataset developed for Brazil using machine learning models. Although it differs from the potential evaporation (Miralles et al. 2020), the Penman-Monteith reference evapotranspiration is closely related, and the dataset was available for the entire study period and areas.

Observed streamflow data
Daily streamflow data for gauge stations 17120000, 27380000, and 33680000 were downloaded from HidroWeb (ANA 2020), which is a tool part of the Nacional System for Water Resources Information (in Portuguese: Sistema Nacional de Informações sobre Recursos Hídricos). The data were filtered for the same period as considered for the meteorological forcings.

Model calibration and validation
The observed streamflow and forcing data were split into warm-up, calibration, and validation periods. The warm-up period consisted of 2.5 years (2000)(2001)(2002), while the remaining 12 years of records were split in a 75:25 ratio for calibration-validation, i.e., 9 years for calibration (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011) and the 3 most recent years for validation (2012)(2013)(2014). Considering the strong seasonality of streamflow in the region and the importance of water availability during the dry season, it is essential to choose an objective function that gives proper weight to low-flow conditions during calibration. It is well documented that the commonly used Nash-Sutcliffe efficiency index (NSE) (Nash & Sutcliffe 1970) poorly assesses the model's performance under low-flow conditions (Legates & McCabe 1999;Krause et al. 2005;Schaefli & Gupta 2007;Gupta et al. 2009). Thus, we used the Kling-Gupta efficiency (KGE) index (Gupta et al. 2009): where ED is the Euclidian distance, r is the linear correlation coefficient between the observed (O i ) and predicted (P i ) streamflow, α is the variability ratio or ratio between the standard deviation of simulated values and the standard deviation of observed values (s P =s O ), and β is the ratio between the mean simulated and observed values (m P =m O or P= O). The performance of the model was also evaluated by considering the following metrics: percent bias (PBIAS, Equation (2)), mean absolute error (MAE, Equation (3)), root-mean-square error (RMSE, Equation (4)), NSE (Equation (5)), and relative NSE (rNSE, Equation (6)), i.e., NSE computed with relative values: where O i and P i are the observed and predicted (simulated) streamflows at timestep i, the overbar denotes the mean for the entire period, for example, O is the mean observed streamflow, and n denotes the total number of timesteps.
PBIAS is a simple average of the prediction error by the average observed streamflow and indicates whether a model is generally under-or overestimating observed values. MAE indicates the average magnitude of the prediction error, while RMSE is penalized by larger errors. NSE is a commonly used goodness-of-fit criterion in many studies, which penalizes high error dispersion in relation to observed streamflow variability. rNSE provides a better assessment of streamflow estimates under low-flow conditions because errors in low-flow conditions tend to be relatively larger. The ideal value for PBIAS, MAE, and RMSE is 0, while that for KGE, NSE, and rNSE is 1.

Explanation tasks for machine learning models
Because of their inner complexity, machine learning models are often considered 'black-boxes'. Even though the hybrid models are built based on cubist regression, which is based on more interpretable units (decision trees), the number of rules and linear regressions and its boosting-like scheme used for nonlinear predictions comes at the expense of interpretability. Recent advances made within XAI, a subdomain of machine learning, can aid its users in better understanding and reporting how their 'black-box' models function on a global scale, as well as allowing researchers to zoom into specific cases (Maksymiuk et al. 2020). To gain some insights into the models developed, we apply two global and two local interpretation techniques.

Global interpretation
The most common methods used for global interpretation include (i) variable/feature importance (VI/FI) measures and (ii) partial dependence profiles (PDPs) (Friedman 2001). VI is an explanation task based on model parts and requires only the trained model. For the cubist regression, the VI is calculated based on variable usage statistics, i.e., percentage of times each variable is used in a condition and/or in a linear model (Kuhn et al. 2018). At each split of the tree, cubist regression saves a linear model after feature selection. This linear model may utilize any variable used in the current split or the splits above it. The usage statistics reflect all the models involved in the predictions. The PDP technique, on the other hand, is an explanation task based on the model profile. This method profiles the global behavior of models within the context of a single variable, i.e., it presents the expected prediction as a function of a specific variable (Maksymiuk et al. 2020). Thus, records from the validation period were used for the PDP. The PDP, unlike the VI technique, is a modelagnostic method that does not require the knowledge of the model's internal structure. The VI and PDP techniques were implemented in the R programming language and environment (R Core Team 2020) and using the caret package (Kuhn et al. 2018) and the pdp package (Greenwell 2018).

Local interpretation
For the local interpretation, two model-agnostic techniques were used; one, the individual conditional explanation (ICE) (Goldstein et al. 2015) was used to predict profiles, and the other, the local interpretable model-agnostic explanation (LIME) (Ribeiro et al. 2016) was used to analyze specific cases. ICE, also known as the Ceteris Paribus for profiles, iteratively examines the influence of each explanatory variable while holding the values of all other variables unchanged. The main difference between ICE and PDP is that there is no aggregation over the entire dataset, as it presents the perspective over single observations (Maksymiuk et al. 2020). The ICE profiles were also generated for the records from the validation period and implemented in R using the pdp package (Greenwell 2018).
A closer look was given to five different cases over a year using the LIME algorithm. The LIME explanation task is a technique that predicts parts, i.e., it attempts to explain local predictions. It assumes that every complex model is linear on a local scale and can be explained by a simple model fit around a single observation. The LIME algorithm first creates a (i) large number of permutations, i.e., replicate feature data with slight value modifications. It computes a (ii) similarity distance measured between the original observations and permutations and applies the (iii) original model to predict the outcomes of the permuted data. It then (iv) selects features that best describe the predicted outcomes and (v) fits a simple model to the permuted data weighted by its similarity to the original observation. The feature weights of this simple model are used to explain local behavior. The dates chosen were 25 January 2012, 8 April 2012, 22 June 2012, 5 September 2012, and 19 November 2012, which are all part of the validation period. These dates were chosen to represent the different contexts of streamflow prediction throughout the year.
The LIME algorithm was implemented using the lime package (Pedersen & Benesty 2019). In the LIME algorithm, (i) the number of permutations was set to 5,000, (ii) the similarity was computed using Gower's distance (Gower 1971), (iii) the trained cubist model was used to predict outcomes from permutations, (iv) the forward selection was used for feature selection, and (v) the ridge regression was fitted to the weighted permuted observations.

Performance of the hydrological models
A brief overview of the hydrological context of the studied catchments is provided in Supplementary Material Section S2, and a discussion regarding calibrated parameters is provided in Supplementary Material Section S3. The time series and scatterplots between the observed and predicted streamflows are presented in Figures 3 and 4, respectively. In general, the hybrid models presented similar  performances and were superior to the GR4J model, which demonstrates that the association of the GR4J model with machine learning techniques improved the prediction of daily flow in the three catchment areas analyzed. The hybrid models performed remarkably better under low-flow conditions and in transitions between wet and dry seasons. For example, let us consider low-flow conditions roughly as flows below the 66% exceedance probability level and transition conditions between seasons as flows between the 33 and 66% exceedance probability levels. Under low-flow conditions, the GR-Hyb-2 model produced RMSEs of 0.11, 0.07, and 0.02 mm day À1 for catchments 17120000, 27380000, and 33680000, respectively, while GR4J produced RMSEs of 0.15, 0.30, and 0.03 mm day À1 , respectively. During the transition periods, GR-Hyb-2 produced RMSEs of 0.18, 0.44, and 0.05 mm day À1 for catchments 17120000, 27380000, and 33680000, respectively, while GR4J produced RMSEs of 0.26, 0.51, and 0.06 mm day À1 , respectively. The rNSE values are also displayed in Table 1. Overall, they also showed a general reduction in dispersion (Figure 4), i.e., higher values of the determination coefficient (R 2 ). The best alignment to the ideal 1:1 line (black dashed line) was also notable in which the slopes of both adjusted regressions (ordinary least squares and forced to the origin) were closer to 1.
A summary of the performance of the models is presented in Table 1. In the catchment areas of stations 17120000 and 27380000, the GR-Hyb-1 model produced PBIAS values of À4.5 and 3.2%, indicating an under-and overestimation, respectively, of the daily flows observed in these basins. In the catchment area of station 33680000, the GR4J model presented the lowest PBIAS value (À15.0%) and showed a tendency to underestimate the values of daily flow, especially that during the wet season. In contrast, the hybrid models tended to overestimate the streamflow in the same catchment. This overestimation occurred with a greater magnitude mainly during the wet period of 2014 (Figure 3). The hybrid models showed lower MAE and RMSE values in relation to the GR4J  model. The reduction in RMSE was approximately 27, 50, and 24% for stations 17120000, 27380000, and 33680000, respectively. Based on the NSE, rNSE, and KGE performance criteria, the GR-Hyb-2 model presented their highest values in basin 17120000, while the GR-Hyb-1 model resulted in the best results (values closer to 1) for streamflow prediction at stations 27380000 and 33680000. A more objective way to analyze and compare the performance of these models is through a Taylor diagram (Taylor 2001; Figure 5). The metrics presented by this diagram were similar to the metrics present in the KGE calculation. With the exception of the bias, the graph also shows the same linear correlation (blue-dashed lines) and the variability ratio, which, in the case of the Taylor diagram, is referred to as the normalized standard deviation (black dash-dotted lines). In addition, the diagram also shows the centralized RMSE (greendotted lines). The model that comes closest to 1 on the abscissa axis was considered the most suitable.
The proximity between the GR-Hyb-1 and GR-Hyb-2 models indicates that the inclusion of input variables with values from previous days was not an efficient strategy to better capture the temporal effects. Studies using purely statistical models show that this inclusion can be advantageous for flow estimates, especially when the streamflow variable is available (Adnan et al. 2020a(Adnan et al. , 2020bTyralis et al. 2020). In contrast, hybrid models have the advantage of having the water balance of storage   x1, which, in this study, seems to be sufficient to capture the temporal effects and antecedent water conditions of the catchment area. Thus, the greater complexity given to the model through the inclusion of observations from previous days is unnecessary.

Explainable artificial intelligence
It is clear from the VI plots ( Figure 6) that percolation was by far the most important input variable to the cubist regression models. Even when input variables of lagged observations were included (GR-Hyb-2), the lagged percolation variables still presented a higher importance. The importance of the actual evaporation and net precipitation varied between the gauge stations and within the models applied to the same gauge station. For example, the actual evapotranspiration (VI: ∼10%) ranked better than the net precipitation (VI: ∼0%) at gauge station 17120000 in GR-Hyb-1 and was generally considered more important than net precipitation in GR-Hyb-2. On the other hand, despite the net precipitation (VI: ∼50%) showing a high importance at gauge station 27380000 in GR-Hyb-1, its importance was lower than that of the actual evaporation in GR-Hyb-2, even for the variables with lagged observations.
Because the lagged net precipitation observations presented a higher importance in GR-Hyb-2 than the net precipitation of the current day, Pn iÀ3 for stations 17120000 and 27380000 and Pn iÀ2 for station 33680000 could be related to the unit hydrograph duration (parameter x4, see Supplementary Material Section S3) in the GR4J model. For example, the calibrated x4 parameter was higher for station 17120000 (6.25 days) and lower for stations 27380000 (2.45 days) and 33680000 (2.44 days). However, it was difficult to make such an affirmation since x4 was not included in GR-Hyb-1 and GR-Hyb-2. In addition, despite the fact that GR-Hyb-1 used only input variables corresponding to the current day, both hybrid models were calibrated to similar x1 values and performed correspondingly. Figure 6 | Variable importance in the cubist regression of the hydrological hybrid models. The suffix numbers refer to the input variable of lagged observations, e.g., Perc1 ¼ Perc iÀ1 (percolation from the previous day), Pn1 ¼ Pn iÀ1 (net precipitation from the previous day), and En1 ¼ En iÀ1 (evapotranspiration from the previous day). As both the GR-Hyb-1 and GR-Hyb-2 models showed similar performances, it seems reasonable to use the more parsimonious model, especially when the objective is to untangle the complexities of the 'black-box' models. For this reason, the remaining explanation tasks were applied only to the cubist regression of the GR-Hyb-1 models. The ICEs and PDPs obtained from GR-Hyb-1 are shown in Figure 6. Despite showing a large VI for some of the models (Figure 7), the actual evaporation and net precipitation parameters show a limited influence on the prediction of streamflow. These variables seem to have only weighted a small number of instances, as shown by the black lines with gentler slopes. However, the overall dependence of streamflow on these variables was insignificant, as the PDPs (red lines) show almost flat slopes. In contrast, percolation was the largest driver of streamflow, especially for stations 17120000 and 27380000, accounting altogether for the baseflow and interflow components. The closer analysis provided by LIME on five different cases over a year (Supplementary Material Figure S2) shows that percolation has a higher absolute weight on the prediction for all cases. For all catchments, the weight was positive for the predictions performed during the wet season and negative during the dry season and under low-flow conditions. Thus, streamflow is, in most part, controlled by the production storage.
It may be surprising that the precipitation from the respective day, which can result in direct runoff, played such a small role in the streamflow prediction. However, a study considering a large number of catchments across Europe also showed that runoff anomalies correlated well with soil moisture anomalies (Ghajarnia et al. 2020). Ghajarnia et al. (2020) explain that the role of soil moisture, represented here by the production storage, is mechanistically explainable as it implies changes in the groundwater table depth and, therefore, in the hydraulic gradients governing groundwater flow toward streams. These findings are important in untangling how runoff routing is resolved and the key role that soil moisture should play in hydrological and earth system modeling.

CONCLUSIONS
In this study, we compared the performance of a conceptual hydrological model to that of a hybrid hydrological model. The hybrid hydrological model consisted of a conceptual-based component, in which a single parameter is used to calibrate the capacity of the production storage (soil moisture accounting), and a data-driven-based component, which uses water balance outputs as inputs for machine learning modeling to predict streamflow. In the data-driven component, two types of input were used: one that only considered variables of the respective day and another that considered variables from past days (lagged input). The machine learning model benefitted from the inclusion of the soil moisture accounting storage and showed that including observations from the past day, as done by many studies of purely datadriven hydrological models, was unnecessary. Thus, the conceptual-based component used to account for soil moisture reduced the short-or long-term dependencies of the input variables used to make predictions. In addition, the hybrid models showed clear improvements compared to the conceptual model, especially for predictions made during the transition period between wet and dry seasons and during low-flow conditions. This improvement is important for watersheds with strong seasonality, as is the case with the tropical watersheds observed in this study.
A highlight from this study is the use of XAI techniques to untangle how the machine learning component of a hybrid hydrological model resolves runoff routing. These techniques showed that soil moisture is the most important factor in predicting streamflow. Soil moisture is represented in the hybrid models by the water level in the production storage and is responsible for the groundwater flow contribution to streamflow. These findings agree with large-scale studies that have unveiled a strong correlation between soil moisture and runoff.

FUNDING
This study was financed in part by the Coordination of Improvement of Higher Education Personnel (CAPES) -finance code 001and by the National Council for Scientific and Technological Development (CNPQ)grant no. 142273/2019-8.