Abstract
Runoff has been greatly affected by climate change and human activities. Studying nonlinear controls on runoff response is of great significance for water resource management decision-making and ecological protection. However, there is limited understanding of what physical mechanisms dominate the runoff response and of their predictability over space. This study analyzed the spatial patterns of runoff response including runoff changes and its sensitivity to climate–landscape variations in 1,003 catchments of the contiguous United States (CONUS). Then, an interpretable machine learning method was used to investigate the nonlinear relationship between watershed attributes and runoff response, which enables the importance of influencing factors. Finally, the random forest model was employed to predict runoff response according to the predictors of catchment attributes. The results show that alteration of runoff is up to 56%/10 years due to climate change and human activities. Catchment attributes substantially altered runoff over CONUS (−60% to 56%/10 years). Climate, topography, and hydrology are the top three key factors which nonlinearly control runoff response patterns which cannot be captured by the linear correlation method. The random forest can predict runoff response well with the highest R2 of 0.96 over CONUS.
HIGHLIGHTS
Spatial patterns of runoff response were explored over 1,000+ catchments of the contiguous United States.
Interpretable machine learning method was used to investigate the nonlinear relationship between runoff response and catchment attributes.
Climate, hydrology, and topography are the top three dominant factors that nonlinearly control the runoff response, which can be well predicted through a random forest based on 56 indices.
INTRODUCTION
Global warming will bring about a series of meteorological problems, such as rising temperatures (Gampe et al. 2021), uneven spatiotemporal distribution of rainfall (Ghosh et al. 2011), glacier melting (Huss & Hock 2018), and drought risks (Pokhrel et al. 2021). The impact of such climate change on the hydrological cycle has been a hot research area for hydrologists (Zhang et al. 2021). Human activities are another major factor in runoff behavior. The impact of direct human activities, such as the construction of river engineering, industrial and agricultural water intake, land use/cover changes that cause evaporation and changes in groundwater level on runoff lead to spatiotemporal changes in water resources; human urbanization, farmland transformation, and other activities affect the land–atmosphere feedback mechanism, further affecting hydrological processes (Wang et al. 2020). Therefore, understanding the impact of climate change and human intensification on runoff behavior will help decision-makers in facing water resource shortage crises and drought risks.
Climate elasticity is a method that reflects the sensitivity of runoff to climate change. The larger its absolute value, the more sensitive is runoff variation to the influencing factors. Tang & Lettenmaier (2012) demonstrated that the sensitivity of runoff to climate change is relatively stable in scenarios of global temperature rise and increased emissions, but there are significant differences in high-latitude and arid or semi-arid regions. Gong et al. (2022) investigated the spatial patterns of sensitivity of runoff to climate and catchment properties over 188 catchments of the contiguous United States (CONUS). Their results demonstrated the critical role of catchment properties on runoff, which are consistent with ours; this indicates the robustness of our results. In addition to the natural factors, human activities have a nonnegligible role in runoff. For example, Li et al. (2020) demonstrated that future watershed hydrologic impacts of urbanization vary dramatically across the United States through use of a monthly water balance model across 81,900 12-digit hydrologic unit code (HUC) watersheds of CONUS. The direct impact of human activities on runoff has not been studied in this study. Our results also validated this impact on runoff. Total changes in runoff did not equal the contribution of natural factors, and we infer the difference between them is caused by human activities. However, there is little research on the relationship between climate elasticity, runoff changes, and controlling factors. The current understanding of climate elasticity is stable across time variations within the watershed. Understanding this unchanged pattern and the mechanism would help model the runoff–precipitation relationship.
Machine learning is a black-box model whose behavior and internal connections are often unobservable (Molnar et al. 2018). In order to understand the running process and internal connections of the model, model users need not only to pay attention to the accuracy of the model's results but also to understand the situation of the model itself. The interpretable role of machine learning is to make the running behavior of the black-box model visible. Accumulated local effects (ALE) is an effective interpretation method based on the completion of model operation, which is an extension of the partial dependency plots (PDP) method, through judging the average change in output results through the edge distribution changes of input variables and further evaluating the importance of input variables to the output results. However, the disadvantage of PDP is that it cannot accurately evaluate the importance of predictive variables with related relationships. The ALE method enables the evaluation of nonlinear relationships between attributes, and makes it possible to overcome the effect of correlated catchment attributes; it has not been used often in hydrology. Recently, the method has been widely used to investigate how catchment attributes and climate attributes influence the hydrological response. For example, Stein et al. (2021) applied the ALE method to evaluate how catchment attributes and climate attributes influence the distribution of flood processes across 671 catchments of CONUS. They found the climate in terms of the fraction of snow, aridity, precipitation seasonality, and mean precipitation to be most influential on flood process distribution. The ALE method, however, also has certain limitations: firstly, for example, ALE evaluates the model's response to changes in a certain predictive variable, and its results are not directly based on data (Anchang et al. 2020). When the simulation results of the model are poor, ALE results are generally not good. Secondly, when there is less predictive variable data, ALE results are unreliable.
The ability to accurately predict behavior patterns is a critical test of the adequacy of knowledge in any subject (Crawford & Linsley 1966). Previous studies on spatial prediction in ungauged basins have mainly focused on hydrological variables such as runoff and hydrological signatures (Hrachowitz et al. 2013; Kuentz et al. 2017). Studies on runoff changes caused by climate change and/or human activities and their relative importance have expanded over time (Kalugin 2019; Somorowska & Łaszewski 2019). There are three approaches to achieving spatial predictability of runoff changes based on predictors of climatic and physiographic attributes: (1) statistical models with predictors of catchment characteristics (Beck et al. 2013; Kuentz et al. 2017). For example, Stein et al. (2021) used a random forest model to predict the flood generation process. (2) Interpolation with similar catchments between adjacent gauged basins (Westerberg & McMillan 2015); Kuentz et al. (2017) analyzed the hydrological similarity and further explored the dominant controls on hydrological response over 30,000+catchments of Europe. (3) Hydrological models with parameters regionalized based on catchment attributes (Wagener & Wheater 2006; Ragettli et al. 2017); Shao et al. (2020) used a physics-based distributed hydrological model, embedded with a multi-purpose reservoir module, to quantify the effects of urbanization on floods.
Random forests can consider multiple predictors and nonlinear relationships. Catchment behavior is usually determined by multiple factors, but not a single influence. Beck et al. (2013) have explored global patterns in base flow, and they have shown weak relationships between flow patterns and single factors in catchments. The random forest model can deal with the above problems by coupling with multiple predictors and considering nonlinear relationships; it is not limited by physical principles or hypothesized hydrological processes and is a flexible statistical model without the problem of overfitting. The random forest model can measure the importance of predictors among all predictors using an indicator of the increase in the mean square error. Like other machine learning models, random forests can obtain the highest accuracy with the most convenient calculation procedure, and have been widely used to predict hydrological response. For example, as stated above, Stein et al. (2021) used the model to predict the variability of the flood generation process. Dogan (2023) employed a random forest model to predict streamflow over the Kızılırmak River, Turkey and obtained ideal results.
- (1)
How well can runoff response be predicted by the statistical model (data-driven) with a consideration of nonlinear relationships? And what are the patterns of predictability of runoff response? We used random forests with a consideration of nonlinear relationships to explore the predictions of runoff response based on predictors of catchment attributes covering six types (topography, geology, climate, hydrology, land use/cover, and soil characteristics) indicated by 56 indicators over 1,003 catchments in the CONUS. We quantified how catchment attributes impact strongly the prediction of runoff response by means of ALE, which can rank the importance of catchment attributes to runoff response.
- (2)
How do key factors control the runoff response? What differences are there compared with linear relationships? After key factors are determined through step 1, we quantified the nonlinear relationship between runoff response and key factors using ALE.
DATA
Catchment attributes
The United States Geological Survey (USGS) provided Geospatial Attributes of Gages for Evaluating Streamflow (GAGES-II), version II, for evaluating inland hydrological stations in the United States (Falcone et al. 2010) (https://www.usgs.gov/3d-elevation-program). GAGES-Ⅱ includes nearly 9,322 hydrological stations distributed in the United States (including Arras, Hawaii, and Puerto Rico), and the each flow stations are divided into ‘reference station’ and ‘disturbed station’ according to whether they are affected by human activities (http://esapubs.org/Archive/ecol/E091/045/default.htm). GAGES-II is updated from the original dataset GAGES, and has two main purposes: (1) to provide large and comprehensive geographic spatial data within the watershed, especially long-term runoff data; (2) to identify whether runoff sites are affected by human activities, which will help restore or establish ecological flow. The length of the runoff series included in this dataset is at least 20 years (since 1950), but the runoff series might not be continuous and these sites are all valid, and their data can be subject to real-time review. The reference stations are considered to be unaffected or minimally affected by human activities and are all distributed in 12 major ecological regions of the United States, with a total of 2,057 stations distributed in the dataset. There are 7,265 sites that are significantly affected by human activities (‘disturbed sites’), mainly distributed outside the ecological zone. Among the 2,057 reference stations, 1,633 have a runoff sequence length exceeding 20 years, and some stations have a runoff sequence length of 110 years (from 1900 to 2009). Geospatial data mainly include nearly 100 types of watershed attribute-characteristic data, including climate, hydrology, geomorphology, land use/land cover, geology, and human activities within each watershed. This study considered six types of watershed attribute data within the watershed, consisting of topography, geology, climate, hydrology, land use/land cover change, and soil characteristics. The specific descriptions and sources are shown in the Supplementary Information, Table A1.
The runoff data of the GAGES-II hydrological station are downloaded from the USGS Water Information System (USGS NWIS) of the United States Geological Survey (Harris 2002) (https://waterdata.usgs.gov/nwis/). When downloading runoff data, it is necessary to consider the length and continuity of the data, that is, at least 20 years of data, and at least 80% of the data are continuous (at least 8 years of data every decade is continuous on a daily basis). In the end, a total of 2,482 sites (1966–2015) met the above screening criteria, of which 203 were in 1916–2015 and 1,408 were in 1941–2015. This study selected the period 1941–2015 as the research period. In order to facilitate analysis, sites with a runoff coefficient greater than 1 were abandoned. In the end, 1,003 sites were selected for this study.
Climate data
The climate data used in this study mainly include daily rainfall, maximum daily temperature, and minimum daily temperature. The climate data come from the PRISM climate dataset. The parameter elevation regression on independent slopes model (PRISM) climate dataset is a parameter elevation regression model that is independent of slope (Daly et al. 2008). Theoretical-based systems are mainly used to insert climate elements into complex landforms and landscapes. Regression-based PRISM uses point data, digital elevation model (DEM), other spatial datasets, and encoded spatial climate theory to generate estimates of annual, monthly, daily and event-based climate elements. These estimated values are inserted into the regular grid to make them compatible with the geographic information system (GIS). The climate data in the PRISM dataset have undergone peer review, including the official rainfall and temperature map data of 50 states, Caribbean and Pacific islands released by the US Department of Agriculture, the new official climate map of the United States, with 112 consecutive years of monthly temperature, precipitation, and dew-point maps of 48 states, detailed precipitation and temperature maps of Canada, China, and Mongolia, and the first comprehensive precipitation map of the European Alps region (Daly et al. 2021). Its quality is reliably guaranteed (Strachan & Daly 2017). The PRISM Climate Group collects climate observation data from a wide range of monitoring networks, adopts complex quality-control measures, and develops spatial climate datasets to reveal short-term and long-term climate patterns. The generated dataset includes various modeling techniques and is available in various spatial/temporal resolutions, covering the period from 1895 to the present, and the data are freely available. The original grid resolution of the PRISM dataset is 800 m, but it has been filtered to a 4 km resolution for download and operation.
METHODS
Runoff elasticity and runoff changes
Scholar . | . |
---|---|
Schreiber (1904) | |
Ol'dekop (1911) | |
Yang et al. (2008) | |
Fu (1981) | |
Zhang et al. (2001) | |
Porporato et al. (2004) | |
Donohue et al. (2012) | |
Wang & Tang (2014) |
Replacing the data of maximum and minimum daily temperatures with X in the above equation can thus obtain the elasticity of runoff according to maximum and minimum daily temperatures, respectively.
The Food and Agriculture Organization of the United Nations (FAO) has improved and upgraded the estimation method of potential evapotranspiration (ET) by introducing the concept of a reference crop (grass) described by the Penman–Monteith (PM-ET) equation (Allen et al. 1998), and this method has been well verified under different climate and timescale calculations and is now widely used worldwide (Almorox et al. 2015). However, the high cost of maintaining agricultural meteorological stations and estimating the large number of sensors required for ET makes it difficult to apply in practice, especially in rural areas (Moratiel et al. 2020). For this reason, using air temperature estimation ET in areas where wind speed, solar radiation, and air humidity data are not easily available is highly popular. The method of estimating potential evaporation based on temperature has been studied for a long time. Allen et al. (1998) recommended two methods for calculating potential evaporation based on the Penman–Monteith formula: (1) using the Hargreaves–Samani equation; (2) using the Penman–Monteith formula, estimating net radiation and water vapor pressure difference based on temperature. This study uses the second method to estimate potential evaporation; for details of the method, refer to Allen et al. (1998).
Random forest
Regression trees generally grow in a recursive binary-splitting manner. Firstly, run from the top of the tree and split the nodes, with one variable and one threshold, in order to minimize the error of the prediction as much as possible. The final prediction result is the average of all prediction factors within each category. Therefore, the prediction of the decision tree is a discrete value (one prediction value per terminal node, for example, in Figure 2, the leftmost terminal node of the tree is 0.48). In order to prevent overfitting, the cross-validation method is used to minimize the error to make the tree grow and prune. Although regression trees are easy to explain and can handle nonlinear relationships between variables, they often lack robustness.
In order to overcome this shortcoming, this paper integrates 500 regression trees to form a random forest, and lets each tree grow freely, which gives the random forest a strong robustness. At each node, predictive factors are randomly selected, and the remaining predictive factors are used for other nodes. This means that predictive factors that are important for response variables may be removed. The number of trees and the number of predictive factors removed at each node can be freely determined by the user. In general, the number of trees at 500 has a good prediction effect, while the removal of prediction factors at 1/3 of the total number for each node has a small impact on the prediction results (Addor et al. 2018).
In this study, 18 hydrological characteristic variables are predicted, each regression tree has 180 parameters, and 500 trees have 90,000 parameters. It is impractical to analyze such a large number of parameters one by one, but we can evaluate the importance of predictive variables through the ALE.
The prediction results of the random forest pass the ten-fold cross-validation evaluation, that is, 90% of the catchment is used as the training set, 10% of the catchment is used as the test set, and the cycle is repeated ten times. During this process, the catchment is randomly sampled.
RESULTS AND DISCUSSION
Spatial patterns of runoff response
Comparing the spatial distribution of runoff changes caused by rainfall, it can be found that in the central and central southern regions, the rainfall elasticity is in the range from 2 to 11, and the runoff changes are 2%–17%/10 years, indicating that the runoff in this region is linearly increasing (Figures 3 and 4). In the central northern and northeastern regions, the elasticity of rainfall is mainly in the range of 1–2, while the runoff changes are in the range of 5%–17%/10 years, indicating that the water supply in this region should also include snowmelt in addition to rainfall. In the southeastern and western regions, the rainfall elasticity is mainly in the range of 1 to 4, and the runoff changes show a negative change. This can prove that rainfall elasticity is not necessarily the only criterion for judging runoff variation but can be used as one of the means to determine the sensitivity of runoff to influencing factors. Compared with the amount of runoff caused by potential evaporation, the potential evaporation elasticity in the central southern and southeastern regions is −1 to 4, and the corresponding runoff changes are 6%–19%/10 years, indicating that in this type of region, the amount of runoff is about five times the potential evaporation variation (Figures 3 and 4). In most other regions (northeastern, central northern, and western), the average potential evaporation elasticity is −1.0, and the corresponding runoff changes are 1%/10 years. This indicates that in these regions, the changes in runoff and elasticity are consistent, and the corresponding runoff can be quantitatively evaluated using the potential evaporation elasticity. The runoff changes do not vary linearly with the catchment attribute elasticity. It is crucial to further explore the factors that affect this type of runoff change.
Cumulative local effects between catchment behavior and runoff response
Predictions of runoff response through random forest
The predictions of the random forest based on predictors of catchment attributes are shown in Figure 5. When the catchment attributes are used to predict the total changes and catchment-attributes-induced changes, the accuracy of both reaches 0.6. The R2 of runoff changes caused by rainfall, potential evaporation, and temperature is 0.5, 0.3, and 0.1, respectively. The reliability of ALE results is closely related to the accuracy of model predictions; the higher the prediction accuracy, the more reliable the ALE results of the model. There is a significant difference in accuracy between different types of runoff changes. Moreover, the 56 watershed attributes adopted in this study cannot confidently explain the distribution of runoff changes caused by rainfall, potential evaporation, and average temperature, even though the nonlinear relationships are considered. The relationship between runoff changes and catchment behavior is not a simple linear relationship, and might even be offset by the role of various factors on the runoff changes. In the future, more influencing factors should be considered and more advanced mathematical methods should be adopted to explore the relationship between them.
The predictions of runoff elasticity of the random forest are shown in Figure 6. The accuracy of the elasticity of watershed attributes is the highest, 0.9, followed by rainfall elasticity (0.7) and potential evaporation elasticity (0.7). The prediction accuracy of watershed attributes n value and average temperature elasticity is relatively low, with 0.4 and 0.1, respectively. For runoff elasticity with high accuracy, an empirical relationship between runoff elasticity and influencing factors can be established based on ALE results to explore the patterns of runoff elasticity in areas without data. For runoff elasticity with lower accuracy, it is necessary to consider more and more comprehensive influencing factors or adopt more comprehensive methods to analyze the relationship between them.
CONCLUSION
Runoff changes and runoff elasticity present a heterogeneous pattern over the CONUS. The total changes show a positive trend in the east and negative in the west (56% to −36%/10 years); the changes caused by rainfall are generally consistent with the total changes; the changes caused by potential evaporation have shown an overall positive trend in the United States (19.3 to −3.7% per decade); temperature-induced changes show an increasing trend in the 100th meridian; catchment-attributes-induced changes are relatively consistent with total changes. The rainfall elasticity in most catchments is from 1.0 to 2.0; the potential evaporation elasticity is consistent with rainfall elasticity; temperature elasticity around the 100th meridian is 1–16; the elasticity of catchment attributes is relatively regular, with the value in the range of −1 to 5 near the 100th meridian, and an average of −1.2 in other regions. The distribution of the nine ecological regions in CONUS can explain the patterns of runoff response.
Compared with traditional linear correlation analysis, accumulative local effects can be used to analyze quantitatively the nonlinear relationship between watershed attributes and runoff response and determine the importance of influencing factors. Longitude and annual runoff are the top two in importance for total runoff changes and rainfall-runoff changes; the annual runoff is the most important for potential-evaporation-induced changes, and the annual average herbaceous coverage also plays an important role; the precipitation regime is the most important for runoff changes caused by average temperature; the annual runoff is the most important for catchment-attributes-induced changes. The runoff elasticity in arid areas is larger than that in humid areas, and the annual runoff has the greatest impact on the runoff elasticity. The impact of the catchment area cannot be ignored.
The random forest model was used to predict runoff response, and the accuracy was also analyzed. When using 56 indices of watershed attributes to predict the total changes, watershed attributes (n), and runoff changes caused by rainfall, potential evaporation, and average temperature, the accuracy R2 is 0.6, 0.6, 0.5, 0.3, and 0.1, respectively. The accuracy of rainfall and potential evaporation elasticity is 0.7, and 0.96 for catchment attribute elasticity.
ACKNOWLEDGEMENTS
This study is supported by the National Key R&D Program Project Topic (No. 2021YFD150080505).
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories: catchment properties data from http://ned.usgs.gov/ and https://waterdata.usgs.gov/nwis/.
CONFLICT OF INTEREST
The authors declare there is no conflict.