After some background about what I have learned from a career in hydrological modelling, I present some opinions about how we might make progress in improving hydrological models in future, including how to decide whether a model is fit for purpose; how to improve process representations in hydrological models; and how to take advantage of Models of Everywhere. Underlying all those issues, however, is the fundamental problem of improving the hydrological data available for both forcing and evaluating hydrological models. It would be a major advance if the hydrological community could come together to prioritise and commission the new observational methods that are required to make real progress.
My first attempt at a hydrological model was produced as an undergraduate student at the University of Bristol in about 1970. It was an attempt to model the famous Lynmouth Flood in 1952. It was programmed in Algol and physically existed as a pack of punched cards that needed to be fed into a card reader every time a run was made (compilation errors, run-time errors, and eventually, production runs included). The primary data available were rainfall records, so the only ‘calibration’ data were indirect post-flood estimates of a peak discharge. This was a high sediment laden flow that transported some huge boulders, so any such estimate would have been highly uncertain. Even so, that simple study taught me a great deal about the importance of antecedent conditions in trying to predict flood discharges; the wetting of the catchment prior to the flood was extremely important (as has also been the case in many more recent cases of flash flooding in the UK).
In starting my PhD at the University of East Anglia in Norwich in 1971, I made a survey of hydrological models in the literature. Even at that time, there were a plethora of different models. With the more widespread availability of digital computers in the late 1960s, many PhD projects and consultants were producing their own models. Most of these were conceptual models of the Stanford Watershed Model type, which itself was the PhD project of Norman Crawford at Standford University under the direction of Ray K. Linsley (Crawford & Linsley 1966). This model was the foundation of the Hydrocomp consultancy, and I met both of them when I was able to participate, while still a PhD student myself, at the first UK Hydrocomp workshop. The Hydrocomp Simulation Programme in Fortran (HSPF) was later adopted by the US EPA and remained in use as a freely available tool. When I gave up my count of models at over 100 in 1971, I was already asking the question of how can we do better?
My response was to try and be objective, to base a model on best physical principles, and to measure rather than calibrate the model parameters. Al Freeze was already advocating this in the Freeze and Harlan paper in Journal of Hydrology in 1969 (Freeze & Harlan 1969) and implementing it using finite difference methods at the Thomas J. Watson research centre of IBM at Yorktown Heights. I took a slightly different strategy using finite element methods to solve the Richards equation so that my hillslopes and soil horizons on those hillslopes could look more natural in the discretisation grid (I had a physical geography rather than engineering degree after all!). I did not have quite the same resources as Al Freeze. The model was implemented as two full boxes of computer cards (with all the same issues of compilation and run-time errors, but now with many more cards to get through the card reader successfully) and ran on an ICL1904 mainframe computer. I also carried out the laboratory work necessary to determine all the soil moisture characteristics on soil cores and the fieldwork necessary for channel cross-sections and roughness. The model was applied to the East Twin catchment in the Mendips that had been studied by Darrell Weyman (1970, 1973) for his PhD, and the results were really rather bad, a fact noted by my PhD examiner, Terrence O'Donnell. They were finally published as part of my Dalton Lecture paper (Beven 2001) which tells the story of how that experience shaped my research career.
I was fortunate to then work with Mike Kirkby as a postdoc at the University of Leeds on the development of Topmodel (see Beven & Kirkby 1979) based on Mike's concept of the topographic index. Given my experience of physically based modelling, I was more than happy to take another approach, but still one that allowed the results of the modelling to be mapped back into space (I was again in a Geography Department). The only problem with that was that both the topographic analysis that went into Topmodel and the analysis of the spatial nature of the results had to be done manually. There were no Digital Terrain Models, and computer outputs were still on lineprinter paper. We were also running a nested catchment experiment, with both rainfall and stream-level data recorded on paper charts so a great deal of time was spent just getting the data into a computer-compatible form (e.g. Beven & Callen 1979). One of the nice outcomes of that project was that we demonstrated a model structure that could be applied successfully based on field-measured parameters (Beven et al. 1984). I also learned that parameter optimisation would not necessarily use the model concepts in the correct way (see, for example, Figure 14 in Beven & Kirkby 1979).
My experience with these different types of models proved valuable in being appointed (actually as a ‘mathematic modeller’ despite my geography degree) at the Institute of Hydrology (IH) in Wallingford in 1977. Part of my time was devoted to the SHE (Système Hydrologique Européen) project, a joint initiative with the Danish Hydraulics Institute (DHI) and SOGREAH in France, funded by a European Community loan. This was another attempt at producing a complete ‘physically-based’ hydrological model and was led by Mike Abbott who had successfully dealt with the numerical issues of solving the shallow water equations in hydraulics which were the basis of the DHI MIKE series of simulation packages. Before I joined IH, I had participated in the first SHE meeting at Wallingford, and the minutes record that, as a result of my PhD modelling experience, I raised many of the problems that would be met in the SHE project particularly in the decoupling of the saturated and unsaturated zone solutions. This was a pragmatic decision to reduce dimensionality, based on the available computer resource, but was the main reason why it was 1986 before the first SHE applications appeared (Abbott et al. 1986a, 1986b; Bathurst 1986). It was later relaxed as computer power increased and much later both the MIKE-SHE and SHETRAN versions of SHE were implemented with fully 3D partially saturated Darcy–Richards subsurface solutions (see Ewen et al. 2000; Graham & Butts 2005). Speed could still be an issue, however, and MIKE-SHE has also been used with conceptual groundwater storage components in some applications (see the history of SHE in Refsgaard et al. 2010).
After 3 years in Wallingford, in 1979, I moved to the University of Virginia and was able to return to working with Topmodel. I took advantage of its computational speed and the availability of a CDC6600 mainframe computer to start making Monte Carlo runs of the model in around 1980. This soon showed that there were many runs of the model with different parameter sets that gave more or less equivalent results, something that was later developed into the equifinality concept (Beven 1993, 2006a) though equifinality had already been mentioned in my PhD thesis (Beven 1975). It was also the origin of the Generalised Likelihood Uncertainty Estimation (GLUE) methodology, although I did not have the confidence to publish this until much later (Beven & Binley 1992). Returning to Wallingford in 1982, I told the Director, Jim McCulloch, that I would not work on SHE but there was also funding for another physically based model the IH Distributed Model or IHDM that had been started by Liz Morris. We rewrote the IHDM, producing Version 4, that was based on a finite element rather than finite difference methods. It was therefore rather similar to my PhD model but with better numerics and finer discretisations because of more computer resource. The numerics of the IHDM were later improved still further by Calver & Wood (1989) but remained subject to the problems of using the Richards equation as a representation of flows in real soils (see, for example, Beven 1989, a paper that started out as a commentary on the first 1986 SHE applications).
In 1985, I moved to Lancaster and continued work on 3D finite element modelling with Andy Binley (e.g. Binley et al. 1989a, 1989b; Binley & Beven 1992); Topmodel and the development of Dynamic Topmodel with Jim Freer (Beven & Freer 2001); modelling flow and transport for water quality (e.g. Page et al. 2007; Dean et al. 2009; Hollaway et al. 2018a); pollutant dispersion and flood forecasting using the Data-Based Mechanistic (DBM) methods of Peter Young (e.g. Wallis et al. 1989; Young & Beven 1994); and a wide range of applications of the GLUE methodology (e.g. Beven 2009, 2016a, and the references therein). Some of that work proved controversial, in particular about whether informal likelihood measures and rejection criteria could replace formal statistical methods in model evaluation (see, for example, Beven 2006b, 2008; Andréassian et al. 2007; Hall et al. 2007; Todini & Mantovan 2007). However, controversy encourages harder thinking about what is important and what is required to go beyond the norms of the current paradigm and make real advances.
This background frames the comments about how to make advances in hydrological modelling that are set out in the following sections. This essentially updates the final chapter of Beven (2012a). I concentrate on what I see as the three most important issues. These are: how to decide whether a model is fit for purpose; how to improve process representations in hydrological models; and how to take advantage of Models of Everywhere. Underlying all those issues, however, is the fundamental problem of improving the hydrological data available for both forcing and evaluating hydrological models.
THE NEED TO IMPROVE HYDROLOGICAL DATA FOR MODEL APPLICATIONS
Hydrological data are highly uncertain (see, most recently, Beven 2019). This is true for the most basic of quantities, such as rainfall at a point, discharge at a point (particularly at the highest and lowest flows), and actual evapotranspiration fluxes (sort of at a point). It is even more problematic if we are interested in the water balance over a catchment area because there are uncertainties in catchment area rainfall, snowfall, evapotranspiration fluxes, and storages. The issue is greater because, in general, the uncertainties involved are the result of a lack of knowledge (i.e. epistemic uncertainties) rather than random variability (the aleatory uncertainties) (see, for example, Kauffeldt et al. 2013; Beven 2016a; Westerberg et al. 2016; Wilby et al. 2017). In some cases, we choose to treat data uncertainties as if they are aleatory because of the convenience of the statistical techniques available (e.g. kriging for the interpolation of areal rainfalls; repeat measurements for ADCP (Accoustic Doppler Current Profiler) estimates of flows; and the choice and fitting of flood frequency distributions). A good example is the use of statistical regression for fitting rating curves for the conversion of observed water levels to discharges. It is often assumed that some simple power law will hold over the range of the data (with or without an offset and with or without multiple segments). This might be satisfactory within the range of the actual gaugings, at least if they are not too variable and if effects such as weed growth and a mobile bed are negligible, but in some cases such extrapolations can be potentially misleading (see, for example, Beven et al. 2012; McMillan & Westerberg 2015; Hollaway et al. 2018b; and the comparison of Kiang et al. 2018).
There is some movement towards the use of extrapolations based on hydraulic modelling of a gauging site, particularly for overbank flows. The review of the Sheepmount rating curve at Carlisle after the 2005 flood is a good example. Consulting engineers were commissioned by the Environment Agency to revisit the rating curve at this site using hydraulic modelling, since the recorded water level was over a metre higher than the highest measured discharge. This led to a significant increase in the estimated discharge relative to that produced by extrapolation of the rating curve fitted to the discharge measurements. The revised rating was then used to estimate the even higher flood peak from Storm Desmond in 2015. However, such estimates are very dependent on the estimation of effective roughness coefficients for the out-of-bank conditions, which is necessarily uncertain. Extrapolated discharge estimates are still often cited without any associated uncertainty range even though there is evidence that effective roughness might be model structure dependent and vary with peak magnitude (e.g. Romanowicz & Beven 2003; Pappenberger et al. 2006).
These experiences led to Beven et al. (2011) and Beven & Smith (2015), suggesting that some catchment data might be disinformative in deciding whether a model is acceptable or not. They identified events that gave exceedingly high or exceedingly low runoff coefficients in a rapid response catchment in the north of England. Clearly if a model is constrained by mass balance, but the data for an event suggest a runoff coefficient greater than 1, then the model is going to produce residuals that reflect the deficiencies in the original data, not only from any failure of the model (see also the examples in global data sets included in Kauffeldt et al. 2013). In this case, the problem is quite evident, and if such data are included in model evaluation will lead to bias in inference about parameter values and in predicted outcomes, especially if simple evaluation measures based on the sum of squared errors are used. There will, however, be many other periods of data when the effects on model evaluation will be subtle and difficult to allow for.
The conclusion of this is that we need to be much more careful about considering the value of the available data in model evaluation, and that we need better observational techniques, not only for the inputs and outputs in the water balance equation but also for internal state variables. In the latter case, there is still a great deal of epistemic uncertainty about subsurface flow pathways on hillslopes (and in valley bottoms). Where internal state data are used, there can also be incommensurability between observed variables and simulated variables (e.g. soil moisture at a point relative to the soil moisture output at the discrete element scale of a distributed model). There have been some advances, such as the COSMOS measurement of soil moisture over an area, but that has both variable effective depths and areal extent depending on the levels of near-surface moisture (Zreda et al. 2012; Evans et al. 2016; Baroni et al. 2018).
We also know enough from tracing experiments and the nature of the physics to conclude that the Richards equation should not be used in modelling flow through soils (in fact, we argued this in Beven 1989; Binley et al. 1989a, 1989b, nearly 30 years ago). It is based on the wrong experiment that excluded the possibility of preferential flows in focusing on capillary equilibrium conditions. This might be more applicable under relatively dry conditions, but even then, the physics itself suggests that the usual form should not be used if there is any heterogeneity of soil properties within the scale of the application, which is, of course, always the case (see also Beven & Germann 1982, 2013; Beven 2012a, 2018b). However, we have no good (non-destructive) measurement techniques at scales of interest with which to study vertical and downslope preferential flows and recharge. Those detailed observations that have been done have suggested that the flows can be highly localised, highly variable, and subject to complex connectivity issues in space and time (e.g. Freer et al. 1997; Jencso et al. 2009; McGuire 2010; Klaus & Jackson 2018).
In fact, we are not interested in such detail (except in terms of scientific understanding), and it might be better to develop new measurement techniques at larger scales that would integrate over the detail. If, for example, we had an effective and affordable gravity anomaly technique for total water storage over an area; coupled with a method for measuring stream discharges that were sufficiently accurate to determine incremental discharges downstream in a river network, then we might be able to infer much more useful process relationships than those we have currently. However, as a community, we have not been at all pro-active about deciding on priorities for measurement requirements and commissioning new techniques.
The satellite community have done so much more effectively (including the SWOT launch planned for 2021 which will be of some hydrological interest), but, from a hydrological point of view, satellite imaging has always had potential but not actually been that useful, apart from generating digital terrain data, particularly LIDAR (Light Direction and Ranging) that has led to significant improvements in, for example, flood inundation mapping. Even then, however, there are both aleatory and epistemic uncertainties associated with the treatment of the digital numbers (how to deal with vegetation and buildings; small-scale features such as walls and hedges on flood plains; later infilling of sinks or burn-in of channels in the terrain to get consistent flow lines; determination of catchment boundaries etc.) that will have an effect on any model outputs when compared with observations. Most other remote sensing is also associated with epistemic uncertainties, including rainfall and soil moisture estimation, with the result that it provides only some qualitative and uncertain indication of patterns in the landscape relevant to hydrology.
Improving the quantity and quality of hydrological data is essential to what follows, in particular in deciding on whether particular models might be fit-for-purpose. Note that this paper is about how to make improvements in hydrological simulation models. It is not about models used for forecasting, i.e. modelling using data assimilation for getting the best real time n step ahead predictions with minimal uncertainty (see Beven & Young 2013 for a discussion of different types of model prediction). Forecasting does not necessarily require process representations nor physical constraints such as mass balance that may not be a feature of the available data. This is particularly true in forecasting flood events when there may be poor sampling of the most intense rainfalls, and the discharge rating curve may be subject to epistemic uncertainties. Data assimilation is then a valuable tool in improving forecasts. Far better to forecast levels and use data assimilation to compensate for the limitations in the input data (see, for example, Romanowicz et al. 2006; Leedal et al. 2010). (Note that while I consider data assimilation to be essential in forecasting, I do not consider it to be good practice to use data assimilation to compensate for model deficiencies in simulation modelling, especially if there is no attempt to learn from the data assimilation about how a model might be in error. There have been a number of such studies in the literature. Clearly, it is not possible to use data assimilation to compensate for model deficiencies in simulating the impacts of future changes. It is better than to attempt to produce a realistic estimate of the associated uncertainties, both aleatory and epistemic.)
Here I shall be interested in the representation and simulation of hydrological processes in the context of not only reproducing historical behaviours but also future behaviours under change. Even a cursory survey of the literature will reveal that this is a challenge and difficult to achieve. Hydrological systems are complex and nonlinear, and we have little in the way of techniques for studying patterns of processes at the catchment scale. We rely on the way in which catchments act as integrators over small-scale complexity and heterogeneity in resorting to the calibration of simple model representations against the very discharge data that we want to predict. That clearly helps in getting better reproduction of discharges without change but not necessarily for the right reasons. Getting good results for the wrong reasons could then be misleading when we want to simulate the impacts of change (rarely is any consideration given to change during a calibration period, but see Merz et al. 2011; Peel & Blöschl 2011; Harrigan et al. 2014). In the past, I have had some success in making predictions using only measured parameters (e.g. Beven et al. 1984), but also some notable failures (e.g. Beven 2001).
EVALUATING HYDROLOGICAL MODELS AS FIT-FOR-PURPOSE
We know very well that the process representations used in hydrological models are only approximations to the real-world complexity of surface and subsurface flows. It is also obvious that the epistemic issues with hydrological data mean that we would not expect even a perfect model to provide perfect predictions. We see this in the comparisons of observed and predicted variables in a multitude of academic papers and reports to clients. Sometimes, indeed, it seems that the predictions are rather poor, especially if models are applied without calibration as if to an ungauged catchment. Calibration is generally helpful in finding parameter sets that give predictions that are closer to the observations at least in the calibration period. When a split record evaluation is also done, it is common to find that the model performance is not so good in the validation period or under different seasonal or climate conditions (Refsgaard & Knudsen 1996; Freer et al. 2003; Choi & Beven 2007; Coron et al. 2014; Fowler et al. 2016, 2018; Dakhlaoui et al. 2017; Pool et al. 2017). This might be the result of over-fitting an overparameterised model; it might be because the model is producing good results in calibration for the wrong reasons; it might be only because the forcing data errors are quite different in the validation period. For more severe testing (see Klemes 1986; Refsgaard & Knudsen 1996; Ewen & Parkin 1996; Seibert 2003), it is often difficult to declare any form of success.
We can think about models as hypotheses about how a hydrological system functions (e.g. Beven 2012b, 2018a). Thus, testing whether a model should be considered as fit-for-purpose can be considered a form of hypothesis testing, with the possibility of rejecting models that do not fit the evaluation data to some defined level of acceptability. Model rejection in this sense is a good thing; it means that we need to make some improvements, either to the model structure or to the data that we are using with the model (Beven 2018a). Clearly, methods for hypothesis testing are well developed in statistics under assumptions that variables can be considered to have aleatory variability. However, when we know that we are dealing with epistemic uncertainties, it might be incoherent to use simple statistical assumptions (e.g. Beven et al. 2008). This is evident, for example, in the use of formal likelihood functions in model evaluation that, particularly for long time series, can give quite a misleading impression of the relative merits of different models and parameter sets (e.g. Beven & Smith 2015; Beven 2016a).
In assessing fitness-for-purpose, of course, we do need to consider what is the purpose. We can differentiate between two major types of purpose (though each could have a variety of subdivisions). The first is in the use of models to test the science, i.e. the understanding of how a hydrological system might function. This might involve the more detailed consideration of the internal states and other detail in experimental plots and catchments, and how they differ from responses reported from elsewhere. The second is in the use of models for decision-making. The important factor then is that the model should make predictions of the future behaviour of a hydrological system that will not deviate too far from what would happen under the assumed boundary conditions. This might allow a greater degree of approximation to be considered to be acceptable, especially if decisions are being taken at larger scales (such as in the methods used for the UK National Flood Risk Assessment that is currently under revision). A particular feature of this second purpose is that the results cannot really be tested, even if a model has survived a validation test, since the future boundary conditions are necessarily unknown or epistemically uncertain (see, for example, the post-audit analyses of groundwater models in Konikow & Bredehoeft (1992), where some models failed only because of poor assumptions about the future boundary conditions). We might hope, of course, that as the science evolves, the purpose of improving understanding will feed into the purpose of decision-making, with a better theoretical basis for moving from local scales to national scales and for assessing changes in parameter values but we are not there yet (see below).
The question remains of how should we test models as hypotheses in the face of epistemic uncertainties? Beven & Lane (2019) suggest that one way of looking at this problem is in the form of testing for model invalidation (see also Beven 2018a). There is, of course, a long history of applying such tests, at least implicitly in the form of not invalidating a model based on its simulated outputs. Every time a referee accepts a paper with model results for publication, s/he is essentially applying such a test. Every time a report is presented to a client, then the authors of that report have applied such a test. Every time a report is accepted by the client (perhaps after an independent assessment by another consultant), then such a test has been applied. Most of these judgments are qualitative and subjective, albeit that they may be supported by some quantitative measures (such as quoting the Nash–Sutcliffe efficiency despite all of its faults as a measure of calibration or validation performance).
It is therefore interesting to speculate about what information such a group of experts would require in order to make such an invalidation more rigorous, both in the use of models for predicting an ungauged catchment and in the case where some output observations are available to evaluate model runs. One interesting feature of this strategy is that there is a possibility for the users of the model outputs, such as decision or policymakers or stakeholders affected by a decision, to be involved in such a process in considering not only the acceptability of the model outputs but also the assumptions that contribute to the outputs (see Beven 2018a, and the condition tree approach of Beven et al. 2014).
There is actually a precedent for this type of approach in the ‘blind validation’ approach of Ewen & Parkin (1996). This requires the modeller (in their case) to define some criteria for acceptability prior to making any model runs. Model parameters were estimated from the past experience, and no prior model calibration was allowed. The range of simulated outcomes was then compared with available observations of flows and internal state data (assumed at that time to be known accurately). Blind validation was applied to the SHE model by Parkin et al. (1996) and Bathurst et al. (2004). In both cases, the model simulations failed to meet all the defined validation criteria. In the application of Parkin et al. (1996), the model failed one out of four tests; in the case of Bathurst et al. (2004), two out of 10 tests were failed. This was despite the criteria for success being rather relaxed and some model simulations being excluded on the basis of expert evaluations. These failures do not seem to have had much effect on the use of the SHE model elsewhere. In fact, the failures are not mentioned at all in the SHE review paper of Refsgaard et al. (2010), which includes just a brief passing mention of the development of model testing methods based on the Klemes (1986) concepts. There have been no other applications of this blind validation methodology, to my knowledge, though it has much in common with the setting of limits of acceptability within the GLUE methodology (see Beven 2006a, 2009, 2016a) that has led to some other model invalidations (e.g. Page et al. 2007; Dean et al. 2009; Liu et al. 2009; Hollaway et al. 2018a).
One of the issues in this type of evaluation is, again, the data being used to both drive and test a model as hypothesis. Since we do not expect a model to be better than the data it is used with, any invalidation test should first make some assessment and allowance for the uncertainties, both epistemic and aleatory, associated with those data, although in some (wet) cases any model that gets the water balance separation approximately correct might provide quite good measures of performance (e.g. Seibert et al. 2018). How uncertain do we expect the inputs used to force the model to be? If we have observations of the system response, how uncertain are those observations relative to the variables predicted by the model? We do not expect this assessment of uncertainty to be a simple statistical variability (though lacking better knowledge we might choose to treat it as such). We are not used to framing model testing in this way (and indeed perhaps, we have avoided it because these are very difficult questions to resolve when we expect the nature of errors in the inputs to vary from event to event, and parameter interactions to be complex). Data uncertainty also raises the issue of how to avoid Type I hypothesis testing errors (accepting a model hypothesis that is not fit-for-purpose because of the data uncertainties) and Type II errors (rejecting a model hypothesis that would be fit-for-purpose because of the data uncertainties). The former is more problematic but should hopefully be reduced as new data, or different types of data are added to the assessment. Such difficulties should not, however, stop us from thinking more deeply about how to make an invalidation test more rigorous.
A further feature that might be considered is whether a model contradicts some secure evidence on the nature of the system response. If that is the case, it should not be considered as fit-for-purpose. We want to base decisions on predictions from a model that, as far as possible, is producing the right results for the right reasons. A nice example of this appears in the very first Topmodel paper (Beven & Kirkby 1979) where it was shown that optimising the model parameters resulted in using the model structure in a way that contradicts the theory on which it was based by using the subsurface store with a very low time constant to control the timing of fast runoff. There are also examples from other domains such as climate models (e.g. Liepert & Lo 2013). Thus, how to show that a model is giving the right results for the right reasons should be a subject for some deeper thought (see, for example, Kirchner 2006).
An interesting possibility that arises from applying more rigorous testing to model applications in hydrology is that all the models tried might be rejected as fit-for-purpose. This invalidation might be for different parameter sets in a single model structure; it might extend to multiple model structures. There are published examples of where all the models tried have been rejected (see, most recently, the case of the SWAT model in Hollaway et al. (2018a), in an application to a small UK catchment). As noted earlier, such model rejection is really a good outcome, in that it requires either that we do better modelling or find better data, or that we find some other way of making decisions within an adaptive management framework. We should, note, however, that even where more rigorous invalidation testing is carried out, the results will always be conditional on the information that is to hand now. The future remains epistemically uncertain, and the possibility of future surprise remains. That should not, however, be a reason for relaxing the testing. It should still be considered as a poor practice to relax rejection criteria just because a decision needs to be made. That may not result in a good decision if the model is not fit-for-purpose or if the decision is sensitive to the uncertainty in model predictions.
IMPROVING PROCESS REPRESENTATIONS IN HYDROLOGICAL MODELS
The concept of being able to reject models as hypotheses has an important implication; that we might be able to learn from the nature of the rejection to refine the representation of hydrological processes and systems where this is shown to be necessary. In this context, model rejection is a good outcome. It is the starting point for where the creativity of analysis and thought is required for doing better in the future.
It is already possible, however, to make some suggestions as to what such innovations might look like, particularly if we want process representations that will satisfy the needs to predict both flow and transport within a consistent framework. This assumes greater importance when we start to accept the limitations of gradient-based continuum approaches such as the Buckingham–Richards equation (which I have argued need to be reconsidered since Beven 1989). Such a framework is required to consider both velocities (in predicting conservative transport) and celerities (in predicting flows). Since celerities are generally different and faster than velocities, it follows that any process representation should be length scale-dependent, i.e. different scales of spatial discretisation might require different parameter values. The difference between velocities and celerities will also be state-dependent, suggesting that, at any scale, the hysteresis on the storage–flux response will change with system state. This has been shown numerically using the Multiple Interacting Pathway (MIP) model by Davies & Beven (2015).
The MIPs model allows velocity distributions to be specified as part of a random particle representation of all the water in the flow domain. Celerities follow from the filling and emptying of storage in the system. It is a computationally expensive modelling strategy and therefore has, to date, been restricted to small-scale applications. While there is still much to explore in the interaction between the scale of discretisation, time step, velocity distributions, and transition probabilities, it does have the type of consistent framework that might be valuable in future. Zehe & Jackisch (2016) and Jackisch & Zehe (2018) have taken a somewhat similar approach including more explicit consideration of the effects of capillarity. Such approaches might be one way of approaching a theory of scale-dependent process representations for both flow and transport.
I have argued before (e.g. Beven 2006c, 2012a) that there is already a useful framework within which new process representations might be embedded. This is the Representative Elementary Watershed (REW) framework (see, for example, Reggiani et al. 2000; Reggiani & Schellekens 2003). This sets out a framework of mass, energy, and momentum equations that is common for any spatial discretisation. However, those balance equations need closure, i.e. a way of defining the flux terms of mass, energy, and momentum at the boundaries of each discrete element, together with how those fluxes depend on the internal states of the system. I believe that this will lead to closure schemes based on hysteretic relationships between element storages and boundary fluxes. A move in this direction would, of course, be greatly enhanced by the availability of the relevant storages or fluxes at the element scale, and it may be (again) that real progress will await the availability of new measurement techniques. What we should not do, however, is to continue to ignore the implications of the difference between velocities and celerities and the scale-dependent and hysteretic nature of hydrological responses at the element scale.
It is perhaps worth pointing out that the asymmetry of the unit hydrograph or linear transfer functions derived at catchment scales is a representation of hysteresis in the storage–flow relationship. But as a linear model, it relies on a way of processing the inputs to represent the effects of nonlinearity and antecedent conditions in predicting the catchment response at a wider range of conditions. I could speculate that if input, storage and output data were available for discrete elements of the landscape (or arbitrary REWs), then a transfer function modelling framework such as the DBM approach developed by Young (1998) and Young & Beven (1994) would be a suitable way of deriving closure schemes at the required scale. The parameters of such a model would then be quite different from those we use today: the time constants for the linear transfer function and some coefficients for nonlinear processing the input sequence. Given additional tracer data, it might also be possible to derive a consistent set of concepts relating parameters for both flow and transport within such a framework (e.g. Harman 2019). The emphasis is, again, on making the right type of data available, initially at research locations, so that we can learn about how to produce closure schemes that might be applicable more widely.
However, I could also speculate that rather than accepting a limitation to the linear transfer function or unit hydrograph, with its constant time distribution of contributions of effective rainfall to the hydrograph, perhaps there will be other ways of analysing such data that might more explicitly reflect antecedent states and input intensities at the required scale of discretisation. There are methods of developing hysteretic functions that have been applied to hydrological systems (e.g. O'Kane & Flynn 2007; Appelbe et al. 2009), but these also have some rather strong assumptions. Given recent developments in data-mining techniques, might this be a way of deriving the forms of functions that would be applicable more widely, that would suggest quite different process representations than those being used today?
HYDROLOGICAL MODELS OF EVERYWHERE
The other advance that is certainly going to have a major impact on modelling practice is the much more widespread availability of spatial predictions of hydrological models on the internet. I first suggested a Models of Everywhere concept more than a decade ago (Beven 2007; Beven & Alcock 2012), but it is only relatively recently that this has become computationally easier to implement and computer scientists have become more interested in the problem of producing facilitating software (e.g. Blair et al. 2019).
What is critical to this Models of Everywhere concept is that the predictions are a sufficiently fine resolution that local stakeholders can relate to them directly. The concept is therefore quite different from providing the global ‘hyperresolution’ simulations presented, for example, by Wood et al. (2011). Hyperresolution in their sense is of the order of 1 km (see Bierkens et al. 2015) and while there may be some variables that local stakeholders can relate to at that scale, there will also be a great deal of hyperresolution ignorance about what parameters and variables at that scale might mean (see, for example, the discussion in Beven et al. 2015). There is a movement to finer-resolution, continental-scale simulations, such as the HydroBlocks of Chaney et al. (2016), which is based on the Dynamic Topmodel. At much finer scales, such as the 2 m scale used in producing the UK pluvial flooding maps, the ability of people with local knowledge to provide feedback on the model outputs is much more direct. In this case, modelling becomes much more of a learning process, driven by the feedback about where the model predictions are demonstrably wrong. It is a learning process about places that starts to reflect the uniqueness of places in terms of both learning about appropriate effective parameter values and learning about appropriate process representations (see Beven 2000). The possibility of local feedback on the acceptability of model simulations will change the nature of the modelling process in fundamental ways. While we might start with general model structures that are applied to places as in the past, what we need are methods of learning about places from the availability of local data, effective ways of obtaining new data for different purposes, and making use of local (perhaps qualitative) knowledge and expertise (see, for example, the study of Landström et al. 2011).
There is an interesting issue in the question of data assimilation in applications of Models of Everywhere. Clearly, we would wish to use all the useful information available to test models locally and to ensure that we get the right results for the right reasons. This might include whatever quantitative data might be available but might also be a matter of learning how to use ‘soft’ data in model evaluations (see, for example, Seibert & McDonnell 2002; Fenicia et al. 2008; Winsemius et al. 2009). We would also like to re-evaluate models as more data are made available. But, as noted earlier, there have been cases where data assimilation is used simply to compensate for model deficiencies by updating model states so as to get a better predicted outcome. If the purpose of modelling is real-time forecasting into the near future, then that might be acceptable or even advisable. Where the purpose is for simulation and assessing the impacts of future change, then we should be very wary of compensating for important model deficiencies. For Models of Everywhere, we might want to do both forecasting and simulation, in which case it will be important to learn from the process of data assimilation for forecasting in improving the model formulation for simulation. There have been few studies (to my knowledge) that have done so (but see the learning from nonstationarity of Westra et al. 2014, as an example of the type of analysis that might lead to model modifications). More generally, forecasters have been satisfied with using data assimilation to get better forecasts, and simulation modellers have been satisfied with using calibration to either find an optimal model or constrain the associated uncertainty. Perhaps, we can do better or at least be a little more thoughtful in applying models. The feedback from users once Models of Everywhere visualisations are more widely available may force us to do so.
I have written about these issues in many past papers (including Beven 2016b), but this has been a useful opportunity to bring the strands of thought about the future of hydrological modelling in one place. I do think that hydrology remains a field of inexact science that is still greatly constrained by observational limitations, and it would be really good to see the community make a real effort to decide on what its priorities should be and then move to commission what is needed (as has happened, for example, with the SWOT satellite). The process might be long, but the benefits to the science would be great, including for testing models as hypotheses, developing new process representations, and constraining predictive uncertainties.
The role of Models of Everywhere in improving modelling capability will also make for an interesting future. What new techniques for learning about places and for learning from clear errors in representing the response of places will need to be developed? And how can new types of knowledge be used to constrain uncertainties? What should the learning framework for both quantitative and qualitative information look like, including the issue of distinguishing information from disinformation. These are issues that are relevant to a wider range of research areas than hydrology which is just one of many inexact environmental sciences (Beven 2002, 2019).
There is a particularly interesting aspect of uncertainty for the modeller in this context. A realistic assessment of uncertainty in predicting how places respond will mean that the modeller is much less likely to be obviously wrong in those predictions. This is clearly a good thing (at least from a modeller's point of view), but should not preclude an effort being made to carry out model testing and find ways of reducing that predictive uncertainty.
As I said in the talk on which this paper is based, I am ending my career with much more uncertainty than when I started as a young PhD student in 1971. But that is a good thing – it means that there is still so much good research to do in the closely linked areas of novel observational methods, closure schemes and model testing, theoretical development, and learning about places. In particular, learning about the assessment of epistemic uncertainties will also lead to the development of methods for reducing those uncertainties. The near future could be an exciting time for hydrological research and practice.
Over a long career, I have been fortunate to work and collaborate with many excellent hydrological modellers and experimentalists; a number that is really too long to list here. I will just mention the contribution of Dr Peter Metcalfe who had only recently started work on the Q-NFM project led by Dr Nick Chappell at Lancaster University before his sudden death in a climbing accident. Working with Peter on the problem of modelling distributed natural flood management measures in this project (NERC grant no. NE/R004722/1) was instrumental in my thinking again about how to improve hydrological models. I am also grateful to Jan Seibert and two anonymous referees who made some useful suggestions for relevant papers and improvements to the paper and presentation.
President's Invited Address, BHS Symposium, University of Westminster, September 2018.