How to make advances in hydrological modelling

After some background about what I have learned from a career in hydrological modelling, I present some opinions about how we might make progress in improving hydrological models in future, including how to decide whether a model is ﬁ t for purpose; how to improve process representations in hydrological models; and how to take advantage of Models of Everywhere. Underlying all those issues, however, is the fundamental problem of improving the hydrological data available for both forcing and evaluating hydrological models. It would be a major advance if the hydrological community could come together to prioritise and commission the new observational methods that are required to make real progress.

physically based modelling, I was more than happy to take another approach, but still one that allowed the results of the modelling to be mapped back into space (I was again in a Geography Department). The only problem with that was that both the topographic analysis that went into Topmodel and the analysis of the spatial nature of the results had to be done manually. There were no Digital Terrain Models, and computer outputs were still on lineprinter paper. We were also running a nested catchment experiment, with both rainfall and stream-level data recorded on paper charts so a great deal of time was spent just getting the data into a computer-compatible form (e.g. Beven & Callen ). One of the nice outcomes of that project was that we demonstrated a model structure that could be applied successfully based on field-measured parameters (Beven et al. ). I also learned that parameter optimisation would not necessarily use the model concepts in the correct way (see, for example, Figure 14 in Beven & Kirkby ). Before I joined IH, I had participated in the first SHE meeting at Wallingford, and the minutes record that, as a result of my PhD modelling experience, I raised many of the problems that would be met in the SHE project particularly in the decoupling of the saturated and unsaturated zone solutions. This was a pragmatic decision to reduce dimensionality, based on the available computer resource, but was the main reason why it was 1986 before the first SHE applications appeared (Abbott et al. a, b; Bathurst   There is some movement towards the use of extrapolations based on hydraulic modelling of a gauging site, particularly for overbank flows. The review of the Sheepmount rating curve at Carlisle after the 2005 flood is a good example. Consulting engineers were commissioned by the Environment Agency to revisit the rating curve at this site using hydraulic modelling, since the recorded water level was over a metre higher than the highest measured discharge. This led to a significant increase in the estimated discharge relative to that produced by extrapolation of the rating curve fitted to the discharge measurements. The revised rating was then used to estimate the even higher flood peak from Storm Desmond in 2015. However, such estimates are very dependent on the estimation of effective roughness coefficients for the out-of-bank conditions, which is necessarily uncertain. Extrapolated discharge estimates are still often cited without any associated uncertainty range even though there is evidence that effective roughness might be model structure In fact, we are not interested in such detail (except in terms of scientific understanding), and it might be better to develop new measurement techniques at larger scales that would integrate over the detail. If, for example, we had an effective and affordable gravity anomaly technique for total water storage over an area; coupled with a method for measuring stream discharges that were sufficiently accurate to determine incremental discharges downstream in a river network, then we might be able to infer much more useful process relationships than those we have currently. However, as a community, we have not been at all pro-active about deciding on priorities for measurement requirements and commissioning new techniques.
The satellite community have done so much more effectively (including the SWOT launch planned for 2021 which will be of some hydrological interest), but, from a hydrological point of view, satellite imaging has always had potential but not actually been that useful, apart from generating digital terrain data, particularly LIDAR (Light Direction and Ranging) that has led to significant improvements in, for example, flood inundation mapping. Even then, however, there are both aleatory and epistemic uncertainties associated with the treatment of the digital numbers (how to deal with vegetation and buildings; small-scale features such as walls and hedges on flood plains; later infilling of sinks or burn-in of channels in the terrain to get consistent flow lines; determination of catchment boundaries etc.) that will have an effect on any model outputs when compared with observations. Most other remote sensing is also associated with epistemic uncertainties, including rainfall and soil moisture estimation, with the result that it provides only some qualitative and uncertain indication of patterns in the landscape relevant to hydrology.
Improving the quantity and quality of hydrological data is essential to what follows, in particular in deciding on whether particular models might be fit-for-purpose. Note that this paper is about how to make improvements in hydrological simulation models. It is not about models used for forecasting, i.e. modelling using data assimilation for getting the best real time n step ahead predictions with minimal uncertainty (see Beven & Young  for a discussion of different types of model prediction). Forecasting does not necessarily require process representations nor physical constraints such as mass balance that may not be a feature of the available data. This is particularly true in forecasting flood events when there may be poor sampling of the most intense rainfalls, and the discharge rating curve may be subject to epistemic uncertainties. Data assimilation is then a valuable tool in improving forecasts. (Note that while I consider data assimilation to be essential in forecasting, I do not consider it to be good practice to use data assimilation to compensate for model deficiencies in simulation modelling, especially if there is no attempt to learn from the data assimilation about how a model might be in error. There have been a number of such studies in the literature. Clearly, it is not possible to use data assimilation to compensate for model deficiencies in simulating the impacts of future changes. It is better than to attempt to produce a realistic estimate of the associated uncertainties, both aleatory and epistemic.) Here I shall be interested in the representation and simulation of hydrological processes in the context of not only reproducing historical behaviours but also future behaviours under change. Even a cursory survey of the literature will reveal that this is a challenge and difficult to achieve.
Hydrological systems are complex and nonlinear, and we have little in the way of techniques for studying patterns of processes at the catchment scale. We rely on the way in which catchments act as integrators over small-scale complexity and heterogeneity in resorting to the calibration of simple model representations against the very discharge data that we want to predict. That clearly helps in getting better reproduction of discharges without change but not necessarily for the right reasons. Getting good results for the wrong reasons could then be misleading when we want to simulate the impacts of change (rarely is any consideration given to change during a calibration

EVALUATING HYDROLOGICAL MODELS AS FIT-FOR-PURPOSE
We know very well that the process representations used in hydrological models are only approximations to the real-world complexity of surface and subsurface flows. It is also obvious that the epistemic issues with hydrological data mean that we would not expect even a perfect model to provide perfect predictions. We see this in the comparisons of observed and predicted variables in a multitude of academic papers and reports to clients. Sometimes, indeed, it seems that the predictions are rather poor, especially if models are applied without calibration as if to an ungauged catchment. Calibration is generally helpful in finding parameter sets that give predictions that are closer to the observations at least in the calibration period. When a split record evaluation is also done, it is common to find We can think about models as hypotheses about how a hydrological system functions (e.g. Beven b, a).
Thus, testing whether a model should be considered as fitfor-purpose can be considered a form of hypothesis testing, with the possibility of rejecting models that do not fit the evaluation data to some defined level of acceptability.
Model rejection in this sense is a good thing; it means that we need to make some improvements, either to the model structure or to the data that we are using with the model In assessing fitness-for-purpose, of course, we do need to consider what is the purpose. We can differentiate between two major types of purpose (though each could have a variety of subdivisions). The first is in the use of models to test the science, i.e. the understanding of how a hydrological system might function. This might involve the more detailed consideration of the internal states and other detail in experimental plots and catchments, and how they differ from responses reported from elsewhere.
The second is in the use of models for decision-making.
The important factor then is that the model should make predictions of the future behaviour of a hydrological system that will not deviate too far from what would happen under the assumed boundary conditions. This might allow a greater degree of approximation to be con- There is actually a precedent for this type of approach in the 'blind validation' approach of Ewen & Parkin ().
This requires the modeller (in their case) to define some criteria for acceptability prior to making any model runs.
Model parameters were estimated from the past experience, and no prior model calibration was allowed. The range of One of the issues in this type of evaluation is, again, the data being used to both drive and test a model as hypothesis.
Since we do not expect a model to be better than the data it is used with, any invalidation test should first make some assessment and allowance for the uncertainties, both epistemic and aleatory, associated with those data, although in some (wet) cases any model that gets the water balance separation approximately correct might provide quite good measures of performance (e.g. Seibert et al. ). How uncertain do we expect the inputs used to force the model to be? If we have observations of the system response, how uncertain are those observations relative to the variables predicted by the model? We do not expect this assessment of uncertainty to be a simple statistical variability (though lacking better knowledge we might choose to treat it as such). We are not used to framing model testing in this way (and indeed perhaps, we have avoided it because these are very difficult questions to resolve when we expect the nature of errors in the inputs to vary from event to event, and parameter interactions to be complex). Data uncertainty also raises the issue of how to avoid Type I hypothesis testing errors (accepting a model hypothesis that is not fit-forpurpose because of the data uncertainties) and Type II errors (rejecting a model hypothesis that would be fit-forpurpose because of the data uncertainties). The former is more problematic but should hopefully be reduced as new data, or different types of data are added to the assessment.
Such difficulties should not, however, stop us from thinking more deeply about how to make an invalidation test more rigorous.
A further feature that might be considered is whether a model contradicts some secure evidence on the nature of the system response. If that is the case, it should not be considered as fit-for-purpose. We want to base decisions on predictions from a model that, as far as possible, is producing the right results for the right reasons. A nice example of this appears in the very first Topmodel paper (Beven & Kirkby ) where it was shown that optimising the model parameters resulted in using the model structure in a way that contradicts the theory on which it was based by using the subsurface store with a very low time constant to control the timing of fast runoff. There are also examples from other domains such as climate models (e.g. Liepert & Lo ). Thus, how to show that a model is giving the right results for the right reasons should be a subject for some deeper thought (see, for example, Kirchner ).
An interesting possibility that arises from applying more rigorous testing to model applications in hydrology is that all the models tried might be rejected as fit-for-purpose. application to a small UK catchment). As noted earlier, such model rejection is really a good outcome, in that it requires either that we do better modelling or find better data, or that we find some other way of making decisions within an adaptive management framework. We should, note, however, that even where more rigorous invalidation testing is carried out, the results will always be conditional on the information that is to hand now. The future remains epistemically uncertain, and the possibility of future surprise remains. That should not, however, be a reason for relaxing the testing. It should still be considered as a poor practice to relax rejection criteria just because a decision needs to be made. That may not result in a good decision if the model is not fit-for-purpose or if the decision is sensitive to the uncertainty in model predictions.

IMPROVING PROCESS REPRESENTATIONS IN HYDROLOGICAL MODELS
The concept of being able to reject models as hypotheses has an important implication; that we might be able to learn from the nature of the rejection to refine the representation of hydrological processes and systems where this is shown to be necessary. In this context, model rejection is a good outcome. It is the starting point for where the creativity of analysis and thought is required for doing better in the future.
It is already possible, however, to make some suggestions as to what such innovations might look like, particularly if we want process representations that will satisfy the needs to predict both flow and transport within a consistent framework. This assumes greater importance when we start to accept the limitations of gradient-based continuum approaches such as the Buckingham-Richards equation (which I have argued need to be reconsidered since Beven ). Such a framework is required to consider both velocities (in predicting conservative transport) and celerities (in predicting flows). Since celerities are generally different and faster than velocities, it follows that any process representation should be length scale-dependent, i.e. different scales of spatial discretisation might require different parameter values. The difference between velocities and celerities will also be state-dependent, suggesting that, at any scale, the hysteresis on the storage-flux response will change with system state. This has been shown numerically using

HYDROLOGICAL MODELS OF EVERYWHERE
The other advance that is certainly going to have a major impact on modelling practice is the much more widespread availability of spatial predictions of hydrological models on the internet. I first suggested a Models of Everywhere concept more than a decade ago (Beven ; Beven & Alcock ), but it is only relatively recently that this has become computationally easier to implement and computer scientists have become more interested in the problem of producing facilitating software (e.g. Blair et al. ).
What is critical to this Models of Everywhere concept is that the predictions are a sufficiently fine resolution that local stakeholders can relate to them directly. The concept is therefore quite different from providing the global 'hyperresolution' simulations presented, for example, by There is an interesting issue in the question of data assimilation in applications of Models of Everywhere.
Clearly, we would wish to use all the useful information available to test models locally and to ensure that we get the right results for the right reasons. This might include whatever quantitative data might be available but might also be a matter of learning how to use 'soft' data in model evaluations (see, for example, Seibert  More generally, forecasters have been satisfied with using data assimilation to get better forecasts, and simulation modellers have been satisfied with using calibration to either find an optimal model or constrain the associated uncertainty.
Perhaps, we can do better or at least be a little more thoughtful in applying models. The feedback from users once Models of Everywhere visualisations are more widely available may force us to do so.

CONCLUSIONS
I have written about these issues in many past papers (including Beven b), but this has been a useful opportunity to bring the strands of thought about the future of hydrological modelling in one place. I do think that hydrology remains a field of inexact science that is still greatly constrained by observational limitations, and it would be really good to see the community make a real effort to decide on what its priorities should be and then move to commission what is needed (as has happened, for example, with the SWOT satellite). The process might be long, but the benefits to the science would be great, including for testing models as hypotheses, developing new process representations, and constraining predictive uncertainties.
The role of Models of Everywhere in improving modelling capability will also make for an interesting future. What new techniques for learning about places and for learning from clear errors in representing the response of places will need to be developed? And how can new types of knowledge be used to constrain uncertainties? What should the learning framework for both quantitative and qualitative information look like, including the issue of distinguishing information from disinformation. These are issues that are relevant to a wider range of research areas than hydrology which is just one of many inexact environmental sciences (Beven , ).
There is a particularly interesting aspect of uncertainty for the modeller in this context. A realistic assessment of uncertainty in predicting how places respond will mean that the modeller is much less likely to be obviously wrong in those predictions. This is clearly a good thing (at least from a modeller's point of view), but should not preclude an effort being made to carry out model testing and find ways of reducing that predictive uncertainty.
As I said in the talk on which this paper is based, I am ending my career with much more uncertainty than when I started as a young PhD student in 1971. But that is a good thingit means that there is still so much good research to do in the closely linked areas of novel observational methods, closure schemes and model testing, theoretical development, and learning about places. In particular, learning about the assessment of epistemic uncertainties will also lead to the development of methods for reducing those uncertainties. The near future could be an exciting time for hydrological research and practice.

ACKNOWLEDGEMENTS
Over a long career, I have been fortunate to work and collaborate with many excellent hydrological modellers and experimentalists; a number that is really too long to list here. I will just mention the contribution of Dr Peter Metcalfe who had only recently started work on the Q-NFM project led by Dr Nick Chappell at Lancaster University before his sudden death in a climbing accident.
Working with Peter on the problem of modelling distributed natural flood management measures in this project (NERC grant no. NE/R004722/1) was instrumental in my thinking again about how to improve hydrological models. I am also grateful to Jan Seibert and two anonymous referees who made some useful suggestions for relevant papers and improvements to the paper and presentation.