Genetic programming for hydrological applications: to model or to forecast that is the question

Genetic programming (GP) is a widely used machine learning (ML) algorithm that has been applied in water resources science and engineering since its conception in the early 1990s. However, similar to other ML applications, the GP algorithm is often used as a data fitting tool rather than as a model building instrument. We find this a gross underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it distinctly different from the rest of ML techniques is its capability to produce explicit mathematical relationships between input and output variables. In the context of theory-guided data science (TGDS) which recently emerged as a new paradigm in ML with the main goal of blending the existing body of knowledge with ML techniques to induce physically sound models. Hence, TGDS has evolved into a popular data science paradigm, especially in scientific disciplines including water resources. Following these ideas, in our prior work, we developed two hydrologically informed rainfall-runoff model induction toolkits for lumped modelling and distributed modelling based on GP. In the current work, the two toolkits are applied using a different hydrological model building library. Here, the model building blocks are derived from the Sugawara TANK model template which represents the elements of hydrological knowledge. Results are compared against the traditional GP approach and suggest that GP as a rainfall-runoff model induction toolkit preserves the prediction power of the traditional GP short-term forecasting approach while benefiting to better understand the catchment runoff dynamics through the readily interpretable induced models.


INTRODUCTION
Data science methods have shown success in many scientific fields including hydrology. However, it may be argued that, compared with the level of success in commercial fields, the potential in water resources has not been fully realized. There are two major reasons for this: a lack of the availability of labelled instances for model training and the black-box nature of data-driven models where a modeller has little or no knowledge about how the model makes its prediction (Karpatne et al. ). Although the datadriven models are often more accurate than more traditional hydrological physics-based, conceptual and empirical models in terms of predictive capabilities, they contribute little towards the advancement of scientific theories due to the lack of interpretability of the model configurations.
Recently, a novel modelling paradigm called theory-guided data science (TGDS) (Karpatne et al. )  Genetic programming (GP) (Koza ) is an ML algorithm that has been used for many applications in water resources science and engineering since its invention in the early 1990s. However, as per the state-of-the-art GP applications, the algorithm is often used as a data fitting tool instead of as a model building instrument. The data fitting makes the GP applications very similar to other ML algorithms, such as artificial neural networks or support vector machines. In hydrological rainfall-runoff modelling, the most frequent use of ML, including GP, is as a short-term forecasting tool. We find this to be an underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it so different from the rest of ML techniques lies in its capability to produce explicit mathematical relationships between input and output variables.
A recent paper (Addor & Melsen ), based on more than 1,500 peer-reviewed research articles, concluded that the model selection in hydrological modelling is more often driven by legacy rather than adequacy. Furthermore, the so-called uniqueness of the place has been identified as one of the key aspects of hydrological modelling (Beven ). Hence, an automatic model induction and model selection framework may serve as an alternative to the more traditional subjective model selection and should induce a model architecture that might be more adequate for the intended application.

HYDROLOGICAL MODELLING
Hydrological models play an important role in understanding catchment dynamics. In this section, different hydrological model-building strategies, some of which have been used in the present study are briefly discussed.
Theory-based models vs. data science models Here, we refer to both conceptual and physics-based hydrological models as theory-based models. In conceptual modelling, hydrological processes are represented mathematically with the reservoir units describing the catchment storages. Fluxes, reservoirs, closure relations, and transfer functions are the main building blocks of a conceptual model. Conceptual models are several orders less complex than the physics-based models due to the conceptual representation of catchment dynamics instead of small-scale physics used in physics-based models. However, in conceptual models, there are little or no direct relations between the model parameters and physically measurable quantities in the field. Therefore, it is required to use calibration schemes to select appropriate parameter values which provide a reasonable match between observed and modelsimulated values.
Two types of modelling approaches can be identified within the conceptual modelling: the models based on a single hypothesis (fixed models) and the models based on multiple hypotheses (flexible models). Fixed models are built around a general model architecture that gives satisfactory model performances over a fairly broad range of watersheds and climatic regions. On the other hand, flexible modelling frameworks provide model building blocks that can be arranged in different ways to test many hypotheses about catchment dynamics instead of the one fixed hypothesis in fixed models. Such robust nature of any modular modelling framework allows the modeller to consider the uniqueness of the area of their application.
Data science models, such as ML, gained popularity in many disciplines including hydrology with the advancement of computing power and data availability. The main advantage of data science models is their ability to use the available data to build the input-output relationships which provide actionable models with good predictive capabilities without relying on scientific theories. The short lead time predictive power of ML models is often superior to that of traditional physics-based and conceptual hydrological models due to their ability to capture the non-linearity, non-stationarity, noise complexity and dynamism of data (Yaseen et al. ). Another major advantage of an ML model is that it requires less effort to develop and calibrate than a physics-based model. Data science models, such as deep learning (DL) models, have shown more accurate performances in hydrograph predictions than the traditional approaches in ungauged catchments (Kratzert et al. ).
Furthermore, ML models can capture watershed similarities by providing satisfactory results even for the catchments that were not used for the training of those models. This implies the potential of ML models in evolving catchment scale theories in which the traditional models were unable to do so well (Nearing et al. ).
Theory-based models are frequently admired by the community due to their interpretability which may lead to a better understanding of catchment dynamics. The simple application of data-driven techniques may produce models with high prediction accuracy yet with meaningless model configurations which do not satisfy basic hydrological understanding and may have serious complications with the explanation. The black-box nature of data science models has been criticized, leading to a few of their applications and hindered success in many scientific fields. Beven () highlights the importance of the interpretability of DL models and suggests a more direct incorporation of process information into such models.
Furthermore, he points out that ML models should also need to pay attention to similar issues associated with traditional modelling approaches like data and parameter uncertainties and equifinality. Nearing et al. () contend that there is a danger for the hydrologic community in not recognizing the potential of ML offers for the future of hydrological modelling.
Furthermore, the authors reject the most common criticism on ML models (the lack of explainability) by stating that the adequacy of process representation in theory-based models can be questioned due to the poor prediction accuracy. In summary, despite having a huge potential, the modern ML capabilities have not been thoroughly tested in hydrological modelling where there is an expectation that even distributed hydrological models are to be developed principally on ML in near future. Even though the new data science techniques, such as DL, have become indispensable tools in many disciplines, their applications in water resources remain quite limited (Shen et al. ).
Lumped models vs. distributed models vs.

semi-distributed models
Hydrological models can be broadly categorized into lumped and distributed models based on whether a model considers the spatial heterogeneity of watershed properties (e.g. geology, topography, land use) and climate variables (e.g. rainfall distribution). Lumped models assume spatially uniform catchment characteristics and use catchment average climate variable values (e.g. precipitation, temperature, evaporation) as model inputs. In contrast, distributed models incorporate the spatial variability in watershed properties and climate variables into the modelling process.
Model parsimony due to relatively few model parameters, ease and simplicity of use have made lumped modelling a popular hydrological modelling approach.
However, the meaningfulness of lumped values may deteriorate, especially when the catchment size increases, and hence, the lumped representation may not be correctly corresponding to the actual physical reality of the catchment. A lumped model may still produce accurate results for a large catchment where the spatial heterogeneities are significant, but the inferences made through the model may not be reasonable or realistic.
The so-called uniqueness of the place has been identified as one of the crucial factors in hydrological modelling (Beven ). One way of addressing this phenomenon is using distributed models as they consider the spatial variabilities in their modelling process. On top of that, distributed models are required in situations where the discharge forecasting is required at multiple points within the catchment, and the impacts of the change of land use patterns within the watershed are to be tested, and critical source detections are required for contaminant or sediment transport modelling (Fenicia et al. ).
Two subcategories can be identified within distributed modelling: fully distributed and semi-distributed models.
Fully distributed models discretize the watershed into regular or irregular grids and use small-scale physics to model the fluxes through the spatial elements. In the beginning, its usage was greatly limited due to computation demand and intensive data requirements. However, researchers believed that more data about the catchment properties and climate variables would be available with the advancement of technology and invested heavily on developing fully distributed models. Until today, however, the distributed models have not reached the expected outcome.

GP APPLICATIONS IN WATER RESOURCES ENGINEERING
Over the past three decades, ML has evolved into an irreplaceable tool in many commercial and scientific disciplines, including water resources engineering. The approach outperforms the traditional approaches in many applications, such as autonomous driving, language translation, character recognition and object tracking (Karpatne et al. ). ML techniques received attention among hydrologists to a great extent during the last 20 years. The majority of ML methods used in hydrological modelling can be cate- The earliest GP application in water resources engineering was reported in the late 1990s (Babovic ). Since then GP has been used in many research directions in water resources engineering, such as rainfall-runoff modelling suggest five ways of combining scientific knowledge with data science: (i) theory-guided design of data science models, (ii) theory-guided learning of data science models, (iii) theory-guided refinement of data science outputs, (iv) learning hybrid models of theory and data science and (v) augmenting theory-based models using data science. A typical theory-guided machine learning approach may follow one or more above-named strategies to couple scientific knowledge with learning algorithms.

GP FOR HYDROLOGICAL FORECASTING
In the traditional setting, when GP is used for hydrological forecasting, the terminal set of GP algorithm consists of past and current states of meteorological variables and randomly generated constants. The function set consists of basic mathematical functions. The GP algorithm uses these building blocks to generate its solution set for predicting catchment response into the future.

Runoff forecasting
For example, to produce one-day (Q tþ1 ) and five-day (Q tþ5 ) runoff forecasts, one would use precipitation (P t ), evaporation (E t ) and antecedent runoff (Q t ) as the input variables.
Furthermore, lagged precipitation (P tÀlag ) and evaporation (E tÀlag ) vectors up to a certain period (lag) may also be used to incorporate precipitation and evaporation history into the forecasting process (in the current study, lag ¼ 5 days). Although the runoff history can also be included in the forecasting process, in this study, no lagged runoff vectors are used as the forecasted runoff is largely correlated with the last observed runoff value (Q t ). The expected relationship between the forecasted runoff and input variables can be captured as in Equation (1). GP may create its candidate equations either using one or more input vectors along with the random constants and mathematical functions. GP settings used for the runoff forecasts are given in Table 1.
Case studies The catchment characteristics of both watersheds are summarized in Table 2. Precipitation (P), Streamflow (Q) and Evaporation (E) data for 11 years from 1 January 2004 to σ is the standard deviation, μ is the mean, Q ot is the mean of observed discharge values and ln is the natural logarithm). They are responsive to contrasting flow segments of simulated and observed hydrographs.

Results
Four equations were derived by using GP as a forecasting tool for one-day and five-day runoff prediction of the two catchments. Equations (6) and (7) are for the Sipsey Fork catchment and Equations (8) and (9) are for the Red Creek catchment. The performance matrices of the four equations are given in Table 3. Figure 3 demonstrates the forecasted hydrographs with the observed hydrographs of the catchments.
• One-day runoff forecasts of the Red Creek catchment achieve high-efficiency values for both the training and testing periods under all four objective functions. This demonstrates the ability of GP to achieve high-efficiency values in runoff forecasting. Furthermore, the optimal equation (Equation (8)) shows no signs of overfitting due to its consistent performance across both the training and testing periods.
• As it can be seen with observed hydrographs, the runoff signature of the Sipsey Fork catchment is relatively more complex than that of the Red Creek catchment.
In this context, GP with only mathematical functions is unable to achieve satisfactory performance in terms of the training NSE value. Additionally, the optimal  (6)) is overfitted to its training data as the NSE value falls below zero for the testing period.
This highlights the importance of cautious usage of traditional GP forecasts when they are used with out of sample values.
• As per the efficiency values, one-day forecasts of both catchments perform significantly better than five-day forecasts. Therefore, the performance of GP as a forecasting tool deteriorates significantly as the lead time increases.
• The optimal equations provide little or no knowledge about the underlying physical phenomena of runoff generation. Furthermore, the equations tend to become lengthier with insignificant components.

GP FOR HYDROLOGICAL MODELLING
In this section, GP is used for automatic rainfall-runoff model induction. In contrast to runoff forecasting, GP here maps the input forcing variables at the time t to predict the runoff at the time t which involves no lead time.

RB function
RB function represents a single reservoir in a TANK model configuration. As shown in Equation (10), RB has nine function arguments.
Q ¼ RB(RI, h 1 , a 1 , h 2 , a 2 , b, L 1 , L 2 , S) where Q is the discharge components (two side flows and one bottom flow) of the reservoir (mm/day); RI is all input

RJOIN function
RJOIN function (Equation (11) Table 5. Optimal model selection was carried out as described in earlier sections, while 5,000 behavioural models (parameter sets) were identified for uncertainty analysis using NSE as the performance indicator with a threshold value of 0.6.

Results
Lumped modelling The performance matrices of the optimal model are given in Table 6. As per the efficiency values, the optimal model shows a consistent performance over calibration, validation and testing periods. Hence, it shows no sign of overfitting issues to its training data. The simulated hydrograph along with the observed hydrograph of the Sipsey Fork catchment is given in Figure 9 and a good visual match between the two hydrographs can be observed. However, on some occasions, the optimal model underestimates high discharge values. The simulated FDC (Figure 10  The calibrated values of the optimal model reveal that almost the entire surface runoff consists of three surface runoff components: q sur2 of RB 1 , q sur1 and q sur2 of RB 7 . The associated time delays of their lag functions are 0.04, 0.05 and 0.01 days, respectively. This suggests that the optimal model has a quick discharge response to its forcing terms.
Interestingly, the experimental insights of the Sipsey Fork catchment indicate that the most distinct feature of the catchment is its rapid flow response to the precipitation events.

FORECASTING VS. MODELLING
The most common application of GP in rainfall-runoff modelling so far is based on the use of the algorithm as a shortterm forecasting tool. As shown in this and many previous studies, GP is capable of achieving high-efficiency values when it is used as a forecasting tool. However, in this study, we explore the potential of GP as a model induction toolkit. In contrast to the traditional GP as a forecasting tool, the two model induction toolkits (ML-RR-MI and MIKA-SHA) consist of the following augmentations when they are used as model induction engines.
• GP is used to simulate the runoff at the time t using the forcing terms at the time t which involves no lead times. The precipitation, evaporation and streamflow histories are automatically taken into account by the reservoir storage and lag functions.
• Existing hydrological knowledge is incorporated as special functions (RB, RJOIN and DISTRIBUTED) into the GP function set. The objective of adding special functions is to make the induced models readily interpretable and to capture the complex runoff dynamics.
• Input variable data are divided into three segments (calibration, validation and testing) and the performance of the validation period is also considered in the optimal model selection stage. This is used to avoid the selection of overfitted models as the optimal model.
• A multi-objective optimization scheme is used to optimize both model structure and parameters simultaneously. Models selected using single-objective optimization may be biased towards a specific flow segment. For example, the popular NSE is sensitive towards the high flows. Hence, a model derived through a multi-objective optimization is expected to perform well in many flow characteristics.
• Optimal model selection is not only based on the best training fitness but also based on validation fitness, relative performance in capturing observed FDC, model parsimony, time-series complexity and pattern matching.
• Induced models are readily interpretable and provide insights about catchment dynamics.
• Parallel computation is used at the fitness calculation stage. As the special functions require considerably longer computational times than the basic mathematical functions, the use of parallel computation greatly helps to reduce overall computational time.
As it can be seen with the Red Creek catchment, GP as a model induction toolkit can achieve the same highefficiency values similar to runoff forecasts (one-day forecasts). More importantly, when the runoff signature is much more complex like with the Sipsey Fork catchment, the model induction toolkit was able to achieve satisfactory prediction power which runoff forecasts through traditional GP were unable to achieve. In the present study, a relatively shorter period (6 years) was used for the model training (calibration). However, both model induction toolkits were able to capture the runoff signature reasonably well within that period. Obtaining a long quality dataset for model training is often challenging in hydrological modelling. Therefore, it may be safe to assume GP as a model induction toolkit performs better than forecasting when the discharge signatures are more complex and difficult to model and/or when the training labelled instances are limited.

CONCLUSIONS
We recognize the potential offered by ML algorithms towards hydrological modelling. We also recognize that simplistic black-box type data-driven models may lead to the evolution of accurate; yet, senseless models with serious difficulties with interpretation may not serve towards the advancement of hydrological knowledge. Therefore, we chart the most promising way ahead through the integration of existing hydrological knowledge with learning algorithms to induce more generalizable and physically coherent models. This was the motivation behind the development of GP-based model induction frameworks which have been founded on both ML and theory-driven models. Therefore, we expect this work will strengthen the link between two major, historically, quite separate communities in water resource science and engineering: those working with physics-based process simulation modelling and those working with machine learning.
Results demonstrated in this contribution the potential of GP as a model induction toolkit in contrast to its most frequent usage as a short-term forecasting tool. More importantly, GP as a model induction engine preserves its high prediction accuracies shown with runoff forecasting while benefiting hydrologists to gain a better understanding of the watershed dynamics through the resultant interpretable models. Furthermore, GP as a model induction toolkit produces more accurate results than forecasting when the discharge signatures are more complex and difficult to model and/or when the training labelled instances are limited. Furthermore, in the current study, GP is used to its full capacity as a model induction engine to simultaneously optimize both model configuration and parameters. based on adequacy rather than legacy. Therefore, the modular approach of these model induction toolkits may address the so-called uniqueness of the place in hydrological modelling. The unique perhaps more important feature of these model induction toolkits would be their capability to couple with any internally coherent collection of building blocks representing the elements of hydrological knowledge.
The optimal model configurations derived in this study are in agreement with experimental insights and previously disclosed research findings of the catchments. Hence, GP as a model induction toolkit is smart enough to mine knowledge from data which makes it viable to depend on the induced models with more than just statistical confidence.
The optimal models derived by the two GP-based model induction toolkits are readily interpretable by domain experts.
This makes the approach introduced here different from the other ML utilizations in rainfall-runoff modelling. We need data to build models. In the absence of data, human experts will be able to construct models based on their intuition since even human-built models require some sort of validation against data. With the limited data, both human and our model induction toolkits (both ML-RR-MI and MIKA-SHA) face the same challenge. DL models will not be useful in such a situation due to the very large number of free parameters. From the perspective of the bias-variance dilemma, DL models, such as Long Short-Term Memory networks (LSTM), will not be particularly useful under the poor data availability circumstances. However, the present work is just the beginning of coupling the strengths of human awareness with those of computational discovery techniques. We expect further research studies on theory-guided machine learning to be directed towards knowledge discovery and automatic model building in hydrological modelling.

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.