## Abstract

Genetic programming (GP) is a widely used machine learning (ML) algorithm that has been applied in water resources science and engineering since its conception in the early 1990s. However, similar to other ML applications, the GP algorithm is often used as a data fitting tool rather than as a model building instrument. We find this a gross underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it distinctly different from the rest of ML techniques is its capability to produce explicit mathematical relationships between input and output variables. In the context of theory-guided data science (TGDS) which recently emerged as a new paradigm in ML with the main goal of blending the existing body of knowledge with ML techniques to induce physically sound models. Hence, TGDS has evolved into a popular data science paradigm, especially in scientific disciplines including water resources. Following these ideas, in our prior work, we developed two hydrologically informed rainfall-runoff model induction toolkits for lumped modelling and distributed modelling based on GP. In the current work, the two toolkits are applied using a different hydrological model building library. Here, the model building blocks are derived from the Sugawara TANK model template which represents the elements of hydrological knowledge. Results are compared against the traditional GP approach and suggest that GP as a rainfall-runoff model induction toolkit preserves the prediction power of the traditional GP short-term forecasting approach while benefiting to better understand the catchment runoff dynamics through the readily interpretable induced models.

## HIGHLIGHTS

GP is used as a rainfall-runoff model induction toolkit rather than as a short-term forecasting mechanism.

Hydrological knowledge has been integrated with the machine learning algorithm.

This guides algorithm to induce physically sound and consistent model configurations.

The model building blocks are inspired by the Sugawara TANK model template.

They represent the elements of incorporated hydrological knowledge.

## INTRODUCTION

Data science methods have shown success in many scientific fields including hydrology. However, it may be argued that, compared with the level of success in commercial fields, the potential in water resources has not been fully realized. There are two major reasons for this: a lack of the availability of labelled instances for model training and the black-box nature of data-driven models where a modeller has little or no knowledge about how the model makes its prediction (Karpatne *et al.* 2017). Although the data-driven models are often more accurate than more traditional hydrological physics-based, conceptual and empirical models in terms of predictive capabilities, they contribute little towards the advancement of scientific theories due to the lack of interpretability of the model configurations. Recently, a novel modelling paradigm called theory-guided data science (TGDS) (Karpatne *et al.* 2017) or physics informed machine learning (Babovic & Keijzer 2002; Babovic 2005, 2009; Physics Informed Machine Learning Conference 2016; Chadalawada *et al.* 2020) emerged to enhance the explainability of machine learning (ML) models in general and water resources in particular. In this approach, the existing body of knowledge is blended with ML algorithms to induce physically sound and consistent models.

Genetic programming (GP) (Koza 1992) is an ML algorithm that has been used for many applications in water resources science and engineering since its invention in the early 1990s. However, as per the state-of-the-art GP applications, the algorithm is often used as a data fitting tool instead of as a model building instrument. The data fitting makes the GP applications very similar to other ML algorithms, such as artificial neural networks or support vector machines. In hydrological rainfall-runoff modelling, the most frequent use of ML, including GP, is as a short-term forecasting tool. We find this to be an underutilization of the GP capabilities. The most unique and distinct feature of GP that makes it so different from the rest of ML techniques lies in its capability to produce explicit mathematical relationships between input and output variables.

A recent paper (Addor & Melsen 2019), based on more than 1,500 peer-reviewed research articles, concluded that the model selection in hydrological modelling is more often driven by legacy rather than adequacy. Furthermore, the so-called uniqueness of the place has been identified as one of the key aspects of hydrological modelling (Beven 2020). Hence, an automatic model induction and model selection framework may serve as an alternative to the more traditional subjective model selection and should induce a model architecture that might be more adequate for the intended application.

Following this line of thought in our prior work, we developed two hydrologically informed rainfall-runoff model induction toolkits, namely Machine Learning Rainfall-Runoff Model Induction toolkit (ML-RR-MI) (Chadalawada *et al.* 2020) and Machine Induction Knowledge Augmented-System Hydrologique Asiatique (MIKA-SHA) (Herath *et al.* 2020). Both toolkits are capable of inducing fully fledged rainfall-runoff *models* for a catchment of interest where the former is used for lumped modelling and the latter is used for distributed modelling. In this approach, GP has been used as the ML induction algorithm to learn both model structure and optimize parameters simultaneously. Hydrological knowledge is incorporated through the building blocks introduced into the GP framework used to govern the induction of physically sound and consistent models. In prior applications of ML-RR-MI and MIKA-SHA, building blocks of two modular modelling frameworks namely SUPERFLEX (Fenicia *et al.* 2011; Kavetski & Fenicia 2011) and FUSE (Clark *et al.* 2008) were used as the elements of incorporated hydrological knowledge.

In the current study, both ML-RR-MI and MIKA-SHA are armed with a different model building library inspired by the Sugawara TANK model template (Sugawara 1979). One of the most unique features of our hydrologically informed model induction frameworks is the capability of integrating frameworks with any internally coherent (consistent) collection of building blocks as the elements of incorporated hydrological knowledge. Therefore, we chose to utilize model building components from the Sugawara TANK model template for the current study in contrast to the model building components of flexible modelling frameworks used in earlier ML-RR-MI and MIKA-SHA applications to demonstrate the versatility of our approach.

The main objective of this study is to compare the performance of GP as a hydrological model induction engine against the GP as a short-term forecasting tool. Both lumped and semi-distributed rainfall-runoff model induction capabilities of GP have been evaluated on two United States watersheds.

The rest of the paper is arranged as follows. The next section provides the background of different hydrological modelling strategies. GP applications in water resources engineering are discussed in the section that follows. Traditional GP has been used for runoff forecasting of two United States watersheds and the results are included in the section on GP for hydrological forecasting. The section on GP for hydrological modelling describes the applications of the ML-RR-MI and MIKA-SHA toolkits along with the Sugawara TANK model-based building block library for lumped and distributed model induction for the same two catchments. The comparison between GP for runoff forecasting vs. hydrological modelling has been given in the next section. The last section summarizes the conclusions made through the research findings. Additional details are summarized in the Supplementary Appendix.

## HYDROLOGICAL MODELLING

Hydrological models play an important role in understanding catchment dynamics. In this section, different hydrological model-building strategies, some of which have been used in the present study are briefly discussed.

### Theory-based models vs. data science models

Here, we refer to both conceptual and physics-based hydrological models as theory-based models. In conceptual modelling, hydrological processes are represented mathematically with the reservoir units describing the catchment storages. Fluxes, reservoirs, closure relations, and transfer functions are the main building blocks of a conceptual model. Conceptual models are several orders less complex than the physics-based models due to the conceptual representation of catchment dynamics instead of small-scale physics used in physics-based models. However, in conceptual models, there are little or no direct relations between the model parameters and physically measurable quantities in the field. Therefore, it is required to use calibration schemes to select appropriate parameter values which provide a reasonable match between observed and model-simulated values.

Two types of modelling approaches can be identified within the conceptual modelling: the models based on a single hypothesis (fixed models) and the models based on multiple hypotheses (flexible models). Fixed models are built around a general model architecture that gives satisfactory model performances over a fairly broad range of watersheds and climatic regions. On the other hand, flexible modelling frameworks provide model building blocks that can be arranged in different ways to test many hypotheses about catchment dynamics instead of the one fixed hypothesis in fixed models. Such robust nature of any modular modelling framework allows the modeller to consider the uniqueness of the area of their application.

Data science models, such as ML, gained popularity in many disciplines including hydrology with the advancement of computing power and data availability. The main advantage of data science models is their ability to use the available data to build the input–output relationships which provide actionable models with good predictive capabilities without relying on scientific theories. The short lead time predictive power of ML models is often superior to that of traditional physics-based and conceptual hydrological models due to their ability to capture the non-linearity, non-stationarity, noise complexity and dynamism of data (Yaseen *et al.* 2015). Another major advantage of an ML model is that it requires less effort to develop and calibrate than a physics-based model. Data science models, such as deep learning (DL) models, have shown more accurate performances in hydrograph predictions than the traditional approaches in ungauged catchments (Kratzert *et al.* 2019). Furthermore, ML models can capture watershed similarities by providing satisfactory results even for the catchments that were not used for the training of those models. This implies the potential of ML models in evolving catchment scale theories in which the traditional models were unable to do so well (Nearing *et al.* 2020).

Theory-based models are frequently admired by the community due to their interpretability which may lead to a better understanding of catchment dynamics. The simple application of data-driven techniques may produce models with high prediction accuracy yet with meaningless model configurations which do not satisfy basic hydrological understanding and may have serious complications with the explanation. The black-box nature of data science models has been criticized, leading to a few of their applications and hindered success in many scientific fields. Beven (2020) highlights the importance of the interpretability of DL models and suggests a more direct incorporation of process information into such models. Furthermore, he points out that ML models should also need to pay attention to similar issues associated with traditional modelling approaches like data and parameter uncertainties and equifinality.

Nearing *et al.* (2020) contend that there is a danger for the hydrologic community in not recognizing the potential of ML offers for the future of hydrological modelling. Furthermore, the authors reject the most common criticism on ML models (the lack of explainability) by stating that the adequacy of process representation in theory-based models can be questioned due to the poor prediction accuracy. In summary, despite having a huge potential, the modern ML capabilities have not been thoroughly tested in hydrological modelling where there is an expectation that even distributed hydrological models are to be developed principally on ML in near future. Even though the new data science techniques, such as DL, have become indispensable tools in many disciplines, their applications in water resources remain quite limited (Shen *et al.* 2018).

### Lumped models vs. distributed models vs. semi-distributed models

Hydrological models can be broadly categorized into lumped and distributed models based on whether a model considers the spatial heterogeneity of watershed properties (e.g. geology, topography, land use) and climate variables (e.g. rainfall distribution). Lumped models assume spatially uniform catchment characteristics and use catchment average climate variable values (e.g. precipitation, temperature, evaporation) as model inputs. In contrast, distributed models incorporate the spatial variability in watershed properties and climate variables into the modelling process. Model parsimony due to relatively few model parameters, ease and simplicity of use have made lumped modelling a popular hydrological modelling approach.

However, the meaningfulness of lumped values may deteriorate, especially when the catchment size increases, and hence, the lumped representation may not be correctly corresponding to the actual physical reality of the catchment. A lumped model may still produce accurate results for a large catchment where the spatial heterogeneities are significant, but the inferences made through the model may not be reasonable or realistic.

The so-called uniqueness of the place has been identified as one of the crucial factors in hydrological modelling (Beven 2020). One way of addressing this phenomenon is using distributed models as they consider the spatial variabilities in their modelling process. On top of that, distributed models are required in situations where the discharge forecasting is required at multiple points within the catchment, and the impacts of the change of land use patterns within the watershed are to be tested, and critical source detections are required for contaminant or sediment transport modelling (Fenicia *et al.* 2016).

Two subcategories can be identified within distributed modelling: fully distributed and semi-distributed models. Fully distributed models discretize the watershed into regular or irregular grids and use small-scale physics to model the fluxes through the spatial elements. In the beginning, its usage was greatly limited due to computation demand and intensive data requirements. However, researchers believed that more data about the catchment properties and climate variables would be available with the advancement of technology and invested heavily on developing fully distributed models. Until today, however, the distributed models have not reached the expected outcome. Distributed models suffer from the consequences of equifinality and a large number of model parameters often lead to overparameterization. A comprehensive review on applications, challenges and future trends of fully distributed modelling in hydrology is presented in Fatichi *et al.* (2016).

In contrast, semi-distributed models represent the catchment dynamics using a network of separate conceptual models assigned to functionally distinguishable land segments (Boyle *et al.* 2001). The functionally distinguishable land segments can either be subcatchments (e.g. Boyle *et al.* 2001) or Hydrological Response Units (HRUs) (e.g. Fenicia *et al.* 2016). There are no interconnections or dependencies among the models in the network, and the total model output consists of the routed sum of the individual model responses of each spatial element. This together with the use of conceptual models instead of small-scale physics makes this approach more parsimonious and several orders less complex than fully distributed models. Moreover, semi-distributed models do not suffer much from equifinality and overparameterization as in fully distributed models.

## GP APPLICATIONS IN WATER RESOURCES ENGINEERING

Over the past three decades, ML has evolved into an irreplaceable tool in many commercial and scientific disciplines, including water resources engineering. The approach outperforms the traditional approaches in many applications, such as autonomous driving, language translation, character recognition and object tracking (Karpatne *et al.* 2017). ML techniques received attention among hydrologists to a great extent during the last 20 years. The majority of ML methods used in hydrological modelling can be categorized into five groups (Yaseen *et al.* 2015): (i) Artificial Neural Networks (ANNs; Minns & Hall 1996), (ii) Fuzzy set (Lohani *et al.* 2014), (iii) Support Vector Machines (SVMs; Yu *et al.* 2004), (iv) Evolutionary Computation (EC; Babovic & Keijzer 2002) and (v) Wavelet–Artificial Intelligence models (W-AI; Moosavi *et al.* 2013). The selection of an ML technique for a particular application should be based on the objectives of the study as each of these methods has its pros and cons. A comprehensive review of different ML techniques and their applications in water resources engineering is beyond the scope of this paper. Instead, interested readers are directed to review papers by Govindaraju (2000), Yaseen *et al.* (2015), Mehr *et al.* (2018) and the textbook by Hsieh (2009).

GP (Koza 1992) is an evolutionary computation technique inspired by the basic principles of Darwin's theory of evolution. GP is capable of automatic generation of computer programs and falls under the supervised ML category. The ability to generate the input–output relationships as explicit symbolic mathematical expressions differentiates GP from the rest of ML techniques. Therefore, GP is often regarded as a grey-box data-driven technique rather than as a black-box data-driven technique, such as ANNs or SVMs. Moreover, conceptual simplicity, the ability of parallel computing and the capability of obtaining the near-global or global solution make GP a powerful ML technique. There are several variants of GP-like Monolithic GP (MGP), Multi-Gene GP (MGGP), Gene Expression Programming (GEP), Linear GP (LGP) and Grammar-based GP (GGP) (Mehr *et al.* 2018). Despite variants, the fundamental operations are much the same. The GP algorithm starts with a randomly generated initial set of candidate solutions for the task at hand. It uses mathematical functions, physical or random constants and independent variables as its building blocks to generate candidate solutions. Then, the output of each candidate solution (GP individual) is evaluated against the labelled instances using a so-called objective function. Next, the genetic operators, such as crossover and mutation, are applied to GP individuals of the current generation (parent population) to generate offsprings for the next generation. The selection mechanism used to pick suitable parents for breeding ensures that more fit individuals get higher chances of selection. This process continues while the GP algorithm minimizes the difference between labelled instances and modelled values until it reaches the termination criteria (execution time, desired level of accuracy or a maximum number of generations) (Babovic & Keijzer 2000).

The earliest GP application in water resources engineering was reported in the late 1990s (Babovic 1996). Since then GP has been used in many research directions in water resources engineering, such as rainfall-runoff modelling (Babovic & Keijzer 2002; Havlicek *et al.* 2013; Babovic *et al.* 2020), streamflow prediction (Meshgi *et al.* 2014; Karimi *et al.* 2016), water quality modelling (Savic & Khu 2005), groundwater modelling (Fallah-Mehdipour *et al.* 2013; Datta *et al.* 2014), reservoir management (Giuliani *et al.* 2015), sediment transport (Babovic 2000; Safari & Mehr 2018), climate variables and soil properties modelling (Bautu & Bautu 2006; Elshorbagy & El-Baroudy 2009). A comprehensive review on GP applications in water resources engineering is presented in Mehr *et al.* (2018), whereas Fallah-Mehdipour & Haddad (2015) presents the applications of GP in hydrological modelling.

In the context of rainfall-runoff modelling, GP has been reported to be more accurate than the other ML techniques, such as ANN in terms of prediction efficiencies (Havlicek *et al.* 2013). However, it is worthwhile to note that, in many of the GP utilizations in water resources engineering so far, the GP algorithm is used as a short-term forecasting mechanism by paying more attention towards the prediction power and less attention to the resultant input–output relationship. Yet, the ability to produce an explicit mathematical relationship between input and output variables has made GP so different from other ML techniques. Recent studies, such as Chadalawada *et al.* (2017, 2020) and Herath *et al.* (2020), explore this potential of GP and use it as a rainfall-runoff model induction toolkit to induce interpretable models with good predictive powers.

### Physics informed GP

One effective way to address the main criticism of ML models (lack of interpretability) would be to use the existing hydrological knowledge to guide ML models (Babovic & Keijzer 2002; Babovic 2009). Presently, blending the existing body of knowledge with learning algorithms has been identified as a novel modelling paradigm in ML. Despite the different taxonomy used by different researchers, such as TGDS (Karpatne *et al.* 2017), physics informed machine learning (Physics Informed Machine Learning Conference 2016; Chadalawada *et al.* 2020) and AI-neuroscience (Voosen 2017), this modelling paradigm aims to enhance the explainability of ML models by inducing more generalizable and physically consistent models. Karpatne *et al.* (2017) suggest five ways of combining scientific knowledge with data science: (i) theory-guided design of data science models, (ii) theory-guided learning of data science models, (iii) theory-guided refinement of data science outputs, (iv) learning hybrid models of theory and data science and (v) augmenting theory-based models using data science. A typical theory-guided machine learning approach may follow one or more above-named strategies to couple scientific knowledge with learning algorithms.

McGovern *et al.* (2019) highlight that an increasing trend can be observed by using TGDS models in water resources engineering applications including optimal process-oriented hydroclimatic model building and multiple-source data fusion for hydrologic process simulation (e.g. Snauffer *et al.* 2018; Solander *et al.* 2019; Chadalawada *et al.* 2020; Herath *et al.* 2020) in contrast to limited such applications in past (e.g. Cannon & Mckendry 2002; Keijzer *et al.* 2001; Keijzer & Babovic 2002; Fleming 2007). Although there are efforts in nearly all machine learning techniques to integrate existing knowledge into the basic frameworks, in the sequel, we will focus our discussion on efforts in GP. For a complete discussion on theory-guided machine learning applications in water resources engineering, refer to Karpatne *et al.* (2017).

Even though the TGDS was only recently identified as a novel modelling paradigm in the context of GP, there were efforts over the past 20 years to couple the hydrological knowledge into the basic GP framework to induce more physically reliable hydrological models. Babovic & Keijzer (2000, 2002) and Keijzer & Babovic (2002) augmented the basic GP algorithm by introducing preferential bias and declarative bias into the model building process of GP to induce more physically sound and consistent models that consist of dimensionally accurate equations with minimal physical violations. It is reported that the dimensionally aware GP approach performs better than the traditional GP by achieving faster convergence and parsimonious model equations. In another study (Babovic *et al.* 2000), the same approach was used to induce hydraulic equations from the data. Babovic (2009) incorporated high-level concepts with the GP algorithm to evolve sediment transport process equations and noticed that the resulted equations perform equally or better than the equations derived by human experts. GP was used to analyse the configurations of hydrological models by Selle & Muttil (2011) to gain insights about the dominant hydrological processes in environmental systems. Basic hydrological concepts were incorporated with the GP algorithm as special functions (e.g. a simple reservoir function, simple moving average function, delay operator and cumulative sum operator) in Havlicek *et al.* (2013), to improve the runoff forecasting capabilities of GP. They found that the GP with special functions performs better than the traditional GP and ANNs in terms of prediction efficiencies. Chadalawada *et al.* (2017) used GP to identify the optimal TANK model configuration (Sugawara 1979) for the catchment of interest by optimizing the number of reservoir blocks (tanks), connections and outflows based on field measurements. In our prior work (Chadalawada *et al.* 2020; Herath *et al.* 2020), an automatic lumped and semi-distributed conceptual rainfall-runoff model induction toolkit was developed using GP, and the building blocks available in two modular modelling frameworks (SUPERFLEX and FUSE) were used as the aspects of hydrological insights.

## GP FOR HYDROLOGICAL FORECASTING

In the traditional setting, when GP is used for hydrological forecasting, the terminal set of GP algorithm consists of past and current states of meteorological variables and randomly generated constants. The function set consists of basic mathematical functions. The GP algorithm uses these building blocks to generate its solution set for predicting catchment response into the future.

### Runoff forecasting

*Q*

_{t}_{+1}) and five-day (

*Q*

_{t}_{+5}) runoff forecasts, one would use precipitation (

*P*), evaporation (

_{t}*E*) and antecedent runoff (

_{t}*Q*) as the input variables. Furthermore, lagged precipitation (

_{t}*P*

_{t}_{−lag}) and evaporation (

*E*

_{t}_{−lag}) vectors up to a certain period (lag) may also be used to incorporate precipitation and evaporation history into the forecasting process (in the current study, lag = 5 days). Although the runoff history can also be included in the forecasting process, in this study, no lagged runoff vectors are used as the forecasted runoff is largely correlated with the last observed runoff value (

*Q*). The expected relationship between the forecasted runoff and input variables can be captured as in Equation (1). GP may create its candidate equations either using one or more input vectors along with the random constants and mathematical functions. GP settings used for the runoff forecasts are given in Table 1.

_{t}### Case studies

Two catchments located in the Eastern United States, namely Sipsey Fork catchment, near Grayson, Alabama and Red Creek catchment, near Vestry, Mississippi, were selected for the runoff forecasting using traditional GP. The catchment characteristics of both watersheds are summarized in Table 2. Precipitation (*P*), Streamflow (*Q*) and Evaporation (*E*) data for 11 years from 1 January 2004 to 31 December 2014 were used for model training (1 January 2004–31 December 2009) and model testing (1 January 2010–31 December 2014).

The catchment average daily data of *P*, *Q* and *E* of both the catchments were downloaded from CAMELS dataset (Newman *et al.* 2015). Time-series plots of forcing terms and streamflow data of the Sipsey Fork catchment and the Red Creek catchment are shown in Figures 1 and 2, respectively.

### Objective functions

*et al.*2009) and log Nash–Sutcliffe Efficiency (logNSE; Krause

*et al.*2005) are evaluated. Equations of the four objective functions are given below (

*N*is the time steps, is the observed streamflow, is the simulated streamflow,

*r*is the linear correlation coefficient, , is the standard deviation, is the mean, is the mean of observed discharge values and is the natural logarithm). They are responsive to contrasting flow segments of simulated and observed hydrographs.

### Results

Following observations can be made through the forecasting results of the two catchments.

One-day runoff forecasts of the Red Creek catchment achieve high-efficiency values for both the training and testing periods under all four objective functions. This demonstrates the ability of GP to achieve high-efficiency values in runoff forecasting. Furthermore, the optimal equation (Equation (8)) shows no signs of overfitting due to its consistent performance across both the training and testing periods.

As it can be seen with observed hydrographs, the runoff signature of the Sipsey Fork catchment is relatively more complex than that of the Red Creek catchment. In this context, GP with only mathematical functions is unable to achieve satisfactory performance in terms of the training NSE value. Additionally, the optimal equation for one-day forecasts of the Sipsey Fork catchment (Equation (6)) is overfitted to its training data as the NSE value falls below zero for the testing period. This highlights the importance of cautious usage of traditional GP forecasts when they are used with out of sample values.

As per the efficiency values, one-day forecasts of both catchments perform significantly better than five-day forecasts. Therefore, the performance of GP as a forecasting tool deteriorates significantly as the lead time increases.

The optimal equations provide little or no knowledge about the underlying physical phenomena of runoff generation. Furthermore, the equations tend to become lengthier with insignificant components. For example, the third term of Equation (9) () results in insignificant contribution towards the total forecasted runoff.

## GP FOR HYDROLOGICAL MODELLING

In this section, GP is used for automatic rainfall-runoff model induction. In contrast to runoff forecasting, GP here maps the input forcing variables at the time *t* to predict the runoff at the time *t* which involves no lead time.

## ML-RR-MI AND MIKA-SHA

In the present study, ML-RR-MI (Chadalawada *et al.* 2020) and MIKA-SHA (Herath *et al.* 2020) toolkits are used to identify the optimal rainfall-runoff model configurations for catchments of interest. Both toolkits are based on GP and classified as hydrologically informed ML techniques where existing hydrological knowledge is incorporated into the learning algorithm (as special functions) to induce physically sound and consistent models. ML-RR-MI is used for lumped modelling and MIKA-SHA is used for distributed modelling. Building blocks used in the GP frameworks of both toolkits represent the elements of available hydrological knowledge. In their earlier applications, the model components of FUSE and SUPERFLEX flexible modelling frameworks were used as model building blocks. The most important feature of these two model induction frameworks is that they can be coupled with any internally coherent collection of building blocks. Hence, in this paper, both ML-RR-MI and MIKA-SHA are using different model inventory based on the Sugawara TANK model template. As per the taxonomy defined in Karpatne *et al.* (2017), our approach can be categorized as a hybrid TGDS approach.

Both toolkits are primarily coded in R programming language (R Core Team 2013) based on the canonical GP approach proposed in Havlicek *et al.* (2013). The multi-objective optimization framework of the toolkits is based on Non-dominated Sorting Genetic Algorithm-II (NSGA-II; Deb *et al.* 2002). Parallel computation has been used at the fitness computation stage of GP individuals to reduce the overall computation time. However, the genetic operators are applied to the whole generation in series (using one core) to have more diversity among breeding individuals. Both toolkits consist of four major stages. They are data pre-processing stage, model identification stage, model selection stage and uncertainty and sensitivity analysis stage. Each step is briefly described in the following section, while for a detailed explanation refer to Chadalawada *et al.* (2020) and Herath *et al.* (2020).

The data pre-processing stage involves the quality control of independent (forcing terms) and dependent variable (streamflow) data. More steps are required with MIKA-SHA, such as watershed delineation, and are discussed in the next section. The core of both toolkits is the GP-based model identification stage where both the model structure and model parameters of GP individuals are optimized simultaneously. The workflow diagram of the model identification stage is given in Figure 4. The model identification stage starts with randomly generated models (GP individuals) to capture the runoff signature of the catchment of interest by using independent variables, random constants, mathematical functions and special functions. Then, the fitness of each individual is assessed based on the multi-objective criterion used. Both toolkits are equipped with an objective function library which consists of most of the performance measures in practice. Candidate models are evolved through generations as the genetic operators are applied. Optimization terminates once the number of generations reaches the maximum number of generations. The whole process is repeated to a user-defined number of iterations (independent runs) to have more coverage of the solution space. The Pareto optimality concept used here results in a group of model structures that are not dominated by each other on the user-defined performance measures as the final output of the model identification stage.

The model selection stage involves the optimal model selection out of the Pareto-optimal models identified in the model identification stage. Fitness function values of the same multi-objective criterion over the validation period are calculated and Pareto-optimal models are re-identified based on both the calibration and validation performance values. Then, the standardized signature index sum value (SIS; Ley *et al.* 2016) of each model is calculated and only the models with negative SIS values are passed to the next selection step (the SIS value is a relative performance measure which evaluates how well a particular model approximates the measured flow duration curve (FDC) relative to the other models. The more negative the value the better the performance). Then, the models with unique model structures are identified (known as competitive models). Finally, each competitive model is ranked based on three relative performance indicators (model parsimony based on the number of reservoir units/model parameters, dynamic time warping distance and cross-sample entropy). The model with the least sum-up rank is selected as the optimal model for the watershed of interest.

The uncertainty and sensitivity analysis of both toolkits is based on generalized likelihood uncertainty estimation (GLUE; Beven & Binley 1992). The model parameter values of the selected optimal model are changed uniformly across their parameter ranges and a user-defined number of satisfactory model parameter sets are identified (known as behavioural models and the performance is evaluated using user-specified objective criteria). Then, the percentage of the observed values that exist within the 90% confidence interval bands are used as a gauge of the parameter uncertainty of the optimal model. Sensitivity scatter plots are drawn for each model parameter and their shape is used to identify the model sensitive parameters.

### Special functions

The elements of hydrological knowledge are incorporated through the addition of special functions to the GP function set. The special functions used in the current study are inspired by the Sugawara TANK model template elements and semi-distributed modelling concepts.

#### Sugawara TANK model

Originally, the TANK model (Sugawara 1979) was developed to represent the hydrological processes in small humid catchments. In the TANK framework, several reservoir units (maximum four reservoirs) are arranged vertically to conceptualize the different soil layers in a catchment (hereinafter referred to as a series TANK model). Later, a parallel version of the TANK model (hereinafter referred to as a parallel TANK model) was proposed for large non-humid catchments. The parallel TANK model consists of several branches (maximum four branches) of series TANK models. Each branch represents a different zone of the watershed. The outermost zone represents hill/upstream side and the innermost represents the river/downstream side of the catchment basin. The parallel TANK model architecture with the maximum possible reservoir units (four branches each with four reservoirs) is shown in Figure 5. A typical reservoir unit in a TANK model consists of at least one side flow and one bottom flow (bottom flow is zero for the last reservoir in each branch). Side flows are used to represent surface runoff, interflow, sub baseflow and baseflow, while the bottom flows are used to represent interception, infiltration and deep percolation. By considering the uniqueness of the area of interest, one can test many hypotheses by altering the number of branches in the parallel TANK model configuration and by varying the number of reservoir units in each branch. Hence, the parallel TANK model provides more granularity in the model building than the other conceptual models with a fixed structure. Computational and conceptual simplicity, higher predictive power and easy interpretability have made the TANK model a widely used rainfall-runoff model (Song *et al.* 2017). More details and applications of the TANK model can be found in Nie *et al.* (2017) and Chadalawada *et al.* (2017).

#### RB function

*Q*is the discharge components (two side flows and one bottom flow) of the reservoir (mm/day); is all input variables, such as precipitation, evaporation and flows from the other reservoirs (mm/day); are the height of side-outlet one and two (mm). for the bottom reservoir in each branch; is the linear discharge coefficients of the two side flows and the bottom flow. for the top reservoir in each branch and elsewhere. for every reservoir; is the time delay in lag functions of two side flows (days);

*S*is the initial storage of the reservoir (mm).

#### RJOIN function

#### DISTRIBUTED function

*k*at time

*t*, are the outlet coefficients of side-outlet

*k*and bottom outlet, is the height of side-outlet

*k*, is the bottom flow, is the evaporation from reservoir

*i*at time

*t*, is the actual evaporation at time

*t*and is the storage of reservoir

*i*at time .

In the current study, RB function and RJOIN function along with other basic mathematical functions are used for lumped modelling (with ML-RR-MI), whereas RB, RJOIN and DISTRIBUTED functions along with basic mathematical functions are used for semi-distributed modelling (with MIKA-SHA). The parse tree representation of RB, RJOIN and DISTRIBUTED functions are given in Figure 6. The two-parameter gamma function is used as the lag functions in RB and DISTRIBUTED functions.

### Case studies

The Sipsey Fork catchment, near Grayson, Alabama, was used for lumped modelling with ML-RR-MI, while the Red Creek catchment, near Vestry, Mississippi, was used for semi-distributed modelling with MIKA-SHA. *P*, *Q* and *E* data for 11 years from 1 January 2004 to 31 December 2014 were used for model spin-up (1 January 2004–31 December 2004), model calibration (1 January 2005–31 December 2009), model validation (1 January 2010–31 December 2012) and model testing (1 January 2013–31 December 2014).

The catchment average daily data of *P*, *Q* and *E* of the Sipsey Fork catchment were downloaded from CAMELS dataset (Newman *et al.* 2015). Watershed delineation was required for the Red Creek catchment to identify subcatchments and HRUs. In the current study, SWAT+ plugin in QGIS software (QGIS 2020) was used for the watershed delineation. The whole watershed was divided into three subcatchments (Sub 1, Sub 2 and Sub 3) and HRUs were identified based on the geology of the area. Four HRUs were identified to represent the four major soil types of the Red Creek catchment (Figure 7). The area percentages of each HRU within three subcatchments are given in Table 4. The spatial distribution of daily precipitation data of the Red Creek catchment was considered and lumped at the subcatchment scale (three precipitation time series for the three subcatchments). Soil and land use data (resolution 30 m × 30 m), digital elevation data (resolution 30 m × 30 m), evaporation and streamflow data (catchment average daily data) and precipitation data (daily time step, 1 km × 1 km resolution) of the Red Creek catchment were downloaded from the United States Department of Agriculture's (USDA's) Geospatial Data Gateway (USDA's Geospatial Data Gateway 2020), Shuttle Radar Topography Mission (SRTM) data from the United States Geological Survey (USGS) EarthExplorer (USGS EarthExplorer 2020), CAMELS dataset (Newman *et al.* 2015) and Daymet dataset (Daymet 2020), respectively.

### Rainfall-runoff modelling

ML-RR-MI was used to identify an optimal lumped rainfall-runoff model for the Sipsey Fork catchment, while MIKA-SHA was used to identify an optimal semi-distributed rainfall-runoff model for the large Red Creek catchment. As the zonal representation of the parallel TANK modelling template captures the topography of the area to a certain extent, the HRUs were identified based on the geology of the Red Creek catchment. Therefore, MIKA-SHA model architectures derived by semi-distributed modelling are expected to represent the impacts of both topography and geology towards the runoff generation of the catchment. GP settings used with two model induction toolkits are summarized in Table 5. Optimal model selection was carried out as described in earlier sections, while 5,000 behavioural models (parameter sets) were identified for uncertainty analysis using NSE as the performance indicator with a threshold value of 0.6.

### Results

#### Lumped modelling

Following the methodology described in earlier sections, ML-RR-MI identified a three-branch model configuration (Figure 8(a)) as the optimal model structure representing the watershed dynamics of the Sipsey Fork catchment. The induced model configuration divides the watershed into three equal zones (each comprises 1/3 of the catchment area). Zone 1 represents the hillside of the catchment, while Zone 3 represents the riverside. The three-zone representation of the Sipsey Fork catchment is illustrated in Figure 8(b). The first two branches have three reservoirs each and the last branch has only one reservoir. The total catchment outflow consists of four surface runoff components (*q*_{sur2} of RB_{1}, *q*_{sur2} of RB_{4}, *q*_{sur2} and *q*_{sur1} of RB_{7}), one interflow component (*q*_{inf} of RB_{5}) and one baseflow component (*q*_{bf} of RB_{6}). Out of the total model outflow during the calibration period, 59% consists of surface flow components and 41% consists of subsurface flow components (baseflow and interflow).

The performance matrices of the optimal model are given in Table 6. As per the efficiency values, the optimal model shows a consistent performance over calibration, validation and testing periods. Hence, it shows no sign of overfitting issues to its training data. The simulated hydrograph along with the observed hydrograph of the Sipsey Fork catchment is given in Figure 9 and a good visual match between the two hydrographs can be observed. However, on some occasions, the optimal model underestimates high discharge values. The simulated FDC (Figure 10) of the calibration period closely follows the measured FDC in all flow segments, while the simulated FDCs of the validation and testing periods tend to deviate from the observed FDC in low flow regimes. According to the uncertainty analysis, 67.1% of observed discharge values during the calibration period fall within the 90% uncertainty bounds. This high percentage implies that the parameter uncertainty solely is competent enough to capture most of the total uncertainty of the optimal model (Boukezzi *et al.* 2016). Out of the 38 model parameters of the optimal model, seven parameters (*b*_{1}, *b*_{4}, *b*_{5}, *L*_{2,5}, *a*_{2,5}, *a*_{1,7}, *a*_{2,7}) can be identified as the model sensitive parameters towards achieving high model performances. Model parameter details are given in the Supplementary Appendix.

The calibrated values of the optimal model reveal that almost the entire surface runoff consists of three surface runoff components: *q*_{sur2} of RB_{1}, *q*_{sur1} and *q*_{sur2} of RB_{7}. The associated time delays of their lag functions are 0.04, 0.05 and 0.01 days, respectively. This suggests that the optimal model has a quick discharge response to its forcing terms. Interestingly, the experimental insights of the Sipsey Fork catchment indicate that the most distinct feature of the catchment is its rapid flow response to the precipitation events. Shallow soils, deeply incised stream channels and steep topography have been identified as the causes of such flashy behaviour (Mast & Turk 1999). On the other hand, the main stream of the catchment is characterized as a perennial river. This indicates that there may be a steady subsurface flow to sustain the streamflow during dry periods. Hence, having a significant subsurface flow component (baseflow + interflow) in the optimal model architecture is meaningful. Furthermore, the main soil type of the Sipsey Fork catchment belongs to the Montevallo Enders Townley soil association which is characterized as a deep and moderately permeable soil type (Official Soil Series Descriptions 2020). This may cause to have vertical flow processes, such as infiltration and deep percolation, and hence, water may reach deeper soil layers. Therefore, it is quite logical to have three reservoir configurations for both Zone 1 and Zone 2. However, as the flow reaches towards the riverside of the catchment (Zone 3), it can be expected to have more surface flow responses, such as saturation excess overland flow due to the presence of a shallow water table. Hence, having one reservoir with only surface runoff components in the optimal model configuration for Zone 3 is reasonable.

#### Semi-distributed modelling

MIKA-SHA identified the following model architecture (Figure 11) to represent the watershed dynamics of the Red Creek catchment. Here, S1, S2, S3 and S4 stand for the four major soil associations of the catchment. S1, S2 and S4 model structures consist of series model configurations (only one branch) with two reservoirs each. The S3 model structure has a parallel model configuration with three branches. First two branches consist of two reservoirs each and the last branch has three reservoirs. The model outflows of S1, S2 and S4 model configurations consist of two surface runoff components (*q*_{sur1} and *q*_{sur2} of RB_{1}) and one baseflow component (*q*_{bf} of RB_{2}). The simulated discharge values during the calibration period reveal that S1, S2 and S4 model outflows are dominated by their surface runoff components (89% in S1, 77% in S2 and 74% in S4). In contrast, the outflow of S3 model configuration consists of four surface runoff components (*q*_{sur1} of RB_{5} and *q*_{sur2} of RB_{1}, RB_{3} and RB_{5}), one interflow component (*q*_{inf} of RB_{6}) and one baseflow component (*q*_{bf} of RB_{7}). Furthermore, the total discharge of the S3 model configuration is dominated by subsurface discharge components (interflow 33% and baseflow 49%).

The performance matrices of the optimal model are given in Table 7. The optimal model shows a consistent performance throughout calibration, validation and testing periods. Hence, we may expect no overfitting issues. It can be noted that the testing fitness values are higher than the validation fitness values in all four absolute performance measures. This may be due to the relatively simpler discharge signature of the Red Creek catchment during the testing period. A good visual match can be observed between simulated and observed hydrographs (Figure 12). The simulated FDC follows the observed FDC closely in the calibration period and tends to deviate at low flows in the validation and testing periods (Figure 13). As per the uncertainty analysis, 78% of the observed discharge values during the calibration period fall within the 90% uncertainty bounds. Hence, it would be safe to accept that the parameter uncertainty of the optimal model is capable to estimate the total output uncertainty satisfactorily. By studying the shape of the sensitivity scatterplots, nine model parameters can be recognized as the model sensitive parameters of the optimal model (*b*_{1} of S1, *b*_{1} of S2, *b*_{1}, *b*_{2} and *b*_{3} of S3, *a*_{2,1} and *b*_{1} of S4, *L*_{HRU} and *L*_{sub}). Refer to the Supplementary Appendix for the model parameter details.

Most of the catchment area belongs to soil type S3 (60%). Due to the large spatial extent of S3 soil type (McLaurin Heidal) from hillside to the riverside of the catchment, the three-branch configuration of the MIKA-SHA optimal model is meaningful. The remaining three soil types have approximately the same spatial coverage. Furthermore, the S3 soil type is characterized as a sandy, well-drained soil type with a moderate permeability (Official Soil Series Descriptions 2020). This nature may encourage more subsurface type response due to the vertical drainage, such as infiltration and deep percolation. Interestingly, the outflow of the S3 model structure of the optimal model is also dominated by subsurface flow components (82% of the simulated discharge during the calibration period). On the other hand, the remaining three model configurations (S1, S2 and S4) consist of surface flow dominated discharge responses. The S2 soil type (Smithton) is formed in loamy alluvial sediments and characterized as a very deep, poorly drained soil type with a slow permeability. On top of that, as the S2 soil type is located close to the river network of the catchment, a shallow water table can also be expected. These properties may result in a high percentage of surface flow in runoff generation of S2 soil type. Both S1 (Susquehanna Poarch Benndale) and S4 (Susquehanna Benndale) soil types consist of Susquehanna soil series which is characterized as a soil type with a very slow permeability. Therefore, it can be expected to have surface runoff components like infiltration excess overland flow in both soil types. Furthermore, the bottom reservoirs of each S1, S2 and S4 model configurations show large storage throughout the calibration period. The experiment insights of the catchment reveal that S1, S2 and S4 soil types have deep to very deep soil stratum which may act as aquifers. All four model configurations have at least one subsurface flow component and hence provide a stable baseflow in runoff generation. Interestingly, a stable ground flow response can also be expected in the catchment dynamics of the Red Creek watershed as its main river channel is categorized as a perennial type river that has a continuous runoff throughout the year.

## FORECASTING vs. MODELLING

The most common application of GP in rainfall-runoff modelling so far is based on the use of the algorithm as a short-term forecasting tool. As shown in this and many previous studies, GP is capable of achieving high-efficiency values when it is used as a forecasting tool. However, in this study, we explore the potential of GP as a model induction toolkit. In contrast to the traditional GP as a forecasting tool, the two model induction toolkits (ML-RR-MI and MIKA-SHA) consist of the following augmentations when they are used as model induction engines.

GP is used to simulate the runoff at the time

*t*using the forcing terms at the time*t*which involves no lead times. The precipitation, evaporation and streamflow histories are automatically taken into account by the reservoir storage and lag functions.Existing hydrological knowledge is incorporated as special functions (RB, RJOIN and DISTRIBUTED) into the GP function set. The objective of adding special functions is to make the induced models readily interpretable and to capture the complex runoff dynamics.

Input variable data are divided into three segments (calibration, validation and testing) and the performance of the validation period is also considered in the optimal model selection stage. This is used to avoid the selection of overfitted models as the optimal model.

A multi-objective optimization scheme is used to optimize both model structure and parameters simultaneously. Models selected using single-objective optimization may be biased towards a specific flow segment. For example, the popular NSE is sensitive towards the high flows. Hence, a model derived through a multi-objective optimization is expected to perform well in many flow characteristics.

Optimal model selection is not only based on the best training fitness but also based on validation fitness, relative performance in capturing observed FDC, model parsimony, time-series complexity and pattern matching.

Induced models are readily interpretable and provide insights about catchment dynamics.

Parallel computation is used at the fitness calculation stage. As the special functions require considerably longer computational times than the basic mathematical functions, the use of parallel computation greatly helps to reduce overall computational time.

As it can be seen with the Red Creek catchment, GP as a model induction toolkit can achieve the same high-efficiency values similar to runoff forecasts (one-day forecasts). More importantly, when the runoff signature is much more complex like with the Sipsey Fork catchment, the model induction toolkit was able to achieve satisfactory prediction power which runoff forecasts through traditional GP were unable to achieve. In the present study, a relatively shorter period (6 years) was used for the model training (calibration). However, both model induction toolkits were able to capture the runoff signature reasonably well within that period. Obtaining a long quality dataset for model training is often challenging in hydrological modelling. Therefore, it may be safe to assume GP as a model induction toolkit performs better than forecasting when the discharge signatures are more complex and difficult to model and/or when the training labelled instances are limited.

## CONCLUSIONS

We recognize the potential offered by ML algorithms towards hydrological modelling. We also recognize that simplistic black-box type data-driven models may lead to the evolution of accurate; yet, senseless models with serious difficulties with interpretation may not serve towards the advancement of hydrological knowledge. Therefore, we chart the most promising way ahead through the integration of existing hydrological knowledge with learning algorithms to induce more generalizable and physically coherent models. This was the motivation behind the development of GP-based model induction frameworks which have been founded on both ML and theory-driven models. Therefore, we expect this work will strengthen the link between two major, historically, quite separate communities in water resource science and engineering: those working with physics-based process simulation modelling and those working with machine learning.

Results demonstrated in this contribution the potential of GP as a model induction toolkit in contrast to its most frequent usage as a short-term forecasting tool. More importantly, GP as a model induction engine preserves its high prediction accuracies shown with runoff forecasting while benefiting hydrologists to gain a better understanding of the watershed dynamics through the resultant interpretable models. Furthermore, GP as a model induction toolkit produces more accurate results than forecasting when the discharge signatures are more complex and difficult to model and/or when the training labelled instances are limited. Furthermore, in the current study, GP is used to its full capacity as a model induction engine to simultaneously optimize both model configuration and parameters.

GP model induction toolkits (MIKA-SHA and ML-RR-MI) may serve as an alternative to the traditional subjective model selection approach by automatically inducing optimal model configurations for a watershed of concern based on adequacy rather than legacy. Therefore, the modular approach of these model induction toolkits may address the so-called uniqueness of the place in hydrological modelling. The unique perhaps more important feature of these model induction toolkits would be their capability to couple with any internally coherent collection of building blocks representing the elements of hydrological knowledge. The optimal model configurations derived in this study are in agreement with experimental insights and previously disclosed research findings of the catchments. Hence, GP as a model induction toolkit is smart enough to mine knowledge from data which makes it viable to depend on the induced models with more than just statistical confidence.

The optimal models derived by the two GP-based model induction toolkits are readily interpretable by domain experts. This makes the approach introduced here different from the other ML utilizations in rainfall-runoff modelling. We need data to build models. In the absence of data, human experts will be able to construct models based on their intuition since even human-built models require some sort of validation against data. With the limited data, both human and our model induction toolkits (both ML-RR-MI and MIKA-SHA) face the same challenge. DL models will not be useful in such a situation due to the very large number of free parameters. From the perspective of the bias-variance dilemma, DL models, such as Long Short-Term Memory networks (LSTM), will not be particularly useful under the poor data availability circumstances. However, the present work is just the beginning of coupling the strengths of human awareness with those of computational discovery techniques. We expect further research studies on theory-guided machine learning to be directed towards knowledge discovery and automatic model building in hydrological modelling.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.