## ABSTRACT

Baseflow is a vital component of the water balance. The fractured hard rock aquifers of the German low mountain range are in danger of increased water stress due to climate change because they react rapidly to deficits in precipitation and groundwater tables decline sharply. Therefore, simulation software must be able to model baseflow accurately. Three soil moisture simulation and two monthly factor-based baseflow models are evaluated using two calibration strategies. Models were calibrated to total flow (S1) or stepwise to baseflow and then total flow (S2). Results were not significantly different for total flow. Regarding baseflow, S2 proved significantly better with median values (S1 calibration, validation | S2 calibration, validation) of SSE (20.3, 20.3 | 13.5, 13.8), LnNSE (0.15, 0.17 | 0.47, 0.34), and PBIAS (27.8, 21.6 | 2.5, −0.8). Parallel linear reservoir proved best at modeling baseflow with a median SSE (S2: 6.1, 5.9), LnNSE (S2: 0.64, 0.71), and PBIAS (S2: 3.8, 3.8). The new modified monthly factor approach is a simple and robust alternative with SSE (13.0, 13.3), LnNSE (0.61, 0.61), and PBIAS (9.8, −8.6). The results are useful regarding selection of baseflow model structure and calibration strategy in low mountain ranges with fractured hard rock aquifers.

## HIGHLIGHTS

Application and evaluation of the effect of two different calibration strategies on total flow and baseflow modeling.

Implementation and evaluation of two additional soil moisture-based baseflow model structures and a new factor-based baseflow model in BlueM.Sim.

The two parallel linear reservoir models proved to be the best of all evaluated model structures.

The newly factor-based model proved to be simple yet robust.

## INTRODUCTION

Baseflow is defined as the streamflow that originates from groundwater and other delayed sources (Tallaksen 1995; Smakhtin 2001; Price 2011). The terms baseflow and low flow are often used synonymously in the literature (Price 2011). Baseflow plays a vital role in many aspects of integrated water resources management as it sustains the flow in rivers during dry periods and impacts water quality and temperature (Gomez-Velez *et al.* 2015; Hare *et al.* 2021; Miller *et al.* 2021). Climate change threatens to impact baseflow across a variety of catchments with different geological and climatological backgrounds (Wang *et al.* 2014; Hellwig & Stahl 2018; Zhang *et al.* 2019; Grosser & Schmalz 2021, 2023). Hellwig & Stahl (2018) and Grosser & Schmalz (2021) found that in Germany catchments with hard rock aquifers, such as the German low mountain range, are at a higher risk for decreasing seasonal low flows. Additionally, Grosser & Schmalz (2021, 2023) found climate change will lead to an intensification of water stress and prolonged dry periods in the German low mountain range.

Accordingly, accurate assessment of the quantity and modeling of baseflow are essential. One of the most common methods to assess baseflow are recursive digital filters (RDFs) (Lyne & Hollick 1979; Arnold *et al.* 1995; Chapman 1999; Eckhardt 2005) which separate baseflow from observed total flow. RDFs are commonly used for comparison of separated, i.e., observed, baseflow versus simulated baseflow. However, RDFs cannot be used to make predictions of future conditions. Therefore, hydrological models are commonly used as tools for integrated water resources management (Jayakrishnan *et al.* 2005; Daggupati *et al.* 2015) and low flow/baseflow modeling (Luo *et al.* 2012; Nicolle *et al.* 2014; Pfannerstill *et al.* 2014; Stoelzle *et al.* 2015; Parra *et al.* 2019). Luo *et al.* (2012) investigated modeling of baseflow with SWAT via a single linear reservoir and two parallel linear reservoirs (PLRs) in the Tianshan Mountains, China, with calibration being conducted with regard to total observed flow. Pfannerstill *et al.* (2014) investigated a two linear reservoir structure for the shallow aquifer in SWAT as well in the German lowlands (northern Germany) with calibration to total observed flow. Parra *et al.* (2019) examined four different baseflow model structures in the HBV model including linear and nonlinear reservoirs by investigating different geological and climatological settings with calibration conducted with regard to total observed flows. Stoelzle *et al.* (2015) studied nine different conceptual baseflow model structures by directly introducing groundwater seepage, estimated by a water balance model, as input to each model and then calibrating each model to a reference series of baseflow. The two parallel linear reservoir approach was found to be the best of the investigated model structures (Luo *et al.* 2012; Pfannerstill *et al.* 2014) for fractured hard rock aquifers (Stoelzle *et al.* 2015; Parra *et al.* 2019). However, none of these studies examined modeled baseflow depending on baseflow model structure and calibration strategy, i.e., calibration to total flow versus calibration to baseflow and then total flow. Additionally, the aforementioned studies utilized different hydrological models of which each has its own specific data requirements, input parameters as well as mathematical models representing elements of the water cycle. Therefore, there is a need for an investigation of baseflow model structures as well as impact of calibration strategy with a single hydrological model to reduce uncertainty due to differences between individual hydrological models.

Another major objective of integrated water resources management in peri-urban catchments is the assessment of impacts from sewer system emissions on rivers and streams (Braud *et al.* 2013), both regarding water quantity (hydraulic stress) and water quality (acute, delayed, and accumulating effects). The flow stemming from rural catchments is significant for the resulting water quantity and quality in rivers – in dry weather times, this flow is mainly dependent on baseflow.

In the federal state of Hesse, Germany, an official guideline for modeling flow and water quality in rivers is mandatorily applied to all catchments to assess these impacts (HMUELV 2012). A similar approach will be applied to the whole of Germany in the coming years (DWA/BWK 2021). In the Hessian guideline, a hydrological model is coupled with a sewer system model. The impact of emissions out of the sewer system into rivers and streams is assessed regarding hydraulic stress and water quality (HMUELV 2012). Low flow/baseflow is an essential component of the assessment of the impact on water quality as critical conditions are most likely during the hot and dry summer months with low flow rates and lower saturated oxygen concentrations in rivers and streams (HMUELV 2012). The baseflow represents the conditions in the streams and rivers without impacts from the sewer system. Therefore, it determines the underlying boundary conditions for the assessment of the impact on water quality. Presently, a constant low flow is assumed as flow originating from rural catchments (HMUELV 2012) and therefore only incorporates rural catchment hydrology at the most basic level which is a gross oversimplification. The respective software tool for this task is based on the hydrological model BlueM.Sim which is part of the BlueM software package (Bach *et al.* 2009). The package consists of BlueM.Sim, BlueM.Wave and BlueM.Opt (Bach *et al.* 2009). BlueM.Sim is the hydrological simulation core and BlueM.Wave is used for time series analysis and visualization. BlueM.Opt is a tool for multi-objective optimization and auto-calibration. A recent overview of applications of the BlueM software package is given in Kissel *et al.* (2023). Kissel *et al.* (2023) showed that BlueM.Sim is already capable of more accurate baseflow modeling in the German low mountain range by using either an annual baseflow pattern based on monthly factors (factor approach (FA)) or soil moisture modeling (soil moisture approach (SMA)) with seepage from the deepest soil layer used as the input for a single linear reservoir. In their study, Kissel *et al.* (2023) concluded that the FA could be suited for use as part of the Hessian guideline as data requirements are minimal and computation complexity is less compared to the SMA. The current FA, however, cannot account for variation of hydrological conditions outside of the long-term average. The SMA is the only approach in BlueM.Sim which can take varying hydrological conditions and their effect on resulting baseflow into consideration. The widespread application of the SMA for practitioners is unfeasible due to the data requirements, data complexity as well as the complexity of calibration and length of computing time (Kissel *et al.* 2023). The SMA is, however, of great interest for ongoing research purposes in the Fischbach catchment (FIS) in the federal state of Hesse, Germany, which is a representative catchment for the German low mountain range (Schmalz & Kruse 2019). Based on the findings in the literature and the limitations identified with regard to Germany-wide detailed soil moisture modeling for baseflow estimation, the following research questions will be addressed in this study:

– SMA: Can the implementation of alternative baseflow model structures improve baseflow modeling in a typical German low mountain range catchment?

– FA: Can a novel simple modification allowing for consideration of dry, normal, and wet years improve baseflow modeling with the FA and thus improve assessment of hydraulic stress and water quality impacts in peri-urban catchments under changing boundary conditions?

– How does the calibration strategy affect the achieved baseflow and total flow modeling results?

The results of this study are relevant for researchers and practitioners in the German low mountain range. Application of the findings of this study can lead to improved water management strategies to meet future challenges due to climate change. In Hesse (HMUELV 2012) and Germany (DWA/BWK 2021) the results can lead to an improved assessment of the impact of emission from sewer systems and hence to improved management strategies for water quality of aquatic ecosystems. The findings regarding baseflow modeling can be transferred and adapted to studies in any low mountain range with similar characteristics.

## METHODS

### Study area

^{2}and elevations range from 600 to 160 m a.s.l. Geologically FIS is part of the Odenwald crystalline basement complex. Baseflow stems from a hard rock aquifer which is partly overlain with weathered bedrock, which acts as an additional shallow aquifer (HLNUG 2017). Lateral subsurface flow in the vadose zone (interflow) forms the majority of direct runoff, which is defined as the sum of surface runoff and interflow (HLNUG 2017). The baseflow index (BFI), which is the ratio of mean baseflow to mean total flow, was found to be in the range of 0.4–0.5 (Kissel & Schmalz 2020). Land use classification is mostly forests and vegetation (51.7%), followed by agricultural land (41.8%) and settlements (6.5%) (Schmalz & Kruse 2019). Mean annual precipitation for FIS based on data from 1971 to 2000 is 981 mm and mean annual actual evapotranspiration is 650 mm (HLNUG 2017). A long-term flow gage is located at the catchment outlet. FIS is an active research catchment of the Chair of Engineering Hydrology and Water Management of the Technical University of Darmstadt. Further information on the catchment and the scientific activities in this field lab can be found in Schmalz & Kruse (2019), David & Schmalz (2020), Grosser & Schmalz (2021) and Scholand & Schmalz (2021).

### Datasets

The datasets of daily records used in Kissel *et al.* (2023) are the basis for this study. The datasets consist of flow, precipitation, and temperature data from 1989–2002. The precipitation data stems from the three closest gaging stations outside of the FIS and an aerial average was computed using the Thiessen method (Thiessen 1911). Additionally, the BlueM.Sim dataset used in Kissel *et al.* (2023) is applied and modified for this study. The datasets are presented in full detail in Kissel *et al.* (2023). Calibration is conducted for the years 1989–1997 and the models are validated for the years 1998–2002. For the modification of the FA, flow, and precipitation data from 1985 to 2002 is used to have an evenly distributed sample of dry, normal, and wet years.

### Recursive digital filtering

The Eckhardt filter (Eckhardt 2005) is used to separate daily baseflow from total observed flow. The two parameters of the Eckhardt filter that need to be estimated before its application are *BFI _{max}* and

*α*.

*BFI*is the maximum BFI that can be separated by the Eckhardt filter.

_{max}*α*is the filter parameter and it is equivalent to the recession constant, which describes baseflow recession during dry periods (Eckhardt 2005).

*BFI*is set to 0.46 (Kissel

_{max}*et al.*2023) and

*α*is set to 0.976 (Kissel & Schmalz 2020) for FIS in this study.

### The BlueM.Sim model: current state and newly implemented and developed approaches

BlueM.Sim (Bach *et al.* 2009) offers two methods for modeling rainfall runoff and baseflow from catchments. The first model setup, FA, uses the curve number (CN) method (USDA 1954) modified according to Zaiß (1989) to enable continuous simulation of direct runoff. The modified CN method is described in Zaiß (1989) and Kissel *et al.* (2023). In the FA, baseflow is modeled by multiplying the mean baseflow with monthly factors *f _{i}* allowing for annual baseflow patterns. Total runoff from a catchment is then computed as the sum of direct runoff and baseflow.

is the change in soil moisture in a time step. The complete soil moisture model is documented in Bach (2011). Surface runoff is the difference of precipitation – infiltration into the topsoil layer if precipitation > infiltration. The surface runoff is routed via a cascade of PLRs to the catchment outlet. Interflow and baseflow are routed to the catchment outlet via single linear reservoirs for each component (Bach 2011; Kissel *et al.* 2023). The recharge *R* for the baseflow reservoir is the seepage from the deepest soil layer. Total runoff at the catchment outlet is the sum of surface runoff, interflow and baseflow.

Total flow from a catchment is routed in river segments using the Kalinin-Miljukov method (Rosemann & Vederal 1970) in both the FA and the SMA.

In this study, a novel method for the FA and two alternative baseflow models for the SMA are implemented in BlueM.Sim, see Table 1. The two alternative baseflow models for the SMA are PLRs and a nonlinear reservoir (NLR) according to Ostrowski (1982). The parameters and relevant equations of PLR and NLR are given in Table 1. For the PLR, recharge *R* is distributed into each individual linear reservoir as shown in Equations (2) and (3) based on the distribution factor *a*. Total baseflow is calculated according to Equation (4). The NLR uses a single reservoir in which the storage outflow relationship is modeled according to Equation (5). The exponent *n* was found to be between 0.3 and 1.1 by Wittenberg (1999). The modified factor approach (MFA) classifies individual years *j* into dry, normal, and wet years based on their annual sum of precipitation P_{Sum,j}. For each class, an annual pattern of monthly baseflow factors *fi(c)* can be defined to reflect baseflow variation due to prevalent hydrological conditions in a year. The methodology of this study is compared to the previous study by Kissel *et al.* (2023) in Supplementary material, Appendix A.

Model . | Equations . | Parameters . | Assumptions . |
---|---|---|---|

PLR | (2) (3) (4) | k, _{1}k, _{2}a | |

NLR | (5) | k, n | (Wittenberg 1999) |

MFA | (6) | c, | described in the text |

Model . | Equations . | Parameters . | Assumptions . |
---|---|---|---|

PLR | (2) (3) (4) | k, _{1}k, _{2}a | |

NLR | (5) | k, n | (Wittenberg 1999) |

MFA | (6) | c, | described in the text |

*Q _{b}* = baseflow [m

^{3}/s]; = mean baseflow discharge rate [l/(s*km

^{2})];

*A*= area of a subcatchment [km

^{2}];

*c*= range parameter [%] to classify a year as dry, normal, or wet regarding its annual sum precipitation;

*f*

_{i}(c)*=*monthly factor of the annual pattern for dry, normal, or wet years;

*k, k*= recession constant [h];

_{1}, k_{2}*S, S*= baseflow storage [m

_{1}, S_{2}^{3}];

*a*

*=*distribution factor;

*n*= exponent of the nonlinear reservoir.

The novel approach to estimating the factors *f _{i}(c)* for the MFA is outlined in the following:

2. Calculate mean baseflow A over the entire time period using the time series separated with the Eckhardt filter.

3. Calculate mean baseflow of each month

*i*of each individual year*j*in the entire time period.5. Calculate mean monthly factors

*f*for dry, normal, and wet years by calculating the mean of each class._{i}(c)6. Verify that the

*f*adhere to dry < normal < wet._{i}(c)

### Calibration with BlueM.Opt

Auto-calibration is conducted with BlueM.Opt (Bach *et al.* 2009) using a Parametric Evolution Strategy (PES) (Rechenberg 1973; Schwefel 1993). The PES aims to find Pareto optimal solutions (parameter sets) by mimicking evolutionary processes, namely mutation and recombination, over many generations of individuals (parameter sets) by considering one or more objective functions (Muschalla *et al.* 2008). The Pareto optimal solutions are the set of individuals which are non-dominated. ‘The solutions known as non-dominated or Pareto-optimum solutions are those which are superior to all other possible solutions in the search space when all objectives are considered. The set of non-dominated solutions has been found when no other solution exists that improves simultaneously all objective functions’ (Muschalla *et al.* 2008, p. 1550). To account for parameter uncertainty all input parameters are given a plausible range based on standard references, e.g., the Bodenkundliche Kartieranleitung (Ad-hoc-Arbeitsgruppe Boden 2005) for soil parameters in Germany. This parameter space is explored by the PES through mutation and recombination. Additionally, rules are defined in BlueM.Opt to ensure that parameters generated through random mutation form a physically meaningful parameter set, e.g., the total pore volume of the soil must be greater than the field capacity (the capacity to retain water against the gravimetric pull) of the soil. In this study, two calibration strategies using the PES are applied. The first strategy (S1) follows the calibration procedure in Kissel *et al.* (2023) in which simulated total flow at the outlet of FIS is fitted to total observed flow. The second strategy (S2) applies stepwise calibration in which simulated baseflow is fitted to separated baseflow and afterwards the models are calibrated again for total flow without varying the parameters of the baseflow models. In Table 2, all objective functions which are used for calibration and selection of favorite solutions are presented. The objective functions for total flow are the same as in Kissel *et al.* (2023). BlueM.Sim has been modified in this study to allow for direct calibration of the baseflow component. The separated baseflow of the Eckhardt filter is used as a proxy for the unknown real baseflow. The objective functions for baseflow do not include the absolute peak difference (APD) but rather the volume error (PBIAS) because Kissel *et al.* (2023) noted that the volume of baseflow component was overestimated with the SMA. The Nash-Sutcliffe Efficiency is calculated in addition to the other objective functions, but not used for calibration, as it one of the most reported objective functions in the literature. The SSE and APD are sensitive to differences in peak flows and their consideration as objective function contributes toward the selection of parameter sets that tend to replicate the peak flows better. The LnNSE is less sensitive to peak flows through the logarithmic transformation of observed and simulated flows and therefore puts a stronger emphasis on mid to low flows (Krause *et al.* 2005; Pushpalatha *et al.* 2012). The choice of objective functions is based on (a) comparability to the previous study by Kissel *et al.* (2023), (b) the available objective functions in the BlueM software package and (c) the SSE, LnNSE and PBIAS are among the widest spread objective functions in hydrology. Other objective functions have been suggested in the literature such as the Kling–Gupta Efficiency (Gupta *et al.* 2009). Merits and limitations of specific objective functions have been discussed extensively in the literature, e.g., Krause *et al.* (2005), Gupta *et al.* (2009), Ritter & Muñoz-Carpena (2013), Liu (2020) and Onyutha (2022). The classifications of the achieved goodness of fit of the different objective functions as unsatisfactory, satisfactory, good and very good are based on Moriasi *et al.* (2007). As a set, the objective functions will be addressed as goodness of fit criteria (GOFC) in this study. In addition to the GOFC, the plots of observed and simulated flows as well as the flow duration curves (FDCs) of observed and simulated flows are used for favorite selection.

Objective function . | Equation . | Total flow . | Baseflow . | Favorite selection . |
---|---|---|---|---|

Sum of squared errors (SSE) | (9) | X | X | X |

Logarithmic Nash-Sutcliffe Efficiency (LnNSE) | (10) | X | X | X |

Absolute Peak Difference (APD) | (11) | X | X | |

Volume error (PBIAS) | (12) | X | X | |

Nash-Sutcliffe Efficiency (NSE) | (13) | X |

Objective function . | Equation . | Total flow . | Baseflow . | Favorite selection . |
---|---|---|---|---|

Sum of squared errors (SSE) | (9) | X | X | X |

Logarithmic Nash-Sutcliffe Efficiency (LnNSE) | (10) | X | X | X |

Absolute Peak Difference (APD) | (11) | X | X | |

Volume error (PBIAS) | (12) | X | X | |

Nash-Sutcliffe Efficiency (NSE) | (13) | X |

Q_{obs} = observed flow, Q_{sim} = simulated flow.

### Statistical analysis

To aid in the determination of a potentially optimal model structure or calibration strategy several statistical tests are performed. The GOFC of S1 and S2 as well as the GOFC of the baseflow model structures are analyzed regarding their statistical significance. Each GOFC of S1 and S2, respectively, is grouped and compared via a two-sided Mann–Whitney U-Test (Mann & Whitney 1947) to determine if there is a statistically significant difference between a GOFC of the S1 and S2 group.

– Null hypothesis: The respective GOFC of the S1 and S2 groups do not differ.

– Alternative hypothesis: The respective GOFC of the S1 and S2 groups differ.

The individual model structures are analyzed for each GOFC via a Kruskal–Wallis Test (Kruskal & Wallis 1952) for S1 and S2, respectively, to determine if the groups of GOFC differ.

– Null hypothesis: The respective GOFC of the baseflow model structures do not differ for calibration strategy S1 or S2.

– Alternative hypothesis: The respective GOFC of the baseflow model structures differ for calibration strategy S1 or S2.

The Kruskal–Wallis Test can only indicate if one of the groups differs but not which one. A potential differing group is analyzed via a one-sided Mann–Whitney U-Test by pairing the individuals of a group in every permutation.

– Null hypothesis: Baseflow model structure A has equal or smaller values regarding a specific GOFC than baseflow model structure B.

– Alternative hypothesis: Baseflow model structure has greater values regarding a specific GOFC than baseflow model structure B.

A significance level of *p* = 0.05 is applied for all statistical tests.

Chosen favorite solutions for each baseflow model structure are analyzed for their capabilities to reproduce extreme high and low flows via an extreme value analysis. The highest (lowest) annual simulated and observed flows are extracted for both total flow and baseflow. A theoretical distribution function is fit to the series of annual peak (low) flows and flows for the 1a, 2a, 5a, 10a, 25a, and 50a return period are estimated and compared to the estimated flows derived from observed flows. A more detailed explanation of an extreme value analysis is, e.g., given in Onyutha (2019). The return period was limited to 50 years as the total available time series is only 14 years.

## RESULTS

*c*was set to 10% to define the upper and lower bounds separating the three classes according to Equations (7a)–(7c). The driest and wettest years are 1991 and 2002, respectively. Each class contains six years. (1985–2002) is 0.164 m

^{3}/s. For each year mean monthly baseflows were calculated and converted to mean monthly factors according to Equation (8). The of each year are shown in Figure 2(b)–2(d) as grey lines. Mean monthly factors for each class were derived by averaging the corresponding . The of January 1991,1994 and 2002 were excluded from the average as their unusually high values are artifacts of the boundary condition used for the starting baseflow value of the Eckhardt Filter in which it is assumed that the first value of the baseflow series is equal to the first value of the observed total flow series. In Figure 2(e), the of each class are compared. The green and blue lines, representing the normal and wet class, respectively, have distinct peak values in the spring months. The peak value of the dry class (red line) is substantially lower. All three classes reach their minimum value in September. The factors of the normal class are closer to the factors of the wet class during the winter and spring months. For the typically dryer summer and autumn months, the normal class factors are closer to the dry class factors.

*n*close to 1.0, which is the exponent in all SMA models. The median exponents of NLR_S1 and NLR_S2a are 0.94 and 0.97, respectively. The model MFA_* is an MFA model in which the estimated dry, normal, and wet baseflow factors are applied as estimated, see Figure 2, and only the parameters relating to direct runoff are calibrated.

### Calibration

#### Baseflow hydrographs

The baseflow hydrographs of the chosen favorite solutions in the calibration period are depicted in Figure 3. Figure 3(a) depicts the hydrographs of the models which were calibrated according to S1 as well as the MFA_* model. The PLR_S1 model reflects the dynamic in the baseflow separated by the Eckhardt filter the best out of the S1 models. However, all soil moisture-based models (NLR_S1, PLR_S1, SMA_S1) overestimate the amount of baseflow. FA_S1 and MFA_S1 can reproduce lower baseflows in 1990–1994 but fail to capture the lower baseflows in the wetter years from 1994–1995. In contrast, MFA_* can reproduce the lower baseflow values in these periods reasonably well. Similarly, MFA_* is better at reproducing high baseflow phases compared to FA_S1 and MFA_S1. For the year 1992, MFA_* greatly overestimates baseflow. However, compared to FA_S1 and MFA_S1, it optically fits the baseflow hydrograph the best of the factor-based models. Figure 3(b) depicts the hydrographs of all the S2a models. The factor-based models FA_S2a and MFA_S2a struggle with the same problems as their S1 counterparts. However, the optical difference between FA_S2a and MFA_S2a is less than the difference of FA_S1, MFA_S1 and MFA_*. The only optically quite significant difference between FA_S2a and MFA_S2a is in 1995. Similarly, the NLR_S2a and SMA_S2a models are virtually indistinguishable from each other. The PLR_S2a model clearly fits the separated baseflow the best. The overestimation of baseflow is significantly reduced in the soil moisture-based S2a models compared to their S1 counterparts. However, baseflow in the dry years 1990 and 1991 is still overestimated. The baseflow hydrographs of the S2b models, see Figure 3(c), lie optically between the hydrographs of the S1 and S2a models. Essentially, the same observations can be made as in the description of the S2a hydrographs.

#### Baseflow FDCs

The FDC of the favorite solutions of each model and calibration strategy are shown in Figure 4. In Figure 4(a) the FDC of the PLR_S1 and SMA_S1 reveal that the baseflow is generally overestimated by these models. This is especially true for the baseflows with a percentage of exceedance (PE) between 0–40%. The FA_S1 and MFA_S1 models tend to underestimate baseflow with a PE between 0–15%. The NLR_S1 and MFA_* models underestimate the baseflows with a PE of less than 10%. The soil moisture models mostly overestimate the baseflow with PE > 50%. However, the overestimation is less when compared to the overestimation in the section of the FDC with < 50% PE. Only the MFA_S1 model tends to underestimate baseflows with PE > 65%. The FDCs of the S2a models are depicted in Figure 4(b). In comparison to the FDC in Figure 4(a), the FDC in Figure 4(b) are much closer to the FDC of the separated baseflow. The PLR_S2a model fits the FDC of the separated baseflow very well. The FDCs of the NLR_S2a and SMA_S2a models are almost identical and match the FDCs of the separated baseflow well. However, they do not match it as well as the PLR_S2a model in the section with 0–15% PE as well as in the 25–80% PE section of the FDC. The factor-based models (FA_S2a, MFA_S2a) perform similarly to their S1 counterparts. Figure 4(c) depicts the FDCs of the S2b models. The FDCs of these models optically lie between the FDCs of the S1 and S2a models. The FDCs of the FA_S2b, MFA_S2b and SMA_S2b models are very similar to their S2a counterparts. The FDC of PLR_S2b shows more overestimation in the 5–40% PE section of the FDC compared to PLR_S2a, but high and low baseflows are reproduced well.

#### Total flow: GOFC boxplots

Note: The classifications of the GOFC are summarized in Supplementary material, Appendix B for an easier overview.

The GOFC NSE, LnNSE, APD, PBIAS and SSE of the models FA_S2a and MFA_S2a are not shown in Figure 5. This is because only the parameters regarding baseflow are calibrated in S2a, however baseflow and direct runoff are decoupled in these models. Therefore, GOFC regarding total flow are not of interest for these models, whereas there is a direct connection between all water balance components in models using soil moisture simulation (SMA, PLR, NLR). The red ‘x’ in Figure 5 marks the GOFC of the chosen favorite solution for each model. GOFC with the Suffix ‘_Qb’ are the GOFC of simulated and separated baseflow. Regardless of calibration strategy, all models, except for MFA_S1, were able to achieve at minimum a median NSE > 0.5 (satisfactory) as all the boxplots of the remaining models are above the 0.5 mark. FA_S1, MFA_S2b, FA_S2b and MFA_* have a satisfactory median NSE in the range of 0.5 < NSE_{median} ≤ 0.65. The remaining models, except for SMA_S1, achieved a good median NSE in the range of 0.65 < NSE_{median} ≤ 0.75. SMA_S1 achieved a very good median NSE with a value of 0.77. A similar general summary can be made for the LnNSE as well. All models, except for MFA_S1, performed at least satisfactorily. Notably, only the soil moisture-based models (NLR_S1, PLR_S1, SMA_S1 and PLR_S2b) were able to achieve good median LnNSE values, of which three models were calibrated according to S1 and only PLR_S2b according to S2. FA_S1, NLR_S2a, MFA_S2b, FA_S2b and MFA_* have median SSE values in the range of 100 ≤ SSE_{median} ≤ 200. MFA_S1 has a median SSE > 200. All other models have a median SSE < 100. With regards to PBIAS the performance of MFA_S1 is unsatisfactory with a median of 27.04%. All other S1 models have very good median values of less than ±10%. NLR_S2a, PLR_S2a, SMA_S2a and MFA_* have good median PBIAS values in the range of ±10% ≤ PBIAS_{median} ≤ ±15%. All S2b models achieved very good median PBIAS values. The factor-based models (FA, MFA) all have median APD values < 1 m^{3}/s. All soil moisture-based models (NLR, PLR, SMA), except for PLR_S1, have median APD values > 2 m^{3}/s. PLR_S1 has a median APD value < 1 m^{3}/s.

#### Baseflow: GOFC boxplots

Note: The classifications of the GOFC are summarized in Supplementary material, Appendix B for an easier overview.

Contrary to the median NSE values of total simulated flow versus observed total flow, all S1 models have unsatisfactory median NSE_Qb values below 0.5, see Figure 5. The MFA and FA models calibrated according to S2 performed unsatisfactorily as well. The MFA_*, NLR_S2a and SMA_S2b models achieved satisfactory median NSE_Qb values with 0.55, 0.64 and 0.54, respectively. Good median NSE_QB values are found for SMA_S2a (0.67) and PLR_S2b (0.72). The PLR_S2a model achieved a very good median NSE_Qb with 0.79. Only three models (PLR_S2a, PLR_S2b and MFA_*) were able to achieve satisfactory median LnNSE_Qb values > 0.5. All other models are classified as unsatisfactory with regard to the LnNSE_Qb. The highest median SSE_Qb values are found for the models MFA_S1 and FA_S1 with approximately 30 and 23. The lowest median SSE_Qb were achieved by the PLR_S2a and PLR_S2b models with approximately 6 and 8, respectively. All other models have median SSE values in the range of 10–20. Regarding PBIAS_Qb only the models PLR_S1 and SMA_S1 performed unsatisfactorily with median values > 25%. The MFA_S1 and NLR_S1 models performed satisfactorily with median PBIAS_Qb values of 15.3% and 23.4%, respectively. The PLR_S2b model achieved a good median PBIAS_Qb with 14.2%. All other models achieved a very good median PBIAS_Qb of less than 10%.

### Validation

#### Baseflow hydrographs

Figure 6 is split into three subplots a-c. Hydrographs of all S1 models as well as the MFA_* model are shown in Figure 6(a). The factor-based S1 models underestimate baseflow whereas the soil moisture-based models overestimate it. FA_S1 generally underestimates baseflow, regardless of high or low baseflow conditions. MFA_S1 also does not fit the baseflow hydrograph satisfactorily. However, MFA_* reproduces the lower baseflows well and simulates high baseflows reasonably well. The NLR_S1 model reacts too slowly. It does not display sharp peaks as the separated baseflow does and the receding limbs decline too gradually. SMA_S1 matches the general form of the separated baseflow but does not display the sharply defined peaks. The PLR_S1 model reproduces the dynamic of the separated baseflow very well but it generally overestimates the amount of baseflow. In Figure 6(b), the hydrographs of the S2a models are shown. As in the calibration period, the overestimation of baseflow by the soil moisture-based models is significantly reduced compared to their S1 counterparts. The SMA_S2a and NLR_S2a models are nearly indistinguishable from each other. The PLR_S2a model optically fits the separated baseflow the best of all the S2a models. The MFA_S2a and FA_S2a models produce similar hydrographs except for the year 2002. Both factor-based models underestimate baseflow, as is the case in the S1 models as well. The hydrographs of the S2b models in Figure 6(c) are optically between those of the S1 and S2a models. The PLR_S2b model optically outperforms the other S2b models.

#### Baseflow FDCs

The FDCs of the validation period are shown in Figure 7. Figure 7(a) depicts the FDC of all the S1 models as well as MFA_*. Both PLR_S1 and SMA_S1 overestimate baseflow over the entire spectrum, whereas FA_S1 underestimates it. However, the form of the FDC of PLR_S1 is very similar to the FDC of the separated baseflow. The MFA_S1 model underestimates baseflow for PE < 50% and overestimates it for PE > 50%. MFA_* and NLR_S1 both underestimate baseflow for PE < 20% but the FDC of the MFA_* model fits the FDC of separated baseflow well for PE > 20%, whereas the NLR_S1 model overestimates baseflow for PE > 20%. Figure 7(b) depicts the FDC of the S2a models. The soil moisture-based models all reproduce the FDC of the separated baseflow very well. As was the case in the calibration period, the FDCs of NLR_S2a and SMA_S2a are nearly identical. Neither model simulates the highest baseflows (PE < 5%) well. The PLR_S2a model, however, also reproduces the baseflows for PE < 5% very well. FA_S2a underestimates baseflow over the entire spectrum, as does its S1 counterpart. MFA_S2a underestimated baseflow for PE < 40% and for PE > 80%. The S2b models generally adhere to the findings made for the S2a models with a few exceptions, see Figure 7(c). The PLR_S2b model now tends to overestimate the baseflow for PE > 5%, but the general form of the FDC matches the form of the FDC of the separated baseflow quite well. SMA_S2b matches the baseflow very well for 5% < PE < 35% but fails to reproduce high baseflows (PE < 5%) and overestimates baseflow with a PE > 35%.

#### Total flow: GOFC boxplots

Note: The classifications of the GOFC are summarized in Supplementary material, Appendix B for an easier overview.

The boxplots of the GOFC in the validation period are depicted in Figure 8. Most models achieved similar or better median NSE values in the validation period. The notable exceptions are the factor-based models FA_S1, MFA_S2b, FA_S2b and MFA_* which have median NSE values < 0.5. The median LnNSE of all models is at least satisfactory, except for once again MFA_S1. The soil moisture-based models (NLR, PLR, SMA) and the MFA_* model achieved at least good median LnNSE values. Very good median LnNSE values were achieved by the PLR_S1 and PLR_S2b models. MFA_S1 has a median SSE > 300, whereas only soil moisture-based models have a median SSE < 100. FA_S1, NLR_S2a, SMA_S2a, MFA_S2b, FA_S2b and MFA_* have median SSE values in the range of 100 < SSE_{median} ≤ 200. Median PBIAS values remain similar to the calibration period. MFA_S1 has a good median PBIAS, whereas it was classified as unsatisfactory in the calibration period. The median PBIAS of FA_S1a, MFA_S2b and FA_S2b were > 0 in the calibration and are all < 0 in the validation period. For the APD, the general summary is identical to that of the calibration period. S1 models as well as the factor-based models MFA_S2b and FA_S2b have the lowest median APD values (<1.0 m^{3}/s). Of the soil moisture-based models, only PLR_S1 and PLR_S2b have median APD values < 1.0 m^{3}/s.

#### Baseflow: GOFC boxplots

As in the calibration period, most models achieved unsatisfactory median NSE_Qb values. Three models achieved a satisfactory median NSE_Qb. These models are the NLR_S2a, SMA_S2a and MFA_* models with 0.59, 0.61 and 0.51, respectively. In stark contrast to all other models, the PLR_S2a and PLR_S2b models both achieved very good median NSE_Qb values with 0.78 and 0.75, respectively. Similarly, all models, except for PLR_S2a, PLR_S2b and MFA_*, achieved unsatisfactory median LnNSE_Qb values as well. PLR_S2a, PLR_S2b and MFA_* achieved good, satisfactory, and satisfactory median LnNSE_Qb. In the validation period, all FA models have the highest median SSE_Qb values with FA_S1 approximately 31, FA_S2a with 28 and FA_S2b with 26. Similarly, the MFA_S1, MFA_S2a and MFA_S2b models have median SSE_Qb values > 20. Only the PLR_S2a and PLR_S2b models have median SSE_Qb values of less than 10. All other models have median SSE values in the range of 10–20. All FA models performed unsatisfactorily regarding their median PBIAS_Qb. However, the MFA models achieved satisfactory (MFA_S2a, MFA_S2b) and very good (MFA_S1, MFA_*) median PBIAS_QB. In general, the S1 models performed worse than the S2 models. While the median PBIAS_QB of NLR_S1 is still satisfactory, both PLR_S1 and SMA_S1 are classified as unsatisfactory. The lowest median PBIAS_QB are found for NLR_S2a, PLR_S2a and SMA_S2a with values ≤ 5%. The median PBIAS_QB of PLR_S2b is classified as good, whereas SMA_S2b achieved a very good classification.

### Statistical analysis

The two-sided Mann–Whitney U-Test of the grouped S1 and S2 median GOFC showed that regarding total flow, the NSE, LnNSE and SSE values are not significantly different for both the calibration and validation periods. Only the PBIAS and APD values of S1 and S2 are significantly different regarding total flow. All of the GOFC regarding baseflow are significantly different between the S1 and S2 groups. The GOFC regarding baseflow of S2 are mostly better than their S1 counterparts. The results of the two-sided Mann–Whitney U-Test are shown in Supplementary material, Appendix C. To determine if there are significant differences between median GOFC of total flow and baseflow as a result of baseflow model structure within a calibration strategy and either the calibration or validation period, a Kruskal–Wallis Test was performed. The result indicates that there are significant differences between the baseflow model structures regardless of calibration strategy and whether the calibration or validation period is considered, see Supplementary material, Appendix D. The Kruskal–Wallis Test does not specify which baseflow model structures differ. Based on the findings in Sections 3.1 and 3.2, the MFA_* and PLR baseflow model structures were selected for further investigation to determine if they are significantly different, i.e., better, than the other baseflow model structures. To this end, the median GOFC of the MFA_*/PLR baseflow models were paired with the other baseflow model structures and a one-sided Mann–Whitney U-Test was performed. The results are given in Supplementary material, Appendix E and Appendix F. The central findings from the paired one-sided Mann–Whitney U-Tests are: (1) The PLR model is significantly better in reproducing the dynamic of the separated baseflow time series compared to all other factor-based and soil moisture-based models. (2) The PLR model is however not significantly better at reproducing total baseflow volume. (3) These findings are more pronounced for the PLR model calibrated according to S2. (4) The MFA_* model is generally significantly better at baseflow modeling than the other factor-based models. (5) The soil moisture-based approaches are only significantly better at baseflow modeling than MFA_* if calibrated according to S2. The results of the extreme value analysis are given in Supplementary material, Appendix G and Appendix H. For S1, the factor-based models MFA_S1 and MFA_* estimated extreme high total flows closest to the estimations from observations. For extreme low total flows, the soil moisture-based models SMA_S1 and PLR_S1 give the best estimations. Extreme high baseflows are almost identical for the SMA_S1 model and the estimations from observations. The PLR_S1 model gives decent estimates and the MFA_* model ranks third. Extreme low baseflow values are best estimated by the SMA_S1 model. For S2, the FA_S2b model best estimates extreme high total flows compared to the estimations from observations. The SMA_S2b model performs worst. The other S2b models give very similar estimates. Regarding extreme low total flows, the SMA_S2b model performs best and the FA_S2b and MFA_* models perform worst. The other S2b models are very similar to each other. For extreme high baseflows, the PLR_S2b model is best with the SMA_S2b and MFA_* model in rank 2 and rank 3. For extreme low baseflow values, the MFA_S2b and SMA_S2b models perform best with the PLR_S2b model in third rank. In general, the soil moisture-based models are better suited for estimation of extreme baseflow values as well as low total flow values.

## DISCUSSION

### Derivation of monthly factors for the MFA

The central influence on the MFA is the range parameter *c,* which is used to classify the years into dry, normal, and wet based on their yearly sum of precipitation. In this study, *c* was set to 10%. This value was found by first manually classifying the years by considering the bar plots of the yearly sums and the long-term mean yearly precipitation. Afterwards the parameter *c* was varied to achieve as close a match as possible between the classification based on Equations (7a)–7(c) and the manual classification. The manual classification is compared to the classification of the three closest precipitation gaging stations to FIS (Table 3). The closest stations are Reinheim, Reichelsheim and Lautertal/ Odw-Reichenbach (Kissel *et al.* 2023). The federal state of Hesse classifies years based on the deviation of the yearly precipitation sum compared to a reference period (1991–2020) either as extremely dry, very dry, substantially dryer, dryer, wetter, substantially wetter, very wet, and extremely wet (HLNUG 2023). As the classification in this study contains a class for normal and the classes of the federal state of Hesse does not, the assumption was made that normal years are valid if the classification of the individual stations is made up of both classes from the dry and wet spectrum. Dry years are valid if the individual classifications stem only from the dry classes and conversely wet years are valid if the individual classifications stem only from the wetter classes. Based on these assumptions, the classification with parameter *c* and Equations (7a)–7(c) hold true for 15 years of the 18-year time period. All years (1999, 2000, 2001) that do not fit the assumptions underlying the classification are classified as normal. The classification of these years as normal is a result of the Thiessen method (Thiessen 1911), which was used to obtain a mean precipitation series by Kissel *et al.* (2023) since the closest precipitation stations do not lie within the FIS catchment. In general, it can be concluded that the classification based on Equations (7a)–7(c) fits the official classifications very well. The timeframe of years for classification was expanded from 1989–2002 to 1985–2002 in order to have an equal distribution of dry, normal and wet years. The factors for dry, normal and wet years shown in Figure 2(e) fit into the typical flow regime of the FIS catchment and the German low mountain range with highest baseflows found in the winter and spring months and the lowest in the late summer and early autumn months (HLNUG 2017; Kissel & Schmalz 2020). Stoelzle *et al.* (2020) show a similar annual pattern of Pardé coefficients (ratio of long-term mean monthly flow to long-term mean annual flow) for rainfall dominated catchments with mean elevations below 1,000 m.a.s.l. in Germany. Therefore, the coefficients of monthly baseflow in this study can be interpreted as Pardé coefficients of the baseflow component.

Year . | Lautertal/Odw-Reichenbach ^{a}
. | Reichelsheim ^{a}
. | Reinheim ^{a}
. | This study . |
---|---|---|---|---|

1985 | substantially dryer | dryer | dryer | dry |

1986 | wetter | very wet | very wet | wet |

1987 | extremely wet | substantially wetter | very wet | wet |

1988 | wetter | substantially wetter | wetter | wet |

1989 | dryer | wetter | wetter | normal |

1990 | dryer | dryer | dryer | dry |

1991 | very dry | very dry | very dry | dry |

1992 | dryer | dryer | substantially wetter | normal |

1993 | dryer | dryer | dryer | dry |

1994 | wetter | dryer | dryer | normal |

1995 | substantially wetter | wetter | very wet | wet |

1996 | dryer | dryer | dryer | dry |

1997 | substantially dryer | dryer | dryer | dry |

1998 | substantially wetter | wetter | substantially wetter | wet |

1999 | wetter | wetter | wetter | normal |

2000 | substantially wetter | wetter | wetter | normal |

2001 | substantially wetter | wetter | substantially wetter | normal |

2002 | very wet | very wet | very wet | wet |

Year . | Lautertal/Odw-Reichenbach ^{a}
. | Reichelsheim ^{a}
. | Reinheim ^{a}
. | This study . |
---|---|---|---|---|

1985 | substantially dryer | dryer | dryer | dry |

1986 | wetter | very wet | very wet | wet |

1987 | extremely wet | substantially wetter | very wet | wet |

1988 | wetter | substantially wetter | wetter | wet |

1989 | dryer | wetter | wetter | normal |

1990 | dryer | dryer | dryer | dry |

1991 | very dry | very dry | very dry | dry |

1992 | dryer | dryer | substantially wetter | normal |

1993 | dryer | dryer | dryer | dry |

1994 | wetter | dryer | dryer | normal |

1995 | substantially wetter | wetter | very wet | wet |

1996 | dryer | dryer | dryer | dry |

1997 | substantially dryer | dryer | dryer | dry |

1998 | substantially wetter | wetter | substantially wetter | wet |

1999 | wetter | wetter | wetter | normal |

2000 | substantially wetter | wetter | wetter | normal |

2001 | substantially wetter | wetter | substantially wetter | normal |

2002 | very wet | very wet | very wet | wet |

^{a}*Source*: HLNUG (2023).

### Using separated baseflow time series as proxy values for the unknown actual baseflow

As the actual baseflow is typically unknown, numerous studies have used various separation techniques to partition observed total flow into baseflow and direct runoff for the purpose of either calibration or evaluation of baseflow modeling. RDFs are the most used separation technique for this purpose (Zhang *et al.* 2011; Luo *et al.* 2012; He *et al.* 2015; Jang *et al.* 2018; Onyutha 2019). This is despite RDF often being characterized as non-physical, e.g., Kang *et al.* (2022). However, Eckhardt (2023) showed that the Eckhardt filter can be formulated to be equivalent to the physically based filter of Furey & Gupta (2001). Independently of this issue, research has shown that different separations techniques can result in quite significant differences in BFI estimates as well as separated timeseries (Gonzales *et al.* 2009; Lott & Stewart 2016; Kissel & Schmalz 2020). Accordingly, goodness of fit to the separated baseflow can be significantly impacted by the choice of separation method. For FIS, Kissel & Schmalz (2020) concluded that the Eckhardt filter with an estimated BFI_{max} using the Kille method is a valid approach. With the assumption that on the annual scale baseflow is equal to the groundwater recharge (Piggott *et al.* 2005), the estimation of BFI_{max} via the Kille method corresponds to the method of estimating BFI_{max} as the ratio of mean groundwater recharge to mean streamflow as suggested by Eckhardt (2023).

### Impact of calibration strategy on total flow and baseflow modeling

Two calibration strategies were applied in this study. The first (S1) is a traditional calibration strategy in which the objective functions are optimized regarding the total observed streamflow. Individual processes or water balance components are not calibrated. The objective is simply to achieve the best possible fit between total observed and simulated flow. The second strategy (S2) is based on a stepwise calibration of baseflow and then total flow in which the exclusively baseflow related parameters are not included in the calibration of the total flow. S2 can be classified as a stepwise single-pass calibration following the classification by Daggupati *et al.* (2015). In the literature, both stepwise and simultaneous calibration, in which separated baseflows and observed total flows are supplied simultaneously as references, are reported, e.g., Santhi *et al.* (2001), Zhang *et al.* (2011), He *et al.* (2015), Jang *et al.* (2018) and Onyutha (2019). Fenicia *et al.* (2007) note that the potential to calibrate individual components of a model to various observations is the strength of stepwise calibration. The separated baseflows were not used as references for S1 in order to be comparable to the results of Kissel *et al.* (2023). For total flow, similar results were achieved by the S1 and their counterpart S2 models regarding the NSE, LnNSE and SSE and the results were found to be statistically not significantly different. However, consideration of PBIAS and the APD reveals that most of the S1 models had lower median PBIAS and median APD values. Taking the GOFC of baseflow into consideration, the lower PBIAS regarding total flow of the soil moisture-based S1 models can be explained by the significant overestimation of the amount of baseflow by these models. The median APD of the soil moisture models, except for PLR_S1, is generally higher than that of the factor-based models, which is most likely due to the high sensitivity to the CN (Ponce & Hawkins 1996) and the decoupled modeling of baseflow and direct runoff, which enable the factor-based models to more easily achieve low APD values. Yet the soil moisture-based models generally have better NSE, LnNSE and SSE for both total flow and baseflow. This is indicative of tradeoffs between objective functions in multi-objective calibration. Kissel *et al.* (2023) showed that the timing of peak flows with BlueM.Sim, to which the NSE, LnNSE and SSE are sensitive (Krause *et al.* 2005; Gupta *et al.* 2009), is better met with the soil moisture-based model SMA because this model includes a better representation of retention processes in a catchment. Accordingly, a higher or lower retention could result in a better fit of the peak flow timing, however at the cost of greater error regarding the APD. The extreme value analysis of the chosen favorite solutions confirmed that the factor-based models can better estimate peak values of total flow, but are worse at the estimation of extreme low total flows and are generally worse at estimating extreme values of baseflow compared to the soil moisture-based models. Regarding baseflow, the S2 models perform significantly better than their S1 counterparts in both the calibration and validation periods. As could be expected, the S2a models, which were calibrated solely regarding baseflow, achieve the best GOFC for baseflow modeling. But more significantly, the S2b models still performed substantially better than their S1 counterparts regarding the GOFC of baseflow. Therefore, models calibrated according to S2 better represent the flow processes in the FIS and S2 should be the preferred calibration method for the soil moisture-based models. For factor-based models, neither S1 nor S2 offers significant benefits for direct baseflow calibration as the GOFC in reference to the separated baseflow are unsatisfactory and the MFA_* model performed best out of the factor-based models. The applicability of factor-based models in the context of the Hessian guideline (HMUELV 2012) will be addressed in the following discussion on the influence of baseflow model structure.

### Assessment of optimal baseflow model structure in the FIS

Two factor-based and three soil moisture-based baseflow model structures were applied in this study. The first factor-based model FA applies monthly factors to a mean baseflow and thereby strives to replicate the variation of baseflow throughout the year. Kissel *et al.* (2023) showed that to achieve satisfactory validation results with the FA the monthly factors of the favorite solution must be close to the factors representing long-term average baseflow conditions even though there may be solutions that lead to a better fit of observed and simulated FDC for low flows in the calibration period. In this study, it was found that if the FA is compared directly to a series of separated baseflow the GOFC are unsatisfactory. However, in the Hessian guideline, the mean long-term baseflow is used as the baseline flow in rivers and streams (HMUELV 2012). From the plot of separated baseflow, Figure 3, it is apparent that the baseflow varies over time. The minimum and maximum value of separated baseflow are 0.02 and 0.59 m^{3}/s, respectively. Accordingly, FA is a step toward a more realistic representation as fluctuation in baseflow is modeled at a basic level. However, as Kissel *et al.* (2023) noted, its major issue is the lack of any possibility to connect the baseflow to changing hydrological conditions. FA is therefore only applicable to represent average conditions but could satisfy the demands of the Hessian guideline (Kissel *et al.* 2023). In the guideline average conditions in rivers are supposed to be represented which are then impacted by emissions from urban sewer systems during storm events (HMUELV 2012). The new model structure, MFA, addresses the central issue of FA by classifying years as dry, normal, and wet based on the sum of yearly precipitation. The MFA_* model, which is the model in which baseflow factors for each class were applied as estimated without calibration, consistently performed satisfactorily with regard to total flow and baseflow. This highlights the potential of the MFA_* model because it is a fairly simple model with significantly fewer data requirements and computational complexity compared to the soil moisture-based models. A possible explanation as to why calibrated MFA models performed worse than MFA_* is that during the calibration monthly baseflow values are trying to be fit to a series of daily baseflow values which can negatively impact the overall goodness of fit due to the higher variability in the daily series. Although it was shown that the classification algorithm used in this study replicates the official classification given by HLNUG (2023) very well, the MFA could be potentially improved upon by extending the classification algorithm. For instance, the year 1992, which is classified as normal, has low baseflows because of two preceding dry years and therefore corresponds to the baseflow of a dry year as the groundwater storage is still depleted. A possible extension could for instance use classical drought indices such as the standardized precipitation index which take antecedent rainfall conditions into account and are available for the FIS (Grosser & Schmalz 2021). For practitioners, the MFA offers significant advantages in ease of use, comprehensibility, fewer data requirements and computational complexity. Regarding the model structure of the soil moisture-based models, the PLR models clearly outperform the NLR and SMA models if they are calibrated according to S2. The better performance of the PLR model for a hard rock aquifer like in FIS is in accordance with findings in the literature, e.g., Luo *et al.* (2012), Stoelzle *et al.* (2015), Parra *et al.* (2019). For FIS, the single NLR did not model baseflow better than a single linear reservoir (SMA). The selected favorite solutions of NLR_S2a and SMA_S2a are indistinguishable from each other in the hydrograph and FDC plots, see Figures 3, 4, 7 and 8 and their GOFC are very similar as well. However, their retention constants, which influence the shape of the simulated baseflow hydrograph, differ greatly with 2,297 h and 3,417 h for NLR_S2a and SMA_S2a, respectively. This is due to non-uniqueness of parameter sets in complex models which can lead to identical simulated outputs despite differences in parameters (Beven & Freer 2001; Bárdossy 2007). The soil moisture-based models are superior to the factor-based models as they directly link driving factors and responses in baseflow. But MFA_* was statistically not significantly worse at baseflow modeling if the soil moisture-based models are calibrated according to S1. The application of soil moisture-based models by practitioners is limited due to greater data requirements, extensive parameter estimation and calibration as well as higher computational demands.

## CONCLUSION

Three research questions were formulated for this study.

1. SMA: Can the implementation of alternative baseflow models improve baseflow modeling in a typical German low mountain range catchment?

2. FA: Can a novel simple modification allowing for consideration of dry, normal, and wet years improve baseflow modeling with this approach and thus improve assessment of hydraulic stress and water quality impacts in peri-urban catchments under changing boundary conditions?

3. How does the calibration strategy affect the achieved baseflow and total flow modeling results?

Regarding the first question it was found that a model structure of two PLR representing a quick and a slow baseflow component can significantly improve the GOFC of simulated versus separated baseflow in a hard rock aquifer. For the FIS catchment the differences between a single NLR and a single linear reservoir were negligible.

To answer the second question the new MFA model proved to be a significant improvement compared to the standard FA model in BlueM.Sim. The ability to classify years into dry, normal, and wet and derivation of baseflow factors for each class resulted in a much better goodness of fit of simulated and separated baseflow. The simplicity of the MFA and the relatively low data requirements make it a viable tool for practitioners. However, several questions remain regarding the MFA. The influence of unequal distributions between the three classes and the minimum number of years required for reliable monthly factor estimation was not examined in this study but should be considered in future research. This could be done by employing sampling strategies to reflect different distributions and minimum number of years per class to determine a minimum number of years as of which the estimated monthly factors are relatively stable. Varying length of time series could be investigated by using bootstrapping techniques to generate the necessary time series. While the general methodology used to estimate the parameters of the MFA can be recommended no recommendations can be made regarding the outlined issues regarding minimum number of years, etc. As the MFA is a novel approach, its applicability in other low mountain range catchments should be examined in future research. Additionally, evaluating the MFA using classical Pardé factors of total flow could be of interest as the separated baseflow series is derived from the total flow via an RDF and potentially the added step of an RDF could be omitted. Therefore, a study focused on the comparison of baseflow modeling with monthly factors based on Pardé factors of total flow versus those of separated baseflow could be of significant interest. The incorporation of indices based on antecedent rainfall could also be a potential improvement to the MFA as outlined in the discussion section of this study, as baseflow is not solely a product of current precipitation conditions but is also greatly influenced by antecedent conditions. With regard to the third question, the calibration strategy impacted the goodness of fit of total flow and baseflow quite differently. Regarding total flow, the GOFC were similar for both the stepwise and the single step calibration strategy and differences were mostly not statistically significant. However, including the additional calibration step focused on baseflow significantly improved the partitioning of simulated flow components to match that of the observed total flow and separated baseflow. Therefore, the findings of this study suggest that stepwise calibration should be performed, when possible, to ensure a more physically meaningful calibration result.

To summarize, this study has assessed the impact of baseflow model structure and calibration strategy in fractured hard rock aquifers. From the findings of this study, it can be concluded that researchers wishing to model baseflow in such a geological setting should apply a two parallel linear reservoir structure calibrated in a stepwise manner to ensure physically sound modeling of flow components. If data availability limits detailed soil moisture simulation, the modified factor approach (MFA_*) could be a robust alternative as it did not statistically perform significantly worse than the soil moisture-based models calibrated only to total observed flow.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Integrierte Modellierung für Einzugsgebiete mit Komplexer Nutzung (Integrated Modelling of Catchments with Complex Land use)*

*PhD thesis*

*Forum für Hydrologie und Wasserbewirtschaftung*

*ISBN: 978-3-96862-137-1*[Preprint]

*Ein Beitrag zur Kontinuierlichen Simulation der Wasserbilanz (A Contribution to the Continuous Simulation of the Water Balance)*

*PhD thesis*

*Simulation Ereignisspezifischer Einflüsse des Niederschlag-Abfluß-Prozesses von Hochwasserereignissen Kleiner Einzugsgebiete mit Niederschlag-Abfluß-Modellen (Simulation of Event Dependent Influences on the Rainfall Runoff Process of Flood Events in Small*

*PhD thesis*