Predictions of probabilities and magnitudes of extreme events are essential for water management. One approach for flood estimation is the use of conceptual runoff models. This approach, however, can be questioned for the same reason as the approach of extreme-value statistics: the model has to be used for conditions far beyond those used for model development and calibration. In this study the HBV model, a conceptual runoff model, was applied to four different catchments and differential split-sample testing (calibration on years with lower runoff peaks and testing it on years with higher peaks) was used to evaluate model performance for the situation when the model has to be used to simulate runoff during conditions different from those observed during calibration. To assess the value of improved calibration different goodness-of-fit measures were used, which allowed to explicitly consider the ability of the model to simulate groundwater-levels and peak flows. The results indicated that applying a model to conditions different from those during the calibration period might not give accurate results and that improved calibration procedures might not automatically provide more accurate flood estimations.