## ABSTRACT

This study investigates the performance of six machine learning (ML) models – Random Forest (RF), Adaptive Boosting (ADA), CatBoost (CAT), Support Vector Machine (SVM), Lasso Regression (LAS), and Artificial Neural Network (ANN) – against traditional empirical formulas for estimating maximum scour depth after sluice gates. Our findings indicate that ML models generally outperform empirical formulas, with correlation coefficients (CORR) ranging from 0.882 to 0.944 for ML models compared with 0.835–0.847 for empirical methods. Notably, ANN exhibited the highest performance, followed closely by CAT, with a CORR of 0.936. RF, ADA, and SVM performed competitive metrics around 0.928. Variable importance assessments highlighted the dimensionless densimetric Froude number (*F _{d}*) as significantly influential, particularly in RF, CAT, and LAS models. Furthermore, SHAP value analysis provided insights into each predictor's impact on model outputs. Uncertainty assessment through Monte Carlo (MC) and Bootstrap (BS) methods, with 1,000 iterations, indicated ML's capability to produce reliable uncertainty maps. ANN leads in performance with higher mean values and lower standard deviations, followed by CAT. MC results trend towards optimistic predictions compared with BS, as reflected in median values and interquartile ranges. This analysis underscores the efficacy of ML models in providing precise and reliable scour depth predictions.

## HIGHLIGHTS

Benchmarked six ML models against empirical formulas for estimating scour depth.

ML algorithms performed superior performance with CORR [0.882–0.944].

CAT, ANN, and RF models excelled in precision and accuracy.

Evaluated predictor importance using permutation and SHAP values.

Assessed uncertainty in predictions by Monte Carlo and Bootstrap methods.

## INTRODUCTION

Scouring downstream of sluice outlets remains a primary concern in hydraulic engineering due to its implications for structural damage and the alteration of hydrodynamic conditions within water bodies. The phenomenon of scour pertains to the erosion or displacement of sedimentary materials, such as sand and rocks, by water flow (Verma & Goel 2005; Yeganeh-Bakhtiary *et al.* 2020). Specifically, scour at sluice outlets can imperil the structural integrity of the outlets and proximate infrastructure, potentially resulting in operational complications and even failures (Yousif *et al.* 2019). Attempts have been made to counteract the erosive force of water, such as introducing non-erodible aprons downstream (Aamir & Ahmad 2022). However, depending on its magnitude, a consequential scour hole can critically threaten the foundational stability of gates. This emphasizes the importance of precise scour depth estimation and management strategies.

The maximum scour depth typically reaches the equilibrium stage when no grain movement occurs within the scour hole (Chatterjee *et al.* 1994; Fitri *et al.* 2019). Estimating maximum scour depth at sluice gate outlets is inherently intricate, given the myriad of variables governing the scour processes (Sarathi *et al.* 2008). Influences such as soil properties, initial conditions, and hydrodynamic flow characteristics play pivotal roles in the stability of hydraulic constructions (Najafzadeh *et al.* 2017). In particular, some researchers study the effect of auxiliary work like wing walls on this parameter (Le *et al.* 2022) or the roughness of the apron (Aamir *et al.* 2022). Consequently, predicting maximum scour depth, especially given the wide range of possible outcomes, becomes a difficult task requiring in-depth understanding and rigorous modelling approaches.

Over the years, researchers have employed many methods to scour depth estimation at hydraulic structures encompassing empirical equations, physical observations, and hydraulic models (Mutlu Sumer 2007), each of which attempts to predict the scour dynamics based on parameters such as velocity, particle size, and outlet design (Mostaani & Azimi 2022). Despite the varied approaches, each method has presented its challenges. Numerical models, for instance, often leveraging the Navier-Stokes equations coupled with sediment transport formulations, have proven valuable for simulating scour evolution (Olsen & Kjellesvig 1998). Yet, their practical application sometimes struggles with computational demands and reliability (Le *et al.* 2022). Empirical equations generally developed through thorough experimental data analysis and understanding scour depth influencers have found broad utility (Hamidifar *et al.* 2011). However, they sometimes falter in reliability due to their dependency on a limited scope of experimental data and the intrinsic complexity of scour phenomena (Najafzadeh *et al.* 2017).

In light of these challenges and rapid advancements in computational capacities and data analytics, there has been a paradigm shift towards embracing artificial intelligence techniques, particularly ML, to address these challenges (Sharafati *et al.* 2020; Kartal *et al.* 2023). This transition is driven by the inherent ability of ML models to discern intricate and non-linear interrelationships among a multitude of variables, a task that traditional methodologies often grapple with (Sreedhara *et al.* 2021; Le *et al.* 2023). ML algorithms, especially the Gene Expression Programming (GEP), Group Method of Data Handling (GMDH) networks, Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs), have emerged as indispensable tools in maximum scour depth estimation (Najafzadeh 2015). Studies such as those of Najafzadeh *et al.* (2017) have juxtaposed several techniques, including the GEP, Evolutionary Polynomial Regression, and Model Tree (MT), illuminating the superior predictive accuracy of MT over traditional empirical equations. Similarly, Abd El-Hady Rady (2020) affirmed the genetic programming algorithm's prowess, noting its superior performance over adaptive neuro-fuzzy inference system (ANFIS) models and empirical equations. Qaderi *et al.* (2021) accentuated the aptitude of ANFIS, observing its outperformance over common algorithms such as GEP, GMDH, SVM, and ANN. Parsaie *et al.* (2019) added another layer to this discourse by presenting SVM's slight edge over ANFIS and ANN in maximum scour depth prediction. Despite promising accuracy, the performance of these methods is inherently dependent on the datasets in which they are trained, emphasizing the pivotal role of data quality and volume in model outcomes (Aamir & Ahmad 2016). This inherent dependency underscores the burgeoning significance of quantifying and interpreting uncertainty in ML predictions, especially within the niche domain of hydrodynamic scouring phenomena.

Uncertainty quantification techniques range from probabilistic methods such as Monte Carlo (MC) simulations (Han *et al.* 2011) to non-parametric ones such as bootstrapping (Efron & Tibshirani 1994). Such traditional methods have carved a niche in hydrodynamic studies, offering the dual benefit of assessing variability and conferring confidence in predictions. The recent literature also emphasizes adopting these methodologies to bolster prediction reliability (Palmer *et al.* 2022), especially in risk assessments (Grana *et al.* 2020). However, despite these advancements, the field of uncertainty quantification is still developing (Ustimenko *et al.* 2020). As Hüllermeier & Waegeman (2021) noted, understanding the nuanced distinctions between aleatoric and epistemic uncertainties in machine learning offers a deeper insight into the limits and potential of predictive models. Furthermore, Abed *et al.* (2023) emphasized the increasing role of artificial intelligence in environmental modelling, pointing to the expansive potential of these technologies in enhancing methodological approaches in hydrodynamic studies. Nonetheless, comprehensive studies on uncertainty quantification in the context of scour depth predictions at sluice outlets are scarce. The literature tends to focus on the capabilities of individual ML models rather than exploring comparative or ensemble approaches that might enhance predictive accuracy and reliability (Rezaie-Balf 2019). The present research gap underscores the significance of this research, which endeavours to benchmark the various ML models and quantify the uncertainty in their predictions of maximum scour depth.

In response to these identified gaps, this study endeavours to assess the six ML models and two empirical formulas comprehensively. The evaluation highlights their respective and comparative prediction performances for estimating maximum scour depth at sluice outlets. This approach compares their effectiveness and delves deeper into each model's capabilities through an integrated methodological framework. Furthermore, the research explores the incorporation of advanced interpretability techniques, such as the importance of permutation feature and SHAP (SHapley Additive exPlanations) values, which enhance the transparency and understanding of how different predictors influence the model outputs. These tools are vital for dissecting the complex dynamics of the predictive models and refining their accuracy. In addition to interpretability, this study also intensively applies MC simulations and Bootstrap (BS) techniques to thoroughly quantify the uncertainty in the predictions provided by these models. By generating a multitude of predictive outcomes through these techniques, the study assesses the reliability and variability of the model forecasts, offering a robust statistical basis to evaluate their predictive confidence.

## MATERIALS AND METHODS

### Overview of empirical equations

*d*

_{s}), which is a crucial metric that describes the shape of the scour. The determination of

*d*

_{s}is affected by parameters such as the input velocity (

*V*), tailwater depth (

*d*

_{t}), the open height of the sluice gate (

*a*), apron length (

*L*), apron roughness (Lim & Yu 2002; Dey & Westrich 2003), and bed material properties, including soil density (

*ρ*

_{s}), its geometric standard deviation (

*σ*

_{g}), mean grain size (

*D*

_{50}), and the soil type (Najafzadeh & Lim 2015). Further compounding the determination is the gravitational acceleration (

*g*), water density (

*ρ*

_{w}), and kinematic viscosity (

*ν*). The connection between the scour depth and efficient variables has been recognized as:

*a*,

*V*, and

*ρ*

_{w}. The eight non-repeating variables are

*L*,

*d*

_{t},

*D*

_{50},

*ρ*

_{s},

*d*

_{s},

*g*,

*ν*, and

*σ*

_{g}. From these variables, eight dimensionless π terms are formulated as follows:

Here, *F* represents the Froude number (see Equation (2)), *F _{d}* represents the densimetric Froude number (see Equation (3)), π

_{7}represents the Reynolds number kept much higher than the threshold value for a turbulent flow in a fully rough zone (Aamir & Ahmad 2022).

_{7}) is crucial for identifying the flow regime, it is found to have an insignificant effect on maximum scour depth in turbulent conditions. In addition, Aamir & Ahmad (2019) performed a test to determine the significance of each π term in predicting maximum scour depth, concluding that

*σ*

_{g}is a negligible quantity. Therefore, instead of the 10 variables in Equation (1), the relative maximum scour depth (

*d*

_{s}/

*a*) can be determined by five π terms in the following function:

From Equation (4), researchers have developed many empirical formulas over the years. The equations that have been proposed for this prediction are briefly summarized in Table 1.

Researcher . | Equation . | . |
---|---|---|

Chatterjee et al. (1994) | (5) | |

Aderibigbe & Rajaratnam (1998) | where F_{d}_{(95)} is a F based on _{d}D_{95} | (6) |

Lim & Yu (2002) | where | (7) |

Hopfinger et al. (2004) | (8) | |

Dey & Sarkar (2006) | (9) | |

Aamir et al. (2022) | where k_{s} is the roughness | (10) |

Researcher . | Equation . | . |
---|---|---|

Chatterjee et al. (1994) | (5) | |

Aderibigbe & Rajaratnam (1998) | where F_{d}_{(95)} is a F based on _{d}D_{95} | (6) |

Lim & Yu (2002) | where | (7) |

Hopfinger et al. (2004) | (8) | |

Dey & Sarkar (2006) | (9) | |

Aamir et al. (2022) | where k_{s} is the roughness | (10) |

Table 1 illustrates that these formulas frequently simplify the underlying complexities if there is insufficient input data. For instance, many formulations consider *d*_{s} to be a function of a single parameter, as seen in Equations (5) and (6), or focus on jet water characteristics and specific soil, as denoted by Equations (7) and (8). However, a challenge arises in parameters like *F _{d}*

_{(95)}, which often prove difficult to obtain, putting them impracticable in practical scenarios. Another observation from Table 1 is the neglect of tailwater depth (

*d*

_{t}) in Equations (5)–(8). This is in contrast with the argument made by Dey & Sarkar (2006) that a maximal scour depth decreases as

*d*

_{t}increases up to the critical tailwater depth. Furthermore, the authors mentioned that an increase in sediment size (

*D*

_{50}) is correlated with a decrease in maximum scour depth, which is not found in Equations (5) and (6) where the impact of sediment characteristics area has been ignored. The value of the

*d*

_{s}/

*a*decreases as the

*L*/

*a*increases, which is not discovered in Equations (5), (6), and (8).

According to the findings of Aamir & Ahmad (2019), empirical equations, including those that rely on complex multiple linear regressions, can fail to estimate maximal scour depth accurately on occasion. It is important to note that many of these formulas were developed and calibrated using particular experimental datasets and conditions. Thus, their efficacy may fluctuate when applied to different datasets. This emphasizes the significance attributed to the initial experimental conditions and datasets in the process of formula determination. With advances in technology and computational methods, there has been a shift towards using data-driven techniques, such as ML-based methods, to boost the predictability of maximum scour depth.

### Data collection

For this research, our focus was concentrated on experimental data sourced from the seminal works of Dey & Sarkar (2006) (hereafter Dey_2006 for short) and the more recent findings of Aamir *et al.* (2022) (hereafter Aamir_2022 for short). Both researchers investigated the local scour caused by the 2D submerged water jet after the sluice gate. This intentional data selection was premised on two empirical equations from these studies, encapsulated in Equations (9) and (10). Adapting our study to these specific equations allows for a consistent and systematic comparative analysis, which serves as a reasonable yardstick against which the performance of other prediction models can be evaluated. The data in Table 2 show the variety of test runs with detailed metrics, such as apron length, gate opening, tailwater depth, and the water jet's Froude number located behind the sluice gate. This study only looks at smooth, rigid aprons; their roughness is insignificant. The condition of a submerged hydraulic jump was maintained throughout all experiments in which the tailwater depth was greater than the conjugate depth of a free hydraulic jump (Aamir *et al.* 2022).

Investigator . | Number of runs . | D_{50}/a
. | L/a
. | d_{t}/a
. | F
. | F
. _{d} | d_{s}/a
. |
---|---|---|---|---|---|---|---|

Dey_2006 | 225 | 0.02–0.4 | 26.7–55 | 6.6–13.9 | 2.4–4.9 | 3.3–22.1 | 1.5–8.2 |

Aamir_2022 | 126 | 0.02–1.3 | 20–100 | 6.7–40 | 1.5–12.1 | 2.7–25.3 | 0.3–20.2 |

Investigator . | Number of runs . | D_{50}/a
. | L/a
. | d_{t}/a
. | F
. | F
. _{d} | d_{s}/a
. |
---|---|---|---|---|---|---|---|

Dey_2006 | 225 | 0.02–0.4 | 26.7–55 | 6.6–13.9 | 2.4–4.9 | 3.3–22.1 | 1.5–8.2 |

Aamir_2022 | 126 | 0.02–1.3 | 20–100 | 6.7–40 | 1.5–12.1 | 2.7–25.3 | 0.3–20.2 |

### ML models

#### Random Forest (RF)

*et al.*2023). RF possesses an inherent capability to measure the importance of individual features, making it invaluable for choosing features in complex hydrodynamic predictions. The overall prediction is obtained by averaging the predictions generated by each individual tree:where the number of trees is denoted by

*T*; and the prediction of the

*i*-th tree is denoted by

*f*(

_{i}*x*).

#### Adaptive Boosting (ADA)

*α*denotes the weight assigned to the

_{i}*i-*th tree, which is computed using the error of that tree.

#### CatBoost (CAT)

*et al.*2018). Its primary advantage is its ability to handle categorical variables without manual pre-processing by transforming them into numerical values using various techniques like one-hot encoding and mean encoding. CAT also provides built-in support for handling missing values. For the regression problem, the CAT equation can be described as:where

*η*represents the learning rate multiplied by the contribution of the

_{i}*i-*th tree.

#### Support Vector Machine (SVM)

*et al.*2022). A prominent advantage of SVM is its capability to work in a transformed feature space via the kernel trick, effectively handling non-linear relationships (Cortes & Vapnik 1995). For regression tasks, the simplified representation of the SVM could be depicted as:where

*b*denotes the bias;

*ω*represents the weight vector; and

*ϕ*(

*x*) indicates the transformation of input

*x*through the kernel function.

#### Lasso Regression (LAS)

*λ*denotes the regularization parameter;

*β*

_{0}and

*β*denote coefficients;

_{j}*y*and

*x*are the response and predictors variables;

*p*and

*n*are the number of predictors and observations.

#### Artificial Neural Network (ANN)

*et al.*2021). Comprising interconnected layers of nodes (or ‘neurons’), ANNs can learn and represent almost any function given enough depth and data. In maximum scour depth prediction, ANN can adaptively learn from the data without relying on a pre-determined functional form and can capture the inherent complexities and non-linearities of hydraulic processes. In a single-layer ANN, the output for an input vector

*x*is computed as follows:where

*ω*denotes the weight vector;

*n*represents the number of input nodes; and

*σ*denotes the activation function.

### Uncertainty analysis methods

#### Permuation importance

Permutation feature importance provides a metric to discern the significance of each feature (or variable) used by an ML algorithm. By shuffling the values of a particular variable and measuring the subsequent decrease in model performance (typically, accuracy or error rate), one can ascertain the importance of that feature in the model's predictions (Breiman 2001). The rationale behind this is that if a predictor is vital for the model, randomly altering its values would result in a notable drop in the model's performance. Conversely, insignificant features would have a negligible impact on the performance metric. Altmann *et al.* (2010) further elucidated that permutation importance can offer insight into the model's decision-making process and can be especially effective when benchmarking multiple ML algorithms, ensuring a uniform evaluation metric.

#### SHAP values

SHAP values are a state-of-the-art model interpretability method that ascribes each variable as an essential value for a specific prediction (Rodríguez-Pérez & Bajorath 2020). Originating from cooperative game theory, SHAP values mathematically guarantee a unique and consistent attribution for each feature, ensuring fairness and avoiding potential biases (Lundberg & Lee 2017). Within the context of scour depth prediction, SHAP values can unveil the influence of individual predictors and the intricate non-linear relationships and interactions among predictors, thereby furnishing a nuanced understanding of how each feature contributes, either positively or negatively, to the predicted maximum scour depth.

#### Uncertainty quantification

Quantifying the uncertainty in model predictions is of paramount importance, especially in scenarios with high-stakes outcomes, such as estimating maximum scour depths at sluice outlets. Two prevalent methods, MC simulation and BS, are employed to understand this uncertainty.

Rooted in probabilistic theory, MC methods involve running numerous simulations with random input values within a predefined distribution to approximate the output's expected distribution (Brownlee 2019). By repeatedly sampling and simulating, MC furnishes an explicit representation of the prediction's uncertainty, capturing both its variability and sensitivity to changes in the input values (Han *et al.* 2011). While the technique's inherent simplicity makes it attractive, its effectiveness hinges on the availability of a well-defined probability distribution for each input parameter and requires extensive computational resources due to repeated simulations. In maximum scour depth prediction, MC simulations can provide a probability distribution of the predicted scour depth rather than a single deterministic value.

Another widely acknowledged method for uncertainty quantification is BS. Efron & Tibshirani (1994) presented BS as a resampling technique where random samples (with replacement) are drawn from the dataset and are used to gauge the variability or confidence intervals of an estimator. Recent studies have indicated bootstrapping's viability in enhancing the robustness of ML predictions in water resources (Palmer *et al.* 2022). The key advantage of BS lies in its non-parametric nature, requiring no assumptions about the data's distribution, making it an ideal choice for complex, non-linear datasets (Gewerc 2020), as frequently encountered in hydraulic studies. In the current analysis, both MC and BS methods were employed to understand the uncertainty bounds of the ML models, a step crucial in establishing the credibility of ML predictions for maximum scour depth estimations.

### Model design and hyperparameter tuning

Hyperparameter tuning emerges as a crucial procedure when enhancing the capabilities of ML models to estimate the maximum scour depth accurately. The grid search technique is a well-regarded method in hyperparameter optimization owing to its rigorous and exhaustive exploration of potential parameter combinations. Such meticulousness is especially suited to datasets of moderate sizes, as it facilitates the evaluation of the ML model across every permutation of hyperparameters within a predetermined grid. The widely recognized Python library, scikit-learn, offers efficient functionalities for the effective application of grid search procedures (Pedregosa *et al.* 2011).

An essential aspect of hyperparameter tuning is the use of cross-validation. This study adopts the grid search methodology, which incorporates 5-fold cross-validation. This approach substantially augments the model's resilience, ensuring its predictions remain consistent and reliable when subjected to previously unobserved data. In Table 3, the particular hyperparameters and their respective ranges for each ML algorithm are exhaustively detailed.

Algorithm . | Hyperparameter . | Value range . | Optimal value . |
---|---|---|---|

RF | n_estimators | [50, 100, 200] | 50 |

max_depth | [None, 10, 20, 30] | 10 | |

min_samples_leaf | [1, 2, 4] | 1 | |

min_samples_split | [2, 3, 4, 5] | 5 | |

bootstrap | [True, False] | True | |

ADA | n_estimators | [50, 100, 200] | 200 |

learning_rate | [0.001, 0.01, 0.1, 0.5, 1.0] | 0.01 | |

loss | [linear, square, exponential] | Linear | |

base_estimators | DecisionTreeRegressor(max_depth) | 4 | |

CAT | learning_rate | [0.01, 0.05, 0.1, 0.5, 1] | 0.01 |

iterations | [100, 500, 1,000] | 1,000 | |

depth | [3, 5, 7] | 3 | |

SVM | kernel | [linear, poly, rbf, sigmoid] | rbf |

C | [0.01, 0.1, 1, 10, 100] | 10 | |

epsilon | [0.01, 0.1, 1] | 1 | |

LAS | alpha | [0.001, 0.01, 0.1, 1, 10, 100, 1,000] | 0.1 |

ANN | drop_rate | [0.1, 0.2, 0.25, 0.3] | 0.25 |

hidden layer | [2, 3] | 3 | |

number of units | [128, 64] | 128, 64, 64^{a} | |

loss | mse, mae | mse |

Algorithm . | Hyperparameter . | Value range . | Optimal value . |
---|---|---|---|

RF | n_estimators | [50, 100, 200] | 50 |

max_depth | [None, 10, 20, 30] | 10 | |

min_samples_leaf | [1, 2, 4] | 1 | |

min_samples_split | [2, 3, 4, 5] | 5 | |

bootstrap | [True, False] | True | |

ADA | n_estimators | [50, 100, 200] | 200 |

learning_rate | [0.001, 0.01, 0.1, 0.5, 1.0] | 0.01 | |

loss | [linear, square, exponential] | Linear | |

base_estimators | DecisionTreeRegressor(max_depth) | 4 | |

CAT | learning_rate | [0.01, 0.05, 0.1, 0.5, 1] | 0.01 |

iterations | [100, 500, 1,000] | 1,000 | |

depth | [3, 5, 7] | 3 | |

SVM | kernel | [linear, poly, rbf, sigmoid] | rbf |

C | [0.01, 0.1, 1, 10, 100] | 10 | |

epsilon | [0.01, 0.1, 1] | 1 | |

LAS | alpha | [0.001, 0.01, 0.1, 1, 10, 100, 1,000] | 0.1 |

ANN | drop_rate | [0.1, 0.2, 0.25, 0.3] | 0.25 |

hidden layer | [2, 3] | 3 | |

number of units | [128, 64] | 128, 64, 64^{a} | |

loss | mse, mae | mse |

^{a}The number of units per respective hidden layer for ANN is 128, 64, 64.

For the current investigation, the dataset comprises 351 laboratory-derived samples. These samples elucidate the multifaceted nature of maximum scour depths encountered downstream of sluice gates. The maximum scour depth ratio to the open height of the sluice gate (*d*_{s}/*a*) is conceptualized based on five instrumental variables: *d*_{t}/*a*, *L*/*a*, *D*_{50}/*a*, *F*, and *F _{d}*. The study guarantees a thorough assimilation of parameters influencing maximum scour depths by encompassing this breadth of variables. Considering the dataset dimensions, a strategic division was executed, allocating 70% of the samples (246 samples) for training purposes while the remaining 30% (about 105 samples) for validation and testing. This distribution assures that the model undergoes intensive training while retaining a significant chunk of data for unbiased performance evaluation.

### Evaluation metrics

Ensuring accurate and reliable assessment of scour depth prediction models requires incorporating a careful selection of performance metrics. Therefore, this study takes advantage of various statistical criteria, each of which scrutinizes diverse aspects of the model predictions compared with the observed values. In essence, these metrics evaluate the predictive accuracy, bias, and overall reliability of the models, thereby facilitating a comprehensive understanding of their respective capabilities.

The following statistical criteria were chosen for model evaluation: Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Symmetric Mean Absolute Percentage Error (SMAPE), CORR, and Nash–Sutcliffe Efficiency (NSE). The RMSE and MAE, both in their unique ways, gauge the magnitude of the estimation error, offering insights into model precision and bias, respectively. SMAPE provides a percentage error between the estimated and observed values, offering a scale-independent error metric. CORR reflects the linear relationship, while NSE represents the model's predictive accuracy and is especially appreciated for its ability to delineate the proportion of the variance in the measured data that the model captures. Supplementary Table 4 briefly summarizes these evaluation metrics, their respective mathematical formulations, and interpretative insights.

## RESULTS

### Performance of methods: benchmarking

Regarding RMSE and MAE, which offer insights into the dispersion and deviation from observed values, ML algorithms prominently outshone the empirical methods. Specifically, the CAT registered the least deviation from observed values with an RMSE of 1.07 and an MAE of 0.63. ANN and RF followed suit, both mirroring each other in MAE values at 0.66, with the ANN slightly taking the lead in RMSE at 1.05. In contrast, the empirical formula proposed by Aamir_2022 presented the most significant deviation with an MAE and RMSE of 1.33 and 2.23, respectively. Although Dey_2006 presented a marginally better performance than Aamir_2022, the accuracy of the majority of ML models generally overshadowed the empirical equations.

For the SMAPE criterion, the errors of the six ML algorithms fluctuate in the range (7.5–10.3%), which is smaller than the fluctuations of the two empirical formulas (10.4% of Dey_2006 and 12.0% by Aamir_2022). Among these, CAT, RF, and ANN are the leading models with an error level of about 7.6%, significantly lower than ADA (9.1%) and LAS (10.3%). This disparity emphasizes the robustness of ML algorithms, especially CAT, RF, and ANN, in mirroring observed values with decreasing bias.

A similar trend was observed through meticulous analysis when CORR and NSE indices were examined, illuminating the superior performance of ML techniques over empirical formulations, as demonstrated in Figures 2 and 3. These figures interweave the statistical metrics, emphasizing the overarching dominance of ML algorithms, especially ANN and CAT. This superior linear predictive capability is contrasted sharply with the empirical equations, as the visualizations confirm the superior clustering of ML predictions around observed values (see Figure 2). In addition, the scatter plots for ANN and CAT depict a pronounced alignment with the observed data, while empirical methods reveal a more considerable dispersion (see Figure 3).

For the empirical formulas, Dey_2006 still performed better than the Aamir_2022 formula in both criteria, registering CORR and NSE values of 0.847 and 0.687, respectively. Although the difference in linear prediction capabilities between the two formulas was modestly set (at approximately 0.012 for CORR), a clear contrast was revealed in their NSE values, exhibiting a significant difference of 0.124. The ML models demonstrated notable consistency in performance across both CORR and NSE criteria, spanning the range [0.882–0.944] and [0.818–0.903] for each criterion, respectively. Within this range of methods, the ANN model emerged as the superior performer, while the LAS model displayed the poorest performance among ML models. The ANN algorithm is closely followed by the CAT algorithm, which has a CORR value of 0.936 and an NSE value of 0.899. Meanwhile, the RF, ADA, and SVM algorithms maintained competitive performance metrics, exhibiting minor discrepancies in their values, hovering around 0.928 for CORR and 0.883 for NSE.

In general, the comprehensive performance benchmarking of methods for predicting maximum scour depth indicated that ML algorithms consistently demonstrated superior accuracy and efficiency. Notably, the CAT, ANN, and RF models stood out for their precision, closely reflecting observed values with minimal deviations. Meanwhile, SVM, ADA, and LAS models exhibited lower performance in descending order. In contrast, while the empirical formula by Dey_2006 showed relatively better performance than that of Aamir_2022, both were overshadowed by the predictive ability of most ML models.

### Uncertainty analysis of ML models

#### Insights from variable importance

A careful analysis in Supplementary Table 6 reveals clear differences in the importance of the variables across the models, which can be broadly classified into three levels based on their influence on the models. The first level, primarily dominated by the *F _{d}* variable, was identified as having significant scores in most models, with the highest scores in RF and CAT (respective scores of 0.342 and 0.364). As for the LAS model,

*F*indicates its dominance with the highest importance score among all analysed variables, a value of 0.556 compared with 0.444 for

_{d}*d*

_{t}/

*a*. In juxtaposition, the ANN assigns relatively less emphasis to

*F*(score of 0.170), directing more weight towards

_{d}*d*

_{t}/

*a*and

*F*, with respective importance scores of 0.361 and 0.315. This variable, representative of the dimensionless densimetric Froude number, underscores its pivotal role in influencing maximum scour depth predictions.

The second level includes the variables *d*_{t}/*a* and *F*. Both emerged as crucial predictors for ADA and ANN, with significance scores of (0.392 and 0.357) and (0.361 and 0.315), respectively.

Interestingly, *F* has the highest influence observed in SVM at 0.305, but its importance in LAS is zero. Although slightly lesser, the robust influence of *d _{t}*/

*a*was also identified in CAT and SVM, with respective importance scores of 0.227 and 0.211.

The remaining variables, *D*_{50}/*a* and *L*/*a*, show relatively muted effects across the board. Specifically, *L*/*a* appears relatively unimportant in the ADA, LAS, and ANN models, with modest significance scores of less than 0.05. This contrasts RF because *L*/*a* achieved an importance score of 0.248, ranking 2nd in importance among the variables compared. For *D*_{50}/*a*, most of the recorded values represent the poorest scores for this variable (see Figure 4). The relatively lower significance scores of these predictors, associated with mean grain diameter and characteristic channel length, indicate less direct influence on maximum scour depth estimates in most models, especially LAS, where their impact diminishes to non-existence.

In Figure 5, the range of SHAP values denotes the magnitude of the impact a predictor has on the model output. Therefore, a longer bar signifies a more influential variable. In general, variable *F* is dominant in influencing forecast results in most ML models, except LAS (ranked 3rd). For the RF model, *F* and *L*/*a* variables dominate in terms of influence, closely followed by *F*. The ADA algorithm showcases a relatively balanced influence among *F*, *L*/*a*, and *F _{d}*. The SVM model reflects a similar pattern to the CAT, with

*F*being favoured, albeit with a slightly more substantial influence from

*F*and

_{d}*d*

_{t}/

*a*. The LAS algorithm exhibits a distinct shift with

*F*and

_{d}*d*

_{t}/

*a*attracting significant attention, while other variables play a softer role or are even ignored, such as

*L*/

*a*. Lastly, in the ANN,

*F*and

*D*

_{50}/

*a*elements exhibit a higher SHAP value than the other predictors.

The contrast between the permutation importance and the SHAP values can be seen in some models. Specifically, while specific predictors like *F* and *F _{d}* maintain consistent importance across multiple models, the relative influence of variables like

*L*/

*a*and

*d*

_{t}/

*a*sees marked fluctuations. This highlights the subtle complexities and nuances between two measures of variable importance. The importance of permutations and SHAP values underscores the multifaceted nature of maximum scour depth prediction and the complex interaction of variables in the predictive power of an ML algorithm.

#### Uncertainty in predictions

From Supplementary Table 7 and Figure 6, it is clear that the ANN model consistently displays superior performance, marked by a score of 0.938 for MC and 0.9231 for BS. In contrast, the LAS exhibits the worst performance, with respective scores of 0.882 and 0.875. However, the uncertainty of the LAS model, characterized by values of 0.0012 for MC and 0.0088 for BS, is markedly lower.

This indicates that its predictions, though slightly less accurate, are more consistent. The CAT algorithm closely follows the performance, reflecting promising scores of 0.935 and 0.911, respectively. Besides, the standard deviations, which represent the variation in predictions, were relatively low for all models, suggesting consistent predictive capabilities. Notably, the ANN showed exceptionally low standard deviations of around 0.005 for both techniques, reinforcing its stable performance. The upper 95% CI further corroborated these findings, with the ANN model consistently outperforming the others.

Figure 6 depicts several statistical properties of the 1,000 simulations for the six ML algorithms. It can be observed that the median values were close to the mean values from Supplementary Table 7. This indicated a symmetric distribution of prediction values around the median, suggesting minimal skewness in the predictive outcomes. For all models, the median values derived from the MC technique are marginally higher than those from the BS approach, indicating a slight leaning of the MC simulations towards optimistic predictions. For example, the RF has a median of 0.930 and 0.907 for MC and BS, respectively. For the 1st Quartile (Q1) and 3rd Quartile (Q3) values, a narrower interquartile range (IQR) is identified for the LAS and ANN models, suggesting a higher concentration of prediction values within this range. In contrast, RF exhibited a wider spread with IQR values of 0.007 and 0.035 for MC and BS, respectively, indicating greater prediction variability. Additionally, the IQR values of the MC method were mostly smaller than those of the BS method, as shown by the range of [0.001–0.007] compared with the range of [0.007–0.036]. This trend is repeated across models, illustrating a consistent pattern.

The results in the prediction uncertainty analysis of ML models using MC and BS techniques exhibit varying performance levels and internal variability. The ANN stands out with superior mean and standard deviation performance, but this does not overshadow the remarkable results from other algorithms such as RF, ADA, CAT, and SVM. The performance of the CAT closely follows the ANN, with standard deviations of about 0.0044 and 0.0274 for MC and BS, respectively. While the RF, SVM, and ADA models exhibit a competitive performance with a 1,000-prediction mean of approximately 0.925, the standard deviation of ADA (about 0.0052) is slightly smaller than that of RF and SVM (about 0.0072 and 0.0125, respectively). The synthesis of results from MC and BS techniques offers a balanced perspective, emphasizing the strengths and limitations of each approach in capturing the inherent unpredictabilities in maximum scour depth estimation at sluice outlets.

## DISCUSSION

### Comparative analysis: ML models vs. empirical equations

The field of maximum scour depth prediction is witnessing a gradual transition from long-standing empirical methods to ML algorithms. Historically, empirical equations were derived from observational and experimental data to provide a generalized solution for complex physical phenomena. In this context, the empirical formulas of Dey_2006 and Aamir_2022 were introduced to estimate maximum scour depth. However, as evidenced in the results presented, there appears to be a paradigm shift in the effectiveness of prediction methods.

ML models, especially CAT, ANN, and RF, have showcased a pronounced superiority in statistical metrics. The performance benchmarking emphasized their lower deviation and bias compared with empirical equations. Although the Dey_2006 formula still holds up relatively better than the Aamir_2022 formula, the sheer dominance of ML algorithms, particularly CAT and ANN, cannot be overlooked. The consistent accuracy and efficiency delivered by these ML models hint at the vast potential of data-driven techniques in capturing complex processes, a characteristic that traditional empirical formulas sometimes miss (Muzzammil & Alam 2010; Khosravi *et al.* 2021; Le & Le 2024). Furthermore, as more data becomes available from diverse environments and scenarios, these ML algorithms can continuously learn and improve, a feature not easily attainable with static empirical equations (Sharafati *et al.* 2021).

The distinction between ML models and empirical equations becomes even more profound when the predictive capability is dissected. For instance, the scatter plots of empirical methods divulge a broader dispersion compared with the likes of ANN and CAT. This divergence may be due to the inherent limitations of empirical formulas as they are built on certain assumptions. These assumptions often oversimplify complex real-world scenarios, leading to compromised predictive abilities (Aamir & Ahmad 2016). These findings align with the observations of Sharafati *et al.* (2021), who pointed out that ML models produce more accurate forecasts and display less bias than conventional empirical methods because of their capacity to adapt and learn from data.

In summary, while empirical equations have played a pivotal role in maximum scour depth predictions for years, the emergence of ML offers a paradigm of increased accuracy and adaptability. Additionally, the ability of ML algorithms to incorporate multidimensional factors into their predictions can enable researchers and professionals to integrate a broader spectrum of parameters, thereby enriching the depth and breadth of scour depth analysis. Such advancements might also pave the way for real-time monitoring and prediction systems, leveraging the real-time learning capability of these algorithms.

### Importance and impact of uncertainty in predictions

The variability in predictions showcased by the various ML models in this study draws attention to the underlying uncertainty levels inherent in their algorithms. One pivotal observation from the results is the disparity between models regarding their permutation importance and SHAP values. The permutation feature importance, while offering a comprehensive view, tends to give a global perspective on predictor significance. On the other hand, SHAP values provide a more granular, instance-specific interpretation of variable importance. Such distinctions emphasize the multifaceted nature of scour depth prediction, wherein each model can process predictor importance differently, thus influencing the final output (Kaur *et al.* 2020). This also underlines the sensitivity of these models to predictors and their interrelationships, which can vary based on the underlying mathematical architecture of each model (Štrumbelj & Kononenko 2014).

Another important aspect to mention is the performance difference across models under uncertainty estimation techniques. Techniques such as MC and BS serve as valuable tools in uncertainty analysis, gauging the robustness and reliability of employed algorithms (Papadopoulos & Yeung 2001). As illustrated, the ANN model emerged as superior in performance, with remarkably consistent predictive capabilities. Such prediction consistency can greatly enhance trust in model outputs, especially when they are used to inform critical decisions. However, it should be noted that high prediction performance does not necessarily guarantee that the model accurately captures real-world complexities (Abdar *et al.* 2021). For instance, despite showing slightly lesser accuracy, the LAS model had lower uncertainty, which implies more consistent predictions. This observation underscores the trade-off between accuracy and consistency, highlighting that higher accuracy does not necessarily translate to more reliable predictions, especially under varied conditions.

A deeper dive into the MC and BS techniques reveals nuanced differences in their capabilities to capture uncertainties. The MC method displayed a subtle tendency towards optimistic predictions (Nguyen *et al.* 2021). In situations where overestimations could lead to wasted resources or other negative consequences, such biases need to be carefully considered. On the other hand, the BS approach provided a slightly wider spread of predictions, indicating a broader scope of possibilities, which might be preferable in applications requiring more conservative estimates. In summary, resolving these uncertainties carefully and accurately will enhance the reliability and trustworthiness of the models and their utility in real-world applications, fostering more informed decision-making processes.

## CONCLUSIONS

This study advanced the understanding of maximum scour depth prediction behind sluice outlets by evaluating the efficacy of ML models relative to traditional empirical equations. The research notably highlighted how six ML algorithms – RF, ADA, CAT, SVM, LAS, and ANN – outperform the empirical equations of Dey_2006 and Aamir_2022 in predicting maximum scour depth. This achievement underscores the significant potential of integrating ML into hydraulic engineering practices.

The findings demonstrate that variable importance varies across ML models, with the dimensionless densimetric Froude number (*F _{d}*) consistently emerging as a pivotal predictor in most models. This research also established that other hydraulic parameters such as

*d*

_{t}/

*a*,

*F*,

*L*/

*a*, and

*D*

_{50}/

*a*significantly influence scour depth predictions, reflecting the complex dynamics that govern scouring processes.

The utilization of permutation importance and SHAP values has provided a nuanced view of how predictors impact model outputs, enhancing the interpretability of ML models. While specific predictors maintained consistent importance across models (like *F _{d}* or

*F*), the relative influence of others witnessed marked fluctuations, highlighting the multifaceted nature of scour depth prediction and the dynamic interactions between predictors.

Our uncertainty analysis, employing both MC and BS methods, revealed varying performance levels among ML models. The ANN displayed superior performance, marked by higher mean values and minimal standard deviations in the predictions, closely followed by the CAT. Other models like RF, ADA, and SVM also showcased commendable performance. The MC simulations generally leaned slightly towards optimistic predictions compared with the BS approach, as evidenced by marginally higher median values and the differences in their interquartile ranges.

The study, however, recognizes certain limitations. The reliance on specific datasets might affect the generalizability of the findings, and the computational intensity of some ML models could restrict their practical applicability in certain settings. The inherent unpredictability in estimating scour depth at sluice gates requires a combination of empirical knowledge and advanced computational techniques. The findings of this study advocate for integrating traditional hydraulic understanding with modern ML algorithms to achieve more accurate and reliable maximum scour depth predictions. In light of these findings, future research could further investigate hybrid models that combine the strengths of multiple algorithms to enhance prediction accuracy and reliability.

## DATA AVAILABILITY STATEMENT

All relevant data are available from an online repository or repositories: https://github.com/LXHien88/Scour_Depth_ML/.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*ArXiv*. 10.48550/arXiv.1706.09516

*ArXiv*. 10.48550/arXiv.2006.10562