Abstract
In natural rivers, flow conditions are mainly dependent on flow resistance and type of roughness. The interactions among flow and bedforms are complex in nature as bedform dynamics primarily regulate the flow resistance. Manning's equation is the most frequently used equation for this purpose. Therefore, there is a need to develop alternate reliable techniques for adequate prediction of Manning's roughness coefficient (n) in alluvial channels with bedforms. Thus, the main objective of this study is to utilize machine learning (ML) models for predicting ‘n’ based on the six input features. The performance of ML models was assessed using Pearson's coefficient (R2), sensitivity analysis, Taylor's diagram, box plots, and K-fold method has been used for the cross-validation. Based on the output of the current work, models such as random forest, extra trees regression, and extreme gradient boosting performed extremely well (R2 ≥ 0.99), whereas, Lasso Regression models showed moderate efficiency in predicting roughness. The sensitivity analysis indicated that the energy grade line has a significant impact in predicting the roughness as compared to the other parameters. The alternate approach utilized in the present study provides insights into riverbed characteristics, enhancing the understanding of the complex relationship between roughness and other independent parameters.
HIGHLIGHTS
This study focuses on accurately predicting n in alluvial channels with bedforms.
The intricate interplay between flowing water and bedforms adds complexity to flow resistance prediction.
A significant observation is that integrating all input parameters results in enhanced accuracy when predicting flow resistance.
Leveraging modern techniques, the study employs four machine learning models, to predict n.
INTRODUCTION
In natural alluvial river channels, the flowing water mobilizes sediment particles on its bed surface and the particles begin to move, which is referred to as incipient motion. It is a critical threshold at which the forces acting on the particles try to overcome their resistance to motion, leading to the initiation of sediment transport. The mobile sediment mainly comprises of bed load and the suspended load. The different shapes and geometries developed due to the bed load transport under the varied flow conditions are known as bedforms. Furthermore, the flow conditions largely depend on these various types of bedforms (Kwoll 2016). The relatively small or large bedforms include dunes, ripples, antidunes, bars, chutes, pools, etc. Dunes and ripples are asymmetric triangular-shaped geometries that develop under lower flow conditions, while other bedforms are developed at higher flow conditions (Cardenas & Wilson 2007; Dey 2014a; Lefebvre 2019). The morphology (shape, size, and spacing) of bedforms depends upon the various characteristics, such as flow velocity, depth, and attributes of river bed sediments. As the size of the bedform is large, it can lead to higher resistance to flow (Venditti 2007, 2013).
In the case of open channel flows, the roughness coefficient (n) is more significant than the friction factor (Alam & Kennedy 1969). Various other factors, such as flow hydraulic condition, channel shape, sediment transport, bed load fluctuations, bedforms, and so on, notably affect this parameter (Bridge 1993; Kumar et al. 2023a, 2023b). It is essential to have a comprehensive understanding of these parameters and their impact on their values to obtain an accurate roughness in alluvial channels with bedforms. The parameter ‘n’ plays a crucial role in the channel dynamics and is used to characterize the roughness of the channel. The factors mentioned earlier, along with the surface conditions and vegetation cover, uniquely affect flow resistance within the channel. Therefore, by understanding the influence of these factors on the value of ‘n’, engineers and researchers can make more accurate predictions and calculations related to fluid flow and hydraulic systems.
The importance of various bedforms on ‘n’ has been examined by multiple researchers (van der Mark et al. 2008; Shiono et al. 2009; Aberle et al. 2010; Roushangar et al. 2017). A study examined the consequence of dune morphology on flow resistance in an alluvial channel (Talebbeydokhti et al. 2006). The characteristics of dune formations on the influence of flow resistance within the channel were observed. Moreover, the relationship between dune geometry and flow resistance was analyzed, and insights into the hydraulic behaviour of channels with dune bedforms were provided. The research was carried out to understand the impact of bed load fluctuations and sediment transport on flow resistance in channels that are covered with bedforms. The presence of dune formations and its effect on flow dynamics and resistance within the channel was observed. In addition to this, the interaction between bed load transport and dune bedforms was also studied. It was intended to enhance the understanding of sediment transport processes and hydraulic behaviour in channels with dunes (Omid et al. 2010).
The friction factor in rivers with gravel-beds with bedforms was observed by various researchers (Griffiths 1981; Clifford et al. 1992; Darby 1999; Venditti 2013; Dey 2014a). The contribution of bedforms on friction factor and its influence on flow resistance was also studied. By examining the relationship between bedforms and friction factor, it was suggested that improving hydraulic characteristics in gravel-bed rivers with varying bed configurations was required (Afzalimehr et al. 2010). The effects of different bedforms, such as dunes and ripples, were further investigated. Moreover, the impact of bank vegetation on the flow dynamics was observed (Murray & Paola 2003; Gilvear & Willby 2006; Kabiri et al. 2017). It was concluded that bedform configurations, along with the presence of vegetation on the channel banks, influenced flow conditions. It was suggested that the interaction between bedforms, vegetated banks, and flow conditions was intended to provide an understanding of the complex hydraulic behaviour of natural channels (Dehsorkhi et al. 2011). The influence of various flow conditions on sediment transport, bedload, and related bedforms in channels with different slope conditions, such as low, mild, and steep slopes, affects bed load transport processes and the formation of bedforms was examined (Lisle 1982; Young & Davies 1991; Harbor 1998; Buffington et al. 2002; Roushangar et al. 2017). By exploring the interconnection between flow conditions and bed load transport, understanding sediment dynamics in alluvial river channels with different slopes was enhanced (Chegini & Pender 2012). In other research, the effect of Reynolds stress distribution and velocity on sand bedded dunes was carried out, and the flow patterns and turbulence characteristics associated with gravel dunes in a channel were investigated. By studying the velocity and Reynolds stress distributions, the foresight into the flow dynamics and sediment transport processes influenced by gravel dunes was provided (Kabiri 2014).
Venditti (2007, 2013) conducted a study to observe the consequences of the dune leeside slope on flow resistance and turbulent flow structure (Venditti 2007, 2013). It was examined that the slope of dune formations influences flow resistance and flow characteristics. The relationship between the flow resistance, dune leeside slope, and turbulent flow structure was examined to enhance the understanding of flow dynamics in channels with dunes. A numerical model Delft3D was used to find the effects of bedform roughness on various hydrodynamic patterns and sediment transport patterns (Brakenhoff et al. 2020). Another study concluded that the bedform friction factor is armoured in river channel beds. It investigated the influence of armoured river bed conditions on flow resistance and hydraulic behaviour (Okhravi & Gohari 2020). Furthermore, a study examined the impact on two-dimensional dunes of flow hydrodynamics (Dey et al. 2020). The main aim was to understand flow characteristics and hydrodynamic forces that affect the development and behaviour of two-dimensional dune formations (Dey 2014a). A laboratory study was carried out to examine the impact of bedforms with varying particle sizes on parameters such as bed shear stresses and flow resistance. In addition to this, the impact of varying sizes of different bedforms on the influence of flow resistance and bed shear stress within a channel was also observed (Heydari & Yarahmadi 2022).
As discussed earlier, most of the research has been conducted in the laboratory using flume experiments to gather the essential data. The employment of an experimental approach in the laboratory is a time and effort-consuming process that may not be feasible in some situations. In a channel consisting of various bedforms, it is important to make appropriate predictions using various soft computing techniques when a parameter is related to hydraulic and sedimentary fields. A study was conducted in order to predict values of ‘n’ using various soft computing methods. An artificial neural network (ANN) model was used to forecast the non-linear linkage among the various parameters that influence it (Yuhong & Wenxin 2009). The study focused on developing a reliable method to estimate the friction factor by training the ANN model with relevant data. Another study involved the adaptive neuro-fuzzy inference system (ANFIS) that was developed to establish a strong relationship between the experimental and observed data for estimating the output parameter, i.e., friction factor (Essays 2011). The ANFIS model was designed to adaptively learn from the input data and provide accurate predictions based on fuzzy logic principles. Moreover, ANFIS models were also used to estimate the friction factor in dune-bed rivers was incorporated (Roushangar et al. 2014). In addition to this, models such as feed-forward neural networks (FFNN) and radial-based function neural networks (RBFNN) were also utilized. The study compared the performance of these models and found that ANFIS outperformed other models but was less efficient. The sensitivity analysis was carried out and it revealed that various parameters such as Reynolds number (Re) and the ratio of hydraulic radius to median grain size (R/d50) significantly influenced the prediction of friction factor.
The models such as ANN and genetic programming were used to estimate grain size and Manning's coefficients with higher accuracy than empirical formulas (Niazkar et al. 2019). In addition to this, the study aimed to improve the estimation of these coefficients by employing advanced computational methods. ANN and ANFIS models were incorporated to estimate the roughness coefficient in erodible channels (Zanganeh & Rastegar 2020). The sensitivity analysis indicated that Re had the most important effect on predicting ‘n’ of alluvial channels. The estimated friction factor in dune and ripple rivers using various others is also incorporated (Saghebian et al. 2020). The study also incorporated combined models such as multilayer perceptron firefly neural networks (MLP-FFNN) and multilayer perceptron firefly algorithms (MLP-FFA).
The study emphasized the importance of different parameters, such as Re, Froude number (Fr), and R/d50, in accurately modelling the friction factor based on the bedform characteristics (Yao et al. 2023). In another study, it has been demonstrated that a successive approximation-based stepwise optimizing strategy yielded better solutions than other models. The proposed model outperformed other models that did not consider the variations of the friction factor with both discharge and sediment conditions. Another recent study investigated the prediction of ‘n’ in rivers with bedforms using various soft computing models. They utilized soft computing models, such as the multilayer perceptron model, group method of data handling model, support vector machines model, and genetic programming (Yarahmadi et al. 2023). The study explored the influence of flow conditions, energy grade line, Froude number, relative submergence, and bed form dimensionless parameters on the estimation of ‘n’. The main aim of these studies was to provide valuable insights into accurately estimating the ‘n’ rivers with bedforms, contributing to the hydraulic analysis and design field.
The previous studies have made significant contributions to understanding flow conditions, bedforms, sediment transport, and flow resistance in natural channels (Patel et al. 2015, 2016, 2017; Patel 2017; Patel & Kumar 2017; Brakenhoff et al. 2020; Heydari & Yarahmadi 2022; Balachandar & Patel 2008). In addition to these studies, researchers have employed a variety of machine learning (ML) and hybrid models to forecast various parameters, emphasizing the significance of these techniques in various practical applications of civil engineering (Chadalawada et al. 2020; Jiang et al. 2022; Bassi et al. 2023a, 2023b; Kumar et al. 2023a, 2023b; Singh & Patel 2023; Wadhawan et al. 2023). In another recent study, various ML techniques were used to determine the friction factor in mobile bed channels (Bassi et al. 2023b). However, these studies did not take into consideration the application of advanced soft computing models such as Lasso regression (LR), extra trees regression (ETR), random forest (RF), and extreme gradient boosting (XGB) to predict the value of ‘n’ in rivers with bedforms. The significance of the present study lies in filling these research gaps by employing the above-mentioned alternate techniques to estimate ‘n’ in alluvial rivers with bedforms. By utilizing these models, the current study aims to enhance prediction accuracy by exploring complex interconnections among hydraulic, sedimentary, and geometric variables influencing ‘n’. The outcomes of the study can be anticipated to make noteworthy contributions to hydraulic analysis, offering invaluable understandings for accurate estimation of Manning's roughness coefficient in rivers characterized by bedforms. Furthermore, the findings of the present study provide advanced comprehension of flow dynamics and sediment transport mechanisms in natural alluvial channels, therefore refining hydraulic designs and management strategies across diverse river engineering projects.
MATERIAL AND METHODS
The experimental data were used to predict the value of ‘n’ using various ML algorithms. This section discusses laboratory equipment, experimental procedure, and a summary of different ML models employed. The data for the analysis in the current study were extracted from a previous study (Yarahmadi et al. 2023). The details of the experimental setup and other important parameters are provided in the subsequent sections.
Experimentation and procedure
Bedforms
The ripples and dunes are the two types of bedforms used for experiments. The bedforms were developed in the shape of asymmetrical triangles. The upstream (u/s) side of the bedform with a triangular shape had a gentle slope, while the downstream (d/s) side had a steep slope. The slope of the d/s end was set equal to the angle of repose of the mobile bed sediments, approximately 32°. This design ensured that each bedform resembled a natural asymmetric triangle.
Experimental groups
The experiments were divided into two groups: in the first group, the bedforms had dimensions of 20 cm length, 30 cm width, and 4 cm height. The angle of the d/s side was maintained at 32°. In order to make the surface of the bedform rough, sediments of sizes 0.51, 1.29, and 2.18 mm were used, which were applied using an adhesive. While in the second group, bedforms had dimensions of 25 cm length and 30 cm width. The angle of the d/s remained at 32°. This group included bedforms with four different heights: 1, 2, 3, and 4 cm. To make the surface rough, sediments size 0.45 mm were incorporated, and sand was used in both cases with a relative density of 2.65.
Attachment of bedforms
The bedforms were fixed to the bottom of the experimental flume along the entire length of the test section. This attachment ensured that the bedforms remained stable during the experiments and allowed for controlled flow conditions.
Experimental parameters
The various experiments were conducted with different discharges and bed slopes. Discharge rates of 10, 15, 20, 25, and 30 L/s were tested. The bed slopes ranged from 0 to 0.0015. These parameters were chosen to study the influence of flow intensity and bed slope on the behaviour of the bedforms. A total number of 215 experiments were conducted, covering a wide range of flow conditions.
Geometric and hydraulic parameters
Several geometric and hydraulic parameters were measured and analyzed in this study. These included the dimensions of the bedforms (length, width, and height), depth of flow, sediment size, flow rates, velocity of flow, and Fr. These parameters provided valuable insights into the characteristics and interactions of the bedforms under different flow conditions.
Dimensional analysis
The dimensional analysis carried out in this study focused on determining the variables that affect the value of ‘n’ of channels with different bedforms. The various parameters which were considered are shown in Table 1.
Variables . | Symbol . | Variables . | Symbol . |
---|---|---|---|
Flow velocity (m/s) | V | Channel width (m) | B |
Gravitational acceleration (m/s2) | g | Energy grade line (m) | Sf |
Flow depth (m) | y | Bedform length (m) | λ |
Specific mass (kg/m3) | ρw | Bedform height (m) | Δ |
Dynamic viscosity (Pa s) | μ | Bedform u/s angle (degree) | α |
Specific sediment mass (kg/m3) | ρs | Bedform d/s angle (degree) | θ |
Average diameter (mm) | d50 | Froude number | Fr |
Variables . | Symbol . | Variables . | Symbol . |
---|---|---|---|
Flow velocity (m/s) | V | Channel width (m) | B |
Gravitational acceleration (m/s2) | g | Energy grade line (m) | Sf |
Flow depth (m) | y | Bedform length (m) | λ |
Specific mass (kg/m3) | ρw | Bedform height (m) | Δ |
Dynamic viscosity (Pa s) | μ | Bedform u/s angle (degree) | α |
Specific sediment mass (kg/m3) | ρs | Bedform d/s angle (degree) | θ |
Average diameter (mm) | d50 | Froude number | Fr |
This dimensional analysis helps to identify the critical dimensionless parameters that affect the value of n in channels with bedforms thus providing valuable insights for understanding and predicting flow behaviour in such channels.
Theoretical analysis
In the context of river hydraulics, Manning's equation plays a crucial role between flow velocity and ‘n’ (Bhattacharya et al. 2019; Tuozzolo et al. 2019). This relationship emphasizes the significance of understanding ‘n’ when predicting flow dynamics in rivers. Moreover, R is inherently linked to ‘n’ because it is dependent upon the shape and dimensions of the channel. The changes in channel geometry can have a direct impact on ‘n’. Furthermore, S plays a vital role within Manning's equation as variations in this parameter can initiate ripple effects, impacting flow velocity and, consequently, ‘n’. By following that, variations in S can often be attributable to bedforms or other geomorphic features and have a substantial influence over ‘n’ (Thomas & Nisbet 2007). However, understanding these interconnections is pivotal in comprehending and predicting the dynamic nature of river flows.
In addition to hydraulic features, the sediment transport phenomenon constitutes a crucial aspect of the theoretical analysis, examining the intricate mechanics of sediment movement within river channels (Venditti 2013; Dey 2014b). On the other hand, there are a number of significant factors, such as sediment particle size, concentration, and the dynamic behaviour of various bedforms, which may affect the sediment transport. After analyzing these parameters, the theoretical analysis aims to interpret the complex interaction between sediment characteristics and ‘n’. The size and concentration of the sediment particles have a significant impact on ‘n’, consequently helping in understanding the role of controlling flow resistance. Furthermore, the analysis considers the dynamic nature of bedforms, such as ripples or dunes, that can significantly alter sediment transport patterns, subsequently affecting flow resistance (Dey 2014a). By following that, the analysis establishes a robust theoretical basis for understanding the complex connection between sediment transport and ‘n’. This improves the capability to model and interpret these aspects in alluvial channels with bedforms.
Moreover, understanding fluvial hydrodynamics is essential when dealing with alluvial channels, as it requires a comprehensive exploration of natural geomorphic features that take different shapes and evolve over time. This theoretical framework considers the processes that give rise to the formation of different bedforms and their subsequent transformations. The current study thoroughly examines the experimental models that explain the bedforms' behaviour and interaction with the flow in closed environments.
Input parameter combinations
The various input parameter combinations were done to observe the output parameter. The different cases (C1–C6) and their respective combinations of input parameters and output parameters are presented in Table 2. Each possibility represents a specific scenario or condition considered in the analysis. In C1, the output parameter, ‘n’ (Manning's roughness coefficient), is determined by a combination of input parameters, including Sf (slope of the energy grade line in the channel), Fr (Froude number), y/d50 (ratio of flow depth to median grain size), Δ/d50 (ratio of height of the bedform depth to median grain size), Δ/λ (ratio of height of the bedform to wavelength), and Δ/y (ratio of bedform height to flow depth). The C2 case simplifies the combination by excluding the Sf parameter while still considering Fr, y/d50, Δ/d50, Δ/λ, and Δ/y as input parameters. In C3, the combination further reduces by excluding the Fr parameter, leaving y/d50, Δ/d50, Δ/λ, and Δ/y as the input parameters. The input scenario C4 focuses on the input parameters Δ/d50, Δ/λ, and Δ/y, excluding y/d50 and Fr. However, scenario C5 considers only Δ/λ and Δ/y as the input parameters, excluding others. Finally, C6 focuses on considering the Δ/y parameter as the input only. Each case represents a specific combination of input parameters that contribute to determining the output parameter ‘n’.
Cases . | Combinations of input parameters . | Output parameter . |
---|---|---|
C1 | f (Sf, Fr, y/d50, Δ/d50, Δ/λ, Δ/y) | Manning's roughness coefficient (n) |
C2 | f (Fr, y/d50, Δ/d50, Δ/λ, Δ/y) | |
C3 | f (y/d50, Δ/d50, Δ/λ, Δ/y) | |
C4 | f (Δ/d50, Δ/λ, Δ/y) | |
C5 | f (Δ/λ, Δ/y) | |
C6 | f (Δ/y) |
Cases . | Combinations of input parameters . | Output parameter . |
---|---|---|
C1 | f (Sf, Fr, y/d50, Δ/d50, Δ/λ, Δ/y) | Manning's roughness coefficient (n) |
C2 | f (Fr, y/d50, Δ/d50, Δ/λ, Δ/y) | |
C3 | f (y/d50, Δ/d50, Δ/λ, Δ/y) | |
C4 | f (Δ/d50, Δ/λ, Δ/y) | |
C5 | f (Δ/λ, Δ/y) | |
C6 | f (Δ/y) |
Statistical analysis
Statistical analysis was conducted on the data to determine the variability and trends in the measured parameters. The energy grade line (Sf) ranged from 0.004 to 0.006, with an average value of 0.005083 and a standard deviation of 0.000797. The distribution of Sf exhibited a negative skewness of −0.1508, indicating a slight asymmetry towards lower values. The kurtosis of Sf was −1.40819, showing a platykurtic distribution. The Froude number (Fr) varied from 0.2 to 0.7, with an average value of 0.455349 and a standard deviation of 0.177092. The distribution of Fr showed a slight negative skewness of −0.07631. The kurtosis of Fr was −1.3743, indicating a platykurtic distribution. The ratio of flow depth to sediment size (y/d50) ranges between 70.849 and 287.709, with an average value of 172.6865 and a standard deviation of 62.42025. The distribution of y/d50 exhibited a nearly symmetrical distribution with a slight negative skewness of −0.07039. The kurtosis of y/d50 was −1.28172, indicating a platykurtic distribution. The ratio of bedform height to sediment size (Δ/d50) varies between 0.06 and 0.18, with an average value of 0.126953 and a standard deviation of 0.036966. The distribution of Δ/d50 showed a negative skewness of −0.34247. The kurtosis of Δ/d50 was −1.04904, indicating a platykurtic distribution. The ratio of bedform length to wavelength (Δ/λ) ranges between 0.13 and 0.46, with an average value of 0.293023 and a standard deviation of 0.098083. The distribution of Δ/λ exhibited a nearly symmetrical distribution with a slight negative skewness of −0.05843. The kurtosis of Δ/λ was −1.15867, indicating a platykurtic distribution. Finally, the output parameter ‘n’ had an average value of 0.023321 as shown in Table 3.
Parameters . | Sf . | Fr . | y/d50 . | Δ/d50 . | Δ/λ . | Δ/y . | n . |
---|---|---|---|---|---|---|---|
Maximum | 0.006 | 0.7 | 287.709 | 0.18 | 0.46 | 82.317 | 0.032 |
Minimum | 0.004 | 0.2 | 70.849 | 0.06 | 0.13 | 19.276 | 0.013 |
Standard deviation | 0.000797 | 0.177092 | 62.42025 | 0.036966 | 0.098083 | 19.58988 | 0.005703 |
Average | 0.005083 | 0.455349 | 172.6865 | 0.126953 | 0.293023 | 49.89475 | 0.023321 |
Kurtosis | −1.40819 | −1.3743 | −1.28172 | −1.04904 | −1.15867 | −1.34305 | −1.21499 |
Skewness | −0.1508 | −0.07631 | −0.07039 | −0.34247 | −0.05843 | −0.0669 | −0.34548 |
Parameters . | Sf . | Fr . | y/d50 . | Δ/d50 . | Δ/λ . | Δ/y . | n . |
---|---|---|---|---|---|---|---|
Maximum | 0.006 | 0.7 | 287.709 | 0.18 | 0.46 | 82.317 | 0.032 |
Minimum | 0.004 | 0.2 | 70.849 | 0.06 | 0.13 | 19.276 | 0.013 |
Standard deviation | 0.000797 | 0.177092 | 62.42025 | 0.036966 | 0.098083 | 19.58988 | 0.005703 |
Average | 0.005083 | 0.455349 | 172.6865 | 0.126953 | 0.293023 | 49.89475 | 0.023321 |
Kurtosis | −1.40819 | −1.3743 | −1.28172 | −1.04904 | −1.15867 | −1.34305 | −1.21499 |
Skewness | −0.1508 | −0.07631 | −0.07039 | −0.34247 | −0.05843 | −0.0669 | −0.34548 |
Machine learning models
Random forest
RF is a powerful ensemble learning technique that leverages the concept of bagging to build a collection of decision trees for making predictions. By training each tree on a different subset of the training data, RF reduces the risk of overfitting and enhances the model's ability to generalize to new data. Unlike traditional decision trees, RF incorporates feature randomness by considering only a random subset of features at each split, further enhancing the model's robustness and reducing the correlation between trees. The final prediction of the RF model is obtained by aggregating the predictions of all individual trees, resulting in a more accurate and reliable prediction without an explicit equation (Belgiu & Drăgu 2016; Yoon 2021). Furthermore, RF is a powerful ensemble learning technique that has gained immense popularity in the field of ML due to its remarkable predictive capabilities and resilience against overfitting. This model is significant when dealing with intricate and high-dimensional datasets, and it has found applications in various other domains as well. At its core, RF is an ensemble of decision trees. However, it significantly improves upon traditional decision trees by incorporating two essential techniques: bagging and feature randomness. In addition to bagging, RF employs feature randomness to enhance its robustness and reduce the correlation between trees.
During the construction of each decision tree, only a random subset of features is considered at each split point. By doing so, the model prevents a single dominant feature from influencing the entire forest, making it less sensitive to noise in the data and improving its generalization performance (Qi 2012; Chang et al. 2018). The final prediction of an RF model is determined through ensemble averaging. Each decision tree within the forest makes its prediction, and for regression tasks, these predictions are averaged. In classification tasks, a majority vote is conducted to determine the final class label. One of the standout advantages of RF is its capability to handle high-dimensional datasets with a substantial number of features without succumbing to overfitting. Additionally, it demonstrates robustness to outliers and offers insights into feature importance, aiding in the identification of the most influential features for making predictions. Due to its versatility and effectiveness, RF has become a staple in real-world applications, such as disease diagnosis, credit risk assessment, and natural language processing tasks. While it lacks an explicit equation like linear regression, RF excels in capturing intricate relationships within the data and stands as a valuable tool in the era of ML and artificial intelligence (AI) (Probst et al. 2019).
Lasso regression
LR, also known as L1 regularization or the Lasso penalty, is a linear regression technique that introduces a penalty term to the least squares objective function. Its primary purpose is to perform feature selection by shrinking the coefficients of less significant features towards zero. This regularization technique helps mitigate the risk of overfitting and can yield a more interpretable model by excluding irrelevant features. While the equation for LR is similar to that of linear regression, it incorporates an additional penalty term that encourages sparsity in the coefficient estimates (Alhamzawi & Ali 2018; Pai et al. 2021). The primary objective of LR is twofold. First, it aims to minimize the residual sum of squares, which is the same objective function as in ordinary LR. The model LR seeks to fit a linear relationship between the dependent variable and the independent variables. However, the difference in the Lasso is its second objective: feature selection. The Lasso penalty term is designed to shrink the coefficients of less significant features toward zero. In practical terms, this means that Lasso can effectively eliminate or reduce the impact of irrelevant or less important variables in the model. By encouraging non-availability in the coefficient estimates, Lasso helps to select a subset of the most influential features while disregarding those that do not significantly contribute to the prediction. This feature selection aspect of Lasso makes it particularly valuable when working with high-dimensional datasets, where identifying and utilizing relevant features can be challenging (Chang et al. 2018; Jun & ZeXin 2021).
In addition to this, LR is widely used in various domains, such as economics, etc., where interpretable models are of paramount importance. The equation for LR resembles that of linear regression, with the addition of the L1 regularization term, which penalizes the absolute values of the coefficients. The choice of the regularization strength, often denoted as λ, determines the degree of shrinkage applied to the coefficients. A larger λ value results in greater shrinkage and, consequently, more features with coefficients reduced to zero. The model LR stands as a valuable tool for linear modeling when the goal is not only to predict accurately but also to identify the most relevant features and simplify the model. Its ability to strike a balance between model complexity and predictive performance makes it a significant choice in scenarios where interpretable models are desired (Alanazi 2022).
Extra trees regression
ETR is another ensemble learning method similar to RF. It also combines multiple decision trees through the technique of bagging. However, ETR introduces additional randomness by selecting random splits for each feature rather than searching for the best split. This randomness helps to increase the diversity among the trees and reduces overfitting. The equation for ETR is not explicitly defined as it combines multiple decision trees (Jaiswal & Lohani 2023). In standard decision trees, when selecting the best split for a node, the algorithm considers all available features and evaluates various splitting criteria to find the optimal one. In contrast, ETR takes a different approach as it introduces randomness by considering only a random subset of features at each node for determining the split. This means that instead of exhaustively searching for the best split, ETR makes a more randomized choice. This feature randomness adds diversity to the ensemble because different trees can choose different features for splitting at each node. The advantage of this added randomness is that it helps to reduce the correlation between individual trees within the ensemble, making the model less susceptible to overfitting. However, overfitting occurs when a model captures noise in the data rather than the underlying patterns by introducing diversity, ETR can mitigate this risk. Regarding the mathematical equation for ETR, similar to RF, there is not a single explicit equation that defines the model. Instead, ETR involves combining the predictions from multiple decision trees within the ensemble to make the final prediction. The ensemble approach ensures that the model leverages the combined knowledge of all trees, leading to more robust and accurate predictions. In addition, ETR is an ensemble learning method that builds upon the principles of bagging and introduces additional randomness in the feature selection process. This randomness enhances the diversity among trees, reducing overfitting and improving predictive accuracy (Qi 2012; Liu et al. 2015).
XGBoost
The model XGBoost is a cutting-edge gradient-boosting algorithm that is renowned for its remarkable predictive accuracy and versatility. It seamlessly integrates gradient boosting principles with regularization techniques to make an ensemble of weak prediction models, typically decision trees, and iteratively enhances their predictive capabilities. It is different from other models as it has advanced regularization methods, including L1 and L2 regularization, which significantly counteract overfitting and promote robust model generalization. These regularization terms are thoughtfully incorporated into the loss function algorithm, allowing for precise control through hyperparameters. XGBoost is further enhanced by a wide range of capabilities, including the assessment of feature importance, the ability to halt training early, and an efficient mechanism for trimming trees, all of which collectively strengthen its power and effectiveness. With its ensemble approach, regularization capabilities, and efficient optimization, XGBoost emerges as a formidable tool for diverse ML applications, spanning from classification and regression to ranking and recommendation systems, making it a top choice among data scientists seeking both exceptional predictive performance and interpretability (Tang et al. 2015; Sharma et al. 2022).
Data preprocessing
During the data preprocessing stage for the four regression models (LR, ETR, RF, and XGB), addressing the variations in parameter ranges is essential. A normalization technique is applied to ensure that numerical columns are transformed to a standard scale. This step helps achieve consistency and comparability among the different parameters used in the models. By scaling the values of the dataset, a scale is established, typically with a mean of 0 and a standard deviation of 1. This normalization process ensures that each parameter contributes equally to the overall analysis, preventing any bias arising from variations in their original scales. Data normalization plays a crucial role in preparing the data for regression modelling and enables accurate and meaningful interpretations of the results.
Data split
In the data splitting process, the initial dataset, comprising 215 data points, is divided into two sets: the training set and the testing set. The splitting is performed in a ratio of 80:20, meaning that 80% of the data (172 data points) is allocated to the training set, while the remaining 20% (43 data points) forms the testing set. The purpose of this division is to have a dedicated portion of the data used solely for training the ML models. This allows the models to learn patterns, relationships, and features present in the data during the training phase. The training set serves as the basis for building and optimizing the models, and the testing set is utilized to evaluate the performance and generalization capabilities of the trained models. After being trained on the training set, the models are applied to the testing set to make predictions or classifications. By comparing the predicted outcomes with the actual known outcomes in the testing set, the models' performance metrics, such as accuracy, precision, recall, or mean squared error (MSE), can be computed to assess how well they generalize to unseen data. Splitting the data into training and testing sets is a common practice in ML to ensure unbiased model performance evaluations. It helps to assess the models' ability to handle new, unseen data and provides a realistic estimation of their effectiveness in real-world scenarios.
Criteria for model evaluation
The RMSE is derived as the square root of the MSE, providing a measure of the average magnitude of the prediction errors. The RAE is determined by calculating the relative absolute difference between the forecasted and actual values, normalized by the mean of the actual values. This metric allows for the evaluation of the prediction accuracy relative to the scale of the data. The RRSE is obtained by dividing the RMSE by the range of the observed data, expressed as a percentage. The range is determined by subtracting the minimum value from the maximum value of the observed data. This metric provides a relative measure of the prediction error compared to the overall range of the data. Furthermore, the coefficient of determination (R2) is employed to assess the proportion of the variance in the dependent variable that the independent variables in the model can explain. A higher R2 value indicates a better fit of the model to the data.
In addition to these evaluation measures, heatmap data visualization and parametric analysis are conducted to examine the correlation between various parameters and the output parameter. These analyses help to identify the relationships and dependencies among the variables and provide insights into the predictive capabilities of the models. With the use of these evaluation measures and analytical techniques, the reliability and accuracy of the models in predicting the n of the movable bed channel can be assessed, allowing for informed decision-making and further refinement of the models if necessary. In addition to this, Table 4 presents hyperparameter settings and configurations for various ML models used in this study. It includes all the models along with their optimized hyperparameter values such as N-estimators, max-depth, learning-rate, and other metrics. These configurations serve as guidelines for setting hyperparameters while using these models in a predictive task.
Models . | Hyperparameter . | Optimized value . | Models . | Hyperparameter . | Optimized value . |
---|---|---|---|---|---|
XGB | N_estimators | 1,800 | RF | Max_depth | 5 |
Eta | 0.01 | Min_sample_split | 2 | ||
Max_depth | 5 | Max_features | Auto | ||
Subsample | 0.5 | ||||
Colsample_bytree | 1 | ||||
ETR | N_estimators | 100 | LR | Alpha | 1 |
Max_features | Auto | Selection | Cyclic |
Models . | Hyperparameter . | Optimized value . | Models . | Hyperparameter . | Optimized value . |
---|---|---|---|---|---|
XGB | N_estimators | 1,800 | RF | Max_depth | 5 |
Eta | 0.01 | Min_sample_split | 2 | ||
Max_depth | 5 | Max_features | Auto | ||
Subsample | 0.5 | ||||
Colsample_bytree | 1 | ||||
ETR | N_estimators | 100 | LR | Alpha | 1 |
Max_features | Auto | Selection | Cyclic |
In the equations provided, the variable ‘n’ is the Manning's roughness coefficient related to the bedforms and n’ represents the overall ‘n’. The various statistical error indices, such as R2 and RMSE, were calculated to assess the accuracy of these equations using the experimental data. The obtained values for R2 and RMSE were found for all the models using various input scenarios, respectively. Each input case was selected by removing one parameter in every case (C1, C2, C3, C4, C5, and C6). Moreover, the error indices measure the fit and precision of the equations in relation to the observed data from the laboratory study. The assessment of these error matrices validates the exactness of the equations in estimating ‘n’ for the alluvial river channels with bedforms such as dunes and ripples. However, considering the significance of ‘n’ in river hydraulic studies and the requirement for more adequate estimations, the current study emphasizes developing and evaluating different ML models to prediction of this parameter. The main aim is to explore the ‘n’ estimation accuracy in alluvial river channel applications.
The correlation coefficients provide insights into the relationships between different variables in the dataset. It presents the correlation coefficients for various pairs of variables. Starting with Sf (energy gradient line), it is perfectly correlated with itself (correlation coefficient of 1) and does not correlate with other variables. Fr (Froude number) exhibits a negative correlation of −0.283 with Sf but has a perfect positive correlation with itself. The y/d50 (Sediment size) shows a negative correlation of −0.3 with Sf and a positive correlation of 0.882 with Fr. It also has a perfect positive correlation with itself. Moving on to Δ/d50, it demonstrates a negative correlation of −0.335 with Sf, a positive correlation of 0.961 with Fr, and a positive correlation of 0.936 with y/d50. It also has a perfect positive correlation with itself. Similarly, Δ/λ is negatively correlated with Sf (−0.291) and positively correlated with Fr (0.936), y/d50 (0.919), and Δ/d50 (0.97). It is perfectly associated with itself. Finally, Δ/y shows negative correlations with Sf (−0.388) and positive correlations with Fr (0.876), y/d50 (0.938), Δ/d50 (0.94), and Δ/λ (0.962). It also has a perfect positive correlation with itself. The ‘n’ exhibits negative correlations with Sf (−0.281) and positive correlations with Fr (0.934), y/d50 (0.955), Δ/d50 (0.973), Δ/λ (0.978), Δ/y (0.974), and itself (perfect positive correlation of 1). The correlation coefficients highlight the strength and direction of relationships between the variables, providing valuable insights into their interdependencies.
RESULTS AND DISCUSSION
Heatmap visualization
In this study, various soft computing models, namely LR, ETR, RF, and XGB, were evaluated to assess their effectiveness in estimating ‘n’ in alluvial channels with bedforms. The initial step of the soft computing modelling process involved data preparation, which included dividing the dataset into training and testing sets. The allocation of percentages for these sets was determined by trial and error and taking into account the previous research studies (Bassi et al. 2023a; Wadhawan et al. 2023). In this study, 80% of the data were allocated for training (calibration), while the remaining 20% was used for testing (verification). It is important to note that the statistical characteristics of the dimensionless parameters in both the training and testing datasets exhibited similar patterns, ensuring the representativeness and suitability of the datasets for model evaluation and comparison.
Parametric analysis
Scatter plots
Box plot
Figure 4 presents a boxplot illustrating the distribution of riverbed roughness values and their corresponding predictions generated by the ML models. The inclusion of two median lines, one for the input roughness values and another for the predicted values, serves to highlight the central tendencies of both datasets. By following that, the absence of outliers in the plot indicates a high level of consistency between predicted and actual values. The decision to use 1.5 times IQR for whiskers in the boxplot aims to identify potential outliers. This approach ensures a balanced representation of the data's central tendency and dispersion that effectively captures the majority of the distribution while considering the presence of extreme values.
Furthermore, the boxplot up to the 75th percentile is selected because it emphasizes the main body of the data distribution while minimizing the impact of potential outliers beyond the whiskers. This concentration on the lower and upper quartiles allows for a clearer visualization of the central tendencies and spreads in both the input and predicted riverbed roughness values. The careful design of the boxplot and the utilization of 1.5 times IQR and the 75th percentile, contribute to a refined understanding of the data distribution and the reliability of our predictions.
Histogram
As per Figure 5, the data points in the histogram show variations across different categories, with some bins having higher values and others lower. In addition to this, histograms represent the distribution of six crucial parameters that hold significance in the present study: (a) Sf represents the slope of the energy grade in a hydraulic system. It is crucial in fluid dynamics and hydraulic engineering because it helps determine the energy losses and head distribution in open channels, and ranges of the values in the dataset can be visualized, (b) Fr values can help to classify flow regimes within our data. It depicts most of the data points fall within the subcritical, critical, or supercritical flow categories and helps in understanding the distribution of Fr, (c) y/d50 values in the histogram to determine the relationship between flow depth and sediment size. It can be significantly important to indicate the flow capability of entraining sediment particles when (y/d50 > 1) or (y/d50 < 1), providing valuable information for sediment transport studies, (d) Δ/d50 values illustrate the range of bedform heights relative to sediment size. It represents the dominance of bedforms over the landscape and the mixing of different bedform sizes, which help in the characterization of bedforms, (e) Δ/λ depicts the distribution of various bedform shapes and indicates that bedforms have a specific wavelength relative to height, and (f) Δ/y values show that bedform heights are relatively small as compared to the flow depth and are significant features within the alluvial channel bed.
Relationship between different errors and ML models
The comprehensive overview of the R2 values obtained from various regression models in six cases is mentioned in Table 5. The performance metrics for different ML models (LR, ETR, RF, and XGB) across multiple scenarios (C1, C2, C3, C4, C5, C6). The performance metric is represented by numerical values, indicating accuracy. For each model and scenario, the corresponding numerical values are listed. The values range from relatively high (close to 1, indicating good performance) to lower values (around 0.1). The main purpose is to depict the comparative analysis of each model and its performance across different scenarios.
Models . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.822368321 | 0.822368321 | 0.822368321 | 0.109620603 | 0.109620603 | 0.822368321 |
ETR | 0.99928393 | 0.999230522 | 0.999248134 | 0.99928393 | 0.9866761 | 0.99928393 |
RF | 0.999544496 | 0.999559321 | 0.999516037 | 0.999544496 | 0.997588874 | 0.999544496 |
XGB | 0.997261388 | 0.997261388 | 0.997261388 | 0.997195934 | 0.990703555 | 0.997261388 |
Models . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.822368321 | 0.822368321 | 0.822368321 | 0.109620603 | 0.109620603 | 0.822368321 |
ETR | 0.99928393 | 0.999230522 | 0.999248134 | 0.99928393 | 0.9866761 | 0.99928393 |
RF | 0.999544496 | 0.999559321 | 0.999516037 | 0.999544496 | 0.997588874 | 0.999544496 |
XGB | 0.997261388 | 0.997261388 | 0.997261388 | 0.997195934 | 0.990703555 | 0.997261388 |
Table 6 presents RMSE values for various regression algorithms across six scenarios. The RMSE values represent the average magnitude of prediction errors, with lower values indicating better model performance. It provides a concise comparison of algorithmic performance in terms of prediction accuracy across different scenarios.
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.049011 | 0.049011 | 4.90E − 02 | 0.073335 | 0.073335 | 0.073335 |
ETR | 0.000153 | 0.000158 | 1.56E − 04 | 1.53E − 04 | 0.000213 | 0.000658 |
RF | 0.000121 | 0.000119 | 1.12E − 02 | 1.21E − 04 | 0.011718 | 0.016715 |
XGB | 0.017256 | 0.017256 | 1.73E − 02 | 0.017358 | 0.018514 | 0.023423 |
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.049011 | 0.049011 | 4.90E − 02 | 0.073335 | 0.073335 | 0.073335 |
ETR | 0.000153 | 0.000158 | 1.56E − 04 | 1.53E − 04 | 0.000213 | 0.000658 |
RF | 0.000121 | 0.000119 | 1.12E − 02 | 1.21E − 04 | 0.011718 | 0.016715 |
XGB | 0.017256 | 0.017256 | 1.73E − 02 | 0.017358 | 0.018514 | 0.023423 |
Table 7 displays MSE values for four regression algorithms across six distinct cases. The MSE values, represented in scientific notation, indicate the average magnitude of absolute prediction errors, with smaller values suggesting higher accuracy. For instance, LR exhibits consistently low MSE values across all scenarios, ranging from 2.89 × 10−5 to 5.77 × 10−6, showcasing its precision in prediction. Similarly, ETR, RF, and XGB also demonstrate exceptionally small MSE values, reflecting the high accuracy of these models in capturing the target variable across diverse situations. It provides a concise summary of the models' outstanding performance in minimizing prediction errors across various scenarios.
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 2.89E − 05 | 5.77E − 06 | 5.77E − 06 | 5.77E − 06 | 2.89E − 05 | 2.89E − 05 |
ETR | 4.33E − 07 | 2.33E − 08 | 2.50E − 08 | 2.44E − 08 | 2.33E − 08 | 4.54E − 08 |
RF | 7.81E − 08 | 1.47E − 08 | 1.43E − 08 | 1.57E − 08 | 1.47E − 08 | 1.89E − 08 |
XGB | 3.01E − 07 | 8.87E − 08 | 8.87E − 08 | 8.87E − 08 | 9.08E − 08 | 1.17E − 07 |
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 2.89E − 05 | 5.77E − 06 | 5.77E − 06 | 5.77E − 06 | 2.89E − 05 | 2.89E − 05 |
ETR | 4.33E − 07 | 2.33E − 08 | 2.50E − 08 | 2.44E − 08 | 2.33E − 08 | 4.54E − 08 |
RF | 7.81E − 08 | 1.47E − 08 | 1.43E − 08 | 1.57E − 08 | 1.47E − 08 | 1.89E − 08 |
XGB | 3.01E − 07 | 8.87E − 08 | 8.87E − 08 | 8.87E − 08 | 9.08E − 08 | 1.17E − 07 |
Table 8 presents MAE values for four distinct regression algorithms across six scenarios. The MAE values, represented in scientific notation, serve as indicators of the average squared differences between predicted and actual values. The model LR exhibits MAE values ranging from 0.001944 to 0.00474 across different scenarios, reflecting its performance in minimizing prediction errors. However, other ML models such as ETR, RF, and XGB also demonstrate relatively low MAE values, with ETR showing consistently small values, emphasizing its accuracy in capturing target variable variations. It provides a concise overview of the performance of each algorithm across diverse scenarios, facilitating a comparative analysis of predictive capabilities in terms of minimizing squared errors.
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.001944 | 0.001944 | 1.94E − 03 | 0.00474 | 0.00474 | 0.00474 |
ETR | 4.00E − 05 | 4.19E − 05 | 4.33E − 05 | 4.00E − 05 | 6.86E − 05 | 3.94E − 04 |
RF | 0.000149 | 0.000149 | 1.49E − 04 | 0.000451 | 0.000451 | 0.000451 |
XGB | 0.000281 | 0.000226 | 2.87E − 04 | 0.000364 | 0.000487 | 0.000888 |
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.001944 | 0.001944 | 1.94E − 03 | 0.00474 | 0.00474 | 0.00474 |
ETR | 4.00E − 05 | 4.19E − 05 | 4.33E − 05 | 4.00E − 05 | 6.86E − 05 | 3.94E − 04 |
RF | 0.000149 | 0.000149 | 1.49E − 04 | 0.000451 | 0.000451 | 0.000451 |
XGB | 0.000281 | 0.000226 | 2.87E − 04 | 0.000364 | 0.000487 | 0.000888 |
Table 9 represents RRSE percentage values for four different algorithms across six scenarios. The RRSE measures the accuracy of a forecasting model by calculating the average percentage difference between predicted and actual values. The model LR exhibits RRSE values ranging from 13.300 to 29.878%, indicating a relatively higher percentage error in its predictions across various scenarios. Whereas, the models, such as ETR, RF, and XGB, on the other hand, showcase lower RRSE values, with ETR consistently demonstrating the smallest percentage errors. Table 9 offers a concise summary of the algorithms' performance in terms of percentage accuracy, allowing for a quick comparison of the predictive capabilities across diverse scenarios.
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 13.345% | 13.345% | 13.300% | 29.878% | 29.878% | 29.878% |
ETR | 0.847% | 0.878% | 0.868% | 0.847% | 1.184% | 3.655% |
RF | 0.639% | 0.629% | 0.659% | 0.639% | 0.723% | 1.471% |
XGB | 1.567% | 1.567% | 1.570% | 1.586% | 1.804% | 2.888% |
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 13.345% | 13.345% | 13.300% | 29.878% | 29.878% | 29.878% |
ETR | 0.847% | 0.878% | 0.868% | 0.847% | 1.184% | 3.655% |
RF | 0.639% | 0.629% | 0.659% | 0.639% | 0.723% | 1.471% |
XGB | 1.567% | 1.567% | 1.570% | 1.586% | 1.804% | 2.888% |
Table 10 shows RAE values for four different regression algorithms in six cases. The numerical values represent the accuracy of each algorithm's predictions, with lower values indicating better performance. Model LR consistently shows higher values, suggesting less accurate predictions, while models such as ETR, XGB, and RF exhibit relatively lower RAE values, indicative of better predictive accuracy. Notably, ETR consistently demonstrates the smallest values across all scenarios, indicating superior performance in minimizing prediction errors. It allows for a quick comparison of the algorithms' effectiveness in capturing the underlying patterns in the data across diverse scenarios.
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.107978 | 0.107978 | 1.08E − 01 | 0.263359 | 0.263359 | 0.26335906 |
ETR | 0.002222 | 0.002326 | 2.40E − 03 | 2.22E − 03 | 0.003811 | 0.02189922 |
XGB | 0.009764 | 0.009764 | 9.76E − 03 | 0.009093 | 0.011017 | 0.0202274 |
RF | 0.002421 | 0.002353 | 2.56E − 03 | 2.42E − 03 | 0.003084 | 0.00788005 |
Algorithms . | C1 . | C2 . | C3 . | C4 . | C5 . | C6 . |
---|---|---|---|---|---|---|
LR | 0.107978 | 0.107978 | 1.08E − 01 | 0.263359 | 0.263359 | 0.26335906 |
ETR | 0.002222 | 0.002326 | 2.40E − 03 | 2.22E − 03 | 0.003811 | 0.02189922 |
XGB | 0.009764 | 0.009764 | 9.76E − 03 | 0.009093 | 0.011017 | 0.0202274 |
RF | 0.002421 | 0.002353 | 2.56E − 03 | 2.42E − 03 | 0.003084 | 0.00788005 |
Based on the data in Figure 6(a), R2 depicts values for different models across six cases (C1, C2, C3, C4, C5, C6). The R2 represents the proportion of the dependent variable's variance explained by the independent variables in a regression model. Higher R2 values indicate better fits. For instance, in category C1, the LR model in category C4 has a lower R2 value of 0.109, suggesting a weaker relationship. Further analysis requires a deeper understanding of the data and objectives. Figure 6(b) displays the distribution of RMSE values for different models across six categories. The RMSE represents the average magnitude of prediction errors, with smaller values indicating better model performance. For example, in category C1 the LR model in category C4 shows a higher RMSE value of 2.89E − 05, suggesting relatively larger prediction errors. Similarly, Figure 6(c) represents the MSE values for different models across six categories. The MSE measures the average magnitude of prediction errors, with smaller values indicating better model performance. For instance, in category C1, the LR model in category C4 has a higher MSE value of 0.073, suggesting larger prediction errors.
Figure 6(d) displays the MAE values for different algorithms across six categories. The MAE represents the average absolute difference between predicted and actual values, reflecting prediction accuracy. In category C1, the LR algorithms exhibit exceptionally low MAE values of 2.89E − 05. As we move through categories, the MAE values remain small, indicating accurate predictions. Figure 6(e) shows RRSE values for different algorithms across six categories. The RRSE measures the average percentage difference between predicted and actual values, representing prediction accuracy relative to the range of the output parameter. In category C1, the LR algorithm shows higher RRSE values, ranging from 13.30 to 29.878%. Other algorithms, such as ETR, RF, and XGB, also have varying RRSE values, reflecting the accuracy in predicting the output parameter. Figure 6(e) represents RAE values for different algorithms in each of the six categories. The metric RAE measures the average percentage difference between predicted and actual values, providing insight into the accuracy of the algorithms in predicting the output parameter. In category C1, the LR algorithm exhibits higher RAE values, ranging from 0.107 to 0.263. Other algorithms, such as ETR, RF, and XGB, also have varying RAE values, representing their accuracy in predicting the output parameter.
Relationship between actual and predicted values of ‘n’ using different models
In the current study, the relationship between predicted and experimental values of parameter ‘n’ using various ML models has been examined. The observations reveal distinct patterns in the performance of these models. It is evident that the predictions for parameter ‘n’ deviate significantly from the best-fit line when model LR is considered. This divergence implies a significant variation between the predicted values and the actual experimental values. The R2 values for these models are substantially below 1, which serves as a strong indication of the deviation in the results. However, models such as RF, ETR, and XGB show predictions for ‘n’ that closely align with the experimental values. Moreover, these models exhibit high R2 values (values approaching 1). This indicates that these two models, in particular, effectively capture the intricate relationship between predictor and target parameters, and these findings are consistent with prior research (Azamathulla et al. 2013, 2016; Kitsikoudis et al. 2015; Baharvand et al. 2021; Roushangar & Shahnazi 2021).
Sensitivity analysis
Taylor's diagram
Insights based on ML approach
The selected ensemble models, such as RF, and XGB, demonstrated exceptional performance due to their ensemble nature. These models perform well in capturing complex relationships within the data by combining multiple weaker models into a stronger one. The ensemble approach is particularly important when dealing with complex hydraulic phenomena, as it allows the models to collectively consider a wide range of factors and interactions, resulting in more accurate predictions of ‘n’. Moreover, RF and XGB provide valuable insights through feature importance scores. These scores highlight the relative importance of input parameters in predicting ‘n’. This allows us to focus on the variables that have the most significant impact on the predictions. By identifying such influential factors, a deeper understanding of the underlying mechanisms driving ‘n’ variations is crucial for both prediction accuracy and physical interpretation. The various ML models utilized in the present study are inherently suited for handling non-linear relationships, which can often exist in hydraulic data. In the present analysis, the performance can be attributed to its capacity to adapt to complex relationships within the data, ultimately leading to more reliable predictions.
However, model LR are appropriate models for mitigating overfitting, a common task in predictive modeling. These regularization techniques reduce model complexity and emphasize the most relevant features. The success of LR in the present study can be attributed to their ability to select important input parameters while avoiding unnecessary complexity. This feature selection enhances the models' generalization to new data, contributing to their overall performance. The model XGB relies on similarity metrics to make predictions and performs well when data points with similar characteristics tend to have similar values of ‘n’. In the present analysis, the effectiveness of this model can be interpreted by its capacity to identify and leverage the similarity patterns within the dataset. By considering neighbouring data points, XGB can provide accurate predictions, especially when local patterns strongly influence ‘n’. In addition to this, the success of the selected ML models in predicting ‘n’ can be attributed to their ability to handle complexity, capture non-linear relationships, adapt to data patterns, mitigate overfitting, and leverage effective data preprocessing. These models, along with sensitivity analysis, provide valuable comprehension into the underlying mechanisms governing roughness predictions and advance our understanding of hydraulic processes in alluvial channels with bedforms.
Several ML models, such as RF and ETR influenced ensemble techniques to enhance predictive performance. These ensemble methods aggregate predictions from multiple models, reducing overfitting and supporting robustness. The model XGB excelled through gradient optimization for predictions, which were particularly effective when local patterns were significant. In contrast, a model such as LR showed moderate efficiency in predicting ‘n’ due to their limitation in capturing dataset intricacies and non-linearity. It assumes a linear relationship that may not represent complex patterns adequately. The model LaR, while using L1 regularization to curb overfitting, struggled in feature selection, potentially discarding vital information. This underlines the importance of employing advanced techniques such as ensemble methods and non-linear regression for datasets with diverse and intricate roughness coefficient patterns.
Comparison of results with previous studies
In a previous study, Roushangar et al. (2018) showed that different models (FFNN, RBFNN, and ANFIS) were used to estimate ‘n’ in the rivers with dunes. It was observed that the value of R2 was ≤ 0.9 during the verification stage. In another study, Saghebian et al. (2020) reported the value of R2 = 0.56 and RMSE = 0.0034 in predicting ‘n’ using MLP-FFA and FFNN algorithms from the best scenario. in addition to this, a study carried out by Yarahmadi et al. (2023) reported R2 values for the multilayer perceptron neural network (MLPNN), group method of data handling (GMDH), support vector machine (SVM), and GP models were found to be 0.982, 0.979, 0.999, and 0.926, respectively, during the verification phase. Additionally, the corresponding RMSE values were 0.0006, 0.0006, 0.0000, and 0.0013. However, a more in-depth analysis of the statistical error indices, specifically the R2 and RMSE, when comparing with ML models in this study with those established by previous researchers (Roushangar et al. 2018; Saghebian et al. 2020; Yarahmadi et al. 2023) reveals a clear trend of significantly improved accuracy in these models. By following that, ML models in this study outperformed the previously utilized soft computing models proposed in terms of their ability to accurately predict and represent the roughness coefficients in alluvial channels with bedforms. The enhanced accuracy is crucial for improving our understanding of flow dynamics in these complex environments and has the potential to contribute significantly to river management and engineering practices.
K-fold cross-validation
K-fold cross-validation is a fundamental technique in ML modeling used to assess and validate the performance of predictive models. It serves as a robust method for estimating the ability of ML models to generalize to new, unseen data. The process involves partitioning the dataset into K equally sized subsets or folds. The model is trained and evaluated K times, each time using a different fold as the validation set, while the remaining K-1 folds are used for training. This ensures that every data point is used for validation exactly once. The results from these K iterations are typically averaged to provide a more stable and representative evaluation of the model's performance. K-fold cross-validation helps detect issues like overfitting, where a model performs exceptionally well on the training data but poorly on new data and provides a more accurate estimate of the performance of the models.
In the comprehensive cross-validation analysis, the performance of a diverse set of regression models, such as LR, ETR, RF, and XGB was evaluated. Across all folds, the results consistently showcase the exceptional predictive capabilities of these models. The RMSE values specified the prediction accuracy is minimum, indicating that the models make precise predictions that closely align with the actual values. Equally impressive are the consistently high R2 values, which approach 1. These scores signify that the models effectively capture and explain a substantial portion of the variance within the dependent variable. While minor fluctuations in performance may be observed between folds, the values emphasize the robust and reliable predictive power of these ML techniques. The choice among these models can link among specific considerations such as model complexity and significance; however, these results collectively confirm the capability to elucidate underlying patterns within the dataset. The K-fold cross-validation for all the models is provided in Table 11.
Fold . | XGB . | RF . | LR . | ETR . | ||||
---|---|---|---|---|---|---|---|---|
RMSE . | R2 score . | RMSE . | R2 score . | RMSE . | R2 score . | RMSE . | R2 score . | |
1 | 0.0003 | 0.9974 | 0.0002 | 0.9987 | 0.0004 | 0.9944 | 0.0002 | 0.9988 |
2 | 0.0002 | 0.9985 | 0.0000 | 0.9999 | 0.0003 | 0.9967 | 0.0000 | 1.0000 |
3 | 0.0002 | 0.9987 | 0.0002 | 0.9983 | 0.0003 | 0.9961 | 0.0001 | 0.9998 |
4 | 0.0006 | 0.9885 | 0.0004 | 0.9959 | 0.0004 | 0.9957 | 0.0003 | 0.9964 |
5 | 0.0004 | 0.9950 | 0.0003 | 0.9977 | 0.0004 | 0.9942 | 0.0002 | 0.9990 |
6 | 0.0011 | 0.9607 | 0.0006 | 0.9892 | 0.0004 | 0.9940 | 0.0006 | 0.9868 |
7 | 0.0004 | 0.9952 | 0.0002 | 0.9987 | 0.0007 | 0.9883 | 0.0002 | 0.9985 |
8 | 0.0004 | 0.9968 | 0.0003 | 0.9981 | 0.0004 | 0.9965 | 0.0003 | 0.9980 |
9 | 0.0003 | 0.9960 | 0.0002 | 0.9982 | 0.0004 | 0.9932 | 0.0003 | 0.9972 |
10 | 0.0007 | 0.9839 | 0.0005 | 0.9932 | 0.0005 | 0.9931 | 0.0004 | 0.9950 |
Fold . | XGB . | RF . | LR . | ETR . | ||||
---|---|---|---|---|---|---|---|---|
RMSE . | R2 score . | RMSE . | R2 score . | RMSE . | R2 score . | RMSE . | R2 score . | |
1 | 0.0003 | 0.9974 | 0.0002 | 0.9987 | 0.0004 | 0.9944 | 0.0002 | 0.9988 |
2 | 0.0002 | 0.9985 | 0.0000 | 0.9999 | 0.0003 | 0.9967 | 0.0000 | 1.0000 |
3 | 0.0002 | 0.9987 | 0.0002 | 0.9983 | 0.0003 | 0.9961 | 0.0001 | 0.9998 |
4 | 0.0006 | 0.9885 | 0.0004 | 0.9959 | 0.0004 | 0.9957 | 0.0003 | 0.9964 |
5 | 0.0004 | 0.9950 | 0.0003 | 0.9977 | 0.0004 | 0.9942 | 0.0002 | 0.9990 |
6 | 0.0011 | 0.9607 | 0.0006 | 0.9892 | 0.0004 | 0.9940 | 0.0006 | 0.9868 |
7 | 0.0004 | 0.9952 | 0.0002 | 0.9987 | 0.0007 | 0.9883 | 0.0002 | 0.9985 |
8 | 0.0004 | 0.9968 | 0.0003 | 0.9981 | 0.0004 | 0.9965 | 0.0003 | 0.9980 |
9 | 0.0003 | 0.9960 | 0.0002 | 0.9982 | 0.0004 | 0.9932 | 0.0003 | 0.9972 |
10 | 0.0007 | 0.9839 | 0.0005 | 0.9932 | 0.0005 | 0.9931 | 0.0004 | 0.9950 |
CONCLUSION
In this study, four ML models, such as LR, ETR, RF, and XGB are utilized to predict the parameter ‘n’ value in alluvial channels while considering resistance due to bedforms. The prediction of ‘n’ value was based on six input parameters: Sf, Fr, y/d50, Δ/d50, Δ/λ, and Δ/y. The values of R2 were used to assess the performance of the models in capturing the relationship between the input parameters and the output parameter ‘n’. The correlation matrix revealed significant relationships between the input parameters. The main conclusions drawn from the study are listed below:
The parameter Sf showed a negative correlation coefficient of −0.281, indicating an inverse relationship with ‘n’. The input parameter Fr exhibited a positive correlation coefficient of 0.934, indicating a strong positive relationship, as also depicted in the heatmap. Similarly, y/d50, Δ/d50, Δ/λ, and Δ/y demonstrated positive correlations of 0.882, 0.961, 0.936, and 0.876, respectively, suggesting their significant influence on ‘n’.
In the specific input scenario C1, all the parameters (Sf, Fr, y/d50, Δ/d50, Δ/λ, and Δ/y) were found to be essential for predicting the parameter ‘n’ in the alluvial channel with bedforms. Moreover, Taylor's diagram showed that ETR, RF, and XGB models have high accuracy and strong correlation coefficients, indicating reliable predictions with minimal errors.
The sensitivity analysis highlighted that in predicting roughness, all the parameters were important in model LR, while Sf had a significant impact in model ETR as compared to the other parameters in predicting roughness in alluvial channels. These findings emphasize the importance of considering all these parameters when predicting and analyzing the behaviour of ‘n’ in alluvial channels.
The R2 values achieved by various ML models that exhibited variability among the various input features. RF, XGB, and RF models outperformed as per the values of different errors, and the R2 value is very close to one. In addition to this, the RF, XGB, and RF models demonstrated a higher R2 value ranging from 0.997 to 0.999 suggesting the effectiveness in predicting ‘n’. Whereas, model LR showed an R2 value of 0.82, indicating a good relationship between the input parameters and ‘n’.
K-fold cross-validation is carried out to assess the predictive performance of ML models. The results ensured the reliability of the models, the ability to generalize the new data and supporting the validity of the findings.
This study highlights the successful application of various ML models in predicting the value of ‘n’ in alluvial channels based on the given input parameters. The models demonstrated strong relationships and high accuracy, as evidenced by the high R2 values. The identified correlations between the input parameters and ‘n’ provide insights into the underlying relationships and can be valuable for understanding and managing alluvial channel dynamics. It is essential to note that the potential limitation of the present study is to incorporate actual field data because the current prediction has been carried out on an experimental dataset. However, further studies can be carried out to incorporate actual field data, hydrodynamic simulations and computational fluid dynamics modelling along with ML algorithms which could also provide a more in-depth understanding of flow resistance. In addition to this, the current methodology employed in this study exhibits robust performance in predicting riverbed characteristics when applied to experimental datasets characterized by noticeable patterns and clear relationships. However, its sensitivity to datasets consisting of random numbers, where the absence of distinct patterns can lead to challenges and prediction failures, highlights a limitation of the field data. Also, in scenarios, where the input data lack a clear structure or exhibits significant random variations, the ML models may struggle, resulting in prediction failures. Notably, when attempting to validate the models with real-world field data, the limitations become more pronounced, and the performance of ML models can be reduced significantly. This emphasizes the need for adequate attention when applying the method to datasets with inherent randomness, especially when attempting to extrapolate findings to real field scenarios. The model's failure to perform effectively with field data emphasizes the challenges in adapting the methodology to unpredictable variations in riverbed characteristics experienced in actual field conditions. Thus, it shows the need to conduct further study considering the complexities of field data for the adequate prediction of roughness in the streams.
ACKNOWLEDGEMENTS
The authors would like to thank their fellow researchers for providing valuable insights that helped to enhance the quality of the data presented in the manuscript.
ROLE OF FUNDING
The authors are grateful to the Core Research Grant, SERB Government of India (CRG/2021/002119), for their generous financial assistance, which made it possible to conduct the research presented in this paper.
AUTHOR CONTRIBUTIONS
A. A. M. wrote the original draft, conceptualized the whole article, developed the methodology, and reviewed and edited the article, and M. P. rendered support in data curation, prepared the article, visualized the data, investigated and supervised the work.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.