Abstract
A fundamental issue in the hydraulics of movable bed channels is the measurement of friction factor (λ), which represents the head loss because of hydraulic resistance. The execution of experiments in the laboratory hinders the predictability of λ over a short period of time. The major challenges that arise with traditional forecasting approaches are due to their subjective nature and reliance on various assumptions. Therefore, advanced machine learning (ML) and artificial intelligence approaches can be utilized to overcome this tedious task. Here, eight different ML techniques have been employed to predict the λ using eight different input features. To compare the performance of models, various error metrics have been assessed and compared. The graphical inferences from heatmap data visualization, Taylor diagram, sensitivity analysis, and parametric analysis with different input scenarios (ISs) have been carried out. Based on the outcome of the study, it has been observed that K Star in the IS1 with correlation coefficient (R2) value equal to 0.9716 followed by M5 Prime (0.9712) and Random Forest (0.9603) in IS2 and IS4, respectively, have provided better results as compared to the other ML models to predict λ in terms of least errors.
HIGHLIGHTS
The study employs ML algorithms to accurately predict friction factor (λ) in movable bed channels, comparing eight ML techniques.
The K Star model in input scenario 1 achieves the highest correlation coefficient (R2) value of 0.9716 for predicting λ.
The research findings guide engineers in selecting appropriate input variables and ML models to predict λ accurately.
INTRODUCTION
The friction factor (λ) plays a crucial role in the field of hydraulic engineering. It is influenced by hydraulic and morphological factors in case of alluvial river channels. The determination of λ is important and challenging as flow boundary in channels is not stable and constantly varies with time, consequently, complex interactions among flow and channel bed are observed. Therefore, an accurate estimation of λ is a vital task in resolving numerous practical issues of various engineering specialties. Measurements in the laboratory and field can be used to determine λ in open channels, which is typically influenced by the types of roughness and Reynolds number.
The changes in size and shape of riverbeds affect the variations in λ of natural alluvial channels (Robert 1990). However, it may not be possible to have an idea about how bed configurations change with different flow conditions. Furthermore, the formation and vanishing of the bedforms alters the flow velocity and flow resistance (Patel & Kumar 2017). In a previous study, it was found that the variation of resistance coefficient was in accordance with bedforms using experimental methodology. The estimation of λ is significant as it is an essential component of both the simulation used for the evaluation of riverbeds and the real-time flow prediction. In alluvial channels, the resistance to flow is complicated in nature due to a large number of attributes (Simons & Richardson 1966). The viscosity of the fluid, the size of the channel, the roughness of the inside surface of the channel, the variations in elevations within the system, and the fluid's travel distance are some essential parameters required to determine the head loss in natural channels. The majority of the nonlinear formulas, which currently exist to describe the λ of mobile channels, are based on dimensional analysis and statistical data fitting to parameters that are implicitly taken into account in functional relationships.
The hydraulics of movable bed channels, hydraulic resistance, and λ has been of significant importance (Patel et al. 2016). In this regard, multiple approaches such as analytical and numerical methods were used for characterizing vegetation-induced roughness (Baptist et al. 2010). A novel application of genetic programming was used to derive roughness expressions based on synthetic data with validation against flume experiments (Babovic & Keijzer 2000). The complexity in prediction of the bed resistance and finding an accurate approach to assess the mean flow parameters in alluvial channels arises because of the variations on the channel bed. In order to avoid these complexities, it is required to utilize data mining approaches as they have a lot of capabilities to assess the data adequately and can provide better predictions. Moreover, if laboratory testing and experimentation are required to ascertain the λ in the movable bed channel it requires significant efforts and time. To ease this numerous studies have considered the prediction of the movable bed channel's λ using one or two ML models such as artificial neural networks (ANNs) or Gene programming.
Earlier, Azamathulla (2013) used a model to predict the friction coefficient in natural channels. However, the results obtained in his study presented a better performance with the help of the proposed model. However, the results were not compared and cross validated using other available models. Apart from that, several studies (Roushangar et al. 2018; Li et al. 2019; Milukow et al. 2019) determined λ using limited ML models and it was observed that the results showed inconsistency in determining λ. Therefore, these studies lacked satisfactory authentication of the results for adequate prediction of λ. Also, previous studies lacked effect of various parameter scenarios that should be undertaken while training and testing the various models (Nitsche et al. 2011; Shaghaghi et al. 2018; Khosravi et al. 2020). It can be suggested that the determination and selection of the optimal ISs are essential as it helps to provide the required and significant data for analysis which helps increase efficiency and also saves time and effort.
Furthermore, ML techniques have been used to improve lumped groundwater level prediction at different catchment scales (Cai et al. 2022). In their study, different data-driven methods were used as an alternative to explore groundwater models. Moreover, the improvement in awareness of hydrological knowledge of deep learning (DL) algorithms for ground water level simulation was also carried out. The superiority and powerful ability of the models with physical constraints increased reliability in data-driven approaches and groundwater modelling. In another study, the flood mechanisms across the contiguous United States through interpretive DL on representative catchments were performed for gaining better knowledge of floods that could occur in future in the proposed regions (Jiang et al. 2022). The integration of hydrological knowledge and ML techniques such as genetic programming with MIK A-SHA were developed to interpret distributed rainfall runoff models (Chadalawada et al. 2020; Herath et al. 2021). The approach used in these studies captured spatial variabilities without explicit user selections, enabling the induction of semi distributed models.
Furthermore, a study incorporating firefly algorithm (FA) that was based on flashing patterns and the behaviour of fireflies was carried by Wang et al. (2020). The Yin-Yang Firefly Algorithm (YYFA) was proposed to enhance the FA by addressing its limitations in exploration. The different modifications were carried out by using Cauchy mutation to achieve better balance among functions and good notes set that was incorporated to improve the spatial representativeness of the firefly population. These modifications enhanced the efficiency of the algorithm by allowing more robust optimization in various applications. A novel study was conducted on partition cum unification-based genetic FA that combines the benefits of the FA and genetic algorithm for optimization problems (Gupta et al. 2021). The results demonstrated that the new algorithms outperform other models by providing best objective function values and significantly faster convergence, therefore making it a highly efficient and effective optimization technique.
Another study focused on predicting scour depth in the downstream direction of a Ski-Jump spillway to ensure dam safety (Sammen et al. 2020). The study showed that a hybrid model such as the ANN was used to improve the prediction accuracy. Moreover, a comparative analysis with other hybrid models and the performance of all models was carried out. In addition to this, a similar study has been conducted in consideration with optimization techniques to maximize or minimize functions for achieving optimal results in various domains (Devi et al. 2022). A new improved variant of the Runge-Kutta Optimization (RKO) algorithm, termed as Improved Runge-Kutta Optimization (IRKO), was incorporated to enhance the diversification and intensification capabilities of the basic RKO version. The performance of IRKO was boosted on standard benchmark functions and engineering-constrained optimization problems, respectively. Moreover, IRKO exhibits efficient run time, taking less than 0.5 s for most of the benchmark problems and excelling in real-world optimization scenarios.
As optimization problems are more complex, there is a need for efficient and innovative techniques. In response to that, recently a study based on the various bio-inspired meta-heuristic algorithms has been employed (Ghasemi et al. 2022). In this study, a biologically-based optimization algorithm known as circulatory system-based optimization (CSBO) was employed to a wide range of real-world complex functions and compared results with standard meta-heuristic ML algorithms, depicting that the CSBO successfully achieved optimal solutions and effectively avoids local optima; therefore, making it a promising and reliable optimization approach.
Most of the researchers used different ML and hybrid models for the prediction of various parameters, indicating the importance of these techniques in real-world scenarios (Babovic & Keijzer 2000; Chadalawada et al. 2020; Cai et al. 2022; Jiang et al. 2022; Bassi et al. 2023; Singh & Patel 2023; Wadhawan et al. 2023). In addition, research was carried out for the estimation of suspended sediment load (SSL) using intrinsic time-scale decomposition (IDT) and two data-driven techniques (DDTs) such as evolutionary polynomial regression (EPR) and model tree (MT) at Sarighamish and Varand stations in Iran (Zhao et al. 2021). The analysis of this study demonstrated that the ITD-EPR showed the best prediction accuracy for both stations as compared to standalone MT. In addition to this, the results highlighted the superiority of ITD-EPR in predicting SSL, outperforming conventional methods and providing valuable insights for water resources management and the design of hydraulic structures.
In the previous studies, traditional forecasting approaches were employed which are error-prone and suffer from a plethora of assumptions, resulting in subjective conclusions over time (Clifford et al. 1992; Rasmussen 2004; Tang & Wang 2009; Azamathulla et al. 2010; Harish et al. 2015; Safari et al. 2016). In addition to this, a few of these strategies did not perform efficiently with limited or historical data. In light of these issues, ML is being utilized to forecast as it can provide more accurate predictions with a minimum loss function. This method is more scientific in nature and focuses on the result or outcome, rather than hidden correlations between factors. It can be highly recommended in scenarios where the goal is to examine datasets with a large number of features with the capability to handle enormous amounts of data. Therefore, in order to overcome the limitations of previous studies and bridge the research gap, the cutting-edge computing and digital transformation can be implied such as artificial intelligence (AI). For instance, the development of new modelling paradigms data mining techniques can be employed concurrently in the complications problems of prediction of mobile bed friction. It has unraveled new modeling opportunities for those processes where the current knowledge level hinders the inclusion of pertinent data in a mathematical framework. In the realm of AI, ML is a significant technological advancement due to its capability to learn from data. ML has already enhanced our daily lives, even in its early-stage applications. By incorporating ML models, the desired accuracy can be obtained which in turn leads to better predictions.
The main objective of the present study is to predict λ in mobile bed channels using various ML models. In this regard, eight ML models such as Linear Regression (LR), Gaussian processes (GPs), Multilayer perceptron (MLP), K Star (K*), Additive Regression (AR), M5P (M5 prime), Random Forest (RF), and Support Vector Machine (SVM) have been incorporated. By including a significant number of ML models, the most suitable models can be determined based on their performance in prediction while considering the shortcoming in previously available models. These models are employed because they can provide comprehensive simulations, considering their performance and appropriateness for specific tasks or datasets. In addition to this, no study has been carried out that involves the utilization of eight input parameters such as kinematic viscosity of the fluid (v), the mean size of the sediment particles (d), gradation coefficient (σ), specific gravity (G), gravitational acceleration (g), bed slope (So), mean velocity of flow (u), flow depth of the flow (df), and the width of bed channel (b). Moreover, seven ISs are considered with the aim of determining the most optimal combination of inputs parameters for ML models. The data used in this study are collected from previous laboratory and experimental studies. It has been effectively divided into training and testing sets to support the training and validation of the models. Furthermore, to evaluate and compare the performance of the models, a range of error metrics are examined and compared. Graphical analyses such as heatmap, histograms, scatter plots, data visualization, Taylor diagrams, sensitivity analysis, and parametric analysis are conducted across different ISs. Apart from that an assessment of the advantages and disadvantages of the current approach is conducted while evaluating the effectiveness of ML models in predicting λ for mobile bed channels.
Background and context
In the current scenarios, the digital sphere has a great pool of data, and with the advancements in computers. In this regard, AI and ML have become necessary to analyze data and develop corresponding intelligent and automated applications. There is a great deal of uncertainty about the apt nature of any data and problem domain, and AI technologies are the key to unravelling it. A concise description of the various ML approaches used in this study is provided in this section.
Linear regression
M5 Prime
This equation is used to fit a linear model to the data. Y is the dependent variable, X1–Xk are the independent variables, and β0–βk are the coefficients of the linear model.
Random Forest
In this equation, MSE is the mean squared error, N is the number of data points, fa is the value returned by the model, and pa is the true value for data point a.
K Star
In this equation, R is the predicted value of the new instance, and Ri is the output value of the ith nearest neighbour of the new instance. The mean function returns the average of the output values among the K nearest neighbours.
Additive Regression
Numerous methods, including nonparametric regression, splines, and kernel methods, can be used to estimate the smooth functions. Iterative algorithms like back fitting and boosting can also be used to fit the model. High-dimensional data and complex interactions between the variables can be handled by the model. However, for better estimation, the model can require a lot of data that can be computationally expensive.
Gaussian Processes
GPs are a type of probabilistic model used in ML for regression and classification tasks. They do not need the explicit specification of a structural pattern for the relationship between inputs and outputs, in contrast to other ML techniques. Instead, they represent the underlying function as a distribution over functions, with the mean distribution function and covariance function acting as its defining characteristics. The covariance function describes how much the function values at various input locations are associated, whereas the mean function represents the predicted value of the function at each input point. The complexity and smoothness of the modelled function are determined by the selected covariance functions (Wang et al. 2021).
In this equation, f(x) is a random function that maps input x to output y, m(x) is the mean function that represents the prior belief about the function, k (x, x′) is the covariance function that measures the similarity between input x and x′. The covariance function is also called the kernel function in ML. The GP algorithm assumes that the distribution of the function f(x) is a multivariate Gaussian distribution, which is fully specified by its mean and covariance functions.
Multilayer perceptron
A popular ANN for classification and regression applications in ML is the MLP. The MLP is made up of numerous layers of interconnected nodes, or neurons, where each neuron takes inputs from a lower layer, transforms those inputs nonlinearly, and then sends the output to a higher layer. The input layer, which receives the data's characteristics, is the top layer of the MLP. The output layer, the bottom layer, is where the final prediction is generated. One or more hidden layers may exist between, each of which has a group of neurons that applies a nonlinear change to the inputs. Typically, the MLP's neurons are set up in a feedforward manner, which means that inputs flow directly from the input layer to the output layer without any need for feedback connections. In order to reduce the error between the expected output and the actual output, the weights of the connections between the neurons are changed during training (Tang et al. 2015; Pham et al. 2017; Nosratabadi et al. 2021; Sharma et al. 2022).
In this equation, x is the input vector, w is the weight vector that connects the input x to the neuron, b is the bias term that shifts the activation function, z is the linear combination of the input and weights, f(z) is the activation function that maps the output to a nonlinear space, and y is the output of the neuron.
Support vector machine
The ML approach known as SVMs can be applied to both classification and regression applications. Finding the hyperplane that best divides the data into distinct classes or predicts a continuous target variable is the basic goal of SVMs. The margin (separation between the hyperplane and the nearest data points from each class) is maximized by selecting the hyperplane. It is simpler to locate a hyperplane that can divide the data in SVMs since the data points are mapped into a higher-dimensional feature space. The performance of SVMs depends on the kernel, or mapping function, that is selected. Radial basis function (RBF), linear, and polynomial are a few of the frequently utilized kernel functions.
In this equation, (x) is the input vector, w is the weight vector that defines the hyperplane, b is the bias term, and y(x) is the predicted output. The SVM algorithm tries to find the optimal values of w and b that minimize the classification error and maximize the margin, which is the distance between the hyperplane and the closest data points.
METHODOLOGY
The flow chart in Figure 1 depicts the various steps involved in the approach adopted in the current study for eight ML models with a training set of 80% and a testing set of 20%. The flow chart depicting this approach is a useful tool for visualizing the entire process and ensuring that all necessary steps are followed to obtain accurate and reliable results.
Data collection
A wide range of 2,133 data points was collected from previous studies for the forecasting of the λ of mobile bed channels. The input parameters utilized are v, d, σ, G, g, So, u, df, and b. The λ of the mobile bed channel is considered an output parameter. Table 1 presents a collection of statistical properties associated with various input parameters. Each parameter is accompanied by its respective minimum, maximum, mean, and standard deviation values, providing valuable insights into the distribution and variability of these parameters. Starting with the parameter v, its values range from a minimum of 5.936 × 10−7 to a maximum of 1.73 × 10−6, with a mean of 8.768 × 10−7 and a standard deviation of 2.379 × 10−7. These statistics throw light on the central tendency and spread of v within the dataset. Moving on to the parameter D, its values span from 0.00002 to 0.027. The mean for this parameter is calculated to be 0.001432, with a standard deviation of 0.003263. These measures provide insights into the average and dispersion of the values observed for D. The parameter G demonstrates a narrower range, with a minimum value of 2.25 and a maximum value of 2.68. The mean value for G is determined to be 2.648621, with a relatively low standard deviation of 0.030715. These statistical properties indicate a relatively consistent distribution for G. The next parameter, σ, exhibits a wider range, spanning from 1 to 13.83. The mean value of σ is calculated as 1.389398, and its standard deviation is 0.472589. These measures reflect the average and variability in the values observed for σ. Moving on to the parameter So, it ranges from 0 to 0.0275. The mean value for So is found to be 0.004022, with a standard deviation of 0.004646. These statistics provide information about the central tendency and spread of the values observed for So.
Parametersa . | Statistical properties . | |||
---|---|---|---|---|
Minimum . | Maximum . | Mean . | Standard deviation . | |
v | 5.936 × 10−7 | 1.73 × 10−6 | 8.768 × 10−7 | 2.379 × 10−7 |
D | 0.00002 | 0.027 | 0.001432 | 0.003263 |
G | 2.25 | 2.68 | 2.648621 | 0.030715 |
σ | 1 | 13.83 | 1.389398 | 0.472589 |
So | 0 | 0.0275 | 0.004022 | 0.004646 |
U | 0.150429916 | 2.218796 | 0.71197 | 0.326384 |
df | 0.0079 | 4.2977 | 0.40649 | 0.751401 |
B | 0.134 | 162.431 | 10.30327 | 29.37706 |
λ | 0.020156162 | 0.471829 | 0.064634 | 0.044 |
Parametersa . | Statistical properties . | |||
---|---|---|---|---|
Minimum . | Maximum . | Mean . | Standard deviation . | |
v | 5.936 × 10−7 | 1.73 × 10−6 | 8.768 × 10−7 | 2.379 × 10−7 |
D | 0.00002 | 0.027 | 0.001432 | 0.003263 |
G | 2.25 | 2.68 | 2.648621 | 0.030715 |
σ | 1 | 13.83 | 1.389398 | 0.472589 |
So | 0 | 0.0275 | 0.004022 | 0.004646 |
U | 0.150429916 | 2.218796 | 0.71197 | 0.326384 |
df | 0.0079 | 4.2977 | 0.40649 | 0.751401 |
B | 0.134 | 162.431 | 10.30327 | 29.37706 |
λ | 0.020156162 | 0.471829 | 0.064634 | 0.044 |
aInput attributes undertaken in the current study.
The parameter U showcases values ranging from 0.150429916 to 2.218796. Its mean value is calculated to be 0.71197, with a standard deviation of 0.326384. These measures offer insights into the average and variability of the values observed for U. The parameter df demonstrates a broader range, spanning from 0.0079 to 4.2977. The mean value for df is determined to be 0.40649, with a larger standard deviation of 0.751401. These statistical properties reflect the central tendency and variability in the values observed for df. The parameter b exhibits the widest range, with values ranging from 0.134 to 162.431. The mean value for b is calculated as 10.30327, with a relatively high standard deviation of 29.37706. These measures indicate a significant variability and dispersion in the values observed for B. In summary, a comprehensive overview of the statistical properties of each parameter, offering valuable information about their distribution, central tendency, and variability within the dataset.
The assumption of negligible cross-sectional non-uniformities is commonly made in many hydraulic studies when analysing flow resistance. This assumption is based on the understanding that minor variations in the channel's cross-sectional geometry may not significantly impact the flow resistance, especially in relatively uniform channels. However, in cases where significant cross-sectional changes exist, such as abrupt contractions or expansions, the assumption may lead to inaccurate predictions of friction factor as these features can cause localized turbulence and affect flow resistance. Assuming a rectangular cross-section simplifies the analysis and allows for straightforward application of hydraulic principles. However, natural channels often exhibit different shapes, and the assumption of a rectangular cross-section may not fully capture the complexity of real-world geometries. The impact on the friction factor will depend on the actual channel geometry from the assumed rectangular shape.
The assumption of turbulent flow is often valid in practical open-channel flow scenarios. Turbulent flow conditions are typically associated with high Reynolds numbers, where the flow velocity and turbulence overcome viscous forces. In the context of friction factor prediction, the assumption of turbulent flow enables the use of appropriate turbulence models to better describe flow resistance. However, if the flow deviates from being turbulent, the friction factor predictions may also deviate significantly. The assumption of a relationship between friction factor and Froude number is supported by empirical observations in alluvial channels. The Froude number characterizes the relative importance of inertial forces to gravity, and it affects the channel flow behaviour and sediment transport. By considering the Froude number in the friction factor prediction, the model can contribute to variations in flow resistance due to changes in flow velocity.
The assumptions made in the study help simplify and focus the analysis while still accounting for critical factors influencing flow resistance. Conducting sensitivity analyses and justifying each assumption based on theoretical insights and empirical evidence will strengthen the study's reliability and ensure more accurate predictions of the friction factor in alluvial channels.
IS . | Input scenarios . | Output . |
---|---|---|
IS1 | f (So, b, df, d, G, u, v, σ) | Friction factor (λ) |
IS2 | f (b, df, d, G, u, v, σ) | |
IS3 | f (b, df, d, G, u, σ) | |
IS4 | f (b, df, d, u, σ) | |
IS5 | f (d, u, σ) | |
IS6 | f (d, σ) | |
IS7 | f (d) |
IS . | Input scenarios . | Output . |
---|---|---|
IS1 | f (So, b, df, d, G, u, v, σ) | Friction factor (λ) |
IS2 | f (b, df, d, G, u, v, σ) | |
IS3 | f (b, df, d, G, u, σ) | |
IS4 | f (b, df, d, u, σ) | |
IS5 | f (d, u, σ) | |
IS6 | f (d, σ) | |
IS7 | f (d) |
Data pre-processing
This stage deals with pre-processing the data before sending it to the ML model for training. Firstly, the imputation technique is used to deal with the missing values in data to create a complete dataset for analysis (Zhang 2016). Following that different parameters have different ranges; it is important to scale them to a common scale. Therefore, the normalization technique is used to modify the values of numerical columns to use a standard scale during this data preparation stage (Ahsan et al. 2021). The values of the dataset were scaled to standard values with a mean and standard deviation of 0 and 1, respectively.
Data split
In this step, the complete dataset of 2,133 data points is split into training and testing sets in the ratio of 80:20 of the overall dataset. As a result, the training set consists of 80% (1,704 data points) of them which are used for training the various ML models and the test set consists of 20% (427 data points) which are used to test and evaluate the trained model's performance.
Model evaluation criteria
The model must be validated to ensure its reliability and accuracy in making accurate predictions. A variety of measures such as root mean squared error (RMSE), mean absolute error (MAE), MSE, relative absolute error (RAE), root relative squared error (RRSE), and R2 are employed to evaluate the precision and accuracy of the proposed models in predicting the λ of the movable bed channel. Heatmap data visualization and parametric analysis are also performed to determine the correlation among various parameters and along with the output parameter.
RESULTS AND DISCUSSION
Heatmap data visualization
Most and least effective variables
This investigation is being done to determine the relative contributions of input parameters to the output parameter. The input and output coefficients are displayed as a heatmap to show the degree of association between various factors. They assist in identifying traits that are ideal for creating ML models. The input variables that have the most significant impact on the λ of the mobile bed channel are demonstrated in Figure 2. The magnitude of the values for the Pearson correlation coefficient was also measured. According to the correlation coefficients, d (R = 0.4), σ (R = 0.28), u (R = −0.24), df (R = −0.15), and b (R = −0.15) had a significant impact, followed by G (R= − 0.048), So (R = 0.0011), and v (R = 0.0013) which had least impact.
Parametric analysis
Scatter plots
Box plot
Histogram
Sensitivity analysis
Relationship between the errors and ML models
Models . | Scenario . | ||||||
---|---|---|---|---|---|---|---|
IS1 . | IS2 . | IS3 . | IS4 . | IS5 . | IS6 . | IS7 . | |
LR | 0.0383 | 0.0383 | 0.0429 | 0.0426 | 0.0372 | 0.0409 | 0.0424 |
GP | 0.0388 | 0.0388 | 0.0429 | 0.0431 | 0.0377 | 0.0374 | 0.0433 |
MLP | 0.0197 | 0.0406 | 0.0574 | 0.0397 | 0.0381 | 0.0353 | 0.0432 |
K* | 0.0116 | 0.0131 | 0.0158 | 0.015 | 0.024 | 0.0292 | 0.0438 |
AR | 0.028 | 0.0279 | 0.0325 | 0.0315 | 0.0345 | 0.0333 | 0.0312 |
RF | 0.013 | 0.0177 | 0.0169 | 0.0136 | 0.0234 | 0.0295 | 0.0261 |
M5P | 0.0145 | 0.0129 | 0.0795 | 0.0792 | 0.0279 | 0.0649 | 0.0325 |
SVM | 0.0434 | 0.0436 | 0.0437 | 0.0436 | 0.041 | 0.0372 | 0.0445 |
Models . | Scenario . | ||||||
---|---|---|---|---|---|---|---|
IS1 . | IS2 . | IS3 . | IS4 . | IS5 . | IS6 . | IS7 . | |
LR | 0.0383 | 0.0383 | 0.0429 | 0.0426 | 0.0372 | 0.0409 | 0.0424 |
GP | 0.0388 | 0.0388 | 0.0429 | 0.0431 | 0.0377 | 0.0374 | 0.0433 |
MLP | 0.0197 | 0.0406 | 0.0574 | 0.0397 | 0.0381 | 0.0353 | 0.0432 |
K* | 0.0116 | 0.0131 | 0.0158 | 0.015 | 0.024 | 0.0292 | 0.0438 |
AR | 0.028 | 0.0279 | 0.0325 | 0.0315 | 0.0345 | 0.0333 | 0.0312 |
RF | 0.013 | 0.0177 | 0.0169 | 0.0136 | 0.0234 | 0.0295 | 0.0261 |
M5P | 0.0145 | 0.0129 | 0.0795 | 0.0792 | 0.0279 | 0.0649 | 0.0325 |
SVM | 0.0434 | 0.0436 | 0.0437 | 0.0436 | 0.041 | 0.0372 | 0.0445 |
Supplementary material, Table S6 presents the RAE of ML models in various ISs. It helps to infer that the lowest RAE value (24.9413%) has been found in case of K* in IS2, followed by K* (27.0855%) and RF (28.666%) in IS1 and IS4, respectively. As per the graph, K* and RF models performed better than LR, M5P, SVM, AR, MLP and GP models in terms of least RAE value across all seven ISs. Supplementary material, Figure S10 shows RAE values of all eight ML models under all seven ISs. Supplementary material, Table S7 presents the RRSE of ML models in various ISs. As observed, K*, M5P and RF models performed better than LR, M5P, SVM, AR, MLP and GP models in terms of least RRSE value across all seven ISs. In this study, the model K* has been demonstrated to have the lowest RRSE (23.9617%) percentage out of all models and in IS1. This is followed by the model M5P and K* in IS2 and IS4 with RRSE percentages of 26.4129 and 33.0569%, respectively. Supplementary material, Figure S10 shows the RAE values of all the eight ML models under all seven ISs. Figures S10–S20 are provided in the Supplementary material.
Relationship between the actual and predicted λ of different ML models
The values of correlation coefficients for all eight ML models utilizing various ISs are shown in Supplementary material, Table S8. It helps to understand that the highest R2 value (0.9716) has been found in the case of K* in IS1, followed by M5P (0.9712) and RF (0.9603) in scenarios 2 and 4, respectively. A higher R2 value shows a good correlation between the experimental data's trend line and the predicted data. These models' higher R2 values attest to their superior fitness in regard to the data at present. However, in the case of other ISs using different models has shown a weak correlation between the experimental and predicted λ, which is supported by the lower R2 values. In almost all of the ISs, λ agreement between experimental and predicted values is extremely similar to a linear function, as obtained using K*, M5P, and RF models. In the contexts of LR, SVM, and GP, every combination of input instances and all models show extremely lower R2 values. Additionally, the dispersion of the actual and predicted data points shows that the models do not fit the data adequately. With the least R2 value (0.2161) and the greatest scatter point dispersion and coincidence deviation, LR has demonstrated poor performance. ML models including K*, RF, and M5P achieved better results in predicting the λ of mobile bed channels in accordance with the present work. It was revealed that the current study's findings matched those of studies cited in the literature. Supplementary material, Figures S11–17 shows scatter plots of test data vs. predicted data for all the ISs utilized in this study. Supplementary material, Tables S4–S8 and Figures S10–S20 provide illustration about K* that performed best in IS1, followed by M5P and RF in IS and IS4, respectively.
Taylor's diagram
The Taylor's diagram of each model that performed better in each IS taken into consideration in the current investigation as shown in Supplementary material, Figure S19. The extent to which an observed pattern (or collection of observed patterns) resembles the reference data is shown visually using a Taylor diagram. The diagram is very helpful for studying different layers of complex models or evaluating the relative skill of multiple models. The correlation between two patterns, the difference in the RMSE, and the magnitude of the changes in each pattern (depicted by their standard deviations) can all be used to compare similarity between two patterns. The experimental data's standard deviation in the current situation is 0.002696. From Supplementary material, Figure S19, it can be observed that the K* model in IS1 lies close to the reference line. Next closest to the reference are the K* model in IS3 and M5P in IS2. These three ML models also have high Pearson correlation coefficients (0.9716 for K* in IS1, 0.9712 for M5P in IS2, and 0.9445 for K* in IS3) and lower value of RMSE (0.0116 for K* in IS1, 0.0129 for M5P in IS2, and 0.0158 for K* in IS3), which indicates that they could excellently predict λ in mobile bed channels.
The results obtained by K* in IS1 indicated better accuracy as compared to other models followed by M5P in IS2 and RF in IS4, respectively showed the most minor errors with better prediction of the λ. The lower values of the K* model in IS1 for MAE (0.0075), MSE (0.00013456), RMSE (0.0116), RAE (27.0855%), and RRSE (23.9617%) confirm the better accuracy of these models in comparison to the other ML models used in this study. Taylor's diagram shows that the model K* in IS1, K* model in IS3 and M5P in IS2 models possess a high Pearson correlation coefficient representing the prediction accuracy with the most negligible errors.
K* is a nonparametric technique that defers the main work as long as feasible, whereas other ML algorithms develop generalizations because they ‘meet’ the data. IB learners, also known as Memory-based learners, store the training instances in a lookup table and interpolate from them. This model is an IB classifier that tries to improve its performance in dealing with missing values, smoothing issues and attributes that can be both real and symbolic in nature. It determines the class of a test instance with a class of similar training instances as determined by specified similarity function. Summing the probabilities from the new instance to all of the members of a category yields the classification with K*. This must be repeated for the remaining categories before selecting the one with the highest probability. The fundamental advantage of using the ML model is that it commonly maps nonlinear relationships between variables without requiring the user to understand the physics of the problem. The ML model aims to map the nonlinear relationship between the dependent variable (output variable) and the independent variables (input variables) using its complicated internal structures. In comparison to other models, the K* model in IS1 provides the best overall prediction of λ.
CONCLUSION
In this study, the diverse ML models such as LR, GPs, MLP, K*, AR, M5P, RF, and SVM have been utilized to determine the λ of movable bed channels. The outcomes of the research enable us to reach the following conclusions:
The lower values of the K* model in IS1 for MAE (0.0075), MSE (0.00013456), RMSE (0.0116), RAE (27.0855%), and RRSE (23.9617%) confirm the better accuracy of this model in comparison to the other ML models.
The results obtained in the case of K* in IS1 followed by M5P in IS2 and RF in IS4 models showed the most minor errors with better prediction of the λ.
Taylor's diagram showed that K* in IS1, K* model in IS3 and M5P in IS2 models possess a high Pearson correlation coefficient representing the prediction accuracy with the most negligible errors.
The heatmap data visualization demonstrates the significant impact of input variables on the λ of the mobile bed channel. This is measured by using the magnitude of the values for the Pearson correlation coefficient. According to the correlation coefficients, d (R = 0.4), σ (R = 0.28), u (R = −0.24), df (R = −0.15), and b (R = −0.15) are the most influencing parameters, whereas G (R = −0.048), So (R = 0.0011) and v (R = 0.0013) are least impacting factors.
The highest R2 for the forecasted λ has been found in the case of K* (0.9716) in IS1, followed by M5P (0.9712) and RF (0.9603) in IS2 and IS4, respectively.
By comparing the eight models under various ISs, it is evident that K* in IS1 possesses a high command of forecasting, suggesting its adequacy for predesigning of λ in the mobile bed channel.
Less deviation between actual values of λ and predictions obtained by using the ML models signifies the applicability of the models to get better accuracy on the considered input parameters and to predict the λ efficiently.
The agreement between predicted and actual values of λ strengthens the possibility of using ML approaches on-site to predict the values of λ for specific parameters.
In addition, this study could be further extended by incorporating a broader set of performance indices, as outlined by Chadalawada & Babovic (2019), which are Volumetric Efficiency, Kling-Gupta Efficiency (KGE), Nash–Sutcliffe Efficiency (NSE), and Log NSE, that can help in gaining a more comprehensive understanding of the models' performance and suitability for the prediction of λ.
The current study provides the engineers in the research domain with a greater understanding for the improved choice of input variables and regressors for the execution of ML models to predict output accurately. The better accuracy of the models represents their importance in the civil engineering domain, especially for the determination of λ, as the experimental approach consumes time and effort. The range of the parameters is also significant in the model's training. Despite the fact that the dataset in this study comprises diverse sources from the field and laboratory research, there are circumstances in which the range of input parameters may exceed the values that have been evaluated. It is probable that the proposed model will underperform in these two instances. Most ML-based models have these difficulties since they rely largely on the dataset and its properties on which these models are trained. Such prediction models could be used in developing countries where there is a lack of technical expertise in the process of estimation of λ. The K* model in IS1 can be used as an extra tool to forecast λ without experimental procedures being carried out. The capacity to be customized to work with new data is a critical property of ML models. This attribute will facilitate the model's ability to adapt to shifting environmental conditions.
ACKNOWLEDGEMENTS
The authors acknowledge the insights provided by their colleagues that significantly improved the quality of the manuscript.
AUTHOR CONTRIBUTIONS
A.B. conceptualized the study, did data analysis, wrote the original draft, and prepared the article; A.A.M. did formal analysis and modelling; B.K. was involved in review and editing; M.P. wrote the original draft, edited, investigated, and reviewed the article.
FUNDING
The authors gratefully acknowledge the financial support from the Core Research Grant, SERB Government of India (CRG/2021/002119), to carry out the review work presented in this paper.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.