## Abstract

River bottom tearing scour (RBTS) has a strong effect on the scouring and moulding of channel in the Yellow River. Due to the special forming conditions, complex influencing factors, and limited observed data, it is difficult to predict whether RBTS will occur accurately. By collecting and disposing of the hydrodynamic, sediment, and initial boundary data of 246 flood events related to RBTS in three typical reaches of the Yellow River basin, the correlation between different characteristic influencing factors and the occurrence and absence of RBTS were analysed, and prediction models based on machine learning algorithms were constructed. The results showed that under the existing data conditions, the maximum sediment concentration *S _{m}*, average sediment concentration

*S*, flood growth rate

_{p}*ν*, and shape coefficient

*δ*were the four key indices to more easily distinguish whether RBTS will occur. The support vector machine algorithm model had the best performance results and exhibited higher accuracy and precision in predicting its occurrence compared with other models under given water and sediment conditions. The method proposed in this study provides a new method for accurately predicting RBTS in the Yellow River.

## HIGHLIGHTS

The maximum sediment concentration, average sediment concentration, flood growth rate, and shape coefficient are the four key indices for distinguishing whether RBTS will occur.

Prediction models of RBTS based on machine learning algorithms were built.

The SVM model showed the best predictive performance in the case study of the middle reach of the Yellow River.

## INTRODUCTION

*et al.*2016; Bi

*et al.*2019; Chen

*et al.*2021; Medel

*et al.*2022). In general, the river cohesive sediment will be gradually consolidated and compacted, affected by the long-term water flow, and form stable structures with strong ability of resistance against erosion. When it encounters a huge flood, there may cause channel scour and sediment transport if the hydrodynamic forces surpass the particle resistance forces (Slaa

*et al.*2013; Bosa

*et al.*2018; Das

*et al.*2019). In the areas of Longmen and Lintong of the Weihe River, there is a special phenomenon of large-scale and long-distance erosion occurring in a short period due to high-sediment floods, causing the riverbed sediment layer to be uplifted from the river bottom and exposed in blocks or patches on the water surface. The block area can reach several square metres or even tens of square metres. Subsequently, the sediment is carried away by the flowing water in a short period, resulting in the riverbed being scoured deeply by several or even ten metres during a single flood, which is called river bottom tearing scour (RBTS; Figure 1) (Kuang

*et al.*2000; Cao

*et al.*2006; Van Maren

*et al.*2009a; Jiang

*et al.*2010).

The occurrence of RBTS needs certain conditions. The channel deposition is usually quite serious before its occurrence, and the riverbed sediment has become dense and a certain thickness. Moreover, the clay blocks composed of fine cohesive sediment particles has formed after a period of consolidation, surrounded by granular coarse sand. On the basis of the formation of clay blocks, there are also required adequate hydrodynamic conditions. High-sediment concentration flood from upstream with the strong ability of carrying sand can improve the viscosity force of water flow, and intensify the impact of sediment particles on the silted riverbed, which can not only provide sufficient energy for the upward lifting of sediment at the river bottom, but also continuously carry the sediment broken by the impact to the downstream. Therefore, when the high-sediment concentration flood occurs, as the discharge and velocity increases, the main channel appears strongly eroded, which may result in the RBTS phenomenon (Van Maren *et al.* 2009b; Wang *et al.* 2009; Tabarestani & Zarrati 2015; He & Yan 2019; Anand *et al.* 2021).

RBTS in the Yellow River has a strong effect on the scouring and moulding of the channel and may cause channel erosion and deposition, main channel migration, and constantly changing slip position of engineering, which easily causes river engineering to collapse and brings tremendous pressure to flood defence (Jiang *et al.* 2015). Therefore, it can provide an important basis for dealing with the occurrence of RBTS phenomenon and ensuring the safety of river engineering by studying the influencing factors of its occurrence conditions and achieving accurate prediction, which has positive practical significance for river management.

Presently, research on RBTS in the Yellow River has mainly focused on its genetic mechanisms and mechanical analysis. From a mechanical point of view, RBTS can be divided into three stages: separation, sliding, and turning of the settlement layer of the riverbed (Gou 2004). Based on the analysis of the original test data, combined with the method of the generalised model test, the main conditions of the RBTS are considered to be in the form of sediment, water and sediment conditions, characteristics of the previous channel boundary, and the instantaneous lifting force caused by the different propagation speeds of the pulsating pressure wave on the upper and lower surfaces of the clay block. A critical criteria for determining the occurrence of the phenomenon was established (Jiang *et al.* 2010; Zhang *et al.* 2022b). A relationship model of the clay particle size distribution and shear strength of the RBTS reach in the Yellow River was established and solved (Dong *et al.* 2012). Furthermore, according to the characteristics of the sediment layer, some researchers proposed the concept of an elastic viscous layer of the riverbed and calculated the critical instability force of the cemented block using the material mechanics cross-section method and the principle of minimum potential energy (Zhang & Hu 2013). However, existing research on the discrimination conditions of RBTS occurrence is not uniform, and some assumptions and simplification methods have been used to obtain the discriminant index derived from empirical formulas or parameters. The reliability of the index lacks sufficient verification and applicability, and its accuracy is limited.

With the continuous development of artificial intelligence and data mining technology, intelligent algorithms such as machine learning, which use automatic data analysis and modelling to predict multifactor and nonlinear system problems, have been widely used in several scientific fields and become effective tools for data analysis. These methods have also been applied to related research on the prediction about the evolution of the Yellow River's and other rivers’ bed. For instance, Xia *et al.* (2023) used remote sensing image, cross-sectional terrain, water and sediment data to construct a random forest (RF) prediction model for thalweg migration in the wandering reach of the lower Yellow River, and the average accuracy of different river sections was up to 80%. Hu *et al.* (2023) proposed a new method for monitoring river suspended sediment concentrations (SSCs) that combines the remote sensing technique and light gradient-boosting machine method, and it showed accuracy and robustness in the experiment of the lower Yellow River. Yan *et al.* (2023) developed a solution to discriminate river patterns based on rough set theory, and compared with several competitive machine learning methods including support vector machine (SVM), extreme gradient-boosting (XGBoost), and deep neural networks (DNNs). Though the proposed methods displayed good performance with the advantages of interpretability, simple modelling, and fewer training samples, the results showed that the XGBoost even had the highest performance as one of the black box models. Li *et al.* (2021) used 10-m Sentinel-2 multispectral instrument (MSI) imagery and digital elevation model (DEM) data, as well as RF algorithm to extract bankfull river widths on the upper Yellow River. Moreover, Ahmed *et al.* (2019) used remote sensing and unsupervised machine learning techniques to quickly forecast regions that are subject to future river sediment deposition. Xu *et al.* (2022) built an integrated model combining a numerical fluid-flow and sediment model with the long short-term memory (LSTM) module to analyse the detailed process of the bed-steadying discharge (Q_{S}) in the Middle Huaihe River. Ren *et al.* (2020) adopted RF to identify the most influential factors and develop the model for categorising and mapping the spatial distributions of riverbed substrate grain size of the Hanford Reach located in the Columbia River Basin. In these relevant researches, multiple input parameters usually need to be considered, and machine learning methods can be used to quickly respond to driving factors and construct models. The models have shown good predictive performance. However, there are few reports on using machine learning to solve the prediction problem of RBTS in the Yellow River. Complex nonlinear relationships exist between each influencing factor for the RBTS, and machine learning algorithms can be considered to overcome the problem of establishing complex relationships between multiple factors. Therefore, the study on accurate prediction models and application effect evaluations of machine learning algorithms for RBTS prediction still needs addressing.

Due to the special formation conditions, complicated influencing factors, and limited observed data of RBTS, predicting the occurrence of RBTS is difficult. To solve the above problems, this study establishes a prediction model for the RBTS phenomenon based on the machine learning algorithms to provide theoretical and technical support for its accurate prediction and effective prevention and control.

## METHODS

### Characteristic influencing factor calculation

Based on the mechanism of the RBTS phenomenon and existing studies, the characteristic influencing factor indices selected in this study were mainly divided into hydrodynamic, sediment, and initial boundary (Dong *et al.* 2011; Li *et al.* 2017; Liu 2018; Liu *et al.* 2021; Zhang *et al.* 2022a).

#### Hydrodynamic factor

- (1)
High flow duration ratio

- (2)
Flood growth rate

- (3)
Shape coefficient

*W*is the flood volume, is the peak discharge, and

*T*is the flood duration.

- (4)
Flood peak pattern flow

#### Sediment factor

- (1)
Maximum sediment concentration

- (2)
Average sediment concentration

*T*is the entire duration of the flood.

- (3)
Median particle diameter and average particle diameter

is the corresponding particle size when the cumulative particle size distribution percentage reaches 50% and is the average value of particle size obtained by using standardised sampling methods on the day that is closest to the start date of the flood.

#### Initial boundary factor

- (1)
River width

- (2)
Maximum water depth

- (3)
Average water depth

*n*is the number of measurement points at different positions in a section.

- (4)
Width-depth ratio

is a comprehensive index that reflects the initial boundary conditions of the river channel before the flood.

### Key indicator extraction with strong correlation

In this study, due to the occurrence and absence of the RBTS phenomenon is a discrete variable with only two types of results and the indices of the characteristic influencing factors are continuous data variables, the conventional correlation analysis method (such as correlation coefficient *R*^{2}) is not applicable to reflect the degree of correlation between continuous and discrete variables well. For the correlation analysis between discrete and continuous variables, the *F-*score and index were combined to evaluate the degree of influence of different factors on the dependent variable, which could provide better threshold standards to select key factors for subsequent modelling.

- (1)
*F-*score

*F-*score value, the greater the influence of this characteristic influencing factor on the dependent variable. Therefore, it is easier for this characteristic factor to distinguish whether the RBTS phenomenon occurs. The

*F-*score is defined as (Chen & Lin 2006):where , , and are the average values of the

*i*th characteristic index in the dataset of the total, occurrence, and absence of RBTS, respectively; and are the

*k*th values of the

*i*th characteristic index in the dataset of the occurrence and absence of RBTS, respectively; and and are the number of occurrences and absences of RBTS, respectively.

- (2)

*k*is the group of the RBTS phenomenon, = {occurrence, absence }, is the number of samples in a group, is the average value of the characteristic index in the

*k*group, is the average value of the characteristic index in all groups, and is the

*i*th value of the characteristic index in the

*k*group.

Based on the above calculation formula of the *F-*score and , the correlation between the index of characteristic influencing factors and the occurrence or absence of the RBTS phenomenon was analysed. The extraction steps of key indices with strong correlation are as follows:

- (1)
For collected data of each flood, calculate the indices of characteristic influencing factors in hydrodynamics, sediment, and initial boundary;

- (2)
- (3)
According to the empirical results of and the control criteria of the

*F-*score, select an appropriate threshold [*F-*score] to distinguish the key indices with strong correlations from others with weak correlations; - (4)
The key indices with strong correlations are selected as the input variables of the subsequent prediction model.

### Basic principle of machine learning algorithms

The prediction models of the RBTS phenomenon were built based on four machine learning algorithms, including SVM, k-nearest neighbour (KNN), RF, and XGBoost. These four algorithms belong to different types and have good competitiveness and performance, which is more conducive to obtaining the optimal prediction model.

- (1)
SVM

*et al.*2013; Aburomman & Reaz 2017; Huang

*et al.*2023). For a given training set of

*m*pairs of data points , where , , and are the feature vector and category label of the

*i*th sample, respectively, and it can be equivalent to the following optimisation problem:where

*w*is a normal vector of the hyperplane,

*b*is a bias, is the ‘‘slack variable’’ and is a hinge loss function, and is a regularisation constant that controls the trade-off between the classification margin and misclassification cost.

- (2)
KNN

The KNN algorithm is a classical and effective algorithm for classification recognition that has been widely applied in several domains owing to its methodological simplicity, nonparametric working principle, and easy implementation.

*k*most similar data samples according to the distance size, and predict the category of sample data to be classified according to the labels of

*k*most similar data samples using the majority voting principle (Wang

*et al.*2020; Kim 2021). The distances can be calculated as shown in Equation (13):where

*x*

_{i}and

*y*

_{i}are the

*i*th data points of samples

*X*and

*Y*, respectively.

- (3)
RF

The RF algorithm is an integrated classification method based on multiple decision trees built from data subspaces. It can achieve high accuracy in classifying high-dimensional data and has the characteristics of trivial parallelisation, high speed, and strong generalisation ability. It uses a random method to generate a decision tree by introducing attribute selection in the training process and finally classifies by summarising the results of the decision tree (Mantas *et al.* 2019; Wang *et al.* 2022). The basic steps of the RF algorithm classification are:

- (1)
Randomly select

*N*training samples from the original dataset and obtain*k*independent training sample subsets by*k*round extraction with samples that are returned; - (2)
Different decision tree models can be constructed by selecting features from the original input variables, and the feature with the best value is used to split the nodes;

- (3)
Each decision tree returns a classification value that is a vote for that class. Based on the different decision tree results, the final classification result is determined using integrated voting.

- (4)
XGBoost

XGBoost is optimised and improved based on the gradient progressive regression tree algorithm, which is a typical ensemble model with excellent learning effects and high computational performance. The basic idea is to use multiple decision trees as base classifiers. In each training session, the residual of the last predicted result is added to a new base classifier for learning. By constantly adding new decision trees to minimise the loss value, multiple base classifiers are weighted and integrated into a strong estimator for predictive analysis to improve the model accuracy and obtain the final classification results (Gu *et al.* 2022; Pan *et al.* 2022).

*n*is the total amount of sample data; and are the observed and predicted values of the sample, respectively; and is the loss function that reflects the difference between them. is the complexity of the model and is the model of the

*j*th tree.

### Prediction model construction and effect evaluation based on machine learning

For the key feature influencing factor indicators extracted with strong correlation, the machine learning algorithms were applied to establish a prediction model that comprehensively considered different feature indicators for whether RBTS occurs. The methods for establishing the prediction model and evaluating its effects are as follows:

- (1)The key indices of strong correlation determined by the
*F-*score and method analysis, and the occurrence and absence of the RBTS phenomenon were used to build the expression of the prediction model as input and output, respectively:where*I*is the dependent variable of whether the RBTS phenomenon occurs, and refers to the extracted key influencing factor indices. - (2)
- (3)
The data were divided according to a certain proportion, and the training and test samples were determined.

- (4)
The training samples were placed into different machine learning classification algorithms for learning and the parameter combinations of the algorithms were adjusted and optimised to establish prediction models that comprehensively considered different characteristic indices of RBTS.

- (5)The test data were placed into the established model for calculations to obtain the predicted result and then compared with the observed situation. The prediction accuracy of the model was evaluated according to five indices: accuracy rate
*A*, precision rate*P*, recall rate*R*,*f*1 score, and*AUC*value (the area under the receiver operating characteristic (ROC) curve) (Fu*et al.*2023). The calculation formulas for the first four evaluation indices are:where and are the correctly predicted results of the occurrence and absence of the RBTS phenomenon, respectively; and are the incorrectly predicted results of the occurrence and absence of the RBTS phenomenon, respectively. - (6)
The preferred machine learning prediction model was determined.

## STUDY AREA AND DATA COLLECTION

^{2}, respectively. Moreover, most tributaries in the middle reaches of the Yellow River are located in the Loess Plateau area with loose soil, and the major tributaries are the Weihe and Fenhe rivers. The reaches of Fugu, Longmen to Tongguan, and Lintong to Huaxian of the Weihe River in the middle reaches of the Yellow River were the main streams with more RBTS phenomena, and selected as the study area (Figure 3).

A total of 246 floods obtained from four hydrological stations, Fugu, Longmen, Lintong, and Huaxian, in three different reaches of the Yellow River Basin were selected as the research objects to conduct data analysis (Figure 3). Furthermore, hydrological data on discharge and sediment of the floods from 1954 to 2017 were collected from measured data gauged at different hydrological stations. Based on the specific occurrence time of the RBTS phenomenon, the flood events of the RBTS absence were selected in the before and after adjacent years to ensure the relevant data more effectively reflected the comparison between occurrence and absence of the RBTS. The main data information is presented in Table 1. All measured data were provided by the Yellow River Conservancy Commission in Zhengzhou, China. In addition, due to lack of existing large cross-sectional data, the water depth and river width data of the nearest time before the occurrence and absence of the RBTS floods was used to reflect the initial boundary conditions and distinguish from hydrodynamic and sediment factors during the flood processes. No further adjustments were made to the datasets, as this would have introduced additional scale-dependent issues requiring further correction.

Study reach . | Whole study period . | Occurrence time . | (m^{3}/s)
. | (kg/m^{3})
. | Occurrence of events . | Absence of events . | Total . |
---|---|---|---|---|---|---|---|

Fugu | 1986/03/26-1990/12/03 | 1988/06/12-1988/06/25 | 506 | 98.6 | 1 | 19 | 20 |

1954/08/29-1954/09/09 | 10,800 | 500 | 11 | 126 | |||

1964/07/03-1964/07/09 | 5,260 | 433 | |||||

1966/07/17-1966/07/24 | 3,450 | 667 | |||||

1969/07/22-1966/08/06 | 3,210 | 668 | |||||

1970/07/31-1970/08/15 | 4,670 | 702 | |||||

Longmen | 1954/08/29-2017/08/02 | 1977/07/05-1977/07/13 | 6,450 | 485 | 137 | ||

1977/08/01-1977/08/09 | 7,910 | 603 | |||||

1993/07/04-1993/07/18 | 546 | 336.5 | |||||

1995/07/14-1995/07/21 | 2,310 | 397.8 | |||||

2002/07/04-2002/07/07 | 1,220 | 788 | |||||

2017/07/24-2017/08/02 | 2,990 | 217 | |||||

Lintong | 1964/05/24-1979/09/28 | 1964/07/16-1964/07/25 | 2,410 | 562 | 6 | 40 | 46 |

1964/08/12-1964/08/17 | 1,990 | 613 | |||||

1966/07/26-1966/08/01 | 3,260 | 589 | |||||

1970/08/01-1970/08/09 | 2,250 | 555 | |||||

1975/07/25-1975/07/28 | 1,350 | 553 | |||||

1977/07/05-1977/07/10 | 4,120 | 609 | |||||

Huaxian | 1964/05/24-1978/09/17 | 1964/07/16-1964/07/25 | 2,640 | 491 | 6 | 37 | 43 |

1964/08/13-1964/08/17 | 2,550 | 598 | |||||

1966/07/26-1966/08/01 | 4,660 | 573 | |||||

1970/08/02-1970/08/09 | 2,150 | 423.8 | |||||

1975/07/25-1975/07/29 | 962 | 531.3 | |||||

1977/07/06-1977/07/10 | 3,610 | 656.5 |

Study reach . | Whole study period . | Occurrence time . | (m^{3}/s)
. | (kg/m^{3})
. | Occurrence of events . | Absence of events . | Total . |
---|---|---|---|---|---|---|---|

Fugu | 1986/03/26-1990/12/03 | 1988/06/12-1988/06/25 | 506 | 98.6 | 1 | 19 | 20 |

1954/08/29-1954/09/09 | 10,800 | 500 | 11 | 126 | |||

1964/07/03-1964/07/09 | 5,260 | 433 | |||||

1966/07/17-1966/07/24 | 3,450 | 667 | |||||

1969/07/22-1966/08/06 | 3,210 | 668 | |||||

1970/07/31-1970/08/15 | 4,670 | 702 | |||||

Longmen | 1954/08/29-2017/08/02 | 1977/07/05-1977/07/13 | 6,450 | 485 | 137 | ||

1977/08/01-1977/08/09 | 7,910 | 603 | |||||

1993/07/04-1993/07/18 | 546 | 336.5 | |||||

1995/07/14-1995/07/21 | 2,310 | 397.8 | |||||

2002/07/04-2002/07/07 | 1,220 | 788 | |||||

2017/07/24-2017/08/02 | 2,990 | 217 | |||||

Lintong | 1964/05/24-1979/09/28 | 1964/07/16-1964/07/25 | 2,410 | 562 | 6 | 40 | 46 |

1964/08/12-1964/08/17 | 1,990 | 613 | |||||

1966/07/26-1966/08/01 | 3,260 | 589 | |||||

1970/08/01-1970/08/09 | 2,250 | 555 | |||||

1975/07/25-1975/07/28 | 1,350 | 553 | |||||

1977/07/05-1977/07/10 | 4,120 | 609 | |||||

Huaxian | 1964/05/24-1978/09/17 | 1964/07/16-1964/07/25 | 2,640 | 491 | 6 | 37 | 43 |

1964/08/13-1964/08/17 | 2,550 | 598 | |||||

1966/07/26-1966/08/01 | 4,660 | 573 | |||||

1970/08/02-1970/08/09 | 2,150 | 423.8 | |||||

1975/07/25-1975/07/29 | 962 | 531.3 | |||||

1977/07/06-1977/07/10 | 3,610 | 656.5 |

## RESULTS AND DISCUSSION

### Analysis of correlation degree of different characteristic influencing factors

The dispersion of each characteristic index in the occurrence of the RBTS phenomenon was smaller than that in its absence. In addition to the high flow duration ratio and shape coefficient , the average of the other indices in the occurrence category was higher than that in the absence. This indicates that the occurrence of the RBTS phenomenon has the main feature of high-sediment flood, with a faster growth rate of flood, and higher sediment concentration.

In addition, regarding correlation degree, the correlation of the maximum sediment concentration and the average sediment concentration was higher, while the correlation of initial boundary factors and the median particle diameter and average particle diameter was lower. The results indicate that under the existing data conditions, initial boundary factors and particle size are not important influencing factors for the occurrence of the RBTS phenomenon, whereas sediment content is the key characteristic index.

*F-*score and of all the characteristic influencing factors. Although the results and importance order of each index calculated by the two methods differed, overall, the

*F-*score and of the maximum sediment concentration and the average sediment concentration in the sediment factors and the flood growth rate

*v*, and shape coefficient in the hydrodynamics factors ranked in the top four among all the characteristic indices. Meanwhile, the

*F-*score and value of these indices were above 0.469 and 0.065, respectively (according to the empirical results that belong to the above medium correlation), indicating that these four indices had a greater impact on the RBTS phenomenon and could easily distinguish whether RBTS would occur. In particular, the two indices with the highest

*F-*score and value were the maximum sediment concentration and the average sediment concentration , which further indicated that sediment concentration was the key factor under the influence of the RBTS phenomenon and more sensitive to its occurrence. The calculation results also reflect the relationship between the turbulence intensity of the flow and the gravity action of the clay block in the riverbed, which is the physical mechanism behind the RBTS phenomenon. The starting scour of the clay block requires sufficient hydrodynamic conditions, including flow and its pulsation intensity and flow structure near the bottom. Its efficiency also meets the law of rise erosion and fall deposition; that is, in the process of a flood, erosion of the riverbed typically occurs during the rise stage, and conversely, siltation (Kothyari & Jain 2008; Li

*et al.*2017). This is consistent with the calculation results of the flood growth rate

*v*, and shape coefficient . In addition, the increase in sediment inflow in the upper reaches improves the viscous force and sediment transport capacity of the flow and increases the collision effect of sediment particles on the silted bed surface, which further improves the erosion strength of viscous sediment under the condition of consolidation and density of the riverbed (Thompson & Amos 2004; Lichtman 2017). Therefore, high-sediment concentration is also an important condition for the occurrence of RBTS.

### Strong correlation index extraction

According to the analysis results in Section 4.1, under the existing data conditions, the initial boundary factors and particle size indices had a low correlation with whether RBTS occurred, and their *F-*score values were almost an order of magnitude different from other indicators, which showed that they were not the key factors. Therefore, considering that the observed data samples of the occurrence of RBTS in the statistics were insufficient, the initial boundary factors and particle size indices were excluded to improve the generalisation of the subsequent prediction model, and all data of 246 floods including 24 occurrence and 222 absence events of the RBTS phenomenon were used for analysis. The new *F-*score and results corresponding to the different indices are shown in Figure 5(b).

After increasing the amount of data, the results of the *F-*score and values changed again. However, the maximum sediment concentration , average sediment concentration , flood growth rate *v*, and shape coefficient obtained from the analysis still ranked in the top four. Although the *F-*score of the flood growth rate *v* decreased, the of each index remained basically above 0.06 (still remained above the medium correlation), reflecting that these key index factors were minimally affected by the change in the amount of data and could still ensure a certain stability of strong correlation with the increase in the amount of data. In particular, the *F-*score and of the maximum sediment concentration and average sediment concentration increased with an increase in the amount of data, further indicating that sediment concentration had the most critical influence on the occurrence of the RBTS phenomenon.

### Data processing and model training

According to the mechanism of the RBTS phenomenon, the starting of the clay blocks needs sufficient hydrodynamic and sediment conditions, in fact, the occurrence of RBTS is a competition between the water flow force and impact resistance of clay block. Therefore, it is appropriate and necessary to consider both hydrodynamic and sediment factor combining the analysis results of Sections 4.1 and 4.2. Comprehensively considering the calculation value of the *F-*score and method, [*F-*score] = 0.25 was taken as the threshold value to contain valid water and sediment information and reduce the number of input variables to the greatest extent. Four key indices were extracted as the input variables including the maximum sediment concentration , average sediment concentration , flood growth rate *v*, and shape coefficient , and the occurrence and absence of RBTS were used by the binary numbers of ‘1’ and ‘0’ to represent respectively the output variable. All observed data of 246 floods were used to build the prediction models of RBTS. It should be noted that although the sample number was only 246, it was a classification problem with only two results. Therefore, the accuracy of classification results can still be guaranteed when there is obvious distinction between the samples corresponding to the two results. Moreover, the extraction of strong correlation indices was specially carried out in this paper, which further improved the differentiation between the occurrence and absence of the RBTS, so the rationality and stability of the results could be considered acceptable.

Because there were few occurrence data samples, a segmentation proportion of 2:1 was adopted to ensure the generalisation of the model. Overall, all data were divided into training and test sets according to a proportion of 7:3. A total of 172 samples were used for training and learning, and the remaining 74 were used for prediction. The data composition is presented in Table 2.

Algorithm type . | Input variable . | Output variable . | Training set . | Test set . |
---|---|---|---|---|

SVM | Flood growth rate , shape coefficient , maximum sediment concentration , average sediment concentration | Occurrence, Absence | Occurrence: 16 Absence: 156 Total: 172 | Occurrence: 8 Absence: 66 Total: 74 |

RF | ||||

KNN | ||||

XGBoost |

Algorithm type . | Input variable . | Output variable . | Training set . | Test set . |
---|---|---|---|---|

SVM | Flood growth rate , shape coefficient , maximum sediment concentration , average sediment concentration | Occurrence, Absence | Occurrence: 16 Absence: 156 Total: 172 | Occurrence: 8 Absence: 66 Total: 74 |

RF | ||||

KNN | ||||

XGBoost |

### Prediction results evaluation of different algorithm models

The evaluation results of prediction performance are shown in Table 3. Among the final model evaluation indices, the SVM algorithm model had the highest values of each index, and its accuracy rate *A*, precision rate *P*, recall rate *R*, and *f*1 score under the occurrence events were 0.986, 1.000, 0.875, and 0.933, respectively. By contrast, among the other three algorithm models, the overall accuracy rate *A*, precision rate *P*, and *f*1 score of the RF algorithm model were lower than those of the KNN and XGBoost algorithm models, which were 0.959, 0.778, and 0.824, respectively. However, though the precision rate *P* index of the KNN algorithm model was higher than that of the XGBoost algorithm model, its recall rate *R* and *f*1 score were only 0.750 and 0.857, respectively.

Algorithm type . | Predicted results . | Evaluation index . | |||
---|---|---|---|---|---|

Accuracy rate A
. | Precision rate P
. | Recall rate R
. | f1 score
. | ||

SVM | Occurrence | 0.986 | 1.000 | 0.875 | 0.933 |

Absence | 0.985 | 1.000 | 0.992 | ||

KNN | Occurrence | 0.973 | 1.000 | 0.750 | 0.857 |

Absence | 0.971 | 1.000 | 0.985 | ||

RF | Occurrence | 0.959 | 0.778 | 0.875 | 0.824 |

Absence | 0.985 | 0.970 | 0.977 | ||

XGBoost | Occurrence | 0.973 | 0.875 | 0.875 | 0.875 |

Absence | 0.985 | 0.985 | 0.985 |

Algorithm type . | Predicted results . | Evaluation index . | |||
---|---|---|---|---|---|

Accuracy rate A
. | Precision rate P
. | Recall rate R
. | f1 score
. | ||

SVM | Occurrence | 0.986 | 1.000 | 0.875 | 0.933 |

Absence | 0.985 | 1.000 | 0.992 | ||

KNN | Occurrence | 0.973 | 1.000 | 0.750 | 0.857 |

Absence | 0.971 | 1.000 | 0.985 | ||

RF | Occurrence | 0.959 | 0.778 | 0.875 | 0.824 |

Absence | 0.985 | 0.970 | 0.977 | ||

XGBoost | Occurrence | 0.973 | 0.875 | 0.875 | 0.875 |

Absence | 0.985 | 0.985 | 0.985 |

*AUC*index), the more accurate the classification and the better the prediction performance of the model.

Among the prediction results of the different algorithm models, the ROC curve obtained by the SVM algorithm model had the largest area surrounded by the X-axis. The *AUC* values of the SVM, KNN, RF, and XGBoost models were 0.996, 0.922, 0.980, and 0.975, respectively. Therefore, the prediction performance of the model obtained using the SVM algorithm was the best with respect to prediction of the RBTS phenomenon.

## CONCLUSION

RBTS is a special erosion phenomenon of high-sediment concentration floods in the Yellow River that has a considerable impact on river channel engineering. Owing to the small amount of observed data and relatively complex influencing factors, it is difficult to accurately predict whether it will occur. Therefore, the observed data of 246 floods in three different typical reaches of the Yellow River basin, including Fugu, Longmen, and Weihe, were used in this study, and the prediction models were built. The primary research results and conclusions of this study are summarised as follows:

- (1)
The total data of 179 floods with all the index factors were analysed, and the

*F-*score and of the maximum sediment concentration , average sediment concentration , flood growth rate*v*, and shape coefficient ranked in the top four among all characteristic indices. The*F-*score and values of these indices were above 0.469 and 0.065, respectively, indicating that the four indices are the key factors to more easily distinguish whether RBTS will occur. - (2)
After increasing the amount of data, although the new results of the

*F-*score and changed, the top four indices remained the same, and each index was above the medium correlation, indicating that these key index factors could still ensure a certain stability of strong correlation with the change in the amount of data. - (3)
Four key indices were extracted as the input variables, and the occurrence and absence of RBTS were the output variables. Prediction models were constructed based on the SVM, KNN, RF, and XGBoost classification algorithms. After comparison and analysis, the accuracy rate

*A*, precision rate*P*, recall rate*R*,*f*1 score, and*AUC*of the SVM algorithm model under occurrence events were 0.986, 1.000, 0.875, 0.933, and 0.996, respectively, which were the highest among the four models and showed that the SVM algorithm had higher accuracy and precision in predicting the RBTS phenomenon.

## ACKNOWLEDGEMENTS

This research was supported by the National Natural Science Foundation of China (42041004, U2243601, U2243241, U2243215), the Major Science and Technology Project of Ministry of Water Resources (SKR-2022021, SKS-2022088), the Science and Technology Development Fund of the Yellow River Institute of Hydraulic Research (HKF202111), and the Special Fund of Basic Scientific Research Business Expenses of Central Public Welfare Scientific Research Institutes (HKY-JBYW-2024-13).

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Proceedings of the 35th IAHR World Congress*(Wang, Z. Y., Li, H. W., Gao, J. Z. & Cao, S. Y., eds.). Tsinghua University Press, Beijing, China, pp. 3839-3847.