ABSTRACT
Soil erosion represents a significant environmental threat, particularly in watersheds with complex geomorphological characteristics. This study evaluates soil erosion vulnerability (SEV) in the Saddang Watershed, Indonesia, using tree-based machine learning (ML) models: decision tree (DT) and random forest (RF). The research incorporates key hydrological, topographical, and environmental factors, such as drainage density, rainfall erosivity, soil bulk density, and vegetation index, which influence soil erosion dynamics. A dataset of 2,000 locations, validated through 20-fold cross-validation, was utilized for model training and evaluation. Results indicate that the RF model demonstrated superior predictive accuracy (AUCROC: 0.953) compared with the DT model (AUCROC: 0.917), attributed to RF's ensemble methodology that enhances robustness against overfitting. Spatial analysis classified SEV into five categories, highlighting moderate to very high vulnerability near riverbanks and steep terrains. Drainage density and river proximity emerged as the most influential factors, underscoring the necessity for targeted conservation measures in these areas. The integration of ML models with GIS and remote sensing provides a robust framework for real-time SEV assessment, aiding sustainable land-use planning. This approach offers valuable insights for policymakers and practitioners, enabling evidence-based interventions to mitigate soil erosion and enhance environmental resilience in watersheds worldwide.
HIGHLIGHTS
Innovative modeling techniques.
Comprehensive factor analysis.
Climate change integration.
Practical conservation framework.
Holistic analytical approach.
INTRODUCTION
Globally, soil erosion poses critical risks to food security, environmental sustainability, and socioeconomic growth due to factors like unsustainable land use, deforestation, climate change, and population growth (Chakrabortty et al. 2020; Sahour et al. 2021; Folharini et al. 2023; Nguyen et al. 2023; Olii et al. 2023). This process degrades arable land, reduces soil fertility, and impairs water quality through sedimentation, which disrupts aquatic ecosystems and water supply (Borrelli et al. 2017; François et al. 2024). To address soil erosion, vulnerability assessments that consider soil type, slope, land use, vegetation, rainfall, and human activities are essential (Shit et al. 2015; Pandey et al. 2021; Mosaid et al. 2022; Nguyen et al. 2023). Identifying soil erosion vulnerability (SEV) areas enables targeted soil erosion control, supports resource allocation, and provides critical insights into underlying soil erosion drivers, which aid in developing tailored soil conservation strategies (Olii & Ichsan 2020; Avand et al. 2021; Olii et al. 2024b).
Traditional SEV assessment methods, such as RUSLE, plot-scale studies, and MCDM, offer useful insights but face limitations, including high uncertainty and limited accuracy across diverse conditions (Alewell et al. 2019; Mosavi et al. 2020; Sahour et al. 2021; Olii et al. 2024b). Machine learning (ML) has improved soil erosion prediction accuracy, efficiently handling large datasets and capturing complex, non-linear interactions (Dinh et al. 2021; Nguyen et al. 2023). ML models can incorporate data from diverse sources – such as remote sensing (RS) imagery, soil properties, and climate variables – providing scalability across spatial and temporal scales (Chakraborty et al. 2024). This capability enables precise SEV modeling to inform sustainable land management and improve decision-making (Band et al. 2020; Ghorbanzadeh et al. 2020; Mosavi et al. 2020; Saha et al. 2020; Avand et al. 2021; Dinh et al. 2021; Sahour et al. 2021; Baiddah et al. 2023; Folharini et al. 2023; Nguyen et al. 2023; Singh et al. 2023; Wang et al. 2023). Integrating RS, GIS, and ML techniques enables robust SEV modeling, as RS provides essential spatial data, while GIS organizes and visualizes these datasets (Mosavi et al. 2020; Baiddah et al. 2023; Nguyen et al. 2023). This study uses tree-based ML algorithms, known for their interpretability, scalability, and resilience, as well as their ability to handle heterogeneous data and provide feature importance analysis (Chen & Guestrin 2016). Regardless of its benefits, careful examination of performance and interpretability trade-offs is essential (Friedman 2001; Liaw & Wiener 2002).
Most SEV studies using ML rely on continuous data, which may decrease interpretability and lead to inaccuracies. To address this, we propose a novel method combining classification techniques with ML-based SEV factor weighting, enhancing interpretability for decision-makers and improving model accuracy, especially in unique environments (Ghorbanzadeh et al. 2020; Mosavi et al. 2020; Saha et al. 2020; Dinh et al. 2021; Sahour et al. 2021; Baiddah et al. 2023; Chakrabortty & Pal 2023; Nguyen et al. 2023; Wang et al. 2023). This study addresses the limitations of current SEV models by integrating traditional classification techniques with advanced ML-based factor weighting, enhancing both interpretability and accuracy. By mitigating issues such as data complexity, overfitting, and multicollinearity, this approach makes model results more accessible to decision-makers and adaptable to varied geographic contexts. The model achieves improved prediction stability through a flexible classification system, especially in regions with unique environmental factors. We compare tree-based ML algorithms, including decision tree (DT) and random forest (RF), and integrate them with RS and GIS data to identify the most effective SEV prediction model for South Sulawesi's Saddang Watershed. This method not only improves forecast precision but also offers deeper insights into soil erosion dynamics, advancing SEV prediction and land management practices. Findings from this study are expected to support evidence-based decision-making, enabling stakeholders to create sustainable land-use policies tailored to the region's characteristics. This research also highlights broader applications of tree-based ML, GIS, and RS methodologies in environmental management, showing potential for adaptation across various landscapes. These advancements contribute meaningfully to soil erosion studies and offer scalable solutions for other regions with similar environmental challenges.
MATERIALS AND METHODS
Study area
Overview of methodological framework
Data sources
Data types . | Sources . | Scale . |
---|---|---|
SRTM 1 ARC-Second Global: s03_e119_1arc_v3; s03_e120_1arc_v3; s04_e119_1arc_v3; and s04_e120_1arc_v3 | https://earthexplorer.usgs.gov/ | 30 × 30 m |
Landsat 9 OLI/TIRS C2 L1: LC09_L1TP_115062_20230929_20230929_02_T1 | https://earthexplorer.usgs.gov/ | 30 × 30 m |
Soil texture map (clay, silt, and sand content) | https://soilgrids.org/ | 250 × 250 m |
Soil organic carbon map | https://soilgrids.org/ | 250 × 250 m |
Soil bulk density map | https://soilgrids.org/ | 250 × 250 m |
Rainfall data | https://power.larc.nasa.gov/data-access-viewer/ | 0.25° × 0.25° |
Google Earth Image | Data SIO, NOAA, US Navy, NGA, GEBCO | 30 × 30 image |
Boundary administration | https://gadm.org/ | Shapefile |
Data types . | Sources . | Scale . |
---|---|---|
SRTM 1 ARC-Second Global: s03_e119_1arc_v3; s03_e120_1arc_v3; s04_e119_1arc_v3; and s04_e120_1arc_v3 | https://earthexplorer.usgs.gov/ | 30 × 30 m |
Landsat 9 OLI/TIRS C2 L1: LC09_L1TP_115062_20230929_20230929_02_T1 | https://earthexplorer.usgs.gov/ | 30 × 30 m |
Soil texture map (clay, silt, and sand content) | https://soilgrids.org/ | 250 × 250 m |
Soil organic carbon map | https://soilgrids.org/ | 250 × 250 m |
Soil bulk density map | https://soilgrids.org/ | 250 × 250 m |
Rainfall data | https://power.larc.nasa.gov/data-access-viewer/ | 0.25° × 0.25° |
Google Earth Image | Data SIO, NOAA, US Navy, NGA, GEBCO | 30 × 30 image |
Boundary administration | https://gadm.org/ | Shapefile |
Soil erosion inventory mapping
The soil inventory map is essential for preparing the SEV model by various predictive models and was considered the dependent variable in this study area. It was necessary to know the locations of eroded and non-eroded regions for susceptibility mapping the Saddang Watershed. Therefore, the locations (i.e., x and y-coordinates) of 2,000 areas (1,000 soil erosion locations and 1,000 non-soil erosion locations) were sampled through field surveys and interpretation using SAS Planet and Google Earth to model the SEV based on a binary scale (occurrence/non-occurrence). The soil erosion types include soil erosions (such as sheet, rill, gully, and mass movements).
Cross-validation with 20 k-fold and stratified sampling is a robust method to evaluate model performance. The dataset is split into 20 equally sized subsets (folds), ensuring each fold maintains the same class distribution as the original dataset. The model is trained on 19 folds and validated on the remaining one, repeating the process 20 times with a different validation fold each time. This technique reduces overfitting and provides a reliable performance estimate by averaging the results across all folds. Stratified sampling ensures a balanced representation of classes, making it particularly effective for imbalanced datasets, and improving fairness in evaluation metrics.
Selection of the SEV factors
The factors were chosen using the following criteria: data availability, previous experiences and reports in the literature, data connectivity and heterogeneity, and local geo-environmental characteristics. According to these criteria, 13 important soil erosion factors were collected and compiled for this study, including hydrological (rainfall erosivity, topographical wetness index (TWI), distance to the river, stream power index (SPI), and drainage density), topographic (slope length factor and topographic roughness index (TRI)), and environmental (bulk density, clay ratio, soil organic carbon, and normalized difference vegetation index (NDVI)) factors.
Hydrological factors
The distance to the river and drainage density are crucial factors in determining SEV. SEV is influenced by a variety of factors, including distance to waterways. Areas close to rivers frequently face higher soil erosion rates due to the erosive power of concentrating water flow and the increased possibility of floods. Sediment transportation is facilitated, resulting in higher soil erosion downstream. Furthermore, rivers can modify local microclimates, influencing flora growth and soil stability. Drainage density, or the concentration of streams and channels in a landscape, is also an important consideration. High drainage density is associated with enhanced runoff potential, sediment transport, and more frequent erosive occurrences.
Topographic factors
Environmental factors
Spatial distribution of SEV of each method in Saddang Watershed: DT (Left) and RF (Right).
Spatial distribution of SEV of each method in Saddang Watershed: DT (Left) and RF (Right).
Multicollinearity analysis
Collinearity occurs when an independent variable is a linear function of another independent variable (Mosaid et al. 2024). High collinearity in a regression equation indicates a strong correlation between independent variables, which can compromise the model's accuracy and the reliability of its coefficients (Dormann et al. 2013). To evaluate collinearity among independent variables, this study employed two common indicators: tolerance (TOL) and variance inflation factor (VIF) (Miles 2014). These measures provide insights into the extent of collinearity present in the data. While no universally accepted thresholds exist, the literature suggests widely used criteria for interpreting these indices: a VIF value of ≤ 5 or 10 and a TOL value of ≥ 0.1 or 0.2 generally indicate acceptable levels of collinearity (Arabameri et al. 2019). These thresholds imply that the independent variables are not excessively correlated and can function reliably in the regression model. This assessment is essential for ensuring the robustness and accuracy of the model's results.
Soil erosion modeling using ML models
Tree-based ML algorithms, such as DT and RF, are highly effective for modeling soil erosion. These algorithms use DTs to understand and predict SEV patterns. Their ability to handle complex, non-linear relationships in data makes them well-suited for accurately predicting SEV, aiding in developing targeted soil conservation strategies.
Decision tree
The DT algorithm generates a tree-like structure, with each internal node representing a decision based on a feature attribute and each leaf node representing a class label or regression value (Kotsiantis 2013). The technique starts with the complete dataset and separates it recursively at each node based on feature qualities to maximize information gain or decrease impurity (Ghosh & Maiti 2021). This procedure will continue until a stopping requirement, such as a maximum tree depth or minimum samples per leaf, is fulfilled. DTs provide interpretable models and can handle categorical and continuous data. They are constructed iteratively by picking the optimum split at each node depending on criteria such as Gini impurity or information gain, giving a flexible approach for classification and regression tasks (Band et al. 2020).
Random forest
The RF algorithm is a non-parametric ensemble-supervised ML model (Breiman 2001). The RF algorithm creates an ensemble of DTs using bootstrap sampling and random feature selection. Numerous DTs are initially created from randomly selected training data subsets. At each tree node, a portion of the whole feature set is randomly picked for splitting purposes. This unpredictability reduces the likelihood of overfitting and encourages variation among the trees. During training, each tree generates predictions individually, and the final prediction is decided by aggregating the outputs of all trees, usually using a majority voting system for classification problems or averaging for regression tasks. This ensemble approach produces a robust and highly accurate model capable of identifying complicated relationships within data while resisting noise and outliers (Herrera et al. 2019). This model is a powerful tool for both classification and regression problems and has been widely used for SEV mapping (Gayen et al. 2019; Ghorbanzadeh et al. 2020; Mosavi et al. 2020; Avand et al. 2021; Jiang et al. 2021; Folharini et al. 2023; Wang et al. 2023).
Evaluating models’ performance
The quality of the produced maps was assessed using the receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC). The ROC curve principle is a 1 − specificity (x-axis) against sensitivity (y-axis). The AUC characterizes the accuracy with which future occurrences are predicted. The AUC value spans from 0.5 to 1.0 and is classified into five grades: bad (<0.6), fair (0.6–0.7), good (0.7–0.8), very good (0.8–0.9), and excellent (>0.9) (Wei et al. 2022; Naceur et al. 2024).
RESULTS AND DISCUSSION
Multicollinearity analysis for SEV factors
Based on Table 2, all factors listed can be used in the SEV model, as their collinearity metrics fall within acceptable thresholds reported in the literature. The TOL values for all variables are above the critical threshold of 0.1, and their VIF values are below 10, indicating that none of the variables exhibit excessive multicollinearity (Miles 2014). While some variables, such as bulk density (VIF = 5.307, TOL = 0.188), show relatively higher VIF and lower TOL values, these remain within the permissible range. This suggests that their inclusion in the model is justifiable and unlikely to compromise the stability or predictive accuracy. Other factors, including rainfall erosivity, NDVI, and distance to the river, demonstrate low VIF values (<2) and high TOL values (>0.5), indicating strong independence and minimal multicollinearity, making them reliable contributors to the model. Variables with moderate collinearity, such as TWI, TRI, and clay ratio, have VIF values below 5 and can also be included without significant risk of instability.
Multicollinearity test among SEV factors
SEV factors . | R2 . | TOL . | VIF . |
---|---|---|---|
Rainfall erosivity | 0.317 | 0.683 | 1.464 |
Topographical wetness index | 0.764 | 0.236 | 4.241 |
Stream power index | 0.676 | 0.324 | 3.089 |
Distance to river | 0.454 | 0.546 | 1.831 |
Drainage density | 0.663 | 0.337 | 2.965 |
Slope length factor | 0.516 | 0.484 | 2.065 |
Topographic roughness index | 0.764 | 0.236 | 4.241 |
Bulk density | 0.812 | 0.188 | 5.307 |
Clay ratio | 0.751 | 0.249 | 4.013 |
Carbon organic | 0.701 | 0.299 | 3.345 |
Normalized difference vegetation index | 0.307 | 0.693 | 1.442 |
SEV factors . | R2 . | TOL . | VIF . |
---|---|---|---|
Rainfall erosivity | 0.317 | 0.683 | 1.464 |
Topographical wetness index | 0.764 | 0.236 | 4.241 |
Stream power index | 0.676 | 0.324 | 3.089 |
Distance to river | 0.454 | 0.546 | 1.831 |
Drainage density | 0.663 | 0.337 | 2.965 |
Slope length factor | 0.516 | 0.484 | 2.065 |
Topographic roughness index | 0.764 | 0.236 | 4.241 |
Bulk density | 0.812 | 0.188 | 5.307 |
Clay ratio | 0.751 | 0.249 | 4.013 |
Carbon organic | 0.701 | 0.299 | 3.345 |
Normalized difference vegetation index | 0.307 | 0.693 | 1.442 |
Weights for SEV factors
The study evaluates SEV using various factors, each weighted according to its influence and classifies these factors into grids with corresponding area coverage and scores (Table 3). The highest weight is attributed to drainage density, with values of 0.384 in RF and 0.385 in DT. This highlights the critical role of drainage networks in intensifying soil erosion. A higher drainage density indicates a more extensive network of streams and channels, facilitating concentrated water flow, and thereby increasing soil erosion potential in the affected areas (Bhattacharya et al. 2019). Most of the area (41.0%) falls within the 0.0–0.2 range, suggesting a lower vulnerability due to sparser drainage networks. The distance to rivers is the second most influential factor, with weights of 0.257 (RF) and 0.251 (DT). Distance to rivers significantly affects soil erosion, as areas closer to rivers are more vulnerable to the erosive force of flowing water, especially during high-intensity rainfall events. This factor underscores the importance of spatial planning and soil erosion control measures near riverbanks (Gayen et al. 2019). The analysis shows that 56.6% of the area lies more than 1,600 m from rivers, suggesting a lower SEV for those regions. However, 13.1% of the area within 400 m of rivers is more vulnerable.
Weights, classes, and scores of SEV factors in Saddang Watershed
Variables . | SEV factors . | Weight . | Classes of SEV factors . | Grid total . | Area (km2) . | Area (%) . | Scores . | |
---|---|---|---|---|---|---|---|---|
Random forest . | Decision tree . | |||||||
Hydrological data | Rainfall erosivity (MJ mm ha−1 h−1 year−1) | 0.094 | 0.100 | <1,750 | – | – | – | 1 |
1,750–2,000 | 1,589,039 | 1,430 | 29.1 | 2 | ||||
2,000–2,250 | 2,167,987 | 1,951 | 39.7 | 3 | ||||
2,250–2,500 | 1,697,289 | 1,528 | 31.1 | 4 | ||||
>2,500 | – | – | – | 5 | ||||
Topographical wetness index | 0.003 | 0.003 | <5 | 2,177,564 | 1,960 | 39.9 | 1 | |
5–10 | 2,990,637 | 2,692 | 54.8 | 2 | ||||
10–15 | 252,194 | 227 | 4.6 | 3 | ||||
15–20 | 31,256 | 28 | 0.6 | 4 | ||||
>20 | 2,664 | 2 | 0.0 | 5 | ||||
Stream power index | 0.002 | 0.003 | <0 | 21,442 | 19 | 0.4 | 1 | |
0–5 | 4,459,813 | 4,014 | 81.8 | 2 | ||||
5–10 | 922,106 | 830 | 16.9 | 3 | ||||
10–15 | 45,804 | 41 | 0.8 | 4 | ||||
>15 | 5,150 | 5 | 0.1 | 5 | ||||
Distance to river (m) | 0.257 | 0.251 | >1,600 | 3,085,972 | 2,777 | 56.6 | 1 | |
1,200–1,600 | 504,789 | 454 | 9.3 | 2 | ||||
800–1,200 | 546,529 | 492 | 10.0 | 3 | ||||
400–800 | 600,307 | 540 | 11.0 | 4 | ||||
<400 | 716,718 | 645 | 13.1 | 5 | ||||
Drainage density (km/km2) | 0.384 | 0.385 | 0.0–0.2 | 2,234,961 | 2,011 | 41.0 | 1 | |
0.2–0.4 | 2,016,758 | 1,815 | 37.0 | 2 | ||||
0.4–0.6 | 1,027,461 | 925 | 18.8 | 3 | ||||
0.6–0.8 | 166,738 | 150 | 3.1 | 4 | ||||
0.8–1.0 | 8,397 | 8 | 0.2 | 5 | ||||
Topographic data | Slope length factor | 0.005 | 0.003 | <0.4 | 1,054,041 | 949 | 19.3 | 1 |
0.4–1.4 | 105,427 | 95 | 1.9 | 2 | ||||
1.4–3.1 | 233,839 | 210 | 4.3 | 3 | ||||
3.1–6.8 | 682,546 | 614 | 12.5 | 4 | ||||
>6.8 | 3,378,462 | 3,041 | 61.9 | 5 | ||||
Topographic roughness index | 0.032 | 0.031 | 0.0–0.2 | 1,054,041 | 949 | 19.3 | 1 | |
0.2–0.4 | 105,427 | 95 | 1.9 | 2 | ||||
0.4–0.6 | 233,839 | 210 | 4.3 | 3 | ||||
0.6–0.8 | 682,546 | 614 | 12.5 | 4 | ||||
0.8–1.0 | 3,378,462 | 3,041 | 61.9 | 5 | ||||
Environmental data | Bulk density (cg/cm3) | 0.159 | 0.157 | <50 | – | – | – | 1 |
50–75 | 20,451 | 18 | 0.4 | 2 | ||||
75–100 | 3,219,692 | 2,898 | 59.0 | 3 | ||||
100–125 | 2,214,172 | 1,993 | 40.6 | 4 | ||||
>125 | – | – | – | 5 | ||||
Clay ratio | 0.168 | 0.166 | 0.0–0.2 | – | – | 1 | ||
0.2–0.4 | 143,651 | 129 | 2.6 | 2 | ||||
0.5–0.6 | 3,152,259 | 2,837 | 57.8 | 3 | ||||
0.7–0.8 | 2,127,587 | 1,915 | 39.0 | 4 | ||||
0.8–1.0 | 30,818 | 28 | 0.6 | 5 | ||||
Carbon organic (dg/kg) | 0.230 | 0.225 | >125 | 68,582 | 62 | 1.3 | 1 | |
100–125 | 2,072,569 | 1,865 | 38.0 | 2 | ||||
75–100 | 2,416,955 | 2,175 | 44.3 | 3 | ||||
50–75 | 875,170 | 788 | 16.0 | 4 | ||||
<50 | 21,039 | 19 | 0.4 | 5 | ||||
Normalized difference vegetation index | 0.134 | 0.133 | >0.7 | – | – | – | 1 | |
0.5–0.7 | 348,182 | 313 | 6.4 | 2 | ||||
0.3–0.5 | 4,320,562 | 3,889 | 79.2 | 3 | ||||
0.2–0.3 | 338,890 | 305 | 6.2 | 4 | ||||
<0.2 | 446,681 | 402 | 8.2 | 5 |
Variables . | SEV factors . | Weight . | Classes of SEV factors . | Grid total . | Area (km2) . | Area (%) . | Scores . | |
---|---|---|---|---|---|---|---|---|
Random forest . | Decision tree . | |||||||
Hydrological data | Rainfall erosivity (MJ mm ha−1 h−1 year−1) | 0.094 | 0.100 | <1,750 | – | – | – | 1 |
1,750–2,000 | 1,589,039 | 1,430 | 29.1 | 2 | ||||
2,000–2,250 | 2,167,987 | 1,951 | 39.7 | 3 | ||||
2,250–2,500 | 1,697,289 | 1,528 | 31.1 | 4 | ||||
>2,500 | – | – | – | 5 | ||||
Topographical wetness index | 0.003 | 0.003 | <5 | 2,177,564 | 1,960 | 39.9 | 1 | |
5–10 | 2,990,637 | 2,692 | 54.8 | 2 | ||||
10–15 | 252,194 | 227 | 4.6 | 3 | ||||
15–20 | 31,256 | 28 | 0.6 | 4 | ||||
>20 | 2,664 | 2 | 0.0 | 5 | ||||
Stream power index | 0.002 | 0.003 | <0 | 21,442 | 19 | 0.4 | 1 | |
0–5 | 4,459,813 | 4,014 | 81.8 | 2 | ||||
5–10 | 922,106 | 830 | 16.9 | 3 | ||||
10–15 | 45,804 | 41 | 0.8 | 4 | ||||
>15 | 5,150 | 5 | 0.1 | 5 | ||||
Distance to river (m) | 0.257 | 0.251 | >1,600 | 3,085,972 | 2,777 | 56.6 | 1 | |
1,200–1,600 | 504,789 | 454 | 9.3 | 2 | ||||
800–1,200 | 546,529 | 492 | 10.0 | 3 | ||||
400–800 | 600,307 | 540 | 11.0 | 4 | ||||
<400 | 716,718 | 645 | 13.1 | 5 | ||||
Drainage density (km/km2) | 0.384 | 0.385 | 0.0–0.2 | 2,234,961 | 2,011 | 41.0 | 1 | |
0.2–0.4 | 2,016,758 | 1,815 | 37.0 | 2 | ||||
0.4–0.6 | 1,027,461 | 925 | 18.8 | 3 | ||||
0.6–0.8 | 166,738 | 150 | 3.1 | 4 | ||||
0.8–1.0 | 8,397 | 8 | 0.2 | 5 | ||||
Topographic data | Slope length factor | 0.005 | 0.003 | <0.4 | 1,054,041 | 949 | 19.3 | 1 |
0.4–1.4 | 105,427 | 95 | 1.9 | 2 | ||||
1.4–3.1 | 233,839 | 210 | 4.3 | 3 | ||||
3.1–6.8 | 682,546 | 614 | 12.5 | 4 | ||||
>6.8 | 3,378,462 | 3,041 | 61.9 | 5 | ||||
Topographic roughness index | 0.032 | 0.031 | 0.0–0.2 | 1,054,041 | 949 | 19.3 | 1 | |
0.2–0.4 | 105,427 | 95 | 1.9 | 2 | ||||
0.4–0.6 | 233,839 | 210 | 4.3 | 3 | ||||
0.6–0.8 | 682,546 | 614 | 12.5 | 4 | ||||
0.8–1.0 | 3,378,462 | 3,041 | 61.9 | 5 | ||||
Environmental data | Bulk density (cg/cm3) | 0.159 | 0.157 | <50 | – | – | – | 1 |
50–75 | 20,451 | 18 | 0.4 | 2 | ||||
75–100 | 3,219,692 | 2,898 | 59.0 | 3 | ||||
100–125 | 2,214,172 | 1,993 | 40.6 | 4 | ||||
>125 | – | – | – | 5 | ||||
Clay ratio | 0.168 | 0.166 | 0.0–0.2 | – | – | 1 | ||
0.2–0.4 | 143,651 | 129 | 2.6 | 2 | ||||
0.5–0.6 | 3,152,259 | 2,837 | 57.8 | 3 | ||||
0.7–0.8 | 2,127,587 | 1,915 | 39.0 | 4 | ||||
0.8–1.0 | 30,818 | 28 | 0.6 | 5 | ||||
Carbon organic (dg/kg) | 0.230 | 0.225 | >125 | 68,582 | 62 | 1.3 | 1 | |
100–125 | 2,072,569 | 1,865 | 38.0 | 2 | ||||
75–100 | 2,416,955 | 2,175 | 44.3 | 3 | ||||
50–75 | 875,170 | 788 | 16.0 | 4 | ||||
<50 | 21,039 | 19 | 0.4 | 5 | ||||
Normalized difference vegetation index | 0.134 | 0.133 | >0.7 | – | – | – | 1 | |
0.5–0.7 | 348,182 | 313 | 6.4 | 2 | ||||
0.3–0.5 | 4,320,562 | 3,889 | 79.2 | 3 | ||||
0.2–0.3 | 338,890 | 305 | 6.2 | 4 | ||||
<0.2 | 446,681 | 402 | 8.2 | 5 |
Bulk density, with weights of 0.159 (RF) and 0.157 (DT), also plays a substantial role in SEV. This parameter reflects soil compaction, where higher bulk density can reduce water infiltration and increase surface runoff (Zhao et al. 2018). Although compact soils are more resistant to detachment, they exacerbate runoff, leading to higher SEV. Similarly, the clay ratio, weighted at 0.168 (RF) and 0.166 (DT), influences soil texture and cohesiveness (Soinne et al. 2023). Soils with higher clay content are generally more resistant to soil erosion; however, once detached, clay particles are easily transported by water due to their fine size. Another critical factor is carbon organic content, with weights of 0.230 (RF) and 0.225 (DT). This factor reflects the role of vegetation and organic matter in stabilizing soil. Areas with higher organic carbon content typically exhibit lower soil erosion potential due to the binding effect of organic material on soil particles and the protective role of vegetation cover.
Rainfall erosivity, with weights of 0.094 (RF) and 0.100 (DT), indicates the impact of rainfall intensity and duration on soil erosion. High rainfall erosivity values signify greater potential for soil detachment and transport (Sujatha & Sridhar 2018; Jothimani et al. 2022). Most of the area (39.7%) falls within the 2,000–2,250 MJ mm ha−1h−1year−1 range, indicating significant soil erosion potential due to rainfall. NDVI, another climatic and vegetation-related factor, has weights of 0.134 (RF) and 0.133 (DT). NDVI reflects vegetation health and density, where higher values signify better vegetation cover that can mitigate soil erosion by stabilizing soil with root systems (Mokarram & Zarei 2021). Other factors, such as the TWI, SPI, TRI, and slope length factor, have lower weights but still contribute to SEV. These factors represent terrain characteristics that influence water accumulation, runoff energy, and surface irregularity (Gómez-Gutiérrez et al. 2015; Arabameri et al. 2019; Getnet & Mulu 2021). Though less significant than other factors, they provide valuable insights into localized soil erosion processes.
Cross-validation
By employing 20 k-fold cross-validation with stratified sampling, the study achieves a thorough and fair assessment of the RF and DT models' performance in analyzing SEV. This method strengthens the reliability of factor weights, minimizes overfitting, and ensures that findings – such as the critical role of drainage density and distance to rivers – are robust and generalizable. This reinforces the practical utility of the models for identifying and mitigating soil erosion in vulnerable regions like the Saddang Watershed.
Spatial distribution of SEV
This study aims to assess the spatial distribution of SEV in Saddang Watershed by employing two widely recognized ML models: DT and RF. The analysis categorizes SEV into five classes: very low, low, moderate, high, and very high (Table 4). The results, expressed in terms of area percentages, provide insights into the efficacy and differences between these two models in classifying SEV (Figure 6). For the very low class, the DT model identifies 1,194 km2 (24.43%), nearly identical to the RF model's classification of 1,201 km2 (24.58%). In the low class, the DT model estimates 1,729 km2 (35.38%), slightly higher than the RF model's classification of 1,720 km2 (35.20%), indicating minor differences in identifying low-vulnerability areas. Both models show remarkable agreement for the moderate class, with the DT classifying 1,256 km2 (25.70%) and the RF 1,257 km2 (25.73%), reflecting the reliability of both models in identifying areas with moderate vulnerability. In the high class, the DT model identifies 640 km2 (13.09%), slightly lower than the RF model's classification of 642 km2 (13.14%), demonstrating the consistency of both models in detecting high-vulnerability areas. For the very high class, the DT model estimates 68 km2 (1.39%). In comparison, the RF model assigns a slightly lower value of 66 km2 (1.36%), confirming the capability of both models in pinpointing areas with critical SEV. Overall, the analysis reveals that the DT and RF models provide consistent and reliable classifications, with minor differences reflecting the sensitivity of each model in recognizing various levels of SEV. Notably, the areas classified as moderate, high, and very high are predominantly located near riverbanks and river branches. This spatial correlation underscores the impact of fluvial processes on SEV. This study is in line with Nwilo et al. (2021) and Ratiat et al. (2023) that areas with moderate to very high vulnerability to soil erosion are mainly situated near riverbanks and branches, considering the influence of fluvial processes on SEV. This comparative analysis reveals both consistencies and variations in SEV classifications, highlighting the importance of using multiple models to comprehensively understand soil erosion dynamics and inform effective soil conservation strategies.
Distribution of spatial classes of SEV in Saddang Watershed
SEV classes . | Total scores classes . | Decision tree . | Random forest . | ||||
---|---|---|---|---|---|---|---|
Grid total . | Area (km2) . | Area (%) . | Grid total . | Area (km2) . | Area (%) . | ||
Very low | 0.0–0.2 | 1,326,473 | 1,194 | 24.43 | 1,334,524 | 1,201 | 24.58 |
Low | 0.2–0.4 | 1,921,084 | 1,729 | 35.38 | 1,910,978 | 1,720 | 35.20 |
Moderate | 0.4–0.6 | 1,395,457 | 1,256 | 25.70 | 1,396,923 | 1,257 | 25.73 |
High | 0.6–0.8 | 710,876 | 640 | 13.09 | 713,142 | 642 | 13.14 |
Very high | 0.8–1.0 | 75,425 | 68 | 1.39 | 73,748 | 66 | 1.36 |
Total | 5,429,315 | 4,886 | 100 | 5,429,315 | 4,886 | 100 |
SEV classes . | Total scores classes . | Decision tree . | Random forest . | ||||
---|---|---|---|---|---|---|---|
Grid total . | Area (km2) . | Area (%) . | Grid total . | Area (km2) . | Area (%) . | ||
Very low | 0.0–0.2 | 1,326,473 | 1,194 | 24.43 | 1,334,524 | 1,201 | 24.58 |
Low | 0.2–0.4 | 1,921,084 | 1,729 | 35.38 | 1,910,978 | 1,720 | 35.20 |
Moderate | 0.4–0.6 | 1,395,457 | 1,256 | 25.70 | 1,396,923 | 1,257 | 25.73 |
High | 0.6–0.8 | 710,876 | 640 | 13.09 | 713,142 | 642 | 13.14 |
Very high | 0.8–1.0 | 75,425 | 68 | 1.39 | 73,748 | 66 | 1.36 |
Total | 5,429,315 | 4,886 | 100 | 5,429,315 | 4,886 | 100 |
Validation model
Discussions
Performance evaluation of DT and RF algorithms
The performance of DT and RF algorithms in predicting SEV can be critically assessed through various metrics, including accuracy, robustness, and generalization capabilities. The models are evaluated using AUCROC values, which indicate their efficacy in distinguishing between different levels of SEV. In this study, the higher AUCROC of the RF model suggests a greater discriminatory power than the DT model. This enhanced performance is attributed to the RF's ensemble approach, which combines predictions from multiple decision trees. By aggregating the results, the RF model improves accuracy and reduces the vulnerability of overfitting, making it more adept at capturing complex patterns and providing reliable predictions for new, unseen data. On the other hand, despite having a lower AUCROC, the DT model offers significant advantages in simplicity and interpretability. The visual clarity of decision trees helps stakeholders understand and communicate the factors influencing SEV. This interpretability is particularly valuable in applications where understanding the decision-making process is crucial. In summary, while the RF model's higher AUCROC underscores its robustness and effectiveness in handling complex environmental data, the DT model's simplicity and ease of interpretation make it a useful tool for straightforward applications. Both models are valuable for predicting SEV and informing soil conservation policies, with the RF model providing more accurate predictions and the DT model offering clear, understandable insights. Combining the strengths of both approaches can offer a comprehensive method for evaluating and managing SEV.
Significance of input variables and their interactions
Understanding the significance and interactions of input variables is crucial for accurately predicting SEV using ML models. Each input variable provides unique insights into soil erosion dynamics. Understanding the significance and interactions of input variables is crucial for predicting SEV using ML models. Drainage density is a key factor, where higher values indicate extensive stream networks that intensify concentrated water flow, increasing soil erosion potential (Sajedi-Hosseini et al. 2018; Arabameri et al. 2020). Conversely, areas with sparse drainage networks exhibit lower SEV. Distance to rivers further underscores the vulnerability of areas near riverbanks, where flowing water exerts erosive forces, particularly during high-intensity rainfall events (Gayen et al. 2019). Bulk density and the clay ratio significantly affect soil texture and water behavior. High bulk density reduces infiltration, increasing surface runoff and erosion risks, while clay-rich soils, though resistant to detachment, are easily transported once eroded (Gayen et al. 2019). Organic carbon content plays a stabilizing role by reducing erosion through the binding effect of vegetation and organic matter. Meanwhile, rainfall erosivity quantifies the intensity and duration of rainfall, driving soil detachment (Jothimani et al. 2022; Olii et al. 2023, 2024a). For instance, areas with high drainage density near rivers and steep slopes (TRI) experience severe soil erosion during intense rainfall (rainfall erosivity). In such regions, soils with high bulk density exacerbate runoff, while lower organic carbon content and sparse vegetation (NDVI) fail to provide stabilization. This interplay highlights the need for integrated conservation strategies.
Implications for soil conservation and management practices
Understanding input variables' significance and interactions in predicting SEV offers crucial insights for developing effective soil conservation and management practices in Saddang Watershed. Key variables such as drainage density, distance to rivers, rainfall erosivity, and soil properties provide insights for prioritizing interventions. High drainage density areas, characterized by concentrated water flow, require measures such as check dams or vegetative barriers to disrupt soil erosion processes and stabilize soil (Zema et al. 2022). Regions near rivers, where the distance to rivers is a critical factor, benefit from riparian buffer zones and reinforced riverbanks to mitigate soil erosion during intense rainfall (Singh et al. 2021; Graziano et al. 2022). Areas with high rainfall erosivity and sparse vegetation should adopt conservation practices such as afforestation, cover cropping, or reduced tillage to protect soil surfaces from raindrop impact and enhance ground cover (Farmaha et al. 2022). Tailored strategies are necessary for managing soil properties. High bulk-density areas should focus on soil aeration and organic amendments to improve infiltration and reduce runoff (Basset et al. 2023). For clay-rich soils, SEV can be minimized by implementing sediment traps or constructing retaining walls to manage particle transport. Integrating these approaches with terrain-specific interventions, such as terracing in rugged areas, ensures effective erosion control and sustainable land management.
Integrating ML models like DT and RF into GIS and RS enables real-time SEV assessments and supports proactive decision-making. This approach facilitates the monitoring of SEV and the evaluation of conservation measures' effectiveness over time. Furthermore, the insights gained from SEV models can drive educational initiatives and inform policy development, promoting sustainable land management practices and raising awareness about the importance of soil conservation. By applying these insights, stakeholders can enhance their soil management strategies, protect soil health, and foster sustainable land-use practices, ultimately reducing soil erosion vulnerabilities and improving environmental outcomes.
CONCLUSIONS
This study demonstrates the effectiveness of DT and RF models in determining SEV in Saddang Watershed. Drainage density is the most critical factor influencing SEV, intensifying soil erosion through concentrated water flow. Distance to rivers, bulk density, clay ratio, organic carbon, and rainfall erosivity significantly affect SEV. Vegetation cover and terrain factors contribute less but provide insights into localized soil erosion, aiding targeted soil conservation efforts. The DT and RF models are dependable, but the RF model has superior sensitivity and accuracy, as indicated by its AUCROC score of 0.917 versus the DT's 0.953. The RF model's ensemble technique, which integrates numerous DT, improves its capacity to capture complicated patterns while reducing overfitting, making it more suitable for environmental modeling. However, the DT model's simplicity and interpretability make it useful for stakeholders who require a thorough understanding of the decision-making process. Finally, both the DT and RF models provide useful insights into SEV, with the RF model outperforming the DT model in terms of accuracy and interpretability. These findings highlight the need to employ various models to gain a thorough understanding of soil erosion dynamics, which will aid in developing successful soil conservation methods in the Saddang Watershed.
ACKNOWLEDGEMENTS
We thank the Faculty of Engineering at Universitas Gorontalo for their exceptional support and resources throughout this research. Their commitment to academic excellence and research development has been instrumental in the success of this study. We also extend our thanks to all colleagues, research assistants, and collaborators who contributed their expertise and time, enhancing the quality of this work. Your dedication and collaborative efforts have been crucial in achieving the objectives of this research.
AUTHOR CONTRIBUTIONS
M.R.O. conceived and designed the study, performed data analysis, and drafted the manuscript. A.K.Z.O. contributed to data collection and interpretation and revised the manuscript critically for important intellectual content. A.O. assisted with data analysis and provided technical support and expertise in ML algorithms. R.A.D. contributed to research methodology and experimental design, and reviewed and edited the manuscript. M.A.M. conducted fieldwork and data acquisition and contributed to the discussion and interpretation of results. B.A.K. guided statistical analysis and assisted with data visualization. B.B. contributed to the literature review, supported the research design, and participated in manuscript revision. R.S.N.O. assisted with manuscript preparation and contributed to the discussion of findings. R.P. coordinated the research project, contributed to manuscript drafting, and managed the overall project. All authors reviewed and approved the final version of the manuscript.
DATA AVAILABILITY STATEMENT
The data used for analysis in this study are available from the corresponding author upon reasonable request.
CONFLICT OF INTEREST
The authors declare there is no conflict.