In this study, various regression models were utilized to predict total sediment yield in tons, while their performance was evaluated for accuracy and reliability. The dataset utilized contains numerous predictors that have been standardized and processed through principal component analysis to improve model performance. Models evaluated here include linear regression, normalized linear regression, Principal Component Analysis (PCA), Pearson Correlation Coefficient (PCC) with generalized ridge regression, kernel ridge regression, multivariate regression, lasso regression approaches such as artificial neural network Cellular Automata-Artificial Neural Network (CA-ANN or ANN), and more. Results suggest that the ANN model achieved the lowest mean squared error (MSE), 113.641; this suggests superior predictive capability compared to other models. Although environmental data were complex and relationships complex, an ANN model showed less error, followed closely by CA-ANN with an MSE of 124.83. Traditional models such as linear or lasso regression revealed larger errors with negative squared values that indicated poor fits to data. This analysis highlights the effectiveness of advanced machine learning in environmental modeling, emphasizing the importance of selecting models suited to data and specific phenomena, aiding environmental planners in predicting and managing soil erosion and sediment transport.

  • Comprehensive comparison of traditional and advanced models, including TabNet, CATBOOST, and AutoGLON, for predicting total upland sediment yield (TUSY).

  • Innovative use of TabNet for accurate sediment yield prediction, demonstrating the model's potential in hydroinformatics.

  • Significant improvement in prediction accuracy with advanced techniques, reducing mean squared error (MSE).

Soil erosion and sediment transport play a major role in agricultural productivity, water quality, and ecosystem health, as they are integral parts of the environmental system (Pathak et al. 2020a; Wynants et al. 2021). The watershed management and soil conservation measures include projecting and quantifying the sediment yield from the upland areas (Leta et al. 2023). The growth of large datasets and advancements in statistical methods and computational power along with a hop development of artificial intelligence (AI) and machine learning (ML) has captured the complexity that influences sediment production (Pathak et al. 2020b; Ma & Mei 2021). The regression methods have been applied to determine many environmental features that cause soil erosion, and models' performance was evaluated to determine how well the complexity is captured (De Vente et al. 2013). The sediment yield in the watershed affects the river morphology, bank erosion, silt deposition in reservoirs, and degradation of aquatic habitats (Al-Mamari et al. 2023). Further, the pollutants from fertilizers and pesticides are carried by these sediments into the aquatic environments causing environmental degradation (Remund et al. 2021). The accurate prediction is marred by complexity in land-use land cover changes (LULCC), type of vegetation, and the variabilities in the topography of the watershed (Suwardi et al. 2024). AI and ML have an upper hand in handling large datasets though complex but can easily assimilate nuances like LULCC, slope, rainfall intensity, and other soil properties too (Mohsenzadeh Karimi et al. 2020; Gani et al. 2023). The sediment yield is traditionally studied using the famous USLE, i.e., universal soil loss equation and advanced software such as Soil and Water Assessment Tools were handicapped by the need of vagaries of dataset and failed to capture the complexity involved in sediment estimation, these challenges were comprehensively answered by the advanced ML and AI methods (Jimeno-Sáez et al. 2018). Research on erosion caused by climate change is becoming more frequent, using climate projections to predict changes in erosion patterns and sediment flux caused by altered precipitation regimes and extreme weather events (Li & Fang 2016). Land-use changes, particularly urbanization, agriculture, and deforestation have also been widely studied; their correlation to erosion rates and sediment transport is clearly evident (Obiahu & Elias 2020). Technological innovations in remote sensing and Geographic Information System (GIS) have vastly advanced soil erosion monitoring systems, significantly improving model validation as well as spatial risk analyses (Shivappa Masalvad et al. 2023). Socioeconomic impacts and regulatory framework effectiveness have played a pivotal role in understanding sediment yield's wider implications, leading to sustainable land management policies and practices (Haregeweyn et al. 2023). The precise estimation of sediment yield is crucial for efficient irrigation and management of land and can affect soil erosion control as well as water quality management and sustainable agriculture practices. Traditional models such as USLE and Soil and Water Assessment Tool (SWAT) are not able to handle large datasets and consider nonlinear relations. In this study, we address the shortcomings by comparing the traditional regression techniques with the most advanced ML models. It aims to find the most efficient methods to predict sediment yield.

Manjeera Watershed lies within India's Deccan Plateau region (Figure 1). Geographically speaking, this watershed encompasses 17°30′ to 18°25′ North latitude and 77°40′ to 78°20′ East longitude and serves several districts within Telangana state as an indispensable water source for agriculture, industry, and domestic needs alike. The Manjeera River, an important tributary of the Godavari River, serves as its primary watershed feeder. Traversing across various landscape features including varied topography and both rural and urban settings makes the Manjeera an excellent case study for hydrological and environmental evaluations. Topographic elevation differences within a watershed range significantly, as illustrated by this study's detailed topographical maps. These variations influence hydrological processes and patterns within its boundaries that impact both the availability and quality of water for consumption, with sub-watershed delineations providing a detailed framework to analyze flow rate, sediment transport rate, and potential sources of pollution on an even finer scale.
Figure 1

Study area.

This research collects an extensive set of environmental variables to estimate sediment yields at various geographical locations (Pandey et al. 2022). Variables considered include chemical and physical characteristics of soils, climate data, topographic measurements, and land-use characteristics (Basha et al. 2024). Prior to analysis, variables used in forecasting future outcomes were standardized using a mean value of zero and a standard deviation of one; this ensured uniformity and comparability across scales.

Predicting the total yield of upland sediments requires an organized process that will yield precise and reliable results (Kumar et al. 2023). The initial step in this method involves collecting data – pertinent soil yield information, as well as land use and environmental variables, should be collected from various sources – then subjected to preprocessing in order to remove dirt, format properly, address any missing information or issues, before feature engineering kicks in with significant features being retrieved and transformed to better reflect their fundamental patterns in the data (Figure 2).
Figure 2

Methodology adopted.

Figure 2

Methodology adopted.

Close modal

Exploratory data analysis (EDA) allows researchers to gain a greater understanding of relationships and distribution within a dataset, enabling them to detect significant trends or possible outliers that require further investigation (Shivashankar et al. 2022). Through EDA insights can also be gained into selecting models for regression testing as well as ML applications – these models then being trained on small sections of data before being put through rigorous training programs to evaluate their efficacy.

Model evaluation involves comparing models using performance metrics such as mean squared error (MSE) and R2 to assess their predictive ability. The most effective models are then assessed to select one which should be utilized and tested further before being evaluated so as to give actionable insight into land management or water conservation strategies (Singh et al. 2022). This methodical approach yields comprehensive yet efficient analysis, leading to well-informed choices made regarding environmental management.

Principal component analysis (PCA) was used to manage multicollinearity and improve interpretability for models by reducing data dimensionality while still preserving sufficient variance to allow accurate predictions. Furthermore, this dataset underwent rigorous quality tests in order to confirm its accuracy and integrity for regression analysis. Data meticulously collected can facilitate the application of sophisticated regression models and serve as an important basis to assess their performance in accurately anticipating sediment yields, an essential aspect of environmental planning. Variables were standardized with a mean that was zero, and an average deviation of 1 to ensure consistency and comparability. PCA was used to reduce data dimensionality while conserving enough variance to make precise predictions. These preprocessing steps allowed for managing multicollinearity, and improved model performance. Also, its methodological rigor improves the accuracy of our results and offers valuable insight into managing sediment. This research examined the performance of various regression models in predicting total sediment yield from an environmental variable-rich dataset. MSE measurements as well as R2 scores provided insights into their accuracy and predictive capabilities. MSE measures the difference between estimated values and actual ones – in other words, actual versus estimated – thereby providing a measure of model performance. Lower MSE values indicate more effective model performance and indicate improved results. R2 values measure how much variance could be explained with independent variables alone, while negative R2 values in various models suggest they do not accurately reflect dependent variable variance and may even perform worse than a vertical line method in certain situations.

Linear regression models

Linear regression models have become essential analytical tools in research on sediment yield prediction. This type of model encompasses fundamental and normalized linear regression analyses as well as PCA designs to help understand relationships between environmental factors and sediment yield predictions and actual yield predictions, ultimately.
(1)
where Y represents the predicted sediment yield, is the intercept, ,, … , are the coefficients of the predictor variables ,, … , and is the error term, capturing the discrepancy between predicted and actual values.

The unmodified linear regression model produced a remarkable MSE value of 530,302.6 and an R2 value of 203.14. These results indicated a very poor fit and predictive ability of this particular linear regression.

Ridge and kernel ridge regression

Ridge and kernel ridge regression are extensions of linear regression that incorporate regularization to address multicollinearity as well as avoid overfitting in environmental models such as sediment yield prediction. Both models modify linear regression using penalties added to their coefficients that can be controlled using a complexity parameter.

Ridge regression is an alternative method of improving linear regression by penalizing coefficient size; its cost function can be seen in the following equation:
(2)

Here, represents the observed sediment yields, are the predictor variables, are the regression coefficients, and λ is the regularization parameter that controls the degree of shrinkage of the coefficients toward zero. By selecting a suitable λ, ridge regression can reduce model complexity and prevent overfitting. Kernel ridge regression expands on ridge regression by employing kernel methods for learning in higher dimensional spaces, making this approach particularly suitable for nonlinear relationships. The model employs kernel functions as its foundation.

Calculating inner products of vectors in high-dimensional space represented by:
(3)

The generalized ridge regression and kernel ridge regression methods showed improved performance with MSE values between 3,009.155 and 3,025.75, respectively. However, their R2 values still remained low or negative suggesting potential issues regarding model suitability or data characteristics.

Multivariate and lasso regressions

In this study, we evaluated multivariate regression and lasso regression to model sediment yield from environmental predictors. Multivariate regression, also referred to as multiple linear regression or MLR for short, can help understand how multiple independent variables contribute to one dependent variable and is frequently utilized when trying to predict it using regression models such as MLR; its general form can be depicted by the following equation:
(4)
MLR can become subject to overfitting with large numbers of predictors; therefore, we employed lasso regression, which introduces a regularization term into a regression model in order to stop overfitting and improve prediction accuracy. Our lasso regression model can be defined using the following equation:
(5)
where λ is the regularization parameter that controls the magnitude of the L1 penalty applied to the coefficients. By tuning λ, lasso helps in feature selection by shrinking less important feature coefficients to zero, thus focusing the model on the most relevant predictors.

Its multivariate regression also mirrored ridge regression's performance by yielding an MSE value of 30,091.155 with poor R2 values. Lasso regression offered a lower MSE of 9,313.58 but had a negative R2, suggesting it only helped minimize errors marginally while not meeting model fit requirements adequately. Further, these regression methods delivered some fascinating MSE results; Multivariate Regression posted an MSE value of 30,091.155 with an R2 value of −10.583, suggesting poor fit despite using multiple predictors to make predictions. Lasso regression's ability to reduce coefficient shrinkage resulted in an MSE score of 9,313.5 and an R2 value of −2.585, providing slightly improved results compared to its rival methods. These results demonstrate that while lasso regression improved upon multivariate approaches by focusing on relevant features, it still struggled to capture the complexity of this dataset due to nonlinear relationships and interactions among predictors that were present between predictors and sediment yield. Thus, applying more sophisticated nonlinear models may better represent complex environmental data dynamics effectively.

Advanced neural network models

CANNs (composite artificial neural networks) (Figure 3) are innovative neural network structures which combine several neural networks into a more efficient whole, in order to reach higher performance and greater adaptability. Their composite structure harnesses each neural network's individual strength for more complex tasks and increased accuracy as well as robustness against noise or data disturbance (Ahmed et al. 2023). CANNs are constructed using multiple types of neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and fully connected network components. Each network component of a CANN is tailored specifically for its function – for instance, CNNs excel at feature extraction from images while RNNs offer sequential processing capability (Huang et al. 2023). In a CANN, CNNs can be utilized at the outset in order to extract features from raw data for further processing on subsequent layers. These extracted features then become part of their processing in subsequent layers and thus form part of an ongoing algorithmic loop. Recurrent neural networks (RNNs) or long-short-term memory (LSTM) systems can manage sequential dependencies within data, making this methodology particularly suitable for time series forecasting as well as natural language processing applications (Gers et al. 2002).
Figure 3

ANN flow chart.

The integration layer blends outputs from various component networks together in order to leverage each network's individual strengths more efficiently. It may use techniques such as concatenation or weighted averaging or more sophisticated fusion methods in order to bring data together effectively (Alzubaidi et al. 2021). CANN uses multiple networks to capture more patterns and relationships within data, leading to more precise results. CANNs' modular designs enable them to meet specific customer requirements by selecting components in accordance with specific networks. Robust CANNs can better tolerate noise and overfitting due to being able to draw upon all of their constituent networks' abilities (Alzubaidi et al. 2021).

CANNs have proven their worth across various fields and applications, such as in education. CNNs and RNNs can be combined in order to analyze visual data more thoroughly and analyze it efficiently. Utilizing RNNs and LSTMs, natural language processing uses AI techniques such as sentiment analysis, translation of language translation of languages as well as generation of text for tasks such as sentiment analysis translation linguistic and text generation (Signoroni et al. 2019). By using CNNs to extract features and RNNs to handle temporal dependencies, time series forecasting allows one to predict weather patterns, stock prices, and other time-dependent phenomena more reliably (Zhao et al. 2024).

CANNs provide an effective method of ML that permits the construction of highly sophisticated models that leverage multiple kinds of networks for improved efficiency, flexibility, and durability – making CANNs useful tools in many different sectors and fields of application (Taye 2023).

Artificial neural networks (ANNs) are computational models inspired by the human brain's network of neurons. Like human neural networks, ANNs consist of layers of interconnected nodes (neurons) processing data hierarchically in layers; each node applies weighted sum inputs via activation functions for complex pattern recognition in data (Montesinos López et al. 2022). They have many applications across industries including image and speech recognition, natural language processing and predictive modeling as their ability to improve performance through adaptive training makes them effective solutions to nonlinear, complex problems across many domains (Sarker 2021).

The composite artificial neural network (CA-ANN) and artificial neural network (ANN) models yielded promising results (Figure 4); with the latter showing the lowest MSE of 113.641 at an R2 of −4.142 suggesting it had a superior ability in predicting sediment yield than any of its rival models. The disparities among models illustrate both the complexity of environmental data, as well as their difficulty modeling them with linear assumptions. Nearly all negative R2 values indicate none could adequately account for the variability of the dependent variable; suggesting relationships may not be linear and may involve interactions that simpler models do not account for. ANN models outshone CA-ANN models in terms of predictive power and prediction accuracy for environmental modeling tasks. Their superiority also highlights how advanced ML techniques offer significant improvements over traditional regression approaches in environmental prediction models.

TabNet

TabNet is a deep learning model specifically developed to handle tabular data across various disciplines such as health, finance, and environmental science. TabNet combines the benefits of neural networks and decision trees with an intelligent attention mechanism that automatically selects features to be trained, providing optimal efficiency and clarity of data analysis. This enables TabNet to focus on relevant elements while improving efficiency and comprehensibility. TabNet is an invaluable instrument in its capacity to make predictions, providing insights into which components are the most successful at making accurate forecasts. This feature of TabNet makes it particularly helpful when working with large-scale datasets; environmental science uses TabNet extensively as part of their work forecasting results such as sediment yield and conserving efforts through providing land managers and conservation efforts an actionable view of results like these.

AutoGLON

AutoGLON can be described as an automatic machine-learning (AutoML) framework designed to make developing and tuning ML models simpler and quicker, making creating high-performing models simple without needing much experience in ML. AutoGLON makes creating fast models easier by automating tasks such as the selection of models, hyperparameter optimization, and technology engineering for features. AutoGLON makes advanced ML methods simpler to access, accelerating model development. AutoGLON also helps environmental science researchers quickly develop models to predict phenomena like sediment yield. AutoGLON allows researchers to focus their efforts on understanding results and creating solutions to enhance land and water conservation methods. As it provides reliable models without needing manual intervention to deliver, AutoGLON serves as an indispensable resource in research and decision-making processes.
Figure 4

Flow chart CA-ANN.

Figure 4

Flow chart CA-ANN.

Close modal
Figure 5 shows histograms which represent various morphometric parameters necessary for understanding geomorphological characteristics in watersheds or regions, providing valuable insight into the variability and significance of different attributes of these parameters. Each histogram represents the frequency distribution of specific attributes that give information regarding their variability or significance. Histograms for area (km2), maximum elevation (m), minimum elevation, and minimum base levels reveal how distributional patterns within study regions vary across sizes of area sizes. Most areas tend to cluster in lower size ranges suggesting smaller Subbasins; minimum elevation levels have a more even distribution while max elevation values cluster more tightly into a certain range to highlight topographical relief of region; finally, the minimum base levels histogram illustrates this relationship more precisely than maximum elevation, suggesting relatively uniform base levels throughout this area of research. Maximum slope values suggest significant variance in slope steepness, impacting erosion rates and runoff characteristics. Meanwhile, the maximum relief (m) histogram shows variation in elevation differences; higher relief values occur less often and indicate that only certain regions experience significant elevation differentials. The histogram for maximum island soil erodibility (t/ha/year) shows potential soil loss due to erosion in most regions; most have moderate values. Channel slope (degree) distribution indicates variation in river channel steepness that affects flow velocity and sediment transport while angle of slope (degree) depicts various degrees of slope which helps understand landslide susceptibility as well as water flow dynamics. Lengths of channels range widely; most channels tend to be relatively short and, thus, impact drainage density and network complexity. Basin lengths (km) span an extensive variety of shapes and sizes while their compactness coefficient measures their compactness – most basins do not conform perfectly, altering runoff patterns accordingly. Fitness ratio distribution suggests an array of well-developed and less efficient drainage networks, while form factor distribution can provide insight into basin shapes with lower values indicating more elongated basins while higher values indicate circular basins. Drainage texture affects channel density and frequency with variations showing different stages of development of drainage networks. Drainage density (km/km2) gives insight into the spacing of drainage channels, impacting surface runoff and infiltration rates. The circularity ratio indicates basin shapes; most values cluster near the middle range suggesting moderately circular basins; the elongation ratio measures basin elongation rather than circularity in most instances. The bifurcation ratio measures the branching pattern of drainage networks, with variations reflecting different branching complexities in different regions. The relief ratio indicates overall steepness; its values could indicate gentle to steep terrain variations. Furthermore, the length of overland flow (km) measures how far water travels before entering streams – this measurement could reflect differences in both infiltration and surface runoff processes.
Figure 5

Histogram of morphological parameters.

Figure 5

Histogram of morphological parameters.

Close modal
This correlation matrix (Figure 6) displays relationships among various morphometric parameters of a watershed or region. Each cell of this matrix represents correlation coefficient values from −1 to 1, where positive correlations are represented with red hues while negative ones by shades of blue. Key observations included an almost perfect correlation (close to 1) among parameters related to basin dimensions such as basin length in km, perimeter (km), and area of basin (km2), suggesting that larger basins tend to possess longer perimeters and lengths as expected in natural systems. On the flip side, there are notable negative correlations, including basin width (km) and compact coefficient, where wider basins tend to have fewer compact environments. Drainage density (km/km2) has also demonstrated significant negative associations with shorter overland flow paths when measured against average overland flow path lengths. Parameters related to slope and elevation, such as maximum slope and maximum elevation, show moderate-to-strong correlations with erosion-related metrics such as total upland sediment yield (in tons) (TUSY) and maximum upland sediment yield tons per hectare showing how topography plays a crucial role in soil erosion potential.
Figure 6

Heat map of morphological parameters.

Figure 6

Heat map of morphological parameters.

Close modal

The matrix also highlights fascinating interdependencies, such as its negative correlation between form factor and elongation ratio which suggests more elongated basins have lower form factors – essential information for hydrologic modeling, watershed management and understanding geomorphological processes. Overall, correlation matrices serve as an invaluable way of uncovering key relationships among morphometric parameters – an aid for creating more precise predictive models and efficient management plans. TUSY forecasting and prediction are essential to effectively managing watersheds, conserving soil quality and safeguarding water quality, as inaccurate predictions of TUSY can negatively affect agricultural and environmental planning processes. In this research study, we explored various regression models in order to find out which one provides the most accurate prediction results – they include linear regression, generalized ridge regression, kernel ridge regression, multivariate regression lasso regression, CA-ANN, ANN, TabNet, CATBOOST, AutoGLON, as well as MSE being evaluated against R2, an indicator for performance evaluation metric which measures overall accuracy. Linear Regression as its base model displays an extremely large MSE (530,302.6) as well as an R2 that is significantly negative (−203.14) which indicates poor performance. Linear regression with normalization shows significant improvements and reduces MSE to 30,125.797 while still having positive R2 (−10.6) values; using PCA preprocessing does not improve it at all, leaving its MSE unchanged at 30,125.797 while still having R2 values −10.6. Advanced linear models such as generalized ridge regression can dramatically lower MSE to 3,009.155 but do not significantly boost R2 (−10.583). Kernel ridge regression refines this model further using kernel functions with MSE of 3,025.75 with R2 of −0.164 showing it has better potential, although not sufficient for real-world applications. Integrating multiple variables into multivariate regression does not improve predictions as evidenced by its MSE of 30,091.155 with an R2 value of −10.583. Lasso regression shows improvement but still falls short in its potential over linear regression despite showing improved potential over linear regression by showing improvement by moderate degrees as evidenced by having an MSE of 9,313.58 with R2 of 2.585 which indicates significant potential. Ensemble and nonlinear models show varied results. Convolutional artificial neural networks (CA-ANNs) show a significant reduction in mean square error (124.183); however, their R2 is still positive (−5.73). Artificial neural networks (ANNs) can also reduce MSE by 113.6431 with an R2 of −4.142 which indicates better performance compared with CA-ANN. TabNet stands out among more modern algorithms with an extremely lower MSE of only 155.5 and the first R2 positive value (0.1462). These metrics suggest it captures some variation efficiently within TUSY prediction while providing an intriguing approach. CATBOOST gradient boosting algorithm offers an average MSE value of 13,202.22 with an R2 value of −4.082; more effective than linear models but still not as efficient as TabNet. AutoGLON ML model stands out with MSE of 13,731.5 and R2 value of 0.655 which represents significant progress from conventional linear and nonlinear models. Results show that traditional linear models, even those which incorporate PCA and normalization, fail to accurately predict TUSY due to exaggerated MSE values and lower R2 values. Generalized ridge regression and kernel ridge regression both feature regularization with kernel functions but don't produce acceptable R2 scores despite significantly reducing MSE values; among nonlinear ensemble methods, CA-ANN and ANN significantly decreased MSE while their low R2 scores indicated inadequate generalization or overfitting; whilst CATBOOST offers improvement over traditional approaches but fell short in terms of R2. TabNet stands out with its low MSE of 1,515.3 and first R2 positive (0.1462) which indicates that it captures some variance effectively within TUSY, making it suitable for further investigation and refinement. AutoGLON reduces MSE by 13.715 yet retains negative R2 (−0.655), providing ample room for refinement and optimization efforts to occur over time. This study explores both the limitations and the promise of various regression models used to predict TUSY. Linear models based on traditional methods do not meet the accuracy and reliability requirements effectively; however, modern nonlinear methods and ensembles like TabNet and AutoGLON have demonstrated promising results. The next phase of research should focus on refining these models further by developing hybrid approaches, features engineering methods and selection techniques in order to increase the accuracy of predictions and generalization of models. Such efforts will ultimately strengthen applications of these models for agricultural planning purposes as well as environmental conservation measures, helping with managing soil and watersheds more efficiently and conserving scarce natural resources. ANN stands out due to its remarkable capacity to recognize complex nonlinear relationships in data. While traditional linear models had difficulty accommodating the complexity of datasets and resulted in higher errors, advanced ML methods like PCA and normalization greatly enhanced model performance by reducing dimensionality while standardizing and expanding on them.

Figure 7 and Table 1 show the performance of various regression models used for predicting TUSY, an essential element in land management and water conservation. It includes both MSE and R2 values for each regression model used; linear models like basic linear regression with normalization or PCA demonstrate high MSE with significantly negative R2 values, suggesting poor performance; advanced linear models like generalized ridge regression or kernel ridge regression show improved MSE but limited prediction accuracy due to negative R2 values. Nonlinear and ensemble models yielded improved results than traditional linear ones, particularly CA-ANNs and ANNs, both of which show significantly decreased mean square error (MSE) scores but showed negative R2 values that suggested overfitting; TabNet (a deep learning model) achieved the lowest MSE value and first positive R2 value, suggesting some variance was captured effectively within TUSY by this deep learning approach; AutoGLON and CATBOOST also improved relative to traditional models but could never surpass TabNet in performance. Analysis such as TUSY predictions plays a vital role in land management and water conservation practices. Precise predictions allow targeted interventions such as soil stabilization or planting cover-crops in erosion-prone areas; reliable models assist with developing sustainable land-use plans, preventing soil degradation, and maintaining water quality by decreasing sediment runoff into bodies of water – thus, this figure highlights their significance in improving environmental management strategies and practices. These results align with previous studies highlighting the power of ML models for environmental prediction (Jimeno-Sáez et al. 2018; Al-Mamari et al. 2023). Similar to research from Leta et al. (2023), our results demonstrate how more advanced methods significantly increase prediction accuracy while simultaneously emphasizing their value as hydroinformatics techniques.
Table 1

Performance evaluation of regression techniques

Sl. NoRegression analysisMSER2
Linear regression 530302.6 −203.14 
Linear regression with normalization 30125.797 −10.6 
Linear regression with PCA 30125.797 −10.6 
Generalised ridge regression 3009.155 −10.583 
Kernel ridge regression 3025.75 −0.164 
Multivariate regression 30091.155 −10.583 
Lasso regression 9313.58 −2.585 
CA-ANN 124.183 −5.73 
ANN 113.6431 −4.142 
10 TabNet 1515.3 0.1462 
11 CATBOOST 13202.22 −4.082 
12 AutoGLON 1371.5 −0.655 
Sl. NoRegression analysisMSER2
Linear regression 530302.6 −203.14 
Linear regression with normalization 30125.797 −10.6 
Linear regression with PCA 30125.797 −10.6 
Generalised ridge regression 3009.155 −10.583 
Kernel ridge regression 3025.75 −0.164 
Multivariate regression 30091.155 −10.583 
Lasso regression 9313.58 −2.585 
CA-ANN 124.183 −5.73 
ANN 113.6431 −4.142 
10 TabNet 1515.3 0.1462 
11 CATBOOST 13202.22 −4.082 
12 AutoGLON 1371.5 −0.655 
Figure 7

Regression model performance for TUSY prediction.

Figure 7

Regression model performance for TUSY prediction.

Close modal

This research evaluated several regression models capable of accurately predicting TUSY, an important measure for water and land conservation. Traditional linear models such as basic linear regression or normalized linear regression as well as PCA linear regression have proven themselves incapable of accurately representing environmental datasets accurately due to high MSE values and negative R2 values that indicate their inadequacies as predictive tools. Generalized ridge regression and kernel ridge regression models offer slightly higher MSE values; however, their negative R2 values indicate their lack of practicality. By comparison, ensemble and nonlinear models exhibited better results. CA-ANN models significantly reduced MSE while their low R2 scores indicated overfitting or inadequate generalization. TabNet stands out for having both the lowest MSE value and the first Positive R2 value, suggesting its model can account for variability more effectively. AutoGLON and CATBOOST had superior performances to traditional linear models when compared to autographed linear models; however, they were not as effective as TabNet. These results highlight the limitations of linear models used to forecast TUSY and demonstrate why sophisticated nonlinear and ensemble methods may be more appropriate approaches for doing this task. TUSY predictions are critical in identifying areas at risk from erosion and making specific interventions such as planting or stabilizing soil easier, as well as for creating sustainable land-use plans. These strategies can help protect water quality by limiting the amount of sediment entering water bodies. Future research should focus on improving these advanced models, exploring hybrid strategies, and applying advanced feature engineering and selection methods to increase prediction accuracy and generalization accuracy. These efforts will significantly expand the practical use of sediment yield models in agricultural and environmental planning, ultimately aiding watershed management strategies and soil conservation measures. Imminent research should focus on refining these models through hybrid approaches and by exploring current feature engineering techniques and selection methods. Applications of such models could include targeted interventions in erosion-prone areas, sustainable land-use planning practices and improving water quality management. Furthermore, this research provides a solid basis for further development of predictive sediment yield models which could allow more effective strategies to be developed for environmental management.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Ahmed
R.
,
Bibi
M.
&
Syed
S.
2023
Improving heart disease prediction accuracy using a hybrid machine learning approach: A comparative study of SVM and KNN algorithms
.
International Journal of Computations, Information and Manufacturing (IJCIM)
3
(
1
),
49
54
.
https://doi.org/10.54489/IJCIM.V3I1.223
.
Al-Mamari
M. M.
,
Kantoush
S. A.
,
Al-Harrasi
T. M.
,
Al-Maktoumi
A.
,
Abdrabo
K. I.
,
Saber
M.
&
Sumi
T.
2023
Assessment of sediment yield and deposition in a dry reservoir using field observations, RUSLE and remote sensing: Wadi Assarin, Oman
.
Journal of Hydrology
617
,
128982
.
https://doi.org/10.1016/J.JHYDROL.2022.128982
.
Alzubaidi
L.
,
Zhang
J.
,
Humaidi
A. J.
,
Al-Dujaili
A.
,
Duan
Y.
,
Al-Shamma
O.
,
Santamaría
J.
,
Fadhel
M. A.
,
Al-Amidie
M.
&
Farhan
L.
2021
Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions
.
Journal of Big Data
8
(
1
),
1
74
.
https://doi.org/10.1186/S40537-021-00444-8
.
Basha
U.
,
Pandey
M.
,
Nayak
D.
,
Shukla
S.
&
Shukla
A. K.
2024
Spatial–temporal assessment of annual water yield and impact of land use changes on Upper Ganga Basin, India, using InVEST model
.
Journal of Hazardous, Toxic, and Radioactive Waste
28
(
2
),
04024003
.
https://doi.org/10.1061/JHTRBP.HZENG-1245
.
De Vente
J.
,
Poesen
J.
,
Verstraeten
G.
,
Govers
G.
,
Vanmaercke
M.
,
Van Rompaey
A.
,
Arabkhedri
M.
&
Boix-Fayos
C.
2013
Predicting soil erosion and sediment yield at regional scales: Where do we stand?
Earth-Science Reviews
127
,
16
29
.
https://doi.org/10.1016/J.EARSCIREV.2013.08.014
.
Gani
A.
,
Pathak
S.
,
Hussain
A.
,
Ahmed
S.
,
Singh
R.
,
Khevariya
A.
,
Banerjee
A.
,
Ayyamperumal
R.
&
Bahadur
A.
2023
Water quality index assessment of river Ganga at Haridwar stretch using multivariate statistical technique
.
Molecular Biotechnology
.
https://doi.org/10.1007/S12033-023-00864-2
.
Gers
F. A.
,
Eck
D.
&
Schmidhuber
J.
2002
Applying LSTM to Time Series Predictable Through Time-Window Approaches. 193–200. https://doi.org/10.1007/978-1-4471-0219-9_20
.
Haregeweyn
N.
,
Tsunekawa
A.
,
Tsubo
M.
,
Fenta
A. A.
,
Ebabu
K.
,
Vanmaercke
M.
,
Borrelli
P.
,
Panagos
P.
,
Berihun
M. L.
,
Langendoen
E. J.
,
Nigussie
Z.
,
Setargie
T. A.
,
Maurice
B. N.
,
Minichil
T.
,
Elias
A.
,
Sun
J.
&
Poesen
J.
2023
Progress and challenges in sustainable land management initiatives: A global review
.
Science of the Total Environment
858
,
160027
.
https://doi.org/10.1016/J.SCITOTENV.2022.160027
.
Huang
A.
,
Xu
R.
,
Chen
Y.
&
Guo
M.
2023
Research on multi-label user classification of social media based on ML-KNN algorithm
.
Technological Forecasting and Social Change
188
,
122271
.
https://doi.org/10.1016/J.TECHFORE.2022.122271
.
Jimeno-Sáez
P.
,
Senent-Aparicio
J.
,
Pérez-Sánchez
J.
&
Pulido-Velazquez
D.
2018
A comparison of SWAT and ANN models for daily runoff simulation in different climatic zones of Peninsular Spain
.
Water
10
(
2
),
192
.
https://doi.org/10.3390/W10020192
.
Kumar
B.
,
Patra
S.
&
Pandey
M.
2023
Experimental investigation on flow configuration in flexible and rigid vegetated streams
.
Water Resources Management
37
(
15
),
6005
6019
.
https://doi.org/10.1007/S11269-023-03640-8/METRICS
.
Leta
M. K.
,
Waseem
M.
,
Rehman
K.
&
Tränckner
J.
2023
Sediment yield estimation and evaluating the best management practices in Nashe watershed, Blue Nile Basin, Ethiopia
.
Environmental Monitoring and Assessment
195
(
6
),
1
20
.
https://doi.org/10.1007/S10661-023-11337-Z/TABLES/7
.
Li
Z.
&
Fang
H.
2016
Impacts of climate change on water erosion: A review
.
Earth-Science Reviews
163
,
94
117
.
https://doi.org/10.1016/J.EARSCIREV.2016.10.004
.
Ma
Z.
&
Mei
G.
2021
Deep learning for geological hazards analysis: Data, models, applications, and opportunities
.
Earth-Science Reviews
223
,
103858
.
https://doi.org/10.1016/J.EARSCIREV.2021.103858
.
Mohsenzadeh Karimi
S.
,
Kisi
O.
,
Porrajabali
M.
,
Rouhani-Nia
F.
&
Shiri
J.
2020
Evaluation of the support vector machine, random forest and geo-statistical methodologies for predicting long-term air temperature
.
ISH Journal of Hydraulic Engineering
26
(
4
),
376
386
.
https://doi.org/10.1080/09715010.2018.1495583
.
Montesinos López
O. A.
,
Montesinos López
A.
&
Crossa
J.
2022
Fundamentals of artificial neural networks and deep learning
.
Multivariate Statistical Machine Learning Methods for Genomic Prediction
379
425
.
https://doi.org/10.1007/978-3-030-89010-0_10
.
Obiahu
O. H.
&
Elias
E.
2020
Effect of land use land cover changes on the rate of soil erosion in the Upper Eyiohia river catchment of Afikpo North Area, Nigeria
.
Environmental Challenges
1
,
100002
.
https://doi.org/10.1016/J.ENVC.2020.100002
.
Pandey
M.
,
Jamei
M.
,
Ahmadianfar
I.
,
Karbasi
M.
,
Lodhi
A. S.
&
Chu
X.
2022
Assessment of scouring around spur dike in cohesive sediment mixtures: A comparative study on three rigorous machine learning models
.
Journal of Hydrology
606
,
127330
.
https://doi.org/10.1016/J.JHYDROL.2021.127330
.
Pathak
S.
,
Liu
M.
,
Jato-Espino
D.
&
Zevenbergen
C.
2020a
Social, economic and environmental assessment of urban sub-catchment flood risks using a multi-criteria approach: A case study in Mumbai City, India
.
Journal of Hydrology
591
,
125216
.
https://doi.org/10.1016/J.JHYDROL.2020.125216
.
Pathak
S.
,
Ojha
C. S. P.
,
Garg
R. D.
,
Liu
M.
,
Jato-Espino
D.
&
Singh
R. P.
2020b
Spatiotemporal analysis of water resources in the Haridwar Region of Uttarakhand, India
.
Sustainability
12
(
20
),
8449
.
https://doi.org/10.3390/SU12208449
.
Remund
D.
,
Liebisch
F.
,
Liniger
H. P.
,
Heinimann
A.
&
Prasuhn
V.
2021
The origin of sediment and particulate phosphorus inputs into water bodies in the Swiss Midlands – a twenty-year field study of soil erosion
.
CATENA
203
,
105290
.
https://doi.org/10.1016/J.CATENA.2021.105290
.
Sarker
I. H.
2021
Machine learning: Algorithms, real-world applications and research directions
.
SN Computer Science
2
(
3
),
1
21
.
https://doi.org/10.1007/S42979-021-00592-X/FIGURES/11
.
Shivappa Masalvad
S.
,
Patil
C.
,
Pravalika
A.
,
Katageri
B.
,
Bekal
P.
,
Patil
P.
,
Hegde
N.
,
Sahoo
U. K.
&
Sakare
P. K.
2023
Application of geospatial technology for the land use/land cover change assessment and future change predictions using CA Markov chain model
.
Environment, Development and Sustainability
2023
,
1
26
.
https://doi.org/10.1007/S10668-023-03657-4
.
Shivashankar
M.
,
Pandey
M.
&
Zakwan
M.
2022
Estimation of settling velocity using generalized reduced gradient (GRG) and hybrid generalized reduced gradient–genetic algorithm (hybrid GRG-GA)
.
Acta Geophysica
70
(
5
),
2487
2497
.
https://doi.org/10.1007/S11600-021-00706-2/METRICS
.
Signoroni
A.
,
Savardi
M.
,
Baronio
A.
&
Benini
S.
2019
Deep learning meets hyperspectral image analysis: A multidisciplinary review
.
Journal of Imaging
5
(
5
),
52
.
https://doi.org/10.3390/JIMAGING5050052
.
Singh
U. K.
,
Jamei
M.
,
Karbasi
M.
,
Malik
A.
&
Pandey
M.
2022
Application of a modern multi-level ensemble approach for the estimation of critical shear stress in cohesive sediment mixture
.
Journal of Hydrology
607
,
127549
.
https://doi.org/10.1016/J.JHYDROL.2022.127549
.
Suwardi
,
Sutiarso
L.
,
Wirianata
H.
,
Nugroho
A. P.
,
Sukarman
,
Primananda
S.
,
Dasrial
M.
&
Hariadi
B.
2024
Optimization of a soil type prediction method based on the deep learning model and vegetation characteristics
.
Plant Science Today
11
(
1
),
480
499
.
https://doi.org/10.14719/PST.2926
.
Wynants
M.
,
Patrick
A.
,
Munishi
L.
,
Mtei
K.
,
Bodé
S.
,
Taylor
A.
,
Millward
G.
,
Roberts
N.
,
Gilvear
D.
,
Ndakidemi
P.
,
Boeckx
P.
&
Blake
W. H.
2021
Soil erosion and sediment transport in Tanzania: Part II – sedimentological evidence of phased land degradation
.
Earth Surface Processes and Landforms
46
(
15
),
3112
3126
.
https://doi.org/10.1002/ESP.5218
.
Zhao
S.
,
Kong
M.
,
Li
R.
,
Hounye
A. H.
,
Su
R.
,
Hou
M.
&
Cao
C.
2024
SCARNet: Using convolution neural network to predict time series with time-varying variance
.
Multimedia Tools and Applications
.
https://doi.org/10.1007/S11042-024-19322-5
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).