ABSTRACT
In this study, various regression models were utilized to predict total sediment yield in tons, while their performance was evaluated for accuracy and reliability. The dataset utilized contains numerous predictors that have been standardized and processed through principal component analysis to improve model performance. Models evaluated here include linear regression, normalized linear regression, Principal Component Analysis (PCA), Pearson Correlation Coefficient (PCC) with generalized ridge regression, kernel ridge regression, multivariate regression, lasso regression approaches such as artificial neural network Cellular Automata-Artificial Neural Network (CA-ANN or ANN), and more. Results suggest that the ANN model achieved the lowest mean squared error (MSE), 113.641; this suggests superior predictive capability compared to other models. Although environmental data were complex and relationships complex, an ANN model showed less error, followed closely by CA-ANN with an MSE of 124.83. Traditional models such as linear or lasso regression revealed larger errors with negative squared values that indicated poor fits to data. This analysis highlights the effectiveness of advanced machine learning in environmental modeling, emphasizing the importance of selecting models suited to data and specific phenomena, aiding environmental planners in predicting and managing soil erosion and sediment transport.
HIGHLIGHTS
Comprehensive comparison of traditional and advanced models, including TabNet, CATBOOST, and AutoGLON, for predicting total upland sediment yield (TUSY).
Innovative use of TabNet for accurate sediment yield prediction, demonstrating the model's potential in hydroinformatics.
Significant improvement in prediction accuracy with advanced techniques, reducing mean squared error (MSE).
INTRODUCTION
Soil erosion and sediment transport play a major role in agricultural productivity, water quality, and ecosystem health, as they are integral parts of the environmental system (Pathak et al. 2020a; Wynants et al. 2021). The watershed management and soil conservation measures include projecting and quantifying the sediment yield from the upland areas (Leta et al. 2023). The growth of large datasets and advancements in statistical methods and computational power along with a hop development of artificial intelligence (AI) and machine learning (ML) has captured the complexity that influences sediment production (Pathak et al. 2020b; Ma & Mei 2021). The regression methods have been applied to determine many environmental features that cause soil erosion, and models' performance was evaluated to determine how well the complexity is captured (De Vente et al. 2013). The sediment yield in the watershed affects the river morphology, bank erosion, silt deposition in reservoirs, and degradation of aquatic habitats (Al-Mamari et al. 2023). Further, the pollutants from fertilizers and pesticides are carried by these sediments into the aquatic environments causing environmental degradation (Remund et al. 2021). The accurate prediction is marred by complexity in land-use land cover changes (LULCC), type of vegetation, and the variabilities in the topography of the watershed (Suwardi et al. 2024). AI and ML have an upper hand in handling large datasets though complex but can easily assimilate nuances like LULCC, slope, rainfall intensity, and other soil properties too (Mohsenzadeh Karimi et al. 2020; Gani et al. 2023). The sediment yield is traditionally studied using the famous USLE, i.e., universal soil loss equation and advanced software such as Soil and Water Assessment Tools were handicapped by the need of vagaries of dataset and failed to capture the complexity involved in sediment estimation, these challenges were comprehensively answered by the advanced ML and AI methods (Jimeno-Sáez et al. 2018). Research on erosion caused by climate change is becoming more frequent, using climate projections to predict changes in erosion patterns and sediment flux caused by altered precipitation regimes and extreme weather events (Li & Fang 2016). Land-use changes, particularly urbanization, agriculture, and deforestation have also been widely studied; their correlation to erosion rates and sediment transport is clearly evident (Obiahu & Elias 2020). Technological innovations in remote sensing and Geographic Information System (GIS) have vastly advanced soil erosion monitoring systems, significantly improving model validation as well as spatial risk analyses (Shivappa Masalvad et al. 2023). Socioeconomic impacts and regulatory framework effectiveness have played a pivotal role in understanding sediment yield's wider implications, leading to sustainable land management policies and practices (Haregeweyn et al. 2023). The precise estimation of sediment yield is crucial for efficient irrigation and management of land and can affect soil erosion control as well as water quality management and sustainable agriculture practices. Traditional models such as USLE and Soil and Water Assessment Tool (SWAT) are not able to handle large datasets and consider nonlinear relations. In this study, we address the shortcomings by comparing the traditional regression techniques with the most advanced ML models. It aims to find the most efficient methods to predict sediment yield.
STUDY AREA
DATA AND METHODOLOGY
This research collects an extensive set of environmental variables to estimate sediment yields at various geographical locations (Pandey et al. 2022). Variables considered include chemical and physical characteristics of soils, climate data, topographic measurements, and land-use characteristics (Basha et al. 2024). Prior to analysis, variables used in forecasting future outcomes were standardized using a mean value of zero and a standard deviation of one; this ensured uniformity and comparability across scales.
Exploratory data analysis (EDA) allows researchers to gain a greater understanding of relationships and distribution within a dataset, enabling them to detect significant trends or possible outliers that require further investigation (Shivashankar et al. 2022). Through EDA insights can also be gained into selecting models for regression testing as well as ML applications – these models then being trained on small sections of data before being put through rigorous training programs to evaluate their efficacy.
Model evaluation involves comparing models using performance metrics such as mean squared error (MSE) and R2 to assess their predictive ability. The most effective models are then assessed to select one which should be utilized and tested further before being evaluated so as to give actionable insight into land management or water conservation strategies (Singh et al. 2022). This methodical approach yields comprehensive yet efficient analysis, leading to well-informed choices made regarding environmental management.
Principal component analysis (PCA) was used to manage multicollinearity and improve interpretability for models by reducing data dimensionality while still preserving sufficient variance to allow accurate predictions. Furthermore, this dataset underwent rigorous quality tests in order to confirm its accuracy and integrity for regression analysis. Data meticulously collected can facilitate the application of sophisticated regression models and serve as an important basis to assess their performance in accurately anticipating sediment yields, an essential aspect of environmental planning. Variables were standardized with a mean that was zero, and an average deviation of 1 to ensure consistency and comparability. PCA was used to reduce data dimensionality while conserving enough variance to make precise predictions. These preprocessing steps allowed for managing multicollinearity, and improved model performance. Also, its methodological rigor improves the accuracy of our results and offers valuable insight into managing sediment. This research examined the performance of various regression models in predicting total sediment yield from an environmental variable-rich dataset. MSE measurements as well as R2 scores provided insights into their accuracy and predictive capabilities. MSE measures the difference between estimated values and actual ones – in other words, actual versus estimated – thereby providing a measure of model performance. Lower MSE values indicate more effective model performance and indicate improved results. R2 values measure how much variance could be explained with independent variables alone, while negative R2 values in various models suggest they do not accurately reflect dependent variable variance and may even perform worse than a vertical line method in certain situations.
Linear regression models
The unmodified linear regression model produced a remarkable MSE value of 530,302.6 and an R2 value of 203.14. These results indicated a very poor fit and predictive ability of this particular linear regression.
Ridge and kernel ridge regression
Ridge and kernel ridge regression are extensions of linear regression that incorporate regularization to address multicollinearity as well as avoid overfitting in environmental models such as sediment yield prediction. Both models modify linear regression using penalties added to their coefficients that can be controlled using a complexity parameter.
Here, represents the observed sediment yields, are the predictor variables, are the regression coefficients, and λ is the regularization parameter that controls the degree of shrinkage of the coefficients toward zero. By selecting a suitable λ, ridge regression can reduce model complexity and prevent overfitting. Kernel ridge regression expands on ridge regression by employing kernel methods for learning in higher dimensional spaces, making this approach particularly suitable for nonlinear relationships. The model employs kernel functions as its foundation.
The generalized ridge regression and kernel ridge regression methods showed improved performance with MSE values between 3,009.155 and 3,025.75, respectively. However, their R2 values still remained low or negative suggesting potential issues regarding model suitability or data characteristics.
Multivariate and lasso regressions
Its multivariate regression also mirrored ridge regression's performance by yielding an MSE value of 30,091.155 with poor R2 values. Lasso regression offered a lower MSE of 9,313.58 but had a negative R2, suggesting it only helped minimize errors marginally while not meeting model fit requirements adequately. Further, these regression methods delivered some fascinating MSE results; Multivariate Regression posted an MSE value of 30,091.155 with an R2 value of −10.583, suggesting poor fit despite using multiple predictors to make predictions. Lasso regression's ability to reduce coefficient shrinkage resulted in an MSE score of 9,313.5 and an R2 value of −2.585, providing slightly improved results compared to its rival methods. These results demonstrate that while lasso regression improved upon multivariate approaches by focusing on relevant features, it still struggled to capture the complexity of this dataset due to nonlinear relationships and interactions among predictors that were present between predictors and sediment yield. Thus, applying more sophisticated nonlinear models may better represent complex environmental data dynamics effectively.
Advanced neural network models
The integration layer blends outputs from various component networks together in order to leverage each network's individual strengths more efficiently. It may use techniques such as concatenation or weighted averaging or more sophisticated fusion methods in order to bring data together effectively (Alzubaidi et al. 2021). CANN uses multiple networks to capture more patterns and relationships within data, leading to more precise results. CANNs' modular designs enable them to meet specific customer requirements by selecting components in accordance with specific networks. Robust CANNs can better tolerate noise and overfitting due to being able to draw upon all of their constituent networks' abilities (Alzubaidi et al. 2021).
CANNs have proven their worth across various fields and applications, such as in education. CNNs and RNNs can be combined in order to analyze visual data more thoroughly and analyze it efficiently. Utilizing RNNs and LSTMs, natural language processing uses AI techniques such as sentiment analysis, translation of language translation of languages as well as generation of text for tasks such as sentiment analysis translation linguistic and text generation (Signoroni et al. 2019). By using CNNs to extract features and RNNs to handle temporal dependencies, time series forecasting allows one to predict weather patterns, stock prices, and other time-dependent phenomena more reliably (Zhao et al. 2024).
CANNs provide an effective method of ML that permits the construction of highly sophisticated models that leverage multiple kinds of networks for improved efficiency, flexibility, and durability – making CANNs useful tools in many different sectors and fields of application (Taye 2023).
Artificial neural networks (ANNs) are computational models inspired by the human brain's network of neurons. Like human neural networks, ANNs consist of layers of interconnected nodes (neurons) processing data hierarchically in layers; each node applies weighted sum inputs via activation functions for complex pattern recognition in data (Montesinos López et al. 2022). They have many applications across industries including image and speech recognition, natural language processing and predictive modeling as their ability to improve performance through adaptive training makes them effective solutions to nonlinear, complex problems across many domains (Sarker 2021).
The composite artificial neural network (CA-ANN) and artificial neural network (ANN) models yielded promising results (Figure 4); with the latter showing the lowest MSE of 113.641 at an R2 of −4.142 suggesting it had a superior ability in predicting sediment yield than any of its rival models. The disparities among models illustrate both the complexity of environmental data, as well as their difficulty modeling them with linear assumptions. Nearly all negative R2 values indicate none could adequately account for the variability of the dependent variable; suggesting relationships may not be linear and may involve interactions that simpler models do not account for. ANN models outshone CA-ANN models in terms of predictive power and prediction accuracy for environmental modeling tasks. Their superiority also highlights how advanced ML techniques offer significant improvements over traditional regression approaches in environmental prediction models.
TabNet
TabNet is a deep learning model specifically developed to handle tabular data across various disciplines such as health, finance, and environmental science. TabNet combines the benefits of neural networks and decision trees with an intelligent attention mechanism that automatically selects features to be trained, providing optimal efficiency and clarity of data analysis. This enables TabNet to focus on relevant elements while improving efficiency and comprehensibility. TabNet is an invaluable instrument in its capacity to make predictions, providing insights into which components are the most successful at making accurate forecasts. This feature of TabNet makes it particularly helpful when working with large-scale datasets; environmental science uses TabNet extensively as part of their work forecasting results such as sediment yield and conserving efforts through providing land managers and conservation efforts an actionable view of results like these.
AutoGLON
RESULTS AND DISCUSSIONS
The matrix also highlights fascinating interdependencies, such as its negative correlation between form factor and elongation ratio which suggests more elongated basins have lower form factors – essential information for hydrologic modeling, watershed management and understanding geomorphological processes. Overall, correlation matrices serve as an invaluable way of uncovering key relationships among morphometric parameters – an aid for creating more precise predictive models and efficient management plans. TUSY forecasting and prediction are essential to effectively managing watersheds, conserving soil quality and safeguarding water quality, as inaccurate predictions of TUSY can negatively affect agricultural and environmental planning processes. In this research study, we explored various regression models in order to find out which one provides the most accurate prediction results – they include linear regression, generalized ridge regression, kernel ridge regression, multivariate regression lasso regression, CA-ANN, ANN, TabNet, CATBOOST, AutoGLON, as well as MSE being evaluated against R2, an indicator for performance evaluation metric which measures overall accuracy. Linear Regression as its base model displays an extremely large MSE (530,302.6) as well as an R2 that is significantly negative (−203.14) which indicates poor performance. Linear regression with normalization shows significant improvements and reduces MSE to 30,125.797 while still having positive R2 (−10.6) values; using PCA preprocessing does not improve it at all, leaving its MSE unchanged at 30,125.797 while still having R2 values −10.6. Advanced linear models such as generalized ridge regression can dramatically lower MSE to 3,009.155 but do not significantly boost R2 (−10.583). Kernel ridge regression refines this model further using kernel functions with MSE of 3,025.75 with R2 of −0.164 showing it has better potential, although not sufficient for real-world applications. Integrating multiple variables into multivariate regression does not improve predictions as evidenced by its MSE of 30,091.155 with an R2 value of −10.583. Lasso regression shows improvement but still falls short in its potential over linear regression despite showing improved potential over linear regression by showing improvement by moderate degrees as evidenced by having an MSE of 9,313.58 with R2 of 2.585 which indicates significant potential. Ensemble and nonlinear models show varied results. Convolutional artificial neural networks (CA-ANNs) show a significant reduction in mean square error (124.183); however, their R2 is still positive (−5.73). Artificial neural networks (ANNs) can also reduce MSE by 113.6431 with an R2 of −4.142 which indicates better performance compared with CA-ANN. TabNet stands out among more modern algorithms with an extremely lower MSE of only 155.5 and the first R2 positive value (0.1462). These metrics suggest it captures some variation efficiently within TUSY prediction while providing an intriguing approach. CATBOOST gradient boosting algorithm offers an average MSE value of 13,202.22 with an R2 value of −4.082; more effective than linear models but still not as efficient as TabNet. AutoGLON ML model stands out with MSE of 13,731.5 and R2 value of 0.655 which represents significant progress from conventional linear and nonlinear models. Results show that traditional linear models, even those which incorporate PCA and normalization, fail to accurately predict TUSY due to exaggerated MSE values and lower R2 values. Generalized ridge regression and kernel ridge regression both feature regularization with kernel functions but don't produce acceptable R2 scores despite significantly reducing MSE values; among nonlinear ensemble methods, CA-ANN and ANN significantly decreased MSE while their low R2 scores indicated inadequate generalization or overfitting; whilst CATBOOST offers improvement over traditional approaches but fell short in terms of R2. TabNet stands out with its low MSE of 1,515.3 and first R2 positive (0.1462) which indicates that it captures some variance effectively within TUSY, making it suitable for further investigation and refinement. AutoGLON reduces MSE by 13.715 yet retains negative R2 (−0.655), providing ample room for refinement and optimization efforts to occur over time. This study explores both the limitations and the promise of various regression models used to predict TUSY. Linear models based on traditional methods do not meet the accuracy and reliability requirements effectively; however, modern nonlinear methods and ensembles like TabNet and AutoGLON have demonstrated promising results. The next phase of research should focus on refining these models further by developing hybrid approaches, features engineering methods and selection techniques in order to increase the accuracy of predictions and generalization of models. Such efforts will ultimately strengthen applications of these models for agricultural planning purposes as well as environmental conservation measures, helping with managing soil and watersheds more efficiently and conserving scarce natural resources. ANN stands out due to its remarkable capacity to recognize complex nonlinear relationships in data. While traditional linear models had difficulty accommodating the complexity of datasets and resulted in higher errors, advanced ML methods like PCA and normalization greatly enhanced model performance by reducing dimensionality while standardizing and expanding on them.
Sl. No . | Regression analysis . | MSE . | R2 . |
---|---|---|---|
1 | Linear regression | 530302.6 | −203.14 |
2 | Linear regression with normalization | 30125.797 | −10.6 |
3 | Linear regression with PCA | 30125.797 | −10.6 |
4 | Generalised ridge regression | 3009.155 | −10.583 |
5 | Kernel ridge regression | 3025.75 | −0.164 |
6 | Multivariate regression | 30091.155 | −10.583 |
7 | Lasso regression | 9313.58 | −2.585 |
8 | CA-ANN | 124.183 | −5.73 |
9 | ANN | 113.6431 | −4.142 |
10 | TabNet | 1515.3 | 0.1462 |
11 | CATBOOST | 13202.22 | −4.082 |
12 | AutoGLON | 1371.5 | −0.655 |
Sl. No . | Regression analysis . | MSE . | R2 . |
---|---|---|---|
1 | Linear regression | 530302.6 | −203.14 |
2 | Linear regression with normalization | 30125.797 | −10.6 |
3 | Linear regression with PCA | 30125.797 | −10.6 |
4 | Generalised ridge regression | 3009.155 | −10.583 |
5 | Kernel ridge regression | 3025.75 | −0.164 |
6 | Multivariate regression | 30091.155 | −10.583 |
7 | Lasso regression | 9313.58 | −2.585 |
8 | CA-ANN | 124.183 | −5.73 |
9 | ANN | 113.6431 | −4.142 |
10 | TabNet | 1515.3 | 0.1462 |
11 | CATBOOST | 13202.22 | −4.082 |
12 | AutoGLON | 1371.5 | −0.655 |
CONCLUSIONS
This research evaluated several regression models capable of accurately predicting TUSY, an important measure for water and land conservation. Traditional linear models such as basic linear regression or normalized linear regression as well as PCA linear regression have proven themselves incapable of accurately representing environmental datasets accurately due to high MSE values and negative R2 values that indicate their inadequacies as predictive tools. Generalized ridge regression and kernel ridge regression models offer slightly higher MSE values; however, their negative R2 values indicate their lack of practicality. By comparison, ensemble and nonlinear models exhibited better results. CA-ANN models significantly reduced MSE while their low R2 scores indicated overfitting or inadequate generalization. TabNet stands out for having both the lowest MSE value and the first Positive R2 value, suggesting its model can account for variability more effectively. AutoGLON and CATBOOST had superior performances to traditional linear models when compared to autographed linear models; however, they were not as effective as TabNet. These results highlight the limitations of linear models used to forecast TUSY and demonstrate why sophisticated nonlinear and ensemble methods may be more appropriate approaches for doing this task. TUSY predictions are critical in identifying areas at risk from erosion and making specific interventions such as planting or stabilizing soil easier, as well as for creating sustainable land-use plans. These strategies can help protect water quality by limiting the amount of sediment entering water bodies. Future research should focus on improving these advanced models, exploring hybrid strategies, and applying advanced feature engineering and selection methods to increase prediction accuracy and generalization accuracy. These efforts will significantly expand the practical use of sediment yield models in agricultural and environmental planning, ultimately aiding watershed management strategies and soil conservation measures. Imminent research should focus on refining these models through hybrid approaches and by exploring current feature engineering techniques and selection methods. Applications of such models could include targeted interventions in erosion-prone areas, sustainable land-use planning practices and improving water quality management. Furthermore, this research provides a solid basis for further development of predictive sediment yield models which could allow more effective strategies to be developed for environmental management.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.