ABSTRACT
Accurate river flow prediction is essential for effective water resource management and flood forecasting. This study investigates advanced machine learning (ML) models CatBoost, LGBM, Random Forest, and XGBoost for predicting downstream flow at ‘Mandleshwar' using level data from key stations in the Narmada River basin. The dataset, spanning 1978–2020 with over 15,000 daily observations, was obtained from India’s Water Resources Information System (WRIS). Rigorous preprocessing ensured high-quality input data for modeling. During training, XGBoost achieved an MAE of 0.0031, MSE of 0.00002, RMSE of 0.0048, and R2 of 0.99, effectively capturing flow dynamics. In validation, LGBM outperformed with an R2 of 0.95, MAE of 0.0646, and RMSE of 0.2262, while CatBoost, XGBoost, and Random Forest followed closely. In testing, XGBoost led with an R2 of 0.96, RMSE of 0.2326, and MAE of 0.0680, while Random Forest and other models also demonstrated strong predictive capability. The findings emphasize the utility of level data for downstream flow prediction and the effectiveness of ML models in hydrological applications. This study provides actionable insights for optimizing reservoir operations, flood mitigation, and sustainable water management, supporting policymakers in developing adaptive strategies for the Narmada River basin.
HIGHLIGHTS
Advanced ML models improve downstream flow prediction accuracy in river basins.
Level data utilized to predict downstream flow in the Narmada River Basin.
XGBoost and CatBoost demonstrate superior performance in flow prediction.
Rigorous statistical metrics validate the robustness of ML-based predictions.
Findings support enhanced water management and flood forecasting strategies.
INTRODUCTION
Flow and level measurements in rivers are fundamental for managing water resources, predicting floods, monitoring pollution, and maintaining ecological balance. These measurements are critical for ensuring water availability for agriculture, industry, and domestic use, while also preserving the health of riverine ecosystems (Kundzewicz et al. 2014; Malekmohammadi et al. 2023). Globally, they play a vital role in assessing ecological conditions (Mahdian et al. 2024) and pollution levels (Stride et al. 2023; Guo et al. 2024) and predicting flood potentials (Ahmadi et al. 2024), making accurate and timely data indispensable for sustainable water management practices.
Traditional methods for flow and level measurements, such as physical gauges and manual readings, have been widely used but are increasingly seen as insufficient (Zhang et al. 2022). These methods are labour-intensive, prone to human error, and often lack the real-time capability necessary for timely decision-making (Kundzewicz et al. 2014; Wohl et al. 2023). Their limitations are particularly evident during extreme weather events, where delayed or inaccurate measurements can lead to inadequate responses (Yusoff et al. 2002). Furthermore, traditional approaches often fail to capture the spatial and temporal complexities of river dynamics and do not account for gradual morphological or hydrological changes (Hughes et al. 2021; Tian et al. 2023). These constraints highlight the growing need for advanced techniques to improve the accuracy and efficiency of river flow and level measurements.
Physical gauges, such as weirs and flumes, and manual measurement methods have been pivotal for decades, yet they suffer from significant limitations. Accuracy is often compromised by factors such as calibration issues, environmental conditions, and human error (Knighton 2014). Additionally, these methods are time-consuming and labour-intensive, particularly during periods of high flow, such as floods, when rapid response is critical. The lack of timely information due to manual data processing further hinders effective water management and flood mitigation efforts (Sassi & Hoitink 2013; Samboko et al. 2020). Moreover, the operational costs of maintaining these systems are high, necessitating the adoption of more efficient and cost-effective (Junqueira et al. 2021). As a result, there is a need for more efficient and cost-effective methods for flow and level measurement in rivers, particularly in light of increasing demands on water resources and the need for more accurate and timely data for water management and flood forecasting (Bjerklie et al. 2018).
The hydrology and water resource community are increasingly adopting machine learning (ML) techniques for flow prediction due to their ability to model complex relationships within data. Traditional methods often fall short in capturing the nonlinear and non-stationary characteristics of hydrological processes, resulting in less reliable predictions, particularly under dynamic environmental conditions (Sit et al. 2020; Rathnayake et al. 2023a). In contrast, ML algorithms excel in processing large datasets, uncovering intricate patterns that may not be easily discernible through conventional approaches (Zhao et al. 2022). This capability enables ML models to deliver more accurate flow predictions, especially in regions where hydrological systems are shaped by diverse factors such as climate change and land-use transformations (Kumar et al. 2023).
A key advantage of ML in flow prediction is its ability to minimize reliance on labour-intensive manual methods. ML algorithms can efficiently process vast datasets, significantly reducing the need for manual data collection and processing efforts (Ghobadi & Kang 2023). This efficiency translates into cost-savings, enhanced water resource management, and more effective flood forecasting capabilities (Rathnayake et al. 2023b). Furthermore, ML algorithms are adaptable to evolving conditions, making them robust and reliable tools for long-term predictions (Kumar et al. 2023). The adoption of ML techniques for flow prediction marks a major advancement in hydrology and water resource management, offering significant improvements in the accuracy and efficiency of predicting flows in rivers and other aquatic systems (Rathnayake et al. 2023c). Recent work has further validated these advancements; for instance, (Khosravi et al. 2024) compared and evaluated different ensemble models, demonstrating their potential to outperform traditional models in various river basins.
The application of ML for flow prediction in river basins has been widely explored, employing diverse methodologies and techniques. Numerous studies have demonstrated the effectiveness of ML models, such as CatBoost, Light Gradient-Boosting Machine (LGBM), Random Forest, and XGBoost, in enhancing flow prediction accuracy. For instance, Li et al. (2022) utilized the Random Forest algorithm to predict streamflow in the Upper Yangtze River Basin, achieving superior accuracy compared to traditional hydrological models. Similarly, Patel & Ramachandran (2015) employed XGBoost for river flow prediction in the Karkheh River Basin in Iran, reporting notable improvements in predictive accuracy over conventional methods.
Other studies have compared the performance of different ML models for river flow prediction. For example, Huang et al. (2021) evaluated Random Forest, XGBoost, and LGBM for predicting river flow in the Poyang Lake Basin in China, finding that while all three models performed well, XGBoost achieved the highest prediction accuracy. Similarly, Wang & Qian (2023) compared CatBoost and LGBM in the Yellow River Basin, concluding that both models outperformed traditional approaches, with CatBoost demonstrating slightly better performance. These findings underscore the potential of advanced ML models, including CatBoost, LGBM, Random Forest, and XGBoost, to significantly enhance flow prediction accuracy in river basins. Farajpanah et al. (2024) have explored innovative hybrid ML approaches like wavelet-ML models to enhance flow prediction accuracy, further illustrating the rapid advancements in this field.
The existing literature on flow prediction in the Indian Narmada River Basin using advanced ML models has shown promising results. However, significant research gaps remain, particularly in the methodologies used for downstream flow prediction. Most studies rely on historical flow data as the primary input for predictive models, which, while effective, often fail to explore alternative data sources that could offer new insights. In contrast, this study adopts a novel approach by utilizing upstream-level data to predict downstream flow. This shift from flow-based inputs to level-based inputs provides a fresh perspective on modeling flow dynamics, which has not been extensively explored in the field.
The novelty of this work lies in leveraging level data as the primary input for downstream flow prediction. By focusing on the relationship between upstream-level measurements and downstream flow, this study moves beyond traditional approaches, offering a unique methodology to enhance predictive accuracy. Advanced ML techniques, including CatBoost, LGBM, Random Forest, and XGBoost, are employed to process and analyze the level data, capturing complex and nonlinear relationships that traditional models often overlook.
To achieve this, a robust methodological framework is developed, emphasizing the integration, preprocessing, and modeling of level data from upstream locations. This approach not only improves downstream flow prediction accuracy but also provides deeper insights into the connection between level and flow dynamics. By addressing this critical gap, the study contributes to advancing hydrological modeling and offers scalable solutions for water resource management and flood mitigation in similar river systems globally.
Objective of the study
The primary objective of this study is to address the critical research gap in downstream flow prediction by utilizing upstream-level data as the primary input, a novel approach in hydrological modeling. Unlike most existing studies that rely on historical flow data, this research focuses on integrating upstream-level measurements into advanced ML such as CatBoost, LGBM, Random Forest, and XGBoost models to predict the downstream flow. By leveraging level data from strategically chosen upstream locations, this study aims to enhance the accuracy of downstream flow predictions and provide deeper insights into flow dynamics. The findings are expected to significantly improve water resource management practices, including reservoir operations, flood mitigation strategies, and decision-making processes for sustainable water allocation. This innovative methodology not only advances the understanding of the relationship between level and flow but also establishes a robust framework for applying ML in similar hydrological systems globally.
STUDY AREA AND DATA COLLECTION
Map of the Indian Narmada River Basin showing the locations of the five river station points used in the study for downstream flow prediction.
Map of the Indian Narmada River Basin showing the locations of the five river station points used in the study for downstream flow prediction.
The selected river station points are critical for downstream flow prediction in the Narmada River Basin. Data collected from these points enable researchers and policymakers to understand flow patterns, particularly in downstream areas where flow changes have more significant impacts. This information is essential for developing accurate models for downstream flow prediction, which is vital for effective water resource management and flood forecasting. Climatically, the basin experiences a tropical, humid environment influenced by the southwest monsoon, which contributes about 90% of the annual rainfall. Precipitation ranges from 650 mm in the lower plains to 1,400 mm in the upper hilly regions, with an average of approximately 1,060 mm annually. Temperature variations are significant, ranging from 40 to 42 °C during summers to 8 to 13 °C in winters. The average annual flow discharge of the Narmada River is approximately 41.3 billion cubic meters, reflecting its substantial hydrological contribution to the region. Evaporation rates vary seasonally, ranging from 6to 28 mm during the summer months (April to June) and 1 to 9 mm in winter (October to March), significantly influencing water availability.
Datasets and parameters
The dataset used in this study was sourced from the India's Water Resources Information System (WRIS) and managed by the Ministry of Jal Shakti in India. The WRIS provides a comprehensive database of water resource information, including river flow data, groundwater levels, and water quality data, ensuring the reliability and quality of the dataset. The dataset used in this study spans from 1978 to 2020. It includes daily level data from five strategically selected stations: Belkhedi, Gadarwara, Hoshangabad, Chhidgaon, and Mandleshwar along with flow data from Mandleshwar as the target variable. These datasets provide comprehensive insights into the hydrological dynamics of the study area and serve as inputs for downstream flow prediction. Table 1 presents the key statistical characteristics of level data from five stations, such as Belkhedi, Chhidgaon, Gadarwara, Hoshangabad, and Mandleshwar, and flow data from Mandleshwar. Metrics include mean, median, mode, standard deviation, kurtosis, skewness, and range (minimum to maximum values), providing a detailed overview of the dataset's variability and distribution.
Statistical summary of level and flow data for selected stations
Station . | Mean . | Median . | Mode . | Standard deviation . | Kurtosis . | Skewness . | Minimum . | Maximum . |
---|---|---|---|---|---|---|---|---|
Belkhedi (level) | 341.45 | 341.32 | 341.1 | 0.63 | 114.01 | 7.18 | 340.52 | 359.95 |
Chhidgaon (level) | 287.75 | 287.58 | 287.5 | 0.669 | 92.76 | 7.11 | 287.02 | 301.81 |
Gadarwara (level) | 323.34 | 323.03 | 322.94 | 10.39 | 1,025.15 | 22.78 | 32.34 | 665.86 |
Hoshangabad (level) | 285.50 | 284.74 | 285 | 34.30 | 14,145.03 | 117.5 | 222.65 | 4,422 |
Mandleshwar (level) | 140.02 | 140.11 | 140.06 | 7.4051 | 223.45 | −14.72 | 25 | 157.23 |
Mandleshwar (flow) | 966.18 | 327.25 | 250 | 2,363.73 | 93.07 | 8.04 | 3.46 | 48,200 |
Station . | Mean . | Median . | Mode . | Standard deviation . | Kurtosis . | Skewness . | Minimum . | Maximum . |
---|---|---|---|---|---|---|---|---|
Belkhedi (level) | 341.45 | 341.32 | 341.1 | 0.63 | 114.01 | 7.18 | 340.52 | 359.95 |
Chhidgaon (level) | 287.75 | 287.58 | 287.5 | 0.669 | 92.76 | 7.11 | 287.02 | 301.81 |
Gadarwara (level) | 323.34 | 323.03 | 322.94 | 10.39 | 1,025.15 | 22.78 | 32.34 | 665.86 |
Hoshangabad (level) | 285.50 | 284.74 | 285 | 34.30 | 14,145.03 | 117.5 | 222.65 | 4,422 |
Mandleshwar (level) | 140.02 | 140.11 | 140.06 | 7.4051 | 223.45 | −14.72 | 25 | 157.23 |
Mandleshwar (flow) | 966.18 | 327.25 | 250 | 2,363.73 | 93.07 | 8.04 | 3.46 | 48,200 |
The key parameters considered for this study include:
(a) Water level: Daily measurements from upstream stations such as Belkhedi, Gadarwara, Hoshangabad, Chhidgaon, and Mandleshwar.
(b) Flow discharge: The flow data from Mandleshwar are used as the target variable.
(c) Temporal scope: Seasonal and annual variations to capture dynamic flow patterns.
The dataset was divided into training (70%), validation (15%), and testing (15%) subsets using a stratified random splitting approach, ensuring representative data distribution across subsets. This process was implemented using the train_test_split method from scikit-learn, with a random_state of 42 to ensure reproducibility.
Data preparation
A comprehensive data-cleaning process was implemented in this study to ensure the dataset's quality and reliability. Missing values were addressed by removing rows with null entries using the dropna() method. This step was critical for maintaining data consistency and preventing biases or inaccuracies in model predictions, ensuring that the analysis relied on high-quality and complete data. To further enhance the predictive model's accuracy, an outlier detection and removal process was carried out. The Z-score method was employed to identify outliers in the variable, with data points exceeding three standard deviations from the mean flagged as outliers. These outliers were subsequently removed to improve the robustness of the analysis, minimizing the impact of extreme values and enhancing the overall reliability of the predictive model.
METHODOLOGY
Methodology flowchart for downstream flow prediction in the Narmada River Basin using ML.
Methodology flowchart for downstream flow prediction in the Narmada River Basin using ML.
Model development
This study leverages level data from five strategically selected stations, such as Belkhedi, Gadarwara, Hoshangabad, Chhidgaon, and Mandleshwar, along with the flow data from Mandleshwar to predict downstream flow at Mandleshwar. The methodology employs advanced ML models, including CatBoost, LGBM, Random Forest, and XGBoost, to capture the complex relationships between upstream-level data and downstream flow.
Model inputs
Daily level data from all five stations served as primary inputs, with the flow data of Mandleshwar used as the target variable for model training. These datasets, representing key hydrological parameters, were preprocessed to address missing values, outliers, and inconsistencies, ensuring high-quality inputs. The inclusion of flow data as the target variable allowed the models to establish a direct relationship between upstream levels and downstream flow.
Model training, validation, and testing
The dataset was divided into training, validation, and testing subsets to comprehensively represent diverse hydrological conditions as discussed before. During training, the models were optimized to minimize error metrics using performance evaluation metrics. Validation ensured the models' ability to generalize across varying flow conditions, while the testing set was used to evaluate the final model performance on unseen data, providing a reliable measure of predictive accuracy and robustness.
Feature engineering
Relevant features, including ‘Belkhedi,’ ‘Chhidgaon,’ ‘Gadarwara,’ ‘Hoshangabad,’ and ‘Mandleshwar,’ were selected based on their contribution to the river's flow characteristics. The target variable, ‘Flow,’ was normalized by dividing by 1,000 to scale the data effectively for model training and predictions, improving model convergence and accuracy.
ML models
The ML models employed in this study, including CatBoost, XGBoost, Random Forest, and LGBM, were selected for their robustness and suitability in addressing the challenges of hydrological predictions. CatBoost, known for its efficient handling of categorical data and the prevention of overfitting through ordered boosting, has shown significant advantages in recent studies (Habib et al. 2024). Its ability to process large datasets efficiently makes it highly applicable to the complex flow dynamics of the Narmada River Basin. XGBoost, another advanced algorithm, excels in computational efficiency and adaptability. Its sparsity-aware algorithm effectively manages missing data, and its gradient-boosting framework is well-suited for capturing nonlinear relationships. Studies such as (Habib et al. 2023) validate its effectiveness in hydrological applications, highlighting its suitability for this study.
Random Forest, an ensemble learning method, is robust against overfitting and highly effective in handling high-dimensional data. Its ability to average predictions from multiple decision trees enhances accuracy and reduces variability, as demonstrated in various hydrological studies. This makes it a reliable choice for capturing intricate relationships between upstream-level data and downstream flow. LGBM, known for its efficiency and speed, employs techniques such as gradient-based one-side sampling (GOSS) and exclusive feature bundling to handle large-scale datasets with reduced computational cost. Its high accuracy and fast training make it an ideal choice for hydrological prediction tasks, as demonstrated in recent research. To rigorously evaluate the performance of these models, statistical metrics including mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), normalized root MSE (NRMSE), root mean square percent error (RMSPE), and R-squared (R2) were employed. These metrics provide a comprehensive analysis of predictive accuracy, with MAE and RMSE quantifying error magnitude, and R2 measuring the variance explained by the models.
CatBoost algorithm
CatBoost is a powerful ML algorithm that has gained popularity for its ability to handle categorical features efficiently. Developed by Dorogush, CatBoost stands for ‘Categorical Boosting,’ highlighting its strength in dealing with categorical data. Unlike traditional gradient-boosting algorithms, CatBoost incorporates a novel method for handling categorical variables, which is particularly useful in real-world applications where datasets often contain a mix of categorical and numerical features. The algorithm is based on gradient-boosting principles but includes optimizations to improve training speed and reduce overfitting, making it a robust choice for various predictive modeling tasks (Prokhorenkova et al. 2018).
One of the key features of CatBoost is its support for categorical features without the need for manual encoding. This is achieved through an innovative algorithm that effectively handles categorical variables internally, reducing the preprocessing burden on the user. Additionally, CatBoost includes advanced techniques such as ordered boosting, which optimizes the sequence of trees in the ensemble to improve model performance. These features make CatBoost a valuable tool for data scientists and ML practitioners working with complex datasets containing categorical features (Prokhorenkova et al. 2018).
CatBoost uses a boosting technique where each subsequent tree learns to correct the errors of the previous trees. The final prediction is the sum of predictions from all the trees in the model.
Light Gradient-Boosting Machine
LGBM is a powerful ML algorithm that has gained popularity for its efficiency and effectiveness in handling large-scale datasets. LGBM is based on the gradient-boosting framework and is designed to be faster and more accurate than traditional gradient-boosting algorithms. One of the key advantages of LGBM is its ability to handle categorical features directly, without the need for one-hot encoding, which can significantly reduce the computational complexity of the model (Ke et al. 2017). Additionally, LGBM uses a novel technique called GOSS to reduce the number of data instances used in each iteration of the boosting process, further improving its speed and efficiency (Ke et al. 2017).
LGBM has been successfully applied in various domains, including image recognition, natural language processing, and financial modeling, due to its ability to handle large datasets efficiently and its high level of accuracy. In a study, LGBM was used to predict the occurrence of severe convective storms, demonstrating its effectiveness in predicting rare events with high accuracy (Liu et al. 2021). Zheng et al (2021) applied LGBM to predict the risk of heart failure in patients with diabetes, achieving superior performance compared to other ML algorithms. These studies highlight the versatility and effectiveness of LGBM in a wide range of applications, making it a valuable tool for researchers and practitioners in the field of ML.
Each tree in the LGBM model is trained sequentially, with each subsequent tree learning from the errors of the previous trees. The final prediction is the sum of predictions from all the trees in the ensemble. The LGBM algorithm uses a gradient-based optimization technique to train the trees, aiming to minimize the loss function, typically the MSE, between the predicted and actual flow values.
Random Forest
Random Forest is a powerful ML algorithm that is widely used for regression and classification tasks. It belongs to the ensemble learning family and is based on the principle of decision tree ensembles. In a Random Forest model, multiple decision trees are constructed during training, and the final prediction is made by averaging the predictions of individual trees (Breiman 2001). This approach helps reduce overfitting and improves the model's generalization ability, making it robust to noise and outliers in the data.
One of the key advantages of Random Forest is its ability to handle large datasets with high dimensionality. The algorithm is computationally efficient and can effectively deal with missing values and categorical variables without the need for extensive data preprocessing. Random Forest has been successfully applied in various domains, including remote sensing, bioinformatics, and finance, demonstrating its versatility and effectiveness in real-world applications (Liaw & Wiener 2002).





This equation represents the ensemble prediction made by averaging the predictions of all individual trees in the Random Forest model, weighted by their respective leaf node predictions.
Extreme Gradient Boosting
XGBoost is a powerful and popular ML algorithm known for its efficiency and effectiveness in a wide range of applications. It belongs to the ensemble learning method category and is based on the gradient-boosting framework. XGBoost has gained significant attention and acclaim for its ability to produce highly accurate predictions, particularly in structured/tabular data and regression and classification problems. One of the key strengths of XGBoost lies in its ability to handle complex relationships in the data and its robustness against overfitting, making it a preferred choice for many data scientists and practitioners (Chen & Guestrin 2016).
The algorithm's effectiveness stems from its innovative features, such as a novel sparsity-aware algorithm for handling missing data and a parallelized implementation that makes it efficient for large datasets. These features have contributed to XGBoost's success in various ML competitions and real-world applications. Additionally, XGBoost provides several hyperparameters that allow users to fine-tune the model's performance, making it flexible and adaptable to different types of datasets and problem domains (Chen & Guestrin 2015). The combination of XGBoost's performance, efficiency, and flexibility has established it as a prominent algorithm in the ML community.



In this equation, the XGBoost model combines the predictions of multiple decision trees to produce the final prediction for downstream flow at ‘Mandleshwar’. Each decision tree learns a set of rules from the training data to make predictions, and the final prediction is the sum of the predictions of all trees in the model.
Performance evaluation metrics





Parameter settings
The parameter settings of the ML models in this study, CatBoost was configured with 5,000 iterations and a logging level set to silent to reduce output verbosity. LGBM was set to 5,000 iterations with 1,000 estimators, and verbosity was adjusted to reduce the amount of printed information. LGBM was set to force row-wise tree building for improved efficiency. Random Forest was configured with 500 estimators. XGBoost was set to 1,000 estimators with a verbosity level of 0 to minimize output. These parameter settings were selected based on previous research and experimentation to optimize the performance of each model in predicting river flow in the Narmada River Basin.
RESULTS AND DISCUSSION
(a) MSE for training, validation, and test sets across ML models (m3/s)2 (b) MAE for training, validation, and test sets across ML models (m3/s). (c) RMSE for training, validation, and test sets across ML models (m3/s). (d) NRMSE for training, validation, and test sets across ML models. (e) RMSPE for training, validation, and test sets across ML models (%). (f) R2 for training, validation, and test sets across ML models.
(a) MSE for training, validation, and test sets across ML models (m3/s)2 (b) MAE for training, validation, and test sets across ML models (m3/s). (c) RMSE for training, validation, and test sets across ML models (m3/s). (d) NRMSE for training, validation, and test sets across ML models. (e) RMSPE for training, validation, and test sets across ML models (%). (f) R2 for training, validation, and test sets across ML models.
In the training phase, XGBoost demonstrated exceptional performance, achieving the lowest error metrics across all models with an MAE of 0.0031, MSE of 0.00002, RMSE of 0.0048, and NRMSE of 0.00058. These values highlight its precision in capturing the flow dynamics with minimal error. Additionally, XGBoost's RMSPE of 2.99% and R2 of 0.99 indicate its capability to handle complex relationships in the data with outstanding accuracy. CatBoost followed with an MAE of 0.0401, MSE of 0.00583, RMSE of 0.0763, and NRMSE of 0.0093, showcasing strong predictive accuracy. Its RMSPE of 45.51% and R2 of 0.9958 reflect robust performance but with slightly higher errors compared to XGBoost. LGBM performed better than CatBoost in some metrics, with an MAE of 0.0236, MSE of 0.00255, RMSE of 0.0506, and NRMSE of 0.00616, indicating improved accuracy. Its RMSPE of 53.92% and R2 of 0.99 highlight its strong explanatory power. Random Forest, while competitive, exhibited slightly higher error metrics (MAE: 0.0221, RMSE: 0.0676, NRMSE: 0.0082) and a higher RMSPE of 67.65%, suggesting it is less reliable in capturing extreme variations.
In the validation phase, CatBoost achieved a balance between accuracy and robustness, with an MAE of 0.0632, MSE of 0.0587, RMSE of 0.2422, and NRMSE of 0.0297. Its RMSPE of 59.56% and R2 of 0.94 highlight its ability to generalize effectively while maintaining moderate prediction accuracy. LGBM delivered slightly better results, with an MAE of 0.0646, MSE of 0.0511, RMSE of 0.2262, and NRMSE of 0.0277. Its RMSPE of 86.90% indicates higher variability in predictions, despite achieving the highest R2 of 0.95 in this phase. XGBoost performed competitively, with an MAE of 0.0618, RMSE of 0.2438, NRMSE of 0.0299, and RMSPE of 41.00%. These results highlight XGBoost's ability to maintain relative prediction accuracy, with the lowest RMSPE among all models during validation. Random Forest also showed competitive results, with an MAE of 0.0617, RMSE of 0.2546, NRMSE of 0.0312, and RMSPE of 58.61%, but slightly lagged in terms of consistency and overall accuracy, as indicated by its R2 of 0.94.
The testing phase highlighted XGBoost's consistent superiority, achieving an MAE of 0.0679, MSE of 0.0541, RMSE of 0.2326, and NRMSE of 0.0292. Its RMSPE of 33.10% and R2 of 0.96 further confirm its reliability in making accurate predictions on unseen data. Random Forest followed closely, with the lowest RMSE among all models (0.2216), an NRMSE of 0.0278, and R2 of 0.96, indicating competitive accuracy. However, its RMSPE of 86.76% suggests higher prediction variability in extreme cases. CatBoost continued to deliver robust performance, achieving an MAE of 0.0677, MSE of 0.0548, RMSE of 0.2342, NRMSE of 0.0294, and an RMSPE of 54.72%. Its R2 of 0.96 indicates strong explanatory power but slightly higher errors compared to XGBoost. LGBM demonstrated consistent performance with an MAE of 0.0706, MSE of 0.0570, RMSE of 0.2388, and NRMSE of 0.0299. However, its RMSPE of 74.95% suggests greater variability in predictions compared to CatBoost and XGBoost.
Across all phases, XGBoost consistently delivered the most accurate predictions, as evidenced by its lowest error metrics. CatBoost and LGBM were strong contenders, with LGBM slightly outperforming CatBoost in the validation phase but showing higher variability in the testing phase. Random Forest exhibited reliable performance but with relatively higher RMSPE, suggesting it might struggle with extreme flow conditions. These findings emphasize the effectiveness of advanced ML models in hydrological applications, with XGBoost proving to be the most reliable for downstream flow prediction. The utilization of level data and rigorous preprocessing ensured accurate modeling of complex flow dynamics, providing actionable insights for flood forecasting, reservoir operations, and sustainable water resource management in the Narmada River Basin.
Model performance evaluation through prediction error analysis
(a) Prediction error plot for the CatBoost model. (b) Prediction error plot for the Random Forest model. (c) Prediction error plot for the XGBoost model. (d) Prediction error plot for the LGBM model.
(a) Prediction error plot for the CatBoost model. (b) Prediction error plot for the Random Forest model. (c) Prediction error plot for the XGBoost model. (d) Prediction error plot for the LGBM model.
The Random Forest model (Figure 4(b)) also shows good performance but with slightly higher variability compared to CatBoost. While a significant proportion of points are close to the perfect prediction line, there is a noticeable spread, especially in the 10–20% and 20–30% error ranges. This indicates that the model might occasionally struggle to capture more complex relationships in the data, potentially due to its ensemble structure, which may over-average certain predictions. The XGBoost model (Figure 4(c)) exhibits exceptional performance, with most data points tightly aligned with the perfect prediction line. The dominance of points in the 0–10% error range, along with minimal spread, reflects its ability to effectively generalize and make accurate predictions even for complex hydrological datasets. XGBoost's gradient-boosting mechanism appears to handle feature interactions and non-linearities effectively, resulting in highly accurate predictions.
The LGBM model (Figure 4(d)) also performs well but shows slightly higher dispersion compared to CatBoost and XGBoost. While the majority of points remain within the 0–10% and 10–20% error ranges, there is a greater spread in the higher error categories, particularly beyond 20%. This suggests that while LGBM is efficient in handling large datasets and provides competitive accuracy, it may face challenges in capturing certain complex patterns or edge cases in the data. This could be due to its reliance on histogram-based splitting techniques, which might lose some granularity in highly intricate datasets. Comparatively, XGBoost and CatBoost demonstrate superior performance, with minimal prediction error and better clustering around the perfect prediction line. Their ability to manage nonlinear relationships and feature interactions efficiently contributes to their higher accuracy. Random Forest and LGBM, while still robust, exhibit slightly more variability, reflecting their relative sensitivity to certain complexities in the data.
Evaluation using Taylor diagrams
Taylor diagrams for validation results of predictive (a) MAE for validation (m3/s), (b) MSE for validation (m3/s)2, (c) RMSE for validation (m3/s), (d) NRMSE for validation, (e) RMSPE for validation (%), and (f) R2 for validation.
Taylor diagrams for validation results of predictive (a) MAE for validation (m3/s), (b) MSE for validation (m3/s)2, (c) RMSE for validation (m3/s), (d) NRMSE for validation, (e) RMSPE for validation (%), and (f) R2 for validation.
Taylor diagrams for testing results of predictive (a) MAE for test data (m3/s), (b) MSE for test data (m3/s)2, (c) RMSE for test data (m3/s, (d) NRMSE for test data, (e) RMSPE for test data (%), and (f) R2 for test data.
Taylor diagrams for testing results of predictive (a) MAE for test data (m3/s), (b) MSE for test data (m3/s)2, (c) RMSE for test data (m3/s, (d) NRMSE for test data, (e) RMSPE for test data (%), and (f) R2 for test data.
During the validation phase (Figure 5(a) – (f)), XGBoost demonstrated superior performance by consistently maintaining a closer position to the reference point across all metrics. This indicates its ability to achieve high correlation coefficients while minimizing prediction errors, such as RMSE and MAE. CatBoost performed competitively, showcasing high accuracy but with slightly higher error variability compared to XGBoost, as seen in the NRMSE and RMSPE diagrams (Figure 5(d) and 5(e)). Random Forest and LGBM exhibited moderate performance, with larger deviations from the reference point, suggesting their relatively lower reliability for scenarios requiring high precision.
In the testing phase (Figure 6(a)–(f)), XGBoost again emerged as the best-performing model, maintaining its position closest to the reference point in metrics such as RMSE, NRMSE, and R2 (Figures 6(c), 6(d), and 6(f)). This highlights its consistency and robustness in handling unseen data. CatBoost continued to deliver competitive results, with its performance slightly behind XGBoost. LGBM and Random Forest, while effective, displayed higher prediction variability, as indicated by their positions in the MAE and RMSPE diagrams (Figure 6(a) and 6(e)).
The results emphasize the ability of XGBoost to generalize well across validation and testing datasets, making it the most reliable model for downstream flow prediction. CatBoost's robust performance further supports its applicability in hydrological modeling. However, the higher variability observed in Random Forest and LGBM suggests that these models may be more suitable for less complex or non-critical applications.
Comparison with the existing studies
The application of ML models for hydrological predictions, including downstream flow prediction, has gained significant attention globally and in India. However, no studies have specifically focused on the Narmada River Basin using upstream-level data, as explored in this research. By leveraging advanced ML models, including XGBoost, CatBoost, Random Forest, and LGBM, this study demonstrated superior performance metrics compared to the existing studies. A recent study by Kedam et al. (2024) also investigated streamflow prediction in the Narmada River Basin using the same models with historical streamflow data from five gauging stations. In contrast, this study introduces upstream-level data from the five gauging stations as a novel input source. The use of upstream-level data was adopted as an alternative due to the challenges and inconsistencies often encountered in obtaining streamflow data. This innovative approach enables the models to better capture upstream influences and improve downstream predictions. The inclusion of upstream-level data has led to significant improvements in prediction accuracy, with XGBoost achieving an R2 of 0.96 and an RMSE of 0.2326 m3/s, surpassing the metrics reported by Kedam et al. (2024). This demonstrates the novelty and practical value of the methodological framework, providing enhanced insights for sustainable water management in the Narmada River Basin.
Several other studies have explored ML applications for river flow prediction in other Indian basins. For instance, Patel & Ramachandran (2015) utilized XGBoost for river flow prediction in the Cauvery Basin, achieving moderate accuracy with RMSE values ranging between 0.3 and 0.5 m3/s. In contrast, this study achieved an RMSE of 0.2326 m3/s in the testing phase, demonstrating improved precision. Similarly, Li et al. (2022) reported higher RMSE values of approximately 0.4 m3/s in the Upper Yangtze River Basin using Random Forest, whereas the application of the same model in the Narmada Basin yielded an RMSE of 0.2216 m3/s, highlighting the efficacy of the methodological framework and data preprocessing techniques employed.
Furthermore, Huang et al. (2021) compared various ML models, including XGBoost and LGBM, for river flow prediction in the Poyang Lake Basin, finding that XGBoost outperformed other models with R2 values of approximately 0.92. This study extends these findings by achieving an R2 of 0.96 with XGBoost, underscoring its capability to model complex hydrological dynamics. Farajpanah et al. (2024) explored hybrid ML approaches, such as wavelet-ML models, for improving runoff forecasting. While the hybrid model achieved comparable accuracy, the results in this study demonstrate the advantage of utilizing upstream-level data as a novel input source for enhancing downstream flow predictions.
A key observation is the very high accuracy of the obtained results compared to other studies. Specifically, XGBoost achieved an R2 of 0.96, outperforming many comparable studies in India and globally. This directly addresses the need to benchmark the findings with previous research. For example, studies in other regions, such as the Cauvery Basin (Patel & Ramachandran 2015), reported lower R2 values, highlighting the strength of the methodology and data utilized in this research. This reinforces the utility of advanced ML models and underscores the novelty of using upstream-level data for downstream flow prediction.
Region-specific factors, such as the Narmada Basin's unique climatic and hydrological characteristics, also contribute to the robustness of the models. The basin's tropical climate, monsoonal rainfall, and varying elevation profiles add complexities to flow prediction, which were effectively captured using the selected ML models. These findings highlight the applicability of advanced ML techniques in addressing similar hydrological challenges across other Indian river basins. By integrating upstream-level data and adopting a rigorous methodological framework, this study provides novel insights into downstream flow dynamics, offering enhanced prediction accuracy. The results also underscore the potential of ML models to support effective water resource management and flood mitigation strategies, particularly in regions with dynamic hydrological systems like the Narmada River Basin.
CONCLUSIONS
This study introduces a novel approach for downstream flow prediction in the Narmada River Basin by leveraging upstream-level data as the primary input for advanced ML models. The research addresses existing gaps and demonstrates enhanced predictive accuracy, contributing significantly to the field of hydrology. By employing advanced models such as XGBoost, CatBoost, Random Forest, and LGBM, this study highlights the scalability and potential of using level data for downstream flow prediction. Among these models, XGBoost consistently delivered superior performance, achieving an R2 of 0.96 and an RMSE of 0.2326, surpassing previous studies in terms of accuracy. These findings underscore the importance of integrating upstream data in river flow prediction, which has broader implications for improving hydrological modeling.
Despite these contributions, certain limitations exist. The study relies on historical-level data from the WRIS, which may contain inherent inaccuracies despite rigorous preprocessing. Moreover, the models do not explicitly account for extreme flow events or anomalies driven by climatic and land-use changes, limiting their application under rapidly evolving environmental conditions. To address these limitations, future research could explore the integration of additional data sources, such as remote sensing and climate projections, to enhance prediction robustness. The findings of this study have practical implications for addressing similar challenges in other river basins. The demonstrated accuracy and reliability of the proposed methodology provide valuable insights for optimizing water resource management, reservoir operations, and flood mitigation strategies. By offering a scalable framework for downstream flow prediction, this research contributes to advancing hydrological modeling and supports sustainable water management efforts.
ACKNOWLEDGEMENT
This work was supported by JSPS KAKENHI (Grant No. 22KK0160).
DECLARATION OF GENERATIVE AI IN SCIENTIFIC WRITING
During the preparation of this work, the author(s) used Grammarly to refine their English academic writing. After using this tool/service, the author(s) reviewed and edited the content as needed and took(s) full responsibility for the publication's content.
AUTHOR CONTRIBUTIONS
V.K.: conceptualization, methodology, software. N.R.: data curation, writing – original draft preparation, and visualization. S.H.R.: correction in the revision, writing – revised manuscript. N.K.: visualization and investigation. U.R.: supervision, conceptualization, validation, and supervision. Y.H.: writing – reviewing and editing, validation, and supervision.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.