Abstract
Streamflow forecasting is crucial for planning, designing, and managing water resources. Accurate streamflow forecasting is essential in developing water resource systems that are both technically and economically efficient. This study tested several machine learning techniques to estimate monthly streamflow data in the Hunza River Basin, Pakistan, using streamflow, precipitation, and air temperature data between 1985 and 2013. The techniques tested included adaptive boosting (AB), gradient boosting (GB), random forest (RF), and K-nearest neighbors (KNN). The models were developed using river discharge as the target variable, while air temperature and precipitation as the input variables. The model's performance was assessed via four statistical performance indicators namely root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R2). The results obtained for RMSE, MSE, MAE, and R2 using AB, GB, RF, and KNN techniques are (16.8, 281, 6.53, and 0.998), (95.1, 9,047, 61.5, and 0.921), (126.8, 16,078, 74.6, and 0.859), and (219.9, 48,356, 146.3, and 0.775), respectively. The results indicate that AB outperforms GB, RF, and KNN in predicting monthly streamflow for the Hunza River Basin. Machine learning, particularly AB, offers a reliable approach for streamflow forecasting, aiding hazard and water management in the area.
HIGHLIGHTS
This study used machine learning techniques to estimate monthly streamflow in Hunza River Basin, Pakistan.
The dataset included the mean monthly streamflow, precipitation, and temperature between 1985 and 2013.
AB, GB, RF, and KNN models were used for forecasting.
RMSE, MSE, MAE, and R2 were used as performance indicators.
The AB model outperformed the other models in terms of prediction accuracy.
INTRODUCTION
The hydrologic system is complex, nonlinear, and dynamic (Sivakumar & Singh 2012). Nonlinearity is common in hydrological time-series due to the coupling of natural elements such as climate, landform, soil, vegetation, and anthropogenic activities (Saco et al. 2018). River flow forecasting is necessary for managing water resources, especially controlling reservoir outflows during low and high river flow (Ficchì et al. 2016). Proper management of reservoir outflow needs accurate streamflow prediction (Ghumman et al. 2018), as do hydroelectric project design, real-time operation of water resource projects, efficient management strategies, and proactive mitigation efforts to reduce the environmental impact of climatic events (Anusree & Varghese 2016). A range of models has been used to simulate the complicated nonlinearity of streamflow processes, including autoregressive (AR) and autoregressive moving average (ARMA) models, which use classical statistics to assess historical data and create streamflow projections (Mehdizadeh & Kozekalani Sales 2018). However, these models may not always work well because they cannot represent the nonlinear dynamics involved in transforming rainfall to runoff (Weerts & El Serafy 2006). Machine learning and artificial neural networks (ANNs) have become popular since about 2000 for their ability to model complex nonlinear processes, particularly in the area of hydrological time-series (Yu et al. 2014). ANNs have been found superior to AR models in simulating river flow (Kisi & Kerem Cigizoglu 2007). ANN-based models have also been found superior to other models in simulating streamflow time-series (Gholami & Sahour 2022). ANNs have been used to simulate monthly streamflow in the River Nile, Egypt, and found superior to linear ARMA models (Elganiny & Eldwer 2018).
Although ANNs have been effective in modelling streamflow, they can still face issues such as overfitting, slow convergence, and local minima (Afan et al. 2016). To address these challenges, researchers have explored other subcategories of artificial intelligence (AI) in the hydrological and environmental fields. Machine learning-based approaches such as AB, GB, RF, and KNN have recently been proposed to tackle complex hydrologic processes (Hadi & Tombul 2018). For instance, Liao et al. (2020) developed a hybrid inflow prediction framework by combining gradient-boosting regression trees (GBRT) and the maximum information coefficient (MIC). GBRT models, which consist of a group of decision trees, can capture nonlinear correlations between input and output with extended lead periods (Goldstein et al. 2018). Ren et al. (2014) created a combined model using empirical mode decomposition (EMD) and KNN for forecasting yearly average rainfall. Snieder et al. (2021) used a hybrid approach, combining ANNs with additional algorithms such as Synthetic Minority Over-sampling Technique for Regression (SMOTER-AB) to achieve accurate results for two Canadian watersheds, the Bow River in Alberta and the Don River in Ontario. The Nash–Sutcliffe coefficients of efficiency for the Bow and Don River base models using SMOTER-AB were 0.95 and 0.80, respectively.
RF and extreme learning machine (ELM) are two popular algorithms used for time-series prediction in the hydrologic and environmental fields (Feng et al. 2022). It requires minimal data pre-processing, making it convenient for practical applications (Kannangara et al. 2018). For instance, Tongal & Booij (2018) used RF to develop a simulation framework for predicting streamflow in four American rivers, based on temperature, precipitation, and potential evapotranspiration. ELM has also been applied in hydrologic and environmental studies. It is a feedforward neural network that can perform well in large-scale and complex data analysis (Ni et al. 2006). ELM can handle nonlinearity, noise, and high dimensionality, and does not require parameter tuning, which can be time-consuming (Song et al. 2017).
The aim of this study was to evaluate the effectiveness of various machine learning techniques, including AB, GB, RF, and KNN algorithms, in modelling monthly streamflow data for the Hunza River, Pakistan.
STUDY AREA, DATA COLLECTION, AND METHODS
Study area
The Hunza watershed is in Pakistan's northern Karakoram Mountain range, in the Upper Indus Basin, and covers 13,567.23 km2 (Garee et al. 2017). It is fed by 14 small and medium tributaries, including the Danyore, which maintains the main river flow between them. The highest point in the area is Distaghil Sar, Pakistan's 7th-highest peak and the 19th-highest mountain in the world, standing 7,885 metres above sea level. The lowest is Danyore Gauging Station, 1,370 metres above sea level. About 80% of the total inflow comes from the snowy and heavily glaciated region 3,500 metres above sea level. The study area is a primary source of the Indus River, which contributes more than 12% of inflow to the Tarbela Dam (Ali & De Boer 2007).
Data collection
Several datasets were used for simulating monthly streamflow data in the river.
Digital elevation model
Hydroclimatic datasets
Meteorological data on mean precipitation and temperature for the period 1985 to 2013 were collected from the Pakistan Meteorological Department (PMD). Monthly streamflow data for the period 1985 to 2013 were obtained from the Water and Power Development Authority (WAPDA).
Methods
Gradient boosting
Gradient boosting (GB) is a well-known machine learning algorithm used for regression and classification tasks. It builds an ensemble of decision trees sequentially to approximate the underlying function. Pseudo-residuals, which represent the gradients of the loss function, are employed to train each individual tree. The final prediction of the ensemble is obtained by combining the previous approximation with the contribution of the current tree, weighted appropriately. GB incorporates regularization techniques and early stopping to mitigate overfitting. Moreover, it demonstrates fast execution, scalability, and efficiency, making it highly suitable for handling large datasets.
Adaptive boosting
Adaptive boosting (AB) is an ensemble method that combines multiple classifiers iteratively. It selects weak classifiers from a predefined set and assigns them coefficients. The algorithm learns from the training data to create a discriminant function through weighted voting of the weak classifiers. This enables reliable classification by aggregating the predictions of the weak classifiers.
Random forest
Random forest (RF) is a popular machine learning approach developed by Breiman that offers stability and generalization capabilities. It consists of an ensemble of decision trees, where each tree is built by randomly selecting samples and attributes from all predictors. RF's final output is determined through majority voting of the decision trees' outputs. Out-of-bag samples, obtained by removing certain items from the original dataset, are commonly used to evaluate RF's performance. Two key parameters in RF calibration are the number of trees (ntree) and the number of predictors evaluated at each node (mtry). The optimal value of mtry can be determined empirically, while the choice of ntree significantly affects the forecast results and is determined through experimentation.
K-nearest neighbors
K-nearest neighbors (KNN) is a widely used algorithm that requires no training. During testing, predictions are made by comparing an input to the values of its nearest neighbours in the training data. It operates by classifying a selected feature based on its proximity to its nearest neighbour. The distance between objects is typically calculated using the Euclidean distance formula, which measures the square root of the sum of squared differences between each feature of the objects. KNN is applied in various fields for its simplicity and effectiveness in classification tasks.
MODEL PERFORMANCE
Evaluation of a machine learning model's performance is essential in its development. Several statistical indicators were used in this study, including root mean square error (RMSE), mean square error (MSE), mean absolute error (MAE), and coefficient of determination (R2). These indicators are commonly used in regression tasks to quantify the difference between the predicted and actual values.
Root mean square error
Mean square error
Mean absolute error
Coefficient of determination (R2)
DATA EXPLORATION
RESULTS AND DISCUSSION
Table 1 presents an overview of the statistical outcomes of the various models' prediction accuracy.
Model . | RMSE . | MSE . | MAE . | R2 . |
---|---|---|---|---|
AB | 16.8 | 281 | 6.53 | 0.998 |
GB | 95.1 | 9,047 | 61.5 | 0.921 |
RF | 126.8 | 16,078 | 74.6 | 0.859 |
KNN | 219.9 | 48,356 | 146.3 | 0.775 |
Model . | RMSE . | MSE . | MAE . | R2 . |
---|---|---|---|---|
AB | 16.8 | 281 | 6.53 | 0.998 |
GB | 95.1 | 9,047 | 61.5 | 0.921 |
RF | 126.8 | 16,078 | 74.6 | 0.859 |
KNN | 219.9 | 48,356 | 146.3 | 0.775 |
Table 1 demonstrates that AB outperforms the other three in terms of prediction accuracy. Previous studies have also reported the better prediction capability of AB, especially for large-scale watersheds or low runoff periods (Liu et al. 2014). AB has demonstrated superior generalization ability during data input processing, indicating its capability to effectively capture underlying patterns and make accurate predictions on unseen data. This highlights AB's reliability and suitability for real-world applications, showcasing its potential for accurate predictions across various domains (Ying et al. 2013).
KNN had the lowest prediction performance of the four. This is because KNN has no training procedure and does not attempt to maximize any effectiveness metric. On the other hand, GB performed better than RF and KNN in terms of RMSE, MSE, MAE, and R2 (Table 1). The literature also suggests that the Extreme GB approach outperforms both RF and KNN in terms of accuracy (Venkatesan & Mahindrakar 2019).
In comparison to KNN, RF demonstrates superior performance, although it has lower performance indicators when compared to both AB and GB.
It is clear from Figures 4 and 5 that AB predicts streamflow in the Hunza River Basin much better than the other three models, as there is a greater overlap between the observed and predicted flows. Equally, compared to other models, KNN exhibits less overlap.
The median value represents the typical streamflow condition, excluding the effects of extreme events such as floods and droughts. The median values of the observed data, and the forecasts from AB, GB, RF, and KNN are 91.23, 102.87, 121.69, 132.11, and 202.32, respectively. GB and RF have almost the same median values, while KNN has the highest and AB has the lowest. The distributions of the observed data and that predicted by the AB model are also similar in both general and extreme conditions (Figure 6).
CONCLUSIONS
Four machine learning models were used in this study – AB, GB, RF, and KNN – to predict monthly streamflow for the Hunza River Basin, Pakistan, using precipitation and temperature data as input and discharge as the output. The performance indicators used to evaluate the models were RMSE, MSE, MAE, and R2. AB outperformed the other three models, with the highest values for all of RMSE, MSE, MAE, and R2 (16.8, 281, 6.53, and 0.998, respectively). In contrast, the KNN model had the lowest performance among the models due to its lack of a training procedure. The findings suggest that the AB model can be used to forecast monthly streamflow reliably for the Hunza River and potentially for other watersheds.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.