Abstract
Pakistan, being an agricultural country, highly depends on its natural water resources which originate from the upper regions of Hindu Kush-Karakoram-Himalaya Mountains and nourish one of the world's largest Indus Basin irrigation systems. This paper presents streamflow modelling and forecasting using signal difference average (SDA) based variational mode decomposition (VMD) combined with machine learning (ML) methods at Chitral and Tarbela stations on the Indus River network. For this purpose, VMD based random forest (VMD-RF), gradient boosting machine (VMD-GBM) and Bayesian regularized neural network (VMD-BRNN) have been chosen. Moreover, traditional time series flow models, that is seasonal autoregressive integrated moving average (SARIMA) and classical decomposition approach with particle swarm optimization-based support vector regression (PSO-SVR), are considered as benchmark models for comparison. The results show that overall, VMD-BRNN performed best, followed by VMD-GBM and VMD-RF, whereas SARIMA and PSO-SVR ranked last. Overall, SARIMA and PSO-SVR are failing to capture most of the peaks even during the training period whereas hybridization of VMD and ML methods has shown increased robustness of the models. The results show that the influential role of the high dimensional components and robustness on the river flow may be explored by most optimum SDA based VMD signals hybrid with BRNN method.
HIGHLIGHTS
SDA based VMD combined with machine learning model is developed for streamflow modelling and forecasting at Chitral and Tarbela stations.
Modelling and forecasting capability are enhanced via SDA based VMD combined with Machine Learning methods.
VMD-BRNN is more capable of capturing and forecasting streamflow pattern, especially peaks as compared to VMD-RF and VMD-GBM methods.
INTRODUCTION
It has been highlighted by many researchers and decision-makers about the existing hydrological resources related threats of frequent occurrence of floods and drought at global and regional level (Ostad-Ali-Askari et al. 2017). The history of floods has been reported, including a recent devastating event in 2010 due to heavy rains, which affected almost all the regions of Pakistan (Houze Jr et al. 2011). Low to moderate level of floods arrive every year, depending on the hydrological cycle, however, after every 7–8 years, severe flood unexpectedly appears due to some external forcing factors such as large-scale circulation and other global variations. They are responsible for the loss of lives, property damage and the destruction of bridges, dams and riverine areas. Snow/glacier melts from Hindu Kush-Karakoram-Himalaya Mountains are an elementary source of fresh water to the Indus River and feed one of the world's largest irrigation systems, whereas monsoon and seasonal rainfall are the major sources of water in the system (Hassan & Ansari 2015). Consequently, in some years, abundant rainfall and unexpected rise in snow/glacier melt due to high temperatures may result in flooding of the river. On the other hand, less or no seasonal precipitation in other years plays an important role in setting severe hydrological drought conditions.
Mathematical models have proven their capabilities in the most important tasks for problem solving in hydrology. An innovation was provided by Box et al. (2015) in the form of stochastic time series models such as the autoregressive moving average (ARMA) and further modifications to those models, for example autoregressive integrated moving average (ARIMA) and their seasonal counterpart (SARIMA) (Wood & O'Connell 1985; Tadesse & Dinka 2017). Furthermore, the modification of stochastic methods and implementation of machine learning (ML) techniques have led to an understanding of adequate modelling tactics according to the nature of the time series. Additionally, data-driven techniques due to their self-regularization characteristic, do not depend on the physical behaviour of the hydrological phenomenon and lead to an adequate accuracy in flow forecasting (Poul et al. 2019). Among available data-driven methods, Support Vector Regression (SVR) combined with optimization techniques such as genetic algorithm (GA), particle swarm optimisation (PSO), firefly algorithm (FA) etc., random forest (RF) and gradient boosting machine (GBM) are extensively applied in various fields, including hydrology and have proven their capabilities in model building and forecasting (Elbisy 2015; Touzani et al. 2018; Liu & Sun 2019; Liu et al. 2020). Moreover, conventional artificial neural network (ANN) may capture nonlinearity of the streamflow adequately well, which results in over fitting of the modelled data and hence may fail in the testing period with unseen data (Wang et al. 2021). Furthermore, a regularization approach in Bayesian regularized neural network (BRNN) using a penalty term may handle the complexity of the final network avoiding over/under estimation (Sinecen 2019), discussed in detail in the methodology section. BRNN has been utilized in model building and forecasting of different time series data with esteemed performance (Sinecen 2019), however, the same has not been studied in combination with signal difference average (SDA) based VMD for streamflow forecast previously.
Data pre-processing techniques have been coupled with different ML methods for accuracy enhancement of the models (Zuo et al. 2020; Wu et al. 2021). Wavelet transforms, empirical and ensemble empirical mode decomposition techniques are commonly used for decomposition of actual time series data into sub-series (Wu et al. 2021). Among recently available data decomposition techniques, variational mode decomposition (VMD) proposed by Dragomiretskiy & Zosso (2013) with solid mathematical foundation has been utilized successfully to enhance the robustness of the models in various fields including hydrology (Isham et al. 2018; Wang et al. 2021). Moreover, an important step in VMD is the selection of level of decomposition or mode number (k) of the intrinsic mode function (IMF) and it may affect further computation if not selected appropriately.
Furthermore, different approaches have been reported for selection of k in literature such as: the traditional empirical mode decomposition (EMD) method (Wu et al. 2021) or by observing the existence of central-frequency distortion of the previous IMF component (Zuo et al. 2020), whereas determination of k, based on the similarity concept between actual and ensemble decomposed IMFs, is applied in this paper known as SDA. The motivation for opting for the SDA method is its efficiency (Isham et al. 2018) of selection of k on different data sets with and without noise. A suitable selection of k can boost the effectiveness of VMD over the traditional EMD approach (Isham et al. 2018).
This paper presents streamflow modelling and forecasting using three approaches: (i) traditional time series method seasonal autoregressive integrated moving average (SARIMA); (ii) classical decomposition with particle swarm optimization-based support vector regression (PSO-SVR); and (iii) VMD based ML methods like VMD-RF, VMD-GBM and VMD-BRNN. One of the main objectives of this paper is to distinguish the effective utilization of SDA based VMD on selected ML methods over conventional approaches, which has never been applied previously.
DATA AND METHODOLOGY
Study area
The Chitral River, one of the major tributaries contributing indirectly to the Indus River flow, is most of the time nourished by glacial melt from the upper regions of Pakistan. The river flow characteristics over the years are steady with the only exception of becoming is enhanced during summer months temperature-dependent snow/glacier melt (Hassan & Ansari 2015). The Chitral River situated at 35°50 N and 71°48 E (Hassan & Ansari 2015) in the upstream section between Mastuj and Chitral towns (average elevation of 1500 m.a.s.l) and is known as Yarkhum River. Unlike Chitral River, Tarbela station is one of the major reservoirs of the Indus River with an annual inflow of 175.2 × 109 m3/year consisting of about 70% only from melting of snow/glacier (Charles et al. 2018). The reservoir has been functional since 1976, having a storage capacity of about 10.3 billion m3, which is now reduced by 28%. It mainly serves in hydropower generation and irrigation while regulating floods triggered by summer monsoon (Khattak et al. 2015). Monthly observed streamflow data of the Chitral River at Chitral and the Indus River at Tarbela stations from April 1969 to March 2008 (468 data points) and from April 1976 to March 2011 (420 data points), respectively, were acquired from the Water and Power Development Authority (WAPDA), Pakistan. Moreover, the last 36 monthly values from each dataset (April 2005 to March 2008 of Chitral and April 2008 to March 2011 of Tarbela) are exempt from model building and the same are used for model testing.
Streamflow data has a strong seasonal pattern for both the stations (Figure 1). Moreover, the data for Chitral station has smooth recession flow whereas Tarbela station shows a jerky pattern (Figure 1). This shows that the water comes to Tarbela from the huge upper Indus basin comprised of different climatic regions (Charles et al. 2018). Furthermore, data average (Av) and standard deviation (SD) are analysed to understand the stationarity of the data estimated from non-overlapping equal length intervals (Table 1). It is found that the data has variation in Av and SD values (Av = 23.97 and SD = 18.67) and (Av = 121.91 and SD = 173.56) for Chitral and Tarbela stations, respectively. Additionally, Av and SD for the last interval (including the testing period) has deviated values as compared to other intervals at the Chitral station (Table 1). On the contrary, variation in Av is minimum among all the intervals for Tarbela, showing consistency in the data pattern.
Basic statistics of streamflow data at Chitral and Tarbela stations
Parameters . | Chitral (m3/s) data . | Tarbela (m3/s) data . | ||||
---|---|---|---|---|---|---|
Entire . | Training . | Testing . | Entire . | Training . | Testing . | |
Minimum | 56.38 | 56.38 | 67.14 | 269.01 | 269.01 | 348.03 |
Maximum | 1251 | 1134.48 | 1251 | 9455.05 | 8706.53 | 9455.05 |
Average (Av) | 281.47 | 277.87 | 322.87 | 2355.38 | 2354.78 | 2379.88 |
Av1 | 282.61 | – | – | 2323.98 | – | – |
Av2 | 271.54 | – | – | 2438.76 | – | – |
Av3 | 274.45 | – | – | 2316.85 | – | – |
Av4 | 295.51 | – | – | 2341.95 | – | – |
Standard Deviation (SD) | 274.31 | 270.82 | 311.01 | 1830.17 | 1822.90 | 1962.11 |
SD1 | 268.79 | – | – | 1716.77 | – | – |
SD2 | 267.59 | – | – | 1890.33 | – | – |
SD3 | 273.03 | – | – | 1863.55 | – | – |
SD4 | 286.26 | – | – | 1842.57 | – | – |
Kurtosis | 0.49 | 0.38 | 0.96 | 1.04 | 0.79 | 3.22 |
Skewness | 1.28 | 1.26 | 1.37 | 1.27 | 1.23 | 1.63 |
Parameters . | Chitral (m3/s) data . | Tarbela (m3/s) data . | ||||
---|---|---|---|---|---|---|
Entire . | Training . | Testing . | Entire . | Training . | Testing . | |
Minimum | 56.38 | 56.38 | 67.14 | 269.01 | 269.01 | 348.03 |
Maximum | 1251 | 1134.48 | 1251 | 9455.05 | 8706.53 | 9455.05 |
Average (Av) | 281.47 | 277.87 | 322.87 | 2355.38 | 2354.78 | 2379.88 |
Av1 | 282.61 | – | – | 2323.98 | – | – |
Av2 | 271.54 | – | – | 2438.76 | – | – |
Av3 | 274.45 | – | – | 2316.85 | – | – |
Av4 | 295.51 | – | – | 2341.95 | – | – |
Standard Deviation (SD) | 274.31 | 270.82 | 311.01 | 1830.17 | 1822.90 | 1962.11 |
SD1 | 268.79 | – | – | 1716.77 | – | – |
SD2 | 267.59 | – | – | 1890.33 | – | – |
SD3 | 273.03 | – | – | 1863.55 | – | – |
SD4 | 286.26 | – | – | 1842.57 | – | – |
Kurtosis | 0.49 | 0.38 | 0.96 | 1.04 | 0.79 | 3.22 |
Skewness | 1.28 | 1.26 | 1.37 | 1.27 | 1.23 | 1.63 |
Average (Av) and Standard Deviation (SD) are further computed for four equal and non-overlapping intervals.
Strong seasonal pattern is obvious from the (I) Time series, (II) ACF and (III) PACF plots of (a) Chitral River flow at Chitral stations and (b) Indus River flow at Tarbela station. Moreover, PACF has significant spikes at lag 1 and 2 suggesting the expected order of the AR model.
Strong seasonal pattern is obvious from the (I) Time series, (II) ACF and (III) PACF plots of (a) Chitral River flow at Chitral stations and (b) Indus River flow at Tarbela station. Moreover, PACF has significant spikes at lag 1 and 2 suggesting the expected order of the AR model.
Seasonal autoregressive integrated moving average model





The adequate SARIMA models are selected on the basis of minimum Akaike information criterion (AIC) values and are found to be SARIMA (1,0,0) × (1,1,1)12 and SARIMA (0,0,1) × (1,1,2)12 for Chitral and Tarbela, respectively.
Particle swarm optimization-support vector regression
SVR is a set of supervised learning methods based on statistical learning theory used for classification and regression analysis, introduced by Vapnik and Chervonenkis (Chervonenkis 2013). A regression via SVR is obtained for a data set of length N by mapping input vector xi into a feature space using a nonlinear function ϖ(x) and further finding the regression function, f(x) =Ω .ϖ (x) + b. Hence, approximating the more accurate output yi with error bounded by δ, such that weight vector Ω and bias value b are the parameters of the regression function. Moreover, for the general purpose of SVR methodology, a kernel function K(xi, x) is incorporated that satisfies the Mercer's theorem (Vapnik 2013). The Radial Basis Function (RBF) is applied from available kernel functions as it can reduce the complexity for input data by effectively adjusting cost (C) and gamma (γ) values (Elbisy 2015), defined as: , whereas the width of the Gaussian function is controlled by the user-defined value of γ >0. Nevertheless, the performance of SVR is significantly affected by the parameters: δ, C and γ. Therefore, proper tuning of these parameters is essential for adequate training of the SVR. Among several available approaches, PSO is selected for tuning of the parameters.
Variational mode decomposition







Random forest
RF are adaptable ML algorithms based on several independent decision trees first introduced by Breiman in 2001 (see for more details Shabani et al. 2020). The structure of the tree is simply defined as branching to two nodes known as root and leaf, respectively (Shabani et al. 2020), where each tree grows by dividing the data with self-search of random samples (see reference therein Shabani et al. 2020). This randomization mitigates extreme overfitting with robust anti-noise ability and leads to extraordinarily quick calculations (Liu & Sun 2019). Error calculation in RF depends on the placement of the data in their final subspace (leaf). Concisely, similarity between A and B in a subspace Q(A, B) is actually the ratio of the times when they appeared in the same leaf. Moreover, the functioning of the classification tree is independent of the values of predictors therefore, inadequacy of similarity is bound to the type of predictors. In general, for any n number of samples with sample feature dimension K and number of decision trees J in a random forest, j decision trees are constructed from actual samples via bootstrap. After that, k samples are withdrawn having dimension K for training purposes of several decision trees. These decision trees are allowed to grow as much as possible and the results obtained from each tree are then averaged for the final RF model (Liu & Sun 2019).
Gradient boosting machine
Initially, the GBMs were proposed for the purpose of classification in the field of ML, later on, connected to the idea of loss function in the statistical aspect (regression) by Friedman (2001). In GBM, a regression tree serves as the base model, whereas residuals (gradients) from the base model consequently fit new models which later on combines with the base model. Moreover, combining of prediction from the previous trees is controlled by involving a shrinking parameter (learning rate) ranges between 0 and 1 (Touzani et al. 2018). Furthermore, regression trees being the base model bound the number of samples in each node in order to minimize the overall variance of the model.
Bayesian regularized neural network

Proposed methodology
Proposed framework for streamflow modelling and forecasting is designed as:
Step 1: Determination of level k: An appropriate k is determined using a similarity concept of IMFs sum and input signals via , where y is the observed streamflow signal and Yk is the sum of decomposed IMFs. Moreover, the SDA values are estimated for each k (2–15) and k is selected to the corresponding minimum SDA value.
Step 2: Dataset acquired in step 1 is further split into training (85%) and testing (15%).
Step 3: Selection of set of predictors and target: Significant lags are selected for each IMF via partial autocorrelation function (PACF) plot, considering jth -1 lags outside the confidence interval and greater than 0.5 are indicated as the input predictor of the respective IMF (Zuo et al. 2020). After combining all the predictor's lag, maximum (jth +1) lag of streamflow data is selected as target.
Moreover, SARIMA is applied on streamflow data using the forecast() package, RF/GBM and BRNN are applied on VMD based predictors using h2o() and brnn() packages in R, respectively. Furthermore, PSO-SVR is applied on seasonally adjusted data using a classical additive decomposition approach resulting in the components: trend, seasonal and remainder using pso() and SVR() available in the pyswarm and sklearn packages in the python environment, respectively. Additionally, model(s) accuracy is measured via Root Mean Square Error (RMSE), sMAPE, and coefficient of correlation (CC) for training, testing and combined modelled data (Table 2).
Accuracy measurements of training, testing and overall data ranges
Model . | Station . | Training . | Testing . | Overall . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE . | sMAPE (%) . | CC . | RMSE . | sMAPE (%) . | CC . | RMSE . | sMAPE (%) . | CC . | ||
SARIMA | Chitral | 60.937 | 10.052 | 0.974 | 86.782 | 16.335 | 0.975 | 63.305 | 10.536 | 0.973 |
Tarbela | 643.035 | 18.495 | 0.936 | 784.759 | 15.604 | 0.919 | 656.478 | 18.246 | 0.934 | |
PSO-SVR | Chitral | 49.943 | 7.197 | 0.983 | 82.528 | 14.995 | 0.977 | 53.170 | 7.798 | 0.981 |
Tarbela | 615.942 | 16.896 | 0.941 | 735.834 | 13.283 | 0.933 | 627.197 | 16.584 | 0.940 | |
VMD-RF | Chitral | 19.504 | 4.925 | 0.998 | 116.123 | 39.751 | 0.952 | 37.290 | 7.610 | 0.991 |
Tarbela | 162.051 | 6.429 | 0.997 | 704.180 | 20.606 | 0.949 | 258.461 | 7.653 | 0.991 | |
VMD-GBM | Chitral | 16.610 | 5.668 | 0.998 | 104.602 | 29.847 | 0.952 | 33.138 | 7.532 | 0.993 |
Tarbela | 134.345 | 5.704 | 0.997 | 561.766 | 16.540 | 0.963 | 209.129 | 6.640 | 0.994 | |
VMD-BRNN | Chitral | 6.644 | 3.054 | 0.9996 | 50.204 | 15.204 | 0.987 | 15.331 | 3.990 | 0.998 |
Tarbela | 59.972 | 3.016 | 0.9995 | 244.178 | 10.651 | 0.993 | 91.834 | 3.675 | 0.999 |
Model . | Station . | Training . | Testing . | Overall . | ||||||
---|---|---|---|---|---|---|---|---|---|---|
RMSE . | sMAPE (%) . | CC . | RMSE . | sMAPE (%) . | CC . | RMSE . | sMAPE (%) . | CC . | ||
SARIMA | Chitral | 60.937 | 10.052 | 0.974 | 86.782 | 16.335 | 0.975 | 63.305 | 10.536 | 0.973 |
Tarbela | 643.035 | 18.495 | 0.936 | 784.759 | 15.604 | 0.919 | 656.478 | 18.246 | 0.934 | |
PSO-SVR | Chitral | 49.943 | 7.197 | 0.983 | 82.528 | 14.995 | 0.977 | 53.170 | 7.798 | 0.981 |
Tarbela | 615.942 | 16.896 | 0.941 | 735.834 | 13.283 | 0.933 | 627.197 | 16.584 | 0.940 | |
VMD-RF | Chitral | 19.504 | 4.925 | 0.998 | 116.123 | 39.751 | 0.952 | 37.290 | 7.610 | 0.991 |
Tarbela | 162.051 | 6.429 | 0.997 | 704.180 | 20.606 | 0.949 | 258.461 | 7.653 | 0.991 | |
VMD-GBM | Chitral | 16.610 | 5.668 | 0.998 | 104.602 | 29.847 | 0.952 | 33.138 | 7.532 | 0.993 |
Tarbela | 134.345 | 5.704 | 0.997 | 561.766 | 16.540 | 0.963 | 209.129 | 6.640 | 0.994 | |
VMD-BRNN | Chitral | 6.644 | 3.054 | 0.9996 | 50.204 | 15.204 | 0.987 | 15.331 | 3.990 | 0.998 |
Tarbela | 59.972 | 3.016 | 0.9995 | 244.178 | 10.651 | 0.993 | 91.834 | 3.675 | 0.999 |
Metrics used for accuracy assessment of the model(s)


RESULTS AND DISCUSSION
The regional scale modelling and forecasting of monthly streamflow via VMD-ML has been carried out in the Swat River basin of Pakistan (Sibtain et al. 2020) with ML methods other than those we are discussing here. In their paper, ML alone, VMD-ML and (two stage decomposed)-ML could attain CC values of about 0.869, 0.957 and 0.980 in forecasting, respectively. Furthermore, SDA based VMD-BRNN has not been applied in the Pakistan region previously for streamflow modelling and forecasting with esteemed statistical performance as obtained in the current paper (Table 2).
Hydroclimatic data possess seasonality, cyclicity or trends of different scales in diverse regions of the world. Therefore, streamflow data may contain inline complexities in different seasons, which may vary regionally. The streamflow data of Chitral and Tarbela stations show a strong summer seasonal pattern in time series and respective ACF and PACF plots (Figure 1). Moreover, the data series of Chitral River is positively skewed and more consistent during the training period as depicted by standard deviation whereas that of Tarbela is consistent throughout (section 2, Table 1). Furthermore, variation in consistency in training and testing data series (for example, at the Chitral River) could be challenging for modelling and forecasting.
Despite having a strong seasonal pattern, modelling of streamflow data series via SARIMA performed below average with maximum RMSE and sMAPE values during the training and combined period (Table 2). Unlike SARIMA, performance of PSO-SVR is enhanced during the training and combined period. Especially, during the testing period PSO-SVR performed comparatively well. Moreover, SDA suggested k as 9 and 12, which are further utilized for decomposition of streamflow via VMD for Chitral and Tarbela stations, respectively (Figure 2). Integrated VMD based models achieved increased accuracy during the training period at both the stations (Table 2).
VMD subseries of streamflow quantified by SDA technique with (a) k = 9 for Chitral and (b) k = 12 for Tarbela, respectively.
VMD subseries of streamflow quantified by SDA technique with (a) k = 9 for Chitral and (b) k = 12 for Tarbela, respectively.
Accuracy of VMD-BRNN is prominent with minimum RMSE, sMAPE and maximum CC values at both the stations (Table 2). Unlike the training period, testing period accuracy is significantly decreased in VMD-RF and VMD-GBM by increased values of RMSE and sMAPE for Chitral and increased values of sMAPE; though RMSE remained decreased at Tarbela (Table 2). Such limitations of ML and ANN during the testing period are common in the forecasting of streamflow (Wang et al. 2021), nevertheless, VMD-BRNN remained consistent throughout at both the stations (Table 2). Furthermore, sMAPE is subtle to underestimation of the model as compared to over-estimation (Chen et al. 2017). That is, the jerky pattern of recession flow at the Tarbela station (Figure 1(b)) is distinguishably detected by sMAPE during the testing period (Table 2). Additionally, the combined performance using CC, SD and the centre root mean square difference (RMSD) of the models could be easily visualised via the Taylor diagram (Figure 3). The actual streamflow values are displayed at the center of RMSD on x-axis of Figure 3.
Taylor diagram displaying a statistical comparison of combined (training and testing) estimated results of different models of river flow at (a) Chitral (b) Tarbela stations. Where observed river flow values are co-centred with RMSD on x-axis represented by dash line. Among all the models, VMD-BRNN (represented by ◊) is found to be adequate having minimum RMSD and maximum CC values.
Taylor diagram displaying a statistical comparison of combined (training and testing) estimated results of different models of river flow at (a) Chitral (b) Tarbela stations. Where observed river flow values are co-centred with RMSD on x-axis represented by dash line. Among all the models, VMD-BRNN (represented by ◊) is found to be adequate having minimum RMSD and maximum CC values.
VMD-BRNN is found best among other models with all the optimum values of CC (0.998 and 0.999), SD and RMSD at both the stations (Figure 3), whereas overall CC values of PSO-SVR, VMD-RF and VMD-GBM are nearly same for Chitral and that of PSO-SVR for Tarbela slightly differs from VMD-RF and VMD-GBM (Figure 3 and Table 2). Overall, minimum values of CC are found to be 0.973 and 0.934 for SARIMA at Chitral and Tarbela stations, respectively (Figure 3 and Table 2).
The violin plots (Figure 4) are depicted as a distribution of observed and modelled streamflow data (from left to right) and their respective box-plots (with approximately similar median values) are represented in the center. Thickness of the violin plot around the first quartile (left edge of box-plot) shows that all the observed and modelled data sets are positively skewed with slight variation in frequency distribution around the third quartile. Specifically, VMD-based models have captured frequency distribution near the third quartile well enough at both the stations. Moreover, observed values have prominent extreme values (outliers) which are captured by VMD-BRNN alone at both the stations (Figure 4).
Represents violin plots comparing the distribution of modelled with observed river flow at (a) Chitral and (b) Tarbela stations.
Represents violin plots comparing the distribution of modelled with observed river flow at (a) Chitral and (b) Tarbela stations.
Time series plots provide a visual understanding of over and under fitting of the modelled and forecasted values. SARIMA overestimated the few streamflow peaks during the training and testing period at Chitral (Figure 5(a)). Moreover, the ratio of overestimation is minimum in PSO-SVR as compared to SARIMA. Furthermore, VMD based RF and GBM adequately captured the flow pattern during the training period, however, both of them missed a prominent peak and recession flow throughout the testing period, whereas VMD-BRNN outperformed not only during training, but was able to successfully capture the prominent peak in the testing period (Figure 5(a)) despite having a difference in SD values of Chitral observed data series for the training and testing period (previous section and Table 1).
Time series plot of observed and combined (training and testing) fitted of different model(s) output river flow at (a) Chitral and (b) Tarbela stations. Observed streamflow is indicated by the black line whereas dot-point is used for predicted flow.
Time series plot of observed and combined (training and testing) fitted of different model(s) output river flow at (a) Chitral and (b) Tarbela stations. Observed streamflow is indicated by the black line whereas dot-point is used for predicted flow.
In the case of Tarbela, both SARIMA and PSO-SVR overestimated the flow at most of the lower peaks and missed few prominent peaks (Figure 5(b)), whereas SARIMA overestimated the flow pattern during the training period. Nevertheless, the same pattern has been successfully captured by VMD based models and among them, VMD-GBM exaggerated few peaks during the testing period. On the other hand, VMD-BRNN was able to capture the prominent peak in the testing period with lowest RMSE, sMAPE and highest CC values (Figure 5(b) and Table 2). Moreover, streamflow pattern was consistent with annual peak followed by a recession period (minimum flow) throughout, whereas high peaks may fluctuate after a few years (Figures 1 and 5). Almost all the models could capture the streamflow pattern at recession flow, however, only few could perform well due to fluctuations at the peaks. Furthermore, variation in phenotype of streamflow peaks contribute to the regional cycle in the form of either drought or flood (both are rainfall influenced most of the time). Usually, neural network methods can be trained in capturing the complexity concealed in the phenomenon of streamflow efficiently, resulting in an overfitting of the modelled data. Consequently, their performance during the testing period is affected with unseen datasets. This issue is handled by a regularization approach in BRNN by compromising the goodness of the fit over the complexity of the final network by defining the penalty term as discussed in the previous section. Therefore, fluctuations and inconsistency between training and testing data as well as in the occurrence of high flow peaks of streamflow could be handled by VMD-BRNN adequately well. Overall, the performance of VMD-BRNN is excellent in streamflow modelling and forecasting at both the stations.
CONCLUSION
This paper presents monthly streamflow forecasting by utilising non-decomposed, classical decomposed and SDA based VMD decomposed data approach by stochastic and machine learning models including: SARIMA, PSO-SVR, VMD-RF, VMD-GBM and VMD-BRNN at Chitral and Tarbela stations. Among all the implemented models, the traditional approach (via SARIMA) and the classical decomposed approach (via PSO-SVR) displayed poor performance, failing to capture most of the peaks even during the training period, whereas hybridization of VMD and machine learning methods has shown increased robustness of the models. Moreover, integration of VMD with SDA for determination of number of IMFs has decreased the computational time, allowing the users to focus on the analysis of results from hybrid models completely. Unlike VMD-RF/GBM (poor performance during testing period), the regularization approach in VMD-BRNN distinguishably handled the over/under fitting in both modelling and forecasting monthly streamflow at both the stations. The results show that the influential role of the high dimensional components and robustness on the river flow may be explored by most optimum SDA based VMD signals hybrid with BRNN method. Furthermore, the incorporation of the climatic variables may enhance the accuracy of the forecast. Based on these results, analyses of the influential role of the temperature, rainfall and other climatic factor variations in the river flow will appear in a future communication by the authors.
DECLARATION OF INTERESTS
The author(s) declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
ACKNOWLEDGEMENTS
The author(s) acknowledge the WAPDA, Pakistan for cooperating in data collection. The contents of this paper form part of the first author's doctoral thesis that will be submitted at the Department of Mathematics, University of Karachi, Karachi, Pakistan.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.