Drought is a harmful and little understood natural hazard. Effective drought prediction is vital for sustainable agricultural activities and water resources management. The support vector regression (SVR) model and two of its enhanced variants, namely, fuzzy-support vector regression (F-SVR) and boosted-support vector regression (BS-SVR) models, for predicting the Standardized Precipitation Evapotranspiration Indices (SPEI) (in this case, SPEI-1, SPEI-3 and SPEI-6, at various timescales) with a lead time of one month, were developed to minimize potential drought impact on oil palm plantations at the downstream end of the Langat River Basin, which has a tropical climate pattern. Observed SPEIs from periods 1976 to 2011 and 2012 to 2015 were used for model training and validation, respectively. By applying the MAE, RMSE, MBE and R2 as model assessments, it was found that the F-SVR model was best with the trend of improving accuracy when the timescale of the SPEIs increased. It was also found that differences in model performance deteriorates with increased timescale of the SPEIs. The outlier reducing effect from the fuzzy concept has better improvement for the SVR-based models compared to the boosting technique in predicting SPEI-1, SPEI-3 and SPEI-6 for a one-month lead time at the downstream of Langat River Basin.
Drought is a damaging and little understood natural calamity (Pulwarty & Sivakumar 2014). Drought events usually develop slowly over time with their effects normally lasting for a long period of time (Wilhite et al. 2014). These features allow for making drought mitigation possible, albeit difficult, as the starting and ending of droughts are difficult to determine precisely. In particular, some of the rare and extreme drought events vary considerably in time and extent (Burke et al. 2010). These characteristics of droughts, no doubt, will further cause significant difficulties in drought mitigation. Hence, effective drought prediction and its subsequent management is important for sustainable agricultural activity and water resource management.
Despite the fact that Malaysia is located in a tropical region and receives an average of 2,800 mm of precipitation annually, the rainfall amount and rain day occurrence however, exhibit large variability. Due to these reasons, the wet and dry conditions can be extreme at times, causing difficulties in sustaining dam water storage and supply management. Some of the drastic droughts that have occurred in the basin and its surrounding areas include the 1991 Malacca water crisis, 1998 Klang Valley water crisis (El-Nino) and the 2014 Selangor water crisis (Abdulah et al. 2014). This evidence also showed that the study area, which is located in the Langat River Basin of Peninsular Malaysia is vulnerable to droughts and improvement of its drought prediction capability is required for better drought preparedness. It was reported that 60% of the Langat River Basin is used for agricultural activity (DOA 1995; JICA 2002) with oil palm being the major crop. Oil palm plantations with an approximate 847 km2 area, are located downstream, as shown in Figure 1. Since oil palm production plays a leading and important role in Malaysia for the agricultural and industrial sectors, a study on developing an agricultural drought prediction model is important for mitigating the negative impacts from drought events. This is especially important as virtually all the oil palm estates in Peninsular Malaysia rely solely on direct precipitation for rain-fed irrigation purposes.
Agricultural drought refers to the circumstances when soil moisture is insufficient, resulting in the lack of water availability for crop growth and production (Wilhite & Glantz 1985). However, the estimation of soil moisture is always a challenging task, more so when in drought management. In order to overcome this problem, the multi-scalar drought index, namely, Standardized Precipitation Index (SPI) has been used to describe different types of drought, including agricultural droughts (Hu et al. 2015; Stagge et al. 2015; Liu et al. 2016; Venkataraman et al. 2016). The WMO (2012) stated that the SPI with any timescale from one month to six months, could be used to define agricultural drought as soil moisture conditions respond to precipitation anomalies on a relatively short timescale. However, it is undeniable that there may be a delay between the estimated and actual condition due to the indirect estimation of soil moisture. Hence, the Standardized Precipitation Evapotranspiration Index (SPEI) has been developed to include the potential evapotranspiration (PET), which is normally used to quantify the loss of water to the atmosphere by combining the processes of evaporation from the soil and plant surfaces and transpiration from plants, to the description of drought. In view of its multi-scalar representation and with the inclusion of the consideration of the effect of PET, it has been a popular index in various drought studies (Begueria et al. 2014; Li et al. 2015; Hernandez & Uddameri 2016; Liu et al. 2016; Maca & Pech 2016; Xiao et al. 2016; Alam et al. 2017; Manatsa et al. 2017; Chen et al. 2018; Soh et al. 2018).
Given the advantages of SPEI, it was chosen in this study to describe agricultural drought conditions for the study area, with the timescales of one month, three months and six months. According to the USDA (2016), drought stress lasting more than 8 weeks in Malaysia will usually result in reduced flowering and fruit production during the subsequent 12-month period. This statement is supported by actual historical events, as the moisture deficits that occurred between January and March 2016 were reported to possibly have caused the 8% reduction in crop yield then, although it was quickly recovered with near-normal rainfall between May and July. A newsworthy item was that the newest oil palm hybrid was reported to not only acquire improved drought tolerance, but could tolerate (survive) the maximum of 57 no-rain days Silva et al. (2017). Hence, SPEI-1, SPEI-3 and SPEI-6 were proposed based on the reported moisture sensitivity of oil palm in Malaysia.
Drought monitoring and the early warning instrument are important phases to manage droughts (Bachmair et al. 2016). An approach for drought prediction concerns the application of machine learning models. Since the characteristics of droughts are difficult to determine, machine learning models, well known for their high flexibility and adaptability, have been used to predict droughts that have different durations, frequencies and intensities. Moreover, the use of machine learning has shown outstanding performance and accuracy (Ozger et al. 2011; Belayneh et al. 2014; Masinde 2014; Deo et al. 2016; Prasad et al. 2017; Liang et al. 2018). One of the foremost machine learning models used for drought prediction is support vector regression (SVR), which is a family of data-driven type supervised machine learning models. It has been used in several studies for drought prediction and a host of papers have validated that the SVR approach is a promising tool for drought prediction (Chiang & Tsai 2012, 2013; Belayneh & Adamowski 2013; Jalili et al. 2014; Jalalkamali et al. 2015; Borji et al. 2016). In view of the data-driven characteristics inherent in SVR models, it was used to develop the agricultural drought prediction model in this study area, in tandem with using the drought index SPEI.
According to Freund & Schapire (1996), the boosting ensemble technique can improve the performance of a given learning algorithm. The boosting ensemble technique is a method that combines multiple weak learners to produce predictions with higher accuracy, after measuring the pseudo residuals between the predicted and observed values. However, due to the rapid and multi-directional growth of machine learning models in the hydrological field, the application of boosting-ensemble machine learning model is very limited even though it has shown promising results. For example, a recent study by Belayneh et al. (2016) showed that the boosting technique is suitable for improving the performance of SVR models for the prediction of the SPI. Hence, this study aimed to explore further the application of the boosting ensemble technique in drought prediction.
According to the standard practice, the option available for the modeller towards the problem of outliers is to discard them from the data sets through careful reasoning and selection. Although these data points are treated as the redundant outliers that may cause undesired errors in the modelling processes, it is inevitable that they are part of the observed values to describe the event. In order to cater for both situations, the concept of fuzzy logic is normally applied to define the grey zone. For this reason, a method called the fuzzy SVR has been developed so that different input data points can provide different contributions to the learning of decision surface based on respective fuzzy membership values, which indicates their importance among the data sets. They have shown outstanding performances in predicting runoff (Wiriyarattanakul et al. 2009) and in other applications (Chaudhuri & Kajal 2011; Allaoua & Laoufi 2013; Hung 2016; Edwin & Somasundaram 2016). Given the effectiveness of both techniques in improving the prediction accuracy of machine learning models, the motivation for this study is to improve the agricultural drought predictions with the SVR model by hybridizing it with the boosting ensemble technique and fuzzy membership values, namely, F-SVR and BS-SVR models.
To the best knowledge of the authors, the SVR-based drought prediction models coupled with fuzzy or boosting technique, using SPEI as predictor, have not been previously carried out for the Langat River Basin. Since the study area is the downstream of Langat River Basin that has a similar humid and warm tropical climate as the basin, it is fascinating to develop the aforementioned agricultural drought prediction models, to predict the wet and dry conditions by considering both the simultaneous changes in precipitation and PET. In order to evaluate the improvements of the fuzzy and boosting technique to the SVR models, the models were all developed by the method of producing SPEI-1, SPEI-3 and SPEI-6 of one-month ahead (lead time). Hence, the expected targeted results of this study are the improved one-month lead time predictions of SPEI with various timescales from the SVR, BS-SVR and F-SVR models.
STUDY AREA AND DATA SET
Study area and data acquisition
The Langat River Basin with an approximate total area of 2,400 km2 is located over two Peninsular Malaysia states, Selangor and Negeri Sembilan, within latitudes 2° 40′ 15″ N to 3° 16′ 15″ N and longitudes 101° 19′ 20″ E to 102° 1′ 10″ E (Juahir et al. 2011). The precipitation data were retrieved from the Department of Irrigation and Drainage (DID) Malaysia, while the temperature data were from the Malaysian Meteorology Department (MMD). It was observed that the main agricultural activity in Langat River Basin is oil palm plantations, located at the downstream of the basin with an approximate area of 847 km2, as shown in Figure 1. Hence, the rainfall station at Pejabat JPS Sg. Manggis (ID: s2815001) located at the centre of the basin downstream, and temperature station at Petaling Jaya (ID: 48648), both with 40 years (1976–2015) of data, were used to generate the SPEIs to represent the agricultural drought conditions at the downstream agricultural area of the basin (Figure 1), which fulfils the minimum required density of one station per 575–900 km2 for non-mountainous areas (WMO 2008).
Standardized precipitation evapotranspiration index (SPEI)
|Extremely wet||2.00 and above|
|Very wet||1.50 to 1.99|
|Moderately wet||1.00 to 1.49|
|Near normal||−0.99 to 0.99|
|Moderately dry||−1.00 to −1.49|
|Severely dry||−1.50 to −1.99|
|Extremely dry||−2.00 and below|
|Extremely wet||2.00 and above|
|Very wet||1.50 to 1.99|
|Moderately wet||1.00 to 1.49|
|Near normal||−0.99 to 0.99|
|Moderately dry||−1.00 to −1.49|
|Severely dry||−1.50 to −1.99|
|Extremely dry||−2.00 and below|
Support vector regression (SVR)
Support vector machines (SVM) were introduced by Vapnik (1995) in an effort to characterize the properties of learning machines so that they can generalize well to unseen data (Kisi & Cimen 2011). The learning task is insensitive to the relative number of training examples in positive and negative classes. Compared to the artificial neural networks (ANN), the SVM is less prone to overfitting as it seeks to minimize the generalization error, while the ANN seeks to minimize the training error (Chiang & Tsai 2013; Belayneh et al. 2016; Borji et al. 2016). Thus, the SVM was chosen over the ANN. The SVM can be separated into two types: support vector classification (SVC) and support vector regression (SVR). This study is primarily concerned with the prediction of the SPEI and hence, the SVR was chosen.
Boosting-support vector regression (BS-SVR)
In this study, the ‘LSBoost’ function from Ensemble Learning Toolbox in MATLAB was utilized to combine weak learners and generate a more accurate ensemble. The process to generate accurate and generalized boosted values was iterative and, hence, it was carried out with the aid from ‘resume’ function in MATLAB. After that, boosted SPEIs were produced and imported to the ‘fitrsvm’ function in MATLAB together with targeted observed SPEIs with lead time of one month, as inputs for training and validation of SVR.
Fuzzy-support vector regression (F-SVR)
Fuzzy logic was also adopted in this study for its mathematical modelling ability to incorporate imprecision and tolerance towards uncertainty. Classically (Boolean or crisp set theory), membership of an element x in a set A, is defined by the value of either 1 (true) or 0 (false) to each individual in the universal set X, which also means ‘every proposition is either true or false’. However, fuzzy logic violates both ‘excluded middle’ and ‘contradiction’ laws (Klir & Yuan 2008). According to Zadeh (1965), the true values of variables may be any real number between 0 and 1, which can be done by using fuzzy membership function (FMF) to assign membership value (or degree/grade of membership) between 0 and 1 to every point in the input space (universe of discourse). However, the fuzzy membership function varies for different types of data and the importance to be evaluated on.
With these, the fuzzy membership values, that correspond to each observed SPEI data point were produced. Thereafter, fuzzy membership values, observed SPEI and targeted observed SPEI with lead time of one month were imported to the ‘fitrsvm’ function in MATLAB for training and validation of the SVR. The generated fuzzy membership values were used as additional inputs together with the SPEIs (two input variables) to transform training points from to . Figure 2 shows the flowchart for the development of models.
Models’ performance evaluation
RESULTS AND DISCUSSION
Development of models
For this paper, all the three proposed models, namely, the SVR, the F-SVR and the BS-SVR, results were generated for the station s2815001 (Pejabat JPS Sg. Manggis) located at the Langat Basin. These models were used to predict the SPEI-1, the SPEI-3 and the SPEI-6, respectively, each on a one-month lead time frame. Before the development of the models, the precipitation and temperature data were used to generate the SPEI-1, SPEI-3 and SPEI-6. For the development of SPEIs, the developed series is shown in Figure 3. Compared to the SPEI-1, it can be observed that both the SPEI-3 and SPEI-6 are less sensitive to the changes in monthly precipitation and/or temperature within the long-term record. Since the SPEI-3 and SPEI-6 are longer cumulative indices than SPEI-1, the onset of drought only becomes obvious at over a longer time frame. The AMR was used to describe the variations caused by the sensitivity of each of the SPEIs. As expected, the SPEI-1, which is most sensitive to changes acquired the highest average moving range values of 1.0942. For the SPEI-3 and the SPEI-6, that are less sensitive, they were characterized respectively, with lower average moving range values of 0.6472 and 0.5622.
As mentioned in the section ‘Study area and data set’, the fuzzy membership values (Si) were used in this study to represent the degree of importance of all data (Figure 4) and adopted as additional input in the SVR to reduce the effects of outliers. From Figure 4, it was observed that when the SPEIs is closed to the mean value, it will acquire a high Si, and vice versa. For example, the SPEI-6 with a value of −0.82 at time step 64, attained an Si of 0.99 (largest is 1.00) because it is close to the negative mean of −0.83. Otherwise, the SPEI-6 with a value of −2.08 at time step 75, only attained an Si of 0.46 because of its large difference from negative mean. The same goes for the Si of the SPEI-1 and the SPEI-3, as shown in Figure 4. With these results, it was shown that the adopted fuzzy membership functions have the ability to estimate the degree of importance for each point.
For the boosting ensemble, since the algorithm is to improve the performance of the models by improving the learning effects from the weak learner, the problem of overfitting may occur when the number of learning cycles becomes too high. Thus, the selection of the appropriate number of learning cycles is important for a generalized model. For this study, the optimal number of learning cycles to create the lowest generalization error were 313, 206 and 195, respectively, for the SPEI-1, the SPEI-3 and the SPEI-6. It was observed that the number of learning cycles decreased when the timescales of the SPEI series increased. At every learning cycle, MATLAB trains one weak learner for every template object in learners. Thus, the increasing number of learning cycles to reach the optimum stage also indicates that the number of weak learners were increasing for the decreasing timescales. Hence, the results of the boosting ensembles are also indicating that the training difficulties are getting lower when the timescales of SPEIs increases.
Performance of the models
With the optimal parameters of each model being determined, the prediction results from each model were generated and their performances were evaluated based on the measures of MAE, RMSE, MBE and R2 between the observed and predicted SPEIs, as tabulated in Table 2.
Based on the results, it can be observed that the overall performance of the models increased with timescale, from SPEI-1 to SPEI-6. For example, the validation results in terms of MAE, RMSE, MBE, R2 increased from the range of 0.325–0.559, 0.372–0.644, −0.017–0.133, 0.796–0.828 in SPEI-1 to 0.126–0.170, 0.159–0.202, −0.026–0.040, 0.824–0.854 in SPEI-3, and further to 0.105–0.144, 0.137–0.187, −0.023–0.011, 0.866–0.903 in SPEI-6. Hence, the prediction accuracy of the models increases when the timescales increase, especially for the SPEI-3 and the SPEI-6. On the other hand, it was found that the estimated AMR (indicate variations in a series) value of the SPEI-1 has the highest value of 1.0942, followed by the SPEI-3 and then the SPEI-6 with the values of 0.6472 and 0.5622, respectively. In addition, based on the drastic improvement in the performance measures from prediction of the SPEI-1 to the SPEI-3 and gentler improvement from prediction of the SPEI-3 to the SPEI-6, it is further deemed that the prediction capability of the models is affected by the variation in the series.
For the comparison between models, the results show that the overall performance of the BS-SVR and F-SVR models have been improved as compared to the standalone SVR model. For example, the MAE of the models in predicting SPEI-1 reduced from 0.559 in SVR to 0.543 in BS-SVR and 0.325 in F-SVR during the validation period. These results of BS-SVR and F-SVR models outperforming the standalone SVR model were also shown in other performance measures, such as RMSE and R2 (Table 2). These indicate that both the boosting ensemble technique and fuzzy membership values have successfully improved the prediction capability of the SVR model, as the prediction errors have shown to be reduced together with the increase in correlation between predicted and observed values.
It was also observed that the differences in performance between the BS-SVR model and the F-SVR model is getting less significant over the increasing timescales. Taking into account the fact that the variation in series increases over the increasing timescales, it is deemed that the advantage of fuzzy membership values over boosting ensemble technique is getting smaller when the variation in series decreases. By reviewing the algorithms of each approach, it is reasonable to have this conclusion as the fuzzy membership values in this study were estimated in the effort to reduce the effect of outliers by assigning lower fuzzy membership values to the points further from class mean, while the boosting ensemble technique improved the predictions by combining weak learners. Hence, when the variation in the series decreases over increasing timescales, the advantage of fuzzy membership values over boosting ensemble technique also decreases.
Further evaluation of the models was also carried out using a time series plot of data in the validation period (Figure 5). As clearly illustrated in Figure 5, the predicted SPEIs generated by each model closely mirrored the pattern of the observed SPEIs. There was also no noticeable delay between the observed and predicted SPEIs. This shows that the SVR-based models have no time-shift error in this study and are ideal for the prediction of agricultural droughts for the downstream of Langat River Basin. However, the evidence of the SVR models that underpredicted the values of SPEIs also show that improvements to generate better predictions were needed. Figure 5 also shows that the BS-SVR model always tends to overpredict the extremes, e.g., ten-month time step in Figure 5(a), 11-month time step in Figure 5(b) and 18-month time step in Figure 5(c). This phenomenon can be explained by reviewing the algorithm in the models.
As mentioned, the boosting process was initialized using the overall mean as the first prediction. Thereafter, pseudo residuals between the predicted and observed values were estimated and used as the indication to decide the number of weak learners to be combined in order to improve the predictions in the next step. However, this initial estimation may produce high pseudo residuals at extreme values due to the zero or near to zero mean generated from the cancel-off effect between the positive and negative values during the calculation of mean. Thereafter, higher weightage will be assigned to the extreme values during the training process and cause overprediction at those points, which is in agreement with the graphical illustration shown in Figure 5. However, this problem was avoided in the F-SVR model as the fuzzy membership values used were generated with reference to respective mean from positive and negative classes. In other words, the effects from each point were altered using the fuzzy membership values with reference to its class mean. As compared to the initial mean generated in the BS-SVR model, the class mean in the F-SVR model has better representation on the original characteristics of the data and will not cause higher assignation of weightage to the extreme values. Hence, overprediction did not happen in the F-SVR model, whereas there were some minor underpredictions, as shown in Figure 5, which may be due to the reduced effects from data points caused by assignation of fuzzy membership values.
This study was done to assess the efficiency of the various SVR-based models in predicting one-month lead time agricultural drought at the downstream end of the Langat River Basin, which has a tropical climate pattern. Drought indices, namely, the SPEI-1, the SPEI-3 and the SPEI-6 were adopted to describe drought severity. Apart from the standalone SVR model, this study also combined separately, the concepts of fuzzy logic and the boosting ensemble technique with the SVR model in order to improve the accuracy in predicting the SPEIs. Based on the performance measures determined, it was observed that the performance of all three models has the trend of improving accuracy with increasing timescale of the SPEIs: most accurately for the SPEI-6, followed by the SPEI-3 and then lastly the SPEI-1. In view of previous findings and the decreasing variation in series across increasing timescale of SPEIs, the authors maintain that the difficulties of the models to fit the training data are affected by the variations in the series. Both the F-SVR and the BS-SVR models were found to consistently give better predictions than the standalone SVR model; suggesting that both the F-SVR model and the BS-SVR model can improve the prediction capability of the SVR model by reducing the outlier effects and creating ensembles from weak learners, respectively. Nevertheless, between these two better models, the F-SVR model being a notch better, showed better performance measures with the lower prediction error shown in MAE, RMSE and higher correlation shown in R2. In view of the algorithms in the F-SVR model and the finding of improving accuracy over decreasing variations in series, the authors concluded that the advantage of the F-SVR model over the BS-SVR model is due to its outliers' reducing effect, which reduces the training difficulties due to the variations by assigning lower fuzzy membership values to the points further from the class mean. Future work should be carried out using these methods in other study areas to ensure the models' robustness, especially when longer lead time is required due to the differences in climatic conditions. Attempts to include wavelet transformation as the data smoothing technique to improve the performance of these models could be considered given the finding in this study showing that variations in training series are affecting the prediction capability of the models.
The authors would like to express their sincere appreciation to the Universiti Tunku Abdul Rahman, Bandar Sungai Long, Cheras, 43000 Kajang, Selangor, Malaysia for funds allocated to this project.