Abstract
Drought is quantified with one or a set of drought indices for monitoring and risk management. These indices have a limited ability to capture drought impacts. Drought impact prediction models have been developed to explore the interactions between the drought impact data and the physical drought indices. This study demonstrates the use of extreme gradient boosting (XGB), a well-known machine learning technique, to predict the likelihood of impact occurrence (LIO) of drought on public water supply as a function of drought indices, with high accuracy and low uncertainty. Using text-based drought impact data from multiple sources, the prediction accuracy of drought LIO on the public water supply of South Korea was evaluated using XGB and reference models (log-logistic, support vector machine, and random forest). We also analyzed receiver operating characteristics and quantified the uncertainty of each model with bootstrapping. This study shows that XGB and random forest have a high level of suitability. However, random forest presents a higher level of uncertainty than XGB for predicting drought LIO on the public water supply in South Korea. Although some limitations exist, the results suggest that text-based drought impact data collected from multiple sources can provide insightful information for drought risk management.
HIGHLIGHTS
SPEI was used to model the likelihood of drought impact on public water supply.
The South Korean drought impact inventory was constructed using text-based data.
XGB showed the best predictive performance with high accuracy and low uncertainty.
INTRODUCTION
Drought is a complex natural hazard that affects the environment, society, and economy (Wilhite et al. 2000). Since the 1990s, more than two billion people have been affected, in addition to more than 11 million casualties worldwide due to drought (UNISDR 2009; EM-Dat 2019). The Tana River Basin, which plays an essential role in hydropower generation in Kenya, experienced continuous droughts from 1999 to 2001. This has led to severe water scarcity and power shortages, causing hydropower generation and industrial production losses of approximately 2 billion USD (Mogaka et al. 2006). Drought was a direct cause of more than 500,000 fatalities in Africa during the 1980s (Kallis 2008). According to the Australian Bureau of Agricultural and Resource Economics and Sciences, winter cereal crop yields across Australia were reduced by 36% due to a 2006 drought, leading to fiscal crises for numerous farmers and a total cost of 3.5 billion AUD (Wong et al. 2010).
The Korean peninsula has been experiencing an extreme nationwide drought over a 4–6 year cycle, and its impacts are becoming more pronounced (Hong et al. 2016a). South Korea experienced severe droughts with an annual precipitation 35–50% lower than the average values of 2013–2015 (Kwon et al. 2016). It is one of the highest prolonged droughts recently observed in South Korea. This drought has led to water scarcity in agriculture, resulting in decreased crop production by 17.3% and increased food prices, which have severely impacted the Korean economy (Hong et al. 2016b). Furthermore, the characteristics of the drought, such as the frequency, period, and severity in South Korea, are projected to change with the increasing impacts of climate change (Boo et al. 2006; Yoo et al. 2012; Nam et al. 2015; Waseem et al. 2016).
Droughts are a creeping phenomenon considering the prolonged lag time on reduced precipitation, which makes it challenging to determine the onset, extent, and end of the drought. Quantifying drought events in terms of their geographic extent, scale, intensity, and duration is therefore problematic (Wilhite & Svoboda 2000). Multiple drought indices have been suggested for evaluating droughts (e.g., McKee et al. 1993; Svoboda et al. 2002; Shulka & Wood 2008; Mu et al. 2013), including the Palmer Drought Severity Index (Palmer 1965), Standardized Precipitation Index (SPI; McKee et al. 1993), and Standardized Precipitation Evapotranspiration Index (SPEI; Vicente-Serrano et al. 2010). Zhao et al. (2014) analyzed meteorological and hydrological drought characteristics in the Jinghe Basin of China, using SPI and Standardized Runoff Index (SRI; Shulka & Wood 2008), respectively. Lee et al. (2022) analyzed drought characteristics (i.e., agricultural, hydrological, and meteorological) in South Korea, with calculated SPI, Standardized Soil Moisture Index, and Standardized Streamflow Index (Barella-Ortiz & Quintana-Seguí 2019), respectively. However, these drought indices have a limited ability to capture drought impacts such as wildfires, water shortages, and crop losses (Gudmundsson et al. 2014; Blauhut et al. 2015; Stagge et al. 2015).
Blauhut et al. (2015) recently suggested the use of a log-logistic (LL) regression function to relate a drought index (SPEI) to drought impacts in the public water supply, energy and industry, water quality, and agriculture and livestock farming sectors in European countries. This was undertaken using the European Drought Impact Inventory (EDII; Stahl et al. 2016), which encompasses drought impact data from 15 categories and 33 countries. Stagge et al. (2015) strengthened the link between drought impacts and the index by incorporating data on seasonality, interannual trends, and nonlinear effects of droughts. Blauhut et al. (2016) presented a method that examined the vulnerability factors and drought index (SPEI) used for monitoring to model the likelihood of impact occurrence (LIO). Recently, Bachmair et al. (2017) used an ensemble tree-based model (i.e., random forest; Breiman 2001) to quantify the link between the drought index and drought impact. Furthermore, Sutanto et al. (2019) assessed the forecasting of drought impacts in Germany using hydrometeorological drought indices (i.e., SRI) and drought impacts in EDII. These studies have shown that drought impacts can be forecasted using machine learning techniques at temporal scales of a few months, based on the duration of the drought and the number of reported drought impacts. Note that all these studies were based on the EDII.
The majority of previous drought studies undertaken in South Korea have been based on hydrometeorological data and drought indices (e.g., Kim et al. 2012, 2014; Um et al. 2017, 2018a, 2018b; Bae et al. 2019), except for a few efforts to develop new drought indices using drought impact data (e.g., Lee et al. 2016; Jung et al. 2020). Lee et al. (2016) explored the use of unstructured data in the study of droughts based on correlation analysis between SPI as the meteorological drought index, reservoir water storage rate data, and drought impact data on agriculture collected from news articles. Jung et al. (2020) calculated meteorological and hydrological big data drought indices by combining unstructured data from news articles with hydrometeorological observation data, including precipitation and dam inflows using the Clayton Copula function. At present, there have been no studies in South Korea that have attempted to predict drought impacts using the relationship between drought indices and impacts.
The present study demonstrates the use of machine learning techniques to predict the LIO of drought in the public water supply as a function of drought indices. Data on unstructured drought impacts on the public water supply were collected from multiple sources and then processed into LIO data. A model relating to the drought index, specifically SPEI, with drought impact, was then developed using extreme gradient boosting (XGB; Chen & Guestrin 2016) and other machine learning approaches, including LL regression (Cox 1958), support vector machine (SVM; Cortes & Vapnik 1995), and random forest (RF; Breiman 2001). These have been used in similar studies and have shown that machine learning approaches could predict drought impacts with high performance and low uncertainty (Blauhut et al. 2015; Bachmair et al. 2017; Sutanto et al. 2019, 2020). Model uncertainty and prediction accuracy were examined using bootstrapping and receiver operating characteristic (ROC) curves. Building upon the prior literature, this study highlights the applications of XGB in drought impact prediction while considering prediction skill and uncertainty. Furthermore, the possibility of applying a drought impact prediction model with unstructured local impact data, other than EDII, is also demonstrated.
MATERIALS AND METHODS
Study design
Drought impact inventory
Reference type . | Sources . | |
---|---|---|
Database | National Drought Information Portal | www.drought.go.kr |
Korea Water Resources Corporation | Emergency Water Supply Statistics Database, 2019 | |
Newspaper article | Naver (largest search portal in South Korea) | Web Crawling |
Governmental report | Ministry of Environment | National Drought Records Survey Report, 1995/2001 |
Ministry of the Interior and Safety | National Drought Information Statistics Report, 2018 | |
Korea Water Resources Corporation | Drought Information Annual Report, 2018 |
Reference type . | Sources . | |
---|---|---|
Database | National Drought Information Portal | www.drought.go.kr |
Korea Water Resources Corporation | Emergency Water Supply Statistics Database, 2019 | |
Newspaper article | Naver (largest search portal in South Korea) | Web Crawling |
Governmental report | Ministry of Environment | National Drought Records Survey Report, 1995/2001 |
Ministry of the Interior and Safety | National Drought Information Statistics Report, 2018 | |
Korea Water Resources Corporation | Drought Information Annual Report, 2018 |
Meteorological drought index
SPEI was used to quantify drought hazards. This method has the advantage of considering both temperature and precipitation. The estimation of SPEI uses the climate water balance concept, which calculates precipitation while excluding potential evapotranspiration and fits the data to a probability distribution. The LL distribution, which provides a better fit for extremely negative values, is recommended for the SPEI (Hernandez & Uddameri 2014). The process of calculating the SPEI using climatic data is summarized below.
Modeling the likelihood of drought impact occurrence
This study related the occurrence of drought impacts to the drought index following the methods of Blauhut et al. (2016). The monthly SPEI values were considered independent variables, whereas the monthly drought impact binary data (0 for no impact and 1 for impact) were considered dependent variables. The values were obtained from the drought impact inventory, which was constructed using data from multiple unstructured data sources. The LIO ranged from 0 to 1 and was estimated using the XGB and three other reference models: LL, SVM, and RF. LL and RF have been used in previous drought impact studies (Blauhut et al. 2015, 2016; Stagge et al. 2015; Bachmair et al. 2017; Sutanto et al. 2019, 2020), and to our knowledge, SVM and XGB were employed for the first time in this study. While SVM is a well-known machine learning algorithm, XGB is a recently developed machine learning algorithm presenting outstanding performance across multiple subject areas, including economics (Carmora et al. 2019), human disease risk assessment (Zhang et al. 2019b), streamflow forecasting (Zhang et al. 2019a; Ni et al. 2020), and flash-flood risk assessment (Ma et al. 2021; Maduhuri et al. 2021).
Log-logistic model
Support vector machine
The SVM is one of the most popular and representative supervised learning models for classification and regression tasks (Cortes & Vapnik 1995). Classification, which is a common assignment in machine learning, is the process of determining the class for a given set of data. SVM is based on the concept of finding a line or hyperplane (margin) that best separates the data into two classes. The classifier has a lower generalization error when the margins are larger. Therefore, the line or hyperplane with the greatest distance from the nearest training data for any class is the determining factor for robust classification.
In this study, the default radial basis function kernel was used as the classifier for SVM modeling, and three parameters – gamma, cost, and epsilon – were tuned. Gamma is responsible for the degree of linearity of the hyperplane, cost is responsible for the size of the margin of the SVM means of the weight according to the misclassification, and epsilon is the margin of tolerance, where no weight is given to the errors. If gamma had a higher value, cost had a lower value, and epsilon was close to 0. Accordingly, the hyperplane was curved with a larger margin, leading to overfitting because the model had a high bias and low variance (Pardo & Sberveglieri 2005). The SVM is very sensitive to parameter selection, and even minor changes in the parameters can lead to very different classification results (Lin et al. 2008).
Therefore, 10-fold cross-validation was conducted, which is a commonly used process to tune the optimized parameters for each region and evaluate the effectiveness of SVM with the selected parameters. The data are divided randomly into ten parts, of which nine are used for training and one for the test. The tuning ranges for each parameter are as follows: gamma from 0.5 to 2, cost from 4 to 16, and epsilon from 0 to 1. The cost parameter was set to 4 and the epsilon parameter for the region was set to 0.35. Given that gamma can have different values for each region, it was set to 1 in Gangwon and 0.5 in other regions.
Random forest
RF is a representative machine learning model that constructs numerous decision trees on bootstrapped subsamples for classification or regression and is suitable for prediction model development (Breiman 2001). RF is a special type of bagging concept for an ensemble meta-algorithm that aggregates base classifiers trained on slightly different training data through bootstrapping. RF prevents overfitting to the training data configuration by generating numerous random independent tresses and estimates the error cost-effective because there is no iterative training cost of the model related to cross-validation. RF is, therefore, widely used in various studies because of its flexibility, high accuracy, and better performance compared with other machine learning models (e.g., Wang et al. 2015; Naghibi et al. 2016; Bachmair et al. 2017).
In this study, default values were set for modeling for all parameters, except for two parameters, mtry and ntree, which have the greatest effect on the prediction performance of RF (Liew & Wiener 2002). The mtry is the number of variables randomly sampled for partitioning at each node and ntree refers to the number of trees grown. The lower mtry values improve the stability of the bagging as tree ensembles have more differences and a lower level of correlation (Strobl et al. 2008; Probst et al. 2019). And if excessive trees are generated owing to high ntree values, RF increased computational cost without significant performance gains (Oshiro et al. 2012; Probst & Boulesteix 2018). For small datasets, such as those used in the current study, small trees were suggested as even sufficient to get good performance (Oshiro et al. 2012). Therefore, an out-of-bag error value, which is a commonly used way of tuning the RF parameter, was used to set the optimized parameters. The out-of-bag error is the average error for each predicted result calculated using predictions from the trees that do not use that data in each bootstrap sample of RF. One-third of the data is used for model validation, while the remaining two-thirds of the data is to train RF. The tuning ranges for each parameter are as follows: mtry from 1 to 100 and ntree from 1 to 500. The mtry and ntree parameters were set to 2 and 50 for all regions, respectively.
Extreme gradient boosting
XGB refers to the gradient boosting concept and is an ensemble machine learning algorithm based on a decision tree for solving regression and classification problems (Chen & Guestrin 2016). The two main strengths of XGB are its superior execution speed and model performance when compared with other gradient boosting implementations. The gradient boosting concept was fitted via the gradient descent optimization algorithm and any arbitrary differentiable loss function. The loss gradient was minimized when the model was fitted, which is similar to a neural network. Trees were added to the ensemble one at a time and fit by weighting to correct for the prediction error of prior models. Based on the ensemble results of the previous model, the sample weight was adjusted for the next model result to proceed with the ensemble construct. The XGB model utilizes limited computational resources for boosted trees with improved gradient boosting. In contrast with gradient boosting, which builds trees sequentially, XGB builds trees in parallel, which improves the processing speed. This suggests that the model was designed to be more computationally efficient than other open-source programs.
Tuning the XGB is complicated because changing any parameter can affect the optimal values of the others. All the parameters were set with default values, except for three parameters, max_depth, nround, and early stopping rounds. These have the most pronounced effect on the prediction performance of XGB (Carmora et al. 2019). Controlling these parameters is also important for XGB to avoid overfitting. Our study used 10-fold cross-validation to set the optimized max_depth parameter. The data are divided randomly into ten parts, of which nine are used for training and one for the test, as we have done for the SVM. The max_depth parameter, which is the maximum depth of an individual decision tree, was set to 2 in the range of 2–10. With a high value for the max_depth parameter, the model could improve its accuracy, but it would be more complex and more likely to overfit (Carmora et al. 2019). Moreover, XGB aggressively consumes memory when training a deep tree with a high max_depth value. The nround parameter, which is the maximum number of boosting iterations, was set to 200, and the early stopping rounds, which is the parameter for controlling the patience of how many iterations the user will wait for the next decrease in the loss value, was set to 50 to avoid overfitting. As early stopping rounds are set, XGB can prevent overfitting and get stable performance (Fan et al. 2018; Bikmukhametov & Jäschke 2019; Qiu et al. 2022).
Evaluation of model performance
Model performance was evaluated using the ROC curve and area under the curve (AUC). This provides a comprehensive evaluation of regression and classification models (Wilks 2001; Mason & Graham 2002; Hernández-Orallo et al. 2013). The ROC curve is a tool for the visual assessment of each model, while the AUC value is a numeric representation of the model performance. For the ROC curve, models with curves closer to the 45° diagonal indicate lower performance, whereas those closer to the top-left corner indicate better performance. The ROC curve is expressed by a combination of metrics, which are calculated using a confusion matrix. A well-known metric combination for evaluating predictive models is the true positive rate (TPR) and the false positive rate (FPR). TPR is also known as sensitivity and defines the proportion of correctly predicted positive results across all positive samples, whereas FPR defines the proportion of incorrectly predicted negative results across all negative samples. The ROC curve is widely used to evaluate probabilistic forecasting systems such as the value of ensemble weather forecasts (Liguori et al. 2012), rainfall thresholds estimation for shallow landslide forecasting (Gariano et al. 2015), and drought impact prediction (Blauhut et al. 2015). The AUC can be quantified by the area under the ROC curve and has a value between 0 and 1. If the AUC value is greater than 0.5, the predictions of the chosen model are better than those of random guesses, while values close to 1 indicate the creation of a perfect model. In this study, test data with the same sample size were generated using simple random sampling to estimate the ROC curve and the AUC value.
Quantification of model uncertainty
Obtaining estimates of machine learning model uncertainties for newly predicted data is essential for determining whether predictions can be trusted. A common approach for such uncertainty quantification is to estimate the error from an ensemble of models. These are often generated by the bootstrap method (e.g., Slaets et al. 2017; Bomer et al. 2019). The bootstrap method, a resampling technique that samples a dataset with replacement, is used to estimate statistics including bias, variance, and confidence intervals. The confidence interval (95%) for each model was constructed using the bootstrap method (Efron 1979) by randomly sampling a dataset 1,000 times with replacements.
RESULTS
Drought impact database
Some temporal and regional deviations were noted in terms of data quantity and quality (Figure 4(a)). Drought impact data related to the public water supply dated predominantly from after 2009, accounting for 90% of the total impact data. Most of the drought impact data were collected from a specific region, such as Jeonnam, accounting for 30–80% of the total data from 2009 to 2013. The Gangwon, Gyeonggi, and Jeonnam regions (R6, R9, and R14 in Figure 2, respectively) presented the greatest amount of drought impact data on public water supply; therefore, these regions were selected for further analysis (Figure 4(b)). Nationwide data for all 17 provinces were also analyzed.
Model prediction accuracy
Region . | Method . | |||
---|---|---|---|---|
LL . | SVM . | RF . | XGB . | |
Gangwon | 0.91 | 0.85 | 0.99 | 0.99 |
Gyeonggi | 0.74 | 0.79 | 0.99 | 0.99 |
Jeonnam | 0.67 | 0.67 | 0.98 | 0.99 |
Nationwide | 0.73 | 0.70 | 0.87 | 0.98 |
Region . | Method . | |||
---|---|---|---|---|
LL . | SVM . | RF . | XGB . | |
Gangwon | 0.91 | 0.85 | 0.99 | 0.99 |
Gyeonggi | 0.74 | 0.79 | 0.99 | 0.99 |
Jeonnam | 0.67 | 0.67 | 0.98 | 0.99 |
Nationwide | 0.73 | 0.70 | 0.87 | 0.98 |
Model uncertainty
The uncertainty of the model prediction was evaluated using the bootstrap method to quantify the confidence intervals (95%) (Figure 5) and estimate the standard error values (SE) (Figure 6(b)). Results showed that the uncertainty of the XGB was the lowest, given that the confidence intervals were narrow (Figure 5) and the SE was almost 0 (Figure 6(b)). Although RF had a similar model performance (Figure 5 and Table 2), it showed much greater uncertainty than the XGB (Figures 5 and 6(b)).
LL has a relatively low predictive performance, based on and ROC, and was estimated to have a lower uncertainty than RF and SVM. This suggests that linear logistic regression is more stable than other machine learning techniques. Furthermore, the SVM results for the Jeonnam region showed that the uncertainty of the model was much greater than that of the other three regions (Figure 6(b)). This is in accordance with its lowest AUC and values (Table 2 and Figure 6(a), respectively) and indicates that SVM is not suitable for the Jeonnam region, where the independent variable (SPEI) does not explain the dependent variable, as suggested by accuracy measures.
DISCUSSION
Drought impact prediction with XGB
The drought impact inventory constructed for this study is a new and valuable data source. The data on drought impact inventory is somewhat biased in time and space with an overall increasing trend for more recent events. These biases will decrease as more events are collected. Despite these limitations and uncertainties, the XGB model used to predict the likelihood of drought impact occurrence on public water supply as a function of SPEI was found to be meaningful in South Korea. These drought impact prediction models thus allow a quantitative assessment of regional differences in drought risk across South Korea.
The present study, via a case study of South Korea, demonstrates that XGB can predict drought LIO with high accuracy and low uncertainty. This may be because XGB builds one tree at a time and then updates the weights of the misclassified data in each classification process, before applying them in the next classification. Conversely, RF is simply a collection of trees, each of which provides a prediction while building each tree independently using a random sample of data. Moreover, RF collects the classification results from all trees and considers the mean, median, or mode as the prediction. However, there is a high probability that most trees will make predictions with some random chance, as each tree has its circumstances, which may include sample duplication, overfitting, and inappropriate node splitting. Therefore, the RF results show greater uncertainty.
Drought hazard characteristics with SPEI
Our study used binary drought impact on public water supply and SPEI, i.e., meteorological drought index, derived from the climatic observation data to derive a drought impact function. In particular, Gangwon region showed reasonable performance with all models. It means SPEI (independent variable) affects the LIO (dependent variable) closely in this region, which suggests that the region is more likely to suffer from drought impacts on public water supply due to meteorological drought. In contrast, Jeonnam region results suggested that the drought impact on the public water supply in that region might be better captured by other drought conditions, i.e., agricultural or hydrological drought conditions. To sum up, our results suggest that drought hazards in each area should be analyzed for better drought impact prediction.
Impact prediction for severe droughts
Region . | Method . | |||
---|---|---|---|---|
LL . | SVM . | RF . | XGB . | |
Gangwon | 0.76 | 0.74 | 0.77 | 0.99 |
Gyeonggi | 0.67 | 0.61 | 0.68 | 0.99 |
Jeonnam | 0.52 | 0.52 | 0.61 | 0.99 |
Nationwide | 0.54 | 0.55 | 0.56 | 0.97 |
Region . | Method . | |||
---|---|---|---|---|
LL . | SVM . | RF . | XGB . | |
Gangwon | 0.76 | 0.74 | 0.77 | 0.99 |
Gyeonggi | 0.67 | 0.61 | 0.68 | 0.99 |
Jeonnam | 0.52 | 0.52 | 0.61 | 0.99 |
Nationwide | 0.54 | 0.55 | 0.56 | 0.97 |
CONCLUSION
In this study, the LIO for the public water supply was modeled and evaluated in South Korea as a function of SPEI using XGB and three other reference models: LL, SVM, and RF. More than 3,000 drought impact data points were collected from various sources, such as databases, newspaper articles, and governmental reports. In particular, the text-based drought impact inventory constructed for this study is meaningful as it is the first such attempt in South Korea. The collected drought impact data showed somewhat time-biased, with an increasing trend in the number of reported drought impacts for recent drought events (Figure 4). Moreover, the drought impact function was fitted for the entire period using a binary drought impact and drought index, to predict the likelihood of drought impact occurrence.
The model prediction results showed that XGB exhibited the best performance for all regions. The RF showed a similar performance to XGB but with substantial uncertainty. This suggests that the advantage of XGB is based on boosting, which gives weight to misclassification and contributes to better model performance. The models showed the best performance in Gangwon and less effective performance in Jeonnam. This implies that the impact of drought on the public water supply in Gangwon is strongly associated with meteorological drought. However, other drought indices, such as the hydrological drought index (SRI), might improve drought impact prediction in Jeonnam, where the current model with SPEI showed less effective performance.
The results of this study suggest that XGB is suitable for drought impact prediction when considering the model prediction accuracy and uncertainty and indicates the possibility of using the drought impact prediction model with local data in South Korea other than EDII. This is the case despite the limited availability of drought impact data. However, the likelihood of drought impact occurrence was only assessed using machine learning models; thus, it is necessary to predict actual drought impacts in the future. As droughts have social and economic impacts in multiple areas beyond the public water supply, it is also necessary to predict the impact of droughts on other sectors. Future work on drought impact evaluation in several areas as well as decreased bias can be expected to improve the prediction skills of LIO modeling. It is therefore necessary to systematically archive drought impact data to build on recent advancements in deep learning techniques. Thus, the study also highlights the potential for using text-based impact data to characterize the risk of complex natural hazards other than droughts using an appropriate machine learning technique, specifically XGB.
ACKNOWLEDGEMENTS
This study was supported by the Basic Science Research Program through the National Research Foundation of Korea, which was funded by the Ministry of Science, ICT & Future Planning (No. 2020R1A2C2007670), and the Technology Advancement Research Program through the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 22CTAP-C163540-02).
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories. The model usage and codes of reference models (LL; SVM; RF; XGB) are available at GitHub repository (https://github.com/krsmsuh/JHI_DI). The drought impact inventory is included in this paper.
CONFLICT OF INTEREST
The authors declare there is no conflict.