ABSTRACT
Flooding in remote regions presents significant challenges due to data scarcity, complicating impact assessment and mitigation efforts. This research delineates an integrated methodology for quantifying flood impacts in such contexts. By leveraging machine-learning algorithms, Sentinel-1 synthetic aperture radar (SAR) imagery was combined with digital elevation model data and river proximity metrics to predict and accurately demarcate flood extents. Geographic information systems overlay techniques were then employed for spatial analysis of the floods’ impacts on population and infrastructural assets. The methodology was applied in a case study in Ngabang District, Indonesia, demonstrating its utility. Analysis using decision tree, random forest (RF), and gradient boosting machine models provided critical insights into flood prediction factors. The RF model was chosen as the best, successfully identified flood-prone regions, achieving an accuracy of 0.94 and a Kappa of 0.87 on the testing data, demonstrating its robustness. The flood map showed significant impacts, affecting 373.81 hectares, 10,706 people, 1,500 buildings, and 15 km of roads. This study highlights the importance of proximity, elevation, SAR imagery, and iterative model improvements in flood prediction, offering valuable insights for flood management and mitigation efforts in data-scarce regions.
HIGHLIGHTS
It presents a novel method using machine learning algorithms, synthetic aperture radar imagery, digital elevation model data, and river proximity metrics to predict flood extents.
Advanced models (decision tree, random forest, and gradient boosting machine) are used.
Geographic information systems overlay techniques offer a detailed spatial analysis of flood impacts.
Application in remote areas to identify flood extent and its significant impacts.
This study provides valuable tools for regions that have limited data.
INTRODUCTION
Flooding is a recurrent and devastating natural disaster that affects millions of people each year, causing significant economic losses and numerous fatalities (Olanrewaju et al. 2019). Mitigation efforts focus primarily on urban areas because of their dense populations and extensive infrastructure, which are highly vulnerable to flood damage. However, remote areas are not exempt from the impacts of flooding and often suffer greatly. In fact, they face challenges in both impact assessment and disaster mitigation due to the lack of comprehensive data. Hence, flood management and response are particularly difficult in these areas (Manyangadze et al. 2022; Iqbal & Nazir 2023).
The existing body of literature shows that synthetic aperture radar (SAR) imagery has become crucial in flood mapping thanks to its ability to capture high-resolution images in various weather conditions, including cloud cover and nighttime (Panahi et al. 2022; Riazi et al. 2023). This capability is vital for effective flood management and response strategies. However, relying solely on SAR data can result in artifacts, so additional data is needed for accurate flood extent delineation.
Meanwhile, machine learning (ML) can improve the accuracy and timeliness of flood mapping by analyzing complex hydrological data (Elkhrachy 2022; Soria-Ruiz et al. 2022), including satellite imagery and real-time sensor data. ML models can also enhance flood mapping precision by identifying high-risk areas with greater detail and accuracy (Sampurno et al. 2023). As such, stakeholders can make more informed decisions on flood risk management and mitigation.
Likewise, geographic information systems (GIS) are indispensable in disaster management as they integrate diverse spatial data for a holistic analysis of flood impacts (Tomaszewski 2020). GIS facilitates real-time monitoring and decision-making during disaster events, enhancing the efficiency and effectiveness of response efforts. Combined with SAR imagery and ML algorithms, such data can generate a precise flood hazard map, which offers detailed insights into the impacts on communities and infrastructure (Elkhrachy 2022; Soria-Ruiz et al. 2022; Riazi et al. 2023).
Integrating SAR, GIS, and ML can optimize flood mapping and risk management (Amiri et al. 2024) for three reasons. First, SAR can enhance flood mapping by capturing high-resolution images in all weather conditions with minimum revisit time, which allows for timely data acquisition during flood events (Tripathi et al. 2021; Islam & Meng 2022). Second, the combination of SAR data and GIS spatial analysis will allow for more accurate monitoring and prediction of future flood events, thereby reducing potential human and economic losses (Khosravi et al. 2019). Third, ML further enhances this capability by analyzing large datasets from SAR and GIS to uncover patterns and relationships that may not be evident through traditional methods. This data-driven flood mapping will result in better flood risk management and impact mitigation (Nachappa et al. 2020; Shahabi et al. 2020) as the robust framework offers a more comprehensive and accurate tool for decision-makers in flood-prone areas.
However, it should be noted that such advancements in flood mapping may not be as effective without sufficient data. In remote areas, the critical gap in accurately delineating flood extents is limited data availability. Therefore, this study addresses this gap by integrating SAR imagery, digital elevation models (DEMs), and proximity to rivers as predictors, as well as utilizing ML algorithms to map flood extents. The novelty of this work lies in utilizing these integrated predictors in a case study in the Ngabang District, Indonesia, a region that exemplifies the challenges faced by many remote communities. This approach aims to improve the accuracy of flood extent mapping while providing actionable insights that can enhance disaster management strategies and allow for more effective responses in vulnerable areas.
METHODS
Study area
Data acquisition
In addition to the SAR imagery, we incorporated predictors derived from DEMNAS' DEM data (Badan Informasi Geospasial 2018) and proximity metrics from the Landak River and its branches. The inclusion of DEM data provides essential topographic context, enhancing the accuracy of flood mapping by accounting for elevation-related variations in water flow and accumulation. Similarly, proximity to the Landak River and its branches helps identify areas at higher risk of flooding, further refining our predictive model. The combination of satellite imagery, DEM data, and proximity metrics offers a comprehensive approach to flood risk assessment in the region. The details of the data utilized in this study are presented in Table 1.
No . | Data type . | Source . |
---|---|---|
1 | DEM | DEMNAS (Badan Informasi Geospasial 2018) |
2 | Proximity from River | Calculated from waterway map (OpenStreetMap 2022). |
3 | Sentinel-1 SAR GRD | Accessed and processed using the GEE platform (https://code.earthengine.google.com/) (ESA 2024) |
4 | Validation points | In situ data observations during the January 2024 flood event |
No . | Data type . | Source . |
---|---|---|
1 | DEM | DEMNAS (Badan Informasi Geospasial 2018) |
2 | Proximity from River | Calculated from waterway map (OpenStreetMap 2022). |
3 | Sentinel-1 SAR GRD | Accessed and processed using the GEE platform (https://code.earthengine.google.com/) (ESA 2024) |
4 | Validation points | In situ data observations during the January 2024 flood event |
ML model
To transform the SAR backscatter characteristics into precise flood extent maps, a comparative approach using ML algorithms was adopted. The SAR backscatter values, comprising both the VV and VH bands, served as input features for the models. These specific bands allow the models to capture detailed surface properties and variations in water presence. DEM and the river's proximity data were also included to account for topographical influences and hydrological connectivity. These additional predictors were crucial for enhancing the model's ability to delineate flood-prone areas accurately.
The study employed three key ML algorithms: decision tree (DT), random forest (RF), and gradient boosting machine (GBM) (Felix & Sasipraba 2019; Panahi et al. 2022; Sampurno et al. 2023). DT is known for its simplicity and interpretability, making it easy to understand the decision-making process. However, DT models are prone to overfitting, particularly when dealing with complex datasets. To mitigate this issue, RF was utilized, which enhances model accuracy and reduces overfitting by aggregating the results of multiple decision trees. RF's ensemble approach creates a more stable and reliable model. GBM, on the other hand, offers robust performance by iteratively correcting errors from previous trees, thus enhancing overall model accuracy (Hastie et al. 2009).
The target variable for these algorithms was the flood extent, which was critical for assessing each model's effectiveness in identifying flooded areas. The models were trained using data from a single flood event in January 2024. This event provided SAR backscatter data, including VV and VH bands, DEM, and proximity data, correlated with verified flood extents derived from in situ real-time observations. The data were split into 80% for training and 20% for testing to ensure a robust model performance evaluation. This training process involved extensive data pre-processing and validation, ensuring the input features' reliability and the models' generalizability. By leveraging this carefully prepared data, the models could learn from the observed flooding patterns, significantly enhancing their predictive capabilities for future flood events.
The performance of each model was subsequently evaluated based on the accuracy and the Kappa coefficient, among other metrics, to determine the most effective algorithm for accurate flood mapping using SAR data. Accuracy is a straightforward metric calculated as the ratio of correctly predicted instances to the total instances in the dataset (Liu et al. 2014). However, accuracy can be misleading in the context of imbalanced datasets. Therefore, the Kappa coefficient is used as a complement error measure. This statistic compares the observed accuracy against the accuracy to be expected by chance (Liu et al. 2014). In flood mapping scenarios, the differentiation between water and non-water classes is paramount, so the two metrics combined offer a more holistic view of the ML model's robustness and reliability. High values of accuracy and Kappa give more confidence in the model's ability to delineate flood extents with precision (Congalton & Green 2008).
Flood impact assessment
After meticulously mapping flood extents in the previous stage, the next stage assessed its impact on existing infrastructure, specifically targeting buildings and road networks. This process provided a detailed visual and quantitative analysis of the potential damages and disruptions. The assessment also incorporated demographic data to estimate the impact on the population accurately. The impact was calculated using the overlay technique, which involved layering the flood extent maps over the infrastructure and population datasets to identify intersections and areas of overlap. As such, the tool could precisely determine which buildings, roads, and population clusters would fall within the flood-affected zones. The number of affected buildings, road lengths, and populations within the flood-affected zones was then calculated by summing the respective values from the intersection map. This approach facilitated the identification of high-risk zones, hence the formulation of targeted evacuation plans and resource allocation. The assessment used the InaSAFE plugin within the quantum GIS (QGIS) environment (InaSAFE 2024). The QGIS tool is renowned for integrating hazard data into socio-economic datasets to evaluate the potential impact of flood events on critical infrastructure and human populations. Infrastructure data were sourced from OpenStreetMap (OpenStreetMap 2022), and population data were derived from the GHSL Data Package 2022 (Schiavina et al. 2022).
RESULTS AND DISCUSSIONS
ML model performance
Flood extent and its impact
Furthermore, we investigated the impact of the flood event on the ROI (Table 2). The event directly affected a subset of this population, comprising 10,706 persons out of a total population of 79,292. As for the built environment, 1,500 out of 11,100 buildings in this region were affected by the event, signifying considerable disruption. The road infrastructure within the ROI, which plays a crucial role in transportation and connectivity, was also severely impacted. The analysis shows that 15 km of roads out of 141 km were disrupted. The results of this analysis provide an understanding of the scale and scope of the event on the population and the built environment within this specific region.
No . | Exposure . | Type . | Affected . | Not affected . | Total . |
---|---|---|---|---|---|
1 | Road | Motorway | 491 | 7,130 | 7,621 |
Local | 1,454 | 16,694 | 18,148 | ||
Path | 1,305 | 2,822 | 4,127 | ||
Secondary | 0 | 3,044 | 3,044 | ||
Other | 11,882 | 96,188 | 108,070 | ||
2 | Building | Place of worship | 0 | 1 | 1 |
Education | 0 | 2 | 2 | ||
Residential | 1,486 | 9,567 | 11,053 | ||
3 | Population | 10,706 | 68,586 | 79,292 |
No . | Exposure . | Type . | Affected . | Not affected . | Total . |
---|---|---|---|---|---|
1 | Road | Motorway | 491 | 7,130 | 7,621 |
Local | 1,454 | 16,694 | 18,148 | ||
Path | 1,305 | 2,822 | 4,127 | ||
Secondary | 0 | 3,044 | 3,044 | ||
Other | 11,882 | 96,188 | 108,070 | ||
2 | Building | Place of worship | 0 | 1 | 1 |
Education | 0 | 2 | 2 | ||
Residential | 1,486 | 9,567 | 11,053 | ||
3 | Population | 10,706 | 68,586 | 79,292 |
DISCUSSIONS
The comparative analysis of the DT, RF, and GBM models underscores the strengths and limitations of each algorithm in flood extent prediction. The DT model's simplicity and interpretability make it a valuable tool for understanding the influence of individual predictors (Ludwig et al. 2017), such as proximity and elevation. However, its tendency to overfit, especially with complex datasets, limits its predictive power, as evidenced by decreased performance from training to testing datasets (Stiglic et al. 2012).
The RF model mitigates overfitting through its ensemble approach, resulting in higher accuracy and stability (Zhang & Wang 2021; Sun et al. 2024). The feature importance analysis revealed that proximity and DEM are crucial predictors of flood extent, aligning with hydrological knowledge. Albeit less critical, the VV and VH bands contributed additional detail to the model. The RF model's performance metrics, with minimal drop from training to testing datasets, highlight its robustness and reliability in flood prediction tasks. This consistently high performance makes RF the best choice among the evaluated models (Chen et al. 2018; Rodriguez-Galiano et al. 2018; Song et al. 2021).
The GBM model excelled in accuracy by iteratively correcting errors, but the fluctuation in accuracy with varying iterations and tree depths highlights the need for careful parameter tuning. As the number of iterations increases and trees grow deeper, the model becomes more complex, which can lead to overfitting or underfitting depending on the dataset and the specific configuration applied (Kiatkarun & Phunchongharn 2020; Xia et al. 2021; Mwita et al. 2023). However, its robust performance in both training and testing datasets, consistent accuracy, and Kappa values indicate its potential for flood extent prediction when properly utilized. The GBM's ability to maintain high-performance metrics across datasets suggests it can effectively make generalizations from training data to unseen data, making it a strong candidate for practical flood prediction applications (Felix & Sasipraba 2019).
In conclusion, while the DT model offers interpretability and the GBM model provides robust performance through iterative improvement, the RF model emerged as the most effective algorithm for flood extent prediction in this study area. Its ability to handle overfitting, combined with high accuracy and stability across datasets, makes it the most reliable choice for effective flood management and mitigation strategies (Kumar et al. 2021; Sun et al. 2023). RF's high accuracy and stability have been consistently demonstrated in numerous studies, reinforcing its reliability for applications requiring dependable predictions, such as flood management and mitigation (Bharathidason & Jothi Venkataeswaran 2014; Dheenadayalan et al. 2016).
However, while RF is generally reliable, its performance can be affected by the presence of noisy trees or correlated decision trees within the ensemble, which can impact classification accuracy (Li et al. 2010). Consequently, caution is advised when applying this model to other remote areas. Considering the unique environmental and data-specific factors that may influence its performance is crucial.
CONCLUSIONS
This study demonstrates that ML approaches using Sentinel-1 SAR, DEM, and river proximity data can effectively map floods in Ngabang District, Indonesia. Analyses using DT, RF, and GBM models provided critical insights into flood prediction factors. The RF model, chosen as the best, successfully identified flood-prone regions. While the DT model experienced some overfitting, the RF and GBM models maintained high accuracy and reliability. The map showed that the flood significantly impacted 373.81 hectares of land, 10,706 people, 1,500 buildings, and 15 kilometers of roads. This analysis highlights the importance of proximity, elevation, SAR imagery, and iterative model improvements in flood prediction, offering valuable insights for flood management and mitigation efforts.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.