ABSTRACT
Climate change resulted in dramatic change in the monsoon precipitation rates in Malaysia, contributing to repetitive flooding events. This research examines different substantial practicalities of machine learning (ML) in performing high-performance and accurate FF. The case study was The Dungun River. IGISMAPs datasets of water level and rainfall were investigated (1986–2000). The Forecasting was implemented for current (1986–2000) and near future (2020–2030). ML algorithms were Logistic Regression, K-Nearest neighbors, Support Vector Classifier, Naive Bayes, Decision tree, Random Forest, and Artificial Neural Network. Simulations were run in the Colab software tool. The results revealed that between 1986 and 2000, there would be an average of (18–55) floods around the Dungun River Basin. Floods occurred rarely before 1985. They have been common since 2000. 35 floods occurred annually on average since 2000. It is predicted that between 2020 and 2030, flooding events would grow on the Dungun River Basin. Most floods occurred due to rainfall between 1 and 500 mm. The maximum frequency of flooding was measured at 110 occurrences at a rainfall of 250 mm. The overall accuracies were 75.61%/ random forest, 73.17%/ KNN, and logistic regression/ 48.78%. Overall, the ANN models had a competitive mean accuracy of 90.85%.
HIGHLIGHTS
Comprehensive analysis of the events has been carried out.
Different machine learning models capture the events of floods in the main river in Malaysia.
INTRODUCTION
The Dungun River is considered the most important in the Malaysian province of Dungun. Hence, flood forecasting (FF) is essential. In addition, it represents Terengganu's second-biggest river. However, this river periodically encounters a wide range of severe flooding events and acute natural circumstances, contributing to large-scale damages and harmful effects (Sidek et al. 2020). For this reason, flooding has become an annoying problem and a challenging issue facing risk authorities and Malaysian governments owing to the considerable financial losses, citizens' lives, and other damages occurring frequently. All these severe conditions require active analytical investigations to overcome challenging consequences along with efficient flood mitigation mechanisms.
Accordingly, scholars and hydrological engineers thought of adopting feasible techniques that rely on practical data collection processes. One of these workable mechanisms is the FF. FF relies on real-time database collection using Internet-of-things (IoT) sensors from the flood site. Machine learning (ML) algorithms and functional artificial intelligence (AI) models are adopted to analyze the gathered databases numerically and make necessary forecasting of flooding events over the near future (next years) or far future (coming decades) with the support of software simulation tools (Kumar & Yadav 2021; Mateo-Garcia et al. 2021; Sidek et al. 2021; Avand et al. 2022; Kumar et al. 2023a, 2023b).
Those high-performance AI and ML paradigms contributed to considerable levels of forecasting accuracy, reliability, speed, and effectiveness. As a result, their beneficial gains supported responsible risk managers in carrying out essential and robust decision-making processes, translated by swift responses and active actions that protect local residents' lives, properties, and financial resources. Hence, ultimate rates of safety and protection can be fulfilled by comprehensive and real-time data gathering from flooding sites. Besides, these ML algorithms and numerical AI models could aid in accomplishing valuable environmental tasks, namely, the protection of the local nature and creatures from severe flooding circumstances. Referring to their socioeconomic benefits, ML techniques and AI models could prevent or at least minimize immediate damage brought by hazardous floods, involving loss of physical lives, property loss, crop destruction, livestock loss, infrastructure facility deterioration, and degradation of health conditions connected with waterborne ailments via precise and durable flood prediction (Manzoor et al. 2022).
The Dungun River is notable for its vibrancy among Malaysia's waterways. In the Dungun region, the Dungun River is the most vital waterway. It is Terengganu's second largest river. The overall area of the watershed for this river is approximately 1,858 km2. The Dungun flows for roughly 110 km. It begins in Pasir Raja, flows through Kuala Jengai, and eventually empties into Jerangau, Thus covering a total of four mukims.
Kuala Dungun faces multiple flooding situations per annum. The river's breadth may fluctuate between 50 and 300 m. The Dungun River receives runoff from numerous significant tributaries, including the Kelmin and Loh Rivers (sometimes known as the Telemboh and Perlis Rivers), before emptying its waters into the South China Sea. Numerous ancient village communities have been uncovered in the upper regions. In addition, along the river's 80-km length, various urban projects, including factories and homes, have been built. The Dungun River experiences a tide range of 0.4–3.5 m.
Many floods have occurred over the past decade due to climate change, global warming, and heavy rainfall. Malaysia faces a considerable rise in monsoon rainfalls as a tropical country, causing an annual increase of flooding opportunities, mainly Dungun River Basin zone (Ishak et al. 2014). In this context, Noor et al. (2016) investigated FF methods to predict the flooding events at the Terengganu River, Malaysia. They clarified that this region is prone to periodic floods and higher water levels (WLs) due to large rainfall and attributed topography. Moreover, the Terengganu River has higher rainfall and monsoon between November and February, which increases the flooding opportunity. Noor et al. (2016) relied on a statistical database gathered to determine the WLs of the Terengganu River through past years. They used two approaches to achieve their study goal: (1) noninclusion residual and (2) inclusion of residual. Their analytical results revealed that the noninclusion residual approach is better than the inclusion of residual approach to predict water level's rise and flooding chance at the Terengganu River. In addition, their proposed method provided a higher level of accuracy and better performance in predicting severe flooding events compared to conventional methods.
Similarly, Mosavi et al. (2018), who reported that floods are one of the most catastrophic natural disasters, conducted a study inspecting this problem. Their research verified the significant effectiveness of ML in predicting long- and short-term flood prediction. Concurrently, lower Tapy basin flooding was studied by Kumar & Yadav (2021), who employed artificial neural networks (ANNs) to model the phenomenon on a real-time data collection basis. The upstream metering station, reservoir flow, and downstream metering location were simulated. The models relied on the feed-forward network, the Levenberg Marquardt learning rule, and the x-axis transfer function. It was found that there was a strong statistical link between the established models. Also, the results showed that the model they developed using these models was effective, and the flood discharge presented for the ANN was in agreement with the observed value.
Anaraki et al. (2021) led an analytical exploration through which uncertainty inspection of climate change effects was carried out to forecast the rates of flood frequency next decades relying on ML models. The scholars developed a novel framework in which metaheuristic algorithms, decomposition, and ML techniques were integrated to predict future flooding events, taking into account climate change influences with the adoption of HadCM3 (A2 and B2 scenes), CGCM3 (A2 and A1B scenes), and CanESM2 (RCP2.6, RCP4.5, and RCP8.5 scenes) within global climate models (GCMs). In their suggested framework, multivariate adaptive regression splines (MARS) and the M5 model tree were employed for the determination of the rainfall rate (in wet and dry days). Also, the whale optimization algorithm (WOA) was considered for training the least square support vector machine (LSSVM). In the meantime, wavelet transform was hired for the decomposition of precipitation and temperature, LSSVM-WOA, LSSVM, K-nearest neighbor (KNN), and ANN were performed for downscaling precipitation and temperature. At the same time, discharge was simulated under the present period (1972–2000), near future (2020–2040), and far future (2070–2100). Log-normal distribution was deployed for the flood frequency analysis. Furthermore, analysis of variance (ANOVA) and fuzzy method were utilized for uncertainty analysis. Karun Basin, in the southwest of Iran, was addressed as a case study. Based on their numerical simulations, their critical prediction outcomes confirmed that MARS outpaced the M5 model tree. In downscaling, ANN and LSSVM_WOA performed slightly more robustly than other ML algorithms. Discharge simulation findings affirmed the superiority of the LSSVM_WOA_WT model (Nash–Sutcliffe efficiency amounted to 0.911). Simultaneously, flood frequency analysis confirmed that 200-year discharge declines for all scenes, except the CanESM2 RCP2.6 scene, considering the near future prediction. In the near and far future intervals, it was noted from the ANOVA uncertainty analysis that hydrological algorithms were one of the most critical uncertainty sources. According to the fuzzy uncertainty analysis, the HadCM3 model had low uncertainty levels in more lengthy return periods (up to 60% lower compared with other models in a 1,000-year return interval). However, one of their research limitations was correlated with the drawbacks of GCMs, which account for vivid prediction instruments for future influences of climate change issues. It can be performed on a global scale. Simultaneously, the outcomes of those algorithms should be downscaled to small temporal and spatial ranges when regional impact research investigations are performed.
Janizadeh et al. (2019) guided numerical work verifying the pivotal role of ML techniques in administrating successful projection of flash floods for watershed areas in Tafresh, Iran. The researchers reported that floods are an example of the most catastrophic and destructive natural disasters globally. The development of managerial strategies requires an in-depth knowledge of the probability and size of future flooding events. Hence, the scholars estimated flash flood vulnerability in the Tafresh watershed, Iran. They relied on five ML algorithms, including (1) alternating decision tree (ADT), (2) functional tree (FT), (3) kernel logistic regression (KLR), (4) multilayer perceptron (MLP), and (V) quadratic discriminant analysis (QDA). Geospatial databases involving 320 historical flooding circumstances were formulated and eight geo-environmental parameters were considered. These parameters include the following:
- (a)
Elevation,
- (b)
Slope,
- (c)
Slope aspect,
- (d)
Distance from rivers,
- (e)
Average annual rainfall,
- (f)
Land use,
- (g)
Soil type, and
- (h)
Lithology.
Those variables were employed as influencing elements of floods. Relying on various performance measures, it was confirmed that the ADT model offered dominant outcomes compared with other ML algorithms. Concurrently, the FT model was scored as the second-optimum model, followed by the KLR, MLP, and QDA. Based on the few variations between the goodness of fit and projection robustness of the models, it was deduced that all five ML-based algorithms were serviceable and workable for flood vulnerability mapping in other areas to prevent communities from damaging flooding incidents.
Nonetheless, the authors stated a research limitation, reflected in the multifaceted aspects and time-consuming process of precise FF. This problem is interpreted owing to significant flood occurrence complexity and sophisticated interactions of multiple anthropogenic and geo-environmental variables. Besides, an accurate database of flooding events is rare. In turn, this makes the generation of reliable vulnerability maps more challenging.
By the same token, Islam et al. (2021) clarified that when it comes to property and lives lost, floods are considered among the worst natural calamities imaginable. The dynamic and complicated character of flash floods makes it challenging to predict the areas that are prone to flooding. Therefore, cutting-edge ML models for handling flood disasters were employed to locate areas likely to be impacted by flash floods at an earlier time. In their research, they hired and evaluated four different state-of-the-art ML models to create flood susceptibility mapping for the Teesta River basin in northern Bangladesh: an ANN, a random forest (RF), and a support vector machine (SVM). These models were put to use in a geographic information system (GIS), where data on 413 flood-related points (both recent and historic) have been imported. The correlation between the events and the flood-inducing components was calculated via the information gain ratio and multicollinearity diagnostics tests. For the validation and the comparison of these models, the researchers implemented Freidman, Wilcoxon signed-rank, t-paired tests, and receiver–operating characteristic curve (ROC) to upgrade the ability to predict the statistical appraisal measures. All models achieved an area under the curve of ROC values greater than 0.80. Depending on their numerical analysis, it was found that in predicting how likely an area was to be impacted by flooding, the Dagging model accomplished better outcomes, followed by the RF, the ANN, the SVM, and the rainfall (RS), and then several benchmark models. All in all, their article could help state and local authorities and policymakers reduce flood-related dangers and execute effective mitigation plans to limit future harm by outlining a strategy and solution-oriented outcomes.
Within this framework of literature background, it is noteworthy remarking that in the face of increasing climatic uncertainties and the rising frequency of flood events, like the Dungun River, the need for reliable and accurate FF has become more urgent and demanding. Consequently, this research aims to address this critical issue by harnessing the workable AI and ML techniques. A collection of ML models was utilized, specifically logistic regression (LR), KNN, support vector classifier (SVC), Naive Bayes (NB), decision tree (DT), RF, and ANNs. Those paradigms were hired to predict future flood events. This research not only provides an in-depth analysis of the ML algorithms that most accurately forecast flooding but also offers valuable insights into flood frequencies and associated rainfall amounts, with implications for flood mitigation strategies. Although the article's findings are currently focused on the Dungun River, the methodologies and insights could be considered for broader applications and future research for other flooding regions.
The introduction of this article was extended to underscore the pivotal role of accurate rainfall prediction via ML and AI models in semi-arid regions, where the hydrological balance is a keystone for socioeconomic and environmental stability. These regions, characterized by erratic precipitation patterns, face unique challenges that stem from water scarcity interspersed with sporadic flood events. The implications of this variability are profound, influencing agricultural cycles, water resource management, and the integrity of local ecosystems. In semi-arid landscapes, the delicate equilibrium between drought and deluge determines the sustenance of biodiversity, the productivity of the soil, and the livelihoods of the inhabitants.
Furthermore, a precise representation of existing meteorological methodologies was considered, indicating the transformation from traditional FF mechanisms to AI-based models. Traditional approaches exhibit some constraints in their application to the complex dynamics of semi-arid climates, often failing to capture the nonlinear nature of atmospheric variables. AI-based methods emerge as a response to these limitations, offering enhanced predictive capabilities through smart learning algorithms that can adapt to the diversified nature of climatic data. However, these advanced models encounter certain challenges, including the need for extensive data and overfitting problem.
In view of these data, this article's primary objective is to investigate the critical roles and practical uses of AI and ML models in executing efficient and precise FF procedures. The Dungun River formulates the focal point of this analysis. Some ML paradigms such as LR, KNN, SVC, NB, DT, RF, and ANN are deployed. The study considers flooding datasets, comprising water level and rainfall measures to forecast river flood events from 1971 to 2030. Besides, this analysis aims to identify patterns in flood occurrences, evaluate the prediction accuracy of each AI model, and serve reliable and workable methods for predicting future flooding events.
The study will collect real-time data on rainfall, precipitation type, and water level in the Dungun River, Malaysia. The research reviews a couple of recent articles and publications that examine the contributory benefits of ML models in executing efficient FF missions. Hence, a quick alert system can be employed to accelerate the response to flash floods or extensive flooding events to protect citizens with the help of corresponding authorities, enabling rapid decision-making to safeguard people and property. Historical information related to flood parameters of the Dungun River is included in the dataset. The AI model for this study is developed using FF. ML and flood specialists are consulted to validate and fine-tune the model's outputs. Finally, the study proposes a collection of recommendations and future work aspects aiming at assisting decision-makers in implementing strategies to protect citizens.
MATERIALS AND METHODS
The research procedure involved training AI models depending on this historical dataset to identify key trends and relationships between rainfall and flood occurrences. To enable the considered ML models to predict future events appropriately, some assumptions were considered to prevent variations with global climate change models and regional development trends. It is crucial to point out that these projections were based on the premise of existing conditions continuing, with the understanding that actual future events might diverge due to changes in regional policies, environmental conservation efforts, or significant shifts in climate patterns. The assumptions and methods related to these projections can be workable for researchers to understand the importance and potential limitations of ML models to perform efficient FF processes.
Simulation approach
From Figure 3, input layers are processed against the hidden layer. The hidden layer is then modified against another hidden layer with the green nodes to produce the output layer.
Relying on ML fundamentals, ANNs take their working principles from the way real neurons in the brain work. ANNs can be trained to perform diversified tasks, from pattern recognition in data to forecasting the future. ANN's ‘neurons,’ or nodes, are grouped into layers and form a network with one another. Data are received at the input layer, passed on to one or more hidden levels for processing, and are ultimately shown to the user at the output layer. A neuron in a network takes information from its neighbors and sends it on to its neighbors as an output signal (Alasali et al. 2021).
For an ANN to improve over time, it should be trained by making minor adjustments to the weights of connections between its neurons to minimize a cost function that assesses the deviation between the expected and actual outputs. The gradient of the cost function relative to the network's weights and biases is often calculated using a method called backpropagation.
The ability to learn and generalize from examples is a crucial strength of ANNs. For instance, an ANN can be taught to detect handwritten numbers by exposing it to many such samples and allowing it to modify its weights and biases accordingly. After being taught, the ANN is capable of recognizing completely novel sets of handwritten digits.
ANNs have been implemented in many areas such as autonomous cars, language processing, and image and speech recognition. They excel at problems where standard rule-based approaches may fail, such as those involving enormous volumes of data and complex patterns (Bre et al. 2018).
Although they have been relatively successful, ANNs have some drawbacks. Training them can be computationally costly, and they often need data to work well. It is also not always clear how ANNs arrive at their forecasts because they are so opaque. Nonetheless, studies are being conducted to solve these issues and enhance the efficiency and interpretability of ANNs.
The choice to use a preexisting model architecture was informed by its prior success in complex pattern recognition tasks within the hydrological domain. The pretrained model offered a sophisticated foundation, with established layers and configurations that have been optimized through cumulative research efforts. This approach allowed us to leverage the inherent strengths of the ANN while focusing research efforts on fine-tuning the model to align with the specific characteristics of the Dungun River dataset. A detailed description of the model's architecture, including its original training background and its application to this study, will provide critical contributions to FF procedures (Kumar & Yadav 2021).
Simulation algorithms
This article relies on six major AI algorithms and intelligent ML schemes that can carry out numerical simulation processes and necessary optimization processes to achieve accuracy and high-performance prediction of flooding events in the Dungun River with the help of the numerical real-time FF approach. Those six intelligent models include the following categories:
- A.
LR,
- B.
KNN,
- C.
SVC,
- D.
NB,
- E.
DT, and
- F.
RF.
The following paragraphs provide a brief description of each numerical AI model.
Logistic regression
A dependent binary variable and one or more independent variables can be analyzed and modeled with the help of LR, a statistical method. It has widespread application in many sectors, including medicine, business, advertising, and the social sciences. LR is beneficial when the independent factors are continuous while the dependent variable is either binary (yes/no) or ordinal (low/medium/high). LR excels above simple categorization methods since it can also forecast the likelihood of events. Model construction and variable selection using LR also work for data with nonlinear associations between independent and dependent variables. Predicting medical outcomes, such as the likelihood of contracting an illness, is a prominent use of LR.
Gyawali et al. (2018) reported that diabetic individuals' risk of cardiovascular disease can be predicted flexibly using LR. In addition, the field of marketing uses LR to foretell buyer actions and client defection. In the field of finance, it has been employed in forecasting such events as bankruptcy and loan default (Kleinbaum & Klein 2010). When modeling the association between a binary dependent variable and a set of independent variables, LR can supply a robust and extensively used statistical approach. It can be valuable in various contexts due to its adaptability and predictive power.
K-nearest neighbors
When it comes to the classification and regression, the KNN algorithm is one of the most well-known and successful ML algorithms available. It is a nonparametric algorithm, so it does not assume anything about data's distribution, and it works by finding the K-nearest data points in the feature space to a given data point and then labeling or valuing the data point based on the majority class or average value of its nearest neighbors.
The KNN method has a number of benefits, including its ease of use, adaptability, and capacity to deal with categorization issues involving several classes. It is a memory-based method, too. Thus, it may be used for learning in either online or offline environments without first being trained on the data. However, the KNN algorithm's performance is sensitive to the distance metric and K value chosen, which can have a major effect on algorithm's accuracy and computing cost.
In addition to its use in image recognition and NLP, the KNN algorithm has also been implemented in recommendation systems. Some applications of KNN include sentiment analysis (Wang et al. 2018), collaborative filtering (Liu et al. 2019), and handwritten digit recognition (Chen et al. 2019).
The KNN algorithm is a simple and effective method for classification and regression, making it a significant and popular ML algorithm. Because of its adaptability and ease of use with complex data, it can be put to good use in a wide variety of contexts.
Support vector classifier
One well-known ML technique that sees extensive application for classification tasks is the SVC. Finding a hyperplane that divides the data into classes is the basis of the method to increase the distance between the hyperplane and the nearest data points. Since the SVC approach may use kernel functions to project data into a higher-dimensional space where it can be partitioned using a hyperplane, it is well suited for nonlinearly separable data.
The SVC technique has several benefits, including its scalability and efficiency when working with both large and small datasets. Since the data points closest to the hyperplane (the support vectors) determine the hyperplane, this method yields a one-of-a-kind answer. However, the effectiveness of the SVC algorithm may be delicately dependent on other hyperparameters, such as the regularization parameter and the kernel function that is used.
Image classification, biology, and even finance are just a few of the areas where the SVC method has been successfully implemented. Image object identification (Kumar et al. 2020), gene expression analysis (Bai et al. 2019), and credit risk assessment (Wu & Lin 2017) are just a few of the applications of SVC. Regarding classification missions, the SVC method is a popular and successful ML algorithm. Its versatility stems from its high dimensionality and ability to deal with nonlinearly separable data.
Naive Bayes
The NB algorithm is widely employed in classification applications and is a popular ML technique. It is grounded in Bayes' theorem, which stipulates that fresh evidence (in this case, the feature values of a data point) leads to an updated hypothesis probability (here, the class label of a data point). The NB algorithm simplifies the probability calculation by assuming that the characteristics are conditionally independent given the class label.
The NB algorithm's strengths lie in its flexibility in implementation and its potency in handling high-dimensional data. Compared to other classification algorithms, it has a low propensity for overfitting, and it can be easily modified for use with online and incremental training. However, feature independence is not guaranteed to hold in all circumstances, which might result in subpar efficiency.
Spam filtering, sentiment analysis, and document classification are just a few of the areas where the NB method has been successfully implemented. NB has been applied to various fields, such as social media sentiment analysis (Zhang & Liu 2011), email spam filtering (Gao et al. 2020), and medical record classification (Ji et al. 2018).
Decision tree
In ML, the DT technique is commonly employed for classification and regression purposes. The method achieves its goals by repeatedly partitioning the data into smaller subsets according to the feature values, with the intention of either increasing the information gain or decreasing the impurity. This tree can then be used to create predictions on unlabeled data by ‘exploring’ its nodes according to their feature values.
The DT algorithm's strengths lie in flexible understanding due to straightforward basics, helping enable its use to solve problems with high-dimensional data. It captures nonlinear associations between features and the dependent variable, and it works with both categorical and continuous features. However, DTs are vulnerable to overfitting and can be overly sensitive to slight variations in the input data or the parameters used to train the model.
Several fields, including business, medicine, and marketing, have found the use of DT algorithms remarkably feasible. Credit risk analysis (Zhang & Kou 2020), medical diagnostics (Zhang et al. 2019), and consumer segmentation (Berman et al. 2018) are just a few examples of the many applications of DTs. Concerning classification and regression problems, the DT algorithm is a popular and effective ML technique since it is easy to understand and implement. Because of its flexibility and interpretability, it can be applied to various problems.
Random forest
A typical ML approach for classification, regression, and other purposes is the RF algorithm. In this variant of the DT algorithm, multiple DTs are built by splitting the data and features at random intervals. Overfitting can be mitigated, and model accuracy and resilience can be improved using the resulting ensemble. The RF algorithm's strengths include its adaptability to varying dataset sizes, its robustness in the face of noisy or missing data, and its scalability. It may also shed light on the relative significance of the features for the prediction task. RFs may not do well on severely skewed or imbalanced datasets and can be computationally intensive.
Several fields, including biology, finance, and marketing, have found success using the RF algorithm. Gene expression analysis (Liaw & Wiener 2002), stock price prediction (Singh & Tandon 2019), and customer churn prediction (Luo et al. 2017) are just a few examples of the many applications of RFs. With its robust and versatile approach to classification, regression, and other problems, the RF algorithm has earned its place as a prominent and widely used ML algorithm. Its versatility and durability make it an invaluable resource in variant settings.
Details on the study area
The study area of this research is correlated to the Dungun River. It is located in the state of Terengganu, Malaysia. This river basin has a tropical climate with a monsoon season causing significant variations in river discharge levels. The Dungun River is characterized by its unique catchment area, which has been subject to both natural and anthropogenic changes over the years. This region has a history of flooding, particularly in areas with dense vegetation and urban settlements close to the riverbanks. In recent years, the frequency and intensity of these events have increased, prompting a need for in-depth analysis and accurate forecasting models. The topography, land use patterns, and existing water management infrastructure all influence the hydrological processes of the Dungun River, making it an ideal case study for flood prediction research (Ishak et al. 2014; Sidek et al. 2020).
Data collection
The data collection for this study was precisely performed to ensure a comprehensive understanding concerning the factors influencing flood events in the Dungun River. Primary data were obtained through field measurements, including water level recorders and rainfall gauges strategically placed throughout the catchment area. These instruments provided continuous monitoring of river levels and precipitation, offering high-resolution data critical for accurate modeling.
Secondary data were sourced from local meteorological and hydrological agencies, which supplied historical records of rainfall and flood events. Satellite imagery and aerial surveys were employed to assess land use changes over time, providing additional insights into the anthropogenic impact on flood patterns. All data underwent rigorous quality control procedures to verify their accuracy before being compiled into a centralized database for analysis.
Nonetheless, it is vital to mention that following real-time data collection by recording water level, air temperature, soil humidity, rainfall type, and precipitation rates for longer time spans (several years) via IoT sensors may consume much computational efforts and result in more calculation complexity. To overcome these concerns, the article depended on the previously recorded database, which was provided from International Geographic Information System Maps (IGISMAPs), which provides open-source geographic information pertaining to different regions worldwide.
Table 1 presents a year-by-year sample of statistical data collected on rainfall and water level, along with other variables, spanning the years 1986 to 2000 for some sites around the Dungun River.
Site – Dungun . | Year . | Annual RF . | Annual snowfall . | Annual WL . | Floods . |
---|---|---|---|---|---|
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1986 | 267.125 | 32.6 | 33.8 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1987 | 250.292 | 51.1 | 33.54 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1988 | 290.85 | 51.9 | 33.33 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1989 | 240.317 | 40.7 | 33.25 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1990 | 290.292 | 57.9 | 33.25 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1991 | 272.333 | 76.5 | 33.18 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1992 | 289.692 | 94.9 | 33.1 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1993 | 326.767 | 113.2 | 33.08 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1994 | 328.75 | 131.5 | 33.12 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1995 | 189.917 | 149.9 | 33.19 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1996 | 276.7 | 168.5 | 33.73 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1997 | 218.917 | 186.2 | 34.02 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1998 | 173.517 | 204 | 33.78 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1999 | 206.125 | 222.3 | 33.45 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 2000 | 127.183 | 240.6 | 33.33 | YES |
Site – Dungun . | Year . | Annual RF . | Annual snowfall . | Annual WL . | Floods . |
---|---|---|---|---|---|
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1986 | 267.125 | 32.6 | 33.8 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1987 | 250.292 | 51.1 | 33.54 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1988 | 290.85 | 51.9 | 33.33 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1989 | 240.317 | 40.7 | 33.25 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1990 | 290.292 | 57.9 | 33.25 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1991 | 272.333 | 76.5 | 33.18 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1992 | 289.692 | 94.9 | 33.1 | NO |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1993 | 326.767 | 113.2 | 33.08 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1994 | 328.75 | 131.5 | 33.12 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1995 | 189.917 | 149.9 | 33.19 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1996 | 276.7 | 168.5 | 33.73 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1997 | 218.917 | 186.2 | 34.02 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1998 | 173.517 | 204 | 33.78 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 1999 | 206.125 | 222.3 | 33.45 | YES |
Site 4529001 Rumah Pam Paya Kempian at Pasir Raja | 2000 | 127.183 | 240.6 | 33.33 | YES |
Accordingly, the massive size of dataset can be avoided and the suitable size of the Dungun River dataset can be employed to perform the essential FF task, contributing to minimized rates of computational time, effort, and costs required to conduct the FF.
RESULTS AND DISCUSSION
The main results are obtained from the simulation process and mathematical analysis carried out with the help of the Colab simulation software package, considering the following aspects:
- a.
The frequency of flooding events, which was according to the dataset from previous years to perform flood projections for future years. Some indices were taken into account consisting of the amount of water level and large precipitation duration. Rainfall intensity, water flow rate, and historical flood data were also addressed.
- b.
Flooding events' frequency that was correlated with the site at which the prediction was carried out around the Dungun River.
- c.
Surface WLs depending on an annual basis.
- d.
Annual rainfall (RF).
Numeric codes of AI and ML models adopted to perform the FF were developed and run via the Google Colab package. The link to this FF process can be found in https://colab.research.google.com/.
Figure 10 presents an illustration regarding the frequency of flooding events based on (0) and (1), which represent ‘no flooding’ and ‘flooding,’ respectively.
It can be indicated from the research findings resulting from the Colab simulation regarding the frequency of flooding circumstances in the Dungun River that the number of floods between 1970 and 2020 ranges between approximately 18 and 55 for all the sites. Further, Figure 7 describes the frequency of flooding events depending on the years considered and investigated.
It is inferred from the simulation results explained in Figure 10 that the water level of 33.50 cm corresponded to a maximum flooding events value of about 50, followed by WLs of approximately 33.35, 33.62, 34.00, and 33.17 cm in which the flooding events frequency amounted to about 35, 30, 20, and 10.
On the other hand, Figure 10 offers a comprehensive view of the frequency of flooding events in the Dungun River area, broken down by year, annual rainfall, and annual water level. This figure correlates the annual rainfall and water level variations with the frequency of flooding incidents. By mapping these variables across a timeline, any patterns or trends could be flexibly identified, thereby providing valuable insights for flood prediction and management in the area. This figure is particularly crucial for understanding the environmental factors that contribute to flooding risks, and it will serve as a foundational element in this analysis of the six sites the study is focusing on.
ML simulation outcomes and evaluation criteria
The simulations executed using the Colab software package and the numerical code provided the rates of accuracy linked to the six algorithms addressed and employed in this article to make flooding predictions under FF principles. Those results are expressed in Figure 12.
It is indicated from the numerical outputs represented in Figure 12 that the RF algorithm offered the maximum accuracy rate, registering a percentage of 75.61%, followed by the KNN model, which had a frequency of 73.17%. In contrast, the LR model recorded the lowest accuracy quantity, corresponding to a proportion of about 48.78%. In addition, it is worth mentioning that the average accuracy of the ANN model was 90.85%.
Accordingly, it can be deduced from the overall outcomes attained in this work that employing high-performance ANNs and some efficient ML models could provide considerable levels of forecasting accuracy, reliability, speed, and robustness. As a consequence, these contributory merits could support responsible risk managers in conducting valid and appropriate decision-making processes, reflected in rapid responses and active actions that protect local citizens' lives, properties, and financial resources, relying on the maximized degree of safety fulfilled by comprehensive and real-time data collection from flooding areas.
In examining the predictive capabilities of various AI models, the research findings highlighted the performance difference between ensemble methods and traditional algorithms. The RF algorithm's ensemble approach allowed more feasible outcomes by integrating multiple DT to mitigate the risk of overfitting and capturing complex patterns. On the other hand, the success of KNN suggested that local spatial patterns were highly indicative of flood events through which this model could be adeptly utilized. In contrast, other ML models such as LR and SVC might not fully consider the multiple patterns of the analyzed data. These insights underscored the importance of model selection in FF and suggested a potential avenue for further refining these models to enhance their accuracy in future studies.
Table 2 provides a comparative summary of ML models' performance metrics, evaluating the following performance evaluation metrics:
- (a)
Precision (P),
- (b)
Recall (R),
- (c)
F1-score (F1), and
- (d)
Accuracy (Acc).
Model . | Precision . | Recall . | F1-score . | Accuracy . |
---|---|---|---|---|
KNN | 0.74 | 0.74 | 0.74 | 0.7317 |
LR | 0.74 | 0.74 | 0.74 | 0.4878 |
Decision tree | 0.58 | 0.50 | 0.36 | 0.5853 |
Random forest | 0.58 | 0.50 | 0.36 | 0.756 |
SVM | 0.58 | 0.50 | 0.36 | 0.6585 |
Naive Bayes | 0.58 | 0.50 | 0.36 | 0.5365 |
Model . | Precision . | Recall . | F1-score . | Accuracy . |
---|---|---|---|---|
KNN | 0.74 | 0.74 | 0.74 | 0.7317 |
LR | 0.74 | 0.74 | 0.74 | 0.4878 |
Decision tree | 0.58 | 0.50 | 0.36 | 0.5853 |
Random forest | 0.58 | 0.50 | 0.36 | 0.756 |
SVM | 0.58 | 0.50 | 0.36 | 0.6585 |
Naive Bayes | 0.58 | 0.50 | 0.36 | 0.5365 |
These four effectiveness assessment measures are the most prevalent indices utilized to examine the outcomes' robustness when ML and AI models are employed.
The models assessed are KNN, LR, DT, RF, SVM, and NB. Both KNN and LR exhibit consistent results across all metrics, achieving P, R, F1, and Acc of 0.74. This value suggests a balanced prediction for positives and negatives. However, DT, RF, SVM, and NB show lower scores. These models struggle to balance P and R, leading to lower F1 scores. DT and RF have F1 around 0.36, indicating difficulty in correctly classifying positives. SVM has the highest Acc at 0.65, but shares P, R, and F1 with DT and RF, potentially emphasizing the majority class.
From Figure 14, it can be derived that a decline in the model's loss exists corresponding to the elevation in the number of epochs, eventually reaching a minimum value of approximately 0.41 at 100 epochs. This indicates that the FF model is effectively learning from the training data. The decreasing trend in the loss function suggests that model's predictions are increasingly aligning with the actual outcomes, improving its performance over time. Reaching a loss of 0.41 at 100 epochs may imply that the model has reached a satisfactory level of convergence, where further training can lead to overfitting or negligible improvements (Chen & Pattabiraman 2023). On the other hand, depending on the research's numerical findings of performance examination criteria, it can be concluded that the KNN accomplished more considerable precision, recall, F1-score, and accuracy rates compared with other ML and AI algorithms, corresponding to 74, 74, 74, and 73.17%, respectively. In contrast, NB recorded precision, recall, F1-score, and accuracy levels of 58, 50, 36, and 53.65%, respectively.
The outcomes of this article are consistent with the results of previous studies (Noor et al. 2016; Janizadeh et al. 2019; Kumar & Yadav 2020; Anaraki et al. 2021; Islam et al. 2021; Mateo-Garcia et al. 2021; Sidek et al. 2021; Avand et al. 2022; Kumar et al. 2023a, 2023b), who found that adopting AI, ANN, ML, and advanced ML and AI models could provide accurate forecasting processes and high-performance prediction of WLs, rainfall, and other flooding parameters, which, in turn, could support decision-makers in executing active risk management missions.
However, it should be noted that in the context of active prediction and precise FF events, DL algorithms and NNs could deliver more enhanced outcomes of project robustness, especially for longer-term prediction intervals where FF needs to be executed for far future (several decades or a century later) (Song et al. 2019).
The robustness of the ML model, which outpaced other models (KNN), would supply critical contributory benefits in the field of FF due to its effectiveness in:
performing data imputation,
carrying out optimization and KNN can serve as a hyperparameter, and
conducting efficient and potent prediction tasks flexibly compared with other ML and AI models.
Regarding the valuable applications of innovative AI and ML methods, further in-depth investigation and accurate FF procedures are crucial to implement to supply additional vital outcomes respecting AI and ML importance in performing agile and reliable FF tasks in various world regions with the availability of sufficient datasets or open-source information from GIS websites.
CONCLUSIONS AND RECOMMANDATIONS
Conclusions
This article executed an FF task by dint of high-performance numerical models to deliver some advantageous FF outputs in terms of reliability, robustness, and accuracy. The analysis and numerical simulations of mathematical AI, ML, and ANN paradigms were applied. Already-recorded dataset of flood parameters related to the Dungun River Basin was downloaded from the web and GIS platform since measuring different flood variables for longer time periods may contribute to massive computational complexity. A panel of ML and flooding experts validated the numerical research findings.
The outcomes revealed that among a group of ML models tested in this analysis, RF and KNN delivered enhanced performance and reliability, since their accuracy rates attained approximately 75 and 73%, respectively. These results are significant when considering the increasing trend of flooding events within the Dungun River since 2000, which is often associated with larger rainfall, amounting to 250 mm. In addition, these statistical figures illuminate the importance of practical FF mechanisms. Hence, appropriate flood mitigation strategies can be followed.
Furthermore, it was determined that the majority of Dungun River floods occurred due to rainfalls between 1 and 500 mm. The maximum frequency of flooding was measured at 110 occurrences at a rainfall of 250 mm. The accuracy of the RF was 75.61%, followed by the KNN at 73.17%. The accuracy of the LR was the lowest (approximately 48.78%). In comparison, the ANN model had a satisfying mean accuracy with an amount of 90.85%.
Consequently, it is essential to note that KNN had more robustness and efficiency in conducting high-performance flooding projection in this work. It was more robust than other ML and AI models.
The notable rise in flood frequency along the Dungun River before 2000 was particularly linked to heavier rainfall events. This aspect raises important questions about the associated causes of these observations. Insights from the results of this research suggested that anthropogenic factors (like climate change and global warming) are directly connected with these problems. It was noted worldwide that unfamiliar climate situations did take place, contributing to harmful impacts on living creatures. These uncommon climate behaviors contained extreme weather events (like very hot summers and considerably colder winter days), larger precipitation rates in some global zones, substantial droughts, and deforestation in other areas. In addition, NASA climate platform is still recording paramount statistics correlated with severe events taking place on the planet (NASA 2022).
These harmful effects of climate change and global warming have unfortunately brought increased flooding frequencies and extensive precipitation events, changing the hydrological response of the river system. To help offer a better understanding of these climate patterns, it is essential to employ durable AI models, ML fundamentals, and ANN paradigms to provide a detailed analysis and thorough inspection regarding how to tackle these problems, their reasons, and corresponding strategies to mitigate them via efficient risk management.
In addition, based on the ML analytical simulation process, it was found that adopting active ML prediction of severe flooding circumstances could accomplish upgraded risk management efficiency and performance to respond to flooding events.
Simultaneously, offering practical IoT sensors and proper wireless measurement tools is necessary to ensure sufficient real-time data collection of variant flooding event variables. Hence, better numerical prediction can be attained. Also, these real-time datasets can support risk managers and responsible rescue teams in deciding the most feasible approach or plan to follow to warrant minimized levels of property and lives loss among citizens and maximized rates of safety and protection for local residents and citizens.
On the other hand, the unique attributes of this methodology stem from the application of a sophisticated pretrained ANN model, adjusted to align with the hydrological dynamics of the Dungun River. The model's ability to analyze complex patterns from vast datasets represents a substantial advancement over conventional methods. It offers increased flooding prediction accuracy, potency, and reliability, which formulate some critical factors in the context of FF for enhanced life-saving responses.
Above these outcomes, it is important to remark that the potential applications of these AI, ML, and ANN algorithms can be engaged in several practical domains and other uses. By providing more accurate forecasts, those models could support local authorities and responsible parties in selecting the most agile and efficient action to respond to harmful flood events flexibly. As a result, the adoption of such numerical models can bring improved performance related to early warning systems, enhance the efficacy of emergency response plans, and inform the development of adequate infrastructure, which is better suited to withstand flash flooding circumstances or rapid rain rates. In addition, adopting these numerical models could foster societal and environmental resilience toward robust flooding event response.
Recommendations
Depending on the numerical outcomes attained from this work, the following recommendations are drawn to help elevate the FF efficacy and robustness. These crucial recommendations comprise the following aspects:
To broaden the consideration of ML models and AI techniques in performing high-performance flooding forecasting process for different rivers with frequent rainfall.
To support risk management authorities and responsible flood seniors in executing appropriate rescue missions of local citizens and properties using agile AI and ML forecasting approaches.
To conduct educational sessions and training courses for computer engineers and risk management professionals on ML and AI engagement to supply more potent risk management missions.
To employ practical data dimensionality reduction techniques, which can mitigate the massive size of data that contributes to excessive and complex computational time, effort, and budget.
RESEARCH LIMITATIONS
Despite the successful implementation of this research, it is crucial to acknowledge that this numerical analysis faced some limitations. One of these limitations is the lower precision, accuracy, and FF durability and efficiency accomplished by some ML and AI models compared with the ANN model that had an accuracy lower than 91%. Thus, it can be deduced that not all AI and ML models are well suited to carry out high-performance FF task. Instead, they may achieve more enhanced performance but for other contexts (e.g., detection or visual recognition).
Furthermore, another limitation of this article was associated with conducting real-time data collection of water level, air temperature, soil humidity, rainfall type, and precipitation magnitude for longer timespans. This concern could consume much time, effort, and budget, contributing to computational complexity to run the FF process.
FUTURE WORK
To enhance the generalizability of flood prediction models in future research, it is essential to address the limitations noted in the current study. Future research should aim to integrate a broader spectrum of data, embracing both spatial and temporal diversity. In the light of the critical findings and accuracy outputs attained in this work, a group of future work suggestions are proposed for the future research to help maximize the contributions and value of this article. The following future work suggestions are drawn:
Given that the research simulations were executed using the Colab software package, with the consideration of mathematical AI models, it would be more advantageous to perform the FF procedure, taking into account other critical factors and flooding elements that affect the overall performance of the FF process.
To employ another simulation software tool that has more efficiency and feasibility in the mathematical analysis of FF.
To deploy DL models, which are more effective and functional than conventional ML and AI paradigms in leading feasible FF procedures.
To implement a similar FF procedure but for another site in the world to affirm the robustness and practicality of AI, ML, and ANN algorithms in performing efficient FF task.
ACKNOWLEDGEMENT
This research was supported by the Ministry of Higher Education (MoHE), Malaysia, through the Trans-Disciplinary Research Grant Scheme, under project code TRGS/1/2020/UNITEN/01/1/1. The authors would like to acknowledge the Department of Drainage and Irrigation Malaysia (DID) for providing the hydrological data and report required for this study.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.