ABSTRACT
Hydrological modeling is one of the most complicated tasks in sustainable water resources management, particularly in terms of predicting rainfall. Predicting rainfall is critical to build a sustainable society in terms of hydropower operations, agricultural planning, and flood control. In this study, a hybrid model based on the integration of k-nearest neighbor (KNN), XGBoost (XGB), decision tree (DCT), and Random Forest (RF) has been developed and implemented for forecasting daily rainfall for the first time at Sydney airport, Australia. Daily rainfall, temperature, evaporation, and humidity have been selected as input parameters. Three statistical measurements, namely, root mean square error (RMSE), coefficient of determination (R2), mean absolute error (MAE), and Normalized Root Mean Square Error (NRMSE) have been utilized in order to check the accuracy of the proposed model. A sensitivity analysis was conducted, and the results indicated that for the purpose of prediction, the temperature, humidity, and evaporation were highly sensitive to the rainfall data. According to the results, the developed hybrid model was capable of predicting daily rainfall with high performance for both training and testing parts with RMSE = 0.124, R2 = 0.999, MAE = 0.007, NRMSE = 0.04 and RMSE = 1.246, R2 = 0.991, MAE = 0.109, NRMSE = 0.339, respectively.
HIGHLIGHTS
A hybrid novel model has been developed for rainfall prediction for the first time.
Accurate rainfall prediction will lead to better management of water resources.
Rainfall prediction is an important hydrological tool due to global climate change.
Farmers use rainfall forecast methods to plan for irrigation decisions.
ABBREVIATIONS
INTRODUCTION
For human life, societal growth, and ecological (natural, biological, and environmental) health, water is a vital natural resource. Water is essential for manufacturing, agriculture, and biotransformation, whether for drinking or personal hygiene. Since the global population is rapidly increasing, the demand for water, especially fresh water, is increasing as well. The demand for food increases and it all depends on irrigation, which requires a huge amount of water. As a result, humans will face more water problems. Water shortage has become a major issue worldwide in recent decades, particularly in developing countries (Gohari et al. 2013; Sowby 2023; Villanueva et al. 2023).
Rainfall from the atmosphere is an important component of the global water cycle. The main source of water supply, especially in dry and semi-arid regions, is rainfall. The richness of water resources in any area depends on the amount of rainfall. Rainfall is another crucial indicator for assessing the ecological integrity of a place because it can significantly reflect the dynamic changes brought on by drought. As a result, predicting rainfall is commonly needed in hydrological prediction and water resource management. Forecasting rainfall also has a big effect on how the economy and people's lives grow. Forecasting rainfall can assist individuals in developing plausible strategies to lessen the effects of unforeseen climatic disasters as well as in predicting the occurrence of disasters. For instance, Tesco, a British grocery chain, decreased costs by 30% and saved over 6 million pounds in 2013 by modifying warehouse inventory and sales methods based on weather forecast information. However, because of how unpredictable, diverse, and complex meteorological circumstances are, there are a lot of uncertainties and unpredictability in the process of rainfall. It has been challenging to pinpoint the precise amount of rain that will fall at a certain location and time in the future due to physical factors so far. The hydrological field faces more challenges as a result of the rising demand for rainfall prediction than just the atmosphere's inherent complexity and its associated dynamic processes do, which has the machine learning community interested in their research (Zhang et al. 2020; Latif et al. 2023).
With 70% of the land classed as desert or semi-desert, Australia is the driest inhabited continent. This island continent's water supply is dependent on rainfall. Variation in rainfall, like variation in any other region, frequently influences water availability across Australia. The continent experiences a variety of rainfall patterns, including dryness (lack of rain), food (excess rain), and droughts. Some examples include the Millenium drought, which lasted from 1995 to 2009, the 1970s dry shift in southwest Australia, and widespread flooding in eastern Australia from 2009 to 2012. The tropical regions in the north receive the most rainfall, while the interior is dry and deserted. These variations in rainfall raise serious concerns about water availability, management, and future resource planning (Raval et al. 2021).
Many studies focused on developing machine learning techniques to forecast hydrological parameters (Praveen et al. 2020; Koppa et al. 2022; Mohammadi et al. 2022), especially rainfall (Dash et al. 2018; Xiang et al. 2018). For instance, Chatterjee et al. (2018) proposed a study to develop a novel method for forecasting rainfall in India. They have applied two algorithms, namely, Greedy forward selection and k-means algorithms with neural networks in order to develop a hybrid model. They have also applied a multilayer perceptron hybrid model for comparison purposes with the proposed hybrid model. According to their findings, it is revealed that the proposed hybrid model outperformed the conventional model with a huge difference in terms of accuracy. Moreover, Kumar et al. (2021) developed two hybrid models for predicting rainfall in India. They have also applied three other machine learning algorithms for comparison purposes with their hybrid models. According to their results, their proposed hybrid models performed better than single machine learning algorithms. Furthermore, Bellido-Jiménez et al. (2021) implemented several artificial intelligence techniques to forecast rainfall in different areas in Spain. Based on their findings, the utilization of neighbor data within a 50-km radius outperformed the other options tested. Furthermore, inland areas outperformed coastal areas in most locations, indicating that efficiency effects based on distance to the sea exist. For their proposed models, MLP outperformed simple RF. In addition, Xiang et al. (2020) proposed a study for estimating hourly rainfall-runoff in Clear Creek and Upper Wapsipinicon River in Iowa. They have applied a prediction model based on long short-term memory (LSTM) and the seq2seq structure. Based on their results, the LSTM-seq2seq outperformed other proposed machine learning models including single LSTM. According to their findings, the LSTMseq2seq model has enough predictive capacity to enhance forecast accuracy in short-term flood forecasting implementations. On the other hand, Aggarwal et al. (2023) showed the influence of uncertainty of climate change in small regions. They have mentioned that there is a high correlation between temperature and rainfall parameters in Ludhiana district, Punjab, India. This shows that hydrological and meteorological parameters should be taken into consideration for sustainable water resources management. Moreover, different models could be used for different fields in water resources management. Other models were also used for various fields in water resources management. For instance, Eltarabily et al. (2023) utilized the Slide2 model for estimating seepage loss.
In this study, a hybrid model based on the integration of four machine learning algorithms has been developed and applied in order to forecast daily rainfall at Sydney Airport, Australia. Several statistical indices were utilized to check the accuracy of the proposed models. The conventional machine learning model could not successfully predict the rainfall parameter in the humid subtropical areas due to the complexity of the weather pattern. Therefore, the novelty of this study is that it is the first time these four algorithms, namely, k-nearest neighbor (KNN), XGBoost (XGB), decision tree (DCT), and RF have been combined as a hybrid model in order to predict daily rainfall. The developed model could be practically used by the water sectors to achieve sustainable water resources management.
MATERIALS AND METHODS
Study area and data
Data . | Mean . | Median . | Mode . | SD . | Min . | Max . |
---|---|---|---|---|---|---|
Min. Temperature (C°) | 14.87 | 14.9 | 11.1 | 4.55 | 4.3 | 27.6 |
Max. Temperature (C°) | 23.00 | 22.8 | 19.6 | 4.49 | 11.7 | 45.8 |
Rainfall (mm) | 3.32 | 0 | 0 | 9.88 | 0 | 119.4 |
Evaporation (mm) | 5.17 | 4.8 | 4 | 2.77 | 0 | 18.4 |
Humidity (9:00 am) | 68.18 | 69 | 71 | 15.1 | 19 | 100 |
Humidity (3:00 pm) | 54.68 | 56 | 59 | 16.3 | 10 | 99 |
Data . | Mean . | Median . | Mode . | SD . | Min . | Max . |
---|---|---|---|---|---|---|
Min. Temperature (C°) | 14.87 | 14.9 | 11.1 | 4.55 | 4.3 | 27.6 |
Max. Temperature (C°) | 23.00 | 22.8 | 19.6 | 4.49 | 11.7 | 45.8 |
Rainfall (mm) | 3.32 | 0 | 0 | 9.88 | 0 | 119.4 |
Evaporation (mm) | 5.17 | 4.8 | 4 | 2.77 | 0 | 18.4 |
Humidity (9:00 am) | 68.18 | 69 | 71 | 15.1 | 19 | 100 |
Humidity (3:00 pm) | 54.68 | 56 | 59 | 16.3 | 10 | 99 |
KXDF hybrid model
In this study, a hybrid model for predicting rainfall is developed based on the integration of KNN, XGB, DCT, and RF. The developed hybrid model was named KXDF, referring to the four utilized machine learning algorithms. KNN regression is a non-parametric method for approximating the relationship between independent variables and continuous outcomes in an understandable manner by averaging data in the same neighborhood (Song et al. 2017). The gradient boosted trees method is effectively implemented by the open-source tool XGBoost. Gradient boosting is a supervised learning technique that combines the forecasts of a number of weaker, simpler models in an effort to accurately predict a target variable (Yu et al. 2020). Because they are easy to comprehend and efficient, decision trees are frequently the tool of choice for predictive modeling. The primary function of a decision tree is to partition a large amount of data into more manageable chunks (Mehraein et al. 2022). RF is a decision tree algorithm that is widely used for regression time-series prediction modelling especially in the hydrological area (Saadi et al. 2019). Performance assessment has been carefully accomplished to check the accuracy of the proposed model (Krishnaraj & Honnasiddaiah 2022).
The techniques used for predictions in this study are machine learning algorithms using Python programming language version 3.9, the latest Python version with the updated list of packages. The reason for using Python is that this programming language provides many available packages in different programming areas, especially in machine learning fields, which use CSV datasets for data preprocessing and predictions.
The process starts with preparing the dataset and calling the CSV file into the Python programming language. After that, the proposed dataset will pass through several steps in the Python machine learning models. The data preparation techniques include managing null values and handling data to the numerical data type. In addition, the dataset passes through testing of any possibility of outliers among the columns; if any are found, the normalization technique is used by employing a min-max scaler. The normalization technique works on scaling the values in the dataset into the same range.
KXDF is a hybrid model, consisting of KNN, X-GBoost, Decision Tree, and RF. These models work together to make predictions as base-0. Also, the process works as each model makes its prediction separately, then at the stage called stacking regressor model all the outcomes of predictions are combined, at the last stage they passed to the final model as base –1, and the model that used as final stage is RF. In addition, the output from the stacking regressor model is the same as a table with four columns, and each column is a prediction of KNN, DCT, XGB, and RF. So, the final RF implements the algorithm process in this table and makes the final output prediction.
The reason behind choosing KXDF as a hybrid model in this work, is because the dataset value is not linearly distributed, though the most non-linear model with high capability to go into further numerical calculation is chosen. KNN and support vector machines can work on linear and non-linear data distributions, and they perform better on non-linear data. The same idea can be applied to DCT and RF models. Moreover, the chosen models are less sensitive to outliers and unnormalized data, especially in the case of DCT, XGB, and RF. It makes the step of normalizing data, and the process of removing outliers is not required, which makes the process model slow and time-consuming.
The method utilized had some limitations. For instance, training the dataset took a very long time since four different machine-learning techniques were combined to train the data. KNN was struggling to train the large dataset, but, with the help of other algorithms, it could proceed successfully. XGB was extremely sensitive to the outliers. Furthermore, the large number of trees in the RF model made the algorithm too slow; however, the algorithm is generally fast to train. Finally, the calculation for the final prediction in the decision tree was complex.
The composition of the proposed hybrid KXDF model includes two stages. The first stage is that the network is trained about input–output patterns using a training set. The second stage is testing, in which the performance of the network is assessed when the unknown pattern was not discovered during the training stage. A hybrid model consisting of the integration of four machine learning algorithms, namely KNN, XGB, DT, and LR has been successfully developed for forecasting daily rainfall at Sydney Airport, Australia. Two conventional machine learning algorithms have been implemented in order to compare the accuracy of the proposed hybrid KXDF model. The two conventional machine learning algorithms are RF and artificial neural networks.
The primary model used is the stacking model, which is a type of hybrid model that works on combining different models into two levels: level-0 as the base model and level-1 as the meta-model. On a classification or regression issue, the hybrid model can combine the capabilities of many highly effective models to produce results that are superior to any individual model in the ensemble. Using a hybrid model increases the performance of the proposed model. The main reason is that the model works on predicting the data at a different level with different models, one after another. If one of the models does not perform well or provides a low prediction score, the following model works on it and enhances the score value. The base model and model levels, which fit the training data and whose predictions are created, make up the model's primary structure. The second level is the meta-model, a model that develops the most effective way to combine the predictions of the base models, working on the ultimate decider in cases of accurately forecasting the outcomes.
In the hybrid model, the first algorithm used in the base model list is KNN, and the reason for using it is that it does not require any hyperparameter tuning; the only one that is used is the value of K, which is the number of neighbors, and it should be an odd number. Also, because there is no explicit training step, the prediction is modified as new data is added to the dataset without retraining a new model. After KNN, the XGB model is used as the second level in the hybrid model, performs faster, and can work with a dataset of high dimensionality. The dataset that is used in this study is around three thousand samples. Also, the main hyperparameters that are used with XGB are the type of the kernel, which is linear and works on finding the best estimators in linear solution, and gamma with the value of two, which works on the distance of separators of the data points to predict values of Rainfall with a high score and fewer possible errors. The third level used is the Decision Tree model; this model works on predicting values into n numbers of estimators and depth of tree separation branches. So, increasing the number of separations makes the result of prediction more accurate with minimum error. Also, it performs well even if there is some low data accuracy, which means in this case if the previous model didn't perform well, the DT tuned the data and gave a better result. The last stage is Random, the last model in the level-0 based model, which gets all the predicted data as input from previous models and predicts the output. The RF model works on arranging the inputs, and it can reduce the overfit by dealing with a regularization technique that can capture the noise data.
An extremely popular supervised machine learning approach called the RF algorithm is utilized to solve classification and regression issues. A forest is made up of numerous different species of trees, and the forest will be more vigorous the more trees there are. Similar to this, the number of trees in an RF algorithm enhances the algorithm's accuracy and ability to solve problems.
The final level of the hybrid algorithm (level-1) is called the meta-model, which works on the final decision for result prediction. This model takes all the previous predictions, which were combined into the base model, and predicts new results. In this level model, a decision tree is used because DT has resulted in a good score from the base model as well, and it can work on giving a more in-depth division of the tree to provide higher results for predictions. In the current study, the data has been divided into two subsets: 60% for training and 40% for testing. The optimized weights and their sensitivity to rainfall were assessed. Some input variables have been removed from the model since they was not correlated to the output.
Statistical measurements
Root mean square error
Coefficient of determination (R2)
Mean absolute error (MAE)
Normalized root mean square error
RESULTS AND DISCUSSION
. | RMSE . | R2 . | MAE . | NRMSE . |
---|---|---|---|---|
Training | 0.124 | 0.999 | 0.007 | 0.04 |
Testing | 1.246 | 0.991 | 0.109 | 0.339 |
. | RMSE . | R2 . | MAE . | NRMSE . |
---|---|---|---|---|
Training | 0.124 | 0.999 | 0.007 | 0.04 |
Testing | 1.246 | 0.991 | 0.109 | 0.339 |
Based on Fig. 5, the training set of the predicted model is very accurate, and it can be realized that there is a linear relationship between observed and predicted data. However, there are minor errors in the testing set of the predicted model. Therefore, in some extreme events, the model was not capable of predicting it accurately. The dataset has very few outliers, and that is the reason that some of the data points are out of range, and this does not affect the performance of the proposed model; additionally, the size of the data is around 3,000 data points, and only a few are detected as outliers. Also, the chosen models are not sensitive to outliers and unnormalized data, because they have the ability to work on non-linear data. But, overall, the performance of both the training and testing datasets was reliable and acceptable. The results of the currently developed hybrid model in this study are very important for managing the water resources in Australia since predicting rainfall is extremely important for human beings because heavy and irregular rainfall will cause the destruction of many crops, farms and materials of the farmers. So, the results of the currently developed model for forecasting rainfall are crucial in order to give an early warning so the risks of heavy rainfall for humans, animals, and plants can be reduced and managing the water resources will be effective. There are many different parameters for forecasting rainfall including weather conditions such as temperature, humidity, and evaporation. Therefore, in the current study, humidity, temperature, evaporation and rainfall have been utilized for developing KXDF hybrid models. However, the most common model for forecasting weather is the Numerical Weather Prediction (NWP) model. The model that is used in this work is proposed giving careful consideration to time-consuming, central processing unit (CPU) and memory utilization. CPU is utilized to perform the proposed machine learning model. Also, the models are tested on their ability to give high accuracy. In addition, the hybrid model is a model that combines other, different machine learning models. So, the chosen model should be considered based on the dataset type and the distributions of data. Hence, it has been found that the data are more non-linear, which in return those models should choose those that are more likely to fit to those dataset types. Sensitivity analysis was performed, and it showed that the temperature, humidity, and evaporation were very sensitive to the rainfall data for prediction purposes.
The results of the currently developed model can help decision-makers to prepare for unwanted fundamentals such as flooding in Sydney. There are two ways of predicting rainfall: short-term and long-term. Most of the time short-term predictions will be more accurate. One of the most complex issues that researchers face is that they cannot build an accurate model for long-term prediction. Heavy precipitation is one of the major issues that scientists struggle with since this problem is related to the economy and social security. The consequence of the heavy rainfall leads to many disasters. For example, drought, water shortages, and floods happen yearly in many countries around the world. Recording rainfall is very important for developed countries such as Australia since they rely on agricultural production, and import many agricultural products to many countries in the world. Therefore, it can be concluded that the accuracy of the current developed model could be considered by the authority of Sydney in order to manage its water resources effectively. Furthermore, the potential source of error is that when a single machine learning model is used for predicting rainfall, the accuracy will be very low. As a consequence of this error, policymakers will wrongly plan for the water resources sector. Finally, even though the result of the current study is achieved in a case study of Sydney, the developed hybrid model could be generalized and applied to other regions around the world since it consists of four different machine learning models.
CONCLUSION
Forecasting rainfall is important because it can lead to a range of negative effects, such as agricultural and farm destruction and property damage. For early warnings that can reduce risks to life and property while also better managing agricultural farms, a suitable prediction model is essential. The main objective of this study is to establish and develop a hybrid model based on the integration of four machine learning algorithms (KNN, XGB, DCT, and LR) for forecasting daily rainfall. The proposed hybrid KXDF model was capable of predicting daily rainfall at Sydney Airport accurately. A sensitivity analysis was conducted, and the results indicated that for the purpose of prediction, the temperature, humidity, and evaporation were highly sensitive to the rainfall data. One of the limitations is that it required a very long time to train the dataset because four distinct machine-learning techniques were used in combination. Although KNN was having trouble training the big dataset, it was able to do it with the assistance of other methods. XGB was highly susceptible to the anomalies. In addition, the RF model's high tree count made the process excessively slow; yet the technique trains quickly overall. Ultimately, the decision tree's prediction required a complicated computation. This research has aided the area of water resources engineering by drawing the attention of academics, teachers, and policymakers to forecast models. In future studies, it is suggested that the proposed hybrid KXDF model be used in diverse climate zones and can be generalized and applied to other regions around the world since it consists of four different machine-learning models. Due to the intricate weather pattern, the traditional machine learning model was unable to accurately estimate the rainfall parameter in the humid subtropical regions. Thus, what makes this work new is that it is the first time a hybrid model combining KNN, XGB, DCT, and RF has been used to forecast daily rainfall. The water sectors could put the created model to practical use to manage water resources in a sustainable manner. Finally, further studies might be needed to use graphics processing units (GPU) for time-series forecasting models for predicting rainfall data in other regions around the world.
ACKNOWLEDGEMENT
The authors would like to thank the Australian Government for providing data through the Bureau of Meteorology.
AUTHOR CONTRIBUTIONS
S.D.L. wrote the original draft, performed the methodology, and did analysis; M.H.M. wrote the review and edited, performed the methodology, and did analysis.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.