ABSTRACT
This study aims to provide an efficient and accurate model by comparing the tree-based machine learning approach and the global prediction model with the European Center for Medium Weather Forecast (ECMWF) model for predicting long-term rainfall. Light gradient boosting (LGB) and regression tree (RT) tree-based machine learning algorithms are utilized in this study and compared with the global model. Local metrological parameters such as relative humidity, dew point temperature, minimum temperature, maximum temperature, wind speed, convective available potential energy, and sunshine and large-scale climate variable (sea surface temperature) were used as input during model development. Initially, the database was preprocessed and then partitioned into a training set and a testing set. GridsearchCV technique was used for tuning the parameters of the models. For daily rainfall variation, LGB exhibits strong performance with the highest coefficient of determination (R2 = 0.991; 0.996), lowest root mean squared error (RMSE = 1.14 mm; 0.383 mm), lowest mean squared error (MSE = 1.992; 0.146), and lowest mean absolute error (MAE = 0.899 mm; 0.302 mm) for daily and monthly time scales. For both temporal variations, the LGB model shows significantly higher accuracy than both RT and ECMWF. Relative humidity is the most influential meteorological parameter for rainfall prediction identified by the important random forest (RF) feature with a value of 0.4129. An agricultural decision support system that is still in development will incorporate the suggested models in Ethiopia.
HIGHLIGHTS
The work is new and novel.
It gives a model for comparing the machine learning models and the global model that have not been compared yet for rainfall prediction.
The study uses new input meteorological parameters that are not yet used.
INTRODUCTION
Problems related to climate change and global warming are worldwide concerns. One important aspect of weather change is rainfall, which is the most important meteorological parameter throughout Africa. This is due to the fact that rain-fed agriculture makes up the majority of the continent's economy and that rainfall has a high impact on crop production in many places. Unusually high or low rainfall can cause floods or droughts, respectively, with catastrophic effects on the environment and human welfare (Diro et al. 2008; Hooshyaripor et al. 2020). In many areas of the continent, the current gauge network is unable to give timely or sufficient information regarding the pattern of rainfall because of unequal distribution, missing data, and sparse observations. This leads to the use of global numerical models like European Center for Medium Weather Forecast (ECMWF) (Dinku et al. 2007; Diro et al. 2009; Koutsouris et al. 2016; Olaniyan et al. 2018). It was found that they perform less well in terms of capturing the observed long-term trends and are unable to predict extreme rainfall phenomena in many regions of the African sector (Koutsouris et al. 2016; Olaniyan et al. 2018; Lemma et al. 2019). Additionally, local biases are present in the ECMWF model that tends to estimate the rainfall in Ethiopia (Diro et al. 2009; Gleixner et al. 2020). These studies highlighted the shortcomings of this global model in the African region. Several neural network-based models have been developed in the African region to fill this gap (Endalie et al. 2022; Ojo & Ogunjo 2022; Abebe & Endalie 2023). The results of these investigations revealed that, in comparison to the global models, neural network (NN)-based models show great promise in capturing the overall dynamics of rainfall changes. In contrast to other machine learning methods, this does not mean that NN models always produce accurate estimation in each scenario. Because NN algorithms require a lot of data to fully utilize their potential, they often overfit small datasets (Piotrowski & Napiorkowski 2013; Kim et al. 2024b).
Recent studies have highlighted the application of machine learning and advanced modeling techniques in environmental prediction and management. For instance, Pandey et al. (2021) applied ensemble machine learning models to predict scour depth and assess sediment dynamics around spur dikes, while Basha et al. (2024) used the InVEST model to evaluate the impact of land use changes on water yield in India. Kim et al. (2024b) modeled surface water temperature dynamics in Arctic lakes using machine learning, Mahdian et al. (2024) analyzed environmental changes in the Anzali Wetland in Iran, focusing on the effects of climate variability on ecosystems and Endalie et al. (2022) modeled rainfall using the machine learning models. These approaches, particularly in hydrology and ecology, can be leveraged for modeling rainfall and other climatic factors in regions like Ethiopia. In recent years, future weather estimation, like rainfall, will heavily rely on tree-based machine learning techniques (Kumar et al. 2023). They are appropriate for problems that are too complex or large for global model approaches. Recently, numerous technological and scientific sectors, including atmospheric weather research, have shown great interest in machine learning techniques (Geetha & Nasira 2014) with a focus on modeling the nonlinear relations, like rainfall. In response to technology advancement, birth of light gradient boosting (Microsoft 2016; Ke et al. 2017) (LGB; hereafter) and regression tree model (Breiman 2017) (RT; hereafter), and dynamics of the rainfall scenario, atmospheric science community models predict rainfall using machine learning techniques beyond NN models and the global model around the world and African sector. For example, studies by other authors (Appiah-Badu et al. 2021; Misra et al. 2021; Ridwan et al. 2021; Zhou et al. 2021; Kim et al. 2022; Monego et al. 2022; Yirga 2023) modeled rainfall using machine learning approaches. Their findings have demonstrated that machine learning techniques are efficient in modeling rainfall, both with large and small amounts of data. Additionally, tree-based algorithms are not expensive because they do not necessitate a significant number of computational resources and training time compared to the NN algorithms (Ramsundram et al. 2016; Bentéjac et al. 2021; Kim et al. 2024a). As a result, machine learning algorithms are relatively simple to optimize compared to the NN techniques.
In previous studies conducted in the African region, machine learning methodologies other than tree-based algorithms were primarily used for modeling rainfall focus on short-term data, using small scale and local metrological input parameters. Even the study by Endalie et al. (2022) in Ethiopia on modeling rainfall with limited input parameters and by ignoring large-scale climate indices stressed the importance of global climatic indices like sea surface temperature (SST) in their motivation. Due to the focus on short-term rainfall estimation, the smaller number of local metrological parameters, insufficient studies with large-scale climate indices, and the lower accuracy of models motivate us to look for an alternative method for modeling rainfall variability using large-scale climate indices like SST and local meteorological indices including Convective Available Potential Energy (CAPE) as input parameters. Due to the hotness of the issue and by reviewing the previous works, it is clear that not enough research has been done on evaluating different machine learning models for long-term rainfall estimation. It has also been observed that no studies have compared tree-based machine learning and global models for rainfall estimation. This gap has motivated the current research, which aims to explore the development of a new feasible model using local input meteorological parameters and large-scale climatic indices for rainfall estimation and modeling.
Consequently, this study has been conducted to test the effectiveness of tree-based algorithm models in rainfall estimation using the LGB and RT models and is compared with the existing global numerical weather prediction model, called ECMWF hereafter, to identify the reliable model for accurate rainfall prediction.
MATERIALS AND METHODS
Study area and data
In Bahir Dar, the agricultural sector can be characterized as segmented, highly dependent on rainfall, and lacking permanent watercourses. This is also true for horticulture, agro-industrial processing, urban agriculture, manufacturing, and a variety of service businesses. On the other hand, the main economic activity in Bahir Dar is tourism, which attracts large numbers of the population. Furthermore, the diverse range of meteorological conditions in Bahir Dar renders rainfall estimation a crucial matter. When combined with its temporal estimation, it forms a crucial component for making informed decisions, mitigating potential hazards associated with abrupt spikes, and a notable dispersion of the impacted region.
Daily local meteorological data were collected from the Bahir Dar meteorological institute, located in Bahir Dar (latitude: 11.610 N, longitude: 37.380 E, altitude 1,800 m), between the years 1993 and 2022. The collected metrological parameters include maximum temperature (Tmax), minimum temperature (Tmin), relative humidity (RH), wind speed (Wspeed), sunshine (Sunshine), dew point temperature (Tdew) and rainfall (R). Other local metrological parameters are CAPE, one of the large-scale climate indices SST, and hourly datasets from the years 1993 to 2022, from ECMWF fifth-generation reanalysis, ERA-5. It provides hourly data with a 0.25◦ spatial resolution with hourly intervals and is easy to access via Copernicus' Climate Data Store with (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-singlelevels?tab=form). The hourly datasets are converted into a daily format using Panda-Python programing software. The summary of meteorological data and large-scale climate variables are given in Table 1. The methodology used for the study is shown in Figure 2.
Description of data in the study
Parameters . | Symbol used in study . | Units . | Description . |
---|---|---|---|
Relative humidity | RH | Percentage (%) | Saturation vapor pressure to real water vapor ratio |
Dew point temperature | Tdew | Degree Celsius | The temperature at which the air can precisely hold the amount of moisture present is known as the dew point. |
Maximum temperature | Tmax | Degree Celsius | The greatest temperature measured in a certain time period |
Minimum temperature | Tmin | Degree Celsius | The lowest temperature recorded during a specified period of time |
Wind speed | Wspeed | m/s | The speed of wind |
Convective available energy | CAPE | J/kg | Estimate of the occurrence of convective rainfall situations |
Sunshine | Sunshine | Hour | Hours of shine in a day |
Sea surface temperature | SST | Degree Celsius | The temperature of the water near the surface of the ocean |
Rainfall | Rainfall | Millimeter | The amount of water that precipitates in liquid form over a given region and time period; often measured in millimeters or inches |
Parameters . | Symbol used in study . | Units . | Description . |
---|---|---|---|
Relative humidity | RH | Percentage (%) | Saturation vapor pressure to real water vapor ratio |
Dew point temperature | Tdew | Degree Celsius | The temperature at which the air can precisely hold the amount of moisture present is known as the dew point. |
Maximum temperature | Tmax | Degree Celsius | The greatest temperature measured in a certain time period |
Minimum temperature | Tmin | Degree Celsius | The lowest temperature recorded during a specified period of time |
Wind speed | Wspeed | m/s | The speed of wind |
Convective available energy | CAPE | J/kg | Estimate of the occurrence of convective rainfall situations |
Sunshine | Sunshine | Hour | Hours of shine in a day |
Sea surface temperature | SST | Degree Celsius | The temperature of the water near the surface of the ocean |
Rainfall | Rainfall | Millimeter | The amount of water that precipitates in liquid form over a given region and time period; often measured in millimeters or inches |
Models and technique
RT model
The decision tree algorithm is used for both classification and regression. The purpose of RTs is to predict continuous values, and the sum of squared differences between predicted and actual values is used to determine how accurate the predictions are. The RT model consistently conducts binary partitions for each parameter of the dataset at each level until the prescribed maximum depth is reached (Rokach & Maimon 2005; Liang et al. 2022). In this current study, the problem is a regression problem, and thus, an RT is used for the study since our target variable (rainfall) is continuous.
LGB model
By creating an exclusive feature binding (EFB) operator and gradient-based one-side sampling (GOSS) to increase efficiency, LGB was further improved to address the computational cost issue. To be more precise, EFB can simplify the feature space by binding the mutually exclusive features, whereas GOSS down-samples the sample instances and arbitrarily eliminates those with tiny gradients according to Tang et al. (2021).
















ECMWF model–global model
The global atmospheric reanalysis product known as ECMWF is created by fusing observational data from satellites and ground observations with a numerical weather prediction model. The data utilized in the present study come from the ERA-5 ECMWF reanalysis product (Gleixner et al. 2020) which provides hourly meteorological conditions back to 1979. This version of ECMWF reanalysis is based on the Integrated Forecasting System (IFS) and includes a four-dimensional vibrational analysis (4D-Var). ERA5 data are available in higher spatial as well as temporal resolution. ERA5 data are available on a 0.25◦ grid with hourly intervals. The number of observational datasets that serve as input for the assimilation system was increased and a major difference is the consideration of satellite estimates of rainfall in ERA5. For this study, an ERA5 reanalysis rainfall dataset (0.25° × 0.25°, hourly), spanning January 2017 to December 2022, was downloaded from https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-singlelevels?tab=form, and then the hourly precipitation was converted to a daily scale.
Development of the models
A 29-year (1993–2022) daily interval of meteorological data has been used to train and test models. The data are typically divided into two datasets. No universal guideline exists for the preparation of training and testing datasets in both spatial and time series estimations. It is recommended to divide the complete dataset into two sub-datasets for training (80%) and testing (20%) (Khosravi et al. 2018).
In this study, modeling was performed using the data from January 1993 to December 2016: (80%) of the data was used as the training dataset for model building while the data from January 2017 to December 2022 (20%) served as the testing dataset for model development. The experiments consisted of developing models for rainfall estimation over Bahir Dar, Ethiopia using two different approaches, RT and LGB machine learning techniques. The database was then divided into training and testing subsets.
Model hyperparameter adjustment
Hyperparameters are a set of parameters in machine learning algorithms that need to be adjusted specifically for a given learning problem because they cannot be predicted from data. The ideal values of hyperparameters might vary depending on the data and the problem; they are often found through testing different combinations and evaluating each model's performance. This is a parameter optimization to select the most suitable set of parameters (Wu et al. 2019) using a trial-and-error process until the best estimation score is obtained. The model's performance largely depends on the optimization of hyperparameters (Diez-Sierra & Del Jesus 2020). The study by Ridwan et al. (2021) found that without tuning, the model (boosting and decision tree regression) performed poorly, but when tuned, the accuracy of the model noticeably increased. The optimization processes in this study demonstrate that the models tend to gain the minimizing errors, and the relationship between hyperparameters and model accuracy is not linear for all ML models. Careful consideration is required to tune the hyperparameters. Enlarging the hyperparameters may not improve the ML accuracy. Both accuracy and computational cost need to be considered in model development. Maximum tree depth (max depth), number of trees used in ensemble learning (n estimators), fraction of samples to fit individual base learners (subsample), learning rate to minimize the gradient step (learning rate), and maximum number of leaves in each weak learner (num leaves) are the optimal hyperparameters. The range of looked-for values and the selected optimum hyperparameters are displayed in Table 2.
Optimal hyperparameters of the RT and LGB models
Model . | Selected hyperparameters . | Range of search . |
---|---|---|
Regression tree model | max_depth = 18 min_sample_split = 8 min_sample_split_leaf = 4 | [4,5,6,8,12,14,16,18,20] [2,4,5,6,8,10] [2,4,6] |
Light gradient boosting model | num_leaves = 160 learning rate = 0.0988 max_depth = 48 n estimators = 160 | Num_leaves = 100–180 Learning rate = 0.03–0.1 max_depth = 18–54 n estimators = 120–200 |
Model . | Selected hyperparameters . | Range of search . |
---|---|---|
Regression tree model | max_depth = 18 min_sample_split = 8 min_sample_split_leaf = 4 | [4,5,6,8,12,14,16,18,20] [2,4,5,6,8,10] [2,4,6] |
Light gradient boosting model | num_leaves = 160 learning rate = 0.0988 max_depth = 48 n estimators = 160 | Num_leaves = 100–180 Learning rate = 0.03–0.1 max_depth = 18–54 n estimators = 120–200 |
A GridsearchCV method was implemented to optimize the hyperparameters, identifying the combination that yielded the most accurate results. GridSearchCV is a machine learning technique that uses a methodical search process to determine the best hyperparameters for a given algorithm. It helps to identify the hyperparameter sweet spot of a model for optimal performance. Because GridSearchCV employs cross-validation to evaluate model performance after attempting every possible set of hyperparameters, it is incredibly effective (Kartini et al. 2021). Cross-validation can be used to evaluate models on several subsets of the training data in order to prevent overfitting. More parameters were adjusted using Python GridSearchCV software, and the output was utilized to train the models and produce precise predictions.
To enhance the prediction accuracy and prevent overfitting, we employed GridSearchCV for hyperparameter tuning, utilizing k-fold cross-validation as the evaluation method. This strategy ensures that the model is robust and generalizes well to unseen data by systematically evaluating different combinations of hyperparameters across various subsets of the training data; hence, GridSearchCV is a method that exhaustively searches through a specified hyperparameter grid, trying all possible combinations of hyperparameters to find the set that yields the best performance based on a chosen evaluation metric. In this study, we used k-fold cross-validation, a widely used technique that helps to mitigate the risk of overfitting by partitioning the training data into k subsets (or ‘folds’). For each combination of hyperparameters in the grid, the model is trained on k − 1 folds and evaluated on the remaining fold, iterating this process until every fold has been used for validation. The final model performance is averaged over all folds, which provides a more reliable estimate of how the model will perform on unseen data.
RESULTS AND DISCUSSION
The correlation of rainfall and input parameters
This section starts by indicating the correlation of input data from 1993 to 2016 that will be used for model development. The model is intended to be fed with as much data as possible to facilitate the identification and learning of meaningful patterns. Because one of the most essential phases in developing a predictive model is choosing the input data, it has a big impact on model performance (Ghorbani et al. 2016). The data obtained may contain many attributes or variables, which may or may not be related to the dependent variable. Therefore, for the most accurate analysis of dependent variables, only attributes related to dependent variables should be selected as the input models. This study used metrological data for the rainfall estimation based on Pearson's correlation. The correlation matrix between the input and the output data can be used to confirm these qualities. Finding the features that are most associated with the target variable is simple with a heatmap. To describe the metrological variables that correlate with rainfall and are relevant for rainfall estimation, a rectangular correlation heat map was used (Salaeh et al. 2022), which shows the Pearson correlation analyzed on the variables presented.
Several studies have confirmed the importance of RH in predicting rainfall. In tropical and subtropical regions, high humidity is closely linked to cloud formation and the likelihood of precipitation. For instance, Taye et al. (2021) demonstrated that RH is a strong predictor of rainfall in East Africa, particularly in regions like Ethiopia, where the main rainy season (kiremt) is characterized by high moisture content in the atmosphere. Similarly, Seleshi & Zanke (2004) highlighted the critical role of atmospheric moisture, often measured by RH, in determining the onset and intensity of the rainy seasons in Ethiopia.
A significant positive relationship exists between daily minimum temperature and rainfall, whereas a less significant relationship is observed between maximum temperature and rainfall. It is assumed that there is a correlation between rainfall and temperature changes over months. According to the study by Cong & Brady (2012), temperature and rainfall were positively correlated during January and May but negatively correlated in July. Moreover, little is known about the correlation between these three types of variables (minimum temperature, maximum temperature and rainfall) in Ethiopia, particularly in this study area. Therefore, the present study shows the existence of their correlation. CAPE weakly positively correlates with rainfall which is a crucial indicator of the meteorological conditions needed for convective precipitation events to occur (Seeley & Romps 2015). It is identified to be among the important correlated variables. This is consistent with the previous studies showing that CAPE is an important indicator of extreme precipitation in the eastern US (Lepore et al. 2015; Gizaw et al. 2021). For large-scale climate indices, correlations between rainfall and SST were negatively weak (r= −0.015). Positive SST results in reduced rainfall or drought (El Niño). In contrast, negative SST results in high rainfall (La Niña) (Kirtphaiboon et al. 2014). Another weakly correlated variable, sunshine duration, is an important indicator of the amount of solar radiation received in a region; r= −0.23 with rainfall.
Relative feature importance for rainfall estimation
The variable selection or feature/attribute selection in machine learning, identifying and selecting a subset of relevant features from a large set of variables, is a crucial step to improve the model's performance. Feature importance analysis is a useful procedure in the machine learning community, as it can guide model development by focusing on important variables (Zhou et al. 2021). In this work, the Random Forest (RF) technique was employed to determine the importance of each feature for rainfall estimation as supported by recent studies on environmental data modeling that underscore RF's effectiveness in quantifying variable impacts (Kim et al. 2024a; Mahdian et al. 2024). The RF model was built and trained daily data, and the significance of the variables was determined. The use of RF is to assess the impact of modifying the variables in a particular model's capacity of estimation and to quantify the relative importance of the variables in that model.
Performance of models for temporal rainfall variation
This section examines the observed rainfall variability with test data over the region and shows how our developed models are reliable to show the temporal variability of rainfall. Moreover, the section also consists of intercomparison of the LGB, RT, and ECMWF models.
Estimating daily rainfall
Scatter plot of estimated versus measured daily rainfall values for the testing phase (marked black dot) with the linear fit (blue line) using the following: (a) LGB model; (b) RT model; and (c) ECMWF model.
Scatter plot of estimated versus measured daily rainfall values for the testing phase (marked black dot) with the linear fit (blue line) using the following: (a) LGB model; (b) RT model; and (c) ECMWF model.
The plots of Figure 5 in the panels are LGB, RT, and ECMWF from left to right in order. The R2 scores for LGB, RT, and ECMWF models are 0.991, 0.874, and 0.778, respectively. The scatter plots depict that the predicting ability of the rainfall models agrees with the observed values. However, their degrees of agreement are different. The LGBM model agrees the most. The RT model agrees better than that of the ECMWF model but less than LGBM. The LGB model is a strong candidate to predict rainfall in this temporal rainfall variation.
Daily statistical values of each model
. | LGBM . | RT . | ECMWF . |
---|---|---|---|
R2 | 0.9910 | 0.874 | 0.778 |
MSE (mm) | 1.992 | 53.44 | 62.28 |
RMSE (mm) | 1.411 | 7.31 | 7.892 |
MAE (mm) | 0.899 | 4.27 | 4.306 |
. | LGBM . | RT . | ECMWF . |
---|---|---|---|
R2 | 0.9910 | 0.874 | 0.778 |
MSE (mm) | 1.992 | 53.44 | 62.28 |
RMSE (mm) | 1.411 | 7.31 | 7.892 |
MAE (mm) | 0.899 | 4.27 | 4.306 |
Statistical criteria of three models for the monthly rainfall scale
. | Monthly rainfall variation . | ||
---|---|---|---|
. | LGB . | RT . | ECMWF . |
R2 | 0.996 | 0.940 | 0.925 |
RMSE (mm) | 0.383 | 1.541 | 1.727 |
MSE (mm) | 0.146 | 2.375 | 2.982 |
MAE (mm) | 0.302 | 0.838 | 1.166 |
. | Monthly rainfall variation . | ||
---|---|---|---|
. | LGB . | RT . | ECMWF . |
R2 | 0.996 | 0.940 | 0.925 |
RMSE (mm) | 0.383 | 1.541 | 1.727 |
MSE (mm) | 0.146 | 2.375 | 2.982 |
MAE (mm) | 0.302 | 0.838 | 1.166 |
Histogram of error distribution of daily rainfall for the testing phase using the following: (a) LGB model, (b) RT model and bottom panel, and (c) ECMWF model.
Histogram of error distribution of daily rainfall for the testing phase using the following: (a) LGB model, (b) RT model and bottom panel, and (c) ECMWF model.
Comparison of models performance for daily – trend of rainfall variation
Time series plot of daily rainfall variation using the LGB, RT, and ECMWF models.
Time series plot of daily rainfall variation using the LGB, RT, and ECMWF models.
However, once larger rainfall accumulation amounts are attained, the model's ensemble members' disagreement could escalate, making it less accurate in predicting extreme precipitation. This result is similar to that of the study by other authors (Nguyen et al. 2018; Olaniyan et al. 2018). These results indicate that the LGBM and RT models generally have a better capability of predicting rainfall for less rainy seasons rather than that for wet seasons. The result is also supported by Zhou et al. (2021). The ECMWF model has relatively less skill to capture the variation of rainfall because it covers a wider area and a longer time span. This model generally runs at a lower resolution, both spatially (fewer forecast points per given area) and temporally (fewer time points get a forecast). Generally, the LGB and RT models have better agreement with observed rainfall than the ECMF models.
Comparison of model performance for monthly rainfall variation
Scatter plot of estimated versus measured monthly rainfall values for the testing phase (marked black dot) with the linear fit (blue line) using the following: (a) LGB model, (b) RT model, and (c) ECMWF model.
Scatter plot of estimated versus measured monthly rainfall values for the testing phase (marked black dot) with the linear fit (blue line) using the following: (a) LGB model, (b) RT model, and (c) ECMWF model.
Histogram of error distribution of monthly rainfall for the testing phase using the following: (a) LGB model, (b) RT model and bottom panel, and (c) ECMWF model.
Histogram of error distribution of monthly rainfall for the testing phase using the following: (a) LGB model, (b) RT model and bottom panel, and (c) ECMWF model.
At the monthly scale, most of the R2 values for each model are larger. This shows that the estimating ability of the LGB model has better agreement with the measured or observed values than the daily time scale, which is shown by the good fit between predicted and observed rainfall. The amount of rainfall variation estimated by the tree-based machine learning model captures the trend of the observed monthly variation in the rainfall rate. Our model estimation almost agrees with the observed rainfall variation, and ECMWF also agrees well with the observed rainfall variation in this temporal variation case.
The optimal model for rainfall modeling using all the performance criteria is found to be LGB. This model has one of the lowest RMSE, MAE, and MSE, and the highest R2 compared to that of the RT and ECMWF models. ECMWF has relatively high RMSE, MAE, MSE and low R2 values. The LGB model has been found to be the most generalizing and accurate in rainfall modeling.
Our study describes that the monthly time scale modeling has better agreements with the observed effects than the daily time scale modeling, which is shown by the good fits between modeled and observed plots. This result is also confirmed by the study (Nguyen et al. 2018).
Comparison of models performance for monthly trend of rainfall variation
This section shows the average monthly rainfall pattern/trend and the feasibility of our model to capture the monthly trends of rainfall over the Bahir Dar sector. While the highest average rainfall at Bahir Dar is in July, rainfall increases gradually from May at 4.5 mm and reaches its maximum in July at 18.5 mm. This has been the general rainfall pattern in Bahir Dar Station, and the maximum rainfall occurs once a year. The rainfall pattern at the Bahir Dar Station can be referred to as a uni-modal rainfall distribution (Elvis et al. 2015). The results from the model and the observed data confirm that stations with a uni-modal rainfall distribution have rainfall periods ranging from May to September, with July and August recording the highest amount of rainfall.
Trend of the average monthly rainfall of years 2017–2022 over Bahir Dar using various models.
Trend of the average monthly rainfall of years 2017–2022 over Bahir Dar using various models.
It is justified that LGB performed best in June, July, August, and September. RT performs well in rainy months of October, December, June and May. Both the LGB and RT models are good candidates for rainy month rainfall prediction. ECMWF underestimates the amount of rainfall in all the months. The ECMWF model was insignificant because it grossly underestimated the amount of rainfall in all the months. The ECMWF reanalysis model generally had the largest discrepancies when compared with the other models. This is due to the fact that several algorithms and models based on multiple wavelengths have been developed to derive rainfall estimates. Nevertheless, it is essential to note that rainfall estimates derived from the ECMWF model are indirect and are inevitably accompanied by a large degree of variability. For instance, many previous studies have indicated that reanalysis-based models generally have difficulty in representing rainfall in areas with complex topography in which rainfall is controlled by orography and characterized by high spatiotemporal variability (Sun et al. 2018).
Overall, the outcome of this study suggests that the LGBM and RT models have the potential to be suitable for the study of rainfall variability and trend studies over Bahir Dar. The ability of this model to accurately predict rainfall variability from the observed data can play a vital role even in water resources management in Bahir Dar since the observed data are generally sparse and riddled with gaps.
CONCLUSION
This study builds two tree-based machine learning models (i.e., RT and LGB Machine) for the estimation of rainfall, based on local input metrological parameters (RH, dew point temperature, minimum temperature, maximum temperature, wind speed, CAPE, and sunshine) as well as large-scale climate variables such as SST. First, the given data from NMI is preprocessed and fed into models with split into a training set and a testing set. Then, through GrideserchCV, the machine learning models are tuned to achieve high accuracy. Tree-based machine learning models have achieved high accuracy in the testing set (R2 > 0.874), which largely outperforms the broadly used global model, ECMWF (R2 = 0.778) for daily rainfall variation. Similarly, for monthly variations, LGB and RT achieve high accuracy in the testing set (R2 > 0.940), and outperform the broadly used global model, ECMWF (R2 = 0.925). Compared with RT and ECMWF, LGBM can achieve significantly higher accuracy for both temporal rainfall variations: the R2 of LGB, RT, and ECMWF is 0.991, 0.874, and 0.778, respectively, for daily variation, while for their monthly variation, it is 0.996, 0.940, and 0.925, respectively. A strong correlation between rainfall and the dew point temperature (Tdew), as well as minimum temperature (Tmin), wind speed (Wspeed), RH (RH), and maximum temperature (Tmax) use the heat map method. Their range of correlation is within (r= 0.82–0.94). On the other hand, CAPE is weakly positive and SST and sunshine (sunshine) are weakly negative, and these are the relevant atmospheric features that correlate with rainfall. Their range of correlation is also within the range of (r= − 0.015 to 0.013). The RF model is used for feature important analysis, and it produces the parameters such as RH and dew point temperature (Tdew) with RF scores of 0.4129 and 0.2234, respectively. This feature important analysis is consistent with common knowledge about factors that influence rainfall, which validates the feasibility of the proposed RF model for feature important analysis. For future work, the use of additional datasets and exploring other places and locations throughout Ethiopia would be a good option for building a strong model.
ACKNOWLEDGEMENTS
The authors are thankful to the National Meteorology Institute (NMI) of Ethiopia for providing their observational data on the Bahir Dar sector. The authors also acknowledge the Bahir Dar University institutional respiratory website for archiving the thesis part of our work (http://ir.bdu.edu.et/handle/123456789/15484#:∼:text=http%3A//ir.bdu.edu.et/handle/123456789/15484).
FUNDING
The authors declare that no funds, grants, or other support were received during the preparation of this article.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.