ABSTRACT
Rainfall is one of the most important meteorological phenomena since it provides water to the Earth's surface, which has a significant impact on the daily life of human beings. Gaining the knowledge of its behavior in a semi-arid basin is an important and challenging task to take advantage of this natural resource, given that water is usually scarce in such regions. Artificial intelligence and machine learning algorithms help to identify rainfall patterns and trends within a region. Multiple linear regression, random forest (RF), support vector machine, and artificial neural network (ANN) algorithms were implemented using daily rainfall data from climatological stations located within the basin using one station as a predictor variable and the remaining ones as input variables. The metrics to evaluate the model were the coefficient of determination (R2), mean absolute error, root mean square error, and the Kling–Gupta efficiency coefficient. The results showed that the daily rainfall prediction is better individually than overall, finding that the models obtained by RF and ANN simulate better daily rainfall in the basin.
HIGHLIGHTS
Artificial neural network algorithms were employed using rainfall data within a semi-arid basin.
This work is based on multiple linear regression, random forest (RF), support vector machine, and ANNs.
Daily rainfall prediction is better individually than overall.
Models obtained by ANN and RF simulate better daily rainfall in the basin.
INTRODUCTION
Rainfall prediction has become an important and challenging task for the scientific community all over the world due to the great impact it has on the daily life of humans and sectors such as agriculture, public health services, livestock, industry, electric power generation, and hydrological sciences, among others (Oswal 2019; Yan et al. 2021). Climate change has partially altered the hydrological cycle; where the main effects are evidenced by rainfall. However, the problem lies in the fact that rainfall is not distributed uniformly in all regions; it has been discovered that air pollution is closely related to meteorological conditions in summer and winter periods (Ridwan et al. 2021; Barrera-Animas et al. 2021). The main impacts observed in recent years due to sudden global climate transitions are temperature increases, extreme weather events, and the alteration of ecosystems. This has led researchers to investigate techniques, algorithms, and tools to help understand the phenomenon and mitigate its effects on the daily lives of living beings (Afrifa et al. 2022, 2023b; Lu 2024). The challenge researchers face is how to predict rainfall, because it is a complex phenomenon that involves the analysis of many variables and factors, and measurement can become complicated due to the need for specialized equipment, which is sometimes expensive and generally requires a lot of storage memory (Sarasa-Cabezuelo 2022).
One of the advantages of estimating rainfall is that it helps to prevent natural disasters, such as droughts and floods, making the best use of scarce rainfall in semi-arid regions, since in many cases, access to groundwater or construction of reservoirs can be costly, and the water does not have the quality for the activities carried out by mankind (Basha et al. 2020; Liyew & Melese 2021). Conversely, the spatiotemporal behavior of rainfall makes estimation even more difficult, and it does not have a linear distribution from a mathematical point of view, so other alternatives are explored to understand this phenomenon, defining its patterns and trends (Kashiwao et al. 2017; Basha et al. 2020; Gu et al. 2022). Conventional hydrological methods for rainfall estimation are based on numerical and statistical models, but they are computationally very expensive since they involve a considerable amount of data and variables (Kashiwao et al. 2017; Gu et al. 2022).
In recent years, artificial intelligence has been widely used for water resources studies, since within it we can find more disciplines that allow understanding of the occurrence and distribution of meteorological variables such as rainfall. Artificial intelligence encompasses data mining, machine learning (ML), and deep learning, which help to make correlations, trends, and predictions of rainfall (Liyew & Melese 2021). Different ML algorithms are applied to hydrology problems, although the most prominent are deep learning algorithms, such as artificial neural networks (ANN), as these try to model the nonlinearity of rainfall (Kashiwao et al. 2017; Endalie et al. 2022; Afrifa et al. 2023a). Other regression algorithms have also been applied for predictions, highlighting random forest (RF), support vector machine (SVM), K-nearest neighbors (K-NN), multiple linear regression (MLR), and extreme gradient boosting (Hasan et al. 2015; Diez-Sierra & del Jesus 2020; Gu et al. 2022). In addition, in recent years, the use of optimizers for DL models has been implemented to improve the performance of ANN models in both classification and regression for surface water, groundwater, and atmospheric applications (Roy & Singh 2020; Anaraki et al. 2023; Mohseni & Muskula 2023; Saqr et al. 2023; Abd-Elmaboud et al. 2024).
Regression on rainfall helps to determine the amount of water available at the surface, which allows recognizing the variability in space and time of how these changes fluctuate. In terms of regression, Ridwan et al. (2021) worked with rainfall prediction in Kenyir Lake, within Hulu Terengganu province, Malaysia, using DT, RF, and ANN all for regression, Bayesian linear regression and boosted decision tree regression (BDTR), as well as evaluating different scenarios to better understand the behavior of water in the area near the lake. The database was created with rainfall data from 10 different stations near the lake. The evaluation metrics were root mean squared error (RMSE), MAE, R2, relative absolute error, and relative squared error (RSE). They concluded that the best algorithm was BDTR. Likewise, Diez-Sierra & del Jesus (2020) applied ML techniques, such as support vector regression (SVR), K-NN, K-means, RF, and ANN, as well as generalized regression models for long-term daily rainfall prediction with atmospheric patterns from re-analyzed databases on the island of Tenerife, Spain, showing that ANN models best model daily rainfall, followed by RF and logistic regression. In the same year, Phạm et al. (2020) applied three ML algorithms: particle swarm optimized adaptive network-based fuzzy inference system (PSOANFIS), SVM, and ANN for daily rainfall prediction in Hoa Binh province, Vietnam. Different variables such as maximum and minimum temperatures, wind speed, solar radiation, relative humidity, and rainfall were used as input variables. The validation metrics used were MAE, R2, skill score, probability of detection, critical success index, and false alarm radius (FAR). For this study, the best result was the SVM, compared to the ANN and PSOANFIS models.
In order to understand the behavior of rainfall in a semi-arid basin, the main objective of our research is to perform rainfall estimations using only data with daily records from different pluviographic and climatological stations located within the study area with ML techniques, analyzing the correlations of the stations and implementing algorithms for regression (RF, SVM, ANN, and MLR) validating the model with MAE, RMSE, and R2 by implementing the following three methodologies: the first one, uses the current station as output and the remaining stations as input, the second methodology repeats this process but with augmented databases with Gaussian noise, finally the third methodology predicts the daily rainfall of the current station with respect to the station of higher correlation also with augmented databases. In the end, we compare the three methodologies.
The rest of this paper is organized as follows: In the next section, the study area is briefly described, presenting the dataset used in this work. The methodology section shows the AI techniques implemented, describing the parameters and evaluation metrics of the models. Finally, discussions and results are detailed in order to draw conclusions.
MATERIALS AND METHODS
Study area
Dataset
The databases were created from the collection of data from the three rain gauges and the climatological station installed within the basin with daily rainfall records. The data contemplated were the days with records in the four stations in the years 2019–2022. In the analysis process, days with missing records at the measurement stations were identified due to various technical limitations in data collection. To ensure the quality and consistency of the results, only those days on which complete records were available at all the stations considered were selected. This strategy allowed for an accurate and robust spatial estimation, preventing the presence of missing data from negatively affecting the model results. Five databases were made, with a particular structure, developing the first three columns as input variables, which are the data of the three stations, and as output variable, the remaining station. This technique was used for the four stations to obtain a general model (MG) of rainfall within the basin.
Databases augmented with Gaussian noise were generated using a sigma value corresponding to ±1 mm, which falls within the World Meteorological Organization's tolerance for rainfall measurement sensors (WMO 2005). Although no fixed value for sigma is established, the authors chose this value to represent the variation for the present study. This sigma value was applied individually to each station, improving the estimates in both the regression metrics and the general model (MG). Specifically, 225 records were generated per station, which were then amplified by a factor of 10–2,250 records per station. This augmentation resulted in a total increase of 9,000 data points for the general model (MG) database.
METHODOLOGY
Three different methodologies were implemented for the geospatial estimation of rainfall in the basin as follows:
1. For the first methodology, the estimates were performed with the original data for the four stations within the watershed and the general model (MG) using one station as the target and the remaining stations as input.
2. In the second methodology, rainfall was estimated with the databases augmented with Gaussian noise for the four stations and the general model (MG) using one station as target and the remaining stations as input.
3. Finally, the third methodology was based on the rainfall estimates with the augmented databases, but this time only having as an input variable the station with the highest Pearson correlation of the target station.
All models were trained with the augmented data but were tested with the original databases to verify that the models estimated the rainfall value concerning the true data.
Data augmentation
Pearson's correlation
Multiple linear regression
According to Eberly (2007), for MLR the following considerations must be assumed:
Outputs y are independent.
The outputs y follow a normal distribution.
The mean of the distribution is a linear function of each input .
The variance for the distribution is equal for each output y.
MLR models take the least squares estimation in order to calculate the plane that generates the minimum error, from the sum of the difference between the predicted value and the true value, all squared (Carrasquilla et al. 2016).
Random forest
Support vector regression
SVR machine is a supervised ML technique for performing regression problems. It was proposed by Drucker et al. (1996), as an extension of the SVM algorithm for classification, the difference being that the objective is to find a regression function that maps the values of the output variables with respect to the input variables. For classification, the extreme points are the ones that make up the separation vector to classify the labels; conversely, for SVR, it depends primarily on the number of vectors and not on the magnitude of the input data (Zhang & O'Donnell 2020).
Artificial neural networks
ANNs are artificial intelligence (AI) models inspired by the biological capacity of the brain to recognize patterns, trends, and correlations in order to make decisions (Dölling & Varas 2002; Sharma et al. 2017). Neurons are the basic building blocks of the model, which are interconnected by synaptic weights, and contain a bias, an adder, and an activation function (Sarmiento-Ramos 2020). The first neuron was the McCulloch–Pitts neuron (1943); later the perceptron was created in 1958, which consisted of an input layer, an output layer, and a hidden layer (generally the number of hidden layers is usually one or more).
The activation functions are very important because they transform the input data into output information. For ANN models, the sum of the products of the input values and their respective weights is summed and then multiplied by the activation function of the current layer. The performance of the model depends primarily on the number of layers and the activation functions, as these allow the model to adapt to the established problem and find the optimal solution. Activation functions must be established for each layer, since, if this is not done, the linear functions are taken by default. This can cause a problem when classifying or obtaining regressions because not all problems are linear (Feng & Lu 2019; Sharma et al. 2021).
Model parameters
The same parameters were used for the models of the meteorological stations and the models with a single selected variable to make a more equivalent comparison. For the MLR, the parameters were left as default.
For the RF models, the parameters of the number of trees were modified to 10 and a depth of 10. For the SVR models, a kernel ‘RBF’ was used, the C parameter was equal to 100, and the gamma parameter was equal to 0.1; for the ANN models, the mean squared error (MSE) loss function was used, and the optimizer used for all the stations was ‘Adam’ with a learning rate of 0.001, a parameter equal to 0.9, equal to 0.999, and an epsilon of . The architectures of the ANN models were different for each station since the best results were sought, varying in their number of layers and neurons; these architectures are presented in Table 1. In the architectures of the table, the first number in bold refers to the input layer and the last number to the output layer. All models have the ReLU activation function in the input layer and the hidden layers, while the output layer uses the linear function, all with a batch size of 10 and trained with 200 epochs.
Model . | Architecture . |
---|---|
VERT_ANN_1MET | (10, 20, 1) |
CLIM_ANN_1MET | (10, 25, 1) |
SAVN_ANN_1MET | (10, 20, 1) |
NAST_ANN_1MET | (5, 15, 1) |
MG_ANN_1MET | (15, 25, 12, 5, 1) |
VERT_ANN_2MET | (10, 60, 60, 80, 80, 1) |
CLIM_ANN_2MET | (10, 60, 80, 60, 80, 1) |
SAVN_ANN_2MET | (10, 60, 60, 80, 80, 1) |
NAST_ANN_2MET | (10, 60, 60, 80, 80, 1) |
MG_ANN_2MET | (10, 60, 60, 80, 80, 1) |
VERT_ANN_3MET | (10, 60, 60, 80, 80, 1) |
CLIM_ANN_3MET | (10, 60, 60, 80, 80, 1) |
SAVN_ANN_3MET | (10, 60, 60, 80, 80, 1) |
NAST_ANN_3MET | (10, 60, 60, 80, 80, 1) |
Model . | Architecture . |
---|---|
VERT_ANN_1MET | (10, 20, 1) |
CLIM_ANN_1MET | (10, 25, 1) |
SAVN_ANN_1MET | (10, 20, 1) |
NAST_ANN_1MET | (5, 15, 1) |
MG_ANN_1MET | (15, 25, 12, 5, 1) |
VERT_ANN_2MET | (10, 60, 60, 80, 80, 1) |
CLIM_ANN_2MET | (10, 60, 80, 60, 80, 1) |
SAVN_ANN_2MET | (10, 60, 60, 80, 80, 1) |
NAST_ANN_2MET | (10, 60, 60, 80, 80, 1) |
MG_ANN_2MET | (10, 60, 60, 80, 80, 1) |
VERT_ANN_3MET | (10, 60, 60, 80, 80, 1) |
CLIM_ANN_3MET | (10, 60, 60, 80, 80, 1) |
SAVN_ANN_3MET | (10, 60, 60, 80, 80, 1) |
NAST_ANN_3MET | (10, 60, 60, 80, 80, 1) |
Evaluation metrics for regression
ML models are trained to learn tasks from a database. The product of these algorithms is to classify or perform regression on applications of any kind. The way to evaluate the performance of the model is from metrics, which allow verifying how well it predicts classes or regressions (Naser & Alavi 2021). The evaluation metrics for regression in the present research were RMSE, MAE, coefficient of determination, and KGE, which are described as follows.
Several codes were developed in Python using the Keras and scikit-learn modules to apply ML algorithms to rainfall data.
RESULTS
Results of the first methodology
The results of the first methodology are presented in Table 2. For station EST_VERT, the best model was RF, since it obtained an R2 of 0.809, higher than the rest of the models that did not exceed 0.7. RF also has an MAE of less than 1 mm and the lowest RMSE with 2.163 and the best KGE with a value of 0.839. For station EST_CLIM, the best model was also RF, with the highest R2 of 0.770 and the highest KGE, with a value of 0.927, very close to unity, while the residuals differ from the rest of the models, with a value of 0.927, very close to unity, better than the rest of the models. As far as RMSE is concerned, RF still prevails as the best model with 2.492 mm, but in MAE, the best model was SVR with an error of 1.086 mm of rainfall, not so far from the RF method. For station EST_SAVN, the RF model was the best with 0.686 and 0.822 in R2 and KGE, respectively, while in its residuals it obtained an MAE of 1.955 and RMSE of 4.609 mm. At station EST_NAST, the same happened as at EST_CLIM, in R2 (0.754), RMSE (3.511 mm), and KGE (0.934), the best model was with the RF model, while in the residual MAE (1.061 mm) the best model was SVR. Finally, in the MG, the best model was RF with the best performance in R2 (0.790), MAE (1.028 mm), RMSE (2.969 mm), and KGE (0.866).
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.621 | 1.325 | 3.042 | 0.770 | EST_VERT |
RF | 0.809 | 0.788 | 2.163 | 0.839 | |
SVR | 0.672 | 1.026 | 2.832 | 0.715 | |
ANN | 0.692 | 1.127 | 2.742 | 0.797 | |
MLR | 0.457 | 1.833 | 3.833 | 0.757 | EST_CLIM |
RF | 0.770 | 1.117 | 2.492 | 0.927 | |
SVR | 0.593 | 1.086 | 3.320 | 0.652 | |
ANN | 0.608 | 1.343 | 3.256 | 0.891 | |
MLR | 0.366 | 3.218 | 6.548 | 0.664 | EST_SAVN |
RF | 0.686 | 1.955 | 4.609 | 0.822 | |
SVR | 0.380 | 2.222 | 6.477 | 0.535 | |
ANN | 0.435 | 2.875 | 6.180 | 0.740 | |
MLR | 0.574 | 1.818 | 4.616 | 0.820 | EST_NAST |
RF | 0.754 | 1.308 | 3.511 | 0.934 | |
SVR | 0.692 | 1.061 | 3.926 | 0.831 | |
ANN | 0.619 | 1.762 | 4.367 | 0.835 | |
MLR | 0.466 | 2.059 | 4.739 | 0.711 | MG |
RF | 0.790 | 1.028 | 2.969 | 0.866 | |
SVR | 0.550 | 1.360 | 4.351 | 0.647 | |
ANN | 0.621 | 1.557 | 3.995 | 0.770 |
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.621 | 1.325 | 3.042 | 0.770 | EST_VERT |
RF | 0.809 | 0.788 | 2.163 | 0.839 | |
SVR | 0.672 | 1.026 | 2.832 | 0.715 | |
ANN | 0.692 | 1.127 | 2.742 | 0.797 | |
MLR | 0.457 | 1.833 | 3.833 | 0.757 | EST_CLIM |
RF | 0.770 | 1.117 | 2.492 | 0.927 | |
SVR | 0.593 | 1.086 | 3.320 | 0.652 | |
ANN | 0.608 | 1.343 | 3.256 | 0.891 | |
MLR | 0.366 | 3.218 | 6.548 | 0.664 | EST_SAVN |
RF | 0.686 | 1.955 | 4.609 | 0.822 | |
SVR | 0.380 | 2.222 | 6.477 | 0.535 | |
ANN | 0.435 | 2.875 | 6.180 | 0.740 | |
MLR | 0.574 | 1.818 | 4.616 | 0.820 | EST_NAST |
RF | 0.754 | 1.308 | 3.511 | 0.934 | |
SVR | 0.692 | 1.061 | 3.926 | 0.831 | |
ANN | 0.619 | 1.762 | 4.367 | 0.835 | |
MLR | 0.466 | 2.059 | 4.739 | 0.711 | MG |
RF | 0.790 | 1.028 | 2.969 | 0.866 | |
SVR | 0.550 | 1.360 | 4.351 | 0.647 | |
ANN | 0.621 | 1.557 | 3.995 | 0.770 |
Note. Bold values indicate the best metric for each station for all methods.
The result of the second methodology
The results of the ML models implemented are presented in Table 3. At station EST_VERT, the best model was RF, with values in R2 and KGE very close to unity and its residuals in MAE and RMSE less than 1 mm; very close to this station are the ANN models, with minimal variations to the rest of the metrics. At station EST_CLIM, the best model for almost all metrics was RF with an R2 of 0.904, an MAE of 0.368 mm, and an RMSE of 1.435 mm, except KGE, where the best model was ANN with 0.967. As for station EST_SAVN, the best model was RF, showing values with slightly lower performance than those observed in the previous stations, with an R2 of 0.825, an MAE of 1.477 mm, and an RMSE of 3.440 mm. Also, the KGE metric obtained its best value at this station, with the ANN model with 0.925. For the EST_NAST station only in MAE, the SVR model outperformed RF with 0.636 mm, while in R2, RMSE, and the KGE coefficient, the results were better with the RF model, obtaining 0.896, 2.278, and 0.961 mm, respectively. Finally, the MG metrics were better with the RF model, where an R2 threw a value of 0.879, with MAE of 0.920 mm, RMSE of 2.272 mm, and a value of 0.927 in KGE.
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.620 | 1.393 | 3.046 | 0.816 | EST_VERT |
RF | 0.969 | 0.323 | 0.874 | 0.971 | |
SVR | 0.880 | 0.630 | 1.714 | 0.907 | |
ANN | 0.940 | 0.478 | 1.215 | 0.954 | |
MLR | 0.459 | 1.799 | 3.827 | 0.711 | EST_CLIM |
RF | 0.904 | 0.615 | 1.612 | 0.947 | |
SVR | 0.861 | 0.662 | 1.940 | 0.875 | |
ANN | 0.898 | 0.700 | 1.659 | 0.967 | |
MLR | 0.366 | 3.257 | 6.551 | 0.667 | EST_SAVN |
RF | 0.825 | 1.477 | 3.440 | 0.909 | |
SVR | 0.599 | 1.563 | 5.209 | 0.727 | |
ANN | 0.774 | 1.686 | 3.908 | 0.925 | |
MLR | 0.576 | 1.890 | 4.606 | 0.796 | EST_NAST |
RF | 0.896 | 0.721 | 2.278 | 0.961 | |
SVR | 0.868 | 0.636 | 2.569 | 0.892 | |
ANN | 0.882 | 0.699 | 2.435 | 0.935 | |
MLR | 0.442 | 2.176 | 4.885 | 0.710 | MG |
RF | 0.879 | 0.920 | 2.272 | 0.927 | |
SVR | 0.697 | 1.030 | 3.598 | 0.771 | |
ANN | 0.839 | 0.966 | 2.625 | 0.934 |
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.620 | 1.393 | 3.046 | 0.816 | EST_VERT |
RF | 0.969 | 0.323 | 0.874 | 0.971 | |
SVR | 0.880 | 0.630 | 1.714 | 0.907 | |
ANN | 0.940 | 0.478 | 1.215 | 0.954 | |
MLR | 0.459 | 1.799 | 3.827 | 0.711 | EST_CLIM |
RF | 0.904 | 0.615 | 1.612 | 0.947 | |
SVR | 0.861 | 0.662 | 1.940 | 0.875 | |
ANN | 0.898 | 0.700 | 1.659 | 0.967 | |
MLR | 0.366 | 3.257 | 6.551 | 0.667 | EST_SAVN |
RF | 0.825 | 1.477 | 3.440 | 0.909 | |
SVR | 0.599 | 1.563 | 5.209 | 0.727 | |
ANN | 0.774 | 1.686 | 3.908 | 0.925 | |
MLR | 0.576 | 1.890 | 4.606 | 0.796 | EST_NAST |
RF | 0.896 | 0.721 | 2.278 | 0.961 | |
SVR | 0.868 | 0.636 | 2.569 | 0.892 | |
ANN | 0.882 | 0.699 | 2.435 | 0.935 | |
MLR | 0.442 | 2.176 | 4.885 | 0.710 | MG |
RF | 0.879 | 0.920 | 2.272 | 0.927 | |
SVR | 0.697 | 1.030 | 3.598 | 0.771 | |
ANN | 0.839 | 0.966 | 2.625 | 0.934 |
Note. Bold values indicate the best metric for each station.
Results of the third methodology
The results of the third methodology showed lower results with respect to the first two methodologies, as presented in Table 4. At station EST_VER, the best model was RF, being much superior to the rest of the algorithms, reaching 0.727 for R2, 0.880 for KGE, 1.063 mm for MAE, and 2.583 mm for RMSE. For station EST_CLIM, the best model was equally RF, with values up to 0.618 as R2, 0.814 as KGE, 1.439 and 3.216 mm rainfall as MAE and RMSE, respectively. At station EST_SAVNT, the highest residuals and the lowest yield coefficients were obtained, simply reaching 0.484 for R2, 0.747 for KGE, 2.399 mm for MAE, and 5.910 mm for RMSE implementing the RF model. Finally, for station EST_NAST, the best model was also RF with 0.606 as R2, 0.860 as KGE, 1.510 mm as MAE, and 4.441 mm as RMSE. It can be concluded that, in general, using a single station as a predictor, the RF models improve the rainfall estimation by a large amount.
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.535 | 1.670 | 3.372 | 0.757 | EST_VERT |
RF | 0.727 | 1.127 | 2.583 | 0.871 | |
SVR | 0.609 | 1.063 | 3.091 | 0.714 | |
ANN | 0.660 | 1.626 | 2.882 | 0.880 | |
MLR | 0.399 | 1.988 | 4.033 | 0.673 | EST_CLIM |
RF | 0.618 | 1.439 | 3.216 | 0.814 | |
SVR | 0.390 | 1.505 | 4.062 | 0.587 | |
ANN | 0.385 | 2.239 | 4.080 | 0.727 | |
MLR | 0.326 | 3.379 | 6.752 | 0.635 | EST_SAVN |
RF | 0.484 | 2.786 | 5.910 | 0.747 | |
SVR | 0.343 | 2.399 | 6.665 | 0.519 | |
ANN | 0.330 | 3.394 | 6.731 | 0.694 | |
MLR | 0.537 | 2.043 | 4.811 | 0.772 | EST_NAST |
RF | 0.602 | 1.510 | 4.463 | 0.732 | |
SVR | 0.606 | 1.791 | 4.441 | 0.784 | |
ANN | 0.583 | 2.227 | 4.570 | 0.860 |
Model . | R2 . | MAE (mm) . | RMSE (mm) . | KGE . | Station . |
---|---|---|---|---|---|
MLR | 0.535 | 1.670 | 3.372 | 0.757 | EST_VERT |
RF | 0.727 | 1.127 | 2.583 | 0.871 | |
SVR | 0.609 | 1.063 | 3.091 | 0.714 | |
ANN | 0.660 | 1.626 | 2.882 | 0.880 | |
MLR | 0.399 | 1.988 | 4.033 | 0.673 | EST_CLIM |
RF | 0.618 | 1.439 | 3.216 | 0.814 | |
SVR | 0.390 | 1.505 | 4.062 | 0.587 | |
ANN | 0.385 | 2.239 | 4.080 | 0.727 | |
MLR | 0.326 | 3.379 | 6.752 | 0.635 | EST_SAVN |
RF | 0.484 | 2.786 | 5.910 | 0.747 | |
SVR | 0.343 | 2.399 | 6.665 | 0.519 | |
ANN | 0.330 | 3.394 | 6.731 | 0.694 | |
MLR | 0.537 | 2.043 | 4.811 | 0.772 | EST_NAST |
RF | 0.602 | 1.510 | 4.463 | 0.732 | |
SVR | 0.606 | 1.791 | 4.441 | 0.784 | |
ANN | 0.583 | 2.227 | 4.570 | 0.860 |
Note. Bold values indicate the best metric for each station.
Comparison between the methodologies
DISCUSSION
A strength of this research lies in the fact that the estimates are based on rainfall measurements obtained through a pluviographic network within the basin. In some scientific studies, a limitation is the lack of sufficient climatological information about the study region. Therefore, it is possible to make a point-specific estimation using exclusively rainfall data, in addition to implementing a technique to augment the dataset.
When comparing the results obtained in this research with those of Zahran et al. (2015), who estimated annual rainfall in an arid region of Jordan using ANN, their evaluation metrics were limited to prediction error and accuracy error, without comparing the residuals of the models. This restricted the visualization of residuals in their estimates. In our research, we conducted daily estimations using an ANN, differing in the evaluation metrics employed, such as MAE and RMSE, which enhance the analysis of residuals between measured and estimated values. Furthermore, the daily analysis allows for a better understanding of rainfall behavior.
Regarding studies such as Dutta & Gouthaman (2020), Endalie et al. (2021), Gu et al. (2022), Liyew & Melese (2021), and Yan et al. (2021), which investigated the behavior of some meteorological variables such as radiation, wind speed, relative humidity, and rainfall, they noted that the lack of data could limit the application of ML techniques. Conversely, their prediction reports also used daily rainfall data similar to those in this study. The metrics they employed for rainfall prediction included Pearson's determination coefficient (R²), MAE, and RMSE. The regions analyzed had arid, semi-arid, and humid climates, similar to those studied in this research. However, the difference lies in that these authors focused on rainfall prediction, while we performed a spatial estimation, which represents a significant contribution of this study to the application of ML to this phenomenon. Additionally, the use of the KGE index as a metric enhances model validation, a method implemented in very few studies.
Another group of researchers, such as Ridwan et al. (2021), used daily precipitation data and reported a network of around 10 rain gauge stations with a more robust database to determine rainfall relationships. They used Pearson's coefficient, reporting values between 0.7 and 0.8 for their ANN model. In this study, coefficients between 0.8 and 0.9 were achieved using RF and ANN models, yielding similar results to those reported by this author. However, a comparison of residuals could not be made because the author normalizes them. Liyew & Melese (2021) used a Pearson correlation matrix to extract the most representative features for rainfall, subsequently generating RF, MLR, and XGBoost models. Their metrics included MAE and RMSE, reporting values of 3.58 mm for MAE and 7.85 mm for RMSE. The results of this research improved rainfall estimation performance, achieving values of 0.323 mm for MAE and 0.874 mm for RMSE. Another contribution of this study is the understanding of the spatiotemporal behavior of rainfall through the generated models.
CONCLUSIONS
In this research, a temporal rainfall estimation was conducted for a semi-arid basin in Zacatecas, Mexico, applying various ML algorithms such as MLR, RF, SVR, and ANN, and evaluating regression metrics, including R², MAE, RMSE, and KGE. Three different methodologies were implemented: the first using the original data, estimating one station as the target station and the remaining stations within the basin as input variables; this was applied to the four individual stations and a general model (MG). The second methodology followed the same approach but incorporated data augmentation with Gaussian noise. The third methodology used a single target station and selected the station with the highest Pearson correlation among the remaining ones as the input variable.
Based on the results obtained, it can be concluded that the second methodology and the RF models, along with ANN, are the best estimators of daily rainfall in the semi-arid ‘Chilitas’ basin. These models show values very close to unity in the R² and KGE coefficients, as well as low residuals in the MAE and RMSE indicators. The implementation of a data augmentation technique has proven effective in mitigating the problem of missing or insufficient data. The most outstanding model was the RF applied at the EST_VERT station using the second methodology, achieving an R² of 0.969, a KGE of 0.971, and residuals of 0.323 mm in MAE and 0.874 mm in RMSE.
Conversely, the third methodology is only applicable for estimating rainfall at the station with the highest correlation in case there are not enough sensors. However, its results are not as significant as those obtained with the other two methodologies. Daily rainfall predictions improve with a greater number of measurements throughout the basin, as relying on a single station limits accurate rainfall estimates at a specific point, even if the correlation is high. EST_VERT and EST_NAST stations yielded better results compared to the EST_CLIM and EST_SAVN, the latter of which obtained results very different from the other three stations. This indicates that, at least for this basin, the more northern the station, the better the results.
In future work, a rainfall-runoff model and a rainfall-infiltration model will be developed to verify the behavior of rainfall, i.e., to make better use of water resources in regions where rainfall is scarce, such as the state of Zacatecas and central-northern Mexico. The application of climatic change models will be considered to forecast the behavior of rain in a semi-arid basin.
ACKNOWLEDGMENTS
The authors would like to thank the National Council of Humanities, Sciences, and Technologies, CONAHCYT, (Spanish for Consejo Nacional de Ciencia y Tecnología) for the economic support through the ‘Becas Nacional (Tradicional) 2021-3’ call for scholarships for the student José Armando Rodríguez Carrillo, which facilitated the development of this research. We also thank the state council of science COZCYT for the support in writing this paper.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.