Rainfall is one of the most important meteorological phenomena since it provides water to the Earth's surface, which has a significant impact on the daily life of human beings. Gaining the knowledge of its behavior in a semi-arid basin is an important and challenging task to take advantage of this natural resource, given that water is usually scarce in such regions. Artificial intelligence and machine learning algorithms help to identify rainfall patterns and trends within a region. Multiple linear regression, random forest (RF), support vector machine, and artificial neural network (ANN) algorithms were implemented using daily rainfall data from climatological stations located within the basin using one station as a predictor variable and the remaining ones as input variables. The metrics to evaluate the model were the coefficient of determination (R2), mean absolute error, root mean square error, and the Kling–Gupta efficiency coefficient. The results showed that the daily rainfall prediction is better individually than overall, finding that the models obtained by RF and ANN simulate better daily rainfall in the basin.

  • Artificial neural network algorithms were employed using rainfall data within a semi-arid basin.

  • This work is based on multiple linear regression, random forest (RF), support vector machine, and ANNs.

  • Daily rainfall prediction is better individually than overall.

  • Models obtained by ANN and RF simulate better daily rainfall in the basin.

Rainfall prediction has become an important and challenging task for the scientific community all over the world due to the great impact it has on the daily life of humans and sectors such as agriculture, public health services, livestock, industry, electric power generation, and hydrological sciences, among others (Oswal 2019; Yan et al. 2021). Climate change has partially altered the hydrological cycle; where the main effects are evidenced by rainfall. However, the problem lies in the fact that rainfall is not distributed uniformly in all regions; it has been discovered that air pollution is closely related to meteorological conditions in summer and winter periods (Ridwan et al. 2021; Barrera-Animas et al. 2021). The main impacts observed in recent years due to sudden global climate transitions are temperature increases, extreme weather events, and the alteration of ecosystems. This has led researchers to investigate techniques, algorithms, and tools to help understand the phenomenon and mitigate its effects on the daily lives of living beings (Afrifa et al. 2022, 2023b; Lu 2024). The challenge researchers face is how to predict rainfall, because it is a complex phenomenon that involves the analysis of many variables and factors, and measurement can become complicated due to the need for specialized equipment, which is sometimes expensive and generally requires a lot of storage memory (Sarasa-Cabezuelo 2022).

One of the advantages of estimating rainfall is that it helps to prevent natural disasters, such as droughts and floods, making the best use of scarce rainfall in semi-arid regions, since in many cases, access to groundwater or construction of reservoirs can be costly, and the water does not have the quality for the activities carried out by mankind (Basha et al. 2020; Liyew & Melese 2021). Conversely, the spatiotemporal behavior of rainfall makes estimation even more difficult, and it does not have a linear distribution from a mathematical point of view, so other alternatives are explored to understand this phenomenon, defining its patterns and trends (Kashiwao et al. 2017; Basha et al. 2020; Gu et al. 2022). Conventional hydrological methods for rainfall estimation are based on numerical and statistical models, but they are computationally very expensive since they involve a considerable amount of data and variables (Kashiwao et al. 2017; Gu et al. 2022).

In recent years, artificial intelligence has been widely used for water resources studies, since within it we can find more disciplines that allow understanding of the occurrence and distribution of meteorological variables such as rainfall. Artificial intelligence encompasses data mining, machine learning (ML), and deep learning, which help to make correlations, trends, and predictions of rainfall (Liyew & Melese 2021). Different ML algorithms are applied to hydrology problems, although the most prominent are deep learning algorithms, such as artificial neural networks (ANN), as these try to model the nonlinearity of rainfall (Kashiwao et al. 2017; Endalie et al. 2022; Afrifa et al. 2023a). Other regression algorithms have also been applied for predictions, highlighting random forest (RF), support vector machine (SVM), K-nearest neighbors (K-NN), multiple linear regression (MLR), and extreme gradient boosting (Hasan et al. 2015; Diez-Sierra & del Jesus 2020; Gu et al. 2022). In addition, in recent years, the use of optimizers for DL models has been implemented to improve the performance of ANN models in both classification and regression for surface water, groundwater, and atmospheric applications (Roy & Singh 2020; Anaraki et al. 2023; Mohseni & Muskula 2023; Saqr et al. 2023; Abd-Elmaboud et al. 2024).

Regression on rainfall helps to determine the amount of water available at the surface, which allows recognizing the variability in space and time of how these changes fluctuate. In terms of regression, Ridwan et al. (2021) worked with rainfall prediction in Kenyir Lake, within Hulu Terengganu province, Malaysia, using DT, RF, and ANN all for regression, Bayesian linear regression and boosted decision tree regression (BDTR), as well as evaluating different scenarios to better understand the behavior of water in the area near the lake. The database was created with rainfall data from 10 different stations near the lake. The evaluation metrics were root mean squared error (RMSE), MAE, R2, relative absolute error, and relative squared error (RSE). They concluded that the best algorithm was BDTR. Likewise, Diez-Sierra & del Jesus (2020) applied ML techniques, such as support vector regression (SVR), K-NN, K-means, RF, and ANN, as well as generalized regression models for long-term daily rainfall prediction with atmospheric patterns from re-analyzed databases on the island of Tenerife, Spain, showing that ANN models best model daily rainfall, followed by RF and logistic regression. In the same year, Phạm et al. (2020) applied three ML algorithms: particle swarm optimized adaptive network-based fuzzy inference system (PSOANFIS), SVM, and ANN for daily rainfall prediction in Hoa Binh province, Vietnam. Different variables such as maximum and minimum temperatures, wind speed, solar radiation, relative humidity, and rainfall were used as input variables. The validation metrics used were MAE, R2, skill score, probability of detection, critical success index, and false alarm radius (FAR). For this study, the best result was the SVM, compared to the ANN and PSOANFIS models.

In order to understand the behavior of rainfall in a semi-arid basin, the main objective of our research is to perform rainfall estimations using only data with daily records from different pluviographic and climatological stations located within the study area with ML techniques, analyzing the correlations of the stations and implementing algorithms for regression (RF, SVM, ANN, and MLR) validating the model with MAE, RMSE, and R2 by implementing the following three methodologies: the first one, uses the current station as output and the remaining stations as input, the second methodology repeats this process but with augmented databases with Gaussian noise, finally the third methodology predicts the daily rainfall of the current station with respect to the station of higher correlation also with augmented databases. In the end, we compare the three methodologies.

The rest of this paper is organized as follows: In the next section, the study area is briefly described, presenting the dataset used in this work. The methodology section shows the AI techniques implemented, describing the parameters and evaluation metrics of the models. Finally, discussions and results are detailed in order to draw conclusions.

Study area

The study area is a semi-arid basin located in the state of Zacatecas, Mexico (Figure 1), 20 km south of the state capital. Most of the basin is in the municipality of Genaro Codina, with the slope toward the municipality of Zacatecas. The basin has approximately 100 km2, with the centroid at 740,912.02 East, 2,500,760.34 North in the Universal Transverse Mercator (UTM) World Geodetic System 1984 (WGS84) Zone 13N projection. Vegetation and land uses comprised the following: 63.5% of natural grassland, 19.1% of annual seasonal agriculture, 9% of crasicaule scrub, 4.3% of annual irrigated agriculture, 3.6% of shrub secondary vegetation, 0.4% of urban built-up, and lastly, 0.03% of oak forest. The edaphology of the basin is 62.6% phaeozem, 33.4% kastanozem, 3.8% leptosol, and 0.2% calcisol (INEGI 2023). The basin is rural, where the main activity is agriculture and livestock; it does not have large water reservoirs, and the population is less than 500 inhabitants (Dávila-Hernández et al. 2022). Within the basin there are three pluviographic stations called Vertedor (EST_VERT) with an elevation of 2,244 m.a.s.l., Saint Venant (EST_SAVN) with an elevation of 2,409 m.a.s.l., and Navier Stokes (EST_NAST) with an elevation of 2,467 m.a.s.l., respectively; in addition, an automatic climatological station (EST_CLIM) with an elevation of 2,326 m.a.s.l., all of them can be seen in Figure 1. This equipment was strategically located to evaluate the spatiotemporal variation of rainfall in the basin.
Figure 1

Map of the basin study with its drainage network and the stations applying a DEM at 12 m resolution (INEGI 2023).

Figure 1

Map of the basin study with its drainage network and the stations applying a DEM at 12 m resolution (INEGI 2023).

Close modal

Dataset

The databases were created from the collection of data from the three rain gauges and the climatological station installed within the basin with daily rainfall records. The data contemplated were the days with records in the four stations in the years 2019–2022. In the analysis process, days with missing records at the measurement stations were identified due to various technical limitations in data collection. To ensure the quality and consistency of the results, only those days on which complete records were available at all the stations considered were selected. This strategy allowed for an accurate and robust spatial estimation, preventing the presence of missing data from negatively affecting the model results. Five databases were made, with a particular structure, developing the first three columns as input variables, which are the data of the three stations, and as output variable, the remaining station. This technique was used for the four stations to obtain a general model (MG) of rainfall within the basin.

Databases augmented with Gaussian noise were generated using a sigma value corresponding to ±1 mm, which falls within the World Meteorological Organization's tolerance for rainfall measurement sensors (WMO 2005). Although no fixed value for sigma is established, the authors chose this value to represent the variation for the present study. This sigma value was applied individually to each station, improving the estimates in both the regression metrics and the general model (MG). Specifically, 225 records were generated per station, which were then amplified by a factor of 10–2,250 records per station. This augmentation resulted in a total increase of 9,000 data points for the general model (MG) database.

Three different methodologies were implemented for the geospatial estimation of rainfall in the basin as follows:

  • 1. For the first methodology, the estimates were performed with the original data for the four stations within the watershed and the general model (MG) using one station as the target and the remaining stations as input.

  • 2. In the second methodology, rainfall was estimated with the databases augmented with Gaussian noise for the four stations and the general model (MG) using one station as target and the remaining stations as input.

  • 3. Finally, the third methodology was based on the rainfall estimates with the augmented databases, but this time only having as an input variable the station with the highest Pearson correlation of the target station.

MLR, RF, SVR, and ANN algorithms were implemented to estimate daily rainfall at each station, using as input the remaining stations within the basin and the general model (MG). The training was performed with a five-fold cross-validation to avoid overfitting the models, with the input data standardized with -score, using the following equation:
(1)
where z is the standardized value, x is the given value, is the mean, and is the standard deviation of the data (DeVore 2017). A shuffle and a random state equal to 42 were applied. The evaluation metrics of the models were R2, MAE, RMSE, and Kling–Gupta efficiency (KGE) coefficient, as they are the main metrics to evaluate regression models.

All models were trained with the augmented data but were tested with the original databases to verify that the models estimated the rainfall value concerning the true data.

Data augmentation

Data augmentation is used when databases contain a limited number of records. In these cases, it is necessary to find alternatives to make the data applicable to ML or DL algorithms and achieve the desired accuracy without altering the behavior of the phenomenon under study, e.g., rainfall (Wang et al. 2018). Gaussian noise is a widely employed technique to increase the amount of data, especially in image classification problems and speech recognition. This technique (Benegui & Ionescu 2020) can be expressed by the following equation:
(2)
where represents the original signal to which data augmentation is applied, with a length denoting the aggregated Gaussian noise, randomly generated from a normal distribution , where is the mean and is the standard deviation. It is important to note that the data magnification may vary depending on the amplitude attributed to the value.

Pearson's correlation

According to Lalinde et al. (2018), the terms ‘relationship’ and ‘association’ are equivalent and are used to refer to the assessment of covariation between at least two variables in the field of statistics. These terms indicate the relationship between two or more variables and allow conclusions to be drawn about trends or data fits. This can be explored using Pearson's correlation expression as follows:
(3)
where and are the sample means of the data and X and Y are the true data of their respective sets. These concepts are based on the work of Fiallos (2021). When the value of approaches unity, either positively, where the increase in the independent variable is related to an increase in the value of the dependent variable, or negatively, where the increase in the independent variable is associated with a decrease in the value of the dependent variable, it indicates a high correlation. Conversely, when the value of approaches zero, the correlation is weak.

Multiple linear regression

MLR is an algorithm that performs regression analysis to find the existing correlation between two or more variables, thus deriving a behavior where one variable depends on what happens with the second or more variables, allowing one to make predictions. The MLR equation represents the linear correlation between the dependent variable and the independent variables (Tranmer et al. 2020). Mathematically it can be expressed as follows:
(4)
where y is the model output, is the model input variable, is the model bias, i.e., is the coefficient given to each input variable, and refers to the model error dependent on x.

According to Eberly (2007), for MLR the following considerations must be assumed:

  • Outputs y are independent.

  • The outputs y follow a normal distribution.

  • The mean of the distribution is a linear function of each input .

  • The variance for the distribution is equal for each output y.

MLR models take the least squares estimation in order to calculate the plane that generates the minimum error, from the sum of the difference between the predicted value and the true value, all squared (Carrasquilla et al. 2016).

Random forest

RF is a supervised ML algorithm used to improve predictions from a forest of decision trees, where each tree stores different random variables for prediction, introduced by Breiman (2001). Mathematically, Cutler et al. (2012) defined it as an -dimensional random vector representing the input variables and an output Y representing the independent output variable with an unknown joint distribution . So the objective is to predict Y with the function . The prediction function is given by a loss function that minimizes the expected value of the loss, represented by the following equation:
(5)
When minimizing for the loss of the quadratic error results in the expected value and is used as the RF formula in regression, we have
(6)
For f there is a set called ‘base learners’ , which are combined to obtain a ‘joint predictor’ , which, finally, for regression, are averaged as follows:
(7)
The RF algorithm selects a random subset of the original database and then creates a set of decision trees. The trees separate the observations by binary decision rules, and these have a cutoff point selected to divide the data into two groups to arrive at the prediction. After this process, compare the different results of the trees created to classify the output. Figure 2 visualizes the learning and prediction process of the algorithm.
Figure 2

RF learning and prediction process for n decision trees.

Figure 2

RF learning and prediction process for n decision trees.

Close modal

Support vector regression

SVR machine is a supervised ML technique for performing regression problems. It was proposed by Drucker et al. (1996), as an extension of the SVM algorithm for classification, the difference being that the objective is to find a regression function that maps the values of the output variables with respect to the input variables. For classification, the extreme points are the ones that make up the separation vector to classify the labels; conversely, for SVR, it depends primarily on the number of vectors and not on the magnitude of the input data (Zhang & O'Donnell 2020).

To make predictions, SVR uses an -sensitive loss function to compute a hyperplane; in such a way, the training input values can have at most one deviation from the input values. The hyperplane delimits the -sensitive tube, which delimits the boundaries for the regression values; the SVR technique applies a loss function to make the tube as thin as possible but at the same time contain as much training data within the tube (Welling 2004; Zhang & O'Donnell 2020); as shown in Figure 4, this is expressed mathematically as follows:
(8)
(9)
where w refers to the vector of model weights, is the vector of input values, b is the bias, and is the output values of the instances. The variables , are introduced to protect the margin from outliers. is a parameter that regulates the flatness of the function and the predicted errors. The larger the value of C, the more the predicted errors are minimized, if C is small, the flatness is minimized (Zhang & O'Donnell 2020).
In SVR, kernel functions are used to transform the model input values into a hyperplane so that nonlinearity functions can be discriminated. There are different types of kernels, of which we can highlight the following: linear, polynomial, radial basis function (RBF), and sigmoidal. The formula of the RBF kernel used is represented as follows:
(10)
where is a value of the distance from the hyperplane to the vector that allows discriminating the information and is the squared Euclidean distance (Sujay Raghavendra & Deka 2014).

Artificial neural networks

ANNs are artificial intelligence (AI) models inspired by the biological capacity of the brain to recognize patterns, trends, and correlations in order to make decisions (Dölling & Varas 2002; Sharma et al. 2017). Neurons are the basic building blocks of the model, which are interconnected by synaptic weights, and contain a bias, an adder, and an activation function (Sarmiento-Ramos 2020). The first neuron was the McCulloch–Pitts neuron (1943); later the perceptron was created in 1958, which consisted of an input layer, an output layer, and a hidden layer (generally the number of hidden layers is usually one or more).

The training procedure consists of receiving from the input layer, which will transmit this information to the neurons of the next layers. The information is transformed due to the weights until it reaches the output layer. Then the weights are modified in such a way that they are propagated to find the optimal values for the regression with an optimizer until the performance of the model is improved and the error is minimized (Dölling & Varas 2002; Sujay Raghavendra. & Deka 2014; Otchere et al. 2021). Mathematically, the ANN model can be explained with the following equation (Dongare et al. 2012):
(11)
where y refers to the output value, is the activation function, b is the model bias, refers to the input values, and refers to the synaptic weights that will be updated as the model is trained. Generally, a neural network schematic can be represented as shown in Figure 3.
Figure 3

Fundamentals of how the ANN model is composed, with up to n inputs and one output y (Sharma et al. 2021).

Figure 3

Fundamentals of how the ANN model is composed, with up to n inputs and one output y (Sharma et al. 2021).

Close modal

The activation functions are very important because they transform the input data into output information. For ANN models, the sum of the products of the input values and their respective weights is summed and then multiplied by the activation function of the current layer. The performance of the model depends primarily on the number of layers and the activation functions, as these allow the model to adapt to the established problem and find the optimal solution. Activation functions must be established for each layer, since, if this is not done, the linear functions are taken by default. This can cause a problem when classifying or obtaining regressions because not all problems are linear (Feng & Lu 2019; Sharma et al. 2021).

Model parameters

The same parameters were used for the models of the meteorological stations and the models with a single selected variable to make a more equivalent comparison. For the MLR, the parameters were left as default.

For the RF models, the parameters of the number of trees were modified to 10 and a depth of 10. For the SVR models, a kernel ‘RBF’ was used, the C parameter was equal to 100, and the gamma parameter was equal to 0.1; for the ANN models, the mean squared error (MSE) loss function was used, and the optimizer used for all the stations was ‘Adam’ with a learning rate of 0.001, a parameter equal to 0.9, equal to 0.999, and an epsilon of . The architectures of the ANN models were different for each station since the best results were sought, varying in their number of layers and neurons; these architectures are presented in Table 1. In the architectures of the table, the first number in bold refers to the input layer and the last number to the output layer. All models have the ReLU activation function in the input layer and the hidden layers, while the output layer uses the linear function, all with a batch size of 10 and trained with 200 epochs.

Table 1

Architectures for all the ANN models

ModelArchitecture
VERT_ANN_1MET (10, 20, 1
CLIM_ANN_1MET (10, 25, 1
SAVN_ANN_1MET (10, 20, 1
NAST_ANN_1MET (5, 15, 1
MG_ANN_1MET (15, 25, 12, 5, 1
VERT_ANN_2MET (10, 60, 60, 80, 80, 1
CLIM_ANN_2MET (10, 60, 80, 60, 80, 1
SAVN_ANN_2MET (10, 60, 60, 80, 80, 1
NAST_ANN_2MET (10, 60, 60, 80, 80, 1
MG_ANN_2MET (10, 60, 60, 80, 80, 1
VERT_ANN_3MET (10, 60, 60, 80, 80, 1
CLIM_ANN_3MET (10, 60, 60, 80, 80, 1
SAVN_ANN_3MET (10, 60, 60, 80, 80, 1
NAST_ANN_3MET (10, 60, 60, 80, 80, 1
ModelArchitecture
VERT_ANN_1MET (10, 20, 1
CLIM_ANN_1MET (10, 25, 1
SAVN_ANN_1MET (10, 20, 1
NAST_ANN_1MET (5, 15, 1
MG_ANN_1MET (15, 25, 12, 5, 1
VERT_ANN_2MET (10, 60, 60, 80, 80, 1
CLIM_ANN_2MET (10, 60, 80, 60, 80, 1
SAVN_ANN_2MET (10, 60, 60, 80, 80, 1
NAST_ANN_2MET (10, 60, 60, 80, 80, 1
MG_ANN_2MET (10, 60, 60, 80, 80, 1
VERT_ANN_3MET (10, 60, 60, 80, 80, 1
CLIM_ANN_3MET (10, 60, 60, 80, 80, 1
SAVN_ANN_3MET (10, 60, 60, 80, 80, 1
NAST_ANN_3MET (10, 60, 60, 80, 80, 1

Evaluation metrics for regression

ML models are trained to learn tasks from a database. The product of these algorithms is to classify or perform regression on applications of any kind. The way to evaluate the performance of the model is from metrics, which allow verifying how well it predicts classes or regressions (Naser & Alavi 2021). The evaluation metrics for regression in the present research were RMSE, MAE, coefficient of determination, and KGE, which are described as follows.

RMSE is obtained by calculating the root of the average of the true values minus the squared predicted values. Of the characteristics to highlight is that it contains a dependent scale, the lower the value, the better the performance, and it is sensitive to outliers (Naser & Alavi 2021). Its formula is as follows:
(12)
MAE is calculated by obtaining the mean of the absolute value of the difference between the true values and the predicted values. It is used to compare a series of different scales, the lower the value, the better the performance (Naser & Alavi 2021). It is expressed as follows:
(13)
Coefficient of determination (R2) is one of the most common metrics, the closer it is to 1, the higher the performance of the model. It is obtained by performing the subtraction of 1 and the quotient is the sum of the squared residuals and the total sum of squares (Naser & Alavi 2021; Gu et al. 2022). This metric is computed with the expression
(14)
where the models are the true values, refers to the predicted values, is the average of the true values, and N is the total of the values.
The KGE was selected because of its ability to more comprehensively capture the relationship between observed and estimated values, as compared to more traditional metrics, such as RMSE or MAE. While the latter focuses primarily on measuring absolute or mean square error, the KGE incorporates multiple statistical components that are particularly useful for identifying variations in predicted or estimated values. It is a coefficient that evaluates regression models used mainly to evaluate models in hydrology, comprising the following parameters: r, which refer to Pearson's correlation coefficient, , which is the ratio of the variance of the rainfall errors, and , which is the ratio of the true values and the predicted values. The optimal value is equal to 1, therefore, the closer the coefficient is to unity, the better the rainfall estimate in the basin, and if this is equal to or less than zero, this indicates that the average is better than the established model (Gupta et al. 2009; Knoben et al. 2019), defined as follows:
(15)

Several codes were developed in Python using the Keras and scikit-learn modules to apply ML algorithms to rainfall data.

First, we performed a Pearson correlation between the stations, shown in the diagram of Figure 4. The highest correlation was between the EST_NAST and EST_VERT stations, with 0.74, and the lowest was a tie between the EST_SAVN and EST_SAVN having a value of 0.50. According to this analysis, the input stations were selected for each current station, where in general, the EST_VERT station is used as an input variable, except in the case of the EST_VERT station that used the EST_NAST station as input.
Figure 4

Pearson correlation matrix of stations within the basin.

Figure 4

Pearson correlation matrix of stations within the basin.

Close modal

Results of the first methodology

The results of the first methodology are presented in Table 2. For station EST_VERT, the best model was RF, since it obtained an R2 of 0.809, higher than the rest of the models that did not exceed 0.7. RF also has an MAE of less than 1 mm and the lowest RMSE with 2.163 and the best KGE with a value of 0.839. For station EST_CLIM, the best model was also RF, with the highest R2 of 0.770 and the highest KGE, with a value of 0.927, very close to unity, while the residuals differ from the rest of the models, with a value of 0.927, very close to unity, better than the rest of the models. As far as RMSE is concerned, RF still prevails as the best model with 2.492 mm, but in MAE, the best model was SVR with an error of 1.086 mm of rainfall, not so far from the RF method. For station EST_SAVN, the RF model was the best with 0.686 and 0.822 in R2 and KGE, respectively, while in its residuals it obtained an MAE of 1.955 and RMSE of 4.609 mm. At station EST_NAST, the same happened as at EST_CLIM, in R2 (0.754), RMSE (3.511 mm), and KGE (0.934), the best model was with the RF model, while in the residual MAE (1.061 mm) the best model was SVR. Finally, in the MG, the best model was RF with the best performance in R2 (0.790), MAE (1.028 mm), RMSE (2.969 mm), and KGE (0.866).

Table 2

Results of the ML models implemented in the stations and the MG for the first methodology

ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.621 1.325 3.042 0.770 EST_VERT 
RF 0.809 0.788 2.163 0.839 
SVR 0.672 1.026 2.832 0.715 
ANN 0.692 1.127 2.742 0.797 
MLR 0.457 1.833 3.833 0.757 EST_CLIM 
RF 0.770 1.117 2.492 0.927 
SVR 0.593 1.086 3.320 0.652 
ANN 0.608 1.343 3.256 0.891 
MLR 0.366 3.218 6.548 0.664 EST_SAVN 
RF 0.686 1.955 4.609 0.822 
SVR 0.380 2.222 6.477 0.535 
ANN 0.435 2.875 6.180 0.740 
MLR 0.574 1.818 4.616 0.820 EST_NAST 
RF 0.754 1.308 3.511 0.934 
SVR 0.692 1.061 3.926 0.831 
ANN 0.619 1.762 4.367 0.835 
MLR 0.466 2.059 4.739 0.711 MG 
RF 0.790 1.028 2.969 0.866 
SVR 0.550 1.360 4.351 0.647 
ANN 0.621 1.557 3.995 0.770 
ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.621 1.325 3.042 0.770 EST_VERT 
RF 0.809 0.788 2.163 0.839 
SVR 0.672 1.026 2.832 0.715 
ANN 0.692 1.127 2.742 0.797 
MLR 0.457 1.833 3.833 0.757 EST_CLIM 
RF 0.770 1.117 2.492 0.927 
SVR 0.593 1.086 3.320 0.652 
ANN 0.608 1.343 3.256 0.891 
MLR 0.366 3.218 6.548 0.664 EST_SAVN 
RF 0.686 1.955 4.609 0.822 
SVR 0.380 2.222 6.477 0.535 
ANN 0.435 2.875 6.180 0.740 
MLR 0.574 1.818 4.616 0.820 EST_NAST 
RF 0.754 1.308 3.511 0.934 
SVR 0.692 1.061 3.926 0.831 
ANN 0.619 1.762 4.367 0.835 
MLR 0.466 2.059 4.739 0.711 MG 
RF 0.790 1.028 2.969 0.866 
SVR 0.550 1.360 4.351 0.647 
ANN 0.621 1.557 3.995 0.770 

Note. Bold values indicate the best metric for each station for all methods.

Figure 5 shows the comparisons of the results for the first methodology using a radar chart. In Figure 5(a), we can analyze the results in R2, where the RF polygon covers the largest area, which indicates that for all stations it was the best model, followed by the ANN models, where a significant decrease was noted with respect to the RF model, followed by the SVR, which leads by a little difference from the ANN and even in the EST_NAST station it had better performance than ANN, and finally, the MLR model was the one that obtained the lowest results in this metric. The KGE comparisons are shown in Figure 5(b), where a change was identified with respect to the order of performances in this metric, as RF continues to be the best model, ANN follows closely in area, but MLR models modeled the phenomenon better than SVR. For MAE residuals, the models with the best results were RF, followed by SVR, then ANN, and finally the models with the lowest yields were MLR; this is shown in Figure 5(c). Finally, RMSE (Figure 5(d)) differs concerning the yields in MAE; although RF continues to be the best for each station, the second-best performance was obtained by ANN models, followed by SVR and finally MLR models.
Figure 5

Radar chart with the results of the first methodology for all stations and the general model (MG).

Figure 5

Radar chart with the results of the first methodology for all stations and the general model (MG).

Close modal

The result of the second methodology

The results of the ML models implemented are presented in Table 3. At station EST_VERT, the best model was RF, with values in R2 and KGE very close to unity and its residuals in MAE and RMSE less than 1 mm; very close to this station are the ANN models, with minimal variations to the rest of the metrics. At station EST_CLIM, the best model for almost all metrics was RF with an R2 of 0.904, an MAE of 0.368 mm, and an RMSE of 1.435 mm, except KGE, where the best model was ANN with 0.967. As for station EST_SAVN, the best model was RF, showing values with slightly lower performance than those observed in the previous stations, with an R2 of 0.825, an MAE of 1.477 mm, and an RMSE of 3.440 mm. Also, the KGE metric obtained its best value at this station, with the ANN model with 0.925. For the EST_NAST station only in MAE, the SVR model outperformed RF with 0.636 mm, while in R2, RMSE, and the KGE coefficient, the results were better with the RF model, obtaining 0.896, 2.278, and 0.961 mm, respectively. Finally, the MG metrics were better with the RF model, where an R2 threw a value of 0.879, with MAE of 0.920 mm, RMSE of 2.272 mm, and a value of 0.927 in KGE.

Table 3

Results of the ML models implemented in the stations and the MG for the second methodology

ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.620 1.393 3.046 0.816 EST_VERT 
RF 0.969 0.323 0.874 0.971 
SVR 0.880 0.630 1.714 0.907 
ANN 0.940 0.478 1.215 0.954 
MLR 0.459 1.799 3.827 0.711 EST_CLIM 
RF 0.904 0.615 1.612 0.947 
SVR 0.861 0.662 1.940 0.875 
ANN 0.898 0.700 1.659 0.967 
MLR 0.366 3.257 6.551 0.667 EST_SAVN 
RF 0.825 1.477 3.440 0.909 
SVR 0.599 1.563 5.209 0.727 
ANN 0.774 1.686 3.908 0.925 
MLR 0.576 1.890 4.606 0.796 EST_NAST 
RF 0.896 0.721 2.278 0.961 
SVR 0.868 0.636 2.569 0.892 
ANN 0.882 0.699 2.435 0.935 
MLR 0.442 2.176 4.885 0.710 MG 
RF 0.879 0.920 2.272 0.927 
SVR 0.697 1.030 3.598 0.771 
ANN 0.839 0.966 2.625 0.934 
ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.620 1.393 3.046 0.816 EST_VERT 
RF 0.969 0.323 0.874 0.971 
SVR 0.880 0.630 1.714 0.907 
ANN 0.940 0.478 1.215 0.954 
MLR 0.459 1.799 3.827 0.711 EST_CLIM 
RF 0.904 0.615 1.612 0.947 
SVR 0.861 0.662 1.940 0.875 
ANN 0.898 0.700 1.659 0.967 
MLR 0.366 3.257 6.551 0.667 EST_SAVN 
RF 0.825 1.477 3.440 0.909 
SVR 0.599 1.563 5.209 0.727 
ANN 0.774 1.686 3.908 0.925 
MLR 0.576 1.890 4.606 0.796 EST_NAST 
RF 0.896 0.721 2.278 0.961 
SVR 0.868 0.636 2.569 0.892 
ANN 0.882 0.699 2.435 0.935 
MLR 0.442 2.176 4.885 0.710 MG 
RF 0.879 0.920 2.272 0.927 
SVR 0.697 1.030 3.598 0.771 
ANN 0.839 0.966 2.625 0.934 

Note. Bold values indicate the best metric for each station.

In Figure 6, the results of the second methodology are plotted with the different stations implementing the various models. In Figure 6(a), the R² results are shown, where it can be identified that all edges of the polygon for the RF models are close to unity, closely followed by the ANN models. In third place, with lower performance, are the SVR models, and finally, the MLR models, which performed significantly below the rest. For KGE (Figure 6(b)), the best model was ANN, except at the EST_NAST station where RF outperformed, and in second place with very similar values are the aforementioned models, followed by SVR, which had a significant reduction, especially at the EST_SAVN station, and finally MLR, which had even lower values than the rest. Regarding residuals, MAE is shown in Figure 6(c), where the residual results were very similar in the RF, ANN, and SVR models, ranked from best to worst in the order mentioned, and finally MLR, which significantly increased its residual results compared to the rest of the models. Finally, for RMSE, represented in Figure 6(d), the results differ more clearly as the RF polygon is within the rest of the models, being the best results in this metric. Then, ANN models are located as the second-best models, with a small difference compared to RF, followed by SVR, which increases its area slightly but still encloses the two best models within it, and finally, MLR, which also obtained very high residuals compared to the rest.
Figure 6

Radar chart with the results of the second methodology for all stations and the general model (MG).

Figure 6

Radar chart with the results of the second methodology for all stations and the general model (MG).

Close modal

Results of the third methodology

The results of the third methodology showed lower results with respect to the first two methodologies, as presented in Table 4. At station EST_VER, the best model was RF, being much superior to the rest of the algorithms, reaching 0.727 for R2, 0.880 for KGE, 1.063 mm for MAE, and 2.583 mm for RMSE. For station EST_CLIM, the best model was equally RF, with values up to 0.618 as R2, 0.814 as KGE, 1.439 and 3.216 mm rainfall as MAE and RMSE, respectively. At station EST_SAVNT, the highest residuals and the lowest yield coefficients were obtained, simply reaching 0.484 for R2, 0.747 for KGE, 2.399 mm for MAE, and 5.910 mm for RMSE implementing the RF model. Finally, for station EST_NAST, the best model was also RF with 0.606 as R2, 0.860 as KGE, 1.510 mm as MAE, and 4.441 mm as RMSE. It can be concluded that, in general, using a single station as a predictor, the RF models improve the rainfall estimation by a large amount.

Table 4

Results of the ML models implemented in the stations and the MG for the third methodology

ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.535 1.670 3.372 0.757 EST_VERT 
RF 0.727 1.127 2.583 0.871 
SVR 0.609 1.063 3.091 0.714 
ANN 0.660 1.626 2.882 0.880 
MLR 0.399 1.988 4.033 0.673 EST_CLIM 
RF 0.618 1.439 3.216 0.814 
SVR 0.390 1.505 4.062 0.587 
ANN 0.385 2.239 4.080 0.727 
MLR 0.326 3.379 6.752 0.635 EST_SAVN 
RF 0.484 2.786 5.910 0.747 
SVR 0.343 2.399 6.665 0.519 
ANN 0.330 3.394 6.731 0.694 
MLR 0.537 2.043 4.811 0.772 EST_NAST 
RF 0.602 1.510 4.463 0.732 
SVR 0.606 1.791 4.441 0.784 
ANN 0.583 2.227 4.570 0.860 
ModelR2MAE (mm)RMSE (mm)KGEStation
MLR 0.535 1.670 3.372 0.757 EST_VERT 
RF 0.727 1.127 2.583 0.871 
SVR 0.609 1.063 3.091 0.714 
ANN 0.660 1.626 2.882 0.880 
MLR 0.399 1.988 4.033 0.673 EST_CLIM 
RF 0.618 1.439 3.216 0.814 
SVR 0.390 1.505 4.062 0.587 
ANN 0.385 2.239 4.080 0.727 
MLR 0.326 3.379 6.752 0.635 EST_SAVN 
RF 0.484 2.786 5.910 0.747 
SVR 0.343 2.399 6.665 0.519 
ANN 0.330 3.394 6.731 0.694 
MLR 0.537 2.043 4.811 0.772 EST_NAST 
RF 0.602 1.510 4.463 0.732 
SVR 0.606 1.791 4.441 0.784 
ANN 0.583 2.227 4.570 0.860 

Note. Bold values indicate the best metric for each station.

Comparison between the methodologies

There are significant differences between the three methodologies applied in this research. Figure 7 shows the comparisons for the R2 and KGE metrics. In Figure 7(a) are the comparisons of R2 in the station EST_VERT with the different methodologies, where it can be identified that the second methodology surpassed the first with an average of 15% and the third methodology with an average of 22%, having RF as the best model. Figure 7(b) shows the comparisons of the R2 in the EST_CLIM station, where the second methodology also outperformed the first methodology with an average of 17% and the third methodology with 33%, with RF as the best model. Figure 7(c) shows the differences in R2 between the methodologies at station EST_SAVN, where the second methodology also obtained better yields, with 17% above the first methodology and 27% above the third methodology. Finally, at station EST_NAST (Figure 7(d)), as in the previous stations, the second-best methodology was the second one, 15% better than the first methodology and 22% better than the third one in R2. To compare the results in the KGE metric at station EST_VERT, it can be seen in Figure 7(e) that the first methodology outperformed the first and second by 13 and 11%, respectively. For station EST_CLIM (Figure 7(f)) in KGE, we also find that for the SVR, RF, and ANN models, the second methodology outperforms the first and the third, except for the MLR model, where the first methodology is opposed to the second and the third. At station EST_SAVN (Figure 7(g)), the second methodology in KGE prevailed with 12% over the first and 16% over the third. To conclude this analysis in KGE, at station EST_NAST (Figure 7(h)), as at station EST_CLIM, in the MLR model the first methodology outperformed the others, but in the rest of the models, the second methodology was the best for this metric. It is worth noting that for all stations, the RF model was the best for the estimation.
Figure 7

Comparison of the three methodologies in R2 and KGE.

Figure 7

Comparison of the three methodologies in R2 and KGE.

Close modal
Now, we compare the residuals with the different methodologies, these can be visualized in Figure 8 for MAE and RMSE. For station EST_VERT (Figure 8(a)), the first methodology in MAE was the best in the MLR model, but in the rest of the models, the second methodology is the best for estimating rainfall with a 0.5 mm average below the first methodology and with 0.67 mm below the third methodology. For MAE in station EST_CLIM (Figure 8(b)), the same phenomenon occurs; in MLR, the best methodology is the first one, but in the rest, it is still better to use the second methodology over the first and the second with 0.5 and 0.85 mm average, respectively. In Figure 8(c) for MAE, we find these differences for station EST_SAVN, where the same phenomenon as in the previous stations was repeated, since by a minimal difference the first methodology surpasses the second by very little, having the lowest residuals, but in the rest, the second remained better, with 0.58 mm average below the first methodology and with 1 mm average below the third. Finally, for station EST_NAST (Figure 8(d)), for MAE, the same happened as in the rest of the stations; in the MLR model, the first methodology estimates the smallest rainfall errors, but in the rest, the second one is still the best, with an average decrease of 0.69 mm below the first methodology and 0.91 mm below the third one. Similarly, it can be generalized that RF models give the best results. For the station EST_VERT (Figure 8(e)) in RMSE, the second methodology was the best for estimating this metric with an average of 1 mm below the first methodology and 1.27 mm for the third methodology. For station EST_CLIM (Figure 8(f)) in RMSE, the second methodology prevailed with a 0.96 mm average below the first methodology and a 1.60 mm average below the third methodology. Figure 8(g) shows the comparison for station EST_SAVN for RMSE, where the second methodology was also the best with a 1.18 mm average below the first methodology and a 1.74 mm average below the third methodology. Finally, for station EST_NAST in Figure 8(h) for RMSE, the second methodology is also the best for rainfall estimation with the lowest residuals in this metric and showed an average of 1.13 mm below the first methodology and an average of 1.6 mm below the second methodology, with RF being the best model for estimating rainfall in general.
Figure 8

Comparison of the three methodologies in MAE and RMSE.

Figure 8

Comparison of the three methodologies in MAE and RMSE.

Close modal

A strength of this research lies in the fact that the estimates are based on rainfall measurements obtained through a pluviographic network within the basin. In some scientific studies, a limitation is the lack of sufficient climatological information about the study region. Therefore, it is possible to make a point-specific estimation using exclusively rainfall data, in addition to implementing a technique to augment the dataset.

When comparing the results obtained in this research with those of Zahran et al. (2015), who estimated annual rainfall in an arid region of Jordan using ANN, their evaluation metrics were limited to prediction error and accuracy error, without comparing the residuals of the models. This restricted the visualization of residuals in their estimates. In our research, we conducted daily estimations using an ANN, differing in the evaluation metrics employed, such as MAE and RMSE, which enhance the analysis of residuals between measured and estimated values. Furthermore, the daily analysis allows for a better understanding of rainfall behavior.

Regarding studies such as Dutta & Gouthaman (2020), Endalie et al. (2021), Gu et al. (2022), Liyew & Melese (2021), and Yan et al. (2021), which investigated the behavior of some meteorological variables such as radiation, wind speed, relative humidity, and rainfall, they noted that the lack of data could limit the application of ML techniques. Conversely, their prediction reports also used daily rainfall data similar to those in this study. The metrics they employed for rainfall prediction included Pearson's determination coefficient (R²), MAE, and RMSE. The regions analyzed had arid, semi-arid, and humid climates, similar to those studied in this research. However, the difference lies in that these authors focused on rainfall prediction, while we performed a spatial estimation, which represents a significant contribution of this study to the application of ML to this phenomenon. Additionally, the use of the KGE index as a metric enhances model validation, a method implemented in very few studies.

Another group of researchers, such as Ridwan et al. (2021), used daily precipitation data and reported a network of around 10 rain gauge stations with a more robust database to determine rainfall relationships. They used Pearson's coefficient, reporting values between 0.7 and 0.8 for their ANN model. In this study, coefficients between 0.8 and 0.9 were achieved using RF and ANN models, yielding similar results to those reported by this author. However, a comparison of residuals could not be made because the author normalizes them. Liyew & Melese (2021) used a Pearson correlation matrix to extract the most representative features for rainfall, subsequently generating RF, MLR, and XGBoost models. Their metrics included MAE and RMSE, reporting values of 3.58 mm for MAE and 7.85 mm for RMSE. The results of this research improved rainfall estimation performance, achieving values of 0.323 mm for MAE and 0.874 mm for RMSE. Another contribution of this study is the understanding of the spatiotemporal behavior of rainfall through the generated models.

In this research, a temporal rainfall estimation was conducted for a semi-arid basin in Zacatecas, Mexico, applying various ML algorithms such as MLR, RF, SVR, and ANN, and evaluating regression metrics, including R², MAE, RMSE, and KGE. Three different methodologies were implemented: the first using the original data, estimating one station as the target station and the remaining stations within the basin as input variables; this was applied to the four individual stations and a general model (MG). The second methodology followed the same approach but incorporated data augmentation with Gaussian noise. The third methodology used a single target station and selected the station with the highest Pearson correlation among the remaining ones as the input variable.

Based on the results obtained, it can be concluded that the second methodology and the RF models, along with ANN, are the best estimators of daily rainfall in the semi-arid ‘Chilitas’ basin. These models show values very close to unity in the R² and KGE coefficients, as well as low residuals in the MAE and RMSE indicators. The implementation of a data augmentation technique has proven effective in mitigating the problem of missing or insufficient data. The most outstanding model was the RF applied at the EST_VERT station using the second methodology, achieving an R² of 0.969, a KGE of 0.971, and residuals of 0.323 mm in MAE and 0.874 mm in RMSE.

Conversely, the third methodology is only applicable for estimating rainfall at the station with the highest correlation in case there are not enough sensors. However, its results are not as significant as those obtained with the other two methodologies. Daily rainfall predictions improve with a greater number of measurements throughout the basin, as relying on a single station limits accurate rainfall estimates at a specific point, even if the correlation is high. EST_VERT and EST_NAST stations yielded better results compared to the EST_CLIM and EST_SAVN, the latter of which obtained results very different from the other three stations. This indicates that, at least for this basin, the more northern the station, the better the results.

In future work, a rainfall-runoff model and a rainfall-infiltration model will be developed to verify the behavior of rainfall, i.e., to make better use of water resources in regions where rainfall is scarce, such as the state of Zacatecas and central-northern Mexico. The application of climatic change models will be considered to forecast the behavior of rain in a semi-arid basin.

The authors would like to thank the National Council of Humanities, Sciences, and Technologies, CONAHCYT, (Spanish for Consejo Nacional de Ciencia y Tecnología) for the economic support through the ‘Becas Nacional (Tradicional) 2021-3’ call for scholarships for the student José Armando Rodríguez Carrillo, which facilitated the development of this research. We also thank the state council of science COZCYT for the support in writing this paper.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abd-Elmaboud
M. E.
,
Saqr
A. M.
,
El-Rawy
M.
,
Al-Arifi
N.
&
Ezzeldin
R.
(
2024
)
Evaluation of groundwater potential using ANN-based mountain gazelle optimization: A framework to achieve SDGs in East El Oweinat, Egypt
,
Journal of Hydrology: Regional Studies
,
52
,
101703
.
https://doi.org/10.1016/j.ejrh.2024.101703
.
Afrifa
S.
,
Zhang
T.
,
Appiahene
P.
&
Varadarajan
V.
(
2022
)
Mathematical and machine learning models for groundwater level changes: A systematic review and bibliographic analysis
,
Future Internet
,
14
(
9
),
259
.
https://doi.org/10.3390/fi14090259
.
Afrifa
S.
,
Varadarajan
V.
,
Appiahene
P.
,
Zhang
T.
&
Domfeh
E. A.
(
2023a
)
Ensemble machine learning techniques for accurate and efficient detection of botnet attacks in connected computers
,
Eng
,
4
(
1
),
650
664
.
https://doi.org/10.3390/eng4010039
.
Afrifa
S.
,
Zhang
T.
,
Zhao
X.
,
Appiahene
P.
&
Yaw
M. S.
(
2023b
)
Climate change impact assessment on groundwater level changes: A study of hybrid model techniques
,
IET Signal Processing
,
17
(
6
),
e12227
.
https://doi.org/10.1049/sil2.12227
.
Anaraki
M. V.
,
Achite
M.
,
Farzin
S.
,
Elshaboury
N.
,
Al-Ansari
N.
&
Elkhrachy
I.
(
2023
)
Modeling of monthly rainfall–runoff using various machine learning techniques in Wadi Ouahrane Basin, Algeria
,
Water
,
15
(
20
),
3576
.
Barrera-Animas
A.
,
Oyedele
L.
,
Bilal
M.
,
Akinosho
T.
,
Davila Delgado
M.
&
Akanbi
L.
(
2021
)
Rainfall prediction: a comparative analysis of modern machine learning algorithms for time-series forecasting
,
Machine Learning with Applications
,
7
,
100204
.
https://doi.org/10.1016/j.mlwa.2021.100204
.
Basha
C. Z.
,
Bhavana
N.
,
Bhavya
P.
&
Sowmya
V.
(
2020
). '
Rainfall prediction using machine learning & deep learning techniques
’,
2020 International Conference on Electronics and Sustainable Communication Systems (ICESC)
, pp.
92
97
.
https://doi.org/10.1109/ICESC48915.2020.9155896
.
Benegui
C.
,
Ionescu
R. T.
, (
2020
)
To augment or not to augment? Data augmentation in user identification based on motion sensors
. In:
Yang
H.
,
Pasupa
K.
,
Leung
A. C.-S.
,
Kwok
J. T.
,
Chan
J. H.
&
King
I.
(eds.)
Neural Information Processing
,
Cham:
Springer International Publishing
, pp.
822
831
.
Carrasquilla
A.
,
Chacon-Rodriguez
A.
,
Núñez Montero
K.
,
Gómez-Espinoza
O.
,
Valverde
J.
&
Guerrero
M.
(
2016
)
Regresión lineal simple y múltiple: aplicación en la predicción de variables naturales relacionadas con el crecimiento microalgal
,
Revista Tecnología en Marcha
,
29
,
33
.
https://doi.org/10.18845/tm.v29i8.2983
.
Cutler
A.
,
Cutler
D. R.
&
Stevens
J. R.
(
2012
)
Random forests. ensemble machine learning: methods and applications
. In:
Ensemble Machine Learning. Methods and applications
,
157
175
.
Springer
,
US
.
Dávila-Hernández
S.
,
González-Trinidad
J.
,
Júnez
H.
,
Bautista-Capetillo
C.
,
Ávila
H.
,
Escareño
J.
,
Ortiz-Letechipia
J.
,
Robles-Rovelo
C. O.
&
López-Baltazar
E.
(
2022
)
Effects of the digital elevation model and hydrological processing algorithms on the geomorphological parameterization
,
Water
,
14
,
2363
.
https://doi.org/10.3390/w14152363
.
DeVore
G.
(
2017
)
Computing the Z score and centiles for cross-sectional analysis: A practical approach: Z score and centiles for cross-sectional analysis
,
Journal of Ultrasound in Medicine
,
36
,
459
473
.
https://doi.org/10.7863/ultra.16.03025
.
Diez-Sierra
J.
&
del Jesus
M.
(
2020
)
Long-term rainfall prediction using atmospheric synaptic patterns in semi-arid climates with statistical and machine learning methods
,
Journal of Hydrology
,
586
,
124789
.
https://doi.org/10.1016/j.jhydrol.2020.124789
.
Dölling
O.
&
Varas
E.
(
2002
)
Artificial neural networks for streamflow prediction
,
Journal of Hydraulic Research
,
40
,
547
554
.
https://doi.org/10.1080/00221680209499899
.
Dongare
A. D.
,
Kharde
R. R.
&
Kachare
A. D.
(
2012
)
Introduction to artificial neural network
.
International Journal of Engineering and Innovative Technology (IJEIT)
,
2
(
1
),
189
194
.
Drucker
H.
,
Burges
C.
,
Kaufman
L.
,
Smola
A.
&
Vapnik
V.
(
1996
)
Support vector regression machines
,
Adv Neural Inform Process Syst
,
28
,
779
784
.
Dutta
K.
&
Gouthaman
P.
(
2020
)
Rainfall prediction using machine learning and neural network
,
International Journal of Recent Technology and Engineering (IJRTE)
,
9
,
1954
1961
.
https://doi.org/10.35940/ijrte.A2747.059120
.
Eberly
L.
(
2007
)
Multiple linear regression
.
Methods in Molecular Biology (Clifton, N.J.)
,
404
,
165
87
. https://doi.org/10.1007/978-1-59745-530-5_9
.
Endalie
D.
,
Dagnaw
G.
&
Abebe
W.
(
2022
)
Deep learning model for daily rainfall prediction: Case study of Jimma, Ethiopia
,
Water Supply
,
22
(
3
),
3448
3461
.
https://doi.org/10.2166/ws.2021.391
.
Feng
J.
&
Lu
S.
(
2019
)
Performance analysis of various activation functions in artificial neural networks
,
Journal of Physics: Conference Series
,
1237
,
022030
.
https://doi.org/10.1088/1742-6596/1237/2/022030
.
Fiallos
G.
(
2021
)
La correlación de Pearson y el proceso de regresión por el Método de Mínimos Cuadrados
,
Ciencia Latina Revista Científica Multidisciplinar
,
5
,
2491
2509
.
https://doi.org/10.37811/cl_rcm.v5i3.466
.
Gu
J.
,
Liu
S.
,
Zhou
Z.
,
Chalov
S.
&
Zhuang
Q.
(
2022
)
A stacking ensemble learning model for monthly rainfall prediction in the Taihu Basin, China
,
Water
,
14
,
492
.
https://doi.org/10.3390/w14030492
.
Gupta
H.
,
Kling
H.
,
Yilmaz
K.
&
Martinez
G.
(
2009
)
Decomposition of the mean squared error and NSE performance criteria: Implications for improving hydrological modelling
,
Journal of Hydrology
,
377
,
80
91
.
https://doi.org/10.1016/j.jhydrol.2009.08.003
.
Hasan
N.
,
Nath
N. C.
&
Rasel
R. I.
(
2015
)
A support vector regression model for forecasting rainfall. In 2015 2nd international conference on electrical information and communication technologies (EICT) (554–559). IEEE
.
INEGI
(
2023
)
Continuo de Elevaciones Mexicano (CEM) INEGI. Continuo de Elevaciones Mexicano (CEM) INEGI. Available at: https://www.inegi.org.mx/app/geo2/elevacionesmex/.
Kashiwao
T.
,
Nakayama
K.
,
Ando
S.
,
Ikeda
K.
,
Lee
M.
&
Bahadori
A.
(
2017
)
A neural network-based local rainfall prediction system using meteorological data on the internet: A case study using data from the Japan meteorological agency
,
Applied Soft Computing
,
56
,
317
330
.
https://doi.org/10.1016/j.asoc.2017.03.015
.
Knoben
W. J.
,
Freer
J. E.
&
Woods
R. A.
(
2019
)
Technical note: inherent benchmark or not? Comparing Nash–Sutcliffe and Kling–Gupta efficiency scores
,
Hydrology and Earth System Sciences
,
23
,
4323
4331
.
https://doi.org/10.5194/hess-23-4323-2019
.
Lalinde
J. D. H.
,
Castro
F. E.
,
Rodríguez
J. E.
,
Rangel
J. G. C.
,
Sierra
C. A. T.
,
Torrado
M. K. A.
,
Sierra
S.M. C.
&
Pirela
V. J. B.
(
2018
)
Sobre el uso adecuado del coeficiente de correlación de Pearson: definición, propiedades y suposiciones
.
Archivos venezolanos de Farmacología y Terapéutica
,
37
(
5
),
587
595
.
Liyew
C.
&
Melese
H.
(
2021
)
Machine learning techniques to predict daily rainfall amount
,
Journal of Big Data
,
8
,
1
11
.
https://doi.org/10.1186/s40537-021-00545-4
.
Lu
L.
(
2024
)
In-depth Analysis of Artificial Intelligence for Climate Change Mitigation. DOI:10.20944/preprints202402.0022.v1
.
Mohseni
U.
&
Muskula
S. B.
(
2023
)
Rainfall–runoff modeling using artificial neural network – A case study of purna sub-catchment of Upper Tapi Basin, India
,
Environmental Sciences Proceedings
,
25
(
1
),
1
.
Naser
M. Z.
&
Alavi
A.
(
2021
)
Error metrics and performance fitness indicators for artificial intelligence and machine learning in engineering and sciences
.
1
.
Architecture, Structures and Construction, 3 (4), 499–517. https://doi.org/10.1007/s44150-021-00015-8
.
Oswal
N.
(
2019
)
Predicting rainfall using machine learning techniques
,
arXiv preprint arXiv:1910.13827
.
Otchere
D.
,
Ganat
T.
,
Gholami
R.
&
Ridha
S.
(
2021
)
Application of supervised machine learning paradigms in the prediction of petroleum reservoir properties: Comparative analysis of ANN and SVM models
,
Journal of Petroleum Science and Engineering
,
200
,
108
182
.
https://doi.org/10.1016/j.petrol.2020.108182
.
Phạm
T.
,
Le
L.
,
Le
T.-T.
,
Bui
K.-T.
,
Vương
L.
,
Ly
H.-B.
&
Prakash
I.
(
2020
)
Development of advanced artificial intelligence models for daily rainfall prediction
,
Atmospheric Research
,
237
,
104845
.
https://doi.org/10.1016/j.atmosres.2020.104845
.
Ridwan
W.
,
Sapitang
M.
,
Aziz
A.
,
Kushiar
K.
,
Ali Najah Ahmed
A.-M.
&
El-Shafie
A.
(
2021
)
Rainfall forecasting model using machine learning methods: Case study Terengganu, Malaysia
,
Ain Shams Engineering Journal
,
12
(
2
),
1651
1663
.
https://doi.org/10.1016/j.asej.2020.09.011
.
Roy
B.
&
Singh
M. P.
(
2020
)
An empirical-based rainfall–runoff modelling using optimization technique
,
International Journal of River Basin Management
,
18
(
1
),
49
67
.
Saqr
A. M.
,
Nasr
M.
,
Fujii
M.
,
Yoshimura
C.
&
Ibrahim
M. G.
(
2023
). '
Optimal solution for increasing groundwater pumping by integrating MODFLOW-USG and particle swarm optimization algorithm: A case study of Wadi El-Natrun, Egypt
',
Proceedings of the 2022 12th International Conference on Environment Science and Engineering (ICESE 2022)
.
2022
.
ICESE
, pp.
59
73
.
Sarasa-Cabezuelo
A.
(
2022
)
Prediction of rainfall in Australia using machine learning
,
Information (Switzerland)
,
13
(
4
),
163
.
https://doi.org/10.3390/info13040163
.
Sarmiento-Ramos
J. L.
(
2020
)
Aplicaciones de las redes neuronales y el deep learning a la ingeniería biomédica
,
Revista UIS Ingenierías
,
19
(
4
),
1
18
.
https://doi.org/10.18273/revuin.v19n4-2020001
.
Sharma
S.
,
Sharma
S.
&
Athaiya
A.
(
2017
)
Activation functions in neural networks
,
Towards Data Sci
,
6
(
12
),
310
316
.
Sharma
P.
,
Singh
S.
&
Sharma
S.
(
2022
)
Artificial neural network approach for hydrologic river flow time series forecasting
,
Agricultural Research
,
11
(
3
),
465
476
.
https://doi.org/10.1007/s40003-021-00585-5
.
Sujay Raghavendra
N.
&
Deka
P. C.
(
2014
)
Support vector machine applications in the field of hydrology: A review
,
Applied Soft Computing
,
19
,
372
386
.
https://doi.org/https://doi.org/10.1016/j.asoc.2014.02.002
.
Tranmer
M.
&
Elliot
M.
(
2008
)
Multiple linear regression
.
The Cathie Marsh Centre for Census and Survey Research (CCSR)
, 5 (
5
),
1
5
.
Wang
F.
,
Zhong
S. H.
,
Peng
J.
,
Jiang
J.
&
Liu
Y.
(
2018
)
Data augmentation for EEG-based emotion recognition with deep convolutional neural networks. In MultiMedia Modeling: 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part II 24 (82–93). Springer International Publishing
.
Welling
M.
(
2004
)
Support Vector Regression
.
Toronto, Kanada
:
Department of Computer Science, University of Toronto
.
World Meteorological Organization (WMO)
(
2005
)
WMO Laboratory Intercomparison of Rainfall Intensity Gauges: Trappes (France) - Genoa (Italy) - De Bilt (Netherlands). Ginebra, Switerland
.
Yan
J.
,
Xu
T.
,
Yu
Y.
&
Xu
H.
(
2021
)
Rainfall forecast model based on the TabNet model
,
Water
,
13
,
1272
.
https://doi.org/10.3390/w13091272
.
Zahran
B.
,
Mesleh
A.
,
Matouq
M.
,
Alheyasat
O.
&
Alwadan
T.
(
2015
)
Rainfall prediction in semi-arid regions in Jordan using back propagation neural networks
,
International Journal on Engineering Applications (IREA)
,
3
,
162
.
Zhang
F.
&
O'Donnell
L. J.
(
2020
)
Support vector regression
. In
Machine learning
(pp. 123–140). Academic Press. https://doi.org/10.1016/B978-0-12-815739-8.00007-9
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).