ABSTRACT
Due to ongoing climate change, accurately predicting rainfall has become increasingly critical. This paper explores an approach utilizing two different machine learning algorithms, including multilayer perceptron neural networks (MPNN) and random forest regressors (RFR), to enhance rainfall forecast accuracy. Historical daily weather data spanning 100 years (1913–2023) from the Agro Climate Research Centre at Tamil Nadu Agricultural University were used. The study focused on global climate drivers like the Southwest Monsoon (SWM) and Northeast Monsoon (NEM) over the Coimbatore region; this region receives more rainfall during NEM. Normalization and scaling techniques addressed missing values, preserving 70–85% of the original data for the training set. Results demonstrated that MPNN outperformed RFR, achieving an accuracy of 85.55% for SWM and NEM, while RFR outperformed MPNN, producing an accuracy of 86.50%. The coefficient of determination (R2) for predicted versus observed values was 0.8 for daily rainfall from 2020 to 2023.
HIGHLIGHTS
In this study, we aimed to assess multilayer perceptron neural networks (MPNN) and random forest regressors. Using expectation-maximization, we reduced the feature set from seven to five dimensions. Normalization helped identify centroids by selecting rows with distinct features and covering 90% of the original dataset for training. Future work will concentrate on minimizing error discrepancies using advanced neural network techniques.
A comparison with algorithms like MPNN and RFR is used for model performance and it is better than linear regression or time series models.
Expectation maximization in SPSS with machine learning techniques are employed.
Actual and predicted values of rainfall over SWM and NEM estimated with MPNN and RFR.
Evaluation metrics using R2, RMSE, and MAE were used to validate the accuracy.
INTRODUCTION
Rainfall has a major impact on many different facets of human life and culture. Although difficult, accurate rainfall forecasts are necessary for efficient reservoir operations, flood avoidance, and water resource management. Rainfall also affects urban operations, such as traffic and sewer systems; therefore, accurate data are essential for managing and planning cities (Nayak et al. 2013).
This approach enhances our comprehension of the physical mechanisms underlying hydrological processes (Young & Liu 2015). In recent years, there has been significant interest among various research groups in developing high-resolution gridded rainfall datasets. Artificial neural network (ANN) activity mostly mimics that of the human brain. An ANN does calculations, recognizes patterns, and performs other tasks. Different datasets are used to train the ANN after it has been built (Basha et al. 2020). Another technique in artificial intelligence (AI) is the expectation-maximization (EM) algorithm, a statistical technique for estimating parameters in models with hidden variables by iteratively maximizing the likelihood of the observed data.
Similarly, the EM algorithm produces predictions by refining the model parameters to best fit the observed data. Most research projects are exposed to enormous amounts of data, which require considerable operations and computations (Khalifeloo et al. 2015). Rainfall forecasting has been practiced for many years using conventional procedures that evaluate the relationship between rainfall, other meteorological parameters (such as pressure, temperature, wind speed, and humidity), and geographic coordinates (such as latitude and longitude) using statistical techniques. However, the intricacy of rainfall, including its nonlinearity, makes forecasting difficult (Barrera-Animas et al. 2022). Furthermore, this paper examined and evaluated how the randomness of the database influenced the ANN and random forest regressors' (RFR) capacity to forecast results using statistical analysis and the Monte Carlo method (Pham et al. 2020).
These models are commonly referred to as universal approximators for this reason. Thus, the application of these AI methods is crucial for advanced climate forecasting (Dash et al. 2018). Accurate rainfall forecasting is still a desirable goal, and understanding the intricate physical processes that produce rainfall is still a huge task (Abbot & Marohasy 2012). MPNN models often require more data to train effectively due to their complexity and the number of parameters. RFR models, being ensemble methods that aggregate multiple decision trees, might need less data to achieve good performance (Mohammed & Kora 2023).
Some of the work done by the machine learning (ML) model in rainfall prediction shows that the Ponnaiyar River Basin, which traverses the states of Tamilnadu, Andhra Pradesh, and Karnataka, is the subject of this study because it is crucial to irrigation and industrial development (Samhitha & Srikanth 2017). To predict rainfall at Coonoor one day ahead of time, several ANN-based rainfall forecasting models were created. ANN, which often has many parameters and layers, might benefit from a larger training set to capture complex patterns and prevent overfitting (Brownlee 2018). A time-delay neural network with a restricted number of input predictors produced a high correlation coefficient. Findings demonstrate that the suggested wavelet (Renuga Devi et al. 2017) performance was high in the ANN model.
These days, ML techniques are successfully used for problems involving the identification, clustering, or reduction of the dimensionality of massive collections of especially high-dimensional input data (Geethalakshmi et al. 2022). One can use a multilayer perceptron neural network (MPNN) to model the actions of such nonlinear processes (Nandakumar et al. 2021) that was employed as a suitable forecasting tool to get trustworthy data on inflow into a reservoir. Another study in the Coimbatore region revealed that the average monthly inflow to the Bhavanisagar reservoir, which spans 24 years (1989–2013), was utilized as historical data for training, testing, and validating the model (Suriya et al. 2021).
The primary objective of this study is to evaluate and compare the performance of MPNN and RFR models in accurately predicting seasonal rainfall patterns during the Southwest Monsoon (SWM) and Northeast Monsoon (NEM) periods. This analysis aims to enhance rainfall forecasting accuracy by leveraging historical weather data, ultimately supporting better agricultural planning and disaster management decision-making in semi-arid regions.
MATERIALS AND METHODS
The dataset discussed in this paper comprises monthly weather observations spanning a century, from 1913 to 2023. These observations were collected from the Agro Climate Research Centre at Tamil Nadu Agricultural University, Coimbatore-3. Coimbatore is located at 11°N latitude, 77°E longitude, and has an altitude of 427 m above sea level. The district receives rainfall influenced by both the southwest and northeast monsoons. According to Koppen's Climatic Classifications, it features a typical semi-arid climate within the western zone of agro-climate regional planning. Following this period, wind speeds gradually decrease to a low of 3.7 km/h. Overall, the district is characterized by high average temperatures and relatively low humidity.
Data description
Daily rainfall data, maximum temperature, minimum temperature, and humidity for the Coimbatore district, spanning from 1913 to 2023, were collected from the Agro Climate Research Centre at Tamil Nadu Agricultural University, Coimbatore.
Data preprocessing
Normalization
Development of ANN
The brain is made up of many neurons that are linked to one another by synapses. Neural networks, also known as natural neural networks, are these networks of neurons. A mathematically simplified representation of a natural neural network is called an ANN. A neuron is represented by a vertex in this directed graph and a synapse by an edge. Since the 1940s, many ANN models have been suggested; however, the most popular ones are the radial basis function network and the MPNN (Lee et al. 1998).
The selection of two specific ML algorithms for rainfall prediction typically depends on several key factors, which means that the selected ML algorithms for rainfall prediction should be accurate, efficient, and robust. They must handle specific data types effectively, such as time-series data, and be computationally efficient for large datasets and real-time predictions. The correlation coefficient (CC) and mean square error are used as performance metrics to evaluate how well ANN models estimate the future (Abinayarajam et al. 2021).
Neural network architecture
The neurons of ANNs are connected by synapses. The values that are fed into this network of synapses and neurons determine the value that it generates. Different neuronal types use different processes to generate their output. Among them are neurons used as input: after receiving the input values, these neurons feed them into a network (Table 1). The divide may vary depending on how much data is provided. A lower percentage (e.g., 85:15) of a big dataset can be put aside for testing because there is sufficient data to train the model successfully. On the other hand, if there is less data, a greater percentage might be set aside for testing to guarantee the accuracy of the test findings (e.g., 70:30). The performance of the model can be more thoroughly assessed by adjusting the split ratios (Poornima et al. 2020). A smaller training set, such as 70%, can help detect overfitting. Additionally, it notes that an input neuron's output equals its input value, while hidden (sigmoid) neurons manage the computations within the network. The computations carried out by the network are handled by the hidden neurons. The output values of a specific neuron are accessible in the output (Dubey 2015).
Statistical analyses of data points or total rainfall (in mm) developed by ANN model
Year . | Data points/total rainfall (mm/day) . | Training (%) . | Testing (%) . |
---|---|---|---|
1913–1923 | 5,660 | 75 | 25 |
1923–1933 | 4,240 | 75 | 25 |
1933–1943 | 5,320 | 75 | 25 |
1943–1953 | 5,500 | 70 | 30 |
1953–1963 | 3,600 | 70 | 30 |
1963–1973 | 4,200 | 80 | 20 |
1973–1983 | 2,400 | 80 | 20 |
1983–1993 | 4,300 | 80 | 20 |
1993–2003 | 3,820 | 75 | 25 |
2003–2013 | 5,695 | 85 | 15 |
2013–2023 | 5,243 | 85 | 15 |
Year . | Data points/total rainfall (mm/day) . | Training (%) . | Testing (%) . |
---|---|---|---|
1913–1923 | 5,660 | 75 | 25 |
1923–1933 | 4,240 | 75 | 25 |
1933–1943 | 5,320 | 75 | 25 |
1943–1953 | 5,500 | 70 | 30 |
1953–1963 | 3,600 | 70 | 30 |
1963–1973 | 4,200 | 80 | 20 |
1973–1983 | 2,400 | 80 | 20 |
1983–1993 | 4,300 | 80 | 20 |
1993–2003 | 3,820 | 75 | 25 |
2003–2013 | 5,695 | 85 | 15 |
2013–2023 | 5,243 | 85 | 15 |
Note. Data points = total rainfall in mm/day.
Multilayer perceptron neural network
are the inputs to the neuron.
are the synaptic weights of the neuron corresponding to each input.
b is the bias term.
z represents the neuron's output, typically determined by passing the linear combination through an activation function: z = f(u)
Random forest regressor











Implementation of the EM algorithm in SPSS
Using EM algorithm in SPSS, this paper estimates, using inbuilt MPNN and RFR, the actual and predicted rainfall (mm) values for the SWM monsoon ranging from 1913 to 2023, with nearly 400–850 mm of rainfall estimated per decade. In SPSS, the EM algorithm is commonly used for data imputation and clustering analysis. The EM algorithm operates in two main steps. In this step, the algorithm estimates the missing or hidden data based on the observed data and the current parameter estimates. It computes the expected value of the log-likelihood function, assuming the current parameter estimates are correct. The algorithm updates the parameter estimates by maximizing the expected log-likelihood by using the predicted values computed in the E-step.
This step refines the model parameters to fit the data better. For the NEM, actual and predicted rainfall (mm) values from 1913 to 2023 range from nearly 900 to 1,500 mm of rainfall estimated per decade. The model was determined to be the best fit based on the root mean square error (RMSE) and R2 values. From 1913 to 2023, the RMSE and R2 values using MPNN for the SWM were 0.814 mm and 0.87, respectively. For NEM, the RMSE and R2 values from 1913 to 2023 using RFR were 0.784 mm and 0.89, respectively. The outcome of this paper evaluates that the Coimbatore region receives more rainfall during NEM. In comparison, MPNN outperformed RFR, producing a higher accuracy of 85.55%. For SWM and NEM, RFR produces more accuracy of 86.50%.
(a)–(k) Comparison of actual and predicted decadal rainfall during the SWM from 1913 to 2023 (mm).
(a)–(k) Comparison of actual and predicted decadal rainfall during the SWM from 1913 to 2023 (mm).
(a)–(k) Comparison of actual and predicted decadal rainfall during the NEM from 1913 to 2023 (mm).
(a)–(k) Comparison of actual and predicted decadal rainfall during the NEM from 1913 to 2023 (mm).
The performance of the chosen model is displayed in Table 2 by the preprocessing technique used in MPNN and RFR. The variation in training and testing dataset proportions for different algorithms, such as 72.30 and 73.5% for MPNN and 27.70 and 26.5% for RFR, can be attributed to several factors, mainly that different algorithms have different complexities and are significantly different from each other. This can lead to different proportions for training and testing sets. The proportions might be chosen based on empirical results or previous studies that have shown optimal performance for each algorithm with those specific splits (Shalev-Shwartz & Ben-David 2014; Salehin et al. 2020). The models are used to forecast one step ahead of the observed data series after being fitted to actual data. Table 3 shows ANN information.
Multilayer perceptron neural network and RFR case summary
Case processing summary . | |||
---|---|---|---|
Sample . | N . | Percent (1) . | Percent (2) . |
Training | 313 | 72.30% | 73.50% |
Testing | 120 | 27.70% | 26.50% |
Valid | 433 | 100.00% | 100.00% |
Excluded | 0 | ||
Total | 433 |
Case processing summary . | |||
---|---|---|---|
Sample . | N . | Percent (1) . | Percent (2) . |
Training | 313 | 72.30% | 73.50% |
Testing | 120 | 27.70% | 26.50% |
Valid | 433 | 100.00% | 100.00% |
Excluded | 0 | ||
Total | 433 |
Multilayer perceptron neural network information
Network information . | ||
---|---|---|
Input layer | Factors | Tmax, Tmin |
Number of units | 3 | |
Hidden layer(s) | Number of hidden layers | 1 |
Number of units in hidden layer 1a | 3 | |
Activation function | Hyperbolic tangent | |
Output layer | Dependent variables | Rainfall |
Number of units | 3 | |
Activation function | Softmax | |
Error function | Cross-entropy | |
Excluding the bias unit |
Network information . | ||
---|---|---|
Input layer | Factors | Tmax, Tmin |
Number of units | 3 | |
Hidden layer(s) | Number of hidden layers | 1 |
Number of units in hidden layer 1a | 3 | |
Activation function | Hyperbolic tangent | |
Output layer | Dependent variables | Rainfall |
Number of units | 3 | |
Activation function | Softmax | |
Error function | Cross-entropy | |
Excluding the bias unit |
aThe number of units in hidden layer 1 was determined based on empirical tuning, optimizing model performance using cross-validation.
RESULTS AND DISCUSSION
Temporal rainfall distribution
According to a temporal analysis (1913 –2023), the rainfall distribution in Coimbatore is not uniform across the period under study. In the specified time scale, the seasonality feature was most noticeable during the pre-monsoon (April and May), the North East Monsoon (October and November), and the South West Monsoon (June–September).
For the decadal average SWM rainfall, the highest actual rainfall (223.09 mm) occurred in 1953–1963, with MPNN predicting 253.09 mm and RFR 220.9 mm. The lowest actual SWM rainfall (159.94 mm) was recorded in 1943–1953, with MPNN predicting 160.16 mm and RFR 124.5 mm.
(a), (b) Decadal average of Tamilnadu Agricultural University (TNAU) manual observed data (mm) and MPNN and RFR model predicted rainfall data (mm) for both SWM and NEM.
(a), (b) Decadal average of Tamilnadu Agricultural University (TNAU) manual observed data (mm) and MPNN and RFR model predicted rainfall data (mm) for both SWM and NEM.
Selection of excellent algorithm
This paper discovered 85.55 and 86.50% accuracy in our work after employing MPNN and RFR to analyze all of our data. MPNN exhibited more accuracy for SWM, and RFR exhibited more accuracy for NEM. Our projections are based on data from the Coimbatore region. This region receives more rainfall during NEM.
Overall model summary of the hidden layer activation function
Model summary . | . |
---|---|
Cross-entropy error | 125.448 |
Percent incorrect predictions | 14.10% |
Stopping rule used | 1 consecutive step(s) with no decrease in error |
Training time | 00:00.1 |
Cross-entropy error | 56.811 |
Percent incorrect predictions | 18.30% |
Dependent variable: VAR00001 | |
a. Error computations are based on the testing sample |
Model summary . | . |
---|---|
Cross-entropy error | 125.448 |
Percent incorrect predictions | 14.10% |
Stopping rule used | 1 consecutive step(s) with no decrease in error |
Training time | 00:00.1 |
Cross-entropy error | 56.811 |
Percent incorrect predictions | 18.30% |
Dependent variable: VAR00001 | |
a. Error computations are based on the testing sample |
The percentage correctness report of observed and predicted values of rainfall (mm) used in MPNN model prediction
Classification . | ||||
---|---|---|---|---|
. | . | Predicted . | ||
Sample . | Observed . | 0 . | 1 . | Percent correct . |
Training | 0 | 221 | 13 | 94.40(%) |
1 | 30 | 48 | 61.50(%) | |
Overall percent | 80.20(%) | 19.80(%) | 85.55(%) | |
Testing | 0 | 82 | 6 | 93.20(%) |
1 | 16 | 16 | 50.00(%) | |
Overall percent | 81.70(%) | 18.30(%) | 81.70(%) | |
Dependent variable: VAR00001 (rainfall) |
Classification . | ||||
---|---|---|---|---|
. | . | Predicted . | ||
Sample . | Observed . | 0 . | 1 . | Percent correct . |
Training | 0 | 221 | 13 | 94.40(%) |
1 | 30 | 48 | 61.50(%) | |
Overall percent | 80.20(%) | 19.80(%) | 85.55(%) | |
Testing | 0 | 82 | 6 | 93.20(%) |
1 | 16 | 16 | 50.00(%) | |
Overall percent | 81.70(%) | 18.30(%) | 81.70(%) | |
Dependent variable: VAR00001 (rainfall) |
The percentage correctness report of observed and predicted values of rainfall (mm) with RFR model prediction
Classification . | ||||
---|---|---|---|---|
. | . | Predicted . | ||
Sample . | Observed . | 0 . | 1 . | Percent correct . |
Training | 0 | 321 | 17 | 93.45% |
1 | 50 | 68 | 60.45% | |
Overall percent | 81.25% | 20.90% | 86.50% | |
Testing | 0 | 102 | 16 | 94.40% |
1 | 28 | 26 | 55.00% | |
Overall percent | 83.70% | 19.40% | 83.70% | |
Dependent variable: VAR00001 (rainfall) |
Classification . | ||||
---|---|---|---|---|
. | . | Predicted . | ||
Sample . | Observed . | 0 . | 1 . | Percent correct . |
Training | 0 | 321 | 17 | 93.45% |
1 | 50 | 68 | 60.45% | |
Overall percent | 81.25% | 20.90% | 86.50% | |
Testing | 0 | 102 | 16 | 94.40% |
1 | 28 | 26 | 55.00% | |
Overall percent | 83.70% | 19.40% | 83.70% | |
Dependent variable: VAR00001 (rainfall) |
Descriptive summary of the output
Classification/Units . | SWM (1913–2023) . | NEM (1913–2023) . |
---|---|---|
Algorithm | MPNN | RFR |
N (mm) | 20 | 20 |
Mean (mm) | 329.1667 | 362.5 |
SD (mm) | 3.156 | 2.154 |
Standard error mean (mm) | 0.881 | 0.567 |
Mean absolute error (mm) | 0.751 | 0.834 |
RMSE (mm) | 0.814 | 0.784 |
R^2 (mm) | 0.87 | 0.89 |
Accuracy (%) | 85.55% | 86.50% |
Training(%) | 72.30 | 73.5 |
Testing(%) | 27.70 | 24.5 |
Precision (%) | 82% | 80% |
Classification/Units . | SWM (1913–2023) . | NEM (1913–2023) . |
---|---|---|
Algorithm | MPNN | RFR |
N (mm) | 20 | 20 |
Mean (mm) | 329.1667 | 362.5 |
SD (mm) | 3.156 | 2.154 |
Standard error mean (mm) | 0.881 | 0.567 |
Mean absolute error (mm) | 0.751 | 0.834 |
RMSE (mm) | 0.814 | 0.784 |
R^2 (mm) | 0.87 | 0.89 |
Accuracy (%) | 85.55% | 86.50% |
Training(%) | 72.30 | 73.5 |
Testing(%) | 27.70 | 24.5 |
Precision (%) | 82% | 80% |
Note. mm, millimeters; N, No of parameters; RMSE, root mean square error.
CONCLUSION
This study aimed to assess MPNN and RFR algorithms for forecasting monthly rainfall from 1913 to 2023. Using EM might reduce the feature set from seven to five dimensions. Normalization helped to identify centroids, selecting rows with distinct features, covering 70–85% of the original dataset for training. MPNN exhibited a higher accuracy of 85.55% for SWM compared to RFR. For NEM, RFR showed a higher accuracy of 86.50% than MPNN. The coefficient of determination (R2) between predicted and observed rainfall values from 2020 to 2023 was 0.8, indicating superior algorithm performance. Future work will concentrate on minimizing error discrepancies using advanced neural network techniques and also satellite integration with ML for accurate rainfall predictions.
ACKNOWLEDGEMENT
The opinions and perspectives presented in this research paper/article are those of the authors and may not necessarily represent the views of their respective organizations.
DISCLAIMER
The contents and viewpoints articulated in this study are the author's own and do not necessarily represent the opinions of their affiliated organizations.
AUTHORS CONTRIBUTION
The authors confirm their contribution to the paper as follows: study conception and design: Oviyakandasamy, N. Maragatham; data collection: N. Maragatham; analysis and interpretation of results: Oviyakandasamy, E. Somasundaram, R. Ravikumar; draft manuscript preparation: Oviyakandasamy, Balajikannan. All authors reviewed the results and approved the final version of the manuscript.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.