Abstract
Precipitation is one of the driving forces in water cycles, and it is vital for understanding the water cycle, such as surface runoff, soil moisture, and evapotranspiration. However, missing precipitation data at the observatory becomes an obstacle to improving the accuracy and efficiency of hydrological analysis. To address this issue, we developed a machine learning algorithm-based precipitation data recovery tool to detect and predict missing precipitation data at observatories. This study investigated 30 weather stations in South Korea, evaluating the applicability of machine learning algorithms (artificial neural network and random forest) for precipitation data recovery using environmental variables, such as air pressure, temperature, humidity, and wind speed. The proposed model showed a high performance in detecting the missing precipitation occurrence with an accuracy of 80%. In addition, the prediction results from the models showed predictive ability with a correlation coefficient ranging from 0.5 to 0.7 and R2 values of 0.53. Although both algorithms performed similarly in estimating precipitation, ANN performed slightly better. Based on the results of this study, we expect that the machine learning algorithms can contribute to improving hydrological modeling performance by recovering missing precipitation data at observation stations.
HIGHLIGHTS
Missing precipitation data is recovered using ANN and RF algorithms.
Air humidity and air pressure have a high correlation with precipitation occurrence.
Both models have high performance in detecting the precipitation occurrence.
ANN model has better performance than the RF model for recovering daily precipitation data in South Korea.
INTRODUCTION
Precipitation is a key element of the hydrological cycle and is used as input data for various applications, such as flood prediction, agricultural management, and drought analysis (Shin & Salas 2000; Lobell et al. 2007; Wang et al. 2019). These applications require accurate precipitation estimates and sufficient spatiotemporal coverage for hydrological modeling. However, there are areas where precipitation data are not available due to the absence of rainfall gauges or malfunction of measuring equipment. Those missing precipitation data may lead to significant uncertainty in hydrological modeling, mainly when applied for large-scale modeling (Renard et al. 2010; Song et al. 2011).
Beven (2012) emphasized the importance of accurate rainfall data in rainfall-runoff modeling in his book. Many researchers have conducted studies on precipitation data errors in runoff models (Kuczera et al. 2006; Gabellani et al. 2007; Arnaud et al. 2011; Rico-Ramirez et al. 2015). For example, the sensitivity of the rainfall-runoff model to precipitation data was studied by Kobold & Sušelj (2005) for the Savinja basin, Slovenia. The study found that uncertainty in rainfall causes a significant error in peak discharge. In addition, there are studies on the importance of accurate precipitation data in various hydrological modeling. Peters-Lidard et al. (2008) showed that the accuracy of precipitation is a decisive factor influencing the uncertainty of soil texture and hydraulic properties estimation. Also, Shen et al. (2015) confirmed that the error in rainfall measurement significantly affects the uncertainty of hydrologic modeling and simulation of nonpoint source pollution in the Daning River watershed in China.
There are many traditional methods that estimate those missing precipitation data. These interpolation methods can be classified into two categories which are deterministic and geostatistical methods. The inverse distance weighting (IDW) method, one of the deterministic interpolation methods, enables a simple calculation by considering the weight according to the distance of the known observations (Tomczak 1998). However, this simple deterministic approach only relies on the values from neighboring observations. Therefore, the estimation of precipitation using IDW can be inaccurate depending on the density of the stations and is greatly affected by outliers (Li et al. 2018). In addition, this method ignores the spatial dependency between the data.
On the other hand, kriging, the geostatistical method, can consider spatial dependence using variogram to interpolate data. For example, ordinary kriging is the same as IDW in that it gives weights to given values but considers spatial dependence between data in calculating the weights (Lloyd 2005). Also, multivariate kriging, such as simple kriging with varying local means, can introduce explanatory variables to estimate ungauged values by considering known varying means of the external variables (Goovaerts 1997).
Recently, with the improvement of computing algorithms, hardware, and data cloud system, data-driven models such as machine learning techniques have been widely used in various fields of study (Cho et al. 2019; Kim et al. 2019; Xiang et al. 2020; Han et al. 2021). Generally, since the machine learning technique is based on the relationship between numerous datasets, it requires massive datasets for high-quality modeling results. As observation systems are being developed rapidly, machine learning has dramatically expanded. The machine learning approaches have been used in many fields of hydrological studies, such as predictions of runoff (Kumar et al. 2019; Adnan et al. 2021), precipitation (Agrawal et al. 2019; Nelson et al. 2021), water quality (Ahmed et al. 2019; Bui et al. 2020), and river stages (Choi et al. 2020; Zhao et al. 2020). In addition, machine learning techniques are an effective method for ‘big’ data management. These methods can be utilized to find missing values in the datasets and reduce the uncertainty inherent in the datasets. In other words, it supports the errors of the actual observation system complementary to each other using machine learning techniques.
Numerous studies are recovering missing data such as precipitation (Lee et al. 2014; Chivers et al. 2020), streamflow (Kim et al. 2015; Arriagada et al. 2021; Hamzah et al. 2021), temperature (Wang et al. 2021), and soil properties (Ramcharan et al. 2018). These studies applied various machine learning algorithms, including artificial neural networks (ANN), support vector machines (SVM), random forest (RF), k-nearest neighbors, and traditional statistical methods. They showed the high quality of performance of the machine learning-based approach for missing data recovery compared with the other traditional methods. Although machine learning-based methods have a limitation that we cannot directly be involved in the process in which machine learning works, it has the advantage that they provide high-quality results. Thus, many machine learning algorithms can be effective alternatives to the traditional statistical methods for recovering missing parts in ‘big’ datasets in many fields of study.
In addition, there are research cases in which studies have been conducted to fill missing precipitation using machine learning algorithms. For example, Portuguez-Maurtua et al. (2022) used regression model and machine learning algorithm to fill gaps in daily precipitation data in the central part of Peru. They presented that the machine learning technique has sufficient ability to estimate missing precipitation. Bellido-Jiménez et al. (2021) assessed several machine learning algorithms to gap-fill precipitation data in the southern area of Spain. They showed machine learning algorithms have better performance than simple linear regression estimation methods. However, most previous studies have focused only on estimating the amount of precipitation, while detecting the occurrence of precipitation has not been considered. In addition, there is a limitation in that the diversity of input data of machine learning algorithms is not considered.
Thus, this study aims to develop a machine learning-based model for recovering missing precipitation data using precipitation-related meteorological variables. This study has two main modules for recovering missing precipitation data, (1) detecting precipitation occurrence using ‘classification’ machine learning algorithms, (2) predicting precipitation rate using ‘regression’ machine learning algorithm. We used two machine learning algorithms, ANN and RF models, widely used in classification and regression tasks, and compared the construction results with actual precipitation data obtained from 30 weather stations in South Korea. It is expected that it will contribute to reducing the uncertainty of precipitation data by constructing not only the occurrence of precipitation but also the amount of precipitation for the missing part in the given data.
MATERIALS AND METHODS
Study area
Datasets
Types . | Variables . |
---|---|
Type I | Humidity |
Type II | Humidity + air pressure |
Type III | Humidity + air pressure + air temperature |
Type IV | Humidity + air pressure + air temperature + wind speed |
Types . | Variables . |
---|---|
Type I | Humidity |
Type II | Humidity + air pressure |
Type III | Humidity + air pressure + air temperature |
Type IV | Humidity + air pressure + air temperature + wind speed |
Machine learning algorithms
In this study, two types of machine learning algorithms were utilized as the main methods to develop a prediction model for the construction of missing values of the daily precipitation dataset. The purpose of machine learning can be mainly classified into two types: classification and regression (or prediction) (Kim et al. 2019). Machine learning algorithms for classification and regression include ANN (McCulloch & Pitts 1943), RF (Breiman 2001), Decision Tree (DT; Quinlan 1986), SVM (Vapnik 1999), and deep learning-based models. This study used ANN and RF models, which are easy to operate, and they are known as representative machine learning algorithms widely used in various fields. In addition, these two algorithms are able to both ‘classification’ and ‘regression’, which is the main purpose of this study.
Random forest
The RF model is based on bootstrap aggregation (i.e., bagging) of multiple DT models. Each DT model uses sub-datasets randomly selected from the total sample datasets, provides the results based on various parameters, and then combines the prediction (or classification) results to produce the best answers through the majority voting process. The RF model has shown high-quality results on a wide range of different modeling tasks in many research fields. In addition, one of the novelties of the RF model is that the model can provide the importance of each variable to the target. More details about the determination of importance can be found in (Louppe et al. 2013).
Artificial neural network
McCulloch & Pitts (1943) proposed the initial ANN model, an algorithm widely used for classification and regression problems. The ANN model refers to a computer algorithm based on the neural network structure of the human brain (Choi et al. 2020; Jung et al. 2021). Since the ANN model is a practical algorithm for non-linear modeling processes, it is a valuable tool for solving various problems such as regression, classification, pattern recognition, machine translation, decision-making, etc. In the hydrological study, the ANN has been used to model rainfall-runoff, water quality, and groundwater flow (Tanty & Desmukh 2015).
Model development
Metrics for evaluation
Actual . | Prediction . | |
---|---|---|
Positive . | Negative . | |
Positive | True positive (TP); hit | False-negative (FN); miss |
Negative | False-positive (FP); false alarm | True negative (TN); correct rejection |
Actual . | Prediction . | |
---|---|---|
Positive . | Negative . | |
Positive | True positive (TP); hit | False-negative (FN); miss |
Negative | False-positive (FP); false alarm | True negative (TN); correct rejection |
RESULTS
Model optimization
In the case of ANN, we considered two parameters which are batch and epoch sizes for model optimization. These two parameters were adjusted from 8 to 96 for batch size and from 5 to 150 for epoch size. The model with 16 batch sizes and an epoch size of 150 provides the best predictive performance (Table 3).
Parameters . | RMSE . | Parameters . | RMSE . | ||
---|---|---|---|---|---|
Batch size . | Epoch size . | Batch size . | Epoch size . | ||
8 | 5 | 11.14 | 48 | 5 | 11.25 |
10 | 11.15 | 10 | 11.06 | ||
50 | 10.83 | 50 | 10.75 | ||
100 | 11.00 | 100 | 11.08 | ||
150 | 10.78 | 150 | 10.92 | ||
16 | 5 | 11.18 | 96 | 5 | 11.57 |
10 | 10.88 | 10 | 11.08 | ||
50 | 10.90 | 50 | 10.81 | ||
100 | 10.66 | 100 | 10.72 | ||
150 | 10.59 | 200 | 10.88 | ||
24 | 5 | 11.32 | |||
10 | 11.03 | ||||
50 | 10.83 | ||||
100 | 10.71 | ||||
150 | 10.73 |
Parameters . | RMSE . | Parameters . | RMSE . | ||
---|---|---|---|---|---|
Batch size . | Epoch size . | Batch size . | Epoch size . | ||
8 | 5 | 11.14 | 48 | 5 | 11.25 |
10 | 11.15 | 10 | 11.06 | ||
50 | 10.83 | 50 | 10.75 | ||
100 | 11.00 | 100 | 11.08 | ||
150 | 10.78 | 150 | 10.92 | ||
16 | 5 | 11.18 | 96 | 5 | 11.57 |
10 | 10.88 | 10 | 11.08 | ||
50 | 10.90 | 50 | 10.81 | ||
100 | 10.66 | 100 | 10.72 | ||
150 | 10.59 | 200 | 10.88 | ||
24 | 5 | 11.32 | |||
10 | 11.03 | ||||
50 | 10.83 | ||||
100 | 10.71 | ||||
150 | 10.73 |
Bold numbers refer to the parameters of ANN representing the best performance.
Performance for detecting precipitation occurrence
Performance for predicting precipitation rate
DISCUSSION
Machine learning algorithms for recovering missing precipitation by hydrological data
Similar studies also showed the applicability of the machine learning-based approach for data recovery. Chivers et al. (2020) investigated the performance of four types of machine learning models for sub-hourly precipitation recovery using meteorological variables neighboring precipitation stations. They applied the proposed model for 37 sites across the UK. They showed that the proposed models provide R2 values of 0.6 or higher at most stations. In addition, they suggested the composite application of precipitation-related meteorological variables and data of neighboring points showed better prediction results for data imputation. Kim et al. (2015) developed a machine learning-based data imputation model to recover the missing streamflow using meteorological variables such as daily precipitation and temperature. The proposed models showed over 0.6 of R2 and Nash–Sutcliffe efficiency coefficient (NSE) values for predicting missing streamflow compared with the observation data.
From the many studies, it is evident that machine learning-based algorithms can predict the missing datasets in hydrology studies such as precipitation and streamflow. However, the models still have limitations that the modeling performance shows poor results because the machine learning-based models highly depend on the quality of input data. In other words, there is a limit that high-quality input factors are required to reproduce missing data. Therefore, it is necessary to study to select the best input datasets of the model to ensure better performance for data recovery.
Implementation for water data management
This study demonstrated that ANN and RF machine learning algorithms showed comparable performance in recovering missing precipitation datasets, and they have the potential to be alternatives to traditional methods. Based on the results of the detection of precipitation occurrence in missing parts, both models have an excellent performance with an accuracy value of 0.83, indicating they find the data of occurrence of precipitation with greater than 80% accuracy. It is expected that the proposed models can be an effective data generation tool in various fields such as water resources management, drought analysis, and agricultural system, where the timing of daily precipitation plays an important role. Although there is space for further studies to predict the precipitation rate more accurately, the results of this study can be essential information for the effective management of various water-related data such as precipitation, streamflow, and soil moisture by supplementing missing values.
CONCLUSIONS
For 30 weather stations in South Korea, the performances of each model were assessed in terms of two aspects: detection of precipitation occurrence and prediction of precipitation rate. The results are as follows:
- 1.
This study optimized each model by using different parameters. In the case of the RF model, it considered 100 of a number of samples, 5 of a number of features, 1,000 of a number of trees, and none of the tree depth. Moreover, the ANN model showed the best predictive performance with 16 batch sizes and an epoch size of 150.
- 2.
This study assessed the modeling performance with four different combinations of input variables, including air pressure, temperature, wind speed, and humidity. Type IV combining all variables shows the best performance for detection compared to other types of input combinations. In addition, the ANN model has slightly better detection ability than the RF model in all types.
- 3.
The evaluation results for the prediction precipitation rate showed an excellent performance, and type IV has slightly larger CC and R2 values than the other three types. In addition, the ANN and RF models have RMSE values less than 10 mm/day for four types.
This study demonstrates that machine learning algorithms can be an effective alternative to conventional methods requiring complex data recovery processes. The models used in this study showed sufficient power to generate the missing parts of precipitation datasets using different meteorological variables. Even though in this study only two algorithms were applied for data generation, we intend to use other types of machine learning or deep learning techniques to improve the accuracy of data recovery in the future.
ACKNOWLEDGEMENTS
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant Number 2022R1A6A3A01086229).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.