Precipitation is one of the driving forces in water cycles, and it is vital for understanding the water cycle, such as surface runoff, soil moisture, and evapotranspiration. However, missing precipitation data at the observatory becomes an obstacle to improving the accuracy and efficiency of hydrological analysis. To address this issue, we developed a machine learning algorithm-based precipitation data recovery tool to detect and predict missing precipitation data at observatories. This study investigated 30 weather stations in South Korea, evaluating the applicability of machine learning algorithms (artificial neural network and random forest) for precipitation data recovery using environmental variables, such as air pressure, temperature, humidity, and wind speed. The proposed model showed a high performance in detecting the missing precipitation occurrence with an accuracy of 80%. In addition, the prediction results from the models showed predictive ability with a correlation coefficient ranging from 0.5 to 0.7 and R2 values of 0.53. Although both algorithms performed similarly in estimating precipitation, ANN performed slightly better. Based on the results of this study, we expect that the machine learning algorithms can contribute to improving hydrological modeling performance by recovering missing precipitation data at observation stations.

  • Missing precipitation data is recovered using ANN and RF algorithms.

  • Air humidity and air pressure have a high correlation with precipitation occurrence.

  • Both models have high performance in detecting the precipitation occurrence.

  • ANN model has better performance than the RF model for recovering daily precipitation data in South Korea.

Precipitation is a key element of the hydrological cycle and is used as input data for various applications, such as flood prediction, agricultural management, and drought analysis (Shin & Salas 2000; Lobell et al. 2007; Wang et al. 2019). These applications require accurate precipitation estimates and sufficient spatiotemporal coverage for hydrological modeling. However, there are areas where precipitation data are not available due to the absence of rainfall gauges or malfunction of measuring equipment. Those missing precipitation data may lead to significant uncertainty in hydrological modeling, mainly when applied for large-scale modeling (Renard et al. 2010; Song et al. 2011).

Beven (2012) emphasized the importance of accurate rainfall data in rainfall-runoff modeling in his book. Many researchers have conducted studies on precipitation data errors in runoff models (Kuczera et al. 2006; Gabellani et al. 2007; Arnaud et al. 2011; Rico-Ramirez et al. 2015). For example, the sensitivity of the rainfall-runoff model to precipitation data was studied by Kobold & Sušelj (2005) for the Savinja basin, Slovenia. The study found that uncertainty in rainfall causes a significant error in peak discharge. In addition, there are studies on the importance of accurate precipitation data in various hydrological modeling. Peters-Lidard et al. (2008) showed that the accuracy of precipitation is a decisive factor influencing the uncertainty of soil texture and hydraulic properties estimation. Also, Shen et al. (2015) confirmed that the error in rainfall measurement significantly affects the uncertainty of hydrologic modeling and simulation of nonpoint source pollution in the Daning River watershed in China.

There are many traditional methods that estimate those missing precipitation data. These interpolation methods can be classified into two categories which are deterministic and geostatistical methods. The inverse distance weighting (IDW) method, one of the deterministic interpolation methods, enables a simple calculation by considering the weight according to the distance of the known observations (Tomczak 1998). However, this simple deterministic approach only relies on the values from neighboring observations. Therefore, the estimation of precipitation using IDW can be inaccurate depending on the density of the stations and is greatly affected by outliers (Li et al. 2018). In addition, this method ignores the spatial dependency between the data.

On the other hand, kriging, the geostatistical method, can consider spatial dependence using variogram to interpolate data. For example, ordinary kriging is the same as IDW in that it gives weights to given values but considers spatial dependence between data in calculating the weights (Lloyd 2005). Also, multivariate kriging, such as simple kriging with varying local means, can introduce explanatory variables to estimate ungauged values by considering known varying means of the external variables (Goovaerts 1997).

Recently, with the improvement of computing algorithms, hardware, and data cloud system, data-driven models such as machine learning techniques have been widely used in various fields of study (Cho et al. 2019; Kim et al. 2019; Xiang et al. 2020; Han et al. 2021). Generally, since the machine learning technique is based on the relationship between numerous datasets, it requires massive datasets for high-quality modeling results. As observation systems are being developed rapidly, machine learning has dramatically expanded. The machine learning approaches have been used in many fields of hydrological studies, such as predictions of runoff (Kumar et al. 2019; Adnan et al. 2021), precipitation (Agrawal et al. 2019; Nelson et al. 2021), water quality (Ahmed et al. 2019; Bui et al. 2020), and river stages (Choi et al. 2020; Zhao et al. 2020). In addition, machine learning techniques are an effective method for ‘big’ data management. These methods can be utilized to find missing values in the datasets and reduce the uncertainty inherent in the datasets. In other words, it supports the errors of the actual observation system complementary to each other using machine learning techniques.

Numerous studies are recovering missing data such as precipitation (Lee et al. 2014; Chivers et al. 2020), streamflow (Kim et al. 2015; Arriagada et al. 2021; Hamzah et al. 2021), temperature (Wang et al. 2021), and soil properties (Ramcharan et al. 2018). These studies applied various machine learning algorithms, including artificial neural networks (ANN), support vector machines (SVM), random forest (RF), k-nearest neighbors, and traditional statistical methods. They showed the high quality of performance of the machine learning-based approach for missing data recovery compared with the other traditional methods. Although machine learning-based methods have a limitation that we cannot directly be involved in the process in which machine learning works, it has the advantage that they provide high-quality results. Thus, many machine learning algorithms can be effective alternatives to the traditional statistical methods for recovering missing parts in ‘big’ datasets in many fields of study.

In addition, there are research cases in which studies have been conducted to fill missing precipitation using machine learning algorithms. For example, Portuguez-Maurtua et al. (2022) used regression model and machine learning algorithm to fill gaps in daily precipitation data in the central part of Peru. They presented that the machine learning technique has sufficient ability to estimate missing precipitation. Bellido-Jiménez et al. (2021) assessed several machine learning algorithms to gap-fill precipitation data in the southern area of Spain. They showed machine learning algorithms have better performance than simple linear regression estimation methods. However, most previous studies have focused only on estimating the amount of precipitation, while detecting the occurrence of precipitation has not been considered. In addition, there is a limitation in that the diversity of input data of machine learning algorithms is not considered.

Thus, this study aims to develop a machine learning-based model for recovering missing precipitation data using precipitation-related meteorological variables. This study has two main modules for recovering missing precipitation data, (1) detecting precipitation occurrence using ‘classification’ machine learning algorithms, (2) predicting precipitation rate using ‘regression’ machine learning algorithm. We used two machine learning algorithms, ANN and RF models, widely used in classification and regression tasks, and compared the construction results with actual precipitation data obtained from 30 weather stations in South Korea. It is expected that it will contribute to reducing the uncertainty of precipitation data by constructing not only the occurrence of precipitation but also the amount of precipitation for the missing part in the given data.

Study area

South Korea is located at about 37° N and 127° E in East Asia. The country is surrounded by sea on three sides, and the mountain ranges are located along the east coastline, which results in a relatively gentler slope on the western side than the eastern side. The land covers an area of 100,413 km2, and the highest and average altitudes are about 1,950 and 450 m, respectively (Figure 1). The climate of South Korea is cold and dry in winter and hot and humid in summer. The average annual precipitation is about 1,200 mm, and about 60% of the total precipitation is concentrated in summer due to the influence of the monsoon. This strong seasonality causes considerable flood damage in summer. Also, due to the geomorphological characteristics, regional precipitation patterns have a significant variation.
Figure 1

The study area of this study; red points indicate the location of 30 weather stations in South Korea. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.237.

Figure 1

The study area of this study; red points indicate the location of 30 weather stations in South Korea. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.237.

Close modal

Datasets

The meteorological data are obtained from the Korea Meteorological Administration (KMA; https://data.kma.go.kr/) for 30 stations covering South Korea with elevations ranging from 3 to 1,900 m. Each station is equipped with Automated Synoptic Observing System (ASOS) measuring meteorological variables with various time resolutions from minutes to monthly. In this study, daily-based datasets including humidity, air pressure, air temperature, and wind speed were used as input data for the model for recovering missing precipitation data from 2001 to 2020. This study built four types of input datasets based on the correlation analysis between four environmental variables and precipitation. Figure 2 shows box plots of the correlation coefficient between four environmental variables and precipitation at 30 stations. As shown in the figure, humidity has the highest correlation with precipitation, and it was found that there was a correlation in the order of air pressure, temperature, and wind speed. Thus, based on the results of correlation analysis, we considered four types with different combinations of inputs (Table 1).
Table 1

Combination of input variables for four types

TypesVariables
Type I Humidity 
Type II Humidity + air pressure 
Type III Humidity + air pressure + air temperature 
Type IV Humidity + air pressure + air temperature + wind speed 
TypesVariables
Type I Humidity 
Type II Humidity + air pressure 
Type III Humidity + air pressure + air temperature 
Type IV Humidity + air pressure + air temperature + wind speed 
Figure 2

The correlation coefficient between precipitation and other environmental variables.

Figure 2

The correlation coefficient between precipitation and other environmental variables.

Close modal

Machine learning algorithms

In this study, two types of machine learning algorithms were utilized as the main methods to develop a prediction model for the construction of missing values of the daily precipitation dataset. The purpose of machine learning can be mainly classified into two types: classification and regression (or prediction) (Kim et al. 2019). Machine learning algorithms for classification and regression include ANN (McCulloch & Pitts 1943), RF (Breiman 2001), Decision Tree (DT; Quinlan 1986), SVM (Vapnik 1999), and deep learning-based models. This study used ANN and RF models, which are easy to operate, and they are known as representative machine learning algorithms widely used in various fields. In addition, these two algorithms are able to both ‘classification’ and ‘regression’, which is the main purpose of this study.

Random forest

Breiman (2001) introduced the RF model widely used for solving a wide range of classification and regression predictive tasks. The RF model is an ensemble machine learning algorithm and uses classification or prediction results obtained from multiple decision trees. Figure 3 shows the conceptual diagram of the RF model.
Figure 3

Conceptual diagram of RF model.

Figure 3

Conceptual diagram of RF model.

Close modal

The RF model is based on bootstrap aggregation (i.e., bagging) of multiple DT models. Each DT model uses sub-datasets randomly selected from the total sample datasets, provides the results based on various parameters, and then combines the prediction (or classification) results to produce the best answers through the majority voting process. The RF model has shown high-quality results on a wide range of different modeling tasks in many research fields. In addition, one of the novelties of the RF model is that the model can provide the importance of each variable to the target. More details about the determination of importance can be found in (Louppe et al. 2013).

Artificial neural network

McCulloch & Pitts (1943) proposed the initial ANN model, an algorithm widely used for classification and regression problems. The ANN model refers to a computer algorithm based on the neural network structure of the human brain (Choi et al. 2020; Jung et al. 2021). Since the ANN model is a practical algorithm for non-linear modeling processes, it is a valuable tool for solving various problems such as regression, classification, pattern recognition, machine translation, decision-making, etc. In the hydrological study, the ANN has been used to model rainfall-runoff, water quality, and groundwater flow (Tanty & Desmukh 2015).

Generally, the ANN has three layers: input layer, hidden layer (or computation layer), and output layer (Figure 4). The input data is inserted into the input layer, and each node estimates an output value and delivers it to the following layers via an activation function (Tanty & Desmukh 2015). The number of hidden layers can be designed to be two or more depending on the features of a given dataset. The model's primary purpose is to find optimal weight values (w) to minimize the errors between prediction and actual valid values. The ANN model can be mathematically formulated in the following equation.
(1)
where f is the activation function in the layers, and X, w mean the input data and weight values. B and b are the biases in the output and hidden layers. Y is the final output value produced from the model. In the model algorithm, X is multiplied by weight value (w), and then the coupled value is converted by the activation function (f). The representative activation functions include linear, sigmoid, hyperbolic tangent function (tanh), and rectified linear unit (ReLU) functions. In this study, the sigmoid function was used as an activation function for the ANN model. The sigmoid function is a mathematical function that maps input data to a value between 0 and 1, making it useful for classification and logistic regression tasks. The sigmoid function is commonly used as an activation function in the ANN model. In this study, Python 3.7.10 with Scikit-Learn 0.23.2 libraries was used to develop the algorithm.
Figure 4

Conceptual diagram of ANN model.

Figure 4

Conceptual diagram of ANN model.

Close modal

Model development

This study used RF and ANN machine learning models to recover daily precipitation missing at 30 ground-based stations. Figure 5 illustrates the flowchart of this study, which includes the following two steps: (1) detecting precipitation occurrence; (2) predicting precipitation rate using machine learning models with four input variables. Four variables include air pressure, humidity, air temperature, and wind speed, and they were used to build a different combination of model inputs for training and validation at each station. To build input datasets, four variables and precipitation data were used as independent and explanatory variables in the algorithms. Moreover, each algorithm was trained and verified using the data from ground-based stations where precipitation was observed. In this process, an optimization was also performed to find the best parameters of each algorithm. Here, model training was performed using 70% of the randomly selected dataset, and 30% of the remaining dataset was used for model validation. Finally, an optimized model was tested to assess whether it is appropriate to estimate missing precipitation.
Figure 5

Flow chart of this study.

Figure 5

Flow chart of this study.

Close modal

Metrics for evaluation

In this study, six statistical metrics were used to evaluate modeling performance. The root mean square error (RMSE) measures the prediction error's standard deviation (SD), indicating a difference between predicted and actual values. The correlation coefficient (CC) ranges from −1 to 1 and describes how well the model predicts the outcomes. A CC value of 0 indicates no correlation between observed and predicted precipitation (Moriasi et al. 2007).
(2)
(3)
where , denote the predicted and actual observation values. n means the number of data samples used in the evaluation. , represent the mean value of the predicted and actual data.
To evaluate the performance of models for the detection of missing precipitation data, classification accuracy and F1 score are used. The F1 score is a metric widely used to measure the accuracy of binary classification results. The F1 score and classification evaluation metrics are mathematically expressed as follows (Chivers et al. 2020):
(4)
(5)
(6)
(7)
where TP is true positive and defined as hit (i.e., both reference and estimated data are observed), FP is false-positive and defined as false alarm (i.e., reference data are not observed, but model detects), TN is a true negative, and FN is false-negative results and defined as miss (i.e., the model does not detect the precipitation occurrence) (Table 2). Four metrics, F1 score, precision, recall, and accuracy, can be defined as a combination of TP, FP, TN, and FN.
Table 2

Contingency table of precipitation

ActualPrediction
PositiveNegative
Positive True positive (TP); hit False-negative (FN); miss 
Negative False-positive (FP); false alarm True negative (TN); correct rejection 
ActualPrediction
PositiveNegative
Positive True positive (TP); hit False-negative (FN); miss 
Negative False-positive (FP); false alarm True negative (TN); correct rejection 

Model optimization

Proper parameter selection and configuration are essential for machine learning model training and testing. Generally, the RF model has four parameters: number of samples, number of features, number of trees, and tree depth. This study considered these four parameters for tuning the RF model and evaluated their effect on model performance. Figure 6 shows boxplots of the accuracy of different parameters. This study used parameter values with the highest accuracy. The average values of model accuracy are 82% for 100 of a number of samples, 84% for 5 of a number of features, 84% for 1,000 of a number of trees, and 83.5% for none of the tree depth.
Figure 6

Boxplots of the accuracy of different parameters of the RF model.

Figure 6

Boxplots of the accuracy of different parameters of the RF model.

Close modal

In the case of ANN, we considered two parameters which are batch and epoch sizes for model optimization. These two parameters were adjusted from 8 to 96 for batch size and from 5 to 150 for epoch size. The model with 16 batch sizes and an epoch size of 150 provides the best predictive performance (Table 3).

Table 3

Optimization results of ANN parameters

Parameters
RMSEParameters
RMSE
Batch sizeEpoch sizeBatch sizeEpoch size
11.14 48 11.25 
10 11.15 10 11.06 
50 10.83 50 10.75 
100 11.00 100 11.08 
150 10.78 150 10.92 
16 11.18 96 11.57 
10 10.88 10 11.08 
50 10.90 50 10.81 
100 10.66 100 10.72 
150 10.59 200 10.88 
24 11.32    
10 11.03    
50 10.83    
100 10.71    
150 10.73    
Parameters
RMSEParameters
RMSE
Batch sizeEpoch sizeBatch sizeEpoch size
11.14 48 11.25 
10 11.15 10 11.06 
50 10.83 50 10.75 
100 11.00 100 11.08 
150 10.78 150 10.92 
16 11.18 96 11.57 
10 10.88 10 11.08 
50 10.90 50 10.81 
100 10.66 100 10.72 
150 10.59 200 10.88 
24 11.32    
10 11.03    
50 10.83    
100 10.71    
150 10.73    

Bold numbers refer to the parameters of ANN representing the best performance.

Performance for detecting precipitation occurrence

This study evaluated the proposed model in precipitation detection performance and precipitation rate prediction. Figure 7 shows the evaluation results of four types of input datasets for detecting precipitation occurrence at 30 stations. Four evaluation metrics containing precision, recall, accuracy, and F1 score were used for evaluation. As shown in Figure 7, type IV shows the best performance for detection compared to other types. In the case of type I, the difference in detection performance between both models is significant. When using all environmental variables as inputs (i.e., type IV), both models have a mean value of 0.77 for precision, 0.64 for recall, 0.83 for accuracy, and 0.68 for F1 score. The ANN model has slightly better performance for detecting precipitation occurrence than the RF model in all the types of input data.
Figure 7

Evaluation results of four types for detecting precipitation occurrence at 30 stations. (a)–(d) indicate evaluation results with different combinations of input variables.

Figure 7

Evaluation results of four types for detecting precipitation occurrence at 30 stations. (a)–(d) indicate evaluation results with different combinations of input variables.

Close modal

Performance for predicting precipitation rate

The performances of ANN and RF models with four different types of input data for predicting precipitation rate were evaluated using the Taylor diagram that is a mathematical diagram for the evaluation of modeling results by comparing with references (Taylor 2001; Han & Morrison 2022). The diagram provides visual indicators based on three metrics: CC, SD, and RMSE. Figure 8 illustrates four Taylor diagrams indicating the performances of four types of input datasets. As shown in Figure 8, four diagrams show similar patterns and magnitudes of both models. The CC of types I, II, and III ranges from 0.4 to 0.7, and type IV has slightly larger CC values than the other three types. In addition, the ANN and RF models have RMSE values less than 10 mm/day for four types.
Figure 8

Taylor diagrams for daily precipitation predicted from ANN and RF models representing CC, SD, and RMSE for four types. (a)–(d) indicate evaluation results with different combinations of input variables.

Figure 8

Taylor diagrams for daily precipitation predicted from ANN and RF models representing CC, SD, and RMSE for four types. (a)–(d) indicate evaluation results with different combinations of input variables.

Close modal
Figure 9 illustrates scatter plots indicating the comparison results of performances for precipitation prediction for 30 stations. The RF and ANN models of four types show similar patterns in that they predicted precipitation that was underestimated compared to the observed datasets. It can be inferred that these results are influenced by the no precipitation values in the model training process. Among the four types, type IV has the best predictive performance with R2 values of 0.53 than the other three types. Also, the ANN model has a slightly better ability for precipitation prediction than the RF model for four types. Therefore, it is expected that the RF and ANN model based on the input data of type IV can be used effectively to recover the missing rainfall data.
Figure 9

Scatter plots for daily precipitation predicted from ANN and RF models and observations. (a)–(d) indicate evaluation results with different combinations of input variables.

Figure 9

Scatter plots for daily precipitation predicted from ANN and RF models and observations. (a)–(d) indicate evaluation results with different combinations of input variables.

Close modal

Machine learning algorithms for recovering missing precipitation by hydrological data

This study evaluated RF and ANN models for recovering missing daily precipitation data in South Korea. The proposed models used different input variables, including precipitation, wind speed, humidity, and air pressure. We found that the ANN model has relatively better performance than the RF model for recovering precipitation data. Figure 10 represents an example of missing precipitation data recovered from the ANN model at one of the stations (#90) in South Korea. The recovered precipitation data can be used as input data for hydrologic models for flood and drought prediction and analysis and can reduce the uncertainty from unmeasured data.
Figure 10

Examples of missing precipitation recovered by ANN model at the one of stations (#90). Below panels represent recovered (red) daily precipitation (a) between March 2001 and April 2001 and (b) between June 2001 and July 2001. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.237.

Figure 10

Examples of missing precipitation recovered by ANN model at the one of stations (#90). Below panels represent recovered (red) daily precipitation (a) between March 2001 and April 2001 and (b) between June 2001 and July 2001. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wst.2023.237.

Close modal

Similar studies also showed the applicability of the machine learning-based approach for data recovery. Chivers et al. (2020) investigated the performance of four types of machine learning models for sub-hourly precipitation recovery using meteorological variables neighboring precipitation stations. They applied the proposed model for 37 sites across the UK. They showed that the proposed models provide R2 values of 0.6 or higher at most stations. In addition, they suggested the composite application of precipitation-related meteorological variables and data of neighboring points showed better prediction results for data imputation. Kim et al. (2015) developed a machine learning-based data imputation model to recover the missing streamflow using meteorological variables such as daily precipitation and temperature. The proposed models showed over 0.6 of R2 and Nash–Sutcliffe efficiency coefficient (NSE) values for predicting missing streamflow compared with the observation data.

From the many studies, it is evident that machine learning-based algorithms can predict the missing datasets in hydrology studies such as precipitation and streamflow. However, the models still have limitations that the modeling performance shows poor results because the machine learning-based models highly depend on the quality of input data. In other words, there is a limit that high-quality input factors are required to reproduce missing data. Therefore, it is necessary to study to select the best input datasets of the model to ensure better performance for data recovery.

Implementation for water data management

This study demonstrated that ANN and RF machine learning algorithms showed comparable performance in recovering missing precipitation datasets, and they have the potential to be alternatives to traditional methods. Based on the results of the detection of precipitation occurrence in missing parts, both models have an excellent performance with an accuracy value of 0.83, indicating they find the data of occurrence of precipitation with greater than 80% accuracy. It is expected that the proposed models can be an effective data generation tool in various fields such as water resources management, drought analysis, and agricultural system, where the timing of daily precipitation plays an important role. Although there is space for further studies to predict the precipitation rate more accurately, the results of this study can be essential information for the effective management of various water-related data such as precipitation, streamflow, and soil moisture by supplementing missing values.

For 30 weather stations in South Korea, the performances of each model were assessed in terms of two aspects: detection of precipitation occurrence and prediction of precipitation rate. The results are as follows:

  • 1.

    This study optimized each model by using different parameters. In the case of the RF model, it considered 100 of a number of samples, 5 of a number of features, 1,000 of a number of trees, and none of the tree depth. Moreover, the ANN model showed the best predictive performance with 16 batch sizes and an epoch size of 150.

  • 2.

    This study assessed the modeling performance with four different combinations of input variables, including air pressure, temperature, wind speed, and humidity. Type IV combining all variables shows the best performance for detection compared to other types of input combinations. In addition, the ANN model has slightly better detection ability than the RF model in all types.

  • 3.

    The evaluation results for the prediction precipitation rate showed an excellent performance, and type IV has slightly larger CC and R2 values than the other three types. In addition, the ANN and RF models have RMSE values less than 10 mm/day for four types.

This study demonstrates that machine learning algorithms can be an effective alternative to conventional methods requiring complex data recovery processes. The models used in this study showed sufficient power to generate the missing parts of precipitation datasets using different meteorological variables. Even though in this study only two algorithms were applied for data generation, we intend to use other types of machine learning or deep learning techniques to improve the accuracy of data recovery in the future.

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant Number 2022R1A6A3A01086229).

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Adnan
R. M.
,
Petroselli
A.
,
Heddam
S.
,
Santos
C. A. G.
&
Kisi
O.
2021
Comparison of different methodologies for rainfall-runoff modeling: machine learning vs conceptual approach
.
Natural Hazards
105
(
3
),
2987
3011
.
https://doi.org/10.1007/s11069-020-04438-2
.
Agrawal
S.
,
Barrington
L.
,
Bromberg
C.
,
Burge
J.
,
Gazen
C.
&
Hickey
J.
2019
Machine learning for precipitation nowcasting from radar images. arXiv preprint arXiv:1912.12132. https://doi.org/10.48550/arXiv.1912.12132
.
Ahmed
A. N.
,
Othman
F. B.
,
Afan
H. A.
,
Ibrahim
R. K.
,
Fai
C. M.
,
Hossain
M. S.
,
Ehteram
M.
&
Elshafie
A.
2019
Machine learning methods for better water quality prediction
.
Journal of Hydrology
578
,
124084
.
https://doi.org/10.1016/j.jhydrol.2019.124084
.
Arnaud
P.
,
Lavabre
J.
,
Fouchier
C.
,
Diss
S.
&
Javelle
P.
2011
Sensitivity of hydrological models to uncertainty in rainfall input
.
Hydrological Sciences Journal-Journal des Sciences Hydrologiques
56
(
3
),
397
410
.
https://doi.org/10.1080/02626667.2011.563742
.
Arriagada
P.
,
Karelovic
B.
&
Link
O.
2021
Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm
.
Journal of Hydrology
598
,
126454
.
https://doi.org/10.1016/j.jhydrol.2021.126454
.
Bellido-Jiménez
J. A.
,
Gualda
J. E.
&
García-Marín
A. P.
2021
Assessing machine learning models for gap filling daily rainfall series in a semiarid region of Spain
.
Atmosphere
12
(
9
),
1158
.
https://doi.org/10.3390/atmos12091158
.
Beven
K. J.
2012
Rainfall-Runoff Modelling: the Primer
, 2nd edn.
Wiley-Blackwell Hoboken
, NJ,
USA
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
(
1
),
5
32
.
Bui
D. T.
,
Khosravi
K.
,
Tiefenbacher
J.
,
Nguyen
H.
&
Kazakis
N.
2020
Improving prediction of water quality indices using novel hybrid machine-learning algorithms
.
Science of the Total Environment
721
,
137612
.
https://doi.org/10.1016/j.scitotenv.2020.137612
.
Chivers
B. D.
,
Wallbank
J.
,
Cole
S. J.
,
Sebek
O.
,
Stanley
S.
,
Fry
M.
&
Leontidis
G.
2020
Imputation of missing sub-hourly precipitation data in a large sensor network: a machine learning approach
.
Journal of Hydrology
588
,
125126
.
https://doi-org.srv-proxy1.library.tamu.edu/10.1016/j.jhydrol.2020.125126
.
Cho
E.
,
Jacobs
J. M.
,
Jia
X.
&
Kraatz
S.
2019
Identifying subsurface drainage using satellite Big data and machine learning via Google Earth Engine
.
Water Resources Research
55
(
10
),
8028
8045
.
https://doi.org/10.1029/2019WR024892
.
Choi
C.
,
Kim
J.
,
Han
H.
,
Han
D.
&
Kim
H. S.
2020
Development of water level prediction models using machine learning in wetlands: a case study of Upo wetland in South Korea
.
Water
12
(
1
),
93
.
https://doi.org/10.3390/w12010093
.
Gabellani
S.
,
Boni
G.
,
Ferraris
L.
,
Von Hardenberg
J.
&
Provenzale
A.
2007
Propagation of uncertainty from rainfall to runoff: a case study with a stochastic rainfall generator
.
Advances in Water Resources
30
(
10
),
2061
2071
.
https://doi.org/10.1016/j.advwatres.2006.11.015
.
Goovaerts
P.
1997
Geostatistics for Natural Resources Evaluation
.
Oxford University Press, New York, NY, USA
.
Hamzah
F. B.
,
Hamzah
F. M.
,
Razali
S. F. M.
&
Samad
H. A.
2021
Comparison of multiple imputation methods for recovering missing data in hydrological studies
.
Civil Engineering Journal
7
(
9
),
1608
1619
.
http://dx.doi.org/10.28991/cej-2021-03091747
.
Han
H.
&
Morrison
R. R.
2022
Data-driven approaches for runoff prediction using distributed data
.
Stochastic Environmental Research and Risk Assessment
36
(
8
),
2153
2171
.
https://doi.org/10.1007/s00477-021-01993-3
.
Han
H.
,
Choi
C.
,
Kim
J.
,
Morrison
R. R.
,
Jung
J.
&
Kim
H. S.
2021
Multiple-depth soil moisture estimates using artificial neural network and long short-term memory models
.
Water
13
(
18
),
2584
.
https://doi.org/10.3390/w13182584
.
Jung
J.
,
Han
H.
,
Kim
K.
&
Kim
H. S.
2021
Machine learning-based small hydropower potential prediction under climate change
.
Energies
14
(
12
),
3643
.
https://doi.org/10.3390/en14123643
.
Kim
M.
,
Baek
S.
,
Ligaray
M.
,
Pyo
J.
,
Park
M.
&
Cho
K. H.
2015
Comparative studies of different imputation methods for recovering streamflow observation
.
Water
7
(
12
),
6847
6860
.
https://doi.org/10.3390/w7126663
.
Kim
J.
,
Han
H.
,
Johnson
L. E.
,
Lim
S.
&
Cifelli
R.
2019
Hybrid machine learning framework for hydrological assessment
.
Journal of Hydrology
577
,
123913
.
https://doi.org/10.1016/j.jhydrol.2019.123913
.
Kobold
M.
&
Sušelj
K.
2005
Precipitation forecasts and their uncertainty as input into hydrological models
.
Hydrology and Earth System Sciences
9
(
4
),
322
332
.
https://doi.org/10.5194/hess-9-322-2005
.
Kuczera
G.
,
Kavetski
D.
,
Franks
S.
&
Thyer
M.
2006
Towards a Bayesian total error analysis of conceptual rainfall-runoff models: characterising model error using storm-dependent parameters
.
Journal of Hydrology
331
(
1–2
),
161
177
.
https://doi.org/10.1016/j.jhydrol.2006.05.010
.
Kumar
A.
,
Kumar
P.
&
Singh
V. K.
2019
Evaluating different machine learning models for runoff and suspended sediment simulation
.
Water Resources Management
33
(
3
),
1217
1231
.
Lee
M. K.
,
Moon
S. H.
,
Kim
Y. H.
&
Moon
B. R.
2014
Correcting abnormalities in meteorological data by machine learning
. In:
2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC)
,
5–8 October
,
San Diego, CA, USA
.
Li
Z.
,
Wang
K.
,
Ma
H.
&
Wu
Y.
2018
An adjusted inverse distance weighted spatial interpolation method
. In
Proceedings of the 2018 3rd International Conference on Communications, Information Management and Network Security (CIMNS 2018)
,
27–28 September
,
Wuhan, China
.
Lloyd
C. D.
2005
Assessing the effect of integrating elevation data into the estimation of monthly precipitation in Great Britain
.
Journal of Hydrology
308
(
1–4
),
128
150
.
https://doi.org/10.1016/j.jhydrol.2004.10.026
.
Lobell
D. B.
,
Cahill
K. N.
&
Field
C. B.
2007
Historical effects of temperature and precipitation on California crop yields
.
Climatic Change
81
(
2
),
187
203
.
https://doi.org/10.1007/s10584-006-9141-3
.
Louppe
G.
,
Wehenkel
L.
,
Sutera
A.
&
Geurts
P.
2013
Understanding variable importances in forests of randomized trees
.
Advances in Neural Information Processing Systems
26
,
431
439
.
McCulloch
W. S.
&
Pitts
W.
1943
A logical calculus of the ideas immanent in nervous activity
.
The Bulletin of Mathematical Biophysics
5
(
4
),
115
133
.
https://doi.org/10.1007/BF02478259
.
Moriasi
D. N.
,
Arnold
J. G.
,
Van Liew
M. W.
,
Bingner
R. L.
,
Harmel
R. D.
&
Veith
T. L.
2007
Model evaluation guidelines for systematic quantification of accuracy in watershed simulations
.
Transactions of the ASABE
50
(
3
),
885
900
.
https://doi.org/10.13031/2013.23153
.
Nelson
D. B.
,
Basler
D.
&
Kahmen
A.
2021
Precipitation isotope time series predictions from machine learning applied in Europe
.
Proceedings of the National Academy of Sciences
118
(
26
),
1
8
.
https://doi.org/10.1073/pnas.2024107118
.
Peters-Lidard
C. D.
,
Mocko
D. M.
,
Garcia
M.
,
Santanello
J. A.
,
Tischler
M. A.
,
Moran
M. S.
&
Wu
Y.
2008
Role of precipitation uncertainty in the estimation of hydrologic soil properties using remotely sensed soil moisture in a semiarid environment
.
Water Resources Research
44
(
5
),
1
22
.
https://doi.org/10.1029/2007WR005884
.
Portuguez-Maurtua
M.
,
Arumi
J. L.
,
Lagos
O.
,
Stehr
A.
&
Montalvo Arquiñigo
N.
2022
Filling gaps in daily precipitation series using regression and machine learning in inter-Andean watersheds
.
Water
14
(
11
),
1799
.
https://doi.org/10.3390/w14111799
.
Quinlan
J. R.
1986
Induction of decision trees
.
Machine Learning
1
(
1
),
81
106
.
Ramcharan
A.
,
Hengl
T.
,
Nauman
T.
,
Brungard
C.
,
Waltman
S.
,
Wills
S.
&
Thompson
J.
2018
Soil property and class maps of the conterminous US at 100-meter spatial resolution
.
Soil Science Society of America Journal
82
(
1
),
186
201
.
https://doi.org/10.2136/sssaj2017.04.0122
.
Renard
B.
,
Kavetski
D.
,
Kuczera
G.
,
Thyer
M.
&
Franks
S. W.
2010
Understanding predictive uncertainty in hydrologic modeling: the challenge of identifying input and structural errors
.
Water Resources Research
46
(
5
),
1
22
.
https://doi.org/10.1029/2009WR008328
.
Rico-Ramirez
M. A.
,
Liguori
S.
&
Schellart
A. N. A.
2015
Quantifying radar-rainfall uncertainties in urban drainage flow modelling
.
Journal of Hydrology
528
,
17
28
.
https://doi.org/10.1016/j.jhydrol.2015.05.057
.
Shen
Z. Y.
,
Chen
L.
&
Liao
Q.
2015
Effect of rainfall measurement errors on nonpoint-source pollution model uncertainty
.
Journal of Environmental Informatics
26
(
1
),
14
26
.
Shin
H. S.
&
Salas
J. D.
2000
Regional drought analysis based on neural networks
.
Journal of Hydrologic Engineering
5
(
2
),
145
155
.
Song
X.
,
Zhan
C.
,
Kong
F.
&
Xia
J.
2011
Advances in the study of uncertainty quantification of large-scale hydrological modeling system
.
Journal of Geographical Sciences
21
(
5
),
801
819
.
https://doi.org/10.1007/s11442-011-0881-2
.
Tanty
R.
&
Desmukh
T. S.
2015
Application of artificial neural network in hydrology – a review
.
International Journal of Engineering Research & Technology
4
(
6
),
184
188
.
Taylor
K. E.
2001
Summarizing multiple aspects of model performance in a single diagram
.
Journal of Geophysical Research: Atmospheres
106
(
D7
),
7183
7192
.
https://doi.org/10.1029/2000JD900719
.
Tomczak
M.
1998
Spatial interpolation and its uncertainty using automated anisotropic inverse distance weighting (IDW)-cross-validation/jackknife approach
.
Journal of Geographic Information and Decision Analysis
2
(
2
),
18
30
.
Vapnik
V.
1999
The Nature of Statistical Learning Theory
.
Springer
,
New York, NY
,
USA
.
Wang
X.
,
Kinsland
G.
,
Poudel
D.
&
Fenech
A.
2019
Urban flood prediction under heavy precipitation
.
Journal of Hydrology
577
,
123984
.
https://doi.org/10.1016/j.jhydrol.2019.123984
.
Wang
J.
,
Tam
W. C.
,
Jia
Y.
,
Peacock
R.
,
Reneke
P.
,
Fu
E. Y.
&
Cleary
T.
2021
P-Flash – a machine learning-based model for flashover prediction using recovered temperature data
.
Fire Safety Journal
122
,
103341
.
https://doi.org/10.1016/j.firesaf.2021.103341
.
Xiang
Z.
,
Yan
J.
&
Demir
I.
2020
A rainfall-runoff model with LSTM-based sequence-to-sequence learning
.
Water Resources Research
56
(
1
),
1
17
.
https://doi.org/10.1029/2019WR025326
.
Zhao
G.
,
Pang
B.
,
Xu
Z.
&
Xu
L. A.
2020
A hybrid machine learning framework for real-time water level prediction in high sediment load reaches
.
Journal of Hydrology
581
,
124422
.
https://doi.org/10.1016/j.jhydrol.2019.124422
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc-nd/4.0/).