ABSTRACT
Rain gauge networks provide direct precipitation measurements and have been widely used in hydrology, synoptic-scale meteorology, and climatology. However, rain gauge observations are subject to a variety of error sources, and quality control (QC) is required to ensure the reasonable use. In order to enhance the automatic detection ability of anomalies in data, the novel multi-source data quality control (NMQC) method is proposed for hourly rain gauge data. It employs a phased strategy to reduce the misjudgment risk caused by the uncertainty from radar and satellite remote-sensing measurements. NMQC is applied for the QC of hourly gauge data from more than 24,000 hydro-meteorological stations in the Yangtze River basin in 2020. The results show that its detection ratio of anomalous data is 1.73‰, only 1.73% of which are suspicious data needing to be confirmed by experts. Moreover, the distribution characteristics of anomaly data are consistent with the climatic characteristics of the study region as well as measurement and maintenance modes of rain gauges. Overall, NMQC has a strong ability to label anomaly data automatically, while identifying a lower proportion of suspicious data. It can greatly reduce manual intervention and shorten the impact time of anomaly data in the operational work.
HIGHLIGHTS
Quantitative and fusion application of radar and satellite data were addressed for the quality control (QC) of hourly rain gauge data.
Precipitation quality anomaly events were defined to improve the traceability of QC.
A phased QC strategy was adopted to reduce the misjudgment risk caused by the uncertainty of remote-sensing measurements.
INTRODUCTION
Precipitation data are essential for numerous operational applications in hydrology, synoptic-scale meteorology, and climatology such as hydrological modeling (Mohammadi et al. 2024), weather forecasts (Imhoff et al. 2023), flood warning (Ma et al. 2024), drought forecasting (Mohammadi 2023), and decision-making service (Hassani et al. 2023). Precipitation data can be obtained by different means, e.g., rain gauge station, radar, and satellite. Among them, rain gauges provide direct precipitation measurements, which are routinely used as ground truth; radar and satellite precipitation, by contrast, are obtained based on the modern remote-sensing techniques, which have to be calibrated against the rain gauge data (Nešpor & Sevruk 1999; Martinaitis 2008; Qi et al. 2016; WMO 2021). In general, the direct precipitation measurements provided by gauge networks have higher accuracy than remote-sensing measurement systems. However, rain gauge observations are also subject to inaccuracies caused by random and systemic errors, and the main causes include wind-induced error, wetting and evaporation losses, instrumentation malfunctions, transmission errors, poor observation environment, and mistakes made during data processing (Groisman & Legates 1994; Nešpor & Sevruk 1999; Nešpor et al. 2000; Adam & Lettenmaier 2003; Yang et al. 2005; Baltas et al. 2016).
Quality control (QC) is the best known component of quality management systems to ensure the highest possible reasonable standard of accuracy for the optimum use of these data by all possible users (Zahumenský 2004). Although many sophisticated QC procedures have been carried out in various hydrological and meteorological research projects (Ren et al. 2010; Qi 2015; Blenkinsop et al. 2017), QC of rain gauge data has been a challenge, especially at hourly or sub-hourly scales, because of their high spatial and temporal variability with skewed intensity spectra (Li & Sun 2021; Sha et al. 2021).
In this study, the existing QC procedures are divided into two types: single-source QCs and multi-source QCs. Single-source QCs mainly use data from the checked rain gauge station or their neighbor stations, including limit value checks, station or regional extreme value checks, internal consistency checks, time consistency checks, and spatial consistency checks (Upton & Rahimi 2003; Kondragunta & Shrestha 2006; Ren et al. 2010; Schneider et al. 2014; Blenkinsop et al. 2017). Although these procedures have played an important role in many projects, the labeled suspected anomalous data are mostly required to make a further confirmation by human visual checks. With the continued growth of data volume, these procedures are resource-intensive and can cause delays. Meanwhile, radar and satellite products are being used in an increasing number of applications because of their wide spatial and temporal coverage (Lengfeld et al. 2020; Adane et al. 2021; Liu et al. 2021; Thiruvengadam et al. 2021; Gebremicael et al. 2022; Zhao et al. 2022). On the basis of this, some multi-source QCs have been developed to increase the efficiency of gauge precipitation QC by introducing data from different observation systems (Hill 2013; Qi & Zhang 2013; Qi 2015; Qi et al. 2016; Zhao et al. 2018; Sha et al. 2021), and some new technologies have been adopted for automatic labeling of the anomalies in data, e.g. decision trees (Qi et al. 2016), neural networks (Zhao et al. 2018), deep learning (Sha et al. 2021), but analysis shows that the magic of these methods is inseparable from the support of multi-source data.
The existing multi-source QCs mainly focus on the use of radar data, and there are few studies on the comprehensive application of radar and satellite. The accuracy of ground-based radar data is easily affected by complex terrain and its usage is limited in areas of poor or no radar coverage. However, satellites, especially geostationary satellites, can prove to be an excellent data source providing high spatial and temporal resolutions for regions where radar networks are missing or unevenly distributed. In this paper, a novel multi-source data quality control method called NMQC is proposed for the hourly rain gauge data by the quantitative and fusion application of radar and satellite data. Notably, although NMQC is designed to be as automatic as possible to improve timeliness and reduce the human workload, it is not fully automatic owing to the skewness of precipitation frequency distribution. The contributions are summarized as follows:
Precipitation quality anomaly event (PQAE) is defined with massive data exploration and analysis to improve the traceability management of QC, and several types of QC-oriented parameters are designed to support the implementation of related algorithms based on the multi-source data.
A phased strategy is applied to logically divide QC procedures into two steps of PQAE detection and PQAE diagnosis to reduce the misjudgment risk caused by the uncertainties from radar and satellite remote-sensing measurements.
A hybrid QC processing mode of automatic and human-based is reserved to avoid the incorrect elimination of rare extreme precipitation values occurring in the fully automatic QC procedures, which behave similarly to spurious outliers, as these true extremes are very important to describe the variability of precipitation.
METHODS
Study area description
The study region is the Yangtze River Basin (24.50°–35.75°N, 90.55°–122.42°E) with a total area of ∼1.8 million km2, accounting for nearly 19% of China's land area. With a length of over 6,300 km, the Yangtze River is the longest river in Asia and the third longest river in the world. The Yangtze River Basin stretches from the eastern Tibetan Plateau to the East China Sea with a wide range of climate variability and diverse ecosystems (Zhang et al. 2014). Affected by its unique geographic location, special land-sea thermal differences, and seasonal variations of atmospheric circulation, the Yangtze River Basin is a typical East Asian monsoon region that is sensitive and vulnerable to climate changes, characterized by simultaneous rainy and hot weather in summer, and cool and dry conditions in winter. In recent years, against the background of climate warming, the annual precipitation in the Yangtze River Basin shows a trend of dry and wet polarization, and extreme precipitation events occur frequently, which brings new challenges to the QC of precipitation data (Lin et al. 2021; Cheng et al. 2022).
Definition of precipitation quality anomaly event
The concept of anomaly detection is introduced to improve the traceability of QC. The occurrence of anomalous precipitation data with large deviations from real precipitation is called a precipitation quality anomaly event (PQAE). Two categories of real-time and non-real-time events are summarily defined by comprehensively analyzing the causes, spatial-temporal distribution characteristics, and application sensitivity of a large number of anomaly data. Real-time events are gross errors in data, which have a great impact on the real-time applications such as weather forecasts and warnings, and usually can be labeled with a single time data. Non-real-time events are systemic errors in data, the elimination of which can provide high-quality data support for the non-real-time applications such as climate analysis and disaster assessment, and are needed to be labeled by analyzing the data changes over a period of time. NMQC focuses on the processing of real-time events, and non-real-time events would be discussed separately. Meanwhile, real-time events are subdivided into five types for more refined analysis, as presented in Table 1.
Event codes . | Event names . | Event definitions . |
---|---|---|
CSPQAE | Clear sky precipitation quality anomaly event | Anomaly precipitation is observed on a clear day; there is no limit on the amount of precipitation. |
PSPQAE | Pseudo small precipitation quality anomaly event | Anomaly precipitation is caused by equipment flipping without rain or fog, dew, frost, and snow gathering or melting; the amount of precipitation ranges from 0.1 to 0.3 mm. |
IPQAE | Isolated precipitation quality anomaly event | Only the checked station has precipitation within a specific spatial range; the amount of precipitation is greater than 0.3 mm. |
SLPQAE | Single larger precipitation quality anomaly event | Precipitation is significantly greater than that of all neighbor stations within a specific spatial range; the amount of precipitation is greater than or equal to 1.0 mm. |
LPQAE | Larger precipitation quality anomaly event | Precipitation is significantly greater than that of neighbor stations or historical extremes within a specific spatial range; the amount of precipitation is greater than or equal to 1.0 mm. |
Event codes . | Event names . | Event definitions . |
---|---|---|
CSPQAE | Clear sky precipitation quality anomaly event | Anomaly precipitation is observed on a clear day; there is no limit on the amount of precipitation. |
PSPQAE | Pseudo small precipitation quality anomaly event | Anomaly precipitation is caused by equipment flipping without rain or fog, dew, frost, and snow gathering or melting; the amount of precipitation ranges from 0.1 to 0.3 mm. |
IPQAE | Isolated precipitation quality anomaly event | Only the checked station has precipitation within a specific spatial range; the amount of precipitation is greater than 0.3 mm. |
SLPQAE | Single larger precipitation quality anomaly event | Precipitation is significantly greater than that of all neighbor stations within a specific spatial range; the amount of precipitation is greater than or equal to 1.0 mm. |
LPQAE | Larger precipitation quality anomaly event | Precipitation is significantly greater than that of neighbor stations or historical extremes within a specific spatial range; the amount of precipitation is greater than or equal to 1.0 mm. |
Calculation of QC-oriented parameters
Quality control factor
Quality control factors (QCFs) are designed based on station, radar, and satellite data with massive amounts of data exploration and analysis, which are sensitive to the quality confirmation of hourly rain gauge data. Station data are mainly used to design QCFs to characterize the precipitation extremes and spatial distribution around the checked station.
For radar data, although quantitative precipitation estimation (QPE) can be directly compared with the rain gauge data, it is usually calibrated by rain gauge data in real time to improve the accuracy of radar QPE (Koch et al. 2005; Xiao et al. 2008). Considering the coupling between rain gauge data and QPE product, only CR product is adopted, and QPE is directly estimated based on CR product according to the classical Z–R relationship (Fang et al. 2018).
For satellite data, CTT, CLM, CLT, and CTA products are used because the satellite rainfall estimation cannot be used directly in conjunction with gauge data, especially at hourly scale (Hughes 2006). Studies have shown that these parameters are closely related to ground precipitation (Liu et al. 2009; Han et al. 2011; Kim & Kwon 2011; Yuan & Hu 2015; Jin et al. 2018; Ombadi et al. 2021). The rainfall intensity has a good corresponding relationship with CTT, and CTT gradually decreases and its change range becomes more concentrated along with the increase of rainfall intensity, but CTT is less sensitive to small precipitation. Thus, CLM, CLT, and CTA are introduced to enhance the robustness of satellite QCFs when precipitation is small or not obvious.
The set of QCFs can be denoted as , which are calculated based on QCRF of the corresponding data or products. measures the history precipitation extremes around the checked station, represents the spatial frequency of stations where non-zero precipitation is observed, is the maximum of radar CR, is the average of QPE for the Z–R relationship of different precipitation clouds, is the minimum of CTT, is the average of CTA, is the cloud type with the highest proportion, represents the probability of clear sky, and the higher value corresponds to the lower probability of rainfall.
Quality control decision index
Station weather type classification
The phased QC strategy for PQAEs
PQAE detection algorithm
The PQAE detection algorithm is developed by combining spatial analysis with Isolation Forest to improve the efficiency of NMQC as shown in Figure 3(a). Spatial analysis is essential for rain gauge data QC. However, the efficiency will be limited when the amount of data is large because it works based on the calculation of neighbor stations’ data. And it mainly focuses on the detection of anomaly data for heavy precipitation instead of small precipitation. Isolation Forest is an unsupervised machine learning algorithm that uses the mechanism of isolating outliers to perform anomaly detection (Ding & Fei 2013; Yao et al. 2022), which does not need to prepare the labeled training dataset in advance, and does not make assumptions about the probability distribution of the checked data. It is one of the most widely used algorithms in the field of anomaly data detection. First, Isolation Forest is adopted to quickly detect and label the suspected anomaly data. Secondly, the labeled suspected anomaly data are checked using the percentile-based spatial analysis method to obtain more refined spatial distribution parameters. Finally, comprehensive analyses restricted by business rules are made to classify the corresponding PQAEs.
PQAE diagnosis algorithm
The PQAE detection algorithm is implemented as shown in Figure 3(b). The data for Clear Sky Precipitation Quality Anomaly Event (CSPQAE) is directly confirmed as erroneous, because the corresponding SWT is clear, which has been comprehensively assessed by multi-source data. For other PQAEs, considering that the recognition of small precipitation is more uncertain than heavy precipitation, the diagnosis rules are designed to vary with the hourly precipitation denoted with pre as follows: (1) when and there are no sufficient reasons for error, it is directly labeled with T as available. For instance, if SWT is rain, PSPQAE will be labeled with T; (2) when and there are no sufficient reasons for available or erroneous data, it is labeled with D as suspicious and confirmed by manual intervention if necessary. As defined above, 0 and 1 for QCDIS are confirmed by multiple QCFs, which have higher accuracy compared with the corresponding values for QCDIR. On the contrary, when precipitation reaches a certain grade, the relationship between radar QPE and gauge precipitation is more accurate than that between CTT and gauge precipitation. Therefore, the priority of QCDIS for small precipitation is higher than that of QCDIR, and the priority of QCDIR for heavy precipitation is higher than that of QCDIS.
To facilitate the expression, the hourly precipitation grade is denoted by PREGRD. According to the above analysis, the threshold method is adopted to design four different diagnosis rules of ~ on the basis of SWT, PREGRD, QCDIR, and QCDIS. Taking as an example, when SWT is rainless and , it is used for comprehensive diagnosis of Isolated Precipitation Quality Anomaly Event (IPQAE), Single Larger Precipitation Quality Anomaly Event (SLPQAE), and Larger Precipitation Quality Anomaly Event (LPQAE). First, check if the value of QCDIR is 0,1, or M, and if so, the data are labeled according to the pre-designed decision tables. Then, if not, the following steps are performed: (1) the parameters and are, respectively, defined and threshold c is given; (2) when , if PREGRD is 4 or 5, QCDIR is 2, and QCDIS is 0, the data are flagged with F as erroneous, otherwise they are flagged with T as available; (3) when , if QCDIS is 1 or M, the data are flagged with D as suspicious, or if , they are flagged with T as available, otherwise they are flagged with D as suspicious.
RESULTS
NMQC is evaluated by analyzing the QC results of hourly rain gauge data from more than 24,000 hydro-meteorological stations in the Yangtze River Basin from January to December 2020. Data status flags are adopted to track and analyze the QC process of NMQC as presented in Table 2. During PQAE detection, data that are not labeled are available, which is flagged with C0, and the other labeled data are suspected anomaly data, which are flagged with C1. During PQAE diagnosis, C1 data are further confirmed as belonging to one of the three categories: available, suspicious, and erroneous, which are flagged with D0, D1, and D2, respectively. In addition, taking the hourly rain gauge data from June to August in Hubei Province as an example, a test dataset referred as tSet is created to evaluate the performance of NMQC, where the anomaly dataset referred as aSet is already labeled by data quality analysis experts.
Flag . | Description . |
---|---|
C0 | Detected as true, data is available |
C1 | Detected as suspected anomaly data |
D0 | Detected as suspected anomaly data, diagnosed as available data |
D1 | Detected as suspected anomaly data, diagnosed as suspicious data |
D2 | Detected as suspected anomaly data, diagnosed as erroneous data |
Flag . | Description . |
---|---|
C0 | Detected as true, data is available |
C1 | Detected as suspected anomaly data |
D0 | Detected as suspected anomaly data, diagnosed as available data |
D1 | Detected as suspected anomaly data, diagnosed as suspicious data |
D2 | Detected as suspected anomaly data, diagnosed as erroneous data |
Detection performance of anomaly data
The metrics of hit ratio, false detection, and miss detection ratio for NMQC are verified based on the test dataset tSet. The results are listed in Table 3; is 7,231, is 7,215, is 124, and is 7,101. The statistical results show that the values of these metrics for anomaly data below 1.0 mm are better than that greater than 1.0 mm, and the total , , and are 98.20, 1.71, and 1.80%, respectively.
Precipitation grades . | . | . | . | . |
---|---|---|---|---|
≥1.0 | 413 | 403 | 13 | 390 |
[0.1,1.0) | 6,818 | 6,812 | 101 | 6,711 |
Total | 7,231 | 7,215 | 124 | 7,101 |
Precipitation grades . | . | . | . | . |
---|---|---|---|---|
≥1.0 | 413 | 403 | 13 | 390 |
[0.1,1.0) | 6,818 | 6,812 | 101 | 6,711 |
Total | 7,231 | 7,215 | 124 | 7,101 |
Distribution characteristics of anomaly data
PQAEs distribution
To understand the event distribution of anomaly data, the ratio of different PQAEs is counted and the results show that the highest proportion belongs to PSPQAE with 89.52%, IPQAE is 4.84%, SLPQAE is 3.00%, CSPQAE is 2.58%, and LPQAE is the lowest with 0.06%. The monthly distribution of PQAE is as shown in Figure 5(a). Due to the large difference in the proportion of different PQAEs, especially the proportion of PSPQAEs, which is much higher than others, the data amount is logarithmically converted to make the analysis clearer. It can be seen that the amount of PSPQAE is less in the summer half year than in the winter half year; the amount of IPQAE is higher in March and August; the amount of SLPQAE is higher in March and from June to August; the amount of CSPQAE is higher from February to April; and the amount of LPQAE is higher in March and July.
Temporal distribution
Since the proportion of suspicious data is much lower than that of erroneous data, the data amount is logarithmically converted to better analyze the monthly change of anomaly data as shown in Figure 5(b). It can be seen that the anomaly data are more in the winter half year than in the summer half year. Meanwhile, the amount of erroneous data in the summer half year is less than that in other months, but the amount of suspicious data in the summer half year is more than that in other months. The main reason is that due to the influence of extreme weather such as local strong convection, it is more difficult to confirm the quality of data in the summer half year. The monthly change of anomaly data is consistent with that of erroneous data because the proportion of erroneous data is as high as 98.27%.
Spatial distribution
The statistical results show that the ratio of stations with anomaly data is 83.75%. Among these stations, the proportion of stations with is 58.72%, is 35.57%, is 4.18%, is 1.21%, and is 0.32%. The proportion of stations with () is 94.29%. Although most stations have anomaly data, the stations with high anomaly data ratio are relatively fewer. The spatial distribution of stations with higher anomaly ratio ( ≥ 0.57%) is as shown in Figure 5(c); it can be seen that the distribution of stations is relatively uniform and has no obvious regional characteristics except for areas with sparse station density. According to the analysis of stations with higher , the main anomaly data are CSPQAE and PSPQAE, and these data are mainly the result of poor observation environment and instrument maintenance. Taking Yimencun station an example: it can be observed that it has the highest anomaly ratio of 5.01%, and among the anomaly data, the ratio of PSPQAEs is 59.09%, CSPQAE is 38.41%, and others are 2.50%.
Grades distribution
The non-zero precipitation is analyzed to make a further understanding of the distribution of anomaly data in different precipitation grades. The results show that anomaly data are mainly concentrated at less than 1.0, which accounts for 94.36%. This is consistent with the fact that the observation samples of small precipitation are more. Moreover, the proportion of PQAEs to their own amount in different precipitation grades is counted to understand the precipitation grades distribution of different PQAEs. The results show that PSPQAEs are all in the [0.4,1.0), which is consistent with the definition; CSPQAEs occur at any grades, but are mainly concentrated below 1.0 accounting for 93.05%; IPQAEs are greater than or equal to 0.4, among which [0.4,1.0) account for 63.96%, and ≥1.0 account for 36.04%; SLPQAEs are equal to or greater than 1.0, among which [1.0, 20.0) account for 90.68%, and ≥20.0 account for 9.32%; LPQAEs are greater than or equal to 1.0, among which [1.0, 20.0) account for 6.18%, [20.0, 60.0) account for 86.8%, and ≥60.0 account for 7.01%.
Case study
Pseudo small precipitation quality anomaly event
There are 102 stations showing PSPQAE at 10:00 on January 29, 2020 (Beijing time, the same as after) in the southeastern Sichuan and border areas of Chongqing, Hubei, and Guizhou (28.30°-32.30°N;103.30°-109.30°E), and these stations account for 1.7% of all stations in the region. The analysis of QC parameters shows that SWT is rainless, QCDIS is 0, and QCDIR is 0. The data are automatically confirmed as erroneous according to the DRULE1 diagnosis rule.
Clear sky precipitation quality anomaly event
Isolated precipitation quality anomaly event
DISCUSSIONS
In this paper, NMQC is proposed for the hourly rain gauge data, which is a multi-source data quality control method. First, PQAEs are defined to improve the traceability management of the QC process. Second, several types of QC-oriented parameters are designed by the quantitative and fusion application of radar and satellite multi-source data to support the implementation of algorithms. Third, the phased QC strategy has been adopted to logically divide the QC procedure into the two steps of PQAE detection and PQAE diagnosis to reduce the misjudgment risk caused by the uncertainty from radar and satellite remote-sensing measurements. Finally, NMQC is evaluated by elaborating the QC results of hourly rain gauge data from more than 24,000 hydro-meteorological stations in the Yangtze River Basin from January to December 2020.
Overall, NMQC shows a good performance on the detection of anomaly data and has a strong ability to automatically label erroneous data. Compared with the operationally used single-source QC method (Ren et al. 2010), NMQC has a higher anomaly data detection ratio and a lower proportion of suspicious data detected (Section 3.1, , , and for NMQC is 0.03, 1.7, and 1.73‰, respectively, and the proportion of suspicious data is 1.73%). However, , , and for the operationally used QC method is 0.44‰, 0.18‰, and 0.62‰, respectively, and the proportion of suspicious data is 70.97%. The main reasons are as follows: (1) The operationally used QC method mainly focuses on the QC of heavy precipitation, but NMQC increases the QC of small precipitation, and the observation samples of small precipitation are much larger than that of heavy precipitation; (2) The confirmation of anomaly data is restricted by information resources for the single-source QC method because it is mainly based on spatio-temporal checks within the gauge network itself. But NMQC makes a comprehensive judgment from multiple perspectives by introducing satellite and radar data, which can greatly improve the automatic labeling ability for anomaly data.
Moreover, it should be noted that NMQC is not a fully automatic QC procedure compared with some existing multi-source QC methods; the suspicious anomaly data will be further confirmed by data quality analysis experts in the subsequent business operations. Qi et al. (2016) developed an automated gauge QC scheme based on the consistency of hourly gauge and radar QPE observations to benefit the making of gridded QPE products in radar coverage area. Notably, NMQC is designed for processing the gauge data itself. In other words, their application scenarios are not consistent. We believe that this is an important issue that must be carefully considered when designing QC procedures. Sha et al. (2021) present a supervised automated QC system using convolutional neural networks with grid precipitation and elevation analyses data as input for a sparse gauge station observation network; it is a meaningful exploration. Although manual QC is required to be performed on the raw gauge observations to create binary classification quality labels (good or bad) for the supervised training dataset, we think it is a possible future improvement direction for NMQC, which can be used as a better tool for making multi-classification labels.
Additionally, the results shown in Section 3 are consistent with the climatic characteristics of the Yangtze River Basin as well as the measurement and maintenance modes of rain gauges. Especially, NMQC has high false detection ratio in the summer half year during PQAE detection, and the anomaly data are more in the winter half year than in the summer half year after PQAE diagnosis. On the one hand, the rain and heat are observed over the same period in the Yangtze River Basin, and the local short-duration heavy precipitation and showers occur frequently in the summer half year (Cheng et al. 2022). The spatial distribution of such precipitation is usually similar to that of anomaly precipitation, and it can be easily falsely detected during PQAE detection. However, the authenticity of such precipitation can be determined during PQAE diagnosis by using satellite and radar data to eliminate false detection. On the other hand, the tipping bucket rain gauges are the main observation mediums, which have the inability to accurately measure solid precipitation (Colli et al. 2014; Martinaitis et al. 2015; Choi et al. 2022). There is snow or sleet weather in most areas of the Yangtze River Basin in the winter half year (Zhang et al. 2016). Although tipping bucket rain gauges are disabled in areas with prolonged snowfall of upper reaches, they are still used in middle-lower reaches during the sleet weather, which can cause much anomaly data. Meanwhile, data quality is also affected by the relatively loose rain gauge maintenance requirements because of the low probability of high-impact severe rainfall weather in winter in the Yangtze River Basin. For these reasons, it is suggested that the tipping bucket rain gauges precipitation data in the winter half year should be used cautiously in combination with the synoptic background. More in-depth research would be required for the QC of winter precipitation in the future.
Eventually, the present diagnosis rules are sensitive to the values of the relevant QC parameters, which are obtained through a large number of experiments. In the future, efforts will be made to develop more intelligent methods to set parameter values. Besides, although the hit, false, and miss detection ratios for NMQC are discussed on the labeled test dataset, the volume of the test dataset needs to be further expanded for more comprehensive evaluation.
CONCLUSION
In this study, we presented a novel quality control method for the hourly rain gauge data called NMQC to enhance the automatic labeling ability of anomaly data and the robustness of algorithms by the quantitative and fusion application of radar and satellite multi-source data in the Yangtze River Basin. The main conclusions are summarized as follows:
- (1)
NMQC has a strong ability to label anomaly data automatically, which can greatly reduce manual intervention and shorten the impact time of anomaly data on applications. The detection ratio of anomaly data for NMQC is 1.73‰, of which only 1.73% is suspicious data. And the hit, false detection, and miss detection ratios on the test dataset are 98.20, 1.17, and 1.80%, respectively.
- (2)
The distribution characteristics of anomaly data detected by NMQC are consistent with climatic characteristics of the Yangtze River Basin as well as the measurement and maintenance modes of rain gauges. This implies the feasibility of NMQC. Specifically, the amount of different PQAEs varies greatly, with the highest proportion belonging to PSPQAE with 89.52%, and the anomaly data are mainly concentrated at less than 1.0 mm. The stations with high anomaly data ratios account for a small proportion and their spatial distribution is relatively uniform with no obvious regional characteristics. The anomaly data in the summer half year is less than that in the other months, but the amount of suspicious data in the summer half year is more than that in other months.
In summary, NMQC has a good performance on the detection of anomaly data with a lower proportion of suspicious data. Meanwhile, it also has a strong ability to automatically label erroneous data. Consequently, it can greatly reduce manual intervention and shorten the impact time of anomaly data. That is feasible for operational work.
ACKNOWLEDGEMENTS
This work was supported by the Yangtze River Basin Meteorological Open Fund Project (No.CJLY2022Y08), the Key Laboratory of South China Sea Meteorological Disaster Prevention and Mitigation of Hainan Province Open Fund Project (No.SCSF202209), and the China Yangtze Power Company Limited Scientific Research Project (No.2423020002).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.