An entropy-based approach for the optimization of rain gauge network using satellite and ground-based data

Accurate and precise rainfall records are crucial for hydrological applications and water resources management. The accuracy and continuity of ground-based time series rely on the density and distribution of rain gauges over territories. In the context of a decline of rain gauge distribution, how to optimize and design optimal networks is still an unsolved issue. In this work, we present a method to optimize a ground-based rainfall network using satellite-based observations, maximizing the information content of the network. We combine Climate Prediction Center MORPhing technique (CMORPH) observations at ungauged locations with an existing rain gauge network in the Rio das Velhas catchment, in Brazil. We use a greedy ranking algorithm to rank the potential locations to place new sensors, based on their contribution to the joint entropy of the network. Results show that the most informative locations in the catchment correspond to those areas with the highest rainfall variability and that satellite observations can be successfully employed to optimize rainfall monitoring networks.


INTRODUCTION
Quantification of precipitation is essential for improving knowledge about hydrological and water resources applications, including water allocation, water resources monitoring and risk assessment. Yet, the state of our knowledge is subject to the density and distribution of rainfall monitoring networks over territories. For this reason, it is desirable to have dense rain gauge networks (Li et al. ). However, despite their crucial role, rainfall networks have been declining in the last decades due to their high maintenance and operating costs (Mishra & Coulibaly ; Dai et al. ) and data are scarce or lacking in some areas of the world (Walker et al. ). Although many remote sensing products are now available, groundbased observations are still needed for their calibration, validation and bias removal (Li et al. ). In this context, many researchers have tried to answer the question on how to design optimal monitoring networks, which can guarantee accurate information and reduce uncertainty in precipi- directional information transfer (Yang & Burn ). Some studies adopted additional objective functions not related to IT, such as hydrological model efficiency (Xu et al. ), rainfall field interpolation accuracy (Xu et al. ) and spatiotemporality information (Huang et al. ).
A common issue for monitoring network design, in general, is that precipitation observations, and all the information we can derive from them, are available only at the locations where sensors are deployed. Thus, the question is how to decide which ungauged locations are the most convenient to place new sensors. Most authors either interpolate precipitation observations (e.g., Xu et al. ), for the case of rainfall networks, or employ hydrological models to produce water level time series at ungauged locations (e.g., Werstuck & Coulibaly ). However, when interpolating precipitation, some biases are introduced and the accuracy of the resulting rainfall field depends on the specific interpolation technique adopted and on the characteristics of the area considered (Hofstra et al. ). This problem can be addressed using remote sensing data, which have been widely used in the last decade for many hydrological applications (Li et al. ; Mazzoleni et al. ; Bertini et al. ). Remote sensing products proved to better reflect spatial relationships among objects when compared to interp- A few authors have adopted satellite observations in network design; among them can be listed Contreras et al.
() and Huang et al. (). The former applied the conditioned Latin hypercube sampling method on a TMPA product to capture spatiotemporal precipitation in ungauged locations while the latter applied IT within a multi-objective optimization approach. However, methods that consider satellite information are very limited and still rely on prior data interpolation, without analysing the potential information content of satellite observations. As the main advantage of using a gridded dataset is the information contained at ungauged locations can be investigated without introducing interpolation biases, in this work we propose a method to use satellite precipitation estimates to evaluate and optimize an existing rain gauge network. The number and locations of new sensors are chosen in order to maximize the total information content of the network, given by its joint entropy, following the generally accepted fact that information content of time series can be taken as a value of variability that can indicate where it is more appropriate to measure rainfall (e.g., Krstanovic & Singh ; Mishra & Coulibaly ). Entropy at ungauged locations is evaluated using version 1.0 of CMORPH rainfall product, which matches the requirements of a fine spatial scale and good performance. In contrast to the study of Huang et al.
(), we do not take into account redundancy reduction, with the aim of obtaining a robust network and ensuring the capture of essential information even in case of a sensor failure. Indeed, we do not mean to change or remove the existing stations, as most hydrological applications need long time series; instead, we aim to increase the network density in order to have a better understanding of rainfall characteristics and an improvement of water resources assessment in the area. This paper is organized as follows. First, we provide a background on IT; second, the case study and the dataset are introduced; then, details about the methodology adopted are provided. Finally, results and discussions are presented and conclusions of the study are drawn.

BACKGROUND Information theory
The amount of information content and of redundancy given by a monitoring network can be measured using IT (Shannon ). Definitions of the IT-related quantities employed are presented below.
Given a set of n events, with known probabilities of occurrence p 1 , p 2 , . . . , p n , entropy is defined as the measure of uncertainty of the possible n outcomes. If more information about one of the events is obtained, then the uncertainty of the outcomes decreases. Information can be thus regarded as a decrease in uncertainty and entropy can be seen as a measure of information content. The concept of entropy can be extended to a random discrete variable X (Shannon & Weaver ), with discrete values x 1 , x 2 , . . . , x n and corresponding probabilities p(x 1 ), p(x 2 ), . . . , p(x n ): where H(X) is the entropy for the variable X, also called marginal entropy.
In a similar way, it is possible to evaluate the content of information from N multiple variables X 1 , X 2 , . . . , X N , introducing the concept of joint entropy: where JH (X 1 , X 2 , . . . , X N ) is the joint entropy of N random discrete variables and p (x i 1 , x i 2 , . . . , x i N ) is the joint probability of the X 1 , X 2 , . . . , X N variables.
The logarithm in Equations (1) and (2) is base 2, therefore marginal entropy and joint entropy are measured in bits.
In monitoring network design and optimization problems, each precipitation time series recorded by a sensor can be regarded as a random discrete variable X, with marginal entropy H(X). The information content provided by the whole network, made of N sensors, is given by the joint entropy JH(X 1 , X 2 , . . . , X N ).
Estimating joint entropy as defined in Equation (2)  Quantization can be defined as the division of a quantity into a discrete number of smaller parts, often integral multiples of a common quantity (Gray & Neuhoff ).
Its oldest version, which is rounding off, was already employed in 1898 for the estimation of densities by histogram (Sheppard ).
In this work, a normalized rounding off is adopted to convert a continuous signal, which is precipitation, into discrete values, which are bins, filtering out the noise from the observed time series. The quantization here adopted rounds a value x to its nearest lowest integer x q , which is a multiple of a predefined quantity a, following the rule: where k is a constant value used to normalize the time series.

STUDY AREA AND DATA Study area
The study is applied to the Rio das Velhas catchment, located Data pre-processing The optimization of the monitoring rainfall network is conducted using ground-based and satellite-based precipitation time series, both referring to the period 1998-2014, for a total of 17 years of observations, meeting the requirement of a minimum 10 years of records when entropy is applied to network design (Keum & Coulibaly ).
The existing rain gauges provide daily precipitation depth estimates, while CMORPH gives 30 min rain intensity; therefore, we first pre-process satellite observations to make them comparable to the ground-based ones. First, we transform CMORPH intensity estimates into precipitation depth and we aggregate them to the daily scale, to have the same temporal resolution of gauge-based measurements. Finally, to match with the standard rain gauge minimum resolution, which is 0.1 mm for the case study, each record lower than 0.1 mm is set to 0 mm.  Table 1. Further details are given in the following subsections.

Ranking experiments
The existing stations and CMORPH cells are ranked to identify the most informative locations within the catchment. For experiment G, the set of candidate sensor locations is the set found, label it as g 1 , store it in the set RG (ranked set g) and remove g Ã from the original set g. Then, search for another station g Ã among the candidates in the updated set g, such that JH(g 1 , g Ã ) is maximum; when found, label it as g 2 , append it to RG and remove it from g. Repeat the procedure until the size of RG is N. The set RG, updated at each step of the algorithm, corresponds to a quasi-optimal set of stations.
The mathematical procedure of the ranking problem is the following:

Sensitivity analysis of quantization parameters
The estimation of both marginal and joint entropy requires the calculation of probabilities (see Equations (1) and (2)

Monitoring network optimization
The optimal layout of sensors' locations is defined with an optimization procedure (experiment GS) that takes into account the information provided by both ground-based and satellite-based datasets. The optimal network is defined in two steps: first, select a set of quasi-optimal rain gauges from the existing network, i.e., the first m sensors from set

RESULTS
First, results of the ranking experiments of the datasets are shown, followed by our findings on the sensitivity analysis and by the optimal network obtained.

Ranking experiments
A map of the ranked existing rain gauges is shown in It is interesting to note that none of the existing rain gauge locations is selected. However, we can notice that both the total amount of information provided by a satellite-based network, which is JH S ¼ 8:6 bits, and the trend of joint entropy with an increasing number of stations are very similar to the corresponding obtained for rain gauge observations (Figure 3(b)).

Sensitivity analysis of quantization parameters
Bins' width and probabilities of occurrence evaluated with quantization are influenced by the parameters k and a of Equation (3)  shows that for rain gauge observations there is a welldefined interval of most probable JH values, while for CMORPH time series the same interval is still present but more dispersed (Figure 4(b)).
The main idea of this sensitivity analysis is to find the and Keum et al. ().

Monitoring network optimization
The optimal monitoring network is defined with an optimization problem that combines information provided by ground-based and satellite-based observations. We take a subset made of m optimal rain gauges and complement it with locations chosen among CMORPH cells, with the aim of maximizing the information provided by the final network, as expressed by Equation (6). The network is completed when adding one more station would provide an increase in total joint entropy, in principle, lower than 1%.
The initial set of quasi-optimal rain gauges should be defined so that it provides a high amount of information, which is also lower than the maximum value of joint entropy given by the existing network. In other words, if we refer to the graph of Figure 3(b), the point representing the initial subset should be located in the ascending part, before it reaches the stable value of JH. In this way, we ensure that the high information content of the existing network is preserved but that, at the same time, the network can be improved. To this end, we take m ¼ 8 rain gauges, e.g., the first eight ranked.
The results we obtained are presented in Figure 5. To complete the optimal network only eight sensors are needed, four of which are placed in the west, two in the south and the remaining two in the north-east. Looking at Figure 5(b), it can be observed that combining a set of quasi-optimal rain gauges and a set of quasi-optimal satellite cells we obtain a total amount of information JH GS ¼ 9 bits, higher than the total information provided by both the rain gauges and CMORPH networks. This value is achieved with fewer sensors. This is probably due to the fact that the observations from the two datasets are less correlated than those coming from the same data source.  Our study gives insights into the information content of satellite data, both from a statistical and an IT perspective. It emerged that, when merging observations provided by rain gauges and satellite, an improvement in terms of information amount is obtained. However, more research in this direction is needed, to verify whether this result is due either to the optimal layout defined or to the combination  Even though the optimal network does not contain some of the existing stations located in the inner part of the catchment (Figure 5(b)), these stations should be kept operating, both to improve rainfall knowledge in the area and to meet minimum density requirements suggested by the WMO (). It is worth noting that accuracy, understood as the deviation of the measurement from the real rainfall value, is not included in our analysis. We are aware that rainfall, as for instance, measured by gaugeradar comparison, can have an average difference of ±8% (Vieux & Vieux ), and that similar situations can happen with satellite data. We are also aware that accuracy of rainfall estimates depends on rainfall intensity, topography and climatic conditions of the area and that the use of remote sensing products adds even more uncertainties to these estimates (Yang & Luo ). Further considerations in this direction could be addressed in future research, e.g., adding to each cell time series black noise within a predefined range and repeating the optimization procedure, obtaining a family of Pareto fronts whose probabilistic distribution can be analysed. Finally, some limitations may arise when applying the methods in small catchments, as some of them could be characterized by localized convective rainfall events, while some others could have more uniformly distributed precipitation, depending on the climate and topography of the study area.

CONCLUSIONS
In this paper we present a method to optimize rain gauge networks using satellite observations with an entropybased approach. The main idea is to use CMORPH precipitation records to derive information at ungauged locations and identify the most suitable locations to place new ground-based sensors, based on their information content.
To quantify the information achievable by rainfall records we applied the concept of joint entropy (JH Finally, we combined the information coming from the two datasets to define the optimal network layout. The optimal network configuration is made of the first eight ranked existing rain gauges, complemented with eight locations chosen from CMORPH observations through an optimization problem, based on the maximization of the joint entropy. The number of existing rain gauges, i.e., eight, is chosen in order to preserve the high information content of the original network, while the addition of satellite cells is stopped when the increment in joint entropy is very limited, i.e., lower than 1%. The total amount of information provided by the optimal network is higher than the corresponding value obtained in the two previous networks considering the same number of sensors. Also, in this case, the most informative locations were found to be in the areas of the catchment with the highest variance. It is important to note that, although the optimal layout does not include all the existing rain gauges, we intend to preserve also the stations excluded, in order to maintain time series accuracy and length and to obtain a robust network.
An investigation of the variance distribution over the catchment from the two datasets was also conducted, to check whether the two data sources were capturing the same rainfall variability. It emerged that, despite a difference in the absolute value, with higher values for ground-based observations, the spatial pattern of variance is the same for both data sources. The south and south-east areas and the western part of the catchment have the highest variance, probably due to the influence of topography.
In conclusion, satellite observations proved to be a powerful tool to derive rainfall records at ungauged locations to solve the network optimization problem. However, this result should be interpreted carefully and more research is needed to verify whether the significant information obtained is due either to the combination of two different and less dependent data sources or to the specific spatial configuration of sensors. Furthermore, the spatial scale of CMORPH product, even if finer than that of other satellite-based products, still remains coarser than the resolution of rain gauges. Other considerations, such as areas' accessibility, are needed to precisely locate the sensors within the identified cells.