## ABSTRACT

The effect of different temporal (from seconds to months) and spatial aggregation scales (from individual users to full urban areas) on water demand behavior has been explored to a limited degree. The effort described here extends those works by evaluating the scale effects of residential water consumption in a unique US data set that covers 10,000 households with a 1-gallon (3.79 L) hourly resolution over 2 years. A preliminary data analysis and a sequential Principal Component Analysis (PCA) is carried out to assess the effect of different temporal (weekly, daily, hourly) and spatial aggregation (individual meters and groups every 10, 100 and 1,000 meters) levels on demand. Results show that individual users act very differently from each other, and individual consumer variability is only canceled out when a significant number of households are aggregated. The implications of this finding are assessed from a hydraulic modeling perspective as the spatiotemporal scale of measurements may condition the type of analysis that can be carried out in practice. However, additional work is needed to explore the point at which it may be worth embracing a *micro* (per fixture/household) or a *macro* (per node/network) approach for different purposes.

## INTRODUCTION

The availability of smart technologies for water supply systems has increased considerably over the last decades. One of the key advances is the emergence of Advanced Metering Infrastructure (AMI), that remotely collects water consumption data with high temporal and spatial resolution (USEPA 2023). Consumer demands condition flow and pressure throughout the network and are integral to modeling studies and real-time operational decisions.

Water demand is stochastic in nature and responsible for most of the variability in pressure or flow within a water supply system (Magini *et al.* 2008). Individual consumer demand variability becomes relatively less important when aggregating users. The effect of spatial resolution on water demands and network flows has been studied in the past. Transport mains (with few demand nodes that represent aggregated users) maintain more consistent flow rates and have higher correlation than individual consumer withdrawals (e.g., Blokker *et al.* 2008).

Water demands on a household level behave as sporadic pulses (e.g., Buchberger & Wu 1995; Blokker *et al.* 2010). These pulses are the result of each user's behavior that is individually specific and independent from other users (Díaz & González 2020). Thus, consumers' demands typically have low auto and cross-correlations (e.g., Filion *et al.* 2006; Blokker *et al.* 2008). However, as users are exposed to similar external factors (e.g., weather, work schedules), apparent correlations are seen in flow series that represent aggregated users (Díaz & González 2021, 2022). This aggregation effect permits the common modeling practice of lumping users at a single node in network hydraulic models (e.g., Kang & Lansey 2009; De Oliveira & Boccelli 2021).

In this work, two scales are differentiated: the *micro* and *macro* scales. The micro level refers to individual fixtures or households, typically associated with time scales in the order of seconds/minutes or minutes/hours, that are determined by the household end-uses (e.g., taking a shower or washing hands). The macro level corresponds to users/households that are aggregated at a node or networks/subnetworks (sets of nodes) within the system. The time scale for nodal measurements is usually the hydraulic model time step (from several minutes to hours), whereas the network temporal scale can be several minutes, days or even months depending upon application. Table 1 summarizes the micro- and macro-scale definition and properties of water consumption. The boundaries between micro and macro scales are blurry. Scales for nodes and networks/subnetworks are related to population density. For example, a node in an urban network may represent as many consumers as a network or DMA (District Metered Area) in a suburban/rural system. In some other infrastructure systems, the ‘in-between’ (i.e., user aggregation) may be called meso-scale (e.g., Li *et al.* 2023), but in this work, scales are limited to two categories (micro and macro).

. | Micro scale . | Macro scale . | ||
---|---|---|---|---|

Fixture . | Household . | Node . | Network . | |

Consumer | Individual person | Family | WDS model node | DMA or source inflow (tank, well, WTP) |

Space | Fixture | Meter | Group of consumers | Network or subnetwork |

Time | Seconds to minutes | Minutes to hours | WDS hydraulic time step (several minutes to hours) | WDS hydraulic time step or longer depending upon application (several minutes to days or months) |

Measurement | Special high resolution metering | AMI | AMI aggregation or estimation | Source flow meter |

. | Micro scale . | Macro scale . | ||
---|---|---|---|---|

Fixture . | Household . | Node . | Network . | |

Consumer | Individual person | Family | WDS model node | DMA or source inflow (tank, well, WTP) |

Space | Fixture | Meter | Group of consumers | Network or subnetwork |

Time | Seconds to minutes | Minutes to hours | WDS hydraulic time step (several minutes to hours) | WDS hydraulic time step or longer depending upon application (several minutes to days or months) |

Measurement | Special high resolution metering | AMI | AMI aggregation or estimation | Source flow meter |

WDS, Water Distribution System; WTP, Water Treatment Plant.

Macro- and micro-scales have been studied in the past with the understanding that they are two scales representing the same reality. Linking these two domains is challenging, but necessary to understand the possibilities of hydraulic modeling applications. The aim of this work is to assess the scale effects associated with a set of consumption data to identify the type of analysis that is worthwhile to carry out in a sequent hydraulic modeling application.

Previous works have suggested that aggregation of users impacts correlations and predictive ability. This work attempts to quantify aggregation effects from a large real data set. This novel multi-scale assessment is possible with the unique (large, homogeneous and very complete) data set for a residential area provided by the Oro Valley Water Utility (Arizona, US) and described in the next section. This analysis sheds light on the properties of water flows within a hydraulic network at alternative spatial and temporal scales, laying the groundwork to discuss the implications (usefulness and potential) of data for specific hydraulic applications, such as model construction or leakage detection. In other words, the strategy here adopted will guide the answer to the following questions: What can we do with this data set? Is the spatial/temporal resolution adequate to address a specific application (e.g., leakage detection)? Should we change the measurement strategy/resolution for that purpose? These are relevant questions given the variety of data resolutions and applications that currently coexist in the water industry (Oberascher *et al.* 2022). This work focuses on the analysis of water consumption. Subsequent modeling and/or forecasting applications are beyond the scope of this paper, but the presented analysis is meaningful to define their potential as discussed in the Implications section.

## STUDY AREA AND DATA SET

### Study area

Oro Valley is a suburban town located 10 km north of Tucson, Arizona (US) in the western foothills of Santa Catalina Mountains. In 2020, Oro Valley had 47,070 inhabitants (USCB 2023) including a number of winter visitors who reside elsewhere in the summer. The town is relatively affluent with a median household income that is 31% higher than the US median (USCB 2022a).

Oro Valley's water is largely supplied from the local aquifer that is replenished by flow from the Santa Catalina Mountains. The town also has a Colorado river water allocation and reclaims part of its wastewater for irrigation purposes. Even so, they place a significant emphasis on water conservation (Oro Valley 2023). To that end, the Oro Valley Water Utility deployed AMI with the goal of increasing network efficiency and reducing household leakage losses. The utility has commissioned Tetra Tech (2022) to perform a data analytics evaluation with the aim of identifying usage patterns, establishing usage metrics, and creating a digital dashboard for analytical support.

### Data set

Oro Valley Water Utility provides water to 20,620 AMI volumetric flow meters within its service area. Hourly consumption is recorded with 1-gallon (3.79 L) resolution. The roughly 18,000 single-family households are analyzed in this study for the period of January 1, 2019, to December 31, 2020. A preliminary analysis of this data carried out by Tetra Tech (2022) recommended to discard about 5,000 meters due to data deficiencies (e.g., duplicate meters, meters missing latitude and longitude data, meters with negative values and meters with gaps in data greater than 10 h); leaving 12,697 meters with high-quality hourly data. Consumption records have been anonymized by the water utility and identified by a meter ID with no spatial referencing.

To assess water use at different temporal scales (weeks, days and hours), the time window is adjusted to work with full weeks: from January 7, 2019, to December 27, 2020 (103 weeks). To maintain the records as completely as possible and avoid second-residence or long periods of absence effects (e.g., winter visitors), meters with more than 20 null consumption days (3 weeks) within the 103-week-period were not considered (2,675 in total). Another 22 devices that have exactly 20 null consumption days were removed from the analysis set. This reduction is applied to provide a data set of exactly 10,000 meters that could be conveniently grouped into subsets of 10, 100 and 1,000 units.

This data set is unique for assessing the scale effects of water consumption at demand nodes (i.e., flows at any other location within the system) for several reasons. First, it is large compared to previous studies that rarely exceed a thousand household meters (Cominola *et al.* 2015; Mazzoni *et al.* 2022). Second, all households are relatively homogeneous; similarly constructed single-family homes in a mostly residential suburban town. Third, the data set is very complete with the average number of missing hourly values per meter over the study period of 0.06%. Also of note is that the time window includes the dates in which the COVID outbreak took place (assumed April 1, 2020 in this work), so differences in behavior before and during the first months of the pandemic can be assessed.

Oro Valley Water Utility provided two additional pieces of anonymized metadata; the binned lot size [<465 m^{2} (5,000 ft^{2}); between 465 and 745 m^{2} (5,000 and 8,000 ft^{2}); between 745 and 1,394 m^{2} (8,000 and 15,000 ft^{2}): >1,394 m^{2} (15,000 ft^{2}) or unknown] and pool availability (yes, no or unknown) for each property. The lot size is greater than 465 m^{2} (5,000 ft^{2}) for 85% of the households. With respect to pools, 3,099 (∼31%) and 5,039 (∼50%) of the 10,000 analyzed properties have/do not have a pool, respectively (1,862 households have unknown status). Finally, based on usage patterns, Tetra Tech (2022) identified that of the 10,000 homes, 5,091 (∼51%) operate programed irrigation systems.

## METHODOLOGY

To assess the stochastic nature of water consumption, its variability and its associated scale effects, the data above is analyzed in two stages: (1) preliminary data analysis and (2) Principal Component Analysis (PCA). Preliminary analysis examines the average and correlation of and between hourly, daily and weekly consumptions that are representative of short-term, medium-term and long-term variability (as defined in Díaz & González 2022), respectively. PCA is a statistical technique that reduces the data set's dimensionality while retaining as much as possible of the variation present in the original data set. Therefore, by analyzing the relative importance of the Principal Components (PCs) and their role in explaining variance, it is possible to understand demand relationships between consumers and how they might be modeled.

### Preliminary data analysis

Hourly demand values for time hour *i* and meter *j* are represented as , with hours and meters. With few missing data, no imputation procedure was applied to estimate their value. Data analysis included computing the mean and variance of aggregated hourly demands and correlation studies for different time intervals and spatial aggregation levels.

#### Spatial aggregation: Total consumption and average pattern

*i*are modified accordingly.

#### Temporal aggregation: Daily consumption and representative hourly consumption for different time windows

*k*and week

*l*, respectively.

#### Construction of data matrices

*individual data matrices*’ (), collect each individual meter's consumptions over time. Each of the columns corresponds to a meter. Depending upon the temporal scale, will have a different numbers of rows

*n*( for hours, for days or for weeks):

Each of these matrices can be manipulated to compute an equivalent ‘*group data matrix*’ that contains as many columns as the number of groups being considered. As previously noted, households are grouped in this work every 10 meters ( units, groups), 100 meters ( units, groups) and 1,000 meters ( units, groups), leading to differently sized matrices. To avoid excessive computation times, one set of randomly identified groups are developed in this work (i.e., meters are aggregated randomly once).

Equations similar to Equation (12) can be written for average hourly demands within each day () and average hourly consumptions within each week ().

Three group data matrices are computed for each temporal level for the three grouping levels *m _{g}*, i.e., for the hourly, daily and weekly data (with superscripts dropped for convenience in the general discussion that follows). For the sake of simplicity, hereafter, both individual () and grouped () data matrices will be referred to as data matrices (with a general dimension).

#### Standardization of data matrices

*standardized*’ in this work per spatial unit (e.g., per meter or group of meters) so that each unit contributes equally to the analysis. Standardization is implemented by taking the difference between the matrix values of a meter/group of meters

*s*(, i.e., -column values in matrix ) and the mean of consumption for that same meter/group of meters (, i.e., mean of -column values in matrix ). This difference is then divided by the corresponding standard deviation (, i.e., standard deviation of -column values in matrix ) or:

Since water consumption is standardized per spatial unit, the mean and standard deviation of each column of are equal to 0 and 1, respectively.

#### Correlation analysis

After standardization, all diagonals of the matrix will equal 1. Since the size of the covariance matrix varies with the spatial unit scale (individual meters or groups of meters ), it is not straightforward to directly compare correlations across different spatiotemporal scales. The CDF of the upper triangular submatrix of (symmetric matrix) is computed to facilitate comparison.

### Principal Component Analysis (PCA)

PCA linearly transforms data into a new coordinate system where the variation in the data can be described with fewer dimensions by creating new uncorrelated variables that successively maximize variance (Jollife & Cadima 2016).

#### Data matrices and standardization

#### PCA transformation

*scores*, computed as the projection of over the new orthonormal coordinate system defined by the matrix . represents the new space where PCs are selected to maximize the variance that is transferred to . Each column of represents a so-called eigenvector.

Equation (21) is the characteristic equation of the matrix that is to be decomposed and can be expanded to a polynomial form. The roots of the polynomial equation are the eigenvalues. Eigenvectors can be computed after the eigenvalues are computed as the solution of the homogeneous linear system of equations (Equation (20)). Eigenvalues and eigenvectors are sorted in the descending order of their eigenvalues.

The percentage of variance explained by each component and the specific contribution of each PC to each spatial unit's variance are useful indicators for the extent to which the model dimensionality can be reduced and give an idea of the singularity of each meter/group. Reducing the dimensionality of the problem (i.e., selecting fewer eigenvalues/eigenvectors) to reconstruct the series and/or its variance is associated with a reduction in the percentage of explained variance.

## RESULTS

### Preliminary data analysis

#### Spatial and temporal aggregation: Daily consumption and average daily pattern

Using Tetra Tech (2022) classified households without and with scheduled irrigation systems, Figure 1(b) and 1(c) are daily consumption histograms for the two meters subsets. For reference, the 4,909 users without programmed irrigation systems consume 177.8 gallons/HH/day on average (672.9 L/HH/day) while 5,091 customers with programmed irrigation systems use on average 376.7 gallons/HH/day (1,425.9 L/HH/day). Figure 1(c) shows that most of the variability in the dataset is associated with outdoor use.

Figure 2(b) and 2(c) show the daily withdrawal pattern before and after the COVID-19 outbreak (assumed to begin on April 1, 2020), respectively. Due to confinement and/or mobility restrictions (Díaz *et al.* 2021), the average household consumption increased [from 287.7 gallons/day (1,089.0 L/day) to 315.9 gallons/day (1,195.8 L/day)] over the comparable time window (271 days). The temporal distribution of use also changed during COVID. Use in the community became more consistent as seen with decreased standard deviations (Equation (3)) and confidence interval widths (Equation (4)) after the COVID outbreak (Figure 2(c), in gray). This is captured by an increase in the average of and a reduction in the average of .

#### Temporal aggregation: Construction and standardization of data matrices

Some of the household consumption variability can be explained with the household characteristics (i.e., lot size, pool availability and scheduled irrigation system availability). To complete this assessment, standardized individual data matrices are computed using Equations (9)–(11) and (13) and averaged for meter subsets based on lot size and existence of a pool or scheduled irrigation system (average each group). Results are presented from weekly to hourly to progressively increase the level of detail.

The standardized values for the daily level are plotted in Figure 3(b) for the first 32 weeks of 2019 (roughly 8 months). While the frequency of the time series increases because of the daily periodicity, the trends are similar to the weekly pattern regardless of the lot size group and, as in the weekly data, amplitudes increase with the lot size. To compare summer and winter, hourly patterns are presented. The group averages for the standardized hourly values enhanced for two winter and summer weeks are shown in Figure 3(c) and 3(d), respectively. The trend in variability by season and lot size is again seen in the plots with the most significant differences associated with use amplitude. All lot sizes have similar hourly patterns with greater summer amplitudes (likely due to irrigation). Also, weekday-weekend differences are more apparent in summer (Figure 3(d)) (weekends correspond to 08/24/2019–08/25/2019 and 08/31/2019–09/01/2019) compared to winter (Figure 3(c)) (weekends correspond to 01/12/2019–01/13/2019 and 01/19/2019–01/20/2019).

Supplementary material, Figure S1 shows the evolution of the group average of standardized consumption values for different time windows for homes with and without pools. Again, amplitudes are larger for households with a pool in a pattern similar to homes with larger lots and have lower variability in winter. As with different lot sizes, the standardized distributions for both groups are similar for the two summer weeks. These tendencies are seen across groups and temporal aggregation levels.

To summarize the standardized consumption findings, water consumption trends are very similar regardless of the lot size or pool availability group. Amplitudes change at consistent times in the daily and weekly patterns implying that these factors condition the average consumption but not its evolution over time. On the other hand, programed irrigation systems (Figure 4) affect consumption timing.

#### Correlation analysis

*et al.*(2008) and Filion

*et al.*(2006) who argue that at the individual user level correlation is low. The CDF becomes more vertical around 0 (i.e., lower dependency) at shorter time intervals and user dependencies increase with temporal aggregation over longer time intervals, which is consistent with Magini

*et al.*(2008).

Figure 5(b)–5(d) show the similar plots for spatial aggregation levels (). The three curves within each subplot (weekly, daily and hourly) shift towards the right with higher spatial aggregation levels, demonstrating that correlation between user groups increases. Even for hourly data, correlations are low for groups of 1–10 meters but substantially higher when aggregating 100–1,000 users. These results raise caution when developing a hydraulic model and assuming all nodes have the same average patterns (i.e., top-down approach – Blokker *et al.* 2011). That assumption requires a minimum level of aggregation in the order of hundreds or thousands of users that may not be reached in suburban areas in which nodes may represent tens of residences as discussed in the Implications section.

### Principal Component Analysis (PCA)

PCA assesses to what extent it is possible to explain the variation in data with fewer dimensions. PCA is applied here to identify the strength of the relationship among meters at different spatiotemporal aggregation scales. It is first applied to individual meters then groups of meters to assess the spatial aggregation effect.

#### Individual meters

Figure 6(b) represents the score on a daily level after weekly effects are removed. PC1's evolution practically mirrors PC2's, indicating that the pattern provided by the first component is partly counterbalanced by the second component. This result shows that no clear tendency exists in the data and several components must be combined to reconstruct an individual user's consumption. Figure 6(c)–6(d) show the winter and summer hourly level PCA scores, respectively, after daily effects are removed. These figures show the ‘usual’ hourly patterns represented by PC1, with morning and evening peaks that are higher in summer and low overnight. PC1 is shaded by the sequent principal components to account for the individual meter variability. Figure 4(c) and 4(d) show slightly different total use and temporal distribution on weekdays compared to weekends. Weekends tend to be lower in consumption and have smaller morning peaks regardless of the season. This is reflected in the PC weights that are also lower on weekends for comparable times (Figure 6(c) and 6(d)).

Combining all households in this analysis could potentially bias the above results. However, Figure 3 and Supplementary material, Figure S1 shows that the lot size and pool availability affect the amplitude of water consumption but have little impact on its temporal evolution. Therefore, conducting PCA for lot size or pool availability grouping would provide similar conclusions. However, scheduled irrigation systems do appear to affect water demand timing (Figure 4), so the sequential PCA analysis was repeated for the two samples without (4,909 households) and with (5,091 homes) scheduled irrigation systems.

Similar to Figure 6, Supplementary material, Figure S2 and S3 display the evolution of PCA scores for the first and last principal components within each subset. The daily pattern of PC scores is less erratic in households without programmed irrigation (Supplementary material, Figure S2(b)) relative to all homes (Figure 6(b)) and the most random group with scheduled irrigation (Supplementary material, Figure S3(b)). This is likely due to the variability of irrigation timing in homes with programmed irrigation. Indoor uses are likely more consistent between homes. Households without programmed irrigation adopt manual periodic irrigation (e.g., weekly or lower period) and are likely more similar among those users than households with programmed irrigation. These trends are seen through PCA, which represents the periodicity of groups of residences.

Following recommended watering patterns, the morning peak is higher in homes with timed irrigation systems (Supplementary material, Figure S3) compared to residences without them (Supplementary material, Figure S2). Peaks are in general lower in winter (Supplementary material, Figure S2(c) and S3(c)) than in summer (Supplementary material, Figure S2(d) and Figure S3(d)). This pinpoints that most scheduled irrigation activities take place in the morning and are especially intense in summer, which is consistent with Oro Valley's recommended practice.

The strong PC1 peaks for all days in homes with scheduled irrigation demonstrate its significant influence on demand. Interestingly, households without programmed irrigation in summer (Supplementary material, Figure S2(d)) and with programmed irrigation in winter (Supplementary material, Figure S3(c)) both have strong early morning weekday PC1 peaks. This may suggest that the former homes are watering in some manner early in the day and that the many residents in the latter group are not turning off their irrigation systems in the fall. However, PCA can only assess relative consumption, so this hypothesis should be tested considering absolute consumption values and typical practices at the household level. In homes without programmed irrigation, PC1 reflects the decrease and wider distribution of morning demand on weekends and is tweaked by the behavior of the other PCs.

The colormaps (Supplementary material, Figure S4 – no scheduled irrigation system – and Supplementary material, Figure S5 – with scheduled irrigation system) are predominantly blue and demonstrate that a significant number of PCs are needed to explain the variability of household water consumption for these household classes. Table 2 lists the mean and variance of the percentage of variance explained by PC1, PC2 and PC3 for all meters and the subclasses without and with programmed irrigation systems. These values are equivalent to the mean and variance of the first three columns in the colormaps in Figure 8, Supplementary material, Figures S4 and S5.

. | . | All meters (10,000) . | Meters without scheduled irrigation system (4,909) . | Meters with scheduled irrigation system (5,091) . | |||
---|---|---|---|---|---|---|---|

Temporal level . | PC . | Mean . | Var . | Mean . | Var . | Mean . | Var . |

Weekly | PC1 | 22.9 | 456.4 | 17.0 | 317.4 | 29.0 | 523.1 |

PC2 | 8.8 | 142.1 | 8.8 | 134.3 | 8.7 | 145.7 | |

PC3 | 5.3 | 68.3 | 5.1 | 62.0 | 5.5 | 75.0 | |

Daily | PC1 | 3.9 | 61.8 | 3.8 | 42.6 | 5.5 | 156.8 |

PC2 | 2.8 | 30.7 | 1.7 | 17.3 | 3.0 | 23.0 | |

PC3 | 1.6 | 23.2 | 1.1 | 10.3 | 2.4 | 94.8 | |

Hourly | PC1 | 8.6 | 85.6 | 6.1 | 27.4 | 11.7 | 130.1 |

PC2 | 3.9 | 35.5 | 2.5 | 10.2 | 5.7 | 57.1 | |

PC3 | 2.8 | 16.6 | 2.4 | 14.3 | 3.6 | 26.9 |

. | . | All meters (10,000) . | Meters without scheduled irrigation system (4,909) . | Meters with scheduled irrigation system (5,091) . | |||
---|---|---|---|---|---|---|---|

Temporal level . | PC . | Mean . | Var . | Mean . | Var . | Mean . | Var . |

Weekly | PC1 | 22.9 | 456.4 | 17.0 | 317.4 | 29.0 | 523.1 |

PC2 | 8.8 | 142.1 | 8.8 | 134.3 | 8.7 | 145.7 | |

PC3 | 5.3 | 68.3 | 5.1 | 62.0 | 5.5 | 75.0 | |

Daily | PC1 | 3.9 | 61.8 | 3.8 | 42.6 | 5.5 | 156.8 |

PC2 | 2.8 | 30.7 | 1.7 | 17.3 | 3.0 | 23.0 | |

PC3 | 1.6 | 23.2 | 1.1 | 10.3 | 2.4 | 94.8 | |

Hourly | PC1 | 8.6 | 85.6 | 6.1 | 27.4 | 11.7 | 130.1 |

PC2 | 3.9 | 35.5 | 2.5 | 10.2 | 5.7 | 57.1 | |

PC3 | 2.8 | 16.6 | 2.4 | 14.3 | 3.6 | 26.9 |

The mean of PC1 is higher for the scheduled irrigation meters subgroup for all temporal aggregation levels. The variance of PC1 is also considerably higher in this subclass, which is anticipated given the variability in irrigation household schedules. Trends are less clear for PC2 and PC3 at all temporal levels. However, nearly all means and variances of homes with (without) programmed irrigation are larger (smaller) than the all-meter class, which is a weighted combination of the two subclasses.

This analysis demonstrates that the variability of household water consumption is difficult to explain (i.e., requires many PCs) even for residences that have key characteristics in common (e.g., a scheduled irrigation system). It also suggests that individual residence water consumption models should include a common component with a low or medium weight and other behaviors/tweaks that may be specific to each household.

Since the water consumption of each meter is unpredictable and any possible modeling would be complicated (due to the presence of zero-consumption periods and how to treat them, the high variety of individual patterns, etc.), it is worth wondering if it is truly indispensable to model each household individually for some specific applications. If demand is characterized with the aim of analyzing the state of the hydraulic network, it may be enough to characterize the behavior on a spatially aggregated level. The following subsection analyzes the extent that statistical properties are simplified due to aggregation in the upstream direction. As the availability of a scheduled irrigation system does not seem to influence how the variance is explained, the remainder of this work will deal with the whole set of 10,000 meters.

#### Spatially aggregated users

The previous section highlighted the high number of PCs needed for demand prediction for individual households. The exploratory data analysis showed that, in general, higher spatially aggregated data has higher correlations. Given higher correlations in demand, a logical follow-up to this work is to determine if demands for spatially aggregated users can be predicted with better accuracy.

Table 3 lists the percentage of the variance explained by the first PC for different temporal and spatial aggregation levels. First, the percentage of explained variance increases with the level of spatial aggregation. Next, similar to results in Figure 7, the percentage of variance explained by PC1 is greatest for weekly temporal aggregation followed by hourly and finally daily temporal scales for the four spatial aggregation levels. Thus, for daily and hourly temporal aggregations, a higher spatial aggregation level is needed to reach the same level of PC1's explained variance in the weekly analysis.

. | Individual meters . | Every 10 meters . | Every 100 meters . | Every 1,000 meters . |
---|---|---|---|---|

Weekly | 22.9 | 56.0 | 91.5 | 98.8 |

Daily | 3.9 | 10.6 | 47.7 | 88.2 |

Hourly | 8.6 | 37.9 | 83.2 | 96.4 |

. | Individual meters . | Every 10 meters . | Every 100 meters . | Every 1,000 meters . |
---|---|---|---|---|

Weekly | 22.9 | 56.0 | 91.5 | 98.8 |

Daily | 3.9 | 10.6 | 47.7 | 88.2 |

Hourly | 8.6 | 37.9 | 83.2 | 96.4 |

This table shows that reasonable percentages of explained variance (over 90%) for weekly and hourly time scales can be achieved if over 1,000 meters are aggregated. Demand aggregation is therefore promising with the aim of developing water consumption models to characterize a network's hydraulic behavior, particularly for dense urban systems with a large number of customers served through a node.

## IMPLICATIONS

### Aggregation and apparent correlation

Results show that consumption/flow series have some commonality depending on the spatial aggregation level. When the spatial aggregation level is low (i.e., *micro* scale), users have weak relationships among each other, and a significant number of principal components is needed to explain the variance of water consumption. The average consumption pattern is not related to any specific individual series because consumption at each household experiences significant variations. Correlation appears when water consumption is aggregated, as water users share similar external factors and fewer principal components are needed to explain the variability of the flow series.

In other words, random variabilities and individual user differences cancel out when aggregating households (i.e., *macro* scale), and the average pattern is what remains after the aggregation process. The correlation that appears (i.e., *apparent* correlation) is a result of causality that is determined by broad user behavior (Díaz & González 2022). For example, consider the morning peak, when most consumers wake, prepare for the day, and leave their homes sometime between 6:30 and 9:00 am. With a small set of users, the distribution of water use may be dispersed over the full 2.5 hours. This effect is seen in the low correlations and explained variance for individual and ten user sets (Figure 5 and Table 3, respectively). As more users are included, the signal becomes more distinct and recognizable at specific times and water use can be more readily explained. Temporal aggregation will likely have a similar impact. This hypothesis should be more appropriately tested with data with shorter time intervals (i.e., below the hour).

### Micro vs macro scales and demand modeling

The data analysis above suggests that modeling/forecasting water consumption on a customer (micro scale) basis requires a flexible and versatile model that can adapt to each user (e.g., household waking time). On the other hand, modeling/forecasting for aggregated users (hundreds or thousands of users, i.e., macro scale) should be less complex and therefore more accurate.

High resolution demand (micro) models are relatively new. In their literature review, Creaco *et al.* (2017) identified two types of household demand prediction models. Household models (first type) directly estimate residential consumption (e.g., Buchberger & Wu 1995) using high resolution flow measurements to adjust site-specific statistical process parameters. The second type, known as end-use models, build household consumption by summing micro-component (end-use/fixture) demands (e.g., SIMDEUM – Blokker *et al.* 2010). End-use models are based on Monte Carlo simulations that are driven by surveyed/measured/estimated end-use intensity, duration and frequency (IDF) parameters and socioeconomic information.

This work shows that variations appear in the same household at different times and between households at the same time, so a sufficient number of simulations and households are needed to accurately compute an average consumption pattern and its associated variance (Blokker *et al.* 2011; Díaz & González 2021). Further, model parameters should be specified for each customer using flow measurements or precise inhabitant information. That is, running simulations using ‘off the shelf’ IDF parameters is unlikely to be representative of a specific household that would require adjusted IDF parameters. Thus, without precise site-specific data, end-use models are appropriate to model a significant number of aggregated customers (e.g., 100–1,000). The lack of correlation and significant number of PCs needed according to the analysis presented here suggest that individual residence variability will result in large errors in small samples (10 or less).

Finally, from the data presented here, macro models that are generally statistically based will be useful for 100–1,000 households. More data and analysis are needed for both micro and macro models on spatial aggregations between 10 and 100 households.

### Hydraulic modeling scales

The expectations on accurate demands have implications for the temporal and spatial hydraulic modeling scales, particularly for applications like leakage assessment, real-time modeling or water quality analysis. Estimating demands for one hundred or greater households may be appropriate for nodes in urban areas, DMAs or small systems. Defining nodal demands in suburban areas, such as in the US, may prove more difficult with 20 or fewer homes being supplied by each node. Modeling is more complex if commercial, industrial and/or public uses (out of the scope of this paper) coexist with residential consumption (Creaco *et al.* 2017).

While estimating demands is a valuable research direction, understanding the effect of its uncertainty on model predictions will be a key driver for practitioner modeling. Model outputs of concern include pressures for real-time control and leakage management, and water quality for chlorine injection rates. All are functions of nodal demands and their spatial distribution. In an all-pipe hydraulic model, nodal demands may be driven by a small number of users and have high uncertainty. Aggregating nodes (and pipes) increases the number of consumers and reduces demand prediction errors, but aggregation introduces model representation errors. Impacts will vary for hydraulic modeling and local water quality analyses.

Cost-benefit analyses should be carried out to define the scale of interest for researchers/practitioners depending on their specific application. For example, if a hydraulic model is to be calibrated with a limited budget for instrumentation, building the hydraulic model down to each water connection may be impractical. Rather than locating flow meters at the entrance to a few households, which only enables us to characterize what happens in those homes, installing flow meters at strategic positions that aggregate several users may provide a better representation of reality. If leakage is to be assessed (quantify amount of water lost), water balances on hourly or daily data may be sufficient while leak location may require better demand resolution. As noted, the spatial aggregation level conditions the temporal resolution. Therefore, scale effects should be considered to optimize the instrumentation and monitoring strategy according to economic, technical and/or technological constraints for each application.

### Future work

Results from this work show that different types of users coexist in Oro Valley's data set. These differences occur with and without scheduled irrigation systems, but outdoor use clearly complicates water consumption patterns. This work has not attempted to isolate outdoor uses due to the relatively low measurement resolution (1 gallon and 1 hour). Previous works show that higher resolution measurements are needed to separate outdoor from indoor use (e.g., Meyer *et al.* 2021). Of interest is assessing the similarities among users for only indoor use through PCA in other data sets. Moreover, if data were not anonymized, satellite image processing (e.g., Halipu *et al.* 2022) or weather metadata (e.g., Xenochristou *et al.* 2019) could be used to correlate outdoor use to other external factors. Clustering could also be useful to strategically aggregate users (Noiva *et al.* 2016) in ways that maximize their apparent correlation.

## CONCLUSIONS

This work assesses the stochastic structure of water consumption on multiple temporal and spatial scales through the analysis of a unique residential data set. An initial preliminary analysis is conducted to examine the variability in the consumption series. Sequential PCA is then applied to assess the relationship among users at different temporal scales (weekly, daily, and hourly) and to explore the spatial aggregation effect in flow series (individual users vs groups every 10, 100, and 1,000 meters).

In this unique data set, individual household consumption is uncorrelated irrespective of the temporal scale, and correlation (i.e., apparent correlation) grows with spatial aggregation as a result of causality. Thus, modeling/forecasting water consumption per customer (i.e., *micro* scale) will require a versatile type of model that is to be adjusted to individual users. However, the effort required to build models per customer may be substantial and prohibitive when building a hydraulic model for simulating network flows within a water system. Standard parameters may be enough to get average patterns from such models on a *macro* scale, but not on a *micro* scale. Using groups of consumers (rather than individual household water demands) may be useful for some applications (such as hydraulic model construction or leakage assessment) but not for others (e.g., leakage location).

Thus, using this unique data set, water consumption/flow series are shown to have some commonality depending on the spatial aggregation level. Previous studies have discussed the aggregation effect, but the novelty of this work lies in demonstrating that demand series must be aggregated from many (perhaps hundreds or thousands) users/meters to accurately analyze/model/forecast actual demand patterns with common approaches. If aggregation is insufficient, common patterns will not exist because individual users behave randomly, and modeling and forecasting will be compromised. Therefore, the level of aggregation will determine the ability of the model to explain demand variability and may vary depending on the specific modeling needs and application characteristics.

Further research is needed to weight the impacts of mixed uses on demand aggregation and sequent modeling. In addition, the potential for improving predictability and understanding of users should be explored through clustering and follow-on statistical analysis. Finally, as AMI metering becomes more prevalent, other datasets including those with short reporting intervals should be examined to better understand water consumption and demand forecasting.

## ACKNOWLEDGEMENTS

The authors thank Oro Valley Water Utility and Tetra Tech for collaborating with us by providing the data and answering questions about the utility and data. S.D. thanks the Spanish Ministry of Universities for the financial support (CAS21/00392) provided to visit the University of Arizona in Fall 2022 under the Fulbright Program. M.P. thanks the Austrian Marshall Plan Foundation for financially supporting his stay at the University of Arizona in 2022–2023.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.