Abstract
Understanding the end-use of water is essential to a plethora of critical research in premise plumbing. However, direct access to end-use data through physical sensors is prohibitively expensive for most researchers, building owners, operators, and practitioners. Therefore, machine learning models can alleviate these costs by predicting downstream end-use events (e.g., sink, shower, dishwasher, and washing machine) via an affordable subset of upstream sensors. Choosing which upstream sensors, as well as data preprocessing methods, are best for machine learning has historically been a manual process. This paper proposes a novel approach to systematically configure the machine platform automatically. The optima were determined through a Pareto analysis of the exhaustive combinations of upstream predictors and preprocessing methods. The model was trained and validated with real-world data obtained from a house that has been extensively monitored for over a year. Results from the analysis suggested that downstream events can be effectively predicted with minimum overfitting error for most categories, using as few as two to four upstream sensors. This study automatically implemented highly accurate machine learning models to predict downstream features within premise plumbing systems, significantly lowering the costs of researching residential plumbing best practices such as water conservation.
HIGHLIGHTS
Physical end-use monitoring is prohibitively expensive for most practitioners.
Machine learning is a viable method of lowering the cost of end-use categorization.
Here we proposed a novel approach to systematically configure the machine platform automatically.
Effective end-use prediction is possible with a small subset of upstream features.
Graphical Abstract
INTRODUCTION
Potable water in domestic plumbing is used for a variety of end-uses including bathing, clothes washing, dish washing, and toilet flushing. These different end-uses may have unique implications depending on the application. For instance, hot water consumption, unlike cold water consumption, has a significant impact on building energy consumption due to heating requirements (Griffiths-Sattenspiel & Wilson 2009). Due to the widespread use of water conservation measures and devices over the past three decades coupled with the rising number of waterborne opportunistic pathogen outbreaks, it is currently critical for researchers to establish an understanding of how water is used within buildings in order to minimize degradation of drinking water quality at the tap (Julien et al. 2020). Factors influencing the hydraulic residence time of water within a system (including how and when water is used) are known to negatively affect water quality and potentially exacerbate human health risks (Salehi et al. 2018, 2020; Ley et al. 2020). Therefore, a detailed understanding of water end-use is critical to develop appropriate building plumbing design guidelines, forecast water demand, project impacts of water-saving practices and/or devices, and to determine water age in building plumbing (Pickering et al. 2018).
Determining the end-use of water has previously required expensive data collection equipment or burdensome labor requirements. The advent of electronic data loggers and smart meters has enabled the relatively affordable collection of household water consumption data at sub-daily resolutions, which has in turn been used to disaggregate water end-use from a single household-level flowmeter. For example, a pair of studies characterized end-use by employing household-level flowmeters and the commercial disaggregation software TraceWizard (Mayer et al. 1999; Water Research Foundation 2016). Alternative water use disaggregation software such as Identiflow is currently available; however, each of these decision-tree-based methods still require significant labor efforts to determine end-use (Pastor-Jabaloyes et al. 2018). Froehlich et al. (2011) demonstrated that water pressure can be used as an alternative to flowrate to disaggregate use events via a decision tree framework. However, this methodology requires extensive monitoring during model calibration that is not plausible for widespread use (Froehlich et al. 2011; Cominola et al. 2015). AutoFlow has been developed as an alternative for end-use disaggregation. It classifies water use events autonomously based on flowrate using a hidden Markov model assisted by dynamic time warping with an integrated neural network to correctly classify nearly 90% of water use events while being significantly less labor- and time-intensive than previous software such as TraceWizard (Nguyen et al. 2015). Several recent publications have demonstrated the utility of machine learning in disaggregating smart meter data to downstream features. Gourmelon et al. (2021) evaluated the efficacy of several machine learning techniques in predicting downstream events via simulated smart meter data. Mazzoni et al. (2021) on the other hand used intrusive data read from four households to disaggregate end-use features from an upstream smart meter via machine learning. Finally, Meyer et al. (2021) employed machine learning to determine whether an upstream water event, recorded on a smart meter, indicated an indoor or outdoor end-use event. Each of these studies demonstrated the promise of machine learning as a disaggregation technique, but the features and preprocessing metrics used were hand-chosen and idiosyncratic to their residential plumbing system of interest.
At this time, the methodologies in the literature are limited to a few case-specific systems, manually determined upstream predictors, and/or data analyses. The novelty in this study is that we seek to automate the feature and preprocessing method selection process in upstream use event disaggregation. In other words, our goal is to improve on these existing methods via an entirely data-driven framework for systematically selecting optimal (1) upstream predictors and (2) event preprocessing metrics for categorizing water end-use events in a household system via machine learning. In doing so, we will determine whether or not water end-use can be accurately determined using an affordable subset of upstream sensors. In addition, these analyses can then be used to improve the accuracy of future end-use event categorization machine learning algorithms.
METHODOLOGY
Machine learning analysis overview


For the purposes of this study, machine learning is formally defined as categorizing a number of target events (i.e., water end-use events) using a number of predictors (i.e., upstream sensors) by fixture type (i.e., sink, shower, dishwasher, washing machine).

A single iteration for the machine learning process, with metrics of amplitude (m1) and duration (m2). This process is repeated one time for every combination of predictors and metrics.
A single iteration for the machine learning process, with metrics of amplitude (m1) and duration (m2). This process is repeated one time for every combination of predictors and metrics.
Sample of the complete PCA hydrograph composed of all available predictors.
In other words, a metric converts a signature into a real value scalar. Then a training set of these labeled, disaggregated events are taught to the machine learning algorithm and validated on the test set (Figure 1(h)). After all combinations of metrics and predictors are exhausted, the optimal runs are compiled, and analyzed to determine the optimal upstream predictors and metrics for future end-use event categorization research.
Data source
Data was collected from the Retrofitted Net-zero Energy, Water & Waste (ReNEWW) house. The ReNEWW house is a three-bedroom, 1.5-bath single-family residence and joint project between the Whirlpool Corporation and Purdue University. The home includes residents who live and work locally. Within the house, there are 92 sensors for temperature, volumetric flow (Figure 3), relative humidity, and usage occurrence. A data acquisition system (DAQ) records second-by-second readings of the sensors into a remote MySQL database managed by Whirlpool. This database has 92 columns for the 92 sensors measured by the DAQ. We selected on a representative subset of the data spanning September 2019 to November 2019 while the house was occupied by college students from Purdue University.
Predictor selection and preprocessing
To minimize the number of runs in the analysis, a subset of the available upstream predictors are selected for input into the machine learning algorithm based on the following criteria. First, selected predictors should be relevant to predicting end-use events. For example, the flow rate in the main water line is a pertinent predictor for predicting flows in the kitchen sink since flow through the latter is directly caused by the former (McCabe et al. 1967). Second, the chosen predictors are also limited by location. For example, if a future project sought to predict downstream events using upstream predictors, limiting the sensors to easily accessible locations (e.g., near a water heater or in a basement) would keep install costs manageable. Finally, the number of predictors is also limited by cost. Researchers at Purdue University and Whirlpool Corporation spent over $100,000 to install the sensors throughout the reasonably sized ReNEWW house, which is an infeasible cost for the average homeowner (Salehi et al. 2020). In the end, the selected upstream predictors are the city main volumetric flow (L/s), city main temperature (°C), hot water return volumetric flow (L/s), hot water return temperature (°C), hot water heater inlet volumetric flow (L/s), hot water heater inlet temperature (°C), and total hot water volumetric flow (L/s).
Another aspect of the machine learning set up is establishing the learning targets. This study classifies upstream water usage events into their causal end-use, which included sink, shower, dishwasher, and washing machine events. For each run (one per combination of upstream predictors and metrics), the unstandardized dimensionality of the given set of upstream predictors is compressed into a modified hydrograph (Figure 1(c)). A hydrograph is a two-dimensional graph where the x-axis is time and the y-axis is volumetric flow. Hydrographs are commonly used to analyze flow events, such as rainfall events within a river basin, or water use events in a plumbing system. Therefore, PCA is performed on our dataset and the most significant component is plotted against time. The machine learning algorithm could therefore predict end-use flow events with a simple and consistent two-dimensional set of upstream predictors regardless to how many upstream predictors are in a given configuration.
Our machine learning algorithm is supervised since the ReNEWW house data uniquely includes labeled target data (i.e., labeled end-use flow events). Supervised learning involves teaching a machine learning algorithm how to identify events with the aid of both the predictors and true labeled target events. This enables us to look at global flow events (i.e., when the service line has a non-zero flow) and automatically label the causal end-use flow event or events (Figure 1(d)). This provides feedback to the algorithm throughout the learning process, further increasing algorithm accuracy. Predictors that contain no label, such as the portion of the PCA hydrograph during periods of no flow from the main line, are omitted from the analysis. This assumption stands since these moments in time are delineated by zero-flow readings from the main service line. The PCA hydrograph is then divided into separate time series based on their coincident flow events.
The next step entails disaggregating the PCA time series into discrete events through the density-based spatial clustering of applications with noise (DBSCAN) algorithm (Figure 1(f)) (Ester et al. 1996). Clustering algorithms, which group data into clusters (i.e., single end-use events), allow us to individually examine the temporal signature of individual end-use events. The clustering technique in use, DBSCAN, breaks each of the events (not including non-flow readings) into individual packets. DBSCAN belongs to the family of density-based clustering algorithms, which cluster points of approximate equal density together. The data is ideal for density-based clustering, since the readings within a single event are evenly spaced (Δ 1 second) with regard to time, and the readings are tightly clustered around relatively sparse flow events. We run the algorithm with an epsilon (EPS) of 3 and minimum points (MPS) of 3 in order to remove noise readings that are less than three seconds in duration.
With the time series disaggregated into individual flow events, metric functions are applied to each event (Figure 1(g)). The metrics should ideally capture an aspect of an event, and thus in this study included event amplitude, duration, mean slope, time until next event, max slope, min slope, and vertical variance. At this point, the raw data is suitable for training a machine learning algorithm: it is low-dimensional (i.e., two-dimensional), it is divided into discrete flow events, and it is labeled.
Learning process
The machine learning algorithm is then trained with a subset of events and metrics and validated for overfitting with the remaining events. To do so, the data is first divided into a test and training set, and with equal proportions of each target category (i.e. shower, sink, dishwasher, and washing machine) in the test and training sets. The machine learning algorithm itself, bootstrap aggregated (bagged) decision trees, consists of an ensemble of t decision trees that are trained with t bootstrapped samples of the main training sample in order to reduce the chance of overfitting (Breiman 1996). Bootstrapped samples are random samples with replacement of the training dataset in order to reduce the chance of overfitting (Efron 1992). Once trained, the ensemble of t decision trees individually predict the end-use cause of each upstream event, and then vote upon a single classification. For example, if there are five decision trees, and three categorize a given event as a sink event, and two categorize the same event as a shower event, the ensemble would classify the event as a sink event. The combination of bootstrapping and ensemble classification reduces the risk of overfitting in the model (Breiman 1996). Once the tree ensembles are trained on the training set, the authors validate the new ensemble model on the test set in order to measure the model's overfitting. The training and validation procedure is repeated for every combination of flow event metrics and possible upstream predictors.
After training and validation are completed for each combination of upstream predictors and metrics, the best subsets are chosen through non-dominated sorting. Non-dominated sorting is an algorithm that selects the best solutions out of a set of possible solutions along multiple objectives, known as a Pareto optimal set (Goldberg 1989). In this study, we aim to minimize the following objectives: total training error, total test error, maximum training error among categories, and maximum testing error among categories. These objectives take into account overall accuracy, overfitting, and performance among all fixture categories. Because a Pareto optimal set only includes optimal solutions, which predictors and metrics appear within the set will shed light on their skill as predictors of end-use events. This concept, known as innovization (Deb & Srinivasan 2006), will allow future practitioners to innovate future machine learning procedures for water end-use event prediction.
Run-time complexity
Our algorithm is designed to be executed within the capabilities of an affordable modern desktop workstation. The problem at hand, determining optimal fixtures and metrics, is an NP-complete problem because the search space is exponential with respect to the input (), and the objective function is a non-linear black-box function. Therefore, there is no known polynomial-time algorithm for finding the optimal feature and metric combination (Cormen et al. 2022). However, in this particular study, as is the case with many houses, brute-forcing the optimization problem is a feasible option since
and
are small values. Specifically, our methodology searches among seven different possible features and six different metrics, yielding 8,192 (
) configurations, which is well within the capabilities of a typical desktop workstation to evaluate.










However, considering that in most premise plumbing studies the number of metrics and features is significantly less than the number of sensor readings (i.e., ), the
and
terms in Equation (5) practically behave like constants. For example, a single year of second-by-second sensor data is on the order of magnitude of
readings, where
and
in this study. In summary, for studies such as ours with relatively insignificant
and
values, the run time behaves practically more like O
.
RESULTS AND DISCUSSION
Machine learning results
Among the Pareto optimal solutions (Table 1, Figure 4), each unique configuration met accuracy and overfitting goals with various degrees of success. One determinant of accuracy was the target category. All of the optimal solutions consistently achieved low training error across all categories (non-outliers lying between 0% and 8.87% error) (Figure 5). Testing error, on the other hand, varied across event category: shower and sinks were excellent (lying between 0.57% and 6.56% error) (Table 2), where dishwasher and washing machines performed comparatively poorly (non-outliers lying between 4.26% and 21.28% and between 12.44% and 55.24% error, respectively) (Figure 5, Table 2). The significant difference between training and testing errors suggest that the bagged decision trees tended to slightly overfit dishwashers and significantly overfit washing machines. However, a few solutions among the configurations did obtain better testing accuracy for washing machines. One configuration (city main volumetric flow, total hot water volumetric flow, and amplitude) achieved a 19.62% worst-case test error rate, while maintaining strong error metrics (0.87% total training error, 3.23% total test error, 4.52% worst-case training error) (Table 1). This suggests that these predictors and metrics are suitable for training bagged decision trees for dishwasher prediction, indicating that further research into this specific configuration is necessary. Another configuration (hot water return volumetric flow, total hot water volumetric flow, and amplitude) achieved an excellent worst-case testing error of 12.44%, but a significantly poor training error of 68.85%. This tradeoff makes this predictor and metric combination not suitable.
Analyzed Pareto optimal solutions ranked from least to most overfitted
Predictors . | Metrics . | Total errors (%) . | Worst error among categories (%) . | Frequencies . | . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Training . | Test . | Training . | Test . | Predictors . | Metrics . | Vol. sensors . | Temp. sensors . | % of vol . | ||
Hot water return volumetric flow, Total hot water volumetric flow | Amplitude | 6.52 | 2.24 | 68.85 | 12.44 | 2 | 1 | 2 | 0 | 100 |
City main volumetric flow, Total hot water volumetric flow | Amplitude | 0.87 | 3.23 | 4.52 | 19.62 | 2 | 1 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 0.08 | 3.37 | 1.03 | 29.19 | 2 | 6 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Max slope, Variance | 0.03 | 3.59 | 0.62 | 31.58 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 0.06 | 3.63 | 1.03 | 29.67 | 3 | 6 | 3 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.04 | 3.67 | 0.62 | 29.67 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 0.24 | 3.93 | 3.90 | 27.75 | 2 | 4 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Variance | 0.02 | 3.97 | 0.41 | 34.93 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 0.22 | 4.06 | 3.49 | 28.23 | 3 | 4 | 3 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 0.65 | 4.32 | 7.19 | 24.88 | 3 | 3 | 3 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 0.66 | 4.36 | 7.19 | 23.92 | 2 | 3 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Max slope | 0.04 | 5.65 | 0.21 | 44.98 | 3 | 3 | 3 | 0 | 100 |
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.01 | 6.89 | 0.12 | 53.33 | 3 | 5 | 2 | 1 | 67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.00 | 6.98 | 0.00 | 55.24 | 4 | 5 | 2 | 2 | 50 |
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature | Duration, Time to next event, Mean slope, Max slope | 0.00 | 7.02 | 0.00 | 54.76 | 3 | 4 | 1 | 2 | 33 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 7.02 | 0.00 | 54.76 | 4 | 5 | 2 | 2 | 50 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope | 0.00 | 7.02 | 0.00 | 52.86 | 4 | 4 | 3 | 1 | 75 |
City main temperature, Hot water return volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 7.75 | 0.00 | 51.90 | 2 | 3 | 1 | 1 | 50 |
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 7.75 | 0.00 | 51.90 | 4 | 3 | 3 | 1 | 75 |
City main volumetric flow, City main temperature, Hot water return temperature | Duration, Amplitude, Max slope, Variance | 0.01 | 8.18 | 0.01 | 48.94 | 3 | 4 | 1 | 2 | 33 |
City main temperature, Hot water return volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.02 | 8.48 | 0.03 | 47.14 | 2 | 5 | 1 | 1 | 50 |
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.01 | 8.53 | 0.01 | 46.81 | 3 | 5 | 2 | 1 | 67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow | Amplitude, Time to next event, Max slope | 0.00 | 8.87 | 0.00 | 50.95 | 3 | 3 | 2 | 1 | 0.67 |
Predictors . | Metrics . | Total errors (%) . | Worst error among categories (%) . | Frequencies . | . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Training . | Test . | Training . | Test . | Predictors . | Metrics . | Vol. sensors . | Temp. sensors . | % of vol . | ||
Hot water return volumetric flow, Total hot water volumetric flow | Amplitude | 6.52 | 2.24 | 68.85 | 12.44 | 2 | 1 | 2 | 0 | 100 |
City main volumetric flow, Total hot water volumetric flow | Amplitude | 0.87 | 3.23 | 4.52 | 19.62 | 2 | 1 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 0.08 | 3.37 | 1.03 | 29.19 | 2 | 6 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Max slope, Variance | 0.03 | 3.59 | 0.62 | 31.58 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 0.06 | 3.63 | 1.03 | 29.67 | 3 | 6 | 3 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.04 | 3.67 | 0.62 | 29.67 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 0.24 | 3.93 | 3.90 | 27.75 | 2 | 4 | 2 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Variance | 0.02 | 3.97 | 0.41 | 34.93 | 2 | 5 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 0.22 | 4.06 | 3.49 | 28.23 | 3 | 4 | 3 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 0.65 | 4.32 | 7.19 | 24.88 | 3 | 3 | 3 | 0 | 100 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 0.66 | 4.36 | 7.19 | 23.92 | 2 | 3 | 2 | 0 | 100 |
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Max slope | 0.04 | 5.65 | 0.21 | 44.98 | 3 | 3 | 3 | 0 | 100 |
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.01 | 6.89 | 0.12 | 53.33 | 3 | 5 | 2 | 1 | 67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.00 | 6.98 | 0.00 | 55.24 | 4 | 5 | 2 | 2 | 50 |
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature | Duration, Time to next event, Mean slope, Max slope | 0.00 | 7.02 | 0.00 | 54.76 | 3 | 4 | 1 | 2 | 33 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 7.02 | 0.00 | 54.76 | 4 | 5 | 2 | 2 | 50 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope | 0.00 | 7.02 | 0.00 | 52.86 | 4 | 4 | 3 | 1 | 75 |
City main temperature, Hot water return volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 7.75 | 0.00 | 51.90 | 2 | 3 | 1 | 1 | 50 |
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 7.75 | 0.00 | 51.90 | 4 | 3 | 3 | 1 | 75 |
City main volumetric flow, City main temperature, Hot water return temperature | Duration, Amplitude, Max slope, Variance | 0.01 | 8.18 | 0.01 | 48.94 | 3 | 4 | 1 | 2 | 33 |
City main temperature, Hot water return volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.02 | 8.48 | 0.03 | 47.14 | 2 | 5 | 1 | 1 | 50 |
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.01 | 8.53 | 0.01 | 46.81 | 3 | 5 | 2 | 1 | 67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow | Amplitude, Time to next event, Max slope | 0.00 | 8.87 | 0.00 | 50.95 | 3 | 3 | 2 | 1 | 0.67 |
Complete Pareto front
Features . | Metrics . | Training error (%) . | Testing error (%) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Shower . | Sinks . | Dishwasher . | Washing machine . | Shower . | Sinks . | Dishwasher . | Washing machine . | ||
Hot water return volumetric flow, Total hot water volumetric flow | Amplitude | 4.73 | 0.06 | 0.00 | 68.85 | 4.10 | 0.82 | 10.64 | 12.44 |
City main volumetric flow, Total hot water volumetric flow | Amplitude | 4.52 | 0.43 | 0.53 | 2.98 | 6.56 | 0.88 | 19.15 | 19.62 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 1.03 | 0.00 | 0.53 | 0.12 | 1.64 | 0.62 | 6.38 | 29.19 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Max slope, Variance | 0.62 | 0.00 | 0.00 | 0.00 | 1.64 | 0.57 | 8.51 | 31.58 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 1.03 | 0.00 | 0.53 | 0.00 | 2.46 | 0.83 | 6.38 | 29.67 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.62 | 0.00 | 0.00 | 0.12 | 1.64 | 0.88 | 8.51 | 29.67 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 3.90 | 0.00 | 0.53 | 0.24 | 2.46 | 1.14 | 17.02 | 27.75 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Variance | 0.41 | 0.00 | 0.00 | 0.00 | 1.64 | 0.67 | 8.51 | 34.93 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 3.49 | 0.01 | 0.53 | 0.12 | 1.64 | 1.24 | 19.15 | 28.23 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 7.19 | 0.18 | 1.58 | 0.95 | 1.64 | 1.86 | 21.28 | 24.88 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 7.19 | 0.15 | 1.58 | 1.31 | 1.64 | 2.07 | 19.15 | 23.92 |
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Max slope | 0.21 | 0.04 | 0.00 | 0.00 | 2.46 | 1.60 | 6.38 | 44.98 |
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 0.00 | 0.00 | 0.12 | 1.64 | 2.21 | 6.38 | 53.33 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | 2.21 | 4.26 | 55.24 |
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature | Duration, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.26 | 4.26 | 54.76 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.26 | 4.26 | 54.76 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.47 | 4.26 | 52.86 |
City main temperature, Hot water return volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 3.09 | 19.15 | 51.90 |
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 0.00 | 0.00 | 0.00 | 2.46 | 3.09 | 17.02 | 51.90 |
City main volumetric flow, City main temperature, Hot water return temperature | Duration, Amplitude, Max slope, Variance | 0.00 | 0.01 | 0.00 | 0.00 | 1.64 | 3.24 | 48.94 | 48.57 |
City main temperature, Hot water return volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.00 | 0.03 | 0.00 | 0.00 | 1.64 | 3.81 | 46.81 | 47.14 |
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.00 | 0.01 | 0.00 | 0.00 | 1.64 | 3.91 | 46.81 | 46.67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow | Amplitude, Time to next event, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 3.81 | 48.94 | 50.95 |
Features . | Metrics . | Training error (%) . | Testing error (%) . | ||||||
---|---|---|---|---|---|---|---|---|---|
Shower . | Sinks . | Dishwasher . | Washing machine . | Shower . | Sinks . | Dishwasher . | Washing machine . | ||
Hot water return volumetric flow, Total hot water volumetric flow | Amplitude | 4.73 | 0.06 | 0.00 | 68.85 | 4.10 | 0.82 | 10.64 | 12.44 |
City main volumetric flow, Total hot water volumetric flow | Amplitude | 4.52 | 0.43 | 0.53 | 2.98 | 6.56 | 0.88 | 19.15 | 19.62 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 1.03 | 0.00 | 0.53 | 0.12 | 1.64 | 0.62 | 6.38 | 29.19 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Max slope, Variance | 0.62 | 0.00 | 0.00 | 0.00 | 1.64 | 0.57 | 8.51 | 31.58 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance | 1.03 | 0.00 | 0.53 | 0.00 | 2.46 | 0.83 | 6.38 | 29.67 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.62 | 0.00 | 0.00 | 0.12 | 1.64 | 0.88 | 8.51 | 29.67 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 3.90 | 0.00 | 0.53 | 0.24 | 2.46 | 1.14 | 17.02 | 27.75 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Variance | 0.41 | 0.00 | 0.00 | 0.00 | 1.64 | 0.67 | 8.51 | 34.93 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope, Variance | 3.49 | 0.01 | 0.53 | 0.12 | 1.64 | 1.24 | 19.15 | 28.23 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 7.19 | 0.18 | 1.58 | 0.95 | 1.64 | 1.86 | 21.28 | 24.88 |
City main volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Max slope | 7.19 | 0.15 | 1.58 | 1.31 | 1.64 | 2.07 | 19.15 | 23.92 |
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Max slope | 0.21 | 0.04 | 0.00 | 0.00 | 2.46 | 1.60 | 6.38 | 44.98 |
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 0.00 | 0.00 | 0.12 | 1.64 | 2.21 | 6.38 | 53.33 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Amplitude, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 0.82 | 2.21 | 4.26 | 55.24 |
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature | Duration, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.26 | 4.26 | 54.76 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope, Variance | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.26 | 4.26 | 54.76 |
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow | Duration, Time to next event, Mean slope, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 2.47 | 4.26 | 52.86 |
City main temperature, Hot water return volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 3.09 | 19.15 | 51.90 |
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow | Duration, Time to next event, Mean slope | 0.00 | 0.00 | 0.00 | 0.00 | 2.46 | 3.09 | 17.02 | 51.90 |
City main volumetric flow, City main temperature, Hot water return temperature | Duration, Amplitude, Max slope, Variance | 0.00 | 0.01 | 0.00 | 0.00 | 1.64 | 3.24 | 48.94 | 48.57 |
City main temperature, Hot water return volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.00 | 0.03 | 0.00 | 0.00 | 1.64 | 3.81 | 46.81 | 47.14 |
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow | Duration, Amplitude, Mean slope, Max slope, Variance | 0.00 | 0.01 | 0.00 | 0.00 | 1.64 | 3.91 | 46.81 | 46.67 |
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow | Amplitude, Time to next event, Max slope | 0.00 | 0.00 | 0.00 | 0.00 | 1.64 | 3.81 | 48.94 | 50.95 |
The varying success of each of the target types can be attributed to several reasons. The number of washing machine training events given to the bagged decision tree training was sufficient, considering the dataset contained more washing machine events (1,046) than shower or dishwasher events (608 and 237 respectively). It is plausible that the nature of dishwasher and washing machine events is inherently more difficult to classify on an event-by-event basis. For example, delimiting events by three-second or greater pauses would break down a single washing machine session into separate sub-events due to the various discrete cycles of a washing machine (i.e., prewash cycle, rinse cycle). Dishwashers similarly fill their tubs repeatedly throughout a single session (i.e., prewash, main wash, rinse), and their cycles would therefore be broken into separate events. Each of these dishwasher or washing machine sessions may as a whole be distinguishable from a single sink event, but individually they may hold similar characteristics to simple sink events. In addition, due to the nature of the events, the degree of overfitting also depends on the predictors and metrics chosen in the given configuration.
Predictor and metric analysis
The predictors selected were the greatest sources of overfitting variation in dishwashers and washing machines. For example, one moderately significant determinant of overfitting was how many predictors were selected. As the number of predictors increased, the worst testing error among configurations also tended to increase, and these two sets of values had a correlation of 0.64. Additionally, no optimal configurations had more than four selected predictors, even though seven predictors were available (Table 1). This, according to the concept of innovization, suggests that all optimal solutions with respect to maximizing accuracy and minimizing overfitting will contain no more than four predictors. Furthermore, training machine learning models with less than four predictors in this system tends to train the most accurate models with lowest overfitting. This, in turn, implies that not only can a few (i.e., less than four) upstream predictors accurately categorize end-use events, but such a low number of predictors is, in fact, better for classification with respect to overfitting. Therefore, it is highly plausible to design a machine learning system to affordably predict end-use events in real-time via a very low number of sensors in the basement. Conversely, no solution among the Pareto optimal solutions used less than two predictors (Table 1), which also implies the utility of using a slightly diverse set of predictors.
A strong correlation also exists between the selected predictor type and overfitting. The two types of predictors were volumetric flow sensors and temperature sensors, and therefore each optimal configuration contained a percentage of selected predictors that were volumetric flow. This percentage correlates negatively to overfitting (i.e., the worse test error) with a value of −0.82. This suggests that volumetric sensors are fitter candidates for predicting end-use events with minimal overfitting compared with temperature sensors. Our best explanation of this phenomenon is as follows. Water temperature changes during use because the plumbing is flushed with water from either the city main or the water heater for cold and hot fixtures, respectively. However, it takes time for the water stored in the pipe prior to use to be flushed with hotter or colder water (Klein 2013). Short, low-volume, water use events may not flush the pipes thoroughly enough to cause a distinct change in temperature. This theory agrees with our literature review, which consistently included volumetric flow as predictor to machine learning models. Another piece of evidence that supports the fitness of volumetric flow metrics is their frequency in the Pareto optimal set. The four volumetric flow sensors each appear more frequently than any of the temperature predictors (Figure 6), which occur in less than 25% of the optimal solutions.
Frequency of (a) available predictors and (b) metrics among the optimal configurations.
Frequency of (a) available predictors and (b) metrics among the optimal configurations.
According to the Pareto analysis, the sensors that were most effective as upstream predictors in bagged decision trees were (in order of progressively less frequency): city main volumetric flow, followed by the hot water heater inlet volumetric flow, hot water return volumetric flow, and total hot water volume (Figure 6). Of these four sensors, city main volumetric flow and hot water heater inlet volumetric flow appeared in the vast majority of optimal solutions (Figure 6), and the remaining volumetric flow meters appeared in less than half of the optimal solutions. The remaining predictors, each with a frequency of less than 25%, were temperature sensors. The frequency of city main volumetric flow and hot water heater inlet volumetric flow demonstrates their robustness as predictors in the bagged decision tree learning process and can be used in future works as reliable end-use categorizers.
The frequency of the metrics among the optimal solutions was less varied than the frequency of predictors. All metrics appeared in between 85% and 45% of all optimal solutions, with event duration appearing the most frequently, followed by maximum slope, amplitude, time till next event, mean slope, and vertical variance (Figure 6). This implies that event duration, amplitude, and maximum slope, all appearing in between 85% and 70% of all optimal solutions, are ideal metrics for quantifying upstream flow events in machine learning routines.
Beyond the direct findings of our analysis, our framework enables future researchers to discover the optimal predictors, metrics, and machine learning algorithm best for their particular application of end-use event classification. For example, in future work, researchers can choose the relevant predictors, metrics, and algorithms for their specific system, and determine their optimal machine learning configuration with our framework.
CONCLUSIONS
This study demonstrates how to effectively determine predictors, metrics, and learning algorithms while categorizing end-use events. A Pareto analysis determined that two of the predictors were effective predictors in bagged tree classification (main volumetric flow and hot water heater inlet volumetric flow), and that two of the metrics were highly effective preprocessing methods (duration and max slope). Further, configurations containing a greater proportion of volumetric flowmeters outperformed temperature-based meters due to their negative correlation with overfitting and higher frequency in the Pareto optimal set. Innovization discovered that between two and four predictors are ideal for training bagged decision trees under this system. Overall, we were able to achieve high accuracy with low overfitting in classifying sinks and shower, decent accuracy in classifying the dishwasher and poor accuracy in classifying the washing machine (save for a few counterexamples), perhaps due to the more complex nature of dishwasher and washing machine water demands.
The framework developed can innovate future machine learning routines not only for the ReNEWW house, but also for other systems. Following the same pattern, using a different set of predictors, metrics, datasets, and machine learning models, future studies can systematically discover the innovations in their own machine learning systems. Unlike models in the existing literature that are hand-configured by experts, our data-driven approach saves practitioners labor hours of manually determining optimal predictors and preprocessing routines. Also, data-driven approaches can discover innovations, through innovization, in end-use categorization machine learning, which can then seed future research in academia and industry. For example, commercial buildings may need a different range of predictors to maximize accuracy and minimize overfitting. Such innovations can help guide the direction of future research, which could unearth the theoretical mechanisms behind these innovations.
Further research should determine why washing machine prediction tended to be overfitted, and why one configuration (the predictors of city main volumetric flow, total hot water volumetric flow, and amplitude) was able to avoid the issues of overfitting with washing machines. One possible approach to avoid overfitting might be examining events within multiple time windows, depending on the target fixture. For example, dishwashers, washing machines, and showers operate at different time scales than bathroom and kitchen sinks, and this knowledge can be exploited to improve learning performance. Also, in future work, the number of features and/or metrics may make the run-time complexity behave more like
Therefore, future research should use evolutionary multi-objective optimization methods to more effectively search for optimal metric and feature configurations. Using these evolutionary methods, such as NSGA-III, would not only speed up analysis, but also find a more diverse and convergent solution set with regards to high accuracy and low overfitting (Deb & Jain 2014).
ACKNOWLEDGEMENTS
This work was supported by the US Environmental Protection Agency (grant number R836890).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST STATEMENT
The authors declare there is no conflict.