Understanding the end-use of water is essential to a plethora of critical research in premise plumbing. However, direct access to end-use data through physical sensors is prohibitively expensive for most researchers, building owners, operators, and practitioners. Therefore, machine learning models can alleviate these costs by predicting downstream end-use events (e.g., sink, shower, dishwasher, and washing machine) via an affordable subset of upstream sensors. Choosing which upstream sensors, as well as data preprocessing methods, are best for machine learning has historically been a manual process. This paper proposes a novel approach to systematically configure the machine platform automatically. The optima were determined through a Pareto analysis of the exhaustive combinations of upstream predictors and preprocessing methods. The model was trained and validated with real-world data obtained from a house that has been extensively monitored for over a year. Results from the analysis suggested that downstream events can be effectively predicted with minimum overfitting error for most categories, using as few as two to four upstream sensors. This study automatically implemented highly accurate machine learning models to predict downstream features within premise plumbing systems, significantly lowering the costs of researching residential plumbing best practices such as water conservation.

  • Physical end-use monitoring is prohibitively expensive for most practitioners.

  • Machine learning is a viable method of lowering the cost of end-use categorization.

  • Here we proposed a novel approach to systematically configure the machine platform automatically.

  • Effective end-use prediction is possible with a small subset of upstream features.

Graphical Abstract

Graphical Abstract

Potable water in domestic plumbing is used for a variety of end-uses including bathing, clothes washing, dish washing, and toilet flushing. These different end-uses may have unique implications depending on the application. For instance, hot water consumption, unlike cold water consumption, has a significant impact on building energy consumption due to heating requirements (Griffiths-Sattenspiel & Wilson 2009). Due to the widespread use of water conservation measures and devices over the past three decades coupled with the rising number of waterborne opportunistic pathogen outbreaks, it is currently critical for researchers to establish an understanding of how water is used within buildings in order to minimize degradation of drinking water quality at the tap (Julien et al. 2020). Factors influencing the hydraulic residence time of water within a system (including how and when water is used) are known to negatively affect water quality and potentially exacerbate human health risks (Salehi et al. 2018, 2020; Ley et al. 2020). Therefore, a detailed understanding of water end-use is critical to develop appropriate building plumbing design guidelines, forecast water demand, project impacts of water-saving practices and/or devices, and to determine water age in building plumbing (Pickering et al. 2018).

Determining the end-use of water has previously required expensive data collection equipment or burdensome labor requirements. The advent of electronic data loggers and smart meters has enabled the relatively affordable collection of household water consumption data at sub-daily resolutions, which has in turn been used to disaggregate water end-use from a single household-level flowmeter. For example, a pair of studies characterized end-use by employing household-level flowmeters and the commercial disaggregation software TraceWizard (Mayer et al. 1999; Water Research Foundation 2016). Alternative water use disaggregation software such as Identiflow is currently available; however, each of these decision-tree-based methods still require significant labor efforts to determine end-use (Pastor-Jabaloyes et al. 2018). Froehlich et al. (2011) demonstrated that water pressure can be used as an alternative to flowrate to disaggregate use events via a decision tree framework. However, this methodology requires extensive monitoring during model calibration that is not plausible for widespread use (Froehlich et al. 2011; Cominola et al. 2015). AutoFlow has been developed as an alternative for end-use disaggregation. It classifies water use events autonomously based on flowrate using a hidden Markov model assisted by dynamic time warping with an integrated neural network to correctly classify nearly 90% of water use events while being significantly less labor- and time-intensive than previous software such as TraceWizard (Nguyen et al. 2015). Several recent publications have demonstrated the utility of machine learning in disaggregating smart meter data to downstream features. Gourmelon et al. (2021) evaluated the efficacy of several machine learning techniques in predicting downstream events via simulated smart meter data. Mazzoni et al. (2021) on the other hand used intrusive data read from four households to disaggregate end-use features from an upstream smart meter via machine learning. Finally, Meyer et al. (2021) employed machine learning to determine whether an upstream water event, recorded on a smart meter, indicated an indoor or outdoor end-use event. Each of these studies demonstrated the promise of machine learning as a disaggregation technique, but the features and preprocessing metrics used were hand-chosen and idiosyncratic to their residential plumbing system of interest.

At this time, the methodologies in the literature are limited to a few case-specific systems, manually determined upstream predictors, and/or data analyses. The novelty in this study is that we seek to automate the feature and preprocessing method selection process in upstream use event disaggregation. In other words, our goal is to improve on these existing methods via an entirely data-driven framework for systematically selecting optimal (1) upstream predictors and (2) event preprocessing metrics for categorizing water end-use events in a household system via machine learning. In doing so, we will determine whether or not water end-use can be accurately determined using an affordable subset of upstream sensors. In addition, these analyses can then be used to improve the accuracy of future end-use event categorization machine learning algorithms.

Machine learning analysis overview

The goal of the machine learning analysis is to determine the optimal upstream predictors and preprocessing methods by exhaustively iterating over the available machine learning configurations and then determining the optimal configurations. The configurations include all possible combinations of relevant upstream predictors () and preprocessing methods (), which will yield the following number of combinations (Equation (1)):
(1)

For the purposes of this study, machine learning is formally defined as categorizing a number of target events (i.e., water end-use events) using a number of predictors (i.e., upstream sensors) by fixture type (i.e., sink, shower, dishwasher, washing machine).

Measuring the efficacy of various preprocessing methods requires a degree of standardization, due to the variable number of predictors in each run configuration. To do so, all relevant predictors are combined with a principal component analysis (PCA) (Pearson 1901) (Figure 1(c)) and plotted against time similar to a hydrograph (Figure 2). This provides every preprocessing method with a standardized two-dimensional signal, which is then disaggregated into individual events () with clustering (Figure 1(f)) each labeled with the causal target end-use event (Figure 1(d)). The standardized preprocessing methods with (Figure 1(g)), which we will refer to as event metrics (i.e., m) to avoid confusion with other preprocessing in our methodology, must also conform to a standard function format (Equation (2);
(2)
Figure 1

A single iteration for the machine learning process, with metrics of amplitude (m1) and duration (m2). This process is repeated one time for every combination of predictors and metrics.

Figure 1

A single iteration for the machine learning process, with metrics of amplitude (m1) and duration (m2). This process is repeated one time for every combination of predictors and metrics.

Close modal
Figure 2

Sample of the complete PCA hydrograph composed of all available predictors.

Figure 2

Sample of the complete PCA hydrograph composed of all available predictors.

Close modal

In other words, a metric converts a signature into a real value scalar. Then a training set of these labeled, disaggregated events are taught to the machine learning algorithm and validated on the test set (Figure 1(h)). After all combinations of metrics and predictors are exhausted, the optimal runs are compiled, and analyzed to determine the optimal upstream predictors and metrics for future end-use event categorization research.

Data source

Data was collected from the Retrofitted Net-zero Energy, Water & Waste (ReNEWW) house. The ReNEWW house is a three-bedroom, 1.5-bath single-family residence and joint project between the Whirlpool Corporation and Purdue University. The home includes residents who live and work locally. Within the house, there are 92 sensors for temperature, volumetric flow (Figure 3), relative humidity, and usage occurrence. A data acquisition system (DAQ) records second-by-second readings of the sensors into a remote MySQL database managed by Whirlpool. This database has 92 columns for the 92 sensors measured by the DAQ. We selected on a representative subset of the data spanning September 2019 to November 2019 while the house was occupied by college students from Purdue University.

Figure 3

The ReNEWW house plumbing, sensing, and data acquisition systems.

Figure 3

The ReNEWW house plumbing, sensing, and data acquisition systems.

Close modal

Predictor selection and preprocessing

To minimize the number of runs in the analysis, a subset of the available upstream predictors are selected for input into the machine learning algorithm based on the following criteria. First, selected predictors should be relevant to predicting end-use events. For example, the flow rate in the main water line is a pertinent predictor for predicting flows in the kitchen sink since flow through the latter is directly caused by the former (McCabe et al. 1967). Second, the chosen predictors are also limited by location. For example, if a future project sought to predict downstream events using upstream predictors, limiting the sensors to easily accessible locations (e.g., near a water heater or in a basement) would keep install costs manageable. Finally, the number of predictors is also limited by cost. Researchers at Purdue University and Whirlpool Corporation spent over $100,000 to install the sensors throughout the reasonably sized ReNEWW house, which is an infeasible cost for the average homeowner (Salehi et al. 2020). In the end, the selected upstream predictors are the city main volumetric flow (L/s), city main temperature (°C), hot water return volumetric flow (L/s), hot water return temperature (°C), hot water heater inlet volumetric flow (L/s), hot water heater inlet temperature (°C), and total hot water volumetric flow (L/s).

Another aspect of the machine learning set up is establishing the learning targets. This study classifies upstream water usage events into their causal end-use, which included sink, shower, dishwasher, and washing machine events. For each run (one per combination of upstream predictors and metrics), the unstandardized dimensionality of the given set of upstream predictors is compressed into a modified hydrograph (Figure 1(c)). A hydrograph is a two-dimensional graph where the x-axis is time and the y-axis is volumetric flow. Hydrographs are commonly used to analyze flow events, such as rainfall events within a river basin, or water use events in a plumbing system. Therefore, PCA is performed on our dataset and the most significant component is plotted against time. The machine learning algorithm could therefore predict end-use flow events with a simple and consistent two-dimensional set of upstream predictors regardless to how many upstream predictors are in a given configuration.

Our machine learning algorithm is supervised since the ReNEWW house data uniquely includes labeled target data (i.e., labeled end-use flow events). Supervised learning involves teaching a machine learning algorithm how to identify events with the aid of both the predictors and true labeled target events. This enables us to look at global flow events (i.e., when the service line has a non-zero flow) and automatically label the causal end-use flow event or events (Figure 1(d)). This provides feedback to the algorithm throughout the learning process, further increasing algorithm accuracy. Predictors that contain no label, such as the portion of the PCA hydrograph during periods of no flow from the main line, are omitted from the analysis. This assumption stands since these moments in time are delineated by zero-flow readings from the main service line. The PCA hydrograph is then divided into separate time series based on their coincident flow events.

The next step entails disaggregating the PCA time series into discrete events through the density-based spatial clustering of applications with noise (DBSCAN) algorithm (Figure 1(f)) (Ester et al. 1996). Clustering algorithms, which group data into clusters (i.e., single end-use events), allow us to individually examine the temporal signature of individual end-use events. The clustering technique in use, DBSCAN, breaks each of the events (not including non-flow readings) into individual packets. DBSCAN belongs to the family of density-based clustering algorithms, which cluster points of approximate equal density together. The data is ideal for density-based clustering, since the readings within a single event are evenly spaced (Δ 1 second) with regard to time, and the readings are tightly clustered around relatively sparse flow events. We run the algorithm with an epsilon (EPS) of 3 and minimum points (MPS) of 3 in order to remove noise readings that are less than three seconds in duration.

With the time series disaggregated into individual flow events, metric functions are applied to each event (Figure 1(g)). The metrics should ideally capture an aspect of an event, and thus in this study included event amplitude, duration, mean slope, time until next event, max slope, min slope, and vertical variance. At this point, the raw data is suitable for training a machine learning algorithm: it is low-dimensional (i.e., two-dimensional), it is divided into discrete flow events, and it is labeled.

Learning process

The machine learning algorithm is then trained with a subset of events and metrics and validated for overfitting with the remaining events. To do so, the data is first divided into a test and training set, and with equal proportions of each target category (i.e. shower, sink, dishwasher, and washing machine) in the test and training sets. The machine learning algorithm itself, bootstrap aggregated (bagged) decision trees, consists of an ensemble of t decision trees that are trained with t bootstrapped samples of the main training sample in order to reduce the chance of overfitting (Breiman 1996). Bootstrapped samples are random samples with replacement of the training dataset in order to reduce the chance of overfitting (Efron 1992). Once trained, the ensemble of t decision trees individually predict the end-use cause of each upstream event, and then vote upon a single classification. For example, if there are five decision trees, and three categorize a given event as a sink event, and two categorize the same event as a shower event, the ensemble would classify the event as a sink event. The combination of bootstrapping and ensemble classification reduces the risk of overfitting in the model (Breiman 1996). Once the tree ensembles are trained on the training set, the authors validate the new ensemble model on the test set in order to measure the model's overfitting. The training and validation procedure is repeated for every combination of flow event metrics and possible upstream predictors.

After training and validation are completed for each combination of upstream predictors and metrics, the best subsets are chosen through non-dominated sorting. Non-dominated sorting is an algorithm that selects the best solutions out of a set of possible solutions along multiple objectives, known as a Pareto optimal set (Goldberg 1989). In this study, we aim to minimize the following objectives: total training error, total test error, maximum training error among categories, and maximum testing error among categories. These objectives take into account overall accuracy, overfitting, and performance among all fixture categories. Because a Pareto optimal set only includes optimal solutions, which predictors and metrics appear within the set will shed light on their skill as predictors of end-use events. This concept, known as innovization (Deb & Srinivasan 2006), will allow future practitioners to innovate future machine learning procedures for water end-use event prediction.

Run-time complexity

Our algorithm is designed to be executed within the capabilities of an affordable modern desktop workstation. The problem at hand, determining optimal fixtures and metrics, is an NP-complete problem because the search space is exponential with respect to the input (), and the objective function is a non-linear black-box function. Therefore, there is no known polynomial-time algorithm for finding the optimal feature and metric combination (Cormen et al. 2022). However, in this particular study, as is the case with many houses, brute-forcing the optimization problem is a feasible option since and are small values. Specifically, our methodology searches among seven different possible features and six different metrics, yielding 8,192 () configurations, which is well within the capabilities of a typical desktop workstation to evaluate.

To evaluate each of these configurations, the data must be preprocessed, bootstrapped, trained, and then evaluated for testing error. Feature selection (Figure 1(b)) is manual and poses no significant effect on run time. The MATLAB implementation of PCA employed in this study (Figure 1(c)) uses Singular Value Decomposition (Mathworks Inc. 2020), which has a run-time of , where n is the number of sensor readings inputted. In our case, since the number of sensor readings is always higher than the number of features, the run-time complexity of PCA is . The labeling of events (Figure 1(d)) takes a run time of , since each event must be compared with the labeled dataset of flow types. For the clustering step (Figure 1(f)), DBSCAN has a run-time complexity of (Ester et al. 1996). In the calculating metrics step (Figure 1(g)), extra care was taken to ensure that each metric algorithm had a reasonable run-time. In this particular study, all metrics had a run-time complexity of , which leads to an overall complexity of for all metric calculations. In the training stage (Figure 1(h)), bootstrapping occurs a constant number of times each configuration evaluation and in MATLAB the training of each tree takes , where d is the number of dimensions trained (Mathworks, Inc. 2020). This complexity simplifies to , since in our case. Finally, each ensemble of trees was assessed against testing data (Figure 1(i)), which has a complexity of as well (Mathworks, Inc. 2020). Therefore, one evaluation of a given feature and metric configuration would have the following verbose complexity (Equation (3)):
(3)
which simplifies to Equation (4):
(4)
Therefore, the entire routine of searching and evaluating the different metrics and features will have the following run-time complexity (Equation (5)):
(5)

However, considering that in most premise plumbing studies the number of metrics and features is significantly less than the number of sensor readings (i.e., ), the and terms in Equation (5) practically behave like constants. For example, a single year of second-by-second sensor data is on the order of magnitude of readings, where and in this study. In summary, for studies such as ours with relatively insignificant and values, the run time behaves practically more like O.

Machine learning results

Among the Pareto optimal solutions (Table 1, Figure 4), each unique configuration met accuracy and overfitting goals with various degrees of success. One determinant of accuracy was the target category. All of the optimal solutions consistently achieved low training error across all categories (non-outliers lying between 0% and 8.87% error) (Figure 5). Testing error, on the other hand, varied across event category: shower and sinks were excellent (lying between 0.57% and 6.56% error) (Table 2), where dishwasher and washing machines performed comparatively poorly (non-outliers lying between 4.26% and 21.28% and between 12.44% and 55.24% error, respectively) (Figure 5, Table 2). The significant difference between training and testing errors suggest that the bagged decision trees tended to slightly overfit dishwashers and significantly overfit washing machines. However, a few solutions among the configurations did obtain better testing accuracy for washing machines. One configuration (city main volumetric flow, total hot water volumetric flow, and amplitude) achieved a 19.62% worst-case test error rate, while maintaining strong error metrics (0.87% total training error, 3.23% total test error, 4.52% worst-case training error) (Table 1). This suggests that these predictors and metrics are suitable for training bagged decision trees for dishwasher prediction, indicating that further research into this specific configuration is necessary. Another configuration (hot water return volumetric flow, total hot water volumetric flow, and amplitude) achieved an excellent worst-case testing error of 12.44%, but a significantly poor training error of 68.85%. This tradeoff makes this predictor and metric combination not suitable.

Table 1

Analyzed Pareto optimal solutions ranked from least to most overfitted

PredictorsMetricsTotal errors (%)
Worst error among categories (%)
Frequencies
TrainingTestTrainingTestPredictorsMetricsVol. sensorsTemp. sensors% of vol
Hot water return volumetric flow, Total hot water volumetric flow Amplitude 6.52 2.24 68.85 12.44 100 
City main volumetric flow, Total hot water volumetric flow Amplitude 0.87 3.23 4.52 19.62 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 0.08 3.37 1.03 29.19 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Max slope, Variance 0.03 3.59 0.62 31.58 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 0.06 3.63 1.03 29.67 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.04 3.67 0.62 29.67 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 0.24 3.93 3.90 27.75 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Variance 0.02 3.97 0.41 34.93 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 0.22 4.06 3.49 28.23 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 0.65 4.32 7.19 24.88 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 0.66 4.36 7.19 23.92 100 
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow Duration, Time to next event, Max slope 0.04 5.65 0.21 44.98 100 
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.01 6.89 0.12 53.33 67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.00 6.98 0.00 55.24 50 
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature Duration, Time to next event, Mean slope, Max slope 0.00 7.02 0.00 54.76 33 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 7.02 0.00 54.76 50 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope 0.00 7.02 0.00 52.86 75 
City main temperature, Hot water return volumetric flow Duration, Time to next event, Mean slope 0.00 7.75 0.00 51.90 50 
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow Duration, Time to next event, Mean slope 0.00 7.75 0.00 51.90 75 
City main volumetric flow, City main temperature, Hot water return temperature Duration, Amplitude, Max slope, Variance 0.01 8.18 0.01 48.94 33 
City main temperature, Hot water return volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.02 8.48 0.03 47.14 50 
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.01 8.53 0.01 46.81 67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow Amplitude, Time to next event, Max slope 0.00 8.87 0.00 50.95 0.67 
PredictorsMetricsTotal errors (%)
Worst error among categories (%)
Frequencies
TrainingTestTrainingTestPredictorsMetricsVol. sensorsTemp. sensors% of vol
Hot water return volumetric flow, Total hot water volumetric flow Amplitude 6.52 2.24 68.85 12.44 100 
City main volumetric flow, Total hot water volumetric flow Amplitude 0.87 3.23 4.52 19.62 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 0.08 3.37 1.03 29.19 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Max slope, Variance 0.03 3.59 0.62 31.58 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 0.06 3.63 1.03 29.67 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.04 3.67 0.62 29.67 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 0.24 3.93 3.90 27.75 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Variance 0.02 3.97 0.41 34.93 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 0.22 4.06 3.49 28.23 100 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 0.65 4.32 7.19 24.88 100 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 0.66 4.36 7.19 23.92 100 
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow Duration, Time to next event, Max slope 0.04 5.65 0.21 44.98 100 
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.01 6.89 0.12 53.33 67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.00 6.98 0.00 55.24 50 
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature Duration, Time to next event, Mean slope, Max slope 0.00 7.02 0.00 54.76 33 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 7.02 0.00 54.76 50 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope 0.00 7.02 0.00 52.86 75 
City main temperature, Hot water return volumetric flow Duration, Time to next event, Mean slope 0.00 7.75 0.00 51.90 50 
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow Duration, Time to next event, Mean slope 0.00 7.75 0.00 51.90 75 
City main volumetric flow, City main temperature, Hot water return temperature Duration, Amplitude, Max slope, Variance 0.01 8.18 0.01 48.94 33 
City main temperature, Hot water return volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.02 8.48 0.03 47.14 50 
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.01 8.53 0.01 46.81 67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow Amplitude, Time to next event, Max slope 0.00 8.87 0.00 50.95 0.67 
Table 2

Complete Pareto front

FeaturesMetricsTraining error (%)
Testing error (%)
ShowerSinksDishwasherWashing machineShowerSinksDishwasherWashing machine
Hot water return volumetric flow, Total hot water volumetric flow Amplitude 4.73 0.06 0.00 68.85 4.10 0.82 10.64 12.44 
City main volumetric flow, Total hot water volumetric flow Amplitude 4.52 0.43 0.53 2.98 6.56 0.88 19.15 19.62 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 1.03 0.00 0.53 0.12 1.64 0.62 6.38 29.19 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Max slope, Variance 0.62 0.00 0.00 0.00 1.64 0.57 8.51 31.58 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 1.03 0.00 0.53 0.00 2.46 0.83 6.38 29.67 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.62 0.00 0.00 0.12 1.64 0.88 8.51 29.67 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 3.90 0.00 0.53 0.24 2.46 1.14 17.02 27.75 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Variance 0.41 0.00 0.00 0.00 1.64 0.67 8.51 34.93 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 3.49 0.01 0.53 0.12 1.64 1.24 19.15 28.23 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 7.19 0.18 1.58 0.95 1.64 1.86 21.28 24.88 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 7.19 0.15 1.58 1.31 1.64 2.07 19.15 23.92 
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow Duration, Time to next event, Max slope 0.21 0.04 0.00 0.00 2.46 1.60 6.38 44.98 
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 0.00 0.00 0.12 1.64 2.21 6.38 53.33 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 0.82 2.21 4.26 55.24 
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature Duration, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 1.64 2.26 4.26 54.76 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 0.00 0.00 0.00 1.64 2.26 4.26 54.76 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 1.64 2.47 4.26 52.86 
City main temperature, Hot water return volumetric flow Duration, Time to next event, Mean slope 0.00 0.00 0.00 0.00 1.64 3.09 19.15 51.90 
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow Duration, Time to next event, Mean slope 0.00 0.00 0.00 0.00 2.46 3.09 17.02 51.90 
City main volumetric flow, City main temperature, Hot water return temperature Duration, Amplitude, Max slope, Variance 0.00 0.01 0.00 0.00 1.64 3.24 48.94 48.57 
City main temperature, Hot water return volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.00 0.03 0.00 0.00 1.64 3.81 46.81 47.14 
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.00 0.01 0.00 0.00 1.64 3.91 46.81 46.67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow Amplitude, Time to next event, Max slope 0.00 0.00 0.00 0.00 1.64 3.81 48.94 50.95 
FeaturesMetricsTraining error (%)
Testing error (%)
ShowerSinksDishwasherWashing machineShowerSinksDishwasherWashing machine
Hot water return volumetric flow, Total hot water volumetric flow Amplitude 4.73 0.06 0.00 68.85 4.10 0.82 10.64 12.44 
City main volumetric flow, Total hot water volumetric flow Amplitude 4.52 0.43 0.53 2.98 6.56 0.88 19.15 19.62 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 1.03 0.00 0.53 0.12 1.64 0.62 6.38 29.19 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Max slope, Variance 0.62 0.00 0.00 0.00 1.64 0.57 8.51 31.58 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope, Variance 1.03 0.00 0.53 0.00 2.46 0.83 6.38 29.67 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.62 0.00 0.00 0.12 1.64 0.88 8.51 29.67 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 3.90 0.00 0.53 0.24 2.46 1.14 17.02 27.75 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Time to next event, Mean slope, Variance 0.41 0.00 0.00 0.00 1.64 0.67 8.51 34.93 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope, Variance 3.49 0.01 0.53 0.12 1.64 1.24 19.15 28.23 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 7.19 0.18 1.58 0.95 1.64 1.86 21.28 24.88 
City main volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Max slope 7.19 0.15 1.58 1.31 1.64 2.07 19.15 23.92 
City main volumetric flow, Hot water return volumetric flow, Total hot water volumetric flow Duration, Time to next event, Max slope 0.21 0.04 0.00 0.00 2.46 1.60 6.38 44.98 
Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 0.00 0.00 0.12 1.64 2.21 6.38 53.33 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Amplitude, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 0.82 2.21 4.26 55.24 
Hot water return volumetric flow, Hot water return temperature, Hot water heater inlet temperature Duration, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 1.64 2.26 4.26 54.76 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope, Variance 0.00 0.00 0.00 0.00 1.64 2.26 4.26 54.76 
City main volumetric flow, Hot water return volumetric flow, Hot water heater inlet temperature, Total hot water volumetric flow Duration, Time to next event, Mean slope, Max slope 0.00 0.00 0.00 0.00 1.64 2.47 4.26 52.86 
City main temperature, Hot water return volumetric flow Duration, Time to next event, Mean slope 0.00 0.00 0.00 0.00 1.64 3.09 19.15 51.90 
City main volumetric flow, City main temperature, Hot water heater inlet volumetric flow, Total hot water volumetric flow Duration, Time to next event, Mean slope 0.00 0.00 0.00 0.00 2.46 3.09 17.02 51.90 
City main volumetric flow, City main temperature, Hot water return temperature Duration, Amplitude, Max slope, Variance 0.00 0.01 0.00 0.00 1.64 3.24 48.94 48.57 
City main temperature, Hot water return volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.00 0.03 0.00 0.00 1.64 3.81 46.81 47.14 
City main temperature, Hot water return volumetric flow, Hot water heater inlet volumetric flow Duration, Amplitude, Mean slope, Max slope, Variance 0.00 0.01 0.00 0.00 1.64 3.91 46.81 46.67 
City main volumetric flow, Hot water return temperature, Hot water heater inlet volumetric flow Amplitude, Time to next event, Max slope 0.00 0.00 0.00 0.00 1.64 3.81 48.94 50.95 
Figure 4

Pareto optimal configurations of the machine learning problem.

Figure 4

Pareto optimal configurations of the machine learning problem.

Close modal
Figure 5

Training and testing error of optimal solutions.

Figure 5

Training and testing error of optimal solutions.

Close modal

The varying success of each of the target types can be attributed to several reasons. The number of washing machine training events given to the bagged decision tree training was sufficient, considering the dataset contained more washing machine events (1,046) than shower or dishwasher events (608 and 237 respectively). It is plausible that the nature of dishwasher and washing machine events is inherently more difficult to classify on an event-by-event basis. For example, delimiting events by three-second or greater pauses would break down a single washing machine session into separate sub-events due to the various discrete cycles of a washing machine (i.e., prewash cycle, rinse cycle). Dishwashers similarly fill their tubs repeatedly throughout a single session (i.e., prewash, main wash, rinse), and their cycles would therefore be broken into separate events. Each of these dishwasher or washing machine sessions may as a whole be distinguishable from a single sink event, but individually they may hold similar characteristics to simple sink events. In addition, due to the nature of the events, the degree of overfitting also depends on the predictors and metrics chosen in the given configuration.

Predictor and metric analysis

The predictors selected were the greatest sources of overfitting variation in dishwashers and washing machines. For example, one moderately significant determinant of overfitting was how many predictors were selected. As the number of predictors increased, the worst testing error among configurations also tended to increase, and these two sets of values had a correlation of 0.64. Additionally, no optimal configurations had more than four selected predictors, even though seven predictors were available (Table 1). This, according to the concept of innovization, suggests that all optimal solutions with respect to maximizing accuracy and minimizing overfitting will contain no more than four predictors. Furthermore, training machine learning models with less than four predictors in this system tends to train the most accurate models with lowest overfitting. This, in turn, implies that not only can a few (i.e., less than four) upstream predictors accurately categorize end-use events, but such a low number of predictors is, in fact, better for classification with respect to overfitting. Therefore, it is highly plausible to design a machine learning system to affordably predict end-use events in real-time via a very low number of sensors in the basement. Conversely, no solution among the Pareto optimal solutions used less than two predictors (Table 1), which also implies the utility of using a slightly diverse set of predictors.

A strong correlation also exists between the selected predictor type and overfitting. The two types of predictors were volumetric flow sensors and temperature sensors, and therefore each optimal configuration contained a percentage of selected predictors that were volumetric flow. This percentage correlates negatively to overfitting (i.e., the worse test error) with a value of −0.82. This suggests that volumetric sensors are fitter candidates for predicting end-use events with minimal overfitting compared with temperature sensors. Our best explanation of this phenomenon is as follows. Water temperature changes during use because the plumbing is flushed with water from either the city main or the water heater for cold and hot fixtures, respectively. However, it takes time for the water stored in the pipe prior to use to be flushed with hotter or colder water (Klein 2013). Short, low-volume, water use events may not flush the pipes thoroughly enough to cause a distinct change in temperature. This theory agrees with our literature review, which consistently included volumetric flow as predictor to machine learning models. Another piece of evidence that supports the fitness of volumetric flow metrics is their frequency in the Pareto optimal set. The four volumetric flow sensors each appear more frequently than any of the temperature predictors (Figure 6), which occur in less than 25% of the optimal solutions.

Figure 6

Frequency of (a) available predictors and (b) metrics among the optimal configurations.

Figure 6

Frequency of (a) available predictors and (b) metrics among the optimal configurations.

Close modal

According to the Pareto analysis, the sensors that were most effective as upstream predictors in bagged decision trees were (in order of progressively less frequency): city main volumetric flow, followed by the hot water heater inlet volumetric flow, hot water return volumetric flow, and total hot water volume (Figure 6). Of these four sensors, city main volumetric flow and hot water heater inlet volumetric flow appeared in the vast majority of optimal solutions (Figure 6), and the remaining volumetric flow meters appeared in less than half of the optimal solutions. The remaining predictors, each with a frequency of less than 25%, were temperature sensors. The frequency of city main volumetric flow and hot water heater inlet volumetric flow demonstrates their robustness as predictors in the bagged decision tree learning process and can be used in future works as reliable end-use categorizers.

The frequency of the metrics among the optimal solutions was less varied than the frequency of predictors. All metrics appeared in between 85% and 45% of all optimal solutions, with event duration appearing the most frequently, followed by maximum slope, amplitude, time till next event, mean slope, and vertical variance (Figure 6). This implies that event duration, amplitude, and maximum slope, all appearing in between 85% and 70% of all optimal solutions, are ideal metrics for quantifying upstream flow events in machine learning routines.

Beyond the direct findings of our analysis, our framework enables future researchers to discover the optimal predictors, metrics, and machine learning algorithm best for their particular application of end-use event classification. For example, in future work, researchers can choose the relevant predictors, metrics, and algorithms for their specific system, and determine their optimal machine learning configuration with our framework.

This study demonstrates how to effectively determine predictors, metrics, and learning algorithms while categorizing end-use events. A Pareto analysis determined that two of the predictors were effective predictors in bagged tree classification (main volumetric flow and hot water heater inlet volumetric flow), and that two of the metrics were highly effective preprocessing methods (duration and max slope). Further, configurations containing a greater proportion of volumetric flowmeters outperformed temperature-based meters due to their negative correlation with overfitting and higher frequency in the Pareto optimal set. Innovization discovered that between two and four predictors are ideal for training bagged decision trees under this system. Overall, we were able to achieve high accuracy with low overfitting in classifying sinks and shower, decent accuracy in classifying the dishwasher and poor accuracy in classifying the washing machine (save for a few counterexamples), perhaps due to the more complex nature of dishwasher and washing machine water demands.

The framework developed can innovate future machine learning routines not only for the ReNEWW house, but also for other systems. Following the same pattern, using a different set of predictors, metrics, datasets, and machine learning models, future studies can systematically discover the innovations in their own machine learning systems. Unlike models in the existing literature that are hand-configured by experts, our data-driven approach saves practitioners labor hours of manually determining optimal predictors and preprocessing routines. Also, data-driven approaches can discover innovations, through innovization, in end-use categorization machine learning, which can then seed future research in academia and industry. For example, commercial buildings may need a different range of predictors to maximize accuracy and minimize overfitting. Such innovations can help guide the direction of future research, which could unearth the theoretical mechanisms behind these innovations.

Further research should determine why washing machine prediction tended to be overfitted, and why one configuration (the predictors of city main volumetric flow, total hot water volumetric flow, and amplitude) was able to avoid the issues of overfitting with washing machines. One possible approach to avoid overfitting might be examining events within multiple time windows, depending on the target fixture. For example, dishwashers, washing machines, and showers operate at different time scales than bathroom and kitchen sinks, and this knowledge can be exploited to improve learning performance. Also, in future work, the number of features and/or metrics may make the run-time complexity behave more like Therefore, future research should use evolutionary multi-objective optimization methods to more effectively search for optimal metric and feature configurations. Using these evolutionary methods, such as NSGA-III, would not only speed up analysis, but also find a more diverse and convergent solution set with regards to high accuracy and low overfitting (Deb & Jain 2014).

This work was supported by the US Environmental Protection Agency (grant number R836890).

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Breiman
L.
1996
Bagging predictors
.
Machine Learning
24
(
2
),
123
140
.
Cominola
A.
,
Giuliani
M.
,
Piga
D.
,
Castelletti
A.
&
Rizzoli
A. E.
2015
Benefits and challenges of using smart meters for advancing residential water demand modeling and management: a review
.
Environmental Modelling & Software
72
,
198
214
.
https://doi.org/10.1016/j.envsoft.2015.07.012
.
Cormen
T. H.
,
Leiserson
C. E.
,
Rivest
R. L.
&
Stein
C.
2022
Introduction to Algorithms
.
MIT Press
,
Cambridge, MA, USA
.
Deb
K.
&
Jain
H.
2014
An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints
.
IEEE Transactions on Evolutionary Computation
18
(
4
),
577
601
.
https://doi.org/10.1109/TEVC.2013.2281535
.
Deb
K.
&
Srinivasan
A.
2006
Innovization: innovating design principles through optimization
. In:
GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation
(M. Keijzer, ed.)
,
Association for Computing Machinery, New York, USA
, pp.
1629
1636
.
DeOreo
W. B.
,
Mayer
P. W.
,
Dziegielewski
B.
&
Kiefer
J.
2016
Residential End Uses of Water, Version 2, Report #4309b. Water Research Foundation, Denver, CO, USA
.
Efron
B.
1992
Bootstrap methods: another look at the jackknife
. In:
Breakthroughs in Statistics
(S. Kotz & N. L. Johnson, eds)
,
Springer
,
New York
,
USA
, pp.
569
593
.
Ester
M.
,
Kriegel
H.-P.
,
Sander
J.
&
Xu
X.
1996
A density-based algorithm for discovering clusters in large spatial databases with noise
. In:
KDD'96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
(E. Simoudis, J. Han & U. Fayyad, eds)
,
AAAI Press
,
Menlo Park, CA, USA
. pp.
226
231
.
Froehlich
J.
,
Larson
E.
,
Saba
E.
,
Campbell
T.
,
Atlas
L.
,
Fogarty
J.
&
Patel
S.
2011
A longitudinal study of pressure sensing to infer real-world water usage events in the home
. In:
Pervasive Computing
(K. Lyons, J. Hightower & E. M. Huang, eds)
,
Springer
,
Berlin, Germany
, pp.
50
69
.
https://doi.org/10.1007/978-3-642-21726-5_4
.
Goldberg
D. E.
1989
Genetic Algorithms in Search, Optimization, and Machine Learning
.
Addison-Wesley Professional
,
Boston, MA, USA
.
Gourmelon
N.
,
Bayer
S.
,
Mayle
M.
,
Bach
G.
,
Bebber
C.
,
Munck
C.
,
Sosna
C.
&
Maier
A.
2021
Implications of experiment set-ups for residential water end-use classification
.
Water
13
(
2
),
236
.
Griffiths-Sattenspiel
B.
&
Wilson
W.
2009
The Carbon Footprint of Water
.
River Network
,
Portland, OR, USA
.
Julien
R.
,
Dreelin
E.
,
Whelton
A. J.
,
Lee
J.
,
Aw
T. G.
,
Dean
K.
&
Mitchell
J.
2020
Knowledge gaps and risks associated with premise plumbing drinking water quality
.
AWWA Water Science
2
(
3
),
e1177
.
https://doi.org/10.1002/aws2.1177.
Klein
G.
2013
Efficient hot-water piping. The Journal of Light Construction 2013 (March), 73–78
.
Ley
C. J.
,
Proctor
C. R.
,
Singh
G.
,
Ra
K.
,
Noh
Y.
,
Odimayomi
T.
,
Salehi
M.
,
Julien
R.
,
Mitchell
J.
,
Nejadhashemi
A. P.
,
Whelton
A. J.
&
Aw
T. G.
2020
Drinking water microbiology in a water-efficient building: stagnation, seasonality, and physicochemical effects on opportunistic pathogen and total bacteria proliferation
.
Environmental Science: Water Research & Technology
6
(
10
),
2902
2913
.
Math Works, Inc. 2020 MATLAB. Version 2020b, Math Works, Inc. (computer software). www.mathworks.com/
.
Mayer
P. W.
,
DeOreo
W. B.
,
Opitz
E. M.
,
Kiefer
J. C.
,
Davis
W. Y.
,
Dziegielewski
B.
&
Nelson
J. O.
1999
Residential End Uses of Water
.
AWWA Research Foundation and American Water Works Association
,
Denver
,
CO, USA
.
Mazzoni
F.
,
Alvisi
S.
,
Franchini
M.
,
Ferraris
M.
&
Kapelan
Z.
2021
Automated household water end-use disaggregation through rule-based methodology
.
Journal of Water Resources Planning and Management
147
(
6
),
04021024
.
McCabe
W. L.
,
Smith
J. C.
&
Harriott
P.
1967
Unit Operations of Chemical Engineering
, 5th edn.
McGraw-Hill
,
New York, USA
.
Meyer
B. E.
,
Nguyen
K.
,
Beal
C. D.
,
Jacobs
H. E.
&
Buchberger
S. G.
2021
Classifying household water use events into indoor and outdoor use: improving the benefits of basic smart meter data sets
.
Journal of Water Resources Planning and Management
147
(
12
),
04021079
.
Nguyen
K. A.
,
Stewart
R. A.
,
Zhang
H.
&
Jones
C.
2015
Intelligent autonomous system for residential water end use classification: Autoflow
.
Applied Soft Computing
31
,
118
131
.
https://doi.org/10.1016/j.asoc.2015.03.007
.
Pastor-Jabaloyes
L.
,
Arregui
F. J.
&
Cobacho
R.
2018
Water end use disaggregation based on soft computing techniques
.
Water
10
(
1
),
46
.
https://doi.org/10.3390/w10010046
.
Pearson
K.
1901
On lines and planes of closest fit to systems of points in space
.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science
2
(
11
),
559
572
.
https://doi.org/10.1080/14786440109462720
.
Pickering
R.
,
Onorevole
K.
,
Greenwood
R.
&
Shadid
S.
2018
Measurement Science Roadmap Workshop for Water Use Efficiency and Water Quality in Premise Plumbing Systems, August 1-2, 2018
.
NIST GCR 19-020, National Institute of Standards and Technology, Gaithersburg, MD, USA. https://doi.org/10.6028/NIST.GCR.19-020
.
Salehi
M.
,
Abouali
M.
,
Wang
M.
,
Zhou
Z.
,
Nejadhashemi
A. P.
,
Mitchell
J.
,
Caskey
S.
&
Whelton
A. J.
2018
Case study: fixture water use and drinking water quality in a new residential green building
.
Chemosphere
195
,
80
89
.
https://doi.org/10.1016/j.chemosphere.2017.11.070
.
Salehi
M.
,
Odimayomi
T.
,
Ra
K.
,
Ley
C.
,
Julien
R.
,
Nejadhashemi
A. P.
,
Hernandez-Suarez
J. S.
,
Mitchell
J.
,
Shah
A. D.
&
Whelton
A.
2020
An investigation of spatial and temporal drinking water quality variation in green residential plumbing
.
Building and Environment
169
,
106566
.
https://doi.org/10.1016/j.buildenv.2019.106566
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).