Abstract
The provision of potable water as a severe challenge has engaged many people worldwide. So, identifying influential factors in water demand forecasting (WDF) for the residential sector performs a vital role in water crisis management. Nowadays, long-term macro-planning for vast geographical areas helps policymakers to achieve sustainable development goals. This study uses the same perspective to present a pattern for water consumption behavior and prediction. For this purpose, yearly residential water consumption data, along with climatic characteristics, and socioeconomic factors of rural areas of Isfahan, Iran, are aggregated. The feature selection task is conducted on the collected data using various machine learning (ML) methods along with a novel approach, forward selection based on smoothness index (FSSmI). Posterior to selecting features influencing residential water demand (WD), the raw data are analyzed using regression techniques, including multiple linear regression, support vector regression, and random forest regression. The employed methods show an improvement in the feature selection procedure and coefficient of determination as a result of implementing the FSSmI method. Based on the results, multiple linear regression and support vector regression gain 96% and 95% accuracy and less than 11% and 13% error respectively; it demonstrates the validity of forecasting methods.
HIGHLIGHTS
Machine learning approach is applied to predict residential long-term water demand for a new dataset.
A novel feature selection method is developed based on the smoothness index criterion and forward selection.
Testing and comparing machine-learning-based methods are carried out for feature selection and prediction.
The proposed improvements resulted in better water demand prediction performance.
Graphical Abstract
INTRODUCTION
Water is considered one of the most important primary resources of any settlements, whether village, town, city, or supercities. Water shortage as a common problem can be tackled by expensive strategies such as dams and desalination plants. However, limitations such as high cost and environmental impacts have encouraged decision-makers to search for more sustainable solutions. The novel understanding of the management of water resources has brought about a new paradigm in water supply and demand management and incorporates new policy frameworks for administration. Integrated water resource management has always been a complex issue due to the presence of multiple variables such as climate change, rapid population growth, and urbanization. As water resource management mainly concerns competing water demands and allocating equitable bases to satisfy them, water demand forecasting (WDF) has become an essential tool for optimal scheduling of water resource management (Villarin & Rodriguez-Galiano 2019).
Traditionally, calculating water demand (WD) for different applications, including potable, agricultural, industrial, and ecological sectors, relies on considering socioeconomic (including population, gross domestic product, income, education, water price, family or household size) and environmental (including precipitation, temperature, wind, evapotranspiration) changes.
The estimation of WD based on lots of involving parameters, conditions, factors, and uncertainties is challenging for decision-makers. Modeling WD is a complex exercise but vital for utilities to meet consumer demand while managing the available water resources. Determining the impact of changes in any socioeconomic or environmental parameters on WD is crucial for implementing a range of strategies to ensure the water security of urban and rural residents. Thereby water authorities will have practical and realistic investment plans for the future (Makki et al. 2015).
Water utilities use WDF for various types of purposes, whether for design, planning, operation, or management. Utilities' perspectives, objectives, and planning levels are reflected in the forecast horizon, which is the temporal resolution of modeling and forecasting (Nasseri et al. 2011). Although there are no generally accepted time-frames for these horizons, Nasseri et al. (2011) recommended three classes of time-frames: (1) long-term forecasting, which has a resolution of one year or greater, deals with large-scale planning and management; (2) middle-term forecasting for middle-time management, which has a resolution equal to or greater than one month; (3) short-term forecasting, which has a resolution of less than one month, even to one hour, and concerns low-scale planning and management. Thus, reviewing the literature shows that the forecast span is related to management perspective and goal; in addition, it determines the method to some extent (Donkor et al. 2014).
Forecasting, although it is not a new subject, for its multi-dimensional nature, quality of available data, and uncertainties, is still considered a scientific gap. Various proposed methods and models attempt to improve forecasts despite these conditions. Forecasting with qualitative methods, unit WD analysis, several regression methods (such as multiple regression, piecewise linear regression, and geographically weighted regression), moving average, Bayesian maximum entropy, artificial neural networks, time series, and scenario-based and composite forecasts are examples of forecasting methods (Donkor et al. 2014). The choice of forecasting method depends on the available data and technical–analytical complexities such as modeling strategy and criteria selection.
WDF methods are categorized into four classes: temporal extrapolation, unit WD, end-use forecasting, and multivariate statistical model. These methods are described in Table 1.
WDF method . | Description . | Strengths . | Weaknesses . |
---|---|---|---|
Temporal extrapolation or time-series extrapolation models | These models include simple time trends, exponential smoothing, and moving-average. They are based on the claim that WDF can be conducted using historical water consumption data (Lee et al. 2010). | Rapid execution and wide application when major changes are not expected and causal factors are expected to remain constant (Glantz & Mun 2011). | Not appropriate for long-term forecasting as it does not take into account possible future changes (Donkor et al. 2014). |
Unit WD | Forecasting future demand based on the number of water subscribers, and by relying on the per capita consumption of water subscribers and calculating their unit of water consumption coefficients (such as residential, commercial, and industrial) WD is estimated (Donkor et al. 2014). | Although it is not feature-extensive and all it requires is to estimate consumption per unit of a customer category and number of units of those categories, it is the simplest approach used by most water utilities (Billings & Jones 2008). | Not appropriate for long-term forecasting as it does not take into account possible future changes (Donkor et al. 2014). |
Micro-component or end-use forecasting | Considering all places where water is used, such as behavioral patterns (shower, bathtub, bathroom, faucets, and toilet) and data obtained from water-consuming appliances (washing machine and dishwasher) (Levin et al. 2006). | Simulating the long-term effects of technological change on WD management (Levin et al. 2006). | The entire detailed information of each subscriber is required. |
Multivariate statistical models | In order to consider the impact of socioeconomic (water price, income, population, urban density, and etc.) and climatic (precipitation, temperature, humidity, wind, etc.) factors on WD, these models can be used (Donkor et al. 2014). | A powerful method as it examines the effect of a set of independent variables on per capita consumption (dependent variable) and shows a statistical relationship between them (Fullerton & Molina 2010). | The performance of these models rely on the volume of the dataset (Kindler & Russell 1984). |
WDF method . | Description . | Strengths . | Weaknesses . |
---|---|---|---|
Temporal extrapolation or time-series extrapolation models | These models include simple time trends, exponential smoothing, and moving-average. They are based on the claim that WDF can be conducted using historical water consumption data (Lee et al. 2010). | Rapid execution and wide application when major changes are not expected and causal factors are expected to remain constant (Glantz & Mun 2011). | Not appropriate for long-term forecasting as it does not take into account possible future changes (Donkor et al. 2014). |
Unit WD | Forecasting future demand based on the number of water subscribers, and by relying on the per capita consumption of water subscribers and calculating their unit of water consumption coefficients (such as residential, commercial, and industrial) WD is estimated (Donkor et al. 2014). | Although it is not feature-extensive and all it requires is to estimate consumption per unit of a customer category and number of units of those categories, it is the simplest approach used by most water utilities (Billings & Jones 2008). | Not appropriate for long-term forecasting as it does not take into account possible future changes (Donkor et al. 2014). |
Micro-component or end-use forecasting | Considering all places where water is used, such as behavioral patterns (shower, bathtub, bathroom, faucets, and toilet) and data obtained from water-consuming appliances (washing machine and dishwasher) (Levin et al. 2006). | Simulating the long-term effects of technological change on WD management (Levin et al. 2006). | The entire detailed information of each subscriber is required. |
Multivariate statistical models | In order to consider the impact of socioeconomic (water price, income, population, urban density, and etc.) and climatic (precipitation, temperature, humidity, wind, etc.) factors on WD, these models can be used (Donkor et al. 2014). | A powerful method as it examines the effect of a set of independent variables on per capita consumption (dependent variable) and shows a statistical relationship between them (Fullerton & Molina 2010). | The performance of these models rely on the volume of the dataset (Kindler & Russell 1984). |
Regression methods as a type of multivariate statistical model were split in two categories: parametric regression and non-parametric regression. The former is considered a special function for the relationship between independent and dependent variables, where the type of relationship is also known. Fitting only one equation to a large number of data that represent complex behavior leads to undesirable results (Cleveland et al. 1988). This method also requires time-consuming calculations. Schleich & Hillenbrand (2009) studied about 600 water supply areas of Germany to investigate the impact of several economic, environmental and social factors (such as price, income, household size, age, rainfall, and temperature) on per capita WD using the ordinary least squares method. According to the results, both rainfall pattern and household size have a negative relationship with WD. On the other hand, age has a positive relationship with WD.
In non-parametric regression, the regressor does not take a predetermined functional form; i.e., there are no assumptions about the distribution of data (Cleveland et al. 1988). This method divides the data into several regions to fit a curve to the data in each region, then turns them into appropriate and useful estimators. Eslamian et al. (2016), based on climate parameters and socioeconomic factors, predicted the daily WD of Quebec City in Canada using the multiple regression method. According to the results, WD decreases on rainy days and increases on weekends. Other non-parametric methods are neural networks and fuzzy logic. In these estimation methods, learning relationships between variables in the model leads to a prediction. These methods are more challenging to use and hard to explain to potential users, which prevents many water analysts from using them (Billings & Jones 2008).
On the same mathematical and theoretical basis, machine learning (ML) techniques are useful in many of the previously mentioned methods. ML applies to a wide variety of parts of the water sector, from forecasting the future WD and early flood warning to monitoring and alarm handling of distribution systems, identifying factors and features that influence actions in related systems, determining leakage in the water supply, optimizing management, and classifying types of consumers (Villarin & Rodriguez-Galiano 2019).
In this study, a framework for forecasting residential WD is developed based on various ML methods, aiming to improve previous methods and suggest a novel approach. A new feature selection method is proposed in this study named forward selection based on smoothness index (FSSmI).
The FSSmI method is modified to fill the following gaps of previous research:
Some of the feature selection methods have the constraint of linear dependency, such as the Pearson correlation coefficient which measures the linear dependency between two random variables.
Some of the feature selection methods are based on estimating the probability density function, such as mutual information. In this method, the extent of difference between the joint distribution of (X, Y) and product of marginal distribution of X and Y is determined.
Some of the feature selection methods are dependent on the error of the regression model, such as recursive feature elimination. In this method, the set of features selected based on the error of the regression model determines the importance of each feature.
The developed FSSmI method has provided a short run time and straight framework for feature selection, which addresses the abovementioned problems. This method along with other methods such as principal component analysis (PCA), generic uni-variant selection (GUS), recursive feature elimination (RFE), and mutual information (MI) are implemented to determine the most profitable features. Thereby, this phase reduces problem dimensions and computational costs, resulting in a more feasible forecasting procedure. Afterward, multiple ML methods, such as support vector regression (SVR), multiple linear regression (MLR), and random forest regression (RFR), are conducted on the reduced set of features. Their results are analyzed with performance evaluation metrics, including mean square error (MSE) and R2. Eventually, the best model based on minimum error or maximum accuracy, simplicity, and runtime cost is presented. This study tries to employ diversified data for modeling, including climatic and socioeconomic data. The remainder of this paper proceeds as follows. Section METHODS offers an overview of the method, including aggregated data, data preprocessing, feature selection methods, the proposed method, forecasting methods, and performance evaluation metrics. Section RESULTS AND DISCUSSION presents the dataset's details, selected features, parameter settings, results, and analysis of the method's performances. Finally, the conclusion is presented in Section CONCLUSIONS.
METHODS
An ML approach usually contains some stages, including data preparation and preprocessing, data transformation and feature selection, modeling, and deployment and test. This section represents the methodology and materials borrowed from the ML context and employed on the WDF issue. An overview of the methodology used in this research is available in Figure 1.
Case study design
Data collection
The implemented data in this paper is gathered from socioeconomic, climatic, and water consumption data of Isfahan province, Iran. Isfahan's area is about 107,000 square kilometres, which is larger than 89 countries, including Iceland, South Korea, and Portugal. Its population is about 5 million based on 2016 demographic data, which is roughly the population of Norway, Finland, or Ireland and is greater than 75 other countries. The data contain annual potable water consumption of 871 rural areas.
An enormous number of factors may be considered as influential on WD by the literature, from management intervention and socio-economic variables to demographic data, household characteristics, and weather–climate variables. This paper studies the most recent census of the area collected and published in 2016 (Statistical Centre of Iran 2016). It includes population, gender distribution, age distribution, the employment rate, home-ownership rate, house-type distribution, and housing area distribution in each rural area. To include weather, the most commonly measured weather parameter, air temperature, is used. The data provided by Iran Meteorological Organization (2016), which includes daily mean air temperature for each area, are aggregated to some parameters of a year-scale.
The potable water consumption data, along with the number of water subscriptions, in this paper are provided by Rural Water and Wastewater Company of Isfahan Province, Iran, which are measured every other month (Isfahan Province Rural Water and Wastewater Co 2016). The water consumption data are aggregated to an authentic feature of a year-scale.
Data description
The dataset used in this study consists of data per annum of 871 rural areas located in Isfahan province, Iran. The demographic data of these rural areas assembled with their potable water consumption data and weather information compose the dataset. The total population and number of households of the dataset are 508,111 and 163,213 respectively. All the measurements were made in 2016, as the most recent year that the demographic data were collected. Target water consumption data, along with 29 features, define each case of the dataset. For all features the data related to each rural area were available. Water data include a feature that represents the number of potable water subscriptions and water consumption in m3.
Demographic data represents the household number, average family size, female ratio in the form of a female-to-population ratio, age structure in the form of distribution ratio among age groups, literacy rate, employment rate, number of owner-occupied and non-owner-occupied housings, non-apartment housing ratio, and housing area distribution.
The daily measured average temperature for each rural area is converted to several annual features. These features are maximum temperature, average summer temperature, and cooling degree days (CDD). A concise description of this paper's dataset is presented in Tables 2 and 3.
Feature . | Type/Unit . | Feature . | Type/Unit . |
---|---|---|---|
Subscriptions | count | Non-owner-occupied housings | count |
Households | count | Non-apartment housings | % |
Average family size | number | Area −50- m2 | count |
Female ratio | % | Area 51 to 75 m2 | count |
Age 0 to 9 | count | Area 76 to 80 m2 | count |
Age 10 to 19 | count | Area 81 to 100 m2 | count |
Age 20 to 29 | count | Area 101 to 150 m2 | count |
Age 30 to 39 | count | Area 151 to 200 m2 | count |
Age 40 to 49 | count | Area 201 to 300 m2 | count |
Age 50 to 59 | count | Area 301 to 500 m2 | count |
Age 60 to 69 | count | Area 501 + m2 | count |
Age 70 + | count | Max temperature | °C |
Literacy rate | % | Summer temperature | °C |
Employment rate | % | CDD | count |
Owner-occupied housings | count |
Feature . | Type/Unit . | Feature . | Type/Unit . |
---|---|---|---|
Subscriptions | count | Non-owner-occupied housings | count |
Households | count | Non-apartment housings | % |
Average family size | number | Area −50- m2 | count |
Female ratio | % | Area 51 to 75 m2 | count |
Age 0 to 9 | count | Area 76 to 80 m2 | count |
Age 10 to 19 | count | Area 81 to 100 m2 | count |
Age 20 to 29 | count | Area 101 to 150 m2 | count |
Age 30 to 39 | count | Area 151 to 200 m2 | count |
Age 40 to 49 | count | Area 201 to 300 m2 | count |
Age 50 to 59 | count | Area 301 to 500 m2 | count |
Age 60 to 69 | count | Area 501 + m2 | count |
Age 70 + | count | Max temperature | °C |
Literacy rate | % | Summer temperature | °C |
Employment rate | % | CDD | count |
Owner-occupied housings | count |
Feature . | Mean . | Std . | Min . | 25% . | 50% . | 75% . | Max . |
---|---|---|---|---|---|---|---|
Subscriptions | 188.54 | 245.80 | 1 | 30 | 92 | 243 | 1,833 |
Water consumption | 36,122.32 | 45,960.00 | 150 | 5,177 | 18,444 | 48,500 | 258,516 |
Households | 187.39 | 233.51 | 4 | 37 | 93 | 247.5 | 1,678 |
Average family size | 2.96 | 0.52 | 1 | 2.7 | 3.0 | 3.3 | 5.5 |
Female ratio | 284.10 | 361.70 | 0 | 50 | 135 | 382.5 | 2,548 |
Age 0 to 9 | 0.13 | 0.06 | 0 | 0.10 | 0.14 | 0.17 | 0.36 |
Age 10 to 19 | 0.11 | 0.04 | 0 | 0.08 | 0.12 | 0.14 | 0.36 |
Age 20 to 29 | 0.16 | 0.06 | 0 | 0.14 | 0.17 | 0.19 | 1 |
Age 30 to 39 | 0.16 | 0.05 | 0 | 0.14 | 0.17 | 0.20 | 0.53 |
Age 40 to 49 | 0.12 | 0.04 | 0 | 0.11 | 0.13 | 0.15 | 0.40 |
Age 50 to 59 | 0.11 | 0.04 | 0 | 0.08 | 0.10 | 0.12 | 0.33 |
Age 60 to 69 | 0.09 | 0.06 | 0 | 0.05 | 0.07 | 0.10 | 0.78 |
Age 70+ | 0.12 | 0.12 | 0 | 0.04 | 0.07 | 0.14 | 0.88 |
Literacy rate | 0.75 | 0.15 | 0 | 0.69 | 0.79 | 0.85 | 0.95 |
Employment rate | 0.50 | 0.09 | 0.11 | 0.47 | 0.50 | 0.54 | 1 |
Owner-occupied housings | 148.00 | 184.57 | 0 | 29 | 75 | 192 | 1,349 |
Non-owner-occupied housings | 21.86 | 42.13 | 0 | 2 | 7 | 23 | 488 |
Non-apartment housings | 0.97 | 0.05 | 0.49 | 0.96 | 0.98 | 1 | 1 |
Area −50- m2 | 5.99 | 10.17 | 0 | 1 | 2 | 7 | 125 |
Area 51 to 75 m2 | 11.25 | 18.21 | 0 | 1 | 5 | 13 | 187 |
Area 76 to 80 m2 | 11.30 | 18.63 | 0 | 1 | 4 | 12 | 149 |
Area 81 to 100 m2 | 42.69 | 59.05 | 0 | 7 | 18 | 52.5 | 415 |
Area 101 to 150 m2 | 69.81 | 97.35 | 0 | 8 | 30 | 89 | 652 |
Area 151 to 200 m2 | 25.26 | 41.13 | 0 | 2 | 9 | 32 | 358 |
Area 201 to 300 m2 | 8.36 | 14.96 | 0 | 0 | 3 | 9 | 159 |
Area 301 to 500 m2 | 2.14 | 4.34 | 0 | 0 | 0 | 2 | 54 |
Area 501 + m2 | 0.52 | 1.51 | 0 | 0 | 0 | 0 | 18 |
Max temperature | 32.98 | 3.17 | 27 | 30 | 34 | 35 | 40 |
Summer temperature | 28.86 | 2.43 | 24 | 27 | 29 | 31 | 34 |
CDD | 164.13 | 25.23 | 121 | 140 | 170 | 185 | 221 |
Feature . | Mean . | Std . | Min . | 25% . | 50% . | 75% . | Max . |
---|---|---|---|---|---|---|---|
Subscriptions | 188.54 | 245.80 | 1 | 30 | 92 | 243 | 1,833 |
Water consumption | 36,122.32 | 45,960.00 | 150 | 5,177 | 18,444 | 48,500 | 258,516 |
Households | 187.39 | 233.51 | 4 | 37 | 93 | 247.5 | 1,678 |
Average family size | 2.96 | 0.52 | 1 | 2.7 | 3.0 | 3.3 | 5.5 |
Female ratio | 284.10 | 361.70 | 0 | 50 | 135 | 382.5 | 2,548 |
Age 0 to 9 | 0.13 | 0.06 | 0 | 0.10 | 0.14 | 0.17 | 0.36 |
Age 10 to 19 | 0.11 | 0.04 | 0 | 0.08 | 0.12 | 0.14 | 0.36 |
Age 20 to 29 | 0.16 | 0.06 | 0 | 0.14 | 0.17 | 0.19 | 1 |
Age 30 to 39 | 0.16 | 0.05 | 0 | 0.14 | 0.17 | 0.20 | 0.53 |
Age 40 to 49 | 0.12 | 0.04 | 0 | 0.11 | 0.13 | 0.15 | 0.40 |
Age 50 to 59 | 0.11 | 0.04 | 0 | 0.08 | 0.10 | 0.12 | 0.33 |
Age 60 to 69 | 0.09 | 0.06 | 0 | 0.05 | 0.07 | 0.10 | 0.78 |
Age 70+ | 0.12 | 0.12 | 0 | 0.04 | 0.07 | 0.14 | 0.88 |
Literacy rate | 0.75 | 0.15 | 0 | 0.69 | 0.79 | 0.85 | 0.95 |
Employment rate | 0.50 | 0.09 | 0.11 | 0.47 | 0.50 | 0.54 | 1 |
Owner-occupied housings | 148.00 | 184.57 | 0 | 29 | 75 | 192 | 1,349 |
Non-owner-occupied housings | 21.86 | 42.13 | 0 | 2 | 7 | 23 | 488 |
Non-apartment housings | 0.97 | 0.05 | 0.49 | 0.96 | 0.98 | 1 | 1 |
Area −50- m2 | 5.99 | 10.17 | 0 | 1 | 2 | 7 | 125 |
Area 51 to 75 m2 | 11.25 | 18.21 | 0 | 1 | 5 | 13 | 187 |
Area 76 to 80 m2 | 11.30 | 18.63 | 0 | 1 | 4 | 12 | 149 |
Area 81 to 100 m2 | 42.69 | 59.05 | 0 | 7 | 18 | 52.5 | 415 |
Area 101 to 150 m2 | 69.81 | 97.35 | 0 | 8 | 30 | 89 | 652 |
Area 151 to 200 m2 | 25.26 | 41.13 | 0 | 2 | 9 | 32 | 358 |
Area 201 to 300 m2 | 8.36 | 14.96 | 0 | 0 | 3 | 9 | 159 |
Area 301 to 500 m2 | 2.14 | 4.34 | 0 | 0 | 0 | 2 | 54 |
Area 501 + m2 | 0.52 | 1.51 | 0 | 0 | 0 | 0 | 18 |
Max temperature | 32.98 | 3.17 | 27 | 30 | 34 | 35 | 40 |
Summer temperature | 28.86 | 2.43 | 24 | 27 | 29 | 31 | 34 |
CDD | 164.13 | 25.23 | 121 | 140 | 170 | 185 | 221 |
Data Preprocessing
In a prediction problem, a dataset is divided into two subsets. A subset is used for training (training set) and a subset of data not seen by the model is used for testing (called validation or testing data). The purpose of cross-validation is to examine the model's ability to predict new data that was not used in the training phase.
Suppose the data of a dataset is available to be used in regression modeling. The goal in cross-validation is to achieve a model whose number of parameters is optimal, i.e., a model that does not suffer from overfitting. To achieve this aim, data is usually divided into two parts:
Training data (training set): This part of the data is dedicated to creating a model and estimating its parameters.
Test data (test set): This part of the data is devoted to testing the performance of the model. This part of the data includes the values of the independent and output variables that are not included in the training of the model but make the comparison of predicted and actual values possible.
Since test data do not affect the model, they have no role in determining the parameters of the model and are employed for model evaluation. However, in the cross-validation method, during a repetitive process, the training set is divided into two parts. Each time the cross-validation process is repeated, part of the data is used for training the model, and another part for testing the model. Therefore, this process is a resampling approach for estimating model error.
Since water consumption data and prediction span are annual, the daily temperature data should be transformed to appropriate parameters. CDD is the number of days with an average temperature above 18°C (65°F), and is typically used in the energy consumption context, and its influence on water consumption is examined in this study. Variables like CDD, maximum temperature, and average summer temperature are derived from daily air temperature data for each annum.
The demographic data represents age distribution as groups of five-year spans, from 0 to 74, and a group for above 75. In order to reduce problem dimensions, these groups are transformed into eight categories: 0 to 9, 10 to 19, 20 to 29, 30 to 39, 40 to 49, 50 to 59, 60 to 69, and 70+ .
Data transformation
It is usually required to use data transformation to convert data into a new coordinate system of a lower dimension. Since the transformation of data reduces the number of input variables, and selected suitable variables contain most of the information of the main data, the learning process and data analysis are performed more quickly, results are more reliable and data processing will be more efficient. For this purpose, PCA is defined as the orthogonal projection of the data in a way that maximizes the projected data's variance (Bishop 2006). The first coordinate (called the first principal component) is the greatest variance of the data's projection; the second coordinate is the next greatest variance, and so on.
Feature selection methods
Selecting relevant features that represent and fit data better is done by feature selection methods. This paper uses three categories of feature selection methods: filter, wrapper, and embedded methods. Filter methods act independently of learning models to select features. These methods select features by assigning scores to features based on evaluation metrics. This paper uses K-best selection (KBS), GUS, MI, Pearson correlation coefficient (PCC), variance threshold (VT), and FSSmI.
The proposed FSSmI feature selection method also belongs to filter methods, which works based on the smoothness index (SmI) (Kalhor et al. 2019). This index is mostly used for classification problems and has improved layer performance in deep neural networks. In this paper, this index is customized to select the regression problem's relevant features using forward selection. The combination of the forward selection and SmI proposed in this paper is expected to select features affecting regression. Details of the proposed method are given in Section METHODS.
Feature selection with smoothness index
The SmI is a generalized and extended version of the separation index (Cano 2013), as the quantized separation index is the SmI. The separation index is a measure of data evaluation and data complexity for classification problems, as it shows the separability between samples of different classes. After counting all the closest data points of the same label and normalizing them relative to all data points, a number ranging from 0 to 1 is obtained. The closer this number is to 1, the better the separation of the data of different classes is. That is, each new sample with a particular class is closer to samples of the same class than samples of other classes.
The proposed feature selection method can be used for any type of data and there is no limitation in this regard as long as that data is quantitative (can be expressed in numbers). This can be applied to both signal data (both 1-dimensional and multidimensional) and discrete data with no specific temporal–spatial order.
The SmI determines how close is the output of the nearest input to the input and the output . More close outputs result in SmI close to 1, i.e., data are smooth, and nearby inputs result in nearby outputs. More distant outputs result in decreasing SmI. SmI can be calculated in two ways: the linear SmI that uses linear punishment for an increasing distance of nearby outputs compared with the mean of output distances, and the exponential SmI that uses exponential punishment for increasing distance. Therefore, the exponential SmI is more strict and goes faster to 0 for more distant outputs.
The first step for calculating the smoothness of a set of vector points is to normalize the scattered data (anomalous data do not hurt the z-score). The input is assumed as a matrix with Q rows and n columns (each of the vector points appears as the row of a matrix), and the output will be a matrix with Q rows and m columns. The SmI for each of the inputs () considers the nearest input () and calculates their outputs ( and ). Then it calculates the norm of outputs divided by the mean of output distances (). Averaging distances between the output of instance q and the other outputs results in . If the ratio of the distance between and to the mean of the output distances is low, then its (rectified linear unit) gets close to 1. , is a non-linear activation function in many types of multi-layer neural networks or deep neural networks. will output the input directly if it is positive, otherwise, it will output 0. It has become the default activation function because a model that uses it is easier to train and often achieves better performance.
SmI is important for a prediction problem. Inputs with higher SmI are more suitable and lead to a better feature selection. This is due to more possible and predictable interpolation (provided that data diversity is high enough) when SmI is higher. More smoothness and better interpolation result in more prediction power. One of the advantages of SmI is that it does not force the map with a linear or Gaussian distribution; it just needs the map to be continuous and smooth. Functions are called smooth if they are derivable in their domain. Smoothness indicates to what extent the closeness of two inputs results in their outputs' closeness. Limited distance of inputs leads to limited distance of outputs. This keeps the continuity of output. Furthermore, SmI can be indirectly obtained by loss functions, including MSE and entropy, and directly by finding a proper combination of inputs so that the smoothness grows.
Six different subsets are analyzed for selection of features. As is shown in Table 4, an appropriate feature space is created out of the primary input space (X1 and X2) by designing a feature selection section using the smoothness indicator.
Different subsets of two main inputs and two non-related inputs . | Feature smoothness index . | |||||
---|---|---|---|---|---|---|
Subsets/Inputs . | X1 . | X2 . | X3 . | X4 . | Linear . | Exponential . |
1 | 0.9783 | 0.9788 | ||||
2 | 0.9159 | 0.9249 | ||||
3 | 0.8314 | 0.8615 | ||||
4 | 0.4781 | 0.5972 | ||||
5 | 0.4711 | 0.5888 | ||||
6 | 0.3464 | 0.4929 |
Different subsets of two main inputs and two non-related inputs . | Feature smoothness index . | |||||
---|---|---|---|---|---|---|
Subsets/Inputs . | X1 . | X2 . | X3 . | X4 . | Linear . | Exponential . |
1 | 0.9783 | 0.9788 | ||||
2 | 0.9159 | 0.9249 | ||||
3 | 0.8314 | 0.8615 | ||||
4 | 0.4781 | 0.5972 | ||||
5 | 0.4711 | 0.5888 | ||||
6 | 0.3464 | 0.4929 |
This happens because of the topmost quantity of linear and exponential SmI. In such an example, it is expected that for each input data point, the data points in the closest vicinity produce an output that is very close to the output of the given data point and their difference is a minor error. The reason is that using smoothness provides us with the ability of prediction. On the contrary, when one or two primary inputs are removed and replaced with unrelated or noise input, SmI gets worse and is reduced. This means that the mapping of the data becomes unsmooth and jaggy. Because of the high sensitivity of the output to the quantitative alterations in feature space, it is expected that the generalizations of such a function decrease.
Feature selection with other methods
Wrapper methods select or eliminate a subset of features based on the prediction model's performance, i.e., they use an evaluation metric dependent on the prediction algorithm in a recursive way to determine the usefulness of selected features. The methods in this category, due to high computational cost, are not very popular. The RFE from this category is used in this study.
Embedded methods, like wrapper methods, work in interaction with features and models, but models in this category embed the feature selection process in themselves. Lasso, Ridge, and Elastic are ML models with this category of feature selection.
A high correlation of a variable and target is usually useful because one can predict the other. However, a high correlation between two data variables is interpreted as redundant information relating to the target, and removing one of them reduces the problem dimensions.
Other feature selection techniques are as follows: RFE that recursively fits a model and removes the weakest feature until reaching a specified number of features, VT that removes features with variation below a specific cutoff, KBS that selects features based on the k highest scores, and GUS that is a univariate selector with a configurable strategy. A univariate selector examines statistical tests on features and targets and keeps high-scored features, i.e., it works by selecting the best features based on univariate statistical tests. All feature selection methods are implemented using Python and the ML library Scikit-learn (Buitinck et al. 2013).
ML methods
The potable WDF can be considered as a regression problem, and several ML methods have been studied to solve it (Antunes et al. 2018). This paper explores seven ML methods and will demonstrate the results. These methods are k nearest neighbors (KNN), RFR, SVR, MLR, least absolute shrinkage and selection operator (Lasso), Ridge, and Elastic regression.
KNN is a type of instance-based learning that predicts the target by values of the k nearest neighbors in the available data, measured with a weight function. There are several weight functions available for this purpose. RFR is an ensemble learning method that aggregates many decision trees; each is trained on a different data sample where sampling is done with replacement. SVR's objective function is to minimize the L2 norm of the coefficient vector while ensuring the absolute error is less than or equal to , the defined maximum error. MLR is the extension of the ordinary least-squares regression with several explanatory variables to predict the target by fitting a linear equation to the data.
The other three ML methods embed feature selection procedures in them. Lasso regression performs L1 regularization while Ridge regression makes use of L2 regularization. Elastic regression combines L1 and L2 priors as its regularizer.
Model performance evaluation
The model validation is obtained by a k-fold cross-validation approach (Stone 1974) with . The randomized samples are split into five equal-size folds. For each fold, a model is trained by other than that fold's data and validated by the fold's data. Averaging errors of these five validations results in the final model score. After evaluation of different values for k, it was concluded that based on the minimum error and prediction performance methods, cross-validation would be suitable for k = 5.
RESULTS AND DISCUSSION
In this section, besides previous regression methods and feature selection methods, the proposed new FSSmI method is tested on a newly collected rural water consumption dataset. After evaluating the models' performance using the evaluation metrics mentioned in Section ‘Model performance evaluation’, the models have been improved by setting the appropriate parameters. The results of the performance of the models have been analyzed and compared.
Selected features
Feature selection is an essential part of an ML method that can be defined as determining relevant features and eliminating irrelevant or redundant features to reach a subset of features that well define the problem.
ML methods, in conjunction with the feature selection methods studied in this paper, result in 34 different models. This section explores features selected for each model using the feature selection methods. Each feature selection method shares the same features among ML methods, except RFE, which selects different features for each model. The most anticipated observation is that subscriptions (which is tightly in proportion to population) appears in all models. Other important features that are selected by more than 60 percent of models include temperature features (CDD, max, and average summer temperature), and demographic features (households, non-owner/owner-occupied housings, female ratio, and housing area distribution features including 51–75, 76–80, 101–150, 201–300, and 301–500). Surprisingly, each of the age distribution groups, except for age 70+ , appears in less than 50 percent of models. Figure 3 illustrates the features' appearance frequency in the outputs of the feature selection methods.
It is inferable from Figure 3 that population features, especially subscriptions, show a strong relation to water consumption. On the other hand, age distribution features do not affect much water consumption behavior. Based on the combination of MLR, RFR, and SVR with GUS, RFE, and FSSmI, the 70+ years old feature has a stronger relationship with WD than other age distribution features. The six most important features that appear in more than 90% of models are subscriptions, female ratio, CDD, households, owner occupied housings, and summer average temperature, respectively. It can be inferred from Figure 3 that the temperature has a significant impact on water consumption pattern. However, the literacy rate, family size, and employment rate do not show a meaningful effect.
PCA and filter methods are combined with ML methods, creating 28 models. Selected features of each filter method are the same among all ML methods. The wrapper method is used with MLR, RFR, and SVR; it selects 15, but not necessarily the same, features for each of them. ML methods perform feature selection embeddedly. Table 5 shows the selected features for all models.
Feature . | KBS . | VT . | PCC . | MI . | GUS . | FSSmI . | RFE on MLR . | RFE on RFR . | RFE on SVR . | Lasso . | Ridge . | Elastic . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Subscriptions | ||||||||||||
Households | ||||||||||||
Average family size | ||||||||||||
Female ratio | ||||||||||||
Age 0 to 9 | ||||||||||||
Age 10 to 19 | ||||||||||||
Age 20 to 29 | ||||||||||||
Age 30 to 39 | ||||||||||||
Age 40 to 49 | ||||||||||||
Age 50 to 59 | ||||||||||||
Age 60 to 69 | ||||||||||||
Age 70 + | ||||||||||||
Literacy rate | ||||||||||||
Employment rate | ||||||||||||
Owner-occupied housings | ||||||||||||
Non-owner-occupied housings | ||||||||||||
Non-apartment housings | ||||||||||||
Area −50 m2 | ||||||||||||
Area 51 to 75 m2 | ||||||||||||
Area 76 to 80 m2 | ||||||||||||
Area 81 to 100 m2 | ||||||||||||
Area 101 to 150 m2 | ||||||||||||
Area 151 to 200 m2 | ||||||||||||
Area 201 to 300 m2 | ||||||||||||
Area 301 to 500 m2 | ||||||||||||
Area 501 + m2 | ||||||||||||
Max temperature | ||||||||||||
Summer temperature | ||||||||||||
CDD | ||||||||||||
Number of features | 12 | 17 | 15 | 14 | 14 | 15 | 15 | 15 | 15 | 11 | 16 | 18 |
Feature . | KBS . | VT . | PCC . | MI . | GUS . | FSSmI . | RFE on MLR . | RFE on RFR . | RFE on SVR . | Lasso . | Ridge . | Elastic . |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Subscriptions | ||||||||||||
Households | ||||||||||||
Average family size | ||||||||||||
Female ratio | ||||||||||||
Age 0 to 9 | ||||||||||||
Age 10 to 19 | ||||||||||||
Age 20 to 29 | ||||||||||||
Age 30 to 39 | ||||||||||||
Age 40 to 49 | ||||||||||||
Age 50 to 59 | ||||||||||||
Age 60 to 69 | ||||||||||||
Age 70 + | ||||||||||||
Literacy rate | ||||||||||||
Employment rate | ||||||||||||
Owner-occupied housings | ||||||||||||
Non-owner-occupied housings | ||||||||||||
Non-apartment housings | ||||||||||||
Area −50 m2 | ||||||||||||
Area 51 to 75 m2 | ||||||||||||
Area 76 to 80 m2 | ||||||||||||
Area 81 to 100 m2 | ||||||||||||
Area 101 to 150 m2 | ||||||||||||
Area 151 to 200 m2 | ||||||||||||
Area 201 to 300 m2 | ||||||||||||
Area 301 to 500 m2 | ||||||||||||
Area 501 + m2 | ||||||||||||
Max temperature | ||||||||||||
Summer temperature | ||||||||||||
CDD | ||||||||||||
Number of features | 12 | 17 | 15 | 14 | 14 | 15 | 15 | 15 | 15 | 11 | 16 | 18 |
The features, households, subscriptions, and female ratio are selected by all the feature selection methods that show their influences on regression.
Figure 4 illustrates feature importance based on the mutual information feature selection method. Accordingly, the most important features are subscriptions, female ratio, households, and owner-occupied-housings, respectively.
The SmI presented in Figure 5 is based on the dataset dimensions. The point with the maximum smoothness indicates an optimum number of features, where adding other features probably does not end up with more information. It is evident from Figure 5 that the optimum number of features is 15. This method selects subscriptions, owner-occupied housings, average family size, area 301 to 500, summer temperature, CDD, female ratio, households, age 70+ , non-apartment housings, area 501+ , employment rate, literacy rate, max temperature, and age 0 to 9.
Model settings
Parameter settings for feature selection methods
The number of components (components with the highest variance) is selected from two to ten for the PCA. The calculation of how many components are required to explain the data is a critical part of using PCA in practice. This number can be estimated by looking at the cumulative explained variance ratio as a function of the components' number. The explained variance (the amount of variance explained by each of the selected components) says how much data (variance) can be applied to each of the principal components. For instance, Figure 6 reveals that the first two components constitute approximately 99% of the variance while describing approximately 100% of the variance requires about ten components. For GUS, the feature selection mode is ‘Percentile’, and the corresponding mode parameter is 20.
For RFE (recursive feature elimination ranking), the number of selected features is 15, and the number of features to be removed for each iteration, termed ‘step’, is 1.0. RFE is a feature selection method that fits a model and, until a specified number of features is achieved, eliminates the weakest feature(s). Features are ranked by the model's attributes of coef_ or feature_importances_, then RFE aims to remove dependencies and collinearity that may exist in the model by recursively removing some features per loop. The threshold is set to 0.16 for VT, which excludes features with a training set variance lower than this threshold. PCC is adjusted to select only features correlated with water consumption higher than 0.5, excluding one from each pair of features that correlate higher than 0.9. For MI, discrete_features is set to false for dense inputs and true for sparse inputs, and it selects features based on the 14 highest scores measured with KBest. MI is a non-negative value that scores the dependency between two random variables, that if and only if variables are independent, it is equal to 0; otherwise, higher values stand for higher dependencies. For FSSmI, is used in , which is Euclidean distance.
According to Figure 6, the number of principal components (components with maximum variance) of PCA are taken from the range two to ten, which produces eigenvalues above 97%. The central part of PCA is estimating the number of components for well-describing data. It can be determined by considering the cumulative explained variance ratio as a function of the number of components. The explained variance states how much data (variance) can be related to each of the principal components. For instance, Figure 6 shows the first two components encompass roughly 97% of the variance while extending the percentage to 100% needs around ten components.
The Pearson correlation matrix shows that the highest correlation coefficient, which is 0.98, holds between subscriptions and water consumption. All of the correlation results are found in Figure 7. The presence of a high correlation among features (independent variables) shows multicollinearity, which probably makes bias in regression analysis results. For the Pearson correlation, only features with more than a 0.5 factor of correlation with water consumption are chosen. If there are features (other than water consumption) with a correlation higher than 0.9 in between, one of them is eliminated. Figure 7 is the heatmap of features correlations.
Parameter settings for forecasting methods
For MLR, the ordinary least-squares method is adopted in this paper, which minimizes the sum of squared errors (SSE) to obtain the optimal regression function. In the case of RFR models, the expansion of the number of trees in the forest and the performance evaluation metrics positively affect the predictions' quality. In this paper, the number of trees in the forest was set to be 20, and the maximum tree depth was open, i.e., the nodes expand as long as all the leaves are pure. MSE was selected as the performance evaluation metric. The choice of SVR core function is one of the most critical parameters that affect its performance and is more important than tolerance in predictions. In this paper, a linear function was selected as the core function, the regularization parameter C was set to 15, and the stop criterion's tolerance was 0.001. Also, shrinking heuristics are not used in the training process. The performance of KNN models has a strong relationship with the number of neighbors, so the number of neighbors is optimally obtained in this paper. The results show no significant difference in weight functions, although the result is slightly better when using the uniform weight function. Ridge regression solves a regression model whose loss function is a linear least-squares function, and its regularization is done with the L2 norm, which improves the conditioning of the problem and reduces the variance of the estimates. In this paper, alpha or regularization strength is set to 1, and the solver, which is used in computational procedures, is set to auto (based on data type). For ElasticNet regression, in this paper, the alpha value is selected as 1, and the L1 ratio is selected as 0.5. In this study, because of five-fold cross-validation, there is no need to use LassoCV anymore, Since part of the LassoCV method is selecting cross-validation, its reselection will run a loop. Also, for Lasso regression, alpha is set to 1.
Model performance
The important point of every forecasting method is its performance evaluation. After normalizing data and determining essential features, a prediction model (MLR, RFR, SVR, KNN, Lasso, Ridge, and Elastic) is run on them based on k-fold cross-validation with and the metrics introduced in Section ‘Model performance evaluation’ (, , , and ). The collected data are used to compare models' performance for WDF in this section. shows how predicted values are close to the observed data. The 0 value of indicates that no prediction is close to the observed data, and 1 is for exact prediction. shows the mean of the error squared, which is the difference between predicted and observed values. For , , and , values close to 0 indicate more accurate forecasting. The performance comparison of forecasting methods is represented in Tables 6–9. The MLR forecasting model shows the best performance based on the four metrics, which means it has the least forecasting error and most accuracy for WD prediction. So, the MLR model has the best forecasting performance for residential potable WDF of rural Isfahan. SVR has the second-best performance, while KNN shows the least accuracy and most error among all, and acts poorly in WDF.
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.9657 | 0.9645 | 0.9658 | 0.9653 | 0.9684 | 0.9659 | 0.9670 | 0.9669 | * |
RFR | 0.9387 | 0.9637 | 0.9669 | 0.9550 | 0.9605 | 0.9564 | 0.9568 | 0.9682 | * |
SVR | 0.9523 | 0.9549 | 0.9556 | 0.9661 | 0.9674 | 0.9652 | 0.9658 | 0.9671 | * |
KNN | 0.9179 | 0.9383 | 0.9545 | 0.9547 | 0.9524 | 0.958 | 0.9541 | 0.9604 | * |
Lasso | * | * | * | * | * | * | * | * | 0.9658 |
Ridge | * | * | * | * | * | * | * | * | 0.9675 |
Elastic | * | * | * | * | * | * | * | * | 0.9654 |
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.9657 | 0.9645 | 0.9658 | 0.9653 | 0.9684 | 0.9659 | 0.9670 | 0.9669 | * |
RFR | 0.9387 | 0.9637 | 0.9669 | 0.9550 | 0.9605 | 0.9564 | 0.9568 | 0.9682 | * |
SVR | 0.9523 | 0.9549 | 0.9556 | 0.9661 | 0.9674 | 0.9652 | 0.9658 | 0.9671 | * |
KNN | 0.9179 | 0.9383 | 0.9545 | 0.9547 | 0.9524 | 0.958 | 0.9541 | 0.9604 | * |
Lasso | * | * | * | * | * | * | * | * | 0.9658 |
Ridge | * | * | * | * | * | * | * | * | 0.9675 |
Elastic | * | * | * | * | * | * | * | * | 0.9654 |
*The methods are not comparable.
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.5232 | 0.5499 | 0.5220 | 0.5219 | 0.4969 | 0.5179 | 0.5105 | 0.5081 | * |
RFR | 0.6965 | 0.4816 | 0.4796 | 0.5203 | 0.5014 | 0.5495 | 0.5351 | 0.4986 | * |
SVR | 0.5386 | 0.5066 | 0.4975 | 0.4857 | 0.4781 | 0.4874 | 0.4868 | 0.5162 | * |
KNN | 0.8168 | 0.7123 | 0.5381 | 0.5727 | 0.5707 | 0.5555 | 0.5728 | 0.5167 | * |
Lasso | * | * | * | * | * | * | * | * | 0.5290 |
Ridge | * | * | * | * | * | * | * | * | 0.4914 |
Elastic | * | * | * | * | * | * | * | * | 0.4919 |
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.5232 | 0.5499 | 0.5220 | 0.5219 | 0.4969 | 0.5179 | 0.5105 | 0.5081 | * |
RFR | 0.6965 | 0.4816 | 0.4796 | 0.5203 | 0.5014 | 0.5495 | 0.5351 | 0.4986 | * |
SVR | 0.5386 | 0.5066 | 0.4975 | 0.4857 | 0.4781 | 0.4874 | 0.4868 | 0.5162 | * |
KNN | 0.8168 | 0.7123 | 0.5381 | 0.5727 | 0.5707 | 0.5555 | 0.5728 | 0.5167 | * |
Lasso | * | * | * | * | * | * | * | * | 0.5290 |
Ridge | * | * | * | * | * | * | * | * | 0.4914 |
Elastic | * | * | * | * | * | * | * | * | 0.4919 |
*The methods are not comparable.
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.2786 | 0.3031 | 0.2764 | 0.2726 | 0.2470 | 0.2687 | 0.2655 | 0.2495 | * |
RFR | 0.4912 | 0.2347 | 0.2306 | 0.2714 | 0.2520 | 0.3034 | 0.2901 | 0.2301 | * |
SVR | 0.2916 | 0.2600 | 0.2501 | 0.2365 | 0.2301 | 0.2400 | 0.2391 | 0.2555 | * |
KNN | 0.6696 | 0.5080 | 0.2576 | 0.3290 | 0.3264 | 0.3100 | 0.3337 | 0.2581 | * |
Lasso | * | * | * | * | * | * | * | * | 0.2815 |
Ridge | * | * | * | * | * | * | * | * | 0.2457 |
Elastic | * | * | * | * | * | * | * | * | 0.2447 |
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.2786 | 0.3031 | 0.2764 | 0.2726 | 0.2470 | 0.2687 | 0.2655 | 0.2495 | * |
RFR | 0.4912 | 0.2347 | 0.2306 | 0.2714 | 0.2520 | 0.3034 | 0.2901 | 0.2301 | * |
SVR | 0.2916 | 0.2600 | 0.2501 | 0.2365 | 0.2301 | 0.2400 | 0.2391 | 0.2555 | * |
KNN | 0.6696 | 0.5080 | 0.2576 | 0.3290 | 0.3264 | 0.3100 | 0.3337 | 0.2581 | * |
Lasso | * | * | * | * | * | * | * | * | 0.2815 |
Ridge | * | * | * | * | * | * | * | * | 0.2457 |
Elastic | * | * | * | * | * | * | * | * | 0.2447 |
*The methods are not comparable.
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.5279 | 0.5506 | 0.5257 | 0.5221 | 0.4970 | 0.5184 | 0.5153 | 0.4994 | * |
RFR | 0.7008 | 0.4845 | 0.4802 | 0.5210 | 0.5020 | 0.5508 | 0.5386 | 0.4796 | * |
SVR | 0.5400 | 0.5099 | 0.5001 | 0.4864 | 0.4797 | 0.4899 | 0.4890 | 0.5054 | * |
KNN | 0.8183 | 0.7127 | 0.5075 | 0.5736 | 0.5713 | 0.5567 | 0.5777 | 0.5080 | * |
Lasso | * | * | * | * | * | * | * | * | 0.5305 |
Ridge | * | * | * | * | * | * | * | * | 0.4957 |
Elastic | * | * | * | * | * | * | * | * | 0.4947 |
. | PCA . | GUS . | RFE . | KBS . | VT . | PCC . | MI . | FSSmI . | Embedded . |
---|---|---|---|---|---|---|---|---|---|
MLR | 0.5279 | 0.5506 | 0.5257 | 0.5221 | 0.4970 | 0.5184 | 0.5153 | 0.4994 | * |
RFR | 0.7008 | 0.4845 | 0.4802 | 0.5210 | 0.5020 | 0.5508 | 0.5386 | 0.4796 | * |
SVR | 0.5400 | 0.5099 | 0.5001 | 0.4864 | 0.4797 | 0.4899 | 0.4890 | 0.5054 | * |
KNN | 0.8183 | 0.7127 | 0.5075 | 0.5736 | 0.5713 | 0.5567 | 0.5777 | 0.5080 | * |
Lasso | * | * | * | * | * | * | * | * | 0.5305 |
Ridge | * | * | * | * | * | * | * | * | 0.4957 |
Elastic | * | * | * | * | * | * | * | * | 0.4947 |
*The methods are not comparable.
The results represented in Tables 6–9 can be interpreted in two different aspects: (a) the best model-prediction performance of each feature selection method, independently for each evaluation criterion and (b) the best feature selection methods having the best model-prediction performance for different evaluation criteria.
- (a)
MLR prediction has the best result for PCA feature selection for all evaluation criteria. RFR prediction for the all evaluation criteria has the best result for RFE feature selection. MLR prediction for the evaluation criterion and RFR prediction for the other evaluation criteria had the best result for GUS feature selection. SVR prediction for all evaluation criteria has the best result for KBS feature selection. MLR prediction for the evaluation criterion and SVR prediction for the other evaluation criteria had the best results for VT, PCC, and MI feature selections. RFR prediction for all evaluation criteria had the best result for FSSmI feature selection. Ridge had the best result for the and evaluation criteria and Elastic for the and evaluation criteria for embedded feature selection. KNN and Lasso predictions in combination with various feature selection methods were not among the best for any evaluation criteria.
- (b)
VT and FSSmI feature selections had the best results for the four evaluation criteria presented in this paper (which are marked in italic bold in the tables). The VT feature selection had the best results for the and evaluation criteria and the FSSmI feature selection had the best results for the and criteria.
As previous studies suggest, after feature selection, the prediction model improves in related metrics, prediction performance and accuracy, and computation time; thus, feature selection is employed in this paper. The results indicate that the best feature selection method is the proposed method, FSSmI. Based on the results of all performance evaluation metrics on all ML methods, FSSmI leads to better outcomes, and thus better predictions.
In all cases, ML methods can produce accurate predictions, which is a standard method and usually is used for forecasting, although for different datasets, different models make the best prediction. Thus, testing various methods and models while analyzing new datasets is necessary to find the best outcome.
CONCLUSIONS
As one of the most vital resources for fulfilling life needs, water plays a critical role in daily lives. WDF of the household sector and recognizing water-use factors are the main steps of water crisis management and control. The residential WDF and identifying its influential factors are among the most vital steps of water crisis management, because supplying more water is not always a practical solution; it is necessary to adopt new policies and approaches based on water consumption pattern and use them for planning and decision-making for foreseen WD changes in upcoming years. The novel method developed in this paper provides a practical tool for policymakers to achieve sustainable management of potable WD. The importance of water consumption pattern makes this paper adopt a new feature selection approach for residential WD prediction. Several prediction methods such as SVR, MLR, and RFR, combined with various feature selection methods including RFE, GUS, PCA, and MI, were tuned, implemented, analyzed, and compared in predicting potable residential WD. Seven feature selection methods result in different subsets of predicting parameters. Although the resulting regression coefficients are related to a specific province's rural areas, this method can be employed to analyze residential water consumption of different locations with various climatic and socioeconomic conditions. The outcomes of this study are:
- 1.
Water consumption is considered as a long-term yearly scale and a function of climatic and socioeconomic parameters. Various ML-based forecasting and feature selection methods are implemented in this study and their performance is assessed for a specific problem.
- 2.
Recent developments in feature selection have resolved the challenge of multivariate input spaces' performance so that datasets with many features are released from the curse of dimensionality in which the data may be overfitted. The outputs of this study indicate that many combinations of features do not necessarily improve the models' performance.
- 3.
A novel feature selection framework is proposed to improve forecasting models and indicate the most appropriate input combination, which is a mixture of the forward selection and SmI. The RFR regression method, along with feature selection with SmI, improves WDF performance. In general terms, greater smoothness and better interpolation make better predictions.
- 4.
The presented method has a significant role in identifying features effective in modeling WDF, compared with other methods. The feature selection methods are used to reduce feature dimensions by eliminating redundant features. The reduced subset of features results in more accurate predictions. The FSSmI method selects effective features, including CDD, average family size, and age.
- 5.
The performance evaluation metrics (RMSE, R2, and MSE) are improved in this study. The FSSmI method shows an improvement of about 1.5%, 9%, and 8% based on previously mentioned measures, respectively, comparing to previous methods. The results confirm the accuracy and advantage of this method.
- 6.
The results of feature selection using different methods show a close relationship between water consumption and some characteristics such as subscriptions, housing ownership status, ambient temperature, and housing area distribution. On the other hand, some other features such as literacy rate, housing type, unemployment rate, and age distribution will impact water consumption to a lesser extent. Among the features related to temperature, CDD, and among the features related to the housing area, housings with an area between 101 and 150 square metres have the strongest relationship with water consumption.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.
REFERENCES
Author notes
Authors contributed equally to this work.