Abstract
Reliability of each state of process in many chemical process industries largely relies upon water and vitality supplies. In this way, there is great necessity to have an improved and controlled smart energy distribution network (SEDN) in industries. In SEDNs, sensor information related to flow control and optimization serves as a basis for modelling of energy management systems. Therefore, it is important to ensure that sensor data are accurate and precise. However, they are affected by random noise and measurement biases, which compromise the quality of measurements. Data Reconciliation (DR) is one such approach popularly used in industries to reduce the adverse impact of random errors present in pipe flow measurements. In this study, Python-based simulations of weighted least squares (WLS) and principal component analysis (PCA) based DR techniques are implemented on the selected flow streams of SEDN, and reconciled estimates are obtained. The results show that Root Mean Square Error (RMSE) is the best performance metric since it is more sensitive to small changes in the measurement values and the reconciled estimates. Further, it is observed that PCA-DR performs better than WLS-DR in reducing the random error (and thereby achieving greater precision of measured values).
HIGHLIGHTS
Application of data reconciliation (DR) techniques to treat random errors present in flow sensor data used by water distribution networks.
Selection of best performing metric to evaluate data reconciliation (DR) techniques.
Analyze the performance of selected DR techniques for small and large scale networks using Python-based simulation.
INTRODUCTION
In most chemical industries, utilities such as water and energy play an important role. A sensor-based smart energy distribution network (SEDN) is required to monitor the consumption of these supplies. SEDN can aid in providing better process quality, more efficient operation, more accurate forecasting of supply and demand. SEDN operation is usually supported by Supervisory Control and Data Acquisition (SCADA) systems (Park & Jung 2014; Quevedo et al. 2014; Kro˘cová 2016). Therefore, measurement of process flow variables is an essential part of this process. The precision of the measurements is very important, without which modelling and analysis can be misleading. However, usually measured variables are contaminated by fixed and random errors (Schönenberger 2015; Tran et al. 2019). These random errors creep into measurements from various sources like high frequency pickups, low resolution, and signal converters (Câmara et al. 2017). Data reconciliation (DR) is an approach usually applied to treat random errors present in a measured variable under a constrained process environment. The most important difference between DR and other signal processing techniques is that DR uses process constraints; that is, mass and energy balances of the process network, while the latter do not.
At times it is not possible to measure all the process variables in a process due to practical difficulty. In such situations, unmeasured variables could be estimated through soft sensors or solving DR problem (Miao et al. 2009; Rieger et al. 2010; Quevedo et al. 2014; Narasimhan & Bhatt 2015; Xu et al. 2020). DR is a simple approach and it has even become an integral part of software packages like ASPEN, VALI, VisualMesa, SimSci DATACON, and Sigmafine (Narasimhan & Bhatt 2015; Camara et al. 2017). In the data reconciliation landscape, the techniques include Weighted Least Squared Data Reconciliation (WLS-DR), Quasi-Weighted Least Squared Data Reconciliation (QWLS-DR), Robust-DR, and a few more recent techniques, for dealing with random errors. (Rieger et al. 2010; Zhang et al. 2010; Fuente et al. 2015; Lin et al. 2019; Xie et al. 2019).
Principal Component Analysis (PCA) is a multivariate data processing technique extensively used for dimension reduction where data on a large number of variables are available. It is also widely used in behavioural modelling of large water management systems, monitoring of water distribution including leakage, abnormal use of water, illegal connections, process monitoring for multi input-multi output (MIMO) processes, soft sensor modelling, data reconciliation (DR), and gross error detection (GED). In Helness et al. (2019), Varshith et al. (2019), Bhattacharyya et al. (2017), Fuente et al. (2015), Narasimhan & Bhatt (2015), Narasimhan & Shah (2008), Park & Jung (2014), the use of PCA in data reconciliation and gross error detection techniques is illustrated.
In order to prove that the DR techniques are actually accomplishing the task of improving the precision of measured data, performance metrics are needed. In Spuler et al. (2015), various performance metrics were explained to evaluate regression methods applied for decoding neural signals. From this, Correlation Coefficient (CC), Global deviation (GD), Signal to Noise Ratio (SNR), and Root Mean Square Error (RMSE) are identified as the most suitable metrics to evaluate DR techniques. The other performance measures are Relative Error Reduction (RER), Measurement Relative Error (MRE), and Reconciled Relative Reduction Error (RRE), which are explained in Zhang et al. (2010). In order to find the best metric, factors that lead to deterioration of data should be looked at, and that is discussed in the following section of this study.
In this paper, integrated water supply networks are considered whose pipe flow streams are assumed to be contaminated by gaussian noise. Two DR techniques, WLS-DR and PCA-based DR, are applied to measurements, and reconciled estimates are obtained. The selected metrics are applied to evaluate the performance of DR techniques and the best metric is then found. Further, the same is implemented and evaluated for other datasets that have different variances and serially correlated errors.
DATA RECONCILIATION (DR) TECHNIQUES








Optimised measurements of a system can be estimated which have lesser effect of random noise. Given below are certain prerequisites which are essential in order to apply DR to a dataset.
- i.
The process constraints, which consist of material and energy balance equations, should be defined.
- ii.
The set of measured process variables must be specified and the inaccuracies in these measurements must be specified in terms of the associated variances and covariance.





The measurement model shown in Equation (3) represents a realistic model since in practice, over a long period, measurements may change only due to random error. Here, the true value vector , variance of measurements (
and process incidence matrix (
) are considered (realistic assumption) to be fixed.
There are a few other assumptions about random errors considered.
For independently identically distributed (i.i.d.) data,
- i.
random errors are normally distributed, i.e.
,
- ii.
process variables and errors are not correlated i.e., E[a(j)r(j)] = 0 and
- iii.
errors are not serially correlated i.e., E[r(j)r(j − 1)] = 0.
Sometimes, data are serially correlated due to process dynamics, recycling, adding controller loop, etc. Serial correlation may affect the performance of DR techniques. Hence, performance evaluation of DR techniques for identically distributed but non-independent (non-i.i.d.) measurements is considered here (Lin et al. 2019; Jeyanthi & Devanathan 2020).
Assumptions considered for non-i.i.d are:
- i.
random errors are normally distributed, i.e.
,
- ii.
process variables and errors are not correlated i.e., E[a(j)r(j)] = 0 and
- iii.
errors are serially correlated i.e., E[r(j)r(j − 1)] ≠0.
The weighted least squared-DR






where, is reconciled estimates for m(j),
is the error covariance matrix and
is the constraint matrix.
The obtained estimates are normally distributed and satisfy the process constraints.
Principal component analysis (PCA) based data reconciliation
PCA is a statistical technique that is generally used to reduce the dimensionality of data. PCA can also be used to identify linear relationships between variables. It identifies fewer uncorrelated variables, called principal components (PCs), from a large set of data. The goal of principal components is to explain the maximum amount of variance with the fewest PCs. Narasimhan & Bhatt (2015) have described an approach for applying PCA-based DR, a recent technique to obtain reconciled estimates. For a large data set, PCA-DR proves to be effective. This can be deployed when the error covariance matrix is known. The data matrix M is transformed as in Equation (7)
(7)
where is a decomposition of error covariance matrix (
) of the process network.
(8)
The positive square root of the eigenvalues of are termed as singular values of
. The first highest ɤ singular values are positive and are equal to the rank of
while the remaining n- ɤ singular values are small and used to define the dependent variables in the process. SCREE plot (Narasimhan & Bhatt 2015) can be used to select the largest eigen values (ɤ)
.


The reconciled estimates of are derived from the first part of
,
and calculated as
. The second part of
expresses the interaction among variables. So the identified constraint matrix of the process can be derived as
. It may be noted that the PCA-DR estimates do not satisfy the original constraints, but rather satisfy the identified constraint matrix
.
Performance metrics
It is important to evaluate any technique so as to ensure that it is reliable even for a large set of data. The various performance metrics must capture the error properties like bias, scaling, and other types of errors present in the measurement. They should be sensitive enough to capture these error properties and perform accurately. Among various performance metrics, Correlation Coefficient (CC), Root Mean Square Error (RMSE), Global Deviation (GD), Signal to Noise Ratio (SNR), and Relative Error Reduction (RER) are selected (Chai & Draxler 2014; Narasimhan & Bhatt 2015; Spuler et al. 2015; Xie et al. 2019). In this paper, two cases are taken for the study. The first is where only random errors (both i.i.d and non-i.i.d) are present in the measurements, and the second is where both random error and measurement bias are present in measurements.
The performance metrics are explained below.
Root mean square error (RMSE)
Global deviation (GD)
Correlation coefficient (CC)
Signal to noise ratio (SNR)
Relative error reduction (RER)


SIMULATION
To analyse the performance of DR techniques, two benchmark process systems, a small scale recycle network and a large scale process network, have been chosen for the study. The selection of the benchmark systems (Valle et al. 2018; Varshith et al. 2019; Jeyanthi & Devanathan 2020) is based on the number of variables and interacting nodes in the process. This would lend credence to the performance evaluation of the techniques included in the study.
Example 1
A small recycle network (Xie et al. 2019; Jeyanthi & Devanathan 2020) shown in Figure 1 consists of five interacting nodes with seven flow variables (F1, F2, F3, F4, F5, F6, and F7). The process constraints are linear and estimated from Equation (17).

Input and output flow variables are denoted as ‘1’ and ‘ − 1' respectively. The process constraint matrix is derived from the incidence matrix by removing node ‘5’, which has non-recycled variables in the network.
In order to evaluate the performance of each technique, a few flow variables with specific magnitudes are identified. The performance indices are calculated as explained in the previous section. The reconciled estimates are obtained using corresponding DR techniques as explained above, and the results are compared with raw data. The performance index calculation procedure is explained as follows:
Step 1: Obtaining raw measurement
Step 2: Applying DR technique
Step 3: Calculate reconciled estimates ()
Step 4: Calculate Performance Index
The performance of the recycled network is shown in Table 1. For all variables F1 to F7, SNR has not shown much variation amongst DR techniques. CC and GD have no significant index to show the performance improvement of DR techniques. RER has prominent changes for WLS and PCA-DR techniques, but it has negative values for non-performing reconciled estimates. RMSE has significant changes in different DR techniques. Also, it has a non-negative index. From the perception, it is obvious that RMSE is the best measurement to assess the performance of DR techniques. PCA-DR is sensitive to the magnitude and performs poorly for variables having higher base values F2, F3, and F5. WLS-DR performs well for feedback variables F4 and F6, and poorly for other variables.
Performance of DR techniques for recycle network
Flow variable . | DR technique . | Performance Metrics . | ||||
---|---|---|---|---|---|---|
SNR . | CC . | GD . | RER . | RMSE . | ||
1 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0235 | 0.0000 | 0.0000 | 0.3288 | 0.6753 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
2 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0062 | 0.0000 | 0.0001 | −0.3301 | 1.3390 | |
PCA-DR | 1.0057 | 0.0000 | 0.0007 | −0.3810 | 1.3904 | |
3 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0062 | 0.0000 | 0.0001 | −0.3301 | 1.3390 | |
PCA-DR | 1.0057 | 0.0000 | 0.0007 | −0.3810 | 1.3904 | |
4 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0919 | 0.0000 | 0.0000 | 0.6529 | 0.3487 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
5 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 | |
PCA-DR | 1.0125 | 0.0000 | 0.0000 | 0.0755 | 0.9305 | |
6 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0919 | 0.0000 | 0.0000 | 0.6529 | 0.3487 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
7 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0235 | 0.0000 | 0.0000 | 0.3288 | 0.6753 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 |
Flow variable . | DR technique . | Performance Metrics . | ||||
---|---|---|---|---|---|---|
SNR . | CC . | GD . | RER . | RMSE . | ||
1 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0235 | 0.0000 | 0.0000 | 0.3288 | 0.6753 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
2 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0062 | 0.0000 | 0.0001 | −0.3301 | 1.3390 | |
PCA-DR | 1.0057 | 0.0000 | 0.0007 | −0.3810 | 1.3904 | |
3 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0062 | 0.0000 | 0.0001 | −0.3301 | 1.3390 | |
PCA-DR | 1.0057 | 0.0000 | 0.0007 | −0.3810 | 1.3904 | |
4 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0919 | 0.0000 | 0.0000 | 0.6529 | 0.3487 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
5 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 | |
PCA-DR | 1.0125 | 0.0000 | 0.0000 | 0.0755 | 0.9305 | |
6 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0919 | 0.0000 | 0.0000 | 0.6529 | 0.3487 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 | |
7 | Raw data | 1.0107 | 0.0000 | 0.0001 | 0.0000 | 1.0065 |
WLS-DR | 1.0235 | 0.0000 | 0.0000 | 0.3288 | 0.6753 | |
PCA-DR | 1.0482 | 0.0000 | 0.0003 | 0.5280 | 0.4745 |
Example 2
The large process network (Varshith et al. 2019) shown in Figure 2 consists of 11 nodes representing the balance equations and 28 flow variables. The base value of each variable is referred as in Table 2. The constraint matrix for this process is calculated as in Equation (17).
Base values for flow variables
Flow Variable . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . | 13 . | 14 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Base values | 10 | 10 | 30 | 10 | 20 | 20 | 30 | 20 | 10 | 10 | 10 | 20 | 10 | 10 |
Flow variable | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |
Base values | 10 | 5 | 5 | 40 | 5 | 5 | 10 | 20 | 20 | 10 | 5 | 30 | 45 | 15 |
Flow Variable . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | 11 . | 12 . | 13 . | 14 . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Base values | 10 | 10 | 30 | 10 | 20 | 20 | 30 | 20 | 10 | 10 | 10 | 20 | 10 | 10 |
Flow variable | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 |
Base values | 10 | 5 | 5 | 40 | 5 | 5 | 10 | 20 | 20 | 10 | 5 | 30 | 45 | 15 |
Figure 3 shows the performance indices of selected performance metrics of flow variables F7, F16, F22, and F27. It is observed that in Figure 3(a), RMSE is the best metric for capturing the minute changes in noise present in the data, followed by RER and CC. SNR and GD remain constant throughout, proving to be of no valid significance in evaluating the different data sets. As in the case of variables F16 and F22 shown in Figure 3(b) and 3(c) respectively, SNR and GD remain constant here as well. In contrast, there is a variation in the evaluation outcome of RMSE, RER, and CC. RMSE and RER display improvement for PCA-DR data and CC does not show any improvement for WLS-DR data. The variable F27 shown in Figure 3(d) shows the variation for RMSE, RER, and CC. Since RMSE is majorly consistent and is sensitive to small changes in noise, it is being used as the main evaluation metric. RER almost complements RMSE, but is less sensitive at times.
To analyse the performance of PCA-DR and WLS-DR on the process data, flow variables F16-F1-F28-F5-F3-F18-F27 have been considered in the order of increasing base magnitude. Figure 4(a) shows the performance of the DR techniques when Σ = I. The results obtained show that WLS-DR estimates are consistent and are base magnitude independent. The estimates obtained on performing DR are maximum likelihood estimates and they are more accurate when compared to the raw process measurement. PCA-DR performs linearly as the magnitude increases. For high magnitudes, its performance is less accurate when compared to the estimates of WLS-DR. Thus, we can effectively say that the performance of PCA-DR decreases as the base magnitude of the flow variables increases.
Figure 4(b) compares the performance of WLS-DR and PCA-DR for different variances in data for variable F16. It is seen that as the variance increases, the performance of WLS-DR decreases. PCA-DR performs well when compared to WLS-DR. This shows that increase in variance does not affect the PCA-DR performance as it did for increase in magnitude.
Figure 4(c) shows the performance of DR techniques for different variance when bias (δ) of 5σ present in variable F16. As variance increases, RMSE of WLS-DR estimates also increases. PCA-DR estimates vary slightly and its RMSE remain around the gross error value. This indicates that PCA-DR can be used for detecting the presence of bias even when there are situations where variance changes.
Figure 4(d) shows that as the value of gross error increases, RMSE of PCA-DR is varying linearly with bias, and its magnitude is equal to that of the gross error. The RMSE of WLS-DR is also varying linearly with gross error, but not in a way that is equal to the magnitude of the gross error. The PCA-DR can be combined with gross error detection techniques to identify gross errors present in measurements.
Measurements are usually contaminated by random and gross errors. So, it is important to analyse the effect of gross errors on the performance of DR techniques. Figure 5 shows two scenarios based on the presence of gross error (bias) of magnitude 5σ, in 5(a) variables F16 and F27 and 5(b) F16 and F27. The DR techniques didn't increase the accuracy of estimates. The WLS-DR distributed the gross error, and PCA-DR did not perform at all. The PCA-DR, as seen previously, depends on magnitude of variable in i.i.d. data; in the presence of bias, irrespective of base value, the result was the value of amount of gross error present.
To analyse the performance of the DR techniques on non-i.i.d. data as well, data with Auto Regressive Moving Average (ARMA) noise was simulated. The ARMA(1, 1) had φ1 & θ1 values as 0.4 and 0.2, while the ARMA(2, 2) had φ1, φ2, and θ1, θ2 values as 0.5, 0.4, and −0.4, 0.2 respectively. The DR techniques were then applied on this data and RMSE was calculated for evaluating the techniques. Figure 6 shows the comparison between DR techniques applied to i.i.d. and non-i.i.d. data. It is observed that PCA-DR estimates of non-i.i.d. data for both ARMA (1, 1) and ARMA (2, 2) noise are similar, and serial correlation makes the PCA-DR worsen due to erroneous variance estimation. WLS-DR when applied to the non-i.i.d. data shows high inconsistency in the estimates and gives inaccurate results, thus concluding that WLS-DR cannot be used for non-i.i.d. data. When comparing holistically with i.i.d. data, we can say that both PCA-DR and WLS-DR perform better when applied to i.i.d. data than to non-i.i.d. data.
Performance of DR techniques for i.i.d. and non-i.i.d. for varying magnitudes of flow variable.
Performance of DR techniques for i.i.d. and non-i.i.d. for varying magnitudes of flow variable.
CONCLUSION
The work presented here can be applied to monitoring of water inflows in urban smart water network management, waste water treatment and other areas where water supply is vital. We agree that the present work focuses on techniques that contribute significantly to augmenting the large body of knowledge in metrology. However, we would like to emphasize that these techniques find significant and direct applications in resource reconciliation in water supply networks. When it comes to water, while the specific challenges of commercial (process) use, domestic use, power use, mining use, utility use, etc. may differ, what they all have in common, when seen in the context of water supply networks, is that such networks contain the two basic constituent elements of streams (water flow) and nodes (convergence/divergence points of flows). The techniques outlined in this paper are then generally applicable for all such networks, as they employ a useful relationship constraint, namely the mass balances. Thus, it is evident that the current work has direct and important relevance to water supply, via the treatment it provides for resource reconciliation. In such networks, the reconciliation must, of necessity, address physical leaks as well as measurement biases – both of which form the core focus of data reconciliation and gross error techniques.
This work presents an approach that integrates the process of selecting a suitable data reconciliation technique with the process of selecting the best performance evaluation metric under different scenarios. The study reveals that the RMSE is the best metric (among those considered), since it is very sensitive to small changes in the measurement and estimates. SNR, GD, and CC were seen to not capture significant changes in reconciliation. RER performs in a very different way from that of RMSE, and has negative values for poorly reconciled estimates.
From the two examples discussed in this paper, PCA-DR is a good technique when compared to WLS-DR, for variables with smaller magnitudes of i.i.d. data. When the difference in magnitudes between process variables increases PCA-DR performs less accurately compared to WLS-DR. As in the case of changing variance, PCA-DR is better than WLS-DR. For the biased data both techniques failed in reducing the random errors. For serially correlated (non-i.i.d.) data, PCA and WLS-DR both are performing poor, due to variance change.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.