Abstract

Because of the static nature of conventional principal component analysis (PCA), natural process variations may be interpreted as faults when it is applied to processes with time-varying behavior. In this paper, therefore, we propose a complete adaptive process monitoring framework based on incremental principal component analysis (IPCA). This framework updates the eigenspace by incrementing new data to the PCA at a low computational cost. Moreover, the contribution of variables is recursively provided using complete decomposition contribution (CDC). To impute missing values, the empirical best linear unbiased prediction (EBLUP) method is incorporated into this framework. The effectiveness of this framework is evaluated using benchmark simulation model No. 2 (BSM2). Our simulation results show the ability of the proposed approach to distinguish between time-varying behavior and faulty events while correctly isolating the sensor faults even when these faults are relatively small.

HIGHLIGHTS

  • A novel framework is proposed for adaptive fault detection and diagnosis based on IPCA.

  • The validity and applicability of the proposed framework are examined using benchmark simulation model No.2 (BSM2).

  • The realtime imputation of intricate patterns of missing values is performed using EBLUP.

  • The proposed framework can detect small-magnitude drift faults while tracking time-varying behavior.

Graphical Abstract

Graphical Abstract
Graphical Abstract

ABBREVIATIONS

     
  • X

    Data matrix

  •  
  • T

    Score matrix

  •  
  • P

    Loading matrix

  •  
  • E

    Residual matrix

  •  
  • S

    Covariance matrix

  •  
  • n

    Data matrix row number

  •  
  • m

    Data matrix column number

  •  
  • Eigenvector matrix for kth sample

  •  
  • Eigenvalues matrix for kth sample

  •  
  • jth eigenvalues

  •  
  • Number of principal components to be retained for kth sample

  •  
  • Hotelling's T-squared for kth sample

  •  
  • Square prediction error for kth sample

  •  
  • kth sample

  •  
  • Standardized kth sample

  •  
  • Mean of kth sample

  •  
  • f

    Forgetting factor

  •  
  • Standard deviation for kth sample

  •  
  • Projection of kth sample using the current eigenvector

  •  
  • Residual vector for kth sample

  •  
  • Normalized residual vector for kth sample

  •  
  • Euclidean norm of matrix

  •  
  • Rotation matrix for kth sample

  •  
  • A matrix that decomposed to obtain

  •  
  • Cumulative percent variance with retained principal components

  •  
  • Chi-distribution with degrees of freedom and a confidence limit

  •  
  • T2 threshold for kth sample

  •  
  • SPE threshold for kth sample

  •  
  • Normal deviate corresponding to the (1 − α) percentile

  •  
  • Complete decomposition contribution for kth sample using Hotelling's T-squared

  •  
  • Complete decomposition contribution for kth sample using SPE

  •  
  • Vector of kth sample with observed values

  •  
  • Vector of kth sample with missing values

  •  
  • Mean of observed values of k − 1th sample

  •  
  • Mean of missing values of k − 1th sample

  •  
  • Eigenvector matrix of observed values of k − 1th sample

  •  
  • Eigenvector matrix of missing values of k − 1th sample

  •  
  • Imputed missing values of kth sample

  •  
  • Conditional expectation

  •  
  • Sin

    BSM2 inorganic nitrogen concentration in anaerobic digester

  •  
  • BSM2 secondary clarifier exponential settling velocity function

  •  
  • BSM2 growth rate for the autotrophs

  •  
  • PCA

    Principal component analysis

  •  
  • IPCA

    Incremental principal component analysis

  •  
  • EBLUP

    Empirical best linear unbiased prediction

  •  
  • BSM2

    Benchmark simulation model No.2

  •  
  • WRRF

    Water resource recovery facilities

  •  
  • WWTP

    Wastewater treatment plant

  •  
  • MSPCs

    Multivariate statistical process controls

  •  
  • RPCA

    Recursive principal component analysis

  •  
  • EWMA

    Exponential weighted moving average

  •  
  • FOP

    First-order perturbation

  •  
  • PCs

    Principal components

  •  
  • TSS

    Total suspended solid

  •  
  • PI

    Proportional integral

  •  
  • AD

    Anaerobic digestion

  •  
  • COD

    Carbon-oxygen demand

  •  
  • FAR

    False alarm rate

  •  
  • MDR

    Missed detection rate

INTRODUCTION

Because of the complexity of modern industrial processes, the demand has increased to implement fault detection and a diagnosis framework for those processes. Water resource recovery facilities (WRRF), formerly called wastewater treatment plants (WWTP), are no exception (Sánchez-Fernández et al. 2018). Abnormal or faulty events are those that occur when the process indicates a deviation from its normal behavior. In WRRF, abnormal events include changes in influent quality (e.g. rainfall, industrial discharge), outbreaks of microorganisms (e.g. filamentous bacteria, algae) that impact treatment quality, irregularities or damage to treatment units (e.g. membranes, clarifiers), mechanical failures (e.g. pumps, air blowers) and sensor failure (e.g. drift, bias, electrical interference) (Newhart et al. 2019). All these faults can undermine process performance, so they must be detected and diagnosed as soon as they occur. Fault detection techniques are generally divided into three main groups: model-based methods, knowledge-based methods, and data-driven methods. To implement techniques in the first two groups, an in-depth knowledge of the system's behavior is required. However, obtaining in-depth knowledge of reactions or pathways phenomena (especially for WRRFs, which are very complicated processes) is both time-consuming and challenging.

Data-driven methods, on the other hand, rely only on historical and online data and do not require in-depth prior knowledge of the process (Kazemi et al. 2020). Thanks to the advanced process control framework, WRRFs collect large amounts of data using measurements from numerous online sensors (Wang & Shi 2010). Data-driven techniques have therefore received significant attention in process monitoring over the last few years. Multivariate statistical process controls (MSPCs) are a subset of data-driven methods used to analyze the performance of the processes and identify the parameters that govern them. One of the most widely used techniques for MSPC is principal component analysis (PCA) (Newhart et al. 2019). Various types of PCA methods, such as conventional PCA (Garcia-Alvarez et al. 2009; Sanchez-Fernández et al. 2015; Xiao et al. 2017), kernel PCA (Lee et al. 2004; Jun et al. 2006; Xiao et al. 2017) and dynamic PCA (Lee et al. 2006; Mina & Verde 2006) have been successfully used to monitor processes and detect faults in time-invariant processes. However, because of the complexity of the biological reactions and non-stationary plant influent, WRRFs have time-varying characteristics. Using conventional PCA to monitor such processes may therefore lead to excessive rates of false alarms and missing detections. An adaptive process monitoring framework is therefore needed. Although adaptive PCAs are more suitable for non-stationary processes, due to high computational costs, not every adaptive approach is applicable for realtime fault detection. Several adaptive approaches include recursive PCA (RPCA), moving window PCA and exponential weighted moving average (EWMA) have recently been used for fault detection (Li et al. 2000; Rosen & Lennox 2001; Choi et al. 2006; Elshenawy et al. 2010; Shang et al. 2015). Haimi et al. implemented a moving window PCA technique to detect and isolate the faults in WRRF (Haimi et al. 2016). These authors took into account the variable length of the historical data in the model's construction to prevent sub-optimal monitoring performance. Li et al. applied an RPCA monitoring framework that recursively updates the correlation matrix. These authors used two approaches, namely rank-one modification and Lanczos tridiagonalization, to compute the eigenvalues of the updated correlation matrix (Li et al. 2000). Recently, Elshenawy et al. proposed RPCA based on first-order perturbation analysis (FOP), which is a rank-one update of the eigenvalues and their corresponding eigenvectors from a sample covariance matrix (Elshenawy et al. 2010; Elshenawy & Mahmoud 2018). These authors stated that the computational cost of their approach is lower than that of previously used methods such as RPCA using Lanczos tridiagonalization and MWPCA.

In this paper we propose a new low-computational-cost adaptive fault detection framework for WRRF that uses incremental PCA (IPCA). This approach is based on the IPCA proposed by Hall et al., which was motivated by developments in the field of computer vision (Hall et al. 1998; Brand 2002; Arora et al. 2012). The main benefit of IPCA over other adaptive PCA approaches is that it does not require all eigenvalues to be computed since only the largest ones are needed. This considerably accelerates computation time. Cardot and Degras (Cardot & Degras 2018) compared the computation time of adaptive PCA methods such as perturbation and stochastic approximation and concluded that IPCA outperforms other methods, particularly when the data have many features (sensor measurements). These authors also studied the accuracy of adaptive methods in determining the eigenspace. Their results revealed that IPCA is more accurate than methods such as RPCA, which is based on FOP theory. The IPCA algorithms offer an excellent compromise between statistical accuracy and computational speed. Another major drawback of PCA approaches are missing data during realtime monitoring. In practice, due to sudden mechanical breakdown, maintenance, hardware sensor failure or malfunction of the data acquisition system, etc., some sensor signals may become unavailable (Zhang & Dong 2014). To meet the requirements of realtime fault detection and diagnosis, therefore, the problem of missing data needs to be considered. To handle missing data, the empirical best linear unbiased prediction (EBLUP) approach is incorporated into the proposed IPCA method (Cardot & Degras 2018).

Once a fault is detected, it is essential to find out its primary source. The most common fault isolation method are contribution charts (Nomikos & MacGregor 1995). With this method, the variables that contribute to the T2 (Hotelling's T-squared) and SPE (square prediction error) statistics are calculated and the variable with the highest contribution is considered the primary cause of the fault. Because of the recursive nature of IPCA, the contribution plots of T2 and SPE are updated in accordance with the adaptive eigendecomposition. To test the proposed framework, several types of process parameters and sensor faults were simulated using Benchmark Simulation Model No. 2 (BSM2) (Jeppsson et al. 2006). This paper is organized as follows: first we provide a brief review of the theoretical background behind the conventional PCA method; then we discuss the IPCA algorithm and its realtime application framework for fault detection and diagnosis; and finally we analyze the performance of the monitoring framework on the complete and incomplete (i.e. with missing values) data sets using BSM2.

CONVENTIONAL PCA

Principle component analysis (PCA) is a well-known method designed to convert a data set with possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PCs) (Xiao et al. 2017). Let a data matrix ∈ ℜn×m be composed of n samples and m sensors. The X matrix is scaled to zero mean and unit variance (Z-score normalization) in order to avoid the scaling problem. The PCA model can be defined as:
formula
(1)
where T ∈ ℜn×m, P ∈ ℜm×m, and E ∈ ℜn×m are score (PCs), loading, and residual matrices, respectively. The scores are generated by projecting X onto to the loading matrix. To solve the PCA model the covariance matrix, S ∈ ℜm×m, of should be decomposed as follows:
formula
(2)
The columns of P are eigenvectors of S, and ∈ ℜm×m is a diagonal matrix containing the eigenvalues arranged in descending order. Usually, the number of (β) principal components, which sufficiently explain the variability of the process, are selected. Therefore, P ∈ ℜβ and its corresponding eigenvalues are ∈ ℜβ×β. For fault detection, two statistics (mainly T² and SPE) are normally used. Variations in the mean and covariance in PCs are measured using T² while the variation in the residual subspace is measured using SPE (Elshenawy & Awad 2012). These are given by:
formula
(3)
formula
(4)

Here, as mentioned earlier, is the number of PCs to retain. To perform fault detection, suitable thresholds for these two statistical indices must be obtained. This enables small faults to be detected with a minimum rate of false alarms. Process operation is normal if these statistical indices remain below these thresholds. The procedure for estimating the thresholds of T² and SPE will be discussed later.

INCREMENTAL PCA

Detecting and monitoring faults using conventional PCA is suitable for time-invariant processes. If it were used for time-variant processes, it would be difficult to monitor the typical time-varying characteristics caused by external disturbances and changes in operating conditions. To handle time-varying issues, an adaptive fault detection framework must be developed (Hu et al. 2012). The key idea behind IPCA is to update the eigenvector by incrementing the new data to the PCA. To update process monitoring, both the mean mk ∈ ℜ1×m and the standard deviation δk ∈ ℜ1×m need to be re-estimated whenever a new sample xk ∈ ℜ1×m data vector becomes available (Hall et al. 1998; Artač et al. 2002; Cardot & Degras 2018). These are given by:
formula
(5)
formula
(6)
where is a forgetting factor. Let us assume that eigenvector has already been derived using the data from. Also, is a diagonal matrix of the corresponding eigenvalues , and and are the mean and variance vector of a new sample, respectively. Each sample must be standardized according to:
formula
(7)
where is the standardized sample. Next, the projection of the new sample is computed using the current eigenvector :
formula
(8)
Finally, residual vector is estimated using the feature vector. The residual vector is orthogonal to the eigenvector.
formula
(9)
To update the eigenvector, needs to be normalized according to the condition outlined in the equation below:
formula
(10)
where is the Euclidean norm of matrix .
The new eigenvector can be estimated by adding to the current eigenvector and rotating the result using rotation matrix:
formula
(11)
is calculated by solving the eigenvector decomposition problem as shown below:
formula
(12)
where matrix is composed as follow:
formula
(13)
where is a vector of zero and . By updating the eigenvector and the eigenvalue matrix for each new sample, the process monitoring statistics must also be updated according to Equations (3) and (4).

Estimating the number of PCs

To fully implement IPCA, the number of PCs must be updated after a new sample becomes available. Numerous methods are available for calculating the number of PCs (Li et al. 2000). In this study we used cumulative percent variance (CPV), which can be obtained from:
formula
(14)
where is a number of selected PCs. The number of PCs is chosen when CPV reaches a predetermined limit, e.g. 95%.

Process monitoring statistics thresholds

Because of the non-stationary behavior of water resource recovery facilities, the thresholds for the detection statistics (T2 and SPE) change over time. For realtime monitoring, therefore, it is necessary to adapt these limits (Li et al. 2000). The T2 threshold is approximated using the chi-distribution with degrees of freedom and a confidence limit :
formula
(15)
and the threshold for the SPE statistic is given by:
formula
(16)
where is the normal deviate corresponding to the (1-α) percentile, and
formula
(17)
formula
(18)

The fault is detected once the T2 and SPE statistics exceed their respective adaptive thresholds, and .

Fault isolation

Once a fault has been detected, the variables that contribute to the deviation must be identified in order to determine the location of primary causes. The idea behind the contribution chart is that variables with the largest contributions to the fault detection statistics are probably the faulty variables (Alcala & Qin 2009). Many methods are available for calculating the contributions of variables. In this study we used complete decomposition contribution (CDC), which is widely used in the industry. The CDC for the T2 and SPE statistics is defined as:
formula
(19)
formula
(20)
where and are the vectors whose elements are the variables that contribute to the T2 and SPE statistics for each column (sensor) of a sample, respectively.

Missing data imputation

Methods such as mean, regression, hot-deck, maximum likelihood, multiple imputations, etc., are suggested for imputing missing values in the realtime application of PCA (Cardot & Degras 2018). In this paper we used the empirical best linear unbiased prediction (EBLUP) method described by Brand (2002) to impute missing values in the new observation vector . With this method, the missing observations in vector are approximated using the mean value and the eigenvector decomposition ( and ) of the current sample. The can be split into two sub-vectors: (observed values) and (missing values), respectively. Similarly, and are split into , and , respectively. Let be the diagonal matrix containing the square roots of the diagonal eigenvalues arranged in descending order (Brand 2002; Cardot & Degras 2018). By applying the conditional expectation for multivariate normal distribution, the EBLUP can be obtained as follows:
formula
(21)
where is the imputed missing value.

ADAPTIVE FAULT DETECTION AND ISOLATION

In this section we discuss the complete structure of the adaptive fault detection and isolation scheme. The proposed adaptive fault detection and isolation framework can be divided into two stages: (i) offline training and (ii) online monitoring. The following steps summarize the overall procedure.

Offline training

  • (1)

    Collect the training data matrix ∈ ℜn×m and normalize it to zero mean and unit variance

  • (2)

    Use Equation (2) to compute S and its corresponding eigenpairs , .

  • (3)

    Calculate the number of retaining PCs () by applying Equation (14).

  • (4)

    Calculate the thresholds of the fault detection statistics and using Equations (15) and (16).

Online monitoring

  • (1)

    Collect a new data vector and normalize it to zero mean and unit variance using Equations (5)–(7). If there are any missing values, compute them using Equation (21).

  • (2)

    Calculate the monitoring statistics and using Equations (3) and (4).

  • (3)

    If the values of the monitoring statistics are below their corresponding thresholds, the process status is normal. The updating procedure continues as follows:

    • (i)

      Update and using Equations (5) and (6).

    • (ii)

      Calculate the updated eigenpairs and using Equations (8)–(13).

    • (iii)

      Calculate the number of retained PCs, , using Equation (14).

    • (iv)

      Update the monitoring statistics thresholds and using Equations (15) and (16).

    • (v)

      Return to step one.

  • (4)

    If the values of the monitoring statistics exceed their corresponding thresholds, the process status is faulty and the updating procedure should be stopped. The fault isolation procedure (CDC) needs to be run to identify the process variables responsible for the detected fault.

    • (i)

      The contribution statistics for the faulty points need to be calculated according to Equations (19) and (20).

    • (ii)

      The process variables with the largest contributions are responsible for the fault.

The complete diagram for the proposed adaptive IPCA monitoring procedure is shown in Figure 1.

Figure 1

Adaptive IPCA monitoring diagram.

Figure 1

Adaptive IPCA monitoring diagram.

SIMULATION RESULTS

Description of the water resource recovery facility

In this section, the proposed adaptive IPCA framework is applied to Benchmark Simulation No. 2 (BSM2) (Jeppsson et al. 2006; Nopens et al. 2010). BSM2 is a well-known and powerful dynamic mathematical simulation model able to simulate the physicochemical and biological phenomena in WRRF. BSM2 comprises numerous unit operations, including primary clarifier, activated sludge biological reactor, secondary clarifier, thickener, anaerobic digester, dewatering unit, and storage tank. Five reactors are included in the activated sludge process: two anoxic reactors (used for nitrification) with a total volume of 3,000 m3 and three aerobic reactors (used for predenitrification) with a total volume of 9,000 m3. The plant capacity is 20,648 m3 d−1 with an average influent dry weather flow rate of 592 mg L−1 of average biodegradable COD. Activated Sludge Model No. 1 (ASM1) and the Anaerobic Digestion Model (ADM1) were used to describe the biological phenomena that take place in the activated sludge and AD reactor, respectively. The influent characteristics consist of a 609-day dynamic influent data file (with sampling frequency equal to a data point every 15 min) that includes rainfall and seasonal temperature variations over the year (Jeppsson et al. 2006; Nopens et al. 2010).

Fault detection and isolation

BSM2 simulation was modified to obtain sensor data from various parts of the process by simulating different types of fault. By performing this modification, 320 data measurements (16 state variables × 20 measurement points) can be obtained. However, measuring all these variables is not common practice in a real WRRF. In this paper, therefore, we considered only realistic and commonly available sensor measurements. Twenty-eight process measurements obtained from various parts of BSM2 simulation (Figure 2) were recorded every 15 min as the inputs for the proposed monitoring framework. All these variables were corrupted with white noise to simulate real sensor measurements.

Figure 2

Schematic diagram of BSM2 and locations of the measured variables.

Figure 2

Schematic diagram of BSM2 and locations of the measured variables.

To build the offline adaptive IPCA process monitoring framework, 21 days of data measurements (days 435–456) were recorded under normal operating conditions. We chose this period because there were no rain or storm events on those days. The measurements from days 457 to 530 were used to test the performance of the IPCA during time-varying characteristics and abnormal behavior. On those days, there were three storm events, which we will discuss in the next section. The main sources of time-variant behavior in this simulation are drift and abrupt changes in the total suspended solid (TSS) concentration of the reject flow of the dewatering process. These changes in the TSS concentration of the dewatering unit are due to seasonal and diurnal patterns. The number of retained PCs, , was estimated recursively using the CPV method in such a way that the retained PCs explained almost 99% of total variance. Several varieties of faults in the sensors, process variables and process parameters were simulated according to Table 1. Each fault began on day 470 and continued until day 530. The thresholds confidence limit for all charts was considered equal to 99%.

Table 1

Description of the simulated faults

Fault#DescriptionFault type
Sensor fault Sensor no. 7 Drift fault (0.1 g·m−3·d−1
Change in inorganic nitrogen of anaerobic digester Sin Added 0.2 kmol·m−3 to its normal value 
Secondary clarifier parameter  Decreased by 50% 
Step change in bioreactor parameters  Decreased from 0.5 to 0.1 d−1 
Fault#DescriptionFault type
Sensor fault Sensor no. 7 Drift fault (0.1 g·m−3·d−1
Change in inorganic nitrogen of anaerobic digester Sin Added 0.2 kmol·m−3 to its normal value 
Secondary clarifier parameter  Decreased by 50% 
Step change in bioreactor parameters  Decreased from 0.5 to 0.1 d−1 

Normal operation of the WRRF

First, we simulated the normal operation of the process (i.e. with no fault added) to show that conventional PCA is unable to deal with the time-variant characteristics of WRRF. The monitoring statistics obtained by conventional PCA and IPCA are shown in Figure 3.

Figure 3

Fault detection results of PCA and IPCA for normal operation (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot in the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 3

Fault detection results of PCA and IPCA for normal operation (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot in the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Before we discuss the monitoring results, note that the three severe peaks in this figure for both PCA and IPCA represent storm events (around days 490, 505 and 523) during normal operation of the WRRF. These storm events, which exist in all the fault detections studied in this paper, are classified as faults. As Figure 3 shows, the statistics for conventional PCA exceeded their thresholds for several normal operational points in addition to these storm events. In other words, these results indicate the excessive rate of false alarms, which illustrates a major limitation of using conventional PCA for time-varying processes such as WRRF. What conventional PCA captures in this case is only the dynamic and statistical characteristics of the process supplied by the training data set. However, in time-varying processes, these dynamic or statistical characteristics change over time and must therefore be updated rather than considered constant. Unlike PCA (which produces an excessive rate of false alarms), IPCA produces a very low rate of false alarms. Apart from storm events, which are correctly labeled as faults, data points mislabeled by the T2 and SPE statistics of IPCA are very few (shown in red). Also, since the correlation for T2 did not change, the threshold for T2 remained constant. On the other hand, due to adaptation, the threshold for SPE varied more although its range remained quite low, i.e. between 0.115 and 0.121 (see Figure 3).

Drift fault in the dissolved oxygen sensor

The first fault, shown in Table 1, is simulated by applying a drift in the oxygen sensor of the fourth reactor at day 470. This sensor is used in combination with a proportional-integral (PI) controller to manipulate the concentration of dissolved oxygen in the third, fourth, and fifth reactors. The size of the drift was 0.1 g·m−3·d−1. This was intentionally small in order to evaluate the adaptive behavior of IPCA during such a fault. The monitoring statistics and results of diagnosis for IPCA are shown in Figure 4. As we can see, both monitoring statistics detected the fault, though there were some delays. These delays in detection were due to the fact that, as the drift starts, the PI controller tries to compensate for them and brings the dissolved oxygen concentration back to the setpoint by decreasing the blower speed. The fault therefore cannot be detected directly from the faulty sensor at the initial stage of its occurrence. The impact of the PI controller action is more significant on dissolved oxygen in the third and fifth reactors at the initial stage of the fault occurrence.

Figure 4

Fault detection and diagnosis results of IPCA for drift fault in the dissolved oxygen sensor (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 4

Fault detection and diagnosis results of IPCA for drift fault in the dissolved oxygen sensor (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

The SPE statistic therefore begins to violate the threshold around day 475. This is due to the change in the dissolved oxygen concentrations in the third and fifth reactors. However, it is difficult to allocate the drift because it is very small. The T2 statistic begins to violate the threshold significantly around day 483, which indicates that the PI controller reduces the blower speed to its minimum. At this point, most of the variables begin to deviate and the fault becomes more visible and easier to detect. The contribution chart for T2 shows that the ammonia, H2, TSS, and O2 sensors (numbers 16, 25, 17, and 11, respectively; see Figure 2 for the type and location of sensors) contribute the most. Of these sensors, sensor 11, which represents the dissolved oxygen in the fifth reactor, is isolated correctly. However, the other sensors, such as the concentration of ammonia in the effluent stream (number 16), may provide indirect clues as to the root of the fault. As the concentration of dissolved oxygen decreases due to the action of the PI controller, the nitrification rate decreases, so the concentration of ammonia increases. The most contributing sensors for the SPE statistic are numbers 6 and 7, which represent the concentration of dissolved oxygen in the third and fourth reactors, respectively. Sensor 6 contributes more than faulty sensor number 7, which shows that the PI controller has more impact on the concentration of dissolved oxygen in the third reactor at the initial stage of the fault.

Step change in inorganic nitrogen in the anaerobic digester

The nitrogen level in the anaerobic digestion (AD) process is critical because of its inhibitory impact on microbial activity and needs to be monitored carefully. Fault number 2 (Table 1) is simulated by inducing a step change at day 470 equal to +0.2 kmol·m−3 in the AD inorganic nitrogen level. The monitoring statistics contribution charts are shown in Figure 5.

Figure 5

Fault detection and diagnosis results of IPCA for a step change in AD inorganic nitrogen (dashed blue dashed: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 5

Fault detection and diagnosis results of IPCA for a step change in AD inorganic nitrogen (dashed blue dashed: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

As we can see, both monitoring statistics detect the fault, though with some delays. The delay is shorter in T2 than in SPE. These delays are due to the small size of the faults and their lesser impact at the initial stage of their occurrence. Figure 5 shows that sensor 20 has the largest contribution to the T2 statistic. Sensor 20 is the gas flow of AD, which is correctly isolated. Due to the inhibition phenomena caused by the increase in inorganic nitrogen, the activity of microbial communities decreases, which accounts for the reduction in gas flow. The contribution chart for SPE could not isolate the fault correctly.

Step change in the settling velocity of the secondary clarifier

To simulate fault number 3 (Table 1), the double exponential settling velocity function () of the second clarifier was reduced by 50%. This fault can be classified as a change in the process parameters. Figure 6 shows the detection performance and contribution charts for the T2 and SPE statistics.

Figure 6

Fault detection and diagnosis results of IPCA for a step-change in the settling velocity of the secondary clarifier (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 6

Fault detection and diagnosis results of IPCA for a step-change in the settling velocity of the secondary clarifier (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 6 shows that both statistics can detect the fault instantly after its occurrence. However, SPE has a stronger detection than T2. Clearly, both statistics provide accurate isolation. Sensors 13 (carbon-oxygen demand (COD)) and 14 (TSS) contribute the most to T2 and SPE, respectively. Since there is no sensor to measure settling velocity, COD and TSS are considered the most relevant sensors (these variables are directly related to the settling velocity). Note that the smaller magnitude of this fault could still be detected, though the detection rate may be lower due to its lesser impact on the process.

Step change in the bioreactor parameters

One of the most crucial parameters in WRRF is the specific growth rate for the autotrophs (), which determines the speed of ammonia conversion into nitrite (nitrification rate). If the inhibition (due to toxicity or changes in pH, etc.) occurs in the activated sludge reactors, the nitrification rate is altered due to the lower microbial activity. To simulate fault number 4 (Table 1), the value of was changed from 0.5 to 0.1 d−1. The monitoring statistics and contribution charts are shown in Figure 7.

Figure 7

Fault detection and diagnosis results of IPCA for a step change in the specific growth rate of the autotrophs (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

Figure 7

Fault detection and diagnosis results of IPCA for a step change in the specific growth rate of the autotrophs (dashed blue line: 99% confidence limit; solid vertical blue line: onset of the fault). The small plot inside the IPCA chart shows the adaptive threshold for SPE. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/wst.2020.368.

This figure shows that the T2 statistic begins to violate the threshold earlier than the SPE statistic. Although T2 begins violating the threshold in the initial stage of the fault occurrence, it took almost 30 days to be completely above the threshold. As the nature of this fault is very similar to the drift fault, the detection delay for both methods is very high due to the lesser impact of the faults at the initial stage of their occurrence. The variable contribution chart shows that sensor 10 (the concentration of ammonia in reactor 5) has clearly contributed the most to the deviation in the T2 statistic. As the nitrification rate decreases, the concentration of ammonia increases. The contribution chart of the SPE statistic provides no information about the root cause of the fault.

Storm events

As we mentioned earlier, there are three storm events in the simulated data set. These events are visible in Figure 3 around days 490, 505, and 523. Figure 3 also shows that both the T2 and SPE statistics were able to detect these events accurately. Moreover, as the storm events end, the statistical indices immediately go back to their normal values, which indicates the robustness of the method during such a fault. Figure 8 shows the variables responsible for the deviation in the T2 and SPE statistics during the storm events. The sensors' contributions to these three storm events are quite similar. As well as the increasing influent flowrate, many other variables are influenced by storm events; therefore, it is difficult to isolate the fault just by looking at the contribution charts. The trends in all correlated variables therefore need to be examined by experienced process operators. The contribution chart shows that sensors 13 and 19 (effluent COD and thickener underflow flow rate) contribute the most to the deviation in T2. During the storm events, the COD also decreases due to the dilution effect caused by the increase in plant influent flow.

Figure 8

Sensors' contribution to T2 and SPE statistics at different days of operation under storm condition.

Figure 8

Sensors' contribution to T2 and SPE statistics at different days of operation under storm condition.

The thickener underflow flow rate also increases as a result of the increase in plant influent flow. Sensor 9 (inlet flow rate of the second clarifier) contributes the most to SPE deviation, which is directly related to the plant influent flow rate. Therefore, as the plant influent flow rate increases due to the storm, the thickener underflow flowrate also increases.

FAULT DETECTION WITH MISSING VALUES

Missing values in the input data vector of the monitoring framework is a complex and challenging issue. However, in practice, missing values due to sensor failure or maintenance activities are common. A robust monitoring framework must therefore be designed to handle these missing values. In this study, missing values are imputed using EBLUP. The advantage of this method is that these values can be introduced as they appear during the fault detection procedure, which is useful for realtime applications. To assess the accuracy of our proposed method in the presence of missing sensor signals, several measurements were deleted randomly from the data set. Figure 9 shows the pattern of missing values from day 457 to day 488 during normal operation of the WRRF. This period was chosen to avoid the storm events during the normal running of the plant. Sampled data are shown in a continuous grey/black color scheme (the darker the color, the higher the value of the sensor), while missing values are shown in red. As we can see, the pattern of the missing values is extremely intricate since many sensor signals can become unavailable at the same time, which is challenging for any fault detection and monitoring framework.

Figure 9

Pattern of missing values during the normal running of the WRRF (from day 457 to day 488).

Figure 9

Pattern of missing values during the normal running of the WRRF (from day 457 to day 488).

Table 2 shows the false alarm rate (FAR%) and missed detection rate (MDR%) of the monitoring framework for the complete and incomplete data sets during normal and faulty operation of the WRRF. These rates can be estimated as follows:
formula
(22)
formula
(23)
Table 2

Fault detection performance for complete and uncompleted data set

0%a
2%a
6%a
10%a
Fault#T2SPET2SPET2SPET2SPE
FAR % normalb 0.45 0.26 0.59 3.87 0.49 14.05 0.52 26.49 
MDR% – – – – – – – – 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 18.53 14.70 18.4 12.99 18.5 10.33 18.58 8.80 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 8.56 34.26 8.57 29.95 8.44 24.07 8.68 19.86 
FAR% 0.07 0.22 3.20 0.15 15.7 0.15 30.10 
MDR% 7.91 0.06 8.45 0.08 9.72 0.08 11.41 0.06 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 37.86 55.99 37.84 55.40 38.10 54.16 38.34 53.90 
0%a
2%a
6%a
10%a
Fault#T2SPET2SPET2SPET2SPE
FAR % normalb 0.45 0.26 0.59 3.87 0.49 14.05 0.52 26.49 
MDR% – – – – – – – – 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 18.53 14.70 18.4 12.99 18.5 10.33 18.58 8.80 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 8.56 34.26 8.57 29.95 8.44 24.07 8.68 19.86 
FAR% 0.07 0.22 3.20 0.15 15.7 0.15 30.10 
MDR% 7.91 0.06 8.45 0.08 9.72 0.08 11.41 0.06 
FAR% 0.07 0.22 3.20 0.15 15.70 0.15 30.10 
MDR% 37.86 55.99 37.84 55.40 38.10 54.16 38.34 53.90 

aPercent of missing values; bfrom day 457 to day 488 during normal operation; MDR%, missed detection rate; FAR%, false alarm rate.

We can see that the FAR% for the T2 statistic is not affected by the missing values. On the other hand, the FAR% for the SPE statistic increases significantly in the presence of missing values. The most significant changes to the FAR% occurred with the largest level of missingness (10%). This may be because the imputation of missing values has a more significant impact on the residual subspace compared to the mean and covariance in PCs. Since the variation in the residual subspace is measured by SPE, it is more sensitive to the amount of imputed missing values.

CONCLUSION

We have proposed a novel fault detection and isolating framework based on incremental PCA (IPCA) and complete decomposition contribution (CDC). The advantage of IPCA over other adaptive PCA approaches is that it does not require all eigenvalues to be computed. This characteristic speeds up computation time. The proposed adaptive IPCA monitoring framework consists of (i) adaptive estimation of the mean and variance of data, (ii) adaptive estimation of selected PCs and threshold, (iii) estimation of eigenvalue decomposition by incrementing the new data to the PCA, and (iv) adaptive estimation of the contribution charts for T2 and SPE. This framework was applied to BSM2. Our results, demonstrated with simulated faulty scenarios, show that IPCA is able to adapt time-varying process behavior while detecting and isolating faults. Although there are some delays in detecting the faults, the performance is acceptable since the faults under study were small at the initial stage of their occurrence. The contribution charts showed that this method was able to correctly isolate the faults. Sometimes, however, due to a lack of sensor measurements, it was not possible to isolate the sensor directly and only the impact of the fault on the other measurements could be detected. Our proposed framework can also handle in realtime highly intricate patterns of missing values (e.g. more than one sensor signal failure at the same time) by applying the EBLUP method. Our study of fault detection performance with missing values shows that SPE is more sensitive to the imputation of missing values than T2. This is due to the significant impact of the imputed values on the residual subspace.

ACKNOWLEDGEMENTS

We gratefully thank the Spanish Ministry of Economy and Competitiveness for providing financial support through grants CTM2015-67970-P and for funding the doctoral scholarship (BES-2012-059675). Some authors are recognized by the Comissioner for Universities and Research of the DIUE (Department of Innovation, Universities and Business) of the autonomous government of Catalonia (2017 SGR 396) and supported by the Universitat Rovira i Virgili (2017PFR-URV-B2-33) and Fundació Bancària ‘la Caixa’ (2017ARES-06). The authors are also grateful to Dr Ulf Jeppsson of Lund University, Sweden, for kindly providing the BSM2 Matlab code.

DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

REFERENCES

REFERENCES
Alcala
C. F.
Qin
S. J.
2009
Reconstruction-based contribution for process monitoring
.
Automatica
45
,
1593
1600
.
www.elsevier.com/locate/automatica (Accessed April 10, 2020)
.
Arora
R.
Cotter
A.
Livescu
K.
Srebro
N.
2012
Stochastic optimization for PCA and PLS. 2012 50th Annual Allerton Conference on Communication, Control, and Computing
.
Allerton
2012
,
861
868
.
Artač
M.
Jogan
M.
Leonardis
A.
2002
Incremental PCA for online visual learning and recognition
. In
Proceedings – International Conference on Pattern Recognition
, pp.
781
784
.
Brand
M.
2002
Incremental singular value decomposition of uncertain data with missing values. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2350, 707–720
.
Cardot
H.
Degras
D.
2018
Online principal component analysis in high dimension: which algorithm to choose?
International Statistical Review
86
(
1
),
29
50
.
Choi
S. W.
Martin
E. B.
Morris
A. J.
Lee
I. B.
2006
Adaptive multivariate statistical process control for monitoring time-varying processes
.
Industrial and Engineering Chemistry Research
45
(
9
),
3108
3118
.
Elshenawy
L. M.
Awad
H. A.
2012
Recursive fault detection and isolation approaches of time-varying processes
.
Industrial and Engineering Chemistry Research
51
(
29
),
9812
9824
.
Elshenawy
L. M.
Mahmoud
T. A.
2018
Fault diagnosis of time-varying processes using modified reconstruction-based contributions
.
Journal of Process Control
70
,
12
23
.
https://doi.org/10.1016/j.jprocont.2018.07.017
.
Elshenawy
L. M.
Yin
S.
Naik
A. S.
Ding
S. X.
2010
Efficient recursive principal component analysis algorithms for process monitoring
.
Industrial and Engineering Chemistry Research
49
(
1
),
252
259
.
Garcia-Alvarez
D.
Fuente
M. J.
Vega
P.
Sainz
G.
2009
Fault detection and diagnosis using multivariate statistical techniques in a wastewater treatment plant
.
IFAC Proceedings Volumes
42
(
11
),
952
957
.
Haimi
H.
Mulas
M.
Corona
F.
Marsili-Libelli
S.
Lindell
P.
Heinonen
M.
Vahala
R.
2016
Adaptive data-derived anomaly detection in the activated sludge process of a large-scale wastewater treatment plant
.
Engineering Applications of Artificial Intelligence
52
,
65
80
.
http://dx.doi.org/10.1016/j.engappai.2016.02.003
.
Hall
P. M.
Marshall
D.
Martin
R. R.
1998
Incremental eigenanalysis for classification. In: Proceedings of the British Machine Conference 1998. BMVA Press, September 1998, pp. 29.1-29.10. Available from: http://www.bmva.org/bmvc/1998/papers/d186/h186.htm (accessed 5 May 2020)
.
Hu
Z.
Chen
Z.
Hua
C.
Gui
W.
Yang
C.
Ding
S. X.
2012
A simplified recursive dynamic PCA based monitoring scheme for imperial smelting process
.
International Journal of Innovative Computing, Information and Control
8
(
4
),
2551
2561
.
Jeppsson
U.
Rosen
C.
Alex
J.
Copp
J.
Gernaey
K. V.
Pons
M.-N.
Vanrolleghem
P. A.
2006
Towards a benchmark simulation model for plant-wide control strategy performance evaluation of WWTPs
.
Water Science and Technology
53
(
1
),
287
295
. .
Jun
B. H.
Park
J. H.
Lee
S. I.
Chun
M. G.
2006
Kernel PCA based faults diagnosis for wastewater treatment system
. In
Lecture Notes in Computer Science (including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
.
Springer Verlag
,
Berlin, Germany
, pp.
426
431
.
Kazemi
P.
Steyer
J. P.
Bengoa
C.
Font
J.
Giralt
J.
2020
Robust data-driven soft sensors for online monitoring of volatile fatty acids in anaerobic digestion processes
.
Processes
8
(
1
),
67
.
Lee
J. M.
Yoo
C. K.
Choi
S. W.
Vanrolleghem
P. A.
Lee
I. B.
2004
Nonlinear process monitoring using kernel principal component analysis
.
Chemical Engineering Science
59
(
1
),
223
234
.
Lee
C.
Choi
S. W.
Lee
I. B.
2006
Sensor fault diagnosis in a wastewater treatment process
.
Water Science and Technology
53
(
1
),
251
257
. .
Li
W.
Yue
H. H.
Valle-Cervantes
S.
Qin
S. J.
2000
Recursive PCA for adaptive process monitoring
.
Journal of Process Control
10
(
5
),
471
486
.
Mina
J.
Verde
C.
2006
Fault detection for large scale systems using dynamic principal components analysis with adaptation
. In
IFAC Proceedings Volumes (IFAC-PapersOnline)
. pp.
220
225
.
Newhart
K. B.
Holloway
R. W.
Hering
A. S.
Cath
T. Y.
2019
Data-driven performance analyses of wastewater treatment plants: a review
.
Water Research
157
,
498
513
.
https://doi.org/10.1016/j.watres.2019.03.030
.
Nomikos
P.
MacGregor
J. F.
1995
Multivariate SPC charts for monitoring batch processes
.
Technometrics
37
(
1
),
41
59
.
Nopens
I.
Benedetti
L.
Jeppsson
U.
Pons
M.-N.
Alex
J.
Copp
J. B.
Gernaey
K. V.
Rosen
C.
Steyer
J.-P.
Vanrolleghem
P. A.
2010
Benchmark simulation model No. 2: finalisation of plant layout and default control strategy
.
Water Science and Technology
62
(
9
),
1967
1974
.
Available from: http://www.ncbi.nlm.nih.gov/pubmed/21045320 (accessed 22 August 2018)
.
Rosen
C.
Lennox
J. A.
2001
Multivariate and multiscale monitoring of wastewater treatment operation
.
Water Research
35
(
14
),
3402
3410
.
Sanchez-Fernández
A.
Fuente
M. J.
Sainz-Palmero
G. I.
2015
Fault detection in wastewater treatment plants using distributed PCA methods
. In:
IEEE International Conference on Emerging Technologies and Factory Automation
,
2015 October 1
.
ETFA
,
Luxembourg
.
Sánchez-Fernández
A.
Baldán
F. J.
Sainz-Palmero
G. I.
Benítez
J. M.
Fuente
M. J.
2018
Fault detection based on time series modeling and multivariate statistical process control
.
Chemometrics and Intelligent Laboratory Systems
182
(
July
),
57
69
.
Shang
L.
Liu
J.
Turksoy
K.
Min Shao
Q.
Cinar
A.
2015
Stable recursive canonical variate state space modeling for time-varying processes
.
Control Engineering Practice
36
,
113
119
.
Wang
L.
Shi
H.
2010
Multivariate statistical process monitoring using an improved independent component analysis
.
Chemical Engineering Research and Design
88
(
4
),
403
414
.
Xiao
H.
Huang
D.
Pan
Y.
Liu
Y.
Song
K.
2017
Fault diagnosis and prognosis of wastewater processes with incomplete data by the auto-associative neural networks and ARMA model
.
Chemometrics and Intelligent Laboratory Systems
161
,
96
107
. .
Zhang
Z.
Dong
F.
2014
Fault detection and diagnosis for missing data systems with a three time-slice dynamic Bayesian network approach
.
Chemometrics and Intelligent Laboratory Systems
138
,
30
40
.