## Abstract

Because of the static nature of conventional principal component analysis (PCA), natural process variations may be interpreted as faults when it is applied to processes with time-varying behavior. In this paper, therefore, we propose a complete adaptive process monitoring framework based on incremental principal component analysis (IPCA). This framework updates the eigenspace by incrementing new data to the PCA at a low computational cost. Moreover, the contribution of variables is recursively provided using complete decomposition contribution (CDC). To impute missing values, the empirical best linear unbiased prediction (EBLUP) method is incorporated into this framework. The effectiveness of this framework is evaluated using benchmark simulation model No. 2 (BSM2). Our simulation results show the ability of the proposed approach to distinguish between time-varying behavior and faulty events while correctly isolating the sensor faults even when these faults are relatively small.

## HIGHLIGHTS

A novel framework is proposed for adaptive fault detection and diagnosis based on IPCA.

The validity and applicability of the proposed framework are examined using benchmark simulation model No.2 (BSM2).

The realtime imputation of intricate patterns of missing values is performed using EBLUP.

The proposed framework can detect small-magnitude drift faults while tracking time-varying behavior.

### Graphical Abstract

## ABBREVIATIONS

- X
Data matrix

- T
Score matrix

- P
Loading matrix

- E
Residual matrix

- S
Covariance matrix

- n
Data matrix row number

- m
Data matrix column number

Eigenvector matrix for k

^{th}sampleEigenvalues matrix for k

^{th}samplej

^{th}eigenvaluesNumber of principal components to be retained for k

^{th}sampleHotelling's T-squared for k

^{th}sampleSquare prediction error for k

^{th}samplek

^{th}sampleStandardized k

^{th}sampleMean of k

^{th}sample*f*Forgetting factor

Standard deviation for k

^{th}sampleProjection of k

^{th}sample using the current eigenvectorResidual vector for k

^{th}sampleNormalized residual vector for k

^{th}sampleEuclidean norm of matrix

Rotation matrix for k

^{th}sampleA matrix that decomposed to obtain

Cumulative percent variance with retained principal components

Chi-distribution with degrees of freedom and a confidence limit

T

^{2}threshold for k^{th}sampleSPE threshold for k

^{th}sampleNormal deviate corresponding to the (1 −

*α*) percentileComplete decomposition contribution for k

^{th}sample using Hotelling's T-squaredComplete decomposition contribution for k

^{th}sample using SPEVector of k

^{th}sample with observed valuesVector of k

^{th}sample with missing valuesMean of observed values of k − 1th sample

Mean of missing values of k − 1th sample

Eigenvector matrix of observed values of k − 1th sample

Eigenvector matrix of missing values of k − 1th sample

Imputed missing values of k

^{th}sampleConditional expectation

- S
_{in}BSM2 inorganic nitrogen concentration in anaerobic digester

BSM2 secondary clarifier exponential settling velocity function

BSM2 growth rate for the autotrophs

- PCA
Principal component analysis

- IPCA
Incremental principal component analysis

- EBLUP
Empirical best linear unbiased prediction

- BSM2
Benchmark simulation model No.2

- WRRF
Water resource recovery facilities

- WWTP
Wastewater treatment plant

- MSPCs
Multivariate statistical process controls

- RPCA
Recursive principal component analysis

- EWMA
Exponential weighted moving average

- FOP
First-order perturbation

- PCs
Principal components

- TSS
Total suspended solid

- PI
Proportional integral

- AD
Anaerobic digestion

- COD
Carbon-oxygen demand

- FAR
False alarm rate

- MDR
Missed detection rate

## INTRODUCTION

Because of the complexity of modern industrial processes, the demand has increased to implement fault detection and a diagnosis framework for those processes. Water resource recovery facilities (WRRF), formerly called wastewater treatment plants (WWTP), are no exception (Sánchez-Fernández *et al.* 2018). Abnormal or faulty events are those that occur when the process indicates a deviation from its normal behavior. In WRRF, abnormal events include changes in influent quality (e.g. rainfall, industrial discharge), outbreaks of microorganisms (e.g. filamentous bacteria, algae) that impact treatment quality, irregularities or damage to treatment units (e.g. membranes, clarifiers), mechanical failures (e.g. pumps, air blowers) and sensor failure (e.g. drift, bias, electrical interference) (Newhart *et al.* 2019). All these faults can undermine process performance, so they must be detected and diagnosed as soon as they occur. Fault detection techniques are generally divided into three main groups: model-based methods, knowledge-based methods, and data-driven methods. To implement techniques in the first two groups, an in-depth knowledge of the system's behavior is required. However, obtaining in-depth knowledge of reactions or pathways phenomena (especially for WRRFs, which are very complicated processes) is both time-consuming and challenging.

Data-driven methods, on the other hand, rely only on historical and online data and do not require in-depth prior knowledge of the process (Kazemi *et al.* 2020). Thanks to the advanced process control framework, WRRFs collect large amounts of data using measurements from numerous online sensors (Wang & Shi 2010). Data-driven techniques have therefore received significant attention in process monitoring over the last few years. Multivariate statistical process controls (MSPCs) are a subset of data-driven methods used to analyze the performance of the processes and identify the parameters that govern them. One of the most widely used techniques for MSPC is principal component analysis (PCA) (Newhart *et al.* 2019). Various types of PCA methods, such as conventional PCA (Garcia-Alvarez *et al.* 2009; Sanchez-Fernández *et al.* 2015; Xiao *et al.* 2017), kernel PCA (Lee *et al.* 2004; Jun *et al.* 2006; Xiao *et al.* 2017) and dynamic PCA (Lee *et al.* 2006; Mina & Verde 2006) have been successfully used to monitor processes and detect faults in time-invariant processes. However, because of the complexity of the biological reactions and non-stationary plant influent, WRRFs have time-varying characteristics. Using conventional PCA to monitor such processes may therefore lead to excessive rates of false alarms and missing detections. An adaptive process monitoring framework is therefore needed. Although adaptive PCAs are more suitable for non-stationary processes, due to high computational costs, not every adaptive approach is applicable for realtime fault detection. Several adaptive approaches include recursive PCA (RPCA), moving window PCA and exponential weighted moving average (EWMA) have recently been used for fault detection (Li *et al.* 2000; Rosen & Lennox 2001; Choi *et al.* 2006; Elshenawy *et al.* 2010; Shang *et al.* 2015). Haimi *et al.* implemented a moving window PCA technique to detect and isolate the faults in WRRF (Haimi *et al.* 2016). These authors took into account the variable length of the historical data in the model's construction to prevent sub-optimal monitoring performance. Li *et al.* applied an RPCA monitoring framework that recursively updates the correlation matrix. These authors used two approaches, namely rank-one modification and Lanczos tridiagonalization, to compute the eigenvalues of the updated correlation matrix (Li *et al.* 2000). Recently, Elshenawy *et al.* proposed RPCA based on first-order perturbation analysis (FOP), which is a rank-one update of the eigenvalues and their corresponding eigenvectors from a sample covariance matrix (Elshenawy *et al.* 2010; Elshenawy & Mahmoud 2018). These authors stated that the computational cost of their approach is lower than that of previously used methods such as RPCA using Lanczos tridiagonalization and MWPCA.

In this paper we propose a new low-computational-cost adaptive fault detection framework for WRRF that uses incremental PCA (IPCA). This approach is based on the IPCA proposed by Hall *et al.*, which was motivated by developments in the field of computer vision (Hall *et al.* 1998; Brand 2002; Arora *et al.* 2012). The main benefit of IPCA over other adaptive PCA approaches is that it does not require all eigenvalues to be computed since only the largest ones are needed. This considerably accelerates computation time. Cardot and Degras (Cardot & Degras 2018) compared the computation time of adaptive PCA methods such as perturbation and stochastic approximation and concluded that IPCA outperforms other methods, particularly when the data have many features (sensor measurements). These authors also studied the accuracy of adaptive methods in determining the eigenspace. Their results revealed that IPCA is more accurate than methods such as RPCA, which is based on FOP theory. The IPCA algorithms offer an excellent compromise between statistical accuracy and computational speed. Another major drawback of PCA approaches are missing data during realtime monitoring. In practice, due to sudden mechanical breakdown, maintenance, hardware sensor failure or malfunction of the data acquisition system, etc., some sensor signals may become unavailable (Zhang & Dong 2014). To meet the requirements of realtime fault detection and diagnosis, therefore, the problem of missing data needs to be considered. To handle missing data, the empirical best linear unbiased prediction (EBLUP) approach is incorporated into the proposed IPCA method (Cardot & Degras 2018).

Once a fault is detected, it is essential to find out its primary source. The most common fault isolation method are contribution charts (Nomikos & MacGregor 1995). With this method, the variables that contribute to the T^{2} (Hotelling's T-squared) and SPE (square prediction error) statistics are calculated and the variable with the highest contribution is considered the primary cause of the fault. Because of the recursive nature of IPCA, the contribution plots of T^{2} and SPE are updated in accordance with the adaptive eigendecomposition. To test the proposed framework, several types of process parameters and sensor faults were simulated using Benchmark Simulation Model No. 2 (BSM2) (Jeppsson *et al.* 2006). This paper is organized as follows: first we provide a brief review of the theoretical background behind the conventional PCA method; then we discuss the IPCA algorithm and its realtime application framework for fault detection and diagnosis; and finally we analyze the performance of the monitoring framework on the complete and incomplete (i.e. with missing values) data sets using BSM2.

## CONVENTIONAL PCA

*et al.*2017). Let a data matrix ∈ ℜ

*be composed of*

^{n×m}*n*samples and

*m*sensors. The

*X*matrix is scaled to zero mean and unit variance (Z-score normalization) in order to avoid the scaling problem. The PCA model can be defined as:where T ∈ ℜ

*, P ∈ ℜ*

^{n×m}*, and E ∈ ℜ*

^{m×m}*are score (PCs), loading, and residual matrices, respectively. The scores are generated by projecting*

^{n×m}*X*onto to the loading matrix. To solve the PCA model the covariance matrix, S ∈ ℜ

*, of should be decomposed as follows:*

^{m×m}*P*are eigenvectors of

*S,*and ∈ ℜ

^{m×m}is a diagonal matrix containing the eigenvalues arranged in descending order. Usually, the number of (

*β*) principal components, which sufficiently explain the variability of the process, are selected. Therefore, P ∈ ℜ

^{m×β}and its corresponding eigenvalues are ∈ ℜ

^{β}^{×β}. For fault detection, two statistics (mainly T² and SPE) are normally used. Variations in the mean and covariance in PCs are measured using T² while the variation in the residual subspace is measured using SPE (Elshenawy & Awad 2012). These are given by:

Here, as mentioned earlier, is the number of PCs to retain. To perform fault detection, suitable thresholds for these two statistical indices must be obtained. This enables small faults to be detected with a minimum rate of false alarms. Process operation is normal if these statistical indices remain below these thresholds. The procedure for estimating the thresholds of T² and SPE will be discussed later.

## INCREMENTAL PCA

*et al.*2012). The key idea behind IPCA is to update the eigenvector by incrementing the new data to the PCA. To update process monitoring, both the mean m

_{k}∈ ℜ

^{1×m}and the standard deviation

*δ*

_{k}∈ ℜ

^{1×m}need to be re-estimated whenever a new sample x

_{k}∈ ℜ

^{1×m}data vector becomes available (Hall

*et al.*1998; Artač

*et al.*2002; Cardot & Degras 2018). These are given by:where is a forgetting factor. Let us assume that eigenvector has already been derived using the data from. Also, is a diagonal matrix of the corresponding eigenvalues , and and are the mean and variance vector of a new sample, respectively. Each sample must be standardized according to:where is the standardized sample. Next, the projection of the new sample is computed using the current eigenvector :

### Estimating the number of PCs

*et al.*2000). In this study we used cumulative percent variance (CPV), which can be obtained from:where is a number of selected PCs. The number of PCs is chosen when CPV reaches a predetermined limit, e.g. 95%.

### Process monitoring statistics thresholds

^{2}and SPE) change over time. For realtime monitoring, therefore, it is necessary to adapt these limits (Li

*et al.*2000). The T

^{2}threshold is approximated using the chi-distribution with degrees of freedom and a confidence limit :and the threshold for the SPE statistic is given by:where is the normal deviate corresponding to the (1-

*α*) percentile, and

The fault is detected once the T^{2} and SPE statistics exceed their respective adaptive thresholds, and .

### Fault isolation

^{2}and SPE statistics is defined as:where and are the vectors whose elements are the variables that contribute to the T

^{2}and SPE statistics for each column (sensor) of a sample, respectively.

### Missing data imputation

## ADAPTIVE FAULT DETECTION AND ISOLATION

In this section we discuss the complete structure of the adaptive fault detection and isolation scheme. The proposed adaptive fault detection and isolation framework can be divided into two stages: (i) offline training and (ii) online monitoring. The following steps summarize the overall procedure.

### Offline training

- (1)
Collect the training data matrix ∈ ℜ

and normalize it to zero mean and unit variance^{n×m} - (2)
Use Equation (2) to compute

*S*and its corresponding eigenpairs , . - (3)
Calculate the number of retaining PCs () by applying Equation (14).

- (4)
Calculate the thresholds of the fault detection statistics and using Equations (15) and (16).

### Online monitoring

- (1)
Collect a new data vector and normalize it to zero mean and unit variance using Equations (5)–(7). If there are any missing values, compute them using Equation (21).

- (2)
Calculate the monitoring statistics and using Equations (3) and (4).

- (3)
If the values of the monitoring statistics are below their corresponding thresholds, the process status is normal. The updating procedure continues as follows:

- (4)
If the values of the monitoring statistics exceed their corresponding thresholds, the process status is faulty and the updating procedure should be stopped. The fault isolation procedure (CDC) needs to be run to identify the process variables responsible for the detected fault.

The complete diagram for the proposed adaptive IPCA monitoring procedure is shown in Figure 1.

## SIMULATION RESULTS

### Description of the water resource recovery facility

In this section, the proposed adaptive IPCA framework is applied to Benchmark Simulation No. 2 (BSM2) (Jeppsson *et al.* 2006; Nopens *et al.* 2010). BSM2 is a well-known and powerful dynamic mathematical simulation model able to simulate the physicochemical and biological phenomena in WRRF. BSM2 comprises numerous unit operations, including primary clarifier, activated sludge biological reactor, secondary clarifier, thickener, anaerobic digester, dewatering unit, and storage tank. Five reactors are included in the activated sludge process: two anoxic reactors (used for nitrification) with a total volume of 3,000 m^{3} and three aerobic reactors (used for predenitrification) with a total volume of 9,000 m^{3}. The plant capacity is 20,648 m^{3} d^{−1} with an average influent dry weather flow rate of 592 mg L^{−1} of average biodegradable COD. Activated Sludge Model No. 1 (ASM1) and the Anaerobic Digestion Model (ADM1) were used to describe the biological phenomena that take place in the activated sludge and AD reactor, respectively. The influent characteristics consist of a 609-day dynamic influent data file (with sampling frequency equal to a data point every 15 min) that includes rainfall and seasonal temperature variations over the year (Jeppsson *et al.* 2006; Nopens *et al.* 2010).

### Fault detection and isolation

BSM2 simulation was modified to obtain sensor data from various parts of the process by simulating different types of fault. By performing this modification, 320 data measurements (16 state variables × 20 measurement points) can be obtained. However, measuring all these variables is not common practice in a real WRRF. In this paper, therefore, we considered only realistic and commonly available sensor measurements. Twenty-eight process measurements obtained from various parts of BSM2 simulation (Figure 2) were recorded every 15 min as the inputs for the proposed monitoring framework. All these variables were corrupted with white noise to simulate real sensor measurements.

To build the offline adaptive IPCA process monitoring framework, 21 days of data measurements (days 435–456) were recorded under normal operating conditions. We chose this period because there were no rain or storm events on those days. The measurements from days 457 to 530 were used to test the performance of the IPCA during time-varying characteristics and abnormal behavior. On those days, there were three storm events, which we will discuss in the next section. The main sources of time-variant behavior in this simulation are drift and abrupt changes in the total suspended solid (TSS) concentration of the reject flow of the dewatering process. These changes in the TSS concentration of the dewatering unit are due to seasonal and diurnal patterns. The number of retained PCs, , was estimated recursively using the CPV method in such a way that the retained PCs explained almost 99% of total variance. Several varieties of faults in the sensors, process variables and process parameters were simulated according to Table 1. Each fault began on day 470 and continued until day 530. The thresholds confidence limit for all charts was considered equal to 99%.

Fault# . | Description . | . | Fault type . |
---|---|---|---|

1 | Sensor fault | Sensor no. 7 | Drift fault (0.1 g·m^{−3}·d^{−1}) |

2 | Change in inorganic nitrogen of anaerobic digester | S_{in} | Added 0.2 kmol·m^{−3} to its normal value |

3 | Secondary clarifier parameter | Decreased by 50% | |

4 | Step change in bioreactor parameters | Decreased from 0.5 to 0.1 d^{−1} |

Fault# . | Description . | . | Fault type . |
---|---|---|---|

1 | Sensor fault | Sensor no. 7 | Drift fault (0.1 g·m^{−3}·d^{−1}) |

2 | Change in inorganic nitrogen of anaerobic digester | S_{in} | Added 0.2 kmol·m^{−3} to its normal value |

3 | Secondary clarifier parameter | Decreased by 50% | |

4 | Step change in bioreactor parameters | Decreased from 0.5 to 0.1 d^{−1} |

#### Normal operation of the WRRF

First, we simulated the normal operation of the process (i.e. with no fault added) to show that conventional PCA is unable to deal with the time-variant characteristics of WRRF. The monitoring statistics obtained by conventional PCA and IPCA are shown in Figure 3.

Before we discuss the monitoring results, note that the three severe peaks in this figure for both PCA and IPCA represent storm events (around days 490, 505 and 523) during normal operation of the WRRF. These storm events, which exist in all the fault detections studied in this paper, are classified as faults. As Figure 3 shows, the statistics for conventional PCA exceeded their thresholds for several normal operational points in addition to these storm events. In other words, these results indicate the excessive rate of false alarms, which illustrates a major limitation of using conventional PCA for time-varying processes such as WRRF. What conventional PCA captures in this case is only the dynamic and statistical characteristics of the process supplied by the training data set. However, in time-varying processes, these dynamic or statistical characteristics change over time and must therefore be updated rather than considered constant. Unlike PCA (which produces an excessive rate of false alarms), IPCA produces a very low rate of false alarms. Apart from storm events, which are correctly labeled as faults, data points mislabeled by the T^{2} and SPE statistics of IPCA are very few (shown in red). Also, since the correlation for T^{2} did not change, the threshold for T^{2} remained constant. On the other hand, due to adaptation, the threshold for SPE varied more although its range remained quite low, i.e. between 0.115 and 0.121 (see Figure 3).

#### Drift fault in the dissolved oxygen sensor

The first fault, shown in Table 1, is simulated by applying a drift in the oxygen sensor of the fourth reactor at day 470. This sensor is used in combination with a proportional-integral (PI) controller to manipulate the concentration of dissolved oxygen in the third, fourth, and fifth reactors. The size of the drift was 0.1 g·m^{−3}·d^{−1}. This was intentionally small in order to evaluate the adaptive behavior of IPCA during such a fault. The monitoring statistics and results of diagnosis for IPCA are shown in Figure 4. As we can see, both monitoring statistics detected the fault, though there were some delays. These delays in detection were due to the fact that, as the drift starts, the PI controller tries to compensate for them and brings the dissolved oxygen concentration back to the setpoint by decreasing the blower speed. The fault therefore cannot be detected directly from the faulty sensor at the initial stage of its occurrence. The impact of the PI controller action is more significant on dissolved oxygen in the third and fifth reactors at the initial stage of the fault occurrence.

The SPE statistic therefore begins to violate the threshold around day 475. This is due to the change in the dissolved oxygen concentrations in the third and fifth reactors. However, it is difficult to allocate the drift because it is very small. The T^{2} statistic begins to violate the threshold significantly around day 483, which indicates that the PI controller reduces the blower speed to its minimum. At this point, most of the variables begin to deviate and the fault becomes more visible and easier to detect. The contribution chart for T^{2} shows that the ammonia, H_{2,} TSS, and O_{2} sensors (numbers 16, 25, 17, and 11, respectively; see Figure 2 for the type and location of sensors) contribute the most. Of these sensors, sensor 11, which represents the dissolved oxygen in the fifth reactor, is isolated correctly. However, the other sensors, such as the concentration of ammonia in the effluent stream (number 16), may provide indirect clues as to the root of the fault. As the concentration of dissolved oxygen decreases due to the action of the PI controller, the nitrification rate decreases, so the concentration of ammonia increases. The most contributing sensors for the SPE statistic are numbers 6 and 7, which represent the concentration of dissolved oxygen in the third and fourth reactors, respectively. Sensor 6 contributes more than faulty sensor number 7, which shows that the PI controller has more impact on the concentration of dissolved oxygen in the third reactor at the initial stage of the fault.

#### Step change in inorganic nitrogen in the anaerobic digester

The nitrogen level in the anaerobic digestion (AD) process is critical because of its inhibitory impact on microbial activity and needs to be monitored carefully. Fault number 2 (Table 1) is simulated by inducing a step change at day 470 equal to +0.2 kmol·m^{−3} in the AD inorganic nitrogen level. The monitoring statistics contribution charts are shown in Figure 5.

As we can see, both monitoring statistics detect the fault, though with some delays. The delay is shorter in T^{2} than in SPE. These delays are due to the small size of the faults and their lesser impact at the initial stage of their occurrence. Figure 5 shows that sensor 20 has the largest contribution to the T^{2} statistic. Sensor 20 is the gas flow of AD, which is correctly isolated. Due to the inhibition phenomena caused by the increase in inorganic nitrogen, the activity of microbial communities decreases, which accounts for the reduction in gas flow. The contribution chart for SPE could not isolate the fault correctly.

#### Step change in the settling velocity of the secondary clarifier

To simulate fault number 3 (Table 1), the double exponential settling velocity function () of the second clarifier was reduced by 50%. This fault can be classified as a change in the process parameters. Figure 6 shows the detection performance and contribution charts for the T^{2} and SPE statistics.

Figure 6 shows that both statistics can detect the fault instantly after its occurrence. However, SPE has a stronger detection than T^{2}. Clearly, both statistics provide accurate isolation. Sensors 13 (carbon-oxygen demand (COD)) and 14 (TSS) contribute the most to T^{2} and SPE, respectively. Since there is no sensor to measure settling velocity, COD and TSS are considered the most relevant sensors (these variables are directly related to the settling velocity). Note that the smaller magnitude of this fault could still be detected, though the detection rate may be lower due to its lesser impact on the process.

#### Step change in the bioreactor parameters

One of the most crucial parameters in WRRF is the specific growth rate for the autotrophs (), which determines the speed of ammonia conversion into nitrite (nitrification rate). If the inhibition (due to toxicity or changes in pH, etc.) occurs in the activated sludge reactors, the nitrification rate is altered due to the lower microbial activity. To simulate fault number 4 (Table 1), the value of was changed from 0.5 to 0.1 d^{−1}. The monitoring statistics and contribution charts are shown in Figure 7.

This figure shows that the T^{2} statistic begins to violate the threshold earlier than the SPE statistic. Although T^{2} begins violating the threshold in the initial stage of the fault occurrence, it took almost 30 days to be completely above the threshold. As the nature of this fault is very similar to the drift fault, the detection delay for both methods is very high due to the lesser impact of the faults at the initial stage of their occurrence. The variable contribution chart shows that sensor 10 (the concentration of ammonia in reactor 5) has clearly contributed the most to the deviation in the T^{2} statistic. As the nitrification rate decreases, the concentration of ammonia increases. The contribution chart of the SPE statistic provides no information about the root cause of the fault.

#### Storm events

As we mentioned earlier, there are three storm events in the simulated data set. These events are visible in Figure 3 around days 490, 505, and 523. Figure 3 also shows that both the T^{2} and SPE statistics were able to detect these events accurately. Moreover, as the storm events end, the statistical indices immediately go back to their normal values, which indicates the robustness of the method during such a fault. Figure 8 shows the variables responsible for the deviation in the T^{2} and SPE statistics during the storm events. The sensors' contributions to these three storm events are quite similar. As well as the increasing influent flowrate, many other variables are influenced by storm events; therefore, it is difficult to isolate the fault just by looking at the contribution charts. The trends in all correlated variables therefore need to be examined by experienced process operators. The contribution chart shows that sensors 13 and 19 (effluent COD and thickener underflow flow rate) contribute the most to the deviation in T^{2}. During the storm events, the COD also decreases due to the dilution effect caused by the increase in plant influent flow.

The thickener underflow flow rate also increases as a result of the increase in plant influent flow. Sensor 9 (inlet flow rate of the second clarifier) contributes the most to SPE deviation, which is directly related to the plant influent flow rate. Therefore, as the plant influent flow rate increases due to the storm, the thickener underflow flowrate also increases.

## FAULT DETECTION WITH MISSING VALUES

Missing values in the input data vector of the monitoring framework is a complex and challenging issue. However, in practice, missing values due to sensor failure or maintenance activities are common. A robust monitoring framework must therefore be designed to handle these missing values. In this study, missing values are imputed using EBLUP. The advantage of this method is that these values can be introduced as they appear during the fault detection procedure, which is useful for realtime applications. To assess the accuracy of our proposed method in the presence of missing sensor signals, several measurements were deleted randomly from the data set. Figure 9 shows the pattern of missing values from day 457 to day 488 during normal operation of the WRRF. This period was chosen to avoid the storm events during the normal running of the plant. Sampled data are shown in a continuous grey/black color scheme (the darker the color, the higher the value of the sensor), while missing values are shown in red. As we can see, the pattern of the missing values is extremely intricate since many sensor signals can become unavailable at the same time, which is challenging for any fault detection and monitoring framework.

0%^{a}. | 2%^{a}. | 6%^{a}. | 10%^{a}. | ||||||
---|---|---|---|---|---|---|---|---|---|

. | Fault# . | T^{2}
. | SPE . | T^{2}
. | SPE . | T^{2}
. | SPE . | T^{2}
. | SPE . |

FAR % | normal^{b} | 0.45 | 0.26 | 0.59 | 3.87 | 0.49 | 14.05 | 0.52 | 26.49 |

MDR% | – | – | – | – | – | – | – | – | |

FAR% | 1 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 18.53 | 14.70 | 18.4 | 12.99 | 18.5 | 10.33 | 18.58 | 8.80 | |

FAR% | 2 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 8.56 | 34.26 | 8.57 | 29.95 | 8.44 | 24.07 | 8.68 | 19.86 | |

FAR% | 3 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.7 | 0.15 | 30.10 |

MDR% | 7.91 | 0.06 | 8.45 | 0.08 | 9.72 | 0.08 | 11.41 | 0.06 | |

FAR% | 4 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 37.86 | 55.99 | 37.84 | 55.40 | 38.10 | 54.16 | 38.34 | 53.90 |

0%^{a}. | 2%^{a}. | 6%^{a}. | 10%^{a}. | ||||||
---|---|---|---|---|---|---|---|---|---|

. | Fault# . | T^{2}
. | SPE . | T^{2}
. | SPE . | T^{2}
. | SPE . | T^{2}
. | SPE . |

FAR % | normal^{b} | 0.45 | 0.26 | 0.59 | 3.87 | 0.49 | 14.05 | 0.52 | 26.49 |

MDR% | – | – | – | – | – | – | – | – | |

FAR% | 1 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 18.53 | 14.70 | 18.4 | 12.99 | 18.5 | 10.33 | 18.58 | 8.80 | |

FAR% | 2 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 8.56 | 34.26 | 8.57 | 29.95 | 8.44 | 24.07 | 8.68 | 19.86 | |

FAR% | 3 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.7 | 0.15 | 30.10 |

MDR% | 7.91 | 0.06 | 8.45 | 0.08 | 9.72 | 0.08 | 11.41 | 0.06 | |

FAR% | 4 | 0.07 | 0 | 0.22 | 3.20 | 0.15 | 15.70 | 0.15 | 30.10 |

MDR% | 37.86 | 55.99 | 37.84 | 55.40 | 38.10 | 54.16 | 38.34 | 53.90 |

^{a}Percent of missing values; ^{b}from day 457 to day 488 during normal operation; MDR%, missed detection rate; FAR%, false alarm rate.

We can see that the FAR% for the T^{2} statistic is not affected by the missing values. On the other hand, the FAR% for the SPE statistic increases significantly in the presence of missing values. The most significant changes to the FAR% occurred with the largest level of missingness (10%). This may be because the imputation of missing values has a more significant impact on the residual subspace compared to the mean and covariance in PCs. Since the variation in the residual subspace is measured by SPE, it is more sensitive to the amount of imputed missing values.

## CONCLUSION

We have proposed a novel fault detection and isolating framework based on incremental PCA (IPCA) and complete decomposition contribution (CDC). The advantage of IPCA over other adaptive PCA approaches is that it does not require all eigenvalues to be computed. This characteristic speeds up computation time. The proposed adaptive IPCA monitoring framework consists of (i) adaptive estimation of the mean and variance of data, (ii) adaptive estimation of selected PCs and threshold, (iii) estimation of eigenvalue decomposition by incrementing the new data to the PCA, and (iv) adaptive estimation of the contribution charts for T^{2} and SPE. This framework was applied to BSM2. Our results, demonstrated with simulated faulty scenarios, show that IPCA is able to adapt time-varying process behavior while detecting and isolating faults. Although there are some delays in detecting the faults, the performance is acceptable since the faults under study were small at the initial stage of their occurrence. The contribution charts showed that this method was able to correctly isolate the faults. Sometimes, however, due to a lack of sensor measurements, it was not possible to isolate the sensor directly and only the impact of the fault on the other measurements could be detected. Our proposed framework can also handle in realtime highly intricate patterns of missing values (e.g. more than one sensor signal failure at the same time) by applying the EBLUP method. Our study of fault detection performance with missing values shows that SPE is more sensitive to the imputation of missing values than T^{2}. This is due to the significant impact of the imputed values on the residual subspace.

## ACKNOWLEDGEMENTS

We gratefully thank the Spanish Ministry of Economy and Competitiveness for providing financial support through grants CTM2015-67970-P and for funding the doctoral scholarship (BES-2012-059675). Some authors are recognized by the Comissioner for Universities and Research of the DIUE (Department of Innovation, Universities and Business) of the autonomous government of Catalonia (2017 SGR 396) and supported by the Universitat Rovira i Virgili (2017PFR-URV-B2-33) and Fundació Bancària ‘la Caixa’ (2017ARES-06). The authors are also grateful to Dr Ulf Jeppsson of Lund University, Sweden, for kindly providing the BSM2 Matlab code.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.