In this work, clustering analysis of two partial nitritation–anammox (PN-A) moving bed biofilm reactors (MBBR) containing different types of carrier material was explored for the identification of patterns and operational conditions that may benefit process performance. The systems ran for two years under fluctuations of temperature and organic matter. Ex situ batch activity tests were performed every other week during the operation of these reactors. These datasets and the parameters, which were monitored online and in the laboratory, were combined and analyzed applying clustering analysis to identify non-obvious information regarding the performance of the systems. The initial results were consistent with the literature and from an operational perspective, which allowed the parameters to be explored further. The new information revealed that the oxidation reduction potential (ORP) and the anaerobic ammonium oxidizing bacteria (AnAOB) activity correlated well. ORP also dropped when the reactors were exposed to real wastewater (presence of organic matter). Moreover, operating conditions during nitrite accumulation were identified through clustering, and also revealed inhibition of anammox bacteria already at low nitrite concentrations.
In any biological process for wastewater treatment (wwt), monitoring certain process parameters is essential for the control and evaluation of the performance of the system. Thereby, some parameters are more relevant than others. However, due to (i) the high density of data (from once per day down even to every second), (ii) highly dimensional systems (multiple variables monitored) and (iii) the dynamic nature of the processes involved, most of the data is commonly not studied or analyzed in depth. The high dimensionality of the datasets (many parameters) and large amount of data makes it challenging to process information in a timely enough fashion and use it for better understanding or proper decision-making (Corominas et al. 2018). Accordingly, more efficient and robust statistical tools for data analysis are required to extract knowledge from these datasets. In unsupervised machine learning (UML), a subfield of machine learning, plenty of data analysis methods to extract and discover novel knowledge from data exist. These include methods for extracting direct patterns from sequence items (common items in transactions) and for discovering structure and relationships between data without previous training. One of the many analysis techniques is clustering, an unsupervised data analysis method focused on finding groups of related items (aka. clusters) in a dataset. This is achieved by partitioning data into distinct groups so that the items (observations) within each group are similar to each other (Jain 2010). Different studies have applied UML methods in wwt applications. Aguado et al. (2008) applied self-organizing maps (SOM) and principal component analysis (PCA) for the analysis and interpretation of enhanced biological phosphorus removal processes. The size of their dataset comprised 11 parameters and 328 observations. The data was gathered over three months of operation which comprised also the start-up of the process. The main results demonstrated the feasibility of the application of unsupervised graphical methods to analyze multidimensional systems. Similarly, Garciá & González (2004) applied SOM and k-means to estimate and monitor the diverse states of waste treatment in a chromic acid wastewater treatment plant (WWTP). In this process seven parameters were analyzed. The methodology was similar to the work of Aguado et al. (2008). Just recently, Di et al. (2019) explored two clustering methods to analyze more than 15 WWTP in China and further explored patterns within the data. They employed partitioning around medoids (PAM) and expectation–maximization (EM) analyzing four parameters in 18 monitoring sections in seven administrative regions in the Yangtze River Basin (China) from 2016 to 2017. Their results indicate that UML can be applied for the identification of heavily polluted wastewater discharges and heavily polluted surface water. In most of these studies, a uniform dataset was used for analysis, which means that the resolution of the parameters analyzed was similar. However, the application of UML methods, such as clustering analysis, has not been applied for datasets which share different amounts of observations. In this work, clustering analysis was conducted to find relevant patterns within the data of two laboratory-scale PN-A moving bed biofilm reactors (MBBR). These systems were operated for two years and exposed to seasonal temperature fluctuations (10–20 °C) in the presence and absence of organic matter (Gilbert et al. 2015; Lackner et al. 2015). In PN-A processes, the influence of temperature, nitrogen concentration and organic matter on the activity and composition of the biomass is significant (Bi et al. 2015; Lackner et al. 2015). Aerobic ammonium oxidizing bacteria (AOB), nitrite oxidizing bacteria (NOB), AnAOB and heterotrophic bacteria (HB) comprise the main microbial community in PN-A systems. Our main hypothesis was that by using UML (i.e., data clustering), unseen operational conditions that may benefit the performance of the reactors while studying both laboratory (high amount of observations) and batch-activity (low amount of observations) information can be discovered. Accordingly, the main contribution of our work dealt with clustering analysis of high-dimensional and large datasets to discover and explore hidden patterns.
Two PN-A MBBR containing different carrier materials (K3 and Biofilm ChipTM M, AnoxKaldnes, Sweden) were operated for around two years. In the first year the systems were fed with synthetic feed (NH4-N only) and exposed to seasonal temperature fluctuations back and forth from 20 °C to 10 °C, and in the second year, they ran with real municipal wastewater (organic matter) and the same seasonal temperature fluctuations. Both reactors were monitored with online sensors (online data) and also sampled daily or weekly for further influent and effluent monitoring; around 25 parameters were analyzed in this work (laboratory and online datasets). Additionally, along with the operation of both reactors, biomass was taken regularly from the laboratory-scale reactors to measure the specific rates for nitritation, nitratation, anammox and heterotrophic activity in ex situ batch experiments (batch activity datasets). Further information on the reactor operation and batch tests is available in Agrawal (2018) and Gilbert et al. (2014).
A key issue with these datasets was the high number of features (i.e., variables) involved in the laboratory-scale systems, which may significantly affect the results and quality of any data analysis task such as clustering. In order to address this, feature selection (FS) methods were also used to reduce data dimensions and to make the clustering task more efficient. In our research, we used it to select key variables for the reactor performance: removal of ammonium (NH4+), nitrate (NO3−), nitrite (NO2−) and organic matter (COD).
There are three main approaches to FS in supervised machine learning: wrapper, filter and embedded methods (Chandrashekar & Sahin 2014). Wrapper methods evaluate subsets of features, which allows possible interactions to be detected between them, by using learning algorithms. Filter methods are used as a pre-processing task to rank features wherein highly ranked features are selected and applied to a predictor, however, no learning algorithm is used. Finally, embedded methods are a combination of wrapper and filter methods, where the learning task and the ranking task cannot be separated, such as in random forest methods (Granitto et al. 2006). In random forest, hundreds to thousands of decision trees are built, each of which is independently built, and therefore, the trees are de-correlated. Each tree is built over binary decisions (yes–no questions) and a heuristic is involved to build the tree (wrapper method characteristic). FS in this method is based on an importance score, which is a value representing the average ranking of the features, the features at the top of the list being the most relevant (Kirkpatrick et al. 1983).
Since the size of the batch activity dataset is small compared with the laboratory and online datasets, these datasets were not used in the FS process. Thus, once the FS was performed with the laboratory and online datasets, the selected feature subset was coupled to the batch activity datasets (Figure 1), and from thereon, it is referred to as the Clustering dataset. The FS was performed with the randomForest library in R (Kuhn 2008; R Foundation for Statistical Computing 2016).
Each point (i.e. tuple) in the Clustering dataset (Figure 1) was built so that the dates when the batch activity test were performed matched an average of ±5 days of the laboratory-scale dataset (after FS). After building the Clustering dataset, the k-means clustering algorithm was applied to build related clusters; k-means estimates the unknown cluster centers (aka. centroids) for each of k clusters by minimizing each item of data to the closest centroid, and at the same time, each cluster's centroid is defined when the distance between a data point in the dataset is far from a previous centroid. Clustering analysis with k-means was implemented with the stats (a default library in R) and cluster libraries in the R system (Maechler et al. 2018).
The optimal numbers of clusters were determined by using the elbow-criterion; the number of clusters should be selected such that by adding other clusters, no new information is gained, i.e., from the information theory perspective, it does not reduce entropy. The within-cluster sum of squares (Wk) is a parameter from k-means, which was plotted against an increasing number of clusters. Graphically, Wk decreases monotonically as the number of clusters increases, but from a certain number of clusters, the decrease flattens markedly, resulting in the optimal number of clusters (Tibshirani et al. 2001). For this purpose, a Wk was computed for a different number of clusters for both datasets (see Figure S1 in the supplementary information).
RESULTS AND DISCUSSION
This work aimed at uncovering hidden patterns through clustering analysis for two MBBRs. For this purpose, pre-processing and FS tasks were first performed as seen in Figure 2. The FS method selected the most relevant variables within the computed ranking through the random forest algorithm while minimizing the root mean squared error from the importance score. The relevance of a feature (variable) is defined as strong interactions between the parameters within a dataset, for example, in a prediction task, the relevant parameters will share strong interactions with the output parameters (Blum & Langley 1997). Here, the set of features in Figure 2 were the most influencing parameters for the effluent parameters analyzed in this work: organic matter concentration (COD Eff.), nitrate concentration (NO3-N Eff.), nitrite concentration (NO2-N Eff.), and ammonium concentration (NH4-N Eff.). Data dimension was reduced by up to 50% of the original (i.e., from ∼25 down to around 12 features). Even though the biofilm characteristics between both reactors were different (2 mm vs 10 mm thickness of the carrier), both reactors shared almost the same relevant features: temperature (T Reac.), oxidation reduction potential (ORP), dissolved oxygen concentration (DO), conductivity (Cond. Reac.) (more relevant in BiofilmChip M-MBBR than K3-MBBR), hydraulic retention time (HRT) and ammonium (NH4-N Inf.), nitrite (NO2-N Inf.), biological oxygen demand (BOD5 Inf.) and chemical oxygen demand (COD Inf.) concentrations in the influent. The influence of the concentrations of the nitrogen species is undisputable in deammonification processes since they are target pollutants to be biologically removed. The influence that organic matter has on deammonification systems is also expected; the literature supports that COD/N ratios greater than 1 have a marked influence on the efficiency of ammonium removal (Van Hulle et al. 2010). The impact of temperature on the nitrite concentration in the effluent also agrees with related research (Gilbert et al. 2014). The ORP is a parameter used in wwt to monitor the occurrence of specific biological reactions. The presence of an oxidizing agent such as oxygen increases the ORP value, while the presence of a reducing agent such as a substrate decreases the ORP value (Wouters-Wasiak et al. 1994; Zipper et al. 1998). Accordingly, the values of ORP correlated well with the oxygen concentration just before both reactors were exposed to different feed compositions (i.e., real wastewater, containing organic matter). The ORP value dropped then to values below and around 50 mV in both reactors; the same results were found in the clustering analysis.
The HRT was identified as a relevant parameter (NH4-N Inf. and NH4-N Eff. were kept constant around 50 and 6–8 mg N L−1, respectively). However, this operational condition was only valid for the first year (i.e., synthetic feed only). In PN-A systems, the conductivity signal is an important parameter for controlling the ammonium concentration. Joss et al. (2009) found that both signals correlated linearly. Accordingly, in this work, the correlation coefficients between the conductivity and the ammonium concentration were above 0.7.
In general, DO plays a key role in the out-competition of NOB while maintaining high anammox activity in any PN-A system (Hao et al. 2002; Wyffels et al. 2004; Vangsgaard et al. 2012; Mattei et al. 2015). Mattei et al. (2015) proposed different scenarios where both the DO and shear stress influenced the microbial distribution of a biofilm PN-A system; prolonged exposure of the biofilm to DO concentrations around 5 mg DO L−1 resulted in the inhibition of AnAOB. Our results suggest (also further seen in the clustering analysis) that an average value of 0.3 mg DO L−1 would benefit the activity of AOB and at the same time guarantee no further inhibition of AnAOB.
After FS, clustering analysis was conducted by using the k-means method. Following the elbow-criterion, three was the optimal number of clusters for both datasets (see Figure S1). Figure 3 shows the identified clusters. Cluster 1 in both reactors grouped the data where the highest efficiency of ammonium removal and highest AOB and AnAOB activity was found (Figure 4 and Table S1 in the supplementary information), whereas Cluster 3 (K3) and cluster 2 (BiofilmChip M) enclosed the lowest activity of the biomass and lowest efficiencies. The remaining clusters grouped the data where the systems were exposed to real municipal wastewater (i.e., presence of organic matter).
The results are consistent from an operational perspective. Higher relative activity of the AOB resulted in accumulation of nitrite in the effluent (Figure 4). This accumulation was due to the temperature drop in both systems from 20 to 10 °C (Gilbert et al. 2014). Moreover, in the K3-MBBR reactor, the ORP value dropped from around 300 mV in cluster 1 to an average value of 200 mV in cluster 3, whereas for the BiofilmChip M-MBBR, the ORP decreased progressively (Figure 5).
The AOB to nitrite oxidizing bacteria (NOB) ratios (AOB/NOB) are analyzed in Figure 5. For clusters 1 and 3 in the K3-MBBR, the AOB/NOB ratio changed uniformly while a clear difference was observed in the BiofilmChip M-MBBR. The AOB/NOB ratio was characterized by values above 1 for the BiofilmChip M-MBBR in cluster 2, whereas for the other two clusters, the AOB/NOB was on average below 1, the latter leading to nitrite accumulation in cluster 2. Cluster 2 also revealed a drop in AnAOB activity suggesting inhibition by nitrite.
Table S1 summarizes the centroids of each cluster in both datasets (a centroid being the average value of the data points enclosed in each cluster). By analyzing the centroids of the clusters (Table S1), further patterns were detected; during the operational period with synthetic feed, both MBBRs showed higher values of ORP (around 200 mV), whereas in the presence of organic matter, the ORP dropped to values on average of lower than 100 mV. Moreover, while the MBBRs were fed with synthetic wastewater, higher values of ORP (∼ 200 mV) resulted in higher activity of AnAOB; this trend was clearer for the BiofilmChipM-MBBR than for the K3-MBBR.
On the other hand, the DO concentration correlated well with the conductivity just before the system was exposed to organic matter (real wastewater). The average HRT in the K3-MBBR fluctuated around 1.9 days in clusters 1 and 3 whereas in cluster 2 (operation with real wastewater, i.e. organic matter), this value dropped to 0.95 days. Finally, the HRT in the BiofilmChip M-MBBR was very dynamic, jumping from values of 1.3 to 2.5 days and 1.1 days in cluster 1 to 3, respectively.
A new approach for extracting patterns out of wastewater treatment plant data
This work introduces a set of steps to combine three types of datasets (Figure 6): online, laboratory and batch experiment data. These datasets were used to conduct clustering analysis and extract unknown patterns and information. These three main categories of datasets (online, laboratory and batch experiment data) are common in wwt. The online datasets include parameters that are monitored online by sensors and analysers with the highest amount of observations (readings) in all categories (readings every second to every hour). Parameters measured in the laboratory constitute the laboratory datasets. These parameters are measured daily, weekly or even only monthly; in our case, this dataset contained the second highest amount of readings and observations with almost daily measurements. Finally, in this study, batch experiments were performed also to monitor the activity of the biomass. These tests were measured every other week, thus, they show the lowest number of readings.
These different categories of datasets mentioned were studied through clustering analysis. First, average daily values of the online datasets were considered and coupled to the laboratory dataset. Afterwards, FS was applied to this dataset with the aim of (i) extracting the parameters that have the highest impact in the effluent (i.e., in the process performance) and (ii) reducing the dimensionality of the dataset so only key parameters would be analyzed with the batch activity data. After FS, the batch activity and the selected features from online and laboratory datasets were merged according to the procedure described in Figure 1.
The results found not only agree with the literature but in addition, new hints to improve reactor performance were discovered. Therefore, the methodology illustrated here arises as an opportunity for the application to study similar dynamic and complex biological systems. Moreover, other methods from UML can be applied instead of k-means clustering for discovery of patterns, such as association rules and PCA (Clark & Ma'ayan 2011; James et al. 2013). The selection of methods for exploratory analysis (application of UML) depends on the goal of the study. Clustering analysis can be divided into two groups: hierarchical and partitional. k-means is part of partitional clustering. Although large amounts of clustering algorithms have been created after k-means, this method is still widely applied. The main advantages of k-means are: (i) it is easy to implement and interpret the results, (ii) it allows the organization of data into sensible groupings, (iii) it is adaptable to new data, and (iv) the data to be analyzed does not require labels that tag the examples with prior identifiers. However, some disadvantages of k-means and other similar clustering algorithms are: (i) the final results depend on the initial centroids (when random initialization is not applied), (ii) it is a method that is sensible to scaling, and (iii) selecting the number of clusters is still somewhat ambiguous (it is subject to an expert point of view). Particular to our work, the advantages are the interpretability of the results and the goal matched the method; we aimed to find groups in a high dimensional dataset and we did. Methods such as PCA analysis are directed more towards a graphical representation and reduction of dimensions. In k-means clustering, on the other hand, we look for groups within the database that can elucidate relationships between sets of points. Another popular UML method is SOM, which is a method built towards unsupervised neural networks. However, one of the main disadvantages of this method and its applicability here is the amount of data. SOM requires sufficient data in order to develop meaningful clusters. The weight vectors must be based on data that can successfully group and distinguish inputs. Lack of data or irrelevant data in the weight vectors will add randomness to the groupings. Finding the correct data involves determining which factors are relevant and can be a difficult or even impossible task for several problems. The ability to determine a good dataset is a deciding factor in determining whether to use a SOM or not.
In this work, clustering analysis and FS methods were employed to find relevant patterns within the data of two laboratory-scale PN-A MBBR. The results showed that relevant and actionable knowledge related to parameters such as DO, ORP, HRT and the conductivity can efficiently be extracted. Overall, DO and ORP in both reactors were discovered as relevant parameters for monitoring AOB and AnAOB activity.
We also proposed a new approach to explore correlations and relations among parameters from online, laboratory and batch experiment datasets while coupling the information from all three groups to conduct clustering analysis. Furthermore, our work contributes to promoting the usage of clustering analysis for high-dimensional and large datasets to discover and explore hidden patterns. Hence, this innovative approach provides relevant knowledge to further improve reactor performance.
This work was partly funded by the German Federal Ministry of Education and Research (BMBF) as part of the project EiVeN-G (project number 01DN17004) and FONDECYT (Chile) under grant number 1170002 and CONICYT-Basal Project FB0008.
The Supplementary Material for this paper is available at https://dx.doi.org/10.2166/wst.2020.029.