## ABSTRACT

This article presents the development of virtual sensors for estimation of phosphates (PO_{4}) and soluble COD profiles in a novel, continuous flow, moving bed bioreactor with enhanced biological phosphorus removal and simultaneous nitrification and denitrification, the Hias process. The virtual sensors combine online measurements with additional electrical conductivity (EC), oxidation–reduction potential (ORP) measurements, and state-space models at inlet, zone 3 and zone 7. The data were collected from Hias municipal WRRF, Norway from March to July 2023, and include both online data and laboratory data. Input variables were selected using correlation plots. Linear measurement equations were fitted to relate PO_{4} and COD concentrations in the laboratory data set with the online measurements including EC/ORP measurements. The state-space models were identified from the online data with model accuracy from moderate to strong. The estimated PO_{4} and COD concentrations correspond to most of the scarce laboratory data points at inlet and zone 3, whereas the model in zone 7 requires more work. A Kalman filter was developed for zone 3 and implemented in KYB IIoT platform. Future work is suggested on improvement of the model accuracy in zone 7, and development of energy-efficient control strategies using the virtual sensors.

## HIGHLIGHTS

Virtual sensors for estimation of PO

_{4}and COD in biological wastewater treatmentCombination of EC and ORP sensors with online measurements and models enable estimation of PO

_{4}and COD in different process zones.Virtual sensors in process zones can enable energy-efficient predictive control.

## INTRODUCTION

As the European Commission has proposed an updated urban waste water treatment directive (EuropeanCommission 2022), stricter requirements will be set for the removal of nutrients such as PO_{4}, organic substances and nitrogen compounds. The European Commission also requires energy-neutral operations in the water industry within the next two decades (EuropeanCommission 2021). Currently, the water industry uses approximately 1% of the total energy consumption in the European union. Hence, the development of novel biological nutrient removal processes and energy-efficient control methods for these are essential to meet the strict nutrient recovery requirements while minimizing the energy consumption at water resource recovery facilities (WRRFs).

In general, at municipal WRRFs, the primary wastewater treatment, clarification, is followed by a secondary treatment process that removes nutrients (Henze *et al.* 2008). Primary wastewater treatment encompasses a series of physical processes designed to remove settleable and floatable solids from wastewater. These processes include screening, grit removal, and sedimentation (Shewa & Dagnew 2020). In the secondary treatment process, nutrient recovery relies either on chemical additions or biological processes (Rezai & Allahkarami 2021) to achieve total nitrogen (TN) and total phosphorus (TP) concentrations below legal limits. Notably, every process step requires accurate monitoring, and real-time measurements using a broad range of sensors is a preferred choice. Water quality monitoring includes measurements of physical characteristics (e.g. temperature, conductivity), chemical parameters (e.g. pH, dissolved oxygen, alkalinity, nitrogen and PO_{4} compounds), and abundance of certain biological taxa (Korostynska *et al.* 2012). Monitoring could also include assays of biological activity such as alkaline phosphatase, tests for toxins and direct measurements of pollutants such as heavy metals or hydrocarbons (Frau *et al.* 2019, 2021). Phosphates can exist in wastewater in several forms depending on the source/nature of the discharge, but are generally grouped within three broad classes: orthophosphates, condensed phosphates (pyro-, meta- and poly-) and organic phosphorus (Korostynska *et al.* 2012). Municipal wastewater contains both dissolved and particulate forms of phosphorus. The majority, which is present as dissolved phosphate, comprises about 50% orthophosphate, 35% condensed phosphates, and 15% organic phosphates. Municipal wastewater typically contains between 6 and 8 mg/L of total phosphorus (Owodunni *et al.* 2023).

Conventional biological nutrient removal processes include active sludge systems such as anaerobic–anoxic–oxic (AAO) and sequencing batch reactor (SBR) (Zhou *et al.* 2022). A novel approach to biological nutrient removal is the Hias process, a moving bed bioreactor with enhanced biological phosphorus removal and simultaneous nitrification and denitrification developed by HiasHow2O (Rudi *et al.* 2019; Hias 2022). In this compact process, large amounts of small biofilm carriers, submerged in wastewater, circulate through three anaerobic and seven aerobic process zones, while nutrients such as phosphorus, organic substances and nitrogen compounds are removed from the wastewater in simultaneous biological processes in different layers of the biofilm. The phosphorus-rich biofilm carriers are recirculated from the last aerobic process stage to the first anaerobic process stage by a conveyor belt. The treated effluent flows through a disc filter, where the phosphorus rich biomass sludge is removed. Hence, the Hias process is more compact and does not require energy for sludge circulation (Hias 2022). The aeration rates in the Hias process are controlled in real-time using PID controller-based strategy. This work aims to develop virtual sensors that can enable multivariate control strategy to minimize energy consumption and reduce pollution to aquatic environment.

Real-time monitoring of water quality parameters is crucial for optimal operations, but the industry faces several challenges. While some parameters can be efficiently monitored using physical sensors, there are no technologically and economically feasible sensors for others, such as phosphates. An alternative is to use costly online-analyzers, which are prone to bio-fouling and clogging in solutions with a high content of suspended solids (Zidaoui *et al.* 2023). A virtual (or soft) sensor is a software algorithm that estimates the value of a water quality parameter without the need for the physical sensor itself. It does this by using mathematical equations or machine learning algorithms and the values of other available physical sensor data (Paepae *et al.* 2021, 2022, 2023). Nair *et al.* (2019, 2020) have demonstrated a model-based soft-sensor concept with extended Kalman filter and conductivity sensors for anaerobic basins in a laboratory-scale MBBR pilot system. Further, Wang *et al.* (2022) have demonstrated the use of ORP sensors for virtual measurements of nutrients in aerobic basins. In our previous work, we have studied data-driven models of the effluent PO_{4} in the Hias process (Komulainen *et al.* 2023; Nermo 2023).

In this article, first, the Hias process and instrumentation are presented. The tools and methods applied to develop virtual sensors based on state-space models and online measurements with additional EC and ORP sensors are described. Then, the results for virtual sensors estimating the PO_{4} and COD concentrations in three zones of the Hias process are explained in detail. The virtual sensor approach developed and tested is elaborated in the discussion section. Finally, the conclusions and further work are presented.

## MATERIALS AND METHODS

### Hias process description

The Hias process instrumentation includes continuous measurements of flowrate, temperature, aeration rate, dissolved oxygen, nutrient compositions, and suspended solids. Additional online measurements electrical conductivity and oxidation–reduction potential sensors were installed to the Hias process during the PACBAL research project 2022–2023. Hias laboratory determines nutrient composition of PO_{4} and soluble COD at inlet, zones 3,4,7,10 and outlet from daily grab samples 5 days a week. Hence, there are 5 samples from laboratory and 1,008 samples of online data for each variable per week. As most phosphorus (P) in wastewater is bound into different phosphate compounds, and the available laboratory measurements were ortho-phosphates, for simplicity in this article, the phosphate concentration is marked as (PO_{4}). The soluble organic substances, or carbon parameters, were measured online at inlet and zone 7 with an optical instrument, which corresponds to laboratory measurements of soluble chemical oxygen demand (COD). Further in this article, the soluble organic substances were marked as COD. The online and laboratory measurements used in this study are listed in Table 1.

Symbol . | Description . | Unit . | Online . | Laboratory . |
---|---|---|---|---|

PO | PO_{4} inlet | – | x | |

PO_{4z3} | PO_{4} zone 3 | – | x | |

PO | PO_{4} zone 4 | – | x | |

PO | PO_{4} zone 7 | – | x | |

PO | PO_{4} effluent after disc filter | x | x | |

COD_{in} | COD inlet | x | x | |

COD_{3} | COD zone 3 | – | x | |

COD_{4} | COD zone 4 | – | x | |

COD_{7} | COD zone 7 | x | x | |

NH_{3in} | NH_{3} inlet | – | x | |

NO_{2in} | NO_{2} inlet | x | – | |

NO | NO_{2} zone 7 | x | – | |

NO_{3in} | NO_{3} inlet | x | – | |

NO | NO_{3} zone 7 | x | – | |

NOX_{in} | inlet | x | – | |

NOX_{7} | zone 7 | x | – | |

EC_{in} | El. conductivity at inlet | x | – | |

EC_{3} | El. conductivity at zone 3 | x | – | |

ORP_{4} | Ox.Red. potential zone 4 | x | – | |

ORP_{7} | Ox.Red. potential zone 7 | x | – | |

Water flowrate inlet | x | – | ||

Water temperature inlet | x | – | ||

FO | Aeration rate zones 4–10 | x | – | |

DO | Dissolved oxygen zones 4,5,6,8,9 | x | – | |

SS_{10} | Suspended solids zone10 | x | – | |

SS | Suspended solids out after disc filter | x | – |

Symbol . | Description . | Unit . | Online . | Laboratory . |
---|---|---|---|---|

PO | PO_{4} inlet | – | x | |

PO_{4z3} | PO_{4} zone 3 | – | x | |

PO | PO_{4} zone 4 | – | x | |

PO | PO_{4} zone 7 | – | x | |

PO | PO_{4} effluent after disc filter | x | x | |

COD_{in} | COD inlet | x | x | |

COD_{3} | COD zone 3 | – | x | |

COD_{4} | COD zone 4 | – | x | |

COD_{7} | COD zone 7 | x | x | |

NH_{3in} | NH_{3} inlet | – | x | |

NO_{2in} | NO_{2} inlet | x | – | |

NO | NO_{2} zone 7 | x | – | |

NO_{3in} | NO_{3} inlet | x | – | |

NO | NO_{3} zone 7 | x | – | |

NOX_{in} | inlet | x | – | |

NOX_{7} | zone 7 | x | – | |

EC_{in} | El. conductivity at inlet | x | – | |

EC_{3} | El. conductivity at zone 3 | x | – | |

ORP_{4} | Ox.Red. potential zone 4 | x | – | |

ORP_{7} | Ox.Red. potential zone 7 | x | – | |

Water flowrate inlet | x | – | ||

Water temperature inlet | x | – | ||

FO | Aeration rate zones 4–10 | x | – | |

DO | Dissolved oxygen zones 4,5,6,8,9 | x | – | |

SS_{10} | Suspended solids zone10 | x | – | |

SS | Suspended solids out after disc filter | x | – |

Electical conductivity (EC) is a measure of the total amount of dissolved ions in water. The EC sensors in the Hias process are Xylem TetraCon 700 IQ, rugged electrode-style conductivity probe. The oxidation–reduction potential (ORP) is a measure of the tendency of a solution to gain or lose electrons. ORP can be used to monitor water quality in industrial and wastewater treatment applications (Weißbach *et al.* 2018; Wang *et al.* 2022). The ORP sensors in the Hias process are Xylem SensoLyt 700 IQ with impedance converter and temperature sensor. The optical measurements of COD, NO_{2} and NO_{3} are based on UV-spectrometry. The COD/NO_{2}/NO_{3} sensors in the Hias process are Xylem WTW NiCaVis 701 IQ NI. The optical measurement of dissolved oxygen (DO) is based on photoluminescence. The sensor type is Xylem WTW FDO 70x IQ F. The effluent PO_{4} is measured using an online HACH Phosphax sc phosphate analyzer with sampling time of 10 min and minimum detection limit of 0.05 mg/L.

### Software tools, data collection and preprocessing methods

An Industrial IoT platform, KYB, developed by Digitread Connect, was used for online data collection and Kalman implementation of the virtual sensors. The Industrial IoT platform KYB was used for uploading and standardizing operational data. The online data sets were collected in .csv format. The laboratory data set was obtained in .xlsx format from the process engineers. Matlab was used for development of data preprocessing and measurement equations, state-space model identification and Kalman filter tuning. Matlab functions ’fillmissing’ and ’filloutliers’ were applied to pre-process the data. Function ’fitlm’ multivariate regression was used to estimate measurement equation parameters. Function ’corrplot’ was applied to find the correlation between the variables. Function ’ssest’ was used to estimate the state-space models. Kalman filter was implemented for the real-time estimation of PO_{4} and COD in the KYB IIoT platform. Data preprocessing methods suitable for real-time use included outlier removal of values above/below three standard deviations and filling of missing values with the previous feasible value.

### Modeling approach

The case study is a continuous large-scale municipal WRRF process, the Hias process. The modeling challenge is to estimate PO_{4} and COD profile in the Hias process zones using extra instrumentation of two electrical conductivity (EC) and two oxidation–reduction potential (ORP) sensors. Direct online measurements of PO_{4} are not available in the process zones (only effluent analyzer) and direct online measurements of COD are available at two zones. Possible modeling approach would need to include an equation between the EC and ORP sensor measurements and nutrient concentrations, and a dynamic model accounting for the biological phenomena in the Hias process.

In previous work with the pilot-scale MBBR process, a state-space model structure has been applied in an Extended Kalman filter to estimate the PO_{4} and COD concentrations at the inlet and anaerobic zones (Nair *et al.* 2019, 2020). State-space models with input–output delay were applied to estimate the effluent PO_{4} concentration for the large-scale municipal Hias process in Komulainen *et al.* (2023).

*k*as described in Equation (1). A linear measurement model with coefficient matrices C and D relates measurements and states at current time step

*k*as described in Equation (2). Further, the state-space model structure is used to develop a methodology that combines measurement equations between EC, ORP, and nutrients together with dynamic state-space models of the Hias process.

### Model error indices

*R*

^{2}index based on the real measurements , the mean value of the real measurements , the model calculated output for N data points as presented in the following equation.The

*R*

^{2}index can be interpreted as value between 0 and 0.3 is weak, 0.3 and 0.5 is moderate, 0.5 and 0.7 is sufficient, and value between 0.7 and 1 is strong explanation of variability between the measurements and the model.

### Measurement equations between EC, ORP, PO_{4} and COD

*et al.*(2007), Aguado

*et al.*(2006), and Nair

*et al.*(2020), the relationship between electrical conductivity EC and nutrient concentrations (PO

_{4}and COD) in anaerobic conditions can be modeled using multivariate regression with linear, quadratic and interaction terms between the input variables (predictors). In our work, we look into first order models to fit the C matrix in the state-space model for anaerobic zone 3 and inlet, as presented in Equation (4). Effect of temperature and other variables will be taken into account by other parts of the dynamic model. The laboratory data set was used to fit the model, as PO

_{4}and COD are not measured online in anaerobic zones (except COD at inlet). Matlab function fitlm was used to fit the

*c*parameters of the following multivariate regression:In the aerobic part of the Hias process, oxidation–reduction (ORP) sensors are placed at zone 4 and zone 7. The ORP sensor at aerobic zone 4 was defective during the data collection period. Nair

_{i}*et al.*(2019) have used pH as input for measurement equation in aerobic conditions. However, at Hias WRRF two ORP sensors are available instead of pH sensors. Wang

*et al.*(2022) have developed linear multivariate regression for nutrient monitoring using ORP sensors in anoxic denitrification process. In our work, we look into first order models to fit the C matrix in the state-space model for aerobic zone 7, as presented in Equation (5). Effect of temperature and other variables will be taken into account by other parts of the dynamic model. The laboratory data set was used to fit the model, as PO

_{4}is not measured online in aerobic zones. COD and NO

_{2}and DO are measured at zone 7. Matlab function fitlm was used to fit the

*c*parameters of the following multivariate regression:

_{i}### Input variable selection using correlation plots

As the Hias process has much more variables available than a state-space model identification algorithm can handle, the input variables (features) were selected using correlation pair plots between the input and output variables. Time delays were first applied for each variable, and then the variables were plotted using the corrplot function in Matlab. In total, three to six input variables with highest correlation to the measured output variables were chosen. High correlation between two input variables would rule out one of the input variables (Ljung 1999).

### State-space model identification

_{4}and COD can be formulated as a state-space system with two states (

*x*

_{1}and

*x*

_{2}), multiple inputs (

*u*), output measurements (

_{i}*y*

_{1}and

*y*

_{2}) and input–output delays θ

_{i}. State-space model identification requires scaled continuous data of the inputs and outputs, but not the states. The state-space model will estimate the values of the state variables. The state-space model for states

*x*

_{1}and

*x*

_{2}, corresponding to scaled values of PO

_{4}and COD can be represented as:Theory on identification of state-space systems is presented in Ljung (1999) and for state-delayed state-space systems in Gu & Li (2023). Matlab function ’ssest’ uses the subspace identification algorithm to identify the model parameter matrices A and B. Model parameters in matrices C and D are defined by the measurement equation.

State-space models are estimated from the selected input–output variable set with input–output time delays. The data needs to be scaled, so that the mean values and standard deviations become similar between the variables. The data set is divided into two, where first half is used for model estimation (estimation/training data) and the second half for validation (validation/testing data). The ’ssest’ Matlab function with input–output delay and two states is used with the estimation data set. The resulting model is applied to the validation data set, and error between the measurement and model are compared numerically using the *R*^{2} error index and visually using a trend figure.

### PO_{4} and COD calculation using state estimates

*x*

_{1}and

*x*

_{2}can be used to calculate the PO

_{4}and COD concentrations in different zones of the Hias process. The mean and standard deviation of PO

_{4}and COD were calculated from the laboratory data. The standard deviation of the state estimates

*x*

_{1}and

*x*

_{2}was calculated from the state-space model results, and used as scaling factors

*w*

_{1}and

*w*

_{2}for the calculation of the PO

_{4}and COD estimates with the laboratory data mean and standard deviation values.The scaling factors

*w*

_{1}and

*w*

_{2}were calculated from the estimation data set as:

### Kalman filter implementation in IoT platform

A Kalman filter improves the future estimates of PO_{4} and COD by adjusting an error variable in the algorithm. Kalman filter ensures that the error between the model and measurements is very close to zero. A description of a linear Kalman filter algorithm can be found from text books, for example Ljung (1999).

The Kalman filter code was developed in Matlab m-script and manually translated to Python. NumPy (https://numpy.org/) was used for implementing the Kalman filter, and pandas (https://pandas.pydata.org/) was used for handling and pre-processing data. The Python driver for MongoDB (https://pypi.org/project/pymongo/) was used for fetching online data from the database, and for updating the database with the estimation results. The Kalman filter script in Python was containerized using Docker (https://www.docker.com/), and deployed to KYB as a serverless function in AWS Lambda (https://aws.amazon.com/lambda/). The script was then set up to run every 10 min, pushing the estimation results back into the database as virtual sensor updates.

## RESULTS

### Data collection and preprocessing

The online data and laboratory data were collected for a period of 21 March–31 July 2023. Prior to the measurement equation work, an extended laboratory data set was created. The laboratory data set with daily measurements was manually filled with values for EC, ORP and COD from the online data set with matching time stamps. During period of 1 April–15 May 2023 many online measurements were missing, hence, for most of the results, the online data set is for period 15 May 2023–31 July 2023, whereas and extended period of 21 March–31 July 2023 is used for the laboratory data set.

As the data is not normally distributed, the outliers in the data set were identified based on more than 1.5 interquartile ranges above the upper quartile (75%) or below the lower quartile (25%). Then, the missing values were filled in using fillmissing based on previous feasible value. In the Kalman filter online implementation, the outlier detection limit was set to three standard deviations and outliers and missing values were filled with previous feasible value.

### Input–output time delays θ

*N*zones between these, divided by average flow rate and the sampling time

*T*of 10 min, as presented in Equation (10). The resulting time delays in are given in Table 2.

_{s}Symbol . | Zones in between . | Delay in samples . |
---|---|---|

θ_{8} | 10 | 29 |

θ_{7} | 7 | 26 |

θ_{4} | 4 | 15 |

θ_{3} | 3 | 11 |

θ_{2} | 2 | 7 |

θ_{1} | 1 | 4 |

Symbol . | Zones in between . | Delay in samples . |
---|---|---|

θ_{8} | 10 | 29 |

θ_{7} | 7 | 26 |

θ_{4} | 4 | 15 |

θ_{3} | 3 | 11 |

θ_{2} | 2 | 7 |

θ_{1} | 1 | 4 |

### Nutrient estimation at inlet

A state-space system at inlet can be defined between:

the measured outputs () are and

the states () consist of unmeasured and measured

the measured inputs () at inlet are , , , and

#### Selection of input variables

_{in}and COD

_{in}measurements to choose a set of input variables with high correlation. It is desired to have two to six variables as inputs to the state-space model. Based on the correlation plot, Figure 2, the following input variables with absolute correlation value over 0.4 were considered:

EC

_{in}correlation over 0.4 : NO_{2in}(0.61), NO_{3in}(0.52), NOX_{in}(0.66).COD

_{in}correlation over 0.4:*T*(-0.42), NO_{3in}(0.78), NOX_{in}(0.72).

*F*was not correlated to either of the output variables. Temperature

*T*was negatively correlated with COD. As NOX is strongly correlated with NO

_{2}(0.74) and NO

_{3}(0.90) and all the NO measurements are correlated with the output variables, NOX with was chosen. Hence, the final selection of input variables includes:

*T*, NOX

_{in}. Applying the state-space approach of Equation (6) gives the following state-space presentation of the inputs, states and outputs:

#### Measurement equation

#### State-space model identification

The state-space model was identified between the inputs and outputs with two state-variables. The resulting state-space model coefficients and error indices are presented in Table 3. The state-space model comparison to the estimation data set gave *R*^{2} index of 0.62 and 0.64 for EC and COD, and the comparison to the validation data set gave *R*^{2} index of 0.52 and 0.66 for EC and COD. The model accuracy is sufficient.

States . | a_{11}
. | a_{12}
. |
---|---|---|

x_{1} | 1.078 | -0.2512 |

x_{2} | 0.4884 | -0.3337 |

Inputs | ||

−0.1591 | −0.9459 | |

NOX_{in} | 0.1156 | 0.5399 |

Measurements | ||

EC_{in} | 0.3488 | 0.6138 |

COD_{in} | 0 | 1 |

Error | Estimation | Validation |

index | data | data |

0.62 | 0.52 | |

0.64 | 0.66 |

States . | a_{11}
. | a_{12}
. |
---|---|---|

x_{1} | 1.078 | -0.2512 |

x_{2} | 0.4884 | -0.3337 |

Inputs | ||

−0.1591 | −0.9459 | |

NOX_{in} | 0.1156 | 0.5399 |

Measurements | ||

EC_{in} | 0.3488 | 0.6138 |

COD_{in} | 0 | 1 |

Error | Estimation | Validation |

index | data | data |

0.62 | 0.52 | |

0.64 | 0.66 |

#### Inlet PO_{4} and COD estimates

The inlet PO_{4} and COD were estimated using the state variables *x*_{1} and *x*_{2} from the state-space models. The mean and standard deviation of these variables were calculated on the laboratory data. The standard deviation of the state estimates was close to one and the scaling factors *w*_{1} and *w*_{2} were also 1. The PO_{4} and COD estimates were calculated by scaling the state estimates *x*_{1} and *x*_{2} with the laboratory data set mean and standard deviation values as presented in Table 4.

Variable . | N
. | Mean . | Stdev . | w
. |
---|---|---|---|---|

20 | 5.812 | 1.1426 | 1 | |

COD_{in} | 13 | 478.92 | 115.68 | 1 |

Variable . | N
. | Mean . | Stdev . | w
. |
---|---|---|---|---|

20 | 5.812 | 1.1426 | 1 | |

COD_{in} | 13 | 478.92 | 115.68 | 1 |

_{4}and COD are presented in Figures 3 and 4. The PO

_{4}state estimates fit with about half of the laboratory samples. The COD estimates are following the trends in the online COD measurement, and most of the laboratory samples match with the COD estimates and the online data. It should be noted that the online data and laboratory data do not match at all points of time due to different measurement principles.

### Nutrient estimation zone 3

A state-space system in zone 3 can be defined as:

the measured outputs () are

the states () consist of unmeasured and unmeasured

the measured inputs () at inlet with delay θ

_{3}are , , , , ,the measured input () at inlet with delay θ

_{4}is

#### Selection of input variables with EC_{3} correlation plot

The scaled and delayed input variables were plotted against the EC_{3} measurements, given in Figure 5, to choose a set of input variables with high absolute correlation with EC_{3}.

inlet variable NO

_{2in}(0.59): Nitrite has high correlation with EC_{3}, but low correlation with the other variables.inlet variable SS

_{10}(0.57): Suspended solids in zone10 can represent the biomass capacity on releasing PO_{4}and binding COD on the next round in the Hias process zones 1–3. As SS_{10}and SS are highly correlated, only SS10 is selected.inlet variable COD

_{in}(0.73): COD at inlet has high correlation with NO_{3}, SS_{10}, SS.inlet variable EC

_{in}(0.98): is highly correlated with the output variable and most of the input variables.

*F*, and temperature,

*T*, have both low absolute correlation values with EC

_{3}and are discarded from the input variables. NO

_{2}has higher correlation than NO

_{3}and is chosen. SS

_{10}has higher correlation than SS, and is chosen. The final combination of input variables are: COD

_{in}, NO

_{2}, SS

_{10}and EC

_{in}Applying the state-space approach of Equation (6) gives the following state-space presentation of the inputs, states and outputs:

#### Measurement equation EC3

#### State-space model identification

The state-space model was identified between the inputs and outputs with two state-variables. The first half of the data set was used for model estimation and the second half for model validation. The resulting matrix coefficients are presented in Table 5 . The state-space model fitness to estimation data had a *R*^{2} value of 0.96 and to validation data a *R*^{2} value of 0.97. The model fitness for the measured variable EC3 is very high because the correlations with the input variables are very high.

State . | . | a_{11}
. | a_{12}
. |
---|---|---|---|

x_{1} | 0.7556 | −0.7581 | |

x_{2} | −0.3866 | −0.4218 | |

Inputs | Delay | ||

COD_{in} | 11 | 0.04415 | 0.06767 |

NO | 11 | 0.05976 | 0.1121 |

SS_{10} | 15 | -0.02631 | -0.04808 |

EC_{in} | 11 | 0.8372 | 1.49 |

Measurement | |||

EC_{3} | 0.648 | 0.3332 | |

Error | Estimation | Validation | |

index | data | data | |

R^{2} EC_{3} | 0.9578 | 0.9711 |

State . | . | a_{11}
. | a_{12}
. |
---|---|---|---|

x_{1} | 0.7556 | −0.7581 | |

x_{2} | −0.3866 | −0.4218 | |

Inputs | Delay | ||

COD_{in} | 11 | 0.04415 | 0.06767 |

NO | 11 | 0.05976 | 0.1121 |

SS_{10} | 15 | -0.02631 | -0.04808 |

EC_{in} | 11 | 0.8372 | 1.49 |

Measurement | |||

EC_{3} | 0.648 | 0.3332 | |

Error | Estimation | Validation | |

index | data | data | |

R^{2} EC_{3} | 0.9578 | 0.9711 |

#### Zone 3 PO_{4} and COD estimates

The PO_{4} and COD at zone 3 were estimated using the state variables x1 and x2 from the state-space model. The mean and standard deviation of these variables were calculated on the laboratory data. The inverse standard deviation of the state estimates *x*_{1} and *x*_{2} was used as scaling factor *w*_{1} and *w*_{2}. The PO_{4} and COD estimates were calculated by scaling the state estimates *x*_{1} and *x*_{2} with the laboratory data set mean and standard deviation values as presented in Table 6.

Variable . | N
. | mean . | stdev . | w
. |
---|---|---|---|---|

PO_{4z3} | 20 | 35.36 | 10.91 | 0.99 |

COD_{3} | 13 | 221.77 | 44.21 | 1.42 |

Variable . | N
. | mean . | stdev . | w
. |
---|---|---|---|---|

PO_{4z3} | 20 | 35.36 | 10.91 | 0.99 |

COD_{3} | 13 | 221.77 | 44.21 | 1.42 |

_{4}zone 3 estimates with the training and validation data were plotted in the same figure with the 20 laboratory data points, as presented in Figure 6. Due to the scarceness of the laboratory data (20 points), a statistical evaluation of model fitness would not give reliable results. The PO

_{4}estimate seems to fit with half of the laboratory data points.

### Nutrient estimation zone 7

A state-space system in zone 7 can be defined as:

the measured outputs () are ORP and

the states () consist of unmeasured and measured

the measured input () at the end of Hias process with delay θ

_{8}is .the measured inputs () at inlet with delay θ

_{7}are , , , EC, , , .the measured input () at zone 3 with delay θ

_{4}is EC.the measured inputs () at zone 4 with delay θ

_{3}are DO, FO.the measured inputs () at zone 5 with delay θ

_{2}are DO, FO.the measured inputs () at zone 6 with delay θ

_{1}are DO, FO.the measured inputs () at zone 7 without delay are ,, , .

#### Selection of data set

The ORP_{7} measurements have poor data quality for most of the available online data set period. Therefore, the correlations with input variables and ORP_{7} were close to zero when the whole data set was used for correlation analysis. Hence, two data periods with sufficient ORP_{7} data quality were chosen, samples 7,000–8,000 for estimation data and samples 9,500–10,500 for validation data. With these data periods, the correlations between input variables and ORP_{7} were increased up to 0.56.

#### Selection of input variables with ORP_{7} and COD_{7} correlation plots

_{7}and COD

_{7}to choose a set of input variables with high correlation to the output variables. As there are 19 input variables, the correlation matrices were plotted for subsets of input variables. As an example, the selected input–output pair plots are illustrated in Figure 8.

_{7}or COD

_{7}were chosen to form a set of 6 input variables. The following input variables were chosen for the state-space model at zone 7, correlations are given, respectively, for ORP

_{7}and COD

_{7}:

inlet variable

*T*(0.56 and 0.65): Temperature has high correlation with both output variables, but also with most input variables.inlet variable SS (0.52 and 0.63): suspended solids from the Hias process outlet are highly correlated with output variables, but also with most input variables. Suspended solids can represent the biomass capacity on releasing/binding PO

_{4}and binding COD in the Hias process. As SS_{10}and SS are highly correlated, only the variable with highest correlation is selected.zone 3 variable EC

_{3}(0.38 and 0.72): electrical conductivity at zone 3 is highly correlated especially with COD_{7}. As EC_{in}and EC_{3}are highly correlated, EC_{3}with highest correlation to COD_{7}is chosen. The EC difference variable EC_{3}-EC_{in}has low correlation to both output variables.zone 7 variable FO

_{7}(0.40 and 0.64): The aeration rate in zone 7 is highly correlated with the output variables.zone 7 variable NO (−0.50 and −0.53): Nitrite is negatively correlated with both output variables.

zone 7 variable DO

_{7E}(0.11 and −0.03): Dissolved oxygen in zone 7 is a variable in the measurement equation, and hence it should be included to the state-space model even though it has low correlation on both output variables.

#### Measurement equation

_{4}and COD measurements in zone 7 was extended with online measurements of ORP

_{7}, COD

_{7}, , NO, DO. Total number of laboratory measurements with all these variables was 20, but with poor ORP

_{7}measurement quality for 15 of 20 measurements. These measurements were used to fit a linear regression. The parameters were . The model fitness is much lower than for the EC measurement equations, because issues with the ORP

_{7}data quality for most of the period.

#### State-space model identification

The state-space model was identified between the inputs and outputs with two state-variables. The first half of the data set was used for model estimation and the second half for model validation. The resulting state-space model coefficients and error indices are presented in Table 7. The state-space model comparison to estimation data set gave *R*^{2} index of 0.54 and 0.73 for ORP_{7} and COD_{7}, and the comparison to validation data set gave *R*^{2} index of -0.14 and 0.62 for ORP_{7} and COD_{7}. The *R*^{2} values are sufficient for COD_{7}, but the fit to validation data for ORP_{7} is insufficient.

State . | – . | a_{11}
. | a_{12}
. |
---|---|---|---|

x_{1} | 0.4838 | −3.076 | |

a_{21} | a_{22} | ||

x_{2} | -0.009 | 0.9097 | |

Input | delay | ||

4.223 | 0.08713 | ||

SS | 2.952 | 0.0516 | |

EC_{3} | −0.05261 | 0.01237 | |

FO_{7} | −1.227 | −0.02149 | |

NO | −2.535 | −0.05085 | |

DO_{7E} | 0.1727 | −0.001416 | |

Output | |||

ORP_{7} | 0.126 | −0.452 | |

COD_{7} | 0 | 1 | |

Input | delay | ||

0 | 0 | ||

SS | 0 | 0 | |

EC_{3} | 0 | 0 | |

FO_{7} | 0 | 0 | |

NO | 0.353 | 0 | |

DO_{7E} | 0.293 | 0 | |

Error | Estimation | Validation | |

index | data | data | |

0.5403 | −0.1429 | ||

0.7282 | 0.6168 |

State . | – . | a_{11}
. | a_{12}
. |
---|---|---|---|

x_{1} | 0.4838 | −3.076 | |

a_{21} | a_{22} | ||

x_{2} | -0.009 | 0.9097 | |

Input | delay | ||

4.223 | 0.08713 | ||

SS | 2.952 | 0.0516 | |

EC_{3} | −0.05261 | 0.01237 | |

FO_{7} | −1.227 | −0.02149 | |

NO | −2.535 | −0.05085 | |

DO_{7E} | 0.1727 | −0.001416 | |

Output | |||

ORP_{7} | 0.126 | −0.452 | |

COD_{7} | 0 | 1 | |

Input | delay | ||

0 | 0 | ||

SS | 0 | 0 | |

EC_{3} | 0 | 0 | |

FO_{7} | 0 | 0 | |

NO | 0.353 | 0 | |

DO_{7E} | 0.293 | 0 | |

Error | Estimation | Validation | |

index | data | data | |

0.5403 | −0.1429 | ||

0.7282 | 0.6168 |

#### Zone 7 PO_{4} and COD estimates

Zone 7 PO_{4} and COD were estimated using the state variables *x*_{1} and *x*_{2} from the state-space models. The mean and standard deviation of these variables were calculated on the laboratory data. The scaling factors *w*_{1} and *w*_{2} were calculated as inverse of the standard deviation in the state-variables. The PO_{4} and COD estimates were calculated by scaling the state estimates *x*_{1} and *x*_{2} with the laboratory data set mean and standard deviation values as presented in Table 8.

Variable . | N
. | Mean . | Stdev . | w
. |
---|---|---|---|---|

PO | 20 | 3.1 | 3.56 | 1/8.53 |

COD_{7} | 13 | 80.3 | 10.5 | 1/0.56 |

Variable . | N
. | Mean . | Stdev . | w
. |
---|---|---|---|---|

PO | 20 | 3.1 | 3.56 | 1/8.53 |

COD_{7} | 13 | 80.3 | 10.5 | 1/0.56 |

_{4}and COD are presented in Figures 9 and 10. For the period of reasonable ORP

_{7}data quality, very few laboratory measurement points were available, three (3) for PO

_{4}and two (2) for COD. As negative values of PO

_{4}are not possible, more work should be done on estimation and scaling the model parameters. The COD model estimates have bias, but follow some of the peaks in the online COD measurement. More work is needed to improve accuracy. To improve the estimation approach with state-space model and measurement model in zone 7, a larger laboratory data set and a new online data set after the maintenance of the ORP sensor is required.

### Kalman filter

The model accuracy was strongest at zone 3, and Kalman filter was implemented at the IIoT platform for zone 3. Kalman filter parameters *Q* and *R* were the same as the variance, respectively, in states and measurements. Since the inputs, state variables *x*_{1} and *x*_{2} and outputs are normalized with zero mean and standard deviation of 1, *Q* and *R* became an identity matrices. Initial guess for covariance matrix Pp diagonal elements were 0.1, and initial values for states were chosen 0.5. The Kalman estimated states *x*_{1} and *x*_{2} were then scaled to PO_{4} and COD values using the mean and standard deviation from the laboratory measurements.

## DISCUSSION

The reduced ASM2d modeling approach of Nair *et al.* (2020) with Extended Kalman filter was considered first. We developed reduced order ASM2d models using the available laboratory measurements and online data. However, ammonia (NH_{3}) was not included in the online measurements and the model accuracy decreased just within days. The laboratory data could not be made available in the KYB IIoT platform due to Microsoft two-factor authentication, and therefore, it was not possible to calibrate the reduced order ASM2d models online.

In our data-driven approach, each zone with EC or ORP measurement becomes it’s own virtual sensor. The virtual sensor consists of selected physical online measurements, a state-space model and a Kalman filter that continuously updates the error between the model predictions and real measurements. The state-space model consists of dynamic equation and measurement equation.

Input variable selection

The input variables were selected using correlation matrix with time-delayed variables and the EC/ORP measurement. The selection of input variables affects greatly the state-space model accuracy. Selection of too many input variables deteriorates the accuracy of the state-space model. For example in zone 3, selection of 6 input variables gives *R*^{2} value of 0.6 at best, where as selection of 4 input variables increases the *R*^{2} value up to 0.97. Other input variable selection methods, such as Shapley additive explanations (Paepae *et al.* 2022) or Matlab stepwise regression could be used in future work.

State-space model – Measurement equation

The measurement equation, a linear multivariate regression was developed, in the anaerobic zones between EC and nutrients PO_{4} and COD, and, in the anaerobic zones, between ORP and nutrients PO_{4}, COD, NO_{2} and dissolved oxygen DO. The measurement equation coefficients give values for the C and D matrices. The challenge with the measurement equation is small number of laboratory samples where all the PO_{4}, COD were present. For the inlet and zone 3, the measurement equation accuracy was reasonably strong, but for zone 7, the ORP sensor quality caused lower modeling accuracy. It is possible to increase the modeling accuracy by adding quadratic and interaction terms, but these would not fit with the linear state-space model.

State-space model – dynamic part

The system identification algorithm was forced to use the measurement equation C and D matrix values, and only find optimal parameters of A and B matrices for the state-space model using online data between measured input and output variables. This implementation enables meaningful relationship between the inputs, states and outputs, and hence information of the unmeasured states, or the nutrient concentrations.

Comparison of results

The model comparisons to online and laboratory measurements is illustrated for inlet, zone 3 and zone 7 only with the state-space models, not with Kalman filter. The Kalman filter will minimize the error between the model and output measurements, and not give a realistic picture of the model accuracy. The model accuracy was strongest at zone 3, and Kalman filter as implemented at the IIoT platform for zone 3 (and effluent). The main weakness of the modeling approach is that it is not directly related to material balances as the Extended Kalman Filter approach in Nair *et al.* (2019, 2020). If the laboratory data could be made available in the IIoT platform, more advanced modeling approaches can be developed. The modeling work for zone 7 was challenging due to very limited period of feasible ORP sensor data, and a few laboratory measurements. However, it was desired to test the same approach as for inlet and zone 3. Hence, the results are not strong.

## CONCLUSIONS AND FURTHER WORK

This article presents development work for virtual sensors estimating PO_{4} and COD using online measurements and state-space models of the Hias process. As the laboratory data set was not available online, the modeling approach utilizing simplified ASM2d models could not be applied. Hence, data-driven approach of state-space models was chosen. Input variables for the dynamic part of the state-space models were selected using correlation plots. Linear measurement equations were developed between the online EC and ORP measurements and phosphate PO_{4} and carbon COD measurements using aggregated laboratory and online data sets. The modeling accuracy *R*^{2} was strong at inlet (0.86), and zone 3 (0.91), but moderate at zone 7 (0.66). The measurement equations were further used to formulate state-space models. The dynamic part of the state-space models were identified from the online data. The *R*^{2} error index for the state-space models was 0.6 for EC and COD at inlet, 0.9 for EC zone 3 and 0.6 for COD at zone 7, but less than 0.5 for ORP at zone 7. The estimated states were scaled, and the estimates of the PO_{4} and carbon concentrations were plotted against the laboratory measurements.The estimated values correspond to most of the laboratory data for inlet and zone 3, whereas for zone 7, there are too few laboratory data points to evaluate the result. There were too few laboratory data points to give statistical evaluation.

Kalman filter was implemented in KYB platform to provide online estimates of the phosphate PO_{4} and carbon COD concentrations at zone 3 with the best model accuracy. The ORP sensor data quality in zones 4 and 7 was insufficient. For zone 4 no measurements were available, whereas zone 7 had a limited data period of 2,000 online samples and 3 laboratory samples. This was challenging for modeling of the PO_{4} and COD in zone 7.

Future work is suggested on improvement of the model accuracy in zone 7 with a new data set, and development of design for energy-efficient control strategies using the virtual sensors.

## ACKNOWLEDGEMENTS

RFF Innlandet is gratefully acknowledged for funding the PACBAL project (nr 337727). Torgeir Saltnes at Hias How2O, Katrine Marsteng Jansen from Hias IKS water resource recovery facility have provided valuable insights in the Hias process. Katrine Marsteng Jansen and Tobias Korten from Hias IKS have supported data collection. Axel Tveiten Bech and Ola Solli Grønningsæter from Digitread Connect have performed online data collection. Axel Tveiten Bech has implemented the state-space models and Kalman filter in the KYB IIoT platform. Professor Tiina Komulainen is the principal investigator of the project. She has developed the modeling approach, supervised and revised the dynamic modeling work, and prepared the article draft. Professor Olga Korostynska has supervised the sensor part of the work. Research assistant A.Malik Baqeri has worked on testing different modeling approaches. Professor Tiina Komulainen has prepared the final implementation of the models, virtual sensors and figures. Katrine Marsteng Jansen (Hias IKS) and Professor Olga Korostynska have done critical review of this article.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Proceedings of the 64th International Conference of Scandinavian Simulation Society*. Linköping University Electronic Press, Västerås, Sverige September 26-27 2023. https://doi.org/10.3384/ecp200037

*Mpc of Nutrient Removal in Wastewater Treatment Process*. Master Thesis, Oslo Metropolitan University

*Soft Computing Techniques in Solid Waste and Wastewater Management*(R. R. Karri, G. Ravindran & M. H. Dehghani edn.). Elsevier, pp. 35–53. doi: https://doi.org/10.1016/B978-0-12-824463-0.00004-5