## Abstract

Soft-sensor applications for wastewater management can provide valuable information for intelligent monitoring and process control above and beyond what is available from conventional hard sensors and laboratory measurements. To realize these benefits, it is important to know how to manage gaps in the data time series, which could result from the failure of hard sensors, errors in laboratory measurements, or low-frequency monitoring schedules. A robust soft-sensor system needs to include a plan to address missing data and efficiently select variable(s) to make the most use of the available information. In this study, we developed and applied an enhanced iterated stepwise multiple linear regression (ISMLR) method through a MATLAB-based package to predict the next day's influent flowrate at the Kirie water reclamation plant (WRP). The method increased the data retention from 77% to 93% and achieved an adjusted R^{2} up to 0.83 by integrating with a typical artificial neural network.

## INTRODUCTION

*Soft sensors* are mathematical or statistical approaches that can be used to provide useful information to advance process efficiency and resilience in environmental engineering systems. The term ‘soft sensors’ is used to contrast with conventional hard sensors, because soft sensors are commonly used to predict or estimate information based on mathematical or/and statistical methods. Data-derived soft-sensor approaches can be classified as multivariate statistics, machine learning, fuzzy logic, or hybrid (Haimi *et al.* 2013). Kadlec *et al.* (2011) reported that typical offline soft sensors include multiple linear regression (MLR), principal component analysis (PCA), partial least squares (PLS), and support vector machine (SVR). In addition, recent soft-sensor studies have explored soft sensors based on neural networks (Choi & Park 2001; Basant *et al.* 2010; Mohammadpour *et al.* 2016; Gebler *et al.* 2018; Zhu *et al.* 2018). Potential benefits associated with soft sensors include low cost and fast response times, and integrated soft and hard sensor systems are possible (Fortuna *et al.* 2007). Soft sensors have been investigated for water and wastewater engineering, including surface water (Basant *et al.* 2010; Khan *et al.* 2013), groundwater (He *et al.* 2008; Burow *et al.* 2010), and stormwater (Ki *et al.* 2011; Tota-Maharaj & Scholz 2012). Soft sensors are also used in designed water resources systems (Kuo *et al.* 2006; Yang *et al.* 2011), engineered treatment systems (Choi & Park 2001; Perelman *et al.* 2012), and water distribution systems (Smeti *et al.* 2009; Oliker & Ostfeld 2014). Studies of using soft sensors to predict influent wastewater flowrate are also available. For example, Fernandez *et al.* (2009) developed a soft-sensor model based on the neuro-fuzzy approach to forecast average daily wastewater flowrate at two facilities using only two input variables. They believed that the model achieved good results based on average errors (<8%). Wei *et al.* (2013) took one more step to predict future influent flowrates based on a multilayer perceptron artificial neural network (ANN) using hourly input data, including rainfall and radar reflectivity. Their model was able to predict the flow 1 h ahead with a mean square error (MSE) of about 10%.

The above studies demonstrate the rapid development of soft sensors for intelligent water resources monitoring and control. One issue that can complicate soft-sensor applications is how to manage missing data, especially in a large time-series dataset with many regressors. Among the studies cited above, Wei *et al.* (2013) is the only one that mentioned potential problems with missing data (replacement of average value of neighbors) perhaps because the other studies did not have a problem related to missing values. However, missing data were frequently found in our datasets, include N/A (not available), N/S (not sampled), O/S (equipment out of service), and S/X (sample canceled). Typical missing data management methods include simple deletion, replacement, and imputation. Simple deletion can result in the loss of a large fraction of the data (Scheffer 2002); replacement could reduce variability and alter the covariance and correlation among input regressors (Schafer & Graham 2002); and the computations in imputation can be very time-consuming (Kadlec 2009). These potential drawbacks are especially relevant for a time-series dataset with a large number of time-series regressors where many values can be missing. We developed a simple and effective way to provide missing data management based on a modified variable selection method.

Variable selection, an important step in soft-sensor development, is primarily used to identify relatively important variables for building a concise model and reducing the risk of overfitting. However, the variable selection method could also be used simultaneously to help manage missing data because exclusion of irrelevant variables can reduce the chance of deleting data frames. There are many variable selection methods available, such as PLS, PCA, and SVR (Souza *et al.* 2016). In our previous study (Zhu & Anderson 2016), we described a modified offline soft-sensor based on stepwise regression; that method, *iterated stepwise multiple linear regression* (*ISMLR*), can maximize the utilization of raw historical data in terms of periodicity, minimize the loss of useful data, and minimize the risk of overfitting by shrinking the group of regressors. The flexibility of the ISMLR method makes it possible to rebuild a reliable model even when an important regressor is missing, for example, due to sensor failure. Another potential benefit is that the ISMLR method can be used as a data preprocessing tool. For example, in another study (Zhu *et al.* 2018), we used ISMLR as a pretreatment step, integrated with MLR, ANN, and compromise programming to minimize the fraction of underestimations of predicted five-day carbonaceous biochemical oxygen demand (CBOD_{5}) at a large water reclamation plant (WRP). To realize these benefits, however, two major limitations need to be overcome to promote the application and further development of the ISMLR method.

Although the concept of ISMLR is relatively simple, previous investigations focused on testing the approach, rather than streamlining the implementation method. For example, Zhu & Anderson (2016) employed a manual modeling process that required a substantial amount of time for a single prediction. Datasets with a large number of regressors might need several iterations, which could require long processing times.

In addition, a conventional ISMLR method relies on linear regression and default

*p*-values to detect a subset of important regressors, which offers little flexibility for solving different types of problems. If other regression approaches, such as interactions and quadratic methods were included, it would significantly increase the complexity of the ISMLR method and require much more time to conduct a single modeling task.

To address these limitations and to encourage further development of the ISMLR method, we developed a MATLAB-based, graphical user interface standalone, *ISMLR package*. Advantages of the ISMLR package include customized missing data management, flexible integration with other soft-sensor methods, fast computation, and a friendly user interface. Missing data are common in water and wastewater datasets, and we believe that this work demonstrates that the ISMLR package is able to predict future wastewater flowrates, and demonstrates the possibility of its applications in other areas of water and wastewater engineering systems.

This study is part of a collaborative project involving the Metropolitan Water Reclamation District of Greater Chicago (MWRDGC) and the Illinois Institute of Technology, to reduce energy demands and improve control of nutrient loading in the Chicago Area Waterway System by developing and deploying cyber-physical system (CPS 2010). Wet weather conditions can have a negative effect on wastewater treatment processes. They can create short-duration, high load conditions that reduce hydraulic residence time and dilute or washout microorganisms from the reactors. An additional concern in Chicago is the deep tunnel system that can lead to lower magnitude but more frequent flowrates, which can adversely affect process resilience. The ability to predict characteristics of future flows can help to ensure effluent quality and minimize energy consumption. This paper describes our assessment of how the ISMLR package performs for predicting the next day's influent flowrate at a WRP.

## METHODOLOGY

### ISMLR package

The ISMLR package (Zhu 2019) was designed to develop an offline soft sensor that integrates missing data management and prediction. The package includes four main applications: outlier detection, periodogram analysis, dataset development, and ISMLR application. Briefly, the outlier detection application provides technical outlier detection and statistical outlier detection, which identify outliers based on domain knowledge and statistical distribution, respectively. Periodogram analysis application was developed to identify possible periodic patterns of variables, as revealed by a Fourier series function. Periodic patterns can be important, because they help to develop time-series regressors from the original variables in the dataset development application (details of regressor development will be described in the section of flow prediction). The ISMLR application (Figure S1, details in supporting information, available with the online version of this paper), the core of the ISMLR package, was based on an enhanced ISMLR method. More details of the original ISMLR method are described by Zhu & Anderson (2016). Briefly, SMLR is applied to identify important regressor(s) and irrelevant regressor(s) is(are) excluded at each iteration; simultaneously, a dataset is rebuilt based on the subset of important regressors at each iteration. The iteration will stop when all the regressors in the subset are statistically important. Compared to the original method, the enhanced ISMLR method provides much more flexibility:

First, the ISMLR modeling process is split into primary and subsequent SMLR steps (Figure 1). The primary SMLR step is based on the treated dataset with all initial regressors, whereas each of the subsequent SMLR step(s) is based on a rebuilt, treated dataset with a subset of selected, important regressors. The *pre-ISMLR* results (Figure 1, lower left) are based on the initial treated dataset and the *post-ISMLR* results (lower right) are based on the final treated dataset. The ISMLR application tests the hypothesis that a variable makes a significant contribution to the regression model.

Second, selection of each important variable is based on its *p*-value, which provides a measurement of the smallest level of significance at which the hypothesis would be rejected. The ISMLR application offers options of *p*-value customization and default *p*-values for adding (‘enter’ in the main interface) and subtracting (‘remove’) a regressor are 0.05 and 0.10, respectively (MathWorks 2017). A regressor that has a lower *p*-value in a regression model is more important for predicting variance of the response variable. In the main interface, *p*-values can be customized, subject to two common rules for SMLR:

The range is 0<

*p*-value <1.The

*p*-value for adding must be lower than the*p*-value for subtracting.

Third, in addition to setting the *p*-value, one of four regression types can be selected: *linear*, *interactions*, *purequadratic*, and *quadratic*. *Linear* considers the individual regressors and a constant. *Interactions* includes two-way interactions besides individual regressors and a constant. *Purequadratic* involves quadratic terms, individual regressors, and a constant. *Quadratic* is comprised of constant, individual regressors, two-way interactions, and quadratic terms (MathWorks 2017). Pseudocode of the key part of the enhanced ISMLR method is presented as Algorithm 1 in supporting information.

*p*-values, and dataset size, on the modeling process and results. Three types of evaluation criteria are computation time, retention level of observations, and prediction performance; major performance criteria include adjusted R

^{2}(adj. R

^{2}; Equations (1) and (2), where

*n*is the number of observations;

*k*is the number of regressors; , , and are measured, predicted, and average of measured values, respectively) and mean absolute percentage error (MAPE; Equation (3)). Near the end of this paper, we discuss potential benefits of using the ISMLR as part of a hybrid soft-sensor model.

### The MWRDGC Kirie WRP

The James C. Kirie WRP is one of the seven MWRDGC WRPs in the Greater Chicago area. The plant has provided wastewater treatment service since 1980 for about 270,000 people (in 2000) in a service area of about 169 km^{2}. MWRDGC uses the tunnel and reservoir plan (TARP) to better manage stormwater during wet weather conditions and to reduce the occurrence of combined sewage overflows. As a result, part of the raw wastewater to the Kirie WRP is TARP flow from the Chicago underflow plan (CUP) reservoir. The ability to predict the next day's influent flow would help to better implement process controls, decreasing the risk of effluent violations and increasing energy efficiency.

### Data collection and preliminary treatment

The present study of the Kirie WRP was based on daily data from 2001 to 2010 describing the following:

Influent conditions at the Kirie WRP that include 11 variables.

Meteorological variables from the weather station at the Chicago O'Hare International Airport (GHCND: USW00094846). The data, including air temperature, precipitation, snowfall, and snow depth, were downloaded from National Oceanic and Atmospheric Administration (NOAA 2017) website.

Three additional precipitation variables from stations 1, 3, and 5 of the Cook County Precipitation Network program; these data were obtained from the Illinois State Water Survey (ISWS 2017).

The O'Hare and ISWS monitoring stations were selected because they are geographically close to the Kirie WRP (Figure S2). In total, there were 18 primary variables for this analysis (Table 1).

Variable . | Symbol . | N_{outlier}
. | P_{md} (%)
. | Average, SD (unit) . |
---|---|---|---|---|

Flowrate | Q | 0 | 0 | 1.58, 0.92 (m^{3}/s) |

pH | pH | 0 | 0.1 | 7.3, 0.2 (−) |

Total solids | TS | 0 | 0.1 | 963.4, 223.1 (mg/L) |

Suspended solids | SS | 5 | 0.1 | 190.1, 133.6 (mg/L) |

BOD_{5} | BOD | 2 | 1.0 | 165.6, 77.7 (mg O_{2}/L) |

TKN | TKN | 1 | 0.1 | 25.9, 7.8 (mg N/L) |

NH_{3}-N | NH | 0 | 3.3 | 14.6, 4.8 (mg N/L) |

NO_{2}-N | NO | 131 | 3.6 | 0.17, 0.26 (mg N/L) |

Total P | TP | 1 | 0.1 | 4.3, 1.6 (mg P/L) |

Free barium ion | Ba | 4 | 0.1 | 0.06, 0.03 (mg/L) |

Free manganese ion | Mn | 2 | 0.1 | 0.08, 0.03 (mg/L) |

Air temperature (O'Hare) | T _{a} | 0 | 0 | 10.2, 10.9 (°C) |

Precipitation (O'Hare) | PPT | 0 | 0 | 2.6, 8.1 (mm) |

Snowfall (O'Hare) | SF | 0 | 0 | 2.6, 15.1 (mm) |

Snow depth (O'Hare) | SD | 0 | 0.9 | 12.3, 40.0 (mm) |

Precipitation (ISWS station 1) | PPT1 | 0 | 0 | 2.3, 7.3 (mm) |

Precipitation (ISWS station 3) | PPT3 | 0 | 0 | 2.5, 7.6 (mm) |

Precipitation (ISWS station 5) | PPT5 | 0 | 0 | 2.4, 7.6 (mm) |

Variable . | Symbol . | N_{outlier}
. | P_{md} (%)
. | Average, SD (unit) . |
---|---|---|---|---|

Flowrate | Q | 0 | 0 | 1.58, 0.92 (m^{3}/s) |

pH | pH | 0 | 0.1 | 7.3, 0.2 (−) |

Total solids | TS | 0 | 0.1 | 963.4, 223.1 (mg/L) |

Suspended solids | SS | 5 | 0.1 | 190.1, 133.6 (mg/L) |

BOD_{5} | BOD | 2 | 1.0 | 165.6, 77.7 (mg O_{2}/L) |

TKN | TKN | 1 | 0.1 | 25.9, 7.8 (mg N/L) |

NH_{3}-N | NH | 0 | 3.3 | 14.6, 4.8 (mg N/L) |

NO_{2}-N | NO | 131 | 3.6 | 0.17, 0.26 (mg N/L) |

Total P | TP | 1 | 0.1 | 4.3, 1.6 (mg P/L) |

Free barium ion | Ba | 4 | 0.1 | 0.06, 0.03 (mg/L) |

Free manganese ion | Mn | 2 | 0.1 | 0.08, 0.03 (mg/L) |

Air temperature (O'Hare) | T _{a} | 0 | 0 | 10.2, 10.9 (°C) |

Precipitation (O'Hare) | PPT | 0 | 0 | 2.6, 8.1 (mm) |

Snowfall (O'Hare) | SF | 0 | 0 | 2.6, 15.1 (mm) |

Snow depth (O'Hare) | SD | 0 | 0.9 | 12.3, 40.0 (mm) |

Precipitation (ISWS station 1) | PPT1 | 0 | 0 | 2.3, 7.3 (mm) |

Precipitation (ISWS station 3) | PPT3 | 0 | 0 | 2.5, 7.6 (mm) |

Precipitation (ISWS station 5) | PPT5 | 0 | 0 | 2.4, 7.6 (mm) |

One type (type I) of outlier was identified based on the distance-based outlier detection method, which is briefly summarized in Table S1 and Figure S3. Other outliers (type II) were identified because they were below detection limits. There was a large number (=131) of outliers for nitrite, but the majority of those outliers (=129) were type II. Although 17 type I outliers were detected, ultimately there were only eight days of outliers identified because outliers for different variables can occur on the same day. In general, only about 4% of the total days (=3,651) of data were identified as outliers. Sixteen of the 18 variables have an extremely low percentage of missing data (≤ 1%); the exceptions are nitrite and ammonia. Nitrite has a relatively higher percentage of missing data because of type II outliers, whereas ammonia has a similar amount of missing data because data were not sampled from January to April 2001. However, all variables have nearly complete datasets (>96%).

A periodogram analysis was subsequently applied to identify periodic patterns that might be associated with each variable. Three groups of periodic patterns were found among these variables. Meteorological variables (including air temperature, flowrate, the precipitation variables, snowfall, and snow depth) directly related to the climate or weather change do not exhibit any significant periodic patterns in the short-term (Figure S4). Another group of variables, including pH, BOD_{5}, NO_{2}-N, TP, Ba, and Mn, shows a clear and strong periodic signal, indicating that similar information for each of these variables repeats every seven days. Periodic patterns for the other variables, such as TS, SS, TKN, and NH_{3}-N, are not strong enough to identify among a substantial amount of noise.

To apply a consistent approach for all the primary variables, a seven-day sequence was adopted to develop a cluster of primary regressors. For example, for variables representing typical wastewater compositions, such as TKN and NH_{3}-N, the most recent seven days of data from seven days before the current day (*t**−**7*) to one day before the current day (*t**−**1*), were selected as independent regressors. Because data describing BOD_{5} result from a five-day measurement and the most recent data are not readily available, the ‘window’ was set four days earlier than typical wastewater compositions: *BOD(t**−**11)*, *BOD(t**−**10)*, …, *BOD(t**−**5)*. Regressors based on the easily-acquirable variables, such as pH, flowrate, and free metals, included both historical data (from ‘*t**−**6*’ to ‘*t**−**1*’) and real-time data (the current day's data, ‘*t*’). The remaining meteorological regressors were developed to cover historical, real-time, and anticipated data (*t**−**5*, *t**−**4*, …, *t*, *t**+**1*). In summary, to predict the next day's flowrate, *Q(t**+**1)*, 3,651 days of data and 126 regressors comprised the input to the ISMLR application; eight years (2001–2008) of data were used as the training dataset and two years (2009–2010) of data were used as the testing dataset. The primary prediction was based on the default *p*-values (‘enter’ = 0.05 and ‘remove’ = 0.10) and regression types (*linear* for all primary and subsequent steps); additional tests that involve different *p*-values and regression types are described below.

### Effect of modeling parameters and dataset size

Regression types, *p*-values, and dataset size (number of regressors or observations) can affect the modeling process and results. To investigate these effects, we followed a one-factor-at-a-time approach where only the investigated parameter was changed while the other parameters remained fixed.

There are 4^{2} (=16) different combinations of the four regression types in the primary and subsequent regression steps. We used the first letter of each regression type to simplify the names of these combinations. For example, the default setting is the *LL*-model (*linear* + *linear*) while a combination of *purequdratic* and *interactions* is the *PI*-model.

The number of combinations of *p*-values is theoretically infinite. In this study, we illustrate the effect of 16 selected options made up of four different pairs of *p*-values in both primary and subsequent regression steps (4^{2} = 16). The four pairs for adding and subtracting of *p*-values range from 0.05 and 0.10 to 0.80 and 0.85, respectively; the interval between pairs was fixed at 0.25 and the difference between *p*-values for ‘enter’ and ‘remove’ was fixed at 0.05. (Table S2). Although these options represent only a small part of the whole parameter space, they provide a general picture of the effect of *p*-values. Our investigation of *p*-values considered both *LL*- and *LP*-models.

When the ISMLR application is applied to other cases that may have different dataset sizes, the correlation between dataset size and computation time will be useful to provide efficient controlling strategies if time is an important factor. The number of regressors and the number of observations can significantly affect the computation time and prediction performance. To understand their effects, seven different datasets that have different numbers of initial regressors and another seven datasets that have different numbers of observations were examined. More details about the methods and results are provided in the supporting information.

### Comparison of ISMLR-based approaches with typical methods

The ISMLR package can be used as a pretreatment tool to define a subset of important regressors or to pretreat a dataset for subsequent use with other more advanced prediction algorithms. This kind of hybrid approach can increase the flexibility of the ISMLR package, by providing effective management of missing data. It also offers the ability to achieve better prediction performance relative to an individual ISMLR method or other algorithms. To test the enhanced ISMLR-based hybrid approach, we compared nine soft-sensor methods, including ISMLR based on *LL*-model (denoted as ISMLR (*LL*)), ISMLR (*LP*), principal component regression (PCR), PLS, SVR, ANN, PCA + ANN, and two ISMLR-based hybrid soft sensors (ISMLR (*LP*) + SVR, ISMLR (*LP*) + ANN). Specifically, ten-fold cross-validation was adopted to access the performance of different numbers of principal components in PCR- and PLS-based approaches and MSE, a typical evaluation score, was used to quantify the performance. Average values of ten-fold cross-validation MSEs in different numbers of principal components were used to determine the best models. Hyperparameters, such as *C*, gamma, and epsilon, can affect the prediction performance of the SVR model (refer to the following references for more details about SVR and its hyperparameters). We consulted relevant wastewater studies (Sousa *et al.* 2014; Granata *et al.* 2017; Wang *et al.* 2018) that used SVR, so different hyperparameter values were selected to compare by applying grid search to determine the best model parameters (more description about the SVR model details can be found in supporting information, section 9). The ANN used a typical network type (feed-forward backpropagation) and a training function based on Levenberg-Marquardt (LM) optimization. LM optimization was selected because it has been investigated in the pursuit of better environmental management in many studies (Tota-Maharaj & Scholz 2012; Mustafa *et al.* 2013). For the present study of the Kirie facility, the ANN algorithm was comprised of three layers where the first (input) layer was fed by the treated dataset from ISMLR, the second (hidden) layer had ten neurons based on a log-sigmoid transfer function, and the third (output) layer had one neuron based on a hyperbolic tangent sigmoid transfer function.

## RESULTS AND DISCUSSION

### Prediction results using default values

The computation time for a prediction based on default values was less than a minute using the ISMLR application. The modeling process involved three iterations (Table S3). About 77.3% of all observations (=3,651) were retained in the initial treated dataset comprised of 126 regressors; specifically, 73.4% of observations in the training dataset (2,145 of 2,922 days) and 92.9% of observations in the testing dataset (677 of 729 days) were retained. With the help of ISMLR, the retention level increased to about 93.0% when 27 final important regressors were selected in the third iteration, which includes 91.7% of observations from training dataset (2,680 of 2,922 days) and 97.9% of observations from testing dataset (714 of 729 days). The difference in data management between pre-ISMLR and post-ISMLR can be seen in Figure S5, where many observations that were lost from 2002 and 2003 using the pre-ISMLR model were retained in the post-ISMLR model. Both models do not retain data from early 2001 because NH_{3}-N has a large fraction of missing data during that time, and *NH* regressors were included in both models. In fact, differences in missing data management (15.7% in this study) can be much bigger when the raw variables do not have complete datasets.

The top five important regressors are *Q(t)*, *PPT1(t)*, *PPT1(t**+**1)*, *SD(t**+**1)*, and *PPT1(t**−**4)*, suggesting that, as expected, hydrologic components provide important information for predictions, whereas wastewater composition does not play such a crucial role (Table S4). Precipitation data for ISWS station 1 (*PPT1*) are also among the most important regressors, perhaps because this station is geographically close to the TARP system connected to the Kirie WRP. In contrast, ISWS stations 3 and 5 as well as the O'Hare station are close to the TARP system for the Stickney WRP, another MWRDGC facility. In general, the default model provides good predictions for *Q(t**+**1)* (Table S5); results from testing (adj. R^{2} ≈ 0.784 and MAPE ≈ 14.0%) are slightly better than those from training (0.776 and 14.6%).

### Effect of regression types

Significant differences in computation time can be found using different combinations of regression types in primary and subsequent steps, and these differences follow a logical pattern (Figure 2). First, selection of a primary regression type determines the general computation time. For example, using modeling types *linear* or *purequadratic* uses much less time (<60 min) relative to types *interactions* or *quadratic* (>180 min). Second, after the primary regression step, using *linear* or *purequadratic* models in the subsequent regression steps also uses less time, but this effect is smaller. Overall, the four options (green columns, Figure 2) that combine *linear* and *purequadratic* types require less than 1.5 min; modeling using ‘*linear* or *purequadratic* (primary) + *interactions* or *quadratic* (subsequent)’ (blue columns) requires about 30 min. Interestingly, a much bigger group of regressor candidates in a primary regression step that uses *interactions* or *quadratic* does not provide a significant improvement in prediction performance (Table S6). Among the 16 options, the *LP*-model achieves the highest adj. R^{2} (≈0.813), followed by the *IP*-model (≈0.807) and the *QP*-model (≈0.807). The best results based on MAPE were the *QQ*-model (≈13.08%), followed by the *II*-model (≈13.11%), the *QI*-model (≈13.21%), and the *LP*-model (≈13.47%). Differences among computation times were more substantial, ranging from 320 min for the *QQ*-model to less than 1 min for the *LP*-model. Based on the computation time and prediction performance, it seems that completing the primary regression step with *linear* or *purequadratic* types could be a better choice. Exceptions may occur when the soft-sensor modeling starts from a small number of regressors, so it may be promising to use *interactions* or *quadratic*.

The *LP*-model, which provides the best predictions for future flows, involves three iterations (Table S7). The subset of important regressors includes 22 individual regressors and ten quadratic regressors. Relative to the *LL*-model, the number of variables in the *LP*-model is reduced, and the retention level is increased from 93% to 96%. In the list of important regressors, the top five items are *PPT1(t**+**1)*, *Q(t) ^{2}*,

*PPT1(t)*,

*SD(t*

*+*

*1)*, and

*SD(t*

*−*

*4)*. The regressors and their sequence are different from the

*LL*-model, but meteorological variables and

*PPT1*are still the most important regressors. Overall, the

*LP*-model provides better predictions of the next day's flow (Table S8) for the training dataset (adj. R

^{2}≈ 0.798; MAPE ≈ 14.0%) and the testing dataset (0.813; 13.5%) (Table S9), probably because the model includes information contributed by non-linear terms, such as

*Q(t)*and

^{2}*PPT1(t*

*+*

*1)*(the sixth most important regressor).

^{2}Measured flowrates and predicted values based on the testing dataset (Figure 3) are in good agreement, and storm events (flows >3 m^{3}/s account for 10% of the testing data) are well anticipated. Several of the peak flows, however, are substantially underestimated, indicating there is a room for improvement. For example, the highest residual value is observed for an extreme high-flow day (≈4.8 m^{3}/s, January 23, 2010) (Figure S6). This residual might be reduced by including TARP flow data; TARP flow played an important role in the Calumet-prediction model (Zhu & Anderson 2016). Interestingly, in contrast to these high residual values associated with high flowrates, the three highest MAPE (130, 125, and 120%) occur on the days with medium-low flowrates (1.76, 0.75, and 1.62 m^{3}/s, respectively).

### Effect of *p*-values

Results from the *LL*-model can be obtained in a very short computation time (<250 seconds) (Figure 4(a)). In general, modeling using larger *p*-values requires longer computation time. This result is not surprising because larger *p*-values mean there will be more regressors included in the initial iteration. Ultimately, more relevant regressors may be found in the final model, which leads to a longer computation time (Table S10). For example, *p*-value option 2 requires about 30 s with 28 final regressors, whereas option 14 needs about 140 s and 44 final regressors are included. Similarly, compared to option 14, more regressors (=84) are retained in option 16, which requires more than four minutes. It is worth noting that more regressors does not guarantee better prediction performance. In many cases, models with fewer regressors, such as options 1, 2, 3, 4, 9, and 13, are more likely to achieve slightly better results (Table S10). There are at least two reasons for this effect: one is that the number of regressors affects the adj. R^{2} value and the other is that with fewer retained data there is less valuable information (Figure 4(b)). Apparently, *p*-value options that include the default values (*p _{E}* = 0.05,

*p*= 0.10) can retain more information (>93% of data). It is also worth noting that the retention level is not necessarily proportional to the computation time. Examples include options 11 and 12, which have relatively low regressor retention but moderate computation time, because the number of iterations has a substantial effect on the time (Table S10). Compared to the

_{R}*LL*-model, the

*LP*-model exhibits a similar pattern for the effect of different

*p*-value options on computation time, number of regressors, and performance. The

*LP*-model, however, takes more time and yields better prediction performance (Table S11). Most of the methods investigated provide results within 7 min, but options 15 and 16 require much longer times (>800 s). Approaches that include the

*purequadratic*method lead to models that include both individual regressors and quadratic terms. For this reason, the number of regressors based on the

*LP*-model is bigger (up to 157) relative to the

*LL*-model. A model that includes a large number of regressors based on the training dataset can suffer from overfitting, leading to poor predictions when applied to the testing dataset. Nevertheless, there is no overfitting issue with the

*LP*-model (Table S11). Probably because the

*LP*-model can capture nonlinear information, in general, better prediction results are observed. Specifically, the highest adj. R

^{2}(≈ 0.813) and the lowest MAPE (≈13.5%) is obtained in option 1. Overall, models based on lower

*p*-values and the

*LP*-model appear to be more efficient and effective, in terms of computation time and prediction performance.

### Effect of dataset size

Computation time increases as the number of initial regressors increases, and that relationship can be described in a linear form for *LL*- and *LP*-models (Figure S7(a)). When taking into consideration all computed regressors, it appears there is no significant difference in the relationship between computation time and the number of regressors between *LL*- and *LP*-models (Figure S7(b)). The effect of the number of observations on the computation time usually follows a linear relationship (Figure S8, Table S14). Quadrupling the number of observations (from two to eight years) only results in a doubling of the computation time (from about 20 to 40 s). In general, a prediction based on a smaller number of regressors and observations can be advantageous, especially when computation time is a concern. On the other hand, a more detailed investigation may be required if prediction performance is a priority.

### Comparison of different soft-sensor approaches

Among the seven soft-sensor approaches, ISMLR (*LP*) + ANN achieves the best performance (adj. R^{2} ≈ 0.834; MAPE ≈ 13.3%) for predicting *Q(t**+**1)* based on the testing dataset, followed by ISMLR (*LP*) (0.813; 13.5%) and ISMLR (*LL*) (0.784; 14.0%). Higher performance in these methods is probably because ISMLR leads to a greater reduction in dimensions, and retention of more useful information (Figure 5; Table S15). Low performance in SVR and ANN is probably caused by an overfitting issue, because these machine learning approaches strongly depend on the input information. As a result, these methods can perform very well with the training dataset (adj. R^{2}_{SVR} ≈ 0.866; adj. R^{2}_{ANN} ≈ 0.810), but not as well based on the testing dataset (0.629; 0.670). PCR was able to identify 64 major principal components from 126 regressors and the model was able to explain about 95% of the variance with adj. R^{2} and MAPE results of 0.657 and 18.2%, respectively, based on the testing dataset. The lowest MSE (≈0.194) in PLS modeling was found when the number of principal components was 10, which helps the PLS model to achieve an adj. R^{2} of 0.702 and MAPE of 17.5%. Surprisingly, these two dimension-reduction methods also suffer from the overfitting problem (Table S15). Similarly, PCA + ANN does a good job based on the training dataset (adj. R^{2} ≈ 0.799; MAPE ≈ 15.6%), but provides much lower performance based on the testing dataset (adj. R^{2} ≈ 0.650; MAPE ≈ 18.3%). The ISMLR-based hybrid methods take advantage of the ISMLR ability to initially reduce the number of regressors and manage missing data, so the subsequent machine learning approaches can reduce or avoid the overfitting problem, and obtain better predictions based on the testing dataset. Specifically, compared to the SVR model that has a huge difference (0.799 vs. 0.650) in adj. R^{2} between training and testing, the ISMLR + SVR model (more details about SVR hyperparameters used in SVR only and ISMLR + SVR models can be found in section 9 of the supporting information) significantly reduces the difference and improves the performance based on the testing dataset (0.790 vs. 0.724). More importantly, to search the best combination of hyperparameters, the SVR model required more than 61 h to complete the search using an 18-core desktop. In contrast, the hybrid model significantly reduced the computation time of the search to less than 11 h. Compared to SVR models that have extremely high MAPE values (>40%), the ISMLR + ANN model exhibits excellent results in both adj. R^{2} and MAPE. In addition to the overall performance, the ISMLR + ANN model can provide better results when there are relatively high flowrates. For example, ISMLR (*LL*) underestimates many high flowrate data (>5 m^{3}/s, average residual ≈1.35 m^{3}/s), but the hybrid method can reduce many of these residual values (≈1.01 m^{3}/s) (Figure S10). Therefore, development of hybrid soft sensors using ISMLR and other more advanced algorithms (such as ANN or deep learnings) is promising and worth pursuing.

## CONCLUSION

We applied an ISMLR package, as a simple, effective, and flexible tool for a soft-sensor application to predict the next day's influent flow at the Kirie WRP. The ISMLR application helped to simplify the complicated modeling processes, providing customization choices for regression type and the *p*-values used to identify important regressors. The default *LL*-model performed well; modified models (such as the *LP*-model) can achieve even better predictions. Not surprisingly, meteorological regressors, such as *Q(t)*, *PPT1(t)*, and *PPT1(t**+**1)*, provided crucial information to help to predict the next day's flowrate with both the *LL-* and *LP*-models. *Interactions* or *quadratic* methods in the primary regression step can require substantial computation times, but these methods do not significantly improve prediction performance. Also, lower *p*-values reduce computation time and lead to higher adj. R^{2} and lower MAPE values, based on both *LL*- and *LP*-models. Investigations of dataset size suggest that the performance does not improve as the number of regressors or observations increases, however, computation time will increase. Overall, simpler regression models and low *p*-values provide an advantage when treating a big dataset. Finally, use of the ISMLR package as part of a hybrid soft sensor can serve multiple purposes, including missing data management, overfitting prevention, and higher prediction performance. In the future, this information will help us to design a flow prediction tool with an adaptive learning capability. Future work will include integration with deep learning approaches and hyperparameter optimization.

## SUPPLEMENTARY DATA

Supplementary data related to this article can be found with the online version of this paper, at http://dx.doi.org/10.2166/wst.2019.309.