Sign-constrained linear regression for prediction of microbe concentration based on water quality datasets

This study presents a novel methodology for estimating the concentration of environmental pollutants in water, such as pathogens, based on environmental parameters. The scienti ﬁ c uniqueness of this study is the prevention of excess conformity in the model ﬁ tting by applying domain knowledge, which is the accumulated scienti ﬁ c knowledge regarding the correlations between response and explanatory variables. Sign constraints were used to express domain knowledge, and the effect of the sign constraints on the prediction performance using censored datasets was investigated. As a result, we con ﬁ rmed that sign constraints made prediction more accurate compared to conventional sign-free approaches. The most remarkable technical contribution of this study is the ﬁ nding that the sign constraints can be incorporated in the estimation of the correlation coef ﬁ cient in Tobit analysis. We developed effective and numerically stable algorithms for ﬁ tting a model to datasets under the sign constraints. This novel algorithm is applicable to a wide variety of the prediction of pollutant contamination level, including the pathogen concentrations in water. This study presents a novel methodology for estimating the concentration of environmental This study presents a novel methodology for estimating the concentration of environmental pollutants in water, as environmen This study presents a novel methodology for estimating the concentration of environmental pollutants in water, such as pathogens, based on environmen This study presents a novel methodology for estimating the of This study presents a novel methodology for estimating the of


INTRODUCTION
Water safety plans (WSP) and sanitation safety plans (SSP) are international planning schemes for water and wastewater developments (Goodwin et al. ). Hazard analysis and critical control point (HACCP) is a basic concept of these plans, in which some operational parameters are identified as critical control points (CCPs), and critical limit values are monitored for ensuring safety in water usage (Garner et al. ). A possible option for monitoring microbial risks in water usage is the use of on-line sensors for pathogens. Sensitive sensors for pathogens in water have been proposed (Kitajima et al. ), but they are still too expensive to use in daily operation, and the quantification limit is still not low enough for applying to pathogens in environmental water.
This study focuses on an alternative approach for monitoring microbial risks, which is the employment of other water quality parameters that have been proved to have a statistically significant relationship with the concentration of pathogens (Cho et al. ; Jurzik et al. ). To exploit other water quality parameters, the prediction model must be fitted in advance to a training sample by regression analysis. An obstacle in the model fitting stage is that pathogen concentration in water is frequently found to be below the quantification limit of an analytical method (Kato et al. ). Such data are referred to as non-detects in this paper.
To cope with this censoring problem in regression analysis, several approaches have been attempted. One approach is substitution of non-detects (data under a quantification limit) with zero or a value of quantification limit (Antweiler ). However, value substitution is not always appropriate in a statistical treatment of a dataset because it can distort the statistical estimation of parameters (Helsel , ; Huynh et al. , ). Another approach is Tobit analysis (Amemiya ) that employs a probabilistic model. This approach suffers from overfitting to samples for training when the sample sizes are small.
If abundant training data were available, combining more multiple independent variables could boost the prediction performance of regression models (Harwood et al. ). When the correlations between pathogen concentration and explanatory variables are weaker, more explanatory variables are needed to obtain high prediction accuracy. However, too many independent variables would result in excess conformity if the training sample size were small. The scientific uniqueness of this study is to alleviate the excess conformity by applying domain knowledge, which is the accumulated scientific knowledge regarding the correlations between pathogen concentration and the other variables. The present study investigated whether the performance of regression analyses using censored datasets is increased by the application of domain knowledge in the form of sign constraints. The sign constraints are to fix signs (non-negative or non-positive) of the regression coefficients in regression analysis, when the signs have been shown to be statistically significant in previous studies. For example, we expect a positive correlation between indicator microorganisms (such as coliforms) and pathogenic bacteria (such as enterohemorrhagic Escherichia coli) in water, but the sign of the sample correlation may be reversed when the sample size is too small.
The main purpose of this study was to evaluate the effect of the sign constraints on the performance of regression analyses using censored datasets. In our simulation, water quality data acquired from a watershed were used as explanatory variables, and the common logarithmic value of E. coli concentration was used as an alternative response variable to the pathogen concentration. The reason why E. coli data were used for this simulation is that E. coli can be measured with little censoring that could be easily used to generate datasets with various given quantification limits, which reflects different level of censoring. Six left-censored datasets with different qualification limit values were prepared for investigating the power of the sign constraints. concentration, water temperature, pH, electrical conductivity, dissolved oxygen, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus in water samples were measured according to Standard

Water quality parameters
Methods for the Examination of Water and Wastewater (APHA ). The flow rate of the river water was also measured at each sampling event. We acquired 96 observations of these water quality parameters ( In this study, the common logarithmic value of E. coli concentration was used as an alternative for the pathogen concentration. Six left-censored datasets were generated by manually setting different quantification limit values (1.5, 2.0, 2.5, 3.0, 3.5, and 4.0-log MPN/100 mL), in which the E. coli concentration values less than the quantification limit were regarded as non-detects. These left-censored datasets were used as response variables in the following regression analysis.

Sign-constrained regression after deletion/substitution
Linear regression analysis is a task for determining the value of d regression coefficients w 1 , . . . , w d ∈ R that approxi- where (x 1 , y 1 ), . . . , (x n , y n ) ∈ R d × R are n observations in a given dataset. Typically, in the water quality engineering field, the signs of the true correlations of some explanatory The sign-constrained least square estimation is a task to find the regression coefficients that minimize the mean square error in the feasible region. In other words, the sign-constrained least square problem can be expressed as: When some E. coli concentration values in a given dataset are not detected due to a quantification limit, the mean square error cannot be assessed. Simple solutions to this issue are substitution or deletion of non-detects before the sign-constrained least square estimation. In the substitution method, a constant value (the quantification limit value or half the quantification limit value) was substituted into a non-detect. Then, the estimation task was reduced to the sign-constrained linear regression problem with dataset size n. The deletion method deletes non-detects to obtain the sign-constrained linear regression problem with dataset size n v where n v is the number of detected E. coli concentration.

Sign-constrained Tobit
In the Tobit model (Amemiya ), E. coli detection failure is described with a mass probability, and the regression coef- Note that all E. coli concentration values y v i in the n v data pairs are over the quantification limit u. Denote n h (:¼ n À n v ) data pairs with When E. coli is not detected, the information available is that the true E.
coli concentration values do not exceed the quantification limit. For such a dataset, the log-likelihood function of the Tobit model is given by: where β is referred to as the precision parameter. In the classical Tobit model, the log-likelihood function L(w, β) is maximized without any constraints.
In this study, the sign constraints were introduced for maximum likelihood estimation, which is written as: To solve this maximization problem, a new expectationmaximization (EM) algorithm (MacKay ) was developed. In the EM algorithm, a posterior distribution q t (y h i 0 ) is introduced for each of the non-detects at each iteration, and the E-step and M-step are repeated alternately until convergence. Denoting by (w (t) , β (t) ) the pair of the regression coefficient vector and the precision parameter at t-th iteration, the posterior distribution is defined in the E-step as: where In the M-step, the value of (w, β) is updated so that the Q-function is increased, where the Q-function is defined as: where Therein, E qt is the operator taking the expectation over the posterior distribution defined in the E-step of the t-th iteration. With this Q-function, w and β are updated as: and The definition of the Q-function implies that the update rule of w is reduced to the sign-constrained least square problem, which can be solved efficiently. More detailed information of sign-constrained Tobit, including EM steps is indicated in Appendix E of the Supplementary materials (available online).

Evaluation process
In order to evaluate the generalized prediction performance for unseen data, 96 data pairs were divided into training and evaluation datasets. The size of the training datasets, say n, was between 10 and 58. There were four approaches to deal with the censored dataset: Tobit analysis (Tobit), the substitution of non-detect with the quantification limit value (DL), the substitution of non-detect with half the quantification limit value (DL/2), and the deletion of non-detects (Del).
For each approach, the generalization performance of the conventional sign-free regression was compared with that of the sign-constrained regression. Sign-constrained Tobit, DL, DL/2, and Del were expressed as SC-Tobit, SC-DL, SC-DL/2, and SC-Del, respectively, whereas those without sign constraints were SF-Tobit, SF-DL, SF-DL/2, and SF-Del, respectively. In total, eight methods were examined.
In the evaluation, the regression coefficients in w were determined using a training dataset of the eight approaches (SC-Tobit, SF-Tobit, SC-DL, SF-DL, SC-DL/2, SF-DL/2, SC-Del, and FC-Del). Then, the root mean square deviation (RMSD) was calculated using an evaluation dataset as follows: where (x tst 1 , y tst 1 ), … , (x tst ntst , y tst ntst ) ∈ R d × R are n tst data pairs in the evaluation dataset. The size of every evaluation dataset is n tst ¼ 25, and the points are chosen from (96 À n) data points. This process of training and evaluation was repeated 100 times for each approach, and the average and standard deviation of RMSD were calculated.
To make the evaluation process reproducible, a step-bystep description of the evaluation process is given as follows.
Two subsets with size n and n tst ¼ 25 were chosen from 96 data points, to obtain the training and evaluation datasets, respectively. The intersection of the two subsets were empty. This step is repeated 100 times, and 100 different training/evaluation datasets were generated for each n ∈ {10, 13, 16, . . . , 58}. For each of 100 different training/evaluation datasets and each of eight methods, the regression coefficients were estimated with the training dataset and RMSD was evaluated with the evaluation dataset.

RESULTS
Before examining the effects of the sign constraints, the dataset that was used for examining the sign constraints was overviewed. The minimum and maximum values of the common logarithm of E. coli concentration were 0.89 and 5.38, respectively, and the median was 3.20. To visualize the relationship between the response variable and each explanatory variable, the non-parametric density estimation was performed using the Parzen window (Figure 1(a)).
Weak but statistically significant correlations were observed between the response variable and each explanatory variable except for pH (pH values less than 7.0). The Pearson's correlation coefficients and p values in t-test with sample size 96, shown in Table 1, indicate that there were statistically significant positive correlations (p < 0.05) with water temperature, electric conductivity, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus, but statistically significant negative correlations (p < 0.05) with pH þ (pH values higher than 7.0), dissolved oxygen, and flow rate.
What happens when the sample size is small is demonstrated preliminary to reporting the effects of sign constraints. The positive correlations between the response variable and six explanatory variables (water temperature, electric conductivity, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus) were statistically significant when all 96 data were used for the computation. However, these correlations would not be significant when the sample size was too small. Similarly, significant negative correlations with pH þ , dissolved oxygen, and flow rate would not be detected when the sample size was too small. In order to confirm this intuition, a Pearson's sample correlation coefficient was computed using five randomly selected data out of 96 for each explanatory variable. This simulation was repeated 10,000 times, and 10,000 values of the sample correlation coefficient were obtained for each explanatory variable. Then, the histograms of the sample correlation coefficients were plotted ( Figure 1(b)), and the percentage values of positive and negative sign of the sample correlation coefficient were calculated ( Table 2). As shown in Figure 1(b), the correlation coefficients for all explanatory variables were scattered in a broad interval, which is due to the small size of the sample. The strongest positive correlation was obtained with total nitrogen (  Table 2). The flow rate had a significant negative correlation (Table 1)  The sign-free approaches were compared among Tobit, DL, DL/2, and Del (Figure 4(a)-4(f)). When the quantification limit was 3.0-log MPN/100 mL, the sign-free Tobit (square) performed best in terms of the regression accuracy if the training sample size was larger than 40 (Figure 4(d)).
The sign-constraint approaches were also compared among Tobit, DL, DL/2, and Del (Figure 4(g)-4(l)). The performance of SC-DL/2 was the best when the quantification limit value was smaller than 2.5-log MPN/100 mL, but the difference among SC-DL, SC-DL/2, and SC-Del was almost negligible. When the training sample size was 21, the RMSD of SC-Tobit was larger than those of SC-DL, SC-DL/2, and SC-Del. The difference of RMSD among SC-Tobit, SC-DL, SC-DL/2, and SC-Del was very small when the training sample size was large. SC-Tobit performed best when the training sample size was large and the quantification limit value was higher than 3.0-log MPN/100 mL, whereas SC-DL/2 was the best when the quantification limit value was less than 2.5-log MPN/100 mL. These results indicate that the best approach (the smallest RMSD) is determined by the sample size when the quantification limit value is larger than 3.0-log MPN/100 mL.

DISCUSSION
When there is a significant correlation between pathogen concentration in water and explanatory variables, predic- The sign constraints play a role in removing the explanatory variables in a given dataset that violates the domain knowledge Biological oxygen demand 0.381 8.14 × 10 À6 Total nitrogen 0.691 9.17 × 10 À27 a pH þ : max (0, pH -7.0). b pHÀ: max (0, 7.0 -pH).

CONCLUSIONS
In this study, new approaches introducing sign constraints to express such domain knowledge were attempted. Effective and numerically stable algorithms for fitting a model to left-censored datasets under the sign constraints were developed, which must be applicable to a wide variety of environmental prediction problems, including the real-time monitoring of pathogen concentration in water. It was confirmed that the prediction performance of the regression was improved by the employment of sign-constraint approaches compared to conventional sign-free approaches. In particular, more significant improvements were observed when the training sample is small, implying that the signconstraint techniques are a powerful option for practical analysts when they choose a statistical tool. Another contribution of this paper is to present a novel algorithm for fitting of Tobit model under sign constraints. The presented algorithm was an implementation of the algorithm for the fitting problem.