## Abstract

This study presents a novel methodology for estimating the concentration of environmental pollutants in water, such as pathogens, based on environmental parameters. The scientific uniqueness of this study is the prevention of excess conformity in the model fitting by applying domain knowledge, which is the accumulated scientific knowledge regarding the correlations between response and explanatory variables. Sign constraints were used to express domain knowledge, and the effect of the sign constraints on the prediction performance using censored datasets was investigated. As a result, we confirmed that sign constraints made prediction more accurate compared to conventional sign-free approaches. The most remarkable technical contribution of this study is the finding that the sign constraints can be incorporated in the estimation of the correlation coefficient in Tobit analysis. We developed effective and numerically stable algorithms for fitting a model to datasets under the sign constraints. This novel algorithm is applicable to a wide variety of the prediction of pollutant contamination level, including the pathogen concentrations in water.

*This article has been made Open Access thanks to the generous support of a global network of libraries as part of the Knowledge Unlatched Select initiative.*

## INTRODUCTION

Water safety plans (WSP) and sanitation safety plans (SSP) are international planning schemes for water and wastewater developments (Goodwin *et al.* 2015). Hazard analysis and critical control point (HACCP) is a basic concept of these plans, in which some operational parameters are identified as critical control points (CCPs), and critical limit values are monitored for ensuring safety in water usage (Garner *et al.* 2016). A possible option for monitoring microbial risks in water usage is the use of on-line sensors for pathogens. Sensitive sensors for pathogens in water have been proposed (Kitajima *et al.* 2016), but they are still too expensive to use in daily operation, and the quantification limit is still not low enough for applying to pathogens in environmental water.

This study focuses on an alternative approach for monitoring microbial risks, which is the employment of other water quality parameters that have been proved to have a statistically significant relationship with the concentration of pathogens (Cho *et al.* 2010; Jurzik *et al.* 2010). To exploit other water quality parameters, the prediction model must be fitted in advance to a training sample by regression analysis. An obstacle in the model fitting stage is that pathogen concentration in water is frequently found to be below the quantification limit of an analytical method (Kato *et al.* 2016). Such data are referred to as non-detects in this paper.

To cope with this censoring problem in regression analysis, several approaches have been attempted. One approach is substitution of non-detects (data under a quantification limit) with zero or a value of quantification limit (Antweiler 2015). However, value substitution is not always appropriate in a statistical treatment of a dataset because it can distort the statistical estimation of parameters (Helsel 2006, 2010; Huynh *et al.* 2014, 2016). Another approach is Tobit analysis (Amemiya 1984) that employs a probabilistic model. This approach suffers from overfitting to samples for training when the sample sizes are small.

If abundant training data were available, combining more multiple independent variables could boost the prediction performance of regression models (Harwood *et al.* 2005). When the correlations between pathogen concentration and explanatory variables are weaker, more explanatory variables are needed to obtain high prediction accuracy. However, too many independent variables would result in excess conformity if the training sample size were small. The scientific uniqueness of this study is to alleviate the excess conformity by applying domain knowledge, which is the accumulated scientific knowledge regarding the correlations between pathogen concentration and the other variables. The present study investigated whether the performance of regression analyses using censored datasets is increased by the application of domain knowledge in the form of *sign constraints*. The sign constraints are to fix signs (non-negative or non-positive) of the regression coefficients in regression analysis, when the signs have been shown to be statistically significant in previous studies. For example, we expect a positive correlation between indicator microorganisms (such as coliforms) and pathogenic bacteria (such as enterohemorrhagic *Escherichia coli*) in water, but the sign of the sample correlation may be reversed when the sample size is too small.

The main purpose of this study was to evaluate the effect of the sign constraints on the performance of regression analyses using censored datasets. In our simulation, water quality data acquired from a watershed were used as explanatory variables, and the common logarithmic value of *E. coli* concentration was used as an alternative response variable to the pathogen concentration. The reason why *E. coli* data were used for this simulation is that *E. coli* can be measured with little censoring that could be easily used to generate datasets with various given quantification limits, which reflects different level of censoring. Six left-censored datasets with different qualification limit values were prepared for investigating the power of the sign constraints.

## METHODS

### Water quality parameters

River water sampling was conducted about twice a month from January 2012 to April 2013 in four rivers: Toyohira River (Site A), Nopporo River (Site B), Atsubetsu River (Site C), and Motsukisamu River (Site D) in Sapporo City, Japan (Figure SD.1 in the Supplementary materials, available with the online version of this paper). *E. coli* concentration, water temperature, pH, electrical conductivity, dissolved oxygen, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus in water samples were measured according to *Standard Methods for the Examination of Water and Wastewater* (APHA 2005). The flow rate of the river water was also measured at each sampling event. We acquired 96 observations of these water quality parameters (Table SD.1 in the Supplementary materials, available online).

In this study, the common logarithmic value of *E. coli* concentration was used as an alternative for the pathogen concentration. Six left-censored datasets were generated by manually setting different quantification limit values (1.5, 2.0, 2.5, 3.0, 3.5, and 4.0-log MPN/100 mL), in which the *E. coli* concentration values less than the quantification limit were regarded as non-detects. These left-censored datasets were used as response variables in the following regression analysis.

### Sign-constrained regression after deletion/substitution

*d*regression coefficients that approximate the linear relationship between

*d*water quality parameters, and an

*E. coli*concentration , where and Regression coefficients in are usually determined by minimizing the mean square error where are

*n*observations in a given dataset. Typically, in the water quality engineering field, the signs of the true correlations of some explanatory variables to a pathogen concentration can be expected in advance. However, in the case of a small dataset or weak correlations, the signs of the corresponding regression coefficients are often opposite to those of the true correlations. To alleviate this phenomenon, sign constraints were introduced in this study. Let be a constant vector where

*h*-th entry is if the

*h*-th explanatory variable is known to have a positive correlation to the

*E. coli*concentration; if the negativity of the correlation is known; 0 otherwise. The feasible region is then expressed as:

When some *E. coli* concentration values in a given dataset are not detected due to a quantification limit, the mean square error cannot be assessed. Simple solutions to this issue are substitution or deletion of non-detects before the sign-constrained least square estimation. In the substitution method, a constant value (the quantification limit value or half the quantification limit value) was substituted into a non-detect. Then, the estimation task was reduced to the sign-constrained linear regression problem with dataset size *n*. The deletion method deletes non-detects to obtain the sign-constrained linear regression problem with dataset size where is the number of detected *E. coli* concentration.

### Sign-constrained Tobit

*E. coli*detection failure is described with a mass probability, and the regression coefficients are determined by maximum likelihood estimation. Let us denote the data pairs of water quality parameter vectors and

*E. coli*concentration values by , …,. Note that all

*E. coli*concentration values in the data pairs are over the quantification limit

*u*. Denote data pairs with non-detects by , … ,. When

*E. coli*is not detected, the information available is that the true

*E. coli*concentration values do not exceed the quantification limit. For such a dataset, the log-likelihood function of the Tobit model is given by: where is referred to as the precision parameter. In the classical Tobit model, the log-likelihood function is maximized without any constraints.

*t*-th iteration, the posterior distribution is defined in the E-step as: where

The definition of the Q-function implies that the update rule of is reduced to the sign-constrained least square problem, which can be solved efficiently. More detailed information of sign-constrained Tobit, including EM steps is indicated in Appendix E of the Supplementary materials (available online).

### Evaluation process

In order to evaluate the generalized prediction performance for unseen data, 96 data pairs were divided into training and evaluation datasets. The size of the training datasets, say was between 10 and 58. There were four approaches to deal with the censored dataset: Tobit analysis (Tobit), the substitution of non-detect with the quantification limit value (DL), the substitution of non-detect with half the quantification limit value (DL/2), and the deletion of non-detects (Del). For each approach, the generalization performance of the conventional sign-free regression was compared with that of the sign-constrained regression. Sign-constrained Tobit, DL, DL/2, and Del were expressed as SC-Tobit, SC-DL, SC-DL/2, and SC-Del, respectively, whereas those without sign constraints were SF-Tobit, SF-DL, SF-DL/2, and SF-Del, respectively. In total, eight methods were examined.

To make the evaluation process reproducible, a step-by-step description of the evaluation process is given as follows. Two subsets with size *n* and were chosen from 96 data points, to obtain the training and evaluation datasets, respectively. The intersection of the two subsets were empty. This step is repeated 100 times, and 100 different training/evaluation datasets were generated for each . For each of 100 different training/evaluation datasets and each of eight methods, the regression coefficients were estimated with the training dataset and RMSD was evaluated with the evaluation dataset.

## RESULTS

Before examining the effects of the sign constraints, the dataset that was used for examining the sign constraints was overviewed. The minimum and maximum values of the common logarithm of *E. coli* concentration were 0.89 and 5.38, respectively, and the median was 3.20. To visualize the relationship between the response variable and each explanatory variable, the non-parametric density estimation was performed using the Parzen window (Figure 1(a)). Weak but statistically significant correlations were observed between the response variable and each explanatory variable except for pH (pH values less than 7.0). The Pearson's correlation coefficients and *p* values in t-test with sample size 96, shown in Table 1, indicate that there were statistically significant positive correlations (*p* < 0.05) with water temperature, electric conductivity, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus, but statistically significant negative correlations (*p* < 0.05) with pH_{+} (pH values higher than 7.0), dissolved oxygen, and flow rate.

Explanatory variable | Correlation coefficient | p value |
---|---|---|

Flow rate | −0.287 | 5.23 × 10^{−5} |

Dissolved oxygen | −0.683 | 5.67 × 10^{−26} |

pH^{a}_{+} | −0.293 | 3.88 × 10^{−5} |

pH^{b}_{−} | −0.019 | 4.02 × 10^{−1} |

Water temperature | 0.548 | 1.41 × 10^{−15} |

Electrical conductivity | 0.562 | 1.89 × 10^{−16} |

Suspended solids | 0.369 | 2.17 × 10^{−7} |

Biological oxygen demand | 0.381 | 8.14 × 10^{−6} |

Total nitrogen | 0.691 | 9.17 × 10^{−27} |

Explanatory variable | Correlation coefficient | p value |
---|---|---|

Flow rate | −0.287 | 5.23 × 10^{−5} |

Dissolved oxygen | −0.683 | 5.67 × 10^{−26} |

pH^{a}_{+} | −0.293 | 3.88 × 10^{−5} |

pH^{b}_{−} | −0.019 | 4.02 × 10^{−1} |

Water temperature | 0.548 | 1.41 × 10^{−15} |

Electrical conductivity | 0.562 | 1.89 × 10^{−16} |

Suspended solids | 0.369 | 2.17 × 10^{−7} |

Biological oxygen demand | 0.381 | 8.14 × 10^{−6} |

Total nitrogen | 0.691 | 9.17 × 10^{−27} |

^{a}pH_{+}: max (0, pH – 7.0).

^{b}pH_{−}: max (0, 7.0 – pH).

What happens when the sample size is small is demonstrated preliminary to reporting the effects of sign constraints. The positive correlations between the response variable and six explanatory variables (water temperature, electric conductivity, suspended solids, biological oxygen demand, total nitrogen, and total phosphorus) were statistically significant when all 96 data were used for the computation. However, these correlations would not be significant when the sample size was too small. Similarly, significant negative correlations with pH_{+}, dissolved oxygen, and flow rate would not be detected when the sample size was too small. In order to confirm this intuition, a Pearson's sample correlation coefficient was computed using five randomly selected data out of 96 for each explanatory variable. This simulation was repeated 10,000 times, and 10,000 values of the sample correlation coefficient were obtained for each explanatory variable. Then, the histograms of the sample correlation coefficients were plotted (Figure 1(b)), and the percentage values of positive and negative sign of the sample correlation coefficient were calculated (Table 2). As shown in Figure 1(b), the correlation coefficients for all explanatory variables were scattered in a broad interval, which is due to the small size of the sample. The strongest positive correlation was obtained with total nitrogen (Table 1), but 3.9% of the sample correlation coefficient values were negative (Table 2). The strongest negative correlation was obtained with dissolved oxygen (Table 1), but 5.5% of the sample correlation coefficient values were positive (Table 2). Suspended solids had a significant positive correlation (Table 1), but 22.0% of the sample correlation coefficient values were negative (Table 2). The flow rate had a significant negative correlation (Table 1), but 28.0% of the sample correlation coefficient values were positive (Table 2). These results clearly indicate that the sign reversal between sample and population correlation coefficient values occurs when the sample size is small, which exacerbates the performance of conventional regression analysis based on the standard least square estimation.

Explanatory variable | Positive correlation % | Negative correlation % |
---|---|---|

Flow rate | 28.0 | 72.0 |

Dissolved oxygen | 5.4 | 94.6 |

pH^{a}_{+} | 25.0 | 73.5 |

pH^{b}_{−} | 51.2 | 36.6 |

Water temperature | 88.4 | 11.6 |

Electrical conductivity | 90.3 | 9.7 |

Suspended solids | 78.1 | 21.9 |

Biological oxygen demand | 83.6 | 16.4 |

Total nitrogen | 96.1 | 3.9 |

Explanatory variable | Positive correlation % | Negative correlation % |
---|---|---|

Flow rate | 28.0 | 72.0 |

Dissolved oxygen | 5.4 | 94.6 |

pH^{a}_{+} | 25.0 | 73.5 |

pH^{b}_{−} | 51.2 | 36.6 |

Water temperature | 88.4 | 11.6 |

Electrical conductivity | 90.3 | 9.7 |

Suspended solids | 78.1 | 21.9 |

Biological oxygen demand | 83.6 | 16.4 |

Total nitrogen | 96.1 | 3.9 |

^{a}pH_{+}: max (0, pH – 7.0).

^{b}pH_{−}: max (0, 7.0 – pH).

The effects of sign constraints in Tobit, DL, DL/2, and Del were compared with sign-free approaches when the quantification limit value was 1.5, 2.0, or 2.5-log MPN/100 mL (Figure 2) and 3.0, 3.5, or 4.0-log MPN/100 mL (Figure 3). RMSD between sign-constraint and sign-free approaches and *p*-values at each number of training examples is indicated in Tables SA.1–SA.24 of the Supplementary materials (available with the online version of this paper). The Tobit model is one of the statistically sophisticated models that include a truncated probabilistic distribution, in which true values in non-detects are expressed using mass probabilities and estimated using maximum likelihood estimation in the marginal distributions. It was remarkable that the sign-constraint approaches (triangles) always gave smaller RMSD values than the sign-free approaches (squares), particularly when the training data size was small. This means that the accuracy of the regression is improved by the employment of sign-constraint approaches compared to conventional sign-free approaches.

The sign-free approaches were compared among Tobit, DL, DL/2, and Del (Figure 4(a)–4(f)). When the quantification limit was 3.0-log MPN/100 mL, the sign-free Tobit (square) performed best in terms of the regression accuracy if the training sample size was larger than 40 (Figure 4(d)). The sign-constraint approaches were also compared among Tobit, DL, DL/2, and Del (Figure 4(g)–4(l)). The performance of SC-DL/2 was the best when the quantification limit value was smaller than 2.5-log MPN/100 mL, but the difference among SC-DL, SC-DL/2, and SC-Del was almost negligible. When the training sample size was 21, the RMSD of SC-Tobit was larger than those of SC-DL, SC-DL/2, and SC-Del. The difference of RMSD among SC-Tobit, SC-DL, SC-DL/2, and SC-Del was very small when the training sample size was large. SC-Tobit performed best when the training sample size was large and the quantification limit value was higher than 3.0-log MPN/100 mL, whereas SC-DL/2 was the best when the quantification limit value was less than 2.5-log MPN/100 mL. These results indicate that the best approach (the smallest RMSD) is determined by the sample size when the quantification limit value is larger than 3.0-log MPN/100 mL.

## DISCUSSION

When there is a significant correlation between pathogen concentration in water and explanatory variables, prediction accuracy is improved by the linear combination of multiple explanatory variables, if the training sample size is large enough (Chatterjee & Price 1977). However, the sign of the sample correlation coefficient may become opposite when the training sample size is not large enough, as shown in Table 2, which results in performance deterioration of the predictor. In this study, we propose the following simple assignment on the sign of correlation coefficient: (1) non-negative regression coefficient () of an explanatory variable if it is common to have a positive correlation coefficient; and (2) non-positive regression coefficient () of an explanatory variable if it is common to have a negative correlation coefficient. The effect of the sign constraints was clear, as can be seen in Figures 2 and 3: the averages of RMSD values were always lower (the prediction performance is always higher) in the sign-constraint approaches compared with the sign-free approaches.

The sign constraints play a role in removing the explanatory variables in a given dataset that violates the domain knowledge by zeroing the corresponding regression coefficients. The box plots in Figures SB.1–SB.24 of the Supplementary materials (available with the online version of this paper) show the distributions of the regression coefficient values obtained in regression analysis. The coefficient values were distributed widely in the feasible region associated with the sign constraints for each explanatory variable. The coefficient values frequently vanished, especially when the training dataset size is smaller. Note that for a coefficient value that did not vanish, the estimated coefficient values would not change even if the corresponding sign constraints were excluded from the optimization problem for model fitting. On the other hand, the zero coefficients suggest that the existence of the sign constraints might change the estimated solution from the sign-free approach. The frequencies of trials of 100-time repeated experiments in which specified numbers of regression coefficients vanished were plotted in a stacked bar format in Figures SC.1–SC.24 of the Supplementary materials (available online). Each bar is a stack of three sections at most. The lowest sections indicated with ‘ < ’ are the frequencies with which RMSD was reduced by imposing the sign constraints; the middle sections indicated with ‘ = ’ are the frequencies with which no difference in RMSD between the sign-constrained and the sign-free approaches was observed; and the highest sections indicated with ‘ > ’ are the frequencies with which RMSD was increased with the sign constraints. For Tobit methods, RMSD was often reduced when the training sample size was small and the quantification limit was low. For the other three approaches, the sign constraints tended to decrease RMSD. The averaged numbers of vanished coefficients against the training dataset size are plotted in Figure SC.25 of the Supplementary materials and these averaged numbers and the standard deviations are shown in Tables SC.1–SC.4 of the Supplementary materials (available online). For Tobit, DL, and DL/2 methods, more vanished coefficients were observed when the quantification limit is lower, reflecting the fact that the domain knowledge disagreed less frequently in training datasets containing more censored data. For Del method, higher quantification limit reduced the data fed to model fitting (recall that Del method uses only uncensored data), which produced more explanatory variables contradicting domain knowledge and thereby increased the number of zeroed coefficients. These results also suggest that the explanatory variables are discarded if the corresponding sign constraints are contradicted with a given sample, and the sign constraints are effective even with a small number of wrong constraints given.

The most remarkable technological contribution of this study is the finding that it is possible to incorporate the sign constraints in the estimation of the correlation coefficient in the EM algorithm for Tobit analysis (Amemiya 1984). The sign-constrained regression exists and the Tobit model exists, although no algorithm for the combination has been developed so far. The new discovery was that the M-step in the EM algorithm boiled down to a mathematical problem of non-negative least square estimation when the maximum likelihood estimation is conducted under sign-constraint. Algorithms for the non-negative least square estimation have been studied extensively and an efficient algorithm exists (Lawson & Hanson 1987). By introducing the efficient algorithm for non-negative least square estimation to the M-step of the EM algorithm, an effective and numerically stable algorithm for fitting a model to datasets under the non-negative constraint was obtained. The novel algorithm is applicable to environmental correlation issues, including the prediction of pathogen concentration in water.

The EM approach employed in this study was not a unique choice for finding the maximum likelihood estimator. Under an invertible variable transformation, the Hessian matrix of the negative log-likelihood function is guaranteed to be positive definite, which implies the global concavity (Amemiya 1984). This fact suggests another approach such as gradient-based methods for maximum likelihood estimation under sign constraints. An advantage of the EM algorithm is the ease of extensions to more complicated models and Bayesian approaches. If performing Bayesian inference instead of maximum likelihood estimation, the prior distribution is designed so that the prior probabilistic densities outside the feasible region of the sign constraints are zero, and the variational approximation (MacKay 2003) which is a natural extension of the EM algorithm may be needed to make the Bayesian algorithm tractable.

## CONCLUSIONS

In this study, new approaches introducing sign constraints to express such domain knowledge were attempted. Effective and numerically stable algorithms for fitting a model to left-censored datasets under the sign constraints were developed, which must be applicable to a wide variety of environmental prediction problems, including the real-time monitoring of pathogen concentration in water. It was confirmed that the prediction performance of the regression was improved by the employment of sign-constraint approaches compared to conventional sign-free approaches. In particular, more significant improvements were observed when the training sample is small, implying that the sign-constraint techniques are a powerful option for practical analysts when they choose a statistical tool. Another contribution of this paper is to present a novel algorithm for fitting of Tobit model under sign constraints. The presented algorithm was an implementation of the algorithm for the fitting problem.

## ACKNOWLEDGEMENTS

This study was supported by JSPS KAKENHI Grant Number JP15K00591. No conflict of interest is declared.