This study aims to identify important well characteristics associated with increased odds of bacterial contamination in the Wellington-Dufferin-Guelph public health unit of Southern Ontario. Identifying risk factors associated with bacterial contamination can aid in the mandate of public health units to promote the safety, and facilitate the testing, of drinking water systems to help minimize the risk of illness. Logistic regression models for adverse bacterial test results based on physical well characteristics were created. Models with the lowest Akaike Information Criterion values were examined for consistently identified characteristics. The odds of bacterial contamination in the Wellington-Dufferin-Guelph region are most associated with the age of the well, the season of testing, having a treatment system on the well, and the presence of potential point contamination sources within 50 feet (15.24 m) of the well. While this information can support the design of targeted public health education campaigns, the current model leaves room for improvement, as the predictive abilities of the models based solely on well characteristic data are limited.

  • Increased odds of bacterial contamination in private wells were associated with the age of the well, lack of treatment system on the well, and nearby point contamination source.

  • Season of testing affected the odds of bacterial contamination, with increased odds of contamination in warmer seasons.

  • Public health units can utilize these results to inform well water testing campaigns, to reduce the occurrence of waterborne illness.

Approximately 4 million Canadians, including 1.5 million living in Ontario, use private wells as their primary drinking water source, most commonly in rural areas or those areas not serviced by municipal water systems (Hynds et al. 2014; Ugas et al. 2019; Latchmore et al. 2020). While maintenance of municipal water supplies is the responsibility of individual municipalities, private well owners are responsible for maintaining and testing their own wells for contamination (Krolik et al. 2013; Ugas et al. 2019; Latchmore et al. 2020). Contamination can occur by infiltration of chemicals and hazardous microorganisms, including bacteria, viruses, and parasites, into the drinking water (Simpson 2004). Pathogenic bacteria, such as Campylobacter jejuni, Salmonella, Shigella, and Escherichia coli (E. coli), can lead to enteric diseases such as gastroenteritis (Simpson 2004; Krolik et al. 2013; Hynds et al. 2014). Previous studies have found that private well owners and people drinking from small water systems are estimated to be at a higher risk of gastric illness than those serviced by municipal water supplies (Uhlmann et al. 2009), and these drinking water sources have been identified as potential sources of gastrointestinal illness (Murphy et al. 2016).

The risk of contamination of well water has been linked to numerous factors. Well characteristics, including type, age, depth, casing material, and height of the well, have all been identified as risk factors (Simpson 2004; Ugas et al. 2019; Latchmore et al. 2020). Improperly maintained wells are more likely to have pathways for bacteria to enter the drinking water supply, and point contamination sources such as on-site septic systems and run-off from agricultural practices like livestock grazing and manure applications can introduce bacteria into surface water and subsequent drinking water supplies (Simpson 2004; Hynds et al. 2014). Underlying hydrogeological factors, including elevation, soil depth and type, and bedrock depth and type, may also influence the movement of bacteria into the water supply, while seasonal effects such as increased precipitation can exacerbate bacterial contamination (Simpson 2004; Krolik et al. 2013; O'Dwyer et al. 2018; Invik et al. 2019; Latchmore et al. 2020). Even though there are complex factors that can increase the risk of bacteria entering drinking water supplies, well owners can protect their drinking water through frequent testing of well water to allow for timely identification of any contamination, as well as by using a water treatment system on their drinking water to eliminate contamination (Simpson 2004; Public Health Ontario 2021).

Public Health Ontario (PHO) recommends that private well owners test their well water ‘often’ (Ugas et al. 2019; Public Health Ontario 2021). Testing is offered free of charge by Public Health Ontario Laboratories (PHOL) to identify the presence of indicator bacteria, including coliforms and E. coli (Simpson 2004; Public Health Ontario 2021). Coliforms are a group of bacteria that are found in both the environment and animal wastes, and indicate surface water contamination, whereas E. coli, a species of coliform bacteria, are found in digestive tracts and indicate human or animal waste contamination (Public Health Ontario 2021).

The Wellington-Dufferin-Guelph Public Health (WDGPH; Public Health) unit in Southern Ontario includes the counties of Wellington (including the City of Guelph) and Dufferin (WDG). One of the mandates of WDGPH is to promote and maintain clean drinking water standards (Ontario Ministry of Health & Long-term Care 2019; Ugas et al. 2019). The primary goal of this study was to identify prominent risk factors of bacterial contamination specific to privately owned wells in the WDG region through logistic regression models, to help achieve this mandate. Data from a survey of private well owners and associated laboratory test results from 2018 to 2019 were analyzed. The results from these models can be used by WDGPH to formulate recommendations and generate targeted education for well testing practices for high-risk wells, as well as by private well owners to help them assess their risk of contamination.

Data for this project were provided by the WDGPH unit. Data were re-structured and used to model the odds of bacterial contamination via logistic regression. Important well characteristics, as selected by the Akaike Information Criterion (AIC), were examined. Data analysis was carried out using R statistical software and RStudio (R Core Development Team 2020; RStudio Team 2020). A list of specific R packages and their primary use in this study are included in Supplementary material, Appendix A.

Dataset

Data contained survey responses and laboratory results collected by WDGPH and described in Ugas et al. (2019). In brief, a survey of private well owners was conducted by WDGPH in 2018 and 2019 regarding characteristics of private wells located on the respondents' properties, as well as respondents' attitudes about well testing for bacterial contamination. The survey consisted of 23 questions and was available online, as well as in paper format at well water testing and bottle pickup locations (Ugas et al. 2019). Surveys were also mailed out to addresses with wells that had not been tested by PHOL more recently than 2006 (Ugas et al. 2019).

Public Health facilitates testing of wells by private well owners in the region: sample collection bottles available from WDGPH offices are filled by well owners with well water, based on sample collection instructions and acceptance criteria guidelines provided by PHO (Public Health Ontario 2021, 2025), and returned to WDGPH. Samples are then forwarded to PHOL for microbiological testing. PHOL uses standard methods in accordance with those listed by the Ontario Ministry of the Environment, Conservation and Parks to detect and count total coliforms and E. coli concentrations in sample bottles (Public Health Ontario 2019). Some wells in the database had more than one test result per year since PHO recommends multiple tests per year, and that samples be re-submitted following remedial actions after a positive test result (Public Health Ontario 2021). This increases the frequency of adverse results for some wells.

For data analysis, the dataset shared by WDGPH contained PHOL test results from 2018 to 2019 for privately owned wells in WDG, which were then linked to the corresponding WDGPH survey responses via the address provided on the survey and PHOL sample identification, as described by Ugas et al. (2019).

Data management

Records with a missing or inconclusive test result were removed. Inconclusive test results consisted of laboratory and sample errors, including samples overgrown with non-target bacteria. Laboratory test results indicating unsafe drinking water due to evidence of bacterial contamination (total coliform count >5 colony forming units per 100 mL) or evidence of fecal contamination (E. coli count ≥1 CFU per 100 mL), or that were heavily contaminated with environmental and indicator bacteria (‘No Data: overgrown with target’) were classified as a positive (adverse) result. Laboratory samples with no significant evidence of bacterial contamination (total coliforms ≤5 CFU per 100 mL and E. coli = 0 CFU per 100 mL) were classified as a negative (non-adverse) result.

Variables were created for the laboratory result year, month, and meteorological season, which was defined as: winter (December–February), spring (March–May), summer (June–August), and autumn (September–November). The season variable was used to account for climate and precipitation trends over the course of the year (as done by Latchmore et al. 2020).

The remaining variables were defined from the survey directly, using levels for categorical variables that reflected the survey options presented to the respondents as closely as possible. A binary indicator variable was generated for the presence of a treatment system (at least one of reverse osmosis, chlorine injection system, or ultraviolet) or absence. Another binary variable indicated the presence or absence of potential point contamination sources (at least one of septic system, livestock, manure storage, manure spreading, or surface water) within 50 feet (15.24 m) of the well, due to the low prevalence of wells in each level of these variables when expanded. These binary variables were computed by combining survey responses across the three levels of treatment type and the four levels of contamination sources, respectively. To reduce the risk of misclassification bias, the binary variable was classified as absent if the response was one of ‘none’, ‘other’, or ‘unsure’, or a blank response. As done by Ugas et al. (2019), an explanatory variable for subregion was also created, to account for geological differences across the region at a broader scale than a municipality, the latter of which had small counts that might lead to small number problems in the data analysis.

Preliminary data cleaning included adjusting survey responses, when needed, to account for indefinite responses or multiple responses to single-answer questions (see Supplementary material, Table B1 for details). Thirty-seven wells were linked to two or more survey submissions with differing responses; to improve data accuracy and reduce the risk of misclassification, all observations associated with these wells were removed from the dataset. As well, since individual test results should be given a unique barcode, for 175 records with repeated laboratory barcodes, only the first observation was kept. A detailed flowchart of data cleaning steps can be found in Supplementary material, Figure B1.

Variables of interest

Ten explanatory variables describing well characteristics were considered, all of which were categorical: age of the well, type of well, the height of the top of the well relative to the ground (height), depth of the well, type of cap on the well, surface water drainage ability around the well, subregion, meteorological season, and the binary variables of treatment (yes/no) and presence of known potential contamination sources within 50 feet (15.24 m) (yes/no). The response variable selected for the logistic regression model was the first test result on each well per season (positive/negative). As some wells had multiple test results for a single season, with others having only one or no results in some seasons, using only the first test result per season allowed us to mitigate this effect of multiple test results per season on some wells.

Modeling preparations

Variable exploration revealed that data separation would cause problems for model parameter estimation (Mansournia et al. 2018). Explanatory variables were therefore further examined for separation from the response variable using two-way tables; to avoid data separation issues, levels of categorical variables that contained either no positive or no negative test results were amalgamated with the next closest level (Mansournia et al. 2018). Variables affected in this way were the age of the well, height of the well (relative to the ground), and type of cap on the well (Supplementary material, Table B2).

To assess pairwise correlation between variables, Cramer's V was estimated, with a bias correction to reduce overestimation (Bergsma 2013). Cramer's V (minimum = 0, maximum = 1) is a measure of the strength of association between nominal categorical variables for contingency tables with more than two rows and/or columns, with values greater than 0.3 typically indicating a strong association (Bergsma 2013; Marchant-Shapiro 2015). Furthermore, the variance inflation factor (VIF) was assessed as a measure of multicollinearity between variables in a model, with values greater than 5 considered large (James et al. 2021). Variables with large Cramer's V and VIF were removed from modeling when also associated with multiple other variables.

Logistic regression models

Logistic regression was applied to assess the association between well contamination and the well characteristics. A detailed explanation of logistic regression can be found in Hosmer et al. (2013). Initial models considered only the main effects. A forward-backward stepwise regression procedure was carried out, using the AIC as a relative measure of model fit for model comparisons. In addition to the AIC, the area under the receiver operating curve (AUROC), a common performance measure for classification problems, was estimated as a model evaluation metric for the top 10 fitted models with the lowest AIC scores. The AUROC (minimum = 0, maximum = 1) represents the ability of the model to discriminate between different categorical outcomes (here, adverse and non-adverse well test results), with 0.5 representing no discriminatory capacity compared to random chance and values >0.7 indicating adequate discriminative capacity (Hosmer et al. 2013). Further explanations of the AIC and AUROC can be found in Hosmer et al. (2013).

The deviance for each model was also calculated as a goodness of fit metric. This indicates how much the proposed model deviates from a saturated model, with a smaller deviance indicating a better fit (Hosmer et al. 2013; James et al. 2021). While the predictive performance of the fitted models was not the primary objective of this analysis, confusion matrices (tables of the actual and predicted values from the estimated logistic regression model) were also examined for each model to determine whether a basic classification ability existed within the fitted models (Hosmer et al. 2013). Predicted probabilities of >0.5 were assigned a predicted adverse well water result and probabilities ≤0.5 were assigned a predicted non-adverse result.

Further models were fitted including interactions between seasons and other explanatory variables. The explanatory variables were chosen from the 10 best-fitting models according to their AIC.

Secondary response variable analysis

A second way to mitigate the effect of multiple tests per season on some wells was to define the response variable as whether a well had at least one positive test result per season (yes/no). This also allowed for better capture of contamination patterns for wells that had intermittent contamination throughout the season. As such, the above-described modeling preparation procedures and logistic regression analysis were repeated under this second definition of the response variable (Supplementary material, Appendix C).

Descriptive statistics

The initial data provided by WDGPH included 4,646 laboratory (PHOL) test results for 592 unique wells. After filtering and cleaning these data, the final dataset used for statistical modeling included 642 laboratory test results from 2018 to 2019, with 70 adverse results (10.90%) and 572 non-adverse results (89.10%). These test results originated from 376 unique privately owned wells in the WDGPH region, of which 16 (4.26%) had only adverse test results, 322 (85.64%) had only non-adverse test results, and 38 (10.11%) had at least one adverse and one non-adverse test result over the study period. A breakdown of the number of test results and the number of unique wells for each level of the well characteristic variables is found in Supplementary material, Table B3.

The type of well was not considered in the regression models due to correlation with other variables (VIF = 6.33, Cramer's V > 0.3 with age, depth, height of well, and type of cap on well). Therefore, the variables used in the modeling process were: the age of the well, the height of the well, the depth of the well, the type of cap on the well, surface water drainage ability around the well, subregion, meteorological season, and the binary variables of treatment (yes/no) and presence of known potential contamination sources within 50 feet (15.24 m) (yes/no).

Logistic regression modeling

The logistic regression modeling results presented here are for only the response variable of the first test on a well per season. The results for the models based on the second definition of the response variable are presented in Supplementary material, Appendix C.

Table 1 presents the 10 best-fitting models according to their AIC, with an indicator for variables included in each model. Age, season, presence of a treatment system, and point contamination sources within 50 feet (15.24 m) were in ≥50% of the top 10 models. Depth and height of the well were only identified as important variables in two of the top 10 models.

Table 1

Top 10 models with the lowest AIC values, ordered by increasing AIC, after stepwise regression

ModelAgeTreatmentSeasonContamination source nearbyDepthHeightType of capSurface waterSubregion
Full X X X X X X X X X 
      
     
       
     
    
      
       
     
      
10     
Totala 10  8  8  5  2  2  0  0  0 
ModelAgeTreatmentSeasonContamination source nearbyDepthHeightType of capSurface waterSubregion
Full X X X X X X X X X 
      
     
       
     
    
      
       
     
      
10     
Totala 10  8  8  5  2  2  0  0  0 

Note. An ‘X’ indicates that a well characteristic variable was identified as important in the corresponding model.

aNumber of models containing the variable out of the top 10 models with the lowest AIC values, excluding the full model.

The AIC, AUROC, and deviance of the top 10 models were examined (Table 2). While the AIC decreased as a result of the variable selection procedures, the AUROC values were also slightly lower in the reduced models, ranging from 0.64 to 0.70 (relative to 0.75 for the full model). Therefore Model 2, which had one of the lower AIC and deviance values balanced against one of the higher AUROC values, was chosen as the preferred model and selected for further investigation of the impact of well characteristics on bacterial contamination. This model contained variables found in 50% or more of the 10 examined models, indicating it identified variables associated with bacterial contamination consistent with most of the other top 10 models.

Table 2

AIC, AUROC, and deviance (Dev) for the full model and the 10 models with the lowest AIC values

ModelAICAUROCDev
Full 450.23 0.75 390.23 
430.63 0.68 412.63 
430.95 0.69 410.95 
432.75 0.67 416.75 
432.82 0.70 406.82 
433.46 0.70 405.46 
433.58 0.67 415.58 
433.62 0.64 421.62 
433.64 0.69 409.64 
433.93 0.66 419.93 
10 434.07 0.70 408.07 
ModelAICAUROCDev
Full 450.23 0.75 390.23 
430.63 0.68 412.63 
430.95 0.69 410.95 
432.75 0.67 416.75 
432.82 0.70 406.82 
433.46 0.70 405.46 
433.58 0.67 415.58 
433.62 0.64 421.62 
433.64 0.69 409.64 
433.93 0.66 419.93 
10 434.07 0.70 408.07 

Note. The two maximum AUROC values, and minimum AIC and deviance values (prior to rounding), are indicated in bold.

The odds ratios (ORs) and corresponding 95% confidence intervals derived from Model 2 are shown in Table 3. The results indicate the OR for bacterial contamination increased with the age of the well and point contamination sources close to the well and decreased with treatment systems on the well. Summer was found to have an increased OR compared to spring. The confusion matrix for Model 2 is shown in Table 4, where only one of the 70 adverse test results was correctly predicted, suggesting a high misclassification rate of adverse results. This high misclassification rate was similar for all models, with confusion matrices for all 10 models provided in Supplementary material, Table B4.

Table 3

Associations between well characteristics and the odds of bacterial contamination using Model 2

VariableOdds ratio95% confidence interval
Age 
  ≤ 10 years Referent Referent 
 11–35 years 2.48 0.83, 7.41 
 36–75 years 2.78 0.92, 8.45 
  > 75 years 24.11 2.31, 251.84 
 Unsure 6.18 1.95, 19.55 
Treatment 
 No Referent Referent 
 Yes 0.51 0.26, 0.97 
Point contamination source within 50 ft (15.24 m) 
 No Referent Referent 
 Yes 1.48 0.83, 2.65 
Season 
 Spring Referent Referent 
 Winter 0.77 0.29, 2.05 
 Summer 2.34 1.20, 4.56 
 Autumn 1.09 0.56, 2.13 
VariableOdds ratio95% confidence interval
Age 
  ≤ 10 years Referent Referent 
 11–35 years 2.48 0.83, 7.41 
 36–75 years 2.78 0.92, 8.45 
  > 75 years 24.11 2.31, 251.84 
 Unsure 6.18 1.95, 19.55 
Treatment 
 No Referent Referent 
 Yes 0.51 0.26, 0.97 
Point contamination source within 50 ft (15.24 m) 
 No Referent Referent 
 Yes 1.48 0.83, 2.65 
Season 
 Spring Referent Referent 
 Winter 0.77 0.29, 2.05 
 Summer 2.34 1.20, 4.56 
 Autumn 1.09 0.56, 2.13 

Note. Estimated ORs and 95% confidence intervals for each category level are provided.

Table 4

Confusion matrix for Model 2

ResultsActual non-adverseActual adverse
Predicted non-adverse 571 69 
Predicted adverse 
ResultsActual non-adverseActual adverse
Predicted non-adverse 571 69 
Predicted adverse 

The inclusion of interaction terms in the model did not improve model fit (as measured by the AIC), discriminatory ability (as measured by the AUROC), or predictive ability (as observed in the resulting confusion matrix) relative to the main effects model. Results for the interaction models are presented in Supplementary material, Appendix C.

Results

This study found that, for wells in the WDG region, the age of the well, season of testing, not having a treatment system on the well, and having a point contamination source within 50 feet (15.24 m) of the well were all associated with the first test result in a season having a higher odds of bacterial contamination. These variables appeared in at least 50% of the 10 models in our study with the lowest AIC values. Comparing the models in Table 2 by their AIC, AUROC, and deviance suggests that Model 2, based on four explanatory variables, represents the data well while also providing the highest discriminative ability. However, the AUROC for all models examined was considerably low (≤ 0.7), suggesting that while the well characteristic variables identified in the models were important for determining the odds of bacterial contamination in the WDGPH region, the fitted models themselves do not have a particularly good ability to classify a well as contaminated and might not fully explain the reasons for contamination.

The factors identified as important in this study are consistent with other results found in the literature and provide insight into the specific factors influencing contamination in the WDGPH region. For example, the age of the well can influence the risk of contamination, as older wells may be naturally deteriorating or be constructed with less reliable materials (Simpson 2004; Krolik et al. 2013; Hynds et al. 2014). It is now recommended that older wells are checked for seepage and that wells are upgraded to current recommendations, such as including a water-tight casing and beginning at least 6 m deep (Simpson 2004; Ontario Ministry of the Environment, Conservation & Parks 2021).

The point contamination sources referenced in the survey (surface water, septic systems, livestock grazing, and manure spreading and storage) have all been indicated as contamination sources that can increase the number of pathogens either traveling through the soil or bypassing the filtration of the soil (Simpson 2004; Krolik et al. 2013; Hynds et al. 2014; Borchardt et al. 2021). The further a well is from a contamination source, the less likely it is that contaminants can infiltrate the well (Simpson 2004). On-site sewage disposal systems can introduce fecal matter by reducing the travel distance and therefore the amount of filtering of contaminants through the soil (Simpson 2004). Septic tanks that leak or are improperly maintained can further increase this risk (Simpson 2004). One study has traced sources of human fecal pollution in wells, following increased precipitation, back to septic systems (Krolik et al. 2014). Surface water can act as a reservoir for bacterial contamination by allowing pathogens to bypass the natural filtration of the soil and can submerge a wellhead when flooding occurs, increasing the risk of contamination, especially if the well is improperly capped (Simpson 2004). Coleman et al. (2013) suggest that surface waters may be contaminated with human sewage, pet or wildlife droppings, and livestock run-off. Livestock have been implicated in many studies regarding bacterial contamination; grazing animals, land spreading of organic waste, and point agricultural sources, including silage pits and animal housing, are all known risk factors for well contamination (Simpson 2004; Coleman et al. 2013; Hynds et al. 2014; Invik et al. 2019; Borchardt et al. 2021). Agricultural run-off can introduce both chemical and microbiological contaminants into the soil and surrounding surface waters (Krolik et al. 2013; Borchardt et al. 2021). Cattle-based run-off with Shiga toxin-producing E. coli led to an outbreak in Walkerton, ON, resulting in the deaths of seven individuals (Krolik et al. 2014).

Treatment systems on water from private wells may be used to mitigate contamination from various sources (Murphy et al. 2016). It has been suggested that those who use untreated water from wells could be at greater risk for gastrointestinal illness (Murphy et al. 2016; Ugas et al. 2019). Furthermore, untreated water can act as a reservoir for antimicrobial-resistant bacteria (Coleman et al. 2013). Previous research has suggested that water from systems treated with UV disinfection to inactivate pathogens are at a decreased risk of contamination, while treating with chlorination alone is less effective against protozoa such as Cryptosporidium and Giardia (Murphy et al. 2016). It should be noted that as water for testing is drawn from faucets inside a residence and not from the well itself, a negative sample from a treated water system does not necessarily indicate that the well itself is uncontaminated.

Season has been implicated as a key factor for bacteriological contamination, potentially due to seasonal temperatures and weather patterns, which were not assessed in this study. For example, precipitation levels can affect well contamination (Invik et al. 2019), where decreased levels of precipitation can cause wells to run dry and increased precipitation can lead to flooding around the wellhead (Simpson 2004), both scenarios increasing the risk of contamination and/or the concentration of contaminants. Increased precipitation can also move water and contaminants more quickly through the soil, which reduces filtering time for contaminants and increases the risk of groundwater contamination (Simpson 2004; Krolik et al. 2013). One study found that fecal coliforms almost doubled during a wet season, potentially due to poor soil drainage and seasonally high water tables (Arnade 1999). While the current study shows an increased odds of bacterial contamination in the summer, it does not investigate when the wells were tested relative to the wet spring season or relative to summer precipitation events; there may be a lag effect between rainfall events and when surface water enters a well, and between when surface water enters the well and when bacteria start to grow. However, this study supports previous findings in the literature that seasonal effects are important to consider when evaluating the risk of bacterial contamination.

The addition of interaction terms between season and the three other variables in Model 2 did not improve model fit, suggesting interaction between variables is not a significant contributing factor to bacterial contamination. The analysis under the second definition of the response variable provides insight into the consistency of well characteristics as being associated with bacterial contamination. While this analysis captured a slightly higher proportion of contaminated test results (12.31%), the results showed no major changes to selected variables or the predictive ability of the model. These results are relevant to public health officials, as there is currently no standard method for modeling well contamination. The similarity of results also demonstrates that the important well characteristics identified by the model are not an artifact of how the response variable was defined.

The low discriminatory ability of the models indicates that important factors are missing that may be related to the odds of bacterial contamination, for example, the local hydrogeology of the area. Factors such as soil type and depth, bedrock type, depth to surface water, and elevation, among other hydrogeological settings, may play a role in bacterial contamination (Simpson 2004; Krolik et al. 2013; O'Dwyer et al. 2018). For example, thinner, more porous soils and fractured bedrock have a higher risk of bacteriological contamination, as they allow bacteria to move more quickly to the groundwater and reduce the amount of filtration provided by the soil (Simpson 2004; Hynds et al. 2014). As well, information on the physical setting of the well was not available. Factors associated with the land use (e.g. residential property versus active farm) where a well is located may also be related to the odds of bacterial contamination (see for example, Invik et al. (2019) and Borchardt et al. (2021)) and should be considered when such data are available.

Limitations

The response variable in this study was represented as one observation per season per well, to reduce the effects of multiple testing from the same wells. However, defining the variable in this way might not capture the true prevalence of bacteriological contamination that might be detected through more frequent tests. It has been shown that detection rates of E. coli will increase as the testing rate per year increases (Latchmore et al. 2020). By only accounting for one test result per season, there may be a less accurate reflection of the actual risk of bacterial contamination in the WDG region and of bacterial contamination patterns.

Completion of the survey by respondents and submission of well water to PHOL for testing over the period covered by this study were both done on a voluntary basis, which can be considered passive sampling of the study population potentially leading to sampling bias. For example, although the survey was sent out to ‘known non-testers,’ as well as to owners in the test result database, it was not recorded what proportion of those invited to participate from each of these groups actually responded to the survey. More diligent well owners, and those who had had previous well contamination, may have been more likely to participate in the survey and to send multiple water samples for testing, respectively. In addition, Ugas et al. (2019) argue that self-reported survey responses are poor indicators of true testing frequencies, possibly because respondents may inflate the degree to which they test their water. Similarly, it is likely that responses to survey questions about well characteristics may not accurately reflect the true features of the well in question, as some owners may not be sufficiently familiar with their well. Categories of ‘unsure’ and ‘other’ were included in this analysis to reduce the number of excluded observations, but this inherently introduces some uncertainty about the characteristics of a well and could skew the importance of a variable. Likewise, limited responses in some well characteristic categories (for example, the small number of wells over 75 years old) may affect the differences in odds of contamination between variable levels. There was some consideration during the survey period to attempt to verify the well characteristics of a subset of the survey respondents via site visits, but this proved to be not feasible and so was not done.

Response categories for each of the variables treatment type and the presence of a point contamination source were amalgamated to create binary classifications (present/absent). It is unclear what effect this may have had on the resulting models. Future analyses could look at retaining the specific treatment types and specific point contamination sources included in the survey during the model fitting process.

Many issues encountered in modeling the survey and PHOL data may have stemmed from the lack of balanced data: the response variable had a low instance of contaminated results (10.90%) on which to build a model. Unbalanced data can affect the performance of models and is an issue that is especially important to consider while handling rare event cases (Garcia et al. 2012). Having a low number of events per variable (EPV), including less than 10 EPV as in this study (Supplementary material, Table B3), can lead to biased regression coefficients and large variance estimates and create more conservative Wald estimates, where associations and significance are diminished (Peduzzi et al. 1996). It can also lead to separation or quasi-separation, which can affect the data by leading to convergence errors within the model (Mansournia et al. 2018). The current data include observations for rare covariate patterns or rare combinations of well characteristics. For example, there were no adverse results in wells that were less than one year old. Combining (collapsing) levels of categorical variables helped, to some extent, to alleviate these issues; however, there were still collapsed levels of variables in which the observed numbers of wells were low. Future studies could explore different techniques to create a more balanced dataset, such as using bootstrapping to under- or over-resample the unbalanced variables (Garcia et al. 2012).

In this research, there were challenges in assessing the fit of the model, as there is no set of standards for evaluating logistic regression model fit. The Hosmer–Lemeshow test could be used as an additional goodness-of-fit test; however, it was not used here due to the influence of the defined number of groupings on the p-value (Hosmer et al. 1997).

In the exploratory data analysis, Cramer's V was used to assess related variables. However, the estimate of Cramer's V can be influenced by the definition of category levels and is influenced by non-random sampling (Sun et al. 2010). This means that if levels were to be amalgamated or removed, such as the ‘unsure’ survey responses, there might be different estimates of Cramer's V. As well, Cramer's V, although an extension of Pearson's correlation coefficient, cannot be interpreted similarly as the proportion of variance explained (Sun et al. 2010). In the work described here, Cramer's V was used in an exploratory capacity as a consistent metric between different nominal and ordinal variables. It was also used in conjunction with the VIF to explore multicollinearity that might be present in the model and act as the criterion used to select correlated variables for removal from the model fitting process.

Random effect models, discussed in Hosmer et al. (2013), could be used to further explore spatial and temporal predictors for well water contamination. However, limited sample sizes prevented this strategy in the current study.

PHU importance

Waterborne illnesses from bacterial contamination are a public health issue in North America (Hynds et al. 2014). Sensitive sub-populations such as the immune-compromised, the very young, and the elderly are more susceptible to the effects of illnesses from pathogen contamination of groundwater (Hynds et al. 2014). Educating the public on risk factors for groundwater contamination can increase recognition within the community of the importance of maintaining and routinely testing drinking water sources (Murphy et al. 2016). Private well owners need to be educated about proper well maintenance, suggested testing guidelines, and treatment systems for contaminated water sources (Simpson 2004; Ontario Ministry of Health & Long-term Care 2019). The Government of Ontario recommends well owners test their water at least three times per year and provides guidelines on the placement of new wells relative to sources of contamination (for example, a minimum 50-feet (15.24 m) distance between a well and a septic tank) (Ontario Ministry of Environment, Conservation & Parks 2021), however, continued advocacy by public health units of these, and other, government recommendations is necessary. Based on the results of this research, education surrounding the need to frequently test the well could be targeted at those who have older wells, those with no treatment systems, or those with point contamination sources around the well, to minimize the risk of waterborne illness to private well owners and operators. As well, education about these risk factors could be presented to the general public.

Private wells are the most common source of water in rural Ontario households but can involve waterborne pathogens that increase the risk for gastrointestinal and specifically bacterial illness (Simpson 2004; Ugas et al. 2019). It is important to identify and assess locally relevant risk factors for contamination so that education and water testing programs can be targeted at the owners of private wells at higher risk of contamination. Data showed that 10.9% of the PHOL well test results, linked to the WDGPH survey in 2018–2019, had a contaminated sample in the first test per season. Using a logistic regression model, the age of a well, season of testing, not having a treatment system on the well, and point contamination sources surrounding the well were associated with bacterial contamination of private well water in the Wellington-Dufferin-Guelph region. These characteristics were consistently present in models selected by the AIC.

Overall, information collected by the well owner survey used here is, on its own, insufficient for developing a predictive model that can classify adverse results when there is a low prevalence of contaminated wells in the sample data and so few independent variables. Future work should explore additional factors, such as the hydrogeological characteristics where the well is located; see, for example, O'Dwyer et al. (2018), who incorporated both hydrogeological and well-based characteristics in their model to achieve promising predictive performance. Models with both well-based and external factors may have better discriminatory ability and be able to identify geographical areas within a region that may be at higher risk of bacteriological contamination.

The authors of this study wish to acknowledge the hard work and contributions of the team at WDGPH in preparing and administering the survey and gathering responses, and those at PHOL who assisted in providing laboratory test results. We wish to acknowledge that the Wellington-Dufferin-Guelph region spans the traditional lands of the Anishinaabe, Haudenosaunee, Tionontati, Huron-Wendat, and Attawandaron peoples and their many treaty lands. The University of Guelph campus is located on the lands of the Dish with One Spoon Wampum and the Between the Lakes Treaty, with long-standing ties to the Mississaugas, Haudenosaunee, and Attawandaron peoples.

This work was supported by a Discovery Grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) (2018-04701 to LE Deeth) and a Canadian Foundation for Infectious Disease (CFID) Undergraduate Summer Research Grant (2020).

L.G.F. conceptualized the work, rendered support in data curation and formal analysis, investigated the project, developed the methodology, analyzed the software, validated the work, visualized the study, wrote the original draft, wrote and reviewed and edited the article. L.E.D. conceptualized the work, developed the methodology, wrote and reviewed and edited the article, supervised the concept. O.B. conceptualized the work, developed the methodology, wrote and reviewed and edited the article, supervised the concept. L.A.T.W. conceptualized the work, developed the methodology, wrote and reviewed and edited the article.

For the survey, free and informed consent of the participants, or their legal representatives, was obtained. The study protocol was approved by the appropriate Committee for the Protection of Human Participants: The Research Ethics Committee of WDGPH, Guelph, Ontario, Canada on 27 March 2018. The use of the data in the current study was approved by the University of Guelph Research Ethics Board (REB #20-04-005; June 10, 2020). A copy of the survey is available from the authors upon request.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Arnade
L. J.
(
1999
)
Seasonal correlation of well contamination and septic tank distance
,
Groundwater
,
37
(
6
),
920
923
.
https://doi.org/10.1111/j.1745-6584.1999.tb01191.x
.
Bergsma
W.
(
2013
)
A bias-correction for Cramer's V and Tschuprow's T
,
Journal of the Korean Statistical Society
,
42
(
3
),
323
328
.
https://doi.org/10.1016/j.jkss.2012.10.002
.
Borchardt
M. A.
,
Stokdyk
J. P.
,
Kieke
B. A.
,
Muldoon
M. A.
,
Spencer
S. K.
,
Firnstahl
A. D.
,
Bonness
D. E.
,
Hunt
R. J.
&
Burch
T. R.
(
2021
)
Sources and risk factors for nitrate and microbial contamination of private household wells in the fractured dolomite aquifer of northeastern Wisconsin
,
Environmental Health Perspectives
,
129
(
6
),
067004.1
067004.18
. .
Coleman
B. L.
,
Louie
M.
,
Salvadori
M. I.
,
McEwen
S. A.
,
Neumann
N.
,
Sibley
K.
&
Braithwaite
S.
(
2013
)
Contamination of Canadian private drinking water sources with antimicrobial resistant Escherichia coli
,
Water Research
,
47
(
9
),
3026
3036
.
https://doi.org/10.1016/j.watres.2013.03.008
.
Garcia
V.
,
Sanchez
J. S.
&
Mollineda
R. A.
(
2012
)
On the effectiveness of preprocessing methods when dealing with different levels of class imbalance
,
Knowledge-Based Systems
,
25
(
1
),
13
21
.
https://doi.org/10.1016/j.knosys.2011.06.013
.
Hosmer
D. W.
Jr
,
Lemeshow
S.
&
Sturdivant
R. X.
(
2013
)
Applied Logistic Regression
, 3rd edn.
Hoboken, NJ
:
Wiley
.
doi:10.1002/9781118548387
.
Hynds
P. D.
,
Thomas
M. K.
&
Pintar
K. D. M.
(
2014
)
Contamination of groundwater systems in the US and Canada by enteric pathogens, 1990–2013: a review and pooled-analysis
,
PLoS One
,
9
(
5
),
e93301
.
https://doi.org/10.1371/journal.pone.0093301
.
Invik
J.
,
Barkema
H. W.
,
Massolo
A.
,
Neumann
N. F.
,
Cey
E.
&
Checkley
S.
(
2019
)
Escherichia coli contamination of rural well water in Alberta, Canada is associated with soil properties, density of livestock and precipitation
,
Canadian Water Resources Journal/Revue Canadienne Des Ressources
,
44
(
3
),
248
262
.
https://doi.org/10.1080/07011784.2019.1595157
.
James
G.
,
Witten
D.
,
Hastie
T.
&
Tibshirani
R.
(
2021
)
An Introduction to Statistical Learning: with Applications in R
, 2nd edn.
New York
:
Springer
.
Krolik
J.
,
Maier
A.
,
Evans
G.
,
Belanger
P.
,
Hall
G.
&
Joyce
A.
(
2013
)
A spatial analysis of private well water escherichia coli contamination in southern Ontario
,
Geospatial Health
,
8
(
1
),
65
75
.
https://doi.org/10.4081/gh.2013.55
.
Krolik
J.
,
Evans
G.
,
Belanger
P.
,
Maier
A.
,
Hall
G.
,
Joyce
A.
,
Guimont
S.
,
Pelot
A.
&
Majury
A.
(
2014
)
Microbial source tracking and spatial analysis of E. coli contaminated private well waters in southeastern Ontario
,
Journal of Water and Health
,
12
(
2
),
348
357
.
https://doi.org/10.2166/wh.2013.192
.
Latchmore
T.
,
Hynds
P.
,
Brown
S. R.
,
Schuster-Wallace
C.
,
Dickson-Anderson
S.
,
McDermott
K.
&
Majury
A.
(
2020
)
Analysis of a large spatiotemporal groundwater quality dataset, Ontario 2010–2017: informing human health risk assessment and testing guidance for private drinking water wells
,
Science of The Total Environment
,
738
,
140382
.
https://doi.org/10.1016/j.scitotenv.2020.140382
.
Mansournia
M. A.
,
Geroldinger
A.
,
Greenland
S.
&
Heinze
G.
(
2018
)
Separation in logistic regression: causes, consequences, and control
,
American Journal of Epidemiology
,
187
(
4
),
864
870
.
https://doi.org/10.1093/aje/kwx299
.
Marchant-Shapiro
T.
(
2015
)
Chi-Square and Cramer's V: What do You Expect?
In:
Statistics for Political Analysis: Understanding the Number
,
London:
SAGE Publications, Inc
, pp.
245
272
.
https://doi.org/10.4135/9781483395418.n9
.
Murphy
H. M.
,
Thomas
M. K.
,
Schmidt
P. J.
,
Medeiros
D. T.
,
McFadyen
S.
&
Pintar
K. D. M.
(
2016
)
Estimating the burden of acute gastrointestinal illness due to Giardia, Cryptosporidium, Campylobacter, E. coli o157 and norovirus associated with private wells and small water systems in Canada
,
Epidemiology & Infection
,
144
(
7
),
1355
1370
.
https://doi.org/10.1017/S0950268815002071
.
O'Dwyer
J.
,
Hynds
P. D.
,
Byrne
K. A.
,
Ryan
M. P.
&
Adley
C. C.
(
2018
)
Development of a hierarchical model for predicting microbiological contamination of private groundwater supplies in a geologically heterogeneous region
,
Environmental Pollution
,
237
,
329
338
.
https://doi.org/10.1016/j.envpol.2018.02.052
.
Ontario Ministry of Health and Long-Term Care
(
2019
)
Safe Drinking Water and Fluoride Monitoring Protocol. Available at: https://www.health.gov.on.ca/en/pro/programs/publichealth/oph_standards/docs/protocols_guidelines/Safe_Water_Fluoride_Protocol_2019_en.pdf (Accessed 20 April 2022)
.
Ontario Ministry of the Environment, Conservation and Parks
(
2021
)
Water Supply Wells: Requirements and Best Practices. Available at: https://www.ontario.ca/document/water-supply-wells-requirements-and-best-practices (Accessed 20 April 2022, 13 April 2025)
.
Peduzzi
P.
,
Concato
J.
,
Kemper
E.
,
Holford
T. R.
&
Feinstein
A. R.
(
1996
)
A simulation study of the number of events per variable in logistic regression analysis
,
Journal of Clinical Epidemiology
,
49
(
12
),
1373
1379
.
https://doi.org/10.1016/S0895-4356(96)00236-3
.
Public Health Ontario
(
2019
)
Water Sample Analysis. Available at: https://www.publichealthontario.ca/en/laboratory-services/public-health-inspectors-guide/phi-water?tab=1 (Accessed 20 December 2021, 11 April 2025)
.
Public Health Ontario
(
2021
)
Well Water Testing – Private Drinking Water. Available at: https://www.publichealthontario.ca/en/laboratory-services/well-water-testing?tab=0 (Accessed 20 December 2021)
.
Public Health Ontario
(
2025
)
Bacteriological Analysis of Drinking Water for Private Citizen, Single Household Only. Available at: https://www.publichealthontario.ca/-/media/Documents/Lab/drinking-water-private-citizen.pdf?sc_lang=en&rev=79428191483a41c1b848f142a5f31c3e&hash=284818E87AFC7EE090D5D823E88D7DE5 (Accessed 11 April 2025)
.
R Core Development Team
(
2020
)
R: A Language and Environment for Statistical Computing
.
Vienna, Austria
:
R Core Development Team
.
https://www.R-project.org/ (Accessed 20 May 2021)
.
RStudio Team
(
2020
)
RStudio: Integrated Development Environment for R
.
Boston, MA
:
RStudio, PBC
.
Retrieved from: http://www.rstudio.com/ (Accessed 20 May 2021)
.
Simpson
H.
(
2004
)
Promoting the management and protection of private water wells
,
Journal of Toxicology and Environmental Health, Part A
,
67
(
20–22
),
1679
1704
.
https://doi.org/10.1080/15287390490492296
.
Sun
S.
,
Pan
W.
&
Wang
L. L.
(
2010
)
A comprehensive review of effect size reporting and practices in academic journals in education and psychology
,
Journal of Educational Psychology
,
102
(
4
),
989
1004
.
https://doi.org/10.1037/a0019507
.
Ugas
M.
,
Pearl
D. L.
,
Zentner
S.
,
Tschritter
D.
,
Briggs
W.
,
Manser
D.
&
Trotz-William
L. A.
(
2019
)
Examining the factors related to bacteriological testing of private wells in southern Ontario
,
Journal of Water and Health
,
17
(
6
),
944
956
.
https://doi.org/10.2166/wh.2019.164
.
Uhlmann
S.
,
Galanis
E.
,
Takaro
T.
,
Mak
S.
,
Gustafson
L.
,
Embree
G.
,
Bellack
N.
,
Corbett
K.
&
Isaac-Renton
J.
(
2009
)
Where's the pump? associating sporadic enteric disease with drinking water using a geographic information system, in British Columbia, Canada, 1996–2005
,
Journal of Water and Health
,
7
(
4
),
692
698
.
https://doi.org/10.2166/wh.2009.108
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data