ABSTRACT
Household water treatment (HWT) is recommended when safe drinking water is limited. To understand determinants of HWT adoption, we conducted a cross-sectional survey with 650 households across different regions in Haiti. Data were collected on 71 demographic and psychosocial factors and 2 outcomes (self-reported and confirmed HWT use). Data were transformed into 169 possible determinants of adoption across nine categories. We assessed determinants using logistic regression and, as machine learning methods are increasingly used, random forest analyses. Overall, 376 (58%) respondents self-reported treating or purchasing water, and 123 (19%) respondents had residual chlorine in stored household water. Both logistic regression and machine learning analyses had high accuracy (area under the receiver operating characteristic curve (AUC): 0.77–0.82), and the strongest determinants in models were in the demographics and socioeconomics, risk belief, and WASH practice categories. Determinants that can be influenced inform HWT promotion in Haiti. It is recommended to increase access to HWT products, provide cash and education on water treatment to emergency-impacted populations, and focus future surveys on known determinants of adoption. We found both regression and machine learning methods need informed, thoughtful, and trained analysts to ensure meaningful results and discuss the benefits/drawbacks of analysis methods herein.
HIGHLIGHTS
One-fifth of respondents had treated their water with chlorine-based HWT.
Regression analyses and machine learning identified determinants of HWT adoption with high accuracy.
Determinants that were influenceable in promotion of HWT were identified.
It is recommended to increase access to HWT and provide cash and education to emergency-affected populations.
Machine learning and regression analysis both need trained operators to conduct analysis, but machine learning analysis may be faster/easier.
INTRODUCTION
One aim of the United Nations Sustainable Development Goals is to ensure universal and equitable access to safely managed drinking water (WHO, UNICEF, and World Bank 2022). However, an estimated 703 million people did not have access to a basic water source in 2022 (WHO and UNICEF 2023). In areas without safe drinking water, using household water treatment (HWT) has been shown to improve microbiological water quality, decrease risks of recontamination, and reduce diarrheal disease (Wright et al. 2004; Clasen et al. 2007; Lantagne & Clasen 2012, 2013). Thus, HWT products (e.g., chlorine and filters) are recommended as an interim measure in contexts such as humanitarian emergencies (Global Task Force on Cholera Control 2017; WHO 2019).
Haiti is the poorest country in the Americas and has been affected by successive emergencies. HWT with chlorine was widely promoted in response to the 2010 earthquake and cholera outbreak (Gelting et al. 2013; Patrick et al. 2013). In an evaluation from 2012, 60% of urban and 78% of rural respondents self-reported using HWT, with >96% of respondents reporting using chlorine-based products (Cayemittes et al. 2012). HWT's actual use, however, has varied across contexts. In a program that included household education visits, 77 and 90% of households had free chlorine residual (FCR) in stored drinking water 2 and 10 months 2010 post-earthquake, respectively (Lantagne & Clasen 2012, 2013). In a different 2011 program without household visits, while 70% of respondents self-reported using HWT, only 26% of households had sufficient FCR in stored household drinking water (PSI Haiti 2011).
Recent emergencies in Haiti (another large earthquake, the COVID-19 pandemic, socioeconomic crisis, and political insecurity) have further destabilized the country (UNICEF 2022). In addition, international investment has shifted away from Haiti, with only 29% (68.2 million USD) of funding needs covered in Haiti in 2021, compared with 58% covered (220.8 million USD) in 2011 (OCHA 2011, 2021). To our knowledge, no recent research on the use and adoption of HWT products by the Haitian population has been published. Therefore, there is a need to understand current household water, sanitation, and hygiene (WASH) practices in general and knowledge, attitudes, and practices around HWT in Haiti during this current ongoing emergency.
Previous research assessing behavioral determinants of HWT use and adoption has primarily relied on qualitative epidemiological analysis and statistical methodologies such as logistic regression (Rainey & Harding 2005; Figueroa & Kincaid 2010; Wood et al. 2012; Dreibelbis et al. 2013), which requires trained personnel and time to complete. Recently, machine learning methods have been increasingly applied to understanding health behavior and offering potential resource-saving benefits for analyzing data collected in humanitarian emergencies (Conway & O'Connor 2016; dos Santos et al. 2019; Elamrani Abou Elassad et al. 2020; Zhu et al. 2022). Machine learning programs have the potential to reduce the time and training level needed to complete analysis. However, machine learning methods also have potential drawbacks for public health analysis. A concerning drawback is the ‘black box’ aspect of machine learning programs, including how analysis, variable prioritization, and default imputation occur.
The objective of this study was to assess the current determinants of HWT adoption in Haiti, to inform future programming, and to reduce the burden of diarrheal diseases. A secondary objective was to compare traditional regression and machine learning analyses and to understand model results, accuracy, time, and operator training efficiency.
METHODS
The non-governmental organizations (NGOs) Haiti Outreach (HO) and Deep Springs International (DSI) have long-term operations in the Hinche/Pignon and Léogâne regions of Haiti, respectively. These two NGOs have different aims and interventions. HO (haitioutreach.org) began in 1997 and works to support the drilling of wells, installation of local water systems, and training of community water management committees. DSI (deepspringsinternational.org) began manufacturing and selling chlorine solutions in 2008 and promotes HWT.
Data collection
We initially estimated a minimum sample size of 400 households (95% confidence interval and 5% error margin) to ensure a representative sample of the estimated populations in the locations surveyed. We then added 250 households to our sample size (total of 650) to have extra test data and account for the humanitarian context. Based on our sample size, we can detect a difference of 10–20% in outcomes (by NGO and community type) with 95% confidence and 80% power.
The target respondents were male or female heads of households and primary caregivers in urban, rural, or mountainous households served by the two partner NGOs to account for different environmental contexts and exposure to previous public health interventions. Due to the security situation, Tufts University staff were not able to travel to Haiti. A remote 2-day Zoom training was conducted to train data collectors from HO and DSI on ethics for human subjects research, project objectives, recruitment materials, informed consent forms, survey administration, and water quality testing.
On each survey day, data collectors from HO and DSI visited households in accessible zones of targeted areas and obtained oral informed consent from the respondents (if they were willing to participate). The work was approved by the Tufts University Institutional Review Board (STUDY00002129), the Haitian Ministry of Health (2122-8), and the DEVCOM ARL Human Research Protections Official (ARL 22-010). The survey was administered orally in Haitian Creole, with responses recorded in mWater (New York, USA, mwater.co) installed on enumerator cell phones. mWater is an open-access data management platform used to collect and analyze WASH data. Throughout data collection, Tufts University and local research team members met daily to review data, clarify concerns, and ensure data accuracy.
Enumerators measured the water quality and HWT use indicator total residual chlorine (TRC) (PSI Haiti 2011; Cayemittes et al. 2012; Gelting et al. 2013; Patrick et al. 2013) at the household during the survey on stored household water provided by the respondent. TRC was measured using pool test kits (WWD Group, https://wwdevices.com/) that were shipped to Haiti via missionary air flights. Note that while standards for effective disinfection are >0.2 mg/L free residual chlorine (FRC) (World Health Organization 2017), the presence of TRC (>0 mg/L) was used in this study because TRC kits were simpler to ship, train on, and use. TRC may be a slight overestimation of chlorination compared to FRC. TRC results were provided to the household and NGO to inform them on future water safety.
Data cleaning and preparation
The dataset was downloaded in ‘code’ format from mWater. Open-ended responses were translated from Haitian Creole into English, the community type (urban, rural, or mountainous) was assigned to each household based on Global Positioning System (GPS) coordinates collected by mWater, and all data were reviewed for accuracy.
Nine categories of determinants were developed in the behavior model (Figure 1): demographics and socioeconomics, WASH practices, HWT promotion, risk beliefs, social support, access and availability of HWT, user attitude and experiences with HWT, self-reported HWT use, and attitudes and feelings about WASH (SI Table S1). Each category was divided into sub-categories, for example, age and sex in demographics. The 71 survey questions were then transformed into 169 individual determinants within those categories and sub-categories.
Because a potential drawback of the easiest-to-use machine learning analysis programs is that they automatically impute missing data using default assumptions, we developed a second version of the dataset that was manually imputed. We developed an imputation scheme to assign rules to replace missing data with substituted values based on whether the survey answer was binary, categorical, continuous, ordinal, or checkbox (SI Table S2) to maintain the statistical properties of the dataset (Dixon 1979). For any household that did not provide a water sample, TRC was imputed as 0 mg/L. The risks of imputation, particularly with large datasets, are further described in the Discussion section.
Data analysis
Descriptive statistics were calculated using R version 3.3.3 (R Foundation, Vienna, Austria), with differences between NGOs assessed using the Chi-squared test. Multicollinearity between independent numerical determinants was assessed using a Spearman correlation matrix (confirming the correlation if the coefficient is below −0.5 [negative correlation] or above 0.5 [positive correlation]). The correlation matrix is available in the SI.
Then, determinants of HWT adoption were determined using stepwise logistic regression and machine learning tools. Four analyses were initially conducted: (1) regression on the non-imputed dataset; (2) regression on the manually imputed dataset; (3) machine learning on the manually imputed dataset; and (4) machine learning on the non-imputed dataset (with default imputation in machine learning). Overall, 25 determinants (of the possible 169) were retained. This number of determinants (25) was retained as it was found to include the lowest total number of determinants that included at least two different determinant categories to enrich the model.
Stepwise logistic regression was conducted (using R) to determine which of the 169 determinants to include in modeling, with both the non-imputed and imputed datasets. Dependent determinants with the highest p-values were successively removed from the model if they did not widely reduce (>0.5) McFadden's pseudo R2. When a final set of 25 determinants was included in the model, Akaike information criterion (AIC) and R2 were both available; we present R2 herein for ease of reading. To compare logistic regression results with the default indicator from machine learning, we also computed the average 100-repeated 10-fold cross-validation area under the receiver operating characteristic curve (AUC) with the R package caret for the logistic regression (Kuhn 2008). The AUC provides a common indicator to report and compare logistic regression with machine learning (Christodoulou et al. 2019). The closer the AUC value is to 1, the better the performance (higher sensitivity and lower false positives). Herein, we present categories of all 25 included determinants and descriptions of all statistically significant determinants within the overall group of 25.
Machine learning analyses were conducted using the WEKA (ml.cms.waikato.ac.nz/weka) free software package. WEKA was selected because it is freely available and easy to use and includes many different models, eight of which were applied to this dataset (BayesNet, J48, JRip, KStar, Logistic, Multilayer Perceptron, SMS, and RandomForest). Each of these eight models was pre-tested with default settings to estimate which model fits the data better using the confirmed HWT use outcome from the dataset with manual imputation (to avoid default imputation). The RandomForest model was selected as the primary machine learning model for further analysis because this model is known to require less tuning than another model (performs well with default settings). Relatedly, this model provided the highest accuracy for balanced outcomes in the 50:50 ratio (as the sample size was calculated for 40:60) with 10-fold cross-validation. The random forest model's relative attribute importance (defined as the product of the ‘mean decrease in impurity’ (Breiman et al. 2017; Parr et al. 2018) with the ‘number of nodes using that attribute’) was used as the indicator for selecting the most important determinants. Determinants were ranked from higher to lower relative importance, and the top 25 were selected for inclusion. A new random forest model was then built with these determinants. The accuracy and 10-fold cross-validation AUC were then calculated to allow comparison with the regression analysis. The same procedure was then followed but with the dataset without manual imputation. Data imputations for the determinants, but not the outcome, were performed using WEKA default tool settings. Note the survey, data, code, and correlation tables are available in the SI.
Lastly, based on the initial results, three additional analyses were conducted in machine learning only. As initial results appeared to show community type and NGO serving the area were determinants of adoption, we conducted additional analysis to determine if the populations served by the NGO were different and/or if the NGOs had a different effect on the populations using outcome variables of community type and NGO. In addition, as the initially identified strongest determinants of adoption were demographic and socioeconomic in the machine learning results, we repeated the machine learning analysis (with default imputation) for confirmed HWT use, removing the demographic and socioeconomic determinants to test the strength of the other types of determinants.
RESULTS
DSI and HO approached 654 households between 7 and 17 February 2022, and 650 respondents completed the survey in urban, rural, and mountainous areas of Léogâne, Pignon, and Hinche, Haiti. This high response rate is attributed to community trust in experienced NGOs. Respondents were generally evenly divided between the two organizations and the three community location types (15–18% of households) (Table 1). Most respondents were female, with an average age of 40.7 years. Just over half (56–59%) of female and male heads of households (HOH) reported being able to read and write. Under half of respondents, 39 and 41%, respectively, reported working for a salary or receiving remittances.
Category . | Sub-category . | Characteristic of the determinant . | Total (n = 650) . | Mountain (n = 211) . | Rural (n = 209) . | Urban (n = 230) . | |||
---|---|---|---|---|---|---|---|---|---|
DSI (n = 112) . | HO (n = 99) . | DSI (n = 106) . | HO (n = 103) . | DSI (n = 119) . | HO (n = 111) . | ||||
Outcome | n (%) reporting treating water | 357 (55%) | 63 (56%) | 58 (59%) | 47 (44%) | 53 (51%) | 70 (59%) | 66 (59%) | |
n (%) purchasing water | 19 (3%) | 6 (5%) | 2 (2%) | 2 (2%) | 3 (3%) | 6 (5%) | 0 (0%) | ||
n (%) with TRC > 0 mg/L | 123 (19%) | 35 (31%) | 3 (3%) | 31 (29%) | 5 (5%) | 49 (41%) | 0 (0%) | ||
n (%) with TRC ≥ 0.2 mg/L | 81 (12%) | 19 (17%) | 3 (3%) | 21 (20%) | 5 (5%) | 33 (28%) | 0 (0%) | ||
n (%) with TRC = 0 mg/L | 289 (44%) | 59 (53%) | 50 (51%) | 40 (38%) | 38 (37%) | 49 (41%) | 53 (48%) | ||
n (%) willing to test TRC | 412 (63%) | 94 (89%) | 53 (54%) | 71 (67%) | 43 (42%) | 98 (82%) | 53 (48%) | ||
n (%) who did not treat because they reported ‘I buy water’ | 96 (15%) | 20 (18%) | 3 (3%) | 11 (10%) | 13 (13%) | 29 (24%) | 20 (18%) | ||
Average number of times per week respondents reported treated water | 1.9 | 2.4 | 1.8 | 1.8 | 2.0 | 2.0 | 1.6 | ||
Demographics and socioeconomics | Age | Average age of respondents | 40.7 | 41.9 | 41.9 | 41.4 | 43 | 40.3 | 35.8 |
Sex | n (%) female | 476 (73%) | 93 (83%) | 83 (84%) | 64 (60%) | 75 (73%) | 77 (65%) | 84 (76%) | |
Education | n (%) with female HOH able to read and write | 383 (59%) | 76 (68%) | 44 (44%) | 66 (62%) | 49 (48%) | 89 (75%) | 59 (53%) | |
Education | n (%) with male HOH able to read and write | 361 (56%) | 54 (48%) | 58 (59%) | 57 (54%) | 61 (59%) | 66 (55%) | 65 (59%) | |
Income | n (%) with anyone in the family who worked for a salary | 251 (39%) | 44 (39%) | 35 (35%) | 35 (33%) | 32 (31%) | 63 (53%) | 42 (38%) | |
Income | n (%) whom family has received money from someone outside of Haiti in the past 12 months | 264 (41%) | 48 (43%) | 31 (31%) | 56 (53%) | 19 (18%) | 84 (70%) | 26 (23%) | |
Possession | n (%) with wired electricity | 349 (54%) | 16 (14%) | 54 (55%) | 41 (39%) | 73 (71%) | 81 (68%) | 84 (76%) | |
Cultural belief | n (%) who attended church | 558 (86%) | 94 (84%) | 96 (97%) | 80 (75%) | 94 (91%) | 98 (82%) | 100 (90%) | |
WASH practices | WASH situation | n (%) for which data collectors observed a dedicated space for washing hands | 124 (19%) | 13 (12%) | 30 (30%) | 29 (27%) | 26 (25%) | 14 (12%) | 10 (9%) |
n (%) for which data collectors observed soap for hands | 104 (16%) | 14 (13%) | 28 (28%) | 20 (19%) | 25 (24%) | 7 (6%) | 10 (9%) | ||
n (%) who paid for water | 212 (51%) | 33 (29%) | 46 (46%) | 14 (13%) | 35 (34%) | 34 (29%) | 50 (45%) | ||
Water use | Average number of liters of water that households drank per day | 21.6 | 29.5 | 9.0 | 36.4 | 12.3 | 30.8 | 9.7 | |
HWT promotion | Promotion | n (%) who had heard promotion about treating their water at the ‘Radio’ | 340 (52%) | 80 (71%) | 21 (21%) | 69 (65%) | 25 (24%) | 98 (82%) | 47 (42%) |
Risk beliefs | Events on diarrhea | n (%) with someone in the home who had diarrhea or cholera in the last week | 111 (17%) | 28 (25%) | 11 (11%) | 19 (18%) | 11 (11%) | 33 (27%) | 10 (9%) |
Risk of diarrhea | n (%) who thought people get diarrhea because of ‘Poor Hygiene’ | 305 (47%) | 51 (46%) | 26 (26%) | 55 (52%) | 48 (47%) | 70 (59%) | 55 (50%) | |
n (%) who thought people get diarrhea because of ‘Dirty Environment’ | 139 (21%) | 22 (20%) | 9 (9%) | 29 (27%) | 12 (12%) | 45 (38%) | 22 (20%) | ||
n (%) who thought people get diarrhea because of ‘Bad Water’ | 514 (79%) | 103 (92%) | 52 (53%) | 100 (94%) | 74 (71%) | 109 (92%) | 76 (68%) | ||
Risk from drinking water | n (%) who knew water is safe to drink because ‘Water is treated’ | 506 (78%) | 101 (90%) | 48 (48%) | 95 (90%) | 71 (69%) | 109 (92%) | 82 (74%) | |
n (%) who knew water is not safe to drink because it ‘Looks Dirty’ | 321 (49%) | 83 (74%) | 37 (37%) | 89 (84%) | 12 (12%) | 79 (66%) | 21 (19%) | ||
n (%) who knew water is not safe to drink because it ‘Smells Bad’ | 189 (29%) | 41 (37%) | 8 (8%) | 48 (45%) | 22 (21%) | 48 (40%) | 22 (19%) | ||
Social support | Encouragement to treat water | n (%) who was encouraged by ‘NGOs’ to drink only treated water | 138 (21%) | 27 (27%) | 13 (13%) | 25 (24%) | 26 (25%) | 24 (20%) | 23 (21%) |
n (%) who was encouraged by ‘Clinics’ to drink only treated water | 127 (20%) | 8 (7%) | 31 (31%) | 7 (7%) | 34 (33%) | 8 (7%) | 39 (35%) |
Category . | Sub-category . | Characteristic of the determinant . | Total (n = 650) . | Mountain (n = 211) . | Rural (n = 209) . | Urban (n = 230) . | |||
---|---|---|---|---|---|---|---|---|---|
DSI (n = 112) . | HO (n = 99) . | DSI (n = 106) . | HO (n = 103) . | DSI (n = 119) . | HO (n = 111) . | ||||
Outcome | n (%) reporting treating water | 357 (55%) | 63 (56%) | 58 (59%) | 47 (44%) | 53 (51%) | 70 (59%) | 66 (59%) | |
n (%) purchasing water | 19 (3%) | 6 (5%) | 2 (2%) | 2 (2%) | 3 (3%) | 6 (5%) | 0 (0%) | ||
n (%) with TRC > 0 mg/L | 123 (19%) | 35 (31%) | 3 (3%) | 31 (29%) | 5 (5%) | 49 (41%) | 0 (0%) | ||
n (%) with TRC ≥ 0.2 mg/L | 81 (12%) | 19 (17%) | 3 (3%) | 21 (20%) | 5 (5%) | 33 (28%) | 0 (0%) | ||
n (%) with TRC = 0 mg/L | 289 (44%) | 59 (53%) | 50 (51%) | 40 (38%) | 38 (37%) | 49 (41%) | 53 (48%) | ||
n (%) willing to test TRC | 412 (63%) | 94 (89%) | 53 (54%) | 71 (67%) | 43 (42%) | 98 (82%) | 53 (48%) | ||
n (%) who did not treat because they reported ‘I buy water’ | 96 (15%) | 20 (18%) | 3 (3%) | 11 (10%) | 13 (13%) | 29 (24%) | 20 (18%) | ||
Average number of times per week respondents reported treated water | 1.9 | 2.4 | 1.8 | 1.8 | 2.0 | 2.0 | 1.6 | ||
Demographics and socioeconomics | Age | Average age of respondents | 40.7 | 41.9 | 41.9 | 41.4 | 43 | 40.3 | 35.8 |
Sex | n (%) female | 476 (73%) | 93 (83%) | 83 (84%) | 64 (60%) | 75 (73%) | 77 (65%) | 84 (76%) | |
Education | n (%) with female HOH able to read and write | 383 (59%) | 76 (68%) | 44 (44%) | 66 (62%) | 49 (48%) | 89 (75%) | 59 (53%) | |
Education | n (%) with male HOH able to read and write | 361 (56%) | 54 (48%) | 58 (59%) | 57 (54%) | 61 (59%) | 66 (55%) | 65 (59%) | |
Income | n (%) with anyone in the family who worked for a salary | 251 (39%) | 44 (39%) | 35 (35%) | 35 (33%) | 32 (31%) | 63 (53%) | 42 (38%) | |
Income | n (%) whom family has received money from someone outside of Haiti in the past 12 months | 264 (41%) | 48 (43%) | 31 (31%) | 56 (53%) | 19 (18%) | 84 (70%) | 26 (23%) | |
Possession | n (%) with wired electricity | 349 (54%) | 16 (14%) | 54 (55%) | 41 (39%) | 73 (71%) | 81 (68%) | 84 (76%) | |
Cultural belief | n (%) who attended church | 558 (86%) | 94 (84%) | 96 (97%) | 80 (75%) | 94 (91%) | 98 (82%) | 100 (90%) | |
WASH practices | WASH situation | n (%) for which data collectors observed a dedicated space for washing hands | 124 (19%) | 13 (12%) | 30 (30%) | 29 (27%) | 26 (25%) | 14 (12%) | 10 (9%) |
n (%) for which data collectors observed soap for hands | 104 (16%) | 14 (13%) | 28 (28%) | 20 (19%) | 25 (24%) | 7 (6%) | 10 (9%) | ||
n (%) who paid for water | 212 (51%) | 33 (29%) | 46 (46%) | 14 (13%) | 35 (34%) | 34 (29%) | 50 (45%) | ||
Water use | Average number of liters of water that households drank per day | 21.6 | 29.5 | 9.0 | 36.4 | 12.3 | 30.8 | 9.7 | |
HWT promotion | Promotion | n (%) who had heard promotion about treating their water at the ‘Radio’ | 340 (52%) | 80 (71%) | 21 (21%) | 69 (65%) | 25 (24%) | 98 (82%) | 47 (42%) |
Risk beliefs | Events on diarrhea | n (%) with someone in the home who had diarrhea or cholera in the last week | 111 (17%) | 28 (25%) | 11 (11%) | 19 (18%) | 11 (11%) | 33 (27%) | 10 (9%) |
Risk of diarrhea | n (%) who thought people get diarrhea because of ‘Poor Hygiene’ | 305 (47%) | 51 (46%) | 26 (26%) | 55 (52%) | 48 (47%) | 70 (59%) | 55 (50%) | |
n (%) who thought people get diarrhea because of ‘Dirty Environment’ | 139 (21%) | 22 (20%) | 9 (9%) | 29 (27%) | 12 (12%) | 45 (38%) | 22 (20%) | ||
n (%) who thought people get diarrhea because of ‘Bad Water’ | 514 (79%) | 103 (92%) | 52 (53%) | 100 (94%) | 74 (71%) | 109 (92%) | 76 (68%) | ||
Risk from drinking water | n (%) who knew water is safe to drink because ‘Water is treated’ | 506 (78%) | 101 (90%) | 48 (48%) | 95 (90%) | 71 (69%) | 109 (92%) | 82 (74%) | |
n (%) who knew water is not safe to drink because it ‘Looks Dirty’ | 321 (49%) | 83 (74%) | 37 (37%) | 89 (84%) | 12 (12%) | 79 (66%) | 21 (19%) | ||
n (%) who knew water is not safe to drink because it ‘Smells Bad’ | 189 (29%) | 41 (37%) | 8 (8%) | 48 (45%) | 22 (21%) | 48 (40%) | 22 (19%) | ||
Social support | Encouragement to treat water | n (%) who was encouraged by ‘NGOs’ to drink only treated water | 138 (21%) | 27 (27%) | 13 (13%) | 25 (24%) | 26 (25%) | 24 (20%) | 23 (21%) |
n (%) who was encouraged by ‘Clinics’ to drink only treated water | 127 (20%) | 8 (7%) | 31 (31%) | 7 (7%) | 34 (33%) | 8 (7%) | 39 (35%) |
About one-fifth of households (19%) had a dedicated space for washing hands, with 16% of households having soap (Table 1). About half of respondents had heard promotion about water treatment on the radio, and one-fifth (20–21%) had been encouraged by NGOs and clinics to use treated drinking water. Respondents knew transmission routes for diarrhea, including bad water (79%), poor hygiene (47%), and dirty environment (21%). Respondents also knew water is safe to drink when it is treated (78%). Respondents in DSI-served areas were more likely to have heard radio promotion (p < 0.001), less likely to have NGO/clinic promotion (p < 0.001), and more likely to know transmission routes for diarrhea (p < 0.001–0.004).
Overall, 55% of respondents reported treating their water, mostly (93%) with chlorine-based products (Table 1). Of the total, 63% consented to water testing, and 30% of those (19% overall) had TRC in stored household drinking water. Respondents who did not provide their water were more likely to be surveyed by HO (p < 0.001) or be from rural areas (p = 0.001). Households in DSI areas were more likely to have TRC (29–41% compared to 0–5%, across community types (p < 0.001). Respondents who reported treating water said they treated an average of 1.9 times per week.
In the multicollinearity analysis, few correlations between independent numerical determinants were found (see SI). In logistic regression, significant correlations were considered during the analysis to avoid multicollinearity. In machine learning, any large correlations should generally result in only one correlated determinant being identified as a strong determinant, since including the second in a model would provide relatively little additional discriminatory power, but this is not guaranteed.
As described, we performed analyses with both imputed and not-imputed datasets to see if this impacted the machine learning results. Our results did not widely vary between imputed and not (SI Tables S3 and S4). Therefore, only the results obtained for the logistic regression without imputation (default) and the random forest model with imputation (default) are presented herein. These are highlighted because this would be the simplest analysis for a data operator to complete.
Self-reported use
. | Self-reported use . | Confirmed use . | |||
---|---|---|---|---|---|
Regression not-imputed . | Random forest default imputed . | Regression not-imputed . | Random forest default imputed . | Random forest default imputed – D&S removed . | |
Model characteristics | |||||
Total N in model | 507 | 650 | 354 | 650 | 650 |
R2 | 0.34 | – | 0.46 | – | – |
Balanced accuracy | – | 62.8 | – | 83.2 | 83.2 |
AUC | 0.76 | 0.75 | 0.78 | 0.84 | 0.82 |
Results by determinant categories | |||||
Demographics and socioeconomics (A) | 6 | 18 | 6 | 15 | |
Community type (B) | 1 | 1 | 1 | ||
Organization (C) | 1 | 1 | 1 | ||
WASH practices (D) | 7 | 5 | 5 | 6 | 9 |
HWT promotion (E) | 1 | 2 | 1 | ||
Risk beliefs (F) | 9 | 8 | 1 | 13 | |
Social support (G) | 1 | 1 | |||
Access and availability of HWT (H) | 1 | ||||
User attitude and experience with HWT (I) | 1 | ||||
Self-reported HWT use (J) | 1 | 1 | 1 | ||
Attitudes and feelings about WASH (K) |
. | Self-reported use . | Confirmed use . | |||
---|---|---|---|---|---|
Regression not-imputed . | Random forest default imputed . | Regression not-imputed . | Random forest default imputed . | Random forest default imputed – D&S removed . | |
Model characteristics | |||||
Total N in model | 507 | 650 | 354 | 650 | 650 |
R2 | 0.34 | – | 0.46 | – | – |
Balanced accuracy | – | 62.8 | – | 83.2 | 83.2 |
AUC | 0.76 | 0.75 | 0.78 | 0.84 | 0.82 |
Results by determinant categories | |||||
Demographics and socioeconomics (A) | 6 | 18 | 6 | 15 | |
Community type (B) | 1 | 1 | 1 | ||
Organization (C) | 1 | 1 | 1 | ||
WASH practices (D) | 7 | 5 | 5 | 6 | 9 |
HWT promotion (E) | 1 | 2 | 1 | ||
Risk beliefs (F) | 9 | 8 | 1 | 13 | |
Social support (G) | 1 | 1 | |||
Access and availability of HWT (H) | 1 | ||||
User attitude and experience with HWT (I) | 1 | ||||
Self-reported HWT use (J) | 1 | 1 | 1 | ||
Attitudes and feelings about WASH (K) |
Note: Each model has 25 determinants. The number of determinants per category is provided.
In the random forest analysis, 650 respondents and 131 determinants were included, as the machine learning program completed default imputation, and had an accuracy value of 62.8% and an AUC value of 0.63 (Table 2; SI Tables S7 and S8). This model included 20 demographics and socioeconomics and 5 WASH practice determinants (Figure 2). Both the community type and NGO serving the area were determinants, with more respondents from DSI reporting treating their water.
Overall, the self-reported models were similarly predicted but with different determinants (SI Tables S5 and S8). There were 11 (of 25, 44%) determinants selected by both models: the community type, six from the demographics and socioeconomics category (respondent sex, education of male or female head of household, household religion, and income through salary or remittances in the last 12 months), and four from the WASH practice category (water source, primary person collecting water, frequency of water collection, and volume of water the household drinks per day).
Confirmed HWT use
Overall, 412 households (63%) and 138 determinants were included in the confirmed HWT use logistic regression model, as 238 households (37%) did not provide a water sample (Table 2, SI Tables S9 and S10). The 25-determinant model had an R2 value of 0.46 and an AUC value of 0.78 and included eight risk belief, seven demographics and socioeconomics (including organization), five WASH practices, two HWT promotion, and one each of social support, User attitude and experience with HWT and self-reported HWT use determinants (SI Table S9). Among the demographics and socioeconomics category, results suggest that older respondents (p < 0.001), respondents earning income through a salary (p < 0.001), of a certain religion (p = 0.019), and with plastic/tarp observed for household walls (p = 0.025) were more likely to have TRC in household drinking water. In addition, respondents who thought people get diarrhea because of poor hygiene (risk belief, p = 0.005) and dirty environment (risk belief, p = 0.003), who thought they or family members can get dehydration (risk belief, p = 0.011) and infection (risk belief, p = 0.001) from drinking water, who stored water in smaller containers (WASH practices, p = 0.021), and who received training on HWT (HWT promotion, p = 0.029) were more likely to have TRC in drinking water. Respondents were less likely to have water with TRC if they thought diarrhea can be prevented by storing water in closed containers (risk belief, p = 0.023), they thought they or family members can get diarrhea from drinking water (risk belief, p = 0.042), if a young boy was the primary person collecting water (WASH practices, p = 0.004), or if they stored water in open/large containers such as buckets, clay pots, drums, or cisterns (WASH practices, p = 0.004). The community type was not a determinant in this model, but NGO serving the area (p = 0.021) was, with more respondents from DSI having TRC.
Overall, 650 respondent answers and 169 determinants were included in the random forest model (Table 2, SI Tables S11 and S12). The final model had an accuracy value of 83.2 and an AUC value of0.84, with 17 demographics and socioeconomics (including community type and organization), 6 WASH practice determinants, and 1 each of risk belief and access and availability of HWT. Both the community type and NGO serving the area were determinants, with more respondents from DSI having treated water.
Nine (of 25) determinants were similar between the regression and initial random forest with imputation analyses (SI Tables 9 and 12), including the NGO serving the area, five from the demographics and socioeconomics category (respondent age, marital status, number of children <5 living in household, remittances received in the past 12 months, and religion denomination), two from the WASH practice category (primary person collecting water or treating water), and one from the access and availability of the HWT category (the place where respondents obtain products to treat water).
When the random forest model was re-run, manually removing demographics and socioeconomics determinants, the model had an accuracy value of 83.2 and an AUC value of 0.82, with 13 risk beliefs, 9 WASH practices, and 1 each of HWT promotion, access and availability of HWT, and self-reported HWT use determinants (SI Table S15).
NGO and community-type analysis
NGO serving the area was found to be a determinant in three models (Table 2). When the random forest model was run with the NGO serving the area as the outcome, the NGO was modeled with an accuracy value of 97.8%. Therefore, respondents from DSI and HO areas could be distinguished with high accuracy based on 16 demographic and socioeconomic factors, 5 WASH practices, 2 risk beliefs, and 1 each of access and availability of HWT and user attitude and experiences with HWT determinants (SI Table S13). Community type was also found to be a determinant in three models (Table 2). When the random forest model was run with community type as the outcome, community type was modeled with an accuracy value of 71.5%. The 25 included determinants were 18 demographics and socioeconomics, 5 WASH practices, and 1 each of access and availability of HWT and user attitude and experiences with HWT (SI Table S14).
DISCUSSION
To assess and understand the current WASH knowledge, attitudes, and practices in Haiti, and to determine determinants of HWT adoption, we conducted a cross-sectional study involving 650 households with two local NGOs in three different community types. We measured self-reported and confirmed HWT using 71 demographic and psychosocial questions transformed into 169 potential determinants organized into nine major categories. We observed that (1) fewer respondents reported treating or purchasing water than in previous studies, although this was influenced by NGOs operating in the area; (2) determinants of adoption were mostly in demographics and socioeconomics, risk belief, and WASH practice categories, although exact determinants varied by analysis type; (3) determinants of adoption that are influenceable (as opposed to fixed) can provide insight for promoting HWT in Haiti; and (4) there are benefits and drawbacks to both regression and machine learning analysis to consider in future analysis.
Overall, 55% of respondents reported treating water, and 19% had TRC > 0 mg/L in drinking water during the survey visit. While these percentages are lower or equal to those observed previously (Cayemittes et al. 2012; Lantagne & Clasen 2012, 2013), they indicate that about one-fifth of respondents are treating water. In this study, about the same number of respondents reported using chlorine-based liquid products to treat water (39%, compared to 21–40% in 2010–2012), but fewer reported using chlorine tablets or powder products (10% herein, 61–74% in 2010–2012). These changes over time could be attributable to market and HWT promotion disruptions that have occurred during the ongoing complex emergency in Haiti. In addition, we observed that more respondents from DSI self-reported treating their water (8–12% for DSI and 9–10% for HO), were willing to provide a water sample for testing (17–23% for DSI and 10–13% for HO), and had TRC in their water (8–12% for DSI and 0–1% for HO). Including the NGO serving the area as a determinant we improved the model, especially with random forest modeling. As DSI focuses on HWT manufacture and promotion, and HO on water supply and community development, this result highlights the benefits of long-term consistent educational programming public health interventions to promote HWT. This result was also observed in focus group discussions conducted as part of the larger study. During these discussions, participants in DSI locations had more scientific/empirical beliefs about distinguishing safe and unsafe water and the cause-and-effect relationship between unsafe water and illness. It is expected that if we were investigating water supply (and not HWT), respondents in HO areas would have more knowledge and improved practices.
We assessed HWT using two outcomes: self-reported and confirmed use. Models of the determinants of adoption for this outcome had AUCs of 0.63–0.77 for self-reported use (with higher AUC values with regression analysis) and 0.78–0.84 for confirmed use (with higher values with random forest). These values are slightly lower than the ones found in other studies for which AUC values of 0.85 and 0.94 were estimated to model HWT practices in Nepal and Indonesia, respectively (Daniel et al. 2019, 2021). However, those results suggested that the models developed in this study are moderately accurate (Greiner et al. 2000) and can sufficiently well distinguish respondents who report, or use, HWT. In addition, we observed that self-reported use was not included as a determinant of confirmed use in models. However, some related determinants (e.g., ‘who is the primary person that treats the water’) were included in confirmed use models. Therefore, some related determinants to self-reported use help inform confirmed use.
We found that determinants from demographics and socioeconomics, risk beliefs, and WASH practice categories were most frequently included in self-reported and confirmed HWT use models. These results support existing research on the importance of socioeconomic characteristics and WASH practices as determinants of HWT uptake (Boateng et al. 2013; Tsai et al. 2020; Daniel et al. 2021). However, while demographics and socioeconomics dominated as a category, dropping these determinants did not reduce the accuracy of the confirmed use of random forest; then, risk belief determinants dominated. Therefore, mental beliefs and attitudes do have importance in determining confirmed HWT use. It is hypothesized that, overall, the determinants in the category of risk belief are correlated with the determinants in the category of demographics and socioeconomics, but we did not see a correlation between individual determinants.
When considering how these results can inform the NGO strategy for increasing HWT uptake in Haiti, the determinants can be separated into ‘fixed’ determinants (such as age) and ‘influenceable’ determinants. The determinants found in both regression and random forest modeling that were influenceable included: income, place where HWT treatment products are obtained, and (when including the models with demographics and socioeconomics excluded and an NGO operating in the area) risk beliefs and HWT knowledge. It is recommended for NGOs to focus on strategies to get cash to families (a common humanitarian response currently WHO, UNICEF, and World Bank 2022), make HWT products available in the market, and provide education on risks of untreated water, and the benefits of treated water. In addition, to ease the time and cost burden of surveying, it is recommended to focus future survey questions on demographic (for targeting purposes) and influenceable (for knowledge of change) determinants of adoption.
We used WEKA machine learning analysis, which is an easy-to-use open-source program. There are benefits and drawbacks of machine learning compared to regression analysis. The perceived benefits include reduced need for human data manipulation and reduced time for analysis. Ideally, a less-trained analyst could provide results quickly to inform humanitarian response. The perceived drawbacks are the ‘black box’ aspect of machine learning. For example, it is not clear how the analysis is conducted, how imputation happens, or why some determinants are identified as stronger than others. While our models had similar accuracy (0.63–0.84 for random forest and 0.76–0.78 for regression), different determinants were included in each model. We believe, but cannot confirm, that part of the reason machine learning models returned more demographic and socioeconomic determinants is that continuous determinants (which the demographics and socioeconomics category had more of) are estimated to be more important in machine learning models because there are more possible values. Ultimately, the black box nature of machine learning makes it difficult to understand how determinants were selected. Relatedly, in our analysis using manual imputation, we did not find imputation to impact results that much (with an accuracy value of 61.4–80.0% for non-imputed and of 62.8–83.2% for imputed in machine learning and an R2 value of 0.34–0.46 for non-imputed and of 0.24–0.59 for imputed in regression). However, imputation (particularly in datasets with high amounts of missing data) often impacts results, and our imputation results herein should not be generalized. Overall, our results suggest that machine learning models should not be used quickly using default settings but used thoughtfully by trained analysts (which may reduce some of the time savings of machine learning). In our experience conducting this work, machine learning analysis required less time to complete than regression analysis.
This study had limitations. First, we measured TRC as the use outcome, which excludes confirmation of non-chlorine methods of treatment. However, <10% of respondents self-reported using other HWT methods (such as boiling or filtration). We do not feel that this limited our analysis. Second, 37% of respondents did not provide a water sample for testing, because they did not have water for drinking at the time, had a limited amount, or did not want to provide a sample. Third, while the survey was designed to include questions on contextual, demographics, and psychosocial determinants, some categories (demographics and socioeconomics, risk beliefs) were tested with more determinants and thus had more weight. As there was always more than one determinant tested for each category, and determinants in less-represented categories were never all selected, we do not believe this impacted our results. Lastly, this was a collaboration between computer scientists and public health epidemiologists, and sometimes language differences between sectors led to different interpretations. For example, computer science would use the term ‘prediction’ for this work, but, in public health, we have avoided ‘prediction’ (based on reviewer comments) and used ‘determinants of adoption’.
CONCLUSIONS
Approximately one-fifth of respondents in our survey had treated their water with chlorine-based HWT. We were able to use machine learning and regression analyses to identify determinants of HWT adoption in Haiti, with a high degree of accuracy. We identified determinants that are influenceable (as opposed to fixed determinants such as ‘age’). These influenceable determinants can be used in promoting HWT in Haiti in the future. Based on our results, it is recommended to increase access to HWT and provide cash and education to emergency-affected populations. Lastly, we found that machine learning and regression analysis both need trained operators to conduct analysis, although machine learning analysis may offer the benefit of reduced time for analysis.
ACKNOWLEDGEMENTS
We thank the Defense Advanced Research Projects Agency (DARPA) for funding this research (Award W911NF21C0001) as well as the ‘BecauseWe’ Research Team (Raytheon BBN Technologies, Kairos Research, and George Washington University) for assistance in designing the tools. We are also grateful to Deep Springs International and Haiti Outreach for their contribution and logistical support, to the enumerators who conducted all household surveys, and to respondents for providing information regarding their experiences with HWT use and for welcoming us into their homes for water sampling. This material is based upon work supported by the DARPA and the Army Research Office (ARO) under Contract No. W911NF-21-C-0001. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of DARPA or the US Government.
DATA AVAILABILITY STATEMENT
All relevant data are available from https://tufts.box.com/s/mkorujb08ew48vrrcxmjzny5j1m8enuv.
CONFLICT OF INTEREST
The authors declare there is no conflict.