Microbial source tracking of fecal contamination in Laguna Lake, Philippines using the library-dependent method, rep-PCR

Laguna Lake is an economically important resource in the Philippines, with reports of declining water quality due to fecal pollution. Currently, monitoring methods rely on counting fecal indicator bacteria, which does not supply information on potential sources of contamination. In this study, we predicted sources of Escherichia coli in lake stations and tributaries by establishing a fecal source library composed of rep-PCR DNA fingerprints of human, cattle, swine, poultry, and sewage samples (n1⁄4 1,408). We also evaluated three statistical methods for predicting fecal contamination sources in surface waters. Random forest (RF) outperformed k-nearest neighbors and discriminant analysis of principal components in terms of average rates of correct classification in two(84.85%), three(82.45%), and five-way (74.77%) categorical splits. Overall, RF exhibited the most balanced prediction, which is crucial for disproportionate libraries. Source tracking of environmental isolates (n1⁄4 332) revealed the dominance of sewage (47.59%) followed by human sources (29.22%), poultry (12.65%), swine (7.23%), and cattle (3.31%) using RF. This study demonstrates the promising utility of a library-dependent method in augmenting current monitoring systems for source attribution of fecal contamination in Laguna Lake. This is also the first known report of microbial source tracking using rep-PCR conducted in surface waters of the Laguna Lake watershed.


GRAPHICAL ABSTRACT INTRODUCTION
supporting information for policy-making and implementation for the rehabilitation and improvement of Laguna Lake and its tributaries.

Site description
Laguna Lake has a total surface area of 900 km 2 and an estimated holding capacity of 2.19 billion m 3 . Its watershed area of 3,820 km 2 expands along the provinces of Rizal and Laguna, some towns in Batangas and Cavite, and some cities in Metro Manila (Santos-Borja & Nepomuceno 2006). Stations in Laguna Lake and its tributaries were selected using the monitoring information of the LLDA (Figure 1). From a total of nine lake stations and 27 tributaries monitored by the agency, three lake stations and eight tributaries were shortlisted on the basis of having (1) the highest coliform count and (2) the greatest concentration of possible host sources (e.g., farms near the tributaries). For the lake stations, these were the East Bay (LS2), Northwest Bay (LS5), and South Bay (LS8). These stations are near the mouths of the tributaries, hence, the high levels of fecal contamination. For the tributaries, the following were selected to represent regions adjacent to the lake: Bagumbayan (TR1), Mangangate (TR2), and Tunasan (TR4) in the National Capital Region (NCR); Sapang Baho (TR3) in Rizal; and Biñan (TR5), Pila (TR6), San Cristobal (TR7), and Sta. Rosa (TR8) in Laguna. Land-use assessment in the vicinity of the lake reveals Figure 1 | Sampling location. Agricultural (red; feces from chickens, cattle, and swine) and domestic (blue; sewage and human feces) sources were used to construct the host source library which was used to source track thermotolerant E. coli isolates obtained from the environment (LS, Laguna Lake stations; TR, tributaries). Inset shows the geographical location relative to the island of Luzon. Please refer to the online version of this paper to see this figure in colour: doi: http://dx.doi.org/10. 2166/wh.2021.119. wide built-up areas around the northern tributaries (TR1-TR4) as they are situated in Metro Manila. The southern tributaries (TR5-8) have fewer industrialized areas and are surrounded by agricultural lands but are increasingly being converted to residential and commercial areas (Tanganco et al. 2019).
After the environmental sites were selected, nearby locations of possible host sources were identified. Domestic host sources were obtained from sewage treatment facilities of Manila Water and Laguna Water, as well as from sewage waters of Metro Manila draining toward Laguna Lake. In addition, human feces were collected from municipalities near the tributaries. These were Barangay Cupang, Muntinlupa, and the rural health unit of the municipality of Pila, Laguna. Human feces were obtained with clearance from the University of the Philippines Manila Research Ethics Board (UPMREB code: 2018-356-01). Meanwhile, agricultural host sources were collected in regions that focused on agricultural activities. These were Rizal and Laguna, located in the northern and southern boundaries of the lake, respectively. Fecal samples were collected from small backyard farms, specialized farms (e.g., piggeries, poultry farms, and cattle farms), and pastures.

Sample collection
Samples were categorized as (1) environmental (i.e., water from lake and tributaries), (2) agricultural (feces from chickens, ducks, cattle, and swine), and (3) domestic (sewage and human feces). Water and sewage samples (1 L) were obtained from the surface using grab sampling and were transferred to sterile wide-mouth water bottles (Nalgene, USA). Meanwhile, fecal samples were collected using a spatula and then transferred into either a stool container or a polypropylene bag. All samples were stored in ice and were transported to the laboratory for processing within 24 h after collection. Sample collection was done monthly from July 2017 to July 2019 during the wet and dry seasons.

Isolation and characterization of thermotolerant E. coli
Samples were serially diluted in sterile conical tubes using 0.9% saline solution as the diluent. Prior to serial dilution, fecal samples from each host source at a given site were pooled and 10 g were aseptically transferred into a sterile flask containing the diluent. After vigorous mixing, the mixture was serially diluted up to 10 À7 . Selected dilutions (human and animal feces ¼ 10 À7 to 10 À10 , sewage ¼ 10 À4 to 10 À7 , tributary water ¼ 10 À4 to 10 À7 , lake water ¼ undiluted), done in duplicates, were filtered through a GN-6 Metricel membrane (47 mm diameter, 0.45 μm pore size; Pall Corp., USA) using a vacuum pump (Millipore, USA). The membrane filters were placed on modified membrane-thermotolerant E. coli (mTEC) agar (BD Difco, USA) and incubated at 37°C for 2 h, then to 42°C for 18-24 h. Presumptive E. coli isolates, characterized by blue to violet colonies on mTEC plates, were further streaked in eosin methylene blue agar (EMBA; BD BBL, USA) for confirmation. All isolates that exhibited a green metallic sheen on EMBA were selected for DNA extraction and molecular fingerprinting.

Source tracking
Analysis of DNA fingerprints, construction of host library, and source tracking were done using the programming language, R v.3.6.3 (R Core Team 2020), following Labrador et al. (2020) with modifications. Band positions were binned by rounding up their molecular weight to the nearest 50 bp. These were then converted into a binary sequence based on their presence (1) or absence (0) across samples. Binary sequences from both primers were integrated to form a composite profile to be used in the analysis. Isolates from a specific host source having identical profiles were collapsed into a single observation. The resulting data were partitioned into two sets: (1) the training dataset, which contained profiles from agricultural and domestic sources; and (2) the unknown dataset, which contained profiles from environmental sources. The training dataset was used to create the library. This was prepared depending on how the sources were categorized: (1) a two-way split had agricultural-domestic; (2) a three-way split had agricultural-human-sewage; and lastly, (3) a five-way split had cattlepoultry-swine-human-sewage.
Models for the source classifier were constructed using three different statistical techniques, namely, DAPC, kNN, and RF. Library accuracy was externally assessed by holdout of 20% of the library into a 'challenge' or test dataset which was excluded from model training. Samples were chosen using stratified sampling in order to preserve overall class distribution. A 10-fold cross-validation technique repeated five times was adopted for model training. In order to improve the prediction bias resulting from the class imbalance in the library, an oversampling technique was used, as implemented in the package caret. For comparative evaluation of classifiers, the subsampling techniques were maintained across all three methods; thus, the exact same data points were used for training.
DAPC was performed as implemented in the package, adegenet (Jombart 2008). The number of PCs with the least mean square error was carried over for discriminant analysis. Both kNN and RF were implemented using the package caret (Kuhn 2008). For RF, forest sizes ranging from 500 to 2,500 were explored, but accuracy rates were not significantly different (data not shown), thus, all analyses subsequently used 500 trees. The number of randomly selected predictors (mtry) for RF and the number of neighbors to consider (k) for kNN were optimized at each categorical split scheme using accuracy as the metric.
A confusion matrix between the observed and predicted categories was then created to calculate for the ARCC of the library. We explored how the disproportionate library affects the ARCC by calculating the percentage of known samples incorrectly predicted (% IP) as sewage for each method using the five-way split. To see how class imbalance affects the % IP in each model, sewage isolates were sampled at intervals of 100, starting with 200 up to 600.
Afterwards, the library was used to categorize the isolates in the unknown dataset. The posterior probability of each isolate to belong to a defined category was calculated. The category with the highest posterior probability was considered as the most probable identity of an isolate. Once environmental isolates were categorized, the percent composition of contamination of each source was determined.

Library performance
The reference library consisted of 1,408 thermotolerant E. coli isolates, 444 of which were from agricultural sources while 964 were from domestic sources (Table 1). These isolates were placed on pre-defined categories, and DAPC, RF, and kNN were used to assess library performance in classification.
Overall, good classification was obtained using the BOX-GTG composite profiles in either a two-way or three-way categorical split using DAPC and RF (Table 2). In terms of library accuracy, ARCC decreased as the number of categories increased. Comparably high accuracies were calculated for both two-way (RF ARCC ¼ 84.35%; DAPC ARCC ¼ 82.92%) and three-way (RF ARCC ¼ 82.45%; DAPC ARCC ¼ 81.77%) libraries using RF and DAPC. Meanwhile, accuracy was lower for the five-way library (RF ARCC ¼ 74.77%; DAPC ARCC ¼ 74.55%). Moreover, kNN yielded the lowest ARCC for all categorical schemes (two-way ¼ 80.41%, three-way ¼ 73.12%, five-way ¼ 68.48%). Human sources had a consistently high relative rate of correct classification where it was used as a category in DAPC and RF classification (RCC ¼ 92. 13-93.19%). This was followed by sewage (RCC ¼ 77.92-84.9%). Agricultural isolates were successfully classified when lumped together in a single category (RCC ¼ 73.32-79.32%), further partitioning them into exclusive subcategories increased the probability of misclassification.

Library validation
An external validation of the library was conducted using a library 'challenge' which was composed of known source isolates (n ¼ 278) excluded from the training set. The accuracy obtained for the training set (80%) was comparable to ARCCs derived when using the full library ( To see how class imbalance affects the accuracy of classification, we looked at the trend of the percentage of samples incorrectly predicted (% IP) as sewage as we populated the library with an increasing number of sewage samples (Figure 2). In general, a higher % IP can be seen in agricultural categories when increasing the sewage library using DAPC or kNN. A lower % IP was achieved by RF compared to DAPC and kNN, showing more balanced composition. Human classification tends to be more robust and has lower % IP which may indicate sufficient distinction from sewage, supporting the high RCC rates previously achieved.  Source-tracking environmental isolates The environmental dataset was made up of 332 isolates from various locations, 220 of which were isolated during the wet season (May-October) and the remaining 112 during the dry season (November-April). Source prediction of unknown isolates using kNN differed greatly from DAPC and RF, which had similar percentages in all categorical schemes (Table 4). The kNN method was also not able to supply consistent source attribution across the three different schemes, resulting in varying percentages in each source category. Based on this finding, as well as the generally lower ARCC obtained from kNN, we focused the study on using RF and DAPC models. Overall, fecal contamination in the environment was heterogeneous in that they were identified to have originated from agricultural, human, and sewage sources across all models used (Table 4). In all classification schemes, sewage was the dominant source of pollution (37.65-49.4%). This was followed by human sources (24.4-29.22% in DAPC and RF). Among the agricultural sources, poultry was the highest (12.65%) contributor of contamination, followed by swine (7.23-9.94%), and lastly, cattle (3.31-3.61%).

DISCUSSION
Fecal contamination is among the factors that cause deterioration in the water quality of Laguna Lake. Total coliform counts monitored by the LLDA provide information on the extent of contamination. However, there is a need to discern the origin of contamination for appropriate mitigation measures to be implemented and protect public health (Griffith et al. 2003;McLellan et al. 2003). Here, we utilized rep-PCR, a library-dependent MST method, to identify the dominant source of fecal contamination in Laguna Lake and its tributaries.

MST fingerprint library
Several rep-PCR markers for source-tracking E. coli are available in literature. Each generates different fingerprint profiles that ultimately affect library accuracy (Mohapatra et al. 2007). Labrador et al. (2020) assessed the performance of three of these markers and proposed combining profiles of BOX and GTG to attain the highest accuracy for source tracking in Laguna Lake. By combining the banding patterns of the two markers, more variability was introduced in the dataset that allowed for the isolates to be classified to their respective host sources. Improvement in the overall discriminatory power of the source-tracking library was also reported after band profiles from different primers were combined (Yoke-Kqueen    Sukhumungoon et al. 2016). However, partitioning the library into more categorical splits lowered the accuracy, an observation that was concordant with previous reports Mohapatra et al. 2007;Mott & Smith 2011). The library used as the training set can also be noted to have disproportionate sizes for each point source. The sewage samples are the most abundant with as much as four times those of the fecal libraries. LDMs of MST often rely on the representation of each source candidate in the library, as over-and under-representation may lead to biases in classification. In order to overcome biases, we integrated clonal isolates from the training set to improve prediction and representativeness (Hassan et al. 2005).
Our library had more difficulty in classifying the different agricultural host sources: chickens/ducks, cattle, and swine. In contrast, other genotypic libraries had higher accuracies when dealing with nonhuman host sources (Dombek et al. 2000;Mohapatra et al. 2007;Somarelli et al. 2007). Although the addition of representative isolates for each host source was reported to improve the library (Harwood et al. 2003;Wiggins et al. 2003;Mott & Smith 2011), the fingerprint profiles of the agricultural sources used in this study may not be variable enough to be classified accordingly. Such limitation can be attributed to the inherent fingerprint profiles rather than the library size. Therefore, the addition of more isolates would no longer improve classification. In this case, different profiling methods such as antibiotic resistance assays, or direct detection using library-independent methods (LIMs), can be further explored to assess their efficiency in identifying fecal contamination from agricultural sources.
Another limitation that needed to be addressed was the frequent false positives that plague genotypic libraries (Griffith et al. 2003;Myoda et al. 2003). Although library-based genotypic methods are able to correctly identify dominant sources of contamination in a given sample, they tend to incorrectly identify absent host sources as being present. A threshold percentage of !15% was suggested to minimize the severity of false positives (Griffith et al. 2003;Harwood et al. 2003). Here, we opted not to impose a single threshold for our classification, as the appropriate level of probability of the isolate belonging to the same category may vary with the three different methods. Furthermore, omitting unknown sources in the prediction may be problematic when paired with our goal of determining percentages of fecal contribution in the environmental samples (Ritter et al. 2003;Robinson et al. 2007). The application of an unclassified category for our samples while comparing the different classification methods may be further explored in the future.
We recommend further assessments of the library including (1) temporal stability of the library, (2) applicability to other water bodies, including other watersheds in the Philippines, (3) addition of wildlife sources, and (4) replicability of results in other laboratories, as a precedent for the use of MST as a regular monitoring tool.

Prediction models
Harwood et al. (2000) proposed that ARCCs ranging around 60-70% are suitable for MST studies with the objective of pollution control. Our model was able to predict source categories with an accuracy of 74.77% for a five-way split using RF, which is comparable to other reported ARCCs obtained through rep-PCR of E. coli. Lyautey et al. (2010) reported an accuracy of 77% using BOX and ERIC libraries, while Mohapatra et al. (2007) achieved 79.89% in GTG(5) library. Other gel fingerprinting studies have used different classifiers such as discriminant analysis (Dombek et al. 2000;Mohapatra et al. 2007;Somarelli et al. 2007), support vector machines (Garabetian et al. 2020), kNN and neural networks (Carlos et al. 2012) with lower or similar accuracy rates. Robinson et al. (2007) suggested kNN as a compromise between the strengths of maximum similarity (MS) and discriminant analysis in terms of accuracy and prediction bias when dealing with disproportionate libraries. However, our results show that compared with DAPC and RF, kNN is the least suitable for source classification in our library and is more prone to incorrectly predicting isolates as sewage-derived as sewage library increases. This is also consistent with Lyautey et al. (2010) reporting kNN as being less sensitive and specific than another model, MS.
Comparison of different statistical algorithms shows that DAPC and RF yield similar results in terms of prediction accuracy, while kNN may not be as suitable for our fingerprint library. RF was observed to be more robust when tackling unbalanced datasets and can be used in the quantification of source attribution of fecal pollution. Smith et al. (2010) also reported the outperformance of DA by RF using antibiotic resistance profiles for bacterial source tracking. The agreement between the prediction from DAPC and RF lends more support and increases confidence in our result.
It can be observed that for DAPC, the test set ARCC is much lower compared with the training set. This is perhaps due to overfitting, wherein the model learns the training set well but underperforms when looking at unknowns. The major source of incorrect predictions during test set validation came from isolates wrongly classified as sewage, particularly those from agricultural sources (cattle, swine, and poultry). The prediction bias toward sewage may be caused by the inherent class imbalance present in the fecal source library, with sewage samples populating most of the dataset at almost four times as much as other fecal sources (Table 1). This resulted in lower RCCs for these classes, particularly in DAPC and kNN. In contrast, RF exhibited more balanced RCC rates for the agricultural sources by minimizing incorrect assignments without compromising accuracy in sewage prediction. RF has been said to be less prone to overfitting (Breiman 2001), making it useful for our study. RF has also been implemented in several source-tracking studies using other library-dependent methodologies such as antibiotic resistance profiling (Smith et al. 2010), microarray (Dubinsky et al. 2016), and 16S amplicon sequences (Roguet et al. 2018(Roguet et al. , 2020.

Fecal source attribution in Laguna Lake
All three statistical methods used in this study unanimously show that sewage is the dominant source of fecal pollution in Laguna Lake. The contamination of lake water by sewage poses a health risk to the populace residing around the vicinity. This is because sewage is reported to contain chemical contaminants, such as heavy metals and organic compounds (Lamastra et al. 2018), and biological contaminants, such as pathogens, contaminants of emerging concerns, and antimicrobial resistant genes (Alygizakis et al. 2020). Fecal pollution due to sewage may be due to the inadequate wastewater treatment system in the Philippines. Specifically, in Metro Manila, just about 15% of households are linked to sewage systems, and of this only half are being treated before release (Palanca-Tan 2017; Jalilov 2018). The majority of the wastewater in the megacity flows into septic tanks, which are not government-regulated but privately maintained by each household (Palanca-Tan 2013). A more alarming situation is the case of informal settlements, where wastewater is being directly dumped into drainage without any treatment (Palanca-Tan 2017). According to Santo Domingo & Edge (2010), untreated municipal wastewater may also contain fecal contamination derived from animal sources. Similarly, Griffith et al. (2003) reported that sewage is not purely composed of human contamination; rather, it contains fecal contamination from other sources that infiltrate sewage systems.
In several studies (Wiggins et al. 1999;Griffith et al. 2003;Ahmed et al. 2009;Kon et al. 2009), MST often used FIB from sewage as a substitute for human-related contamination. However, our study demonstrated that our sewage samples can be successfully discriminated from human sources compared with other agricultural isolates (Figure 2). This revealed several pieces of information: first, human contribution to sewage may not be as high as was previously expected (Labrador et al. 2020), and that the former is not a good representative of the latter. Secondly, animal-derived contamination may have a higher contribution to sewage contamination than previously known. Lastly, the distinct sewage category suggested that it was a mixture of contamination from various sources, such as industrial wastes and non-point sources, that were not accounted for in this study.
Although the use of library-independent host-specific markers has gained more popularity in recent source-tracking studies, the advantage of the LDM is that it can supplement the methodologies that monitoring systems already have in place. Our LDM utilizes indicator bacteria that the LLDA is also testing, which allows for congruence in data gathering, ease of sampling, and comparability of results (Mott & Smith 2011). Furthermore, the established methods, which include bacterial culture and PCR, are more easily adapted in laboratories which may lack the equipment and expertise required for LIM studies.
Overall, our results show that there is evidence of fecal contamination in Laguna Lake. This implies that management guidelines should be improved and strictly implemented to improve the water quality of the lake. Moreover, we identified the dominant sources of pollution in the watershed, which can aid government units and monitoring agencies in focusing their efforts and enacting policies for water quality management, such as establishing more stringent standards for wastewater disposal and total maximum daily loads of E. coli and other bacteria. Proper rehabilitation of bodies of water is important because it can prevent outbreaks of human and ecosystem diseases (Santo Domingo et al. 2007), leading to a positive impact in the field of public health.

CONCLUSION
MST of fecal contamination is an important aspect of water quality management. We demonstrate that DNA fingerprinting of E. coli coupled with RF is a promising method for source attribution in Laguna Lake. The validation of the prediction algorithm showed that RF yields the highest accuracy compared with DAPC and kNN and is shown to be more reliable when dealing with disproportionate datasets. RF found the dominant source of fecal contamination to be from sewage, a result that was also supported by DAPC. This may lead to public health implications; thus, more stringent measures are