Selecting rep-PCR markers to source track fecal contamination in Laguna Lake, Philippines

Fecal contamination is one of the factors causing deterioration of Laguna Lake. Although total coliform levels are constantly monitored, no protocol is in place to identify their origin. This can be addressed using the library-dependent microbial source tracking (MST) method, repetitive element sequencebased polymerase chain reaction (rep-PCR) fingerprinting. Serving as a prerequisite in developing the host-origin library, we assessed the discriminatory power of three fingerprinting primers, namely BOXA1R, (GTG)5, and REP1R-1/2-1. Fingerprint profiles were obtained from 290 thermotolerant Escherichia coli isolated from sewage waters and fecal samples of cows, chickens, and pigs from regions surrounding the lake. Band patterns were converted into binary profiles and were classified using the discriminant analysis of principal components. Results show that: (1) REP1R-1/2-1 has a low genotyping success rate and information content; (2) increasing the library size led to more precise estimates of library accuracy; and (3) combining fingerprint profiles from BOX-A1R and (GTG)5 revealed the best discrimination (average rate of correct classification (ARCC)1⁄4 0.82± 0.06) in a two-way categorical split; while (4) no significant difference was found between the combined profiles (0.74± 0.15) and using solely BOX-A1R (0.76± 0.09) in a four-way split. Testing the library by identifying known isolates from a separate dataset has shown that a two-way classification performed better (ARCC1⁄4 0.66) than a four-way split (ARCC1⁄4 0.29). The library can be developed further by adding more representative isolates per host source. Nevertheless, our results have shown that combining profiles from BOX-A1R and (GTG)5 is recommended in developing the MST library for Laguna Lake. doi: 10.2166/wh.2019.042 ://iwaponline.com/jwh/article-pdf/18/1/19/720450/jwh0180019.pdf Kevin L. Labrador Mae Ashley G. Nacario Gicelle T. Malajacan Joseth Jermaine M. Abello Luiza H. Galarion Windell L. Rivera (corresponding author) Pathogen-Host-Environment Interactions Research Laboratory, Natural Sciences Research Institute, University of the Philippines Diliman, Quezon City, Philippines E-mail: wlrivera@science.upd.edu.ph Mae Ashley G. Nacario Gicelle T. Malajacan Joseth Jermaine M. Abello Windell L. Rivera Institute of Biology, College of Science, University of the Philippines Diliman, Quezon City, Philippines Christopher Rensing Fujian Provincial Key Laboratory of Soil Environmental Health and Regulation, College of Resources and Environment, Fujian Agriculture and Forestry University, Fuzhou, China


INTRODUCTION
Laguna Lake, the largest lake in the Philippines, has experienced continued deterioration over the past years. The surrounding communities, estimated to have a population size of around 15 million people, benefit from the lake through various agricultural, industrial, and domestic uses (Laguna Lake Development Authority ). However, these communities introduce waste into the system, and the intensive demand coupled with high pollution load contributed to the decline in water quality (Santos-Borja & Nepomuceno ; WAVES ). In fact, monitoring activities by the Laguna Lake Development Authority, the agency that is primarily responsible for the lake's development, revealed fecal contamination in the lake and high total coliform counts in most of its tributaries. These pose serious public health problems since pathogens from infected sources can be introduced into the environment through feces, ultimately causing risks to public health and Although the extent of the contamination is known, there is no information pertaining to their origin. Knowledge on sources of fecal contamination is critical because it can (1) allow development of management schemes to minimize their input, (2) aid in restoration and remediation efforts, MST is a collection of methods that utilize microorganisms to identify and quantify dominant sources of fecal contamination in a given system (Scott et al. ; Stoeckel & Harwood ). The methods are broadly categorized into either (1) library-dependent (LDM) or (2) libraryindependent (LIM), depending on the need to develop a culture library of known host sources that will serve as a database for identifying unknown isolates (Gomi et al. ). Genotypic methods, a subclass of LDM that utilizes genetic fingerprinting techniques, rely on the genetic variation of indicator bacteria to develop the library (Scott et al. ). Among these methods is repetitive element sequence-based polymerase chain reaction (rep-PCR), a technique that targets repetitive palindromic sequences that are widespread in bacterial genomes (Versalovic et al. ).
Compared to other genetic fingerprinting methods, rep-PCR is easier, less technically demanding, faster, cheaper, and displays better reproducibility, while generating relatively more efficient and reliable results that are easy to interpret, making it a practical approach for source tracking fecal con- to profile greater than 600 isolates per host source to account for the genetic diversity of the microbial indicator and refine the size and representativeness of the library (Mott & Smith ). Developing a library that meets the recommended size while using many primers demands considerable resources. Therefore, as a screening step, it is necessary to evaluate the discriminatory power of several primers on a small number of samples first. This is to ensure that resource allocation is maximized toward the development of an accurate fingerprint library.
Our objective was to assess the discriminatory power of three rep-PCR fingerprinting primers in source tracking fecal contamination in Laguna Lake. The primers, namely BOX-A1R, (GTG) 5 and REP1R-1/2-1, were selected since they had the highest average rate of correct classification (ARCC) based on a comparative study by Mohapatra et al. (). We also employed the composite analysis to assess the accuracy of the different primer combinations. Using thermotolerant Escherichia coli as the microbial indicator, we populated the reference library with isolates from sewage waters and fecal samples from cows, chickens, and pigs. Initial site surveys and consultations with stakeholders identified these host sources as major contributors of fecal contamination to the lake. Furthermore, we also assessed library performance based on its categorical splitthat is the number of categories in the library in which the isolates will be categorized. In this regard, we have designed a twoway split (Agricultural-Domestic) and a four-way split (Chicken-Cow-Pig-Sewage) library. We aimed to answer the following questions: (1) How did an increase in library size affect its performance? (2) Was a two-way split better than a four-way split? (3) Which among the primers or primer combinations were best suited in source tracking fecal contamination to Laguna Lake? (4) How accurate was the library in source tracking known isolates? To the best of our knowledge, this is the first initiative to perform MST on the Philippines' largest lake.

Sample collection
Fecal samples were collected from potential agricultural animal host sources (i.e., chicken, cows, and pigs) in backyard farms located in the provinces of Laguna and Rizal    The resulting amplicons were subjected to agarose gel electrophoresis (2%, 190 V, 60 min), and the gels were visualized using a gel documentation system (Bio-print ST4, Vilber Lourmat, UK). The analysis of banding patterns was done with an imaging system (SuperMegaCapt ST4 v.16.08 g, Vilber Lourmat, UK); band positions were normalized using a 1 kb molecular ladder (Hyperladder, Bioline, USA) as an external reference. Only a single observer performed the gel analysis to minimize variability attributed to multiple observers.

Fingerprint analysis
The statistical analysis was performed using the programming language, R v.3.5.0 (R Core Team ). Band positions were adjusted by binning their molecular weight (MW) to the nearest 20 bp. The binned MW was then converted into a binary sequence based on their presence (1) or absence (0)  The trend in ARCC as a function of the library size was also considered. To remove biases associated with disproportionate libraries (Mott & Smith ), the equal number of isolates from each host source (n host ) was randomly selected from the sample pool before subjecting them to DAPC. The value of n host was increased by multiples of ten and the classification was iterated 100 times. Afterwards, the mean and 95% confidence interval (CI) of the ARCC were calculated. Lastly, the variation was compared (1) among rep-PCR markers within a categorical split and (2) between categorical split within a rep-PCR marker. For the former, the Kruskal-Wallis test was used followed by the post-hoc pairwise Wilcoxon test (p-values were adjusted using Bonferroni correction). For the latter, the Wilcoxon rank-sum test was used. The alpha level of significance (α) was set to 0.05 for all statistical tests.

Library testing
Once the optimal marker combination was determined, we tested the constructed library using a separate dataset.
The test dataset was composed of E. coli from known sources that were subjected to similar assays and data preparation used in library construction. We used the predict.dapc function in order to categorize the test dataset based on the fingerprint library. This function calculated the posterior probability of each isolate in the test dataset; the category with the highest posterior probability was considered as the most probable identity of an isolate. This was done on both the two-way and four-way categorical split.

DNA fingerprinting
Rep-PCR profiles of 290 thermotolerant E. coli from various host sources were obtained using three primers. The host sources were from domestic sewage (n ¼ 110) and agricultural animals (chicken, cow, pig; n ¼ 60 each) (Figure 2  amplified using REP (70%). In addition, more complex profiles were generated by BOX and GTG, while only a few REP unique profiles remained after identical profiles were collapsed (40%). Since REP had a low genotyping success rate and minimal information content, it was considered ineffective and was omitted from the succeeding analyses.

Library performance
The library was assessed by increasing the number of representatives per host source (n host ) in multiples of ten and then subjecting the dataset to 100 iterations of the classification model. Given the categorical classification of the isolates in the current sample pool, greater n host was evaluated for a two-way split (max n host ¼ 100) than the four-way split (max n host ¼ 40).
Two generalizations could be inferred from the results ( Figure 3). First, increasing n host led to decreasing mean and 95% CI of ARCC. Second, for the two-way categorical split, the mean and CI of ARCC stabilized as sample size increased. However, the same cannot be said for the fourway split since not enough n host was obtained to observe stability in the distribution. This preliminary assessment provides an estimate of the ARCC of the rep-PCR marker used prior to maximizing the library.

Marker selection
In assessing the differences among markers, the ARCC dis-  (Table 3), no significant difference was found between BOX-GTG (0.74 ± 0.15) and BOX (0.76 ± 0.09). In Figure 3 | Change in the ARCC as a function of the sample size. The dataset was partitioned based on the categorical split used. The shaded region represents the 95% CI obtained from contrast, GTG (0.63 ± 0.12) was significantly lower than the other two. These results suggest that for a two-way split, the best marker is BOX-GTG, whereas for a four-way split, it could either be BOX or BOX-GTG.

Library testing
To further assess its accuracy, the library was used in predicting the categories of a separate dataset consisting of composite BOX-GTG fingerprint profiles of E. coli from known sources. With the two-way categorical split, the library was able to correctly predict the isolates from agricultural and domestic sources with a classification rate of 67 and 64%, respectively (ARCC ¼ 65.50%; Table 4). This accuracy decreased when using the four-way categorical split (ARCC ¼ 29.17%; Table 5), with correct classification rates ranging from 13.33% (cow) to 53.33% (sewage).    revealed the absence of discrete clusters that could be useful for discriminating host sources (not shown). Given the binary nature of the data, the DA was also not usable since the correlation among variables was quite high.

DISCUSSION
DAPC subjects the data to PCA to summarize the overall variability among individuals and then utilizes the PC scores for DA; this maximizes between-group variation while minimizing within-group variation, allowing for the best discrimination of samples to pre-defined groups.
Although the classification model was originally developed for genetic data, it can be used with any dataset that is multivariate in nature (Jombart et al. ).
In terms of library performance, increasing library size improved the precision of estimates at the expense of accuracy. Although smaller libraries tend to have higher ARCCs, they suffer from lower representativeness which prevents the classification of isolates that are not in the library (Mott & Smith ). As much as possible, the library needs to consider all the isolates that are representative of a certain system for it to be effective in source tracking. Furthermore, the two-way performed better than the four-way categorical split. This is in concordance with previous reports showing that as the number of categorical splits increased, the accuracy decreased (Carson et al. ; Mohapatra et al. ). This is because a smaller number of categories leads to a greater probability that an isolate will be classified to its observed category by chance and that increasing the number of categories decreases this probability. For example, in a two-way split, a random classification will lead to 50% ARCC by chance, while for a four-way split, this becomes 25% (Mott & Smith ).
Testing the library using a separate dataset have shown its utility in actual source tracking. The two-way categorical split was able to identify the isolates correctly at a greater rate (ARCC ¼ 65.50%) than that of the four-way categorical split (29.17%); this suggests that, in its current form, the library performs well only with the two-way split. However, we argue that the poor performance in the four-way split is a

CONCLUSION
We evaluated the discriminatory power of three rep-PCR markers on E. coli coming from various host sources that were identified to contribute fecal contamination in Laguna Lake. REP had a low genotyping success rate and information content and was thus dropped from the analyses. BOX had a higher discriminatory power than GTG; however, a combined profile from these markers had a higher ARCC. Our results indicate that BOX-GTG can be used as a rep-PCR marker in further developing the fingerprint library for MST in the lake. This future development includes increasing the number of isolates per host source as well as the inclusion of additional host sources when necessary. Since the most optimal primer combination has been selected, focus can now be given on assessing library representativeness, size, stability, and performance in identifying environmental isolates. It should be noted, however, that the LDM is not without limitations; MST should be performed using a toolbox approach; hence, there is a need to implement phenotypic fingerprinting methods (e.g., antibiotic resistance analysis) as well as LIMs.