Potential applications of next generation DNA sequencing of 16 S rRNA gene amplicons in microbial water quality monitoring

The applicability of next generation DNA sequencing (NGS) methods for water quality assessment has so far not been broadly investigated. This study set out to evaluate the potential of an NGS-based approach in a complex catchment with importance for drinking water abstraction. In this multicompartment investigation, total bacterial communities in water, faeces, soil, and sediment samples were investigated by 454 pyrosequencing of bacterial 16S rRNA gene amplicons to assess the capabilities of this NGS method for (i) the development and evaluation of environmental molecular diagnostics, (ii) direct screening of the bulk bacterial communities, and (iii) the detection of faecal pollution in water. Results indicate that NGS methods can highlight potential target populations for diagnostics and will prove useful for the evaluation of existing and the development of novel DNA-based detection methods in the field of water microbiology. The used approach allowed unveiling of dominant bacterial populations but failed to detect populations with low abundances such as faecal indicators in surface waters. In combination with metadata, NGS data will also allow the identification of drivers of bacterial community composition during water treatment and distribution, highlighting the power of this approach for monitoring of bacterial regrowth and contamination in technical systems. doi: 10.2166/wst.2015.407 om https://iwaponline.com/wst/article-pdf/72/11/1962/465112/wst072111962.pdf 19 J. Vierheilig D. Savio A. H. Farnleitner G. H. Reischer (corresponding author) Research Group Environmental Microbiology and Molecular Ecology, Institute for Chemical Engineering, Vienna University of Technology, Gumpendorfer Straße 1a, A-1060 Vienna, Austria E-mail: georg.reischer@tuwien.ac.at J. Vierheilig D. Savio Centre for Water Resource Systems (CWRS), Vienna University of Technology, Karlsplatz 13/222, A-1040 Vienna, Austria J. Vierheilig Present address: Division of Microbial Ecology, Department of Microbiology and Ecosystem Science, University of Vienna, Althanstraße 14, A-1090 Vienna, Austria R. E. Ley Department of Microbiology, Cornell University, Ithaca, NY 14853, USA R. L. Mach Gene Technology Group, Institute for Chemical Engineering, Vienna University of Technology, Gumpendorfer Straße 1a, A-1060 Vienna, Austria A. H. Farnleitner G. H. Reischer Interuniversity Cooperation Centre Water & Health, Institute for Chemical Engineering, Vienna University of Technology, Gumpendorfer Straße 1a, A-1060 Vienna, Austria


INTRODUCTION
During the last 30 years, molecular biological methods have contributed vastly to our understanding of ecosystems and their functions (Zinger et al. ).In the field of microbial water quality assessment, detection methods targeting nucleic acids have expanded our view on the microbial world beyond the minority of bacterial taxa cultivable by classical microbiological methods.Microscopy-based fluorescence in situ hybridisation (FISH) (DeLong et al. ) and related techniques allow the detection of single cells in their native habitat (Amann et al. ) and the investigation of the distribution and dynamics of specific bacterial populations in natural and engineered systems with relevance for water quality (Farnleitner et al. ; Wilhartitz et al. ).Polymerase chain reaction (PCR)-based methods allow for the highly specific and sensitive detection and amplification of genes in environmental samples (Bej et al. ) and the investigation of complete microbiomes including viruses and phages (Schwab et al. ).Based on PCR amplification, various typing methods such as denaturing gradient gel electrophoresis (DGGE) (Muyzer et al. ) or terminal restriction fragment length polymorphism (T-RFLP) (Cancilla et al. ) have been developed to characterise marker gene communities and thus the corresponding bacterial, archaeal, protozoan, or viral populations.
All these molecular biological methods are based on the utilisation of DNA sequence information to target the respective desired nucleic acid (DNA or RNA).During the last 30 years, Sanger DNA sequencing was the method of choice for obtaining sequence information from target cells.Despite a high degree of automatisation, especially during the sequencing of the human genome, this approach is time consuming and laborious, especially in terms of sample preparation (gene cloning) (Venter et al. ).These limitations became particularly evident in metagenomic studies conducted with Sanger approaches (Venter et al. ).The advent of next generation sequencing (NGS) platforms, which started in 2005 (Margulies et al. ) with the introduction of the 454 GS 20 sequencer, revolutionised DNA sequencing by allowing massively parallel sequencing with millions of reactions running in the same experiment.In the course of the last decade, numerous other platforms have been introduced (e.g.Illumina, Pacific Biosciences, Ion Torrent, SOLiD), yielding up to 600 gigabases of sequence information and up to 4 billion sequence reads per instrument run, usually with relatively short read length of about 150 bases (Quail et al. ).The use of multiplex identifiers for sample-specific labelling of nucleic acids allows the analysis of hundreds of samples in parallel in a single instrument run (Hamady et al. ).
Since their inception, NGS approaches have been used extensively for the detailed investigation of the microbial consortia of the human microbiome (Hamady & Knight ), marine ecosystems (Sogin et al. ), or environmental microbiomes in general (Shokralla et al. ).Two main approaches are used in these studies: (1) the elucidation of microbial community structure in an environmental sample by deep amplicon sequencing, i.e. the in-depth sequencing of PCR amplicons of a marker gene (most often the 16S rRNA gene); and (2) the metagenomic analysis of the complete DNA or RNA content of an environmental sample, referred to as metagenomics or metatranscriptomics, respectively.The second approach allows surveying for the presence of gene families with distinct metabolic potential in a community ('What can the community do?'), while amplicon sequencing permits making of a detailed census of the microbial communities with unprecedented resolution ('Who is there?').In the water sector, applications of NGS are currently rather limited and there are no studies assessing the applicability of these methods in water quality investigations in general.This lack of investigations is most probably due to the novelty and technical challenges associated with NGS methods (molecular biological and bioinformatic know-how).Also, molecular methods are just at the beginning of being broadly applied in the field of water quality.To remedy this lack of information, this study was initiated to assess the potential applicability and the limitations of 16S rRNA gene amplicon sequencing for water quality assessment in general and the specific detection of faecal pollution in particular.A complex river backwater catchment, which serves multiple purposes (drinking water source, recreation, national park), was selected as the model catchment to sample the compartments of surface water, sediment, and soil, and supplemented by faecal sampling.The research questions of this study were: (i) Is the used NGS approach useful in the development and evaluation of molecular tools for the detection of microbial pollution (e.g.faecal pollution, source tracking)?(ii) Can NGS tools serve as affordable, direct molecular monitoring tools for bulk bacterial communities and changes in their composition from source to tap? (iii) What is the potential of NGS methods for the detection of faecal pollution in environmental waters?

METHODS
Sampling, sample processing, and DNA extraction Samples (n ¼ 29) of different types were collected between June 2010 and May 2011 (Table 1).The main sampling area was the backwater catchment area of the porous groundwater well aquifer (PGWA), where surface water (n ¼ 11), soil (n ¼ 2), sediment (n ¼ 2), and animal faeces (n ¼ 5) were sampled.This riverine wetland is a national park and an important water resource located to the north of the Danube River at the south-eastern border of the city of Vienna, Austria.In addition, faecal samples were collected from a broad range of vertebrate animals at the Vienna Zoo, Austria (n ¼ 9).For soil and sediment samples, material from three cores, taken within an area of 1 m 2 , was pooled.All samples were aseptically collected in sterile 1,000 ml glass bottles (surface water) or 50 ml plastic vials (soil, sediment, faeces) and kept cool and dark during transport to the laboratory.Samples were stored at À20 W C until DNA from soil, sediment, and faecal samples (each approximately 250 mg) was extracted using the PowerSoil DNA isolation kit (MoBio Laboratories, Carlsbad, USA) in combination with bead-beating.In order to test the effect of modifications in the DNA extraction procedure, an additional experiment was performed, in which the method was modified by adding glass beads to the kit's extraction tubes before bead-beating.The DNA of the nine faecal samples from the zoo was extracted both with and without these additional glass beads, totalling 18 DNA extracts.The water samples (250 ml) were filtered immediately after arrival in the laboratory on 0.2 μm polycarbonate membrane filters (Millipore, Bedford, MA).These were stored at À20   included as controls.All DNA extracts were stored at À80 W C until analysis within less than 4 months.

PCR amplification, amplicon processing, and pyrosequencing
The DNA extracts were used as templates in PCR to amplify the variable regions V1-V2 of the 16S rRNA gene for 25 cycles.All reactions were run in triplicate with the bacterialspecific primers S-D-Bact-0008-a-S-20 (5 0 -AGAGTTT-GATCCTGGCTCAG-3 0 , as described by Edwards et al. () and S-D-Bact-0338-a-A-19 (5 0 -TGCTGCCTCCCGTAG-GAGT-3 0 , as described by Etchebehere & (Tiedje )), the latter equipped with a distinct 12-nucleotide error-correcting Golay barcode for each extract as a multiplex tag (Hamady et al. ).The nomenclature for the PCR primers was standardised according to Alm et al. ().Amplicons were visualised on a 0.8% agarose gel.All samples gave positive results; all controls (filtration, extraction, PCR) were negative and thus not analysed further.Subsequently, the sample amplicons (n ¼ 38) were purified, pooled in equimolar amounts and sent to Selah Clinical Genomic Center, formerly EnGenCore (Columbia, SC, USA) for 454 pyrosequencing (titanium chemistry) (Figure 1).

Sequence analysis
Sequence analysis was performed using the software package Quantitative Insights Into Microbial Ecology, QIIME (Caporaso et al. ).Raw sequences were quality filtered and assigned to the samples according to their barcodes.
The remaining flowgrams were denoised to reduce sequencing noise.After removing the primer sequences, chimeric sequences identified by de novo (abundance-based) and reference-based chimera detection with UCHIME were filtered out (Edgar et al. ).
The remaining sequences were binned into operational taxonomic units (OTUs) using USEARCH, with a minimum pairwise identity of 97%.Greengenes OTUs (97%; version August 2013) were specified as a reference database at the previous two steps.Rare OTUs represented by less than four sequences were filtered out.Samples yielding less than 1,994 sequences (i.e.fourth smallest number of sequences per sample) were not subjected to further analyses (n ¼ 3).The most abundant sequence in each OTU was chosen as a representative and aligned using PyNAST and the Greengenes reference alignment (DeSantis et al. ) trimmed to the V1-V2 region of the 16S rRNA gene with a minimum percent identity of 75%.The hypervariable regions were filtered out with the V1-V2 trimmed version of the lanemask, and a phylogenetic tree was constructed using FastTree (Price et al. ).Taxonomy was assigned with the Ribosomal Database Project classifier with a minimum confidence of 80% and the Greengenes taxonomy (August 2013).A total of 1,994 sequences were randomly selected from each sample for further analyses (rarefaction).In order to compare the bacterial communities between the samples, we calculated the pairwise unweighted UniFrac distance metric (Lozupone & Knight ) and clustered the resulting matrix using principal coordinate analysis to visualise the phylogenetic relatedness of the bacterial communities (Figure 1).Sequence data from this project is available in the Sequence Read Archive of the National Center for Biotechnology Information under the study accession number SRP055404.

RESULTS AND DISCUSSION
This study set out to assess the suitability of an amplicon sequencing approach targeting bacterial 16S rRNA genes using 454 pyrosequencing for the evaluation and development of molecular biological methods in water quality testing as well as a direct tool for monitoring water quality.The test sample set comprised water, sediment, soil, and faecal samples from a backwater study area influenced by the river Danube, as well as faecal samples from various zoo animals.Sequencing yielded 240,944 raw sequence reads assigned to the 38 DNA samples, which were reduced to 136,821 high quality sequences by quality filtering.Subsequent identification and removal of chimeric sequences and rare OTUs further decreased the number to 126,720 reads.The samples W.42, W.1B, and F.orangutan yielded less than 1,994 filtered sequences and were excluded from further analysis.

NGS as a tool for the development and evaluation of molecular detection methods?
Using NGS to evaluate DNA extraction bias DNA isolation is a (highly) critical step, especially in the application of (semi-)quantitative molecular biological methods on environmental samples.Inappropriate DNA extraction efficiency will bias all subsequent analysis, preventing meaningful biological insights (Feinstein et al. ).In this study a commercial kit (MoBio PowerSoil DNA Isolation Kit) was used to extract DNA from faecal samples.To test the effect of different DNA extraction procedures on the community composition detected in the DNA extract, the extraction protocol was modified by adding glass beads to the extraction vials before the initial bead-beating step.This leads to higher mechanical stress on particles, cells, and molecules.Figure 2 shows that the change in procedure led to a distinct shift in the detected community composition on the level of bacterial phyla.The abundance in terms of read number of the dominant phyla Bacteroidetes and Proteobacteria decreased significantly across all samples.Conversely, members of the phylum Firmicutes became much more dominant in the extracts obtained with the harsher extraction, often reaching proportions of greater than 90% of the total community.These results might indicate that the Grampositive and often spore-forming Firmicutes are more efficiently lysed with the modified procedure, leading to an elevated representation in the results.However, the shift could also be explained by increased shearing of DNA from less resilient bacterial clades, destroying their DNA and making it unavailable for downstream analysis.This experiment demonstrates that the chosen NGS approach was sensitive enough to detect dramatic changes in the composition of DNA extracts caused by relatively minor changes in extraction procedures.The observed effects are most likely to cause biases in all kinds of downstream analysis such as PCR-based methods, and are particularly critical for quantitative approaches (Feinstein et al. ).

NGS as a tool for evaluation of existing molecular methods
The high-resolution, sequence-based picture of community composition in the samples is also useful for the evaluation of existing PCR-based methods targeting dominant populations of the investigated marker gene.The used NGS approach provides a sample-specific 'sequence database' that might be searched for binding sites of PCR primers and probes or FISH probes, giving an indication whether the targets of the assays are present and abundant in a sample or a group of samples.Among other applications, this allows in silico evaluation of the source-sensitivity of microbial source tracking assays (Newton et al. ) or assays for faecal indication (Vierheilig et al. ).

NGS as a tool for the development of novel molecular methods
An extensive sample-derived sequence database is ideally suited for the development of novel methods targeting bacterial populations that are represented in the respective sample-derived NGS database.NGS results reveal the relative abundances of the dominant bacterial populations in each sample and thereby give a semi-quantitative indication of potential target populations to inform assay design.Figure 3 shows the bacterial phyla abundances found in the different sample types investigated in this study.It becomes evident that different habitats were dominated by distinct bacterial populations.While faecal communities were dominated by Firmicutes and (to a lesser degree) Bacteroidetes and Proteobacteria, soil samples were dominated by Proteobacteria and Acidobacteria (Figures 2 and 3).Sediment and water communities were more similar to each other and mainly contained members of the phyla Proteobacteria, Bacteroidetes, and significant populations of Cyanobacteria.It should be mentioned that the taxonomic composition of a sample can often be resolved down to the level of bacterial genera.Depending on the read length and quality (Kuczynski et al. ).Bacterial populations that are characteristic for a group of samples and highly abundant in that group are thereby considered as ideal targets for molecular diagnostics (Eren et al. ).Assay design (primers, probes) can be directly based on the sequence information retrieved by the NGS approach, and a preliminary testing of assay specificity against non-target samples can be performed in silico, as mentioned above (Newton et al. ).Alternative methods for the characterisation of bacterial community structures such as DGGE and T-RFLP in fact also allow the highlighting of target populations, but have much lower resolution and do not directly provide sequence information.In contrast, the unprecedented depth and information density provided by NGS approaches form a much more stable basis for stateof-the-art assay development.

NGS for the characterisation of microbial diversity
The high-resolution insight into community composition provided by deep amplicon sequencing makes it a valuable tool for monitoring of microbial communities in natural as well as technical aquatic systems (Lin et al. ).It is particularly suited for the investigation of temporal or spatial changes in the community composition as well as for the identification of drivers triggering changes along environmental gradients (Fierer et al. ).
To assess whether the applied sequencing depth in this study (minimum of 1,994 sequence reads per sample) was sufficient to give a representative impression of the community in the samples, α-diversity measures were estimated by rarefaction analysis (Figure 4).The analysis made evident that the complete diversity was unveiled in none of the four sample types: faeces, soil, sediment, and water.Diversity was much lower in water samples than in faeces, soil, and sediment, indicating that for this habitat lower sequencing depth might be sufficient.This finding is in accordance with literature data showing that soil habitats have very high bacterial diversity when compared to water or intestinal systems (Ley et al. ).One of the strengths of NGS approaches is that sequencing depth and effort can and indeed have to be adapted to the complexity of the investigated environment in order to ensure efficient use of resources and provide meaningful results.

NGS revealing bulk microbial community structure and dynamics
As shown in Figure 3 the taxonomic composition of the microbial community can be derived directly from the NGS results.That in itself can give crucial insights into the constitution and status of the investigated sample by identifying signature taxa or monitoring quantitative shifts between dominant taxa with known traits.Beyond and independent of taxonomic identification, a deep amplicon sequencing database allows the investigation of the relatedness of communities in different samples based on the phylogenetic history of their members.
In order to investigate the diversity between samples (β-diversity) in this study, the sequence reads of each sample were aligned to a 16S rRNA gene reference alignment, from which a phylogenetic tree was constructed.This tree, representing the phylogenetic composition of the samples, was used to calculate the so-called UniFrac metric, which serves as a distance measure for β-diversity, i.e. a measure for assessing how closely related two communities are in terms of shared evolutionary ancestry of their constituents.The resulting UniFrac distance matrix was subjected to cluster analysis to visualise which communities are more closely related and which are more distinct (Figure 5).Faecal communities were clearly set apart from other sample types while soil and sediment communities were closely related.This is not surprising because the study area is a backwater area regularly inundated during flooding.Interestingly, two of the water samples exhibited communities more closely related to soil and sediment samples, namely the Danube River water sample W.2017 and the PGWA sample W.2016, which was taken directly adjacent to the Danube in a branch of the river.All other water-sampling sites are only connected to the river during flood events.Taken together, these results demonstrate that the chosen NGS approach is indeed able to resolve spatial heterogeneities in community composition of the dominant bacterial populations in a sample and will therefore be a useful tool for the monitoring of both spatial and temporal changes in community composition in the environment.This has been demonstrated previously in several studies which successfully monitored microbial community changes in drinking water treatment and distribution systems (Hong et

NGS for the detection of faecal pollution in water
In contrast to bulk bacterial community analysis, other investigators suggested the use of NGS-derived community  signatures as a tool for identifying faecal pollution sources in water (Unno et al. ; Newton et al. ).Other scientists proposed the use of NGS approaches for the direct detection of pathogens in wastewater treatment plants (Ye & Zhang ; Cai & Zhang ).These applications highlight one of the basic restrictions of deep amplicon sequencing of total bacterial communities, which is the problem of relative abundances of target and background populations.Wastewater or faecal bacterial communities become rapidly diluted when entering environmental waters.In addition, pathogens constitute only a very minor portion of wastewater bacterial communities in the first place.To exemplify this, we searched for the commonly used faecal indicator Escherichia coli in the sequencing results of this study.Although E. coli can be detected by cultivation in high abundances in most faecal samples (Farnleitner et al. ) and was consistently cultivated in the water samples included in this study (concentrations ranging from 7 to >300 colony forming units per 100 ml), we were unable to find a single sequence read related to that species or even the genus Escherichia in the entire dataset.Faecal indicators and, to an even greater degree, pathogens are quantitatively very minor constituents even of faecal communities.These results highlight that NGS amplicon pyrosequencing using general bacterial primers is indeed able to detect abundant bulk populations in a community (e.g.Bacteroidetes or Firmicutes in faeces) but, at the applied sequencing depth, is not able to detect very low abundant populations that often are of relevance for the microbiological quality assessment of water (faecal indicators and pathogens) (Cai & Zhang ).Ipso facto, it is evident that the dilution in water resources limits the capability of any NGS method to find these target populations in a background of autochthonous, i.e. 'native' populations (Farnleitner et  Other possible methodical restrictions that should be considered are the relatively short read length of current NGS methods and the sequencing error rate that might suggest a level of diversity that is actually not present in the sample ('rare biosphere problem') (Reeder & Knight ).These problems can be overcome by conservative and careful data analysis and interpretation.Furthermore, one also has to keep in mind that results of NGS amplicon sequencing do not provide quantitative concentrations but relative quantities as related to the total gene community.Additionally, all biases associated with the application of PCR methods also apply to deep amplicon sequencing (von Wintzingerode et al. ).In contrast, this is not the case when applying metagenomic sequencing approaches, which do not employ gene-specific primers for DNA amplification and therefore avoid the respective biases (Cai & Zhang ).However, these approaches require much higher sequencing effort and bioinformatic analysis resources.With NGS sequencing services getting cheaper by the month, concomitant with increasing sequencing yield and quality, the main technical bottleneck will be the handling of the enormous amounts of data provided by these methods.Today there are precious few ready-made tools for data analysis available, and bioinformatics expertise is in short supply.This topic highlights the necessity to formulate clear hypotheses and research questions before starting an NGS-based investigation.
The currently used applications of NGS in water quality monitoring are still rather demanding in terms of necessary expertise and equipment, i.e. they require the availability of a molecular biological laboratory for sample processing.Although sequencing facilities offer full service packages for amplicon sequencing, metagenomic sequencing, and even preliminary bioinformatic analysis of the results, the costs of NGS analysis remain rather high and the high-throughput NGS methods are not well suited to small-scale investigations.Despite current attempts to establish NGS-based analysis pipelines to the needs of the water industry (Unno et al. ), it remains one of the main challenges of the coming years to make these technologies accessible also to facilities for practical application in the water sector.

CONCLUSIONS
The results of this study demonstrate that deep amplicon sequencing of the 16S rRNA marker gene using NGS methods could be a valuable tool for many applications in water quality monitoring.It is useful for the development and evaluation of molecular diagnostic tools to detect abundant bacterial indicators in water resources (McLellan & Eren ).But for the detection of pathogens or faecal indicators with low abundances in the environment, however, amplicon sequencing of bulk bacterial communities is not sufficiently sensitive at the applied sequencing depth.Deeper sequencing as provided by new NGS approaches and technologies might prove up to this task.However, the results of the present study demonstrate that this method is indeed capable of unveiling the dominant or bulk bacterial communities in water samples.The approach, applied on environmental water samples in this study, can be directly translated to the monitoring of bacterial communities in water treatment plants and distribution systems, where it can provide unprecedented insights into efficiency of measures and biostability of water (Roeselers et al. ).At present, NGS approaches remain mainly a highly powerful research tool with the potential to fundamentally revolutionise our knowledge about microbial content and dynamics in water resources and treatment.In order to translate the novel methods and findings into useful and accessible solutions for the practitioner in the water field, research and future development will have to supply standardised laboratory procedures and, in particular, data analysis pipelines and software tools as well as specialised sequence databases tailored to the requirements of water quality assessment.

F
Pongo pygmaeus/abelii (Bornean/Sumatran orangutan) F.lemur b Faeces Zoo; Lemur catta (Ring-tailed lemur) a Porous groundwater well aquifer (PGWA) area.bDNA extraction of these nine samples both with and without additional glass beads to test the effect of modifications in the DNA extraction procedure (resulting in 18 DNA extracts).

Figure 1 |
Figure1| NGS pipeline followed in this study in the laboratory and in silico.

Figure 2 |
Figure 2 | Phylum-level bacterial community composition of faecal samples extracted in parallel with the original DNA extraction procedure (e1) and applying the modified, harsher extraction procedure (e2).Results are given in absolute read numbers per sample.Phyla represented by 3 sequences are not shown.

Figure 3 |
Figure 3 | Bacterial phyla found in the faeces (n ¼ 5), soil (n ¼ 2), sediment (n ¼ 2) and water (n ¼ 9) samples from the PGWA study area.Results are average abundances as a percentage of the complete community.Phyla that could be detected with an average abundance of <1% are not shown.

Figure 5 |
Figure 5 | Visualisation of a principle coordinate analysis of the β-diversity (between sample diversity) as calculated by the phylogeny-based, unweighted UniFrac metric.In this analysis, samples with phylogenetically more similar communities cluster more closely together.

Table 1 |
Overview of samples (n ¼ 29) a ; Sus scrofa (Wild boar) F.reddeer Faeces PGWA a ; Cervus elaphus (Red deer) F.roedeer Faeces PGWA a ; Capreolus capreolus (Roe deer) F.fallowdeer Faeces PGWA a ; Dama dama (Fallow deer) F.mouflon Faeces PGWA a ; Ovis orientalis musimon (European mouflon) al. ; Ye & Zhang ).One way to circumvent this problem is to use group-specific primers instead of general primers targeting most bacteria (Unno et al. ), although this sacrifices the general overview and broad focus provided by total community analysis.Another possibility is the substantial increase of sequencing depth by at least two orders of magnitude.Novel, very recently emerging sequencing technologies and platforms (Liang & Zhang ) might offer sequencing depths that are also able to detect (very) low abundant populations of interest.