The Qarhan Salt Lake is the second largest salt lake in the world and contains a rich and unique range of extremophiles requiring in-depth exploration. Halophilic microorganisms are promising resources for biotechnology due to their flexibility and survivability. The present study first isolated a novel strain of Halobacillus trueperi S61 from the Qarhan Salt Lake, then whole-genome sequencing and comparative genomics using third-generation PacBio combined with second-generation Illumina technology were performed. The whole genome of H. trueperi S61 identified 57,549 reads and consists of a complete circular chromosome of 4,047,887 bp with 43.86% genetic compound (GC) content and no gaps. A total of 139 non-coding ribonucleic acids (RNA) (including 86 tRNA, 30 rRNA, and 23 sRNA),16 gene islands with 260, 275 bp, and two prophages (with 82,682 in length) were predicted. The whole genome of H. trueperi S61 was annotated with 3,982 protein-coding genes using the Nr, Swissport, KOG, and KEGG databases for 3,980, 3,667, 2,998, and 2,303 genes. In addition, 561 carbohydrate enzymes and 4,416 pathogen–host interaction-related genes were identified. The protein function of H. trueperi S61 was focused on biological processes with distribution in gene transcription and amino acids as well as carbohydrate metabolism. The novel strain of H. trueperi S61 isolated from the Qarhan Salt Lake primarily preferred protein biological processes and antibiotic resistance, providing a potential resource for biotechnology.
The whole genome of Halobacillus trueperi S61 isolated from the Qarhan Salt Lake was identified.
The Halobacillus trueperi S61 predicted 3982 nucleotides 3567510 in length and 44.57% GC content.
The Halobacillus trueperi S61 summarized basic annotation for 3982 protein-coding genes.
The Halobacillus trueperi S61 preferred protein biological processes and antibiotic resistance.
The marine is an enriched pool of resources and contains numerous halotolerant or psychrophilic microorganisms that inevitably evolve physiological and genomic adaptations to extreme conditions. Among them, the flexibility and survivability of Halophilic microbes is a valuable property and prospect in biotechnology (Poli et al. 2017; Hong et al. 2019; Zhang et al. 2022). For instance, Halobacillus members are an important source of the halotolerant extracellular enzymes for industrial production, and Halobacillus trueperi RSK CAS9 was optimized for lipase production in the marine fish industry (Sathishkumar et al. 2015; Treves et al. 2018; Park et al. 2020).
Noteworthy, H. trueperi is moderately halophilic with a concentration of 0.5–2.5 mol·L−1, aerobic, and heterotrophic and it was first taken from the Great Salt Lake (Utah) (Spring et al. 1996). Since then, the H. trueperi has attracted more attention and researchers have carried out more studies. Lu et al. (2004) isolated H. trueperi from the saltwater in the western Himalayas and reported that H. trueperi DSM10404 was able to accumulate glycine, glutamate, and betaine as salt-tolerant compatible solutes. Gupta et al. (2019) isolated H. trueperi SS1 from Lunsu saltwater and Kharangate-Lad & Bhosle (2016) isolated H. trueperi MXM-16 from mangrove plant litter which is capable of producing hydroxamate siderophore and carotenoid pigments to chelate iron. Rivadeneyra et al. (2004) isolated H. trueperi ATCC 700077 from the solid and liquid salinities in the southern Sahara region of Tunisia and reported it as a major ecosystem-adaptive microorganism. Although several halophilic bacteria have been widely reported, their unique features are present in different natural environments. Importantly, the Qarhan Salt Lake is second-largest salt in the world and the largest in China (Shen et al. 2022) and contains rich and unique halophilic microbial resources that require in-depth exploration with broad prospects (Li et al. 2020).
With the development of biotechnology, researchers have used emerging techniques to identify organisms (Shaikh et al. 2020). Currently, whole-genome sequencing is used as a novel and culture-independent technique to explore the genetic diversity and evolutionary history of microorganisms, and various genomic projects have been performed (Thirugnanasambandam et al. 2017; Edward et al. 2018; Xu et al. 2020; Zhang et al. 2020; Chen et al. 2021; Wang et al. 2021a, 2021b). Although the second generation of Illumina sequencing has large sequencing throughput, high sequencing accuracy, and low cost, its read length is relatively short (150–400 bp). The third-generation single-molecule real-time sequencing Pacbio is advantageous in ultra-long sequencing read length (average of 10–12 kb), high throughput (5–10 Gb data), no GC bias, and direct detection of various types of DNA methylation. The latest advanced sequencing technologies will provide novel insights into the metabolic profiles of microorganisms (Buermans & den Dunnen 2014; Kang et al. 2015; Williamson et al. 2016).
Based on a novel strain first time isolated from the Qarhan Salt Lake and identified as H. trueperi S61 (Shen et al. 2022), this study performed whole-genome sequencing and comparative genomics by combining the advantages of third-generation PacBio and second-generation Illumina technology. The obtained active secondary metabolites with complete and accurate genome assembly favorable to understanding the genome properties of H. trueperi S61 contribute to genetics and potential biocontrol technique applications.
MATERIALS AND METHODS
Sample collection and pretreatment as well as microbe isolation
Fresh water and soil were collected from the Qarhan Salt Lake in Qinghai, Tibet Plateau, China (36°18′–36°45′N, 99°02′E). Fifteen sampling points were collected according to the five-point sampling method. All samples were kept in portable freezers and transported to the laboratory for pretreatment. The water samples were pretreated by mixing them and filtered through a 0.22 aperture filter, and bacteria were enriched on the filter membrane under sterile conditions. When the water was 30 ml, the filter membrane was removed and placed in a glass test tube containing 3 ml of sterile seawater, which is 10−1 water sample. Pretreatment of an appropriate amount of soil samples was taken and the samples, air-dried, and treated at 120 °C for 1 h. Then 10 g of treated soil sample was weighed and 90 ml of sterile seawater was added, placed in a sterilized triangular flask with a glass sphere, and fully shaken for 30 min, then the supernatant was absorbed. The supernatant was 10−1 soil sample. The above 10−1 samples were diluted to 10−2 and 10−3 times in succession and 150 μl gradient samples were taken and coated on an ATCC213 medium plate (10 g MgSO4·7H2O, 0.2 g CaCl2·2H2O, 2.5 g peptone, 10 g yeast extract, 5 g KCl, 30 g NaCl, and 12 g agar powder, 1,000 ml of distilled water was added and the pH of distilled water was adjusted around 7.2–7.4) with three repetitions. After placing the coated plate upside down in the incubator, it was cultivated at 28 and 37 °C, respectively. After observing colonies, single colonies were picked up and purified to obtain purified strains.
Molecular identification and genome analysis of H. trueperi S61
H. trueperi S61 was isolated and purified by dilution-plate enrichment culture. Morphologically, it was Gram-positive, spherical, capsular, and peritrichous flagella with a size of 0.6–0.8 × 0.4–0.6 μm. Molecular identification was performed from DNA extraction following the instructions of Sangon Biotech Co., Ltd column bacterial DNA extraction kit procedures (Shanghai, China). The bacterial universal primers were F27 and P1541 (5′-AGAGTTTGATCCTGGCTCAGG-3′ and 5′-AAGGAGGTGGTGATGCCGCA-3′). The reaction conditions were 94 °C denaturations for 45 s, 50 °C annealing for 45 s, 72 extensions for 75 s, and a 50 μl reaction system for 30 cycles. The PCR products were detected by agarose gel electrophoresis, and the sequence results were obtained by cloning and then sequence was uploaded to https://www.ezbiocloud.net for comparison. Finally, the matched genus was determined and named H. trueperi S61 (preservation number GDMCC No: 60078).
The genome sequencing was carried out by third-generation PacBio and second-generation Illumina with the assistance of Genedenovo Biotechnology Co., Ltd (Guangzhou, China). First, DNA extraction and quality control were obtained by concentration and electrophoresis tests, respectively. Then, RS II and Sequel from Pacific Biosciences were used for single-molecule real-time-based amplification. After the constructed library, Qubit was used to perform quality detection and Agilent 2100 was used to evaluate insert sizes and then PacBio sequencing was performed. At the same time, Illumina sequencing was performed using Hiseq ×10 after library construction and detection.
Genome assembly and function annotation
Genome assembly was performed using third-generation sequencing data, followed by second-generation data to correct the assembly results. Genome component analysis and functional annotation were performed based on the corrected assembly results. First, to perform quality control on the sequencing data the raw data from Pacbio and Illumina sequencing were filtered to obtain clean data. Then, the genome was assembled by using Falcon to splice and assembling third-generation sequencing reads to calculate the coverage and GC distribution. According to the assembled genome sequence and the predicted results of the encoded genes, genome circle diagram was drawn to display the features of the genome comprehensively.
Followed by genome component analysis, using National Center for Biotechnology Information Search database for encoding gene prediction, RNAmmer, tRNAscan, and cmscan to compare the Rfam database for non-coding ribosomal RNA (rRNA), transfer RNA (tRNA), and predict small RNA (sRNA). Furthermore, Interpersed and Tandem repeat sequences of bacterial genomes were predicted using Repeat Masker and TRF software. The CRISPR finder was used to predict Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) on the genome. Transposon PSI (version: 20100822), Island Viewer4, and Phage_Finder were used to perform transposons, Gene Islands (GIs), and prophage prediction on the genome.
In addition, basic and advanced function annotations were analyzed. Basic function annotations included the non-redundant protein database and SwissProt by using blastp and diamond to compare the amino acid sequences encoded by the gene with the database. Through the Kyoto encyclopedia of genes and genomes (KEGG), Gene Ontology (GO), Non-redundant (Nr), and Cluster of Orthologous Groups (COG) database to obtain the annotation results corresponding to those of genes and classify accordingly. For advanced analysis, Pfam Scan (https://www.ebi.ac.uk/Tools/pfa/pfamscan/) was performed on protein families database of alignments and hidden Markov models (Pfam) to provide complete and accurate protein family and domain classification information. blastp for pathogen-host interactions (PHI-base) and carbohydrate-active enzymes database (CAZy) analysis were used. Protein sequences of the predicted genes were analyzed using SignalP 4.1 to identify the signal proteins and predic transmembrane proteins and effector proteins through TMHMM and EffectiveT3. Resistance gene identifier (RGI), blastn, and antismash 4.1.0 were used to predict antibiotic resistance ontology, virulence factors of pathogenic bacteria (VFDB), and secondary metabolism gene clusters.
RESULTS AND DISCUSSION
Genome assembly and component features of the genome H. trueperi S61
The genome of H. trueperi S61 predicts 3,982 nucleotides with length of 3,567,510 and a GC content of 44.57%. The non-coding RNA (ncRNA) was a non-encode protein that performs various biological functions in life activities at the RNA level (Chen et al. 2021). In this study, 139 ncRNAs were identified including 86 tRNA, 30 rRNA, and 23 sRNA (Table 1). Among them, the largest amount was tRNA with a sequence length of 0.16% of the total sequence length, indicating the important role of tRNA in the expression and regulation of H. trueperi S61 cells (Figure 2(b)). The repeated sequences as components of gene regulatory networks affect evolution, heredity, and mutation in life (Li et al. 2021a). This study predicts 58 interspersed repeats with 3,909 bp and five types, with the largest elements at 28 and 18 in LINEs and SINEs with 1,856 and 1,149 bp, respectively, while less proportion was predicted in the DNA and LTR elements with 497 and 218 bp, respectively. In addition, three types of transposons (helitronORF and LINE) were predicted.
|Type .||Number .||Average length (bp) .||Total length .||In genome (%) .|
|Type .||Number .||Average length (bp) .||Total length .||In genome (%) .|
The clustered regularly interspaced short palindromic repeats (CRISPR) a genetic weapon or natural immune system of most bacteria and archaea, due to their resistance to extraneous plasmids and phage sequence (Zhang et al. 2021). Two kinds of CRISPRs were predicted in the genome H. trueperi S61, Crispr 1 (AGAAAACAAAACCAACAATCAGCTG) and Crispr 2 (TGATGGGAATCGAACCCACGACAT) indicated that strain H. trueperi S61 provides the corresponding acquired immunity to the host through CRISPR pathway. Gene islands (GI) are considered mobile genetic elements due to their relation to various biological functions, especially the horizontal transfer of genes (Lekota et al. 2018). These predicted GI regions may be contained in H. trueperi S61 antibiotic resistance genes and bacteriostatic gene fragments. A total of 16 gene islands with 260,275 bp have been predicted in the whole genome of H. trueperi S61, which may support microbial adaptation to distinct abiotic stresses and antimicrobial resistance environments. In addition, prophage, as a carrier of genetic information, could be integrated with the genome of the infected microbe after infection. Previous studies have found that bacteriophages were capable of dissolving certain pathogenic microorganisms that may be beneficial for disease healing, while also dissolving beneficial or other harmful microorganisms. As a result, it is widely used as a carrier for the horizontal transfer of beneficial microorganisms (Zhang et al. 2020). The present study identified two prophages with a length of 82,682 in H. trueperi S61, containing 43.86 CDs of 44 and 61 genes with 44.85 and 38.43% GC, respectively. Therefore, it has been speculated that H. trueperi S61 has the ability to lyse pathogens while requiring further validation.
Essential functional annotation of the genome of H. trueperi S61
The advanced function annotation of the genome of H. trueperi S61
The pathogenic host interaction gene database PHI included diverse pathogenic genes related to different types of hosts. It is crucial to find target genes for drug intervention (Zhang et al. 2020). By gene annotation, strain H. trueperi S61 has 4,416 PHI-related genes, mostly dominant pathogen species distributed in Burkholderia glumae that caused bacterial grain rot disease with DNA gyrase (bacterial topoisomerase II). Followed by Flavobacterium psychrophilum (DNA gyrase), Cryptococcus neoformans (GTP Biosynthesis), Bacillus anthracis (Tellurite Resistance), and Pectobacterium wasabiae (Posttranscriptional regulator) caused bacterial cold-water disease, meningoencephalitis, anthrax, and soft rot, respectively. Among these, 742 pathogenic factor genes derived from Magnaporthe oryzae (related to Magnaporthe grisea), 367, 262, 216, and 208 pathogenic genes related to Fusarium graminearum (related to Gibberella zeae), Aspergillus fumigatus, Alternaria alternata, and Candida albicans. Moreover, the virulence factors of pathogenic bacteria (VFDB) database annotated 15 factors in the form of Listeria monocytogenes, Legionella pneumophila Philadelphia, Chlamydia trachomatis, Salmonella enterica, Escherichia coli, Bacillus anthracis, Bacillus anthracis, and Mycobacterium tuberculosis. In addition, the prediction results for secondary metabolism gene clusters show ten gene cluster types, composed of nrps, terpene, nrps-transatpks-otherks, t3pks, lantipeptide, and sactipeptide head_to_tail.
The comprehensive antibiotic research database (CARD) was used to associate antibiotic modules with their targets, resistance mechanisms, and genetic mutations (Lekota et al. 2018). There were predicted 11 efflux pump complexes or subunits confer antibiotic resistance including lmrB, ykkD, TaeA, sav1866, ykkC, lmrD, TriC, bmr, and blt. Four antibiotic inactivation enzymes included aadK, VgbC, rphB, BLA1, and mphI and an antibiotic target protection protein (mfd). Antibiotic-resistant gene Enterococcus faecium cls conferring resistance to daptomycin, antibiotic-resistant fabI, mecA, Bacillus subtilis mprF, Escherichia coli EF-Tu mutants conferring resistance to kirromycin, Staphylococcus aureus rpoB mutants conferring resistance to rifampicin, Mycobacterium tuberculosis intrinsic murA conferring resistance to Fosfomycin as well as a determinant of resistance to nucleoside antibiotic (tmrB). Treves et al. (2018) evaluated the draft genome of Halobacillus sp. BBL2006 identified 4,331 open reading frames which comprised heavy metals and antibiotic resistance genes. Although the encoded genes were annotated from different databases, the reflected phenomena were consistent and mainly distributed in protein biological processes and antibiotic resistance, which provides a potential resource for biotechnology.
The whole-genome assembly and annotation were performed on the novel strain H. trueperi S61 isolated from the Qarhan Salt Lake. The genome of H. trueperi S61 predicted 3,982 nucleotides with a length of 3,567,510 and a GC content of 44.57%. A total of 3,668 genes have been annotated with COG and the protein function is mainly distributed in 9.95% amino acid transport and metabolism (365 genes). There were 7,829 genes annotated with GO annotation function and biological processes account for 52% and are dominant by metabolic process (1,016 genes). A total of 3,672 genes were annotated with KEGG and dominant by metabolism (80.94%), and the Nr database annotated 3,980 genes, and 88.32% matched with Bacillus subtilis. The Bacillus velezensis was most associated with Rossellomorea vietnamensis. Overall, the strain H. trueperi S61 mainly focused on biological processes.
We are grateful for the support of funding the General Project of the Natural Science Foundation of Qinghai Science and Technology Department (2019-ZJ-914) and the National Modern Agricultural Technology System (CARS-10).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.