Advancing analytical algorithms and pipelines for billions of microbial sequences
Highlights
► New technologies permit dramatic increases in number of sequences/sites per study. ► Multisite spatial/temporal studies with hundreds of millions of sequences are possible. ► New software pipelines are required to analyze these vast datasets. ► Field is increasingly moving from ad hoc scripts towards integrated pipelines. ► Multivariate techniques often introduce computational bottlenecks. ► Cloud computing and improved approximation methods are needed to avoid bottlenecks.
Introduction
Recent innovations in sequencing technologies allowed microbial ecologists to advance from analyzing a few hundred sequences per study to hundreds of millions [1••, 2••]. These quantitative differences in the amount of sequence data produce qualitative differences in the types of studies that can be performed. For example, 10 years ago, characterization of a single clone library from a single body site in one subject represented a substantial advance in knowledge about the human body. A few years ago, quantifying interpersonal differences in one body site, for example, the gut, represented a major advance [3, 4]. Three years ago, performing a multi-site microbial scan of the body, showing how the microbial communities that live on the same person's body are clearly separated by body site, primarily skin, mouth, and stool [5]. Now, with higher throughput sequencing technologies, we can observe the dynamics of the human microbiota across multiple sites and individuals through time, demonstrating that our microbial guests are highly volatile day-to-day even in healthy adults [6••]. These examples also illustrate the daunting analytical challenges that microbial researchers face to handle datasets that are ever increasing in size. These challenges range from simply finding the right hypotheses to test, to finding the correct analytical tools and computational power to test them, to finding the methods for visualizing the key results. Here we review computational tools developed in the last three years and algorithms conceived over the last few decades, but only recently applied in microbial ecology; we conclude with suggestions for computational tool developers who wish to help the field continue its rapid pace of development over the next few years.
Section snippets
Microbial diversity analysis tools
As 16S rRNA and shotgun metagenomic datasets grow dramatically, the need for easily accessible, well-documented and well-tested tools in the form of a pipeline becomes increasingly critical. In particular, the complexity of what is considered a ‘standard’ analysis has increased rapidly, from small trees and pie charts to advanced analyses incorporating multivariate statistics, machine learning, and, increasingly, explicitly spatial and/or temporal analysis (Figure 1). These new challenges, and
Summarizing and understanding microbial diversity
The democratization of sequencing technology allows researchers to sequence large numbers of samples from diverse environments [1••, 2••]. Large-scale collaborative projects have taken advantage of this possibility. For example, the Human Microbiome Project [23] sampled 250 individuals two to three times, in five main sites (the GI tract, the mouth, the vagina, the skin, and the nasal cavity), and the Earth Microbiome Project [24] will sequence up to 200 000 diverse environmental samples. A new
Conclusions
We are currently faced with daunting bioinformatics and computational challenges because of the large numbers of sequences and samples now examined in microbial ecology studies, which require the use of defined software engineering methods to create pipelines that are user-driven and well-tested. Although these pipelines integrate many different techniques for visualizing and understanding data, dimensionality reduction techniques such as PCoA have proven especially valuable for understanding
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgements
We thank Greg Caporaso, Jesse Stombaugh and Meg Pirrung for assistance in creating Figure 1, and Jessica Metcalf for helpful comments and edits on the manuscript. The work described in this review was supported by the National Institutes of Health, the Bill and Melinda Gates Foundation, the Crohns and Colitis Foundation of America, the Colorado Center for Biofuels and Biorefining and the Howard Hughes Medical Institute.
References (42)
- et al.
Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample
Proc Natl Acad Sci U S A
(2011) - et al.
Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads
Appl Environ Microbiol
(2011) - et al.
Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases
Proc Natl Acad Sci U S A
(2007) - et al.
A core gut microbiome in obese and lean twins
Nature
(2009) - et al.
Bacterial community variation in human body habitats across space and time
Science
(2009) - et al.
Moving pictures of the human microbiome
Genome Biol
(2011) - R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria;...
MATLAB. version 7.10.0 (R2010a)
(2010)- et al.
vegan: Community Ecology Package
(2009) - et al.
The ade4 package-II: Two-table and K-table Methods
R News
(2007)
APE: analyses of phylogenetics and evolution in R language
Bioinformatics
Host-bacterial mutualism in the human intestine
Science
Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities
Appl Environ Microbiol
Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa
Proc Natl Acad Sci U S A
QIIME allows analysis of high-throughput community sequencing data
Nat Methods
Diversity and dynamics of rare and of resident bacterial populations in coastal sands
ISME J
The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes
BMC Bioinformatics
SmashCommunity: a metagenomic annotation and analysis tool
Bioinformatics
CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing
BMC Bioinformatics
Cited by (43)
Practical opportunities for microbiome analyses and bioinformatics in poultry processing
2022, Poultry ScienceCitation Excerpt :The bioinformatics used for raw sequence analyses involves computations and programs that convert raw sequencing data into an organized biological interpretable form (Luscombe et al., 2001). Microbiome bioinformatics relies on developed software programs or analytical pipelines, such as Quantitative Insights into Microbial Ecology (QIIME) and mothur, to assemble and organize the sequencing data using various algorithms that are available for assembling and organizing data into graphic and numeric diversity comparisons (Schloss et al., 2009; Schloss, 2010; Gonzalez and Knight, 2012; Huse et al., 2014; Nilakanta et al., 2014; Boylen et al., 2019). Bioinformatic tools can be applied to further process the data into defined taxonomic assignments via selection of operational taxonomic units (OTUs) and classification into phyla, class, order, family, genus, and species based on available sequence databases (DeSantis et al., 2006; Schloss, 2010; Kozich et al., 2013; Jovel et al., 2016; Robinson et al., 2016; Amato, 2017).
Modelling structure and dynamics of microbial community in aquatic ecosystems: The importance of hydrodynamic processes
2022, Journal of HydrologyCitation Excerpt :The computational approaches have experienced rapid developments in the past decade, and these approaches vary in pre-processing methods of raw sequence data and methods for calculating microbial associations. A number of analytical tools facilitating the pre-processing of raw sequence data have been developed in the past decade, such as vegan (Dixon, 2003), ade4 (Gonzalez and Knight, 2012), and Quantitative Insights Into Microbial Ecology (QIIME) (Caporaso et al., 2010). With the differences in usage of pre-processing methods, relative or absolute OTU counts can be input for the analysis of microbial associations.
Statistical analysis of microbiome data: The challenge of sparsity
2021, Current Opinion in Endocrine and Metabolic ResearchCitation Excerpt :Classic statistical tests may not be adequate in analyzing these types of data, as misleading or uninterpretable results could be generated. Special statistical techniques are required to analyze such complex data, and despite intense investigation in human microbiome research, the statistical techniques for microbiome data are still being established [9]. Various statistical methodologies in dealing with the sparsity issues of 16S rRNA gene sequencing data will be reviewed.
Are microbiome studies ready for hypothesis-driven research?
2018, Current Opinion in MicrobiologyUnraveling Food Production Microbiomes: Concepts and Future Directions
2018, Food and Feed Safety Systems and Analysis