Advancing analytical algorithms and pipelines for billions of microbial sequences

doi:10.1016/j.copbio.2011.11.028

Current Opinion in Biotechnology

Volume 23, Issue 1, February 2012, Pages 64-71

https://doi.org/10.1016/j.copbio.2011.11.028 Get rights and content

The vast number of microbial sequences resulting from sequencing efforts using new technologies require us to re-assess currently available analysis methodologies and tools. Here we describe trends in the development and distribution of software for analyzing microbial sequence data. We then focus on one widely used set of methods, dimensionality reduction techniques, which allow users to summarize and compare these vast datasets. We conclude by emphasizing the utility of formal software engineering methods for the development of computational biology tools, and the need for new algorithms for comparing microbial communities. Such large-scale comparisons will allow us to fulfill the dream of rapid integration and comparison of microbial sequence data sets, in a replicable analytical environment, in order to describe the microbial world we inhabit.

Highlights

► New technologies permit dramatic increases in number of sequences/sites per study. ► Multisite spatial/temporal studies with hundreds of millions of sequences are possible. ► New software pipelines are required to analyze these vast datasets. ► Field is increasingly moving from ad hoc scripts towards integrated pipelines. ► Multivariate techniques often introduce computational bottlenecks. ► Cloud computing and improved approximation methods are needed to avoid bottlenecks.

Introduction

Recent innovations in sequencing technologies allowed microbial ecologists to advance from analyzing a few hundred sequences per study to hundreds of millions [1••, 2••]. These quantitative differences in the amount of sequence data produce qualitative differences in the types of studies that can be performed. For example, 10 years ago, characterization of a single clone library from a single body site in one subject represented a substantial advance in knowledge about the human body. A few years ago, quantifying interpersonal differences in one body site, for example, the gut, represented a major advance [3, 4]. Three years ago, performing a multi-site microbial scan of the body, showing how the microbial communities that live on the same person's body are clearly separated by body site, primarily skin, mouth, and stool [5]. Now, with higher throughput sequencing technologies, we can observe the dynamics of the human microbiota across multiple sites and individuals through time, demonstrating that our microbial guests are highly volatile day-to-day even in healthy adults [6^••]. These examples also illustrate the daunting analytical challenges that microbial researchers face to handle datasets that are ever increasing in size. These challenges range from simply finding the right hypotheses to test, to finding the correct analytical tools and computational power to test them, to finding the methods for visualizing the key results. Here we review computational tools developed in the last three years and algorithms conceived over the last few decades, but only recently applied in microbial ecology; we conclude with suggestions for computational tool developers who wish to help the field continue its rapid pace of development over the next few years.

Section snippets

Microbial diversity analysis tools

As 16S rRNA and shotgun metagenomic datasets grow dramatically, the need for easily accessible, well-documented and well-tested tools in the form of a pipeline becomes increasingly critical. In particular, the complexity of what is considered a ‘standard’ analysis has increased rapidly, from small trees and pie charts to advanced analyses incorporating multivariate statistics, machine learning, and, increasingly, explicitly spatial and/or temporal analysis (Figure 1). These new challenges, and

Summarizing and understanding microbial diversity

The democratization of sequencing technology allows researchers to sequence large numbers of samples from diverse environments [1••, 2••]. Large-scale collaborative projects have taken advantage of this possibility. For example, the Human Microbiome Project [23] sampled 250 individuals two to three times, in five main sites (the GI tract, the mouth, the vagina, the skin, and the nasal cavity), and the Earth Microbiome Project [24] will sequence up to 200 000 diverse environmental samples. A new

Conclusions

We are currently faced with daunting bioinformatics and computational challenges because of the large numbers of sequences and samples now examined in microbial ecology studies, which require the use of defined software engineering methods to create pipelines that are user-driven and well-tested. Although these pipelines integrate many different techniques for visualizing and understanding data, dimensionality reduction techniques such as PCoA have proven especially valuable for understanding

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

• of special interest
•• of outstanding interest

Acknowledgements

We thank Greg Caporaso, Jesse Stombaugh and Meg Pirrung for assistance in creating Figure 1, and Jessica Metcalf for helpful comments and edits on the manuscript. The work described in this review was supported by the National Institutes of Health, the Bill and Melinda Gates Foundation, the Crohns and Colitis Foundation of America, the Colorado Center for Biofuels and Biorefining and the Howard Hughes Medical Institute.

References (42)

J.G. Caporaso et al.
Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample
Proc Natl Acad Sci U S A
(2011)
A.K. Bartram et al.
Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads
Appl Environ Microbiol
(2011)
D.N. Frank et al.
Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases
Proc Natl Acad Sci U S A
(2007)
P.J. Turnbaugh et al.
A core gut microbiome in obese and lean twins
Nature
(2009)
E.K. Costello et al.
Bacterial community variation in human body habitats across space and time
Science
(2009)
J.G. Caporaso et al.
Moving pictures of the human microbiome
Genome Biol
(2011)
R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria;...
MATLAB. version 7.10.0 (R2010a)
(2010)
J. Oksanen et al.
vegan: Community Ecology Package
(2009)
S. Dray et al.
The ade4 package-II: Two-table and K-table Methods
R News
(2007)

E. Paradis et al.

APE: analyses of phylogenetics and evolution in R language

Bioinformatics

(2004)

F. Backhed et al.

Host-bacterial mutualism in the human intestine

Science

(2005)

P.D. Schloss et al.

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities

Appl Environ Microbiol

(2009)

C. De Filippo et al.

Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa

Proc Natl Acad Sci U S A

(2010)

J.G. Caporaso et al.

QIIME allows analysis of high-throughput community sequencing data

Nat Methods

(2010)

Sogin M, Welch DM: VAMPS: Visualization and Analysis of Microbial Population Structure. 2008. [cited]; Available from:...

A. Gobet et al.

Diversity and dynamics of rare and of resident bacterial populations in coastal sands

ISME J

(2011)

F. Meyer et al.

The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes

BMC Bioinformatics

(2008)

M. Arumugam et al.

SmashCommunity: a metagenomic annotation and analysis tool

Bioinformatics

(2010)

S. Angiuoli et al.

CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing

BMC Bioinformatics

(2011)

P. Abrahamsson et al.

(2002)

Cited by (43)

Practical opportunities for microbiome analyses and bioinformatics in poultry processing
2022, Poultry Science
Citation Excerpt :
The bioinformatics used for raw sequence analyses involves computations and programs that convert raw sequencing data into an organized biological interpretable form (Luscombe et al., 2001). Microbiome bioinformatics relies on developed software programs or analytical pipelines, such as Quantitative Insights into Microbial Ecology (QIIME) and mothur, to assemble and organize the sequencing data using various algorithms that are available for assembling and organizing data into graphic and numeric diversity comparisons (Schloss et al., 2009; Schloss, 2010; Gonzalez and Knight, 2012; Huse et al., 2014; Nilakanta et al., 2014; Boylen et al., 2019). Bioinformatic tools can be applied to further process the data into defined taxonomic assignments via selection of operational taxonomic units (OTUs) and classification into phyla, class, order, family, genus, and species based on available sequence databases (DeSantis et al., 2006; Schloss, 2010; Kozich et al., 2013; Jovel et al., 2016; Robinson et al., 2016; Amato, 2017).
Poultry processing is undergoing changes both in operations as well as microbial methodologies. Traditionally, microbial data has been gathered through a series of culturing methods using liquid media and plating for isolation and enumeration. Both foodborne pathogens and nonpathogenic bacterial populations are estimated to assess food safety risks as well as the potential for spoilage. Bacterial loads from carcasses are important for estimating processing control and the effectiveness of antimicrobial applications. However, these culture-based approaches may only provide part of the microbial ecology landscape associated with chicken carcasses and the subsequent changes that occur in these populations during processing. Newer molecular-based approaches, such as 16S sequencing of the microbiota, offer a means to retrieve a more comprehensive microbial compositional profile. However, such approaches also result in large data sets which must be analyzed and interpreted. As more data is generated, this will require not only bioinformatic programs to process the data but appropriate educational forums to present the processed data to a broad audience.
Modelling structure and dynamics of microbial community in aquatic ecosystems: The importance of hydrodynamic processes
2022, Journal of Hydrology
Citation Excerpt :
The computational approaches have experienced rapid developments in the past decade, and these approaches vary in pre-processing methods of raw sequence data and methods for calculating microbial associations. A number of analytical tools facilitating the pre-processing of raw sequence data have been developed in the past decade, such as vegan (Dixon, 2003), ade4 (Gonzalez and Knight, 2012), and Quantitative Insights Into Microbial Ecology (QIIME) (Caporaso et al., 2010). With the differences in usage of pre-processing methods, relative or absolute OTU counts can be input for the analysis of microbial associations.
Microbial communities are essential roles in aquatic ecosystems, which participate in a number of biogeochemical processes, and are vital for the protection and ecological remediation of water environment. Mathematical models have been demonstrated to be highly advantageous for investigating the composition, functions and metabolic behaviors of microbial community. Moreover, advances in molecular biology tools have provided sufficient information for numerical simulation of microbial community structure and dynamics, which contribute to the rapid developments of microbial modelling approaches. However, models focusing only on the metabolic processes of microbial community exist nonnegligible drawbacks when applying in aquatic ecosystems. They are not capable to simulate the hydrodynamic processes in flowing water, which would significantly change the composition and functions of microbial community. Thus, incorporating hydrodynamic processes into microbial models are vital for the in-depth investigation of aquatic microbial community. In this review, latest developments on microbial modelling approaches have been summarized to demonstrate how different approaches have been adapted to highly dynamic context to unravel the characteristics of microbial community in the aquatic ecosystems. Furthermore, present limitations of different modelling strategies have been discussed and several future development directions have been provided to make the simulation of microbial community in aquatic ecosystem to greater depth and accuracy.
Statistical analysis of microbiome data: The challenge of sparsity
2021, Current Opinion in Endocrine and Metabolic Research
Citation Excerpt :
Classic statistical tests may not be adequate in analyzing these types of data, as misleading or uninterpretable results could be generated. Special statistical techniques are required to analyze such complex data, and despite intense investigation in human microbiome research, the statistical techniques for microbiome data are still being established [9]. Various statistical methodologies in dealing with the sparsity issues of 16S rRNA gene sequencing data will be reviewed.
Microbiomes not only exist across many different body sites in human beings but also interact dynamically with the host and environment. The unique feature and complexity of 16S ribosomal RNA gene sequence data, especially the sparsity of the data, present challenges to statistical analysis and interpretation. Proper normalization is critical in ensuring the validity of downstream analysis. Modeling the inflated zeros is an active area of research, and it has been extended from differential abundance analysis to a variety of other types of analyses. One specific method may not work well under all conditions, where sample size, effect size, data distribution, and degree of zero inflation vary greatly.
Radiomicrobiomics: Advancing Along the Gut-brain Axis Through Big Data Analysis
2019, Neuroscience
The gut-brain axis communicates the brain with the gut microbiota, a bidirectional conduit that has received increasing attention in recent years thanks to its emerging role in brain development and function. Alterations in microbiota composition have been associated to neurological and psychiatric disorders, and several studies suggest that the immune system plays a fundamental role in the gut-brain interaction. Recent advances in brain imaging and in microbiome sequencing have generated a large amount of information, yet the data from both these sources need to be combined efficiently to extract biological meaning, and any diagnostic and/or prognostic benefit from these tools. In addition, the causal nature of the gut-brain interaction remains to be fully established, and preclinical findings translated to humans. In this “Perspective” article, we discuss recent efforts to combine data on the gut microbiota with the features that can be obtained from the conversion of brain images into mineable data. The subsequent analysis of these data for diagnostic and prognostic purposes is an approach we call radiomicrobiomics and it holds tremendous potential to enhance our understanding of this fascinating connection.
Are microbiome studies ready for hypothesis-driven research?
2018, Current Opinion in Microbiology
Citation Excerpt :
As data volumes increase, it is frequent that the main research activity in a field moves from tests of hypotheses of differences in individual variables among sites, to tests of these hypotheses with replicates at each site, to spatially or temporally explicit sampling, to detailed spatial maps that reveal otherwise unsuspected patterns. This progression has occurred in 16S rRNA amplicon-based microbiome studies over the past decade [8••,26], and increasingly characterizes mass spectrometry-based metabolome studies over the past four years [27,28•,29,30•,31,32]. The value of spatial maps is so self-evident that the results are often cursed by obviousness.
Hypothesis-driven research has led to many scientific advances, but hypotheses cannot be tested in isolation: rather, they require a framework of aggregated scientific knowledge to allow questions to be posed meaningfully. This framework is largely still lacking in microbiome studies, and the only way to create it is by discovery-driven, tool-driven, and standards-driven research projects. Here we illustrate these issues using several such non-hypothesis-driven projects from our own laboratories, including spatial mapping, the American Gut Project, the Earth Microbiome Project (which is an umbrella project integrating many smaller hypothesis-driven projects), and the knowledgebase-driven tools GNPS and Qiita. We argue that an investment of community resources in infrastructure tasks, and in the controls and standards that underpin them, will greatly enhance the investment in hypothesis-driven research programs.
Unraveling Food Production Microbiomes: Concepts and Future Directions
2018, Food and Feed Safety Systems and Analysis
Characterization of the microbiota in food production systems both in the live animal and in postharvest food–processing settings has always been a part of the analytical landscape. Certainly, foodborne pathogen detection has been important from a public health standpoint and continues to offer challenges for development of control measures to limit the impact of the presence of foodborne pathogens in all phases of food production from farm to fork. With the advent of improvements and subsequent widescale application of 16S ribosomal RNA gene–based microbiome sequencing, a new appreciation of microbial community dynamics is becoming realized. The interaction between the gastrointestinal tract microbiome and the host animal has helped to establish the concept of a healthy gut and identification of factors that may negatively influence the gut ecosystem is just one such example. However, as the data sets expand at a rapid rate and become multidimensional, the computational bioinformatics can become a limitation. There are several software pipelines currently available that are relatively accessible. In the future, computer program tools such as machine learning may provide further insight into microbiome data and potentially be used for making predictions based on data availability. Such approaches could be useful for the food industry to appraise various intervention strategies implemented in food production.

View all citing articles on Scopus

View full text

Advancing analytical algorithms and pipelines for billions of microbial sequences

Highlights

Introduction

Section snippets

Microbial diversity analysis tools

Summarizing and understanding microbial diversity

Conclusions

References and recommended reading

Acknowledgements

Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample

Proc Natl Acad Sci U S A

Generation of multimillion-sequence 16S rRNA gene libraries from complex microbial communities by assembling paired-end illumina reads

Appl Environ Microbiol

Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases

Proc Natl Acad Sci U S A

A core gut microbiome in obese and lean twins

Nature

Bacterial community variation in human body habitats across space and time

Science

Moving pictures of the human microbiome

Genome Biol

MATLAB. version 7.10.0 (R2010a)

vegan: Community Ecology Package

The ade4 package-II: Two-table and K-table Methods

R News

APE: analyses of phylogenetics and evolution in R language

Bioinformatics

Host-bacterial mutualism in the human intestine

Science

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities

Appl Environ Microbiol

Impact of diet in shaping gut microbiota revealed by a comparative study in children from Europe and rural Africa

Proc Natl Acad Sci U S A

QIIME allows analysis of high-throughput community sequencing data

Nat Methods

Diversity and dynamics of rare and of resident bacterial populations in coastal sands

ISME J

The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes

BMC Bioinformatics

SmashCommunity: a metagenomic annotation and analysis tool

Bioinformatics

CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing

BMC Bioinformatics