Date of Award

Spring 1-1-2011

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

First Advisor

Robin Dowell

Second Advisor

Mike Yarus

Third Advisor

Rob Knight

Fourth Advisor

Kevin Jones

Fifth Advisor

Norm Pace

Abstract

Studies of microbial communities, including those found on and within humans and those found in both natural and engineered environments, have revealed the enormous levels of diversity contained within those communities. The vast majority of this diversity cannot be observed using cultivation-based techniques. However, advances in DNA sequencing technology have created the opportunity to survey microbial diversity in unprecedented detail, through direct sequencing of the small ribosomal subunit rRNA gene. Modern datasets from a single study may contain hundreds of thousands to millions of 16S rRNA sequences, drawn from hundreds to thousands of biological samples. Such sequences are obtained without the biases inherent in culture-dependent methods, and typically include many sequences representing undescribed and uncharacterized species. The ability to obtain such extensive data relatively easily and inexpensively has revealed important constraints in our ability to detect patterns in these increasingly large and complex datasets, and to relate community composition to measures of human or health or environmental function.

To facilitate the analysis of sequence based community ecology surveys by researchers such as myself, I (in collaboration with others) developed a software tool entitled Quantitative Insights Into Microbial Ecology (QIIME). QIIME's extensive testing validates the analyses it performs, and its scalability guarantees its continued usefulness despite the trend towards studies of ever larger numbers of sequences and biological samples.

In addition, I embarked on investigations of the most appropriate methods for analyzing such data, and the quantity of DNA sequences and biological samples required. I tested a large set of commonly used measures of microbial community resemblance for their efficacy on data typical of large-scale microbial ecology studies. By applying the community resemblance measures to a combination of empirical results, as well as simulated results generated with a computational framework I designed, I was able to identify measures that are most useful, and the conditions under which they are most applicable.

The extent of sequencing required in community ecology studies in order to have confidence in the conclusions drawn from that data remains an open question, and is dependent on features of the particular communities often not known in advance, as well as the specific research goals. Although researchers with finite budgets must always grapple with the tradeoff between more biological samples and deeper sequencing of fewer biological samples, I show that in many instances deeper sequencing, or obtaining larger numbers of sequences per sample, is of limited use, and a fixed sequencing budget is better applied to acquiring and sequencing more biological samples.

Lastly, I investigated the effects of incomplete sequencing of microbial communities, and the effects that incomplete data has on estimates of the resemblance (β diversity) of those communities. To address the longstanding issue in microbial community ecology that comparisons of only limited samples of microbial communities are frequently biased estimates of the true resemblance of the full communities, I have developed a measure of community resemblance that is relatively unbiased when applied to incompletely sequenced microbial communities.

Share

COinS