Date of Award

Spring 1-1-2012

Document Type


Degree Name

Doctor of Philosophy (PhD)


Applied Mathematics

First Advisor

Manuel E. Lladser

Second Advisor

Rob Knight

Third Advisor

Jem Corcoran


We study an ensemble of urns with unknown compositions inferred from initial samples with replacement from each urn. This model fits diverse situations. For instance, in microbial ecology studies each urn represents an environment, each ball within an urn corresponds to an individual bacterium, and a ball's color represents its taxonomic label. In a different context, each urn could represent a random RNA pool and each colored ball a possible solution to a particular binding site problem over that pool. The main parameter of this study is dissimilarity, which we define as the probability that a draw from one urn is not seen in a sample of size k from a possibly different urn. We estimate this parameter with a U-statistic, shown to be the uniformly minimum variance unbiased estimator (UMVUE) of dissimilarity over a range for k determined by initial sample sizes. Furthermore, despite the non-Markovian nature of our estimator when applied sequentially over k, we provide conditions that guarantee uniformly consistent estimates of variances via a jackknife method, and show uniform convergence in probability as well as approximately normal marginal distributions. We apply our U-statistics and a restricted exponential regression to extrapolate dissimilarity over a range beyond that determined by initial sample sizes, which we use to identify an allocation of draws for subsequent sampling that minimizes a measure of pair-wise dissimilarities over the whole ensemble. This is motivated by the challenge faced by microbiome projects worldwide to effectively allocate additional samples for a more robust and reliable estimation of UniFrac distances between pairs of environments. Similar methods are applied to measures of sample quality of the ensemble derived from alpha-diversity and coverage. We test our methods against simulated data, where we compare optimal and inferred draw allocations when considering these three measures, and analyze 16S ribosomal RNA data from the Human Microbiome Project. (7 kB)
Matlab code to execute Algorithms