Date of Award
Doctor of Philosophy (PhD)
Chemical & Biochemical Engineering
Tremendous advances in genetic and sequencing technology are enabling unprecedented insight into human disease, forensics, and cellular mechanisms, to name a few. Conclusions drawn from these studies are strongly influenced by the interpretation of their associated massive data sets. The goal of this thesis is to understand, develop, and apply algorithms to help overcome common ecological and biological sequencing study challenges: contamination, differences in sampling efforts, a very large amount of zeroes, and compositionality.
We use simulations and experimental data to understand how different matrix normalization strategies mitigate the effects of the aforementioned challenges on downstream analyses, particularly principal coordinate analysis (PCoA). PCoA is very useful to researchers as a summary of overall differences in the studied populations, e.g. case vs. control. For determining specifically which taxa in the studied populations differ, we focus on methods for differential expression/abundance testing. Our benchmarking of nonparametric and parametric models, designed to increase rare taxa detection power, leads to recommendations for which strategy to use depending on a specific data set’s properties. Using these normalization and differential abundance detection guidelines, we apply them in a forensics study of how carcass mass influences the resulting microbial community, which has implications for post-mortem interval calculation. Then we move from studying changes in abundance of individual taxa to changes in abundance of multiple taxa; ultimately deriving how taxa inter-relate in pairwise and even higher-order interactions. Correlation analysis is critical because all microbial communities and biological systems are highly interconnected, however correlations are especially adversely affected by sparsity and compositionality.
Finally, while this thesis has focused on improved analysis in the context of microbial communities, the same methodologies apply to any extremely multi-dimensional and sparse high-throughput sequencing data set. In particular, we turn to the data arising from individual microbes, such as the Trackable Multiplex Recombineering (TRMR) approach used in strain engineering. We adapt the TRMR approach from outdated microarray technology to high-throughput sequencing, and integrate it with streamlined bioinformatics software. This approach can be used to study the bacterial response to any inhibitory chemical. We focus here on the alleles contributing to antimicrobial resistance and susceptibility, and identify a unique allelic and proteomic fingerprint for each antibiotic. Collectively, we present advances towards addressing the major sequencing data set challenges of contamination, uneven library sizes, a plethora of zeroes, and compositionality, and apply them to a wide range of topics in microbial ecology and biological engineering.
Weiss, Sophie J., "Improved Methods for Understanding Sparse, Multi-Dimensional, High Throughput Sequencing Data" (2015). Chemical & Biological Engineering Graduate Theses & Dissertations. 86.