Undergraduate Honors Theses

Thesis Defended

Spring 2013

Document Type



Applied Mathematics

First Advisor

Dr. Rob Knight


Hundreds of studies have addressed whether the presence or absence of certain bacteria are linked with a particular phenotype. However, it is plausible that the causative agent (or the consequence) of a given phenotype is not a single type of microbe, but groups of them, perhaps in specific combinations. Rule Induction is a commonly used machine learning method to infer structure within observational data, and build rules to represent these structures. In this thesis I introduce the application of a method, Rule Induction, to infer co-occurrence patterns in microbial data. First, I benchmark the methods within Rule Induction, to assess how rules are generated with regards to several parameters such as table density, support and confidence. I then subsample data over multiple iterations to understand the robustness of the rules being produced to verify due to sampling. Next, I provide insight into different biological variables and examine their effect on rules produced. I compare 16S rRNA region, specifically V1-3 and V3-5 regions. I compare different sequencingtechnology, specifically 454 and Illumina. I finally compare time, specifically looking over a time frame of 400 ays. Within all these comparisons I aim to understand the differentces, but more importantly what is conserved when these samples are stratified by these variables in terms of the generated rules. Finally, I explore Rule Induction using two microbial datasets, and compare the rules to already-known associations. The first dataset I interpret identifies a correlation between HIV and the Gut Microbiome. The second data set distinguishes the Gut Microbiome over varyuing geographical lovations. I link each of these rules produced from each data set with taxonomic information and consolidate those rules to give rise to the underlying structure within the biological data.