Date of Award

Spring 1-1-2017

Document Type


Degree Name

Doctor of Philosophy (PhD)


Computer Science

First Advisor

Robin Dowell

Second Advisor

Debra S. Goldberg

Third Advisor

Larry E. Hunter

Fourth Advisor

Jordan Boyd-Graber

Fifth Advisor

Daniel Weaver


his thesis examines the role of computational methods in identifying the motifs utilized by RNA binding proteins (RBPs). RBPs play an important role in post-transcriptional regulation and identify their targets in a highly specific fashion through recognition of primary sequence and/or secondary structure thus making the prediction a complex problem.

I applied the existing k-spectrum kernel method to a support vector machine and verified the published binding sites of two RBPs: Human antigen R (HuR) and Tristetraprolin (TTP). These RBPs exhibit opposing effects to the bound messenger RNA (mRNA) transcript but have similar binding preferences. Additional feature engineering highlighted the U-rich binding preference for HuR and AU-rich binding preference for TTP. I extended the k-spectrum kernel method to incorporate domain adaptation and identified binding preferences specific to each RBP as well as identified binding sites that were shared by the RBPs. The predicted k-mers correctly identified U/CU-rich k-mers specific to HuR and AU-rich k-mers specific to TTP, as well as exposed the k-mers that were shared by these RBPs.

In order to assess the performance of computational methods used to predict RBP targets, I developed a Python framework called the PySG. This framework generates custom biologically relevant datasets by embedding either primary sequence-based or secondary structure-based motifs followed by benchmarking computational methods on the generated sequences. To test-drive the PySG framework, the k-spectrum kernel method and Discriminative Regular Expression Motif Elicitation (DREME) were included in the framework and their performance was benchmarked on generated datasets. Benchmarking results demonstrate that the k-spectrum kernel method correctly identifies primary sequence binding sites, whereas DREME performs better when the binding sites involve secondary structure. The k-spectrum kernel considers only k-mers, which are indicative of primary sequence, while DREME allows wildcard characters when identifying motifs which implicitly captures covariation present in secondary structure binding sites. The novelty of the framework is the ability to generate sequences that mimic various biologically relevant conditions. This allows bioinformaticians to select the right computational method based on the type of dataset they have on hand. This makes the PySG a valuable contribution to the field of bioinformatics.