Date of Award
Spring 1-1-2016
Document Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Lawrence Hunter
Second Advisor
Kevin B.. Cohen
Third Advisor
James H.. Martin
Abstract
Due to the recent advances in unsupervised language processing methods, it’s now possible to use large unannotated corpora such as biomedical PubMed Open Access (PMC-OA) for generating word embeddings, which can be further used for many natural language tasks such as text classification, named entity recognition etc. In this study, we carry a thorough investigation of word embeddings generated using word2vec model from PMC-OA. We investigate their quality and explore the domain specific challenges. We perform three tests – odd-one out, word similarity and word analogy and compare their quality on human-curated gold-standard datasets. We also compare parameter settings and share results on test accuracies, time & computation resources. Our results show that domain specific training is significant for generating quality word embeddings and though there is accuracy gain by increasing text window and dimension size but after a point the change is insignificant as the time complexity increases linearly.
Recommended Citation
Wadhawan, Kahini, "Investigation of Word Representation Methods for Biomedical Domain" (2016). Computer Science Graduate Theses & Dissertations. 132.
https://scholar.colorado.edu/csci_gradetds/132