Date of Award
Master of Science (MS)
Kevin B.. Cohen
James H.. Martin
Due to the recent advances in unsupervised language processing methods, it’s now possible to use large unannotated corpora such as biomedical PubMed Open Access (PMC-OA) for generating word embeddings, which can be further used for many natural language tasks such as text classification, named entity recognition etc. In this study, we carry a thorough investigation of word embeddings generated using word2vec model from PMC-OA. We investigate their quality and explore the domain specific challenges. We perform three tests – odd-one out, word similarity and word analogy and compare their quality on human-curated gold-standard datasets. We also compare parameter settings and share results on test accuracies, time & computation resources. Our results show that domain specific training is significant for generating quality word embeddings and though there is accuracy gain by increasing text window and dimension size but after a point the change is insignificant as the time complexity increases linearly.
Wadhawan, Kahini, "Investigation of Word Representation Methods for Biomedical Domain" (2016). Computer Science Graduate Theses & Dissertations. 132.
Available for download on Sunday, September 27, 2020