Date of Award

Spring 1-1-2016

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Lawrence Hunter

Second Advisor

Kevin B.. Cohen

Third Advisor

James H.. Martin

Abstract

Due to the recent advances in unsupervised language processing methods, it’s now possible to use large unannotated corpora such as biomedical PubMed Open Access (PMC-OA) for generating word embeddings, which can be further used for many natural language tasks such as text classification, named entity recognition etc. In this study, we carry a thorough investigation of word embeddings generated using word2vec model from PMC-OA. We investigate their quality and explore the domain specific challenges. We perform three tests – odd-one out, word similarity and word analogy and compare their quality on human-curated gold-standard datasets. We also compare parameter settings and share results on test accuracies, time & computation resources. Our results show that domain specific training is significant for generating quality word embeddings and though there is accuracy gain by increasing text window and dimension size but after a point the change is insignificant as the time complexity increases linearly.

Available for download on Sunday, September 27, 2020

Share

COinS