Date of Award
Doctor of Philosophy (PhD)
Short snippets of written text play a central role in our day-to-day communication—SMS and email messages, news headlines, tweets, and image captions are some of the many forms in which we see them used every day. Natural language processing (NLP) techniques have provided means for automatically processing such textual data at scale, supporting key applications in areas like education, law, healthcare, and security. This dissertation explores automatic identification of semantic similarity for short text: given two snippets, neither longer than a few sentences, the goal is to develop algorithms that can quantify the degree of their semantic similarity.
Short text similarity (STS) is an important problem in contemporary NLP, with applications in numerous real-life tasks. In academic tests, for example, student responses to short-answer questions can be automatically graded based on their semantic similarity with expert-provided correct answers to those questions. Automatic question answering (QA) is another example, where textual similarity with the question is used to evaluate candidate answer snippets retrieved from larger documents.
Semantic analysis of short text, however, is a challenging task—complex human expressions can be encoded in just a few words, and sentences that look quite different on the surface can express very similar meanings. This research contributes to the automatic identification of short text similarity (STS) through the development and application of algorithms that can align semantically similar concepts in the two snippets. The proposed STS algorithms are applied to the real-life tasks of short answer grading and question answering. All algorithms demonstrate state-of-the-art results in the respective tasks.
In view of the high utility of STS, statistical domain adaptation techniques are also explored for the proposed STS algorithms. Given training examples from different domains, these techniques enable (1) joint learning of per-domain parameters (i.e. a separate set of model parameters for each domain), and (2) inductive transfer among the domains for a supervised STS model. Across text from different sources and applications, domain adaptation improves overall performance of the proposed STS models.
Sultan, Md Arafat, "Short-Text Semantic Similarity: Algorithms and Applications" (2016). Computer Science Graduate Theses & Dissertations. 121.