Date of Award
Doctor of Philosophy (PhD)
James H Martin
This dissertation presents a framework for statistically modeling words and sentences. It focuses on the role of context in learning semantic representations from a corpus. In recent years, approaches like Latent Semantic Analysis (LSA) (Deerwester et al., 1990; Landauer et al., 1997) and Probabilistic Topic Models (LDA) (Blei et al., 2003; Griffiths and Steyvers, 2002,2007; Hofmann, 1999) have both enjoyed success with the psycholinguistics community as being theories of meaning and models of language understanding. They serve as important components of information retrieval, machine translation, and document summarization systems, as well as in several other applications. However, sentences have a rich set of semantic and syntactic features which cannot be accurately represented by these models as they are based on an order-independent bag-of-words assumption. This dissertation develops a model which takes these syntagmatic and paradigmatic constraints into account and provides a better model for sentence processing.
The Construction Integration II (CI-II) model of Kintsch and Mangalath (Kintsch and Mangalath, 2010) is a cognitively plausible computational account of how language is acquired and stored as representations in long term memory, which are then retrieved contextually to generate meaning in working memory. Semantic constraints are modeled using LSA, the Topics Model and context co-occurrence probabilities. Syntactic constraints are modeled using Ngrams and Dependency Grammars (De Marneffe et al., 2006; Collins, 1999; Covington, 2001; Eisner, 1996; Hall et al., 2004). In short, I show how text is structurally decomposed and combined with the comprehenders' prior knowledge in order to understand the text. It demonstrates how the expressiveness from explicitly modeling context leads to a better word sense disambiguation process.
This dissertation develops a tree edit distance (Bille, 2005; Kouylekov and Magnini, 2005) based metric---Dependency Edit Distance---that structurally decomposes sentences into dependency relations and measures similarity in terms of the semantic and syntactic cost associated in transforming one to the other. It further applies supervised machine learning techniques to use these measures between labelled pairs of sentences and build models with predictive accuracies that match human raters. The long term goal of this research is to map this model into software that helps students learn in an instructional environment capable of assessing their comprehension. I show data from two experiments in which student responses were automatically graded; the results show great potential towards such a practical realization.
Mangalath, Praful, "The Construction of Meaning: The role of context in corpus-based approaches to language modeling" (2010). Computer Science Graduate Theses & Dissertations. 10.