Date of Award

Spring 1-1-2011

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Electrical, Computer & Energy Engineering

First Advisor

Shannon M. Hughes

Second Advisor

Peter Mathys

Third Advisor

Youjian Liu

Abstract

The quantity of information in the world is soaring. We are living in an information age with abundant sources that generate information. While this data has the potential to transform every aspect of our life, it is very difficult to analyze and make inference from this huge amount of data. One example of this data deluge is the massive amount of textual data being generated on a daily basis from sources such as newspapers, blogs, tweets and other social network posts, emails, research papers, product descriptions and reviews, online discussion forums, digital libraries, knowledge databases, etc. This thesis tackles the problem of finding the hidden structure behind these collections of text documents and of organizing them better so as to help better navigate this sea of textual data.

Traditionally, the techniques used to classify the documents by topic include probabilistic methods such as the naive Bayes classifier, margin-based learning techniques such as support vector machines, and statistical manifold learning methods such as the Fisher information non-parametric embedding. We believe that the set of documents can be represented by points in a high-dimensional space that lie on or near a low-dimensional manifold. Hence, manifold learning or dimensionality reduction techniques can help to recover the underlying manifold and retrieve the inherent modes of variability in the set of documents. This will aid towards effective organization of these documents. Indeed, we find that many popular manifold learning methods perform well at organizing test datasets.

We also propose a different view of the local similarity of documents and thereby introduce the Earth Mover's distance as a local distance metric to replace Euclidean distance metric for the distance between the documents. The manifold learning methods, modified to incorporate the Earth Mover's distance, do provide improvement in the results as expected. Finally, we show that the spectral clustering promises to be a useful technique for the purpose of text organization.

Share

COinS