Date of Award

Spring 1-1-2012

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

James H. Martin

Second Advisor

Dan Jurafsky

Third Advisor

Peter Norvig

Abstract

Deception is a pervasive psycholinguistic phenomenon---from lies during legal trials to fabricated online reviews. Its identification has been studied for centuries---from the ancient Chinese method of spitting dry rice to the modern polygraph. The recent proliferation of deceptive online reviews has increased the need for automatic deception filtering systems. Although human performance is in general at chance, previous research suggests that the linguistic signals resulting from conscious deception are sufficient for building automatic systems capable of distinguishing deceptive documents from truthful ones. Our interest is in identifying the invariant traits of deception in text, and we argue that these encouraging results in automatic deception detection are mainly due to the side effects of corpus-specific features. This poses no harm to practical applications, but it does not foster a deeper investigation of deception. To demonstrate this and to allow researchers and practitioners to share results, we have developed the largest publicly available shared multidimensional deception corpus for online reviews, the BLT-C (Boulder Lies and Truths Corpus). In an attempt to overcome the inherent lack of ground truth, we have also developed a set of semi-automatic techniques to ensure corpus validity. This thesis shows that detecting deception using supervised machine learning methods is brittle. Experiments conducted using this corpus show that accuracy changes across different kinds of deception (e.g., lying vs. fabrication) and text content dimensions (e.g., sentiment), demonstrating the limitations of previous studies. Preliminary results confirm statistical separation between fabricated and truthful reviews (although not as large as in other studies), but we do not observe any separation between truths and lies, which suggests that lying is a much more difficult class of deception to identify than fabricated spam reviews.

Share

COinS