Optimization of Natural Language Processing Components for Robustness and Scalability

Choi, Jinho D.

Graduate Thesis Or Dissertation

Optimization of Natural Language Processing Components for Robustness and Scalability Public Deposited

Analytics

Download PDF

Citations

Citeable URL: https://scholar.colorado.edu/concern/graduate_thesis_or_dissertations/vq27zn63h

Abstract

This thesis focuses on the optimization of nlp components for robustness and scalability. Three kinds of nlp components are used for our experiments, a part-of-speech tagger, a dependency parser, and a semantic role labeler. For part-of-speech tagging, dynamic model selection is introduced. Our dynamic model selection approach builds two models, domain-specific and generalized models, and selects one of them during decoding by comparing similarities between lexical items used for building these models and input sentences. As a result, it gives robust tagging accuracy across corpora and shows fast tagging speed. For dependency parsing, a new transition-based parsing algorithm and a bootstrapping technique are introduced. Our parsing algorithm learns both projective and non-projective transitions so it can generate both projective and non-projective dependency trees yet shows linear time parsing speed on average. Our bootstrapping technique bootstraps parse information used as features for transition-based parsing, and shows significant improvement for parsing accuracy. For semantic role labeling, a conditional higher-order argument pruning algorithm is introduced. A higher-order pruning algorithm improves the coverage of argument candidates and shows improvement on the overall F1-score. The conditional higher-order pruning algorithm also noticeably reduces average labeling complexity with minimal reduction in F1-score. For all experiments, two sets of training data are used; one is from the Wall Street Journal corpus, and the other is from the OntoNotes corpora. All components are evaluated on 9 different genres, which are grouped separately for in-genre and out-of-genre experiments. Our experiments show that our approach gives higher accuracies compared to other state-of-the-art nlp components, and runs fast, taking about 3-4 milliseconds per sentence for processing all three components. All components are publicly available as an open source project, called ClearNLP. We believe that this project is beneficial for many nlp tasks that need to process large-scale heterogeneous data.

Creator