Date of Award

Spring 1-1-2018

Document Type

Thesis

Degree Name

Master of Science (MS)

First Advisor

Manuel E. Lladser

Second Advisor

Jem Corcoran

Third Advisor

Anne Dougherty

Abstract

In many applications, the need arises to search a text for appearances of a given set of keywords. As an example, in bioinformatics one may wish to search a DNA sequence to find so-called biological motifs. A standard approach to this problem is to leverage a deterministic finite automaton---a graph structure which is traversed as letters of the text are read in. However, depending on the number and length of the keywords being sought in the text, the graph may be too large to fit in computer memory, making this approach fruitless.

In this thesis, we first present a novel algorithm that, under the assumption that the keywords take the form of a so-called generalized string, constructs the minimal DFA recognizing those keywords. Importantly, the algorithm is iterative and allows one to build the automaton directly, without any use of buffer memory. Not only does this mean that the algorithm is efficient regarding memory consumption, but it also provides useful insight to help facilitate analysis for the size of such DFA.

Using this new algorithm and pairing it with the assumption that the generalized strings are drawn at random from some class of probability distributions, we develop bounds on the size of the minimal automaton that are true with high probability. Furthermore, using synthetic data, we provide evidence that the size of the minimal automaton grows linearly in expectation for many cases.

Share

COinS