Communities of lesser resourced languages like North Sámi benefit from language tools such as spell checkers and grammar checkers to improve literacy. Accurate error feedback is dependent on well-tokenised input, but traditional tokenisation as shallow preprocessing is inadequate to solve the challenges of real-world language usage. We present an alternative where tokenisation remains ambiguous until we have linguistic context information available. This lets us accurately detect sentence boundaries, multiwords and compound error detection. We describe a North Sámi grammar checker with such a tokenisation system, and show the results of its evaluation.
Wiechetek, Linda; Unhammer, Kevin B.; and Moshagen, Sjur N.
"Seeing More Than Whitespace — Tokenisation and Disambiguation in a North Sámi Grammar Checker,"
Proceedings of the Workshop on Computational Methods for Endangered Languages: Vol. 1
, Article 7.
Available at: https://scholar.colorado.edu/scil-cmel/vol1/iss1/7