only search NEJLT

Volume 4, Article 4, 2016

Low-Resource Active Learning of Morphological Segmentation

Author: Stig-Arne Grönroos*
Katri Hiovain**
Peter Smit*
Ilona Rauhala**
Kristiina Jokinen**
Mikko Kurimo*
Sami Virpioja***
Affiliation: *Department of Signal Processing and Acoustics, Aalto University, Finland
**Institute of Behavioural Sciences, University of Helsinki, Finland
***Department of Computer Science, Aalto University, Finland
DOI: 10.3384/nejlt.2000-1533.1644
Volume: 4
Article No.: 4
Available: 2016-03-13
View Article: Pdf fileView Article (PDF); References (HTML)
No. of pages: 26
Pages: 47-72
Abstract: Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.

Publishing host : Linköping University Electronic Press, Linköpings universitet