Ripple Down Rule learning for automated word lemmatisation
Titel:
Ripple Down Rule learning for automated word lemmatisation
Auteur:
Plisson, Joël Lavrač, Nada Mladenić, Dunja Erjavec, Tomaž
Verschenen in:
AI communications
Paginering:
Jaargang 21 (2008) nr. 1 pagina's 15-26
Jaar:
2008-03-10
Inhoud:
Lemmatisation is the process of finding the normalised forms of wordforms as they appear in text. It is a useful pre-processing step for a large number of language engineering tasks, and especially important for languages with rich inflection morphology. This paper presents a machine learning approach to automated word lemmatisation using a Ripple Down Rule learning algorithm, specially adapted to this task. By focusing on word suffixes, the induced Ripple Down Rules determine which wordform suffix should be removed and/or added to generate the lemma. The rules, induced from a lexicon of lemmatised Slovene words, were evaluated by cross-validation in the lexicon and on a hand-validated annotated corpus, and compared to previous work using two other inductive lemmatisers, ATRIS and CLOG. We show that RDR outperforms ATRIS and is more flexible than CLOG, as it can, unlike CLOG, also work without prior part-of-speech tagging. The RDR lemmatiser is easy to train and use for new languages and is, together with CLOG, available via a Web service.