RSLP Stemmer

RSLP Stemmer (Removedor de Sufixos da Lingua Portuguesa)

This page is about the algorithm for the RSLP Stemmer which was presented by the paper A Stemming Algorithm for the Portuguese Language. In Proceedings of the SPIRE Conference, Laguna de San Raphael, Chile, November 13-15, 2001, written by Viviane Moreira Orengo and Christian Huyck.

A new version of the algorithm was implemented by Alexandre Ramos Coelho and can be obtained here.

Stemming is the process of conflating the variant forms of a word into a common representation, the stem. This consists basically in removing the suffixes from the words. This procedure is widely used in information retrieval with the aim of enhancing recall. For example, the words: presentation, presented, presenting could all be reduced to a common root "present". This is based on the assumption that posing a query with the term presenting implies an interest in documents containing the words presentation and presented.

The RSLP stemmer is a suffix stripping algorithm for Portuguese that is based on rules. Each rule is expressed the following way:

The RSLP algorithm was implemented in C and is composed of 8 steps that need to be executed in a certain order. The figure below shows the sequence of those steps. Each step has a set of rules, the rules in the steps are examined in sequence, and only one rule in a step can apply. The longest possible suffix is always removed first because of the order of the rules within a step, e.g. the plural suffixes should be tested before the suffixes.

Publications about the RSLP Stemmer:

· FLORES, F.; MOREIRA Viviane P.; Heuser, C. A. Assessing the Impact of Stemming Accuracy on Information Retrieval. In: International Conference on Computational Processing of Portuguese Language (PROPOR 2010), 2010 p. 11-20. Link to paper

Abstract:
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information Retrieval. The aim is to assess whether the most accurate stemmers are also the ones that bring the biggest gain in Information Retrieval. Our results show that some kind of correlation does exist, but it is not as strong as one might have expected.

· Orengo, V.M, L. Buriol, and A. Coelho. A study on the use of Stemming for Monolingual Ad-Hoc Portuguese Information Retrieval , in Evaluation of Multilingual and Multi-modal Information Retrieval, C. Peters, et al., Editors. 2007, Springer Berlin / Heidelberg. p. 91-98. CLEF 2006, Alicante. Link to paper

Abstract:
For UFRGS’s first participation on CLEF our goal was to compare the performance of heavier and lighter stemming strategies using the Portuguese data collections for Monolingual Ad-hoc retrieval. The results show that the safest strategy was to use the lighter alternative (reducing plural forms only). On a query-by-query analysis, full stemming achieved the highest improvement but also the biggest decrease in performance when compared to no stemming. In addition, statistical tests showed that the only significant improvement both in terms of mean average precision and precision at ten was achieved by our lighter stemmer.

· Orengo, V.M. and C.R. Huyck, A Stemming Algorithm for the Portuguese Language, in 8th International Symposium on String Processing and Information Retrieval (SPIRE). 2001: Laguna de San Raphael, Chile. p. 183-193. Link to paper

Abstract:
Stemming algorithms are traditionally used in Information Retrieval with the goal of enhancing recall, as they conflate the variant forms of a word into a common representation. This paper describes the development of a simple and effective suffix-stripping algorithm for Portuguese. The stemmer is evaluated using a method proposed by Paice [9]. The results show that it performs significantly better than the Portuguese version of the Porter algorithm.