RSLP Stemmer (Removedor
de Sufixos da Lingua Portuguesa)
This page
is about the algorithm for the RSLP Stemmer which was presented by the paper A
Stemming Algorithm for the Portuguese Language. In
Proceedings of the SPIRE Conference,
A new
version of the algorithm was implemented by Alexandre Ramos Coelho and can be
obtained here.
Stemming
is the process of conflating the variant forms of a word into a common
representation, the stem. This consists basically in removing the
suffixes from the words. This procedure is widely used in information retrieval
with the aim of enhancing recall. For example, the words: presentation, presented, presenting could all
be reduced to a common root "present".
This is based on the assumption that posing a query with the term presenting
implies an interest in documents containing the words presentation and
presented.
The RSLP
stemmer is a suffix stripping algorithm for Portuguese that is based on rules.
Each rule is expressed the following way:
The RSLP algorithm
was implemented in C and is composed of 8 steps that need to be executed in a
certain order. The figure below shows the sequence of those steps. Each step
has a set of rules, the rules in the steps are examined in sequence, and only
one rule in a step can apply. The longest possible suffix is always removed
first because of the order of the rules within a step, e.g. the plural suffixes
should be tested before the suffixes.
Publications about the RSLP Stemmer:
·
FLORES, F.; MOREIRA Viviane P.; Heuser, C. A. Assessing
the Impact of Stemming Accuracy on Information Retrieval. In:
International Conference on Computational Processing of Portuguese Language
(PROPOR 2010), 2010 p. 11-20. Link to paper
Abstract:
The quality of stemming algorithms is typically
measured in two different ways: (i) how accurately they map the variant forms
of a word to the same stem; or (ii) how much improvement they bring to
Information Retrieval. In this paper, we evaluate different Portuguese stemming
algorithms in terms of accuracy and in terms of their aid to Information
Retrieval. The aim is to assess whether the most accurate stemmers are also the
ones that bring the biggest gain in Information Retrieval. Our results show
that some kind of correlation does exist, but it is not as strong as one might
have expected.
·
Orengo, V.M, L. Buriol, and A. Coelho. A
study on the use of Stemming for Monolingual Ad-Hoc Portuguese Information
Retrieval , in Evaluation of Multilingual and Multi-modal
Information Retrieval, C. Peters, et al., Editors. 2007, Springer
Abstract:
For UFRGS’s first participation on CLEF our goal was
to compare the performance of heavier and lighter stemming strategies using the
Portuguese data collections for Monolingual Ad-hoc retrieval. The results show
that the safest strategy was to use the lighter alternative (reducing plural
forms only). On a query-by-query analysis, full stemming achieved the highest
improvement but also the biggest decrease in performance when compared to no
stemming. In addition, statistical tests showed that the only significant
improvement both in terms of mean average precision and precision at ten was
achieved by our lighter stemmer.
·
Orengo, V.M. and C.R. Huyck, A
Stemming Algorithm for the Portuguese Language, in 8th International
Symposium on String Processing and Information Retrieval (SPIRE). 2001:
Abstract:
Stemming algorithms are traditionally used in
Information Retrieval with the goal of enhancing recall, as they conflate the
variant forms of a word into a common representation. This paper describes the
development of a simple and effective suffix-stripping algorithm for
Portuguese. The stemmer is evaluated using a method proposed by Paice [9]. The
results show that it performs significantly better than the Portuguese version
of the Porter algorithm.