An empirical study of smoothing techniques for language modeling

SF Chen, J Goodman - Computer Speech & Language, 1999 - Elsevier
We survey the most widely-used algorithms for smoothing models for language n-gram
modeling. We then present an extensive empirical comparison of several of these smoothing
techniques, including those described by Jelinek and Mercer (1980); Katz (1987); Bell,
Cleary and Witten (1990); Ney, Essen and Kneser (1994), and Kneser and Ney (1995). We
investigate how factors such as training data size, training corpus (eg Brown vs. Wall Street
Journal), count cutoffs, and n-gram order (bigram vs. trigram) affect the relative performance …