tion of questions matters. When evaluating text
criteria, answers to questions about different cri-
teria tend to correlate when they are presented si-
multaneously for a given item. When participants
are shown an item multiple times and questioned
about each text criterion separately, this correla-
tion is reduced.
4.5 Statistics and data analysis
Within behavioral sciences, it is standard to evalu-
ate hypotheses based on whether findings are sta-
tistically significant or not (typically, in published
papers, they are), although a majority of NLG pa-
pers do not report statistical tests (see Section 3.6).
However, there is a growing awareness that statis-
tical tests are often conducted incorrectly, both in
Moreover, one may wonder whether standard null-
hypothesis significance testing (NHST) is applica-
ble or helpful in human NLG evaluation.
In a common scenario, NLG researchers may
want to compare various versions of their own
novel system (e.g. with or without output varia-
tion, or relying on different word embedding mod-
els, to give just two more or less random exam-
ples) to compare them to each other, to some other
(‘state-of-the-art’) systems, and/or with respect to
one or more baselines. Notice that this quickly
gives rise to a rather complex statistical design
with multiple factors and multiple levels. Ironi-
cally, with every system or baseline that is added
to the evaluation, the comparison becomes more
interesting but the statistical model becomes more
complex, and power issues become more press-
statistical power—the probability that the statisti-
cal test will reject the null hypothesis (H0) when
the alternative hypothesis (H1, e.g., that your new
NLG system is the best) is true—are seldom (if
ever) discussed in the NLG literature.
A related issue is that clear hypotheses are of-
searchers generally assume that their system will
be rated higher than the comparison systems. But
they will not necessarily assume that they will per-
form better on all dependent variables. Moreover,
they may have no specific hypotheses about which
variant of their own system will perform best.
In fact, in the scenario sketched above there
may be multiple (implicit) hypotheses: new sys-
tem better than state-of-the-art, new system bet-
ter than baseline, etcetera. When testing multiple
hypotheses, the probability of making at least one
false claim (incorrectly rejecting a H0) increases
(such errors are known as false positives or Type I
errors). Various remedies for this particular prob-
lem exist, one being an application of the simple
Bonferroni correction, which amounts to lowering
the significance threshold α—commonly .05, but