Automatic MT Evaluation

Lecture 4: Automatic MT Evaluation
Lecture video:	web TODO ; Youtube
Supplementary materials:	File:Bleu.pdf

{{#ev:youtube|https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V%7C800%7Ccenter}}

Reference Translations

The following picture^[1] illustrates the issue of reference translations:

Out of all possible sequences of words in the given language, only some are grammatically correct sentences ( $G$ ). An overlapping set is formed by understandable translations ( $T$ ) of the source sentence (note that these are not necessarily grammatical). Possible reference translations can then be viewed as a subset of $G\cap T$ . Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.

PER

Position-independent error rate^[2] (PER) is a simple measure which counts the number of words which are identical in the MT output and the reference translation and divides

BLEU

BLEU^[3] (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.

While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of $n$ -grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).

The formal definition is as follows:

${\text{BLEU}}={\text{BP}}\cdot \exp \sum _{i=1}^{n}(\lambda _{i}\log p_{i})$

Where (almost always) $\lambda _{i}=1/n$ and $n=4$ . $p_{i}$ stand for $i$ -gram precision, i.e. the number of $i$ -grams in the candidate translation which are confirmed by the reference.

Each reference $n$ -gram can be used to confirm the candidate $n$ -gram only once (clipping), making it impossible to game BLEU by producing many occurrences of a single common word (such as "the").

BP stands for brevity penalty. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:

${\text{BP}}={\begin{cases}1,&{\mbox{if }}c>r\\\exp(1-r/c),&{\mbox{if }}c\leq r.\end{cases}}$

Example

Consider the following situation:

Source	Vom Glück der traumenden Kamele	Confirmed
Reference	On the happiness of dreaming camels	1	2	3	4
MT Output	The happiness of dreaming camels	5	4	3	2

The number of MT $n$ -grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:

${\text{BP}}=\exp(1-6/5)\approx 0.82$

The geometric mean of precisions is:

$\exp(1/4\log(5/6)+1/4\log(4/5)+1/4\log(3/4)+1/4\log(2/3))\approx 0.76$

Multiple Reference Translations

Other Metrics

Translation Error Rate (TER)

METEOR

References

↑ Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. Scratching the Surface of Possible Translations
↑ C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. Accelerated DP Based Search for Statistical Translation
↑ Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation

[deprefset-1] Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. Scratching the Surface of Possible Translations

[per-2] C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. Accelerated DP Based Search for Statistical Translation

[bleu-3] Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation

[1]

[2]

[3]

Automatic MT Evaluation

Contents

Reference Translations

PER

BLEU

Example

Multiple Reference Translations

Other Metrics

References

Navigation menu


Lecture video:	web TODO Youtube
Supplementary materials:	File:Bleu.pdf

Automatic MT Evaluation

Reference Translations

PER

BLEU

Example

Multiple Reference Translations

Other Metrics

References

Navigation menu

Search