Automatic MT Evaluation

From MT Talks
Jump to: navigation, search
Lecture 5: Automatic MT Evaluation
Lecture video: web TODO
Supplementary materials: File:Bleu.pdf
Exercises: BLEU

Reference Translations

The following picture[1] illustrates the issue of reference translations:


Out of all possible sequences of words in the given language, only some are grammatically correct sentences (G). An overlapping set is formed by understandable translations (T) of the source sentence (note that these are not necessarily grammatical). Possible reference translations can then be viewed as a subset of G\cap T. Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.


Position-independent error rate[2] (PER) is a simple measure which counts the number of correct words in the MT output, regardless of their position. It is calculated using the following formula:

{\text{PER}}=1-{\frac  {{\text{correct}}-\max(0,c-r)}{r}}

Where r and c is the number of tokens in the reference translation and candidate translation, respectively.


BLEU[3] (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.

While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of n-grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).

The formal definition is as follows:

{\text{BLEU}}={\text{BP}}\cdot \exp \sum _{{i=1}}^{{n}}(\lambda _{i}\log p_{i})

Where (almost always) \lambda _{i}=1/n and n=4. p_{i} stand for i-gram precision, i.e. the number of i-grams in the candidate translation which are confirmed by the reference.

Each reference n-gram can be used to confirm the candidate n-gram only once (clipping), making it impossible to game BLEU by producing many occurrences of a single common word (such as "the").

BP stands for brevity penalty. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:

{\text{BP}}={\begin{cases}1,&{\mbox{if }}c>r\\\exp(1-r/c),&{\mbox{if }}c\leq r.\end{cases}}

Where r and c is again the number of tokens in the reference translation and candidate translation, respectively.


Consider the following situation:

Source Vom Glück der traumenden Kamele Confirmed
Reference On the happiness of dreaming camels 1 2 3 4
MT Output The happiness of dreaming camels 5 4 3 2

The number of confirmed MT n-grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:

{\text{BP}}=\exp(1-6/5)\doteq 0.82

The geometric mean of precisions is:

\exp({\frac  {1}{4}}\log({\frac  {5}{6}})+{\frac  {1}{4}}\log({\frac  {4}{5}})+{\frac  {1}{4}}\log({\frac  {3}{4}})+{\frac  {1}{4}}\log({\frac  {2}{3}}))\doteq 0.76

Note that you can equivalently take the fourth root of the product of the precisions, i.e. {\sqrt[ {4}]{{\frac  {5}{6}}\cdot {\frac  {4}{5}}\cdot {\frac  {3}{4}}\cdot {\frac  {2}{3}}}}

The final BLEU score is then 0.82\cdot 0.76\doteq 0.62.

BLEU is often mutliplied by 100 for readability.

BLEU is a document-level metric. This means that counts of confirmed n-grams are collected for all sentences in the translated document and then the geometric mean of n-gram precisions is computed from the accumulated counts. For a single sentence, BLEU is often zero (since there is frequently no matching 4-gram or even trigram).

Multiple Reference Translations

BLEU supports multiple references. In that case, if an n-gram in the MT output is confirmed by any of the reference translations, it is counted as correct. If an n-gram occurs multiple times, it has to be seen in one of the references multiple times as well.

The original paper is not clear about BP in this case. The usual practice is to take the reference translation which is closest in length to the MT output and calculate BP from that. (Note that even this specification is not unambiguous since there can be two closest references to the given hypothesis, the longer and the shorter one.)

Other Metrics

  • Results of the WMT14 Metrics Shared Task[4] (WMT metrics) -- an annual shared task in automatic evaluation of MT, see the task web page.
  • Translation Error Rate[5] (TER) -- an edit-distance based metric on the level of phrases
  • METEOR[6] -- a robust metric with support for paraphrasing



  1. Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. Scratching the Surface of Possible Translations
  2. C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. Accelerated DP Based Search for Statistical Translation
  3. Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation
  4. Matouš Macháček and Ondřej Bojar. Results of the WMT14 Metrics Shared Task
  5. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation
  6. Alon Lavie, Michael Denkowski. The METEOR Metric for Automatic Evaluation of Machine Translation