Automatic MT Evaluation
![]() | |
| Lecture video: |
web TODO Youtube |
|---|---|
| Supplementary materials: | File:Bleu.pdf |
| Exercises: |
BLEU PER |
Reference Translations
The following picture[1] illustrates the issue of reference translations:
Out of all possible sequences of words in the given language, only some are grammatically correct sentences (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G} ). An overlapping set is formed by understandable translations (Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T} ) of the source sentence (note that these are not necessarily grammatical). Possible reference translations can then be viewed as a subset of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G \cap T} . Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.
PER
Position-independent error rate[2] (PER) is a simple measure which counts the number of correct words in the MT output, regardless of their position. It is calculated using the following formula:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{PER} = 1 - \frac{\text{correct} - \max(0, c - r)}{r}}
Where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c} is the number of tokens in the reference translation and candidate translation, respectively.
BLEU
BLEU[3] (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.
While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} -grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).
The formal definition is as follows:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{BLEU} = \text{BP} \cdot \exp \sum_{i=1}^{n}(\lambda_i \log p_i) }
Where (almost always) Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda_i = 1/n} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n = 4} . Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_i} stand for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i} -gram precision, i.e. the number of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i} -grams in the candidate translation which are confirmed by the reference.
Each reference n-gram can be used to confirm the candidate n-gram only once (clipping), making it impossible to game BLEU by producing many occurrences of a single common word (such as "the").
BP stands for brevity penalty. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{BP} = \begin{cases} 1, & \mbox{if } c > r \\ \exp(1 - r/c), & \mbox{if } c \leq r. \end{cases} }
Where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle c} is again the number of tokens in the reference translation and candidate translation, respectively.
Example
Consider the following situation:
| Source | Vom Glück der traumenden Kamele | Confirmed | |||
|---|---|---|---|---|---|
| Reference | On the happiness of dreaming camels | 1 | 2 | 3 | 4 |
| MT Output | The happiness of dreaming camels | 5 | 4 | 3 | 2 |
The number of confirmed MT n-grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{BP} = \exp(1 - 6/5) \doteq 0.82}
The geometric mean of precisions is:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \exp(\frac{1}{4} \log(\frac{5}{6}) + \frac{1}{4} \log (\frac{4}{5}) + \frac{1}{4} \log(\frac{3}{4}) + \frac{1}{4} \log(\frac{2}{3})) \doteq 0.76}
Note that you can equivalently take the fourth root of the product of the precisions, i.e. Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sqrt[4]{\frac{5}{6} \cdot \frac{4}{5} \cdot \frac{3}{4} \cdot \frac{2}{3}}}
The final BLEU score is then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 0.82 \cdot 0.76 \doteq 0.62} .
BLEU is often mutliplied by 100 for readability.
BLEU is a document-level metric. This means that counts of confirmed n-grams are collected for all sentences in the translated document and then the geometric mean of n-gram precisions is computed from the accumulated counts. For a single sentence, BLEU is often zero (since there is frequently no matching 4-gram or even trigram).
Multiple Reference Translations
BLEU supports multiple references. In that case, if an n-gram in the MT output is confirmed by any of the reference translations, it is counted as correct. If an n-gram occurs multiple times, it has to be seen in one of the references multiple times as well.
The original paper is not clear about BP in this case. The usual practice is to take the reference translation which is closest in length to the MT output and calculate BP from that.
Other Metrics
- Translation Error Rate[4] (TER) -- an edit-distance based metric on the level of phrases
- METEOR[5] -- a robust metric with support for paraphrasing
Exercises
References
- ↑ Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. Scratching the Surface of Possible Translations
- ↑ C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. Accelerated DP Based Search for Statistical Translation
- ↑ Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation
- ↑ Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation
- ↑ Alon Lavie, Michael Denkowski. The METEOR Metric for Automatic Evaluation of Machine Translation

