Scoring and Optimization

Features of MT Models

So far we haven't fully described the actual model (most commonly) used in phrase-based and syntactic MT, the log-linear model. For MT, it can be formulated as follows:

$P(e|f)\propto \exp \sum _{i}w_{i}f_{i}(e,f)$

Essentially, the probability (or, less ambitiously, the score) of a translation is a weighted sum of features $f_{i}$ . Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.

Our goal is then to find such a translation hypothesis that maximizes this score, formally:

$e^{*}={\text{argmax}}_{e}P(e|f)$

Typically, feature functions are evaluated on partial translations during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.

We describe how to obtain the weights $w_{i}$ in the last section of this lecture.

Phrase Translation Probabilities

Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:

$P(\mathbf {e} |\mathbf {f} )$
$P(\mathbf {f} |\mathbf {e} )$

These probabilities are estimated by simply counting how many times (for the first formula) we saw $\mathbf {e}$ aligned to $\mathbf {f}$ and how many times we saw $\mathbf {f}$ in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that $P({\text{naznačena v programu}}|{\text{estimated in the programme}})=3/9$ .

estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| odhadován v programu
estimated in the programme ||| odhadovány v programu
estimated in the programme ||| odhadovány v programu 
estimated in the programme ||| předpokládal program
estimated in the programme ||| v programu uvedeným
estimated in the programme ||| v programu uvedeným

Lexical Weights

Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable probability estimates; for instance many long phrases occur together only once in the corpus, resulting in $P(\mathbf {e} |\mathbf {f} )=P(\mathbf {f} |\mathbf {e} )=1$ . Several methods exist for computing lexical weights. The most common one is based on word alignment inside the phrase. The probability of each foreign word $f_{j}$ is estimated as the average of lexical translation probabilities $w(f_{j},e_{i})$ over the English words aligned to it. Thus for the phrase $(\mathbf {e} ,\mathbf {f} )$ with the set of alignment points $a$ , the lexical weight is:

${\text{lex}}(\mathbf {f} |\mathbf {e} ,a)=\prod _{j=1}^{l_{f}}{\frac {1}{|{i|(i,j)\in a}|}}\sum _{\forall (i,j)\in a}w(f_{j},e_{i})$

Language Model

The task of language modeling in machine translation is to estimate how likely a sequence of words $\mathbf {w} =(w_{1},\ldots ,w_{l})$ is in the target language.

When translating, the decoder generates translation hypotheses which are probable according to the translation model (i.e. the phrase table). The language model then scores these hypotheses according to how probable (common, fluent) they are in the target language. The final translation is then something like a compromise -- the sentence that is both fluent and a good translation of the input.

Similarly to the translation model, sequence probabilities are learned from data using maximum likelihood estimation. For language modeling, only monolingual data are needed (a resource available in much larger amounts than parallel texts).

Naturally, the prediction of the whole sequence $\mathbf {e}$ has to be decomposed, so that it can be reliably estimated. The most common approach are n-gram language models which build upon the Markov assumption: a word depends only on a limited, fixed number of preceding words. The decomposition is done as follows:

${\begin{aligned}P(\mathbf {w} )&=P(w_{1})P(w_{2}|w_{1})P(w_{3}|w_{1},w_{2})\ldots P(w_{l}|w_{1},\ldots ,w_{l-1})\\&\approx P(w_{1})P(w_{2}|w_{1})\ldots P(w_{l}|w_{l-n},\ldots ,w_{l-1})\end{aligned}}$

The first equality follows from the chain rule and the second from n-th order Markov assumption. Each word is then modeled by at most n preceding words and the probability of the whole sequence is the product of probabilities of individual words. Smoothing is further used to supply probability estimates to unseen n-grams.

A great introduction to language modeling is the video lecture by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on Coursera; videos from the Coursera course can be found on YouTube.

Word and Phrase Penalty

For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the "penalty" can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).

Distortion Penalty

The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is the distance (measured in words) between its beginning and the end of the preceding phrase. This distance-based reordering can be replaced by more sophisticated models, such as lexicalized reordering.

Decoding

Phrase-Based Search

We have already described the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.

Some of the feature functions that we have described are local, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.

The most prominent example of a non-local feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase $\mathbf {e} =(e_{1},\ldots ,e_{K})$ without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase ( $e_{1}$ ) given the previous context.

Phrase-based search uses hypothesis recombination to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis if no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.

This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a state -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.

Decoding in SCFG

With syntactic MT, the situation is more complicated because hypotheses are not constructed left-to-right. That means that while there was only a single boundary between the current partial translation and its extension, SCFG rules can apply anywhere and we may need to look at words both preceding and following the target-side of the rule. This makes state tracking more complicated than in PBMT.

Optimization of Feature Weights

We now focus on how to find a good set of weights $w_{i}$ for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).

TODO references to papers!

All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).

Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is iterative:

Set all weights to some initial values.
Translate the tuning set using the current weights; for each sentence, output n best translations and their feature scores.
Run one iteration of MERT to get a new set of weights.
If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.

The input for MERT is a set of n-best lists -- the n best translations for each sentence in the tuning set. A vector of feature scores is associated with each sentence.

First, each translation is scored by the objective function (such as BLEU). In each n-best list, the sentence with the best score is assumed to be the best translation. The goal of MERT then is to find a set of weights that will maximize the overall score, i.e. move good translations to the top of the n-best lists.

MERT addresses the dimensionality of the weight space (the space is effectively $\mathbb {R} ^{n}$ for n weights) by optimizing each weight separately.

While the line search is globally optimal (in the one dimension), overall, the procedure is likely to reach a local optimum. MERT is therefore usually run from a number of different starting positions and the best set of weights is used.

After convergence (or reaching a pre-set maximum number of iterations), the weights for log-linear model are known and the system training is finished.

Note that there have even been shared tasks in model optimization. One, by invitation only, in 2011 and one in 2015: WMT15 Tuning Task.