Scoring and Optimization: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
No edit summary
Line 47: Line 47:
=== Language Model ===
=== Language Model ===


http://videolectures.net/hltss2010_eisner_plm/
The task of language modeling in machine translation is to estimate how likely a
sequence of words <math>\mathbf{w} = (w_1, \ldots, w_l)</math> is in the target language.


https://www.coursera.org/course/nlp
When translating, the decoder generates translation hypotheses which are
probable according to the translation model (i.e. the phrase table). The
language model then scores these hypotheses according to how probable (common,
fluent) they are in English. The final translation is then a compromise -- the
sentence that is both fluent and a good translation of the input.


https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi
Similarly to the translation model, sequence probabilities are learned from data
using maximum likelihood estimation. For language modeling, only monolingual
data are needed (a resource available in much larger amounts than parallel texts).
 
Naturally, the prediction of the whole sequence $\mathbf{e}$ has to be
decomposed, so that it can be reliably estimated. The most common approach are
\emph{n-gram} language models which build upon the Markov assumption: a word
depends only on a limited, fixed number of preceding words. The decomposition is
done as follows:
 
<math>
P(\mathbf{w}) &= P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\
  & \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})
</math>
 
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].


=== Word and Phrase Penalty ===
=== Word and Phrase Penalty ===

Revision as of 11:10, 25 August 2015

Lecture 13: Scoring and Optimization
Lecture video: web TODO
Youtube

{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&index=11&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V%7C800%7Ccenter}}

Features of MT Models

Phrase Translation Probabilities

Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:

These probabilities are estimated by simply counting how many times (for the first formula) we saw aligned to and how many times we saw in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that .

estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| odhadován v programu
estimated in the programme ||| odhadovány v programu
estimated in the programme ||| odhadovány v programu 
estimated in the programme ||| předpokládal program
estimated in the programme ||| v programu uvedeným
estimated in the programme ||| v programu uvedeným

Lexical Weights

Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable probability estimates; for instance many long phrases occur together only once in the corpus, resulting in . Several methods exist for computing lexical weights. The most common one is based on word alignment inside the phrase. The probability of each foreign word is estimated as the average of lexical translation probabilities over the English words aligned to it. Thus for the phrase with the set of alignment points , the lexical weight is:

Language Model

The task of language modeling in machine translation is to estimate how likely a sequence of words is in the target language.

When translating, the decoder generates translation hypotheses which are probable according to the translation model (i.e. the phrase table). The language model then scores these hypotheses according to how probable (common, fluent) they are in English. The final translation is then a compromise -- the sentence that is both fluent and a good translation of the input.

Similarly to the translation model, sequence probabilities are learned from data using maximum likelihood estimation. For language modeling, only monolingual data are needed (a resource available in much larger amounts than parallel texts).

Naturally, the prediction of the whole sequence $\mathbf{e}$ has to be decomposed, so that it can be reliably estimated. The most common approach are \emph{n-gram} language models which build upon the Markov assumption: a word depends only on a limited, fixed number of preceding words. The decomposition is done as follows:

Failed to parse (syntax error): {\displaystyle P(\mathbf{w}) &= P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\ & \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1}) }

A great introduction to language modeling is the video lecture by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on Coursera; videos from the Coursera course can be found on YouTube.

Word and Phrase Penalty

Distortion Penalty

Decoding

Phrase-Based Search

Decoding in SCFG

Optimization of Feature Weights

Note that there have even been shared tasks in model optimization. One, by invitation only, in 2011 and one in 2015: WMT15 Tuning Task.