Scoring and Optimization
Lecture video: |
web TODO Youtube |
---|
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&index=11&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V%7C800%7Ccenter}}
Features of MT Models
Phrase Translation Probabilities
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\mathbf{e}|\mathbf{f})}
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\mathbf{f}|\mathbf{e})}
These probabilities are estimated by simply counting how many times (for the first formula) we saw Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbf{e}} aligned to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbf{f}} and how many times we saw Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbf{f}} in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9} .
estimated in the programme ||| naznačena v programu estimated in the programme ||| naznačena v programu estimated in the programme ||| naznačena v programu estimated in the programme ||| odhadován v programu estimated in the programme ||| odhadovány v programu estimated in the programme ||| odhadovány v programu estimated in the programme ||| předpokládal program estimated in the programme ||| v programu uvedeným estimated in the programme ||| v programu uvedeným
Lexical Weights
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable probability estimates; for instance many long phrases occur together only once in the corpus, resulting in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e}) = 1} . Several methods exist for computing lexical weights. The most common one is based on word alignment inside the phrase. The probability of each foreign word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_j} is estimated as the average of lexical translation probabilities Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle w(f_j, e_i)} over the English words aligned to it. Thus for the phrase Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (\mathbf{e},\mathbf{f})} with the set of alignment points Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a} , the lexical weight is:
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f} \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i) }
Language Model
http://videolectures.net/hltss2010_eisner_plm/
https://www.coursera.org/course/nlp
https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi
Word and Phrase Penalty
Distortion Penalty
Decoding
Phrase-Based Search
Decoding in SCFG
Optimization of Feature Weights
Note that there have even been shared tasks in model optimization. One, by invitation only, in 2011 and one in 2015: WMT15 Tuning Task.