Phrase-based Model: Difference between revisions
No edit summary |
No edit summary |
||
Line 12: | Line 12: | ||
== Phrase Extraction == | == Phrase Extraction == | ||
PBMT uses '''phrases''' as the basic unit of translation. Phrases are simply sequences of words which have been observed in the training data, they don't correspond to any linguistic notion of phrases. | PBMT uses '''phrases''' as the basic unit of translation. Phrases are simply contiguous sequences of words which have been observed in the training data, they don't correspond to any linguistic notion of phrases. | ||
In order to obtain a '''phrase table''' (a probabilistic dictionary of phrases), we need [[Word Alignment|word-aligned]] parallel data. | In order to obtain a '''phrase table''' (a probabilistic dictionary of phrases), we need [[Word Alignment|word-aligned]] parallel data. Using the alignment links, a simple heuristic is applied to extract '''consistent''' phrase pairs. Consider the word-aligned example sentence: | ||
[[File:phrase-extraction.png|400px]] | |||
Phrase pairs are contiguous spans where all alignment points from the source side of the span fall within its target side and vice versa. These are examples of phrases consistent with this word alignment: | |||
[[File:phrase-extraction-okay.png|400px]] [[File:phrase-extraction-okay2.png|400px]] | |||
On the other hand, if either a source word or a target word is aligned outside of the current span, the phrase cannot be extracted: | |||
[[File:phrase-extraction-short.png|400px]] [[File:phrase-extraction-long.png|400px]] | |||
== See Also == | == See Also == |
Revision as of 15:03, 7 April 2015
Lecture video: |
web TODO Youtube |
---|
{{#ev:youtube|https://www.youtube.com/watch?v=aA4jFayPNeQ%7C800%7Ccenter}}
Phrase-based machine translation (PBMT) is probably the most widely used approach to MT today. It is relatively simple and easy to adapt to new languages.
Phrase Extraction
PBMT uses phrases as the basic unit of translation. Phrases are simply contiguous sequences of words which have been observed in the training data, they don't correspond to any linguistic notion of phrases.
In order to obtain a phrase table (a probabilistic dictionary of phrases), we need word-aligned parallel data. Using the alignment links, a simple heuristic is applied to extract consistent phrase pairs. Consider the word-aligned example sentence:
Phrase pairs are contiguous spans where all alignment points from the source side of the span fall within its target side and vice versa. These are examples of phrases consistent with this word alignment:
On the other hand, if either a source word or a target word is aligned outside of the current span, the phrase cannot be extracted: