Rich Vocabulary
Lecture video: |
web TODO Youtube |
---|
{{#ev:youtube|https://www.youtube.com/watch?v=eSIbNT-yjdg%7C800%7Ccenter}}
Examples of Languages with a Rich Vocabulary
German -- compounding
While German has some degree of inflection, it is the Germans' fondness of complex word compounds that causes the large vocabulary problem for MT. Consider the following compound:
Finnish -- agglutination
Agglutinative languages (such as Finnish, Turkish or Hungarian) often attach many affixes (prefixes or suffixes) to words. These affixes can describe grammatical properties or change the word meaning, as shown in the example:
For Finnish, nouns are said to have over 2000 possible inflections. The number of unique word forms in Finnish can therefore be astronomical.
Czech -- fusional inflection
Fusional languages differ from agglutinative languages in that they fuse multiple properties into a single affix. In Czech, one suffix can describe case, gender and number at the same time. On the other hand, fusional affixes tend to be ambiguous (e.g. an identical suffix can be used for multiple morphological cases).
Morphologically rich languages tend to impose strong agreement constraints on the suffixes (adjetive inflection must agree with its governing noun, subject and objects must agree with the verb inflection). Consider the following example:
Large Vocabulary Sizes in MT Pipeline
Let us describe how large vocabulary size affects each step in the standard MT pipeline.
Word Alignment
Word alignment treats different inflection of one word as unrelated units. This prevents the algorithm from sharing statistics and results in sparse observations for inflected forms.
Phrase Extraction
Rich vocabulary along with Zipf's law imply that we cannot expect to see all possible word inflections in our training data. We can be simply unable to create the correct word form because we have never observed it -- even though we may have sufficient statistics for the word lemma.
Decoding
Each extra word form creates an additional branching point for the search which the decoder must evaluate. There is an inherent trade-off for decoding between time (or even tractability) and quality: we can increase pruning limits to avoid discarding the potentially correct word forms but this comes at a high computational cost.
In short, when dealing with rich vocabulary, we might not even have the correct form available and when we do, we still might not find due to pruning.
Evaluation
Rich vocabulary even presents a problem for MT evaluation. Automatic metrics are (mostly) defined in a language-agnostic way and do not look beyond surface forms. BLEU