Rich Vocabulary: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
No edit summary
Line 13: Line 13:
While German has some degree of inflection, it is the Germans' fondness of complex word compounds that causes the large vocabulary problem for MT. Consider the following compound:
While German has some degree of inflection, it is the Germans' fondness of complex word compounds that causes the large vocabulary problem for MT. Consider the following compound:


[[File:rindfleish-prezi.png|500px]]
[[File:rindfleish-prezi.png|600px]]


=== Finnish -- agglutination ===
=== Finnish -- agglutination ===
Agglutinative languages (such as Finnish, Turkish or Hungarian) often attach many affixes (prefixes or suffixes) to words. These affixes can describe grammatical properties or change the word meaning, as shown in the example:


[[File:finnish-prezi.png|500px]]
[[File:finnish-prezi.png|500px]]
For Finnish, nouns are said to have over 2000 possible inflections. The number of unique word forms in Finnish can therefore be astronomical.


=== Czech -- fusional inflection ===
=== Czech -- fusional inflection ===
Fusional languages differ from agglutinative languages in that they ''fuse'' multiple properties into a single affix. In Czech, one suffix can describe case, gender and number at the same time. On the other hand, fusional affixes tend to be ambiguous (e.g. an identical suffix can be used for multiple morphological cases).
Morphologically rich languages tend to impose strong agreement constraints on the suffixes (adjetive inflection must agree with its governing noun, subject and objects must agree with the verb inflection). Consider the following example:


[[File:czech-inflection-prezi.png|500px]]
[[File:czech-inflection-prezi.png|500px]]

Revision as of 13:34, 12 August 2015

Lecture 12: Rich Vocabulary
Lecture video: web TODO
Youtube

{{#ev:youtube|https://www.youtube.com/watch?v=eSIbNT-yjdg%7C800%7Ccenter}}

Examples of Languages with a Rich Vocabulary

German -- compounding

While German has some degree of inflection, it is the Germans' fondness of complex word compounds that causes the large vocabulary problem for MT. Consider the following compound:

Finnish -- agglutination

Agglutinative languages (such as Finnish, Turkish or Hungarian) often attach many affixes (prefixes or suffixes) to words. These affixes can describe grammatical properties or change the word meaning, as shown in the example:

For Finnish, nouns are said to have over 2000 possible inflections. The number of unique word forms in Finnish can therefore be astronomical.

Czech -- fusional inflection

Fusional languages differ from agglutinative languages in that they fuse multiple properties into a single affix. In Czech, one suffix can describe case, gender and number at the same time. On the other hand, fusional affixes tend to be ambiguous (e.g. an identical suffix can be used for multiple morphological cases).

Morphologically rich languages tend to impose strong agreement constraints on the suffixes (adjetive inflection must agree with its governing noun, subject and objects must agree with the verb inflection). Consider the following example:

Large Vocabulary Sizes in MT Pipeline

Word Alignment

Phrase Extraction

Decoding

Evaluation

Possible Solutions