Pre-processing: Difference between revisions

Lecture 3: Pre-processing
Lecture video:	web TODO ; Youtube
Exercises:	Lowercasing ; Deaccenting

Latest revision as of 10:03, 11 March 2015

Overall, the task of MT data pre-processing is to drop any distinctions that are not important for the output.

Inspecting Text Data

Text Encoding

Two texts that look the same might not be identical. MT systems do not see the strings as humans do but instead, they work with the actual byte representation. Therefore, data pre-processing is a very important step in system development.

Unicode includes a number of special characters which can complicate text processing for an MT system developer. The following table contains examples of some of the more devious characters:

Code	Name	Description
U+200B	Zero-width space	An invisible space.
U+200E	Left-to-right mark	An invisible character used in texts with mixed scripts (e.g. Latin and Arabic) to indicate reading direction.
U+2028	Line separator	A Unicode newline which is often not interpreted by text editors (and can be invisible).
U+2029	Paragraph separator	Separates paragraphs, implies a new line (also often ignored).

Decode Unicode is a useful webpage with information on Unicode characters.

Often, a file hexdump is the most useful diagnostic tool. E.g. the Linux command xxd provides the necessary functionality.

Script/Characters

Unicode often provides many ways how to write a single character. For example, the letter "a" might be written with Latin or Cyrillic script. A nice summary of Latin-like alphabets is available at homoglyphs.net

Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:

Confusion of 0 (zero) and O (capital letter)
Inconsistent letter case: English word I written in lowercase etc. (notorious e.g. in movie subtitles)
Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: i'll Ill l'll 1'll 1'11
Different symbols for various punctuation (quotes, dashes, apostrophe etc.)

VIM tips

A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:

Set file encoding to UTF-8:

:set encoding=utf8

Show the code of character under cursor:

ga

Set or remove BOM (byte-order mark) for current file:

:set bomb
:set nobomb

Spot Five Differences

A text file with the Russian word "чай" (tea) can be written in seemingly identical ways which however differ significantly on byte level.

First is the very file beginning, which may and may not include the Unicode byte-order-mark symbol (BOM), which is 3 bytes long.

The second and third difference is the presence of two Unicode non-printing characters, namely zero-width space and left-to-right direction mark.

The fourth difference is the code for the letter "a" which can be written either in Latin or in Cyrillic script (looks identical).

The fifth, final difference is the representation of the last letter "й". It can be written either as one letter or as "и" followed by a wedge (the diacritics).

Tokenization

The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.

A basic but quite robust approach is to split whenever the Unicode character category changes. Imagine reading the input character by character. When we observe that so far, there have been letters (category L) and suddenly, there is punctuation (category P), we insert a space. During the same process, it is useful to convert all whitespace (tabulators, spaces, non-breaking spaces and sequences of such) to a single space character.

However, in many situations, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":

don't -> do n't

shouldn't -> should n't

couldn't -> could n't

It is essential for data tokenization to be consistent. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).

Related Material

Data Cleaning and Tokenization (Moses tutorial)

Exercises

This is the first lecture accompanied by programming exercises. Before starting, you should follow the instructions on how to use the CodEx submission system.

Follow the links to see the description of each task and a submission interface with automatic evaluation of your solutions.

@@ Line 3: / Line 3: @@
 |image = [[File:bear-with-us.png|200px]]
 |label1 = Lecture video:
-|data1 = [http://example.com web '''TODO'''] <br/> [http://www.youtube.com/watch?v=ucSv4S4sCjs Youtube]
+|data1 = [http://example.com web '''TODO'''] <br/> [https://www.youtube.com/watch?v=GDij7urWeOk&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=3 Youtube]
+|label2 = Exercises:
+|data2 = [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=6&module=groups%2Ftasks&page=specification Lowercasing] <br/> [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=7&module=groups%2Ftasks&page=specification Deaccenting]
 }}
-{{#ev:youtube|ucSv4S4sCjs|800|center}}
+{{#ev:youtube|https://www.youtube.com/watch?v=GDij7urWeOk&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=3|800|center}}
-Data pre-processing and normalization
+Overall, the task of MT data pre-processing is to ''drop any distinctions that are not important for the output''.
-Drop any distinctions that are not important for the output.
 == Inspecting Text Data ==
@@ Line 43: / Line 43: @@
 [http://www.decodeunicode.org/ Decode Unicode] is a useful webpage with information on Unicode characters.
+Often, a file [http://en.wikipedia.org/wiki/Hex_dump hexdump] is the most useful diagnostic tool. E.g. the Linux command '''xxd''' provides the necessary functionality.
 === Script/Characters ===
-Unicode often provides many ways how to write a single character. For example "a" might be written with Latin or Cyrillic script.
+Unicode often provides many ways how to write a single character. For example, the letter "'''a'''" might be written with Latin or Cyrillic script. A nice summary of Latin-like alphabets is available at [http://homoglyphs.net/ homoglyphs.net]
+Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:
+* Confusion of '''0''' (zero) and '''O''' (capital letter)
+* Inconsistent letter case: English word '''I''' written in lowercase etc. (notorious e.g. in movie subtitles)
+* Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: '''i'll Ill l'll 1'll 1'11'''
+* Different symbols for various punctuation (quotes, dashes, apostrophe etc.)
 === VIM tips ===
+A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:
 Set file encoding to UTF-8:
@@ Line 58: / Line 69: @@
   '''ga'''
-Set BOM (byte-order mark) for current file:
+Set or remove BOM (byte-order mark) for current file:
   ''':set bomb'''
+ ''':set nobomb'''
-Hexdump: xxd
+=== Spot Five Differences ===
-==
-== Negation in English-Czech Translation ==
-[[File:nemam_kocku.png|thumb|300px|'''Example of an error during phrase extraction.''' The system learns a translation pair ''"nemám" = "I have"'' which has the opposite meaning.]]
-In some cases, the statistical approach leads to '''systematic errors'''. The picture illustrates a common issue with negation -- in many languages (such as Czech), negation is expressed by a prefix ("''ne''" in this case). Moreover, Czech uses double negatives -- the sentence:
-: ''Nemám žádnou kočku.''
-Its English translation is:
-: ''I have no cat.''
-Although word by word, the Czech sentence actually says:
-: ''I_do_not_have no cat.''
-Most statistical MT systems are based on word alignment, i.e. finding which words correspond to each other. From this sentence pair, the automatic procedure learns a wrong translation rule:
-: ''I have''=''nemám''
-Whenever this rule is applied, the meaning of the translation is completely reversed.
-== Named Entities ==
+A text file with the Russian word ''"чай"'' (tea) can be written in seemingly identical ways which however differ significantly on byte level.
-Other examples of notorious errors include named entities, such as:
+First is the very file beginning, which may and may not include the Unicode byte-order-mark symbol (BOM), which is 3 bytes long.
-: ''Jan Novák potkal Karla Poláka. -> John Smith met Charles Pole.''
+The second and third difference is the presence of two Unicode non-printing characters, namely zero-width space and left-to-right direction mark.
-The name ''Novák'' is sometimes translated as ''Smith'' as both are examples of very common surnames in the respective language.
+The fourth difference is the code for the letter ''"a"'' which can be written either in Latin or in Cyrillic script (looks identical).
-== Inadequate Modeling of Semantic Roles ==
+The fifth, final difference is the representation of the last letter ''"й"''. It can be written either as one letter or as ''"и"'' followed by a wedge (the diacritics).
-[[File:pes-kocka-mys.png|thumb|500px|'''Example of a system's failure to translate semantic roles.''' Screenshot of Google Translate producing identical translation of radically different sentences.]]
+== Tokenization ==
-There is also a disconnect when translating between a morphologically poor and a morphologically rich language. While the first tend to express argument roles using '''word order''' (think English), the latter often use '''inflectional affixes'''. A statistical system which simply learn correspondences between words and short phrases then fails to capture the difference in meaning:
+The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.
-: ''Pes dává kočce myš.''     (the dog gives the cat a mouse)
+A basic but quite robust approach is to split whenever the [http://www.regular-expressions.info/unicode.html Unicode character category] changes. Imagine reading the input character by character. When we observe that so far, there have been letters (category '''L''') and suddenly, there is punctuation (category '''P'''), we insert a space. During the same process, it is useful to convert all whitespace (tabulators, spaces, non-breaking spaces and sequences of such) to a single space character.
-: ''Psovi dává myš kočku.''   (to the dog, the mouse gives a cat)
+However, in many situations, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":
-: ''Psovi dává kočka myš.''   (to the dog, the cat gives a mouse)
+: ''don't -> do n't''
-All of these examples are translated identically by [https://translate.google.com Google Translate] at the moment, even though their meanings are clearly radically different.
+: ''shouldn't -> should n't''
-== Numerals ==
+: ''couldn't -> could n't''
-Translation dictionaries of statistical MT systems are full of potential errors in numbers. Consider the possible translations of the number ''1.96'' according to our English-Czech translation system:
+It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).
-.96 ||| , 96 1 ,
+== Related Material ==
-.96 ||| , 96 1
-.96 ||| , 96
-.96 ||| 1,96
-.96 ||| 1.96
-.96 ||| 96 1 ,
-.96 ||| 96 1
-.96 ||| 96
-While the wrong translations may be improbable according to the model, they can still appear in the final translation in some situations.
+[https://www.youtube.com/watch?v=Lx4eD9HcGI0 Data Cleaning and Tokenization (Moses tutorial)]
-Moreover, MT systems will often translate the actual number correctly but confuse the units, e.g.:
+== Exercises ==
-: ''40 miles -> 40 km''
+This is the first lecture accompanied by programming exercises. Before starting, you should follow the [[CodEx-Introduction|instructions]] on how to use the CodEx submission system.
-On the other hand, such situations can lead to peculiar translations of numbers observed in parallel data:
+Follow the links to see the description of each task and a submission interface with automatic evaluation of your solutions.
-   ||| 24.8548
+* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=6&module=groups%2Ftasks&page=specification Lowercasing]
- (km)     (miles)
+* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=7&module=groups%2Ftasks&page=specification Deaccenting]

Pre-processing: Difference between revisions

Latest revision as of 10:03, 11 March 2015

Contents

Inspecting Text Data

Text Encoding

Script/Characters

VIM tips

Spot Five Differences

Tokenization

Related Material

Exercises

Navigation menu


Lecture video:	web TODO Youtube
Exercises:	Lowercasing Deaccenting

Pre-processing: Difference between revisions

Latest revision as of 10:03, 11 March 2015

Inspecting Text Data

Text Encoding

Script/Characters

VIM tips

Spot Five Differences

Tokenization

Related Material

Exercises

Navigation menu

Search