Pre-processing: Difference between revisions
No edit summary |
|||
Line 77: | Line 77: | ||
A basic but quite robust approach is to split whenever the [http://www.fileformat.info/info/unicode/category/index.htm Unicode character category] changes. However, in many situation, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't": | A basic but quite robust approach is to split whenever the [http://www.fileformat.info/info/unicode/category/index.htm Unicode character category] changes. However, in many situation, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't": | ||
don't -> do n't | : ''don't -> do n't'' | ||
shouldn't -> should n't | |||
couldn't -> could n't | : ''shouldn't -> should n't'' | ||
: ''couldn't -> could n't'' | |||
It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data). | It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data). |
Revision as of 14:44, 12 January 2015
Lecture video: |
web TODO Youtube |
---|
{{#ev:youtube|ucSv4S4sCjs|800|center}}
Overall, the task of data pre-processing is to drop any distinctions that are not important for the output.
Inspecting Text Data
Text Encoding
Two texts that look the same might not be identical. MT systems do not see the strings as humans do but instead, they work with the actual byte representation. Therefore, data pre-processing is a very important step in system development.
Unicode includes a number of special characters which can complicate text processing for an MT system developer. The following table contains examples of some of the more devious characters:
Code | Name | Description |
---|---|---|
U+200B | Zero-width space | An invisible space. |
U+200E | Left-to-right mark | An invisible character used in texts with mixed scripts (e.g. Latin and Arabic) to indicate reading direction. |
U+2028 | Line separator | A Unicode newline which is often not interpreted by text editors (and can be invisible). |
U+2029 | Paragraph separator | Separates paragraphs, implies a new line (also often ignored). |
Decode Unicode is a useful webpage with information on Unicode characters.
Often, a file hexdump is the most useful diagnostic tool. E.g. the Linux command xxd provides the necessary functionality.
Script/Characters
Unicode often provides many ways how to write a single character. For example, the letter "a" might be written with Latin or Cyrillic script.
Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:
- Confusion of 0 (zero) and O (capital letter)
- Inconsistent letter case: English word I written in lowercase etc. (notorious e.g. in movie subtitles)
- Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: i'll Ill l'll 1'll 1'11
- Different symbols for various punctuation (quotes, dashes, apostrophe etc.)
VIM tips
A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:
Set file encoding to UTF-8:
:set encoding=utf8
Show the code of character under cursor:
ga
Set BOM (byte-order mark) for current file:
:set bomb
Tokenization
The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.
A basic but quite robust approach is to split whenever the Unicode character category changes. However, in many situation, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":
- don't -> do n't
- shouldn't -> should n't
- couldn't -> could n't
It is essential for data tokenization to be consistent. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).