MT Talks - User contributions [en]

MT Talks

2019-11-06T11:01:10Z

Bojar: a paragraph on NMT

[[File:banner.png]]

MT Talks is a series of mini-lectures on machine translation.

Our goal is to hit just the right level of detail and technicality to make the talks interesting and attractive to people who are not yet familiar with the field but mix in new observations and insights so that even old pals will have a reason to watch us.

MT Talks and the expanded notes on this wiki will never be the ultimate resource for MT, but we would be very happy to serve as an ultimate commented ''directory'' of good pointers.

By the way, this is indeed a Wiki, so your contributions are very welcome! Please register and feel free to add comments, corrections or links to useful resources.

== Relation to Neural MT ==

MT Talks were created '''before''' neural MT (NMT) was seriously considered. Some of the talks have thus lost their relevance when describing pre-neural solutions and some problems (e.g. morphological richness) have become substantially less severe.

For an example of top-performing neural MT systems, see e.g. our [http://lindat.cz/services/translation/ Demo at Lindat].

== Our Talks ==

01 '''[[Intro]]''': Why is MT difficult, approaches to MT.

02 '''[[MT that Deceives]]''': Serious translation errors even for short and simple inputs.

03 '''[[Pre-processing]]''': Normalization and other technical tricks bound to help your MT system.

04 '''[[MT Evaluation in General]]''': Techniques of judging MT quality, dimensions of translation quality, number of possible translations.

05 '''[[Automatic MT Evaluation]]''': Two common automatic MT evaluation methods: PER and BLEU

06 '''[[Data Acquisition]]''': The need and possible sources of training data for MT. And the diminishing utility of the new data additions due to Zipf's law.

07 '''[[Sentence Alignment]]''': An introduction to the Gale & Church sentence alignment algorithm.

08 '''[[Word Alignment]]''': Cutting the chicken-egg problem.

09 '''[[Phrase-based Model]]''': Copy if you can.

10 '''[[Constituency Trees]]''': Divide and conquer.

11 '''[[Dependency Trees]]''': Trees with gaps.

12 '''[[Rich Vocabulary]]''': Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.

13 '''[[Scoring and Optimization]]''': Features your model features.

14 '''[[Deep Syntax]]''': Prague Family Jewels.



== Contributing ==

Due to spamming, we had to restrict permissions for editing the Wiki. If you're interested in contributing, please write an email to '''tamchyna -at- ufal.mff.cuni.cz''' to obtain a username.

== Other Videolectures on MT ==

* [http://www.upc.edu/learning/courses/mooc/2014-2015/approaches-to-machine/approaches-to-machine Approaches to Machine Translation: Rule-Based, Statistical, Hybrid] (an online course on MT by UPC Barcelona)
* [https://www.coursera.org/course/nlangp Natural Language Processing at Coursera] by Michael Collins, includes lectures on word-based and phrase-based models. [http://www.cs.columbia.edu/~mcollins/notes-spring2013.html Further notes]
* [https://www.youtube.com/playlist?list=PLVjXYOjST-AokmIxpCr4GexcdtpeOliBc TAUS Machine Translation and Moses Tutorial] (a series of commented slides, MT overview and practical aspects of the Moses Toolkit)

== Acknowledgement ==

The work on this project has been supported by the grant FP7-ICT-2011-7-288487 ([http://www.statmt.org/mosescore/ MosesCore]).

CodEx-Introduction

2019-05-23T11:34:39Z

Bojar: codex discontinued

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/).

'''UNFORTUNATELY, CODEX HAS BEEN REPLACED WITH A NEWER VERSION''' but we don't have the capacity to redo our exercises in the new version. Please get in touch if you would be able to help us in porting the old exercises.

This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual exercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20MT%20Talks%20CodEx%20account&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%0D%0A%0D%0AMy%20name:%09%0D%0AInstitution:%09%0D%0A%0D%0A%20Thank%20you. mttalks@ufal.mff.cuni.cz] mentioning your name and institution to request an account. He will create the account for you and add it to '''MT talks''' CodEx group right away.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentation and news related to your account.

In the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises of the group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For MT talks exercises, please join the group '''MT talks''' if you have not done it yet (shown in picture: list of groups).

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
In the left-handed sidebar, under '''group -> task''', there are three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'', respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). You can pick any of these programming languages: ''Pascal, C, C++, C#, Haskell, Python and Java''

Your solution has to fit in one single file and process standard input to standard output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box directly on the web page, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension, upload and submit it.

In the evaluation process, your program is run several times with several inputs to validate the correctness. There are also built-in time and memory limits, which any sensible solution should easily meet. You will pass the exercise if your program passes a given number of these tests, we generally require to pass all the tests.

In the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which reads names of people and says 'Hello' to each of them. Each input line should be turned into a greeting.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

To test it manually, run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

To test it manually, run: javac CodEx.java; java CodEx < sample.in

'''Note''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For CodEx limitiations for other languages, please read the CodEx manual.

MT Talks

2019-05-23T11:32:23Z

Bojar: codex no longer avaliable

[[File:banner.png]]

MT Talks is a series of mini-lectures on machine translation.

Our goal is to hit just the right level of detail and technicality to make the talks interesting and attractive to people who are not yet familiar with the field but mix in new observations and insights so that even old pals will have a reason to watch us.

MT Talks and the expanded notes on this wiki will never be the ultimate resource for MT, but we would be very happy to serve as an ultimate commented ''directory'' of good pointers.

By the way, this is indeed a Wiki, so your contributions are very welcome! Please register and feel free to add comments, corrections or links to useful resources.

== Our Talks ==

01 '''[[Intro]]''': Why is MT difficult, approaches to MT.

02 '''[[MT that Deceives]]''': Serious translation errors even for short and simple inputs.

03 '''[[Pre-processing]]''': Normalization and other technical tricks bound to help your MT system.

04 '''[[MT Evaluation in General]]''': Techniques of judging MT quality, dimensions of translation quality, number of possible translations.

05 '''[[Automatic MT Evaluation]]''': Two common automatic MT evaluation methods: PER and BLEU

06 '''[[Data Acquisition]]''': The need and possible sources of training data for MT. And the diminishing utility of the new data additions due to Zipf's law.

07 '''[[Sentence Alignment]]''': An introduction to the Gale & Church sentence alignment algorithm.

08 '''[[Word Alignment]]''': Cutting the chicken-egg problem.

09 '''[[Phrase-based Model]]''': Copy if you can.

10 '''[[Constituency Trees]]''': Divide and conquer.

11 '''[[Dependency Trees]]''': Trees with gaps.

12 '''[[Rich Vocabulary]]''': Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.

13 '''[[Scoring and Optimization]]''': Features your model features.

14 '''[[Deep Syntax]]''': Prague Family Jewels.



== Contributing ==

Due to spamming, we had to restrict permissions for editing the Wiki. If you're interested in contributing, please write an email to '''tamchyna -at- ufal.mff.cuni.cz''' to obtain a username.

== Other Videolectures on MT ==

* [http://www.upc.edu/learning/courses/mooc/2014-2015/approaches-to-machine/approaches-to-machine Approaches to Machine Translation: Rule-Based, Statistical, Hybrid] (an online course on MT by UPC Barcelona)
* [https://www.coursera.org/course/nlangp Natural Language Processing at Coursera] by Michael Collins, includes lectures on word-based and phrase-based models. [http://www.cs.columbia.edu/~mcollins/notes-spring2013.html Further notes]
* [https://www.youtube.com/playlist?list=PLVjXYOjST-AokmIxpCr4GexcdtpeOliBc TAUS Machine Translation and Moses Tutorial] (a series of commented slides, MT overview and practical aspects of the Moses Toolkit)

== Acknowledgement ==

The work on this project has been supported by the grant FP7-ICT-2011-7-288487 ([http://www.statmt.org/mosescore/ MosesCore]).

Scoring and Optimization

2015-08-25T08:03:59Z

Bojar: /* Optimization of Feature Weights */ links to tuning tasks

{{Infobox
|title = Lecture 13: Scoring and Optimization
|image = [[File:features.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=rDkZOINdPhw&index=11&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}

{{#ev:youtube|https://www.youtube.com/watch?v=rDkZOINdPhw&index=11&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}

== Features of MT Models ==

=== Phrase Translation Probabilities ===

Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:

* <math>P(\mathbf{e}|\mathbf{f})</math>
* <math>P(\mathbf{f}|\mathbf{e})</math>

These probabilities are estimated by simply counting how many times (for the first formula) we saw <math>\mathbf{e}</math> aligned to <math>\mathbf{f}</math> and how many times we saw <math>\mathbf{f}</math> in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that <math>P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9</math>.

estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| naznačena v programu
estimated in the programme ||| odhadován v programu
estimated in the programme ||| odhadovány v programu
estimated in the programme ||| odhadovány v programu
estimated in the programme ||| předpokládal program
estimated in the programme ||| v programu uvedeným
estimated in the programme ||| v programu uvedeným

=== Lexical Weights ===

Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable
probability estimates; for instance many long phrases occur together only once
in the corpus, resulting in <math>P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})
= 1</math>. Several methods exist for computing lexical weights. The most common one
is based on word alignment inside the phrase. The
probability of each ''foreign'' word <math>f_j</math> is estimated as the average of
lexical translation probabilities <math>w(f_j, e_i)</math> over the English words aligned
to it. Thus for the phrase <math>(\mathbf{e},\mathbf{f})</math> with the set of alignment
points <math>a</math>, the lexical weight is:

<math>
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}
\frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)
</math>

=== Language Model ===

https://www.coursera.org/course/nlp

https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi

=== Word and Phrase Penalty ===

=== Distortion Penalty ===

== Decoding ==

=== Phrase-Based Search ===

=== Decoding in SCFG ===

== Optimization of Feature Weights ==

Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].

Phrase-based Model

2015-04-07T21:24:14Z

Bojar: /* Phrase Scoring */ drobnost

{{Infobox
|title = Lecture 8: Phrase-based model
|image = [[File:computer-copies.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=aA4jFayPNeQ Youtube]
}}

{{#ev:youtube|https://www.youtube.com/watch?v=aA4jFayPNeQ|800|center}}

Phrase-based machine translation (PBMT) is probably the most widely used approach to MT today. It is relatively simple and easy to adapt to new languages.

== Phrase Extraction ==

PBMT uses '''phrases''' as the basic unit of translation. Phrases are simply contiguous sequences of words which have been observed in the training data, they don't correspond to any linguistic notion of phrases.

In order to obtain a '''phrase table''' (a probabilistic dictionary of phrases), we need [[Word Alignment|word-aligned]] parallel data. Using the alignment links, a simple heuristic is applied to extract '''consistent''' phrase pairs. Consider the word-aligned example sentence:

[[File:phrase-extraction.png|400px]]

Phrase pairs are contiguous spans where all alignment points from the source side of the span fall within its target side and vice versa. These are examples of phrases consistent with this word alignment:

[[File:phrase-extraction-okay.png|400px]] [[File:phrase-extraction-okay2.png|400px]]

On the other hand, if either a source word or a target word is aligned outside of the current span, the phrase cannot be extracted. The conflicting alignment points are drawn in yellow:

[[File:phrase-extraction-short.png|400px]] [[File:phrase-extraction-long.png|400px]]

In practice, only phrases up to a certain length are extracted (e.g. 7 words). Longer phrases would hardly ever be used by the translation model (unless it was presented with a sentence from the training data) and the phrase table would be extremely large.

== Phrase Scoring ==

Once we have extracted all consistent phrase pairs from our training data, we can assign translation probabilities to them using maximum likelihood estimation. To estimate the probability of phrase <math>e</math> being the translation of phrase <math>f</math>, we simply count:

<math>
P(e|f) = \frac{\text{count}(e,f)}{\text{count}(f)}
</math>

The formula tells us to simply count how many times we saw <math>f</math> translated as <math>e</math> in our training data and divide that by the number of times we saw <math>f</math> in total.

In practice, several other scores are also computed (including the reverse phrase translation probability <math>P(f|e)</math>) but that's a topic for another lecture.

== Decoding ==

When we get an input sentence for translation, the first step is to look up '''translation options''' (possible translations) for each source span in the phrase table. These can be thought of as jigsaw puzzle pieces which are combined to get as good final translation as possible. The task for the '''decoder''' (the translation program) is to find a combination which covers the whole input sentence and is the most probable according to the model (this procedure is usually called decoding).

Here we describe the stack-based '''beam search''' algorithm which is commonly used for phrase-based decoding, although other algorithms exist.

=== Overview ===

The search begins with an empty hypothesis: no part of the input is covered and nothing has been produced. We can start covering the input sentence by translating any span, and for each span, we can choose any of its possible translations. The decoder produces all of these partial hypotheses.

In the next step, we try to expand our partial translations further by covering the remaining parts of the input sentence. We continue expanding our partial hypotheses until we cover the whole source sentence. We choose the most probable translation.

In the following example, we try to translate the Czech sentence:

: ''Honza miluje Marii''

With the following phrase table:

Honza ||| John
Honza ||| Johny
miluje ||| loves
miluje ||| is fond of
miluje ||| likes
Marii ||| Mary

This is an excerpt of the space where the decoder searches for the best translation. One (the most reasonable) full translation is illustrated but the decoder theoretically needs to evaluate all possible combinations.

[[File:beam-search.png|350px]]

Note that we keep track of covered input spans in a Boolean array (1 for covered words, 0 for untranslated ones) -- we know that we have a full translation once the '''coverage vector''' has no more zeroes.

=== Stacks, Pruning ===

It is obvious that with realistic sentence lengths and translation dictionaries, the search space very quickly explodes and it becomes impossible to go through all the possible combinations of span translations and their orderings.

First, we need to impose a limit on reordering (how far we can jump in the input), otherwise the search would be intractable. In practice, the limit is set to roughly 6 words.

Furthermore, we need to prune our partial hypotheses as there are still too many combinations. Our model assigns a score to each of them (more on that in later lectures) so we can sort our translations according to it and only keep several most promising candidates.

However, it would be unfair to compare a score of a full translation with a partial hypothesis that e.g. only had to translate one phrase. The full hypothesis contains more decisions => more uncertainty => almost surely, the full translation has a lower score. For this reason, partial hypotheses are organized into '''stacks''' according to the number of covered input words. For each number, we have a separate stack. That way, only hypotheses which cover the same "amount" of input are compared.

The algorithm can then proceed from stack "0" (which contains the empty hypothesis) to the final stack where full hypotheses compete for the best score. Each stack has a limited size which keeps the search tractable.

There is one more caveat -- not all input words (or spans -- phrases) are created equal. Some are difficult to translate and some only have one or a few possible translations. To account for this discrepancy, phrase-based decoders estimate the '''future cost''' of translating the remainder of the input sentence. This estimation is on the score of the most probable translation of the remaining spans and can be efficiently pre-computed using dynamic programming. Without accounting for future cost, hypotheses which translate the easiest words first would dominate the stacks and the search would lead to heavily suboptimal solutions.

== See Also ==

* [http://www.statmt.org/book/slides/05-phrase-based-models.pdf Philipp Koehn's slides on PBMT]
* [http://www.statmt.org/book/slides/06-decoding.pdf Decoding in PBMT]

Admin RootPage

2015-03-12T13:12:48Z

Bojar: youtube playlist

0x : How to get started with CodEx MT exercises

Our [https://www.youtube.com/playlist?list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V YouTube playlist] -- shows some total number of views, although different from individual video views.

MT Talks

2015-03-12T13:05:05Z

Bojar: /* Other Videolectures on MT */ TAUS lectures

Admin RootPage

2015-03-10T16:36:44Z

Bojar: published 07

0x : How to get started with CodEx MT exercises

MT Talks

2015-03-10T15:49:51Z

Bojar: publishing 07

Data Acquisition

2015-02-24T22:00:24Z

Bojar: updated link

{{Infobox
|title = Lecture 6: Data Acquisition
|image = [[File:bigbrother.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=7obaii5xldQ&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=6 Youtube]
}}

{{#ev:youtube|https://www.youtube.com/watch?v=7obaii5xldQ|800|center}}

There seems to be a universal rule for (not only) statistical methods in NLP: '''more data is better data.''' <ref name=effectiveness>A. Halevy, P. Norvig, F. Pereira. [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4804817&tag=1 ''The Unreasonable Effectiveness of Data'']</ref><ref name=friends>Jan Hajič, Eva Hajičová. [http://link.springer.com/chapter/10.1007%2F978-3-540-74628-7_2#page-1 ''Some of Our Best Friends Are Statisticians'']</ref>

Translation systems have at their disposal (orders of magnitude) more training data than a person reads in a lifetime.<ref name="inaug">Philipp Koehn. [https://www.youtube.com/watch?v=6UVgFjJeFGY Inaugural lecture.]</ref>

== Available Sources of Large Data ==

This is definitely not a list of all possible sources, just a few of the interesting ones.

=== Monolingual ===

Google released [http://googleresearch.blogspot.cz/2006/08/all-our-n-gram-are-belong-to-you.html n-grams of the whole web] and of [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google Books].

[http://commoncrawl.org/ Common Crawl] is an initiative which builds an open repository of crawled web. [http://www.statmt.org/ngrams/ Moses n-grams] are similar to Google n-grams but have been computed on the Common Crawl. There is no pruning, so the data are much larger but can offer more detailed statistics.

=== Parallel ===

Politics has been the motivation for a lot of parallel corpora. Canadian Hansard is published both in French and English and is one of the best-known parallel corpora. [https://ec.europa.eu/jrc/en/language-technologies EU regulations] are published in all official European languages, providing an invaluable language resource (if nothing else).

[http://opus.lingfil.uu.se/ OPUS] is a repository which contains most of the publicly available parallel corpora. The data is available for download in several formats, cleaned and processed with a unified pipeline.

== Obtaining More Data ==

More data, or more data of a specific kind, can be obtained e.g. via '''crowdsourcing'''.

People also create large amounts of data every day and a good part of this is published via '''social media'''. It is therefore not surprising that some research in NLP focuses on leveraging these interesting new data sources.

== Zipf's Law ==

[http://en.wikipedia.org/wiki/Zipf%27s_law Zipf's law] states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Implications of this law can be observed everywhere in NLP. While just a few dozen most frequent words (''types'') will cover half of all ''tokens'' (word occurrences) in a natural language corpus, the tail (infrequent words) is extremely long.

Moreover, even if we collect many times more data than we have at the moment, we will not cover much more of the tail and many infrequent words will remain ''out-of-vocabulary'' (OOV) for our NLP system.

== References ==

<references />

Admin RootPage

2015-02-24T16:21:08Z

Bojar: published 06

0x : How to get started with CodEx MT exercises

MT Talks

2015-02-24T16:19:21Z

Bojar: /* Our Talks */ publishing 06

MT Talks

2015-02-11T08:56:25Z

Bojar: /* CodEx -- Coding Exercises */ endash

Automatic MT Evaluation

2015-02-11T08:54:07Z

Bojar: /* Other Metrics */ typos

{{Infobox
|title = Lecture 5: Automatic MT Evaluation
|image = [[File:camel.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]
|label2 = Supplementary materials:
|data2 = [[File:bleu.pdf]]
|label3 = Exercises:
|data3 = [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification BLEU] [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification PER]
}}

{{#ev:youtube|https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}

== Reference Translations ==

The following picture<ref name="deprefset">Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. ''[https://ufal.mff.cuni.cz/~tamchyna/papers/2013-tsd.pdf Scratching the Surface of Possible Translations]''</ref> illustrates the issue of reference translations:

[[File:references.png|650px]]

Out of all possible sequences of words in the given language, only some are ''grammatically correct sentences'' (<math>G</math>). An overlapping set is formed by ''understandable translations'' (<math>T</math>) of the source sentence (note that these are not necessarily grammatical). Possible ''reference translations'' can then be viewed as a subset of <math>G \cap T</math>. Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.

== PER ==

Position-independent error rate<ref name="per">C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. ''[https://www-i6.informatik.rwth-aachen.de/publications/download/203/TillmannC.VogelS.NeyH.SawafH.ZubiagaA.--AcceleratedDP-basedSearchforStatisticalTranslation--1997.pdf Accelerated DP Based Search for Statistical Translation]''</ref> (PER) is a simple measure which counts the number of correct words in the MT output, regardless of their position. It is calculated using the following formula:

<math>\text{PER} = 1 - \frac{\text{correct} - \max(0, c - r)}{r}</math>

Where <math>r</math> and <math>c</math> is the number of tokens in the reference translation and candidate translation, respectively.

== BLEU ==

BLEU<ref name="bleu">Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. ''[http://www.aclweb.org/anthology/P02-1040.pdf BLEU: a Method for Automatic Evaluation of Machine Translation]''</ref> (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.

While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of <math>n</math>-grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).

The formal definition is as follows:

<math>
\text{BLEU} = \text{BP} \cdot \exp \sum_{i=1}^{n}(\lambda_i \log p_i)
</math>

Where (almost always) <math>\lambda_i = 1/n</math> and <math>n = 4</math>. <math>p_i</math> stand for <math>i</math>-gram precision, i.e. the number of <math>i</math>-grams in the candidate translation which are confirmed by the reference.

Each reference n-gram can be used to confirm the candidate n-gram only once (''clipping''), making it impossible to game BLEU by producing many occurrences of a single common word (such as ''"the"'').

BP stands for ''brevity penalty''. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:

<math>
\text{BP} = \begin{cases} 1, & \mbox{if } c > r \\ \exp(1 - r/c), & \mbox{if } c \leq r. \end{cases}
</math>

Where <math>r</math> and <math>c</math> is again the number of tokens in the reference translation and candidate translation, respectively.

=== Example ===

Consider the following situation:

{|
!Source
|Vom Glück der traumenden Kamele
|colspan="4"|Confirmed
|-
!Reference
|On the happiness of dreaming camels
|1
|2
|3
|4
|-
!MT Output
|The happiness of dreaming camels
|5
|4
|3
|2
|}

The number of confirmed MT n-grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:

<math>\text{BP} = \exp(1 - 6/5) \doteq 0.82</math>

The geometric mean of precisions is:

<math>\exp(\frac{1}{4} \log(\frac{5}{6}) + \frac{1}{4} \log (\frac{4}{5}) + \frac{1}{4} \log(\frac{3}{4}) + \frac{1}{4} \log(\frac{2}{3})) \doteq 0.76</math>

Note that you can equivalently take the fourth root of the product of the precisions, i.e. <math>\sqrt[4]{\frac{5}{6} \cdot \frac{4}{5} \cdot \frac{3}{4} \cdot \frac{2}{3}}</math>

The final BLEU score is then <math>0.82 \cdot 0.76 \doteq 0.62</math>.

BLEU is often mutliplied by 100 for readability.

BLEU is a document-level metric. This means that counts of confirmed n-grams are collected for all sentences in the translated document and then the geometric mean of n-gram precisions is computed from the accumulated counts. For a single sentence, BLEU is often zero (since there is frequently no matching 4-gram or even trigram).

=== Multiple Reference Translations ===

BLEU supports multiple references. In that case, if an n-gram in the MT output is confirmed by ''any'' of the reference translations, it is counted as correct. If an n-gram occurs multiple times, it has to be seen in one of the references multiple times as well.

The original paper is not clear about BP in this case. The usual practice is to take the reference translation which is closest in length to the MT output and calculate BP from that. (Note that even this specification is not unambiguous since there can be two closest references to the given hypothesis, the longer and the shorter one.)

== Other Metrics ==

* Results of the WMT14 Metrics Shared Task<ref name="wmtmetrics">Matouš Macháček and Ondřej Bojar. ''[http://www.statmt.org/wmt14/pdf/W14-3336.pdf Results of the WMT14 Metrics Shared Task]''</ref> (WMT metrics) -- an annual shared task in automatic evaluation of MT, see the [http://www.statmt.org/wmt15/metrics-task/ task web page].

* Translation Error Rate<ref name="ter">Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, John Makhoul. ''[https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf A Study of Translation Edit Rate with Targeted Human Annotation]''</ref> (TER) -- an edit-distance based metric on the level of phrases

* METEOR<ref name="meteor">Alon Lavie, Michael Denkowski. ''[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mteval-1/Papers/MT-Journal-2009/meteor-mtj-2009.pdf The METEOR Metric for Automatic Evaluation of Machine Translation]''</ref> -- a robust metric with support for paraphrasing

== Exercises ==

* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification Implement BLEU]
* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification Implement PER]

== References ==

<references />

MT Talks

2015-02-11T08:52:20Z

Bojar: /* CodEx */ title improved

Admin RootPage

2015-02-11T08:45:34Z

Bojar: released 05

0x : How to get started with CodEx MT exercises

MT Talks

2015-02-11T08:38:44Z

Bojar: /* Our Talks */ releasing 05

Automatic MT Evaluation

2015-02-11T08:26:18Z

Bojar: /* Other Metrics */ link to metrics task

{{Infobox
|title = Lecture 5: Automatic MT Evaluation
|image = [[File:camel.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]
|label2 = Supplementary materials:
|data2 = [[File:bleu.pdf]]
|label3 = Exercises:
|data3 = [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification BLEU] [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification PER]
}}

{{#ev:youtube|https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}

== Reference Translations ==

The following picture<ref name="deprefset">Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. ''[https://ufal.mff.cuni.cz/~tamchyna/papers/2013-tsd.pdf Scratching the Surface of Possible Translations]''</ref> illustrates the issue of reference translations:

[[File:references.png|650px]]

Out of all possible sequences of words in the given language, only some are ''grammatically correct sentences'' (<math>G</math>). An overlapping set is formed by ''understandable translations'' (<math>T</math>) of the source sentence (note that these are not necessarily grammatical). Possible ''reference translations'' can then be viewed as a subset of <math>G \cap T</math>. Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.

== PER ==

Position-independent error rate<ref name="per">C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. ''[https://www-i6.informatik.rwth-aachen.de/publications/download/203/TillmannC.VogelS.NeyH.SawafH.ZubiagaA.--AcceleratedDP-basedSearchforStatisticalTranslation--1997.pdf Accelerated DP Based Search for Statistical Translation]''</ref> (PER) is a simple measure which counts the number of correct words in the MT output, regardless of their position. It is calculated using the following formula:

<math>\text{PER} = 1 - \frac{\text{correct} - \max(0, c - r)}{r}</math>

Where <math>r</math> and <math>c</math> is the number of tokens in the reference translation and candidate translation, respectively.

== BLEU ==

BLEU<ref name="bleu">Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. ''[http://www.aclweb.org/anthology/P02-1040.pdf BLEU: a Method for Automatic Evaluation of Machine Translation]''</ref> (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.

While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of <math>n</math>-grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).

The formal definition is as follows:

<math>
\text{BLEU} = \text{BP} \cdot \exp \sum_{i=1}^{n}(\lambda_i \log p_i)
</math>

Where (almost always) <math>\lambda_i = 1/n</math> and <math>n = 4</math>. <math>p_i</math> stand for <math>i</math>-gram precision, i.e. the number of <math>i</math>-grams in the candidate translation which are confirmed by the reference.

Each reference n-gram can be used to confirm the candidate n-gram only once (''clipping''), making it impossible to game BLEU by producing many occurrences of a single common word (such as ''"the"'').

BP stands for ''brevity penalty''. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:

<math>
\text{BP} = \begin{cases} 1, & \mbox{if } c > r \\ \exp(1 - r/c), & \mbox{if } c \leq r. \end{cases}
</math>

Where <math>r</math> and <math>c</math> is again the number of tokens in the reference translation and candidate translation, respectively.

=== Example ===

Consider the following situation:

{|
!Source
|Vom Glück der traumenden Kamele
|colspan="4"|Confirmed
|-
!Reference
|On the happiness of dreaming camels
|1
|2
|3
|4
|-
!MT Output
|The happiness of dreaming camels
|5
|4
|3
|2
|}

The number of confirmed MT n-grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:

<math>\text{BP} = \exp(1 - 6/5) \doteq 0.82</math>

The geometric mean of precisions is:

<math>\exp(\frac{1}{4} \log(\frac{5}{6}) + \frac{1}{4} \log (\frac{4}{5}) + \frac{1}{4} \log(\frac{3}{4}) + \frac{1}{4} \log(\frac{2}{3})) \doteq 0.76</math>

Note that you can equivalently take the fourth root of the product of the precisions, i.e. <math>\sqrt[4]{\frac{5}{6} \cdot \frac{4}{5} \cdot \frac{3}{4} \cdot \frac{2}{3}}</math>

The final BLEU score is then <math>0.82 \cdot 0.76 \doteq 0.62</math>.

BLEU is often mutliplied by 100 for readability.

BLEU is a document-level metric. This means that counts of confirmed n-grams are collected for all sentences in the translated document and then the geometric mean of n-gram precisions is computed from the accumulated counts. For a single sentence, BLEU is often zero (since there is frequently no matching 4-gram or even trigram).

=== Multiple Reference Translations ===

BLEU supports multiple references. In that case, if an n-gram in the MT output is confirmed by ''any'' of the reference translations, it is counted as correct. If an n-gram occurs multiple times, it has to be seen in one of the references multiple times as well.

The original paper is not clear about BP in this case. The usual practice is to take the reference translation which is closest in length to the MT output and calculate BP from that. (Note that even this specification is not unambiguous since there can be two closest references to the given hypothesis, the longer and the shorter one.)

== Other Metrics ==

* Results of the WMT14 Metrics Shared Task<ref name="wmtmetrics">Matouš Macháček and Ondřej Bojar. ''[http://www.statmt.org/wmt14/pdf/W14-3336.pdf Results of the WMT14 Metrics Shared Task''</ref> (WMT metrics) -- an annual shared task in automatic evaluation of MT, see the [http://www.statmt.org/wmt14/metrics-task/ task web page].

* Translation Error Rate<ref name="ter">Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, John Makhoul. ''[https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf A Study of Translation Edit Rate with Targeted Human Annotation]''</ref> (TER) -- an edit-distance based metric on the level of phrases

* METEOR<ref name="meteor">Alon Lavie, Michael Denkowski. ''[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mteval-1/Papers/MT-Journal-2009/meteor-mtj-2009.pdf The METEOR Metric for Automatic Evaluation of Machine Translation]''</ref> -- a robust metric with support for paraphrasing

== Exercises ==

* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification Implement BLEU]
* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification Implement PER]

== References ==

<references />

Automatic MT Evaluation

2015-02-11T08:18:35Z

Bojar: /* Multiple Reference Translations */ closest ref is ambiguous

{{Infobox
|title = Lecture 5: Automatic MT Evaluation
|image = [[File:camel.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]
|label2 = Supplementary materials:
|data2 = [[File:bleu.pdf]]
|label3 = Exercises:
|data3 = [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification BLEU] [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification PER]
}}

{{#ev:youtube|https://www.youtube.com/watch?v=Bj_Hxi91GUM&index=5&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}

== Reference Translations ==

The following picture<ref name="deprefset">Ondřej Bojar, Matouš Macháček, Aleš Tamchyna, Daniel Zeman. ''[https://ufal.mff.cuni.cz/~tamchyna/papers/2013-tsd.pdf Scratching the Surface of Possible Translations]''</ref> illustrates the issue of reference translations:

[[File:references.png|650px]]

Out of all possible sequences of words in the given language, only some are ''grammatically correct sentences'' (<math>G</math>). An overlapping set is formed by ''understandable translations'' (<math>T</math>) of the source sentence (note that these are not necessarily grammatical). Possible ''reference translations'' can then be viewed as a subset of <math>G \cap T</math>. Only some of these can be reached by the MT system. Typically, we only have several reference translations at our disposal; often we have just a single reference.

== PER ==

Position-independent error rate<ref name="per">C. Tillmann, S. Vogel, H. Ney, A. Zubiaga, H. Sawaf. ''[https://www-i6.informatik.rwth-aachen.de/publications/download/203/TillmannC.VogelS.NeyH.SawafH.ZubiagaA.--AcceleratedDP-basedSearchforStatisticalTranslation--1997.pdf Accelerated DP Based Search for Statistical Translation]''</ref> (PER) is a simple measure which counts the number of correct words in the MT output, regardless of their position. It is calculated using the following formula:

<math>\text{PER} = 1 - \frac{\text{correct} - \max(0, c - r)}{r}</math>

Where <math>r</math> and <math>c</math> is the number of tokens in the reference translation and candidate translation, respectively.

== BLEU ==

BLEU<ref name="bleu">Kishore Papineni, Salim Roukos, Todd Ward, Wei-Jing Zhu. ''[http://www.aclweb.org/anthology/P02-1040.pdf BLEU: a Method for Automatic Evaluation of Machine Translation]''</ref> (Bilingual evaluation understudy) remains the most popular metric for automatic evaluation of MT output quality.

While PER only looks at individual words, BLEU considers also sequences of words. Informally, we can describe BLEU as the amount of overlap of <math>n</math>-grams between the candidate translation and the reference (more specifically unigrams, bigrams, trigrams and 4-grams, in the standard formulation).

The formal definition is as follows:

<math>
\text{BLEU} = \text{BP} \cdot \exp \sum_{i=1}^{n}(\lambda_i \log p_i)
</math>

Where (almost always) <math>\lambda_i = 1/n</math> and <math>n = 4</math>. <math>p_i</math> stand for <math>i</math>-gram precision, i.e. the number of <math>i</math>-grams in the candidate translation which are confirmed by the reference.

Each reference n-gram can be used to confirm the candidate n-gram only once (''clipping''), making it impossible to game BLEU by producing many occurrences of a single common word (such as ''"the"'').

BP stands for ''brevity penalty''. Since BLEU is a kind of precision, short outputs (which only contain words that the system is sure about) would score highly without BP. This penalty is defined simply as:

<math>
\text{BP} = \begin{cases} 1, & \mbox{if } c > r \\ \exp(1 - r/c), & \mbox{if } c \leq r. \end{cases}
</math>

Where <math>r</math> and <math>c</math> is again the number of tokens in the reference translation and candidate translation, respectively.

=== Example ===

Consider the following situation:

{|
!Source
|Vom Glück der traumenden Kamele
|colspan="4"|Confirmed
|-
!Reference
|On the happiness of dreaming camels
|1
|2
|3
|4
|-
!MT Output
|The happiness of dreaming camels
|5
|4
|3
|2
|}

The number of confirmed MT n-grams is 5, 4, 3, 2 respectively for unigrams, bigrams etc. The MT output is one word shorter than the reference, therefore:

<math>\text{BP} = \exp(1 - 6/5) \doteq 0.82</math>

The geometric mean of precisions is:

<math>\exp(\frac{1}{4} \log(\frac{5}{6}) + \frac{1}{4} \log (\frac{4}{5}) + \frac{1}{4} \log(\frac{3}{4}) + \frac{1}{4} \log(\frac{2}{3})) \doteq 0.76</math>

Note that you can equivalently take the fourth root of the product of the precisions, i.e. <math>\sqrt[4]{\frac{5}{6} \cdot \frac{4}{5} \cdot \frac{3}{4} \cdot \frac{2}{3}}</math>

The final BLEU score is then <math>0.82 \cdot 0.76 \doteq 0.62</math>.

BLEU is often mutliplied by 100 for readability.

BLEU is a document-level metric. This means that counts of confirmed n-grams are collected for all sentences in the translated document and then the geometric mean of n-gram precisions is computed from the accumulated counts. For a single sentence, BLEU is often zero (since there is frequently no matching 4-gram or even trigram).

=== Multiple Reference Translations ===

BLEU supports multiple references. In that case, if an n-gram in the MT output is confirmed by ''any'' of the reference translations, it is counted as correct. If an n-gram occurs multiple times, it has to be seen in one of the references multiple times as well.

The original paper is not clear about BP in this case. The usual practice is to take the reference translation which is closest in length to the MT output and calculate BP from that. (Note that even this specification is not unambiguous since there can be two closest references to the given hypothesis, the longer and the shorter one.)

== Other Metrics ==

* Translation Error Rate<ref name="ter">Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, John Makhoul. ''[https://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf A Study of Translation Edit Rate with Targeted Human Annotation]''</ref> (TER) -- an edit-distance based metric on the level of phrases

* METEOR<ref name="meteor">Alon Lavie, Michael Denkowski. ''[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mteval-1/Papers/MT-Journal-2009/meteor-mtj-2009.pdf The METEOR Metric for Automatic Evaluation of Machine Translation]''</ref> -- a robust metric with support for paraphrasing

== Exercises ==

* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=8&module=groups%2Ftasks&page=specification Implement BLEU]
* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=9&module=groups%2Ftasks&page=specification Implement PER]

== References ==

<references />

MT Evaluation in General

2015-01-28T08:06:59Z

Bojar: new link to video

{{Infobox
|title = Lecture 4: General MT Evaluation
|image = [[File:worker.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [http://youtu.be/kSVb4-xI0Fw Youtube]
}}

{{#ev:youtube|kSVb4-xI0Fw|800|center}}

== Data Splits ==

Available training data is usually split into several parts, e.g. '''training''', '''development''' (held-out) and '''(dev-)test'''. Training data is used to estimate model parameters, development set can be used for model selection, hyperparameter tuning etc. and dev-test is used for continuous evaluation of progress (are we doing better than before?).

However, you should always keep an additional '''(final) test set''' which is used only very rarely. Evaluating your system on the final test set can then be used as a rough estimate of its true performance because you do not use it in the development process at all, and therefore do not bias your system towards it.

The "golden rule" of (MT) evaluation: '''Evaluate on unseen data!'''

== Approaches to Evaluation ==

Let us first introduce the example that we will use throughout the section:

=== Example Sentence + Translations ===

Original German sentence:

: ''Arbeiter sturzte von Leiter: schwer verletzt''

English reference translation:

: ''Worker falls from ladder: seriously injured''

{|
! Translation Candidate
! Notes
|-
| '''A''' ''Workers rushed from director: Seriously injured''
| plural (workers), bad choice of verb (rushed), ''Leiter'' mistranslated as ''director''
|-
| '''B''' ''Workers fell from ladder: hurt''
| plural (workers), intensifier missing
|-
| '''C''' ''Worker rushed from ladder: schwer verletzt''
| bad choice of verb (rushed), tail is left untranslated
|-
| '''D''' ''Worker fell from leader: heavily injures''
| ''Leiter'' translated as ''leader'' (not a typo, a bad lexical choice), poor morphological choices
|}

=== Absolute Ranking ===

We put each translation into a category that best describes its quality. The following categories can be used:

{|
| '''Worth publishing'''
| Translation is almost perfect, can be published as-is.
|-
| '''Worth editing'''
| Translation contains minor errors which can be quickly fixed by a human post-editor.
|-
| '''Worth reading'''
| Translation contains major errors but can be used for rough understanding of the text (''gisting'').
|}

If we define our categories like this, probably all example translations fall in the ''worth editing'' bin.

We can also separate our assessment of translation quality into different aspects (or dimensions). One division that has been used extensively for MT evaluation is:

* '''Adequacy''' -- how faithfully does the translation capture the meaning of the source sentence
* '''Fluency''' -- is the translation a grammatical, fluent sentence in the target language? (regardless of meaning)

In this case, e.g. candidate '''A''' could be marked as ''worth publishing'' in terms of fluency, while it is ''worth reading'' at best in terms of adequacy.

=== Relative Ranking ===

In this case, we avoid assigning translations into categories and instead ask the human judge(s) to rank the possible translations relative to one another. (Human [http://en.wikipedia.org/wiki/Inter-rater_reliability inter-annotator agreement] can be surprisingly low in both scenarios, though.)

In our example, we would probably order the systems (from best to worst): '''B > D > C > A'''

Different annotators could come up with different rankings. Ranking can also differ according to the intended '''purpose''' of the translations -- if a human translator is supposed to post-edit the translation, major errors in adequacy (such as spurious/missing negation) might be easy to fix and therefore such translations could be ranked higher than factually correct translations with lots of small errors.

== Dimensions of Translation Quality ==

Multidimensional Quality Metrics (MQM <ref name="mqm">Arle Richard Lommel, Aljoscha Burchardt, Hans Uszkoreit. ''[http://www.mt-archive.info/10/Aslib-2013-Lommel.pdf Multidimensional Quality Metrics: A Flexible System for Assessing Translation Quality]''</ref>) provides probably the greatest level of detail for various aspects (or dimensions) of translation quality:

[[File:mqm.png|800px]]

== Space of Possible Translations ==

An inherent issue with MT evaluation is the fact that there is usually more than one correct translation. In fact, several experiments<ref name="deprefset">Ondřej Bojar, Matouš Machaček, Aleš Tamchyna, Daniel Zeman. ''[https://ufal.mff.cuni.cz/~tamchyna/papers/2013-tsd.pdf Scratching the Surface of Possible Translations]''</ref><ref name="hyter">Markus Dreyer, Daniel Marcu. ''[http://www.aclweb.org/anthology/N12-1017 HyTER: Meaning-Equivalent Semantics for Translation Evaluation]''</ref> show that there can be as many as hundreds of thousands or even millions of correct translations per a single sentence.

Such a high number of possible translations is mainly caused by the flexibility of lexical choice and word order. (In our example, the German word "''Arbeiter''" can be translated into English as "''worker''" or "''employee''".) Every such decision multiplies the number of translations, which thus grows exponentially.

Despite this fact, when we train or evaluate translation systems, we often rely on just a single reference translation.

== Translation Evaluation Campaigns ==

There are several academic workshops where the quality of various translation systems is compared. Such "competitions" require manual evaluation. Their methodology evolves to make the results as fair and statistically sound as possible. The most prominent ones include:

[http://www.statmt.org/wmt14/ Workshop on Statistical Machine Translation (WMT)]

[http://workshop2014.iwslt.org/ International Workshop on Spoken Language Translation (IWSLT)]

== References ==

<references/>

MT Talks

2015-01-27T17:35:37Z

Bojar: /* Our Talks */ rephrased

MT Talks

2015-01-27T17:34:04Z

Bojar: /* Our Talks */ revealing 04

Admin RootPage

2015-01-27T16:39:46Z

Bojar: renamed

0x : How to get started with CodEx MT exercises

04 '''[[MT Evaluation in General]]''': TODO popisek

MT Evaluation in General

2015-01-27T16:39:08Z

Bojar: Bojar moved page General MT Evaluation to MT Evaluation in General without leaving a redirect

{{Infobox
|title = Lecture 4: General MT Evaluation
|image = [[File:worker.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [http://www.youtube.com/watch?v=_QL-BUxIIhU Youtube]
}}

{{#ev:youtube|_QL-BUxIIhU|800|center}}

== Data Splits ==

Available training data is usually split into several parts, e.g. '''training''', '''development''' (held-out) and '''(dev-)test'''. Training data is used to estimate model parameters, development set can be used for model selection, hyperparameter tuning etc. and dev-test is used for continuous evaluation of progress (are we doing better than before?).

However, you should always keep an additional '''(final) test set''' which is used only very rarely. Evaluating your system on the final test set can then be used as a rough estimate of its true performance because you do not use it in the development process at all, and therefore do not bias your system towards it.

The "golden rule" of (MT) evaluation: '''Evaluate on unseen data!'''

== Approaches to Evaluation ==

Let us first introduce the example that we will use throughout the section:

=== Example Sentence + Translations ===

Original German sentence:

: ''Arbeiter sturzte von Leiter: schwer verletzt''

English reference translation:

: ''Worker falls from ladder: seriously injured''

{|
! Translation Candidate
! Notes
|-
| '''A''' ''Workers rushed from director: Seriously injured''
| plural (workers), bad choice of verb (rushed), ''Leiter'' mistranslated as ''director''
|-
| '''B''' ''Workers fell from ladder: hurt''
| plural (workers), intensifier missing
|-
| '''C''' ''Worker rushed from ladder: schwer verletzt''
| bad choice of verb (rushed), tail is left untranslated
|-
| '''D''' ''Worker fell from leader: heavily injures''
| ''Leiter'' translated as ''leader'' (not a typo, a bad lexical choice), poor morphological choices
|}

=== Absolute Ranking ===

We put each translation into a category that best describes its quality. The following categories can be used:

{|
| '''Worth publishing'''
| Translation is almost perfect, can be published as-is.
|-
| '''Worth editing'''
| Translation contains minor errors which can be quickly fixed by a human post-editor.
|-
| '''Worth reading'''
| Translation contains major errors but can be used for rough understanding of the text (''gisting'').
|}

If we define our categories like this, probably all example translations fall in the ''worth editing'' bin.

We can also separate our assessment of translation quality into different aspects (or dimensions). One division that has been used extensively for MT evaluation is:

* '''Adequacy''' -- how faithfully does the translation capture the meaning of the source sentence
* '''Fluency''' -- is the translation a grammatical, fluent sentence in the target language? (regardless of meaning)

In this case, e.g. candidate '''A''' could be marked as ''worth publishing'' in terms of fluency, while it is ''worth reading'' at best in terms of adequacy.

=== Relative Ranking ===

In this case, we avoid assigning translations into categories and instead ask the human judge(s) to rank the possible translations relative to one another. (Human [http://en.wikipedia.org/wiki/Inter-rater_reliability inter-annotator agreement] can be surprisingly low in both scenarios, though.)

In our example, we would probably order the systems (from best to worst): '''B > D > C > A'''

Different annotators could come up with different rankings. Ranking can also differ according to the intended '''purpose''' of the translations -- if a human translator is supposed to post-edit the translation, major errors in adequacy (such as spurious/missing negation) might be easy to fix and therefore such translations could be ranked higher than factually correct translations with lots of small errors.

== Dimensions of Translation Quality ==

Multidimensional Quality Metrics (MQM <ref name="mqm">Arle Richard Lommel, Aljoscha Burchardt, Hans Uszkoreit. ''[http://www.mt-archive.info/10/Aslib-2013-Lommel.pdf Multidimensional Quality Metrics: A Flexible System for Assessing Translation Quality]''</ref>) provides probably the greatest level of detail for various aspects (or dimensions) of translation quality:

[[File:mqm.png|800px]]

== Space of Possible Translations ==

An inherent issue with MT evaluation is the fact that there is usually more than one correct translation. In fact, several experiments<ref name="deprefset">Ondřej Bojar, Matouš Machaček, Aleš Tamchyna, Daniel Zeman. ''[https://ufal.mff.cuni.cz/~tamchyna/papers/2013-tsd.pdf Scratching the Surface of Possible Translations]''</ref><ref name="hyter">Markus Dreyer, Daniel Marcu. ''[http://www.aclweb.org/anthology/N12-1017 HyTER: Meaning-Equivalent Semantics for Translation Evaluation]''</ref> show that there can be as many as hundreds of thousands or even millions of correct translations per a single sentence.

Such a high number of possible translations is mainly caused by the flexibility of lexical choice and word order. (In our example, the German word "''Arbeiter''" can be translated into English as "''worker''" or "''employee''".) Every such decision multiplies the number of translations, which thus grows exponentially.

Despite this fact, when we train or evaluate translation systems, we often rely on just a single reference translation.

== Translation Evaluation Campaigns ==

There are several academic workshops where the quality of various translation systems is compared. Such "competitions" require manual evaluation. Their methodology evolves to make the results as fair and statistically sound as possible. The most prominent ones include:

[http://www.statmt.org/wmt14/ Workshop on Statistical Machine Translation (WMT)]

[http://workshop2014.iwslt.org/ International Workshop on Spoken Language Translation (IWSLT)]

== References ==

<references/>

Pre-processing

2015-01-20T09:51:00Z

Bojar: /* Script/Characters */ link to homoglyphs.net

{{Infobox
|title = Lecture 3: Pre-processing
|image = [[File:bear-with-us.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [http://www.youtube.com/watch?v=GDij7urWeOk Youtube]
|label2 = Exercises:
|data2 = [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=6&module=groups%2Ftasks&page=specification Lowercasing] [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=7&module=groups%2Ftasks&page=specification Deaccenting]
}}

{{#ev:youtube|GDij7urWeOk|800|center}}

Overall, the task of MT data pre-processing is to ''drop any distinctions that are not important for the output''.

== Inspecting Text Data ==

=== Text Encoding ===

Two texts that look the same might not be identical. MT systems do not see the strings as humans do but instead, they work with the actual byte representation. Therefore, data pre-processing is a very important step in system development.

[http://en.wikipedia.com/wiki/Unicode Unicode] includes a number of special characters which can complicate text processing for an MT system developer. The following table contains examples of some of the more devious characters:

{|
! Code
! Name
! Description
|-
|'''U+200B'''
|Zero-width space
|An invisible space.
|-
|'''U+200E'''
|Left-to-right mark
|An invisible character used in texts with mixed scripts (e.g. Latin and Arabic) to indicate reading direction.
|-
|'''U+2028'''
|Line separator
|A Unicode newline which is often not interpreted by text editors (and can be invisible).
|-
|'''U+2029'''
|Paragraph separator
|Separates paragraphs, implies a new line (also often ignored).
|}

[http://www.decodeunicode.org/ Decode Unicode] is a useful webpage with information on Unicode characters.

Often, a file [http://en.wikipedia.org/wiki/Hex_dump hexdump] is the most useful diagnostic tool. E.g. the Linux command '''xxd''' provides the necessary functionality.

=== Script/Characters ===

Unicode often provides many ways how to write a single character. For example, the letter "'''a'''" might be written with Latin or Cyrillic script. A nice summary of Latin-like alphabets is available at [http://homoglyphs.net/ homoglyphs.net]

Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:

* Confusion of '''0''' (zero) and '''O''' (capital letter)
* Inconsistent letter case: English word '''I''' written in lowercase etc. (notorious e.g. in movie subtitles)
* Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: '''i'll Ill l'll 1'll 1'11'''
* Different symbols for various punctuation (quotes, dashes, apostrophe etc.)

=== VIM tips ===

A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:

Set file encoding to UTF-8:

''':set encoding=utf8'''

Show the code of character under cursor:

'''ga'''

Set or remove BOM (byte-order mark) for current file:

''':set bomb'''
''':set nobomb'''

=== Spot Five Differences ===

A text file with the Russian word ''"чай"'' (tea) can be written in seemingly identical ways which however differ significantly on byte level.

First is the very file beginning, which may and may not include the Unicode byte-order-mark symbol (BOM), which is 3 bytes long.

The second and third difference is the presence of two Unicode non-printing characters, namely zero-width space and left-to-right direction mark.

The fourth difference is the code for the letter ''"a"'' which can be written either in Latin or in Cyrillic script (looks identical).

The fifth, final difference is the representation of the last letter ''"й"''. It can be written either as one letter or as ''"и"'' followed by a wedge (the diacritics).

== Tokenization ==

The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.

A basic but quite robust approach is to split whenever the [http://www.regular-expressions.info/unicode.html Unicode character category] changes. Imagine reading the input character by character. When we observe that so far, there have been letters (category '''L''') and suddenly, there is punctuation (category '''P'''), we insert a space. During the same process, it is useful to convert all whitespace (tabulators, spaces, non-breaking spaces and sequences of such) to a single space character.

However, in many situations, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":

: ''don't -> do n't''

: ''shouldn't -> should n't''

: ''couldn't -> could n't''

It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).

== Exercises ==

This is the first lecture accompanied by programming exercises. Before starting, you should follow the [[CodEx-Introduction|instructions]] on how to use the CodEx submission system.

Follow the links to see the description of each task and a submission interface with automatic evaluation of your solutions.

* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=6&module=groups%2Ftasks&page=specification Lowercasing]
* [https://codex3.ms.mff.cuni.cz/codex-trans/?groupId=3&taskId=7&module=groups%2Ftasks&page=specification Deaccenting]

CodEx-Introduction

2015-01-19T09:09:31Z

Bojar: /* How to get a CodEx account */ updated the mailto form

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/). This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual exercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20MT%20Talks%20CodEx%20account&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%0D%0A%0D%0AMy%20name:%09%0D%0AInstitution:%09%0D%0A%0D%0A%20Thank%20you. mttalks@ufal.mff.cuni.cz] mentioning your name and institution to request an account. He will create the account for you and add it to '''MT talks''' CodEx group right away.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentation and news related to your account.

In the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises of the group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For MT talks exercises, please join the group '''MT talks''' if you have not done it yet (shown in picture: list of groups).

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
In the left-handed sidebar, under '''group -> task''', there are three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'', respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). You can pick any of these programming languages: ''Pascal, C, C++, C#, Haskell, Python and Java''

Your solution has to fit in one single file and process standard input to standard output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box directly on the web page, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension, upload and submit it.

In the evaluation process, your program is run several times with several inputs to validate the correctness. There are also built-in time and memory limits, which any sensible solution should easily meet. You will pass the exercise if your program passes a given number of these tests, we generally require to pass all the tests.

In the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which reads names of people and says 'Hello' to each of them. Each input line should be turned into a greeting.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

To test it manually, run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

To test it manually, run: javac CodEx.java; java CodEx < sample.in

'''Note''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For CodEx limitiations for other languages, please read the CodEx manual.

Pre-processing

2015-01-14T14:07:15Z

Bojar: 'MT' data preprocessing

{{Infobox
|title = Lecture 3: Pre-processing
|image = [[File:bear-with-us.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [http://www.youtube.com/watch?v=GDij7urWeOk Youtube]
|label2 = Exercises:
|data2 = [https://codex3.ms.mff.cuni.cz/codex-trans/?exerciseId=1&module=exercises&page=specification Lowercasing] [https://codex3.ms.mff.cuni.cz/codex-trans/?exerciseId=2&module=exercises&page=specification Deaccenting]
}}

{{#ev:youtube|GDij7urWeOk|800|center}}

Overall, the task of MT data pre-processing is to ''drop any distinctions that are not important for the output''.

== Inspecting Text Data ==

=== Text Encoding ===

Two texts that look the same might not be identical. MT systems do not see the strings as humans do but instead, they work with the actual byte representation. Therefore, data pre-processing is a very important step in system development.

[http://en.wikipedia.com/wiki/Unicode Unicode] includes a number of special characters which can complicate text processing for an MT system developer. The following table contains examples of some of the more devious characters:

{|
! Code
! Name
! Description
|-
|'''U+200B'''
|Zero-width space
|An invisible space.
|-
|'''U+200E'''
|Left-to-right mark
|An invisible character used in texts with mixed scripts (e.g. Latin and Arabic) to indicate reading direction.
|-
|'''U+2028'''
|Line separator
|A Unicode newline which is often not interpreted by text editors (and can be invisible).
|-
|'''U+2029'''
|Paragraph separator
|Separates paragraphs, implies a new line (also often ignored).
|}

[http://www.decodeunicode.org/ Decode Unicode] is a useful webpage with information on Unicode characters.

Often, a file [http://en.wikipedia.org/wiki/Hex_dump hexdump] is the most useful diagnostic tool. E.g. the Linux command '''xxd''' provides the necessary functionality.

=== Script/Characters ===

Unicode often provides many ways how to write a single character. For example, the letter "'''a'''" might be written with Latin or Cyrillic script.

Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:

* Confusion of '''0''' (zero) and '''O''' (capital letter)
* Inconsistent letter case: English word '''I''' written in lowercase etc. (notorious e.g. in movie subtitles)
* Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: '''i'll Ill l'll 1'll 1'11'''
* Different symbols for various punctuation (quotes, dashes, apostrophe etc.)

=== VIM tips ===

A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:

Set file encoding to UTF-8:

''':set encoding=utf8'''

Show the code of character under cursor:

'''ga'''

Set or remove BOM (byte-order mark) for current file:

''':set bomb'''
''':set nobomb'''

=== Spot Five Differences ===

A text file with the Russian word ''"чай"'' (tea) can be written in seemingly identical ways which however differ significantly on byte level.

First is the very file beginning, which may and may not include the Unicode byte-order-mark symbol (BOM), which is 3 bytes long.

The second and third difference is the presence of two Unicode non-printing characters, namely zero-width space and left-to-right direction mark.

The fourth difference is the code for the letter ''"a"'' which can be written either in Latin or in Cyrillic script (looks identical).

The fifth, final difference is the representation of the last letter ''"й"''. It can be written either as one letter or as ''"и"'' followed by a wedge (the diacritics).

== Tokenization ==

The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.

A basic but quite robust approach is to split whenever the [http://www.regular-expressions.info/unicode.html Unicode character category] changes. Imagine reading the input character by character. When we observe that so far, there have been letters (category '''L''') and suddenly, there is punctuation (category '''P'''), we insert a space. During the same process, it is useful to convert all whitespace (tabulators, spaces, non-breaking spaces and sequences of such) to a single space character.

However, in many situations, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":

: ''don't -> do n't''

: ''shouldn't -> should n't''

: ''couldn't -> could n't''

It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).

Admin RootPage

2015-01-13T09:57:49Z

Bojar: links to codex

0x : How to get started with CodEx MT exercises

03 '''[[Pre-processing]]''': Normalization and other technical tricks bound to help your MT system.

== CodEx ==

* [https://codex3.ms.mff.cuni.cz/codex-trans/ Log in to CodEx] and solve programming exercises that complement our talks.
* [[CodEx-Introduction|Brief description of CodEx]]: how to get an account and submit a solution.

CodEx-Introduction

2015-01-13T09:50:49Z

Bojar: /* How to pick an exercise, solve it and submit your solution */ various small changes

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/). This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual exercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20CodEx%20account!%20MT%20talks&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%20Thank%20you. Admin] to request an account. He will create the account for you. Your account is added to '''MT talks''' group by default.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentation and news related to your account.

In the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises of the group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For MT talks exercises, please join the group '''MT talks''' if you have not done it yet (shown in picture: list of groups).

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
In the left-handed sidebar, under '''group -> task''', there are three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'', respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). You can pick any of these programming languages: ''Pascal, C, C++, C#, Haskell, Python and Java''

Your solution has to fit in one single file and process standard input to standard output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box directly on the web page, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension, upload and submit it.

In the evaluation process, your program is run several times with several inputs to validate the correctness. There are also built-in time and memory limits, which any sensible solution should easily meet. You will pass the exercise if your program passes a given number of these tests, we generally require to pass all the tests.

In the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which reads names of people and says 'Hello' to each of them. Each input line should be turned into a greeting.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

To test it manually, run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

To test it manually, run: javac CodEx.java; java CodEx < sample.in

'''Note''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For CodEx limitiations for other languages, please read the CodEx manual.

CodEx-Introduction

2015-01-13T09:42:52Z

Bojar: /* How to login and join a group */ polishing English

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/). This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual exercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20CodEx%20account!%20MT%20talks&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%20Thank%20you. Admin] to request an account. He will create the account for you. Your account is added to '''MT talks''' group by default.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentation and news related to your account.

In the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises of the group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For MT talks exercises, please join the group '''MT talks''' if you have not done it yet (shown in picture: list of groups).

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
On the left-handed sidebar, under '''group -> task''', you will see three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'' respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). The list of programming languages is: ''Pascal, C, C++, C#, Haskell, Python and Java''

You solution has to fit in one single file with standard input/output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension and submit

In the evaluation process, your program is run several times with several input to validate the correctness. You will pass if your program passes a ''threshold'' number of times.

On the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which read the name of a person and say 'Hello' to him/her.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

Run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Run: javac CodEx.java; java CodEx < sample.in

'''Notes''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For other languages, please read the CodEx manual.

CodEx-Introduction

2015-01-13T09:40:21Z

Bojar: typo

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/). This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual exercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20CodEx%20account!%20MT%20talks&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%20Thank%20you. Admin] to request an account. He will create the account for you. Your account is added to '''MT talks''' group by default.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentations and news that relates to your account.

On the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises which are assigned to that group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For the sake of MT talks, please join the group '''MT talks''' if you have not done it.(shown in pictures: list of groups)

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
On the left-handed sidebar, under '''group -> task''', you will see three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'' respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). The list of programming languages is: ''Pascal, C, C++, C#, Haskell, Python and Java''

You solution has to fit in one single file with standard input/output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension and submit

In the evaluation process, your program is run several times with several input to validate the correctness. You will pass if your program passes a ''threshold'' number of times.

On the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which read the name of a person and say 'Hello' to him/her.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

Run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Run: javac CodEx.java; java CodEx < sample.in

'''Notes''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For other languages, please read the CodEx manual.

CodEx-Introduction

2015-01-13T09:39:34Z

Bojar: polishing intro paragraph

When reading this page, you've probably already gone a long way in learning about machine translation. Nice work!

Our MT Talks are occasionally complemented with programming exercises. We invite you (and strongly recommend) to go beyond watching our videos and try solving some or all of these exercises. Pick a programming language from our choice, write the short program and submit it to our system for evaluation -- a set of fully automatic tests.

The exercises are implemented in The Code Examiner ('''CodEx''', https://codex3.ms.mff.cuni.cz/codex-trans/). This page briefly describes how to use CodEx in general:

* How to get a CodEx account
* How to login to CodEx
* How to pick a task to solve
* How to submit a solution for evaluation

The individual excercises are described both in the CodEx system, as well as on the corresponding MT talk page here.

== How to get a CodEx account ==

Before venturing your journey though all the tasks, you need to get an account. There are two options to obtain an account in CodEx

=== For CUNI students ===

[[File:codex-registration.png|thumb|200px|'''Codex Registration''' CUNI students]]

Please access the SIS registration page: https://codex3.ms.mff.cuni.cz/codex-trans/?module=sisregistration. You will be asked to verify your account, then click '''verify'''. If everything is fine, you could proceed to create your own account by following the instruction.

=== For non-CUNI students ===

Please send an email to [mailto:mttalks@ufal.mff.cuni.cz?Subject=Request%20for%20CodEx%20account!%20MT%20talks&body=Hello!%0D%0A%0D%0APlease%20create%20a%20CodEx%20account%20for%20me.%20Thank%20you. Admin] to request an account. He will create the account for you. Your account is added to '''MT talks''' group by default.

== How to login and join a group ==

[[File:codex-welcome-page.png|thumb|200px|'''Codex Welcome''']]

Once you have your login alias/password, come back to the login page: https://codex3.ms.mff.cuni.cz/codex-trans. After logging in, you are directed to the welcome page which displays all documentations and news that relates to your account.

On the left hand column, there is an internal link '''group'''. It directs to the list of all groups that you could join. When you join a group, you are responsible to do all the exercises which are assigned to that group.

[[File:codex-group.png|thumb|200px|'''List of groups''']]

For the sake of MT talks, please join the group '''MT talks''' if you have not done it.(shown in pictures: list of groups)

== How to pick an exercise, solve it and submit your solution ==

After joining a group, you are able to see all the exercises assigned to that group.
On the left-handed sidebar, under '''group -> task''', you will see three options: ''specification, new submit, submits''. They mean ''read the specification, submit a new solution'' and ''manage old submissions'' respectively.

[[File:codex-submit.png|thumb|200px|'''Submit a new solution''']]
[[File:codex-eval.png|thumb|200px|'''Manage your submissions''']]

For every exercise, please read the specification carefully. You are asked to write a complete program (not just a function). The list of programming languages is: ''Pascal, C, C++, C#, Haskell, Python and Java''

You solution has to fit in one single file with standard input/output.

To submit a solution, there are two ways:
* Upload from text area: You write your solution into the text box, select the extension according to your programming language, then submit.
* Upload from file: Simply write your solution into a file with an appropriate extension and submit

In the evaluation process, your program is run several times with several input to validate the correctness. You will pass if your program passes a ''threshold'' number of times.

On the left-handed sidebar, under '''group''', there are links to page '''results''' and '''bonus points''' where you can keep track of your results throughout the course.

=== Example ===

Exercise '''Hello World!''': Your task is to write a program which read the name of a person and say 'Hello' to him/her.

'''Input''': << standard input >> < sample.in

John
Marry
Marry and Kate

'''Ouput''': <<standard output >>

Hello John!
Hello Marry!
Hello Marry and Kate!

''' Sample solution''': Read the input file line-by-line, trim the string, concatenate the line with "Hello " and "!" then print it.

'''Python'''
#!/usr/bin/env python
import fileinput
if __name__ == '__main__':
for line in fileinput.input():
print "Hello " + line.strip() + "!"

Run: ./helloworld.py sample.in

'''Java'''

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
public class CodEx{
public static void main(String[] args) {
BufferedReader br = null;
try {
br = new BufferedReader(new InputStreamReader(System.in));
String line;
while ((line = br.readLine()) != null) {
System.out.println("Hello " + line + "!");
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Run: javac CodEx.java; java CodEx < sample.in

'''Notes''': If you choose Java to be your programming language, your program must not declare any package, the main class must be "CodEx". For other languages, please read the CodEx manual.

Pre-processing

2015-01-13T09:26:19Z

Bojar: /* VIM tips */ set nomomb added

{{Infobox
|title = Lecture 3: Pre-processing
|image = [[File:bear-with-us.png|200px]]
|label1 = Lecture video:
|data1 = [http://example.com web '''TODO'''] [http://www.youtube.com/watch?v=ucSv4S4sCjs Youtube]
}}

{{#ev:youtube|ucSv4S4sCjs|800|center}}

Overall, the task of data pre-processing is to ''drop any distinctions that are not important for the output''.

== Inspecting Text Data ==

=== Text Encoding ===

Two texts that look the same might not be identical. MT systems do not see the strings as humans do but instead, they work with the actual byte representation. Therefore, data pre-processing is a very important step in system development.

[http://en.wikipedia.com/wiki/Unicode Unicode] includes a number of special characters which can complicate text processing for an MT system developer. The following table contains examples of some of the more devious characters:

{|
! Code
! Name
! Description
|-
|'''U+200B'''
|Zero-width space
|An invisible space.
|-
|'''U+200E'''
|Left-to-right mark
|An invisible character used in texts with mixed scripts (e.g. Latin and Arabic) to indicate reading direction.
|-
|'''U+2028'''
|Line separator
|A Unicode newline which is often not interpreted by text editors (and can be invisible).
|-
|'''U+2029'''
|Paragraph separator
|Separates paragraphs, implies a new line (also often ignored).
|}

[http://www.decodeunicode.org/ Decode Unicode] is a useful webpage with information on Unicode characters.

Often, a file [http://en.wikipedia.org/wiki/Hex_dump hexdump] is the most useful diagnostic tool. E.g. the Linux command '''xxd''' provides the necessary functionality.

=== Script/Characters ===

Unicode often provides many ways how to write a single character. For example, the letter "'''a'''" might be written with Latin or Cyrillic script.

Aside from seemingly identical, but differently encoded characters, problems commonly seen in data include:

* Confusion of '''0''' (zero) and '''O''' (capital letter)
* Inconsistent letter case: English word '''I''' written in lowercase etc. (notorious e.g. in movie subtitles)
* Various systematic mis-spellings -- all of these variants of "I'll" (I will) were observed in movie subtitles: '''i'll Ill l'll 1'll 1'11'''
* Different symbols for various punctuation (quotes, dashes, apostrophe etc.)

=== VIM tips ===

A good text editor is an essential pre-requisite for successful inspection of text data and the implementation of suitable pre-processing. We provide several random tips for the VIM editor:

Set file encoding to UTF-8:

''':set encoding=utf8'''

Show the code of character under cursor:

'''ga'''

Set or remove BOM (byte-order mark) for current file:

''':set bomb'''
''':set nobomb'''

== Tokenization ==

The most suitable tokenization can be task-dependent. For example, in parsing, we would like to keep adjectives such as "red-haired" as one word, while for phrase-based MT, it is useful to split such words.

A basic but quite robust approach is to split whenever the [http://www.regular-expressions.info/unicode.html Unicode character category] changes. Imagine reading the input character by character. When we observe that so far, there have been letters (category '''L''') and suddenly, there is punctuation (category '''P'''), we insert a space. During the same process, it is useful to convert all whitespace (tabulators, spaces, non-breaking spaces and sequences of such) to a single space character.

However, in many situations, a more sophisticated, linguistically motivated tokenization scheme is useful. E.g. for words such as "don't", "could't", "shouldn't", we can obtain a nice generalization by splitting off "n't":

: ''don't -> do n't''

: ''shouldn't -> should n't''

: ''couldn't -> could n't''

It is essential for data tokenization to be '''consistent'''. All of our training data should conform to the same pre-processing scheme and an identical pipeline should be applied at test time (when our system runs and we translate new data).

MT Talks

2014-12-16T22:38:10Z

Bojar: links to other videolectures