<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://mttalks.ufal.ms.mff.cuni.cz/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Tamchyna</id>
	<title>MT Talks - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://mttalks.ufal.ms.mff.cuni.cz/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Tamchyna"/>
	<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php/Special:Contributions/Tamchyna"/>
	<updated>2026-04-28T16:16:33Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53659</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53659"/>
		<updated>2015-11-17T17:25:26Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
The following figure illustrates the structure of Chimera -- the final system is CH2 but CH1 (Moses+TectoMT but no automatic post-editing) and CH0 (only plain Moses) have also been evaluated.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;br /&gt;
&lt;br /&gt;
Chimera currently represents the state of the art in English-Czech MT; it was ranked first by human judges in three consecutive years of the WMT shared Translation Task (2013, 2014, 2015).&lt;br /&gt;
&lt;br /&gt;
The following table shows the improvements from adding extra data and from including the TectoMT outputs. The results suggest that TectoMT provides improvements which are complementary and which significantly help translation quality. The constrained setup used only 15 million parallel sentence pairs, as opposed to the full system with over 52 million training sentence pairs. In terms of monolingual data, the difference was 44 vs. 392 million sentences.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! &lt;br /&gt;
! Constrained&lt;br /&gt;
! Full&lt;br /&gt;
! Delta&lt;br /&gt;
|-&lt;br /&gt;
!CH0&lt;br /&gt;
|21.28&lt;br /&gt;
|22.29&lt;br /&gt;
|1.31&lt;br /&gt;
|-&lt;br /&gt;
!CH1&lt;br /&gt;
|23.37&lt;br /&gt;
|24.24&lt;br /&gt;
|0.87&lt;br /&gt;
|-&lt;br /&gt;
!Delta&lt;br /&gt;
|2.09&lt;br /&gt;
|1.65&lt;br /&gt;
|&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53658</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53658"/>
		<updated>2015-11-17T17:23:11Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
The following figure illustrates the structure of Chimera -- the final system is CH2 but CH1 (Moses+TectoMT but no automatic post-editing) and CH0 (only plain Moses) have also been evaluated.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;br /&gt;
&lt;br /&gt;
Chimera currently represents the state of the art in English-Czech MT; it was ranked first by human judges in three consecutive years of the WMT shared Translation Task (2013, 2014, 2015).&lt;br /&gt;
&lt;br /&gt;
The following table shows the improvements from adding extra data and from including the TectoMT outputs. The results suggest that TectoMT provides improvements which are complementary and which significantly help translation quality:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! &lt;br /&gt;
! Constrained&lt;br /&gt;
! Full&lt;br /&gt;
! Delta&lt;br /&gt;
|-&lt;br /&gt;
!CH0&lt;br /&gt;
|21.28&lt;br /&gt;
|22.29&lt;br /&gt;
|1.31&lt;br /&gt;
|-&lt;br /&gt;
!CH1&lt;br /&gt;
|23.37&lt;br /&gt;
|24.24&lt;br /&gt;
|0.87&lt;br /&gt;
|-&lt;br /&gt;
!Delta&lt;br /&gt;
|2.09&lt;br /&gt;
|1.65&lt;br /&gt;
|&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53657</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53657"/>
		<updated>2015-11-17T17:22:39Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
The following figure illustrates the structure of Chimera -- the final system is CH2 but CH1 (Moses+TectoMT but no automatic post-editing) and CH0 (only plain Moses) have also been evaluated.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;br /&gt;
&lt;br /&gt;
Chimera currently represents the state of the art in English-Czech MT; it was ranked first by human judges in three consecutive years of the WMT shared Translation Task (2013, 2014, 2015).&lt;br /&gt;
&lt;br /&gt;
The following table shows the improvements from adding extra data and from including the TectoMT outputs. The results suggest that TectoMT provides improvements which are complementary and which significantly help translation quality:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
! &lt;br /&gt;
! Constrained&lt;br /&gt;
! Full&lt;br /&gt;
! Delta&lt;br /&gt;
|-&lt;br /&gt;
|CH0&lt;br /&gt;
|21.28&lt;br /&gt;
|22.29&lt;br /&gt;
|1.31&lt;br /&gt;
|-&lt;br /&gt;
|CH1&lt;br /&gt;
|23.37&lt;br /&gt;
|24.24&lt;br /&gt;
|0.87&lt;br /&gt;
|-&lt;br /&gt;
|Delta&lt;br /&gt;
|2.09&lt;br /&gt;
|1.65&lt;br /&gt;
|&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53656</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53656"/>
		<updated>2015-11-17T17:21:38Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
The following figure illustrates the structure of Chimera -- the final system is CH2 but CH1 (Moses+TectoMT but no automatic post-editing) and CH0 (only plain Moses) have also been evaluated.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;br /&gt;
&lt;br /&gt;
Chimera currently represents the state of the art in English-Czech MT; it was ranked first by human judges in three consecutive years of the WMT shared Translation Task (2013, 2014, 2015).&lt;br /&gt;
&lt;br /&gt;
The following table shows the improvements from adding extra data and from including the TectoMT outputs. The results suggest that TectoMT provides improvements which are complementary and which significantly help translation quality:&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
! &lt;br /&gt;
! Constrained&lt;br /&gt;
! Full&lt;br /&gt;
! Delta&lt;br /&gt;
|-&lt;br /&gt;
|CH0&lt;br /&gt;
|21.28&lt;br /&gt;
|22.29&lt;br /&gt;
|1.31&lt;br /&gt;
|-&lt;br /&gt;
|CH1&lt;br /&gt;
|23.37&lt;br /&gt;
|24.24&lt;br /&gt;
|0.87&lt;br /&gt;
|-&lt;br /&gt;
|Delta&lt;br /&gt;
|2.09&lt;br /&gt;
|1.65&lt;br /&gt;
|&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53655</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53655"/>
		<updated>2015-11-17T17:17:43Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
The following figure illustrates the structure of Chimera -- the final system is CH2 but CH1 (Moses+TectoMT but no automatic post-editing) and CH0 (only plain Moses) have also been evaluated.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;br /&gt;
&lt;br /&gt;
Chimera currently represents the state of the art in English-Czech MT; it was ranked first by human judges in three consecutive years of the WMT shared Translation Task (2013, 2014, 2015).&lt;br /&gt;
&lt;br /&gt;
The following table shows the improvements from adding extra data and from including the TectoMT outputs. The results suggest that TectoMT provides improvements which are complementary and which significantly help translation quality:&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53654</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53654"/>
		<updated>2015-11-14T09:12:56Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|400px]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53653</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53653"/>
		<updated>2015-11-14T09:12:40Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png|500px]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Chimera.png&amp;diff=53652</id>
		<title>File:Chimera.png</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Chimera.png&amp;diff=53652"/>
		<updated>2015-11-14T09:12:20Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53651</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53651"/>
		<updated>2015-11-14T09:11:57Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* Using Deep Syntax to Achieve State of the Art in MT */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;br /&gt;
&lt;br /&gt;
[[File:chimera.png]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53650</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53650"/>
		<updated>2015-11-14T08:53:20Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;br /&gt;
&lt;br /&gt;
== Using Deep Syntax to Achieve State of the Art in MT ==&lt;br /&gt;
&lt;br /&gt;
By itself, deep syntactic MT does not reach the performance of statistical methods (e.g. phrase-based). However, the outputs of TectoMT are usually grammatical sentences (as they are generated from a deep representation, preserving agreement constraints) and they can contain word forms not observed in the training data (thanks to its morphological generator). As such, they are a useful complement of the statistical systems.&lt;br /&gt;
&lt;br /&gt;
[http://ufal.mff.cuni.cz/chimera Chimera] is a system combination of a standard phrase-based MT system and TectoMT. Development and test data are translated using TectoMT and its outputs are added as a separate parallel corpus. An extra phrase table is extracted from this synthetic set and is added to Moses. The system can therefore choose to use either the standard parallel data or the outputs of TectoMT. Standard MERT is used to set the weights.&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53649</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53649"/>
		<updated>2015-11-10T13:46:43Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: /* See Also */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
With syntactic MT, the situation is more complicated because hypotheses are not constructed left-to-right. That means that while there was only a single boundary between the current partial translation and its extension, SCFG rules can apply anywhere and we may need to look at words both preceding and following the target-side of the rule. This makes state tracking more complicated than in PBMT.&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).&lt;br /&gt;
&lt;br /&gt;
TODO references to papers!&lt;br /&gt;
&lt;br /&gt;
All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).&lt;br /&gt;
&lt;br /&gt;
Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is&lt;br /&gt;
iterative:&lt;br /&gt;
&lt;br /&gt;
# Set all weights to some initial values.&lt;br /&gt;
# Translate the tuning set using the current weights; for each sentence, output &#039;&#039;n&#039;&#039; best translations and their feature scores.&lt;br /&gt;
# Run one iteration of MERT to get a new set of weights.&lt;br /&gt;
# If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.&lt;br /&gt;
&lt;br /&gt;
The input for MERT is a set of &#039;&#039;&#039;n-best lists&#039;&#039;&#039; -- the &#039;&#039;n&#039;&#039; best translations&lt;br /&gt;
for each sentence in the tuning set. A vector of feature scores is associated&lt;br /&gt;
with each sentence.&lt;br /&gt;
&lt;br /&gt;
First, each translation is scored by the objective function (such as BLEU). In&lt;br /&gt;
each n-best list, the sentence with the best score is assumed to be the best&lt;br /&gt;
translation. The goal of MERT then is to find a set of weights that will&lt;br /&gt;
maximize the overall score, i.e. move good translations to the top of the n-best&lt;br /&gt;
lists.&lt;br /&gt;
&lt;br /&gt;
MERT addresses the dimensionality of the weight space (the space is effectively&lt;br /&gt;
&amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; for &#039;&#039;n&#039;&#039; weights) by optimizing each weight separately.&lt;br /&gt;
&lt;br /&gt;
While the line search is globally optimal (in the one dimension), overall, the&lt;br /&gt;
procedure is likely to reach a local optimum. MERT is therefore usually run from&lt;br /&gt;
a number of different starting positions and the best set of weights is used.&lt;br /&gt;
&lt;br /&gt;
After convergence (or reaching a pre-set maximum number of iterations), the&lt;br /&gt;
weights for log-linear model are known and the system training is finished.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;br /&gt;
&lt;br /&gt;
== See Also ==&lt;br /&gt;
&lt;br /&gt;
* Bojar, O. 2012. [http://www.cupress.cuni.cz/ink2_ext/index.jsp?include=podrobnosti&amp;amp;id=224545 Čeština a stojový překlad]. Ústav formální a aplikované lingvistiky MFF UK 2012.&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53648</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53648"/>
		<updated>2015-11-10T12:48:57Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
With syntactic MT, the situation is more complicated because hypotheses are not constructed left-to-right. That means that while there was only a single boundary between the current partial translation and its extension, SCFG rules can apply anywhere and we may need to look at words both preceding and following the target-side of the rule. This makes state tracking more complicated than in PBMT.&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).&lt;br /&gt;
&lt;br /&gt;
TODO references to papers!&lt;br /&gt;
&lt;br /&gt;
All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).&lt;br /&gt;
&lt;br /&gt;
Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is&lt;br /&gt;
iterative:&lt;br /&gt;
&lt;br /&gt;
# Set all weights to some initial values.&lt;br /&gt;
# Translate the tuning set using the current weights; for each sentence, output &#039;&#039;n&#039;&#039; best translations and their feature scores.&lt;br /&gt;
# Run one iteration of MERT to get a new set of weights.&lt;br /&gt;
# If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.&lt;br /&gt;
&lt;br /&gt;
The input for MERT is a set of &#039;&#039;&#039;n-best lists&#039;&#039;&#039; -- the &#039;&#039;n&#039;&#039; best translations&lt;br /&gt;
for each sentence in the tuning set. A vector of feature scores is associated&lt;br /&gt;
with each sentence.&lt;br /&gt;
&lt;br /&gt;
First, each translation is scored by the objective function (such as BLEU). In&lt;br /&gt;
each n-best list, the sentence with the best score is assumed to be the best&lt;br /&gt;
translation. The goal of MERT then is to find a set of weights that will&lt;br /&gt;
maximize the overall score, i.e. move good translations to the top of the n-best&lt;br /&gt;
lists.&lt;br /&gt;
&lt;br /&gt;
MERT addresses the dimensionality of the weight space (the space is effectively&lt;br /&gt;
&amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; for &#039;&#039;n&#039;&#039; weights) by optimizing each weight separately.&lt;br /&gt;
&lt;br /&gt;
While the line search is globally optimal (in the one dimension), overall, the&lt;br /&gt;
procedure is likely to reach a local optimum. MERT is therefore usually run from&lt;br /&gt;
a number of different starting positions and the best set of weights is used.&lt;br /&gt;
&lt;br /&gt;
After convergence (or reaching a pre-set maximum number of iterations), the&lt;br /&gt;
weights for log-linear model are known and the system training is finished.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;br /&gt;
&lt;br /&gt;
=== See Also ===&lt;br /&gt;
&lt;br /&gt;
* Bojar, O. 2012. [http://www.cupress.cuni.cz/ink2_ext/index.jsp?include=podrobnosti&amp;amp;id=224545 Čeština a stojový překlad]. Ústav formální a aplikované lingvistiky MFF UK 2012.&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=MT_Talks&amp;diff=53647</id>
		<title>MT Talks</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=MT_Talks&amp;diff=53647"/>
		<updated>2015-10-10T14:21:15Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:banner.png]]&lt;br /&gt;
&lt;br /&gt;
MT Talks is a series of mini-lectures on machine translation.&lt;br /&gt;
&lt;br /&gt;
Our goal is to hit just the right level of detail and technicality to make the talks interesting and attractive to people who are not yet familiar with the field but mix in new observations and insights so that even old pals will have a reason to watch us.&lt;br /&gt;
&lt;br /&gt;
MT Talks and the expanded notes on this wiki will never be the ultimate resource for MT, but we would be very happy to serve as an ultimate commented &#039;&#039;directory&#039;&#039; of good pointers.&lt;br /&gt;
&lt;br /&gt;
By the way, this is indeed a Wiki, so your contributions are very welcome! Please register and feel free to add comments, corrections or links to useful resources.&lt;br /&gt;
&lt;br /&gt;
== Our Talks ==&lt;br /&gt;
&lt;br /&gt;
01 &#039;&#039;&#039;[[Intro]]&#039;&#039;&#039;: Why is MT difficult, approaches to MT.&lt;br /&gt;
&lt;br /&gt;
02 &#039;&#039;&#039;[[MT that Deceives]]&#039;&#039;&#039;: Serious translation errors even for short and simple inputs.&lt;br /&gt;
&lt;br /&gt;
03 &#039;&#039;&#039;[[Pre-processing]]&#039;&#039;&#039;: Normalization and other technical tricks bound to help your MT system.&lt;br /&gt;
&lt;br /&gt;
04 &#039;&#039;&#039;[[MT Evaluation in General]]&#039;&#039;&#039;: Techniques of judging MT quality, dimensions of translation quality, number of possible translations.&lt;br /&gt;
&lt;br /&gt;
05 &#039;&#039;&#039;[[Automatic MT Evaluation]]&#039;&#039;&#039;: Two common automatic MT evaluation methods: PER and BLEU&lt;br /&gt;
&lt;br /&gt;
06 &#039;&#039;&#039;[[Data Acquisition]]&#039;&#039;&#039;: The need and possible sources of training data for MT. And the diminishing utility of the new data additions due to Zipf&#039;s law.&lt;br /&gt;
&lt;br /&gt;
07 &#039;&#039;&#039;[[Sentence Alignment]]&#039;&#039;&#039;: An introduction to the Gale &amp;amp; Church sentence alignment algorithm.&lt;br /&gt;
&lt;br /&gt;
08 &#039;&#039;&#039;[[Word Alignment]]&#039;&#039;&#039;: Cutting the chicken-egg problem.&lt;br /&gt;
&lt;br /&gt;
09 &#039;&#039;&#039;[[Phrase-based Model]]&#039;&#039;&#039;: Copy if you can.&lt;br /&gt;
&lt;br /&gt;
10 &#039;&#039;&#039;[[Constituency Trees]]&#039;&#039;&#039;: Divide and conquer.&lt;br /&gt;
&lt;br /&gt;
11 &#039;&#039;&#039;[[Dependency Trees]]&#039;&#039;&#039;: Trees with gaps.&lt;br /&gt;
&lt;br /&gt;
12 &#039;&#039;&#039;[[Rich Vocabulary]]&#039;&#039;&#039;: Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.&lt;br /&gt;
&lt;br /&gt;
13 &#039;&#039;&#039;[[Scoring and Optimization]]&#039;&#039;&#039;: Features your model features.&lt;br /&gt;
&lt;br /&gt;
14 &#039;&#039;&#039;[[Deep Syntax]]&#039;&#039;&#039;: Prague Family Jewels.&lt;br /&gt;
&lt;br /&gt;
== CodEx – Coding Exercises ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [https://codex3.ms.mff.cuni.cz/codex-trans/ Log in to CodEx] and solve programming exercises that complement our talks.&lt;br /&gt;
* [[CodEx-Introduction|Brief description of CodEx]]: how to get an account and submit a solution.&lt;br /&gt;
* [[CodEx - Important Notes|Important Notes]] on technical issues&lt;br /&gt;
&lt;br /&gt;
== Contributing ==&lt;br /&gt;
&lt;br /&gt;
Due to spamming, we had to restrict permissions for editing the Wiki. If you&#039;re interested in contributing, please write an email to &#039;&#039;&#039;tamchyna -at- ufal.mff.cuni.cz&#039;&#039;&#039; to obtain a username.&lt;br /&gt;
&lt;br /&gt;
== Other Videolectures on MT ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.upc.edu/learning/courses/mooc/2014-2015/approaches-to-machine/approaches-to-machine Approaches to Machine Translation: Rule-Based, Statistical, Hybrid] (an online course on MT by UPC Barcelona)&lt;br /&gt;
* [https://www.coursera.org/course/nlangp Natural Language Processing at Coursera] by Michael Collins, includes lectures on word-based and phrase-based models. [http://www.cs.columbia.edu/~mcollins/notes-spring2013.html Further notes]&lt;br /&gt;
* [https://www.youtube.com/playlist?list=PLVjXYOjST-AokmIxpCr4GexcdtpeOliBc TAUS Machine Translation and Moses Tutorial] (a series of commented slides, MT overview and practical aspects of the Moses Toolkit)&lt;br /&gt;
&lt;br /&gt;
== Acknowledgement ==&lt;br /&gt;
&lt;br /&gt;
The work on this project has been supported by the grant FP7-ICT-2011-7-288487 ([http://www.statmt.org/mosescore/ MosesCore]).&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53646</id>
		<title>Admin RootPage</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53646"/>
		<updated>2015-10-10T14:21:01Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;0x : How to get started with CodEx MT exercises&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Our [https://www.youtube.com/playlist?list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V YouTube playlist] -- shows some total number of views, although different from individual video views.&lt;br /&gt;
&lt;br /&gt;
[[CodEx - Important Notes]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53645</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53645"/>
		<updated>2015-10-07T14:28:22Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png|800px]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Tectomt-example.png&amp;diff=53644</id>
		<title>File:Tectomt-example.png</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Tectomt-example.png&amp;diff=53644"/>
		<updated>2015-10-07T14:27:59Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53643</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53643"/>
		<updated>2015-10-07T14:27:47Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;br /&gt;
&lt;br /&gt;
The following picture shows an example of Czech-English translation.&lt;br /&gt;
&lt;br /&gt;
[[File:tectomt-example.png]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53642</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53642"/>
		<updated>2015-10-07T14:18:22Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;br /&gt;
&lt;br /&gt;
TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.&lt;br /&gt;
&lt;br /&gt;
The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only &amp;quot;relabel&amp;quot; the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53641</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53641"/>
		<updated>2015-10-07T14:03:21Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). &lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different &#039;&#039;valency frames&#039;&#039; roughly correspond to different verb senses.&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53640</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53640"/>
		<updated>2015-10-07T13:41:27Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD.&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53639</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53639"/>
		<updated>2015-10-07T12:47:59Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
The lowest layer contains the sentence &amp;quot;as is&amp;quot;, without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53638</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53638"/>
		<updated>2015-10-07T12:44:30Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank). An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53637</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53637"/>
		<updated>2015-10-07T12:41:51Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960&#039;s. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the &#039;&#039;tectogrammatical layer&#039;&#039;). The following image (taken from PDT-2.0 documentation) shows an example of this description: &lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53636</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53636"/>
		<updated>2015-10-07T12:38:08Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|300px]]&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:I-layer-links.png&amp;diff=53635</id>
		<title>File:I-layer-links.png</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:I-layer-links.png&amp;diff=53635"/>
		<updated>2015-10-07T12:37:50Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53634</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53634"/>
		<updated>2015-10-07T12:37:39Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
[[File:i-layer-links.png|center|300px]]&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53633</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53633"/>
		<updated>2015-10-07T09:55:21Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;br /&gt;
&lt;br /&gt;
== Prague Dependency Treebank ==&lt;br /&gt;
&lt;br /&gt;
== VALLEX ==&lt;br /&gt;
&lt;br /&gt;
== MT Using Deep Syntax: TectoMT ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Family-jewels.png&amp;diff=53632</id>
		<title>File:Family-jewels.png</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=File:Family-jewels.png&amp;diff=53632"/>
		<updated>2015-10-07T09:51:33Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53631</id>
		<title>Deep Syntax</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Deep_Syntax&amp;diff=53631"/>
		<updated>2015-10-07T09:51:18Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: Created page with &amp;quot;{{Infobox |title = Lecture 14: Deep Syntax |image = 200px |label1 = Lecture video: |data1 = [http://example.com web &amp;#039;&amp;#039;&amp;#039;TODO&amp;#039;&amp;#039;&amp;#039;] &amp;lt;br/&amp;gt; [https://www.y...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 14: Deep Syntax&lt;br /&gt;
|image = [[File:family-jewels.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Functional Generative Description ==&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53630</id>
		<title>Admin RootPage</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53630"/>
		<updated>2015-10-07T09:47:31Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;0x : How to get started with CodEx MT exercises&lt;br /&gt;
&lt;br /&gt;
14 &#039;&#039;&#039;[[Deep Syntax]]&#039;&#039;&#039;: Prague Family Jewels.&lt;br /&gt;
&lt;br /&gt;
Our [https://www.youtube.com/playlist?list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V YouTube playlist] -- shows some total number of views, although different from individual video views.&lt;br /&gt;
&lt;br /&gt;
[[CodEx - Important Notes]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53629</id>
		<title>Admin RootPage</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53629"/>
		<updated>2015-10-07T09:47:25Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;0x : How to get started with CodEx MT exercises&lt;br /&gt;
&lt;br /&gt;
13 &#039;&#039;&#039;[[Deep Syntax]]&#039;&#039;&#039;: Prague Family Jewels.&lt;br /&gt;
&lt;br /&gt;
Our [https://www.youtube.com/playlist?list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V YouTube playlist] -- shows some total number of views, although different from individual video views.&lt;br /&gt;
&lt;br /&gt;
[[CodEx - Important Notes]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53628</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53628"/>
		<updated>2015-08-27T17:26:13Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
With syntactic MT, the situation is more complicated because hypotheses are not constructed left-to-right. That means that while there was only a single boundary between the current partial translation and its extension, SCFG rules can apply anywhere and we may need to look at words both preceding and following the target-side of the rule. This makes state tracking more complicated than in PBMT.&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).&lt;br /&gt;
&lt;br /&gt;
TODO references to papers!&lt;br /&gt;
&lt;br /&gt;
All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).&lt;br /&gt;
&lt;br /&gt;
Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is&lt;br /&gt;
iterative:&lt;br /&gt;
&lt;br /&gt;
# Set all weights to some initial values.&lt;br /&gt;
# Translate the tuning set using the current weights; for each sentence, output &#039;&#039;n&#039;&#039; best translations and their feature scores.&lt;br /&gt;
# Run one iteration of MERT to get a new set of weights.&lt;br /&gt;
# If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.&lt;br /&gt;
&lt;br /&gt;
The input for MERT is a set of &#039;&#039;&#039;n-best lists&#039;&#039;&#039; -- the &#039;&#039;n&#039;&#039; best translations&lt;br /&gt;
for each sentence in the tuning set. A vector of feature scores is associated&lt;br /&gt;
with each sentence.&lt;br /&gt;
&lt;br /&gt;
First, each translation is scored by the objective function (such as BLEU). In&lt;br /&gt;
each n-best list, the sentence with the best score is assumed to be the best&lt;br /&gt;
translation. The goal of MERT then is to find a set of weights that will&lt;br /&gt;
maximize the overall score, i.e. move good translations to the top of the n-best&lt;br /&gt;
lists.&lt;br /&gt;
&lt;br /&gt;
MERT addresses the dimensionality of the weight space (the space is effectively&lt;br /&gt;
&amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; for &#039;&#039;n&#039;&#039; weights) by optimizing each weight separately.&lt;br /&gt;
&lt;br /&gt;
While the line search is globally optimal (in the one dimension), overall, the&lt;br /&gt;
procedure is likely to reach a local optimum. MERT is therefore usually run from&lt;br /&gt;
a number of different starting positions and the best set of weights is used.&lt;br /&gt;
&lt;br /&gt;
After convergence (or reaching a pre-set maximum number of iterations), the&lt;br /&gt;
weights for log-linear model are known and the system training is finished.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53627</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53627"/>
		<updated>2015-08-27T16:58:13Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).&lt;br /&gt;
&lt;br /&gt;
TODO references to papers!&lt;br /&gt;
&lt;br /&gt;
All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).&lt;br /&gt;
&lt;br /&gt;
Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is&lt;br /&gt;
iterative:&lt;br /&gt;
&lt;br /&gt;
# Set all weights to some initial values.&lt;br /&gt;
# Translate the tuning set using the current weights; for each sentence, output &#039;&#039;n&#039;&#039; best translations and their feature scores.&lt;br /&gt;
# Run one iteration of MERT to get a new set of weights.&lt;br /&gt;
# If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.&lt;br /&gt;
&lt;br /&gt;
The input for MERT is a set of &#039;&#039;&#039;n-best lists&#039;&#039;&#039; -- the &#039;&#039;n&#039;&#039; best translations&lt;br /&gt;
for each sentence in the tuning set. A vector of feature scores is associated&lt;br /&gt;
with each sentence.&lt;br /&gt;
&lt;br /&gt;
First, each translation is scored by the objective function (such as BLEU). In&lt;br /&gt;
each n-best list, the sentence with the best score is assumed to be the best&lt;br /&gt;
translation. The goal of MERT then is to find a set of weights that will&lt;br /&gt;
maximize the overall score, i.e. move good translations to the top of the n-best&lt;br /&gt;
lists.&lt;br /&gt;
&lt;br /&gt;
MERT addresses the dimensionality of the weight space (the space is effectively&lt;br /&gt;
&amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; for &#039;&#039;n&#039;&#039; weights) by optimizing each weight separately.&lt;br /&gt;
&lt;br /&gt;
While the line search is globally optimal (in the one dimension), overall, the&lt;br /&gt;
procedure is likely to reach a local optimum. MERT is therefore usually run from&lt;br /&gt;
a number of different starting positions and the best set of weights is used.&lt;br /&gt;
&lt;br /&gt;
After convergence (or reaching a pre-set maximum number of iterations), the&lt;br /&gt;
weights for log-linear model are known and the system training is finished.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53626</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53626"/>
		<updated>2015-08-27T16:57:45Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features. There are many methods for tuning model parameters in MT, such as MERT (Minimum Error Rate Training, described here), PRO (Pairwise Ranked Optimization), or MIRA (Margin Infused Relaxed Algorithm, a general online optimization algorithm applied successfully to MT).&lt;br /&gt;
&lt;br /&gt;
TODO references to papers!&lt;br /&gt;
&lt;br /&gt;
All of them require a tuning set (development set, held-out set) -- a small parallel corpus separated from the training data on which the performance of the proposed weights is evaluated. Choosing a suitable tuning set is black magic (as are many decisions in MT system development). As a general guideline, it should be as similar to the expected test data as possible and the larger, the better (too large tuning sets can take too long to tune on, though).&lt;br /&gt;
&lt;br /&gt;
Minimum Error Rate Training (MERT) and has become a de-facto standard algorithm for tuning. The tuning process is&lt;br /&gt;
iterative:&lt;br /&gt;
&lt;br /&gt;
  # Set all weights to some initial values.&lt;br /&gt;
  # Translate the tuning set using the current weights; for each sentence, output &#039;&#039;n&#039;&#039; best translations and their feature scores.&lt;br /&gt;
  # Run one iteration of MERT to get a new set of weights.&lt;br /&gt;
  # If the n-best lists are identical to the previous iteration, return the current weights and exit. Else go back to 2.&lt;br /&gt;
\end{enumerate}  &lt;br /&gt;
&lt;br /&gt;
The input for MERT is a set of &#039;&#039;&#039;n-best lists&#039;&#039;&#039; -- the &#039;&#039;n&#039;&#039; best translations&lt;br /&gt;
for each sentence in the tuning set. A vector of feature scores is associated&lt;br /&gt;
with each sentence.&lt;br /&gt;
&lt;br /&gt;
First, each translation is scored by the objective function (such as BLEU). In&lt;br /&gt;
each n-best list, the sentence with the best score is assumed to be the best&lt;br /&gt;
translation. The goal of MERT then is to find a set of weights that will&lt;br /&gt;
maximize the overall score, i.e. move good translations to the top of the n-best&lt;br /&gt;
lists.&lt;br /&gt;
&lt;br /&gt;
MERT addresses the dimensionality of the weight space (the space is effectively&lt;br /&gt;
&amp;lt;math&amp;gt;\mathbb{R}^n&amp;lt;/math&amp;gt; for &#039;&#039;n&#039;&#039; weights) by optimizing each weight separately.&lt;br /&gt;
&lt;br /&gt;
While the line search is globally optimal (in the one dimension), overall, the&lt;br /&gt;
procedure is likely to reach a local optimum. MERT is therefore usually run from&lt;br /&gt;
a number of different starting positions and the best set of weights is used.&lt;br /&gt;
&lt;br /&gt;
After convergence (or reaching a pre-set maximum number of iterations), the&lt;br /&gt;
weights for log-linear model are known and the system training is finished.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53625</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53625"/>
		<updated>2015-08-27T16:48:36Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a translation hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the search. That means that each partial translation has a score associated with it and we gradually add the values of features for each extension of the partial translation.&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
We now focus on how to find a good set of weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; for the features.&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53624</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53624"/>
		<updated>2015-08-27T16:46:07Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the construction&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53623</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53623"/>
		<updated>2015-08-27T16:45:36Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Essentially, the &#039;&#039;probability&#039;&#039; (or, less ambitiously, the &#039;&#039;score&#039;&#039;) of a translation is a weighted sum of features &amp;lt;math&amp;gt;f_i&amp;lt;/math&amp;gt;. Feature functions can look at the translation and the source and they output a number. We introduce the common types of features in the following subsections.&lt;br /&gt;
&lt;br /&gt;
Our goal is then to find such a hypothesis that maximizes this score, formally:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;e^* = \text{argmax}_e P(e|f) \propto \exp \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Typically, feature functions are evaluated on &#039;&#039;partial translations&#039;&#039; during the construction&lt;br /&gt;
&lt;br /&gt;
We describe how to obtain the weights &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt; in the last section of this lecture.&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53622</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53622"/>
		<updated>2015-08-27T16:24:18Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
So far we haven&#039;t fully described the actual model (most commonly) used in phrase-based and syntactic MT, the &#039;&#039;&#039;log-linear model&#039;&#039;&#039;. For MT, it can be formulated as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;\text{score(e,f)} = \sum_i w_i f_i(e,f)&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53621</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53621"/>
		<updated>2015-08-25T12:39:37Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: e.g. a 4-gram LM which will consider the partial hypotheses identical only if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53620</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53620"/>
		<updated>2015-08-25T12:37:29Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
Phrase-based search uses &#039;&#039;&#039;hypothesis recombination&#039;&#039;&#039; to reduce the number of possible translations. The basic idea is that when we have two partial hypotheses with an identical coverage vector (they have translated identical portions of the source sentence), we can discard the lower-scoring hypothesis &#039;&#039;&#039;if&#039;&#039;&#039; no future feature function can distinguish between them. Local features do not look outside the current phrase pair so we only need to worry about non-local features: a 4-gram LM will consider the partial hypotheses identical if their last three words do not differ.&lt;br /&gt;
&lt;br /&gt;
This is where the notion of locality comes into play: it complicates recombination during search because partial translations need to maintain a &#039;&#039;&#039;state&#039;&#039;&#039; -- information for the non-local features (e.g. last three words for the LM). We can then only safely recombine hypotheses which have an identical coverage vector and state.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53619</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53619"/>
		<updated>2015-08-25T12:26:41Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the [http://videolectures.net/hltss2010_eisner_plm/ video lecture] by Jason Eisner. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53618</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53618"/>
		<updated>2015-08-25T12:25:29Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model. If we have a 4-gram LM, for example, we cannot score our new target phrase &amp;lt;math&amp;gt;\mathbf{e} = (e_1,\ldots,e_K)&amp;lt;/math&amp;gt; without knowing the three words that precede it in our translation. The reason is that we need to compute the probability of the first word in that phrase (&amp;lt;math&amp;gt;e_1&amp;lt;/math&amp;gt;) &#039;&#039;given&#039;&#039; the previous context.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53617</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53617"/>
		<updated>2015-08-25T12:19:13Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are &#039;&#039;&#039;local&#039;&#039;&#039;, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local (word penalty is simply the count of words in the target phrase). As we build the translation, we simply add the scores of these local feature functions to the current translation score.&lt;br /&gt;
&lt;br /&gt;
The most prominent example of a &#039;&#039;&#039;non-local&#039;&#039;&#039; feature is the language model.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53616</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53616"/>
		<updated>2015-08-25T12:05:38Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
Some of the feature functions that we have described are local, i.e. their value only depends on the current phrase pair. For example, lexical weights, phrase translation probabilities or word penalty are local.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=MT_Talks&amp;diff=53615</id>
		<title>MT Talks</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=MT_Talks&amp;diff=53615"/>
		<updated>2015-08-25T11:48:07Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[File:banner.png]]&lt;br /&gt;
&lt;br /&gt;
MT Talks is a series of mini-lectures on machine translation.&lt;br /&gt;
&lt;br /&gt;
Our goal is to hit just the right level of detail and technicality to make the talks interesting and attractive to people who are not yet familiar with the field but mix in new observations and insights so that even old pals will have a reason to watch us.&lt;br /&gt;
&lt;br /&gt;
MT Talks and the expanded notes on this wiki will never be the ultimate resource for MT, but we would be very happy to serve as an ultimate commented &#039;&#039;directory&#039;&#039; of good pointers.&lt;br /&gt;
&lt;br /&gt;
By the way, this is indeed a Wiki, so your contributions are very welcome! Please register and feel free to add comments, corrections or links to useful resources.&lt;br /&gt;
&lt;br /&gt;
== Our Talks ==&lt;br /&gt;
&lt;br /&gt;
01 &#039;&#039;&#039;[[Intro]]&#039;&#039;&#039;: Why is MT difficult, approaches to MT.&lt;br /&gt;
&lt;br /&gt;
02 &#039;&#039;&#039;[[MT that Deceives]]&#039;&#039;&#039;: Serious translation errors even for short and simple inputs.&lt;br /&gt;
&lt;br /&gt;
03 &#039;&#039;&#039;[[Pre-processing]]&#039;&#039;&#039;: Normalization and other technical tricks bound to help your MT system.&lt;br /&gt;
&lt;br /&gt;
04 &#039;&#039;&#039;[[MT Evaluation in General]]&#039;&#039;&#039;: Techniques of judging MT quality, dimensions of translation quality, number of possible translations.&lt;br /&gt;
&lt;br /&gt;
05 &#039;&#039;&#039;[[Automatic MT Evaluation]]&#039;&#039;&#039;: Two common automatic MT evaluation methods: PER and BLEU&lt;br /&gt;
&lt;br /&gt;
06 &#039;&#039;&#039;[[Data Acquisition]]&#039;&#039;&#039;: The need and possible sources of training data for MT. And the diminishing utility of the new data additions due to Zipf&#039;s law.&lt;br /&gt;
&lt;br /&gt;
07 &#039;&#039;&#039;[[Sentence Alignment]]&#039;&#039;&#039;: An introduction to the Gale &amp;amp; Church sentence alignment algorithm.&lt;br /&gt;
&lt;br /&gt;
08 &#039;&#039;&#039;[[Word Alignment]]&#039;&#039;&#039;: Cutting the chicken-egg problem.&lt;br /&gt;
&lt;br /&gt;
09 &#039;&#039;&#039;[[Phrase-based Model]]&#039;&#039;&#039;: Copy if you can.&lt;br /&gt;
&lt;br /&gt;
10 &#039;&#039;&#039;[[Constituency Trees]]&#039;&#039;&#039;: Divide and conquer.&lt;br /&gt;
&lt;br /&gt;
11 &#039;&#039;&#039;[[Dependency Trees]]&#039;&#039;&#039;: Trees with gaps.&lt;br /&gt;
&lt;br /&gt;
12 &#039;&#039;&#039;[[Rich Vocabulary]]&#039;&#039;&#039;: Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz.&lt;br /&gt;
&lt;br /&gt;
13 &#039;&#039;&#039;[[Scoring and Optimization]]&#039;&#039;&#039;: Features your model features.&lt;br /&gt;
&lt;br /&gt;
== CodEx – Coding Exercises ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
* [https://codex3.ms.mff.cuni.cz/codex-trans/ Log in to CodEx] and solve programming exercises that complement our talks.&lt;br /&gt;
* [[CodEx-Introduction|Brief description of CodEx]]: how to get an account and submit a solution.&lt;br /&gt;
* [[CodEx - Important Notes|Important Notes]] on technical issues&lt;br /&gt;
&lt;br /&gt;
== Contributing ==&lt;br /&gt;
&lt;br /&gt;
Due to spamming, we had to restrict permissions for editing the Wiki. If you&#039;re interested in contributing, please write an email to &#039;&#039;&#039;tamchyna -at- ufal.mff.cuni.cz&#039;&#039;&#039; to obtain a username.&lt;br /&gt;
&lt;br /&gt;
== Other Videolectures on MT ==&lt;br /&gt;
&lt;br /&gt;
* [http://www.upc.edu/learning/courses/mooc/2014-2015/approaches-to-machine/approaches-to-machine Approaches to Machine Translation: Rule-Based, Statistical, Hybrid] (an online course on MT by UPC Barcelona)&lt;br /&gt;
* [https://www.coursera.org/course/nlangp Natural Language Processing at Coursera] by Michael Collins, includes lectures on word-based and phrase-based models. [http://www.cs.columbia.edu/~mcollins/notes-spring2013.html Further notes]&lt;br /&gt;
* [https://www.youtube.com/playlist?list=PLVjXYOjST-AokmIxpCr4GexcdtpeOliBc TAUS Machine Translation and Moses Tutorial] (a series of commented slides, MT overview and practical aspects of the Moses Toolkit)&lt;br /&gt;
&lt;br /&gt;
== Acknowledgement ==&lt;br /&gt;
&lt;br /&gt;
The work on this project has been supported by the grant FP7-ICT-2011-7-288487 ([http://www.statmt.org/mosescore/ MosesCore]).&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53614</id>
		<title>Admin RootPage</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Admin_RootPage&amp;diff=53614"/>
		<updated>2015-08-25T11:47:52Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;0x : How to get started with CodEx MT exercises&lt;br /&gt;
&lt;br /&gt;
Our [https://www.youtube.com/playlist?list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V YouTube playlist] -- shows some total number of views, although different from individual video views.&lt;br /&gt;
&lt;br /&gt;
[[CodEx - Important Notes]]&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53613</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53613"/>
		<updated>2015-08-25T11:47:01Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [[Phrase-based Model#Decoding|already described]] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53612</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53612"/>
		<updated>2015-08-25T11:46:21Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [Phrase-based Model#Decoding|already described] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53611</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53611"/>
		<updated>2015-08-25T11:46:08Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [Phrase-basedModel#Decoding|already described] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
	<entry>
		<id>https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53610</id>
		<title>Scoring and Optimization</title>
		<link rel="alternate" type="text/html" href="https://mttalks.ufal.ms.mff.cuni.cz/index.php?title=Scoring_and_Optimization&amp;diff=53610"/>
		<updated>2015-08-25T11:45:45Z</updated>

		<summary type="html">&lt;p&gt;Tamchyna: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;{{Infobox&lt;br /&gt;
|title = Lecture 13: Scoring and Optimization&lt;br /&gt;
|image = [[File:features.png|200px]]&lt;br /&gt;
|label1 = Lecture video:&lt;br /&gt;
|data1 = [http://example.com web &#039;&#039;&#039;TODO&#039;&#039;&#039;] &amp;lt;br/&amp;gt; [https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V Youtube]}}&lt;br /&gt;
&lt;br /&gt;
{{#ev:youtube|https://www.youtube.com/watch?v=oxhc0Nv_ySw&amp;amp;index=11&amp;amp;list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V|800|center}}&lt;br /&gt;
&lt;br /&gt;
== Features of MT Models ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase Translation Probabilities ===&lt;br /&gt;
&lt;br /&gt;
Phrase translation probabilities are calculated from occurrences of phrase pairs extracted from the parallel training data. Usually, MT systems work with the following two conditional probabilities:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f})&amp;lt;/math&amp;gt;&lt;br /&gt;
* &amp;lt;math&amp;gt;P(\mathbf{f}|\mathbf{e})&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
These probabilities are estimated by simply counting how many times (for the first formula) we saw &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; aligned to &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; and how many times we saw &amp;lt;math&amp;gt;\mathbf{f}&amp;lt;/math&amp;gt; in total. For example, based on the following excerpt from (sorted) extracted phrase pairs, we estimate that &amp;lt;math&amp;gt;P(\text{naznačena v programu} | \text{estimated in the programme}) = 3/9&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| naznačena v programu&lt;br /&gt;
 estimated in the programme ||| odhadován v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu&lt;br /&gt;
 estimated in the programme ||| odhadovány v programu &lt;br /&gt;
 estimated in the programme ||| předpokládal program&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
 estimated in the programme ||| v programu uvedeným&lt;br /&gt;
&lt;br /&gt;
=== Lexical Weights ===&lt;br /&gt;
&lt;br /&gt;
Lexical weights are a method for smoothing the phrase table. Infrequent phrases have unreliable&lt;br /&gt;
probability estimates; for instance many long phrases occur together only once&lt;br /&gt;
in the corpus, resulting in &amp;lt;math&amp;gt;P(\mathbf{e}|\mathbf{f}) = P(\mathbf{f}|\mathbf{e})&lt;br /&gt;
= 1&amp;lt;/math&amp;gt;. Several methods exist for computing lexical weights. The most common one&lt;br /&gt;
is based on word alignment inside the phrase. The&lt;br /&gt;
probability of each &#039;&#039;foreign&#039;&#039; word &amp;lt;math&amp;gt;f_j&amp;lt;/math&amp;gt; is estimated as the average of&lt;br /&gt;
lexical translation probabilities &amp;lt;math&amp;gt;w(f_j, e_i)&amp;lt;/math&amp;gt; over the English words aligned&lt;br /&gt;
to it.  Thus for the phrase &amp;lt;math&amp;gt;(\mathbf{e},\mathbf{f})&amp;lt;/math&amp;gt; with the set of alignment&lt;br /&gt;
points &amp;lt;math&amp;gt;a&amp;lt;/math&amp;gt;, the lexical weight is:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\text{lex}(\mathbf{f}|\mathbf{e},a) = \prod_{j=1}^{l_f}&lt;br /&gt;
  \frac{1}{|{i|(i,j) \in a}|} \sum_{\forall(i,j) \in a}w(f_j, e_i)&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Language Model ===&lt;br /&gt;
&lt;br /&gt;
The task of language modeling in machine translation is to estimate how likely a&lt;br /&gt;
sequence of words &amp;lt;math&amp;gt;\mathbf{w} = (w_1, \ldots, w_l)&amp;lt;/math&amp;gt; is in the target language.&lt;br /&gt;
&lt;br /&gt;
When translating, the decoder generates translation hypotheses which are&lt;br /&gt;
probable according to the translation model (i.e. the phrase table). The&lt;br /&gt;
language model then scores these hypotheses according to how probable (common,&lt;br /&gt;
fluent) they are in the target language. The final translation is then something like a compromise -- the&lt;br /&gt;
sentence that is both fluent and a good translation of the input.&lt;br /&gt;
&lt;br /&gt;
Similarly to the translation model, sequence probabilities are learned from data&lt;br /&gt;
using maximum likelihood estimation. For language modeling, only monolingual&lt;br /&gt;
data are needed (a resource available in much larger amounts than parallel texts). &lt;br /&gt;
&lt;br /&gt;
Naturally, the prediction of the whole sequence &amp;lt;math&amp;gt;\mathbf{e}&amp;lt;/math&amp;gt; has to be&lt;br /&gt;
decomposed, so that it can be reliably estimated. The most common approach are&lt;br /&gt;
&#039;&#039;n-gram&#039;&#039; language models which build upon the Markov assumption: a word&lt;br /&gt;
depends only on a limited, fixed number of preceding words. The decomposition is&lt;br /&gt;
done as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
P(\mathbf{w}) &amp;amp; = P(w_1)P(w_2|w_1)P(w_3|w_1,w_2) \ldots P(w_l|w_1,\ldots,w_{l-1}) \\&lt;br /&gt;
 &amp;amp; \approx P(w_1)P(w_2|w_1) \ldots P(w_l|w_{l-n}, \ldots, w_{l-1})&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The first equality follows from the chain rule and the second from &#039;&#039;n&#039;&#039;-th order&lt;br /&gt;
Markov assumption. Each word is then modeled by at most &#039;&#039;n&#039;&#039; preceding words and&lt;br /&gt;
the probability of the whole sequence is the product of probabilities of&lt;br /&gt;
individual words. Smoothing is further used to supply probability estimates to unseen n-grams.&lt;br /&gt;
&lt;br /&gt;
A great introduction to language modeling is the video lecture by [http://videolectures.net/hltss2010_eisner_plm/ Jason Eisner]. LMs are covered in more depth in the Stanford NLP lectures on [https://www.coursera.org/course/nlp Coursera]; videos from the Coursera course can be found on [https://www.youtube.com/playlist?list=PLaRKlIqjjguC-20Glu7XVAXm6Bd6Gs7Qi YouTube].&lt;br /&gt;
&lt;br /&gt;
=== Word and Phrase Penalty ===&lt;br /&gt;
&lt;br /&gt;
For each word and for each phrase produced, the decoder pays a constant cost. Tweaking the word penalty can lead to either very short or very long output sentences (the &amp;quot;penalty&amp;quot; can also be negative -- a reward). Changes to the phrase penalty can lead to outputs consisting of word-by-word translations (small or negative phrase penalty -- use as many phrases as possible) or on the other hand, to outputs consisting of very long phrases (as is usually desirable).&lt;br /&gt;
&lt;br /&gt;
=== Distortion Penalty ===&lt;br /&gt;
&lt;br /&gt;
The distortion penalty is the cost which the MT system pays for shuffling words (or phrases) around. There are many definitions possible, the following is commonly used: for each phrase, its value is&lt;br /&gt;
the distance (measured in words) between its beginning and the end of the preceding phrase. This &#039;&#039;&#039;distance-based&#039;&#039;&#039; reordering can be replaced by more sophisticated models, such as [http://www.statmt.org/moses/?n=Advanced.Models#ntoc1 lexicalized reordering].&lt;br /&gt;
&lt;br /&gt;
== Decoding ==&lt;br /&gt;
&lt;br /&gt;
=== Phrase-Based Search ===&lt;br /&gt;
&lt;br /&gt;
We have [Phrase-basedModel#Decoding already described] the decoding algorithm for phrase-based MT. Here we discuss how feature values are calculated in the search.&lt;br /&gt;
&lt;br /&gt;
=== Decoding in SCFG ===&lt;br /&gt;
&lt;br /&gt;
== Optimization of Feature Weights ==&lt;br /&gt;
&lt;br /&gt;
Note that there have even been shared tasks in model optimization. One, by invitation only, in [http://www.statmt.org/wmt11/tunable-metrics-task.html 2011] and one in 2015: [http://www.statmt.org/wmt15/tuning-task/ WMT15 Tuning Task].&lt;/div&gt;</summary>
		<author><name>Tamchyna</name></author>
	</entry>
</feed>