Sentence Alignment: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
No edit summary
Line 10: Line 10:
{{#ev:youtube|https://www.youtube.com/watch?v=_4lnyoC3mtQ|800|center}}
{{#ev:youtube|https://www.youtube.com/watch?v=_4lnyoC3mtQ|800|center}}


Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly, which sentences correspond to each other. The task here is to find this correspondence (alignment).


Gale & Church algorithm<ref name="galechurch">William Gale, Kenneth Church. ''[http://www.aclweb.org/anthology/J93-1004.pdf A Program for Aligning Sentences in Bilingual Corpora]''</ref>
Once sentence alignment is available, we can proceed further by finding word or phrase correspondences within the aligned sentences, but that's a topic for another lecture.
 
Gale & Church algorithm<ref name="galechurch">William Gale, Kenneth Church. ''[http://www.aclweb.org/anthology/J93-1004.pdf A Program for Aligning Sentences in Bilingual Corpora]''</ref> is an algorithm for sentence alignment. It assumes that documents are already aligned on the level of paragraphs. For each paragraph, it finds which sentences correspond to each other.
 
It is formulated as a [http://en.wikipedia.org/wiki/Dynamic_programming dynamic programming] algorithm, quite analogous to [Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance].
 
== Possible Operations ==
 
Similarly to string edit distance, a sentence can be '''inserted''', '''deleted''' or '''substituted'''. However, Gale & Church define a few more operations:
 
* '''contraction''' -- two source-side sentences correspond to one target sentence
* '''expansion''' -- one source-side sentence corresponds to two target sentences
* '''merge''' -- two source-side sentences correspond to two target sentences (but there is not 1-1 correspondence)
 
== Distance Function ==
 
A distance measure (or a cost function) is required so that we can look for a minimal solution. First, let us define some notation (identical to the original paper):
 
* <math>d(x_1, y_1, 0, 0)</math> -- the cost of ''substituting'' <math>x_1?</math> with <math>y_1</math>
* <math>d(x_1, 0, 0, 0)</math> -- the cost of ''deleting'' <math>x_1</math>
* <math>d(0, y_1, 0, 0)</math> -- the cost of ''inserting'' <math>y_1</math>
* <math>d(x_1, y_1, x_2, 0)</math> -- the cost of ''contracting'' <math>x_1</math> and <math>x_2</math> to <math>y_1</math>
* <math>d(x_1, y_1, 0, y_2)</math> -- the cost of ''expanding'' <math>x_1</math> to <math>y_1</math> and <math>y_2</math>
* <math>d(x_1, y_1, x_2, y_2)</math> -- the cost of ''merging'' <math>x_1, x_2</math> with <math>y_1, y_2</math>


== Other Algorithms & Tools ==
== Other Algorithms & Tools ==

Revision as of 13:03, 10 March 2015

Lecture 7: Sentence Alignment
Lecture video: web TODO
Youtube
Exercises: Gale & Church algorithm

{{#ev:youtube|https://www.youtube.com/watch?v=_4lnyoC3mtQ%7C800%7Ccenter}}

Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly, which sentences correspond to each other. The task here is to find this correspondence (alignment).

Once sentence alignment is available, we can proceed further by finding word or phrase correspondences within the aligned sentences, but that's a topic for another lecture.

Gale & Church algorithm[1] is an algorithm for sentence alignment. It assumes that documents are already aligned on the level of paragraphs. For each paragraph, it finds which sentences correspond to each other.

It is formulated as a dynamic programming algorithm, quite analogous to [Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance].

Possible Operations

Similarly to string edit distance, a sentence can be inserted, deleted or substituted. However, Gale & Church define a few more operations:

  • contraction -- two source-side sentences correspond to one target sentence
  • expansion -- one source-side sentence corresponds to two target sentences
  • merge -- two source-side sentences correspond to two target sentences (but there is not 1-1 correspondence)

Distance Function

A distance measure (or a cost function) is required so that we can look for a minimal solution. First, let us define some notation (identical to the original paper):

  • -- the cost of substituting with
  • -- the cost of deleting
  • -- the cost of inserting
  • -- the cost of contracting and to
  • -- the cost of expanding to and
  • -- the cost of merging with

Other Algorithms & Tools

Exercises

References