Sentence Alignment: Difference between revisions

Lecture 7: Sentence Alignment
Lecture video:	web TODO ; Youtube
Exercises:	Gale & Church algorithm

Latest revision as of 14:23, 10 March 2015

Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly which sentences correspond to each other. The task here is to find this correspondence (alignment).

Once sentence alignment is available, we can proceed further by finding word or phrase correspondences within the aligned sentences, but that's a topic for another lecture.

Gale & Church algorithm^[1] is an algorithm for sentence alignment. It assumes that documents are already aligned on the level of paragraphs. For each paragraph, it finds which sentences correspond to each other.

It is formulated as a dynamic programming algorithm, quite analogous to Levenshtein distance.

Possible Operations

Similarly to string edit distance, a sentence can be:

deleted -- a source-side sentence with no corresponding target-side sentence
inserted -- a target-side sentence with no corresponding source-side sentence
substituted -- a pair of source- and target-side sentences which correspond to each other 1-1 (ideally, the most frequent scenario)

However, Gale & Church define a few more operations:

contraction -- two source-side sentences correspond to one target sentence
expansion -- one source-side sentence corresponds to two target sentences
merge -- two source-side sentences correspond to two target sentences (but there is not 1-1 correspondence)

Distance Function

A distance measure (or a cost function) is required so that we can look for a minimal solution. Gale & Church observe that length differences (measured in characters) between matching sentences tend to be normally distributed. Let $c$ be the average ratio between sentence lengths (for zero mean, $c$ would be 1), $s^{2}$ be the observed variance, and $l_{1},l_{2}$ lengths of the source and target sentence, respectively. Then we define:

$\delta =(l_{2}-l_{1}c)/{\sqrt {l_{1}s^{2}}}$

$\delta$ is a zero-mean, unit-variance, normally distributed random variable. We can use it to define our distance measure as the inverse of the conditional probability of a match given a difference $\delta$ . Following the Bayes' rule and dropping the (constant) denominator, we obtain:

$P({\text{match}}|\delta )\propto P(\delta |{\text{match}})\cdot P({\text{match}})$

We use $-\log P({\text{match}}|\delta )$ so that lower cost is better and that we can sum the values in the algorithm and still have a probability distribution (instead of multiplying them).

Gale & Church estimate the prior $P({\text{match}})$ empirically from the data, see Table 5 in the paper.

The likelihood can be formulated as:

$P(\delta |{\text{match}})=2(1-P(|\delta |))$

Where $P(|\delta |)$ is the cumulative distribution function for a 0-mean, unit variance normal distribution.

Algorithm Formulation

Let us define some notation (identical to the original paper):

d(x_{1},y_{1},0,0)

-- the cost of substituting

x_{1}

with

y_{1}

d(x_{1},0,0,0)

-- the cost of deleting

x_{1}

d(0,y_{1},0,0)

-- the cost of inserting

y_{1}

d(x_{1},y_{1},x_{2},0)

-- the cost of contracting

x_{1}

and

x_{2}

to

y_{1}

d(x_{1},y_{1},0,y_{2})

-- the cost of expanding

x_{1}

to

y_{1}

and

y_{2}

d(x_{1},y_{1},x_{2},y_{2})

-- the cost of merging

x_{1},x_{2}

with

y_{1},y_{2}

Then, the algorithm can be defined very simply using the following recursive formula. Let source-side sentences (within a paragraph) be $x_{i},i=1\ldots I$ and let target-side sentences be $y_{j},j=1\ldots J$ :

$D(i,j)=\min {\begin{cases}D(i,j-1)+d(0,y_{j},0,0)\\D(i-1,j)+d(x_{i},0,0,0)\\D(i-1,j-1)+d(x_{i},y_{j},0,0)\\D(i-1,j-2)+d(x_{i},y_{j},0,y_{j-1})\\D(i-2,j-1)+d(x_{i},y_{j},x_{i-1},0)\\D(i-2,j-2)+d(x_{i},y_{j},x_{i-1},y_{j-1})\end{cases}}$

Again, similarly to string edit distance, the minimum total distance can be read off the table cell $(I,J)$ and backtracking can be used to find the actual alignment.

Other Algorithms & Tools

A comparison and evaluation of various approaches to sentence alignment^[2]
Hunalign^[3]
Gargantua^[4]
Bleualign^[5]

Exercises

Implement Gale & Church alignment algorithm

References

↑ William Gale, Kenneth Church. A Program for Aligning Sentences in Bilingual Corpora
↑ Alexandr Rosen. In Search of the Best Method for Sentence Alignment in Parallel Texts
↑ D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy. Parallel corpora for medium density languages
↑ Fabienne Braune, Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora
↑ Rico Sennrich, Martin Volk. Iterative, MT-based sentence alignment of parallel texts

[galechurch-1] William Gale, Kenneth Church. A Program for Aligning Sentences in Bilingual Corpora

[rosen-2] Alexandr Rosen. In Search of the Best Method for Sentence Alignment in Parallel Texts

[hunalign-3] D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy. Parallel corpora for medium density languages

[gargantua-4] Fabienne Braune, Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora

[bleualign-5] Rico Sennrich, Martin Volk. Iterative, MT-based sentence alignment of parallel texts

[1]

[2]

[3]

[4]

[5]

@@ Line 10: / Line 10: @@
 {{#ev:youtube|https://www.youtube.com/watch?v=_4lnyoC3mtQ|800|center}}
-Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly, which sentences correspond to each other. The task here is to find this correspondence (alignment).
+Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly which sentences correspond to each other. The task here is to find this correspondence (alignment).
 Once sentence alignment is available, we can proceed further by finding word or phrase correspondences within the aligned sentences, but that's a topic for another lecture.
@@ Line 16: / Line 16: @@
 Gale & Church algorithm<ref name="galechurch">William Gale, Kenneth Church. ''[http://www.aclweb.org/anthology/J93-1004.pdf A Program for Aligning Sentences in Bilingual Corpora]''</ref> is an algorithm for sentence alignment. It assumes that documents are already aligned on the level of paragraphs. For each paragraph, it finds which sentences correspond to each other.
-It is formulated as a [http://en.wikipedia.org/wiki/Dynamic_programming dynamic programming] algorithm, quite analogous to [Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance].
+It is formulated as a [http://en.wikipedia.org/wiki/Dynamic_programming dynamic programming] algorithm, quite analogous to [http://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance].
 == Possible Operations ==
@@ Line 34: / Line 34: @@
 == Distance Function ==
-A distance measure (or a cost function) is required so that we can look for a minimal solution. First, let us define some notation (identical to the original paper):
+A distance measure (or a cost function) is required so that we can look for a minimal solution. Gale & Church observe that length differences (measured in characters) between matching sentences tend to be normally distributed. Let <math>c</math> be the average ratio between sentence lengths (for zero mean, <math>c</math> would be 1), <math>s^2</math> be the observed variance, and <math>l_1, l_2</math> lengths of the source and target sentence, respectively. Then we define:
+<math>
+\delta = (l_2 - l_1 c) / \sqrt{l_1 s^2}
+</math>
+<math>\delta</math> is a zero-mean, unit-variance, normally distributed random variable. We can use it to define our ''distance measure'' as the inverse of the conditional probability of a match given a difference <math>\delta</math>. Following the Bayes' rule and dropping the (constant) denominator, we obtain:
+<math>P(\text{match} | \delta) \propto P(\delta | \text{match}) \cdot P(\text{match})</math>
+We use <math>-\log P(\text{match} | \delta)</math> so that lower cost is better and that we can sum the values in the algorithm and still have a probability distribution (instead of multiplying them).
+Gale & Church estimate the prior <math>P(\text{match})</math> empirically from the data, see Table 5 in the paper.
+The likelihood can be formulated as:
+<math>
+P(\delta | \text{match}) = 2(1 - P(|\delta|))
+</math>
+Where <math>P(|\delta|)</math> is the cumulative distribution function for a 0-mean, unit variance normal distribution.
+== Algorithm Formulation ==
+Let us define some notation (identical to the original paper):
 : <math>d(x_1, y_1, 0, 0)</math> -- the cost of ''substituting'' <math>x_1</math> with <math>y_1</math>
@@ Line 43: / Line 67: @@
 : <math>d(x_1, y_1, x_2, y_2)</math> -- the cost of ''merging'' <math>x_1, x_2</math> with <math>y_1, y_2</math>
-Then, the algorithm can be defined very simply using the following recursive formula:
+Then, the algorithm can be defined very simply using the following recursive formula. Let source-side sentences (within a paragraph) be <math>x_i, i = 1 \ldots I</math> and let target-side sentences be <math>y_j, j = 1 \ldots J</math>:
 <math>
-D(i, j) = \min \begin{cases} D(i, j - 1) & + & d(0, y_j, 0, 0) \\
+D(i, j) = \min \begin{cases} D(i, j - 1) + d(0, y_j, 0, 0) \\
-  D(i - 1, j) & + & d(x_i, 0, 0, 0) \\
+  D(i - 1, j) + d(x_i, 0, 0, 0) \\
-  D(i - 1, j - 1) & + & d(x_i, y_j, 0, 0) \\
+  D(i - 1, j - 1) + d(x_i, y_j, 0, 0) \\
-  D(i - 1, j - 2) & + & d(0, y_1, 0, 0) \\
+  D(i - 1, j - 2) + d(x_i, y_j, 0, y_{j-1}) \\
-  D(i - 2, j - 1) & + & d(0, y_1, 0, 0) \\
+  D(i - 2, j - 1) + d(x_i, y_j, x_{i-1}, 0) \\
-  D(i - 2, j - 2) & + & d(0, y_1, 0, 0) \end{cases}
+  D(i - 2, j - 2) + d(x_i, y_j, x_{i-1}, y_{j-1}) \end{cases}
 </math>
+Again, similarly to string edit distance, the minimum total distance can be read off the table cell <math>(I,J)</math> and backtracking can be used to find the actual alignment.
 == Other Algorithms & Tools ==

Sentence Alignment: Difference between revisions

Latest revision as of 14:23, 10 March 2015

Contents

Possible Operations

Distance Function

Algorithm Formulation

Other Algorithms & Tools

Exercises

References

Navigation menu


Lecture video:	web TODO Youtube
Exercises:	Gale & Church algorithm

Sentence Alignment: Difference between revisions

Latest revision as of 14:23, 10 March 2015

Possible Operations

Distance Function

Algorithm Formulation

Other Algorithms & Tools

Exercises

References

Navigation menu

Search