Sentence Alignment
![]() | |
| Lecture video: |
web TODO Youtube |
|---|---|
| Exercises: | Gale & Church algorithm |
Sentence alignment is an essential step in building a translation system. Often, we have some parallel data (texts in the source and target language which are translations of each other) but we don't know exactly which sentences correspond to each other. The task here is to find this correspondence (alignment).
Once sentence alignment is available, we can proceed further by finding word or phrase correspondences within the aligned sentences, but that's a topic for another lecture.
Gale & Church algorithm[1] is an algorithm for sentence alignment. It assumes that documents are already aligned on the level of paragraphs. For each paragraph, it finds which sentences correspond to each other.
It is formulated as a dynamic programming algorithm, quite analogous to [Levenshtein distance http://en.wikipedia.org/wiki/Levenshtein_distance].
Possible Operations
Similarly to string edit distance, a sentence can be:
- deleted -- a source-side sentence with no corresponding target-side sentence
- inserted -- a target-side sentence with no corresponding source-side sentence
- substituted -- a pair of source- and target-side sentences which correspond to each other 1-1 (ideally, the most frequent scenario)
However, Gale & Church define a few more operations:
- contraction -- two source-side sentences correspond to one target sentence
- expansion -- one source-side sentence corresponds to two target sentences
- merge -- two source-side sentences correspond to two target sentences (but there is not 1-1 correspondence)
Distance Function
A distance measure (or a cost function) is required so that we can look for a minimal solution. Gale & Church use a probabilistic view so that the costs can be combined in a principled way (by summing log-probabilities, i.e. by multiplying probabilities). We therefore look for:
Where
First, let us define some notation (identical to the original paper):
- -- the cost of substituting with
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d(x_1, 0, 0, 0)} -- the cost of deleting Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_1}
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d(0, y_1, 0, 0)} -- the cost of inserting Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_1}
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d(x_1, y_1, x_2, 0)} -- the cost of contracting Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_1} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_2} to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_1}
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d(x_1, y_1, 0, y_2)} -- the cost of expanding Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_1} to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_1} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_2}
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle d(x_1, y_1, x_2, y_2)} -- the cost of merging Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_1, x_2} with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_1, y_2}
Then, the algorithm can be defined very simply using the following recursive formula. Let source-side sentences (within a paragraph) be Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x_i, i = 1 \ldots I} and let target-side sentences be Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle y_j, j = 1 \ldots J} :
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D(i, j) = \min \begin{cases} D(i, j - 1) + d(0, y_j, 0, 0) \\ D(i - 1, j) + d(x_i, 0, 0, 0) \\ D(i - 1, j - 1) + d(x_i, y_j, 0, 0) \\ D(i - 1, j - 2) + d(x_i, y_j, 0, y_{j-1}) \\ D(i - 2, j - 1) + d(x_i, y_j, x_{i-1}, 0) \\ D(i - 2, j - 2) + d(x_i, y_j, x_{i-1}, y_{j-1}) \end{cases} }
Other Algorithms & Tools
- A comparison and evaluation of various approaches to sentence alignment[2]
- Hunalign[3]
- Gargantua[4]
- Bleualign[5]
Exercises
References
- ↑ William Gale, Kenneth Church. A Program for Aligning Sentences in Bilingual Corpora
- ↑ Alexandr Rosen. In Search of the Best Method for Sentence Alignment in Parallel Texts
- ↑ D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy. Parallel corpora for medium density languages
- ↑ Fabienne Braune, Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora
- ↑ Rico Sennrich, Martin Volk. Iterative, MT-based sentence alignment of parallel texts
