Intro: Difference between revisions

From MT Talks
Jump to navigation Jump to search
No edit summary
No edit summary
 
(31 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Infobox
|title = Lecture 1: Intro
|image = [[File:sociable_tank.png|200px]]
|label1 = Lecture video:
|data1 = [http://lectures.ms.mff.cuni.cz/view.php?rec=239 web] <br/> [https://www.youtube.com/watch?v=kOY_F1UTySs&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=1 Youtube]
|label2 = Supplementary materials:
|data2 = [http://prezi.com/lschzdbtqhts/?utm_campaign=share&utm_medium=copy Prezi]
}}
{{#ev:youtube|https://www.youtube.com/watch?v=kOY_F1UTySs&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=1|800|center}}
== Ambiguity in language ==
== Ambiguity in language ==


Line 20: Line 31:
== Types of MT systems ==
== Types of MT systems ==


[[File:pyramid.png|thumb|500px|'''Vauqouis triangle.''' Illustrates the possible approaches to linguistic abstraction in MT.]]
[[File:pyramid.png|thumb|500px|'''Vauquois triangle.''' Illustrates the possible approaches to linguistic abstraction in MT.]]


Approaches to MT can be categorized by whether they work directly with surface words or whether they utilize some (linguistic) abstraction. Many successful MT systems disregard any linguistic information and treat all words as unrelated, indivisible units. Other systems perform linguistic '''analysis''' on the source side and then do '''transfer''' -- either to some abstract representation or directly to target-side surface words. In the first case, target-side '''generation''' is needed to create the surface words of the translation.
Approaches to MT can be categorized by whether they work directly with surface words or whether they utilize some (linguistic) abstraction. Many successful MT systems disregard any linguistic information and treat all words as unrelated, indivisible units. Other systems perform linguistic '''analysis''' on the source side and then do '''transfer''' -- either to some abstract representation or directly to target-side surface words. In the first case, target-side '''generation''' is needed to create the surface words of the translation.
[[File:en-tlayer.png|thumb|300px|'''A deep-syntactic parse of an English sentence.''' "However, he tried to find refuge in Brazil".]]


Another possible distinction is how the systems are "trained" -- in the past, linguistic experts would manually develop rules to describe the analysis, transfer or generation for a particular language pair. Such '''rule-based''' systems sometimes grew to very mature, complex systems. However, they can be very costly to build and difficult to adapt -- either to a new genre/domain or to different languages. The other end of this continuum is occupied by purely '''statistical''' systems which only require data and utilize statistical models or machine learning to capture the knowledge required for translation. Finally, many flavors of '''hybrid''' systems have been developed, which combine data-driven and rule-based components in some way.
Another possible distinction is how the systems are "trained" -- in the past, linguistic experts would manually develop rules to describe the analysis, transfer or generation for a particular language pair. Such '''rule-based''' systems sometimes grew to very mature, complex systems. However, they can be very costly to build and difficult to adapt -- either to a new genre/domain or to different languages. The other end of this continuum is occupied by purely '''statistical''' systems which only require data and utilize statistical models or machine learning to capture the knowledge required for translation. Finally, many flavors of '''hybrid''' systems have been developed, which combine data-driven and rule-based components in some way.
Line 28: Line 41:
=== System combination ===
=== System combination ===


Different (types of) MT systems are prone different errors. Their outputs can thus hopefully be combined to obtain a better translation than any of the individual translation hypotheses.
Different (types of) MT systems are prone to different errors. Their outputs can thus hopefully be combined to obtain a better translation than any of the individual translation hypotheses.
 
== Pre-processing ==
 
Text data in the wild come in all kinds of forms. Documents with different encoding, mark-up or annotation, articles and discussions on the web with abbreviations and typos etc. Pre-processing is an essential subtask of converting all this data into a unified form that the MT system can handle.
 
== MT evaluation ==
 
Evaluation of translation quality is essential for system development. '''Manual''' evaluation seems ideal at first glance, however humans often surprisingly disagree when comparing outputs of different MT systems. Moreover, such evaluation is labor-intensive and not easily reproducible. '''Automatic''' measures have therefore been developed -- in essence, these compare the MT output to some ''reference translation''.
 
Additionally, '''quality estimation''' is a field that develops methods to recognize whether a translation is good ''without'' a reference translation or manual judgement. Such a score can help estimate the amount of work that a professional translator needs to do -- just confirm that a translation is correct, make some minor edits or re-write it from scratch.
 
== Bird's Eye Overview of MT ==
 
[[File:mt-overview.png|1000px|'''A broad overview of MT'''.]]

Latest revision as of 15:20, 27 January 2015

Lecture 1: Intro
Lecture video: web
Youtube
Supplementary materials: Prezi

{{#ev:youtube|https://www.youtube.com/watch?v=kOY_F1UTySs&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V&index=1%7C800%7Ccenter}}

Ambiguity in language

Unusual grammatical constructions with unexpected meaning can be used to (deliberately) mislead a human reader. These are called garden path sentences. Consider some of the best-known examples:

  • Fat people eat accumulates.
  • The horse raced past the barn fell.
  • The government plans to raise taxes were defeated.

But everyday sentences actually contain countless ambiguities which humans resolve so naturally that they do not even notice them. Knowledge of the world and context are essential.

The plant is next to the bank.

  • plant
    • factory?
    • flower?
  • bank
    • financial institution?
    • river side?

Types of MT systems

Vauquois triangle. Illustrates the possible approaches to linguistic abstraction in MT.

Approaches to MT can be categorized by whether they work directly with surface words or whether they utilize some (linguistic) abstraction. Many successful MT systems disregard any linguistic information and treat all words as unrelated, indivisible units. Other systems perform linguistic analysis on the source side and then do transfer -- either to some abstract representation or directly to target-side surface words. In the first case, target-side generation is needed to create the surface words of the translation.

A deep-syntactic parse of an English sentence. "However, he tried to find refuge in Brazil".

Another possible distinction is how the systems are "trained" -- in the past, linguistic experts would manually develop rules to describe the analysis, transfer or generation for a particular language pair. Such rule-based systems sometimes grew to very mature, complex systems. However, they can be very costly to build and difficult to adapt -- either to a new genre/domain or to different languages. The other end of this continuum is occupied by purely statistical systems which only require data and utilize statistical models or machine learning to capture the knowledge required for translation. Finally, many flavors of hybrid systems have been developed, which combine data-driven and rule-based components in some way.

System combination

Different (types of) MT systems are prone to different errors. Their outputs can thus hopefully be combined to obtain a better translation than any of the individual translation hypotheses.

Pre-processing

Text data in the wild come in all kinds of forms. Documents with different encoding, mark-up or annotation, articles and discussions on the web with abbreviations and typos etc. Pre-processing is an essential subtask of converting all this data into a unified form that the MT system can handle.

MT evaluation

Evaluation of translation quality is essential for system development. Manual evaluation seems ideal at first glance, however humans often surprisingly disagree when comparing outputs of different MT systems. Moreover, such evaluation is labor-intensive and not easily reproducible. Automatic measures have therefore been developed -- in essence, these compare the MT output to some reference translation.

Additionally, quality estimation is a field that develops methods to recognize whether a translation is good without a reference translation or manual judgement. Such a score can help estimate the amount of work that a professional translator needs to do -- just confirm that a translation is correct, make some minor edits or re-write it from scratch.

Bird's Eye Overview of MT

A broad overview of MT.