Deep Syntax

From MT Talks
Revision as of 14:27, 7 October 2015 by Tamchyna (talk | contribs)
Jump to navigation Jump to search
Lecture 14: Deep Syntax
Lecture video: web TODO
Youtube

{{#ev:youtube|https://www.youtube.com/watch?v=lJwCW2mFk2M&index=11&list=PLpiLOsNLsfmbeH-b865BwfH15W0sat02V%7C800%7Ccenter}}

Functional Generative Description

The functional generative description (FGD) is a linguistic theory developed by Petr Sgall in Prague in the 1960's. It formally describes the language as a system of layers, ranging from the most basic layers (phonology) to abstract ones (deep syntax/semantic -- the tectogrammatical layer). The theory was developed with the intention to capture the language using a computer and indeed, much of the theory has been implemented as computer programs. However, the system of layers was gradually simplified and currently, only four layers are used (we refer to the annotation scheme for the Prague Dependency Treebank).

Prague Dependency Treebank

The Prague Dependency Treebank (PDT) is a corpus of Czech sentences manually annotated according to the FGD. An example of the layered description is shown on the following image (taken from PDT-2.0 documentation):

The lowest layer contains the sentence "as is", without any annotation. The m-layer provides a morphological analysis for each word (and also fixes typing errors). The a-layer is a dependency tree which describes the surface syntax of the sentence. Finally, the t-layer is a more abstract dependency tree which describes the deep syntax of the sentence.

VALLEX

One of the central notions in FGD and PDT is (verb) valency. Essentially, valency is the ability of verbs to require arguments (for example, most verbs require an actor, or subject, only some require an object etc.) VALLEX is a fine-grained valency dictionary of Czech verbs. The assumption underlying this dictionary is that different valency frames roughly correspond to different verb senses.

MT Using Deep Syntax: TectoMT

TectoMT is an implementation of the FGD framework for machine translation. It uses the analysis-transfer-synthesis approach and it was developed primarily for English-Czech translation, although recently is has been extended to support other languages such as Dutch, German or Basque.

The input sentence is first analysed up to the tectogrammatical layer (deep syntax). This layer is assumed to be abstract enough that the structure of the dependency tree is language independent. This allows for the transfer phase to only "relabel" the tree nodes instead of doing full tree-to-tree transfer which would include structural transformations. Once a deep syntactic representation of the translation is produced, the generation phase proceeds to construct the surface representation in the target language.

The following picture shows an example of Czech-English translation.