In this presentation I will describe the key cross-language annotation guidelines to provide support for state-of-the art machine translation systems. The guidelines aim at improving the quality of the statistical machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units. The guidelines were based on the alignment of bilingual texts of the common test set of the Europarl corpus. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages. The major challenges I will discuss are grouped into four different classes: lexical and semantico-syntactic (multiword units, compound verbs, and prepositional predicates); morphological (lexical versus non-lexical realization, such as determiners and zero determiners, the pro-drop phenomenon including subject pronoun drop, and empty relative pronoun, and contracted forms); morpho-syntactic (free noun adjuncts); and semantico-discursive (emphatic linguistic constructions such as tautology, pleonasm and repetition, and focus constructions). I will also present CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units. The inter-annotator agreement for English-Portuguese word alignment is 0.98 and for multiword and semantico-syntactic unit alignment is 0.54, which represents a total agreement of 0.87. The gold collection and alignment tool are publicly available.
Cross-Language Alignments: Challenges, Guidelines and Gold Sets
July 3, 2012
1:00 pm