Cross-Language Alignments: Challenges, Guidelines and Gold Sets

July 3, 2012

1:00 pm

In this presentation I will describe the key cross-language annotation guidelines to provide support for state-of-the art machine translation systems. The guidelines aim at improving the quality of the statistical machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units. The guidelines were based on the alignment of bilingual texts of the common test set of the Europarl corpus. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages. The major challenges I will discuss are grouped into four different classes: lexical and semantico-syntactic (multiword units, compound verbs, and prepositional predicates); morphological (lexical versus non-lexical realization, such as determiners and zero determiners, the pro-drop phenomenon including subject pronoun drop, and empty relative pronoun, and contracted forms); morpho-syntactic (free noun adjuncts); and semantico-discursive (emphatic linguistic constructions such as tautology, pleonasm and repetition, and focus constructions). I will also present CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units. The inter-annotator agreement for English-Portuguese word alignment is 0.98 and for multiword and semantico-syntactic unit alignment is 0.54, which represents a total agreement of 0.87. The gold collection and alignment tool are publicly available.

Anabela Barreiro

Anabela Barreiro is an invited researcher at INESC-ID Lisbon at the Spoken Language Systems Laboratory. She holds a PhD in Linguistics and works in the areas of machine translation and paraphrasing applied to authoring aids, text production and revision, and cross-language tasks. Her post-doctoral work consists of the development of a new hybrid machine translation system that applies linguistically enhanced natural language processing resources (semantico-syntactic knowledge) to statistical machine translation. She has over 7 years experience in the development of commercial machine translation systems at Logos Corporation, USA. More recently, she has been endorsing the OpenLogos open source machine translation system initiative. She has substantial experience in the development of linguistic resources (monolingual and multilingual) and natural language processing tools. She is the author of several journal publications on machine translation, paraphrases, and linguistic resources.L2F, INESC-ID

Seminários

Últimos seminários

Cost-Sensitive Learning to Defer to Multiple Experts
March 2, 2026
Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Fair Federated Learning under Group-Specific Distributed Concept Drift
February 24, 2026
Machine learning models can become unfair when different groups experience changes in data over time, a phenomenon called group-specific concept…
Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
June 17, 2025
Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Speech as a Biomarker for Disease Detection
May 20, 2025
Today’s overburdened health systems face numerous challenges, exacerbated by an aging population. Speech emerges as a ubiquitous biomarker with strong…

Cross-Language Alignments: Challenges, Guidelines and Gold Sets

Anabela Barreiro

Seminários

Últimos seminários

Cost-Sensitive Learning to Defer to Multiple Experts

Fair Federated Learning under Group-Specific Distributed Concept Drift

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Speech as a Biomarker for Disease Detection