In this presentation I will describe the key cross-language annotation guidelines to provide support for state-of-the art machine translation systems. The guidelines aim at improving the quality of the statistical machine translation output by using linguistically-informed and motivated annotation of special case multiwords and semantico-syntactic translation units. The guidelines were based on the alignment of bilingual texts of the common test set of the Europarl corpus. The bilingual texts cover all possible combinations between the English, Spanish, French, and Portuguese languages. The major challenges I will discuss are grouped into four different classes: lexical and semantico-syntactic (multiword units, compound verbs, and prepositional predicates); morphological (lexical versus non-lexical realization, such as determiners and zero determiners, the pro-drop phenomenon including subject pronoun drop, and empty relative pronoun, and contracted forms); morpho-syntactic (free noun adjuncts); and semantico-discursive (emphatic linguistic constructions such as tautology, pleonasm and repetition, and focus constructions). I will also present CLUE-Aligner, a tool developed to reduce ambiguity in the alignment process and facilitate the alignment of meaning and translation units. The inter-annotator agreement for English-Portuguese word alignment is 0.98 and for multiword and semantico-syntactic unit alignment is 0.54, which represents a total agreement of 0.87. The gold collection and alignment tool are publicly available.
Cross-Language Alignments: Challenges, Guidelines and Gold Sets
July 3, 2012
1:00 pm
Anabela Barreiro
Anabela Barreiro is an invited researcher at INESC-ID Lisbon at the Spoken Language Systems Laboratory. She holds a PhD in Linguistics and works in the areas of machine translation and paraphrasing applied to authoring aids, text production and revision, and cross-language tasks. Her post-doctoral work consists of the development of a new hybrid machine translation system that applies linguistically enhanced natural language processing resources (semantico-syntactic knowledge) to statistical machine translation. She has over 7 years experience in the development of commercial machine translation systems at Logos Corporation, USA. More recently, she has been endorsing the OpenLogos open source machine translation system initiative. She has substantial experience in the development of linguistic resources (monolingual and multilingual) and natural language processing tools. She is the author of several journal publications on machine translation, paraphrases, and linguistic resources.L2F, INESC-IDSeminários
Últimos seminários
Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
June 17, 2025Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Speech as a Biomarker for Disease Detection
May 20, 2025Today’s overburdened health systems face numerous challenges, exacerbated by an aging population. Speech emerges as a ubiquitous biomarker with strong…
Enhancing Uncertainty Estimation in Neural Networks
May 6, 2025Neural networks are often overconfident about their predictions, which undermines their reliability and trustworthiness. In this presentation, I will present…
Improving Evaluation Metrics for Vision-and-Language Models
April 22, 2025Evaluating image captions is essential for ensuring both linguistic fluency and accurate semantic alignment with visual content. While reference-free metrics…



