In natural language processing problems, we often encounter the situation where we don’t have enough data in the language we’re interested to build a model, but there is data for the same problem in another language. There has been a line of research into devising multilingual models, transferring information from resource-rich languages into resource-poor languages. One way to do this cross-lingual transfer would be to leverage parallel data (which often already exists), in which the same text is written in different languages. An approach which has recently been the target of research, obtaining promising results, consists in leveraging parallel data to obtain distributed representations of text (embeddings). These embeddings capture the semantics of text via representations using dense real vectors, which make them useful for a wide array of different tasks. Since handcrafting universal language-independent features is not an easy problem, having such features be generated automatically through embeddings is desirable. In this talk, we explore how we can use parallel data and text embeddings to perform cross-lingual document classification.
Cross-lingual Document Classification
March 15, 2016
1:00 pm
Daniel Ferreira
Daniel Ferreira is currently doing research at Priberam Labs. He received his BSc. and MSc. degrees from Instituto Superior Técnico (IST), Portugal, in 2013 and 2015, in applied mathematics. He has been working in natural language processing since 2014.PriberamSeminários
Últimos seminários
Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
June 17, 2025Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Speech as a Biomarker for Disease Detection
May 20, 2025Today’s overburdened health systems face numerous challenges, exacerbated by an aging population. Speech emerges as a ubiquitous biomarker with strong…
Enhancing Uncertainty Estimation in Neural Networks
May 6, 2025Neural networks are often overconfident about their predictions, which undermines their reliability and trustworthiness. In this presentation, I will present…
Improving Evaluation Metrics for Vision-and-Language Models
April 22, 2025Evaluating image captions is essential for ensuring both linguistic fluency and accurate semantic alignment with visual content. While reference-free metrics…



