Cross-lingual Document Classification

March 15, 2016

1:00 pm

In natural language processing problems, we often encounter the situation where we don’t have enough data in the language we’re interested to build a model, but there is data for the same problem in another language. There has been a line of research into devising multilingual models, transferring information from resource-rich languages into resource-poor languages. One way to do this cross-lingual transfer would be to leverage parallel data (which often already exists), in which the same text is written in different languages. An approach which has recently been the target of research, obtaining promising results, consists in leveraging parallel data to obtain distributed representations of text (embeddings). These embeddings capture the semantics of text via representations using dense real vectors, which make them useful for a wide array of different tasks. Since handcrafting universal language-independent features is not an easy problem, having such features be generated automatically through embeddings is desirable. In this talk, we explore how we can use parallel data and text embeddings to perform cross-lingual document classification.

Daniel Ferreira

Daniel Ferreira is currently doing research at Priberam Labs. He received his BSc. and MSc. degrees from Instituto Superior Técnico (IST), Portugal, in 2013 and 2015, in applied mathematics. He has been working in natural language processing since 2014.Priberam

Seminários

Últimos seminários

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
June 17, 2025
Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Speech as a Biomarker for Disease Detection
May 20, 2025
Today’s overburdened health systems face numerous challenges, exacerbated by an aging population. Speech emerges as a ubiquitous biomarker with strong…
Enhancing Uncertainty Estimation in Neural Networks
May 6, 2025
Neural networks are often overconfident about their predictions, which undermines their reliability and trustworthiness. In this presentation, I will present…
Improving Evaluation Metrics for Vision-and-Language Models
April 22, 2025
Evaluating image captions is essential for ensuring both linguistic fluency and accurate semantic alignment with visual content. While reference-free metrics…

Cross-lingual Document Classification

Daniel Ferreira

Seminários

Últimos seminários

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Speech as a Biomarker for Disease Detection

Enhancing Uncertainty Estimation in Neural Networks

Improving Evaluation Metrics for Vision-and-Language Models