Cross-lingual Document Classification

In natural language processing problems, we often encounter the situation where we don’t have enough data in the language we’re interested to build a model, but there is data for the same problem in another language. There has been a line of research into devising multilingual models, transferring information from resource-rich languages into resource-poor languages. One way to do this cross-lingual transfer would be to leverage parallel data (which often already exists), in which the same text is written in different languages. An approach which has recently been the target of research, obtaining promising results, consists in leveraging parallel data to obtain distributed representations of text (embeddings). These embeddings capture the semantics of text via representations using dense real vectors, which make them useful for a wide array of different tasks. Since handcrafting universal language-independent features is not an easy problem, having such features be generated automatically through embeddings is desirable. In this talk, we explore how we can use parallel data and text embeddings to perform cross-lingual document classification.

Daniel Ferreira

Daniel Ferreira is currently doing research at Priberam Labs. He received his BSc. and MSc. degrees from Instituto Superior Técnico (IST), Portugal, in 2013 and 2015, in applied mathematics. He has been working in natural language processing since 2014.Priberam