Priberam

Seminars

Event-based Multi-document Summarization

Daily amount of news reporting real-world events is growing exponentially. Simultaneously, Organizations are looking for event information in a fast and summarized form to make decisions. Event-based summarization systems offer an efficient solution to this problem.

We proposed multi-document summarization methods based on the hierarchical combination of single-document summaries. We improved summarization methods using event information. Our approach is based on a two-stage single-document method that extracts a collection of key phrases, which are then used in a centrality-as-relevance model.

To adapt centrality-as-relevance single-document summarization for multi-document summarization that is able to use event information, we needed a good and adaptable baseline system. Because key phrase extraction is important in summarization, we improved a state-of-the-art key phrase extraction toolkit using four additional sets of semantic features. The event detection method is based on Fuzzy Fingerprint, which is a trained on documents with annotated event tags. We explored three different ways to integrate event information, achieving state-of-the-art results in both single and multi-document summarization using filtering and event-based features. We complemented event information with word embeddings.

The automatic evaluation and user study performed show that these methods improve upon current state-of-the-art multi-document summarization systems on DUC 2007 and TAC 2009 evaluation datasets. We show a relative improvement in ROUGE-1 scores of 16% for TAC 2009 and of 17\% for DUC 2007. We have also obtained improvements in ROUGE-1 upon current state-of-the-art single-document summarization systems of between 32% in clean data and 19% in noisy data. These improvements derived from the inclusion of key phrases and event information. Key phrase extraction was also refined with additional pre-processing steps and features, which lead to a relative improvement in NDCG scores of 9%. Event detection based on Fuzzy Fingerprints detected all event types, while an SVM only detected roughly 85% of them.

Luís Marujo

Luís Marujo is a Data Scientist at Feedzai Research. He finished his dual-degree Ph.D. in Language Technologies in 2015 from Carnegie Mellon University (CMU) and the Instituto Superior Técnico (IST), Portugal. He also obtained a MSc. (2012) in Language Technologies from CMU. He holds MSc. (2009), and BSc. (2007) in Computer Science and Engineering from IST. He was awarded the best poster award at the S3MR 2011.Feedzai Research

Seminários

Últimos seminários