Priberam Compressive Summarization Corpus

This page provides links and information about the Priberam Compressive Summarization Corpus (PCSC). This corpus contains 801 documents split into 80 topics, each of which has 10 documents (one has 11). The documents are news stories from major Portuguese newspapers, radio and TV stations. Each topic also has two human generated summaries up to 100 words. The human summaries are compressive: the annotators performed only sentence and word deletion operations.

Download here the Priberam Compressive Summarization Corpus.


The corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. For the legal licensing terms, please see the LICENSE.txt file in the archive. You can find a human-readable summary of the license (which is not a substitute of the license) here.


If you use this corpus in your research, please cite the following paper: Miguel B. Almeida, Mariana S. C. Almeida, André F. T. Martins, Helena Figueira, Pedro Mendes and Cláudia Pinto, A New Multi-Document Summarization Corpus for European Portuguese, Language Resources and Evaluation Conference (LREC’14), Reykjavik, Iceland, May 2014.


Priberam would like to thank Cofina, Controlinveste, and RTP for their collaboration in providing the news articles which were processed in the elaboration of the corpus. This work was partially supported by the EU/FEDER programme, QREN/POR Lisboa (Portugal), under the Discooperio project (contract 2011/18501), and by FCT grant PTDC/EEI-SII/2312/2012.