Priberam

Extracting geographic entities with Conditional Random Fields

Geographic Information Retrieval systems rely on the identification of place names in documents to determine the region about which they are relevant. Extracting location names from text is a common Natural Language Processing task, a simple approach is to used manually coded rules supported with dictionaries of place names or gazetteers. Despite these methods achieving good results, the rules are usually too restrictive and very specific in regard to a type of text.

Another approach is to use machine learning, based on extracting features from texts where the geographic entities are annotated. Features can be surrounding words or properties of the word itself, like capitalization, or frequency of the word in corpus. A probabilistic model is then built based on these features to discriminate when a given word is or not a geographic entity.

Work done on training and using Conditional Random Fields for extracting geographic references from a web crawl of the Portuguese web will be presented, and also available resources for research, such as a geographic ontology of Portugal.

David Batista

David has an MSc. Informatics Engineering from the Faculty of Sciences, University of Lisbon, he is part of the XLDB group at LaSIGE. Currently he is working on GREASE (Geographic Reasoning for Search Engines) project, which researches information access methods to large collections of documents having geographically rich text and meta-data, with emphasis on the web.XLDB, FCUL