Unsupervised feature discretization and selection for sparse data

In many applications, we deal with high dimensional datasets with sparse data (many features have zero value with high probability). For instance, in text classification and information retrieval problems, we have large collections of documents. Each text is usually represented by a bag-of-words or similar representation, with a large number of features (terms). Many of these features may be irrelevant (or even detrimental) for the learning tasks. This excessive number of features carries the problem of memory usage in order to represent and deal with these collections, clearly showing the need for adequate methods for feature representation, reduction, and selection, to both improve the classification accuracy and the memory requirements for the storage of these datasets.

This talk focuses on techniques for unsupervised Feature Discretization (FD) and Feature Selection (FS). The proposed FD technique uses the Lloyd-Max algorithm along with a new criterion for FS based on the discretized features. The FS methods rely on the use of dispersion measures to compute feature relevance. The recent topic of compressed learning (CL), i.e., learning in a domain of reduced dimensionality obtained by random projections (RP) is explored under the framework of feature reduction. We show some experimental results on standard datasets.

Artur Ferreira

Artur Ferreira is adjunct professor at ISEL (Instituto Superior de Engenharia de Lisboa) and a PhD student of Electrical and Computer Engineering at IST-IT (Instituto Superior Técnico – Instituto de Telecomunicações), under the supervision of prof. Mário Figueiredo. He holds a MSc on Electrical and Computer Engineering by IST. His main research interests are data compression, pattern recognition and machine learning.IT/ISEL