Pushing the Limits of Sparse Attention: From Theory to Practical Efficiency

April 8, 2025

10:52 am

Adaptive sparse attention mechanisms have emerged as a powerful alternative to dense attention in transformers, offering more interpretability for sequence modeling. Despite this, their widespread adoption has been limited by computational inefficiencies and insufficient understanding of their theoretical properties compared to dense attention models.

In this talk, I will present recent advancements in adaptive sparse attention, exploring its expressivity, generalization ability, and hardware-aware optimizations.

First, I’ll explore the expressivity of sparsemax attention, showing how it relates to linear attention with selective updates, and why entmax with α=1.5 offers even greater expressive power.

Second, I’ll discuss our findings on generalization capabilities, where sparse attention shows superior performance on longer sequences compared to dense attention, particularly when considering an appropriate scaling.

Finally, I’ll introduce AdaSplash, our hardware-aware implementation of α-entmax attention that outperforms FlashAttention-2 at high levels of sparsity. Throughout the talk, I’ll highlight how these advances collectively establish adaptive sparse attention as a robust alternative that can redefine the landscape of long sequence modeling.

Marcos Treviso

Marcos Treviso is a Postdoctoral Researcher at Instituto de Telecomunicações where he focuses on advancing sparse attention mechanisms for natural language processing. His research spans theoretical analysis of sparse attention expressivity, generalization capabilities to longer contexts, and hardware-efficient implementations. His recent work includes theoretical connections between sparsemax attention and linear attention, studies on sparse attention's superior generalization to longer sequence lengths, and hardware-aware optimizations for efficient transformers. Marcos earned his Ph.D. with Distinction and Honour from IST, University of Lisbon, under the supervision of Prof. André Martins. He serves as a Reviewer, Area Chair and Senior Area Chair at major NLP conferences, including ACL, helping drive research in efficient language processing techniques.Instituto de Telecomunicações

Seminários

Últimos seminários

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
June 17, 2025
Large language models (LLMs) have emerged as strong contenders in machine translation. Yet, they often fall behind specialized neural machine…
Speech as a Biomarker for Disease Detection
May 20, 2025
Today’s overburdened health systems face numerous challenges, exacerbated by an aging population. Speech emerges as a ubiquitous biomarker with strong…
Enhancing Uncertainty Estimation in Neural Networks
May 6, 2025
Neural networks are often overconfident about their predictions, which undermines their reliability and trustworthiness. In this presentation, I will present…
Improving Evaluation Metrics for Vision-and-Language Models
April 22, 2025
Evaluating image captions is essential for ensuring both linguistic fluency and accurate semantic alignment with visual content. While reference-free metrics…

Pushing the Limits of Sparse Attention: From Theory to Practical Efficiency

Marcos Treviso

Seminários

Últimos seminários

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Speech as a Biomarker for Disease Detection

Enhancing Uncertainty Estimation in Neural Networks

Improving Evaluation Metrics for Vision-and-Language Models