Adaptive sparse attention mechanisms have emerged as a powerful alternative to dense attention in transformers, offering more interpretability for sequence modeling. Despite this, their widespread adoption has been limited by computational inefficiencies and insufficient understanding of their theoretical properties compared to dense attention models.
In this talk, I will present recent advancements in adaptive sparse attention, exploring its expressivity, generalization ability, and hardware-aware optimizations.
First, I’ll explore the expressivity of sparsemax attention, showing how it relates to linear attention with selective updates, and why entmax with α=1.5 offers even greater expressive power.
Second, I’ll discuss our findings on generalization capabilities, where sparse attention shows superior performance on longer sequences compared to dense attention, particularly when considering an appropriate scaling.
Finally, I’ll introduce AdaSplash, our hardware-aware implementation of α-entmax attention that outperforms FlashAttention-2 at high levels of sparsity. Throughout the talk, I’ll highlight how these advances collectively establish adaptive sparse attention as a robust alternative that can redefine the landscape of long sequence modeling.