Multipole Attention for Efficient Long Context Reasoning
Coleman Hooper*, Sebastian Zhao*, Luca Manolache, and 5 more authors
NeurIPS, 2025
While reasoning models have shown promising accuracy benefits through long chain-of-thought decoding, they exhibit substantial overhead at inference time due to the need to generate thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. In this work, we introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens while maintaining approximate representations for the remaining tokens, achieving up to 4.5x attention speedup for long context reasoning.