Coleman Hooper

selected publications

NeurIPS

Multipole Attention for Efficient Long Context Reasoning

Coleman Hooper^*, Sebastian Zhao^*, Luca Manolache, and 5 more authors

NeurIPS, 2025

Abs PDF

While reasoning models have shown promising accuracy benefits through long chain-of-thought decoding, they exhibit substantial overhead at inference time due to the need to generate thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. In this work, we introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens while maintaining approximate representations for the remaining tokens, achieving up to 4.5x attention speedup for long context reasoning.
ACL

Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper^*, Sehoon Kim^*, Hiva Mohammadzadeh, and 6 more authors

ACL, 2025

Abs PDF Code

In this work, we aim to accelerate long context length applications where much of the input context in the prompt is fixed across different user inputs. Our approach preprocesses the fixed context KV cache ahead of inference time by applying K-means clustering to group the keys based on semantic similarity and represent each cluster with a single centroid value. At inference time, we first compare the query with the key centroids to identify important KV cache entries, and then only compute exact attention using only the important keys, thereby achieving 8X reduction in KV budget and >4x speedups with minimal accuracy loss.
NeurIPS

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, and 4 more authors

NeurIPS, 2024

Abs PDF Code

In this work, we identify the KV cache as the critical memory bottleneck when scaling to long context lengths with LLM inference. We then design a KV cache quantization strategy which retains accuracy with aggressive ( 5x) compression by incorporating several novel methods: (i) Per-Channel Key Quantization, (ii) Pre-RoPE Key Quantization, (iii) Non-Uniform KV Cache Quantization, and (iv) Per-Vector Dense-and-Sparse Quantization. Our method achieve < 0.1 perplexity degradation with 3-bit quantization and gets up to 1.7x speedups.
ICML

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim^*, Coleman Hooper^*, Amir Gholami^*, and 5 more authors

ICML, 2024

Abs PDF Code

In this paper, we highlight memory bandwidth as a critical bottleneck for single batch LLM inference rather than computation. Building on this observation, we design a quantization strategy which aggressively reduces the memory requirements (at the expense of a small amount of additional computation). We achieve accurate low-precision (3-bit) quantization by incorporating (i) sensitivity-based non-uniform quantization and (ii) dense-and-sparse decomposition, which together allow our method to retain close to the baseline accuracy while achieving 2.3x speedup.
ISCA Workshop

Full Stack Optimization of Transformer Inference: A Survey

Sehoon Kim^*, Coleman Hooper^*, Thanakul Wattanawong, and 8 more authors

ISCA Workshop on Architecture and System Support for Transformer Models (ASSYST), 2023

Abs PDF

TODO