publications | Coleman Hooper

2025

NeurIPS

Multipole Attention for Efficient Long Context Reasoning

Coleman Hooper^*, Sebastian Zhao^*, Luca Manolache, and 5 more authors

NeurIPS, 2025

PDF
arXiv

Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

Minseo Kim, Coleman Hooper, Aditya Tomar, and 5 more authors

arXiv preprint, 2025

PDF
arXiv

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Coleman Hooper, Charbel Sakr, Ben Keller, and 4 more authors

arXiv preprint, 2025

PDF
arXiv

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar^*, Coleman Hooper^*, Minjae Lee, and 7 more authors

arXiv preprint, 2025

PDF
arXiv

ETS: Efficient Tree Search for Inference-Time Scaling

Coleman Hooper, Sehoon Kim, Suhong Moon, and 7 more authors

arXiv preprint, 2025

PDF
ICML

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

Haocheng Xi, Aditya Tomar, Coleman Hooper, and 6 more authors

ICML, 2025

PDF
ACL

Squeezed Attention: Accelerating Long Context Length LLM Inference

Coleman Hooper^*, Sehoon Kim^*, Hiva Mohammadzadeh, and 6 more authors

ACL, 2025

PDF Code
Springer

SPEED: Speculative Pipelined Execution for Efficient Decoding

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, and 4 more authors

In Enhancing LLM Performance: Efficacy, Fine-Tuning, and Inference Techniques, 2025

PDF

2024

EMNLP Demo

TinyAgent: Function Calling at the Edge

Lutfi Eren Erdogan, Nicholas Lee, Siddharth Jha, and 7 more authors

EMNLP Demo, 2024

PDF
IEEE Micro

AI and Memory Wall

Amir Gholami, Zhewei Yao, Sehoon Kim, and 3 more authors

IEEE Micro, 2024

PDF
NeurIPS

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, and 4 more authors

NeurIPS, 2024

PDF Code
ICML Workshop

Learned Best-Effort LLM Serving

Siddharth Jha, Coleman Hooper, Xiaoxuan Liu, and 2 more authors

ICML Workshop on Efficient Systems for Foundation Models, 2024

PDF
MLSys

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Ying Sheng, Shiyi Cao, Dacheng Li, and 8 more authors

MLSys, 2024

PDF
ICML

SqueezeLLM: Dense-and-Sparse Quantization

Sehoon Kim^*, Coleman Hooper^*, Amir Gholami^*, and 5 more authors

ICML, 2024

PDF Code

2023

CHiME-7 Workshop

Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

Tae Jin Park, He Huang, Coleman Hooper, and 5 more authors

CHiME-7 Workshop, 2023

PDF
ISSCC

A 12nm 18.1 TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

Thierry Tambe, Jeff Zhang, Coleman Hooper, and 8 more authors

In ISSCC, 2023

PDF
ISCA Workshop

Full Stack Optimization of Transformer Inference: A Survey

Sehoon Kim^*, Coleman Hooper^*, Thanakul Wattanawong, and 8 more authors

ISCA Workshop on Architecture and System Support for Transformer Models (ASSYST), 2023

PDF

2022

JSSC

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference with Bayesian Sound Source Separation and Attention-Based DNNs

Thierry Tambe, En-Yu Yang, Glenn G Ko, and 7 more authors

JSSC, 2022

PDF

2021

MICRO

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Thierry Tambe, Coleman Hooper, Lillian Pentecost, and 8 more authors

In MICRO, 2021

PDF
ISSCC

A 25mm² SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

Thierry Tambe, En-Yu Yang, Glenn G Ko, and 7 more authors

In ISSCC, 2021

PDF
Hot Chips

SM6: A 16nm System-on-Chip for Accurate and Noise-Robust Attention-Based NLP Applications

Thierry Tambe, En-Yu Yang, Glenn G Ko, and 7 more authors

In Hot Chips Symposium, 2021

PDF