Token Pruning / Reduction Methods in Transformers (2021-2026)

A comprehensive survey of token pruning, merging, and compression methods published at top-tier AI/ML venues.


Category 1: Vision Transformer Token Pruning/Merging

1. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

  • Authors: Yongming Rao et al.
  • Venue: NeurIPS 2021 (also extended in T-PAMI)
  • Model type: ViT (DeiT)
  • Core method: Lightweight prediction module estimates importance score of each token; added at different layers for hierarchical pruning
  • Key innovation: First dynamic token sparsification framework for ViTs with learnable prediction modules and attention masking for end-to-end training
  • Benchmarks: Prunes 66% tokens, reduces 31%-37% FLOPs, improves throughput by 40%+ with <0.5% accuracy drop on ImageNet
  • ArXiv: 2106.02034

2. TokenLearner: Adaptive Space-Time Tokenization for Videos

  • Authors: Michael S. Ryoo et al. (Google Research)
  • Venue: NeurIPS 2021
  • Model type: ViT (video)
  • Core method: Learns to generate a small set of tokens from input via spatial attention maps; replaces fixed tokenization with adaptive learned tokenization
  • Key innovation: Data-driven token generation that reduces token count while improving accuracy; applicable to both images and videos
  • Benchmarks: SOTA on Kinetics-400, Kinetics-600, Charades, AViD
  • ArXiv: 2106.11297

3. IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

  • Authors: Bowen Pan et al.
  • Venue: NeurIPS 2021
  • Model type: ViT
  • Core method: Learns to identify and remove redundant tokens using a policy network that determines which tokens to keep
  • Key innovation: Interpretability-aware approach that provides visual explanations while reducing computation
  • ArXiv: 2106.12620

4. EViT: Expediting Vision Transformers via Token Reorganizations

  • Authors: Youwei Liang et al.
  • Venue: ICLR 2022 (Spotlight)
  • Model type: ViT (DeiT)
  • Core method: Identifies attentive tokens using CLS token attention scores; reorganizes tokens by keeping top-k attentive tokens and fusing inattentive ones
  • Key innovation: Training-free token reorganization guided by class token attention; fuses rather than discards inattentive tokens to preserve information
  • Benchmarks: 50% speedup on DeiT-S with only 0.3% accuracy drop on ImageNet
  • ArXiv: 2202.07800

5. A-ViT: Adaptive Tokens for Efficient Vision Transformer

  • Authors: Hongxu Yin et al. (NVIDIA)
  • Venue: CVPR 2022 (Oral)
  • Model type: ViT
  • Core method: Adaptively adjusts the number of tokens per image based on input complexity using a halting mechanism inspired by Adaptive Computation Time (ACT)
  • Key innovation: Per-token halting score that automatically reduces tokens for simpler inputs; no need for predefined pruning ratios
  • Benchmarks: Reduces computation with minimal accuracy loss on ImageNet
  • ArXiv: 2112.07658

6. Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

  • Authors: Yifan Xu et al.
  • Venue: AAAI 2022
  • Model type: ViT (DeiT, LeViT)
  • Core method: Separates tokens into informative (slow) and less informative (fast) groups; slow tokens go through full computation, fast tokens are updated with a lightweight mechanism
  • Key innovation: Slow-fast token evolution preserves global information flow while reducing computation
  • Benchmarks: 40%-60% throughput improvement on DeiT; further accelerates LeViT
  • ArXiv: 2108.01390

7. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

  • Authors: Zhenglun Kong et al.
  • Venue: ECCV 2022
  • Model type: ViT (DeiT, Swin)
  • Core method: Soft token pruning with latency-aware optimization; prunes tokens by learning a soft mask and considers real hardware latency
  • Key innovation: Latency-aware pruning applicable to both flat (DeiT) and hierarchical (Swin) architectures; optimizes for actual speed rather than FLOPs
  • ArXiv: 2112.13890

8. ATS: Adaptive Token Sampling for Efficient Vision Transformers

  • Authors: Mohsen Fayyaz et al.
  • Venue: ECCV 2022 (Oral)
  • Model type: ViT
  • Core method: Differentiable parameter-free module that scores tokens and adaptively samples significant ones; can be plugged into any ViT
  • Key innovation: Parameter-free adaptive sampling based on inverse transform sampling; no additional learnable parameters
  • Benchmarks: 2x GFLOPs reduction while preserving accuracy on ImageNet, Kinetics-400/600
  • ArXiv: 2111.15667

9. Token Merging (ToMe): Your ViT But Faster

  • Authors: Daniel Bolya et al. (Meta/Facebook Research)
  • Venue: ICLR 2023 (Oral)
  • Model type: ViT
  • Core method: Merges similar tokens using bipartite soft matching on key similarity (cosine similarity of K vectors); combines rather than prunes
  • Key innovation: Training-free token merging that preserves information by combining similar tokens; applicable off-the-shelf to any ViT
  • Benchmarks: 2x throughput on ViT-L@512, ViT-H@518 with only 0.2-0.3% accuracy drop; 2.2x on video ViT-L
  • ArXiv: 2210.09461

10. Joint Token Pruning and Squeezing (TPS)

  • Authors: Siyuan Wei et al.
  • Venue: CVPR 2023
  • Model type: ViT
  • Core method: Combines pruning with “squeezing” — pruned tokens’ information is injected into retained tokens via unidirectional nearest-neighbor matching and similarity-oriented fusing
  • Key innovation: Recovers information from pruned tokens by squeezing their content into similar retained tokens, enabling more aggressive compression
  • Benchmarks: More aggressive compression than pure pruning or merging with comparable accuracy
  • ArXiv: 2304.10716

11. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers

  • Authors: Mengzhao Chen et al.
  • Venue: ICCV 2023
  • Model type: ViT
  • Core method: Makes compression ratio differentiable; jointly optimizes token pruning and merging decisions and their ratios per layer via gradient descent
  • Key innovation: First method to make the compression rate itself a learnable parameter optimized end-to-end
  • Benchmarks: Outperforms ToMe and other methods at the same FLOPs budget on ImageNet
  • ArXiv: 2305.17997

12. Making Vision Transformers Efficient from A Token Sparsification View

  • Authors: Shuning Chang et al.
  • Venue: CVPR 2023
  • Model type: ViT
  • Core method: Comprehensive analysis and improved token sparsification strategy
  • Key innovation: Unified view of token sparsification methods with improved strategy

13. Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph

  • Authors: Hongjie Wang et al. (Princeton)
  • Venue: CVPR 2024
  • Model type: ViT
  • Core method: Uses Weighted Page Rank (WPR) on the attention graph to compute token importance; combines importance and similarity for pruning decisions
  • Key innovation: First zero-shot method that jointly considers token importance (via PageRank on attention) and similarity; no training or fine-tuning needed
  • ArXiv: 2305.17328

14. Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

  • Authors: (OpenGVLab team)
  • Venue: ECCV 2024
  • Model type: ViT
  • Core method: Compensates for information loss from token reduction by learning a lightweight compensator module
  • Key innovation: Allows changing the compression ratio at inference time without retraining

15. Agglomerative Token Clustering

  • Authors: (Various)
  • Venue: ECCV 2024
  • Model type: ViT
  • Core method: Hierarchical agglomerative clustering of tokens based on feature similarity
  • Key innovation: Bottom-up clustering approach for token merging

16. Exploring Token Pruning in Vision State Space Models

  • Authors: (Various)
  • Venue: NeurIPS 2024
  • Model type: Vision SSM (Mamba-based)
  • Core method: Investigates token pruning in state space models; finds naive application of ViT pruning methods fails
  • Key innovation: First study of token pruning for vision state space models; proposes SSM-specific pruning strategies

17. Spectrum-Preserving Token Merging (Accelerating Transformers)

  • Authors: (Various)
  • Venue: NeurIPS 2024
  • Model type: ViT
  • Core method: Token merging that preserves the spectral properties of the token representation
  • Key innovation: Frequency-domain aware merging strategy

18. Token Cropr: Faster ViTs for Quite a Few Tasks

  • Authors: (Various)
  • Venue: CVPR 2025
  • Model type: ViT
  • Core method: Crops/selects tokens for multiple downstream tasks beyond classification
  • Key innovation: Task-agnostic token selection applicable to detection, segmentation, and other dense tasks

Category 2: Vision-Language Model Token Pruning

19. MADTP: Multimodal Alignment-Guided Dynamic Token Pruning

  • Authors: Jianjian Cao et al.
  • Venue: CVPR 2024
  • Model type: Vision-Language Transformer (BLIP, BLIP-2)
  • Core method: Multi-modality Alignment Guidance (MAG) aligns features across modalities; Dynamic Token Pruning (DTP) adaptively adjusts compression ratio per layer per instance
  • Key innovation: First dynamic token pruning for VL transformers that considers cross-modal alignment; instance-adaptive pruning ratios
  • Benchmarks: 80% GFLOPs reduction on BLIP with <4% performance drop on NLVR2
  • ArXiv: 2403.02991

20. FastV: An Image is Worth 1/2 Tokens After Layer 2

  • Authors: Liang Chen et al.
  • Venue: ECCV 2024 (Oral)
  • Model type: Large Vision-Language Model (LLaVA)
  • Core method: Prunes visual tokens in early LLM layers based on attention scores; plug-and-play approach
  • Key innovation: Discovers that visual token attention becomes sparse after layer 2 in VLLMs; first to use LLM’s own attention signal for visual token pruning
  • Benchmarks: 50% visual tokens pruned after layer 2 without accuracy loss on LLaVA-1.5-13B; 45% FLOPs reduction
  • ArXiv: 2403.06764

21. IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models

  • Authors: Kai Huang et al.
  • Venue: ECCV 2024
  • Model type: Large Vision-Language Model
  • Core method: Uses instruction/text query to guide visual token importance estimation via Group-wise Token Pruning (GTP) with attention rollout
  • Key innovation: Instruction-aware pruning that considers the task/question when deciding which visual tokens to keep

22. Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

  • Authors: Chen Ju et al.
  • Venue: ECCV 2024 (Oral)
  • Model type: Vision-Language Model
  • Core method: Prunes tokens based on “information degree” that combines mutual redundancy (data duplication between sequential tokens) and semantic value (each token’s contribution to overall semantics)
  • Key innovation: Dual-criterion informativity metric; applicable as plug-in to both visual and textual tokens
  • ArXiv: 2407.11717

23. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

  • Authors: Yuzhang Shang et al.
  • Venue: ICCV 2025
  • Model type: Large Multimodal Model (LLaVA)
  • Core method: Exploits sparsity in CLIP visual encoder attention to identify crucial tokens; prunes less important tokens then merges pruned token information into retained tokens via key-similarity clustering
  • Key innovation: Adaptive pruning ratio via outlier detection on attention scores; combines pruning + merging for multimodal models
  • Benchmarks: 18x average visual token compression with comparable VQA/reasoning performance
  • ArXiv: 2403.15388

24. Dynamic-LLaVA: Efficient Multimodal LLMs via Dynamic Vision-Language Context Sparsification

  • Authors: (Various)
  • Venue: ICLR 2025
  • Model type: Multimodal LLM (LLaVA)
  • Core method: Dynamically reduces vision context redundancy in prefill stage and language context overhead during decoding
  • Key innovation: Joint vision-language token sparsification; handles both modalities
  • Benchmarks: ~50% computation reduction in decoding; ~50% GPU memory savings with KV cache

25. ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

  • Authors: Xubing Ye et al.
  • Venue: CVPR 2025
  • Model type: Large Vision-Language Model
  • Core method: Adaptive Token Pruning (ATP) module computes instance-specific importance scores and pruning thresholds per LLM layer; Spatial Augmented Pruning (SAP) considers both redundancy and spatial relationships
  • Key innovation: Instance-adaptive per-layer pruning ratio; spatial-aware pruning that preserves spatial structure
  • Benchmarks: 75% average token reduction with only 1.9% performance degradation across 7 benchmarks
  • ArXiv: 2412.00447

26. PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

  • Authors: Mohamed Dhouib et al.
  • Venue: CVPR 2025
  • Model type: Vision-Language Model
  • Core method: Prunes irrelevant tokens and merges visually redundant ones at an early LLM layer via clustering
  • Key innovation: Combined pruning and clustering at a single early layer for efficiency
  • Benchmarks: 71.3% token reduction, 31% GPU memory reduction, 225% speedup
  • ArXiv: 2504.08966

27. TopV: Compatible Token Pruning with Inference Time Optimization

  • Authors: Cheng Yang et al.
  • Venue: CVPR 2025
  • Model type: Multimodal VLM
  • Core method: Formulates token pruning as an optimization problem rather than relying on attention scores; compatible with FlashAttention
  • Key innovation: Optimization-based token selection that works with FlashAttention (which doesn’t output attention weights); training-free
  • ArXiv: 2503.18278

28. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

  • Authors: Saeed Ranjbar Alvar et al.
  • Venue: CVPR 2025
  • Model type: Large Multimodal Model
  • Core method: Formulates token pruning as a Max-Min Diversity Problem (MMDP); selects token subset that maximizes diversity among retained tokens
  • Key innovation: Diversity-maximization approach to token selection; training-free; works well at high pruning ratios
  • Benchmarks: SOTA across 16 image- and video-language datasets
  • ArXiv: 2503.02175

29. PyramidDrop: Accelerating LVLMs via Pyramid Visual Redundancy Reduction

  • Authors: (Various, including team at Cooperx521 GitHub)
  • Venue: CVPR 2025
  • Model type: Large Vision-Language Model
  • Core method: Progressively drops more visual tokens in deeper layers, forming a pyramid shape (fewer tokens in deeper layers)
  • Key innovation: Observation that visual redundancy is low in shallow layers but increases in deeper layers; pyramid-shaped pruning schedule
  • Benchmarks: 40% training time reduction, 55% inference FLOPs reduction on LLaVA-NeXT with comparable performance
  • ArXiv: 2410.17247

30. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

  • Authors: Jin Tao et al.
  • Venue: CVPR 2025
  • Model type: Video LLM
  • Core method: Dynamic compression of video tokens based on temporal and spatial redundancy
  • Key innovation: Video-specific token compression for Video LLMs

31. AdaptMerge: Inference Time Adaptive Visual and Language Token Merging

  • Authors: (Various)
  • Venue: EMNLP 2025 Findings
  • Model type: Vision-Language Model
  • Core method: Two-stage reduction: Adaptive Visual Token Merging (AVTM) in the vision encoder, then Adaptive Language-Guided Visual Token Merging (ALVTM) at LLM input
  • Key innovation: Dual-stage merging that considers both visual similarity and language guidance

32. CoViPAL: Layer-wise Contextualized Visual Token Pruning

  • Authors: (Various)
  • Venue: EMNLP 2025 Findings
  • Model type: Large Vision-Language Model
  • Core method: Layer-wise contextualized pruning of visual tokens
  • Key innovation: Context-aware layer-wise pruning strategy

Category 3: LLM Token Pruning (Prompt/Context Compression)

33. LLMLingua: Compressing Prompts for Accelerated Inference of LLMs

  • Authors: Huiqiang Jiang et al. (Microsoft)
  • Venue: EMNLP 2023
  • Model type: LLM (general)
  • Core method: Coarse-to-fine prompt compression with budget controller, token-level iterative compression, and instruction tuning for distribution alignment
  • Key innovation: First systematic prompt compression framework using a small LM to identify removable tokens; up to 20x compression
  • ArXiv: 2310.05736

34. Learning to Compress Prompts with Gist Tokens

  • Authors: Jesse Mu et al.
  • Venue: NeurIPS 2023
  • Model type: LLM
  • Core method: Trains LM to compress prompts into smaller sets of “gist” tokens that can be cached and reused
  • Key innovation: Learned compression into reusable gist tokens; up to 26x prompt compression
  • ArXiv: 2304.08467

35. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

  • Authors: (Various)
  • Venue: NeurIPS 2023
  • Model type: LLM
  • Core method: Dynamically prunes context during autoregressive generation
  • Key innovation: Per-head, per-step context pruning that is also interpretable

36. In-context Autoencoder for Context Compression in a Large Language Model

  • Authors: (Various)
  • Venue: ICLR 2024
  • Model type: LLM
  • Core method: Compresses long context into compact representation using an in-context autoencoder
  • Key innovation: Autoencoder-based context compression within the LLM framework

37. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

  • Authors: Huiqiang Jiang et al. (Microsoft)
  • Venue: ACL 2024
  • Model type: LLM
  • Core method: Extends LLMLingua for long-context scenarios; mitigates “lost in the middle” problem
  • Key innovation: Question-aware prompt compression that prioritizes tokens relevant to the query
  • ArXiv: 2310.06839

38. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

  • Authors: Qichen Fu et al. (Apple)
  • Venue: ICML 2024 Workshop (ES-FoMo)
  • Model type: LLM (LLaMA 2)
  • Core method: Selectively computes KV for tokens important for next token prediction; dynamically selects different token subsets at each generation step
  • Key innovation: Training-free, universal dynamic token selection that allows previously pruned tokens to be reconsidered in later steps
  • Benchmarks: 2.34x prefilling acceleration on LLaMA 2 7B for multi-document QA
  • ArXiv: 2407.14057

39. MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

  • Authors: Julie Kallini et al.
  • Venue: ICLR 2025
  • Model type: Byte-level LM (ByT5)
  • Core method: Integrates a learned delete gate in the encoder to dynamically shorten byte sequences; gate determines which tokens to remove
  • Key innovation: Applies token merging to byte-level models, reducing the fundamental tokenization overhead
  • Benchmarks: Comparable accuracy to ByT5 while reducing sequence lengths by up to 75% on XNLI, TyDi QA
  • ArXiv: 2410.20771

Category 4: KV Cache Compression/Pruning for LLMs

40. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs

  • Authors: Zhenyu Zhang et al.
  • Venue: NeurIPS 2023
  • Model type: LLM (OPT)
  • Core method: KV cache eviction policy that dynamically retains “heavy hitter” tokens (those with high accumulated attention scores) plus recent tokens
  • Key innovation: Discovery that attention scores follow power-law distribution; small set of heavy-hitter tokens dominate attention computation
  • Benchmarks: Up to 29x throughput improvement over DeepSpeed Zero-Inference; H2O with 20% heavy hitters
  • ArXiv: 2306.14048

41. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression

  • Authors: Zichang Liu et al.
  • Venue: NeurIPS 2023
  • Model type: LLM
  • Core method: Based on “persistence of importance” hypothesis — tokens important at one step remain important; maintains fixed KV cache budget by evicting non-pivotal tokens
  • Key innovation: Observation that token importance persists across generation steps; up to 5x KV cache memory reduction without fine-tuning
  • ArXiv: 2305.17118

42. StreamingLLM: Efficient Streaming Language Models with Attention Sinks

  • Authors: Guangxuan Xiao et al. (MIT Han Lab)
  • Venue: ICLR 2024
  • Model type: LLM (LLaMA-2, MPT, Falcon, Pythia)
  • Core method: Retains KV cache of initial “attention sink” tokens plus a sliding window of recent tokens; discards intermediate tokens
  • Key innovation: Discovery of “attention sink” phenomenon — initial tokens receive disproportionate attention regardless of semantic content; enables infinite-length generation
  • Benchmarks: Stable language modeling with up to 4M+ tokens on models trained with finite windows; no fine-tuning needed
  • ArXiv: 2309.17453

43. Model Tells You What to Discard (FastGen): Adaptive KV Cache Compression for LLMs

  • Authors: Suyu Ge et al.
  • Venue: ICLR 2024 (Oral)
  • Model type: LLM (LLaMA)
  • Core method: Profiles attention head patterns and applies per-head adaptive compression strategies: evicts long-range on local heads, discards non-special on special-token heads, keeps full cache on broad-attention heads
  • Key innovation: Per-head adaptive compression policies based on attention structure profiling; no fine-tuning
  • Benchmarks: ~40% memory reduction on LLaMA-65B with negligible quality loss
  • ArXiv: 2310.01801

44. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

  • Authors: Zirui Liu et al.
  • Venue: ICML 2024
  • Model type: LLM (LLaMA-2, Falcon, Mistral)
  • Core method: Asymmetric quantization: quantizes Key cache per-channel and Value cache per-token to 2-bit precision
  • Key innovation: Exploits different distribution patterns of keys (per-channel outliers) vs. values (per-token outliers) for asymmetric quantization strategy
  • Benchmarks: 2.6x less peak memory, up to 4x larger batch size, 2.35x-3.47x throughput improvement
  • ArXiv: 2402.02750

45. SnapKV: LLM Knows What You Are Looking for Before Generation

  • Authors: Yuhong Li et al.
  • Venue: NeurIPS 2024
  • Model type: LLM
  • Core method: Compresses KV cache by selecting/clustering significant KV positions based on attention patterns observed in a small observation window at the end of the prompt
  • Key innovation: Discovery that attention patterns in the prompt’s final window predict which KV entries will be important during generation; clustering-based compression
  • ArXiv: 2404.14469

46. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

  • Authors: Coleman Hooper et al. (UC Berkeley)
  • Venue: NeurIPS 2024
  • Model type: LLM (LLaMA)
  • Core method: Multi-technique KV cache quantization: per-channel key quantization, pre-RoPE key quantization, sensitivity-weighted non-uniform datatypes, dense-and-sparse quantization
  • Key innovation: Enables sub-4-bit KV cache quantization via multiple complementary techniques; targets ultra-long context (10M tokens)
  • Benchmarks: <0.1 perplexity degradation at 3-bit on LLaMA-7B; 4.8x memory reduction
  • ArXiv: 2401.18079

47. MiniCache: KV Cache Compression in Depth Dimension for LLMs

  • Authors: Akide Liu et al.
  • Venue: NeurIPS 2024
  • Model type: LLM
  • Core method: Compresses KV cache across layers (depth dimension) by exploiting high similarity between adjacent layers’ KV states; disentangles into magnitude and direction, interpolates directions
  • Key innovation: Novel depth-wise (cross-layer) KV cache compression; retains highly distinct state pairs unmerged
  • Benchmarks: Up to 5.02x compression, 5x throughput improvement, 41% memory reduction vs FP16 baseline
  • ArXiv: 2405.14366

48. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

  • Authors: Dongjie Yang et al.
  • Venue: ACL 2024 Findings
  • Model type: LLM
  • Core method: Assigns varying KV cache budgets across layers, forming a pyramid shape (more cache in lower layers, less in upper)
  • Key innovation: Layer-aware cache budget allocation based on observation that different layers need different cache sizes
  • ArXiv: 2406.02069 (note: PyramidKV is a closely related concurrent work)

49. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

  • Authors: Zefan Cai et al.
  • Venue: ACL 2025 (under review / accepted)
  • Model type: LLM
  • Core method: Dynamic layer-wise KV cache allocation following pyramidal pattern; lower layers get larger cache, upper layers smaller
  • Key innovation: Pyramidal information funneling observation; dynamic rather than static pyramid allocation
  • ArXiv: 2406.02069

50. PALU: KV-Cache Compression with Low-Rank Projection

  • Authors: Chi-Chih Chang et al.
  • Venue: ICLR 2025
  • Model type: LLM
  • Core method: Low-rank decomposition of KV cache; medium-grained decomposition scheme with efficient rank search and low-rank-aware quantization
  • Key innovation: Combines low-rank projection with quantization; optimized GPU kernels with matrix fusion
  • Benchmarks: Up to 1.19 lower perplexity compared to quantization-only methods at the same compression ratio
  • ArXiv: 2407.21118

51. RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

  • Authors: Payman Behnam et al. (NVIDIA)
  • Venue: ICML 2025
  • Model type: LLM
  • Core method: Two-stage compression: (1) coarse-grain permanent KV cache eviction, (2) fine-grain top-k sparse attention with hybrid sparse method
  • Key innovation: Two-stage pipeline enabling extreme compression ratios; combines eviction with sparse attention
  • Benchmarks: Up to 400x compression ratio, 3.7x speedup, 32.6% peak memory reduction on A100
  • ArXiv: (Available on GitHub: NVlabs/RocketKV)

52. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

  • Authors: Hao Kang et al.
  • Venue: ICML 2024 (or arXiv 2024; check venue)
  • Model type: LLM
  • Core method: Three-component approach: ultra-low-precision quantization for bulk entries + low-rank matrix for quantization error + sparse matrix for outlier errors
  • Key innovation: Residual-based compression combining quantization, low-rank, and sparse correction
  • Benchmarks: Near-lossless 4-bit compression, 2.38x throughput, 2.29x memory reduction
  • ArXiv: 2403.05527

53. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

  • Authors: (Various)
  • Venue: MLSys 2024
  • Model type: LLM
  • Core method: Combines token selection (sparsity) with quantization for KV cache compression
  • Key innovation: Joint sparse-quantized approach outperforming either technique alone

54. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

  • Authors: Tianle Cai et al.
  • Venue: ICML 2024
  • Model type: LLM
  • Core method: Adds extra lightweight decoding heads to predict multiple future tokens in parallel; uses tree-based attention for verification
  • Key innovation: Eliminates need for separate draft model; learns multiple parallel prediction heads (each a 2-layer FFN)
  • Benchmarks: 2.2x speedup (Medusa-1) to 2.3-2.8x (Medusa-2) without quality loss
  • ArXiv: 2401.10774

55. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

  • Authors: Yuhui Li et al.
  • Venue: ICML 2024 (EAGLE-1), EMNLP 2024 (EAGLE-2), NeurIPS 2025 (EAGLE-3)
  • Model type: LLM
  • Core method: Single trainable decoder layer operates at the feature level of the target model; autoregressive generation at feature level during drafting
  • Key innovation: Feature-level rather than token-level draft generation; better captures uncertainty
  • ArXiv: 2401.15077

Category 6: Token Pruning for Diffusion Transformers

56. Token Merging for Fast Stable Diffusion

  • Authors: Daniel Bolya et al. (Meta)
  • Venue: CVPR 2023 Workshop
  • Model type: Diffusion model (Stable Diffusion)
  • Core method: Extends ToMe to U-Net based diffusion models; merges similar tokens in self-attention layers
  • Key innovation: Adapts token merging from classification ViTs to generative diffusion models

57. Dynamic Diffusion Transformer (DyDiT)

  • Authors: (Various)
  • Venue: ICLR 2025
  • Model type: Diffusion Transformer (DiT)
  • Core method: Dynamic token pruning and channel selection for DiT; adapts computation based on timestep and spatial complexity
  • Key innovation: Timestep-aware dynamic architecture for diffusion transformers
  • Benchmarks: 51% FLOPs reduction, 1.73x generation acceleration on ImageNet

58. FreqTS: Frequency-Aware Token Selection for Accelerating Diffusion Models

  • Authors: (Various)
  • Venue: AAAI 2025
  • Model type: Diffusion model
  • Core method: Frequency-domain analysis for token importance; selects tokens based on frequency characteristics
  • Key innovation: Frequency-aware criteria for token selection in diffusion generation

Category 7: NLP Token Reduction (Non-KV-Cache)

59. TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

  • Authors: (Various)
  • Venue: EMNLP 2023
  • Model type: Video-Language model
  • Core method: Aggregates tokens along both temporal and spatial dimensions for long videos
  • Key innovation: Joint temporal-spatial aggregation strategy for video understanding

60. Efficient Transformers with Dynamic Token Pooling

  • Authors: (Various)
  • Venue: ACL 2023
  • Model type: Transformer (NLP)
  • Core method: Dynamic pooling of tokens during processing based on learned pooling decisions
  • Key innovation: Adaptive token pooling for NLP transformers

61. Revisiting Token Dropping Strategy in Efficient BERT Pretraining

  • Authors: (Various)
  • Venue: ACL 2023
  • Model type: BERT
  • Core method: Analyzes and improves token dropping strategies during BERT pretraining
  • Key innovation: Improved understanding of when and how to drop tokens during pretraining

62. TokenSkip: Controllable Chain-of-Thought Compression in LLMs

  • Authors: (Various)
  • Venue: EMNLP 2025
  • Model type: LLM (reasoning)
  • Core method: Compresses chain-of-thought reasoning by skipping intermediate tokens while preserving reasoning quality
  • Key innovation: Controllable compression of reasoning traces; balances efficiency vs. reasoning depth

63. CoT-Valve: Length-Compressible Chain-of-Thought Tuning

  • Authors: (Various)
  • Venue: ACL 2025
  • Model type: LLM (reasoning)
  • Core method: Tunes models to produce variable-length chain-of-thought; learns to compress reasoning length
  • Key innovation: Adjustable reasoning length via a single valve parameter

Summary Statistics

CategoryCount
ViT Token Pruning/Merging18
VLM Token Pruning14
LLM Prompt/Context Compression7
KV Cache Compression14
Speculative Decoding2
Diffusion Transformer3
NLP Token Reduction5
Total63
YearCount
20213
20225
202312
202418
202525
Total63