Token Pruning / Reduction Methods in Transformers (2021-2026)

A comprehensive survey of token pruning, merging, and compression methods published at top-tier AI/ML venues.

Category 1: Vision Transformer Token Pruning/Merging

1. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Authors: Yongming Rao et al.
Venue: NeurIPS 2021 (also extended in T-PAMI)
Model type: ViT (DeiT)
Core method: Lightweight prediction module estimates importance score of each token; added at different layers for hierarchical pruning
Key innovation: First dynamic token sparsification framework for ViTs with learnable prediction modules and attention masking for end-to-end training
Benchmarks: Prunes 66% tokens, reduces 31%-37% FLOPs, improves throughput by 40%+ with <0.5% accuracy drop on ImageNet
ArXiv: 2106.02034

2. TokenLearner: Adaptive Space-Time Tokenization for Videos

Authors: Michael S. Ryoo et al. (Google Research)
Venue: NeurIPS 2021
Model type: ViT (video)
Core method: Learns to generate a small set of tokens from input via spatial attention maps; replaces fixed tokenization with adaptive learned tokenization
Key innovation: Data-driven token generation that reduces token count while improving accuracy; applicable to both images and videos
Benchmarks: SOTA on Kinetics-400, Kinetics-600, Charades, AViD
ArXiv: 2106.11297

3. IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

Authors: Bowen Pan et al.
Venue: NeurIPS 2021
Model type: ViT
Core method: Learns to identify and remove redundant tokens using a policy network that determines which tokens to keep
Key innovation: Interpretability-aware approach that provides visual explanations while reducing computation
ArXiv: 2106.12620

4. EViT: Expediting Vision Transformers via Token Reorganizations

Authors: Youwei Liang et al.
Venue: ICLR 2022 (Spotlight)
Model type: ViT (DeiT)
Core method: Identifies attentive tokens using CLS token attention scores; reorganizes tokens by keeping top-k attentive tokens and fusing inattentive ones
Key innovation: Training-free token reorganization guided by class token attention; fuses rather than discards inattentive tokens to preserve information
Benchmarks: 50% speedup on DeiT-S with only 0.3% accuracy drop on ImageNet
ArXiv: 2202.07800

5. A-ViT: Adaptive Tokens for Efficient Vision Transformer

Authors: Hongxu Yin et al. (NVIDIA)
Venue: CVPR 2022 (Oral)
Model type: ViT
Core method: Adaptively adjusts the number of tokens per image based on input complexity using a halting mechanism inspired by Adaptive Computation Time (ACT)
Key innovation: Per-token halting score that automatically reduces tokens for simpler inputs; no need for predefined pruning ratios
Benchmarks: Reduces computation with minimal accuracy loss on ImageNet
ArXiv: 2112.07658

6. Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Authors: Yifan Xu et al.
Venue: AAAI 2022
Model type: ViT (DeiT, LeViT)
Core method: Separates tokens into informative (slow) and less informative (fast) groups; slow tokens go through full computation, fast tokens are updated with a lightweight mechanism
Key innovation: Slow-fast token evolution preserves global information flow while reducing computation
Benchmarks: 40%-60% throughput improvement on DeiT; further accelerates LeViT
ArXiv: 2108.01390

7. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

Authors: Zhenglun Kong et al.
Venue: ECCV 2022
Model type: ViT (DeiT, Swin)
Core method: Soft token pruning with latency-aware optimization; prunes tokens by learning a soft mask and considers real hardware latency
Key innovation: Latency-aware pruning applicable to both flat (DeiT) and hierarchical (Swin) architectures; optimizes for actual speed rather than FLOPs
ArXiv: 2112.13890

8. ATS: Adaptive Token Sampling for Efficient Vision Transformers

Authors: Mohsen Fayyaz et al.
Venue: ECCV 2022 (Oral)
Model type: ViT
Core method: Differentiable parameter-free module that scores tokens and adaptively samples significant ones; can be plugged into any ViT
Key innovation: Parameter-free adaptive sampling based on inverse transform sampling; no additional learnable parameters
Benchmarks: 2x GFLOPs reduction while preserving accuracy on ImageNet, Kinetics-400/600
ArXiv: 2111.15667

9. Token Merging (ToMe): Your ViT But Faster

Authors: Daniel Bolya et al. (Meta/Facebook Research)
Venue: ICLR 2023 (Oral)
Model type: ViT
Core method: Merges similar tokens using bipartite soft matching on key similarity (cosine similarity of K vectors); combines rather than prunes
Key innovation: Training-free token merging that preserves information by combining similar tokens; applicable off-the-shelf to any ViT
Benchmarks: 2x throughput on ViT-L@512, ViT-H@518 with only 0.2-0.3% accuracy drop; 2.2x on video ViT-L
ArXiv: 2210.09461

10. Joint Token Pruning and Squeezing (TPS)

Authors: Siyuan Wei et al.
Venue: CVPR 2023
Model type: ViT
Core method: Combines pruning with “squeezing” — pruned tokens’ information is injected into retained tokens via unidirectional nearest-neighbor matching and similarity-oriented fusing
Key innovation: Recovers information from pruned tokens by squeezing their content into similar retained tokens, enabling more aggressive compression
Benchmarks: More aggressive compression than pure pruning or merging with comparable accuracy
ArXiv: 2304.10716

11. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers

Authors: Mengzhao Chen et al.
Venue: ICCV 2023
Model type: ViT
Core method: Makes compression ratio differentiable; jointly optimizes token pruning and merging decisions and their ratios per layer via gradient descent
Key innovation: First method to make the compression rate itself a learnable parameter optimized end-to-end
Benchmarks: Outperforms ToMe and other methods at the same FLOPs budget on ImageNet
ArXiv: 2305.17997

12. Making Vision Transformers Efficient from A Token Sparsification View

Authors: Shuning Chang et al.
Venue: CVPR 2023
Model type: ViT
Core method: Comprehensive analysis and improved token sparsification strategy
Key innovation: Unified view of token sparsification methods with improved strategy

13. Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph

Authors: Hongjie Wang et al. (Princeton)
Venue: CVPR 2024
Model type: ViT
Core method: Uses Weighted Page Rank (WPR) on the attention graph to compute token importance; combines importance and similarity for pruning decisions
Key innovation: First zero-shot method that jointly considers token importance (via PageRank on attention) and similarity; no training or fine-tuning needed
ArXiv: 2305.17328

14. Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

Authors: (OpenGVLab team)
Venue: ECCV 2024
Model type: ViT
Core method: Compensates for information loss from token reduction by learning a lightweight compensator module
Key innovation: Allows changing the compression ratio at inference time without retraining

15. Agglomerative Token Clustering

Authors: (Various)
Venue: ECCV 2024
Model type: ViT
Core method: Hierarchical agglomerative clustering of tokens based on feature similarity
Key innovation: Bottom-up clustering approach for token merging

16. Exploring Token Pruning in Vision State Space Models

Authors: (Various)
Venue: NeurIPS 2024
Model type: Vision SSM (Mamba-based)
Core method: Investigates token pruning in state space models; finds naive application of ViT pruning methods fails
Key innovation: First study of token pruning for vision state space models; proposes SSM-specific pruning strategies

17. Spectrum-Preserving Token Merging (Accelerating Transformers)

Authors: (Various)
Venue: NeurIPS 2024
Model type: ViT
Core method: Token merging that preserves the spectral properties of the token representation
Key innovation: Frequency-domain aware merging strategy

18. Token Cropr: Faster ViTs for Quite a Few Tasks

Authors: (Various)
Venue: CVPR 2025
Model type: ViT
Core method: Crops/selects tokens for multiple downstream tasks beyond classification
Key innovation: Task-agnostic token selection applicable to detection, segmentation, and other dense tasks

Category 2: Vision-Language Model Token Pruning

19. MADTP: Multimodal Alignment-Guided Dynamic Token Pruning

Authors: Jianjian Cao et al.
Venue: CVPR 2024
Model type: Vision-Language Transformer (BLIP, BLIP-2)
Core method: Multi-modality Alignment Guidance (MAG) aligns features across modalities; Dynamic Token Pruning (DTP) adaptively adjusts compression ratio per layer per instance
Key innovation: First dynamic token pruning for VL transformers that considers cross-modal alignment; instance-adaptive pruning ratios
Benchmarks: 80% GFLOPs reduction on BLIP with <4% performance drop on NLVR2
ArXiv: 2403.02991

20. FastV: An Image is Worth 1/2 Tokens After Layer 2

Authors: Liang Chen et al.
Venue: ECCV 2024 (Oral)
Model type: Large Vision-Language Model (LLaVA)
Core method: Prunes visual tokens in early LLM layers based on attention scores; plug-and-play approach
Key innovation: Discovers that visual token attention becomes sparse after layer 2 in VLLMs; first to use LLM’s own attention signal for visual token pruning
Benchmarks: 50% visual tokens pruned after layer 2 without accuracy loss on LLaVA-1.5-13B; 45% FLOPs reduction
ArXiv: 2403.06764

21. IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models

Authors: Kai Huang et al.
Venue: ECCV 2024
Model type: Large Vision-Language Model
Core method: Uses instruction/text query to guide visual token importance estimation via Group-wise Token Pruning (GTP) with attention rollout
Key innovation: Instruction-aware pruning that considers the task/question when deciding which visual tokens to keep

22. Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

Authors: Chen Ju et al.
Venue: ECCV 2024 (Oral)
Model type: Vision-Language Model
Core method: Prunes tokens based on “information degree” that combines mutual redundancy (data duplication between sequential tokens) and semantic value (each token’s contribution to overall semantics)
Key innovation: Dual-criterion informativity metric; applicable as plug-in to both visual and textual tokens
ArXiv: 2407.11717

23. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Authors: Yuzhang Shang et al.
Venue: ICCV 2025
Model type: Large Multimodal Model (LLaVA)
Core method: Exploits sparsity in CLIP visual encoder attention to identify crucial tokens; prunes less important tokens then merges pruned token information into retained tokens via key-similarity clustering
Key innovation: Adaptive pruning ratio via outlier detection on attention scores; combines pruning + merging for multimodal models
Benchmarks: 18x average visual token compression with comparable VQA/reasoning performance
ArXiv: 2403.15388

24. Dynamic-LLaVA: Efficient Multimodal LLMs via Dynamic Vision-Language Context Sparsification

Authors: (Various)
Venue: ICLR 2025
Model type: Multimodal LLM (LLaVA)
Core method: Dynamically reduces vision context redundancy in prefill stage and language context overhead during decoding
Key innovation: Joint vision-language token sparsification; handles both modalities
Benchmarks: ~50% computation reduction in decoding; ~50% GPU memory savings with KV cache

25. ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Authors: Xubing Ye et al.
Venue: CVPR 2025
Model type: Large Vision-Language Model
Core method: Adaptive Token Pruning (ATP) module computes instance-specific importance scores and pruning thresholds per LLM layer; Spatial Augmented Pruning (SAP) considers both redundancy and spatial relationships
Key innovation: Instance-adaptive per-layer pruning ratio; spatial-aware pruning that preserves spatial structure
Benchmarks: 75% average token reduction with only 1.9% performance degradation across 7 benchmarks
ArXiv: 2412.00447

26. PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

Authors: Mohamed Dhouib et al.
Venue: CVPR 2025
Model type: Vision-Language Model
Core method: Prunes irrelevant tokens and merges visually redundant ones at an early LLM layer via clustering
Key innovation: Combined pruning and clustering at a single early layer for efficiency
Benchmarks: 71.3% token reduction, 31% GPU memory reduction, 225% speedup
ArXiv: 2504.08966

27. TopV: Compatible Token Pruning with Inference Time Optimization

Authors: Cheng Yang et al.
Venue: CVPR 2025
Model type: Multimodal VLM
Core method: Formulates token pruning as an optimization problem rather than relying on attention scores; compatible with FlashAttention
Key innovation: Optimization-based token selection that works with FlashAttention (which doesn’t output attention weights); training-free
ArXiv: 2503.18278

28. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Authors: Saeed Ranjbar Alvar et al.
Venue: CVPR 2025
Model type: Large Multimodal Model
Core method: Formulates token pruning as a Max-Min Diversity Problem (MMDP); selects token subset that maximizes diversity among retained tokens
Key innovation: Diversity-maximization approach to token selection; training-free; works well at high pruning ratios
Benchmarks: SOTA across 16 image- and video-language datasets
ArXiv: 2503.02175

29. PyramidDrop: Accelerating LVLMs via Pyramid Visual Redundancy Reduction

Authors: (Various, including team at Cooperx521 GitHub)
Venue: CVPR 2025
Model type: Large Vision-Language Model
Core method: Progressively drops more visual tokens in deeper layers, forming a pyramid shape (fewer tokens in deeper layers)
Key innovation: Observation that visual redundancy is low in shallow layers but increases in deeper layers; pyramid-shaped pruning schedule
Benchmarks: 40% training time reduction, 55% inference FLOPs reduction on LLaVA-NeXT with comparable performance
ArXiv: 2410.17247

30. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Authors: Jin Tao et al.
Venue: CVPR 2025
Model type: Video LLM
Core method: Dynamic compression of video tokens based on temporal and spatial redundancy
Key innovation: Video-specific token compression for Video LLMs

31. AdaptMerge: Inference Time Adaptive Visual and Language Token Merging

Authors: (Various)
Venue: EMNLP 2025 Findings
Model type: Vision-Language Model
Core method: Two-stage reduction: Adaptive Visual Token Merging (AVTM) in the vision encoder, then Adaptive Language-Guided Visual Token Merging (ALVTM) at LLM input
Key innovation: Dual-stage merging that considers both visual similarity and language guidance

32. CoViPAL: Layer-wise Contextualized Visual Token Pruning

Authors: (Various)
Venue: EMNLP 2025 Findings
Model type: Large Vision-Language Model
Core method: Layer-wise contextualized pruning of visual tokens
Key innovation: Context-aware layer-wise pruning strategy

Category 3: LLM Token Pruning (Prompt/Context Compression)

33. LLMLingua: Compressing Prompts for Accelerated Inference of LLMs

Authors: Huiqiang Jiang et al. (Microsoft)
Venue: EMNLP 2023
Model type: LLM (general)
Core method: Coarse-to-fine prompt compression with budget controller, token-level iterative compression, and instruction tuning for distribution alignment
Key innovation: First systematic prompt compression framework using a small LM to identify removable tokens; up to 20x compression
ArXiv: 2310.05736

34. Learning to Compress Prompts with Gist Tokens

Authors: Jesse Mu et al.
Venue: NeurIPS 2023
Model type: LLM
Core method: Trains LM to compress prompts into smaller sets of “gist” tokens that can be cached and reused
Key innovation: Learned compression into reusable gist tokens; up to 26x prompt compression
ArXiv: 2304.08467

35. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Authors: (Various)
Venue: NeurIPS 2023
Model type: LLM
Core method: Dynamically prunes context during autoregressive generation
Key innovation: Per-head, per-step context pruning that is also interpretable

36. In-context Autoencoder for Context Compression in a Large Language Model

Authors: (Various)
Venue: ICLR 2024
Model type: LLM
Core method: Compresses long context into compact representation using an in-context autoencoder
Key innovation: Autoencoder-based context compression within the LLM framework

37. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Authors: Huiqiang Jiang et al. (Microsoft)
Venue: ACL 2024
Model type: LLM
Core method: Extends LLMLingua for long-context scenarios; mitigates “lost in the middle” problem
Key innovation: Question-aware prompt compression that prioritizes tokens relevant to the query
ArXiv: 2310.06839

38. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Authors: Qichen Fu et al. (Apple)
Venue: ICML 2024 Workshop (ES-FoMo)
Model type: LLM (LLaMA 2)
Core method: Selectively computes KV for tokens important for next token prediction; dynamically selects different token subsets at each generation step
Key innovation: Training-free, universal dynamic token selection that allows previously pruned tokens to be reconsidered in later steps
Benchmarks: 2.34x prefilling acceleration on LLaMA 2 7B for multi-document QA
ArXiv: 2407.14057

39. MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Authors: Julie Kallini et al.
Venue: ICLR 2025
Model type: Byte-level LM (ByT5)
Core method: Integrates a learned delete gate in the encoder to dynamically shorten byte sequences; gate determines which tokens to remove
Key innovation: Applies token merging to byte-level models, reducing the fundamental tokenization overhead
Benchmarks: Comparable accuracy to ByT5 while reducing sequence lengths by up to 75% on XNLI, TyDi QA
ArXiv: 2410.20771

Category 4: KV Cache Compression/Pruning for LLMs

40. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs

Authors: Zhenyu Zhang et al.
Venue: NeurIPS 2023
Model type: LLM (OPT)
Core method: KV cache eviction policy that dynamically retains “heavy hitter” tokens (those with high accumulated attention scores) plus recent tokens
Key innovation: Discovery that attention scores follow power-law distribution; small set of heavy-hitter tokens dominate attention computation
Benchmarks: Up to 29x throughput improvement over DeepSpeed Zero-Inference; H2O with 20% heavy hitters
ArXiv: 2306.14048

41. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression

Authors: Zichang Liu et al.
Venue: NeurIPS 2023
Model type: LLM
Core method: Based on “persistence of importance” hypothesis — tokens important at one step remain important; maintains fixed KV cache budget by evicting non-pivotal tokens
Key innovation: Observation that token importance persists across generation steps; up to 5x KV cache memory reduction without fine-tuning
ArXiv: 2305.17118

42. StreamingLLM: Efficient Streaming Language Models with Attention Sinks

Authors: Guangxuan Xiao et al. (MIT Han Lab)
Venue: ICLR 2024
Model type: LLM (LLaMA-2, MPT, Falcon, Pythia)
Core method: Retains KV cache of initial “attention sink” tokens plus a sliding window of recent tokens; discards intermediate tokens
Key innovation: Discovery of “attention sink” phenomenon — initial tokens receive disproportionate attention regardless of semantic content; enables infinite-length generation
Benchmarks: Stable language modeling with up to 4M+ tokens on models trained with finite windows; no fine-tuning needed
ArXiv: 2309.17453

43. Model Tells You What to Discard (FastGen): Adaptive KV Cache Compression for LLMs

Authors: Suyu Ge et al.
Venue: ICLR 2024 (Oral)
Model type: LLM (LLaMA)
Core method: Profiles attention head patterns and applies per-head adaptive compression strategies: evicts long-range on local heads, discards non-special on special-token heads, keeps full cache on broad-attention heads
Key innovation: Per-head adaptive compression policies based on attention structure profiling; no fine-tuning
Benchmarks: ~40% memory reduction on LLaMA-65B with negligible quality loss
ArXiv: 2310.01801

44. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Authors: Zirui Liu et al.
Venue: ICML 2024
Model type: LLM (LLaMA-2, Falcon, Mistral)
Core method: Asymmetric quantization: quantizes Key cache per-channel and Value cache per-token to 2-bit precision
Key innovation: Exploits different distribution patterns of keys (per-channel outliers) vs. values (per-token outliers) for asymmetric quantization strategy
Benchmarks: 2.6x less peak memory, up to 4x larger batch size, 2.35x-3.47x throughput improvement
ArXiv: 2402.02750

45. SnapKV: LLM Knows What You Are Looking for Before Generation

Authors: Yuhong Li et al.
Venue: NeurIPS 2024
Model type: LLM
Core method: Compresses KV cache by selecting/clustering significant KV positions based on attention patterns observed in a small observation window at the end of the prompt
Key innovation: Discovery that attention patterns in the prompt’s final window predict which KV entries will be important during generation; clustering-based compression
ArXiv: 2404.14469

46. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Authors: Coleman Hooper et al. (UC Berkeley)
Venue: NeurIPS 2024
Model type: LLM (LLaMA)
Core method: Multi-technique KV cache quantization: per-channel key quantization, pre-RoPE key quantization, sensitivity-weighted non-uniform datatypes, dense-and-sparse quantization
Key innovation: Enables sub-4-bit KV cache quantization via multiple complementary techniques; targets ultra-long context (10M tokens)
Benchmarks: <0.1 perplexity degradation at 3-bit on LLaMA-7B; 4.8x memory reduction
ArXiv: 2401.18079

47. MiniCache: KV Cache Compression in Depth Dimension for LLMs

Authors: Akide Liu et al.
Venue: NeurIPS 2024
Model type: LLM
Core method: Compresses KV cache across layers (depth dimension) by exploiting high similarity between adjacent layers’ KV states; disentangles into magnitude and direction, interpolates directions
Key innovation: Novel depth-wise (cross-layer) KV cache compression; retains highly distinct state pairs unmerged
Benchmarks: Up to 5.02x compression, 5x throughput improvement, 41% memory reduction vs FP16 baseline
ArXiv: 2405.14366

48. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

Authors: Dongjie Yang et al.
Venue: ACL 2024 Findings
Model type: LLM
Core method: Assigns varying KV cache budgets across layers, forming a pyramid shape (more cache in lower layers, less in upper)
Key innovation: Layer-aware cache budget allocation based on observation that different layers need different cache sizes
ArXiv: 2406.02069 (note: PyramidKV is a closely related concurrent work)

49. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Authors: Zefan Cai et al.
Venue: ACL 2025 (under review / accepted)
Model type: LLM
Core method: Dynamic layer-wise KV cache allocation following pyramidal pattern; lower layers get larger cache, upper layers smaller
Key innovation: Pyramidal information funneling observation; dynamic rather than static pyramid allocation
ArXiv: 2406.02069

50. PALU: KV-Cache Compression with Low-Rank Projection

Authors: Chi-Chih Chang et al.
Venue: ICLR 2025
Model type: LLM
Core method: Low-rank decomposition of KV cache; medium-grained decomposition scheme with efficient rank search and low-rank-aware quantization
Key innovation: Combines low-rank projection with quantization; optimized GPU kernels with matrix fusion
Benchmarks: Up to 1.19 lower perplexity compared to quantization-only methods at the same compression ratio
ArXiv: 2407.21118

51. RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Authors: Payman Behnam et al. (NVIDIA)
Venue: ICML 2025
Model type: LLM
Core method: Two-stage compression: (1) coarse-grain permanent KV cache eviction, (2) fine-grain top-k sparse attention with hybrid sparse method
Key innovation: Two-stage pipeline enabling extreme compression ratios; combines eviction with sparse attention
Benchmarks: Up to 400x compression ratio, 3.7x speedup, 32.6% peak memory reduction on A100
ArXiv: (Available on GitHub: NVlabs/RocketKV)

52. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

Authors: Hao Kang et al.
Venue: ICML 2024 (or arXiv 2024; check venue)
Model type: LLM
Core method: Three-component approach: ultra-low-precision quantization for bulk entries + low-rank matrix for quantization error + sparse matrix for outlier errors
Key innovation: Residual-based compression combining quantization, low-rank, and sparse correction
Benchmarks: Near-lossless 4-bit compression, 2.38x throughput, 2.29x memory reduction
ArXiv: 2403.05527

53. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Authors: (Various)
Venue: MLSys 2024
Model type: LLM
Core method: Combines token selection (sparsity) with quantization for KV cache compression
Key innovation: Joint sparse-quantized approach outperforming either technique alone

54. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Authors: Tianle Cai et al.
Venue: ICML 2024
Model type: LLM
Core method: Adds extra lightweight decoding heads to predict multiple future tokens in parallel; uses tree-based attention for verification
Key innovation: Eliminates need for separate draft model; learns multiple parallel prediction heads (each a 2-layer FFN)
Benchmarks: 2.2x speedup (Medusa-1) to 2.3-2.8x (Medusa-2) without quality loss
ArXiv: 2401.10774

55. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Authors: Yuhui Li et al.
Venue: ICML 2024 (EAGLE-1), EMNLP 2024 (EAGLE-2), NeurIPS 2025 (EAGLE-3)
Model type: LLM
Core method: Single trainable decoder layer operates at the feature level of the target model; autoregressive generation at feature level during drafting
Key innovation: Feature-level rather than token-level draft generation; better captures uncertainty
ArXiv: 2401.15077

Category 6: Token Pruning for Diffusion Transformers

56. Token Merging for Fast Stable Diffusion

Authors: Daniel Bolya et al. (Meta)
Venue: CVPR 2023 Workshop
Model type: Diffusion model (Stable Diffusion)
Core method: Extends ToMe to U-Net based diffusion models; merges similar tokens in self-attention layers
Key innovation: Adapts token merging from classification ViTs to generative diffusion models

57. Dynamic Diffusion Transformer (DyDiT)

Authors: (Various)
Venue: ICLR 2025
Model type: Diffusion Transformer (DiT)
Core method: Dynamic token pruning and channel selection for DiT; adapts computation based on timestep and spatial complexity
Key innovation: Timestep-aware dynamic architecture for diffusion transformers
Benchmarks: 51% FLOPs reduction, 1.73x generation acceleration on ImageNet

58. FreqTS: Frequency-Aware Token Selection for Accelerating Diffusion Models

Authors: (Various)
Venue: AAAI 2025
Model type: Diffusion model
Core method: Frequency-domain analysis for token importance; selects tokens based on frequency characteristics
Key innovation: Frequency-aware criteria for token selection in diffusion generation

Category 7: NLP Token Reduction (Non-KV-Cache)

59. TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Authors: (Various)
Venue: EMNLP 2023
Model type: Video-Language model
Core method: Aggregates tokens along both temporal and spatial dimensions for long videos
Key innovation: Joint temporal-spatial aggregation strategy for video understanding

60. Efficient Transformers with Dynamic Token Pooling

Authors: (Various)
Venue: ACL 2023
Model type: Transformer (NLP)
Core method: Dynamic pooling of tokens during processing based on learned pooling decisions
Key innovation: Adaptive token pooling for NLP transformers

61. Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Authors: (Various)
Venue: ACL 2023
Model type: BERT
Core method: Analyzes and improves token dropping strategies during BERT pretraining
Key innovation: Improved understanding of when and how to drop tokens during pretraining

62. TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Authors: (Various)
Venue: EMNLP 2025
Model type: LLM (reasoning)
Core method: Compresses chain-of-thought reasoning by skipping intermediate tokens while preserving reasoning quality
Key innovation: Controllable compression of reasoning traces; balances efficiency vs. reasoning depth

63. CoT-Valve: Length-Compressible Chain-of-Thought Tuning

Authors: (Various)
Venue: ACL 2025
Model type: LLM (reasoning)
Core method: Tunes models to produce variable-length chain-of-thought; learns to compress reasoning length
Key innovation: Adjustable reasoning length via a single valve parameter

Summary Statistics

Category	Count
ViT Token Pruning/Merging	18
VLM Token Pruning	14
LLM Prompt/Context Compression	7
KV Cache Compression	14
Speculative Decoding	2
Diffusion Transformer	3
NLP Token Reduction	5
Total	63

Year	Count
2021	3
2022	5
2023	12
2024	18
2025	25
Total	63

Starry's Blog

Explorer

token_pruning_papers

Token Pruning / Reduction Methods in Transformers (2021-2026)

Category 1: Vision Transformer Token Pruning/Merging

1. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

2. TokenLearner: Adaptive Space-Time Tokenization for Videos

3. IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers

4. EViT: Expediting Vision Transformers via Token Reorganizations

5. A-ViT: Adaptive Tokens for Efficient Vision Transformer

6. Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

7. SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

8. ATS: Adaptive Token Sampling for Efficient Vision Transformers

9. Token Merging (ToMe): Your ViT But Faster

10. Joint Token Pruning and Squeezing (TPS)

11. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers

12. Making Vision Transformers Efficient from A Token Sparsification View

13. Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph

14. Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning

15. Agglomerative Token Clustering

16. Exploring Token Pruning in Vision State Space Models

17. Spectrum-Preserving Token Merging (Accelerating Transformers)

18. Token Cropr: Faster ViTs for Quite a Few Tasks

Category 2: Vision-Language Model Token Pruning

19. MADTP: Multimodal Alignment-Guided Dynamic Token Pruning

20. FastV: An Image is Worth 1/2 Tokens After Layer 2

21. IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models

22. Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models

23. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

24. Dynamic-LLaVA: Efficient Multimodal LLMs via Dynamic Vision-Language Context Sparsification

25. ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

26. PACT: Pruning and Clustering-Based Token Reduction for Faster Visual Language Models

27. TopV: Compatible Token Pruning with Inference Time Optimization

28. DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

29. PyramidDrop: Accelerating LVLMs via Pyramid Visual Redundancy Reduction

30. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

31. AdaptMerge: Inference Time Adaptive Visual and Language Token Merging

32. CoViPAL: Layer-wise Contextualized Visual Token Pruning

Category 3: LLM Token Pruning (Prompt/Context Compression)

33. LLMLingua: Compressing Prompts for Accelerated Inference of LLMs

34. Learning to Compress Prompts with Gist Tokens

35. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

36. In-context Autoencoder for Context Compression in a Large Language Model

37. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

38. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

39. MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Category 4: KV Cache Compression/Pruning for LLMs

40. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs

41. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression

42. StreamingLLM: Efficient Streaming Language Models with Attention Sinks

43. Model Tells You What to Discard (FastGen): Adaptive KV Cache Compression for LLMs

44. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

45. SnapKV: LLM Knows What You Are Looking for Before Generation

46. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

47. MiniCache: KV Cache Compression in Depth Dimension for LLMs

48. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference

49. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

50. PALU: KV-Cache Compression with Low-Rank Projection

51. RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

52. GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

53. Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Category 5: Speculative Decoding (Related Token-Level Efficiency)

54. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

55. EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Category 6: Token Pruning for Diffusion Transformers

56. Token Merging for Fast Stable Diffusion

57. Dynamic Diffusion Transformer (DyDiT)

58. FreqTS: Frequency-Aware Token Selection for Accelerating Diffusion Models

Category 7: NLP Token Reduction (Non-KV-Cache)

59. TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

60. Efficient Transformers with Dynamic Token Pooling

61. Revisiting Token Dropping Strategy in Efficient BERT Pretraining

62. TokenSkip: Controllable Chain-of-Thought Compression in LLMs

63. CoT-Valve: Length-Compressible Chain-of-Thought Tuning

Summary Statistics

Graph View

Table of Contents