1. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
Authors: Yongming Rao et al.
Venue: NeurIPS 2021 (also extended in T-PAMI)
Model type: ViT (DeiT)
Core method: Lightweight prediction module estimates importance score of each token; added at different layers for hierarchical pruning
Key innovation: First dynamic token sparsification framework for ViTs with learnable prediction modules and attention masking for end-to-end training
Benchmarks: Prunes 66% tokens, reduces 31%-37% FLOPs, improves throughput by 40%+ with <0.5% accuracy drop on ImageNet
ArXiv: 2106.02034
2. TokenLearner: Adaptive Space-Time Tokenization for Videos
Authors: Michael S. Ryoo et al. (Google Research)
Venue: NeurIPS 2021
Model type: ViT (video)
Core method: Learns to generate a small set of tokens from input via spatial attention maps; replaces fixed tokenization with adaptive learned tokenization
Key innovation: Data-driven token generation that reduces token count while improving accuracy; applicable to both images and videos
Benchmarks: SOTA on Kinetics-400, Kinetics-600, Charades, AViD
ArXiv: 2106.11297
3. IA-RED^2: Interpretability-Aware Redundancy Reduction for Vision Transformers
Authors: Bowen Pan et al.
Venue: NeurIPS 2021
Model type: ViT
Core method: Learns to identify and remove redundant tokens using a policy network that determines which tokens to keep
Key innovation: Interpretability-aware approach that provides visual explanations while reducing computation
ArXiv: 2106.12620
4. EViT: Expediting Vision Transformers via Token Reorganizations
Authors: Youwei Liang et al.
Venue: ICLR 2022 (Spotlight)
Model type: ViT (DeiT)
Core method: Identifies attentive tokens using CLS token attention scores; reorganizes tokens by keeping top-k attentive tokens and fusing inattentive ones
Key innovation: Training-free token reorganization guided by class token attention; fuses rather than discards inattentive tokens to preserve information
Benchmarks: 50% speedup on DeiT-S with only 0.3% accuracy drop on ImageNet
ArXiv: 2202.07800
5. A-ViT: Adaptive Tokens for Efficient Vision Transformer
Authors: Hongxu Yin et al. (NVIDIA)
Venue: CVPR 2022 (Oral)
Model type: ViT
Core method: Adaptively adjusts the number of tokens per image based on input complexity using a halting mechanism inspired by Adaptive Computation Time (ACT)
Key innovation: Per-token halting score that automatically reduces tokens for simpler inputs; no need for predefined pruning ratios
Benchmarks: Reduces computation with minimal accuracy loss on ImageNet
ArXiv: 2112.07658
6. Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer
Authors: Yifan Xu et al.
Venue: AAAI 2022
Model type: ViT (DeiT, LeViT)
Core method: Separates tokens into informative (slow) and less informative (fast) groups; slow tokens go through full computation, fast tokens are updated with a lightweight mechanism
Key innovation: Slow-fast token evolution preserves global information flow while reducing computation
Benchmarks: 40%-60% throughput improvement on DeiT; further accelerates LeViT
Core method: Soft token pruning with latency-aware optimization; prunes tokens by learning a soft mask and considers real hardware latency
Key innovation: Latency-aware pruning applicable to both flat (DeiT) and hierarchical (Swin) architectures; optimizes for actual speed rather than FLOPs
ArXiv: 2112.13890
8. ATS: Adaptive Token Sampling for Efficient Vision Transformers
Authors: Mohsen Fayyaz et al.
Venue: ECCV 2022 (Oral)
Model type: ViT
Core method: Differentiable parameter-free module that scores tokens and adaptively samples significant ones; can be plugged into any ViT
Key innovation: Parameter-free adaptive sampling based on inverse transform sampling; no additional learnable parameters
Benchmarks: 2x GFLOPs reduction while preserving accuracy on ImageNet, Kinetics-400/600
ArXiv: 2111.15667
9. Token Merging (ToMe): Your ViT But Faster
Authors: Daniel Bolya et al. (Meta/Facebook Research)
Venue: ICLR 2023 (Oral)
Model type: ViT
Core method: Merges similar tokens using bipartite soft matching on key similarity (cosine similarity of K vectors); combines rather than prunes
Key innovation: Training-free token merging that preserves information by combining similar tokens; applicable off-the-shelf to any ViT
Benchmarks: 2x throughput on ViT-L@512, ViT-H@518 with only 0.2-0.3% accuracy drop; 2.2x on video ViT-L
ArXiv: 2210.09461
10. Joint Token Pruning and Squeezing (TPS)
Authors: Siyuan Wei et al.
Venue: CVPR 2023
Model type: ViT
Core method: Combines pruning with “squeezing” — pruned tokens’ information is injected into retained tokens via unidirectional nearest-neighbor matching and similarity-oriented fusing
Key innovation: Recovers information from pruned tokens by squeezing their content into similar retained tokens, enabling more aggressive compression
Benchmarks: More aggressive compression than pure pruning or merging with comparable accuracy
ArXiv: 2304.10716
11. DiffRate: Differentiable Compression Rate for Efficient Vision Transformers
Authors: Mengzhao Chen et al.
Venue: ICCV 2023
Model type: ViT
Core method: Makes compression ratio differentiable; jointly optimizes token pruning and merging decisions and their ratios per layer via gradient descent
Key innovation: First method to make the compression rate itself a learnable parameter optimized end-to-end
Benchmarks: Outperforms ToMe and other methods at the same FLOPs budget on ImageNet
ArXiv: 2305.17997
12. Making Vision Transformers Efficient from A Token Sparsification View
Authors: Shuning Chang et al.
Venue: CVPR 2023
Model type: ViT
Core method: Comprehensive analysis and improved token sparsification strategy
Key innovation: Unified view of token sparsification methods with improved strategy
13. Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph
Authors: Hongjie Wang et al. (Princeton)
Venue: CVPR 2024
Model type: ViT
Core method: Uses Weighted Page Rank (WPR) on the attention graph to compute token importance; combines importance and similarity for pruning decisions
Key innovation: First zero-shot method that jointly considers token importance (via PageRank on attention) and similarity; no training or fine-tuning needed
ArXiv: 2305.17328
14. Token Compensator: Altering Inference Cost of Vision Transformer Without Re-tuning
Authors: (OpenGVLab team)
Venue: ECCV 2024
Model type: ViT
Core method: Compensates for information loss from token reduction by learning a lightweight compensator module
Key innovation: Allows changing the compression ratio at inference time without retraining
15. Agglomerative Token Clustering
Authors: (Various)
Venue: ECCV 2024
Model type: ViT
Core method: Hierarchical agglomerative clustering of tokens based on feature similarity
Key innovation: Bottom-up clustering approach for token merging
16. Exploring Token Pruning in Vision State Space Models
Authors: (Various)
Venue: NeurIPS 2024
Model type: Vision SSM (Mamba-based)
Core method: Investigates token pruning in state space models; finds naive application of ViT pruning methods fails
Key innovation: First study of token pruning for vision state space models; proposes SSM-specific pruning strategies
Model type: Vision-Language Transformer (BLIP, BLIP-2)
Core method: Multi-modality Alignment Guidance (MAG) aligns features across modalities; Dynamic Token Pruning (DTP) adaptively adjusts compression ratio per layer per instance
Key innovation: First dynamic token pruning for VL transformers that considers cross-modal alignment; instance-adaptive pruning ratios
Benchmarks: 80% GFLOPs reduction on BLIP with <4% performance drop on NLVR2
ArXiv: 2403.02991
20. FastV: An Image is Worth 1/2 Tokens After Layer 2
Authors: Liang Chen et al.
Venue: ECCV 2024 (Oral)
Model type: Large Vision-Language Model (LLaVA)
Core method: Prunes visual tokens in early LLM layers based on attention scores; plug-and-play approach
Key innovation: Discovers that visual token attention becomes sparse after layer 2 in VLLMs; first to use LLM’s own attention signal for visual token pruning
Benchmarks: 50% visual tokens pruned after layer 2 without accuracy loss on LLaVA-1.5-13B; 45% FLOPs reduction
ArXiv: 2403.06764
21. IVTP: Instruction-Guided Visual Token Pruning for Large Vision-Language Models
Authors: Kai Huang et al.
Venue: ECCV 2024
Model type: Large Vision-Language Model
Core method: Uses instruction/text query to guide visual token importance estimation via Group-wise Token Pruning (GTP) with attention rollout
Key innovation: Instruction-aware pruning that considers the task/question when deciding which visual tokens to keep
22. Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Large Models
Authors: Chen Ju et al.
Venue: ECCV 2024 (Oral)
Model type: Vision-Language Model
Core method: Prunes tokens based on “information degree” that combines mutual redundancy (data duplication between sequential tokens) and semantic value (each token’s contribution to overall semantics)
Key innovation: Dual-criterion informativity metric; applicable as plug-in to both visual and textual tokens
ArXiv: 2407.11717
23. LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Authors: Yuzhang Shang et al.
Venue: ICCV 2025
Model type: Large Multimodal Model (LLaVA)
Core method: Exploits sparsity in CLIP visual encoder attention to identify crucial tokens; prunes less important tokens then merges pruned token information into retained tokens via key-similarity clustering
Key innovation: Adaptive pruning ratio via outlier detection on attention scores; combines pruning + merging for multimodal models
Benchmarks: 18x average visual token compression with comparable VQA/reasoning performance
ArXiv: 2403.15388
24. Dynamic-LLaVA: Efficient Multimodal LLMs via Dynamic Vision-Language Context Sparsification
Authors: (Various)
Venue: ICLR 2025
Model type: Multimodal LLM (LLaVA)
Core method: Dynamically reduces vision context redundancy in prefill stage and language context overhead during decoding
Key innovation: Joint vision-language token sparsification; handles both modalities
Benchmarks: ~50% computation reduction in decoding; ~50% GPU memory savings with KV cache
25. ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
Authors: Xubing Ye et al.
Venue: CVPR 2025
Model type: Large Vision-Language Model
Core method: Adaptive Token Pruning (ATP) module computes instance-specific importance scores and pruning thresholds per LLM layer; Spatial Augmented Pruning (SAP) considers both redundancy and spatial relationships
33. LLMLingua: Compressing Prompts for Accelerated Inference of LLMs
Authors: Huiqiang Jiang et al. (Microsoft)
Venue: EMNLP 2023
Model type: LLM (general)
Core method: Coarse-to-fine prompt compression with budget controller, token-level iterative compression, and instruction tuning for distribution alignment
Key innovation: First systematic prompt compression framework using a small LM to identify removable tokens; up to 20x compression
ArXiv: 2310.05736
34. Learning to Compress Prompts with Gist Tokens
Authors: Jesse Mu et al.
Venue: NeurIPS 2023
Model type: LLM
Core method: Trains LM to compress prompts into smaller sets of “gist” tokens that can be cached and reused
Key innovation: Learned compression into reusable gist tokens; up to 26x prompt compression
ArXiv: 2304.08467
35. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Authors: (Various)
Venue: NeurIPS 2023
Model type: LLM
Core method: Dynamically prunes context during autoregressive generation
Key innovation: Per-head, per-step context pruning that is also interpretable
36. In-context Autoencoder for Context Compression in a Large Language Model
Authors: (Various)
Venue: ICLR 2024
Model type: LLM
Core method: Compresses long context into compact representation using an in-context autoencoder
Key innovation: Autoencoder-based context compression within the LLM framework
37. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Authors: Huiqiang Jiang et al. (Microsoft)
Venue: ACL 2024
Model type: LLM
Core method: Extends LLMLingua for long-context scenarios; mitigates “lost in the middle” problem
Key innovation: Question-aware prompt compression that prioritizes tokens relevant to the query
ArXiv: 2310.06839
38. LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Authors: Qichen Fu et al. (Apple)
Venue: ICML 2024 Workshop (ES-FoMo)
Model type: LLM (LLaMA 2)
Core method: Selectively computes KV for tokens important for next token prediction; dynamically selects different token subsets at each generation step
Key innovation: Training-free, universal dynamic token selection that allows previously pruned tokens to be reconsidered in later steps
Benchmarks: 2.34x prefilling acceleration on LLaMA 2 7B for multi-document QA
ArXiv: 2407.14057
39. MrT5: Dynamic Token Merging for Efficient Byte-level Language Models
Authors: Julie Kallini et al.
Venue: ICLR 2025
Model type: Byte-level LM (ByT5)
Core method: Integrates a learned delete gate in the encoder to dynamically shorten byte sequences; gate determines which tokens to remove
Key innovation: Applies token merging to byte-level models, reducing the fundamental tokenization overhead
Benchmarks: Comparable accuracy to ByT5 while reducing sequence lengths by up to 75% on XNLI, TyDi QA
ArXiv: 2410.20771
Category 4: KV Cache Compression/Pruning for LLMs
40. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of LLMs
Authors: Zhenyu Zhang et al.
Venue: NeurIPS 2023
Model type: LLM (OPT)
Core method: KV cache eviction policy that dynamically retains “heavy hitter” tokens (those with high accumulated attention scores) plus recent tokens
Key innovation: Discovery that attention scores follow power-law distribution; small set of heavy-hitter tokens dominate attention computation
Benchmarks: Up to 29x throughput improvement over DeepSpeed Zero-Inference; H2O with 20% heavy hitters
ArXiv: 2306.14048
41. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression
Authors: Zichang Liu et al.
Venue: NeurIPS 2023
Model type: LLM
Core method: Based on “persistence of importance” hypothesis — tokens important at one step remain important; maintains fixed KV cache budget by evicting non-pivotal tokens
Key innovation: Observation that token importance persists across generation steps; up to 5x KV cache memory reduction without fine-tuning
ArXiv: 2305.17118
42. StreamingLLM: Efficient Streaming Language Models with Attention Sinks
Authors: Guangxuan Xiao et al. (MIT Han Lab)
Venue: ICLR 2024
Model type: LLM (LLaMA-2, MPT, Falcon, Pythia)
Core method: Retains KV cache of initial “attention sink” tokens plus a sliding window of recent tokens; discards intermediate tokens
Key innovation: Discovery of “attention sink” phenomenon — initial tokens receive disproportionate attention regardless of semantic content; enables infinite-length generation
Benchmarks: Stable language modeling with up to 4M+ tokens on models trained with finite windows; no fine-tuning needed
ArXiv: 2309.17453
43. Model Tells You What to Discard (FastGen): Adaptive KV Cache Compression for LLMs
Authors: Suyu Ge et al.
Venue: ICLR 2024 (Oral)
Model type: LLM (LLaMA)
Core method: Profiles attention head patterns and applies per-head adaptive compression strategies: evicts long-range on local heads, discards non-special on special-token heads, keeps full cache on broad-attention heads
Key innovation: Per-head adaptive compression policies based on attention structure profiling; no fine-tuning
Benchmarks: ~40% memory reduction on LLaMA-65B with negligible quality loss
ArXiv: 2310.01801
44. KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Authors: Zirui Liu et al.
Venue: ICML 2024
Model type: LLM (LLaMA-2, Falcon, Mistral)
Core method: Asymmetric quantization: quantizes Key cache per-channel and Value cache per-token to 2-bit precision
Key innovation: Exploits different distribution patterns of keys (per-channel outliers) vs. values (per-token outliers) for asymmetric quantization strategy
Benchmarks: 2.6x less peak memory, up to 4x larger batch size, 2.35x-3.47x throughput improvement
ArXiv: 2402.02750
45. SnapKV: LLM Knows What You Are Looking for Before Generation
Authors: Yuhong Li et al.
Venue: NeurIPS 2024
Model type: LLM
Core method: Compresses KV cache by selecting/clustering significant KV positions based on attention patterns observed in a small observation window at the end of the prompt
Key innovation: Discovery that attention patterns in the prompt’s final window predict which KV entries will be important during generation; clustering-based compression
ArXiv: 2404.14469
46. KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Benchmarks: <0.1 perplexity degradation at 3-bit on LLaMA-7B; 4.8x memory reduction
ArXiv: 2401.18079
47. MiniCache: KV Cache Compression in Depth Dimension for LLMs
Authors: Akide Liu et al.
Venue: NeurIPS 2024
Model type: LLM
Core method: Compresses KV cache across layers (depth dimension) by exploiting high similarity between adjacent layers’ KV states; disentangles into magnitude and direction, interpolates directions
Core method: Single trainable decoder layer operates at the feature level of the target model; autoregressive generation at feature level during drafting
Key innovation: Feature-level rather than token-level draft generation; better captures uncertainty
ArXiv: 2401.15077
Category 6: Token Pruning for Diffusion Transformers
56. Token Merging for Fast Stable Diffusion
Authors: Daniel Bolya et al. (Meta)
Venue: CVPR 2023 Workshop
Model type: Diffusion model (Stable Diffusion)
Core method: Extends ToMe to U-Net based diffusion models; merges similar tokens in self-attention layers
Key innovation: Adapts token merging from classification ViTs to generative diffusion models
57. Dynamic Diffusion Transformer (DyDiT)
Authors: (Various)
Venue: ICLR 2025
Model type: Diffusion Transformer (DiT)
Core method: Dynamic token pruning and channel selection for DiT; adapts computation based on timestep and spatial complexity
Key innovation: Timestep-aware dynamic architecture for diffusion transformers
Benchmarks: 51% FLOPs reduction, 1.73x generation acceleration on ImageNet
58. FreqTS: Frequency-Aware Token Selection for Accelerating Diffusion Models
Authors: (Various)
Venue: AAAI 2025
Model type: Diffusion model
Core method: Frequency-domain analysis for token importance; selects tokens based on frequency characteristics
Key innovation: Frequency-aware criteria for token selection in diffusion generation
Category 7: NLP Token Reduction (Non-KV-Cache)
59. TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Authors: (Various)
Venue: EMNLP 2023
Model type: Video-Language model
Core method: Aggregates tokens along both temporal and spatial dimensions for long videos
Key innovation: Joint temporal-spatial aggregation strategy for video understanding
60. Efficient Transformers with Dynamic Token Pooling
Authors: (Various)
Venue: ACL 2023
Model type: Transformer (NLP)
Core method: Dynamic pooling of tokens during processing based on learned pooling decisions
Key innovation: Adaptive token pooling for NLP transformers
61. Revisiting Token Dropping Strategy in Efficient BERT Pretraining
Authors: (Various)
Venue: ACL 2023
Model type: BERT
Core method: Analyzes and improves token dropping strategies during BERT pretraining
Key innovation: Improved understanding of when and how to drop tokens during pretraining
62. TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Authors: (Various)
Venue: EMNLP 2025
Model type: LLM (reasoning)
Core method: Compresses chain-of-thought reasoning by skipping intermediate tokens while preserving reasoning quality
Key innovation: Controllable compression of reasoning traces; balances efficiency vs. reasoning depth