CDPruner

先提醒两点：

AdaCM2 更准确说是 query-aware memory reduction，不只是普通 patch pruning。

VLTP 是 task-oriented segmentation 场景，不是通用 VQA/Chat MLLM，但它很适合放进“专门加 query-aware 模块”这一类。 (CVF Open Access)

论文	query 信号	信号来源类型	生效位置 / 剪枝位置	粒度	training-free / 训练方式	一句话概括
SparseVLM	先选出与视觉相关的文本 token，再用 decoder 内 visual-text self-attention 给视觉 token 打分	内部 attention 直接读取	LLM decoder 多层内部，逐层 progressive sparsification	image/video visual tokens	training-free	用模型内部 attention 直接做 text-guided 逐层剪枝，还带 token recycling。 (arXiv)
PruneVid	用 question-to-video attention 评估 token 对问题的相关性	内部 attention 直接读取	先做时空 merging，再在 LLM 中间层 / prefill 阶段剪	video tokens，兼顾时空	training-free	“先去冗余，再用中间层问句-视频 attention 做二次筛选”。 (arXiv)
FlexSelect	从 reference transformer layer 的 cross-modal attention 读 token relevance	内部 attention 直接读取	在进入重推理前做筛选；核心依据是 reference layer	long-video video tokens	核心 ranking 是 training-free；另有一个轻量 selector 可监督训练复现该 ranking	关键不是“有没有用 attention”，而是“找哪一层的 attention 最可信”。 (arXiv)
DyToK	从 VLLM 内部 attention 提取 query-conditioned keyframe prior，并按帧动态分配保留比例	内部 attention 直接读取	主要作用在帧级 token budget 分配，再配合底层压缩器剪帧内 token	frame-level budget + frame 内 token compression	training-free	它更像“query-aware 动态分帧配额”，不是直接逐 patch 排名。 (arXiv)
AdaCM2	用 cross-modality attention 衡量视觉 token 与文本 prompt 的相关性	内部 attention / cross-modal module	在 Q-Former / video cache memory reduction 阶段，按层压缩记忆	video cache / memory tokens	论文没有把它表述成 training-free 插件；是专门设计的 memory reduction 框架	更准确说是 query-aware memory pruning，按层保留与文本更相关的视觉记忆。 (CVF Open Access)
LVPruning	插入 cross-attention decision module，让 vision tokens attend to language tokens 计算重要性	专门的 query-aware 模块	插在多个 LLM 层中做 progressive pruning	vision tokens	需要训练插入的 decision modules；原模型冻结	和 SparseVLM 的区别在于：它不是读原 attention，而是显式学一个语言引导的剪枝器。
VLTP	MLLM 先生成 SEG token / reasoning guidance，prune decoder 用它来预测 token relevance	专门的 query-aware 模块	插在 ViT 多个层中，做 multi-stage pruning 与 reactivation	image patch tokens	需要训练 prune decoder，并可联合训练 mask decoder	这是“把 query guidance 前移到视觉 backbone 里剪”的代表。 (arXiv)
HICom	直接把 instruction 作为条件注入压缩过程	专门的 query-aware 模块	hybrid-level：local 注入 grouped visual tokens，global 注入 learnable tokens	video tokens	不是 training-free；还有 conditional pre-training	它不是从 attention 里读 query relevance，而是把 instruction 显式灌进压缩模块。 (arXiv)
TRIM	用外部 CLIP text-image similarity 给视觉 token 打分，再结合 IQR / outlier 规则选 token	外部信号	LLM 之前做 token reduction	image tokens	training-free	不依赖 MLLM 内部 attention，而是借外部 CLIP 做 query-aware 预筛选。 (arXiv)
CDPruner	把 instruction relevance 与 token similarity 合成 conditional diversity，再用 DPP 选子集	结构化优化 / 外部建模	更偏 LLM 前 / visual embedding 级的子集选择	image/video visual tokens	training-free，model-agnostic	它把“query-aware 剪枝”写成了一个条件化 subset selection 问题，而不是直接看 attention。 (arXiv)
D-CoDe	先把原问题分解成 sub-questions，再让压缩过程受这些子问题引导	结构化 query 重写	属于 compression 外层的推理控制；同时配合动态帧/空间压缩	representative frames + spatial tokens	training-free	它最特别的地方是：先改写 query，再间接改变压缩重点。 (arXiv)

你可以直接在综述里这样分三类

1. 直接读内部 attention

SparseVLM / PruneVid / FlexSelect / DyToK / AdaCM2

这类方法共同点是：
query relevance 直接从模型内部已经形成的跨模态 attention 里“读出来”。
区别主要在于它们读的对象不同：

SparseVLM、PruneVid 更像 token-level relevance ranking；
FlexSelect更强调 哪一层的 attention 最可信；
DyToK更偏 frame-level allocation；
AdaCM2更偏 memory/cache reduction。 (arXiv)

2. 专门加 query-aware 模块

LVPruning / VLTP / HICom

这类方法不是读原模型 attention，而是显式设计一个条件模块，让 query / instruction 直接参与保留决策。
因此通常更“强条件化”，但代价是更常需要训练，插件化程度和训练成本要单独考虑。

3. 外部信号 / 结构化方式

TRIM / CDPruner / D-CoDe

这类方法把 query-aware 看成一个更广义的问题：

TRIM：外部 CLIP 相关性；
CDPruner：条件化多样性最优化；
D-CoDe：先把 query 拆成子问题再引导压缩。
也就是说，它们不把“attention score”当唯一答案。 (arXiv)

你做自己的方法时，这个表最有用的一个观察

如果你后面想设计自己的 query-aware video pruning，其实可以直接从这三种思路里选一条主线：

最省事、最像 training-free 插件：学 PruneVid / FlexSelect / DyToK，直接读内部 attention。 (arXiv)
最“正统”的 query-conditioned pruning：学 LVPruning / HICom，单独做 query-aware selector。
最容易写出方法新意：学 CDPruner / D-CoDe，把 query 变成结构化约束，而不只是一个 attention 分数。 (arXiv)

我也可以继续帮你整理成一版 更适合论文 related work 的精简表，只保留：方法 | query信号 | 剪枝位置 | 是否training-free | 代表特点。

Starry's Blog

Explorer