VisionZip
研究背景:
- “Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs” 视觉语言模型的最新进展通过增加视觉标记的长度来增强性能,使它们比文本标记长得多,并显着提高计算成本
- 提出质疑:“Are all visual tokens necessary?” (Yang 等, 2024, p. 1)
- 观察到visual tokens 有大量冗余
- 现有的VLMs:将图像转换为vision tokens再用LLM 的decoder处理
VisionZip:
核心是识别并保留信息最丰富的视觉token、与文本无关
“reducing visual token redundancy and improving efficiency while maintaining model performance.” 减少视觉标记冗余并提高效率,同时保持模型性能。
总体框架:
Dominant Token Selection 从视觉编码器输出里,选出一批最有信息量的 token is the attention score of each head, is the head dimension, and and represent query and key repectively.
output = vision_tower(images, output_hidden_states=True, output_attentions=True)
# 提取特征和中间层attention矩阵
attn = output.attentions[SELECT_LAYER]
vanilla_tokens = output.hidden_states[SELECT_LAYER]
# attn形状为(B,H,S,S)B是batch size,H是注意力头数量,S是序列长度,vanilla_tokens是这层对应的 token 表示
attn_rec = attn[:, :, cls_idx, cls_idx+1:].sum(dim=1)
# attn[b, h, i, j]是:第 b 个样本,第 h 个 head 里,第 i 个 token 对第 j 个 token 的注意力权重
Contextual Tokens Merging 按相似性合并成少量上下文 token,防止丢掉小细节
remaining = vanilla_tokens.mask(dominant_tokens)
targets, merge = uniform_split(remaining, M)
#将剩余tokens先均匀采样出targets,其余作为即将merge到targets上面的merge tokens
# M表示最终得到的contextual tokens数量
similarity = bmm(to_merge.K, targets.K.transpose(1, 2))
assign_idx = similarity.argmax(dim=2)
# 做相似度计算,决定每个 merge token 该并到哪个 target
context_tokens = avg_merge(assign_idx, targets, merge)
# 把分到同一个 target 的那些 merge tokens 跟 target 本身做平均合并,得到新的 contextual tokensEfficient Tuning 微调 projector,让压缩后的视觉 token 更好对齐 LLM 空间
优势:
- 可以被广泛用于图像和视频理解任务
- 非常适合现实场景中的多轮对话
- 超过SOTA5%
- 提高推理速度
- 提出了开发具有较低冗余能力的视觉编码器的未来方向
常规的VLMs 架构:
visual decoder + modality projector + LLM
- “visual encoder, typically a pre-trained image encoder like CLIP’s vision model, converts input images into visual tokens”
- “The projector module aligns these visual tokens with the LLM’s word embedding space, enabling the LLM to process visual data effectively”
- “LLM then integrates the aligned visual and textual information to generate responses.”
计算复杂度
考虑自注意力机制和FFN
T: # transformer layers ; n: sequence length d: hidden dimension size m: intermediate size of FFN