VisionZip

研究背景:

  1. “Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs” 视觉语言模型的最新进展通过增加视觉标记的长度来增强性能,使它们比文本标记长得多,并显着提高计算成本
  2. 提出质疑:“Are all visual tokens necessary?” (Yang 等, 2024, p. 1)
  3. 观察到visual tokens 有大量冗余
  4. 现有的VLMs:将图像转换为vision tokens再用LLM 的decoder处理

VisionZip:

核心是识别并保留信息最丰富的视觉token、与文本无关

“reducing visual token redundancy and improving efficiency while maintaining model performance.” 减少视觉标记冗余并提高效率,同时保持模型性能。

总体框架:

Dominant Token Selection 从视觉编码器输出里,选出一批最有信息量的 token is the attention score of each head, is the head dimension, and and represent query and key repectively.

output = vision_tower(images, output_hidden_states=True, output_attentions=True)
# 提取特征和中间层attention矩阵
attn = output.attentions[SELECT_LAYER]
vanilla_tokens = output.hidden_states[SELECT_LAYER]
# attn形状为(B,H,S,S)B是batch size,H是注意力头数量,S是序列长度,vanilla_tokens是这层对应的 token 表示
attn_rec = attn[:, :, cls_idx, cls_idx+1:].sum(dim=1)
# attn[b, h, i, j]是:第 b 个样本,第 h 个 head 里,第 i 个 token 对第 j 个 token 的注意力权重
 

Contextual Tokens Merging 按相似性合并成少量上下文 token,防止丢掉小细节

	remaining = vanilla_tokens.mask(dominant_tokens)
	targets, merge = uniform_split(remaining, M) 
	#将剩余tokens先均匀采样出targets,其余作为即将merge到targets上面的merge tokens
	# M表示最终得到的contextual tokens数量
	similarity = bmm(to_merge.K, targets.K.transpose(1, 2))
	assign_idx = similarity.argmax(dim=2)
	# 做相似度计算,决定每个 merge token 该并到哪个 target
	context_tokens = avg_merge(assign_idx, targets, merge)
	# 把分到同一个 target 的那些 merge tokens 跟 target 本身做平均合并,得到新的 contextual tokens

Efficient Tuning 微调 projector,让压缩后的视觉 token 更好对齐 LLM 空间

优势:

  1. 可以被广泛用于图像和视频理解任务
  2. 非常适合现实场景中的多轮对话
  3. 超过SOTA5%
  4. 提高推理速度
  5. 提出了开发具有较低冗余能力的视觉编码器的未来方向

常规的VLMs 架构:

visual decoder + modality projector + LLM

  1. “visual encoder, typically a pre-trained image encoder like CLIP’s vision model, converts input images into visual tokens”
  2. “The projector module aligns these visual tokens with the LLM’s word embedding space, enabling the LLM to process visual data effectively”
  3. “LLM then integrates the aligned visual and textual information to generate responses.”

计算复杂度

考虑自注意力机制和FFN

T: # transformer layers ; n: sequence length d: hidden dimension size m: intermediate size of FFN