Direct Preference Optimization

Abstract

背景：

对于大型的LMs：

“achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training” (Rafailov 等, 2024, p. 1) 由于训练完全不受监督，因此很难精确控制他们的行为

过去的研究方法：

“often with reinforcement learning from human feedback (RLHF)”

难点：

“RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.”

Starry's Blog

Explorer

DPO

Direct Preference Optimization

Abstract

Graph View

Table of Contents