Direct Preference Optimization

Abstract

背景:

对于大型的LMs:

“achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training” (Rafailov 等, 2024, p. 1) 由于训练完全不受监督,因此很难精确控制他们的行为

过去的研究方法:

“often with reinforcement learning from human feedback (RLHF)”

难点:

“RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.”