Direct Preference Optimization
Abstract
背景:
对于大型的LMs:
“achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training” (Rafailov 等, 2024, p. 1) 由于训练完全不受监督,因此很难精确控制他们的行为
过去的研究方法:
“often with reinforcement learning from human feedback (RLHF)”
难点:
“RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model.”