RLHF 2006 Ch9. Instruction Fine-Tuning (IFT/SFT) Data Preparation in RLHF -- Ch6 (Preference Data) vs Ch9 (SFT Data) PPO in LLMs vs PPO in Walker2D PPO vs DPO in RLHF The Complete InstructGPT Recipe (Ch 4.2.1)