PPO in LLMs vs PPO in Walker2D

🤖🦿 Understanding PPO: From Language Generation to Robot Control — Code, Concepts, and Comparisons #

Proximal Policy Optimization (PPO) in both large language models (LLMs, e.g., GPT-style) and classical control environments (e.g., Walker2D), focusing on the structure of the PPO update and how actions are selected during inference.


1. 🧾 PPO step() Call — Argument-by-Argument Breakdown #

ppo_trainer.step(
    queries=[input_ids[0]],       # Prompt (tokenized) — represents the current state
    responses=[response_ids[0]],  # Generated tokens — represents the action taken
    rewards=[reward]              # Scalar from reward model — score for that action
)

Mapping to Classic RL (Walker2D) #

PPO Argument 🤖 LLM (RLHF) 🦿 Walker2D (Classic RL)
queries = [input_ids[0]] Prompt as input (discrete tokenized state) Robot’s continuous state (joint angles, velocities)
responses = [response_ids[0]] Generated tokens (sequence of actions) Applied joint torques (vector of real numbers)
rewards = [reward] Reward model output (alignment score) Environment reward (e.g., distance walked)

2. 🎯 Action Selection in PPO #

How does the agent choose its next action, given a state/prompt?

🤖 LLMs (Text Generation) #

# Given a prompt (state)
input_ids = tokenizer("What causes rain?", return_tensors="pt").input_ids

# Model outputs token logits for next action
outputs = model(input_ids=input_ids)
logits = outputs.logits[:, -1, :]
probs = torch.softmax(logits, dim=-1)

# Sample the next token (action) from distribution
next_token = torch.multinomial(probs, num_samples=1)

# Repeat to generate full response

🦿 Walker2D (Physical Control) #

# Get current robot state
state = get_env_state()  # vector like [θ1, θ2, v1, v2...]

# Policy network outputs action distribution parameters
mean, std = policy_net(state)

# Sample a continuous action (e.g., torque values)
action = torch.normal(mean, std)

# Apply action to environment
next_state, reward, done, info = env.step(action)

🔁 Comparison of Action Logic #

Component 🤖 LLM (RLHF) 🦿 Walker2D (Classic RL)
State Prompt text Robot’s physical state
Action Next token (discrete) Joint torques (continuous)
Policy Output Token logits (softmaxed) Mean & std dev of Gaussian per action dim
Sampling Method Multinomial over vocab Sample from Gaussian
Result Extend response with chosen token Step to new physical state

🔁 PPO Mapping in LLMs (RLHF) vs Classical RL #

Category PPO in LLMs (RLHF) 🦿 PPO in Walker2D (Classic RL)
Agent Language Model (e.g., GPT-2, o1) Control Policy Network
Environment Static or semi-static prompt context Physics-based simulator (e.g., MuJoCo)
State Prompt (or full token context so far) Robot’s current physical state (joint angles, velocity, etc.)
Action Next token in the sequence (discrete, vocabulary-sized) Torque values for each joint (continuous, multi-dimensional)
Trajectory Sequence of tokens (prompt + response) Sequence of joint states and actions over time
Reward Signal Given after full response (from reward model trained via human preferences) Immediate reward at each time step (distance walked, balance maintained, etc.)
Reward Nature Sparse, episodic, scalar (usually one reward per episode) Dense, frequent, multi-dimensional (continuous feedback per step)
Goal Generate text aligned with human values/preferences Learn movement to walk forward efficiently without falling
Policy Network Transformer LM (large, ~billions of params) Feedforward or RNN-based controller (small, e.g., MLP)
Reference Model Frozen copy of base LM (used for KL-penalty regularization) Usually none (KL not common in Walker2D PPO)
Training Stability Needs KL penalty to prevent mode collapse / nonsense generations PPO alone is usually enough due to continuous feedback
Evaluation Human evals, reward model scores (e.g., helpfulness, safety) Distance walked, steps survived, control energy used
🗣️ “Say the right thing, the way a human likes” 🦿 “Move the right way, so you don’t fall”
Actions are words, optimizing a sequence to match human preference Actions are forces, optimizing to physically walk and stay balanced
Reference https://rlhfbook.com/ https://gymnasium.farama.org/