Cold-start, then climb
Pure RL from a base model on a hard task mostly produces high-variance garbage — and the policy-gradient math says exactly why. The fix I use is a two-stage recipe: a small SFT cold-start to give the policy a shape, then GRPO to climb. The recipe, the math, and the failure modes that actually bite.
Contents
Reinforcement learning improves a policy I already have. When I point it at a base model and a hard task — long-horizon planning under hard constraints — I hit the physical limit: if the policy almost never stumbles onto a good trajectory, there’s nothing for RL to amplify. I get high variance, slow progress, and reward curves that look like noise. The fix is to not start cold. I rely on this recipe, driven by the underlying math, which tells me exactly when it is necessary.
Why I cold-start before RL: the gradient dictates it
Every policy-gradient method I implement, from REINFORCE to GRPO, is a variant of
an expectation over the policy’s own samples. I read it as a search budget: a behavior contributes gradient only in proportion to how often the policy currently produces it. If a good trajectory has probability under the base model and I sample 8 rollouts per prompt, I see one roughly every 1,250 prompts. The other 9,999 gradient contributions are noise pushing in arbitrary directions. Probability ≈ 0 means gradient ≈ 0, no matter how large the reward I attached to it.
With sparse 0/1 rewards, the variance of the gradient estimate scales like over my sample budget — worst exactly in the regime where success is rare and I need signal most. A small, clean SFT cold-start fixes my starting point. I’m moving from to . At that point, a group of 8 samples contains a usable contrast almost every prompt. The cold-start buys sample efficiency, not capability.
My four-step recipe
This mirrors the frontier-scale recipe detailed in DeepSeek-R1 (DeepSeek-AI, 2025), where R1-Zero acted as an ablation showing the chaos of pure RL from a base model. I adapted it to a vertical task:
- Explore on the base with GRPO. I run GRPO directly on the base to push out long chain-of-thought planning — letting it discover what reasoning paths reach valid plans. I expect this stage to be ugly; I’m mining for rare good trajectories.
- Rejection-sample the seed. I keep only the high-correctness
reasoning → plansamples verified by my code. This is my SFT seed: small (thousands, not hundreds of thousands), clean, and in the model’s own voice. - SFT cold-start. I fine-tune the base on the seed for one to two epochs. The model now reliably produces the shape I want.
- GRPO, for real. I run the main GRPO stage with a reward that scores constraint satisfaction and feasibility, letting it climb.
SFT gives the policy a shape; GRPO sharpens it against a reward.
How the two stages map to my code
Both stages are the same config object — SwiftConfig in orbit/training/config.py — with train_type flipped. The cold-start is train_type="sft"; the climb is train_type="rlhf", rlhf_type="grpo". SwiftConfig.to_yaml_dict() emits the GRPO-specific knobs:
if self.train_type == "rlhf":
d["rlhf_type"] = self.rlhf_type
if self.beta is not None:
d["beta"] = self.beta
# ...
if self.rlhf_type == "grpo":
# Group size K, mapping to the 8 rollouts per prompt
d["num_generations"] = self.num_generations
if self.reward_funcs:
# The verifier program feeding RL
d["reward_funcs"] = self.reward_funcs
num_generations is the group size (default 8), determining my sampling budget. beta is the KL-penalty coefficient . reward_funcs is my verifier producing the per-rollout reward that GRPO standardizes into an advantage.
GRPO math in practice
PPO needs a separate critic network to estimate a per-token value baseline. GRPO, introduced in DeepSeekMath (Shao et al., 2024), deletes it. For each prompt , I sample a group of responses from the current policy and score each with my reward. The advantage of response is its score standardized against its own group:
I plug that into the clipped surrogate, plus an explicit KL penalty to a reference model:
[ prompt ]
| sample K = 4
+-------+-------+
| |
[ans: 0.9] [ans: 0.1]
[ans: 0.6] [ans: 0.4]
mean = 0.5
advantage = score - mean
[ans: 0.9] -> +0.4 (up)
[ans: 0.6] -> +0.1 (up)
[ans: 0.4] -> -0.1 (down)
[ans: 0.1] -> -0.4 (down)
The group mean does the job of PPO’s critic. I trade a second model for sampling per prompt. For a verifiable reward, scoring is practically free and sampling is the dominant cost (eating 70–90% of my wall-clock time in vLLM). The policy update is the cheap part.
I know before I tune that the in the denominator is a bias source. As analyzed in Dr. GRPO (2025), dividing by the group’s std up-weights prompts where the policy is consistent, making long wrong answers cheaper per token than short wrong ones.
The things that bite my runs
The dead-group problem. If all samples fail, . The group contributes zero gradient and I pay the full sampling cost. This is why practitioners introduced dynamic sampling in DAPO (2025), to resample or skip degenerate groups.
Group size is a knob. controls the variance of as a baseline estimate. Too small (2–4), the advantage is noisy; too large, I burn rollout budget. At pass rate and , the chance a group is live is ~34%. At , it’s ~56%. My cold-start’s job is quantified: I raise until a modest keeps most groups alive.
By running this SFT cold-start → GRPO pipeline aligned with a constraint-aware reward, I lifted complex-constraint satisfaction ~12% on my internal benchmark and cut hallucinated plans without requiring a massive human-labeled dataset. The optimizer did the climbing, but the cold start built the ladder.
Comments