Post-Training in Practice
June 10, 2026 · 13 min read

Cold-start, then climb

Pure RL from a base model on a hard task mostly produces high-variance garbage — and the policy-gradient math says exactly why. The fix I use is a two-stage recipe: a small SFT cold-start to give the policy a shape, then GRPO to climb. The recipe, the math, and the failure modes that actually bite.

llmrlgrposftreasoning
Cover illustration for “Cold-start, then climb”
Contents
  1. Why I cold-start before RL: the gradient dictates it
  2. My four-step recipe
  3. How the two stages map to my code
  4. GRPO math in practice
  5. The things that bite my runs

Reinforcement learning improves a policy I already have. When I point it at a base model and a hard task — long-horizon planning under hard constraints — I hit the physical limit: if the policy almost never stumbles onto a good trajectory, there’s nothing for RL to amplify. I get high variance, slow progress, and reward curves that look like noise. The fix is to not start cold. I rely on this recipe, driven by the underlying math, which tells me exactly when it is necessary.

Why I cold-start before RL: the gradient dictates it

Every policy-gradient method I implement, from REINFORCE to GRPO, is a variant of

θJ(θ)  =  Eyπθ[A(y)θlogπθ(yx)],\nabla_\theta J(\theta) \;=\; \mathbb{E}_{y \sim \pi_\theta}\big[\, A(y)\, \nabla_\theta \log \pi_\theta(y \mid x) \,\big],

an expectation over the policy’s own samples. I read it as a search budget: a behavior contributes gradient only in proportion to how often the policy currently produces it. If a good trajectory has probability 10410^{-4} under the base model and I sample 8 rollouts per prompt, I see one roughly every 1,250 prompts. The other 9,999 gradient contributions are noise pushing in arbitrary directions. Probability ≈ 0 means gradient ≈ 0, no matter how large the reward I attached to it.

With sparse 0/1 rewards, the variance of the gradient estimate scales like p(1p)p(1-p) over my sample budget — worst exactly in the regime where success is rare and I need signal most. A small, clean SFT cold-start fixes my starting point. I’m moving p(good trajectory)p(\text{good trajectory}) from 10410^{-4} to 10110^{-1}. At that point, a group of 8 samples contains a usable contrast almost every prompt. The cold-start buys sample efficiency, not capability.

My four-step recipe

This mirrors the frontier-scale recipe detailed in DeepSeek-R1 (DeepSeek-AI, 2025), where R1-Zero acted as an ablation showing the chaos of pure RL from a base model. I adapted it to a vertical task:

  1. Explore on the base with GRPO. I run GRPO directly on the base to push out long chain-of-thought planning — letting it discover what reasoning paths reach valid plans. I expect this stage to be ugly; I’m mining for rare good trajectories.
  2. Rejection-sample the seed. I keep only the high-correctness reasoning → plan samples verified by my code. This is my SFT seed: small (thousands, not hundreds of thousands), clean, and in the model’s own voice.
  3. SFT cold-start. I fine-tune the base on the seed for one to two epochs. The model now reliably produces the shape I want.
  4. GRPO, for real. I run the main GRPO stage with a reward that scores constraint satisfaction and feasibility, letting it climb.

SFT gives the policy a shape; GRPO sharpens it against a reward.

How the two stages map to my code

Both stages are the same config object — SwiftConfig in orbit/training/config.py — with train_type flipped. The cold-start is train_type="sft"; the climb is train_type="rlhf", rlhf_type="grpo". SwiftConfig.to_yaml_dict() emits the GRPO-specific knobs:

if self.train_type == "rlhf":
    d["rlhf_type"] = self.rlhf_type
    if self.beta is not None:
        d["beta"] = self.beta
    # ...
    if self.rlhf_type == "grpo":
        # Group size K, mapping to the 8 rollouts per prompt
        d["num_generations"] = self.num_generations 
        if self.reward_funcs:
            # The verifier program feeding RL
            d["reward_funcs"] = self.reward_funcs

num_generations is the group size KK (default 8), determining my sampling budget. beta is the KL-penalty coefficient β\beta. reward_funcs is my verifier producing the per-rollout reward rir_i that GRPO standardizes into an advantage.

GRPO math in practice

PPO needs a separate critic network to estimate a per-token value baseline. GRPO, introduced in DeepSeekMath (Shao et al., 2024), deletes it. For each prompt xx, I sample a group of KK responses y1,,yKy_1,\dots,y_K from the current policy and score each with my reward. The advantage of response ii is its score standardized against its own group:

A^i  =  rimean(r1,,rK)std(r1,,rK)\hat{A}_i \;=\; \frac{r_i - \operatorname{mean}(r_1,\dots,r_K)}{\operatorname{std}(r_1,\dots,r_K)}

I plug that A^i\hat{A}_i into the clipped surrogate, plus an explicit KL penalty to a reference model:

L  =  E ⁣[min ⁣(ρtA^i,  clip(ρt,1ε,1+ε)A^i)]  +  βDKL ⁣[πθπref]\mathcal{L} \;=\; -\,\mathbb{E}\!\left[\min\!\big(\rho_t \hat{A}_i,\; \operatorname{clip}(\rho_t,\, 1-\varepsilon,\, 1+\varepsilon)\, \hat{A}_i\big)\right] \;+\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\left[\pi_\theta \,\|\, \pi_{\mathrm{ref}}\right]
      [ prompt ]
          | sample K = 4
  +-------+-------+
  |               |
[ans: 0.9]    [ans: 0.1]
[ans: 0.6]    [ans: 0.4]

mean = 0.5
advantage = score - mean

[ans: 0.9] -> +0.4 (up)
[ans: 0.6] -> +0.1 (up)
[ans: 0.4] -> -0.1 (down)
[ans: 0.1] -> -0.4 (down)

The group mean does the job of PPO’s critic. I trade a second model for K×K\times sampling per prompt. For a verifiable reward, scoring is practically free and sampling is the dominant cost (eating 70–90% of my wall-clock time in vLLM). The policy update is the cheap part.

I know before I tune that the std\operatorname{std} in the denominator is a bias source. As analyzed in Dr. GRPO (2025), dividing by the group’s std up-weights prompts where the policy is consistent, making long wrong answers cheaper per token than short wrong ones.

The things that bite my runs

The dead-group problem. If all KK samples fail, rimean(r)=0r_i - \operatorname{mean}(r) = 0. The group contributes zero gradient and I pay the full sampling cost. This is why practitioners introduced dynamic sampling in DAPO (2025), to resample or skip degenerate groups.

Group size is a knob. KK controls the variance of mean(r)\operatorname{mean}(r) as a baseline estimate. Too small (2–4), the advantage is noisy; too large, I burn rollout budget. At pass rate p=0.05p=0.05 and K=8K=8, the chance a group is live is ~34%. At K=16K=16, it’s ~56%. My cold-start’s job is quantified: I raise pp until a modest KK keeps most groups alive.

By running this SFT cold-start → GRPO pipeline aligned with a constraint-aware reward, I lifted complex-constraint satisfaction ~12% on my internal benchmark and cut hallucinated plans without requiring a massive human-labeled dataset. The optimizer did the climbing, but the cold start built the ladder.

Comments