Post-Training in Practice
June 10, 2026 · 12 min read

Post-training is a data problem

PPO, GRPO, and DPO are commoditized. In my engineering iterations, the only variable that structurally improved alignment was the synthetic data engine.

llmpost-trainingsynthetic-datarejection-sampling
Cover illustration for “Post-training is a data problem”

I used to spend days tuning PPO hyperparameters. I eventually realized that the loss function is mostly irrelevant. Eliciting latent behaviors and shaping specific trajectories requires a massive volume of highly constrained, faithful demonstrations. You cannot crowd-source this. You have to synthesize it.

The Generation Engines

I built four discrete data-manufacturing modules under orbit/data/. They all output a uniform JSONL schema (messages, env, score, task_id).

1. Deterministic Synthetic Trajectories. I bypass the LLM entirely for scaffolding. In orbit/data/liveweb_teacher_gen.py, my TeacherGenerator replays cached web topologies:

gen = TeacherGenerator(cache_dir=cache_dir, include_plugins=include_plugins)
result = await gen.generate_composite_trajectory(
    seed=seed, num_subtasks=n_sub, templates=selected,
)
for record in result.records:
    record["env"] = "LIVEWEB"
    record["score"] = record.get("metadata", {}).get("score", 1.0)

I generate rigid, multi-tool trajectories deterministically. This seeds the model with structural syntax before it ever attempts to hallucinate reasoning.

2. Self-Play. For well-defined environments, I implemented an OpenSpiel registry in orbit/data/game_gen.py. MCTS search or CFR policy snapshots play out matches. The generators only keep the winning trajectories.

3. Rejection Sampling. The core pipeline mechanism. I sample massively, enforce a strict verifier, and discard the failures. In orbit/data/sft.py, filter_quality executes the dedup logic:

filtered = [r for r in records if r.get("score", 0.0) >= min_score]
if dedup:
    best = {}
    for r in filtered:
        key = (r.get("env"), r.get("task_id"))
        if key not in best or r.get("score", 0) > best[key].get("score", 0):
            best[key] = r
    filtered = list(best.values())

Rejection sampling implicitly executes a KL-regularized policy improvement. If the pass rate is pp, best-of-NN guarantees at least one success with probability 1(1p)N1-(1-p)^N. The resulting distribution is bounded at a KL divergence of roughly logN\log N. I get the policy improvement of RL without the rollout volatility, mirroring the dataset distillation mechanics proven in LLM alignment pipelines (Touvron et al., 2023).

4. The Verifier. Human grading is geometrically impossible at scale. I rely entirely on programmatic constraints. StaticTraceVerifier in orbit/verifiers/static.py maps trajectories to terminal scores.

The Yield Flywheel

The components form an aggressive compounding loop.

+-------------------+        +--------------------+        +--------------------+
| Generator         |        | Verifier / Judge   |        | Train              |
| (synthetic,       | -----> | (filter, rank,     | -----> | (SFT, GRPO, DPO)   |
|  self-play)       |        |  label)            |        |                    |
+-------------------+        +--------------------+        +--------------------+
         ^                                                           |
         |                                                           |
         +-----------------------------------------------------------+
                              Better Model

I treat the pass rate ptp_t as the fundamental metric of my system. When I train the survivors, the model improves, and the generator yield for round t+1t+1 increases. If I model the per-round improvement as a multiplier g>1g > 1 on the odds ratio, the growth is geometric:

pt1pt=gtp01p0\frac{p_t}{1-p_t} = g^t \cdot \frac{p_0}{1-p_0}

If I start with a 5%5\% yield (p0=0.05p_0=0.05) and g=2g=2, four passes push the yield to 46%46\%. But if the verifier is noisy, gg collapses to 11.

I dedicate zero engineering time to the training code; build_ms_swift_dataset is a static mapping function. I spend 90% of my compute and engineering budget aggressively optimizing the verifier rubric. Yield directly dictates compute cost. At a 5% pass rate, extracting 10,000 clean trajectories costs 200,000 generation passes. Tuning the verifier to halve that ratio is significantly more impactful than optimizing GPU utilization.

Comments