Post-training is a data problem
PPO, GRPO, and DPO are commoditized. In my engineering iterations, the only variable that structurally improved alignment was the synthetic data engine.
I used to spend days tuning PPO hyperparameters. I eventually realized that the loss function is mostly irrelevant. Eliciting latent behaviors and shaping specific trajectories requires a massive volume of highly constrained, faithful demonstrations. You cannot crowd-source this. You have to synthesize it.
The Generation Engines
I built four discrete data-manufacturing modules under orbit/data/. They all output a uniform JSONL schema (messages, env, score, task_id).
1. Deterministic Synthetic Trajectories. I bypass the LLM entirely for scaffolding. In orbit/data/liveweb_teacher_gen.py, my TeacherGenerator replays cached web topologies:
gen = TeacherGenerator(cache_dir=cache_dir, include_plugins=include_plugins)
result = await gen.generate_composite_trajectory(
seed=seed, num_subtasks=n_sub, templates=selected,
)
for record in result.records:
record["env"] = "LIVEWEB"
record["score"] = record.get("metadata", {}).get("score", 1.0)
I generate rigid, multi-tool trajectories deterministically. This seeds the model with structural syntax before it ever attempts to hallucinate reasoning.
2. Self-Play. For well-defined environments, I implemented an OpenSpiel registry in orbit/data/game_gen.py. MCTS search or CFR policy snapshots play out matches. The generators only keep the winning trajectories.
3. Rejection Sampling. The core pipeline mechanism. I sample massively, enforce a strict verifier, and discard the failures. In orbit/data/sft.py, filter_quality executes the dedup logic:
filtered = [r for r in records if r.get("score", 0.0) >= min_score]
if dedup:
best = {}
for r in filtered:
key = (r.get("env"), r.get("task_id"))
if key not in best or r.get("score", 0) > best[key].get("score", 0):
best[key] = r
filtered = list(best.values())
Rejection sampling implicitly executes a KL-regularized policy improvement. If the pass rate is , best-of- guarantees at least one success with probability . The resulting distribution is bounded at a KL divergence of roughly . I get the policy improvement of RL without the rollout volatility, mirroring the dataset distillation mechanics proven in LLM alignment pipelines (Touvron et al., 2023).
4. The Verifier. Human grading is geometrically impossible at scale. I rely entirely on programmatic constraints. StaticTraceVerifier in orbit/verifiers/static.py maps trajectories to terminal scores.
The Yield Flywheel
The components form an aggressive compounding loop.
+-------------------+ +--------------------+ +--------------------+
| Generator | | Verifier / Judge | | Train |
| (synthetic, | -----> | (filter, rank, | -----> | (SFT, GRPO, DPO) |
| self-play) | | label) | | |
+-------------------+ +--------------------+ +--------------------+
^ |
| |
+-----------------------------------------------------------+
Better Model
I treat the pass rate as the fundamental metric of my system. When I train the survivors, the model improves, and the generator yield for round increases. If I model the per-round improvement as a multiplier on the odds ratio, the growth is geometric:
If I start with a yield () and , four passes push the yield to . But if the verifier is noisy, collapses to .
I dedicate zero engineering time to the training code; build_ms_swift_dataset is a static mapping function. I spend 90% of my compute and engineering budget aggressively optimizing the verifier rubric. Yield directly dictates compute cost. At a 5% pass rate, extracting 10,000 clean trajectories costs 200,000 generation passes. Tuning the verifier to halve that ratio is significantly more impactful than optimizing GPU utilization.
Comments