Wang Tong · 王通
LLM post-training & agent algorithm engineer — alignment, RL, reward modeling.
I'm an LLM algorithm engineer focused on post-training and alignment — SFT, DPO, RLHF / RLVR, PPO, GRPO and reward modeling — and on the data engines that feed them: synthetic data, self-play, rejection sampling and LLM-as-judge. I like owning the whole loop, from data generation → training → offline/online evaluation → deployment, and I care as much about the eval harness and the verifier as about the training run.
Lately. RL for reasoning & planning agents (a GRPO + reward-model pipeline that lifted complex-constraint satisfaction ~12% on an internal planning benchmark); online distillation (a Qwen 35B→4B Forward-KL setup on 8×H200); evaluation harnesses for browser and software-engineering agents (LiveWeb, SWE trajectories); and the training/eval orchestration to run all of it on remote GPUs. Tooling I reach for: PyTorch, DeepSpeed / Megatron, vLLM / SGLang, ms-swift, and a lot of CUDA.
About this blog. The name is a small joke — a mixture of experts, rerouted toward whatever I'm currently nerd-sniped by. The writing is long-form field notes that try to carry their own weight: the actual loss function, not a hand-wave at it; the exact API or attestation field, not "the system handles it." A post-training series that derives why GRPO and DPO work; an OpenVINO series that rebuilds the CUDA serving stack from the bandwidth math up; an Android series that traces, channel by channel, how an app detects a modified OS and where userspace finally hits silicon. Different layers, one habit — take the thing apart until you know exactly why it behaves the way it does.
Prove the failure before you prescribe the fix; know which wall you're standing at before you spend a week pushing on it.
Hiring for LLM post-training / agents, building something at this layer, or just want to compare notes? Reach me on GitHub, LinkedIn, or by email.