---
title: "Mixture of Insights"
description: "Deep technical notes on LLM post-training, RL, agents, and the systems underneath, by Wang Tong."
language: "en"
---

# Mixture of Insights

Deep technical notes on LLM post-training, RL, agents, and the systems underneath, by Wang Tong.

Language: [中文](/zh/index.md)

## Agent resources

- [API catalog](/.well-known/api-catalog)
- [OpenAPI description](/openapi.json)
- [Agent skills index](/.well-known/agent-skills/index.json)
- [MCP server card](/.well-known/mcp/server-card.json)
- [llms.txt](/llms.txt)

## Articles

- [A control plane for renting GPUs](https://mixtureofinsights.com/blog/a-control-plane-for-renting-gpus/) ([Markdown](https://mixtureofinsights.com/blog/a-control-plane-for-renting-gpus.md)): The orchestration mess around ephemeral, rented GPUs is the actual bottleneck of model iteration. Here is my bet with ORBIT: treat a run as a reproducible artifact, splitting control from execution.
- [A task-agnostic core, and plugins that earn their keep](https://mixtureofinsights.com/blog/orbit-a-task-agnostic-core/) ([Markdown](https://mixtureofinsights.com/blog/orbit-a-task-agnostic-core.md)): I designed ORBIT's execution core to be completely oblivious to the tasks it runs. By pushing task-specific logic up into plugins, I prevented new tasks from mutating and breaking the executor.
- [Auditing from the app's eyes](https://mixtureofinsights.com/blog/04-auditing-from-the-apps-eyes/) ([Markdown](https://mixtureofinsights.com/blog/04-auditing-from-the-apps-eyes.md)): You can't tell what a normal app sees from adb shell — shell has privileges an app never does. Three lenses for looking through the app's eyes, and the blind spot of each.
- [Cold-start, then climb](https://mixtureofinsights.com/blog/cold-start-then-climb/) ([Markdown](https://mixtureofinsights.com/blog/cold-start-then-climb.md)): Pure RL from a base model on a hard task mostly produces high-variance garbage — and the policy-gradient math says exactly why. The fix I use is a two-stage recipe: a small SFT cold-start to give the policy a shape, then GRPO to climb. The recipe, the math, and the failure modes that actually bite.
- [DPO when I can't afford RLHF](https://mixtureofinsights.com/blog/dpo-when-you-cant-afford-rlhf/) ([Markdown](https://mixtureofinsights.com/blog/dpo-when-you-cant-afford-rlhf.md)): RLHF is powerful and heavy — a reward model, an online rollout loop, instability. DPO gets most of the way with a fraction of the machinery. The derivation, the gradient that explains why I use it, and the catch — which is always the data.
- [How Qwen3-TTS makes a frame of sound](https://mixtureofinsights.com/blog/how-qwen3-tts-makes-a-frame/) ([Markdown](https://mixtureofinsights.com/blog/how-qwen3-tts-makes-a-frame.md)): A TTS model isn't one graph — it's a small pipeline of graphs with wildly different compute shapes. The key design move in porting Qwen3-TTS to OpenVINO is cutting it at the seams: a talker graph for long-context attention, a cached subcode graph for the rest of each multi-codebook frame, and a chunked streaming decoder.
- [Paged-KV, U8, and batching where vLLM isn't](https://mixtureofinsights.com/blog/paged-kv-batching-without-vllm/) ([Markdown](https://mixtureofinsights.com/blog/paged-kv-batching-without-vllm.md)): You have the model graphs. Now serve them — long-context, concurrent, inside an iGPU's memory budget, with none of vLLM's machinery. Four decisions that compose: paged-KV over fixed buckets, a U8 cache, full-context generation, and online batching that lives in the scheduler so one IR set serves everyone.
- [Post-training is a data problem](https://mixtureofinsights.com/blog/post-training-is-a-data-problem/) ([Markdown](https://mixtureofinsights.com/blog/post-training-is-a-data-problem.md)): PPO, GRPO, and DPO are commoditized. In my engineering iterations, the only variable that structurally improved alignment was the synthetic data engine.
- [Self-play, and the games my models teach themselves](https://mixtureofinsights.com/blog/self-play-and-the-games-models-teach-themselves/) ([Markdown](https://mixtureofinsights.com/blog/self-play-and-the-games-models-teach-themselves.md)): There's no dataset of good game-play. But in a game with a clear outcome, I manufacture one — I let a strong sampler play out games, filter by who won, and the transcripts become the strategy data. How the data engine, the verifier, and emergent strategy all meet, grounded in my GAME pipeline.
- [The bundle is the contract](https://mixtureofinsights.com/blog/orbit-the-bundle-is-the-contract/) ([Markdown](https://mixtureofinsights.com/blog/orbit-the-bundle-is-the-contract.md)): When a rented machine evaporates, the only evidence left is what you collected. I enforced a strict directory contract for bundles to ensure exact dependency provenance and runtime observability.
- [The logcat leak](https://mixtureofinsights.com/blog/03-the-logcat-leak/) ([Markdown](https://mixtureofinsights.com/blog/03-the-logcat-leak.md)): You hid the packages and the features. Then you notice fifteen apps quietly holding READ_LOGS — reading the whole device log, where every stray Magisk and lineage string is sitting in plain text.
- [What am I actually rewarding?](https://mixtureofinsights.com/blog/what-are-you-rewarding/) ([Markdown](https://mixtureofinsights.com/blog/what-are-you-rewarding.md)): RL doesn't optimize what I want — it optimizes exactly what I wrote down. The gap between the two is reward hacking, and closing it is most of the real work. Verifiers vs reward models, and how a constraint reward earned its +12%.
- [What you can and can't hide](https://mixtureofinsights.com/blog/05-what-you-can-and-cant-hide/) ([Markdown](https://mixtureofinsights.com/blog/05-what-you-can-and-cant-hide.md)): The full map of how a non-privileged app detects a rooted custom ROM, what closes each channel, and the two walls that nothing in userspace will move.
- [When the GPU isn't an NVIDIA](https://mixtureofinsights.com/blog/when-the-gpu-isnt-an-nvidia/) ([Markdown](https://mixtureofinsights.com/blog/when-the-gpu-isnt-an-nvidia.md)): The whole LLM stack assumes CUDA. The GPU in front of you is often an Intel iGPU or a CPU. Getting a real, low-latency autoregressive TTS to stream there means rebuilding the parts you usually pip-install — the decode loop, the KV cache, the batching scheduler — on OpenVINO.
- [StockMask: a stock illusion without touching a single app](https://mixtureofinsights.com/blog/02-stockmask/) ([Markdown](https://mixtureofinsights.com/blog/02-stockmask.md)): HideMyApplist hides package names. Apps still detected the custom ROM. The fix was a 200-line module that filters system_server responses by who's asking — never injecting into the app itself.
- [The Google Wallet Wall](https://mixtureofinsights.com/blog/01-the-google-wallet-wall/) ([Markdown](https://mixtureofinsights.com/blog/01-the-google-wallet-wall.md)): Play Integrity passes STRONG. Google Wallet still refuses to add a card. Here is why — proven, not guessed — and why it can't be forced onto an unlocked device.
- [Neovim: yank to the system clipboard (OSC 52)](https://mixtureofinsights.com/blog/nvim-yank-osc52/) ([Markdown](https://mixtureofinsights.com/blog/nvim-yank-osc52.md)): How I make Neovim's yank reach the system clipboard over SSH / WSL — utilizing Neovim ≥ 0.10's native OSC 52 support.