Switching to Secondary Is Faster

A practical workflow that routes boilerplate and spec generation to a smaller model, reserving the flagship for review and novel problems.

Wayne Lau

  ·  2 min read

Remember, switching to your pistol is always faster than reloading.

The same idea applies to LLM workflows.

Most of the time, you don’t need a flagship model to scaffold a project. Boilerplate, spec drafts, and initial plans are all tasks where a smaller model can do the heavy lifting. Then you pass the result to a larger model for review.

Why this works #

Prefill is usually a single forward pass (not counting advanced stuff like chunking and sequence parallelism). Fundamentally, the next token is just model.forward(). How does this help?

Say your initial prompt is 16k tokens (a rough ballpark for a Claude Code session) and you need to generate another 16k tokens of output (tool calls, reads, edits included). If your large model generates at 50t/s, a small model can easily hit 200t/s. That’s 80 seconds versus 320 seconds for the same 16k tokens.

The concept is the same as speculative decoding. Modern decoders use a small draft model to propose multiple tokens at once, then the large model verifies them in parallel. Using a secondary model for the first pass is just speculative decoding scaled to 16k tokens.

The workflow #

Here’s what I’ve been doing:

  1. Plan — either with a small model for speed, or directly with a large model for precision. The large model is more accurate but spends more tokens on the planning stage.
  2. Review — pass the plan to a large model, fix what’s wrong.
  3. Generate code — small model implements from the refined spec.
  4. Review again — catch what the small model missed.

For the small model, I’ve been using Qwen 3.6 35b MoE. It’s fast enough to run locally and produces reasonable boilerplate. The large model then acts as a reviewer rather than a first-pass generator.

This hasn’t been tested on novel codebases. For truly new problems, I write the code myself and use the small model for repetitive tasks like generating tests and boilerplate.