Keep the orchestrator dumb
There is a quiet assumption running underneath most agent architectures right now: that the path to capable systems is one model that does everything. Make it bigger, make it multimodal, bolt on tools, and eventually the single monolith reasons, perceives, plans, and acts. Mixture-of-experts is the industry's one concession to specialization, and even that is a routing trick, not a design.
I think this is backwards. The better structure is a slow, discrete, almost boring orchestrator sitting on top of fast, specialized submodels that each work in their own native regime — and you should resist every temptation to collapse the two into one.
The nervous system already made this bet. Your prefrontal cortex is a slow serial planner. It does not process the raw waveform hitting your ear or the photon stream landing on your retina. Sensory cortices do that, in parallel, in a temporal regime the prefrontal cortex could never operate in, and they hand up a compressed summary — "a voice, saying your name, to your left." The executive layer traffics in those summaries, not in the signal. Nobody would design a brain where the part planning your afternoon is also phase-locking to microsecond delays between your ears. Yet that is roughly what "one big model does it all" proposes.
The engineering case, without the biology
You do not need the neuroscience to make the argument. It stands on three unglamorous observations.
Transformers are bad at continuous signal. A frame-based model chops audio or sensor telemetry into 10–25ms buckets and reasons over the buckets. Inside one of those windows, the actual temporal structure that matters — the sub-millisecond timing, the phase relationships — has already been thrown away before the model sees it. This is not a scale problem you fix with more parameters. It is baked into the representation. A submodel built for continuous dynamics does not have this problem, because it never quantizes into frames in the first place.
Tool use works better than in-weights simulation. When an LLM needs an FFT, a matrix solve, or an exact arithmetic result, the reliable move is to have it write the code and run it, not to have it approximate the computation inside its weights. We already know this. It is the whole reason code-interpreter tool use exists. But notice what it implies: the model is best when it stays high-level and delegates the precise work to a system built for precise work. That is the hierarchy in miniature. We just usually stop at "call a Python interpreter" instead of asking what else should be delegated.
MoE is not specialization. In a mixture-of-experts model, a router sends tokens to sub-networks, but nobody tells those experts what to specialize in. Whatever division of labor emerges is a side effect of optimization, uninterpretable and unassignable. That is a fine efficiency hack. It is not the same as building a submodel you chose to be an audio system, or a spatial reasoner, or a temporal-dynamics engine, and can reason about, swap out, and improve independently. Functional specialization you designed is worth more than functional specialization you hope emerged.
Where this costs you
I would not trust a post that only listed the upsides, so here is the bill.
The orchestrator/submodel boundary is a real seam, and seams are lossy. Every time the perceptual layer hands a summary upward, information is discarded — that is the point, but it means the executive layer is reasoning over an abstraction it cannot see behind. Get the abstraction wrong and the whole system is confidently wrong. A monolith, whatever its faults, does not have this particular failure mode.
And the strongest objection is the end-to-end one. If you believe in a single differentiable system, hierarchy looks like a step backward: you have cut the gradient path, so the orchestrator can never learn to reshape what the submodels extract for it. The bitter-lesson crowd will say every hand-drawn boundary I am praising is a bias that a bigger model would learn to do better on its own. Historically, that crowd has won more arguments than it has lost. I do not think they are right here — continuous perception and discrete planning are different enough computational regimes that forcing them into one substrate is a mismatch, not a bias worth learning through — but the honest version of this thesis has to sit next to that objection rather than pretend it away.
The shape of the claim
So the claim is narrow and I want to keep it narrow. Not "decomposition is good" — everyone already does microservices and tool calls. The claim is about depth: the submodels should go deep in their own native regime rather than being forced into the orchestrator's discrete, synchronous, frame-based world. The orchestrator's job is to stay dumb — high-level, serial, symbolic — precisely so the parts underneath it can be brilliant at exactly one thing.
Build the smart parts to be narrow. Build the coordinating part to be simple. Keep them separate on purpose.
Comments
Post a Comment