Text-to-Speech

Sakana AI's KAME architecture lets voice models think while they speak

The Tokyo lab pairs Moshi with an async LLM and claims a big MT-Bench jump without losing low-latency response.

Oliver Senti
Oliver SentiSenior AI Editor
April 30, 20264 min read
Share:
Abstract editorial illustration of two parallel AI systems running in tandem, one fast and one deliberate

Sakana AI just released KAME, a tandem architecture that pairs a fast speech-to-speech model with a slower backend LLM running in parallel. Posted to arXiv in September by the Tokyo lab, the work has been accepted at ICASSP 2026. The headline number: Moshi's MT-Bench score jumps from 2.05 to 6.43, with no change in median latency.

The latency problem they're going after

Voice AI has been stuck on the same tradeoff for a while. End-to-end speech-to-speech models like the one described in the original Moshi paper answer in roughly 200ms, which feels like a real conversation. But ask Moshi anything that needs actual reasoning and it falls apart. (The original paper was upfront about this, and the demos make it clear within thirty seconds.)

The cascaded alternative does the opposite. Speech-to-text, then a real LLM, then text-to-speech. You get the LLM's full intelligence and a dead pause every time you finish speaking.

KAME, which means turtle in Japanese, tries to do both at once. The front-end is the Moshi S2S model, which begins responding immediately. A backend LLM runs asynchronously on a slower cycle and injects what the authors call "oracle" signals into the front-end as the user keeps talking. As the partial transcript grows, the backend gets called repeatedly with progressively more context. The front-end is trained specifically to condition its output on those incoming signals.

Sakana's blog post frames it as "speak while thinking" rather than "think then speak." That's a useful way to describe it.

The numbers (and what they're hiding)

According to the technical report, KAME goes from Moshi's 2.05 baseline to 6.43 on a speech-synthesized MT-Bench variant. Median latency stayed the same.

A 4-point MT-Bench gain looks dramatic, but the baseline matters. Moshi at 2.05 is genuinely bad at question-answering. Bolting GPT-4.1 onto anything would help. The more useful comparison is against full cascaded systems like Unmute, where the gap narrows considerably; the paper's Figure 1 shows KAME landing in roughly that quality band while keeping S2S-style response times. Whether the gain is "free" depends on what you do with it.

There's also a practical question the paper handles by training the front-end specifically to listen for the oracle signal. Without that, you'd just have a Moshi that occasionally gets an unsolicited GPT memo it doesn't know what to do with. The retraining is the actual engineering work here, not the LLM call.

Different brain for different tasks

What's mildly more interesting is the backend swap. KAME was trained primarily with GPT-4.1-nano as the backend, but you can hot-swap in Claude Opus 4.1 or Gemini 2.5 Flash without retraining the front-end. The paper says category-wise advantages shift depending on the choice.

In the posted demos, Claude scores 10 on a basic reasoning puzzle ("David has three sisters, each has one brother") where GPT scores 8. On a humanities prompt about evaluating arguments, though, Claude's answer is noticeably worse, partly because the speech synthesis seems to mangle the output. GPT's humanities answer is cleaner. Gemini comes in last on both demos.

Fine.

The implication isn't that one backend wins, but that you can match the LLM to the task without rebuilding the speech stack. For an assistant deployed across coding help, customer service, and casual chat, that's actually useful.

What's missing

The paper doesn't say much about how oracle injection handles disagreement. If the front-end has already started saying one thing and the backend's improved guess contradicts it mid-sentence, what happens? I read the abstract and skimmed the appendix; the answer isn't obvious to me. There's also no public latency-vs-quality curve for partial transcripts, which would tell you whether the backend's late guesses ever actually matter.

And the test set is MT-Bench rendered through TTS, not real human speech with hesitations and crosstalk. The Moshi front-end is built for full-duplex conversation. The benchmark isn't.

Inference code is on GitHub, the model weights are on Hugging Face, and there's a separate finetune repo for the training pipeline. ICASSP 2026 happens this spring; presumably the conference talk fills in some of the gaps.

Tags:Sakana AIKAMEvoice AIspeech-to-speechMoshiICASSP 2026conversational AIreal-time AILLM
Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Sakana AI's KAME Lets Voice Models Think While Speaking | aiHola