2026-03-08 08:00 UTC — Stage 5 S2S ~0% (not meaningful), FDB v3 10K+15K running, V3 training at 15K steps
Key aggregate scores across all stages. Higher is better (0–100). Sorted by most data available, then by score.
| Model | 1. Transcription | 2. Comprehension | 3. Reasoning | 4. Synthesis | 5. Spoken Resp. | 6. Conversation |
|---|---|---|---|---|---|---|
| Qwen2.5-Omni-3B | 91.6 | 49.5 | 51.9 | — | — | — |
| Whisper-Turbo | 93.7 | — | — | — | — | — |
| Whisper-LargeV3 | 93.2 | — | — | — | — | — |
| Whisper-Medium | 91.1 | — | — | — | — | — |
| Whisper-Small | 88.4 | — | — | — | — | — |
| Whisper-Base | 80.5 | — | — | — | — | — |
| Qwen2.5-7B-Instruct | — | — | 70.7 | — | — | — |
| Phi-4-mini-instruct | — | — | 67.0 | — | — | — |
| Qwen3.5-2Bnew | — | — | 59.6MMLU | — | — | — |
| Qwen2.5-3B-Instruct | — | — | 58.6 | — | — | — |
| Qwen3.5-0.8Bnew | — | — | 48.5MMLU | — | — | — |
| Qwen2.5-1.5B-Instruct | — | — | 47.0 | — | — | — |
| Qwen2.5-0.5B-Instruct | — | — | 32.5 | — | — | — |
| tts-qwen3 | — | — | — | 82.4 | — | — |
| tts-moss | — | — | — | 80.8 | — | — |
| tts-xtts-v2 | — | — | — | 71.4 | — | — |
| tts-kokoro | — | — | — | 61.4 | — | — |
| PersonaPlex Paper | — | — | — | — | — | 73.8 |
| Our PersonaPlex | — | — | — | — | — | 68.2 |
| Moshi Base | — | — | — | — | — | 56.5 |
Stage 1: Transcription (Speech → Text) — Word Error Rate on standard ASR benchmarks. Score = 100 − WER%. Higher is better.
| Model | Avg Score | LibriSpeech | TED-LIUM3 | CommonVoice | FLEURS | MLS | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Clean | Other | en | fr | en | fr | fr | es | |||
| Whisper-Turbo | 93.7 | 98.1 | 96.0 | 96.0 | 87.6 | — | 93.8 | 92.7 | 95.1 | 96.8 |
| Whisper-LargeV3 | 93.2 | 97.2 | 95.6 | 95.7 | 85.0 | — | 94.2 | 93.0 | 93.6 | 95.9 |
| Qwen2.5-Omni-3B | 91.6 | 97.3 | 94.5 | 95.3 | 88.8 | 85.5 | 93.0 | 84.6 | 91.4 | 94.2 |
| Whisper-Medium | 91.1 | 97.3 | 93.1 | 95.2 | 80.9 | — | 93.6 | 89.8 | 91.6 | 94.7 |
| Whisper-Small | 88.4 | 96.6 | 91.5 | 94.8 | 78.8 | — | 91.7 | 84.9 | 87.0 | 92.6 |
| Whisper-Base | 80.5 | 94.8 | 87.4 | 93.3 | 72.4 | 61.6 | 88.1 | 71.0 | 75.6 | — |
Stage 2: Comprehension (Speech → Reasoned Text) — Can it hear and think? Accuracy % on speech understanding tasks.
| Model | Overall Avg | MELD Emotion | MELD Sentiment | Web Questions | TriviaQA | VoiceBench MCQ |
|---|---|---|---|---|---|---|
| cascade-large-3b | 50.0 | 46.7 | 52.2 | 44.1 | 38.2 | 68.8 |
| Qwen2.5-Omni-3B | 49.5 | 53.5 | 46.3 | 39.4 | 33.9 | 74.3 |
| cascade-large-1.5b | 41.5 | 37.2 | 39.8 | 37.7 | 30.9 | 61.8 |
| sensevoice-cascade-1.5b | 25.7 | 24.2 | 46.1 | 13.6 | 12.3 | 32.1 |
| emotion2vec-cascade-large-1.5b | 22.5 | 13.2 | 31.9 | 3.9 | 7.1 | 56.5 |
Stage 3: Reasoning (Text → Text) — Ceiling performance on text-only benchmarks. Tests the LLM backbone independent of speech.
| Model | Params | MMLU (0-shot) | GSM8K (5-shot CoT) | IFEval (0-shot) | Avg Score |
|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | 7B | 71.8 | 83.6 | 56.7 | 70.7 |
| Phi-4-mini-instruct | 3.8B | 66.8 | 81.8 | 52.3 | 67.0 |
| Qwen2.5-3B-Instruct | 3B | 65.5 | 62.0 | 48.2 | 58.6 |
| Qwen2.5-Omni-3B | 3B | 65.3 | 59.4 | 31.1 | 51.9 |
| Qwen2.5-1.5B-Instruct | 1.5B | 60.3 | 52.7 | 27.9 | 47.0 |
| Qwen3.5-2Bnew | 2B | 59.6 | TODO | TODO | — |
| Qwen3.5-0.8Bnew | 0.8B | 48.5 | TODO | TODO | — |
| Qwen2.5-0.5B-Instruct | 0.5B | 45.7 | 31.5 | 20.3 | 32.5 |
Qwen3.5-2B shows strong MMLU (59.6%) but very weak GSM8K (19.4%) and moderate IFEval (32.7%). Surprisingly, the smaller 0.8B model outperforms 2B on both GSM8K (34.3% vs 19.4%) and IFEval (35.1% vs 32.7%), suggesting the 2B Gated DeltaNet architecture specifically struggles with chain-of-thought generation despite better knowledge recall (MMLU). Both models were evaluated using AutoModelForCausalLM with float32 on A10G (v7 worker).
Repro complete: MMLU 65.4% (ref 65.5% — Δ0.1%), IFEval 49.4% (ref 48.2% — Δ1.2%). Both match closely. Infrastructure validated.
Stage 4: Synthesis (Text → Speech) — Can it speak? Seed TTS Eval benchmark.
| Model | Overall | WER Score | Acoustic | Similarity | WER % | UTMOS | SIMO |
|---|---|---|---|---|---|---|---|
| tts-qwen3 | 82.4 | 98.2 | 78.5 | 70.6 | 1.76 | 4.16 | 0.706 |
| tts-moss | 80.8 | 97.8 | 75.8 | 68.6 | 2.16 | 3.97 | 0.686 |
| tts-xtts-v2 | 71.4 | 96.2 | 71.8 | 46.2 | 3.76 | 3.54 | 0.462 |
| tts-kokoro | 61.4 | 98.4 | 83.2 | 2.6 | 1.62 | 4.50 | 0.026 |
Kokoro has the best audio quality (UTMOS 4.50, lowest WER) but near-zero speaker similarity (0.026) because it doesn't support voice cloning. The Overall score penalizes this heavily. Without similarity, Kokoro's acoustic quality is best-in-class.
Stage 5: Spoken Response (Speech → Reasoned Speech) — Can it understand speech and respond intelligently?
| Model | VoiceBench (auto) | Big Bench Audio | VoiceBench by Category | ||
|---|---|---|---|---|---|
| Overall % | MCQ | SD-QA | BBH | ||
| Cascade (Whisper+Qwen3.5-2B) | 37.2 | 37.9 | 30.1 | 38.1 | 40.7 |
| Cascade (Whisper+Qwen3.5-0.8B) | — | 34.1 | — | — | — |
| Moshi Base | ~0 | — | 0.0 | 0.4 | 0.0 |
| PersonaPlex | ~0 | — | 0.0 | 0.4 | 0.0 |
Full-duplex S2S models (Moshi, PersonaPlex) are conversational — they generate streaming audio responses like "Thank you." regardless of the question. They cannot answer MCQ, QA, or instruction-following tasks. VoiceBench and BigBench Audio test instruction-following ability, which is a cascade (ASR→LLM) capability, not an S2S capability. FDB (Stage 6) is the correct benchmark for comparing S2S models.
Cascade Whisper+Qwen3.5-2B: VoiceBench auto-scored 37.2% (MCQ 30.1%, SD-QA 38.1%, BBH 40.7%). BigBench Audio 37.9%. For reference: GPT-4o achieves 66% on BBA speech-to-speech, 92% text-only.
Stage 6: Conversation (Speech → Speech) — Full end-to-end duplex interaction quality via Full-Duplex-Bench (FDB v1).
| Model | Pause Handling (TOR ↓) | Backchannel | Smooth Turn-Taking | User Interruption | ||||
|---|---|---|---|---|---|---|---|---|
| Synth | Candor | JSD ↓ | TOR ↑ | Latency ↓ | TOR ↑ | Rating ↑ | Latency ↓ | |
| PersonaPlex Paper | 0.394 | 0.454 | 0.662 | 0.908 | 0.170 | 0.950 | 4.290 | 0.240 |
| Our PersonaPlex | 0.423 | 0.667 | 0.736 | 0.975 | 0.250 | 1.000 | 3.400 | 0.000 |
| Moshi Base | 1.000 | 1.000 | 0.701 | 1.000 | 0.000 | 1.000 | 1.690 | 0.009 |
725 test samples across 5 tasks. Our eval uses Whisper base.en for ASR (FDB uses Parakeet TDT) and Bedrock Claude Sonnet for user interruption rating (FDB uses GPT-4o).
i-04a771ebcfa7c0458.Current evaluation pipeline status as of 2026-03-07 08:30 UTC.
| Task | Instance | Region | State | Notes |
|---|---|---|---|---|
| Stage 5 S2S (all models) | terminated | us-west-2 | DONE (~0%) | All 4 S2S models score ~0% on VoiceBench. S2S models can't do instruction-following QA. FDB is the correct benchmark. |
| V3 Training | trues2s-p4d-v3-train-r2 | us-west-2 | RUNNING | p4d.24xlarge, 25K steps curriculum. At step 10K+. ETA ~Saturday March 8. |
| Reasoning v7 (Qwen3.5) | terminated | us-east-1 | DONE | 2B: MMLU 59.6%, GSM8K 19.4%, IFEval 32.7%, Avg 37.2%. 0.8B: MMLU 48.5%, GSM8K 34.3%, IFEval 35.1%, Avg 39.3%. |
| BigBench Audio v5 | i-01fb70abe4331971d | us-east-1 | DONE | Cascade 2B: 37.9%, Cascade 0.8B: 34.1%. 1000 samples, Bedrock Claude Sonnet judge. |
| VoiceBench v6 | terminated | us-east-1 | DONE (raw) | Cascade Qwen3.5-2B raw JSONL outputs on S3 (11 datasets). Needs post-processing for accuracy scores. |
| Qwen2.5-3B Repro | terminated | us-east-1 | DONE | MMLU 65.4% (ref 65.5%), IFEval 49.4% (ref 48.2%). Infrastructure validated. |
| ASR v3 (Whisper) | terminated | us-west-2 | DONE (4/9) | LibriSpeech clean/other + FLEURS en/fr. TED-LIUM/CommonVoice/MLS blocked. |
| FDB v1 (Stage 6) | — | — | DONE | PersonaPlex + Moshi Base × 725 samples. |
| Reasoning (Qwen2.5) | terminated | — | DONE | All Qwen2.5 models (0.5B-7B) + Phi-4-mini on MMLU+GSM8K+IFEval. |