S2S Eval Leaderboard

v0.6

2026-03-08 08:00 UTC — Stage 5 S2S ~0% (not meaningful), FDB v3 10K+15K running, V3 training at 15K steps

Summary
1. Transcription
2. Comprehension
3. Reasoning
4. Synthesis
5. Spoken Response
6. Conversation
Eval Status

Key aggregate scores across all stages. Higher is better (0–100). Sorted by most data available, then by score.

Model 1. Transcription 2. Comprehension 3. Reasoning 4. Synthesis 5. Spoken Resp. 6. Conversation
Qwen2.5-Omni-3B91.649.551.9
Whisper-Turbo93.7
Whisper-LargeV393.2
Whisper-Medium91.1
Whisper-Small88.4
Whisper-Base80.5
Qwen2.5-7B-Instruct70.7
Phi-4-mini-instruct67.0
Qwen3.5-2Bnew59.6MMLU
Qwen2.5-3B-Instruct58.6
Qwen3.5-0.8Bnew48.5MMLU
Qwen2.5-1.5B-Instruct47.0
Qwen2.5-0.5B-Instruct32.5
tts-qwen382.4
tts-moss80.8
tts-xtts-v271.4
tts-kokoro61.4
PersonaPlex Paper73.8
Our PersonaPlex68.2
Moshi Base56.5

Stage 1: Transcription (Speech → Text) — Word Error Rate on standard ASR benchmarks. Score = 100 − WER%. Higher is better.

Model Avg Score LibriSpeech TED-LIUM3 CommonVoice FLEURS MLS
CleanOther enfr enfr fres
Whisper-Turbo93.798.196.096.087.693.892.795.196.8
Whisper-LargeV393.297.295.695.785.094.293.093.695.9
Qwen2.5-Omni-3B91.697.394.595.388.885.593.084.691.494.2
Whisper-Medium91.197.393.195.280.993.689.891.694.7
Whisper-Small88.496.691.594.878.891.784.987.092.6
Whisper-Base80.594.887.493.372.461.688.171.075.6

Stage 2: Comprehension (Speech → Reasoned Text) — Can it hear and think? Accuracy % on speech understanding tasks.

Model Overall Avg MELD Emotion MELD Sentiment Web Questions TriviaQA VoiceBench MCQ
cascade-large-3b50.046.752.244.138.268.8
Qwen2.5-Omni-3B49.553.546.339.433.974.3
cascade-large-1.5b41.537.239.837.730.961.8
sensevoice-cascade-1.5b25.724.246.113.612.332.1
emotion2vec-cascade-large-1.5b22.513.231.93.97.156.5

Stage 3: Reasoning (Text → Text) — Ceiling performance on text-only benchmarks. Tests the LLM backbone independent of speech.

Qwen3.5-2B and 0.8B: MMLU complete, GSM8K+IFEval TODO (need instruct variants). Qwen2.5-3B-Instruct repro validated: MMLU 65.4% (ref 65.5%), IFEval 49.4% (ref 48.2%).
Model Params MMLU (0-shot) GSM8K (5-shot CoT) IFEval (0-shot) Avg Score
Qwen2.5-7B-Instruct7B71.883.656.770.7
Phi-4-mini-instruct3.8B66.881.852.367.0
Qwen2.5-3B-Instruct3B65.562.048.258.6
Qwen2.5-Omni-3B3B65.359.431.151.9
Qwen2.5-1.5B-Instruct1.5B60.352.727.947.0
Qwen3.5-2Bnew2B59.6TODOTODO
Qwen3.5-0.8Bnew0.8B48.5TODOTODO
Qwen2.5-0.5B-Instruct0.5B45.731.520.332.5

Qwen3.5 Results Analysis

Qwen3.5-2B shows strong MMLU (59.6%) but very weak GSM8K (19.4%) and moderate IFEval (32.7%). Surprisingly, the smaller 0.8B model outperforms 2B on both GSM8K (34.3% vs 19.4%) and IFEval (35.1% vs 32.7%), suggesting the 2B Gated DeltaNet architecture specifically struggles with chain-of-thought generation despite better knowledge recall (MMLU). Both models were evaluated using AutoModelForCausalLM with float32 on A10G (v7 worker).

Qwen2.5-3B Repro

Repro complete: MMLU 65.4% (ref 65.5% — Δ0.1%), IFEval 49.4% (ref 48.2% — Δ1.2%). Both match closely. Infrastructure validated.

Stage 4: Synthesis (Text → Speech) — Can it speak? Seed TTS Eval benchmark.

Aggregate scores 0–100. WER Score = 100 − WER%. Acoustic = MOS × 20. Similarity = max(0, simo) × 100.
Model Overall WER Score Acoustic Similarity WER % UTMOS SIMO
tts-qwen382.498.278.570.61.764.160.706
tts-moss80.897.875.868.62.163.970.686
tts-xtts-v271.496.271.846.23.763.540.462
tts-kokoro61.498.483.22.61.624.500.026

Note on Kokoro

Kokoro has the best audio quality (UTMOS 4.50, lowest WER) but near-zero speaker similarity (0.026) because it doesn't support voice cloning. The Overall score penalizes this heavily. Without similarity, Kokoro's acoustic quality is best-in-class.

Stage 5: Spoken Response (Speech → Reasoned Speech) — Can it understand speech and respond intelligently?

S2S models (Moshi, PersonaPlex) score ~0% on VoiceBench/BBA — they generate conversational audio, not instruction-following text answers. These benchmarks are only meaningful for cascade (ASR→LLM) architectures. FDB (Stage 6) is the correct benchmark for S2S model comparison.
Model VoiceBench (auto) Big Bench Audio VoiceBench by Category
Overall % MCQSD-QABBH
Cascade (Whisper+Qwen3.5-2B)37.237.930.138.140.7
Cascade (Whisper+Qwen3.5-0.8B)34.1
Moshi Base~00.00.40.0
PersonaPlex~00.00.40.0

Why S2S models score ~0%

Full-duplex S2S models (Moshi, PersonaPlex) are conversational — they generate streaming audio responses like "Thank you." regardless of the question. They cannot answer MCQ, QA, or instruction-following tasks. VoiceBench and BigBench Audio test instruction-following ability, which is a cascade (ASR→LLM) capability, not an S2S capability. FDB (Stage 6) is the correct benchmark for comparing S2S models.

Cascade Results

Cascade Whisper+Qwen3.5-2B: VoiceBench auto-scored 37.2% (MCQ 30.1%, SD-QA 38.1%, BBH 40.7%). BigBench Audio 37.9%. For reference: GPT-4o achieves 66% on BBA speech-to-speech, 92% text-only.

Stage 6: Conversation (Speech → Speech) — Full end-to-end duplex interaction quality via Full-Duplex-Bench (FDB v1).

↓ = lower is better. ↑ = higher is better. TOR = Turn Occurrence Rate. JSD = Jensen-Shannon Divergence. Latency in seconds.
Model Pause Handling (TOR ↓) Backchannel Smooth Turn-Taking User Interruption
SynthCandor JSD ↓ TOR ↑Latency ↓ TOR ↑Rating ↑Latency ↓
PersonaPlex Paper 0.3940.454 0.662 0.9080.170 0.9504.2900.240
Our PersonaPlex 0.4230.667 0.736 0.9750.250 1.0003.4000.000
Moshi Base 1.0001.000 0.701 1.0000.000 1.0001.6900.009

Full-Duplex-Bench (FDB v1)

725 test samples across 5 tasks. Our eval uses Whisper base.en for ASR (FDB uses Parakeet TDT) and Bedrock Claude Sonnet for user interruption rating (FDB uses GPT-4o).

Key Findings

v1 25K: great pause handling, poor content
Candor Pause TOR 0.097 vs PersonaPlex 0.454 — 5× better at staying silent during pauses. But User Interrupt Rating 1.665 vs PP 4.290 — catastrophic forgetting from 100% automotive data.
v3 10K: curriculum fix (eval running)
V3 training uses curriculum data mix (general + automotive) to fix the content quality gap. FDB eval running on i-04a771ebcfa7c0458.
Moshi Base baseline
Extreme pause handling failure (TOR 1.000) but fast latency. Responds to everything immediately without pause awareness.

Model Key

PersonaPlex Paper
arXiv:2602.06053 Table 6 (GPT-4o judge, Parakeet ASR)
Our PersonaPlex
Our reproduction with HF weights, Whisper ASR, Claude Sonnet judge
v1 25K
First training run, 25K steps, automotive-only data (unfused FP16)
v3 10K
Third training run, 10K steps, curriculum data mix (balanced)
v3 15K
Third training run, 15K steps, curriculum data mix (balanced)
v3 25K
Pending — training still running on p4d

Current evaluation pipeline status as of 2026-03-07 08:30 UTC.

Task Instance Region State Notes
Stage 5 S2S (all models) terminated us-west-2 DONE (~0%) All 4 S2S models score ~0% on VoiceBench. S2S models can't do instruction-following QA. FDB is the correct benchmark.
V3 Training trues2s-p4d-v3-train-r2 us-west-2 RUNNING p4d.24xlarge, 25K steps curriculum. At step 10K+. ETA ~Saturday March 8.
Reasoning v7 (Qwen3.5) terminated us-east-1 DONE 2B: MMLU 59.6%, GSM8K 19.4%, IFEval 32.7%, Avg 37.2%. 0.8B: MMLU 48.5%, GSM8K 34.3%, IFEval 35.1%, Avg 39.3%.
BigBench Audio v5 i-01fb70abe4331971d us-east-1 DONE Cascade 2B: 37.9%, Cascade 0.8B: 34.1%. 1000 samples, Bedrock Claude Sonnet judge.
VoiceBench v6 terminated us-east-1 DONE (raw) Cascade Qwen3.5-2B raw JSONL outputs on S3 (11 datasets). Needs post-processing for accuracy scores.
Qwen2.5-3B Repro terminated us-east-1 DONE MMLU 65.4% (ref 65.5%), IFEval 49.4% (ref 48.2%). Infrastructure validated.
ASR v3 (Whisper) terminated us-west-2 DONE (4/9) LibriSpeech clean/other + FLEURS en/fr. TED-LIUM/CommonVoice/MLS blocked.
FDB v1 (Stage 6) DONE PersonaPlex + Moshi Base × 725 samples.
Reasoning (Qwen2.5) terminated DONE All Qwen2.5 models (0.5B-7B) + Phi-4-mini on MMLU+GSM8K+IFEval.