S2S Eval Leaderboard v0.6

2026-03-08 08:00 UTC — Stage 5 S2S ~0% (not meaningful), FDB v3 10K+15K running, V3 training at 15K steps

Key aggregate scores across all stages. Higher is better (0–100). Sorted by most data available, then by score.

Model	1. Transcription	2. Comprehension	3. Reasoning	4. Synthesis	5. Spoken Resp.	6. Conversation
Qwen2.5-Omni-3B	91.6	49.5	51.9	—	—	—
Whisper-Turbo	93.7	—	—	—	—	—
Whisper-LargeV3	93.2	—	—	—	—	—
Whisper-Medium	91.1	—	—	—	—	—
Whisper-Small	88.4	—	—	—	—	—
Whisper-Base	80.5	—	—	—	—	—
Qwen2.5-7B-Instruct	—	—	70.7	—	—	—
Phi-4-mini-instruct	—	—	67.0	—	—	—
Qwen3.5-2Bnew	—	—	59.6MMLU	—	—	—
Qwen2.5-3B-Instruct	—	—	58.6	—	—	—
Qwen3.5-0.8Bnew	—	—	48.5MMLU	—	—	—
Qwen2.5-1.5B-Instruct	—	—	47.0	—	—	—
Qwen2.5-0.5B-Instruct	—	—	32.5	—	—	—
tts-qwen3	—	—	—	82.4	—	—
tts-moss	—	—	—	80.8	—	—
tts-xtts-v2	—	—	—	71.4	—	—
tts-kokoro	—	—	—	61.4	—	—
PersonaPlex Paper	—	—	—	—	—	73.8
Our PersonaPlex	—	—	—	—	—	68.2
Moshi Base	—	—	—	—	—	56.5

Stage 1: Transcription (Speech → Text) — Word Error Rate on standard ASR benchmarks. Score = 100 − WER%. Higher is better.

Model	Avg Score	LibriSpeech		TED-LIUM3	CommonVoice		FLEURS		MLS
Model	Avg Score	Clean	Other	TED-LIUM3	en	fr	en	fr	fr	es
Whisper-Turbo	93.7	98.1	96.0	96.0	87.6	—	93.8	92.7	95.1	96.8
Whisper-LargeV3	93.2	97.2	95.6	95.7	85.0	—	94.2	93.0	93.6	95.9
Qwen2.5-Omni-3B	91.6	97.3	94.5	95.3	88.8	85.5	93.0	84.6	91.4	94.2
Whisper-Medium	91.1	97.3	93.1	95.2	80.9	—	93.6	89.8	91.6	94.7
Whisper-Small	88.4	96.6	91.5	94.8	78.8	—	91.7	84.9	87.0	92.6
Whisper-Base	80.5	94.8	87.4	93.3	72.4	61.6	88.1	71.0	75.6	—

Stage 2: Comprehension (Speech → Reasoned Text) — Can it hear and think? Accuracy % on speech understanding tasks.

Model	Overall Avg	MELD Emotion	MELD Sentiment	Web Questions	TriviaQA	VoiceBench MCQ
cascade-large-3b	50.0	46.7	52.2	44.1	38.2	68.8
Qwen2.5-Omni-3B	49.5	53.5	46.3	39.4	33.9	74.3
cascade-large-1.5b	41.5	37.2	39.8	37.7	30.9	61.8
sensevoice-cascade-1.5b	25.7	24.2	46.1	13.6	12.3	32.1
emotion2vec-cascade-large-1.5b	22.5	13.2	31.9	3.9	7.1	56.5

Stage 3: Reasoning (Text → Text) — Ceiling performance on text-only benchmarks. Tests the LLM backbone independent of speech.

Qwen3.5-2B and 0.8B: MMLU complete, GSM8K+IFEval TODO (need instruct variants). Qwen2.5-3B-Instruct repro validated: MMLU 65.4% (ref 65.5%), IFEval 49.4% (ref 48.2%).

Model	Params	MMLU (0-shot)	GSM8K (5-shot CoT)	IFEval (0-shot)	Avg Score
Qwen2.5-7B-Instruct	7B	71.8	83.6	56.7	70.7
Phi-4-mini-instruct	3.8B	66.8	81.8	52.3	67.0
Qwen2.5-3B-Instruct	3B	65.5	62.0	48.2	58.6
Qwen2.5-Omni-3B	3B	65.3	59.4	31.1	51.9
Qwen2.5-1.5B-Instruct	1.5B	60.3	52.7	27.9	47.0
Qwen3.5-2Bnew	2B	59.6	TODO	TODO	—
Qwen3.5-0.8Bnew	0.8B	48.5	TODO	TODO	—
Qwen2.5-0.5B-Instruct	0.5B	45.7	31.5	20.3	32.5

Qwen3.5 Results Analysis

Qwen3.5-2B shows strong MMLU (59.6%) but very weak GSM8K (19.4%) and moderate IFEval (32.7%). Surprisingly, the smaller 0.8B model outperforms 2B on both GSM8K (34.3% vs 19.4%) and IFEval (35.1% vs 32.7%), suggesting the 2B Gated DeltaNet architecture specifically struggles with chain-of-thought generation despite better knowledge recall (MMLU). Both models were evaluated using AutoModelForCausalLM with float32 on A10G (v7 worker).

Qwen2.5-3B Repro

Repro complete: MMLU 65.4% (ref 65.5% — Δ0.1%), IFEval 49.4% (ref 48.2% — Δ1.2%). Both match closely. Infrastructure validated.

Stage 4: Synthesis (Text → Speech) — Can it speak? Seed TTS Eval benchmark.

Aggregate scores 0–100. WER Score = 100 − WER%. Acoustic = MOS × 20. Similarity = max(0, simo) × 100.

Model	Overall	WER Score	Acoustic	Similarity	WER %	UTMOS	SIMO
tts-qwen3	82.4	98.2	78.5	70.6	1.76	4.16	0.706
tts-moss	80.8	97.8	75.8	68.6	2.16	3.97	0.686
tts-xtts-v2	71.4	96.2	71.8	46.2	3.76	3.54	0.462
tts-kokoro	61.4	98.4	83.2	2.6	1.62	4.50	0.026

Note on Kokoro

Kokoro has the best audio quality (UTMOS 4.50, lowest WER) but near-zero speaker similarity (0.026) because it doesn't support voice cloning. The Overall score penalizes this heavily. Without similarity, Kokoro's acoustic quality is best-in-class.

Stage 5: Spoken Response (Speech → Reasoned Speech) — Can it understand speech and respond intelligently?

S2S models (Moshi, PersonaPlex) score ~0% on VoiceBench/BBA — they generate conversational audio, not instruction-following text answers. These benchmarks are only meaningful for cascade (ASR→LLM) architectures. FDB (Stage 6) is the correct benchmark for S2S model comparison.

Model	VoiceBench (auto)	Big Bench Audio	VoiceBench by Category
		Overall %	MCQ	SD-QA	BBH
Cascade (Whisper+Qwen3.5-2B)	37.2	37.9	30.1	38.1	40.7
Cascade (Whisper+Qwen3.5-0.8B)	—	34.1	—	—	—
Moshi Base	~0	—	0.0	0.4	0.0
PersonaPlex	~0	—	0.0	0.4	0.0

Why S2S models score ~0%

Full-duplex S2S models (Moshi, PersonaPlex) are conversational — they generate streaming audio responses like "Thank you." regardless of the question. They cannot answer MCQ, QA, or instruction-following tasks. VoiceBench and BigBench Audio test instruction-following ability, which is a cascade (ASR→LLM) capability, not an S2S capability. FDB (Stage 6) is the correct benchmark for comparing S2S models.

Cascade Results

Cascade Whisper+Qwen3.5-2B: VoiceBench auto-scored 37.2% (MCQ 30.1%, SD-QA 38.1%, BBH 40.7%). BigBench Audio 37.9%. For reference: GPT-4o achieves 66% on BBA speech-to-speech, 92% text-only.

Stage 6: Conversation (Speech → Speech) — Full end-to-end duplex interaction quality via Full-Duplex-Bench (FDB v1).

↓ = lower is better. ↑ = higher is better. TOR = Turn Occurrence Rate. JSD = Jensen-Shannon Divergence. Latency in seconds.

Model	Pause Handling (TOR ↓)		Backchannel	Smooth Turn-Taking		User Interruption
Model	Synth	Candor	JSD ↓	TOR ↑	Latency ↓	TOR ↑	Rating ↑	Latency ↓
PersonaPlex Paper	0.394	0.454	0.662	0.908	0.170	0.950	4.290	0.240
Our PersonaPlex	0.423	0.667	0.736	0.975	0.250	1.000	3.400	0.000
Moshi Base	1.000	1.000	0.701	1.000	0.000	1.000	1.690	0.009

Full-Duplex-Bench (FDB v1)

725 test samples across 5 tasks. Our eval uses Whisper base.en for ASR (FDB uses Parakeet TDT) and Bedrock Claude Sonnet for user interruption rating (FDB uses GPT-4o).

Key Findings

v1 25K: great pause handling, poor content: Candor Pause TOR 0.097 vs PersonaPlex 0.454 — 5× better at staying silent during pauses. But User Interrupt Rating 1.665 vs PP 4.290 — catastrophic forgetting from 100% automotive data.
v3 10K: curriculum fix (eval running): V3 training uses curriculum data mix (general + automotive) to fix the content quality gap. FDB eval running on i-04a771ebcfa7c0458.
Moshi Base baseline: Extreme pause handling failure (TOR 1.000) but fast latency. Responds to everything immediately without pause awareness.

Model Key

PersonaPlex Paper: arXiv:2602.06053 Table 6 (GPT-4o judge, Parakeet ASR)
Our PersonaPlex: Our reproduction with HF weights, Whisper ASR, Claude Sonnet judge
v1 25K: First training run, 25K steps, automotive-only data (unfused FP16)
v3 10K: Third training run, 10K steps, curriculum data mix (balanced)
v3 15K: Third training run, 15K steps, curriculum data mix (balanced)
v3 25K: Pending — training still running on p4d

Current evaluation pipeline status as of 2026-03-07 08:30 UTC.

Task	Instance	Region	State	Notes
Stage 5 S2S (all models)	terminated	us-west-2	DONE (~0%)	All 4 S2S models score ~0% on VoiceBench. S2S models can't do instruction-following QA. FDB is the correct benchmark.
V3 Training	trues2s-p4d-v3-train-r2	us-west-2	RUNNING	p4d.24xlarge, 25K steps curriculum. At step 10K+. ETA ~Saturday March 8.
Reasoning v7 (Qwen3.5)	terminated	us-east-1	DONE	2B: MMLU 59.6%, GSM8K 19.4%, IFEval 32.7%, Avg 37.2%. 0.8B: MMLU 48.5%, GSM8K 34.3%, IFEval 35.1%, Avg 39.3%.
BigBench Audio v5	i-01fb70abe4331971d	us-east-1	DONE	Cascade 2B: 37.9%, Cascade 0.8B: 34.1%. 1000 samples, Bedrock Claude Sonnet judge.
VoiceBench v6	terminated	us-east-1	DONE (raw)	Cascade Qwen3.5-2B raw JSONL outputs on S3 (11 datasets). Needs post-processing for accuracy scores.
Qwen2.5-3B Repro	terminated	us-east-1	DONE	MMLU 65.4% (ref 65.5%), IFEval 49.4% (ref 48.2%). Infrastructure validated.
ASR v3 (Whisper)	terminated	us-west-2	DONE (4/9)	LibriSpeech clean/other + FLEURS en/fr. TED-LIUM/CommonVoice/MLS blocked.
FDB v1 (Stage 6)	—	—	DONE	PersonaPlex + Moshi Base × 725 samples.
Reasoning (Qwen2.5)	terminated	—	DONE	All Qwen2.5 models (0.5B-7B) + Phi-4-mini on MMLU+GSM8K+IFEval.