Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter
Abstract
Public latency benchmarks for large language models (LLMs) are typically measured under idealized, provider-native conditions. In real applications, however, developers often access models through routing layers that introduce additional constraints such as shared inference pools, cold starts, regional routing, and rate prioritization.
This study evaluates time-to-first-token (TTFT) and end-to-end latency of multiple state-of-the-art LLMs when accessed via OpenRouter, and compares the observed results with widely cited internet benchmarks. The findings show that while relative performance rankings remain directionally consistent, absolute latency claims frequently diverge—sometimes by multiples—once models are routed through a real production aggregation layer.
1. Introduction
As LLM adoption shifts from experimentation to production, latency—not raw capability—has become the dominant UX constraint for chatbots, sales agents, and conversational systems.
Most published benchmarks assume:
- Direct access to provider APIs
- Dedicated or priority GPUs
- Warm model instances
- Provider-native regions (e.g., GCP for Gemini, Azure for GPT)
This paper asks a simple but underexplored question:
How do “fast” LLMs behave when accessed the way developers actually use them?
2. Methodology
2.1 Execution Layer
All tests were executed through OpenRouter, not direct provider APIs.
Implications:
- Shared, multi-tenant inference
- Router-level authentication & policy checks
- Cross-region routing
- No guaranteed warm starts
- Free-tier and paid-tier coexistence
2.2 Metrics
Two latency metrics were captured:
- TTFT (Time to First Token) — dominant factor in perceived responsiveness
- Total Latency — time to completion for a short response
- Identical prompt structure
- Streaming enabled
- Single-request (non-batched) calls
3. Empirical Results (via OpenRouter)
3.1 Measured Latency (Sorted by TTFT)
| Model | Provider | TTFT (ms) | Total (ms) |
| :--- | :--- | :--- | :--- |
| Ministral 3B (2512) | Mistral AI | 495 | 509 |
| Ministral 3B | Mistral AI | 1,119 | 1,128 |
| Ministral 8B | Mistral AI | 1,575 | 1,591 |
| Nemotron Nano 9B | NVIDIA | 2,268 | 2,593 |
| Trinity Mini | Arcee AI | 2,606 | 2,625 |
| Nemotron Nano 12B VL | NVIDIA | 2,625 | 2,781 |
| Nemotron Nano 30B | NVIDIA | 3,582 | 3,652 |
| Gemini 2.0 Flash | Google | ❌ | ❌ |
4. Comparison with Internet Benchmarks
4.1 Mistral (Ministral Series)
Benchmark narrative
- Frequently cited TTFT: 200–400 ms
- Optimized for streaming and low-latency inference
- 3B (2512): 495 ms
- 8B: ~1.6 s
While absolute latency increased, relative leadership remained intact. Mistral’s architectural efficiency survives routing, cold starts, and shared GPU environments better than competitors.
4.2 NVIDIA Nemotron
Benchmark narrative
- Marketed as “edge-ready” and “low-latency”
- Limited public TTFT disclosures
- 9B–12B: 2.2–2.8 s
- 30B: ~3.6 s
Nemotron models appear optimized for throughput and batch workloads, not interactive latency. Routing overhead exposes this mismatch clearly.
4.3 Arcee Trinity Mini
Benchmark narrative
- Lightweight, fast inference model
- ~2.6 s TTFT (stable, but slow)
Performance is predictable but non-competitive for conversational UX.
4.4 Gemini Flash
Benchmark narrative
- Often cited as one of the fastest proprietary models
- Sub-400 ms TTFT claims
- Hard failure
Gemini Flash’s latency claims appear highly infrastructure-dependent. Outside Google-native environments, reliability and accessibility degrade sharply.
5. Why OpenRouter Changes the Results
OpenRouter introduces realities most benchmarks ignore:
| Benchmark Assumption | OpenRouter Reality |
| :--- | :--- |
| Dedicated GPU | Shared pool |
| Warm instance | Cold start |
| Single provider | Multi-provider routing |
| Native region | Global |
| Priority traffic | Fair-share scheduling |
These factors consistently add 500–2000 ms to TTFT, revealing which models are genuinely latency-robust.
6. Key Findings
- Latency leadership is architectural, not marketing: Small, dense models dominate when routing is involved.
- OpenRouter amplifies real differences: Models that rely on ideal infrastructure degrade fastest.
- TTFT matters more than tokens/sec: Throughput cannot compensate for slow first-token response.
- “Fastest model” claims are context-fragile: Gemini Flash exemplifies this.
7. Practical Recommendations
Model selection when using OpenRouter:
| Use Case | Recommendation |
| :--- | :--- |
| Real-time chat / sales agents | Ministral 3B (2512) |
| Messaging bots | Ministral 3B |
| Heavier reasoning | Ministral 8B |
| Batch / offline | Nemotron, Arcee |
| Gemini via router | Avoid |
8. Conclusion
When evaluated through OpenRouter, real-world LLM latency diverges significantly from published benchmarks. While benchmark charts remain useful for relative comparisons, only empirical routed testing reveals production-grade behavior. Developers building latency-sensitive systems should treat routing layers as first-class citizens in performance evaluation—not afterthoughts.