1/25/20268 min read
Infrastructure

Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter

Public latency benchmarks for LLMs often diverge from real-world performance. This study measures TTFT and end-to-end latency via OpenRouter, revealing the true cost of production routing.

Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter

Abstract

Public latency benchmarks for large language models (LLMs) are typically measured under idealized, provider-native conditions. In real applications, however, developers often access models through routing layers that introduce additional constraints such as shared inference pools, cold starts, regional routing, and rate prioritization.
This study evaluates time-to-first-token (TTFT) and end-to-end latency of multiple state-of-the-art LLMs when accessed via OpenRouter, and compares the observed results with widely cited internet benchmarks. The findings show that while relative performance rankings remain directionally consistent, absolute latency claims frequently diverge—sometimes by multiples—once models are routed through a real production aggregation layer.

1. Introduction

As LLM adoption shifts from experimentation to production, latency—not raw capability—has become the dominant UX constraint for chatbots, sales agents, and conversational systems.

Most published benchmarks assume:

  • Direct access to provider APIs

  • Dedicated or priority GPUs

  • Warm model instances

  • Provider-native regions (e.g., GCP for Gemini, Azure for GPT)
In contrast, many real systems—including this study—use OpenRouter, which aggregates multiple providers behind a single API and dynamically routes traffic.

This paper asks a simple but underexplored question:

How do “fast” LLMs behave when accessed the way developers actually use them?

2. Methodology

2.1 Execution Layer

All tests were executed through OpenRouter, not direct provider APIs.

Implications:

  • Shared, multi-tenant inference

  • Router-level authentication & policy checks

  • Cross-region routing

  • No guaranteed warm starts

  • Free-tier and paid-tier coexistence
This setup intentionally mirrors real application conditions, not lab benchmarks.

2.2 Metrics

Two latency metrics were captured:

  • TTFT (Time to First Token) — dominant factor in perceived responsiveness

  • Total Latency — time to completion for a short response
All models were tested using:

  • Identical prompt structure

  • Streaming enabled

  • Single-request (non-batched) calls

3. Empirical Results (via OpenRouter)

3.1 Measured Latency (Sorted by TTFT)

| Model | Provider | TTFT (ms) | Total (ms) |
| :--- | :--- | :--- | :--- |
| Ministral 3B (2512) | Mistral AI | 495 | 509 |
| Ministral 3B | Mistral AI | 1,119 | 1,128 |
| Ministral 8B | Mistral AI | 1,575 | 1,591 |
| Nemotron Nano 9B | NVIDIA | 2,268 | 2,593 |
| Trinity Mini | Arcee AI | 2,606 | 2,625 |
| Nemotron Nano 12B VL | NVIDIA | 2,625 | 2,781 |
| Nemotron Nano 30B | NVIDIA | 3,582 | 3,652 |
| Gemini 2.0 Flash | Google | ❌ | ❌ |

4. Comparison with Internet Benchmarks

4.1 Mistral (Ministral Series)

Benchmark narrative

  • Frequently cited TTFT: 200–400 ms

  • Optimized for streaming and low-latency inference
Observed via OpenRouter
  • 3B (2512): 495 ms

  • 8B: ~1.6 s
Interpretation
While absolute latency increased, relative leadership remained intact. Mistral’s architectural efficiency survives routing, cold starts, and shared GPU environments better than competitors.

4.2 NVIDIA Nemotron

Benchmark narrative

  • Marketed as “edge-ready” and “low-latency”

  • Limited public TTFT disclosures
Observed via OpenRouter
  • 9B–12B: 2.2–2.8 s

  • 30B: ~3.6 s
Interpretation
Nemotron models appear optimized for throughput and batch workloads, not interactive latency. Routing overhead exposes this mismatch clearly.

4.3 Arcee Trinity Mini

Benchmark narrative

  • Lightweight, fast inference model
Observed via OpenRouter
  • ~2.6 s TTFT (stable, but slow)
Interpretation
Performance is predictable but non-competitive for conversational UX.

4.4 Gemini Flash

Benchmark narrative

  • Often cited as one of the fastest proprietary models

  • Sub-400 ms TTFT claims
Observed via OpenRouter
  • Hard failure
Interpretation
Gemini Flash’s latency claims appear highly infrastructure-dependent. Outside Google-native environments, reliability and accessibility degrade sharply.

5. Why OpenRouter Changes the Results

OpenRouter introduces realities most benchmarks ignore:

| Benchmark Assumption | OpenRouter Reality |
| :--- | :--- |
| Dedicated GPU | Shared pool |
| Warm instance | Cold start |
| Single provider | Multi-provider routing |
| Native region | Global |
| Priority traffic | Fair-share scheduling |

These factors consistently add 500–2000 ms to TTFT, revealing which models are genuinely latency-robust.

6. Key Findings

  • Latency leadership is architectural, not marketing: Small, dense models dominate when routing is involved.

  • OpenRouter amplifies real differences: Models that rely on ideal infrastructure degrade fastest.

  • TTFT matters more than tokens/sec: Throughput cannot compensate for slow first-token response.

  • “Fastest model” claims are context-fragile: Gemini Flash exemplifies this.

7. Practical Recommendations

Model selection when using OpenRouter:

| Use Case | Recommendation |
| :--- | :--- |
| Real-time chat / sales agents | Ministral 3B (2512) |
| Messaging bots | Ministral 3B |
| Heavier reasoning | Ministral 8B |
| Batch / offline | Nemotron, Arcee |
| Gemini via router | Avoid |

8. Conclusion

When evaluated through OpenRouter, real-world LLM latency diverges significantly from published benchmarks. While benchmark charts remain useful for relative comparisons, only empirical routed testing reveals production-grade behavior. Developers building latency-sensitive systems should treat routing layers as first-class citizens in performance evaluation—not afterthoughts.

Want a concierge like this for your hotel?

We'll tailor an AI assistant for your property in days, not months.

Chat with us