LLM Latency via OpenRouter: Benchmark vs Reality | Sajedar

Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter

Abstract

Public latency benchmarks for large language models (LLMs) are typically measured under idealized, provider-native conditions. In real applications, however, developers often access models through routing layers that introduce additional constraints such as shared inference pools, cold starts, regional routing, and rate prioritization.
This study evaluates time-to-first-token (TTFT) and end-to-end latency of multiple state-of-the-art LLMs when accessed via OpenRouter, and compares the observed results with widely cited internet benchmarks. The findings show that while relative performance rankings remain directionally consistent, absolute latency claims frequently diverge—sometimes by multiples—once models are routed through a real production aggregation layer.

1. Introduction

As LLM adoption shifts from experimentation to production, latency—not raw capability—has become the dominant UX constraint for chatbots, sales agents, and conversational systems.

Most published benchmarks assume:

Direct access to provider APIs

Dedicated or priority GPUs

Warm model instances

Provider-native regions (e.g., GCP for Gemini, Azure for GPT)

In contrast, many real systems—including this study—use OpenRouter, which aggregates multiple providers behind a single API and dynamically routes traffic.

This paper asks a simple but underexplored question:

How do “fast” LLMs behave when accessed the way developers actually use them?

2. Methodology

2.1 Execution Layer

All tests were executed through OpenRouter, not direct provider APIs.

Implications:

Shared, multi-tenant inference

Router-level authentication & policy checks

Cross-region routing

No guaranteed warm starts

Free-tier and paid-tier coexistence

This setup intentionally mirrors real application conditions, not lab benchmarks.

2.2 Metrics

Two latency metrics were captured:

TTFT (Time to First Token) — dominant factor in perceived responsiveness

Total Latency — time to completion for a short response

All models were tested using:

Identical prompt structure

Streaming enabled

Single-request (non-batched) calls

3. Empirical Results (via OpenRouter)

3.1 Measured Latency (Sorted by TTFT)

| Model | Provider | TTFT (ms) | Total (ms) |
| :--- | :--- | :--- | :--- |
| Ministral 3B (2512) | Mistral AI | 495 | 509 |
| Ministral 3B | Mistral AI | 1,119 | 1,128 |
| Ministral 8B | Mistral AI | 1,575 | 1,591 |
| Nemotron Nano 9B | NVIDIA | 2,268 | 2,593 |
| Trinity Mini | Arcee AI | 2,606 | 2,625 |
| Nemotron Nano 12B VL | NVIDIA | 2,625 | 2,781 |
| Nemotron Nano 30B | NVIDIA | 3,582 | 3,652 |
| Gemini 2.0 Flash | Google | ❌ | ❌ |

4. Comparison with Internet Benchmarks

4.1 Mistral (Ministral Series)

Benchmark narrative

Frequently cited TTFT: 200–400 ms

Optimized for streaming and low-latency inference

Observed via OpenRouter

3B (2512): 495 ms

8B: ~1.6 s

Interpretation
While absolute latency increased, relative leadership remained intact. Mistral’s architectural efficiency survives routing, cold starts, and shared GPU environments better than competitors.

4.2 NVIDIA Nemotron

Benchmark narrative

Marketed as “edge-ready” and “low-latency”

Limited public TTFT disclosures

Observed via OpenRouter

9B–12B: 2.2–2.8 s

30B: ~3.6 s

Interpretation
Nemotron models appear optimized for throughput and batch workloads, not interactive latency. Routing overhead exposes this mismatch clearly.

4.3 Arcee Trinity Mini

Benchmark narrative

Lightweight, fast inference model

Observed via OpenRouter

~2.6 s TTFT (stable, but slow)

Interpretation
Performance is predictable but non-competitive for conversational UX.

4.4 Gemini Flash

Benchmark narrative

Often cited as one of the fastest proprietary models

Sub-400 ms TTFT claims

Observed via OpenRouter

Hard failure

Interpretation
Gemini Flash’s latency claims appear highly infrastructure-dependent. Outside Google-native environments, reliability and accessibility degrade sharply.

5. Why OpenRouter Changes the Results

OpenRouter introduces realities most benchmarks ignore:

These factors consistently add 500–2000 ms to TTFT, revealing which models are genuinely latency-robust.

6. Key Findings

Latency leadership is architectural, not marketing: Small, dense models dominate when routing is involved.

OpenRouter amplifies real differences: Models that rely on ideal infrastructure degrade fastest.

TTFT matters more than tokens/sec: Throughput cannot compensate for slow first-token response.

“Fastest model” claims are context-fragile: Gemini Flash exemplifies this.

7. Practical Recommendations

Model selection when using OpenRouter:

8. Conclusion

When evaluated through OpenRouter, real-world LLM latency diverges significantly from published benchmarks. While benchmark charts remain useful for relative comparisons, only empirical routed testing reveals production-grade behavior. Developers building latency-sensitive systems should treat routing layers as first-class citizens in performance evaluation—not afterthoughts.

Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter

Benchmark Claims vs Routed Reality: Measuring LLM Latency via OpenRouter

1. Introduction

2. Methodology

2.1 Execution Layer

2.2 Metrics

3. Empirical Results (via OpenRouter)

3.1 Measured Latency (Sorted by TTFT)

4. Comparison with Internet Benchmarks

4.1 Mistral (Ministral Series)

4.2 NVIDIA Nemotron

4.3 Arcee Trinity Mini

4.4 Gemini Flash

5. Why OpenRouter Changes the Results

6. Key Findings

7. Practical Recommendations

8. Conclusion

Want a concierge like this for your hotel?

Implementation Pathways