How to Benchmark LLM Inference Performance: TTFT, ITL, and Throughput Metrics

Practical guide to LLM performance benchmarking with accurate metrics. Covers TTFT, ITL, throughput testing, and includes tool comparison (vLLM, TensorRT-LLM) plus real examples using llmperf-rs

Wayne Lau

December 15, 2025 · 8 min read

When deploying large language models to production, measuring performance accurately is critical. Whether you’re using vLLM, SGLang, TensorRT-LLM, or a custom inference stack, you need to understand:

Throughput testing: How many requests per second can your system handle?
Latency metrics: Time to First Token (TTFT), Inter-Token Latency (ITL), and end-to-end latency
Token generation speed: Tokens per second under different concurrency levels
Tail latency: P95 and P99 values that affect user experience

In this post, I’ll walk through the key metrics for benchmarking language models and share why I built llmperf-rs, a Rust-based benchmarking tool that takes a different approach to measuring these metrics.

The Problem with Existing Tools #

While working with ray-project/llmperf, which is now in archive mode, I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.

Another library is genai-perf. I have no issue with this, it was a very in-depth tool. My only problem was not being able to run it on Ubuntu 22.04, and required using a docker container just to run this tool. Installing from source might have worked, but I didn’t explore this further.

Edit: As of updating this article (15 Apr 2026), it seems that they sunsetted genai-perf and created aiperf

I have not tried aiperf personally, but it definitely looks very comprehensive.

vllm-bench is also a good option, but it requires installing vllm.

The goal was to build a simple binary that can run almost anywhere (I have not figured out musl-linux yet), and did not require too many dependencies. It was also a learning project.

Motivation #

The starting idea was the original LLMPerf, where I found an issue with how they aggregate ITLs. I also found that the time taken to spin up an environment and start a job was very long, because they had to spawn ray workers. My goal for this project is simple, I want to start an LLM benchmark as fast as possible.

Metrics #

This is a summary of the documentation

Time To First Token (TTFT) #

TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It is also important for RAG-based applications where a large chunk of processing is happening at the prefill stage.

$$ \text{TTFT} = \text{first\_token\_timestamp} - \text{request\_start\_timestamp} $$

Lower is better.

Inter-Token Latency (ITL) #

ITL is the time between consecutive tokens during generation. Spikes can result from multiple issues, most commonly network issues. ITL is usually consistent due to how KV caches and the computation works.

When testing against vLLM, I noticed that high ITL spikes happen when you attempt to benchmark close to the context limit. I suspect this is due to vLLM’s eviction of requests if they exceed the kv-cache size.

For example if 3 requests come in with $0.8x$ context length, with $0.2x$ for generation, but the GPU has space only for $2.8x$ context length, one of the requests will be preempted. Read more here: vllm preemption

Aggregation: Concatenate ALL ITL values across all responses, then compute statistics:

// Request 1 ITL: [1, 2, 10]
// Request 2 ITL: [2, 3, 5]
// Combined: [1, 2, 10, 2, 3, 5]
// Now calculate percentiles, max, stddev on the combined array

Each response produces $(N-1)$ ITL values (where $N$ is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.

Throughput Metrics #

Prefill Throughput (TPS) #

Tokens processed per second during the prefill phase (processing your input prompt):

$$ \text{Prefill TPS} = \frac{\text{input\_tokens}}{\text{TTFT}} $$

This is how fast the model processes your input.

However, prefill TPS doesn’t accurately reflect system performance. The issue is that TTFT includes queue wait time, not just actual processing time.

Request timeline:
┌─────────┬────────────────────┬─────────────────┐
│ Queuing │ Actual Prefill      │ Decode Phase    │
│ Wait    │ Processing         │                 │
└─────────┴────────────────────┴─────────────────┘
└─────────┬────────────────────┘
   TTFT (includes queuing wait)

When a server is under load, your request might sit in a queue waiting for resources. TTFT measures the time from when you sent the request to when you receive the first token—which includes this waiting period.

Example: 1000 input tokens

Scenario A (no queue wait):
  Queue: 0ms → Actual Prefill: 100ms
  TTFT = 100ms
  Prefill TPS = 1000 / 0.1s = 10,000 tokens/s

Scenario B (with queue wait):
  Queue: 200ms → Actual Prefill: 100ms
  TTFT = 300ms
  Prefill TPS = 1000 / 0.3s = 3,333 tokens/s

TTFT and Prefill TPS looks worse despite having the same prefill.

The actual processing speed (10,000 tokens/s) is identical in both scenarios. The lower prefill TPS in Scenario B reflects queue contention, not the system’s processing capability.

Decode Throughput (TPS) #

Tokens generated per second during the decode phase:

$$ \text{Decode TPS} = \frac{\text{output\_tokens}}{\text{final\_time} - \text{decode\_start\_time}} $$

This is the generation speed—how fast the model produces output.

End-to-End Latency #

Total time from POST request to final token (when the response returns with finish_reason). This is the complete request duration including both prefill and decode phases.

What Matters Most #

For production serving, focus on TTFT, ITL stats and maybe RPM.

TTFT measures how quickly users see their first token—this is the perceived responsiveness of your system.

ITL statistics reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose:

Preemption events from KV cache limits
Network issues between components

ITL matters less for batch jobs or non-streaming APIs where users don’t watch tokens arrive in real-time.

Request Per Minute - More traditional metric, applies to everywhere outside of LLMs, can be used for interactive and non-interactive applications.

Token Counting #

Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:

API Response Most OpenAI-compatible endpoints return token counts in the usage field (prompt_tokens, completion_tokens, total_tokens). By default, llmperf-rs uses this as priority, as long as its available.
Tokenizer (optional): For exact input counts, pass a HuggingFace tokenizer via --tokenizer <path>. Note that chat templates may cause <10 token variance even then.

Per-Model Tokenizer Support #

The original llmperf uses a single tokenizer (hf-internal-testing/llama-tokenizer) for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct tokenizer or rely on API-reported counts for accuracy.

For example,

Llama-2 has a vocab size of $32000$, while a more recent model like Qwen/Qwen3-4B-Instruct-2507 has $151936$. While of these tokens go into multilingual tokens, some of them might be English tokens.

This means that maybe two tokens foo and bar becomes foo bar in Qwen. They even have different tokens for \n\n\n and \n.

In my own testing, setting input tokens to $8192$ against a Qwen endpoint while using the default llama tokenizer returns these values:

input_tokens: DetailedStats {
    quantiles_p25: 7363.0,
    quantiles_p50: 7365.0,
    quantiles_p75: 7372.083333333333,
    quantiles_p90: 7374.9,
    quantiles_p95: 7376.0,
    quantiles_p99: 7376.0,
    mean: 7367.4,
    median: 7365.0,
    stddev: 5.081557066709201,
    min: 7362,
    max: 7376,
},

Validating Your Results: Finish Reason #

All benchmark runs should end with finish_reason = length (meaning the model hit the max_tokens limit). If you see finish_reason = stop, the model stopped early. This affects some metrics like RPM and E2E latency. Higher rejections can produce higher RPMs and lower latency due to shorter responses.

Getting Started with llmperf-rs #

Installation #

Installing from releases is preferred—you get a single binary with no dependencies.

# Download the binary
# I suggest checking the releases page above for the latest version
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/wheynelau/llmperf-rs/releases/download/v0.5.0/llmperf-installer.sh | sh

# Or build from source
cargo install --git https://github.com/wheynelau/llmperf-rs

Running a Basic Benchmark #

# Set your endpoint
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=your-key

# Run a benchmark
llmperf --model meta-llama/Llama-3.1-8B-Instruct

What You Get #

The tool outputs:

Console summary: Human-readable metrics right in your terminal
JSON files: Percentiles (p50, p90, p95, p99), mean, min/max for TTFT, ITL, throughput
Compressed JSON: Per-response metrics for detailed analysis in Python/pandas

Example output:

Note: Numbers are usually in floats, but integers are shown for brevity.

"itl_ms_quantiles_p25": 11,
"itl_ms_quantiles_p50": 12,
"itl_ms_quantiles_p75": 13,
"itl_ms_quantiles_p90": 13,
"itl_ms_quantiles_p95": 14,
"itl_ms_quantiles_p99": 16,
"itl_ms_mean": 12,
"itl_ms_median": 12,
"itl_ms_stddev": 1,
"itl_ms_min": 0,
"itl_ms_max": 150, <- Observed a spike here

When to Use llmperf-rs #

Use llmperf-rs when:

Running benchmarks with minimal dependencies (single binary, docker image)
Testing OpenAI-compatible endpoints (works with vLLM, Ollama, local APIs)
You want low overhead (Rust performance, no Ray/ZMQ)
Debugging latency spikes during production incidents

Consider alternatives when:

You need GPU-level metrics (use trtllm-bench or aiperf)
Testing vLLM-specific features (use vLLM benchmark)
You need extensive reporting dashboards (check out GuideLLM)
Python-first workflow is required
Distributed testing - aiperf supports this
Support for other workloads, like huggingface datasets or any json.

Why ITL Matters Even When Throughput Looks Good #

High throughput with bad ITL means tokens arrive in bursts and chat users notice this choppy streaming. ITL spikes (p99 >100ms) often indicate preemption, network issues or some other issue. It ultimately depends on use case, for non user facing such as agentic coding, throughput may matter more than ITL specifics.

Known Limitations #

I currently don’t have a way to test the prefix caching of endpoints
The first ITL is always a very small number, I have not figured out why

Wrapping Up #

When benchmarking LLM systems, different use cases require different approaches. llmperf-rs focuses on preserving raw ITL values to capture latency spikes, using API-reported token counts when available, and providing deterministic test inputs. My intention for llmperf-rs is building something that is quick to start.

Check out the source code at github.com/wheynelau/llmperf-rs and see the full metrics documentation for deeper details. You can also create issues and contribute.