Benchmarking LLM performance backends with rust

Guide to LLM performance benchmarking with Rust, covering TTFT, ITL, throughput metrics, and llmperf-rs’s approach to metrics aggregation

Wayne Lau

December 15, 2025 · 6 min read

When evaluating LLM inference systems, accurate performance metrics are essential. In this post, I’ll walk through the key metrics for benchmarking language models and share the different approaches to benchmarking and why I built llmperf-rs, a Rust-based benchmarking tool.

The Problem with Existing Tools #

While working with ray-project/llmperf, which is now in archive mode, I noticed that Inter-Token Latency (ITL) was calculated by averaging per-request first, then aggregating those averages. This approach works well for many use cases, but I needed to preserve individual latency spikes during testing.

Another library is genai-perf. I have no issue with this, it was a very in-depth tool. My only problem was not being able to run it on Ubuntu 22.04, and required using a docker container just to run this tool. Installing from source might have worked, but I didn’t explore this further.

vllm-bench is also a good option, but it requires installing vllm.

The goal was to build a simple binary that can run almost anywhere (I have not figured out musl-linux yet), and did not require too many dependencies. It was also a learning project.

Different Approaches to ITL Aggregation #

Consider two requests with these ITL values (in milliseconds):

Request 1: [1, 2, 10] → average = 4.33
Request 2: [2, 3, 5] → average = 3.33

llmperf reports the max of these averages as 4.33 ms, while llmperf-rs captures the actual max of 10ms by aggregating raw values.

Metrics #

This is a summary of the documentation

Time To First Token (TTFT) #

TTFT measures how quickly the model begins responding after receiving your request. For interactive applications, this is the perceived latency before the user sees any output. It is also important for RAG-based applications where a large chunk of processing is happening at the prefill stage.

$$ \text{TTFT} = \text{first\_token\_timestamp} - \text{request\_start\_timestamp} $$

Lower is better.

Inter-Token Latency (ITL) #

ITL is the time between consecutive tokens during generation. Spikes can result from multiple issues, most commonly network issues. ITL is usually consistent due to how KV caches and the computation works.

When testing against vLLM, I noticed that high ITL spikes happen when you attempt to benchmark close to the context limit. I suspect this is due to vLLM’s eviction of requests if they exceed the kv-cache size.

For example if 3 requests come in with $0.8x$ context length, with $0.2x$ for generation, but the GPU has space only for $2.8x$ context length, one of the requests will be preempted. Read more here: vllm preemption

Aggregation: Concatenate ALL ITL values across all responses, then compute statistics:

// Request 1 ITL: [1, 2, 10]
// Request 2 ITL: [2, 3, 5]
// Combined: [1, 2, 10, 2, 3, 5]
// Now calculate percentiles, max, stddev on the combined array

Each response produces $(N-1)$ ITL values (where $N$ is the token count). By aggregating raw values instead of per-request averages, you preserve the true distribution including outliers.

Throughput Metrics #

Prefill Throughput (TPS) #

Tokens processed per second during the prefill phase (processing your input prompt):

$$ \text{Prefill TPS} = \frac{\text{input\_tokens}}{\text{TTFT}} $$

This is how fast the model processes your input.

However, prefill TPS doesn’t accurately reflect system performance. The issue is that TTFT includes queue wait time, not just actual processing time.

Request timeline:
┌─────────┬────────────────────┬─────────────────┐
│ Queuing │ Actual Prefill      │ Decode Phase    │
│ Wait    │ Processing         │                 │
└─────────┴────────────────────┴─────────────────┘
└─────────────────────┬─────────────────────────┘
         TTFT (includes queuing wait)

When a server is under load, your request might sit in a queue waiting for resources. TTFT measures the time from when you sent the request to when you receive the first token—which includes this waiting period.

Example: 1000 input tokens

Scenario A (no queue wait):
  Queue: 0ms → Actual Prefill: 100ms
  TTFT = 100ms
  Prefill TPS = 1000 / 0.1s = 10,000 tokens/s

Scenario B (with queue wait):
  Queue: 200ms → Actual Prefill: 100ms
  TTFT = 300ms
  Prefill TPS = 1000 / 0.3s = 3,333 tokens/s

TTFT and Prefill TPS looks worse despite having the same prefill.

The actual processing speed (10,000 tokens/s) is identical in both scenarios. The lower prefill TPS in Scenario B reflects queue contention, not the system’s processing capability.

Decode Throughput (TPS) #

Tokens generated per second during the decode phase:

$$ \text{Decode TPS} = \frac{\text{output\_tokens}}{\text{final\_time} - \text{decode\_start\_time}} $$

This is the generation speed—how fast the model produces output.

End-to-End Latency #

Total time from POST request to final token (when the response returns with finish_reason). This is the complete request duration including both prefill and decode phases.

What Matters Most #

For production serving, focus on TTFT, ITL stats and maybe RPM.

TTFT measures how quickly users see their first token—this is the perceived responsiveness of your system.

ITL statistics reveal decode-phase issues that throughput metrics hide. The 99th percentile and max ITL values expose:

Preemption events from KV cache limits
Network issues between components
GPU memory pressure

ITL matters less for batch jobs or non-streaming APIs where users don’t watch tokens arrive in real-time.

Request Per Minute - More traditional metric, applies to everywhere outside of LLMs, can be used for interactive and non-interactive applications.

Token Counting #

Accurate metrics require accurate token counts. llmperf-rs handles this in two ways:

API Response Most OpenAI-compatible endpoints return token counts in the usage field (prompt_tokens, completion_tokens, total_tokens). By default, llmperf-rs uses this as priority, as long as its available.
Tokenizer (optional): For exact input counts, pass a HuggingFace tokenizer via --tokenizer <path>. Note that chat templates may cause <10 token variance even then.

A Different Approach to Token Counting #

The original llmperf uses a single tokenizer (hf-internal-testing/llama-tokenizer) for all models. Different models use different tokenizers, so llmperf-rs lets you specify the correct tokenizer or rely on API-reported counts for accuracy.

For example,

Llama-2 has a vocab size of $32000$, while a more recent model like Qwen/Qwen3-4B-Instruct-2507 has $151936$. While of these tokens go into multilingual tokens, some of them might be English tokens.

This means that maybe two tokens foo and bar becomes foo bar in Qwen.

In my own testing, setting input tokens to $8192$ against a Qwen endpoint while using the default llama tokenizer returns these values:

input_tokens: DetailedStats {
    quantiles_p25: 7363.0,
    quantiles_p50: 7365.0,
    quantiles_p75: 7372.083333333333,
    quantiles_p90: 7374.9,
    quantiles_p95: 7376.0,
    quantiles_p99: 7376.0,
    mean: 7367.4,
    median: 7365.0,
    stddev: 5.081557066709201,
    min: 7362,
    max: 7376,
},

Validating Your Results: Finish Reason #

All benchmark runs should end with finish_reason = length (meaning the model hit the max_tokens limit). If you see finish_reason = stop, the model stopped early.

Getting Started with llmperf-rs #

Installing from releases is preferred.

# Run a benchmark
export OPENAI_API_BASE=http://localhost:8000/v1
llmperf --model Qwen/Qwen3-4B-Instruct-2507

The tool outputs:

Parquet files with per-response metrics (for detailed analysis)
JSON summaries with aggregated statistics (percentiles, mean, min/max)

Differences from llmperf #

Issue	llmperf	llmperf-rs
ITL aggregation	Averages per-request	Concatenates all values
Token counting	Hardcoded tokenizer	API-reported or configurable
Prompts	Multiple selections	Sonnet text only
Language	Python + Ray	Rust (faster, lower overhead)
Maintenance	Archived	Maintained

Issues #

I currently don’t have a way to test the prefix caching of endpoints

Wrapping Up #

When benchmarking LLM systems, different use cases require different approaches. llmperf-rs focuses on preserving raw ITL values to capture latency spikes, using API-reported token counts when available, and providing deterministic test inputs. Whether you’re evaluating vLLM, TensorRT-LLM, or your own inference stack, these metrics will give you the real picture of your system’s performance.

Check out the source code at github.com/wheynelau/llmperf-rs and see the full metrics documentation for deeper details. You can also create issues and contribute.