Recently, I benchmarked vLLM on a GPU to better understand how much throughput can realistically be expected in an LLM serving setup.
One thing surprised me early on:
High GPU utilization does not necessarily mean high throughput.
Initial Setup and Confusion
Once the model and vLLM server were running, the OpenAI-compatible endpoint was ready for testing. I used a benchmarking script to vary:
- Number of connections
- Concurrency levels
- Request patterns
I mainly measured:
- Requests per second
- Token throughput
The first results were confusing.
GPU utilization stayed around 99%, yet overall throughput looked lower than expected. At first, I wasn’t sure whether this was normal behavior or a flaw in my benchmarking method.
It felt like the system was working hard, but not producing as much output as I expected.
Benchmark Method Matters More Than I Thought
After several rounds of testing and discussions with teammates, I adjusted the methodology to make it more consistent:
- Fix the total request count
- Control the number of connections
- Send requests in bursts instead of spaced batches
- Increase connection count step by step
This changed everything.
In my earlier tests, I was sending requests in batches with time gaps. If the number of active connections wasn’t enough to fully load the server, the measured throughput appeared artificially low.
Once I switched to burst-style traffic and ensured the request queue stayed full, throughput increased significantly.
Requests per second and token throughput both improved as connection count increased — up to a point.
After reaching a certain threshold, adding more connections actually reduced performance. That was when the true capacity limit became visible.
This was my first major takeaway:
If the server isn’t fully loaded, your benchmark results don’t reflect its real capacity.
Misunderstanding “Batching”
Another thing I misunderstood was what “Batch API” actually means in practice.
My initial prompt structure was:
- One fixed system message
- One variable user message
Since the system message remained constant, prefix caching worked well. The prefix cache hit rate stayed around 50–60%.
Later, I tried a different approach:
One system message with multiple user messages combined into a single request.
I expected this to process more inputs more efficiently. But the results were different from what I anticipated.
Here’s what happened:
- Context window limits were reached more easily
- Prefix cache hit rate dropped
- KV cache memory filled up faster
vllm:num_preemptions_totalincreased
Preemptions meant some sequences were evicted from GPU memory and had to be recomputed later.
Even though more text was processed in one request, the number of independent user messages handled efficiently did not increase proportionally.
The shared prefix became a smaller fraction of the total prompt, reducing the benefits of prefix caching. At the same time, memory pressure increased significantly.
What vLLM Is Actually Doing
Looking more closely at vLLM metrics helped clarify what was happening. Metrics such as:
- Prefix cache hit rate
- KV cache usage
- Number of running requests
- Prefill time
- Decode time
- Preemptions
gave visibility into the internal behavior of the scheduler.
What I eventually learned is that vLLM uses continuous batching, not static batching.
Continuous batching dynamically groups sequences at each decoding step. The scheduler decides how many tokens and sequences can run together, constrained by limits like:
- Maximum number of batched tokens
- Maximum number of sequences
Batching in vLLM is token-based and dynamic. It is not simply about combining multiple prompts into a larger request.
This clarified why my “bulk prompt” approach didn’t improve throughput the way I expected.
Key Takeaways
- 99% GPU utilization does not guarantee optimal throughput
- Under-loading the server leads to misleading benchmark results
- Connection count strongly affects measured capacity
- Larger prompts can reduce prefix cache efficiency
- Memory pressure can trigger preemptions and recomputation
- Continuous batching changes how we should think about batching
Final Thoughts
This wasn’t a deep research study. I’m still relatively new to LLM serving systems.
But going through this process helped me build a clearer mental model of how load, memory, caching, and scheduling interact in practice.
Benchmarking is not just about generating numbers — it’s about understanding how the system behaves under pressure.
It turned out to be a much more interesting learning experience than I initially expected.





