Recently I had a chance to benchmark vLLM running on a GPU, to understand how much throughput could realistically be expected in a serving setup.
Once the model and vLLM server were ready, the OpenAI-compatible endpoint was available for testing. I used a benchmarking script to try different connection counts and concurrency levels. I mainly focused on requests per second and token throughput.
The first results surprised me.
- Throughput looked slow.
- GPU utilization stayed around 99%.
At that time, I was not sure if this was normal. It just felt like request processing was very resource intensive, and I wondered if I was testing it in the right way.
Later, I adjusted the benchmark method to make it more consistent.
Changed the setup to:
- Control the number of connections
- Fix the total request count
- Send requests in bursts
One thing I learned is that connection count matters a lot.
In my earlier tests, I was sending requests in batches with time gaps. If the number of active connections did not fully load the server, the measured throughput looked lower than it actually was.
So I tried a different approach:
- Fix the number of connections.
- Send requests quickly in a burst so the server queue stays full.
- Increase the connection count step by step.
- Observe token throughput and requests per second.
With this setup, I saw that requests per second increased as connection count increased. Token throughput also improved. But after a certain point, adding more connections caused performance to drop. That helped me see the real capacity limit of the system.
I also misunderstood what “Batch API” really meant.
In my test, the prompt format was one fixed system message and one variable user message. The system and user messages were about the same length. Because the system message stayed the same, prefix caching worked and the prefix cache hit rate stayed around 50 to 60 percent.
Later, I tried sending one system message with multiple user messages in a single request. I thought this would process more inputs more efficiently -- but the results were different from what I expected.
- Context window limits were reached more easily.
- Prefix cache hit rate dropped because the shared prefix became a smaller part of the total prompt.
- KV cache memory filled up faster.
- vllm:num_preemptions_total increased, which means some requests were removed from GPU memory and had to be recomputed.
Even though more text was processed in one request, the number of independent user messages handled efficiently did not increase much.
Looking at vLLM metrics helped me understand a bit more. Metrics like prefix cache hits, KV cache usage, number of running requests, prefill time, decode time, and preemptions gave more visibility into what was happening inside the server.
With single system plus user prompts and increasing connections, prefix cache hit rate stayed stable, KV cache usage increased, and successful requests increased.
With bulk prompts, prefix cache hit rate decreased, KV cache usage grew faster, and preemptions increased when memory became tight.
After reading the documentation and discussing with the team, I learned that vLLM uses continuous batching instead of simple static batching.
Continuous batching means requests are grouped dynamically during each decoding step. The scheduler decides how many sequences and tokens can run together, based on limits like max number of batched tokens and max number of sequences.
So batching in vLLM is dynamic and token based, not just combining multiple prompts into one big request.
This was not a deep technical study. I am still very new to this area. But going through this process helped me build a clearer mental model of how LLM serving works under load.
For me, it was a simple but very interesting learning experience.