LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

phymbert · 2024-04-17T19:28:57Z

phymbert
Apr 17, 2024
Collaborator

Performances and improvment area

This thread objective is to gather llama.cpp performance 📈 and improvement ideas💡against other popular LLM inference
frameworks, especially on the CUDA backend. Let's try to fill the gap 🚀

Hugging Face TGI: A Rust, Python and gRPC server for text
generation inference.
vLLM: Easy, fast, and cheap LLM serving for everyone.

I have run a couple of benchmarks from the OpenAI /chat/completions endpoint client point of view
using JMeter on 2 A100 with mixtral8x7b and a fine tune llama70b models.

Note 1: from the client point of view, it is not possible to get accurate PP and TG because, first you need steaming
enabled and then PP will always include one generated token. So easier to compare the total tokens of the transactions
in completions.usage.

Note 2: from a performance tests server point of view, we generally consider following metrics:

iterations: total request successfully completed during the test
prompt tokens: average prompt tokens per request, same by iteration number for all tests
generated tokens: average generated tokens per request
RPM: Requests Per Minute
latency: Duration of the http request in seconds
PP+TG: total tokens http clients send and receive per second
errors: number of request in errors during the test from the client point of view, it can be http timeout,
connection close. It is not necessarily caused by the server.

Context size

The transaction tokens context here is:

	min	avg	max
Prompt tokens	12	155	512
Generated tokens	38	133	996

Results

llama70b @ `eedd42e`

metric	users	duration	llama.cpp	vLLM	TGI
iterations	32	30m	1514	4 734	4 448
prompt tokens	32	30m	155.19	125.58	138.66
generated tokens	32	30m	132.71	144.61	125.66
RPM	32	30m	40.92	147.94	139.00
latency	32	30m	65.10	17.06	16.79
PP+TG/s	32	30m	196.34	649.46	612.33
errors	32	30m	0.19%	0.09%	0.05%
iterations	1	10m	48	99	85
prompt tokens	1	10m	120.31	116.07	135.53
generated tokens	1	10m	115.56	121.28	124.69
RPM	1	10m	4.36	9.00	7.73
latency	1	10m	12.83	6.41	7.30
PP+TG	1	10m	17.15	35.60	33.51
errors	1	10m	0%	0%	0%

llama.cpp configuration

server  --model myllama70b-f16-00001-of-00010.gguf \
        --ctx-size 32768 \
        --n-predict 4096 \
        --n-gpu-layers 81 \
        --batch-size 4096 \
        --ubatch-size 256 \
        --parallel 1|32 \
        --metrics \
        --log-format text

vLLM configuration

python -m vllm.entrypoint.openai.api_server \
   --model /models/myllama70b \
   --tensor-parallel-size 2

TGI Configuration

Please note how it is easy:

text-generation-launch --model-id /models/myllama70b

mixtral8x7b

metric	users	duration	llama.cpp	vLLM	TGI
iterations	32	30m	4 152	10 541	10 849
prompt tokens	32	30m	83.56	97.63	98.63
generated tokens	32	30m	110.27	166.31	98.18
RPM	32	30m	129.75	329.41	339.03
latency	32	30m	20.90	6.04	5.83
PP+TG/s	32	30m	409.81	1 449.13	1 487.45
errors	32	30m	1.29%	0.06%	0.05%
iterations	1	10m	219	439	430
prompt tokens	1	10m	79.24	94.09	98.64
generated tokens	1	10m	103.02	107.65	108.14
RPM	1	10m	19.91	39.91	39.09
PP+TG	1	10m	60.48	134.19	134.72
errors	1	10m	0%	0%	0%

llama.cpp configuration @ `137fbb8`

server  --model mixtral-8x7b-instruct-f16-00001-of-00010.gguf \
        --ctx-size 131072 \
        --n-predict 4096 \
        --n-gpu-layers 33 \
        --batch-size 4096 \
        --ubatch-size 256 \
        --parallel 1|32 \
        --metrics \
        --log-format text

Magically vLLM and TGI configuration are not changed.

Area of improvements of `llama.cpp`

Please @ggerganov edit as will

ggml : group all experts in a single ggml_mul_mat_id #6505
ggml : add Flash Attention #5021
Implement automatic NGL detection #6502
Automatic KV Cache size detection
Automatic batch size based on the underlying hardware
server: process prompt fairly accross slots #6607

phymbert · 2024-04-17T19:30:02Z

phymbert
Apr 17, 2024
Collaborator Author

@JohannesGaessler @slaren as you are the main contributors on CUDA backend feel free to highlight or amend any hypothesis. Thanks a lot for your impressive job here.

25 replies

phymbert Apr 18, 2024
Collaborator Author

What are you calling file format, please ? I just took the files on HF, safetensors bfloat16

phymbert Apr 18, 2024
Collaborator Author

basically was wondering what is the fastest way to do inference with what file format

llama.cpp on GGUF of course

ggerganov Apr 19, 2024
Maintainer

If you guide me a little on how I can convert a bfloat16 HF model to the quantized version, I can give it a try on my spare time.

I'm not familiar with the other frameworks, so would be of little help here. No worries, not very important

OB-SPrince May 2, 2024

I think the llama.cpp need to use the real model, and not a quantized gguf version. Then it would be proper comparison.

phymbert May 2, 2024
Collaborator Author

Please develop

aneeshmb02 · 2024-05-10T09:21:18Z

aneeshmb02
May 10, 2024

I have the same objective. What data/prompt did you use and can I use the same with llama-bench?
Started the same discussion at #7195.

0 replies

uicomponent · 2024-09-10T17:12:42Z

uicomponent
Sep 10, 2024

@phymbert Can you please add updated test results? Based on details in linked threads inference server performance was significantly improved. It would be nice to compare results before and after here.

1 reply

phymbert Sep 10, 2024
Collaborator Author

It supposed to have a continuuous performance github action but it has been disabled. Feel free to fix it

hitdra · 2024-09-15T22:24:30Z

hitdra
Sep 15, 2024

Hello, I'm new to this area and recently wanted to measure llamacpp throughput on an a100 video card. I would like to ask you, what about sending a request to llamacpp server after opening its service and then measuring the throughput of llamacpp?

If so, can you refer me to your code for measuring these metrics?

3 replies

VJHack Sep 16, 2024

Start llama-server with --metrics then hit /metrics endpoint.

hitdra Sep 16, 2024

Thank you very much for your reply, I still don't quite get it, could you please give me a specific example if you can?

hitdra Sep 16, 2024

Start llama-server with --metrics then hit /metrics endpoint.

Sorry, I have another query for you. How do I send so many requests in the dataset to the llamacpp server? With a one by one curl command?

OB-SPrince · 2024-09-17T00:15:27Z

OB-SPrince
Sep 17, 2024

Use an AI to develop a load testing script, by showing an example of how you use curl. I have a repo for this https://GitHub.com/SolidRusT/srt-inference-perf Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: hitdra ***@***.***> Sent: Monday, September 16, 2024 6:31:06 AM To: ggerganov/llama.cpp ***@***.***> Cc: Shaun Prince ***@***.***>; Comment ***@***.***> Subject: Re: [ggerganov/llama.cpp] LLM inference server performances comparison llama.cpp / TGI / vLLM (Discussion #6730) EXTERNAL Start llama-server with --metrics then hit /metrics endpoint.<https://github.com/ggerganov/llama.cpp/tree/master/examples/server#:~:text=GET%20/metrics%3A%20Prometheus%20compatible%20metrics%20exporter> Sorry, I have another query for you. How do I send so many requests in the dataset to the llamacpp server? With a one by one curl command? — Reply to this email directly, view it on GitHub<#6730 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BBLN4BPL7UVKKDRPQCCAGH3ZW3MRVAVCNFSM6AAAAABGL5PSISVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANRVHE4DGMQ>. You are receiving this because you commented.Message ID: ***@***.***> This communication (including any attachments) is intended for the use of the intended recipient(s) only and may contain information that is confidential, privileged or legally protected. Any unauthorised use or dissemination of this communication is strictly prohibited. The content of this email and traffic data may be monitored for employment, security, compliance and other legally authorised purposes. OpenBet Ltd, Registered no. 3134634. OpenBet Technologies Ltd, Registered no. 6712030. Both companies registered in England and Wales. Registered Office: Building 6 Chiswick Park, 566 Chiswick High Road, London, W4 5HR. NYX Digital Gaming (USA), LLC, Registered no. E0228962018-2. Registered in Nevada. Registered Office: c/o Corporation Service Company, 112 North Curry Street, Carson City, NV 89703. OpenBet Singapore Pte Limited. Co. Reg. No. 201529435R. Incorporated under the laws of the Republic of Singapore. Registered Address: 16, Raffles Quay, #33-03, Singapore 048581.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 29 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LLM inference server performances comparison llama.cpp / TGI / vLLM #6730

phymbert Apr 17, 2024 Collaborator

Performances and improvment area

Context size

Results

llama70b @ eedd42e

llama.cpp configuration

vLLM configuration

TGI Configuration

mixtral8x7b

llama.cpp configuration @ 137fbb8

Area of improvements of llama.cpp

Replies: 5 comments · 29 replies

phymbert Apr 17, 2024 Collaborator Author

phymbert Apr 18, 2024 Collaborator Author

phymbert Apr 18, 2024 Collaborator Author

ggerganov Apr 19, 2024 Maintainer

phymbert May 2, 2024 Collaborator Author

phymbert Sep 10, 2024 Collaborator Author

phymbert
Apr 17, 2024
Collaborator

llama70b @ `eedd42e`

llama.cpp configuration @ `137fbb8`

Area of improvements of `llama.cpp`

Replies: 5 comments 29 replies

phymbert
Apr 17, 2024
Collaborator Author

phymbert Apr 18, 2024
Collaborator Author

phymbert Apr 18, 2024
Collaborator Author

ggerganov Apr 19, 2024
Maintainer

phymbert May 2, 2024
Collaborator Author

phymbert Sep 10, 2024
Collaborator Author