使用Hey对vllm进行模型并发压测
docker run --rm --network=knowledge_network \registry.cn-shanghai.aliyuncs.com/zhph-server/hey:latest \-n 200 -c 200 -m POST -H "Content-Type: application/json" \-H "Authorization: xxx" \-d '{"model": "codechat","messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello!"}],"stream": false,"max_tokens": 100,"temperature": 0.0}' http://vllm-openai:80/v1/chat/completions
docker run --rm --network=knowledge_network \registry.cn-shanghai.aliyuncs.com/zhph-server/hey:latest \-n 200 -c 200 -m POST -H "Content-Type: application/json" \-H "Authorization: xxx" \-d '{"model": "codebase","prompt": "# write a python code to print hello world","stream": false,"max_tokens": 100,"temperature": 0.5}' http://vllm-openai:80/v1/completions
结果
Summary: Total: 2.2220 secs Slowest: 1.3603 secs Fastest: 0.7641 secs Average: 1.0815 secs Requests/sec: 43.2034 Total data: 28992 bytes Size/request: 302 bytes Response time histogram: 0.764 [1] |■ 0.824 [5] |■■■■■■■ 0.883 [4] |■■■■■■ 0.943 [7] |■■■■■■■■■■ 1.003 [11] |■■■■■■■■■■■■■■■■ 1.062 [7] |■■■■■■■■■■ 1.122 [28] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 1.181 [7] |■■■■■■■■■■ 1.241 [9] |■■■■■■■■■■■■■ 1.301 [9] |■■■■■■■■■■■■■ 1.360 [8] |■■■■■■■■■■■ Latency distribution: 10% in 0.9175 secs 25% in 0.9570 secs 50% in 1.0721 secs 75% in 1.2131 secs 90% in 1.2790 secs 95% in 1.3599 secs 0% in 0.0000 secs Details (average, fastest, slowest): DNS+dialup: 0.0036 secs, 0.7641 secs, 1.3603 secs DNS-lookup: 0.0013 secs, 0.0000 secs, 0.0075 secs req write: 0.0003 secs, 0.0000 secs, 0.0051 secs resp wait: 1.0774 secs, 0.7640 secs, 1.3533 secs resp read: 0.0001 secs, 0.0000 secs, 0.0002 secs Status code distribution: [200] 96 responses