-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider re-working the vLLM Gauge exposing the currently active LoRAs #354
Comments
As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>
``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>
The metrics are currently used in the gateway-api-inference-extension project. While I agree the current implementation is not ideal, I'd like to make sure the new implementation doesn't break the dependency. How to get the same |
See #13504 - V1 will have a compatible implementation. However, it is quite likely we will deprecate and replace it sooner or later.
Good question, the proposal requires more work. But see the further discussion in #13303 - the current metric really isn't doing what you right now either AFAICT. If a LoRA had no scheduled requests in the most recent batch, its weights may still be loaded in GPU memory, and the vLLM instance should still be considered a good target for requests |
The current implementation provides a good proxy in an environment where most lora adapters are actively serving traffic. But agree with you. If we can get the "real" active lora metric in v1, that will be awesome. |
CC: @kfswain @kaushikmitr |
Great discussion here and in #13303! @courageJ pointed out that using timestamp as the value and some metadata as labels is an existing pattern in k8s, see the And the semantics is "this is the last timestamp the server observes this state". I think we can clarify this with better documentation. @markmc I saw vllm-project/vllm#13504 to reimplement this in v1, is this the current plan? If so, then I think we should be fine. Keep us in the loop if you'd like pursue further refactoring. And thanks again for finding us as one of the consumer and bringing this to our attention! |
``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in vllm-project#13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing user, we should retain compatibility in V1 and deprecate it if we replace it with a different metric. See also kubernetes-sigs/gateway-api-inference-extension#354 Signed-off-by: Mark McLoughlin <[email protected]>
In vLLM, we are finishing an effort (known as "V1") to re-architect the core LLM engine - see vllm-project/vllm#10582
One of the last remaining feature gaps versus V0 is metrics - see vllm-project/vllm#10582
As we carry over the V0 metrics into V1, we are re-evaluating what makes sense and considering deprecating or re-working some metrics - see vllm-project/vllm#12745
The
vllm:lora_requests_info
added by vllm-project/vllm#9477 looks particularly unorthodox - it's value being a timestamp, and the status of adapters encoded as a comma-separated string in label:At first blush, it seems like this would achieve the same thing in a more standard representation:
This sort of thing is separately proposed in vllm-project/vllm#11091. There would also be a
lora_config
Info metric which exposesmax_lora
Feel free to discuss either here or in vllm-project/vllm#13303. Thanks!
The text was updated successfully, but these errors were encountered: