Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Deployment names containing #s cause parse failure during checkpoint recovery. #48260

Open
chmeyers opened this issue Oct 24, 2024 · 0 comments · May be fixed by #51003
Open

[Serve] Deployment names containing #s cause parse failure during checkpoint recovery. #48260

chmeyers opened this issue Oct 24, 2024 · 0 comments · May be fixed by #51003
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue

Comments

@chmeyers
Copy link

What happened + What you expected to happen

Ray Serve failed recovery from a checkpoint with the below stacktrace. One of the deployments being recovered was "model1#infer_actor", which was allowed as a deployment name, but apparently messes up the delimiter splitting in from_full_id_str()

ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::SERVE_CONTROLLER_ACTOR:ServeController.__init__() (pid=22146, ip=x.x.x.x, actor_id=9a9a8479cde4a0a033e57ee602000000, repr=<ray.serve._private.controller.ServeController object at 0x7f6dfad5de40>)
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/controller.py", line 177, in __init__
    self.deployment_state_manager = DeploymentStateManager(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 2348, in __init__
    self._recover_from_checkpoint(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 2457, in _recover_from_checkpoint
    deployment_to_current_replicas = self._map_actor_names_to_deployment(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 2390, in _map_actor_names_to_deployment
    replica_id = ReplicaID.from_full_id_str(replica_name)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/common.py", line 70, in from_full_id_str
    raise ValueError(
ValueError: Given replica ID string SERVE_REPLICA::model1#infer_actor#model1#infer_actor#15k12vcw didn't match expected pattern, ensure it has either two or three fields with delimiter '#'.

Versions / Dependencies

Ray 2.36.0, python 3.10, linux

Reproduction script

  1. Create deployment with a # in the name.
  2. Crash your cluster in some horrible way that makes you have to recover from a checkpoint.
  3. Profit?

Issue Severity

Low: It annoys or frustrates me.

@chmeyers chmeyers added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 24, 2024
@edoakes edoakes added P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 24, 2024
@jcotant1 jcotant1 added serve Ray Serve Related Issue and removed serve Ray Serve Related Issue labels Nov 20, 2024
@abrarsheikh abrarsheikh linked a pull request Feb 28, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks serve Ray Serve Related Issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants