[Bug] RayJob does not shut down the submitter pod properly #2359
Labels
bug
Something isn't working
external-author-action-required
P1
Issue that should be fixed within a few weeks
rayjob
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
In some cases of kuberay v1.0.0, especially when RayJob requests a lot of resources and takes a long time (more than half an hour), the task will be completed, but the log output is not completed (no normal success information is output, but the end output of the job can be seen in the dashboard). At this time, RayJob will be stuck there and the submitter pod will not be recycled normally.
The status information returned by kuberay is shown in the figure below
After I upgraded the version to v1.1.1, not only the submitter pod was not recycled normally, but the head node was also not recycled. The status was shown as Running in the jobDeploymentStatus field, and nothing else changed
Reproduction script
It is easy to reproduce a RayJob that occupies a lot of resources and takes a long time
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: