Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't wait forever in cc-worker #256

Open
sethboyles opened this issue Aug 1, 2022 · 6 comments · May be fixed by #464
Open

Don't wait forever in cc-worker #256

sethboyles opened this issue Aug 1, 2022 · 6 comments · May be fixed by #464

Comments

@sethboyles
Copy link
Member

sethboyles commented Aug 1, 2022

Thanks for submitting an issue to capi-release. We are always trying to improve! To help us, please fill out the following template.

Issue

cc-worker can block bosh from cancelling a deployment, since the post start waits forever:

while ! nc -z localhost <%= p("cc.readiness_port.cloud_controller_worker") %>

Steps to Reproduce

CC migrations fail, or pre-start script fails in another way

Expected result

the deployment fails, the task is cancelled, and the lock is deleted

Current result

the deployment fails, the task hangs when cancelled, and the lock is not deleted

Possible Fix

introduce some kind of timeout in the post-start script

name of issue screenshot

[if relevant, include a screenshot]

@sethboyles
Copy link
Member Author

I think this was originally added because of this issue: #165, though I don't quite understand how it addresses it. See cloudfoundry/cloud_controller_ng@85b7ac8

BackgroundJobEnvironment still waits for DB to be migrated, so it seems to me that original issue would still occur. A cc-worker would never progress to the readiness port. I suppose it limits the failure to a single cc-worker?

@Samze
Copy link
Contributor

Samze commented Aug 9, 2024

Just documenting my understanding of this.

The current behaviour when pre-start suceeds:

  1. CC api starts and runs starts to run migrations and eventually succeed
  2. CC worker starts
    1. Creates BackgroundEnvironment
    2. That then loads models which waits on migrations to complete. This has a timeout of 2 weeks.
    3. Migrations run successful, we stop waiting
    4. If we're not the first process,
      1. We start processing jobs
    5. If we're the first worker process per vm,
      1. we listen on readiness port
      2. When we receive a connection, we unblock.
      3. We start processing jobs
  3. CC worker post-start starts
    1. Loops trying to connect on readiness port
    2. When we establish a connection, post-start script succeeds

Now in the case when pre-start fails:

  1. CC api starts and runs starts to run migration and fail
  2. CC worker starts
    1. Creates BackgroundEnvironment
    2. That then loads models which waits on migrations to complete....we wait forever.
  3. CC worker post-start starts
    1. Loops trying to connect on readiness port
    2. Waits forever

Thoughts/Questions:

  1. The issue described in CC-Worker should fail/stop if a migration is not completed yet #165 states that CC worker started and started to use tables that did not exist/were modified because migrations hadn't happened. Looking at the code, it seems like all CC workers call waiting_on_migrations! and how is this possible. Though the original issue refers to check_migrations method which no longer exists.
  2. This logic in the post-start script basically stops bosh reporting the vm as ready until the worker vm has confirmed that migrations have run. Maybe this is important as migrations could take a long time to run? However this is not blocking any sort of worker from picking up jobs or anything.
    1. I think it makes sense to only mark the bosh job as successful if migrations have passed. Though if there is an issue you will notice this status on the API vm, since the API vm will only show pre-start failures. So its certainly an option to just remove this check from the post-start logic of the worker.
    2. To simplify this logic, could we not just effectively call waiting_for_migrations in the post-start script instead of doing this port dance.
  3. The property cc.readiness_port.cloud_controller_worker is described as being used for k8s deployments, but we use it here for regular bosh deployments. Is there something I'm missing for why this is important for k8s?
  4. Question remains, it seems like the current contract for api -> worker for readiness is that migrations have completed. This may not be the case if migrations failed or anything else in the pre-start script failed. So how can the worker post-start script determine a terminal failure in the api vm pre-start script?

note Probably missing some context here, will update this comment as I poke around more.

@johha
Copy link
Contributor

johha commented Aug 12, 2024

Side note: There are parameters like max_migration_duration_in_minutes and max_migration_statement_runtime_in_seconds (only for psql) which allow operators to limit the total duration of migrations.

@Samze
Copy link
Contributor

Samze commented Aug 27, 2024

Another interesting observation,

From this ticket https://www.pivotaltracker.com/story/show/171417894 it seems like the purpose of this was to allow non-http jobs to have their readiness checked for k8s, and this readiness property was added to worker/clock/deployment_updater. However it is disabled by default in everything but the worker, see https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cc_deployment_updater/spec#L87C21-L87C39 and https://github.com/cloudfoundry/cloud_controller_ng/blob/main/lib/cloud_controller/background_job_environment.rb#L23.

This commit added it 56a7240 .

One option here would be to set the port back to default of -1 and update the post-start logic to only perform that check if the port is > 0. 🤔 The consequence of this means that the post-start will succeed and bosh will report success for the cc_worker job when the process is stuck waiting for migrations. However, this is exactly what we do for deployment_updater and clock. So not sure if there is something specific about the worker that would be an issue.

Another options is to be more aggressive and remove this entire check, however I'm not sure if there is still a k8s usecase for this.

Samze added a commit that referenced this issue Aug 27, 2024
@Samze Samze linked a pull request Aug 27, 2024 that will close this issue
3 tasks
@Samze
Copy link
Contributor

Samze commented Aug 27, 2024

Another note, this is more important to fix now because this is much more likely due to the increased chances of failing cc api pre-start script due to the addition of the stack checker https://github.com/cloudfoundry/capi-release/blob/develop/jobs/cloud_controller_ng/templates/pre-start.sh.erb#L145 . If the stack-checker fails on a deprecated stack and there is a new migration in the upgrade, this bug will 100% occur.

@tcdowney
Copy link
Member

Another options is to be more aggressive and remove this entire check, however I'm not sure if there is still a k8s usecase for this.

I don't think there is a k8s use-case for this anymore. Cf-for-k8s which included capi-k8s-release was deprecated in 2021 and cloudfoundry/cloud_controller_ng#3140 pretty much ensured that Cloud Controller's native k8s support was removed. I think on Kubernetes we couldn't control the ordering of how these jobs started as well as we could on BOSH so some of these weird things were put in place to make it more tolerant of that sort of rollout.

I don't remember much more than that unfortunately cause I was mostly involved with the networking team at the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants