Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: fix unrecoverable freezes of rabbit's consumer #10594

Merged
merged 3 commits into from
Feb 13, 2025

Conversation

bougue-pe
Copy link
Contributor

@bougue-pe bougue-pe commented Jan 30, 2025

Hard fix (kind of) to kill the process if a thread's exception goes up to main(), or if rabbit's cancel/shutdown notification is received (even if other threads may run) and let orchestrator restart it.

Bump amqp-client on the way as it doesn't hurt.

Fixes #8621
Also reproduced and fixed the case that leads to the following (different) core logs:

[11:16:04,880] [INFO]          [WorkerCommand] consume shutdown: amq.ctag-LsXBfmFdL6n758icthlCYA, com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
[11:16:04,883] [WARN]  [ForgivingExceptionHandler] An unexpected connection driver error occurred (Exception message: Connection reset by peer)
Exception in thread "main" com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
        at com.rabbitmq.client.impl.AMQConnection.startShutdown(AMQConnection.java:1012)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1127)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1056)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1040)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1032)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.close(AutorecoveringConnection.java:289)
        at kotlin.io.CloseableKt.closeFinally(Closeable.kt:56)
        at fr.sncf.osrd.cli.WorkerCommand.run(WorkerCommand.kt:319)
        at fr.sncf.osrd.App.main(App.java:44)

Hand-tested

  • Run classic (one per infra) core with rabbitmq up, without editoast to load infra from: ✅core crashes.
  • Run single-worker core with the whole stack running: ✅core works and OSRD does its job.
  • Run single-worker core with the whole stack, then stop rabbitmq: ✅core exits (on shutdown notification).
  • Run full stack single-worker mode, then remove the core-req-all queue to initiate cancel notification: ✅core logs "consumer cancelled" then stops (should cover Core message consumer fails and blocks the scenario #8621).
  • Run full stack + single-worker editoast (no core), initiate some core requests (using front).
    • Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✅core crashes and releases unacked messages.
    • Then start editoast, then start core: ✅core works and OSRD does its job.
  • Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✔️❓Exceptions in threads when trying to load infra for pending requests, core stays alive (leaving pending requests unacked until core is stopped - after step below).
    ➡️We can try/catch in callback function and exitProcess to force shutdown.
    DONE in last commit.
  • Then start editoast (keep core as-is) and initiate some new core request (using front): ✅core "correctly" processing only new requests.
  • Stop core: ✅Unacked requests are back to ready.
  • Start core: ✅Ready requests are processed.

Understanding of previous and current work

Previous:

Current:

  • ShutdownCallback looks like it's executed in the main thread (triggering the newly added catch in the main(), then the final return) and it may be an idea to use shutdown notification to exit cleanly
  • The thread executors are messing with the handling of exceptions
  • Looks like the CancelCallback doesn't trigger a "join" or a shutodwn process (because of threads? maybe more because it's on the implementation to decide what's graceful?)

Looks like some improvements may be done (to be explored later, sorted by ROI)

  • Improve/explicit some work applied in core: improve worker lifetime #9439 (use isRecoverable, cleanup consumer logs, properly close channels and connections)
  • Avoid sharing channels between threads as stated in dedicated documentation
  • Handle more standardly shutdown (on exceptions or on notification) by issuing a shutdown notification, or sharing a signal (or common variable?)

@bougue-pe bougue-pe requested review from Khoyo and ElysaSrc January 30, 2025 10:16
@bougue-pe bougue-pe requested a review from a team as a code owner January 30, 2025 10:16
@github-actions github-actions bot added the area:core Work on Core Service label Jan 30, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 30, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.93%. Comparing base (b3a6f01) to head (e6c5f8c).
Report is 93 commits behind head on dev.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev   #10594      +/-   ##
==========================================
- Coverage   81.93%   81.93%   -0.01%     
==========================================
  Files        1079     1079              
  Lines      107380   107376       -4     
  Branches      737      737              
==========================================
- Hits        87984    87978       -6     
- Misses      19356    19358       +2     
  Partials       40       40              
Flag Coverage Δ
editoast 74.28% <ø> (-0.01%) ⬇️
front 89.47% <ø> (-0.01%) ⬇️
gateway 2.18% <ø> (ø)
osrdyne 3.28% <ø> (ø)
railjson_generator 87.50% <ø> (ø)
tests 88.14% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 06726c5 to ee9d8aa Compare January 30, 2025 10:42
@bougue-pe bougue-pe self-assigned this Feb 5, 2025
@bougue-pe bougue-pe marked this pull request as draft February 5, 2025 07:14
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from b9647fc to c8bd86f Compare February 5, 2025 15:31
@bougue-pe bougue-pe requested a review from woshilapin February 5, 2025 17:49
@bougue-pe bougue-pe marked this pull request as ready for review February 5, 2025 17:49
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from 54ad82d to 17f2e86 Compare February 6, 2025 14:57
Hard fix (kind of) to kill the process (and let orchestrator restart) if:
* rabbit shuts down (triggering consumer's ShutdownCallback)
* or an exception before starting basicConsume() goes up to main()
  (even if other threads may run)

Bump amqp-client on the way as it doesn't hurt

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
From hand-tests, shutdown is already covered by the System.exit in
App.java::main().

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
…llback

Hard fix (kind of) to kill the process (and let orchestrator restart) if an
exception goes all the way up to the DeliverCallback.
For example when not able to reach editoast for infra reload.
This will release unacked messages and move them back to ready (instead
of keeping them unacked until the worker exits).

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
@bougue-pe bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 3ecda5e to e6c5f8c Compare February 6, 2025 15:37
@bougue-pe bougue-pe changed the title core: fix unrecoverable freeze when rabbit shuts down core: fix unrecoverable freezes of rabbit's consumer Feb 6, 2025
@bougue-pe
Copy link
Contributor Author

A third commit was pushed, and the main comment updated (and all rebased on dev).

No more work is planned on this, please read the main comment, any feedback is welcome, and we should be good to go 🙏

@bougue-pe
Copy link
Contributor Author

bougue-pe commented Feb 6, 2025

There is a bit more work, actually (for me): test if it improves the case described in #10704
EDIT: It does not improve things, looks like a different issue, not investigating more on it.

@bougue-pe bougue-pe requested a review from eckter February 11, 2025 09:39
Copy link
Contributor

@eckter eckter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for handling this!

@bougue-pe bougue-pe removed their assignment Feb 11, 2025
Copy link
Contributor

@Khoyo Khoyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@Khoyo Khoyo added this pull request to the merge queue Feb 13, 2025
Merged via the queue into dev with commit 2752095 Feb 13, 2025
27 checks passed
@Khoyo Khoyo deleted the peb/core/fix_core_freeze_on_rabbit_shutdown branch February 13, 2025 10:11
bougue-pe added a commit that referenced this pull request Feb 13, 2025
Improve the expressiveness when handling exceptions in rabbit worker.
After #10594 and
#9439.

Also minor code/log improvements.

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Feb 14, 2025
Improve the expressiveness when handling exceptions in rabbit worker.
After #10594 and
#9439.

Also minor code/log improvements.

Signed-off-by: Pierre-Etienne Bougué <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core Work on Core Service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Core message consumer fails and blocks the scenario
5 participants