core: fix unrecoverable freezes of rabbit's consumer #10594

bougue-pe · 2025-01-30T10:16:48Z

Hard fix (kind of) to kill the process if a thread's exception goes up to main(), or if rabbit's cancel/shutdown notification is received (even if other threads may run) and let orchestrator restart it.

Bump amqp-client on the way as it doesn't hurt.

Fixes #8621
Also reproduced and fixed the case that leads to the following (different) core logs:

[11:16:04,880] [INFO]          [WorkerCommand] consume shutdown: amq.ctag-LsXBfmFdL6n758icthlCYA, com.rabbitmq.client.ShutdownSignalException: connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
[11:16:04,883] [WARN]  [ForgivingExceptionHandler] An unexpected connection driver error occurred (Exception message: Connection reset by peer)
Exception in thread "main" com.rabbitmq.client.AlreadyClosedException: connection is already closed due to connection error; protocol method: #method<connection.close>(reply-code=320, reply-text=CONNECTION_FORCED - broker forced connection closure with reason 'shutdown', class-id=0, method-id=0)
        at com.rabbitmq.client.impl.AMQConnection.startShutdown(AMQConnection.java:1012)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1127)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1056)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1040)
        at com.rabbitmq.client.impl.AMQConnection.close(AMQConnection.java:1032)
        at com.rabbitmq.client.impl.recovery.AutorecoveringConnection.close(AutorecoveringConnection.java:289)
        at kotlin.io.CloseableKt.closeFinally(Closeable.kt:56)
        at fr.sncf.osrd.cli.WorkerCommand.run(WorkerCommand.kt:319)
        at fr.sncf.osrd.App.main(App.java:44)

Hand-tested

Run classic (one per infra) core with rabbitmq up, without editoast to load infra from: ✅core crashes.
Run single-worker core with the whole stack running: ✅core works and OSRD does its job.
Run single-worker core with the whole stack, then stop rabbitmq: ✅core exits (on shutdown notification).
Run full stack single-worker mode, then remove the core-req-all queue to initiate cancel notification: ✅core logs "consumer cancelled" then stops (should cover Core message consumer fails and blocks the scenario #8621).
Run full stack + single-worker editoast (no core), initiate some core requests (using front).
- Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✅core crashes and releases unacked messages.
- Then start editoast, then start core: ✅core works and OSRD does its job.

Then stop editoast and start single-worker core (so impossible to load infra inside DeliverCallback): ✔️❓Exceptions in threads when trying to load infra for pending requests, core stays alive (leaving pending requests unacked until core is stopped - after step below).
➡️We can try/catch in callback function and exitProcess to force shutdown. DONE in last commit.

~~Then start editoast (keep core as-is) and initiate some new core request (using front): ✅core "correctly" processing only new requests.~~

~~Stop core: ✅Unacked requests are back to ready.~~

~~Start core: ✅Ready requests are processed.~~

Understanding of previous and current work

core: multithread core workers #9591 messed-up with core crashing as expected when exception is raised up to main (still not understanding exactly what prevents the app from stopping).
Especially core: improve worker lifetime #9439 was worked on (and in my memory correctly hand-tested) at the same time, but when merged (after core: multithread core workers #9591) it "silently" didn't work 😞.

Current:

ShutdownCallback looks like it's executed in the main thread (triggering the newly added catch in the main(), then the final return) and it may be an idea to use shutdown notification to exit cleanly
The thread executors are messing with the handling of exceptions
Looks like the CancelCallback doesn't trigger a "join" or a shutodwn process (because of threads? maybe more because it's on the implementation to decide what's graceful?)

Looks like some improvements may be done (to be explored later, sorted by ROI)

Improve/explicit some work applied in core: improve worker lifetime #9439 (use isRecoverable, cleanup consumer logs, properly close channels and connections)
Avoid sharing channels between threads as stated in dedicated documentation
Handle more standardly shutdown (on exceptions or on notification) by issuing a shutdown notification, or sharing a signal (or common variable?)

codecov-commenter · 2025-01-30T10:19:03Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.93%. Comparing base (b3a6f01) to head (e6c5f8c).
Report is 93 commits behind head on dev.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev   #10594      +/-   ##
==========================================
- Coverage   81.93%   81.93%   -0.01%     
==========================================
  Files        1079     1079              
  Lines      107380   107376       -4     
  Branches      737      737              
==========================================
- Hits        87984    87978       -6     
- Misses      19356    19358       +2     
  Partials       40       40

Flag	Coverage Δ
editoast	`74.28% <ø> (-0.01%)`	⬇️
front	`89.47% <ø> (-0.01%)`	⬇️
gateway	`2.18% <ø> (ø)`
osrdyne	`3.28% <ø> (ø)`
railjson_generator	`87.50% <ø> (ø)`
tests	`88.14% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

core/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt

Hard fix (kind of) to kill the process (and let orchestrator restart) if: * rabbit shuts down (triggering consumer's ShutdownCallback) * or an exception before starting basicConsume() goes up to main() (even if other threads may run) Bump amqp-client on the way as it doesn't hurt Signed-off-by: Pierre-Etienne Bougué <[email protected]>

From hand-tests, shutdown is already covered by the System.exit in App.java::main(). Signed-off-by: Pierre-Etienne Bougué <[email protected]>

…llback Hard fix (kind of) to kill the process (and let orchestrator restart) if an exception goes all the way up to the DeliverCallback. For example when not able to reach editoast for infra reload. This will release unacked messages and move them back to ready (instead of keeping them unacked until the worker exits). Signed-off-by: Pierre-Etienne Bougué <[email protected]>

bougue-pe · 2025-02-06T15:42:20Z

A third commit was pushed, and the main comment updated (and all rebased on dev).

No more work is planned on this, please read the main comment, any feedback is welcome, and we should be good to go 🙏

bougue-pe · 2025-02-06T15:43:56Z

~~There is a bit more work, actually (for me): test if it improves the case described in #10704~~
EDIT: It does not improve things, looks like a different issue, not investigating more on it.

eckter

Thanks for handling this!

Khoyo

Great work!

Improve the expressiveness when handling exceptions in rabbit worker. After #10594 and #9439. Also minor code/log improvements. Signed-off-by: Pierre-Etienne Bougué <[email protected]>

bougue-pe requested review from Khoyo and ElysaSrc January 30, 2025 10:16

bougue-pe requested a review from a team as a code owner January 30, 2025 10:16

github-actions bot added the area:core Work on Core Service label Jan 30, 2025

eckter reviewed Jan 30, 2025

View reviewed changes

core/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt Outdated Show resolved Hide resolved

bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 06726c5 to ee9d8aa Compare January 30, 2025 10:42

bougue-pe self-assigned this Feb 5, 2025

bougue-pe marked this pull request as draft February 5, 2025 07:14

bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from b9647fc to c8bd86f Compare February 5, 2025 15:31

bougue-pe requested a review from woshilapin February 5, 2025 17:49

bougue-pe marked this pull request as ready for review February 5, 2025 17:49

woshilapin approved these changes Feb 6, 2025

View reviewed changes

core/src/main/java/fr/sncf/osrd/cli/WorkerCommand.kt Show resolved Hide resolved

bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch 2 times, most recently from 54ad82d to 17f2e86 Compare February 6, 2025 14:57

bougue-pe added 3 commits February 6, 2025 16:33

core: force process exit on rabbit's cancel notifications

d0121d2

From hand-tests, shutdown is already covered by the System.exit in App.java::main(). Signed-off-by: Pierre-Etienne Bougué <[email protected]>

bougue-pe force-pushed the peb/core/fix_core_freeze_on_rabbit_shutdown branch from 3ecda5e to e6c5f8c Compare February 6, 2025 15:37

bougue-pe changed the title ~~core: fix unrecoverable freeze when rabbit shuts down~~ core: fix unrecoverable freezes of rabbit's consumer Feb 6, 2025

bougue-pe requested a review from eckter February 11, 2025 09:39

eckter approved these changes Feb 11, 2025

View reviewed changes

bougue-pe removed their assignment Feb 11, 2025

Khoyo approved these changes Feb 13, 2025

View reviewed changes

Khoyo added this pull request to the merge queue Feb 13, 2025

Merged via the queue into dev with commit 2752095 Feb 13, 2025
27 checks passed

Khoyo deleted the peb/core/fix_core_freeze_on_rabbit_shutdown branch February 13, 2025 10:11

bougue-pe mentioned this pull request Feb 13, 2025

core: extend use of osrdErrorType.isRecoverable #10801

Merged

flomonster mentioned this pull request Feb 19, 2025

timeouts on train schedules summaries #10704

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: fix unrecoverable freezes of rabbit's consumer #10594

core: fix unrecoverable freezes of rabbit's consumer #10594

bougue-pe commented Jan 30, 2025 •

edited

Loading

codecov-commenter commented Jan 30, 2025 •

edited

Loading

bougue-pe commented Feb 6, 2025

bougue-pe commented Feb 6, 2025 •

edited

Loading

eckter left a comment

Khoyo left a comment

core: fix unrecoverable freezes of rabbit's consumer #10594

core: fix unrecoverable freezes of rabbit's consumer #10594

Conversation

bougue-pe commented Jan 30, 2025 • edited Loading

Hand-tested

Understanding of previous and current work

Looks like some improvements may be done (to be explored later, sorted by ROI)

codecov-commenter commented Jan 30, 2025 • edited Loading

Codecov Report

bougue-pe commented Feb 6, 2025

bougue-pe commented Feb 6, 2025 • edited Loading

eckter left a comment

Choose a reason for hiding this comment

Khoyo left a comment

Choose a reason for hiding this comment

bougue-pe commented Jan 30, 2025 •

edited

Loading

codecov-commenter commented Jan 30, 2025 •

edited

Loading

bougue-pe commented Feb 6, 2025 •

edited

Loading