Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on slack failures #4191

Open
KatieMSB opened this issue Dec 12, 2024 · 1 comment
Open

Retry on slack failures #4191

KatieMSB opened this issue Dec 12, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@KatieMSB
Copy link
Collaborator

Describe the Bug:
When updating User Groups for on-call schedule notifications, if Slack returns an error, the following message is posted in the schedule's on-call notification Slack channel:

Hey everyone! I couldn't update @<user_group> because I ran into a problem. Maybe touch base with the GoAlert admin(s) to see if they can help? I'm sorry for the inconvenience!

Here's the ID I left with the error in my logs so they can find it:
SlackUGErrorID=<service_id>

This behavior occurs when encountering a fatal_error response from the Slack API. We should implement retry logic in this scenario, or at least offer the user an option to manually trigger a retry.

Expected Behavior:
GoAlert should either automatically retry the request when encountering a fatal_error response from Slack or provide users with the option to trigger a retry within the error message.

Application Version:
All versions as of v0.33.0

@KatieMSB KatieMSB added the bug Something isn't working label Dec 12, 2024
@mastercactapus
Copy link
Member

Let's handle this as part of the job queue transition -- we'll get easier retry logic that way for free, and can have the final error (if it fails) show up with this message.

The main complexity today is I don't think today's system has a way to know "is this the final try?" so if the group update call fails it "falls-back" to sending a message, and marked it as delivered if that works. We should instead implement something where it only trys to update the user group, and if that fails the max-try times, schedule a separate job to send the error message (which can then retry)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants