Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal error - Another operation of type DeregisterInstance is in progress #5134

Open
stefaneg opened this issue Feb 28, 2025 · 0 comments · May be fixed by #5135
Open

Fatal error - Another operation of type DeregisterInstance is in progress #5134

stefaneg opened this issue Feb 28, 2025 · 0 comments · May be fixed by #5135
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@stefaneg
Copy link

What happened:
With DNS Controller configured with aws_sd provider, it exits with a fatal error occasionally due to a seeming race condition, eventually entering a crash loop.

{"level":"fatal","msg":"Failed to do run once: operation error ServiceDiscovery: RegisterInstance, https response error StatusCode: 400, RequestID: d270d6a7-a36e-44a7-b88d-72d197c38578, DuplicateRequest: Another operation of type DeregisterInstance and id tr2szldps72jcdtoj2oyb3ckwplbl55r-6buit117 is in progress","time":"2025-02-25T13:51:21Z"}

What you expected to happen:
Expect aws_sd to complete registration and de-registration successfully without panicking.

How to reproduce it (as minimally and precisely as possible):
The error seems to manifest when pods are rescheduled between nodes, usually due to Karpenter rebalancing the cluster. This results in changes of IP addresses, requiring recreation of Route53 records.
A minimal reproduction has not been attempted, as we have a fix.

Anything else we need to know?:
We have been running a patched version of DNS controller for our private namespace for approximately 3 years. It was believed this patch was an optimisation, but it turns out it also fixes this issue with interacting with the AWS API.

A PR that fixes this issue is forthcoming.

A PR was filed for this fix before.

#3123

Environment:

  • External-DNS version (use external-dns --version):
    0.15.1

  • DNS provider:
    AWS Route53

  • Others:
    Deployment configuration:

    - args:
        - --log-level
        - info
        - --log-format
        - json
        - --provider
        - aws-sd
        - --registry
        - aws-sd
        - --policy
        - sync
        - --interval
        - 10s
        - --source
        - service
        - --aws-api-retries
        - "3"
        - --domain-filter
        - company.local
        - --aws-zone-type
        - private
        - --annotation-filter
        - dns.company.com/type=internal
        - --fqdn-template
        - '{{index .ObjectMeta.Annotations "dns.company.com/label"}}.company.local'
@stefaneg stefaneg added the kind/bug Categorizes issue or PR as related to a bug. label Feb 28, 2025
@stefaneg stefaneg linked a pull request Feb 28, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant