Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes nodes cannot be provisioned any more in subnets tagged with sigs.k8s.io/cluster-api-provider-aws/association: secondary #5227

Open
cellux opened this issue Nov 24, 2024 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@cellux
Copy link

cellux commented Nov 24, 2024

/kind bug

What steps did you take and what happened:

Upgraded the CAPA provider to v2.7.1 and then tried to upgrade one of my AWS clusters to a newer Kubernetes version.

During the rolling update of MachineDeployments, CAPA v2.7.1 rejected creation of new EC2 instances saying subnet XXXX belongs to a secondary CIDR block which won't be used to create instances.

What did you expect to happen:

The new EC2 instances are provisioned - as this used to happen before the upgrade to v2.7.1.

Anything else you would like to add:

Downgrading CAPA provider to v2.6.1 resolved the issue.

The problem might be around this code block in pkg/cloud/services/ec2/instance.go:

			tags := converters.TagsToMap(subnet.Tags)
			if tags[infrav1.NameAWSSubnetAssociation] == infrav1.SecondarySubnetTagValue {
				errMessage += fmt.Sprintf(" subnet %q belongs to a secondary CIDR block which won't be used to create instances.", *subnet.SubnetId)
				continue
			}

Environment:

  • Cluster-api-provider-aws version: v2.7.1
  • Kubernetes version: (use kubectl version): v1.29.10-eks-7f9249a
  • OS (e.g. from /etc/os-release): Ubuntu 20.04.6 LTS

We use four private subnets in AWS which are pre-provisioned by our IT team:

  • two for Transit and NAT gateways, VPC endpoints, etc. - these are connected to the company network
  • two for Kubernetes nodes and the pod network - nonrouted subnets sliced from 100.64.0.0/16

We followed the docs at https://cluster-api-aws.sigs.k8s.io/topics/eks/pod-networking#unmanaged-static-vpc:

  • custom VPC CNI configuration
  • secondary CIDR subnets tagged with sigs.k8s.io/cluster-api-provider-aws/association=secondary

We do not want to use the first two subnets for Kubernetes nodes as those are pretty small and could be easily exhausted when we scale out.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-priority labels Nov 24, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 24, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2025
@cellux
Copy link
Author

cellux commented Feb 27, 2025

Meanwhile we investigated our setup and there is good chance that the error is on our side.

If I understand correctly, subnets tagged with sigs.k8s.io/cluster-api-provider-aws/association: secondary should never be used for EC2 instances, only for the pod network. The new check in v2.7.x just codifies this contract.

Our mistake most likely is that we use the same subnet for the EC2 instances and as a secondary subnet for the pod network.

We should just remove the sigs.k8s.io/cluster-api-provider-aws/association: secondary tag from the EC2/pod subnet, replace it with kubernetes.io/cluster/<cluster-name> and kubernetes.io/role/internal-elb tags as described here and then we wouldn't need the custom networking or the ENIConfigs, everything would work with the default setup.

We'll verify these assumptions and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants