Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firewall fails to recognize TLS version and SNI when EBS CSI controller talks to AWS API endpoints #2364

Open
gblues opened this issue Feb 25, 2025 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@gblues
Copy link

gblues commented Feb 25, 2025

/kind bug

What happened?
After upgrading the EBS CSI driver from 1.34.0, EBS volume mounts fail because the connection to the AWS API endpoint is blocked by the firewall due to missing/incorrect SNI in TLS communication.

The exact error varies by version:

v1.35.0

E0225 20:15:21.006813       1 driver.go:108] "GRPC error" err="rpc error: code = Internal desc = Could not detach volume \"REDACTED-VOL-ID\" from node \"REDACTED-EC2-INSTANCE-ID\": error listing AWS instances: operation error EC2: DescribeInstances, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post \"STS_ENDPOINT/\": read tcp REDACTED-EBS-CSI-CONTROLLER-IP:52326->STS-IP:443: read: connection reset by peer"

v1.40.0

E0225 19:05:24.424137       1 batcher.go:161] "execute: error executing batch" err="error listing AWS instances: operation error EC2: DescribeInstances, request canceled, context deadline exceeded"
E0225 19:05:24.424212       1 driver.go:108] "GRPC error" err="rpc error: code = Internal desc = Could not attach volume \"REDACTED-VOL-ID\" to node \"REDACTED-EC2-INSTANCE-ID\": error listing AWS instances: operation error EC2: DescribeInstances, request canceled, context deadline exceeded"

What you expected to happen?
I expected the EBS volume mounts to succeed

How to reproduce it (as minimally and precisely as possible)?

  • have firewall configured to terminate TLS sessions with missing or invalid SNI between EKS and AWS endpoint
  • deploy EBS CSI 1.34.0 and a StatefulSet or something to mount an EBS volume, notice that it works
  • update the EBS CSI to any newer version (1.35.0, 1.37.0, and 1.40.0 have been explicitly tested)
  • re-deploy your statefulset from before to trigger a mount action
  • check the logs for the leader of the ebs-csi-controller deployment and note all the "connection reset by peer" errors

Anything else we need to know?:

  1. This is a FedRAMP environment so I cannot provide more precise details, and the SNI security policy is a part of the FedRAMP environment and so is not subject to change.
  2. Having done a little digging on my own, this is most likely a regression introduced during a dependency update--the aws go v2 API being the most obvious candidate. However, I don't know if this is a client error (insufficient configuration).
  3. I don't have eyes on the firewall output to identify the precise rule that is getting triggered, so "wrong or missing" is as precise as I can be at the moment

Environment

  • Kubernetes version (use kubectl version): 1.31
  • Driver version: 1.35.0 - 1.40.0 are all impacted
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 25, 2025
@torredil
Copy link
Member

Hi there, thanks for the detailed bug report.

I looked and saw no obvious relevant code changes or updates to how the driver handles connections between 1.34 and 1.35, so the behavior described here is interesting. Can you confirm if rolling back fixes the networking issue? I was actually thinking that driver version shouldn't be relevant here, but we need to validate that.

I’ll run some tests on my end to dig deeper, but in the meantime, could you enable SDK debug logs and share them? that will give us a clearer picture of whats happening with the requests.

However, I don't know if this is a client error (insufficient configuration).

The two relevant configuration options that come to mind here are proxy settings, and supplying the controller with the right CA certificate (if you have a custom certificate) via volumeMounts (see example).

@gblues
Copy link
Author

gblues commented Feb 26, 2025 via email

@gblues
Copy link
Author

gblues commented Feb 26, 2025

I followed the linked instructions, and I can confirm the debug log argument is being added to the deployment, but the logs don't have any new information. Now, assuming the SDK logging isn't completely broken, my takeaway from this is that none of the SDK queries are actually completing.

@gblues
Copy link
Author

gblues commented Feb 26, 2025

I apologize for being unclear. When I wrote:

I don't know if this is a client error

What I meant was: I don't know if this was an intentional breaking change in the API that got missed in the dependency upgrade, or if this is an actual regression in the API that needs to be fixed by AWS. I just don't have the golang expertise to make that determination.

@torredil
Copy link
Member

Thank you for the updates! @gblues.

It's unlikely that this issue is caused by a regression in the SDK itself. We suspect the culprit is upgrading Go from 1.22 to 1.23 (which took place in driver version 1.35). Notably, Go 1.23 includes several updates to the crypto/tls package.

Would you be able to check with your network team to ensure the firewall allows connections with a ClientHello fragmented over multiple packets? See golang/go#70139 for more context on this.

Assuming the above is true, you should be able to temporarily work around this issue by setting the GODEBUG environment variable to tlskyber=0:

@gblues
Copy link
Author

gblues commented Feb 27, 2025 via email

@gblues gblues changed the title SNI is not being set/set correctly when talking to AWS API endpoints Firewall fails to recognize TLS version and SNI when EBS CSI controller talks to AWS API endpoints Feb 27, 2025
@gblues
Copy link
Author

gblues commented Feb 27, 2025

Update!

  • adding GODEBUG=tlskyber=0 allowed me to upgrade as far as 1.39.0, but 1.40.0 was still failing
  • working with the network admin, we see the TLS version and SNI showing as "unknown"
  • While I didn't see it in the release notes, we noticed that the Go 1.24 release notes remove the draft kyber implementation and the tlskyber godebug flag, and add a new tlsmlkem flag. Setting GODEBUG=tlsmlkem=0 got 1.40.0 working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants