Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create EBS CSI Driver scale-test tool #2292

Merged

Conversation

AndrewSirenko
Copy link
Contributor

@AndrewSirenko AndrewSirenko commented Jan 16, 2025

What type of PR is this?

/kind documentation

What is this PR about? / Why do we need it?

Validate that the EBS CSI Driver is capable of managing volumes at the scale of your largest EKS clusters with our scale-test tool. See /hack/ebs-scale-test/README.md for more info.

Karpenter cluster-type, more scalability test-types, and more scalability test observability features coming soon.

Scope of this PR: One-step run pre-allocated scale-sts scalability tests.

Running ./scale-test create && ./scale-test run && ./scale-test clean with no extra environment variables set will:

  1. Create an EKS cluster
  2. Install EBS CSI Driver with metrics enabled
  3. Deploy StorageClass and StatefulSet, then scale that sts to 1000 replicas (which will provision and publish 1000 volumes)
  4. Delete sts (unpublishing and deleting volumes)
  5. Export metrics, ebs-plugin logs, driver deployment + daemonset yaml, and other test artifacts to a local directory and an S3 bucket
  6. Clean up all scalability test resources from your AWS account

How was this change tested?

Revision 2 edit: Ran through scale-test create, run, and clean loop again. Also manually tested extra features, the results of which I included in the review comments of those features. Thanks.

Please follow the README.md to run these scale tests yourself.

Timing:

Cluster Setup: ~18 min
Run: ~8 min for 1000 replica test
Cluster Cleanup: ~10 min

❯ ./scale-test setup && ./scale-test run && ./scale-test clean
Deploying EKS cluster. See configuration in /tmp/tmp.FEdux6Eoal/cluster-config.yaml
2025-01-16 14:17:05 [ℹ]  eksctl version 0.194.0-dev+02ef28ee3.2024-10-22T18:42:34Z
...
Applying /workplace/andsirey/aws-ebs-csi-driver/hack/ebs-scale-test/helpers/scale-test/scale-sts-test/scale-sts.yaml. Exported to /tmp/tmp.FGNUXsiFTA/scale-manifest.yaml
...
partitioned roll out complete: 101 new pods have been updated...
Deleting StatefulSet
statefulset.apps "ebs-scale-test" deleted
storageclass.storage.k8s.io "ebs-scale-test" deleted
Waiting for all PVs to be deleted
101 PVs still exist, waiting...
...
Exporting everything in /tmp/tmp.FGNUXsiFTA to S3
Metrics exported to s3://ebs-scale-tests/ebs-scale-pre-allocated-scale-sts-101-2025-01-16T14:35UTC/
2025-01-16 14:44:57 [ℹ]  deleting EKS cluster "ebs-scale-pre-allocated"
2025-01-16 14:52:04 [✔]  all cluster resources were deleted

Does this PR introduce a user-facing change?

Add `scale-test` tool for running EBS CSI Driver scalability tests

@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Jan 16, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 16, 2025
Copy link

Code Coverage Diff

This PR does not change the code coverage

@AndrewSirenko AndrewSirenko marked this pull request as draft January 27, 2025 21:38
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2025
@AndrewSirenko AndrewSirenko force-pushed the scale-tests-pre-allocated branch from 49cbf49 to 4ae77bb Compare January 28, 2025 21:55
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 28, 2025
@AndrewSirenko AndrewSirenko marked this pull request as ready for review January 28, 2025 21:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 28, 2025
Copy link
Member

@ElijahQuinones ElijahQuinones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 29, 2025
@AndrewSirenko AndrewSirenko force-pushed the scale-tests-pre-allocated branch from 94c9338 to 2502c6d Compare January 30, 2025 16:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2025
@AndrewSirenko AndrewSirenko force-pushed the scale-tests-pre-allocated branch from 2502c6d to 3acb83d Compare January 30, 2025 16:32
@ElijahQuinones
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 30, 2025
@AndrewSirenko
Copy link
Contributor Author

/retest

Flake: Error: validation failed: wait time exceeded during validation

@ElijahQuinones
Copy link
Member

/approve

Ran through scale tests locally and confirmed they run as intended send metrics and logs to S3 and clean up all resources when done.

./scale-test create && ./scale-test run && ./scale-test clean;
...
2025-01-31 16:11:46 [✔]  saved kubeconfig as "/home/elijahlq/.kube/config"
...
2025-01-31 16:11:46 [✔]  all EKS cluster resources for "ebs-scale-pre-allocated" have been created
...
2025-01-31 16:33:15 [✔]  all cluster resources were deleted

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ElijahQuinones

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2025
@k8s-ci-robot k8s-ci-robot merged commit 99a8727 into kubernetes-sigs:master Jan 31, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants