Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL integration tests #3697

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

annuay-google
Copy link
Contributor

@annuay-google annuay-google commented Feb 21, 2025

What?

NCCL integration tests for A+ machine families

Why?

  1. NCCL tests help validate functionality and baseline performance of the cluster
  2. Regressions have crept in due to our current integration tests only validating provisioning and nvidia-smi. An example was the change to rename kubernetes network objects was not caught for breaking the NCCL tests we ship to customers

Testing

Validated by running the tests via cloud build:
A3 Ultra: https://pantheon.corp.google.com/cloud-build/builds;region=global/234a1ddf-b194-4129-a640-0b5b081373a1?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev
A3 Mega: https://pantheon.corp.google.com/cloud-build/triggers;region=global/edit/e2eb58fd-613e-4f32-b107-debed1cd573b?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev
A3 High: https://pantheon.corp.google.com/cloud-build/builds;region=global/6ee8fc8f-f718-4446-b11b-01bfa8cc647d?e=13803378&mods=monitoring_api_prod&project=hpc-toolkit-dev

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@glasnt glasnt removed request for a team February 25, 2025 05:34
@@ -39,8 +39,6 @@ steps:
BUILD_ID_SHORT=$${BUILD_ID_FULL:0:6}
EXAMPLE_BP=tools/cloud-build/daily-tests/blueprints/gke-a2-highgpu.yaml

# Replacing the static subnet name to prevent collisions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer required since the subnet names are now prefixed with deployment-name

@annuay-google annuay-google marked this pull request as ready for review February 26, 2025 15:55
@annuay-google annuay-google added the release-improvements Added to release notes under the "Improvements" heading. label Feb 26, 2025
Copy link
Contributor

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants