Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

Open
PBundyra opened this issue Feb 24, 2025 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.

Comments

@PBundyra
Copy link
Contributor

What happened:
This e2e multikueue test flaked:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4301/pull-kueue-test-e2e-multikueue-main/1894013562272092160

What you expected to happen:
I expected test to succeed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Kueue version (use git describe --tags --dirty --always):
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@PBundyra PBundyra added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2025
@tenzen-y
Copy link
Member

/kind flake

This might be same flake with #4376
But those have different error messages.

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 24, 2025
@tenzen-y
Copy link
Member

/retitle MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted

@k8s-ci-robot k8s-ci-robot changed the title Flaky e2e multikueue test MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted Feb 24, 2025
@tenzen-y
Copy link
Member

cc: @dgrove-oss

@dgrove-oss
Copy link
Contributor

dgrove-oss commented Feb 25, 2025

In both cases, the appwrapper controller on worker1 (where the job is expected to run), has an odd problem shown below during startup when it is trying to load a config map to get the operator configuration.

2025-02-24T12:42:12.42429078Z stderr F 2025-02-24T12:42:12.423929135Z	INFO	setup	log/deleg.go:127	Build info	{"version": "v1.0.4", "date": "2025-02-12 14:01"}
2025-02-24T12:42:12.425600879Z stderr F 2025-02-24T12:42:12.425383746Z	ERROR	setup	log/deleg.go:142	unable to initialise configuration	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.96.0.1:443/api/v1\": dial tcp 10.96.0.1:443: connect: network is unreachable"}
2025-02-24T12:42:12.425614359Z stderr F sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error
2025-02-24T12:42:12.425619789Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:142
2025-02-24T12:42:12.425624829Z stderr F github.com/go-logr/logr.Logger.Error
2025-02-24T12:42:12.425629739Z stderr F 	/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:301
2025-02-24T12:42:12.425634239Z stderr F main.exitOnError
2025-02-24T12:42:12.425638979Z stderr F 	/workspace/cmd/main.go:221
2025-02-24T12:42:12.425643839Z stderr F main.main
2025-02-24T12:42:12.425648099Z stderr F 	/workspace/cmd/main.go:108
2025-02-24T12:42:12.425652289Z stderr F runtime.main
2025-02-24T12:42:12.42565638Z stderr F 	/usr/local/go/src/runtime/proc.go:272

I wonder if we the waitForOperatorAvailability function in the e2e utility package isn't stringent enough and we are trying to run the multikueue test before all the controllers on the worker node are really ready.

@mimowo
Copy link
Contributor

mimowo commented Feb 27, 2025

@dgrove-oss just a speculation, but could this be due to interference with this test:

ginkgo.When("The connection to a worker cluster is unreliable", func() {
.

This message makes me think: "dial tcp 10.96.0.1:443: connect: network is unreachable".

Maybe this somehow makes the AppWrapper controller crashing?

IIRC the tests don't run in parallel, but maybe even then the previous test could give hard time to the AppWrapper controller?

@dgrove-oss
Copy link
Contributor

dgrove-oss commented Feb 28, 2025

When the AppWrapper controller is initializing, it is written to exit on errors. In this particular case, the Get inside of loadConfig (here) returned a network error.

I'm open to other ways of structuring AppWrapper's startup code, but it seemed like exiting with an error and letting the pod restart was more robust than trying to handle it or masking with a retry loop. Kueue's main seems to be structured similarly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test.
Projects
None yet
Development

No branches or pull requests

5 participants