MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

PBundyra · 2025-02-24T14:44:57Z

What happened:
This e2e multikueue test flaked:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/4301/pull-kueue-test-e2e-multikueue-main/1894013562272092160

What you expected to happen:
I expected test to succeed

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
Kueue version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

tenzen-y · 2025-02-24T14:48:31Z

/kind flake

This might be same flake with #4376
But those have different error messages.

tenzen-y · 2025-02-24T14:49:44Z

/retitle MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted

tenzen-y · 2025-02-24T14:49:53Z

cc: @dgrove-oss

dgrove-oss · 2025-02-25T22:29:10Z

In both cases, the appwrapper controller on worker1 (where the job is expected to run), has an odd problem shown below during startup when it is trying to load a config map to get the operator configuration.

2025-02-24T12:42:12.42429078Z stderr F 2025-02-24T12:42:12.423929135Z	INFO	setup	log/deleg.go:127	Build info	{"version": "v1.0.4", "date": "2025-02-12 14:01"}
2025-02-24T12:42:12.425600879Z stderr F 2025-02-24T12:42:12.425383746Z	ERROR	setup	log/deleg.go:142	unable to initialise configuration	{"error": "failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get \"https://10.96.0.1:443/api/v1\": dial tcp 10.96.0.1:443: connect: network is unreachable"}
2025-02-24T12:42:12.425614359Z stderr F sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Error
2025-02-24T12:42:12.425619789Z stderr F 	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:142
2025-02-24T12:42:12.425624829Z stderr F github.com/go-logr/logr.Logger.Error
2025-02-24T12:42:12.425629739Z stderr F 	/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:301
2025-02-24T12:42:12.425634239Z stderr F main.exitOnError
2025-02-24T12:42:12.425638979Z stderr F 	/workspace/cmd/main.go:221
2025-02-24T12:42:12.425643839Z stderr F main.main
2025-02-24T12:42:12.425648099Z stderr F 	/workspace/cmd/main.go:108
2025-02-24T12:42:12.425652289Z stderr F runtime.main
2025-02-24T12:42:12.42565638Z stderr F 	/usr/local/go/src/runtime/proc.go:272

I wonder if we the waitForOperatorAvailability function in the e2e utility package isn't stringent enough and we are trying to run the multikueue test before all the controllers on the worker node are really ready.

mimowo · 2025-02-27T15:10:53Z

@dgrove-oss just a speculation, but could this be due to interference with this test:

kueue/test/e2e/multikueue/e2e_test.go

Line 862 in 398c74d

ginkgo.When("The connection to a worker cluster is unreliable", func() {

.

This message makes me think: "dial tcp 10.96.0.1:443: connect: network is unreachable".

Maybe this somehow makes the AppWrapper controller crashing?

IIRC the tests don't run in parallel, but maybe even then the previous test could give hard time to the AppWrapper controller?

dgrove-oss · 2025-02-28T16:41:25Z

When the AppWrapper controller is initializing, it is written to exit on errors. In this particular case, the Get inside of loadConfig (here) returned a network error.

I'm open to other ways of structuring AppWrapper's startup code, but it seemed like exiting with an error and letting the pod restart was more robust than trying to handle it or masking with a retry loop. Kueue's main seems to be structured similarly.

PBundyra added the kind/bug Categorizes issue or PR as related to a bug. label Feb 24, 2025

k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 24, 2025

k8s-ci-robot changed the title ~~Flaky e2e multikueue test~~ MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted Feb 24, 2025

mimowo mentioned this issue Feb 25, 2025

Flaky Test: MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

PBundyra commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

dgrove-oss commented Feb 25, 2025 •

edited

Loading

mimowo commented Feb 27, 2025

dgrove-oss commented Feb 28, 2025 •

edited

Loading

MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

MultiKueue when Creating a multikueue admission check Should run an appwrapper containing a job on worker if admitted #4378

Comments

PBundyra commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

tenzen-y commented Feb 24, 2025

dgrove-oss commented Feb 25, 2025 • edited Loading

mimowo commented Feb 27, 2025

dgrove-oss commented Feb 28, 2025 • edited Loading

dgrove-oss commented Feb 25, 2025 •

edited

Loading

dgrove-oss commented Feb 28, 2025 •

edited

Loading