Consolidating all storage behind datastore #350

ahg-g · 2025-02-17T10:48:25Z

This PR removes the intermediate cache in provider, and consolidates all storage behind the datastore component. This limits what is now called provider into a pure probing controller. To validate the change, I tried to increase test coverage especially for the controllers, which uncovered a couple of bugs (I added comments pointing them out).

The PR is long, but unfortunately there is no way around doing this refactor without such a significant change. The best way to approach this PR is to start by looking at the Datastore interface, which defines the contract for accessing all cached state related to the pool, the models and the pods. All accesses to the datastore are done only via this interface. The components that accesses the datastore are the controllers (pool, model and pod) and the provider (which as mentioned above is now in practice a probing controller). As a followup, I plan to move the datastore implementation into an internal pkg, and make only the interface public.

I think we are still lacking proper test coverage overall. As a next step we should prioritize migrating our integration tests to ginkgo, moreover as it is right now the integration tests setup is not properly testing things end-to-end (e.g., pods are being injected via a side channel on the datastore instead of creating them on the api-server and have the controllers populate the datastore).

I'm hoping this PR will allow us to improve the probing logic through a clearer separation of responsibilities. I think #223 is compatible with this direction with the exception of the extra cache it introduces.

Testing: in addition to increasing test coverage, I also ran integration tests and the e2e test. Also deployed it on a real cluster with increased log level and verified the probing logs while sending a constant stream of requests.

Fixes #346 #349 #310

k8s-ci-robot · 2025-02-17T10:48:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-02-17T10:48:52Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`b30c90d`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67b4f8f88daa3b0009a5cff3
😎 Deploy Preview	https://deploy-preview-350--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pkg/ext-proc/backend/datastore.go

pkg/ext-proc/backend/pod_reconciler.go

ahg-g · 2025-02-17T11:02:42Z

pkg/ext-proc/backend/pod_reconciler.go

 	} else {
-		c.Datastore.pods.Store(pod, true)


This is another bug where we could be overriding the state of an existing metrics state.

ahg-g · 2025-02-17T11:03:44Z

pkg/ext-proc/backend/pod_reconciler.go

-		Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
-	}
-	if !k8sPod.DeletionTimestamp.IsZero() || !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
-		c.Datastore.pods.Delete(pod)


This is another bug introduced by using the backend.Pod as a key, which takes into account the address and the port. If any of those changes (e.g., the pod is deleted and recreated with the same name like statefulsets or the port on the inferencePool changes), we wouldn't be able to lookup and delete the previous version of the pod.

ahg-g · 2025-02-17T11:05:46Z

pkg/ext-proc/backend/provider.go

-			time.Sleep(refreshPrometheusMetricsInterval)
-			p.flushPrometheusMetricsOnce(logger)
+			select {
+			case <-ctx.Done():


In integration tests, those routines were leaking across test cases, causing a lot of problems (a provider from a previous test case overriding state of a different test case). With this, we are forcing them to shutdown with the manager.

pkg/ext-proc/backend/inferencepool_reconciler.go

ahg-g · 2025-02-17T22:14:39Z

pkg/ext-proc/handlers/request.go

 	logger.V(logutil.DEFAULT).Info("Request handled",
 		"model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod)

+	// Insert target endpoint to instruct Envoy to route requests to the specified target pod.


This enables us to not store the port number with the pod in the datastore. The port number is only needed when the epp responds with the endpoint.

ahg-g · 2025-02-17T22:56:36Z

pkg/ext-proc/backend/datastore.go

+	}
+
+	// Update the pod status parts.
+	existing.(*PodMetrics).Pod = new.Pod


This allows us to update pod properties, currently the address only.

kfswain

Gonna do this piecewise due to the size of the PR, I've made it through the datastore, which I think is a large portion of this PR. Left some comments. Will continue to review

pkg/ext-proc/backend/datastore.go

kfswain · 2025-02-18T17:14:56Z

pkg/ext-proc/backend/datastore.go

 }

-func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) {
+func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) {
+	// Pool must exist to invoke this function.


This might be true now, but I could see this eventually being extracted to a lib to be used in custom EPPs. We may want to think about capturing the error and returning it to protect future callers.

Not a blocking comment for this PR however

+1

We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector

I think the pool will always be required since it has the selector, otherwise we wouldn't know what pods to cache. The other option is to pass in the selector, which I am not sure is better since I view the datastore as a per pool cache.

pkg/ext-proc/backend/datastore.go

liu-cong

This significantly cleans up the codebase. Thanks!

(Not a blocker and can be follow ups)
We should further break down the backend package. Right now this is a giant package and we are accessing internal fields in many places, making it bug prone. A refactor may like this:

Make the datastore its own package
Make the types its own package, and add helper functions to properly manage lifecycle of the objects (e.g., in the future creating a PodMetrics object will involve starting a metrics refresh goroutine, in PR refresh metrics per pod periodically #223)
Make each reconciler its own package

pkg/ext-proc/backend/inferencepool_reconciler.go

liu-cong · 2025-02-18T18:12:16Z

pkg/ext-proc/backend/datastore.go

 }

-func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) {
+func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) {
+	// Pool must exist to invoke this function.


+1

We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector

pkg/ext-proc/backend/pod_reconciler.go

pkg/ext-proc/backend/inferencepool_reconciler.go

test/integration/hermetic_test.go

pkg/manifests/ext_proc.yaml

liu-cong · 2025-02-18T18:52:39Z

pkg/ext-proc/backend/datastore.go

+}
+
+func (ds *datastore) PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool {
+	new := &PodMetrics{


(This can be a followup).
Consider making a NewPodMetrics helper function here.

We should hide the internal fields of the PodMetric object, and make helper functions. This will make the PR #223 much easier. Perhaps we need to make PodMetrics an interface instead. I imagine with PR #223 the PodMetrics will need to manage the lifecycle of the refresher goroutines.

lets do a follow up on that, I think we also need to rename the object as well. Perhaps we call it Endpoint? wdyt?

Yeah renaming sounds good too.

…rage behind datastore.

ahg-g · 2025-02-18T19:51:41Z

This significantly cleans up the codebase. Thanks!

(Not a blocker and can be follow ups) We should further break down the backend package. Right now this is a giant package and we are accessing internal fields in many places, making it bug prone. A refactor may like this:

Make the datastore its own package

Make the types its own package, and add helper functions to properly manage lifecycle of the objects (e.g., in the future creating a PodMetrics object will involve starting a metrics refresh goroutine, in PR refresh metrics per pod periodically #223)

Make each reconciler its own package

I agree, in other projects we had a pkg for all controllers, I think this should be sufficient (vs a pkg per controller). I think the datastore should have its own pkg with an internal one, so this looks like this: ext-proc/datastore/internal and ext-proc/datastore/interface.go

I think the types should be part of the pkg/datastore, the types are hand in hand with the datastore.

so the ext-proc/datastore pkg would include

types.go
inference.go
internal/implementation.go

kfswain · 2025-02-18T20:13:05Z

Overall this lgtm, this PR is rather massive, and so relies somewhat on testing to validate it still operates the same. Will LGTM as the PR description mentions reasonable validation.
/lgtm

Will hold to let others review/comment, unhold at your own discretion.
/hold

Thanks for the big effort here!

liu-cong

/lgtm

liu-cong · 2025-02-18T21:13:10Z

pkg/ext-proc/backend/pod_reconciler_test.go

+			if test.req == nil {
+				test.req = &ctrl.Request{NamespacedName: namespacedName}
+			}
+			if _, err := podReconciler.Reconcile(context.Background(), *test.req); err != nil {


Nice! we are testing the Reconcile method instead of testing a private helper method.

yeah, we should do that for all controllers.

liu-cong · 2025-02-18T21:25:09Z

/lgtm

kfswain · 2025-02-18T21:26:54Z

/lgtm

ahg-g · 2025-02-18T21:32:44Z

/hold cancel

Thanks all for the review, we still have a number of followup items:

k8s-ci-robot requested a review from Jeffwan February 17, 2025 10:48

k8s-ci-robot requested a review from robscott February 17, 2025 10:48

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2025

ahg-g force-pushed the cache branch from 8497042 to 6f0be81 Compare February 17, 2025 10:54

ahg-g commented Feb 17, 2025

View reviewed changes

ahg-g force-pushed the cache branch 2 times, most recently from a0cea4f to f9113b3 Compare February 17, 2025 21:49

ahg-g commented Feb 17, 2025

View reviewed changes

pkg/ext-proc/backend/inferencepool_reconciler.go Show resolved Hide resolved

ahg-g commented Feb 17, 2025

View reviewed changes

ahg-g force-pushed the cache branch from bf1f080 to f3ed12f Compare February 17, 2025 22:19

ahg-g commented Feb 17, 2025

View reviewed changes

ahg-g force-pushed the cache branch from f12a2ad to dfee85a Compare February 17, 2025 22:57

This was referenced Feb 17, 2025

refresh metrics per pod periodically #223

Open

Consolidate test object creation under the wrappers pkg #353

Open

kfswain reviewed Feb 18, 2025

View reviewed changes

liu-cong reviewed Feb 18, 2025

View reviewed changes

ahg-g added 7 commits February 18, 2025 18:56

Removed the intermediate cache in provider, and consolidating all sto…

8912c31

…rage behind datastore.

Fixed the provider test and covered the pool deletion events.

89b425b

Don't store the port number with the pods

f62bf4f

Address pod ip address updates

0853b36

rename PodFlushAll to PodResyncAll

475c85d

Addressed first round of comments

6159313

Addressed more comments

225f6b1

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 18, 2025

k8s-ci-robot assigned kfswain Feb 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025

ahg-g force-pushed the cache branch from 4e370e7 to 225f6b1 Compare February 18, 2025 21:04

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025

liu-cong reviewed Feb 18, 2025

View reviewed changes

k8s-ci-robot assigned liu-cong Feb 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025

Adding a comment

b30c90d

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 18, 2025

k8s-ci-robot merged commit 5705c58 into kubernetes-sigs:main Feb 18, 2025
8 checks passed

ahg-g mentioned this pull request Feb 19, 2025

fix inference extension not correctly scrape pod metrics #366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidating all storage behind datastore #350

Consolidating all storage behind datastore #350

ahg-g commented Feb 17, 2025 •

edited

Loading

k8s-ci-robot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 •

edited

Loading

ahg-g Feb 17, 2025

ahg-g Feb 17, 2025

ahg-g Feb 17, 2025

ahg-g Feb 17, 2025

ahg-g Feb 17, 2025

kfswain left a comment

kfswain Feb 18, 2025

liu-cong Feb 18, 2025

ahg-g Feb 18, 2025

liu-cong left a comment

liu-cong Feb 18, 2025

liu-cong Feb 18, 2025

ahg-g Feb 18, 2025

liu-cong Feb 18, 2025

ahg-g commented Feb 18, 2025 •

edited

Loading

kfswain commented Feb 18, 2025

liu-cong left a comment

liu-cong Feb 18, 2025

ahg-g Feb 18, 2025

liu-cong commented Feb 18, 2025

kfswain commented Feb 18, 2025

ahg-g commented Feb 18, 2025

Consolidating all storage behind datastore #350

Consolidating all storage behind datastore #350

Conversation

ahg-g commented Feb 17, 2025 • edited Loading

k8s-ci-robot commented Feb 17, 2025

netlify bot commented Feb 17, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfswain left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Feb 18, 2025 • edited Loading

kfswain commented Feb 18, 2025

liu-cong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented Feb 18, 2025

kfswain commented Feb 18, 2025

ahg-g commented Feb 18, 2025

ahg-g commented Feb 17, 2025 •

edited

Loading

netlify bot commented Feb 17, 2025 •

edited

Loading

ahg-g commented Feb 18, 2025 •

edited

Loading