Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidating all storage behind datastore #350

Merged
merged 8 commits into from
Feb 18, 2025

Conversation

ahg-g
Copy link
Contributor

@ahg-g ahg-g commented Feb 17, 2025

This PR removes the intermediate cache in provider, and consolidates all storage behind the datastore component. This limits what is now called provider into a pure probing controller. To validate the change, I tried to increase test coverage especially for the controllers, which uncovered a couple of bugs (I added comments pointing them out).

The PR is long, but unfortunately there is no way around doing this refactor without such a significant change. The best way to approach this PR is to start by looking at the Datastore interface, which defines the contract for accessing all cached state related to the pool, the models and the pods. All accesses to the datastore are done only via this interface. The components that accesses the datastore are the controllers (pool, model and pod) and the provider (which as mentioned above is now in practice a probing controller). As a followup, I plan to move the datastore implementation into an internal pkg, and make only the interface public.

I think we are still lacking proper test coverage overall. As a next step we should prioritize migrating our integration tests to ginkgo, moreover as it is right now the integration tests setup is not properly testing things end-to-end (e.g., pods are being injected via a side channel on the datastore instead of creating them on the api-server and have the controllers populate the datastore).

I'm hoping this PR will allow us to improve the probing logic through a clearer separation of responsibilities. I think #223 is compatible with this direction with the exception of the extra cache it introduces.

Testing: in addition to increasing test coverage, I also ran integration tests and the e2e test. Also deployed it on a real cluster with increased log level and verified the probing logs while sending a constant stream of requests.

Fixes #346 #349 #310

@k8s-ci-robot k8s-ci-robot requested a review from Jeffwan February 17, 2025 10:48
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 17, 2025
Copy link

netlify bot commented Feb 17, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit b30c90d
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67b4f8f88daa3b0009a5cff3
😎 Deploy Preview https://deploy-preview-350--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

} else {
c.Datastore.pods.Store(pod, true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another bug where we could be overriding the state of an existing metrics state.

Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)),
}
if !k8sPod.DeletionTimestamp.IsZero() || !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) {
c.Datastore.pods.Delete(pod)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another bug introduced by using the backend.Pod as a key, which takes into account the address and the port. If any of those changes (e.g., the pod is deleted and recreated with the same name like statefulsets or the port on the inferencePool changes), we wouldn't be able to lookup and delete the previous version of the pod.

time.Sleep(refreshPrometheusMetricsInterval)
p.flushPrometheusMetricsOnce(logger)
select {
case <-ctx.Done():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In integration tests, those routines were leaking across test cases, causing a lot of problems (a provider from a previous test case overriding state of a different test case). With this, we are forcing them to shutdown with the manager.

@ahg-g ahg-g force-pushed the cache branch 2 times, most recently from a0cea4f to f9113b3 Compare February 17, 2025 21:49
logger.V(logutil.DEFAULT).Info("Request handled",
"model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod)

// Insert target endpoint to instruct Envoy to route requests to the specified target pod.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This enables us to not store the port number with the pod in the datastore. The port number is only needed when the epp responds with the endpoint.

}

// Update the pod status parts.
existing.(*PodMetrics).Pod = new.Pod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows us to update pod properties, currently the address only.

Copy link
Collaborator

@kfswain kfswain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna do this piecewise due to the size of the PR, I've made it through the datastore, which I think is a large portion of this PR. Left some comments. Will continue to review

}

func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) {
func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) {
// Pool must exist to invoke this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be true now, but I could see this eventually being extracted to a lib to be used in custom EPPs. We may want to think about capturing the error and returning it to protect future callers.

Not a blocking comment for this PR however

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pool will always be required since it has the selector, otherwise we wouldn't know what pods to cache. The other option is to pass in the selector, which I am not sure is better since I view the datastore as a per pool cache.

Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This significantly cleans up the codebase. Thanks!

(Not a blocker and can be follow ups)
We should further break down the backend package. Right now this is a giant package and we are accessing internal fields in many places, making it bug prone. A refactor may like this:

  • Make the datastore its own package
  • Make the types its own package, and add helper functions to properly manage lifecycle of the objects (e.g., in the future creating a PodMetrics object will involve starting a metrics refresh goroutine, in PR refresh metrics per pod periodically #223)
  • Make each reconciler its own package

}

func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) {
func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) {
// Pool must exist to invoke this function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector

}

func (ds *datastore) PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool {
new := &PodMetrics{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This can be a followup).
Consider making a NewPodMetrics helper function here.

We should hide the internal fields of the PodMetric object, and make helper functions. This will make the PR #223 much easier. Perhaps we need to make PodMetrics an interface instead. I imagine with PR #223 the PodMetrics will need to manage the lifecycle of the refresher goroutines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets do a follow up on that, I think we also need to rename the object as well. Perhaps we call it Endpoint? wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah renaming sounds good too.

@ahg-g
Copy link
Contributor Author

ahg-g commented Feb 18, 2025

This significantly cleans up the codebase. Thanks!

(Not a blocker and can be follow ups) We should further break down the backend package. Right now this is a giant package and we are accessing internal fields in many places, making it bug prone. A refactor may like this:

  • Make the datastore its own package
  • Make the types its own package, and add helper functions to properly manage lifecycle of the objects (e.g., in the future creating a PodMetrics object will involve starting a metrics refresh goroutine, in PR refresh metrics per pod periodically #223)
  • Make each reconciler its own package

I agree, in other projects we had a pkg for all controllers, I think this should be sufficient (vs a pkg per controller). I think the datastore should have its own pkg with an internal one, so this looks like this: ext-proc/datastore/internal and ext-proc/datastore/interface.go

I think the types should be part of the pkg/datastore, the types are hand in hand with the datastore.

so the ext-proc/datastore pkg would include

  • types.go
  • inference.go
  • internal/implementation.go

@kfswain
Copy link
Collaborator

kfswain commented Feb 18, 2025

Overall this lgtm, this PR is rather massive, and so relies somewhat on testing to validate it still operates the same. Will LGTM as the PR description mentions reasonable validation.
/lgtm

Will hold to let others review/comment, unhold at your own discretion.
/hold

Thanks for the big effort here!

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 18, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025
Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

if test.req == nil {
test.req = &ctrl.Request{NamespacedName: namespacedName}
}
if _, err := podReconciler.Reconcile(context.Background(), *test.req); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! we are testing the Reconcile method instead of testing a private helper method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we should do that for all controllers.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025
@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 18, 2025
@kfswain
Copy link
Collaborator

kfswain commented Feb 18, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit 5705c58 into kubernetes-sigs:main Feb 18, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove the pod cache in datastore
4 participants