-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidating all storage behind datastore #350
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
} else { | ||
c.Datastore.pods.Store(pod, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another bug where we could be overriding the state of an existing metrics state.
Address: k8sPod.Status.PodIP + ":" + strconv.Itoa(int(inferencePool.Spec.TargetPortNumber)), | ||
} | ||
if !k8sPod.DeletionTimestamp.IsZero() || !c.Datastore.LabelsMatch(k8sPod.ObjectMeta.Labels) || !podIsReady(k8sPod) { | ||
c.Datastore.pods.Delete(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another bug introduced by using the backend.Pod as a key, which takes into account the address and the port. If any of those changes (e.g., the pod is deleted and recreated with the same name like statefulsets or the port on the inferencePool changes), we wouldn't be able to lookup and delete the previous version of the pod.
time.Sleep(refreshPrometheusMetricsInterval) | ||
p.flushPrometheusMetricsOnce(logger) | ||
select { | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In integration tests, those routines were leaking across test cases, causing a lot of problems (a provider from a previous test case overriding state of a different test case). With this, we are forcing them to shutdown with the manager.
a0cea4f
to
f9113b3
Compare
logger.V(logutil.DEFAULT).Info("Request handled", | ||
"model", llmReq.Model, "targetModel", llmReq.ResolvedTargetModel, "endpoint", targetPod) | ||
|
||
// Insert target endpoint to instruct Envoy to route requests to the specified target pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This enables us to not store the port number with the pod in the datastore. The port number is only needed when the epp responds with the endpoint.
} | ||
|
||
// Update the pod status parts. | ||
existing.(*PodMetrics).Pod = new.Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This allows us to update pod properties, currently the address only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gonna do this piecewise due to the size of the PR, I've made it through the datastore, which I think is a large portion of this PR. Left some comments. Will continue to review
} | ||
|
||
func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) { | ||
func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) { | ||
// Pool must exist to invoke this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be true now, but I could see this eventually being extracted to a lib to be used in custom EPPs. We may want to think about capturing the error and returning it to protect future callers.
Not a blocking comment for this PR however
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pool will always be required since it has the selector, otherwise we wouldn't know what pods to cache. The other option is to pass in the selector, which I am not sure is better since I view the datastore as a per pool cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This significantly cleans up the codebase. Thanks!
(Not a blocker and can be follow ups)
We should further break down the backend
package. Right now this is a giant package and we are accessing internal fields in many places, making it bug prone. A refactor may like this:
- Make the datastore its own package
- Make the types its own package, and add helper functions to properly manage lifecycle of the objects (e.g., in the future creating a PodMetrics object will involve starting a metrics refresh goroutine, in PR refresh metrics per pod periodically #223)
- Make each reconciler its own package
} | ||
|
||
func (ds *K8sDatastore) flushPodsAndRefetch(ctx context.Context, ctrlClient client.Client, newServerPool *v1alpha1.InferencePool) { | ||
func (ds *datastore) PodResyncAll(ctx context.Context, ctrlClient client.Client) { | ||
// Pool must exist to invoke this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We can consider making this method taking a pod selector, and have the PoolReconciler send in the pod selector
} | ||
|
||
func (ds *datastore) PodUpdateOrAddIfNotExist(pod *corev1.Pod) bool { | ||
new := &PodMetrics{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This can be a followup).
Consider making a NewPodMetrics helper function here.
We should hide the internal fields of the PodMetric object, and make helper functions. This will make the PR #223 much easier. Perhaps we need to make PodMetrics an interface instead. I imagine with PR #223 the PodMetrics will need to manage the lifecycle of the refresher goroutines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets do a follow up on that, I think we also need to rename the object as well. Perhaps we call it Endpoint
? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah renaming sounds good too.
…rage behind datastore.
I agree, in other projects we had a pkg for all controllers, I think this should be sufficient (vs a pkg per controller). I think the datastore should have its own pkg with an internal one, so this looks like this: I think the types should be part of the pkg/datastore, the types are hand in hand with the datastore. so the ext-proc/datastore pkg would include
|
Overall this lgtm, this PR is rather massive, and so relies somewhat on testing to validate it still operates the same. Will LGTM as the PR description mentions reasonable validation. Will hold to let others review/comment, unhold at your own discretion. Thanks for the big effort here! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
if test.req == nil { | ||
test.req = &ctrl.Request{NamespacedName: namespacedName} | ||
} | ||
if _, err := podReconciler.Reconcile(context.Background(), *test.req); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! we are testing the Reconcile method instead of testing a private helper method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we should do that for all controllers.
/lgtm |
/lgtm |
/hold cancel Thanks all for the review, we still have a number of followup items: |
This PR removes the intermediate cache in provider, and consolidates all storage behind the datastore component. This limits what is now called provider into a pure probing controller. To validate the change, I tried to increase test coverage especially for the controllers, which uncovered a couple of bugs (I added comments pointing them out).
The PR is long, but unfortunately there is no way around doing this refactor without such a significant change. The best way to approach this PR is to start by looking at the Datastore interface, which defines the contract for accessing all cached state related to the pool, the models and the pods. All accesses to the datastore are done only via this interface. The components that accesses the datastore are the controllers (pool, model and pod) and the provider (which as mentioned above is now in practice a probing controller). As a followup, I plan to move the datastore implementation into an internal pkg, and make only the interface public.
I think we are still lacking proper test coverage overall. As a next step we should prioritize migrating our integration tests to ginkgo, moreover as it is right now the integration tests setup is not properly testing things end-to-end (e.g., pods are being injected via a side channel on the datastore instead of creating them on the api-server and have the controllers populate the datastore).
I'm hoping this PR will allow us to improve the probing logic through a clearer separation of responsibilities. I think #223 is compatible with this direction with the exception of the extra cache it introduces.
Testing: in addition to increasing test coverage, I also ran integration tests and the e2e test. Also deployed it on a real cluster with increased log level and verified the probing logs while sending a constant stream of requests.
Fixes #346 #349 #310