Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats/opentelemetry: add trace event for name resolution delay #8074

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

vinothkumarr227
Copy link
Contributor

@vinothkumarr227 vinothkumarr227 commented Feb 10, 2025

stats/opentelemetry: add trace event for name resolution delay.

RELEASE NOTES: None

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 75.86207% with 7 lines in your changes missing coverage. Please review.

Project coverage is 82.10%. Comparing base (e0d191d) to head (ec567ff).
Report is 29 commits behind head on master.

Files with missing lines Patch % Lines
clientconn.go 33.33% 3 Missing and 1 partial ⚠️
stats/opentelemetry/trace.go 0.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8074      +/-   ##
==========================================
- Coverage   82.29%   82.10%   -0.19%     
==========================================
  Files         387      387              
  Lines       39065    38947     -118     
==========================================
- Hits        32150    31979     -171     
- Misses       5584     5635      +51     
- Partials     1331     1333       +2     
Files with missing lines Coverage Δ
stats/opentelemetry/client_metrics.go 89.44% <100.00%> (+0.05%) ⬆️
stats/opentelemetry/opentelemetry.go 75.00% <ø> (ø)
stream.go 81.55% <100.00%> (-0.14%) ⬇️
stats/opentelemetry/trace.go 78.43% <0.00%> (-4.91%) ⬇️
clientconn.go 90.55% <33.33%> (-1.63%) ⬇️

... and 56 files with indirect coverage changes

@arjan-bal arjan-bal requested a review from purnesh42H February 17, 2025 07:16
@arjan-bal arjan-bal added this to the 1.71 Release milestone Feb 17, 2025
@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels Feb 17, 2025
@@ -38,6 +38,8 @@ type RPCTagInfo struct {
// FailFast indicates if this RPC is failfast.
// This field is only valid on client side, it's always false on server side.
FailFast bool
// NameResolutionDelay indicates if the RPC was delayed due to address resolution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// NameResolutionDelay indicates if there was a delay in the name resolution.
// This field is only valid on client side, it's always false on server side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
}
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
isDelayed, err := cc.waitForResolvedAddrs(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isDelayed -> nameResDelayed/nameResolutionDelayed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -416,8 +424,9 @@ func (cs *clientStream) newAttemptLocked(isTransparent bool) (*csAttempt, error)
method := cs.callHdr.Method
var beginTime time.Time
shs := cs.cc.dopts.copts.StatsHandlers
isDelayed, _ := ctx.Value(nameResolutionDelayKey).(bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about naming the variable as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
@@ -212,9 +216,13 @@ func newClientStream(ctx context.Context, desc *StreamDesc, cc *ClientConn, meth
}
// Provide an opportunity for the first RPC to see the first service config
// provided by the resolver.
if err := cc.waitForResolvedAddrs(ctx); err != nil {
isDelayed, err := cc.waitForResolvedAddrs(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning a bool and error is not good practice for go. It breaks the established pattern of error handling in Go because returned bool indicates success/failure in general. Can we do something better? It might be fine if we can't but we can try to look for better ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an approach as follows:

Add a nameResolutionDelay field: Add a new nameResolutionDelay field to the ClientConn struct to store the delay state.
Modify waitForResolvedAddrs: Set the nameResolutionDelay field directly in ClientConn instead of returning a boolean.
Access in newAttemptLocked: Use the nameResolutionDelay field from ClientConn within newAttemptLocked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we want to add the field to clientconn because clientconn is not restricted to only single rpc. Returning a struct sounds better but we don't have any other field apart from boolean field. Let's keep bool, err for now. But make sure docstring is updated to indicate the bool correctly.

@purnesh42H
Copy link
Contributor

just to give you more context

  • The firstResolveEvent is a grpcsync.Event (a custom synchronization primitive). It's used to signal whether the name resolver has sent at least one address update. HasFired() is a method on this event that tells you if the signal has already been sent.
  • If firstResolveEvent.HasFired() returns true, it means the resolver has already provided addresses at least once. In this case, there is no need to wait, so we can immediately return nil to signal that resolution has already occurred in the past. This is the fast path indicating no delay in name resolution
  • case <-cc.firstResolveEvent.Done() is part of a select statement. It's waiting for a signal on the cc.firstResolveEvent.Done() channel. The select statement will block here until something is sent on the firstResolveEvent.Done() channel. When something is sent on this channel, it means the firstResolveEvent has been fired. This is the waiting path. If cc.firstResolveEvent.HasFired() was false in the previous step, the code enters the select block. This case is specifically waiting for the resolver to send the first address update. Until then, it will block. So, its indicated delayed name resolution because rpc was blocked

@purnesh42H
Copy link
Contributor

Test Case 1: Fast Path (Line 699)

Setup:
Create a ClientConn with a manual resolver that immediately returns a set of addresses.
Make one RPC call (or call a function that uses waitForResolvedAddrs).
Verify:
The first RPC call should pass without blocking. Because the resolver returned addresses quickly.
Make another RPC call
When the second RPC call happens, the firstResolveEvent will already be fired.

Test Case 2: Waiting Path (Line 703)

Setup:
Create a ClientConn with a manual resolver that delays returning any addresses.
Start an RPC call (or call a function that uses waitForResolvedAddrs).
The RPC should block.
Verify:
The RPC should block (we can use a channel mechanism to block/unblock. There are examples in resolver tests).
After a timeout or a manual trigger, make the manual resolver return addresses.
The blocked RPC should now continue and complete.

@vinothkumarr227
Copy link
Contributor Author

Test Case 1: Fast Path (Line 699)

Setup: Create a ClientConn with a manual resolver that immediately returns a set of addresses. Make one RPC call (or call a function that uses waitForResolvedAddrs). Verify: The first RPC call should pass without blocking. Because the resolver returned addresses quickly. Make another RPC call When the second RPC call happens, the firstResolveEvent will already be fired.

Test Case 2: Waiting Path (Line 703)

Setup: Create a ClientConn with a manual resolver that delays returning any addresses. Start an RPC call (or call a function that uses waitForResolvedAddrs). The RPC should block. Verify: The RPC should block (we can use a channel mechanism to block/unblock. There are examples in resolver tests). After a timeout or a manual trigger, make the manual resolver return addresses. The blocked RPC should now continue and complete.

Done

stream.go Outdated
var mc serviceconfig.MethodConfig
var onCommit func()
newStream := func(ctx context.Context, done func()) (iresolver.ClientStream, error) {
return newClientStreamWithParams(ctx, desc, cc, method, mc, onCommit, done, opts...)
return newClientStreamWithParams(ctx, desc, cc, method, mc, onCommit, done, rpcInfo, opts...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just send the bool value instead of struct? Does it work? If yes, then that's more simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// EmptyCall is a simple RPC that returns an empty response.
func (s *server) EmptyCall(_ context.Context, _ *testgrpc.Empty) (*testgrpc.Empty, error) {
return &testgrpc.Empty{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. It should be testpb.Empty{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

defer cancel()
client := testgrpc.NewTestServiceClient(clientConn)
// First RPC call should succeed immediately.
if _, err := client.EmptyCall(ctx, &testgrpc.Empty{}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be a testpb.Empty{}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

defer cleanup()

statsHandler := &testStatsHandler{}
resolverBuilder := manual.NewBuilderWithScheme("instant")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it simple like rb := manual.NewBuilderWithScheme("")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

resolverBuilder := manual.NewBuilderWithScheme("instant")
resolverBuilder.InitialState(resolver.State{Addresses: []resolver.Address{{Addr: stub.Address}}})
// Create a ClientConn using the manual resolver.
clientConn, err := grpc.NewClient(resolverBuilder.Scheme()+":///test.server",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep it simple like cc, err := grpc.NewClient(resolverBuilder.Scheme()+":///test.server",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

close(resolutionReady)
case <-rpcCompleted:
t.Fatal("RPC completed prematurely before resolution was updated!")
case <-time.After(5 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use test context for timeout

case <-ctx.Done():
       t.Fatalf("Test setup timed out: %v", ctx.Err())
   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

t.Fatalf("RPC failed after resolution: %v", err)
}
t.Log("RPC completed successfully after resolution.")
case <-time.After(5 * time.Second):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, use test context for timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

grpc.WithStatsHandler(statsHandler),
)
if err != nil {
t.Fatalf("grpc.NewClient error: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be t.Fatalf("NewClient() failed: %v", err)

defer cleanup()

statsHandler := &testStatsHandler{}
clientConn, resolverBuilder := createTestClient(t, "delayed", statsHandler)
Copy link
Contributor

@janardhanvissa janardhanvissa Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be cc, rb := createTestClient(t, "delayed", statsHandler)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}()
// Simulate delayed resolution and unblock it via resolutionReady
go func() {
<-resolutionReady
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before <-resolutionReady, you can add the t.LogF("RPC waiting for resolved addresses")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -52,6 +52,9 @@ func populateSpan(rs stats.RPCStats, ai *attemptInfo) {
)
// increment previous rpc attempts applicable for next attempt
atomic.AddUint32(&ai.previousRPCAttempts, 1)
if ai.nameResolutionDelayed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong. As per https://github.com/grpc/proposal/blob/master/A72-open-telemetry-tracing.md#tracing-information, "Delayed name resolution" should be an event in the call span not attempt span.

This should be right place https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/client_tracing.go#L34. Before creating the attempt span, you need to retrieve the current call span using trace.SpanFromContext and add an event to that span. Before that, also check if that event already exist or not. And only add, if it exist.

if _, err := client.EmptyCall(ctx, &testpb.Empty{}); err != nil {
t.Fatalf("First RPC failed unexpectedly: %v", err)
}
// Verify that name resolution did not happen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't mean "name resolution did not happen". Either it should be "name resolution did not happen again" or "Verify that RPC was not blocked on waiting for resolver to return addresses indicating no name resolution delay". I prefer the latter.

if err != nil {
t.Fatalf("RPC failed after resolution: %v", err)
}
if !statsHandler.nameResolutionDelayed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about verifying that RPC was blocked on resolver to return addresses indicating name resolution delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants