Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestChangeReplicasDescriptorInvariant got stuck #2289

Closed
tbg opened this issue Aug 28, 2015 · 3 comments
Closed

TestChangeReplicasDescriptorInvariant got stuck #2289

tbg opened this issue Aug 28, 2015 · 3 comments
Labels
C-test-failure Broken test (automatically or manually discovered).

Comments

@tbg
Copy link
Member

tbg commented Aug 28, 2015

The following test appears to have failed:

#6620:

W0828 15:03:57.947676 949 multiraft/multiraft.go:930  node 200000002 failed to send message to 100000001: multiraft/transport.go:157: transport is closed
W0828 15:03:58.028524 949 multiraft/multiraft.go:930  node 200000002 failed to send message to 100000001: multiraft/transport.go:157: transport is closed
W0828 15:03:58.028688 949 multiraft/multiraft.go:930  node 200000002 failed to send message to 300000003: multiraft/transport.go:157: transport is closed
W0828 15:03:58.048768 949 multiraft/multiraft.go:930  node 300000003 failed to send message to 100000001: multiraft/transport.go:157: transport is closed
W0828 15:03:58.048940 949 multiraft/multiraft.go:930  node 300000003 failed to send message to 200000002: multiraft/transport.go:157: transport is closed
panic: test timed out after 5m0s

goroutine 3512 [running]:
testing.startAlarm.func1()
    /tmp/workdir/go/src/testing/testing.go:703 +0x16b
created by time.goFunc
    /tmp/workdir/go/src/time/sleep.go:129 +0x6e

goroutine 1 [chan receive, 4 minutes]:
testing.RunTests(0x14c3bd0, 0x1a38c00, 0xb9, 0xb9, 0x1252001)
    /tmp/workdir/go/src/testing/testing.go:562 +0xafa
--
goroutine 3429 [select]:
github.com/coreos/etcd/raft.(*multiNode).run(0xc8204728a0)
    /go/src/github.com/coreos/etcd/raft/multinode.go:180 +0x2b32
created by github.com/coreos/etcd/raft.StartMultiNode
    /go/src/github.com/coreos/etcd/raft/multinode.go:56 +0x338
FAIL    github.com/cockroachdb/cockroach/storage    300.076s
=== RUN   TestBatchBasics
I0828 14:59:01.153953 959 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchBasics (0.00s)
=== RUN   TestBatchGet
I0828 14:59:01.156977 959 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchGet (0.00s)
=== RUN   TestBatchMerge
I0828 14:59:01.161901 959 storage/engine/rocksdb.go:122  closing in-memory rocksdb instance
--- PASS: TestBatchMerge (0.00s)
=== RUN   TestBatchProto

Please assign, take a look and update the issue accordingly.

@tbg tbg added the C-test-failure Broken test (automatically or manually discovered). label Aug 28, 2015
@tbg tbg changed the title Test failure in CI build 6620 TestChangeReplicasDescriptorInvariant got stuck Aug 30, 2015
@tbg
Copy link
Member Author

tbg commented Aug 30, 2015

@bdarnell any ideas?

@tbg
Copy link
Member Author

tbg commented Aug 30, 2015

The issue is a worker not shutting down (Quiesce() is through). My bet's on this one:

goroutine 3400 [runnable]:
net.(_OpError).Temporary(0xc82010acd0, 0x7f5383c6cc28)
/tmp/workdir/go/src/net/net.go:450 +0xef
github.com/cockroachdb/cockroach/rpc/codec.(_baseConn).write(0xc82039cc60, 0x7f5383c85b00, 0xc820289d00, 0xc82039cc80, 0x1, 0xa, 0x0, 0x0)

@bdarnell
Copy link
Contributor

If that goroutine was runnable when time ran out, my guess is that it got stuck in a loop with a "temporary" error occurring over and over.

According to https://github.com/golang/go/blob/34695c4742dd8055ed88b409014353e99288c43e/src/syscall/syscall_unix.go#L112, ECONNRESET and ECONNABORTED count as Temporary(), which doesn't look right to me. Looking back through blame, I see golang/go#6163 and https://codereview.appspot.com/141600043, which added them to the Temporary list to fix an issue with net/http's accept loop (but as dvyukov noted this is not correct in other contexts).

I think we should just remove the Temporary check here - I don't think any of the errors that could be returned by Write() could actually be temporary, and we should just kill the connection no matter what kind of error we get.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered).
Projects
None yet
Development

No branches or pull requests

2 participants