rpc: don't leave poison zero-nodeID connections in pool #37204

tbg · 2019-04-30T10:59:57Z

An optimiziation to share the (target,remoteNodeID) connection under a
second name (target,0) backfired because we were never unregistering
the latter, meaning that clients requesting (target,0) would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

cockroach-teamcity · 2019-04-30T11:00:06Z

This change is

An optimiziation to share the `(target,remoteNodeID)` connection under a second name `(target,0)` backfired because we were never unregistering the latter, meaning that clients requesting `(target,0)` would be handed an eternally broken connection. See cockroachdb#37200. Release note (bug fix): Avoid a source of internal connectivity problems that would resolve after restarting the affected node.

These tests are pretty janky, and can end up failing with a timeout and a deadlocked test, which is not something roachtest can really ever handle gracefully. Sprinkle more contexts around and set a statement timeout for the central query that is most likely to get stuck under the crucial lock that we think "causes" most of the deadlocks. Of course there is likely a real problem with CRDB, which this PR does nothing about. All that is (hopefully) achieved here is a clean failure mode. The failure prompting this PR is fixed by cockroachdb#37204, unfortunately it also turns out that the statement timeout added in this PR did not prevent the statement from hanging. It is probably still worth merging this. Release note: None

knz

Nice find and elegant fix. I suppose this was precisely the kind of problem Peter was foreseeing when I initially made the change. I'm sorry I did not consider this before. LGTM in any case. Thank you!

Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained

tbg · 2019-04-30T13:01:11Z

I'm sorry I did not consider this before

Well, I'm sorry I didn't see it in review :-) It happens.

bors r=knz

37204: rpc: don't leave poison zero-nodeID connections in pool r=knz a=tbg An optimiziation to share the `(target,remoteNodeID)` connection under a second name `(target,0)` backfired because we were never unregistering the latter, meaning that clients requesting `(target,0)` would be handed an eternally broken connection. See #37200. Release note (bug fix): Avoid a source of internal connectivity problems that would resolve after restarting the affected node. Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2019-04-30T13:28:06Z

Build succeeded

GitHub CI (Cockroach)

37205: roachtest: reduce hangs in acceptance-chaos tests r=andreimatei a=tbg These tests are pretty janky, and can end up failing with a timeout and a deadlocked test, which is not something roachtest can really ever handle gracefully. Sprinkle more contexts around and set a statement timeout for the central query that is most likely to get stuck under the crucial lock that we think "causes" most of the deadlocks. Of course there is likely a real problem with CRDB, which this PR does nothing about. All that is (hopefully) achieved here is a clean failure mode. The failure prompting this PR is fixed by #37204, unfortunately it also turns out that the statement timeout added in this PR did not prevent the statement from hanging. It is probably still worth merging this. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

tbg requested a review from a team April 30, 2019 10:59

tbg requested a review from knz April 30, 2019 11:00

tbg force-pushed the fix/rpc-poison branch from 93098e5 to 0dd9ca7 Compare April 30, 2019 11:10

tbg mentioned this pull request Apr 30, 2019

roachtest: TC timing out with test downloading debug.zip after acceptance/bank/cluster-recovery timeout #37200

Closed

tbg mentioned this pull request Apr 30, 2019

roachtest: reduce hangs in acceptance-chaos tests #37205

Merged

knz approved these changes Apr 30, 2019

View reviewed changes

craig bot merged commit 0dd9ca7 into cockroachdb:master Apr 30, 2019

tbg mentioned this pull request May 1, 2019

roachtest: gossip/chaos/nodes=9 failed #37118

Closed

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rpc: don't leave poison zero-nodeID connections in pool #37204

rpc: don't leave poison zero-nodeID connections in pool #37204

Uh oh!

tbg commented Apr 30, 2019

Uh oh!

cockroach-teamcity commented Apr 30, 2019

Uh oh!

knz left a comment

Uh oh!

tbg commented Apr 30, 2019

Uh oh!

craig bot commented Apr 30, 2019

Uh oh!

Uh oh!

rpc: don't leave poison zero-nodeID connections in pool #37204

rpc: don't leave poison zero-nodeID connections in pool #37204

Uh oh!

Conversation

tbg commented Apr 30, 2019

Uh oh!

cockroach-teamcity commented Apr 30, 2019

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Apr 30, 2019

Uh oh!

craig bot commented Apr 30, 2019

Build succeeded

Uh oh!

Uh oh!