Skip to content

rpc: don't leave poison zero-nodeID connections in pool #37204

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 30, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Apr 30, 2019

An optimiziation to share the (target,remoteNodeID) connection under a
second name (target,0) backfired because we were never unregistering
the latter, meaning that clients requesting (target,0) would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

@tbg tbg requested a review from a team April 30, 2019 10:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from knz April 30, 2019 11:00
An optimiziation to share the `(target,remoteNodeID)` connection under a
second name `(target,0)` backfired because we were never unregistering
the latter, meaning that clients requesting `(target,0)` would be handed
an eternally broken connection.

See cockroachdb#37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.
@tbg tbg force-pushed the fix/rpc-poison branch from 93098e5 to 0dd9ca7 Compare April 30, 2019 11:10
tbg added a commit to tbg/cockroach that referenced this pull request Apr 30, 2019
These tests are pretty janky, and can end up failing with a timeout and
a deadlocked test, which is not something roachtest can really ever
handle gracefully. Sprinkle more contexts around and set a statement
timeout for the central query that is most likely to get stuck under the
crucial lock that we think "causes" most of the deadlocks.

Of course there is likely a real problem with CRDB, which this PR does
nothing about. All that is (hopefully) achieved here is a clean failure
mode. The failure prompting this PR is fixed by cockroachdb#37204, unfortunately
it also turns out that the statement timeout added in this PR did not
prevent the statement from hanging. It is probably still worth merging
this.

Release note: None
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find and elegant fix. I suppose this was precisely the kind of problem Peter was foreseeing when I initially made the change. I'm sorry I did not consider this before. LGTM in any case. Thank you!

Reviewed 2 of 2 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg
Copy link
Member Author

tbg commented Apr 30, 2019

I'm sorry I did not consider this before

Well, I'm sorry I didn't see it in review :-) It happens.

bors r=knz

craig bot pushed a commit that referenced this pull request Apr 30, 2019
37204: rpc: don't leave poison zero-nodeID connections in pool r=knz a=tbg

An optimiziation to share the `(target,remoteNodeID)` connection under a
second name `(target,0)` backfired because we were never unregistering
the latter, meaning that clients requesting `(target,0)` would be handed
an eternally broken connection.

See #37200.

Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Apr 30, 2019

Build succeeded

@craig craig bot merged commit 0dd9ca7 into cockroachdb:master Apr 30, 2019
craig bot pushed a commit that referenced this pull request May 1, 2019
37205: roachtest: reduce hangs in acceptance-chaos tests r=andreimatei a=tbg

These tests are pretty janky, and can end up failing with a timeout and
a deadlocked test, which is not something roachtest can really ever
handle gracefully. Sprinkle more contexts around and set a statement
timeout for the central query that is most likely to get stuck under the
crucial lock that we think "causes" most of the deadlocks.

Of course there is likely a real problem with CRDB, which this PR does
nothing about. All that is (hopefully) achieved here is a clean failure
mode. The failure prompting this PR is fixed by #37204, unfortunately
it also turns out that the statement timeout added in this PR did not
prevent the statement from hanging. It is probably still worth merging
this.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants