-
Notifications
You must be signed in to change notification settings - Fork 3.9k
rpc: don't leave poison zero-nodeID connections in pool #37204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
An optimiziation to share the `(target,remoteNodeID)` connection under a second name `(target,0)` backfired because we were never unregistering the latter, meaning that clients requesting `(target,0)` would be handed an eternally broken connection. See cockroachdb#37200. Release note (bug fix): Avoid a source of internal connectivity problems that would resolve after restarting the affected node.
These tests are pretty janky, and can end up failing with a timeout and a deadlocked test, which is not something roachtest can really ever handle gracefully. Sprinkle more contexts around and set a statement timeout for the central query that is most likely to get stuck under the crucial lock that we think "causes" most of the deadlocks. Of course there is likely a real problem with CRDB, which this PR does nothing about. All that is (hopefully) achieved here is a clean failure mode. The failure prompting this PR is fixed by cockroachdb#37204, unfortunately it also turns out that the statement timeout added in this PR did not prevent the statement from hanging. It is probably still worth merging this. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find and elegant fix. I suppose this was precisely the kind of problem Peter was foreseeing when I initially made the change. I'm sorry I did not consider this before. LGTM in any case. Thank you!
Reviewed 2 of 2 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
Well, I'm sorry I didn't see it in review :-) It happens. bors r=knz |
37204: rpc: don't leave poison zero-nodeID connections in pool r=knz a=tbg An optimiziation to share the `(target,remoteNodeID)` connection under a second name `(target,0)` backfired because we were never unregistering the latter, meaning that clients requesting `(target,0)` would be handed an eternally broken connection. See #37200. Release note (bug fix): Avoid a source of internal connectivity problems that would resolve after restarting the affected node. Co-authored-by: Tobias Schottdorf <[email protected]>
Build succeeded |
37205: roachtest: reduce hangs in acceptance-chaos tests r=andreimatei a=tbg These tests are pretty janky, and can end up failing with a timeout and a deadlocked test, which is not something roachtest can really ever handle gracefully. Sprinkle more contexts around and set a statement timeout for the central query that is most likely to get stuck under the crucial lock that we think "causes" most of the deadlocks. Of course there is likely a real problem with CRDB, which this PR does nothing about. All that is (hopefully) achieved here is a clean failure mode. The failure prompting this PR is fixed by #37204, unfortunately it also turns out that the statement timeout added in this PR did not prevent the statement from hanging. It is probably still worth merging this. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
An optimiziation to share the
(target,remoteNodeID)
connection under asecond name
(target,0)
backfired because we were never unregisteringthe latter, meaning that clients requesting
(target,0)
would be handedan eternally broken connection.
See #37200.
Release note (bug fix): Avoid a source of internal connectivity problems
that would resolve after restarting the affected node.