Skip to content

distsqlrun: Ignore breaker when outbox dials node #40691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 13, 2019

Conversation

rohany
Copy link
Contributor

@rohany rohany commented Sep 11, 2019

Fixes #38602.

Release justification: Fixes an outstanding bug.

Release note: None

@rohany rohany requested review from tbg, asubiotto and a team September 11, 2019 20:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@rohany
Copy link
Contributor Author

rohany commented Sep 11, 2019

Let me know if this approach makes sense, and the best way to test a change like this.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @rohany, and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 95 at r1 (raw file):

// DialNoBreaker ignores the breaker if there is an error dialing. This function
// should only be used when there is good reason to believe that the node is reachable.
func (n *Dialer) DialNoBreaker(

it seems like this is implemented the same as the above function apart from the breaker stuff. to keep the code DRY, would it make sense to have Dial and DialNoBreaker call out to the same helper function? the only line we would want to skip in the latter case is breaker.Fail(err) i think, so the helper function could just accept a flag for that.

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @rafiss, and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 95 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

it seems like this is implemented the same as the above function apart from the breaker stuff. to keep the code DRY, would it make sense to have Dial and DialNoBreaker call out to the same helper function? the only line we would want to skip in the latter case is breaker.Fail(err) i think, so the helper function could just accept a flag for that.

theres some other breaker related stuff happening in unexported node.dial that we want to avoid too. However, it turned out to be cleaner there than I thought than to just handle a nil breaker.

@rohany rohany requested a review from a team September 11, 2019 20:51
Copy link
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: from my perspective, might be good to add a test

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @rafiss, @rohany, and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 193 at r2 (raw file):

	// as a stop-gap before the reconciliation occurs.
	if breaker != nil {
		breaker.Success()

Maybe make these nil checks part of the Fail and Success methods?


pkg/sql/distsqlrun/outbox.go, line 226 at r2 (raw file):

			// a critical part of query execution: if this step doesn't work, the
			// receiving side might end up hanging or timing out.
			// TODO(asubiotto): We should retry a failed Dial. This rests on the

I think we can remove this TODO, if we attempt to connect to a node that was healthy at plan time and it fails even while ignoring the breaker, that's probably a good enough reason to exit.

Copy link
Contributor Author

@rohany rohany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @rafiss, and @tbg)


pkg/rpc/nodedialer/nodedialer.go, line 193 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Maybe make these nil checks part of the Fail and Success methods?

I think it is more explicit this way that we are ignoring the breaker.


pkg/sql/distsqlrun/outbox.go, line 226 at r2 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

I think we can remove this TODO, if we attempt to connect to a node that was healthy at plan time and it fails even while ignoring the breaker, that's probably a good enough reason to exit.

Done.

@rohany rohany force-pushed the outbox-dialing branch 2 times, most recently from 615b2bd to db7caab Compare September 12, 2019 15:31
Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm too, and i agree about adding a test if you can

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @rafiss, and @tbg)

@rohany
Copy link
Contributor Author

rohany commented Sep 12, 2019

PTAL -- i added a test for DialNoBreaker. I was originally confused about testing this because I thought we would need to recreate the situation in the issue.

@rohany rohany requested a review from rafiss September 12, 2019 17:43
@rohany
Copy link
Contributor Author

rohany commented Sep 12, 2019

Jk, the test passed locally but not under stress. Let me try again.

@rohany
Copy link
Contributor Author

rohany commented Sep 12, 2019

Alright, thanks to some help from @ajwerner (tysm) I was able to test this.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test looks great!

Reviewed 3 of 3 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto, @rafiss, and @tbg)

@rohany
Copy link
Contributor Author

rohany commented Sep 13, 2019

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 13, 2019

Build failed

@rohany
Copy link
Contributor Author

rohany commented Sep 13, 2019

I think this is a test flake. Retrying.

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 13, 2019

Build failed

@rafiss
Copy link
Collaborator

rafiss commented Sep 13, 2019

The flake should be resolved by #40757

@rohany
Copy link
Contributor Author

rohany commented Sep 13, 2019

run it back -- some other PR's got in

bors r+

craig bot pushed a commit that referenced this pull request Sep 13, 2019
40691: distsqlrun: Ignore breaker when outbox dials node r=rohany a=rohany

Fixes #38602.

Release justification: Fixes an outstanding bug.

Release note: None

Co-authored-by: Rohan Yadav <[email protected]>
@craig
Copy link
Contributor

craig bot commented Sep 13, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distsqlrun: ignore breaker when outbox dials node
4 participants