storage: prepare for kv.atomic_replication_changes=true #40370

tbg · 2019-08-30T16:02:43Z

First three commits are #40363.

This PR enables atomic replication changes by default. But most of it is
just dealing with the fallout of doing so:

we don't handle removal of multiple learners well at the moment. This will
be fixed more holistically in [wip] storage: allow atomic removal/addition of multiple LEARNERs #40268, but it's not worth waiting for that
because it's easy for us to just avoid the problem.
tests that carry out splits become quite flaky because at the beginning of
a split, we transition out of a joint config if we see one, and due to
the initial upreplication we often do. If we lose the race against the
replicate queue, the split catches an error for no good reason.
I took this as an opportunity to refactor the descriptor comparisons
and to make this specific case a noop, but making it easier to avoid
this general class of conflict where it's avoidable in the future.

There are probably some more problems that will only become apparent over time,
but it's quite simple to turn the cluster setting off again and to patch things
up if we do.

Release note (general change): atomic replication changes are now enabled
by default.

cockroach-teamcity · 2019-08-30T16:03:04Z

This change is

tbg · 2019-08-30T16:04:21Z

@nvanbenschoten what are some roachtests I should run before I merge this? I'm thinking definitely the "release whitelist" (whatever that is these days, definitely the headroom tests), clearrange (if that even passes these days), and what else?

tbg · 2019-09-02T13:32:44Z

Running some roachtests. From mixed-headroom:

tobias-1567423429-01-n5cpu16-0004> F190902 11:43:11.784656 125964 storage/replica.go:982  [n4,s4,r2109/?:/Table/56/1/398/6/949{-/0}] on-disk and in-memory state diverged: [Lease.Replica.Type: &roachpb.ReplicaType(0) != nil]
goroutine 125964 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000466101, 0xc000466120, 0x0, 0x5a)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xb1
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x7bab9e0, 0xc000000004, 0x7378f32, 0x12, 0x3d6, 0xc00095c120, 0x85)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x93e
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4e29b00, 0xc0173dc5d0, 0xc000000004, 0x2, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2cc
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4e29b00, 0xc0173dc5d0, 0x1, 0xc000000004, 0x0, 0x0, 0xc0294c5f40, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:69 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:189
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).assertStateLocked(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x4e77b20, 0xc0004d7500)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:982 +0x74c
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).applySnapshot(0xc0034db000, 0x4e29b00, 0xc0173dc5d0, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, 0xc0007a9180, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_raftstorage.go:960 +0xf3a
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest.func1(0x4e29b00, 0xc0173dc5d0, 0xc0034db000, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store_snapshot_preemptive.go:343 +0x6bb
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc5d0, 0xc0007a8f48, 0xc0069b1418, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3352 +0x150
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processPreemptiveSnapshotRequest(0xc0000fd800, 0x4e29b00, 0xc0173dc270, 0xc0007a8f00, 0x9548f418505fb0be, 0x379fe45034a031b9, 0xc0035fa640, 0x0, 0x0, 0x0, ...)

This happened on two nodes for different preemptive snaps, so I think it's very frequent (edit: yes. All the time basically). The recipient nodes were running 19.2, the sender 19.1.

I don't know that this was caused by this PR though, I doubt it was. Going to go back to master to make sure the test passes reliably there first.

I also saw #39460 (comment) which I find puzzling. We very explicitly massage that error so that it doesn't bubble up to the client in that form. Something must be going wrong there.

tbg · 2019-09-02T14:59:54Z

I reproduced the above fatal error on c78d4db (master), so it's not new here. It's probably easy to fix, I will take a look.

I also got this one again (on master):

Error: restoring fixture: pq: importing 1528 ranges: split at key /T
raw_bytes:"\344x\317\000\003\010\351\003\022\t\301\211\322\220\206\370\364\214\210
\001\020\001\030\001"\006\010\003\020\003\030\002"\006\010\004\020\004\030\003"\0
tamp:<wall_time:1567431351149034101 >

nvanbenschoten

@nvanbenschoten what are some roachtests I should run before I merge this? I'm thinking definitely the "release whitelist" (whatever that is these days, definitely the headroom tests), clearrange (if that even passes these days), and what else?

I'd include the kv/splits/... tests and the splits/... tests. Also, the import and decommission tests will be interesting.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 1 of 1 files at r5, 3 of 3 files at r6, 2 of 2 files at r7, 4 of 4 files at r8, 1 of 1 files at r9, 1 of 1 files at r10, 1 of 1 files at r11, 1 of 1 files at r12, 1 of 1 files at r13, 1 of 1 files at r14, 1 of 1 files at r15, 1 of 1 files at r16, 2 of 2 files at r17.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

pkg/storage/replica_command.go, line 250 at r9 (raw file):

		if desc.GetStickyBit().Less(args.ExpirationTime) {
			err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
				dbDescValue, err := conditionalGetDescValueFromDB(ctx, txn, desc)

Is a conditional get followed by a conditional put ever necessary with serializable isolation? Shouldn't the conditional get be enough?

pkg/storage/replica_command.go, line 1474 at r16 (raw file):

	check := func(kvDesc *roachpb.RangeDescriptor) bool {
		if len(chgs) == 0 {

I know you don't want to litter this code with leaveJoint := len(chgs) == 0 variables, but I still fear that this is adding in a lot of subtle branching throughout this code that's going to be a lot less clear a year from now than it looks today. Do you have any thoughts on how to improve that?

Perhaps:

type internalReplicationChanges []internalReplicationChange

func (c internalReplicationChanges) leaveJoint() bool { return len(c) == 0 }

tbg

TFTR! Off to run some roachtests. Will report here.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten)

pkg/storage/replica_command.go, line 250 at r9 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is a conditional get followed by a conditional put ever necessary with serializable isolation? Shouldn't the conditional get be enough?

I had the same thought the other day. Pretty sure a Get and then Puts would be enough (assuming they're in the same txn, which they are).

tbg · 2019-09-03T21:59:14Z

I'd include the kv/splits/... tests and the splits/... tests. Also, the import and decommission tests will be interesting.

import/* is too flaky to run a dozen or so without having to expect at least one failure -- I'm going to run import/tpcc/warehouses=1000/nodes=4 which was the most stable of the bunch, with a success rate of 95%

kv/splits and splits/* has 100%, so I'm going to run them all.

acceptance/decommission is too flaky, but decommission/nodes=4/duration=1h0m0s looks good.

Final roster:

roachtest run 'splits|import/tpcc/warehouses=1000/nodes=4|decommission/nodes=4/duration=1h0m0s'

Going to try master first (now) and this branch tomorrow.

tbg · 2019-09-04T07:19:02Z

Ok, that invocation (master) ran through five iterations of each test, but with one (known) crash in the import test:

F190903 22:46:04.352403 141 storage/store.go:2172 [n3,s3,r160/4:/Table/5{7/1-8}] replica descriptor of local store not found in right hand side of split
goroutine 141 [running]:

which is likely #39796

tbg · 2019-09-04T08:27:10Z

My new plan is to merge everything here except flipping the default to on, just to have a smaller diff staged.

This allows simplifying the code because we get to use crt.Desc anywhere we know the trigger was created locally (which is almost everywhere). Release note: None

We don't support removing multiple learners atomically just yet, though \cockroachdb#40268 will fix this (likely in raft). That PR though is obstructed by cockroachdb#40207 because we'll need that first to be able to bump raft again (assuming we don't want to fork). Instead of dealing with all that upfront, let's just not remove multiple learners at once right now so that we can flip the default for atomic replication changes to on. If anyone is still trying to remove only learners atomically, they will fail. However the replicate queues place a high priority on removing stray learners whenever it finds them, so this wouldn't be a permanent problem. Besides, we never add multiple learners at once so it's difficult to get into that state in the first place. Without this commit, TestLearnerAdminRelocateRange fails once atomic replication changes are enabled. Release note: None

This separates concerns nicely, but it's also a required refactor to be more nuanced about the descriptor changing out from under us. Release note: None

This will allow simplifying the implementation by loading the descriptor from KV (instead of using r.Desc() which makes an assumption that the caller picked the right `r`), and it's already mostly true today. (There are no remote production callers of AdminMerge, so no migration concerns). Release note: None

This is in preparation for not loading the descriptor from `r.Desc()` but using conditionalGetDescValueFromDB to obtain it. Release note: None

We'll soon need to allocate the RangeID only once, but potentially create the descriptor multiple times (for each split txn restart). Release note: None

Prepare for getting an updated descriptor handed back from conditionalGetDescValueFromDB in each txn attempt. Release note: None

We're not going to allow parallel writes anytime soon. Release note: None

I don't think those have been useful in a long time. Release note: None

That way it will be easier to avoid confusion between all of the descriptors floating around. Release note: None

Release note: None

It now takes a closure that gets to decide whether the descriptor returned from KV is "acceptable". This descriptor is then returned, and in turn the method doesn't accept a descriptor, just a key from which to look up the descriptor. The point of all this is to allow different callers to check different things. For instance, a split doesn't care whether the set of replicas changed, and a replication change shouldn't fail if the range has since split. Release note: None

We only want to return early on the "stop after learners" knob if we actually did add learners (and not, for example, on a replica removal). Release note: None

ChangeReplicas (and AdminSplit, and AdminMerge) take a RangeDescriptor that they use as a basis for a CPut to make sure the operations mutating the range serialize. This is great for correctness but generally unfortunate for usability since on a mismatch, the caller usually wanted to do the thing they were trying to do anyway, using the new descriptor. The fact that every split (replication change, merge) basically needs a retry loop is constant trickle of test flakes and UX papercuts. It became more pressing to do something against this as we are routinely using joint configs when atomic replication changes are enabled. A joint configuration is transitioned out of opportunistically whenever it is encountered, but this frequently causes a race in which actor A finds a joint config, begins a transaction out of it but is raced by actor B getting there first. The end result is that what actor A wanted to achieve has been achieved, though by someone else, and the result is a spurious error. This commit fixes that behavior in the targeted case of wanting to leave a joint configuration: actor A will get a successful result. Before this change, make stress PKG=./pkg/sql TESTS=TestShowRangesWithLocal would fail immediately when `kv.atomic_replication_changes.enabled` was true because the splits this test carries out would run into the joint configuration changes of the initial upreplication, and would race the replicate queue to transition out of them, which at least one split would typically lose. This still happens, but now it's not an error any more. I do think that it makes sense to use a similar strategy in general (fail replication changes only if the *replicas* changed, allow all splits except when the split key moves out of the current descriptor, etc) but in the process of coding this up I got reminded of all of the problems relating to range merges and also found what I think is a long-standing pretty fatal bug, cockroachdb#40367, so I don't want to do anything until the release is out of the door. But I'm basically convinced that if we did it, it wouldn't cause a new "bug" because any replication change carried out in that way is just one that could be triggered just the same by a user under the old checks. Release note: None

This hopefully makes it easier to reason about the code. Release note: None

tbg · 2019-09-04T09:37:13Z

Removed the default=on commit. Found a buglet in the process that would cause crashes when atomic replication changes were turned off.

bors r=nvanbenschoten

40370: storage: prepare for kv.atomic_replication_changes=true r=nvanbenschoten a=tbg First three commits are #40363. ---- This PR enables atomic replication changes by default. But most of it is just dealing with the fallout of doing so: 1. we don't handle removal of multiple learners well at the moment. This will be fixed more holistically in #40268, but it's not worth waiting for that because it's easy for us to just avoid the problem. 2. tests that carry out splits become quite flaky because at the beginning of a split, we transition out of a joint config if we see one, and due to the initial upreplication we often do. If we lose the race against the replicate queue, the split catches an error for no good reason. I took this as an opportunity to refactor the descriptor comparisons and to make this specific case a noop, but making it easier to avoid this general class of conflict where it's avoidable in the future. There are probably some more problems that will only become apparent over time, but it's quite simple to turn the cluster setting off again and to patch things up if we do. Release note (general change): atomic replication changes are now enabled by default. Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2019-09-04T09:59:36Z

Build succeeded

GitHub CI (Cockroach)

40464: storage: kv.atomic_replication_changes=true r=nvanbenschoten a=tbg I ran the experiments in #40370 (comment) on (essentially) this branch and everything passed. Going to run another five instances of mixed-headroom and headroom with this change to shake out anything else that I might've missed. Release note (general change): atomic replication changes are now enabled by default. Co-authored-by: Tobias Schottdorf <[email protected]>

tbg requested a review from nvanbenschoten August 30, 2019 16:02

tbg force-pushed the atomic/flip-switch branch from 0221cd2 to 54ab647 Compare August 30, 2019 19:10

tbg mentioned this pull request Sep 2, 2019

release: v19.2.0-beta.20190903 #40338

Closed

18 tasks

nvanbenschoten approved these changes Sep 3, 2019

View reviewed changes

tbg force-pushed the atomic/flip-switch branch from 54ab647 to 8c76b84 Compare September 3, 2019 14:59

tbg requested a review from nvanbenschoten September 3, 2019 15:00

tbg commented Sep 3, 2019

View reviewed changes

tbg changed the title ~~storage: kv.atomic_replication_changes=true~~ storage: prepare for kv.atomic_replication_changes=true Sep 4, 2019

tbg force-pushed the atomic/flip-switch branch 2 times, most recently from 526a39d to ec1cf20 Compare September 4, 2019 07:45

tbg added 13 commits September 4, 2019 10:52

storage: populate crt.Desc unconditionally

2beb634

This allows simplifying the code because we get to use crt.Desc anywhere we know the trigger was created locally (which is almost everywhere). Release note: None

storage: factor out prepareChangeReplicasTrigger

269d79f

This separates concerns nicely, but it's also a required refactor to be more nuanced about the descriptor changing out from under us. Release note: None

storage: rearrange AdminMerge

974e9e3

This is in preparation for not loading the descriptor from `r.Desc()` but using conditionalGetDescValueFromDB to obtain it. Release note: None

storage: separate rangeID allocation from descriptor init

a775bce

We'll soon need to allocate the RangeID only once, but potentially create the descriptor multiple times (for each split txn restart). Release note: None

storage: rearrange adminSplitWithDescriptor

17c56ca

Prepare for getting an updated descriptor handed back from conditionalGetDescValueFromDB in each txn attempt. Release note: None

storage: remove a TODO

c8c27c4

We're not going to allow parallel writes anytime soon. Release note: None

storage: remove split closure events

d8dac1b

I don't think those have been useful in a long time. Release note: None

storage: extract splitTxnAttempt closure

eb25288

That way it will be easier to avoid confusion between all of the descriptors floating around. Release note: None

storage: extract splitStickyTxnUpdateAttempt

a41b9fa

Release note: None

storage: clarify a knob

dbb5044

We only want to return early on the "stop after learners" knob if we actually did add learners (and not, for example, on a replica removal). Release note: None

tbg added 2 commits September 4, 2019 10:52

storage: introduce internalReplicationChanges helper type

2ce5bb7

This hopefully makes it easier to reason about the code. Release note: None

tbg force-pushed the atomic/flip-switch branch from ec1cf20 to 2ce5bb7 Compare September 4, 2019 08:52

craig bot merged commit 2ce5bb7 into cockroachdb:master Sep 4, 2019

tbg mentioned this pull request Sep 4, 2019

storage: kv.atomic_replication_changes=true #40464

Merged

ajwerner mentioned this pull request Sep 18, 2019

storage: nil pointer dereference in execChangeReplicasTxn #40877

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

storage: prepare for kv.atomic_replication_changes=true #40370

storage: prepare for kv.atomic_replication_changes=true #40370

Uh oh!

tbg commented Aug 30, 2019

Uh oh!

cockroach-teamcity commented Aug 30, 2019

Uh oh!

tbg commented Aug 30, 2019

Uh oh!

tbg commented Sep 2, 2019 •

edited

Loading

Uh oh!

tbg commented Sep 2, 2019

Uh oh!

nvanbenschoten left a comment

Uh oh!

tbg left a comment

Uh oh!

tbg commented Sep 3, 2019

Uh oh!

tbg commented Sep 4, 2019 •

edited

Loading

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

craig bot commented Sep 4, 2019

Uh oh!

Uh oh!

storage: prepare for kv.atomic_replication_changes=true #40370

storage: prepare for kv.atomic_replication_changes=true #40370

Uh oh!

Conversation

tbg commented Aug 30, 2019

Uh oh!

cockroach-teamcity commented Aug 30, 2019

Uh oh!

tbg commented Aug 30, 2019

Uh oh!

tbg commented Sep 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Sep 2, 2019

Uh oh!

nvanbenschoten left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg commented Sep 3, 2019

Uh oh!

tbg commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

tbg commented Sep 4, 2019

Uh oh!

craig bot commented Sep 4, 2019

Build succeeded

Uh oh!

Uh oh!

tbg commented Sep 2, 2019 •

edited

Loading

tbg commented Sep 4, 2019 •

edited

Loading