Status updates occurring close together result in an exeception #409

shawkins · 2021-04-25T02:41:11Z

We have a custom resource with several operands. The current code will return UpdateControl.updateStatusSubResource on every call to ResourceController.createOrUpdateResource. When several operand updates are processed close together we sometimes see:

ERROR [io.jav.ope.pro.EventDispatcher] Error during event processing ExecutionScope
...
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT
...
the object has been modified; please apply your changes to the latest version and try again

I need to run in debug more to check if the operator sdk is retrying this - if it is retrying, then can this be logged at debug. If it's not, then is this something that the operator or fabric8 should be handling?

The text was updated successfully, but these errors were encountered:

shawkins · 2021-04-27T12:47:12Z

It does appear in debug that the retry handling is kicking in - the logic will look for any new events, and run the update again. So this does appear to be benign. @metacosm are you open to a change that would lower the amount of logging related to this exception?

metacosm · 2021-04-27T13:00:18Z

Yep.

…ecptions

…ceptions

shawkins · 2021-04-27T13:50:59Z

I've submitted a pr to change this to an info log without a stacktrace.

this addressed operator-framework#409 where the log contains verbose error messages related to status update conflicts

fix for #409 adjusting logging around kubernetes exceptions

metacosm · 2021-04-28T10:58:36Z

Can we close this issue then? Or are there still things that need to be investigated?

shawkins · 2021-04-28T11:12:13Z

No, it seems that this is how it's expected to work. The only alternative/addition would be to add locking per resource being passed to createOrUpate such that you could serialize its processing - at least for the single operator case that would typically eliminate these conflicts.

metacosm · 2021-04-28T11:56:11Z

What do you think, @csviri?

csviri · 2021-04-28T13:21:41Z

Hi @metacosm and @shawkins
Not sure if I understand the situation correctly: We have already serialized processing in terms of operator for a custom resource. If I understand correctly what happens is that an other process(es) changes the CR frequently, that lead to an optimistick locking error.

By locking I guess you mean pessimistic locking. I'm not aware that kubernetes supports such thing. I think what can be done however is not do updates on status subresource using optimistic locking, since the controller is anyways the only process that should change status.
See:
https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/EventDispatcher.java#L203-L207

I think this needs to be changed in fabric8 client. @metacosm

shawkins · 2021-04-28T15:00:40Z

If I understand correctly what happens is that an other process(es) changes the CR frequently, that lead to an optimistick locking error.

The cr is not changing frequently. It does however have quite a few operands, which are event sources. My test to reproduce this will create the crs, sleep for a bit, then make an update to the crs that triggers the deletion of the operands. This is just a single operator, and a single process creating and then modify the spec of the crs.

From what I can now gather in the code and here, you do expect the updates against a single cr to effectively be serialized so that the cache update happens in time for the next batch of events. So the question becomes why in this scenario would there be a conflict? I see comments and checks in the caching logic https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/DefaultEventHandler.java#L211 that it's expected for things to get out of sync - but presumably only when there are more actors.

csviri · 2021-04-28T15:16:22Z

@shawkins just clarify 1 thing what do you mean by operand. Its a resource that is created by the controller and watched by an event source ?

Regarding to this comment:
https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/DefaultEventHandler.java#L211

This is an edge case, with event sources where as described if we do an update and execute the controller again because there were already events from different event source than custom resource event source, the cached custom resource could be out of the date (since we cache when it arrvies on a change from a watch). But this code should solved this problem in all cases. So if by actor you mean an event source, then yes this situation only apply when there are additional event sources.

shawkins · 2021-04-28T16:11:28Z

just clarify 1 thing what do you mean by operand. Its a resource that is created by the controller and watched by an event source ?

That is correct. Our cr that the operator is managing creates another custom resource, configmaps, secrets, deployments, etc. We watch those resources via an informer and register them as event sources.

For a given resource R we see logs similar to:

2021-04-24 21:43:12,309 DEBUG [org.bf2.ope.eve.ResourceEventSource] Update event received for io.fabric8.kubernetes.api.model.apps.Deployment x for resource R
...
2021-04-24 21:43:12,433 ERROR [io.jav.ope.pro.EventDispatcher] Error during event processing ExecutionScope "with different operand events for R that the previous"

adam-sandor · 2021-04-29T11:49:19Z

I checked in on this, and I agree with the opinion that you can't make the expectation that other processes won't be changing status. It's just how Kubernetes works. So optimistic locking with just an info log is the right solution imho.

csviri · 2021-04-29T12:31:48Z

It's not restricted in any ways, it's just how it ment that only the controller updates the status. But did not see this anywhere expicitly, (maybe will do a round around this on K8S Slack just out of curiosity). But if you have some nice reference about this pls share @adam-sandor.

The thing is that optimistic locking it won't help basically in any scenarios regaring to this, since it is anyways overriden by the controller on the next run. So while it makes sense to to optimictick locking on specs it does not make to status. (Note that status is a subresource, overrinding it does not affect other parts of resource). In other words the lost update problem does not apply for status subresours.

I think this is a quite a detail, it's more interested from the pholosphical aspect of k8s. The positive effect would be on the other hand to not fail on this, is that there is no unnecessary retry happening:

Let's say after a controller run we want to change the status. But just before somebody ads an annotation to the CR or even changing the specs. Now we would result in expcetion (conflict because of optimistic locking), that would result in exceptions in logs, retry process will happen.
While if we would just update the status (without the optimistic locking), to reflect the actual word we just reconciled.
As a result, either:

if there was no specs change (just a metadata for example) there will be no reconciliation at all (while will be now). This is what we want.
If the specs are changed meanwhile, the controller reconciles and we change the status anyways. This is what we want.

shawkins · 2021-04-29T18:47:54Z

As best as I can tell we are observing this with

Custom Resource CR Revision 1 is in cache
ExecutionConsumer to update Custom Resource CR Revision 1 is queued
CR is updated to Revision 2 by some other action
ExecutionConsumer runs and fails due to the conflict

This typically doesn't happen because there is little time between 2 and 4, but in some scenarios, such as rolling upgrades or mass deletions we see a flood of events such that some of the ExecutionConsumers have been queued with the stale cr.

It seems like a way to narrow this case even further would be to a cache get to refresh, or immediately reject, the ExecutionScope at the start of the run https://github.com/java-operator-sdk/java-operator-sdk/blob/112a8268155a9f6262e1238b592e9813b09e1557/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/ExecutionConsumer.java#L25

csviri · 2021-04-29T19:07:37Z

@shawkins , yes, this can naturally happen, its not necessary between 2 and 4. But actually can happen anytime the controller is is getting executed. And usually the controller execution is much longer then the time between 2 and 4 ( by 4 I mean its actually getting picked up by the executor thread.)

So if this is the case actually there is nothing to solve. This is expected. You are right that getting the latest custom resource from the cache at this point:
https://github.com/java-operator-sdk/java-operator-sdk/blob/112a8268155a9f6262e1238b592e9813b09e1557/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/ExecutionConsumer.java#L25

would be little bit better from this aspect, but also would make the event list and the actual custom resource in the scope inconsistent. (What is not great not terrible, but I'm pretty sure would rise some eyebrows if somebody debugging it. )
Basically this situation is one of the main reason is why we implemented automatic retries, since this can naturally happen.

csviri · 2021-04-29T19:11:27Z

And again the situation when we do retry, because of optimisit locking on status is more concerning. Because there we do retry, while we actually could avoid it and be still consistent. Typically in the situation when there is change on the CR but not on .spec.

shawkins · 2021-04-29T20:04:35Z

I'm fine with the info logging if further attempts at remediation would be more trouble than they are worth. So this can be closed.

because of optimisit locking on status is more concerning.

From what I'm seeing the Fabric8 updateStatus call is implemented with a PUT against /status - this seems to require a resource version per the api server.

Fabric8 or you would need to change the update status to be a patch instead - but it seems like your aware of fabric8 patch issues fabric8io/kubernetes-client#2166

metacosm · 2021-04-29T20:10:39Z

Do you think we should re-open the fabric8 issue to support PATCH?

csviri · 2021-04-29T20:16:46Z

@metacosm that issue was because of the finalizers mainly, since adding finalizer is always an "additive" operation we could do it by patching ("just add this finalizer to the finalizer list") request (without optimistic locking!). But to be honest, I'm also quite happy with the current solution, maybe that would be a little bit nicer and straighforward.

The patch also increased the resource version, so we have to think about that also after the request.

I guess we can create an issue on this if you think it's worth to spend time on this part, but I'm fine with the current approach also.

csviri · 2021-04-29T20:18:28Z

I will create a issue about the sub-resource update, I guess the k8s api does not require optimisitc locking there, (will verify). So we can disuss the pros and const regarding to that under the issue.

shawkins · 2021-04-30T11:27:53Z

I guess the k8s api does not require optimisitc locking there, (will verify)

This seems to require a resource version per the api server. I've captured fabric8io/kubernetes-client#3066 for fabric8 to consider this.

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021

fix for operator-framework#409 adjusting logging around kubernetes ex…

445ac2d

…ecptions

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021

fix for operator-framework#409 adjusting logging around kubernetes ex…

40e4dfb

…ecptions

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021

fix for operator-framework#409 adjusting logging around kubernetes ex…

9664280

…ceptions

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021

fix for operator-framework#409 adjusting logging around kubernetes ex…

dca0e84

…ceptions

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021

chore: refining logging related to execution errors that may be retried

909e524

this addressed operator-framework#409 where the log contains verbose error messages related to status update conflicts

metacosm added a commit that referenced this issue Apr 28, 2021

Merge pull request #410 from shawkins/master

b4c6483

fix for #409 adjusting logging around kubernetes exceptions

shawkins mentioned this issue Apr 28, 2021

Leader election #411

Closed

csviri closed this as completed Apr 29, 2021

shawkins mentioned this issue Apr 30, 2021

non-optimistic locking updateStatus fabric8io/kubernetes-client#3066

Closed

csviri mentioned this issue Sep 2, 2021

Using patch when updating status sub-resource (or update without optimistic locking) #525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status updates occurring close together result in an exeception #409

Status updates occurring close together result in an exeception #409

shawkins commented Apr 25, 2021

shawkins commented Apr 27, 2021

metacosm commented Apr 27, 2021

shawkins commented Apr 27, 2021

metacosm commented Apr 28, 2021

shawkins commented Apr 28, 2021 •

edited

Loading

metacosm commented Apr 28, 2021

csviri commented Apr 28, 2021

shawkins commented Apr 28, 2021

csviri commented Apr 28, 2021

shawkins commented Apr 28, 2021

adam-sandor commented Apr 29, 2021

csviri commented Apr 29, 2021 •

edited

Loading

shawkins commented Apr 29, 2021 •

edited

Loading

csviri commented Apr 29, 2021 •

edited

Loading

csviri commented Apr 29, 2021

shawkins commented Apr 29, 2021

metacosm commented Apr 29, 2021

csviri commented Apr 29, 2021 •

edited

Loading

csviri commented Apr 29, 2021

shawkins commented Apr 30, 2021

Status updates occurring close together result in an exeception #409

Status updates occurring close together result in an exeception #409

Comments

shawkins commented Apr 25, 2021

shawkins commented Apr 27, 2021

metacosm commented Apr 27, 2021

shawkins commented Apr 27, 2021

metacosm commented Apr 28, 2021

shawkins commented Apr 28, 2021 • edited Loading

metacosm commented Apr 28, 2021

csviri commented Apr 28, 2021

shawkins commented Apr 28, 2021

csviri commented Apr 28, 2021

shawkins commented Apr 28, 2021

adam-sandor commented Apr 29, 2021

csviri commented Apr 29, 2021 • edited Loading

shawkins commented Apr 29, 2021 • edited Loading

csviri commented Apr 29, 2021 • edited Loading

csviri commented Apr 29, 2021

shawkins commented Apr 29, 2021

metacosm commented Apr 29, 2021

csviri commented Apr 29, 2021 • edited Loading

csviri commented Apr 29, 2021

shawkins commented Apr 30, 2021

shawkins commented Apr 28, 2021 •

edited

Loading

csviri commented Apr 29, 2021 •

edited

Loading

shawkins commented Apr 29, 2021 •

edited

Loading

csviri commented Apr 29, 2021 •

edited

Loading

csviri commented Apr 29, 2021 •

edited

Loading