Skip to content

Status updates occurring close together result in an exeception #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shawkins opened this issue Apr 25, 2021 · 20 comments
Closed

Status updates occurring close together result in an exeception #409

shawkins opened this issue Apr 25, 2021 · 20 comments

Comments

@shawkins
Copy link
Collaborator

We have a custom resource with several operands. The current code will return UpdateControl.updateStatusSubResource on every call to ResourceController.createOrUpdateResource. When several operand updates are processed close together we sometimes see:

ERROR [io.jav.ope.pro.EventDispatcher] Error during event processing ExecutionScope
...
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT
...
the object has been modified; please apply your changes to the latest version and try again

I need to run in debug more to check if the operator sdk is retrying this - if it is retrying, then can this be logged at debug. If it's not, then is this something that the operator or fabric8 should be handling?

@shawkins
Copy link
Collaborator Author

It does appear in debug that the retry handling is kicking in - the logic will look for any new events, and run the update again. So this does appear to be benign. @metacosm are you open to a change that would lower the amount of logging related to this exception?

@metacosm
Copy link
Collaborator

Yep.

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021
shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021
shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021
shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021
@shawkins
Copy link
Collaborator Author

I've submitted a pr to change this to an info log without a stacktrace.

shawkins added a commit to shawkins/java-operator-sdk that referenced this issue Apr 27, 2021
this addressed operator-framework#409 where the log contains verbose error messages
related to status update conflicts
metacosm added a commit that referenced this issue Apr 28, 2021
fix for #409 adjusting logging around kubernetes exceptions
@metacosm
Copy link
Collaborator

Can we close this issue then? Or are there still things that need to be investigated?

@shawkins
Copy link
Collaborator Author

shawkins commented Apr 28, 2021

No, it seems that this is how it's expected to work. The only alternative/addition would be to add locking per resource being passed to createOrUpate such that you could serialize its processing - at least for the single operator case that would typically eliminate these conflicts.

@metacosm
Copy link
Collaborator

What do you think, @csviri?

@csviri
Copy link
Collaborator

csviri commented Apr 28, 2021

Hi @metacosm and @shawkins
Not sure if I understand the situation correctly: We have already serialized processing in terms of operator for a custom resource. If I understand correctly what happens is that an other process(es) changes the CR frequently, that lead to an optimistick locking error.

By locking I guess you mean pessimistic locking. I'm not aware that kubernetes supports such thing. I think what can be done however is not do updates on status subresource using optimistic locking, since the controller is anyways the only process that should change status.
See:
https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/EventDispatcher.java#L203-L207

I think this needs to be changed in fabric8 client. @metacosm

@shawkins
Copy link
Collaborator Author

If I understand correctly what happens is that an other process(es) changes the CR frequently, that lead to an optimistick locking error.

The cr is not changing frequently. It does however have quite a few operands, which are event sources. My test to reproduce this will create the crs, sleep for a bit, then make an update to the crs that triggers the deletion of the operands. This is just a single operator, and a single process creating and then modify the spec of the crs.

From what I can now gather in the code and here, you do expect the updates against a single cr to effectively be serialized so that the cache update happens in time for the next batch of events. So the question becomes why in this scenario would there be a conflict? I see comments and checks in the caching logic https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/DefaultEventHandler.java#L211 that it's expected for things to get out of sync - but presumably only when there are more actors.

@csviri
Copy link
Collaborator

csviri commented Apr 28, 2021

@shawkins just clarify 1 thing what do you mean by operand. Its a resource that is created by the controller and watched by an event source ?

Regarding to this comment:
https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/DefaultEventHandler.java#L211

This is an edge case, with event sources where as described if we do an update and execute the controller again because there were already events from different event source than custom resource event source, the cached custom resource could be out of the date (since we cache when it arrvies on a change from a watch). But this code should solved this problem in all cases. So if by actor you mean an event source, then yes this situation only apply when there are additional event sources.

@shawkins
Copy link
Collaborator Author

just clarify 1 thing what do you mean by operand. Its a resource that is created by the controller and watched by an event source ?

That is correct. Our cr that the operator is managing creates another custom resource, configmaps, secrets, deployments, etc. We watch those resources via an informer and register them as event sources.

For a given resource R we see logs similar to:

2021-04-24 21:43:12,309 DEBUG [org.bf2.ope.eve.ResourceEventSource] Update event received for io.fabric8.kubernetes.api.model.apps.Deployment x for resource R
...
2021-04-24 21:43:12,433 ERROR [io.jav.ope.pro.EventDispatcher] Error during event processing ExecutionScope "with different operand events for R that the previous"

@adam-sandor
Copy link
Collaborator

I checked in on this, and I agree with the opinion that you can't make the expectation that other processes won't be changing status. It's just how Kubernetes works. So optimistic locking with just an info log is the right solution imho.

@csviri
Copy link
Collaborator

csviri commented Apr 29, 2021

It's not restricted in any ways, it's just how it ment that only the controller updates the status. But did not see this anywhere expicitly, (maybe will do a round around this on K8S Slack just out of curiosity). But if you have some nice reference about this pls share @adam-sandor.

The thing is that optimistic locking it won't help basically in any scenarios regaring to this, since it is anyways overriden by the controller on the next run. So while it makes sense to to optimictick locking on specs it does not make to status. (Note that status is a subresource, overrinding it does not affect other parts of resource). In other words the lost update problem does not apply for status subresours.

I think this is a quite a detail, it's more interested from the pholosphical aspect of k8s. The positive effect would be on the other hand to not fail on this, is that there is no unnecessary retry happening:

Let's say after a controller run we want to change the status. But just before somebody ads an annotation to the CR or even changing the specs. Now we would result in expcetion (conflict because of optimistic locking), that would result in exceptions in logs, retry process will happen.
While if we would just update the status (without the optimistic locking), to reflect the actual word we just reconciled.
As a result, either:

  1. if there was no specs change (just a metadata for example) there will be no reconciliation at all (while will be now). This is what we want.
  2. If the specs are changed meanwhile, the controller reconciles and we change the status anyways. This is what we want.

@shawkins
Copy link
Collaborator Author

shawkins commented Apr 29, 2021

As best as I can tell we are observing this with

  1. Custom Resource CR Revision 1 is in cache
  2. ExecutionConsumer to update Custom Resource CR Revision 1 is queued
  3. CR is updated to Revision 2 by some other action
  4. ExecutionConsumer runs and fails due to the conflict

This typically doesn't happen because there is little time between 2 and 4, but in some scenarios, such as rolling upgrades or mass deletions we see a flood of events such that some of the ExecutionConsumers have been queued with the stale cr.

It seems like a way to narrow this case even further would be to a cache get to refresh, or immediately reject, the ExecutionScope at the start of the run https://github.com/java-operator-sdk/java-operator-sdk/blob/112a8268155a9f6262e1238b592e9813b09e1557/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/ExecutionConsumer.java#L25

@csviri
Copy link
Collaborator

csviri commented Apr 29, 2021

@shawkins , yes, this can naturally happen, its not necessary between 2 and 4. But actually can happen anytime the controller is is getting executed. And usually the controller execution is much longer then the time between 2 and 4 ( by 4 I mean its actually getting picked up by the executor thread.)

So if this is the case actually there is nothing to solve. This is expected. You are right that getting the latest custom resource from the cache at this point:
https://github.com/java-operator-sdk/java-operator-sdk/blob/112a8268155a9f6262e1238b592e9813b09e1557/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/ExecutionConsumer.java#L25

would be little bit better from this aspect, but also would make the event list and the actual custom resource in the scope inconsistent. (What is not great not terrible, but I'm pretty sure would rise some eyebrows if somebody debugging it. )
Basically this situation is one of the main reason is why we implemented automatic retries, since this can naturally happen.

@csviri
Copy link
Collaborator

csviri commented Apr 29, 2021

And again the situation when we do retry, because of optimisit locking on status is more concerning. Because there we do retry, while we actually could avoid it and be still consistent. Typically in the situation when there is change on the CR but not on .spec.

@shawkins
Copy link
Collaborator Author

I'm fine with the info logging if further attempts at remediation would be more trouble than they are worth. So this can be closed.

because of optimisit locking on status is more concerning.

From what I'm seeing the Fabric8 updateStatus call is implemented with a PUT against /status - this seems to require a resource version per the api server.

Fabric8 or you would need to change the update status to be a patch instead - but it seems like your aware of fabric8 patch issues fabric8io/kubernetes-client#2166

@csviri csviri closed this as completed Apr 29, 2021
@metacosm
Copy link
Collaborator

Do you think we should re-open the fabric8 issue to support PATCH?

@csviri
Copy link
Collaborator

csviri commented Apr 29, 2021

@metacosm that issue was because of the finalizers mainly, since adding finalizer is always an "additive" operation we could do it by patching ("just add this finalizer to the finalizer list") request (without optimistic locking!). But to be honest, I'm also quite happy with the current solution, maybe that would be a little bit nicer and straighforward.

The patch also increased the resource version, so we have to think about that also after the request.

I guess we can create an issue on this if you think it's worth to spend time on this part, but I'm fine with the current approach also.

@csviri
Copy link
Collaborator

csviri commented Apr 29, 2021

I will create a issue about the sub-resource update, I guess the k8s api does not require optimisitc locking there, (will verify). So we can disuss the pros and const regarding to that under the issue.

@shawkins
Copy link
Collaborator Author

I guess the k8s api does not require optimisitc locking there, (will verify)

This seems to require a resource version per the api server. I've captured fabric8io/kubernetes-client#3066 for fabric8 to consider this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants