-
Notifications
You must be signed in to change notification settings - Fork 218
Status updates occurring close together result in an exeception #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It does appear in debug that the retry handling is kicking in - the logic will look for any new events, and run the update again. So this does appear to be benign. @metacosm are you open to a change that would lower the amount of logging related to this exception? |
Yep. |
I've submitted a pr to change this to an info log without a stacktrace. |
this addressed operator-framework#409 where the log contains verbose error messages related to status update conflicts
fix for #409 adjusting logging around kubernetes exceptions
Can we close this issue then? Or are there still things that need to be investigated? |
No, it seems that this is how it's expected to work. The only alternative/addition would be to add locking per resource being passed to createOrUpate such that you could serialize its processing - at least for the single operator case that would typically eliminate these conflicts. |
What do you think, @csviri? |
Hi @metacosm and @shawkins By locking I guess you mean pessimistic locking. I'm not aware that kubernetes supports such thing. I think what can be done however is not do updates on status subresource using optimistic locking, since the controller is anyways the only process that should change status. I think this needs to be changed in fabric8 client. @metacosm |
The cr is not changing frequently. It does however have quite a few operands, which are event sources. My test to reproduce this will create the crs, sleep for a bit, then make an update to the crs that triggers the deletion of the operands. This is just a single operator, and a single process creating and then modify the spec of the crs. From what I can now gather in the code and here, you do expect the updates against a single cr to effectively be serialized so that the cache update happens in time for the next batch of events. So the question becomes why in this scenario would there be a conflict? I see comments and checks in the caching logic https://github.com/java-operator-sdk/java-operator-sdk/blob/b4c6483ab3c4ddffcdf030681d03c4f3258556f8/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/DefaultEventHandler.java#L211 that it's expected for things to get out of sync - but presumably only when there are more actors. |
@shawkins just clarify 1 thing what do you mean by operand. Its a resource that is created by the controller and watched by an event source ? Regarding to this comment: This is an edge case, with event sources where as described if we do an update and execute the controller again because there were already events from different event source than custom resource event source, the cached custom resource could be out of the date (since we cache when it arrvies on a change from a watch). But this code should solved this problem in all cases. So if by actor you mean an event source, then yes this situation only apply when there are additional event sources. |
That is correct. Our cr that the operator is managing creates another custom resource, configmaps, secrets, deployments, etc. We watch those resources via an informer and register them as event sources. For a given resource R we see logs similar to: 2021-04-24 21:43:12,309 DEBUG [org.bf2.ope.eve.ResourceEventSource] Update event received for io.fabric8.kubernetes.api.model.apps.Deployment x for resource R |
I checked in on this, and I agree with the opinion that you can't make the expectation that other processes won't be changing status. It's just how Kubernetes works. So optimistic locking with just an info log is the right solution imho. |
It's not restricted in any ways, it's just how it ment that only the controller updates the status. But did not see this anywhere expicitly, (maybe will do a round around this on K8S Slack just out of curiosity). But if you have some nice reference about this pls share @adam-sandor. The thing is that optimistic locking it won't help basically in any scenarios regaring to this, since it is anyways overriden by the controller on the next run. So while it makes sense to to optimictick locking on specs it does not make to status. (Note that status is a subresource, overrinding it does not affect other parts of resource). In other words the lost update problem does not apply for status subresours. I think this is a quite a detail, it's more interested from the pholosphical aspect of k8s. The positive effect would be on the other hand to not fail on this, is that there is no unnecessary retry happening: Let's say after a controller run we want to change the status. But just before somebody ads an annotation to the CR or even changing the specs. Now we would result in expcetion (conflict because of optimistic locking), that would result in exceptions in logs, retry process will happen.
|
As best as I can tell we are observing this with
This typically doesn't happen because there is little time between 2 and 4, but in some scenarios, such as rolling upgrades or mass deletions we see a flood of events such that some of the ExecutionConsumers have been queued with the stale cr. It seems like a way to narrow this case even further would be to a cache get to refresh, or immediately reject, the ExecutionScope at the start of the run https://github.com/java-operator-sdk/java-operator-sdk/blob/112a8268155a9f6262e1238b592e9813b09e1557/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/processing/ExecutionConsumer.java#L25 |
@shawkins , yes, this can naturally happen, its not necessary between 2 and 4. But actually can happen anytime the controller is is getting executed. And usually the controller execution is much longer then the time between 2 and 4 ( by 4 I mean its actually getting picked up by the executor thread.) So if this is the case actually there is nothing to solve. This is expected. You are right that getting the latest custom resource from the cache at this point: would be little bit better from this aspect, but also would make the event list and the actual custom resource in the scope inconsistent. (What is not great not terrible, but I'm pretty sure would rise some eyebrows if somebody debugging it. ) |
And again the situation when we do retry, because of optimisit locking on status is more concerning. Because there we do retry, while we actually could avoid it and be still consistent. Typically in the situation when there is change on the CR but not on |
I'm fine with the info logging if further attempts at remediation would be more trouble than they are worth. So this can be closed.
From what I'm seeing the Fabric8 updateStatus call is implemented with a PUT against /status - this seems to require a resource version per the api server. Fabric8 or you would need to change the update status to be a patch instead - but it seems like your aware of fabric8 patch issues fabric8io/kubernetes-client#2166 |
Do you think we should re-open the fabric8 issue to support PATCH? |
@metacosm that issue was because of the finalizers mainly, since adding finalizer is always an "additive" operation we could do it by patching ("just add this finalizer to the finalizer list") request (without optimistic locking!). But to be honest, I'm also quite happy with the current solution, maybe that would be a little bit nicer and straighforward. The patch also increased the resource version, so we have to think about that also after the request. I guess we can create an issue on this if you think it's worth to spend time on this part, but I'm fine with the current approach also. |
I will create a issue about the sub-resource update, I guess the k8s api does not require optimisitc locking there, (will verify). So we can disuss the pros and const regarding to that under the issue. |
This seems to require a resource version per the api server. I've captured fabric8io/kubernetes-client#3066 for fabric8 to consider this. |
We have a custom resource with several operands. The current code will return UpdateControl.updateStatusSubResource on every call to ResourceController.createOrUpdateResource. When several operand updates are processed close together we sometimes see:
I need to run in debug more to check if the operator sdk is retrying this - if it is retrying, then can this be logged at debug. If it's not, then is this something that the operator or fabric8 should be handling?
The text was updated successfully, but these errors were encountered: