-
Notifications
You must be signed in to change notification settings - Fork 218
AbstractDependentResource should be able to pass his turn #1105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I wanted to have an |
How can it happen that the secret is not created (I mean other than an error state?). The secret is a managed dependent resource that is created before the schema. Managed dependent resources are created in order. It's true that lately the behavior changed, that we reconcile all the resources event if the previous ones failed. This makes issues when the first operation fails, but only if the resources are not independent. In this case a fail fast approach would be better in the current scope. So not reconcile resources if one fails before. |
The |
This is the issue, right here, I think :) |
Again, what are the reason that are not satisfied yet in your case? @scrocquesel |
Maybe with a nice description, definition of the problem in general.. |
TBH I would revert this into a fail fast behavior, that would solve this problem, and would be much more easier to follow what happening from the logs. |
The problem is as explained in the description of the issue: people want to be able to make decisions based on the reconcile result of dependents and right now, the reconcile result doesn't cover all the possibilities. This should be part of a more general approach, of course, but that's the core of the issue. It might just be that the dependent is not yet ready, for example. It shouldn't be an error that triggers a retry because this might cause retries to get exhausted before your dependent ever gets ready, so managing this situation via exceptions is not correct, imo. |
|
There is no decision to make in this case. It is basically should be just retries if there is an issue with secret creation. Can you describe actual use case when people want to make decision bases on error? and how would that look like? So in this case the actual thing we want to do just retry if the secret creation fails, don't go further. |
Note that all this discussion is out of scope for the next release but it's still interesting to talk about this problem now. |
The problem is that the sample currently throws an exception when the secret is not present in the context. When it's not present, it's usually not because we couldn't create it but rather because the associated schema is not ready. Handling this use case via error handling is wrong because it's not an error condition and that might lead to retries being exhausted despite the fact that nothing is actually wrong with the system. To be clear, I'm not saying that this should be fixed for the current scope but throwing an exception in this situation is not the proper way to deal with this because this is not an error, just part of the reconciliation process. Note that this problem wouldn't be solved with an |
Will take a look, why it might happen, but if the secret if created and it's not in a context is simply a bug withing the current implementation. Will take a look on this because this actually should not happen, since we putting the resource into the event source if we created it) However @scrocquesel clearly saying in the issue:
|
@scrocquesel could you please describe more in detail the actual situation what is happening regarding the secret, so it is really not created at that point, or how is this happening? |
Relates to #1107 |
Actually, if SecretDependentResource create the secret in the reconcile, it doesn't not update its cache. So when in the same loop, the This happens more in the equivalent Quarkus test because Quarkus wait for the database to be ready. In the java one, It may be a bug or something to improve but still, it shows up that there is time when a dependent is not able to participate in the reconcile loop. To some extent, this is certainly related to what workflow will address in the future. Meanwhile, it is not clear how it can be addressed and the NPE when logging a null desired returned value should be fixed somehow. There is also an another use case which can be related. Think of a flag in a primary resource to create or delete a secondary resource (or activate/deactivate a feature somewhere). It means that on update, you should be able to handle a null desired state (which is not possible due to NPE in logForOperation. Then, |
Thank you for response @scrocquesel !
This sounds like a bug, I will take a look today. (Although we have a Integration test that never failed )
In this case the reconcile should not be called. See: https://github.com/java-operator-sdk/java-operator-sdk/blob/4301fe3e2b1e2d49753e4876854f8084680f1234/sample-operators/webpage/src/main/java/io/javaoperatorsdk/operator/sample/WebPageStandaloneDependentsReconciler.java#L68-L72 There is a separate delete for this. It should be used like this IMO. Managed dependent resources simply don't support this use case now.
My opinion is the same as mentioned before. If its not "not able to participate in the reconcile loop" then the reconcile should not be called. So let's say we fix the bug that the resources is properly available in the context. We have 3 options:
But we should not even really solve the case if desired state is null/optional - this seems to me a hack. That problem should be done on different level. And basically we should not call We can take a look on the current managed dependent resources as a simple linear workflow, and IMO if the workflow breaks because of an exception we should simply not proceed, just retry the reconciliation (we some delay). This will be the logic probably in more complex workflows too. |
Just to make sure @scrocquesel are you sure you are using the up to date version from the main? It would be great if you could provide some logs. This does not seems to fail on our github actions runs at all. |
I think that this and other use cases should be handled by more complete |
Yes, there IS a draft PR showing this issue.
|
I checked the code and there is an integration test that tests the creation / access, see: I wonder if this can be false negative. But so this never happened that it failed. |
I think found the issue, at least one issue :) Lately the managed dependent were resources changed to have names, so it is stored in a HashMap. But that does not guarantee any order. So it is really works only if the resource are independent. Porbably it is just accidentally right in this sample. IMO MySQLSchema sample should be migrated to standalone mode, since does not really fits therefore to this use case. (Or we should implement some ordering guarantees at least. No strong opinion on this.) cc @metacosm |
I think it's better to implement ordering for now since that's how managed dependent resources were supposed to work initially (with the understanding that this ordering is pretty weak and that we would probably implement stronger ordering in future releases). |
That's right, I actually thought the order of declaration in annotation was used to call the dependent and the hashmap doesn't guarantee it anymore. Nevertheless, it happens for me that for mysql-schema the order of call is respected for all test runs. What is important in your comment is that actually if two dependents are interdependent (one require the other for eg), then we should better use standalone mode for now. I agree the sample actually is misleading as I thought, it should have some very simple logic right now to handle a simple chain logic, even if it implies more code in the dependent resource. |
|
@scrocquesel thank you!! We found a potentially related bug, will take a look on this withing that issue asap! |
This is very strange since it cleary says:
What means when Secret is created and it wants to put it into the temp cache, it does not since informer already knows about it. See: One thing came to my mind is that the secret could be deleted? Probably not. But then on retry is seems to be succeeding:
Nevertheless we should retest this after: #1129 |
@scrocquesel we merged the mentioned issue, could you pls retest this? |
@csviri It's seems the error is still in JOSDK 3.0.0.RC3 but the log changed. here with the quarkus flavor:
And what should interest you is that later on, when retrying, it actually tries to get it from the informer.
I was able to remove the orElseThrow in the reconciler (quarkiverse/quarkus-operator-sdk#331) but I don't remember if it tested without it. |
thx @scrocquesel we will take a look later, there is now a small update regarding the mysql sample, we will test that also. This might be caused by this line (what is bad implementation): |
@metacosm @scrocquesel could we retest this with RC4 ? now the issue from my previous comment is addressed there. |
Still the same log result. Could there be a race condition between the Informer thread ( |
That is indeed possible, will take a look on it ASAP. |
@scrocquesel I created an integration test for this, try to reproduce it: One thing I noticed that you use |
@csviri You can see the problem in action at https://github.com/quarkiverse/quarkus-operator-sdk/runs/6357781975?check_suite_focus=true#step:5:2781, I haven't looked in details yet but basically the reconciler doesn't really like when users are not throwing exceptions when the dependent is not ready :) |
will check thx, but I was not able to reproduce this with minikube + core framework. So suspecting this is an issue with mockWebServer maybe. But will reproduce/debug it with quarkus + mockWebserver. |
Checked it, and after some time of debugging, it turns out the resources are reconciled in wrong order (the problem is in quarkus extension), just not trivial to get it from logs. @metacosm will prepare the fix. |
The fix is in in the Quarkus extension (with a quite neat short commit hash!): quarkiverse/quarkus-operator-sdk@cccccec 😁 |
Cool, will add the assertion to the test soon. Thanks for debunking this. |
Actually, when the secret is not yet created, the mysql schema sample is throwing some errors in SchemaDependentResource due to
orElseThrow
.https://github.com/java-operator-sdk/java-operator-sdk/blob/2eac098db872cedcefecb6036ec98c77b58cf60f/sample-operators/mysql-schema/src/main/java/io/javaoperatorsdk/operator/sample/dependent/SchemaDependentResource.java#L56
The same apply with the reconciler
https://github.com/java-operator-sdk/java-operator-sdk/blob/2eac098db872cedcefecb6036ec98c77b58cf60f/sample-operators/mysql-schema/src/main/java/io/javaoperatorsdk/operator/sample/MySQLSchemaReconciler.java#L39
I'm trying to avoid these exceptions because there are not really exceptions, but just the fact that some requirement are not yet satisfied. So I tried,
But
AbstractDependentResource.logForOperation
produces an NPE and even if it doesn't, I think thatcreate
should not be call.It is not clear what to do in such a case. I don't feel like exception are the way to go because it raise a call to
updateErrorStatus
when it should definitely not. Should desired return anOptional<R>
?The text was updated successfully, but these errors were encountered: