-
Notifications
You must be signed in to change notification settings - Fork 32
Error fallback on router for faulty connections #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Gerold103
added a commit
that referenced
this issue
Dec 4, 2021
RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288
Gerold103
added a commit
that referenced
this issue
Dec 4, 2021
RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288 Needed for #298
Gerold103
added a commit
that referenced
this issue
Dec 6, 2021
RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288 Needed for #298
Gerold103
added a commit
that referenced
this issue
Dec 6, 2021
RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288 Needed for #298
Gerold103
added a commit
that referenced
this issue
Dec 16, 2021
Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 16, 2021
While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. Part of #298 Closes #123
Gerold103
added a commit
that referenced
this issue
Dec 16, 2021
The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants it disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 16, 2021
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapper into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 16, 2021
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. The patch also drops 'memtx_memory' setting from the config because an attempt to apply it after calling box.cfg() (for example, via boot_like_vshard()) raises an error - default memory is bigger than this setting. It messed the new tests. Part of #298 Closes #123
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried.
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. The patch also drops 'memtx_memory' setting from the config because an attempt to apply it after calling box.cfg() (for example, via boot_like_vshard()) raises an error - default memory is bigger than this setting. It messed the new tests. Part of #298 Closes #123
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298
Gerold103
added a commit
that referenced
this issue
Dec 17, 2021
If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried.
Gerold103
added a commit
that referenced
this issue
Dec 20, 2021
vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
Jan 10, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
Jan 10, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
Feb 18, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
Feb 19, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
May 24, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
olegrok
added a commit
to tarantool/cartridge
that referenced
this issue
May 24, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
yngvar-antonsson
pushed a commit
to tarantool/cartridge
that referenced
this issue
May 24, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
yngvar-antonsson
pushed a commit
to tarantool/cartridge
that referenced
this issue
May 24, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
yngvar-antonsson
pushed a commit
to tarantool/cartridge
that referenced
this issue
May 31, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411
yngvar-antonsson
added a commit
to tarantool/cartridge
that referenced
this issue
May 31, 2022
In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Co-authored-by: Igor Zolotarev <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Router continues to send requests to replicas which are proven to be broken. These are orhpan nodes which didn't finish recovery/bootstrap yet, or did finish but with an error and now are broken. It also includes instances who didn't do
vshard.storage.cfg
, or did but didn't finish yet.In case of not finished boot all kinds of bad behaviour is possible. The worst ones:
vshard.storage
functions are recovered in_func
, some are not, so the storage is half-usable;It seems reasonable to rely on
box.info.status ~= 'running'
as a sign of the node being not ready to do anything. This can be used right in the storage functions. Once they see the instance is running, the storage can reload itself to a version without these checks (so as not to call the expensivebox.info
when unnecessary already).In case the storage functions are not available yet, netbox will return something nasty like:
error: Execute access to function 'test' is denied for user 'guest'
;error: Procedure 'test' is not defined
.If encounter these errors for any of
vshard.storage
functions orvshard.storage
functions explicitly return an error about the instance being not'running'
, the router must put such connections into a backoff state for some time before retrying. At the same time, the retry to another instance when see any of these errors must be automatic. Regardless of the request mode - read or write. These are not network errors, so can be freely retried.See also #198 and #123.
The text was updated successfully, but these errors were encountered: