Error fallback on router for faulty connections #298

Gerold103 · 2021-10-05T20:56:18Z

Router continues to send requests to replicas which are proven to be broken. These are orhpan nodes which didn't finish recovery/bootstrap yet, or did finish but with an error and now are broken. It also includes instances who didn't do vshard.storage.cfg, or did but didn't finish yet.

In case of not finished boot all kinds of bad behaviour is possible. The worst ones:

Some of vshard.storage functions are recovered in _func, some are not, so the storage is half-usable;
Some user functions are not recovered yet, so nothing fails right inside of vshard, but fails in user's code.

It seems reasonable to rely on box.info.status ~= 'running' as a sign of the node being not ready to do anything. This can be used right in the storage functions. Once they see the instance is running, the storage can reload itself to a version without these checks (so as not to call the expensive box.info when unnecessary already).

In case the storage functions are not available yet, netbox will return something nasty like:

error: Execute access to function 'test' is denied for user 'guest';
error: Procedure 'test' is not defined.

If encounter these errors for any of vshard.storage functions or vshard.storage functions explicitly return an error about the instance being not 'running', the router must put such connections into a backoff state for some time before retrying. At the same time, the retry to another instance when see any of these errors must be automatic. Regardless of the request mode - read or write. These are not network errors, so can be freely retried.

See also #198 and #123.

The text was updated successfully, but these errors were encountered:

RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288

RO requests use the replica with the highest prio as specified in the weights matrix. If the best replica is not available now and failover didn't happen yet, then RO requests used to fallback to master. Even if there were other RO replicas with better prio. This patch makes RO call firstly try the currently selected most prio replica. If it is not available (no other connected replicas at all or failover didn't happen yet), the call will try to walk the prio list starting from this replica until it finds an available one. If it also fails, the call will try to walk the list from the beginning hoping that the unavailable replica wasn't the best one and there might be better option on the other side of the prio list. The patch was done in scope of task about replica backoff (#298) because the problem would additionally exist when the best replica is in backoff, not only disconnected. It would get worse. Closes #288 Needed for #298

Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298

While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. Part of #298 Closes #123

The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants it disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298

vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapper into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298

If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298

Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298

While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. The patch also drops 'memtx_memory' setting from the config because an attempt to apply it after calling box.cfg() (for example, via boot_like_vshard()) raises an error - default memory is bigger than this setting. It messed the new tests. Part of #298 Closes #123

The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298

vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298

If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298

The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298

vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298

@TarantoolBot

If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried.

Storage configuration takes time. Firstly, box.cfg{} which can be called before vshard.storage.cfg(). Secondly, vshard.storage.cfg() is not immediate as well. During that time accessing the storage is not safe. Attempts to call vshard.storage functions can return weird errors, or the functions can even be not available yet. They need to be created in _func and get access rights in _priv before becoming public. Routers used to forward errors like 'access denied error' and 'no such function' to users as is, treating them as critical. Not only it was confusing for users, but also could make an entire replicaset not available for requests - the connection to it is alive, so router would send all requests into it and they all would fail. Even if the replicaset has another instance which is perfectly functional. This patch handles such specific errors inside of the router. The faulty replicas are put into a 'backoff' state. They remain in it for some fixed time (5 seconds for now), new requests won't be sent to them until the time passes. Router will use other instances. Backoff is activated only for vshard.* functions. If the errors are about some user's function, it is considered a regular error. Because the router can't tell whether any side effects were done on the remote instance before the error happened. Hence can't retry to another node. For example, if access was denied to 'vshard.storage.call', then it is backoff. If inside of vshard.storage.call the access was denied to 'user_test_func', then it is not backoff. It all works for read-only requests exclusively of course. Because for read-write requests the instance is just one - master. Router does not have other options so backoff here wouldn't help. Part of #298

While vshard.storage.cfg() is not done, accessing vshard functions is not safe. It will fail with low level errors like 'access denied' or 'no such function'. However there can be even worse cases. The user can have universe access rights. And vshard can be already in global namespace after require(). So vshard.storage functions are already visible. The previous patch fixed only the case when function access was restricted properly. And still fixed it just partially. New problems are: - box.cfg{} is already called, but the instance is still 'loading'. Then data is not fully recovered yet. Accessing is not safe from the data consistency perspective. - vshard.storage.cfg() is not started, or is not finished yet. In the end it might be doing something on what the public functions depend. This patch addresses these issues. Now all non-trivial vshard.storage functions are disabled until vshard.storage.cfg() is finished and the instance is fully recovered. They raise an error with a special code. Returning it via 'nil, err' pair wouldn't work. Because firstly, some functions return a boolean value and are not documented as ever failing. People would miss this new error. Second reason - vshard.storage.call() needs to signal the remote caller that the storage is disabled and it was found before the user's function was called. If it would be done via 'nil, err', then the user's function could emulate the storage being disabled. Or even worse, it could make some changes and then get that error accidentally by going to another storage remotely which would be disabled. Hence it is not allowed. Too easy to break something. It was an option to change vshard.storage.call() signature to return 'true, retvals...' when user's function was called and 'false, err' when it wasn't, but that would break backward compatibility. Supporting it only for new routers does not seem possible. The patch also drops 'memtx_memory' setting from the config because an attempt to apply it after calling box.cfg() (for example, via boot_like_vshard()) raises an error - default memory is bigger than this setting. It messed the new tests. Part of #298 Closes #123

The patch introduces functions vshard.storage.enable()/disable(). They allow to control manually whether the instance can accept requests. It solves the following problems which were not covered by previous patches: - Even if box.cfg() is done, status is 'running', and vshard.storage.cfg() is finished, still user's application can be not ready to accept requests. For instance, it needs to create more functions and users on top of vshard. Then it wants to disable public requests until all preliminary work is done. - After all is enabled, fine, and dandy, still the instance might want to disable self in case of an emergency. Such as its config got broken or too outdated, desynced with a centric storage. vshard.storage.enable()/disable() can be called any time, before, during, and after vshard.storage.cfg() to solve these issues. Part of #298

vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298

@TarantoolBot

If a storage reports it is disabled, then it probably will take some time before it can accept new requests. This patch makes STORAGE_IS_DISABLED error cause the connection's backoff. In line with 'access denied' and 'no such function' errors. Because the reason for all 3 is the same - the storage is not ready to accept requests yet. Such requests are transparently retried now. Closes #298 @TarantoolBot document Title: vshard.storage.enable/disable() `vshard.storage.disable()` makes most of the `vshard.storage` functions throw an error. As Lua exception, not via `nil, err` pattern. `vshard.storage.enable()` reverts the disable. By default the storage is enabled. Additionally, the storage is forcefully disabled automatically until `vshard.storage.cfg()` is finished and the instance finished recovery (its `box.info.status` is `'running'`, for example). Auto-disable protects from usage of vshard functions before the storage's global state is fully created. Manual `vshard.storage.disable()` helps to achieve the same for user's application. For instance, a user might want to do some preparatory work after `vshard.storage.cfg` before the application is ready for requests. Then the flow would be: ```Lua vshard.storage.disable() vshard.storage.cfg(...) -- Do your preparatory work here ... vshard.storage.enable() ``` The routers handle the errors signaling about the storage being disabled in a special way. They put connections to such instances into a backoff state for some time and will try to use other replicas. For example, assume a replicaset has replicas 'replica_1' and 'replica_2'. Assume 'replica_1' is disabled due to any reason. If a router will try to talk to 'replica_1', it will get a special error and will transparently retry to 'replica_2'. When 'replica_1' is enabled again, the router will notice it too and will send requests to it again. It all works exclusively for read-only requests. Read-write requests can only be sent to a master, which is one per replicaset. They are not retried.

vshard.storage.call() and most of the other vshard.storage.* functions now raise an exception STORAGE_IS_DISABLED when the storage is disabled. The router wants to catch it to handle in a special way. But unfortunately, - error(obj) in a Lua function is wrapped into LuajitError. 'obj' is saved into 'message' using its __tostring meta-method. - It is not possible to create your own error type in a sane way. These 2 facts mean that the router needs to be able to extract the original error from LuajitError's message. In vshard errors are serialized into json, so a valid vshard error, such as STORAGE_IS_DISABLED, can be extracted from LuajitError's message if it wasn't truncated due to being too long. For this particular error it won't happen. The patch introduces new method vshard.error.from_string() to perform this extraction for its further usage in router. Part of #298

In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Closes #1411

In case of OperationError (config was unsuccessfully applied on storage) we shouldn't perform request to such storage. After this feature was implemented in vshard (tarantool/vshard#298) we could just disable vshard storage on such instances. For this purpose simple trigger on_apply_config was implemented. Co-authored-by: Igor Zolotarev <[email protected]>

Gerold103 added storage router feature A new functionality labels Oct 5, 2021

Gerold103 changed the title ~~Error fallback on router for for faulty connections~~ Error fallback on router for faulty connections Oct 5, 2021

sergos added the teamS Scaling label Oct 8, 2021

sergos mentioned this issue Oct 8, 2021

Storages with OperationError state should be excluded from vshard config tarantool/cartridge#1411

Closed

sharonovd added the customer label Oct 8, 2021

Gerold103 self-assigned this Nov 15, 2021

Gerold103 added this to the 0.2 milestone Nov 15, 2021

Totktonada mentioned this issue Dec 2, 2021

Calls to routers and storages when crud-router and crud-storage roles initialization is not finished yet tarantool/crud#229

Closed

Gerold103 closed this as completed in 2eb17c4 Dec 20, 2021

olegrok mentioned this issue Jan 10, 2022

disable vshard.storage in case of operation error tarantool/cartridge#1692

Merged

Totktonada mentioned this issue Jan 22, 2022

feedback: Monitoring a replica set | Tarantool tarantool/doc#2604

Open

sergos mentioned this issue Jan 25, 2022

Add possibility to remove replica from requests after several retries #198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error fallback on router for faulty connections #298

Error fallback on router for faulty connections #298

Gerold103 commented Oct 5, 2021

Error fallback on router for faulty connections #298

Error fallback on router for faulty connections #298

Comments

Gerold103 commented Oct 5, 2021