Skip to content

Commit 3854753

Browse files
committed
Document replication_timeout + heartbeats + idle
close gh-290 (Document replication_timeout variable)
1 parent 0534ab2 commit 3854753

File tree

3 files changed

+65
-15
lines changed

3 files changed

+65
-15
lines changed

doc/1.7/book/box/box_info.rst

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ variables.
3434
.. _box_info_replication:
3535

3636
**replication** part contains statistics for all instances in the replica
37-
set in regard to the current instance (see also an example in the section
37+
set in regard to the current instance (see also
3838
:ref:`"Monitoring a replica set" <replication-monitoring>`):
3939

4040
* **replication.id** is a short numeric identifier of the instance within the
@@ -63,23 +63,30 @@ set in regard to the current instance (see also an example in the section
6363
* ``stopped`` means that replication was stopped due to a replication error
6464
(e.g. :ref:`duplicate key <error_codes>`).
6565

66+
.. _box_info_replication_upstream_idle:
67+
6668
* **replication.upstream.idle** is the time (in seconds) since the instance
6769
received the last event from a master.
70+
This is the primary indicator of replication health.
71+
See more in :ref:`Monitoring a replica set <replication-monitoring>`.
72+
73+
.. _box_info_replication_upstream_peer:
74+
6875
* **replication.upstream.peer** contains the replication user name, host IP
6976
adress and port number used for the instance.
77+
See more in :ref:`Monitoring a replica set <replication-monitoring>`.
78+
79+
.. _box_info_replication_upstream_lag:
80+
7081
* **replication.upstream.lag** is the time difference between the local time at
7182
the instance, recorded when the event was received, and the local time at
7283
another master recorded when the event was written to the
7384
:ref:`write ahead log <internals-wal>` on that master.
74-
75-
Since ``lag`` calculation uses operating system clock from two different
76-
machines, don’t be surprised if it’s negative: a time drift may lead to the
77-
remote master clock being consistently behind the local instance's clock.
78-
79-
For multi-master configurations, this is the maximal lag.
85+
See more in :ref:`Monitoring a replica set <replication-monitoring>`.
8086

8187
* **replication.downstream** contains statistics for the replication
8288
data requested and downloaded from the instance.
89+
8390
* **replication.downstream.vclock** is the instance's
8491
:ref:`vector clock <internals-vector>`, which contains a pair '**id**, **lsn**'.
8592

@@ -90,7 +97,7 @@ set in regard to the current instance (see also an example in the section
9097
builds and returns a Lua table with all keys and values provided in the
9198
submodule.
9299

93-
:return: keys and values in the submodule.
100+
:return: keys and values in the submodule
94101
:rtype: table
95102

96103
**Example:**

doc/1.7/book/replication/repl_monitoring.rst

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Monitoring a replica set
55
================================================================================
66

77
To learn what instances belong in the replica set, and obtain statistics for all
8-
these instances, use ``box.info.replication`` request:
8+
these instances, use :ref:`box.info.replication <box_info_replication>` request:
99

1010
.. code-block:: tarantoolsession
1111
@@ -49,5 +49,28 @@ its own instance id, UUID and log sequence number.
4949
The request was issued at master #1, and the reply includes statistics for the
5050
other two masters, given in regard to master #1.
5151

52-
The primary indicators of replication health are ``idle`` and ``lag`` parameters
53-
(see reference on :ref:`box.info.replication <box_info_replication>` for details).
52+
The primary indicators of replication health are:
53+
54+
* :ref:`idle <box_info_replication_upstream_idle>`, the time (in seconds) since
55+
the instance received the last event from a master.
56+
57+
A replica sends heartbeat messages to the master every second, and the master
58+
is programmed to reconnect automatically if it doesn’t see heartbeat messages
59+
more often than :ref:`replication_timeout <cfg_replication-replication_timeout>`
60+
seconds.
61+
62+
Therefore, in a healthy replication setup, ``idle`` should never exceed
63+
``replication_timeout``: if it does, either your replication is lagging
64+
seriously behind, because the master is running ahead of the replica, or the
65+
network link between the instances is down.
66+
67+
* :ref:`lag <box_info_replication_upstream_lag>`, the time difference between
68+
the local time at the instance, recorded when the event was received, and the
69+
local time at another master recorded when the event was written to the
70+
:ref:`write ahead log <internals-wal>` on that master.
71+
72+
Since ``lag`` calculation uses operating system clock from two different
73+
machines, don’t be surprised if it’s negative: a time drift may lead to the
74+
remote master clock being consistently behind the local instance's clock.
75+
76+
For multi-master configurations, this is the maximal lag.

doc/1.7/reference/configuration/cfg_replication.rst

Lines changed: 24 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
* :ref:`replication <cfg_replication-replication>`
2+
* :ref:`replication_timeout <cfg_replication-replication_timeout>`
23

34
.. _cfg_replication-replication:
45

@@ -12,17 +13,22 @@
1213
:samp:`{konstantin}:{secret_password}@{tarantool.org}:{3301}`
1314

1415
If there is more than one replication source in a replica set, specify an
15-
array of URIs, for example: (replace 'uri' and 'uri2' in this example with valid URIs):
16+
array of URIs, for example (replace 'uri' and 'uri2' in this example with
17+
valid URIs):
1618

1719
:extsamp:`box.cfg{ replication = { {*{'uri1'}*}, {*{'uri2'}*} } }`
1820

1921
If one of the URIs is "self" -- that is, if one of the URIs is for the
2022
instance where ``box.cfg{}`` is being executed on -- then it is ignored.
2123
Thus it is possible to use the same ``replication`` specification on
22-
multiple server instances.
24+
multiple server instances, as shown in
25+
:ref:`these examples <replication-bootstrap>`.
26+
27+
The default user name is 'guest'.
28+
29+
A read-only replica does not accept data-change requests on the
30+
:ref:`listen <cfg_basic-listen>` port.
2331

24-
The default user name is ‘guest’. A replica does not accept
25-
data-change requests on the :ref:`listen <cfg_basic-listen>` port.
2632
The ``replication`` parameter is dynamic, that is, to enter master
2733
mode, simply set ``replication`` to an empty string and issue:
2834

@@ -31,3 +37,17 @@
3137
| Type: string
3238
| Default: null
3339
| Dynamic: **yes**
40+
41+
.. _cfg_replication-replication_timeout:
42+
43+
.. confval:: replication_timeout
44+
45+
A replica sends heartbeat messages to the master every second, and the
46+
master is programmed to reconnect automatically if it doesn’t see heartbeat
47+
messages more often than ``replication_timeout`` seconds.
48+
49+
See more in :ref:`Monitoring a replica set <replication-monitoring>`.
50+
51+
| Type: integer
52+
| Default: 1
53+
| Dynamic: **yes**

0 commit comments

Comments
 (0)