Document replication_timeout + heartbeats + idle

lenkis · lenkis · commit 3854753470a9 · 2017-12-08T11:48:44.000+03:00
close gh-290 (Document replication_timeout variable)
diff --git a/doc/1.7/book/box/box_info.rst b/doc/1.7/book/box/box_info.rst
@@ -34,7 +34,7 @@ variables.
 .. _box_info_replication:
 
 **replication** part contains statistics for all instances in the replica
-set in regard to the current instance (see also an example in the section
+set in regard to the current instance (see also
 :ref:`"Monitoring a replica set" <replication-monitoring>`):
 
 * **replication.id** is a short numeric identifier of the instance within the
@@ -63,23 +63,30 @@ set in regard to the current instance (see also an example in the section
   * ``stopped`` means that replication was stopped due to a replication error
     (e.g. :ref:`duplicate key <error_codes>`).
 
+.. _box_info_replication_upstream_idle:
+
 * **replication.upstream.idle** is the time (in seconds) since the instance
   received the last event from a master.
+  This is the primary indicator of replication health.
+  See more in :ref:`Monitoring a replica set <replication-monitoring>`.
+
+.. _box_info_replication_upstream_peer:
+
 * **replication.upstream.peer** contains the replication user name, host IP
   adress and port number used for the instance.
+  See more in :ref:`Monitoring a replica set <replication-monitoring>`.
+
+.. _box_info_replication_upstream_lag:
+
 * **replication.upstream.lag** is the time difference between the local time at
   the instance, recorded when the event was received, and the local time at
   another master recorded when the event was written to the
   :ref:`write ahead log <internals-wal>` on that master.
-
-  Since ``lag`` calculation uses operating system clock from two different
-  machines, don’t be surprised if it’s negative: a time drift may lead to the
-  remote master clock being consistently behind the local instance's clock.
-
-  For multi-master configurations, this is the maximal lag.
+  See more in :ref:`Monitoring a replica set <replication-monitoring>`.
 
 * **replication.downstream** contains statistics for the replication
   data requested and downloaded from the instance.
+
 * **replication.downstream.vclock** is the instance's
   :ref:`vector clock <internals-vector>`, which contains a pair '**id**, **lsn**'.
 
@@ -90,7 +97,7 @@ set in regard to the current instance (see also an example in the section
     builds and returns a Lua table with all keys and values provided in the
     submodule.
 
-    :return: keys and values in the submodule.
+    :return: keys and values in the submodule
     :rtype:  table
 
     **Example:**
diff --git a/doc/1.7/book/replication/repl_monitoring.rst b/doc/1.7/book/replication/repl_monitoring.rst
@@ -5,7 +5,7 @@ Monitoring a replica set
 ================================================================================
 
 To learn what instances belong in the replica set, and obtain statistics for all
-these instances, use ``box.info.replication`` request:
+these instances, use :ref:`box.info.replication <box_info_replication>` request:
 
 .. code-block:: tarantoolsession
 
@@ -49,5 +49,28 @@ its own instance id, UUID and log sequence number.
 The request was issued at master #1, and the reply includes statistics for the
 other two masters, given in regard to master #1.
 
-The primary indicators of replication health are ``idle`` and ``lag`` parameters
-(see reference on :ref:`box.info.replication <box_info_replication>` for details).
+The primary indicators of replication health are:
+
+* :ref:`idle <box_info_replication_upstream_idle>`, the time (in seconds) since
+  the instance received the last event from a master.
+
+  A replica sends heartbeat messages to the master every second, and the master
+  is programmed to reconnect automatically if it doesn’t see heartbeat messages
+  more often than :ref:`replication_timeout <cfg_replication-replication_timeout>`
+  seconds.
+
+  Therefore, in a healthy replication setup, ``idle`` should never exceed
+  ``replication_timeout``: if it does, either your replication is lagging
+  seriously behind, because the master is running ahead of the replica, or the
+  network link between the instances is down.
+
+* :ref:`lag <box_info_replication_upstream_lag>`, the time difference between
+  the local time at the instance, recorded when the event was received, and the
+  local time at another master recorded when the event was written to the
+  :ref:`write ahead log <internals-wal>` on that master.
+
+  Since ``lag`` calculation uses operating system clock from two different
+  machines, don’t be surprised if it’s negative: a time drift may lead to the
+  remote master clock being consistently behind the local instance's clock.
+
+  For multi-master configurations, this is the maximal lag.
diff --git a/doc/1.7/reference/configuration/cfg_replication.rst b/doc/1.7/reference/configuration/cfg_replication.rst
@@ -1,4 +1,5 @@
 * :ref:`replication <cfg_replication-replication>`
+* :ref:`replication_timeout <cfg_replication-replication_timeout>`
 
 .. _cfg_replication-replication:
 
@@ -12,17 +13,22 @@
     :samp:`{konstantin}:{secret_password}@{tarantool.org}:{3301}`
 
     If there is more than one replication source in a replica set, specify an
-    array of URIs, for example: (replace 'uri' and 'uri2' in this example with valid URIs):
+    array of URIs, for example (replace 'uri' and 'uri2' in this example with
+    valid URIs):
 
     :extsamp:`box.cfg{ replication = { {*{'uri1'}*}, {*{'uri2'}*} } }`
 
     If one of the URIs is "self" -- that is, if one of the URIs is for the
     instance where ``box.cfg{}`` is being executed on -- then it is ignored.
     Thus it is possible to use the same ``replication`` specification on
-    multiple server instances.
+    multiple server instances, as shown in
+    :ref:`these examples <replication-bootstrap>`.
+
+    The default user name is 'guest'.
+
+    A read-only replica does not accept data-change requests on the
+    :ref:`listen <cfg_basic-listen>` port.
 
-    The default user name is ‘guest’. A replica does not accept
-    data-change requests on the :ref:`listen <cfg_basic-listen>` port.
     The ``replication`` parameter is dynamic, that is, to enter master
     mode, simply set ``replication`` to an empty string and issue:
 
@@ -31,3 +37,17 @@
     | Type: string
     | Default: null
     | Dynamic: **yes**
+
+.. _cfg_replication-replication_timeout:
+
+.. confval:: replication_timeout
+
+    A replica sends heartbeat messages to the master every second, and the
+    master is programmed to reconnect automatically if it doesn’t see heartbeat
+    messages more often than ``replication_timeout`` seconds.
+
+    See more in :ref:`Monitoring a replica set <replication-monitoring>`.
+
+    | Type: integer
+    | Default: 1
+    | Dynamic: **yes**