Skip to content

Commit c78ca39

Browse files
committed
Document join retry
1 parent 01cdb23 commit c78ca39

File tree

2 files changed

+129
-96
lines changed

2 files changed

+129
-96
lines changed

doc/concepts/replication/repl_architecture.rst

Lines changed: 96 additions & 65 deletions
Original file line numberDiff line numberDiff line change
@@ -8,62 +8,95 @@ Replication architecture
88
Replication mechanism
99
---------------------
1010

11-
A pack of instances which operate on copies of the same databases make up a
12-
**replica set**. Each instance in a replica set has a role, **master** or
13-
**replica**.
11+
.. _replication_overview:
12+
13+
Replication overview
14+
~~~~~~~~~~~~~~~~~~~~
15+
16+
A pack of instances that operate on copies of the same databases makes up a **replica set**.
17+
Each instance in a replica set has a role: **master** or **replica**.
1418

1519
A replica gets all updates from the master by continuously fetching and applying
16-
its :ref:`write ahead log (WAL) <internals-wal>`. Each record in the WAL represents a single
20+
its :ref:`write-ahead log (WAL) <internals-wal>`. Each record in the WAL represents a single
1721
Tarantool data-change request such as :ref:`INSERT <box_space-insert>`,
18-
:ref:`UPDATE <box_space-update>` or :ref:`DELETE <box_space-delete>`, and is assigned
22+
:ref:`UPDATE <box_space-update>`, or :ref:`DELETE <box_space-delete>`, and is assigned
1923
a monotonically growing log sequence number (**LSN**). In essence, Tarantool
2024
replication is **row-based**: each data-change request is fully deterministic
2125
and operates on a single :ref:`tuple <index-box_tuple>`. However, unlike a classical row-based log, which
2226
contains entire copies of the changed rows, Tarantool's WAL contains copies of the requests.
2327
For example, for UPDATE requests, Tarantool only stores the primary key of the row and
24-
the update operations, to save space.
28+
the update operations to save space.
29+
30+
.. NOTE::
31+
32+
`WAL extensions <https://www.tarantool.io/en/enterprise_doc/wal_extensions/>`_ available in Tarantool Enterprise enable you to add auxiliary information to each write-ahead log record.
33+
This information might be helpful for implementing a CDC (Change Data Capture) utility that transforms a data replication stream.
2534

26-
Invocations of **stored programs** are not written to the WAL. Instead, records
27-
of the actual **data-change requests, performed by the Lua code**, are
28-
written to the WAL. This ensures that possible non-determinism of Lua does not
29-
cause replication to go out of sync.
35+
The following are specifics of adding different types of information to the WAL:
3036

31-
Data definition operations on **temporary spaces**, such as creating/dropping, adding
32-
indexes, truncating, etc., are written to the WAL, since information about
33-
temporary spaces is stored in non-temporary
34-
system spaces, such as :ref:`box.space._space <box_space-space>`. Data change
35-
operations on temporary spaces are not written to the WAL and are not replicated.
37+
* Invocations of **stored programs** are not written to the WAL.
38+
Instead, records of the actual **data-change requests, performed by the Lua code**, are written to the WAL.
39+
This ensures that the possible non-determinism of Lua does not cause replication to go out of sync.
40+
41+
* Data definition operations on **temporary spaces** (:doc:`created </reference/reference_lua/box_schema/space_create>` with ``temporary = true``), such as creating/dropping, adding indexes, and truncating, are written to the WAL, since information about temporary spaces is stored in non-temporary system spaces, such as :ref:`box.space._space <box_space-space>`.
42+
43+
* Data change operations on temporary spaces are not written to the WAL and are not replicated.
3644

3745
.. _replication-local:
3846

39-
Data change operations on **replication-local** spaces
40-
(spaces :doc:`created </reference/reference_lua/box_schema/space_create>`
41-
with ``is_local = true``)
42-
are written to the WAL but are not replicated.
43-
44-
To create a valid initial state, to which WAL changes can be applied, every
45-
instance of a replica set requires a start set of
46-
:ref:`checkpoint files <index-box_persistence>`, such as .snap files for memtx
47-
and .run files for vinyl. A replica joining an existing replica set, chooses an
48-
existing master and automatically downloads the initial state from it. This is
49-
called an **initial join**.
50-
51-
When an entire replica set is bootstrapped for the first time, there is no
52-
master which could provide the initial checkpoint. In such a case, replicas
53-
connect to each other and elect a master, which then creates the starting set of
54-
checkpoint files, and distributes it to all the other replicas. This is called
55-
an **automatic bootstrap** of a replica set.
56-
57-
When a replica contacts a master (there can be many masters) for the first time,
58-
it becomes part of a replica set. On subsequent occasions, it should always
59-
contact a master in the same replica set. Once connected to the master, the
60-
replica requests all changes that happened after the latest local LSN (there
61-
can be many LSNs -- each master has its own LSN).
62-
63-
Each replica set is identified by a globally unique identifier, called the
64-
**replica set UUID**. The identifier is created by the master which creates the
65-
very first checkpoint, and is part of the checkpoint file. It is stored in
66-
system space :ref:`box.space._schema <box_space-schema>`. For example:
47+
* Data change operations on **replication-local** spaces (:doc:`created </reference/reference_lua/box_schema/space_create>` with ``is_local = true``) are written to the WAL but are not replicated.
48+
49+
50+
To learn how to enable replication, check the :ref:`Bootstrapping a replica set <replication-setup>` guide.
51+
52+
53+
.. _replication_stages:
54+
55+
Replication stages
56+
~~~~~~~~~~~~~~~~~~
57+
58+
To create a valid initial state, to which WAL changes can be applied, every instance of a replica set requires a start set of :ref:`checkpoint files <index-box_persistence>`, such as ``.snap`` files for memtx and ``.run`` files for vinyl.
59+
A replica uses the fiber called :ref:`applier <memtx-replication>` to receive the changes from a remote node and apply them to the replica's arena.
60+
The applier goes through the following stages:
61+
62+
1. **Connect**
63+
64+
This is the first stage of connecting a replica to other nodes.
65+
66+
2. **Auth** (optional)
67+
68+
This is an optional step performed if the :ref:`URI <index-uri>` contains :ref:`authentication <authentication>` information.
69+
70+
3. **Bootstrap** (optional)
71+
72+
When an entire replica set is bootstrapped for the first time, there is no master that could provide the initial checkpoint.
73+
In such a case, replicas connect to each other and elect a master.
74+
The master creates the starting set of checkpoint files and distributes them to all the other replicas.
75+
This is called an **automatic bootstrap** of a replica set.
76+
77+
4. **Join**
78+
79+
At this stage, a replica downloads the initial state from the master.
80+
If join fails with a non-critical :ref:`error <error_codes>`, for example, ``ER_READONLY``, ``ER_ACCESS_DENIED``, or a network-related issue, an instance tries to find a new master to join.
81+
82+
.. NOTE::
83+
84+
On subsequent connections, a replica downloads all changes happened after the latest local LSN (there can be many LSNs – each master has its own LSN).
85+
86+
5. **Subscribe**
87+
88+
At this stage, a replica fetches and applies updates from the master's write-ahead log.
89+
90+
You can use the :ref:`box.info.replication[n].upstream.status <box_info_replication>` property to monitor the status of a replica.
91+
92+
93+
.. _replication_uuid:
94+
95+
Replica set and instance UUIDs
96+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97+
98+
Each replica set is identified by a globally unique identifier, called the **replica set UUID**.
99+
The identifier is created by the master, which creates the very first checkpoint and is part of the checkpoint file. It is stored in the :ref:`box.space._schema <box_space-schema>` system space, for example:
67100

68101
.. code-block:: tarantoolsession
69102
@@ -79,14 +112,14 @@ joins the replica set. It is called an **instance UUID** and is a globally uniqu
79112
identifier. The instance UUID is checked to ensure that instances do not join a different
80113
replica set, e.g. because of a configuration error. A unique instance identifier
81114
is also necessary to apply rows originating from different masters only once,
82-
that is, to implement multi-master replication. This is why each row in the write
83-
ahead log, in addition to its log sequence number, stores the instance identifier
115+
that is, to implement multi-master replication. This is why each row in the write-ahead log,
116+
in addition to its log sequence number, stores the instance identifier
84117
of the instance on which it was created. But using a UUID as such an identifier
85-
would take too much space in the write ahead log, thus a shorter integer number
118+
would take too much space in the write-ahead log, thus a shorter integer number
86119
is assigned to the instance when it joins a replica set. This number is then
87-
used to refer to the instance in the write ahead log. It is called
88-
**instance id**. All identifiers are stored in system space
89-
:ref:`box.space._cluster <box_space-cluster>`. For example:
120+
used to refer to the instance in the write-ahead log. It is called
121+
**instance id**. All identifiers are stored in the system space
122+
:ref:`box.space._cluster <box_space-cluster>`, for example:
90123

91124
.. code-block:: tarantoolsession
92125
@@ -112,12 +145,10 @@ describes the state of replication in regard to each connected peer.
112145
Here ``vclock`` contains log sequence numbers (827 and 584) for instances with
113146
instance IDs 1 and 2.
114147

115-
Starting in Tarantool 1.7.7, it is possible for administrators to assign
116-
the instance UUID and the replica set UUID values, rather than let the system
117-
generate them -- see the description of the
118-
:ref:`replicaset_uuid <cfg_replication-replicaset_uuid>` configuration parameter.
148+
If required, you can explicitly specify the instance and the replica set UUID values rather than letting Tarantool generate them.
149+
To learn more, see the :ref:`replicaset_uuid <cfg_replication-replicaset_uuid>` configuration parameter description.
150+
119151

120-
To learn how to enable replication, check the :ref:`how-to guide <replication-setup>`.
121152

122153
.. _replication-roles:
123154

@@ -137,7 +168,7 @@ be visible on the replicas, but not vice versa.
137168
A simple two-instance replica set with the master on one machine and the replica
138169
on a different machine provides two benefits:
139170

140-
* **failover**, because if the master goes down then the replica can take over,
171+
* **failover**, because if the master goes down, then the replica can take over,
141172
and
142173
* **load balancing**, because clients can connect to either the master or the
143174
replica for read requests.
@@ -168,8 +199,8 @@ order on all replicas (e.g. the DELETE is used to prune expired data),
168199
a master-master configuration is also safe.
169200

170201
UPDATE operations, however, can easily go out of sync. For example, assignment
171-
and increment are not commutative, and may yield different results if applied
172-
in different order on different instances.
202+
and increment are not commutative and may yield different results if applied
203+
in a different order on different instances.
173204

174205
More generally, it is only safe to use Tarantool master-master replication if
175206
all database changes are **commutative**: the end result does not depend on the
@@ -179,15 +210,15 @@ conflict-free replicated data types
179210

180211
.. _replication-topologies:
181212

182-
Replication topologies: cascade, ring and full mesh
183-
---------------------------------------------------
213+
Replication topologies: cascade, ring, and full mesh
214+
----------------------------------------------------
184215

185216
Replication topology is set by the :ref:`replication <cfg_replication-replication>`
186-
configuration parameter. The recommended topology is a **full mesh**, because it
217+
configuration parameter. The recommended topology is a **full mesh** because it
187218
makes potential failover easy.
188219

189220
Some database products offer **cascading replication** topologies: creating a
190-
replica on a replica. Tarantool does not recommend such setup.
221+
replica on a replica. Tarantool does not recommend such a setup.
191222

192223
.. image:: images/no-cascade.svg
193224
:align: center
@@ -204,14 +235,14 @@ such instances when replication topology changes. Here is how this can happen:
204235

205236
We have a chain of three instances. Instance #1 contains entries for instances
206237
#1 and #2 in its ``_cluster`` space. Instances #2 and #3 contain entries for
207-
instances #1, #2 and #3 in their ``_cluster`` spaces.
238+
instances #1, #2, and #3 in their ``_cluster`` spaces.
208239

209240
.. image:: images/cascade-problem-2.svg
210241
:align: center
211242

212243
Now instance #2 is faulty. Instance #3 tries connecting to instance #1 as its
213-
new master, but the master refuses the connection since it has no entry for
214-
instance #3.
244+
new master, but the master refuses the connection since it has no entry, for
245+
example, #3.
215246

216247
**Ring replication** topology is, however, supported:
217248

@@ -244,6 +275,6 @@ Orphan status
244275

245276
During ``box.cfg()``, an instance tries to join all nodes listed
246277
in :ref:`box.cfg.replication <cfg_replication-replication>`.
247-
If the instance does not succeed with connecting to the required number of nodes
278+
If the instance does not succeed in connecting to the required number of nodes
248279
(see :ref:`bootstrap_strategy <cfg_replication-bootstrap_strategy>`),
249280
it switches to the :ref:`orphan status <internals-replication-orphan_status>`.

0 commit comments

Comments
 (0)