/doc/design-hroller.rst - Annotate - snf-ganeti - Greek Research and Technology Network's projects

| Branch: | Tag: | Revision:

root / doc / design-hroller.rst @ 0102e732

History | View | Annotate | Download (7.2 kB)

1	09208925	Guido Trotter	============
2	09208925	Guido Trotter	HRoller tool
3	09208925	Guido Trotter	============
4	09208925	Guido Trotter
5	09208925	Guido Trotter	.. contents:: :depth: 4
6	09208925	Guido Trotter
7	09208925	Guido Trotter	This is a design document detailing the cluster maintenance scheduler,
8	09208925	Guido Trotter	HRoller.
9	09208925	Guido Trotter
10	09208925	Guido Trotter
11	09208925	Guido Trotter	Current state and shortcomings
12	09208925	Guido Trotter	==============================
13	09208925	Guido Trotter
14	09208925	Guido Trotter	To enable automating cluster-wide reboots a new htool, called HRoller,
15	09208925	Guido Trotter	was added to Ganeti starting from version 2.7. This tool helps
16	09208925	Guido Trotter	parallelizing cluster offline maintenances by calculating which nodes
17	09208925	Guido Trotter	are not both primary and secondary for a DRBD instance, and thus can be
18	09208925	Guido Trotter	rebooted at the same time, when all instances are down.
19	09208925	Guido Trotter
20	09208925	Guido Trotter	The way this is done is documented in the :manpage:`hroller(1)` manpage.
21	09208925	Guido Trotter
22	09208925	Guido Trotter	We would now like to perform online maintenance on the cluster by
23	09208925	Guido Trotter	rebooting nodes after evacuating their primary instances (rolling
24	09208925	Guido Trotter	reboots).
25	09208925	Guido Trotter
26	09208925	Guido Trotter	Proposed changes
27	09208925	Guido Trotter	================
28	09208925	Guido Trotter
29	fb4b885a	Guido Trotter	New options
30	fb4b885a	Guido Trotter	-----------
31	fb4b885a	Guido Trotter
32	fb4b885a	Guido Trotter	- HRoller should be able to operate on single nodegroups (-G flag) or
33	fb4b885a	Guido Trotter	select its target node through some other mean (eg. via a tag, or a
34	fb4b885a	Guido Trotter	regexp). (Note that individual node selection is already possible via
35	fb4b885a	Guido Trotter	the -O flag, that makes hroller ignore a node altogether).
36	fb4b885a	Guido Trotter	- HRoller should handle non redundant instances: currently these are
37	fb4b885a	Guido Trotter	ignored but there should be a way to select its behavior between "it's
38	fb4b885a	Guido Trotter	ok to reboot a node when a non-redundant instance is on it" or "skip
39	fb4b885a	Guido Trotter	nodes with non-redundant instances". This will only be selectable
40	fb4b885a	Guido Trotter	globally, and not per instance.
41	fb4b885a	Guido Trotter	- Hroller will make sure to keep any instance which is up in its current
42	fb4b885a	Guido Trotter	state, via live migrations, unless explicitly overridden. The
43	fb4b885a	Guido Trotter	algorithm that will be used calculate the rolling reboot with live
44	fb4b885a	Guido Trotter	migrations is described below, and any override on considering the
45	fb4b885a	Guido Trotter	instance status will only be possible on the whole run, and not
46	fb4b885a	Guido Trotter	per-instance.
47	fb4b885a	Guido Trotter
48	09208925	Guido Trotter
49	09208925	Guido Trotter	Calculating rolling maintenances
50	09208925	Guido Trotter	--------------------------------
51	09208925	Guido Trotter
52	09208925	Guido Trotter	In order to perform rolling maintenance we need to migrate instances off
53	09208925	Guido Trotter	the nodes before a reboot. How this can be done depends on the
54	09208925	Guido Trotter	instance's disk template and status:
55	09208925	Guido Trotter
56	09208925	Guido Trotter	Down instances
57	09208925	Guido Trotter	++++++++++++++
58	09208925	Guido Trotter
59	09208925	Guido Trotter	If an instance was shutdown when the maintenance started it will be
60	fb4b885a	Guido Trotter	considered for avoiding contemporary reboot of its primary and secondary
61	fb4b885a	Guido Trotter	nodes, but will not be considered as a target for the node evacuation.
62	fb4b885a	Guido Trotter	This allows avoiding needlessly moving its primary around, since it
63	fb4b885a	Guido Trotter	won't suffer a downtime anyway.
64	09208925	Guido Trotter
65	fb4b885a	Guido Trotter	Note that a node with non-redundant instances will only ever be
66	fb4b885a	Guido Trotter	considered good for rolling-reboot if these are down (or the checking of
67	fb4b885a	Guido Trotter	status is overridden) and an explicit option to allow it is set.
68	09208925	Guido Trotter
69	09208925	Guido Trotter	DRBD
70	09208925	Guido Trotter	++++
71	09208925	Guido Trotter
72	09208925	Guido Trotter	Each node must migrate all instances off to their secondaries, and then
73	09208925	Guido Trotter	can either be rebooted, or the secondaries can be evacuated as well.
74	09208925	Guido Trotter
75	09208925	Guido Trotter	Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
76	09208925	Guido Trotter	it's not any safer than temporarily rebooting a node with secondaries on
77	09208925	Guido Trotter	them (citation needed). As such we'll implement for now just the
78	09208925	Guido Trotter	"migrate+reboot" mode, and focus later on replace-disks as well.
79	09208925	Guido Trotter
80	09208925	Guido Trotter	In order to do that we can use the following algorithm:
81	09208925	Guido Trotter
82	09208925	Guido Trotter	1) Compute node sets that don't contain both the primary and the
83	4a4697de	Klaus Aehlig	secondary of any instance, and also don't contain the primary
84	4a4697de	Klaus Aehlig	nodes of two instances that have the same node as secondary. These
85	4a4697de	Klaus Aehlig	can be obtained by computing a coloring of the graph with nodes
86	4a4697de	Klaus Aehlig	as vertexes and an edge between two nodes, if either condition
87	4a4697de	Klaus Aehlig	prevents simultaneous maintenance. (This is the current algorithm of
88	4a4697de	Klaus Aehlig	:manpage:`hroller(1)` with the extension that the graph to be colored
89	4a4697de	Klaus Aehlig	has additional edges between the primary nodes of two instances sharing
90	4a4697de	Klaus Aehlig	their secondary node.)
91	4a4697de	Klaus Aehlig	2) It is then possible to migrate in parallel all nodes in a set
92	4a4697de	Klaus Aehlig	created at step 1, and then reboot/perform maintenance on them, and
93	fb4b885a	Guido Trotter	migrate back their original primaries, which allows the computation
94	4a4697de	Klaus Aehlig	above to be reused for each following set without N+1 failures
95	fb4b885a	Guido Trotter	being triggered, if none were present before. See below about the
96	fb4b885a	Guido Trotter	actual execution of the maintenance.
97	09208925	Guido Trotter
98	09208925	Guido Trotter	Non-DRBD
99	09208925	Guido Trotter	++++++++
100	09208925	Guido Trotter
101	09208925	Guido Trotter	All non-DRBD disk templates that can be migrated have no "secondary"
102	09208925	Guido Trotter	concept. As such instances can be migrated to any node (in the same
103	09208925	Guido Trotter	nodegroup). In order to do the job we can either:
104	09208925	Guido Trotter
105	09208925	Guido Trotter	- Perform migrations on one node at a time, perform the maintenance on
106	09208925	Guido Trotter	that node, and proceed (the node will then be targeted again to host
107	09208925	Guido Trotter	instances automatically, as hail chooses targets for the instances
108	09208925	Guido Trotter	between all nodes in a group. Nodes in different nodegroups can be
109	09208925	Guido Trotter	handled in parallel.
110	09208925	Guido Trotter	- Perform migrations on one node at a time, but without waiting for the
111	09208925	Guido Trotter	first node to come back before proceeding. This allows us to continue,
112	09208925	Guido Trotter	restricting the cluster, until no more capacity in the nodegroup is
113	09208925	Guido Trotter	available, and then having to wait for some nodes to come back so that
114	09208925	Guido Trotter	capacity is available again for the last few nodes.
115	09208925	Guido Trotter	- Pre-Calculate sets of nodes that can be migrated together (probably
116	09208925	Guido Trotter	with a greedy algorithm) and parallelize between them, with the
117	09208925	Guido Trotter	migrate-back approach discussed for DRBD to perform the calculation
118	09208925	Guido Trotter	only once.
119	09208925	Guido Trotter
120	09208925	Guido Trotter	Note that for non-DRBD disks that still use local storage (eg. RBD and
121	09208925	Guido Trotter	plain) redundancy might break anyway, and nothing except the first
122	09208925	Guido Trotter	algorithm might be safe. This perhaps would be a good reason to consider
123	09208925	Guido Trotter	managing better RBD pools, if those are implemented on top of nodes
124	09208925	Guido Trotter	storage, rather than on dedicated storage machines.
125	09208925	Guido Trotter
126	0102e732	Klaus Aehlig	Full-Evacuation
127	0102e732	Klaus Aehlig	+++++++++++++++
128	0102e732	Klaus Aehlig
129	0102e732	Klaus Aehlig	If full evacuation of the nodes to be rebooted is desired, a simple
130	0102e732	Klaus Aehlig	migration is not enough for the DRBD instances. To keep the number of
131	0102e732	Klaus Aehlig	disk operations small, we restrict moves to ``migrate, replace-secondary``.
132	0102e732	Klaus Aehlig	That is, after migrating instances out of the nodes to be rebooted,
133	0102e732	Klaus Aehlig	replacement secondaries are searched for, for all instances that have
134	0102e732	Klaus Aehlig	their then secondary on one of the rebooted nodes. This is done by a
135	0102e732	Klaus Aehlig	greedy algorithm, refining the initial reboot partition, if necessary.
136	0102e732	Klaus Aehlig
137	09208925	Guido Trotter	Future work
138	09208925	Guido Trotter	===========
139	09208925	Guido Trotter
140	fb4b885a	Guido Trotter	Hroller should become able to execute rolling maintenances, rather than
141	fb4b885a	Guido Trotter	just calculate them. For this to succeed properly one of the following
142	fb4b885a	Guido Trotter	must happen:
143	fb4b885a	Guido Trotter
144	fb4b885a	Guido Trotter	- HRoller handles rolling maintenances that happen at the same time as
145	fb4b885a	Guido Trotter	unrelated cluster jobs, and thus recalculates the maintenance at each
146	fb4b885a	Guido Trotter	step
147	fb4b885a	Guido Trotter	- HRoller can selectively drain the cluster so it's sure that only the
148	fb4b885a	Guido Trotter	rolling maintenance can be going on
149	fb4b885a	Guido Trotter
150	09208925	Guido Trotter	DRBD nodes' ``replace-disks``' functionality should be implemented. Note
151	09208925	Guido Trotter	that when we will support a DRBD version that allows multi-secondary
152	09208925	Guido Trotter	this can be done safely, without losing replication at any time, by
153	09208925	Guido Trotter	adding a temporary secondary and only when the sync is finished dropping
154	09208925	Guido Trotter	the previous one.
155	09208925	Guido Trotter
156	fb4b885a	Guido Trotter	Non-redundant (plain or file) instances should have a way to be moved
157	fb4b885a	Guido Trotter	off as well via plain storage live migration or ``gnt-instance move``
158	fb4b885a	Guido Trotter	(which requires downtime).
159	fb4b885a	Guido Trotter
160	09208925	Guido Trotter	If/when RBD pools can be managed inside Ganeti, care can be taken so
161	09208925	Guido Trotter	that the pool is evacuated as well from a node before it's put into
162	09208925	Guido Trotter	maintenance. This is equivalent to evacuating DRBD secondaries.
163	09208925	Guido Trotter
164	09208925	Guido Trotter	Master failovers during the maintenance should be performed by hroller.
165	09208925	Guido Trotter	This requires RPC/RAPI support for master failover. Hroller should also
166	09208925	Guido Trotter	be modified to better support running on the master itself and
167	09208925	Guido Trotter	continuing on the new master.
168	09208925	Guido Trotter
169	09208925	Guido Trotter	.. vim: set textwidth=72 :
170	09208925	Guido Trotter	.. Local Variables:
171	09208925	Guido Trotter	.. mode: rst
172	09208925	Guido Trotter	.. fill-column: 72
173	09208925	Guido Trotter	.. End:

Synnefo » snf-ganeti

root / doc / design-hroller.rst @ 0102e732