1 ========================================
2 Automatized Upgrade Procedure for Ganeti
3 ========================================
5 .. contents:: :depth: 4
7 This is a design document detailing the proposed changes to the
8 upgrade process, in order to allow it to be more automatic.
11 Current state and shortcomings
12 ==============================
14 Ganeti requires to run the same version of Ganeti to be run on all
15 nodes of a cluster and this requirement is unlikely to go away in the
16 foreseeable future. Also, the configuration may change between minor
17 versions (and in the past has proven to do so). This requires a quite
18 involved manual upgrade process of draining the queue, stopping
19 ganeti, changing the binaries, upgrading the configuration, starting
20 ganeti, distributing the configuration, and undraining the queue.
26 While we will not remove the requirement of the same Ganeti
27 version running on all nodes, the transition from one version
28 to the other will be made more automatic. It will be possible
29 to install new binaries ahead of time, and the actual switch
30 between versions will be a single command.
32 While changing the file layout anyway, we install the python
33 code, which is architecture independent, under ``${prefix}/share``,
34 in a way that properly separates the Ganeti libraries of the
37 Path changes to allow multiple versions installed
38 -------------------------------------------------
40 Currently, Ganeti installs to ``${PREFIX}/bin``, ``${PREFIX}/sbin``,
41 and so on, as well as to ``${pythondir}/ganeti``.
43 These paths will be changed in the following way.
45 - The python package will be installed to
46 ``${PREFIX}/share/ganeti/${VERSION}/ganeti``.
47 Here ${VERSION} is, depending on configure options, either the full qualified
48 version number, consisting of major, minor, revision, and suffix, or it is
49 just a major.minor pair. All python executables will be installed under
50 ``${PREFIX}/share/ganeti/${VERSION}`` so that they see their respective
51 Ganeti library. ``${PREFIX}/share/ganeti/default`` is a symbolic link to
52 ``${sysconfdir}/ganeti/share`` which, in turn, is a symbolic link to
53 ``${PREFIX}/share/ganeti/${VERSION}``. For all python executatables (like
54 ``gnt-cluster``, ``gnt-node``, etc) symbolic links going through
55 ``${PREFIX}/share/ganeti/default`` are added under ``${PREFIX}/sbin``.
57 - All other files will be installed to the corresponding path under
58 ``${libdir}/ganeti/${VERSION}`` instead of under ``${PREFIX}``
59 directly, where ``${libdir}`` defaults to ``${PREFIX}/lib``.
60 ``${libdir}/ganeti/default`` will be a symlink to ``${sysconfdir}/ganeti/lib``
61 which, in turn, is a symlink to ``${libdir}/ganeti/${VERSION}``.
62 Symbolic links to the files installed under ``${libdir}/ganeti/${VERSION}``
63 will be added under ``${PREFIX}/bin``, ``${PREFIX}/sbin``, and so on. These
64 symbolic links will go through ``${libdir}/ganeti/default`` so that the
65 version can easily be changed by updating the symbolic link in
68 The set of links for ganeti binaries might change between the versions.
69 However, as the file structure under ``${libdir}/ganeti/${VERSION}`` reflects
70 that of ``/``, two links of differnt versions will never conflict. Similarly,
71 the symbolic links for the python executables will never conflict, as they
72 always point to a file with the same basename directly under
73 ``${PREFIX}/share/ganeti/default``. Therefore, each version will make sure that
74 enough symbolic links are present in ``${PREFIX}/bin``, ``${PREFIX}/sbin`` and
75 so on, even though some might be dangling, if a differnt version of ganeti is
78 The extra indirection through ``${sysconfdir}`` allows installations that choose
79 to have ``${sysconfdir}`` and ``${localstatedir}`` outside ``${PREFIX}`` to
80 mount ``${PREFIX}`` read-only. The latter is important for systems that choose
81 ``/usr`` as ``${PREFIX}`` and are following the Filesystem Hierarchy Standard.
82 For example, choosing ``/usr`` as ``${PREFIX}`` and ``/etc`` as ``${sysconfdir}``,
83 the layout for version 2.10 will look as follows.
92 | +-- lib -> /usr/lib/ganeti/2.10
94 | +-- share -> /usr/share/ganeti/2.10
99 | +-- harep -> /usr/lib/ganeti/default/usr/bin/harep
105 | +-- gnt-cluster -> /usr/share/ganeti/default/gnt-cluster
115 | +-- default -> /etc/ganeti/lib
125 | | +-- harep -> htools
134 +-- default -> /etc/ganeti/share
160 The actual upgrade process will be done by a new command ``upgrade`` to
161 ``gnt-cluster``. If called with the option ``--to`` which take precisely
162 one argument, the version to
163 upgrade (or downgrade) to, given as full string with major, minor, suffix,
164 and suffix. To be compatible with current configuration upgrade and downgrade
165 procedures, the new version must be of the same major version and
166 either an equal or higher minor version, or precisely the previous
169 When executed, ``gnt-cluster upgrade --to=<version>`` will perform the
172 - It verifies that the version to change to is installed on all nodes
173 of the cluster that are not marked as offline. If this is not the
174 case it aborts with an error. This initial testing is an
175 optimization to allow for early feedback.
177 - An intent-to-upgrade file is created that contains the current
178 version of ganeti, the version to change to, and the process ID of
179 the ``gnt-cluster upgrade`` process. The latter is not used automatically,
180 but allows manual detection if the upgrade process died
181 unintentionally. The intend-to-upgrade file is persisted to disk
184 - The Ganeti job queue is drained, and the executable waits till there
185 are no more jobs in the queue. Once :doc:`design-optables` is
186 implemented, for upgrades, and only for upgrades, all jobs are paused
187 instead (in the sense that the currently running opcode continues,
188 but the next opcode is not started) and it is continued once all
189 jobs are fully paused.
191 - All ganeti daemons on the master node are stopped.
193 - It is verified again that all nodes at this moment not marked as
194 offline have the new version installed. If this is not the case,
195 then all changes so far (stopping ganeti daemons and draining the
196 queue) are undone and failure is reported. This second verification
197 is necessary, as the set of online nodes might have changed during
200 - All ganeti daemons on all remaining (non-offline) nodes are stopped.
202 - A backup of all Ganeti-related status information is created for
203 manual rollbacks. While the normal way of rolling back after an
204 upgrade should be calling ``gnt-clsuter upgrade`` from the newer version
205 with the older version as argument, a full backup provides an
206 additional safety net, especially for jump-upgrades (skipping
207 intermediate minor versions).
209 - If the action is a downgrade to the previous minor version, the
210 configuration is downgraded now, using ``cfgupgrade --downgrade``.
212 - The ``${sysconfdir}/ganeti/lib`` and ``${sysconfdir}/ganeti/share``
213 symbolic links are updated.
215 - If the action is an upgrade to a higher minor version, the configuration
216 is upgraded now, using ``cfgupgrade``.
218 - All daemons are started on all nodes.
220 - ``ensure-dirs --full-run`` is run on all nodes.
222 - ``gnt-cluster redist-conf`` is run on the master node.
224 - All daemons are restarted on all nodes.
226 - The Ganeti job queue is undrained.
228 - The intent-to-upgrade file is removed.
230 - ``post-upgrade`` is run with the original version as argument.
232 - ``gnt-cluster verify`` is run and the result reported.
235 Considerations on unintended reboots of the master node
236 =======================================================
238 During the upgrade procedure, the only ganeti process still running is
239 the one instance of ``gnt-cluster upgrade``. This process is also responsible
240 for eventually removing the queue drain. Therefore, we have to provide
241 means to resume this process, if it dies unintentionally. The process
242 itself will handle SIGTERM gracefully by either undoing all changes
243 done so far, or by ignoring the signal all together and continuing to
244 the end; the choice between these behaviors depends on whether change
245 of the configuration has already started (in which case it goes
246 through to the end), or not (in which case the actions done so far are
249 To achieve this, ``gnt-cluster upgrade`` will support a ``--resume``
250 option. It is recommended
251 to have ``gnt-cluster upgrade --resume`` as an at-reboot task in the crontab.
252 The ``gnt-cluster upgrade --resume`` comand first verifies that
253 it is running on the master node, using the same requirement as for
254 starting the master daemon, i.e., confirmed by a majority of all
255 nodes. If it is not the master node, it will remove any possibly
256 existing intend-to-upgrade file and exit. If it is running on the
257 master node, it will check for the existence of an intend-to-upgrade
258 file. If no such file is found, it will simply exit. If found, it will
259 resume at the appropriate stage.
261 - If the configuration file still is at the initial version,
262 ``gnt-cluster upgrade`` is resumed at the step immediately following the
263 writing of the intend-to-upgrade file. It should be noted that
264 all steps before changing the configuration are idempotent, so
265 redoing them does not do any harm.
267 - If the configuration is already at the new version, all daemons on
268 all nodes are stopped (as they might have been started again due
269 to a reboot) and then it is resumed at the step immediately
270 following the configuration change. All actions following the
271 configuration change can be repeated without bringing the cluster
278 Since ``gnt-cluster upgrade`` drains the queue and undrains it later, so any
279 information about a previous drain gets lost. This problem will
280 disappear, once :doc:`design-optables` is implemented, as then the
281 undrain will then be restricted to filters by gnt-upgrade.
284 Requirement of opcode backwards compatibility
285 ==============================================
287 Since for upgrades we only pause jobs and do not fully drain the
288 queue, we need to be able to transform the job queue into a queue for
289 the new version. The way this is achieved is by keeping the
290 serialization format backwards compatible. This is in line with
291 current practice that opcodes do not change between versions, and at
292 most new fields are added. Whenever we add a new field to an opcode,
293 we will make sure that the deserialization function will provide a
294 default value if the field is not present.