code.grnet.gr Git - ganeti-local/blob - doc/design-hroller.rst

   1 ============
   2 HRoller tool
   3 ============
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document detailing the cluster maintenance scheduler,
   8 HRoller.
   9
  10
  11 Current state and shortcomings
  12 ==============================
  13
  14 To enable automating cluster-wide reboots a new htool, called HRoller,
  15 was added to Ganeti starting from version 2.7. This tool helps
  16 parallelizing cluster offline maintenances by calculating which nodes
  17 are not both primary and secondary for a DRBD instance, and thus can be
  18 rebooted at the same time, when all instances are down.
  19
  20 The way this is done is documented in the :manpage:`hroller(1)` manpage.
  21
  22 We would now like to perform online maintenance on the cluster by
  23 rebooting nodes after evacuating their primary instances (rolling
  24 reboots).
  25
  26 Proposed changes
  27 ================
  28
  29 New options
  30 -----------
  31
  32 - HRoller should be able to operate on single nodegroups (-G flag) or
  33   select its target node through some other mean (eg. via a tag, or a
  34   regexp). (Note that individual node selection is already possible via
  35   the -O flag, that makes hroller ignore a node altogether).
  36 - HRoller should handle non redundant instances: currently these are
  37   ignored but there should be a way to select its behavior between "it's
  38   ok to reboot a node when a non-redundant instance is on it" or "skip
  39   nodes with non-redundant instances". This will only be selectable
  40   globally, and not per instance.
  41 - Hroller will make sure to keep any instance which is up in its current
  42   state, via live migrations, unless explicitly overridden. The
  43   algorithm that will be used calculate the rolling reboot with live
  44   migrations is described below, and any override on considering the
  45   instance status will only be possible on the whole run, and not
  46   per-instance.
  47
  48
  49 Calculating rolling maintenances
  50 --------------------------------
  51
  52 In order to perform rolling maintenance we need to migrate instances off
  53 the nodes before a reboot. How this can be done depends on the
  54 instance's disk template and status:
  55
  56 Down instances
  57 ++++++++++++++
  58
  59 If an instance was shutdown when the maintenance started it will be
  60 considered for avoiding contemporary reboot of its primary and secondary
  61 nodes, but will *not* be considered as a target for the node evacuation.
  62 This allows avoiding needlessly moving its primary around, since it
  63 won't suffer a downtime anyway.
  64
  65 Note that a node with non-redundant instances will only ever be
  66 considered good for rolling-reboot if these are down (or the checking of
  67 status is overridden) *and* an explicit option to allow it is set.
  68
  69 DRBD
  70 ++++
  71
  72 Each node must migrate all instances off to their secondaries, and then
  73 can either be rebooted, or the secondaries can be evacuated as well.
  74
  75 Since currently doing a ``replace-disks`` on DRBD breaks redundancy,
  76 it's not any safer than temporarily rebooting a node with secondaries on
  77 them (citation needed). As such we'll implement for now just the
  78 "migrate+reboot" mode, and focus later on replace-disks as well.
  79
  80 In order to do that we can use the following algorithm:
  81
  82 1) Compute node sets that don't contain both the primary and the
  83    secondary for any instance. This can be done already by the current
  84    hroller graph coloring algorithm: nodes are in the same set (color)
  85    if and only if no edge (instance) exists between them (see the
  86    :manpage:`hroller(1)` manpage for more details).
  87 2) Inside each node set calculate subsets that don't have any secondary
  88    node in common (this can be done by creating a graph of nodes that
  89    are connected if and only if an instance on both has the same
  90    secondary node, and coloring that graph)
  91 3) It is then possible to migrate in parallel all nodes in a subset
  92    created at step 2, and then reboot/perform maintenance on them, and
  93    migrate back their original primaries, which allows the computation
  94    above to be reused for each following subset without N+1 failures
  95    being triggered, if none were present before. See below about the
  96    actual execution of the maintenance.
  97
  98 Non-DRBD
  99 ++++++++
 100
 101 All non-DRBD disk templates that can be migrated have no "secondary"
 102 concept. As such instances can be migrated to any node (in the same
 103 nodegroup). In order to do the job we can either:
 104
 105 - Perform migrations on one node at a time, perform the maintenance on
 106   that node, and proceed (the node will then be targeted again to host
 107   instances automatically, as hail chooses targets for the instances
 108   between all nodes in a group. Nodes in different nodegroups can be
 109   handled in parallel.
 110 - Perform migrations on one node at a time, but without waiting for the
 111   first node to come back before proceeding. This allows us to continue,
 112   restricting the cluster, until no more capacity in the nodegroup is
 113   available, and then having to wait for some nodes to come back so that
 114   capacity is available again for the last few nodes.
 115 - Pre-Calculate sets of nodes that can be migrated together (probably
 116   with a greedy algorithm) and parallelize between them, with the
 117   migrate-back approach discussed for DRBD to perform the calculation
 118   only once.
 119
 120 Note that for non-DRBD disks that still use local storage (eg. RBD and
 121 plain) redundancy might break anyway, and nothing except the first
 122 algorithm might be safe. This perhaps would be a good reason to consider
 123 managing better RBD pools, if those are implemented on top of nodes
 124 storage, rather than on dedicated storage machines.
 125
 126 Future work
 127 ===========
 128
 129 Hroller should become able to execute rolling maintenances, rather than
 130 just calculate them. For this to succeed properly one of the following
 131 must happen:
 132
 133 - HRoller handles rolling maintenances that happen at the same time as
 134   unrelated cluster jobs, and thus recalculates the maintenance at each
 135   step
 136 - HRoller can selectively drain the cluster so it's sure that only the
 137   rolling maintenance can be going on
 138
 139 DRBD nodes' ``replace-disks``' functionality should be implemented. Note
 140 that when we will support a DRBD version that allows multi-secondary
 141 this can be done safely, without losing replication at any time, by
 142 adding a temporary secondary and only when the sync is finished dropping
 143 the previous one.
 144
 145 Non-redundant (plain or file) instances should have a way to be moved
 146 off as well via plain storage live migration or ``gnt-instance move``
 147 (which requires downtime).
 148
 149 If/when RBD pools can be managed inside Ganeti, care can be taken so
 150 that the pool is evacuated as well from a node before it's put into
 151 maintenance. This is equivalent to evacuating DRBD secondaries.
 152
 153 Master failovers during the maintenance should be performed by hroller.
 154 This requires RPC/RAPI support for master failover. Hroller should also
 155 be modified to better support running on the master itself and
 156 continuing on the new master.
 157
 158 .. vim: set textwidth=72 :
 159 .. Local Variables:
 160 .. mode: rst
 161 .. fill-column: 72
 162 .. End: