code.grnet.gr Git - ganeti-local/blob - doc/design-2.0-disk-handling.rst

   1 Ganeti 2.0 disk handling changes
   2 ================================
   3
   4 Objective
   5 ---------
   6
   7 Change the storage options available and the details of the
   8 implementation such that we overcome some design limitations present
   9 in Ganeti 1.x.
  10
  11 Background
  12 ----------
  13
  14 The storage options available in Ganeti 1.x were introduced based on
  15 then-current software (DRBD 0.7 and later DRBD 8) and the estimated
  16 usage patters. However, experience has later shown that some
  17 assumptions made initially are not true and that more flexibility is
  18 needed.
  19
  20 One main assupmtion made was that disk failures should be treated as 'rare'
  21 events, and that each of them needs to be manually handled in order to ensure
  22 data safety; however, both these assumptions are false:
  23
  24 - disk failures can be a common occurence, based on usage patterns or cluster
  25   size
  26 - our disk setup is robust enough (referring to DRBD8 + LVM) that we could
  27   automate more of the recovery
  28
  29 Note that we still don't have fully-automated disk recovery as a goal, but our
  30 goal is to reduce the manual work needed.
  31
  32 Overview
  33 --------
  34
  35 We plan the following main changes:
  36
  37 - DRBD8 is much more flexible and stable than its previous version (0.7),
  38   such that removing the support for the ``remote_raid1`` template and
  39   focusing only on DRBD8 is easier
  40
  41 - dynamic discovery of DRBD devices is not actually needed in a cluster that
  42   where the DRBD namespace is controlled by Ganeti; switching to a static
  43   assignment (done at either instance creation time or change secondary time)
  44   will change the disk activation time from O(n) to O(1), which on big
  45   clusters is a significant gain
  46
  47 - remove the hard dependency on LVM (currently all available storage types are
  48   ultimately backed by LVM volumes) by introducing file-based storage
  49
  50 Additionally, a number of smaller enhancements are also planned:
  51 - support variable number of disks
  52 - support read-only disks
  53
  54 Future enhancements in the 2.x series, which do not require base design
  55 changes, might include:
  56
  57 - enhancement of the LVM allocation method in order to try to keep
  58   all of an instance's virtual disks on the same physical
  59   disks
  60
  61 - add support for DRBD8 authentication at handshake time in
  62   order to ensure each device connects to the correct peer
  63
  64 - remove the restrictions on failover only to the secondary
  65   which creates very strict rules on cluster allocation
  66
  67 Detailed Design
  68 ---------------
  69
  70 DRBD minor allocation
  71 ~~~~~~~~~~~~~~~~~~~~~
  72
  73 Currently, when trying to identify or activate a new DRBD (or MD)
  74 device, the code scans all in-use devices in order to see if we find
  75 one that looks similar to our parameters and is already in the desired
  76 state or not. Since this needs external commands to be run, it is very
  77 slow when more than a few devices are already present.
  78
  79 Therefore, we will change the discovery model from dynamic to
  80 static. When a new device is logically created (added to the
  81 configuration) a free minor number is computed from the list of
  82 devices that should exist on that node and assigned to that
  83 device.
  84
  85 At device activation, if the minor is already in use, we check if
  86 it has our parameters; if not so, we just destroy the device (if
  87 possible, otherwise we abort) and start it with our own
  88 parameters.
  89
  90 This means that we in effect take ownership of the minor space for
  91 that device type; if there's a user-created drbd minor, it will be
  92 automatically removed.
  93
  94 The change will have the effect of reducing the number of external
  95 commands run per device from a constant number times the index of the
  96 first free DRBD minor to just a constant number.
  97
  98 Removal of obsolete device types (md, drbd7)
  99 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 100
 101 We need to remove these device types because of two issues. First,
 102 drbd7 has bad failure modes in case of dual failures (both network and
 103 disk - it cannot propagate the error up the device stack and instead
 104 just panics. Second, due to the assymetry between primary and
 105 secondary in md+drbd mode, we cannot do live failover (not even if we
 106 had md+drbd8).
 107
 108 File-based storage support
 109 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 110
 111 This is covered by a separate design doc (<em>Vinales</em>) and
 112 would allow us to get rid of the hard requirement for testing
 113 clusters; it would also allow people who have SAN storage to do live
 114 failover taking advantage of their storage solution.
 115
 116 Variable number of disks
 117 ~~~~~~~~~~~~~~~~~~~~~~~~
 118
 119 In order to support high-security scenarios (for example read-only sda
 120 and read-write sdb), we need to make a fully flexibly disk
 121 definition. This has less impact that it might look at first sight:
 122 only the instance creation has hardcoded number of disks, not the disk
 123 handling code. The block device handling and most of the instance
 124 handling code is already working with "the instance's disks" as
 125 opposed to "the two disks of the instance", but some pieces are not
 126 (e.g. import/export) and the code needs a review to ensure safety.
 127
 128 The objective is to be able to specify the number of disks at
 129 instance creation, and to be able to toggle from read-only to
 130 read-write a disk afterwards.
 131
 132 Better LVM allocation
 133 ~~~~~~~~~~~~~~~~~~~~~
 134
 135 Currently, the LV to PV allocation mechanism is a very simple one: at
 136 each new request for a logical volume, tell LVM to allocate the volume
 137 in order based on the amount of free space. This is good for
 138 simplicity and for keeping the usage equally spread over the available
 139 physical disks, however it introduces a problem that an instance could
 140 end up with its (currently) two drives on two physical disks, or
 141 (worse) that the data and metadata for a DRBD device end up on
 142 different drives.
 143
 144 This is bad because it causes unneeded ``replace-disks`` operations in
 145 case of a physical failure.
 146
 147 The solution is to batch allocations for an instance and make the LVM
 148 handling code try to allocate as close as possible all the storage of
 149 one instance. We will still allow the logical volumes to spill over to
 150 additional disks as needed.
 151
 152 Note that this clustered allocation can only be attempted at initial
 153 instance creation, or at change secondary node time. At add disk time,
 154 or at replacing individual disks, it's not easy enough to compute the
 155 current disk map so we'll not attempt the clustering.
 156
 157 DRBD8 peer authentication at handshake
 158 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 159
 160 DRBD8 has a new feature that allow authentication of the peer at
 161 connect time. We can use this to prevent connecting to the wrong peer
 162 more that securing the connection. Even though we never had issues
 163 with wrong connections, it would be good to implement this.
 164
 165
 166 LVM self-repair (optional)
 167 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 168
 169 The complete failure of a physical disk is very tedious to
 170 troubleshoot, mainly because of the many failure modes and the many
 171 steps needed. We can safely automate some of the steps, more
 172 specifically the ``vgreduce --removemissing`` using the following
 173 method:
 174
 175 #. check if all nodes have consistent volume groups
 176 #. if yes, and previous status was yes, do nothing
 177 #. if yes, and previous status was no, save status and restart
 178 #. if no, and previous status was no, do nothing
 179 #. if no, and previous status was yes:
 180     #. if more than one node is inconsistent, do nothing
 181     #. if only one node is incosistent:
 182         #. run ``vgreduce --removemissing``
 183         #. log this occurence in the ganeti log in a form that
 184            can be used for monitoring
 185         #. [FUTURE] run ``replace-disks`` for all
 186            instances affected
 187
 188 Failover to any node
 189 ~~~~~~~~~~~~~~~~~~~~
 190
 191 With a modified disk activation sequence, we can implement the
 192 *failover to any* functionality, removing many of the layout
 193 restrictions of a cluster:
 194
 195 - the need to reserve memory on the current secondary: this gets reduced to
 196   a must to reserve memory anywhere on the cluster
 197
 198 - the need to first failover and then replace secondary for an
 199   instance: with failover-to-any, we can directly failover to
 200   another node, which also does the replace disks at the same
 201   step
 202
 203 In the following, we denote the current primary by P1, the current
 204 secondary by S1, and the new primary and secondaries by P2 and S2. P2
 205 is fixed to the node the user chooses, but the choice of S2 can be
 206 made between P1 and S1. This choice can be constrained, depending on
 207 which of P1 and S1 has failed.
 208
 209 - if P1 has failed, then S1 must become S2, and live migration is not possible
 210 - if S1 has failed, then P1 must become S2, and live migration could be
 211   possible (in theory, but this is not a design goal for 2.0)
 212
 213 The algorithm for performing the failover is straightforward:
 214
 215 - verify that S2 (the node the user has chosen to keep as secondary) has
 216   valid data (is consistent)
 217
 218 - tear down the current DRBD association and setup a drbd pairing between
 219   P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
 220   start resyncing from S2
 221
 222 - as soon as P2 is in state SyncTarget (i.e. after the resync has started
 223   but before it has finished), we can promote it to primary role (r/w)
 224   and start the instance on P2
 225
 226 - as soon as the P2⇐S2 sync has finished, we can remove
 227   the old data on the old node that has not been chosen for
 228   S2
 229
 230 Caveats: during the P2⇐S2 sync, a (non-transient) network error
 231 will cause I/O errors on the instance, so (if a longer instance
 232 downtime is acceptable) we can postpone the restart of the instance
 233 until the resync is done. However, disk I/O errors on S2 will cause
 234 dataloss, since we don't have a good copy of the data anymore, so in
 235 this case waiting for the sync to complete is not an option. As such,
 236 it is recommended that this feature is used only in conjunction with
 237 proper disk monitoring.
 238
 239
 240 Live migration note: While failover-to-any is possible for all choices
 241 of S2, migration-to-any is possible only if we keep P1 as S2.
 242
 243 Caveats
 244 -------
 245
 246 The dynamic device model, while more complex, has an advantage: it
 247 will not reuse by mistake another's instance DRBD device, since it
 248 always looks for either our own or a free one.
 249
 250 The static one, in contrast, will assume that given a minor number N,
 251 it's ours and we can take over. This needs careful implementation such
 252 that if the minor is in use, either we are able to cleanly shut it
 253 down, or we abort the startup. Otherwise, it could be that we start
 254 syncing between two instance's disks, causing dataloss.
 255
 256 Security Considerations
 257 -----------------------
 258
 259 The changes will not affect the security model of Ganeti.