code.grnet.gr Git - ganeti-local/blob - doc/design-2.0-locking.rst

   1 Ganeti 2.0 Granular Locking
   2 ===========================
   3
   4 Objective
   5 ---------
   6
   7 We want to make sure that multiple operations can run in parallel on a Ganeti
   8 Cluster. In order for this to happen we need to make sure concurrently run
   9 operations don't step on each other toes and break the cluster.
  10
  11 This design addresses how we are going to deal with locking so that:
  12
  13 - high urgency operations are not stopped by long length ones
  14 - long length operations can run in parallel
  15 - we preserve safety (data coherency) and liveness (no deadlock, no work
  16   postponed indefinitely) on the cluster
  17
  18 Reaching the maximum possible parallelism is a Non-Goal. We have identified a
  19 set of operations that are currently bottlenecks and need to be parallelised
  20 and have worked on those. In the future it will be possible to address other
  21 needs, thus making the cluster more and more parallel one step at a time.
  22
  23 This document only talks about parallelising Ganeti level operations, aka
  24 Logical Units, and the locking needed for that. Any other synchronisation lock
  25 needed internally by the code is outside its scope.
  26
  27 Background
  28 ----------
  29
  30 Ganeti 1.2 has a single global lock, which is used for all cluster operations.
  31 This has been painful at various times, for example:
  32
  33 - It is impossible for two people to efficiently interact with a cluster
  34   (for example for debugging) at the same time.
  35 - When batch jobs are running it's impossible to do other work (for example
  36   failovers/fixes) on a cluster.
  37
  38 This also poses scalability problems: as clusters grow in node and instance
  39 size it's a lot more likely that operations which one could conceive should run
  40 in parallel (for example because they happen on different nodes) are actually
  41 stalling each other while waiting for the global lock, without a real reason
  42 for that to happen.
  43
  44 Overview
  45 --------
  46
  47 This design doc is best read in the context of the accompanying design
  48 docs for Ganeti 2.0: Master daemon design and Job queue design.
  49
  50 We intend to implement a Ganeti locking library, which can be used by the
  51 various ganeti code components in order to easily, efficiently and correctly
  52 grab the locks they need to perform their function.
  53
  54 The proposed library has these features:
  55
  56 - Internally managing all the locks, making the implementation transparent
  57   from their usage
  58 - Automatically grabbing multiple locks in the right order (avoid deadlock)
  59 - Ability to transparently handle conversion to more granularity
  60 - Support asynchronous operation (future goal)
  61
  62 Locking will be valid only on the master node and will not be a distributed
  63 operation. In case of master failure, though, if some locks were held it means
  64 some opcodes were in progress, so when recovery of the job queue is done it
  65 will be possible to determine by the interrupted opcodes which operations could
  66 have been left half way through and thus which locks could have been held. It
  67 is then the responsibility either of the master failover code, of the cluster
  68 verification code, or of the admin to do what's necessary to make sure that any
  69 leftover state is dealt with. This is not an issue from a locking point of view
  70 because the fact that the previous master has failed means that it cannot do
  71 any job.
  72
  73 A corollary of this is that a master-failover operation with both masters alive
  74 needs to happen while no other locks are held.
  75
  76 Detailed Design
  77 ---------------
  78
  79 The Locks
  80 ~~~~~~~~~
  81 At the first stage we have decided to provide the following locks:
  82
  83 - One "config file" lock
  84 - One lock per node in the cluster
  85 - One lock per instance in the cluster
  86
  87 All the instance locks will need to be taken before the node locks, and the
  88 node locks before the config lock. Locks will need to be acquired at the same
  89 time for multiple instances and nodes, and internal ordering will be dealt
  90 within the locking library, which, for simplicity, will just use alphabetical
  91 order.
  92
  93 Handling conversion to more granularity
  94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  95
  96 In order to convert to a more granular approach transparently each time we
  97 split a lock into more we'll create a "metalock", which will depend on those
  98 sublocks and live for the time necessary for all the code to convert (or
  99 forever, in some conditions). When a metalock exists all converted code must
 100 acquire it in shared mode, so it can run concurrently, but still be exclusive
 101 with old code, which acquires it exclusively.
 102
 103 In the beginning the only such lock will be what replaces the current "command"
 104 lock, and will acquire all the locks in the system, before proceeding. This
 105 lock will be called the "Big Ganeti Lock" because holding that one will avoid
 106 any other concurrent ganeti operations.
 107
 108 We might also want to devise more metalocks (eg. all nodes, all nodes+config)
 109 in order to make it easier for some parts of the code to acquire what it needs
 110 without specifying it explicitly.
 111
 112 In the future things like the node locks could become metalocks, should we
 113 decide to split them into an even more fine grained approach, but this will
 114 probably be only after the first 2.0 version has been released.
 115
 116 Library API
 117 ~~~~~~~~~~~
 118
 119 All the locking will be its own class, and the locks will be created at
 120 initialisation time, from the config file.
 121
 122 The API will have a way to grab one or more than one locks at the same time.
 123 Any attempt to grab a lock while already holding one in the wrong order will be
 124 checked for, and fail.
 125
 126 Adding/Removing locks
 127 ~~~~~~~~~~~~~~~~~~~~~
 128
 129 When a new instance or a new node is created an associated lock must be added
 130 to the list. The relevant code will need to inform the locking library of such
 131 a change.
 132
 133 This needs to be compatible with every other lock in the system, especially
 134 metalocks that guarantee to grab sets of resources without specifying them
 135 explicitly. The implementation of this will be handled in the locking library
 136 itself.
 137
 138 Of course when instances or nodes disappear from the cluster the relevant locks
 139 must be removed. This is easier than adding new elements, as the code which
 140 removes them must own them exclusively or can queue for their ownership, and
 141 thus deals with metalocks exactly as normal code acquiring those locks. Any
 142 operation queueing on a removed lock will fail after its removal.
 143
 144 Asynchronous operations
 145 ~~~~~~~~~~~~~~~~~~~~~~~
 146
 147 For the first version the locking library will only export synchronous
 148 operations, which will block till the needed lock are held, and only fail if
 149 the request is impossible or somehow erroneous.
 150
 151 In the future we may want to implement different types of asynchronous
 152 operations such as:
 153
 154 - Try to acquire this lock set and fail if not possible
 155 - Try to acquire one of these lock sets and return the first one you were
 156   able to get (or after a timeout) (select/poll like)
 157
 158 These operations can be used to prioritize operations based on available locks,
 159 rather than making them just blindly queue for acquiring them. The inherent
 160 risk, though, is that any code using the first operation, or setting a timeout
 161 for the second one, is susceptible to starvation and thus may never be able to
 162 get the required locks and complete certain tasks. Considering this
 163 providing/using these operations should not be among our first priorities.
 164
 165 Locking granularity
 166 ~~~~~~~~~~~~~~~~~~~
 167
 168 For the first version of this code we'll convert each Logical Unit to
 169 acquire/release the locks it needs, so locking will be at the Logical Unit
 170 level.  In the future we may want to split logical units in independent
 171 "tasklets" with their own locking requirements. A different design doc (or mini
 172 design doc) will cover the move from Logical Units to tasklets.
 173
 174 Lock acquisition code path
 175 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 176
 177 In general when acquiring locks we should use a code path equivalent to::
 178
 179   lock.acquire()
 180   try:
 181     ...
 182     # other code
 183   finally:
 184     lock.release()
 185
 186 This makes sure we release all locks, and avoid possible deadlocks. Of course
 187 extra care must be used not to leave, if possible locked structures in an
 188 unusable state.
 189
 190 In order to avoid this extra indentation and code changes everywhere in the
 191 Logical Units code, we decided to allow LUs to declare locks, and then execute
 192 their code with their locks acquired. In the new world LUs are called like
 193 this::
 194
 195   # user passed names are expanded to the internal lock/resource name,
 196   # then known needed locks are declared
 197   lu.ExpandNames()
 198   ... some locking/adding of locks may happen ...
 199   # late declaration of locks for one level: this is useful because sometimes
 200   # we can't know which resource we need before locking the previous level
 201   lu.DeclareLocks() # for each level (cluster, instance, node)
 202   ... more locking/adding of locks can happen ...
 203   # these functions are called with the proper locks held
 204   lu.CheckPrereq()
 205   lu.Exec()
 206   ... locks declared for removal are removed, all acquired locks released ...
 207
 208 The Processor and the LogicalUnit class will contain exact documentation on how
 209 locks are supposed to be declared.
 210
 211 Caveats
 212 -------
 213
 214 This library will provide an easy upgrade path to bring all the code to
 215 granular locking without breaking everything, and it will also guarantee
 216 against a lot of common errors. Code switching from the old "lock everything"
 217 lock to the new system, though, needs to be carefully scrutinised to be sure it
 218 is really acquiring all the necessary locks, and none has been overlooked or
 219 forgotten.
 220
 221 The code can contain other locks outside of this library, to synchronise other
 222 threaded code (eg for the job queue) but in general these should be leaf locks
 223 or carefully structured non-leaf ones, to avoid deadlock race conditions.
 224