code.grnet.gr Git - ganeti-local/blob - doc/design-daemons.rst

   1 ==========================
   2 Ganeti daemons refactoring
   3 ==========================
   4
   5 .. contents:: :depth: 2
   6
   7 This is a design document detailing the plan for refactoring the internal
   8 structure of Ganeti, and particularly the set of daemons it is divided into.
   9
  10
  11 Current state and shortcomings
  12 ==============================
  13
  14 Ganeti is comprised of a growing number of daemons, each dealing with part of
  15 the tasks the cluster has to face, and communicating with the other daemons
  16 using a variety of protocols.
  17
  18 Specifically, as of Ganeti 2.8, the situation is as follows:
  19
  20 ``Master daemon (MasterD)``
  21   It is responsible for managing the entire cluster, and it's written in Python.
  22   It is executed on a single node (the master node). It receives the commands
  23   given by the cluster administrator (through the remote API daemon or the
  24   command line tools) over the LUXI protocol.  The master daemon is responsible
  25   for creating and managing the jobs that will execute such commands, and for
  26   managing the locks that ensure the cluster will not incur in race conditions.
  27
  28   Each job is managed by a separate Python thread, that interacts with the node
  29   daemons via RPC calls.
  30
  31   The master daemon is also responsible for managing the configuration of the
  32   cluster, changing it when required by some job. It is also responsible for
  33   copying the configuration to the other master candidates after updating it.
  34
  35 ``RAPI daemon (RapiD)``
  36   It is written in Python and runs on the master node only. It waits for
  37   requests issued remotely through the remote API protocol. Then, it forwards
  38   them, using the LUXI protocol, to the master daemon (if they are commands) or
  39   to the query daemon if they are queries about the configuration (including
  40   live status) of the cluster.
  41
  42 ``Node daemon (NodeD)``
  43   It is written in Python. It runs on all the nodes. It is responsible for
  44   receiving the master requests over RPC and execute them, using the appropriate
  45   backend (hypervisors, DRBD, LVM, etc.). It also receives requests over RPC for
  46   the execution of queries gathering live data on behalf of the query daemon.
  47
  48 ``Configuration daemon (ConfD)``
  49   It is written in Haskell. It runs on all the master candidates. Since the
  50   configuration is replicated only on the master node, this daemon exists in
  51   order to provide information about the configuration to nodes needing them.
  52   The requests are done through ConfD's own protocol, HMAC signed,
  53   implemented over UDP, and meant to be used by parallely querying all the
  54   master candidates (or a subset thereof) and getting the most up to date
  55   answer. This is meant as a way to provide a robust service even in case master
  56   is temporarily unavailable.
  57
  58 ``Query daemon (QueryD)``
  59   It is written in Haskell. It runs on all the master candidates. It replies
  60   to Luxi queries about the current status of the system, including live data it
  61   obtains by querying the node daemons through RPCs.
  62
  63 ``Monitoring daemon (MonD)``
  64   It is written in Haskell. It runs on all nodes, including the ones that are
  65   not vm-capable. It is meant to provide information on the status of the
  66   system. Such information is related only to the specific node the daemon is
  67   running on, and it is provided as JSON encoded data over HTTP, to be easily
  68   readable by external tools.
  69   The monitoring daemon communicates with ConfD to get information about the
  70   configuration of the cluster. The choice of communicating with ConfD instead
  71   of MasterD allows it to obtain configuration information even when the cluster
  72   is heavily degraded (e.g.: when master and some, but not all, of the master
  73   candidates are unreachable).
  74
  75 The current structure of the Ganeti daemons is inefficient because there are
  76 many different protocols involved, and each daemon needs to be able to use
  77 multiple ones, and has to deal with doing different things, thus making
  78 sometimes unclear which daemon is responsible for performing a specific task.
  79
  80 Also, with the current configuration, jobs are managed by the master daemon
  81 using python threads. This makes terminating a job after it has started a
  82 difficult operation, and it is the main reason why this is not possible yet.
  83
  84 The master daemon currently has too many different tasks, that could be handled
  85 better if split among different daemons.
  86
  87
  88 Proposed changes
  89 ================
  90
  91 In order to improve on the current situation, a new daemon subdivision is
  92 proposed, and presented hereafter.
  93
  94 .. digraph:: "new-daemons-structure"
  95
  96   {rank=same; RConfD LuxiD;}
  97   {rank=same; Jobs rconfigdata;}
  98   node [shape=box]
  99   RapiD [label="RapiD [M]"]
 100   LuxiD [label="LuxiD [M]"]
 101   WConfD [label="WConfD [M]"]
 102   Jobs [label="Jobs [M]"]
 103   RConfD [label="RConfD [MC]"]
 104   MonD [label="MonD [All]"]
 105   NodeD [label="NodeD [All]"]
 106   Clients [label="gnt-*\nclients [M]"]
 107   p1 [shape=none, label=""]
 108   p2 [shape=none, label=""]
 109   p3 [shape=none, label=""]
 110   p4 [shape=none, label=""]
 111   configdata [shape=none, label="config.data"]
 112   rconfigdata [shape=none, label="config.data\n[MC copy]"]
 113   locksdata [shape=none, label="locks.data"]
 114
 115   RapiD -> LuxiD [label="LUXI"]
 116   LuxiD -> WConfD [label="WConfD\nproto"]
 117   LuxiD -> Jobs [label="fork/exec"]
 118   Jobs -> WConfD [label="WConfD\nproto"]
 119   Jobs -> NodeD [label="RPC"]
 120   LuxiD -> NodeD [label="RPC"]
 121   rconfigdata -> RConfD
 122   configdata -> rconfigdata [label="sync via\nNodeD RPC"]
 123   WConfD -> NodeD [label="RPC"]
 124   WConfD -> configdata
 125   WConfD -> locksdata
 126   MonD -> RConfD [label="RConfD\nproto"]
 127   Clients -> LuxiD [label="LUXI"]
 128   p1 -> MonD [label="MonD proto"]
 129   p2 -> RapiD [label="RAPI"]
 130   p3 -> RConfD [label="RConfD\nproto"]
 131   p4 -> Clients [label="CLI"]
 132
 133 ``LUXI daemon (LuxiD)``
 134   It will be written in Haskell. It will run on the master node and it will be
 135   the only LUXI server, replying to all the LUXI queries. These includes both
 136   the queries about the live configuration of the cluster, previously served by
 137   QueryD, and the commands actually changing the status of the cluster by
 138   submitting jobs. Therefore, this daemon will also be the one responsible with
 139   managing the job queue. When a job needs to be executed, the LuxiD will spawn
 140   a separate process tasked with the execution of that specific job, thus making
 141   it easier to terminate the job itself, if needeed.  When a job requires locks,
 142   LuxiD will request them from WConfD.
 143   In order to keep availability of the cluster in case of failure of the master
 144   node, LuxiD will replicate the job queue to the other master candidates, by
 145   RPCs to the NodeD running there (the choice of RPCs for this task might be
 146   reviewed at a second time, after implementing this design).
 147
 148 ``Configuration management daemon (WConfD)``
 149   It will run on the master node and it will be responsible for the management
 150   of the authoritative copy of the cluster configuration (that is, it will be
 151   the daemon actually modifying the ``config.data`` file). All the requests of
 152   configuration changes will have to pass through this daemon, and will be
 153   performed using a LUXI-like protocol ("WConfD proto" in the graph. The exact
 154   protocol will be defined in the separate design document that will detail the
 155   WConfD separation).  Having a single point of configuration management will
 156   also allow Ganeti to get rid of possible race conditions due to concurrent
 157   modifications of the configuration.  When the configuration is updated, it
 158   will have to push the received changes to the other master candidates, via
 159   RPCs, so that RConfD daemons and (in case of a failure on the master node)
 160   the WConfD daemon on the new master can access an up-to-date version of it
 161   (the choice of RPCs for this task might be reviewed at a second time). This
 162   daemon will also be the one responsible for managing the locks, granting them
 163   to the jobs requesting them, and taking care of freeing them up if the jobs
 164   holding them crash or are terminated before releasing them.  In order to do
 165   this, each job, after being spawned by LuxiD, will open a local unix socket
 166   that will be used to communicate with it, and will be destroyed when the job
 167   terminates.  LuxiD will be able to check, after a timeout, whether the job is
 168   still running by connecting here, and to ask WConfD to forcefully remove the
 169   locks if the socket is closed.
 170   Also, WConfD should hold a serialized list of the locks and their owners in a
 171   file (``locks.data``), so that it can keep track of their status in case it
 172   crashes and needs to be restarted (by asking LuxiD which of them are still
 173   running).
 174   Interaction with this daemon will be performed using Unix sockets.
 175
 176 ``Configuration query daemon (RConfD)``
 177   It is written in Haskell, and it corresponds to the old ConfD. It will run on
 178   all the master candidates and it will serve information about the the static
 179   configuration of the cluster (the one contained in ``config.data``). The
 180   provided information will be highly available (as in: a response will be
 181   available as long as a stable-enough connection between the client and at
 182   least one working master candidate is available) and its freshness will be
 183   best effort (the most recent reply from any of the master candidates will be
 184   returned, but it might still be older than the one available through WConfD).
 185   The information will be served through the ConfD protocol.
 186
 187 ``Rapi daemon (RapiD)``
 188   It remains basically unchanged, with the only difference that all of its LUXI
 189   query are directed towards LuxiD instead of being split between MasterD and
 190   QueryD.
 191
 192 ``Monitoring daemon (MonD)``
 193   It remains unaffected by the changes in this design document. It will just get
 194   some of the data it needs from RConfD instead of the old ConfD, but the
 195   interfaces of the two are identical.
 196
 197 ``Node daemon (NodeD)``
 198   It remains unaffected by the changes proposed in the design document. The only
 199   difference being that it will receive its RPCs from LuxiD (for job queue
 200   replication), from WConfD (for configuration replication) and for the
 201   processes executing single jobs (for all the operations to be performed by
 202   nodes) instead of receiving them just from MasterD.
 203
 204 This restructuring will allow us to reorganize and improve the codebase,
 205 introducing cleaner interfaces and giving well defined and more restricted tasks
 206 to each daemon.
 207
 208 Furthermore, having more well-defined interfaces will allow us to have easier
 209 upgrade procedures, and to work towards the possibility of upgrading single
 210 components of a cluster one at a time, without the need for immediately
 211 upgrading the entire cluster in a single step.
 212
 213
 214 Implementation
 215 ==============
 216
 217 While performing this refactoring, we aim to increase the amount of
 218 Haskell code, thus benefiting from the additional type safety provided by its
 219 wide compile-time checks. In particular, all the job queue management and the
 220 configuration management daemon will be written in Haskell, taking over the role
 221 currently fulfilled by Python code executed as part of MasterD.
 222
 223 The changes describe by this design document are quite extensive, therefore they
 224 will not be implemented all at the same time, but through a sequence of steps,
 225 leaving the codebase in a consistent and usable state.
 226
 227 #. Rename QueryD to LuxiD.
 228    A part of LuxiD, the one replying to configuration
 229    queries including live information about the system, already exists in the
 230    form of QueryD. This is being renamed to LuxiD, and will form the first part
 231    of the new daemon. NB: this is happening starting from Ganeti 2.8. At the
 232    beginning, only the already existing queries will be replied to by LuxiD.
 233    More queries will be implemented in the next versions.
 234
 235 #. Let LuxiD be the interface for the queries and MasterD be their executor.
 236    Currently, MasterD is the only responsible for receiving and executing LUXI
 237    queries, and for managing the jobs they create.
 238    Receiving the queries and managing the job queue will be extracted from
 239    MasterD into LuxiD.
 240    Actually executing jobs will still be done by MasterD, that contains all the
 241    logic for doing that and for properly managing locks and the configuration.
 242    A separate design document will detail how the system will decide which jobs
 243    to send over for execution, and how to rate-limit them.
 244
 245 #. Extract WConfD from MasterD.
 246    The logic for managing the configuration file is factored out to the
 247    dedicated WConfD daemon. All configuration changes, currently executed
 248    directly by MasterD, will be changed to be IPC requests sent to the new
 249    daemon.
 250
 251 #. Extract locking management from MasterD.
 252    The logic for managing and granting locks is extracted to WConfD as well.
 253    Locks will not be taken directly anymore, but asked via IPC to WConfD.
 254    This step can be executed on its own or at the same time as the previous one.
 255
 256 #. Jobs are executed as processes.
 257    The logic for running jobs is rewritten so that each job can be managed by an
 258    independent process. LuxiD will spawn a new (Python) process for every single
 259    job. The RPCs will remain unchanged, and the LU code will stay as is as much
 260    as possible.
 261    MasterD will cease to exist as a deamon on its own at this point, but not
 262    before.
 263
 264 Further considerations
 265 ======================
 266
 267 There is a possibility that a job will finish performing its task while LuxiD
 268 and/or WConfD will not be available.
 269 In order to deal with this situation, each job will write the results of its
 270 execution on a file. The name of this file will be known to LuxiD before
 271 starting the job, and will be stored together with the job ID, and the
 272 name of the job-unique socket.
 273
 274 The job, upon ending its execution, will signal LuxiD (through the socket), so
 275 that it can read the result of the execution and release the locks as needed.
 276
 277 In case LuxiD is not available at that time, the job will just terminate without
 278 signalling it, and writing the results on file as usual. When a new LuxiD
 279 becomes available, it will have the most up-to-date list of running jobs
 280 (received via replication from the former LuxiD), and go through it, cleaning up
 281 all the terminated jobs.
 282
 283
 284 .. vim: set textwidth=72 :
 285 .. Local Variables:
 286 .. mode: rst
 287 .. fill-column: 72
 288 .. End: