code.grnet.gr Git - ganeti-local/blob - doc/design-2.0-master-daemon.rst

   1 Ganeti 2.0 Master daemon
   2 ========================
   3
   4 Objective
   5 ---------
   6
   7 Many of the important features of Ganeti 2.0 — job queue, granular
   8 locking, external API, etc. — will be integrated via a master
   9 daemon. While not absolutely necessary, it is the best way to
  10 integrate all these components.
  11
  12 Background
  13 ----------
  14
  15 Currently there is no "master" daemon in Ganeti (1.2). Each command
  16 tries to acquire the so called *cmd* lock and when it succeeds, it
  17 takes complete ownership of the cluster configuration and state. The
  18 scheduled improvements to Ganeti require or can use a daemon that
  19 coordinates the activities/jobs scheduled/etc.
  20
  21 Overview
  22 --------
  23
  24 The master daemon will be the central point of the cluster; command
  25 line tools and the external API will interact with the cluster via
  26 this daemon; it will be one coordinating the node daemons.
  27
  28 This design doc is best read in the context of the accompanying design
  29 docs for Ganeti 2.0: Granular locking design and Job queue design.
  30
  31
  32 Detailed Design
  33 ---------------
  34
  35 In Ganeti 2.0, we will have the following *entities*:
  36
  37 - the master daemon (on master node)
  38 - the node daemon (all nodes)
  39 - the command line tools (master node)
  40 - the RAPI daemon (master node)
  41
  42 Interaction paths are between:
  43
  44 - (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
  45 - the master daemon and the node daemons, via the node RPC
  46
  47 The protocol between the master daemon and the node daemons will be
  48 changed to HTTP(S), using a simple PUT/GET of JSON-encoded
  49 messages. This is done due to difficulties in working with the twisted
  50 protocols in a multithreaded environment, which we can overcome by
  51 using a simpler stack (see the caveats section). The protocol between
  52 the CLI/RAPI and the master daemon will be a custom one: on a UNIX
  53 socket on the master node, with rights restricted by filesystem
  54 permissions, the CLI/API will speak HTTP to the master daemon.
  55
  56 The operations supported over this internal protocol will be encoded
  57 via a python library that will expose a simple API for its
  58 users. Internally, the protocol will simply encode all objects in JSON
  59 format and decode them on the receiver side.
  60
  61 The LUXI protocol
  62 ~~~~~~~~~~~~~~~~~
  63
  64 We will have two main classes of operations over the master daemon API:
  65
  66 - cluster query functions
  67 - job related functions
  68
  69 The cluster query functions are usually short-duration, and are the
  70 equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
  71 internally implemented still with these opcodes). The clients are
  72 guaranteed to receive the response in a reasonable time via a timeout.
  73
  74 The job-related functions will be:
  75
  76 - submit job
  77 - query job (which could also be categorized in the query-functions)
  78 - archive job (see the job queue design doc)
  79 - wait for job change, which allows a client to wait without polling
  80
  81 Daemon implementation
  82 ~~~~~~~~~~~~~~~~~~~~~
  83
  84 The daemon will be based around a main I/O thread that will wait for
  85 new requests from the clients, and that does the setup/shutdown of the
  86 other thread (pools).
  87
  88
  89 There will two other classes of threads in the daemon:
  90
  91 - job processing threads, part of a thread pool, and which are
  92   long-lived, started at daemon startup and terminated only at shutdown
  93   time
  94 - client I/O threads, which are the ones that talk the local protocol
  95   to the clients
  96
  97 Master startup/failover
  98 ~~~~~~~~~~~~~~~~~~~~~~~
  99
 100 In Ganeti 1.x there is no protection against failing over the master
 101 to a node with stale configuration. In effect, the responsibility of
 102 correct failovers falls on the admin. This is true both for the new
 103 master and for when an old, offline master startup.
 104
 105 Since in 2.x we are extending the cluster state to cover the job queue
 106 and have a daemon that will execute by itself the job queue, we want
 107 to have more resilience for the master role.
 108
 109 The following algorithm will happen whenever a node is ready to
 110 transition to the master role, either at startup time or at node
 111 failover:
 112
 113 #. read the configuration file and parse the node list
 114    contained within
 115
 116 #. query all the nodes and make sure we obtain an agreement via
 117    a quorum of at least half plus one nodes for the following:
 118
 119     - we have the latest configuration and job list (as
 120       determined by the serial number on the configuration and
 121       highest job ID on the job queue)
 122
 123     - there is not even a single node having a newer
 124       configuration file
 125
 126     - if we are not failing over (but just starting), the
 127       quorum agrees that we are the designated master
 128
 129 #. at this point, the node transitions to the master role
 130
 131 #. for all the in-progress jobs, mark them as failed, with
 132    reason unknown or something similar (master failed, etc.)
 133
 134
 135 Logging
 136 ~~~~~~~
 137
 138 The logging system will be switched completely to the logging module;
 139 currently it's logging-based, but exposes a different API, which is
 140 just overhead. As such, the code will be switched over to standard
 141 logging calls, and only the setup will be custom.
 142
 143 With this change, we will remove the separate debug/info/error logs,
 144 and instead have always one logfile per daemon model:
 145
 146 - master-daemon.log for the master daemon
 147 - node-daemon.log for the node daemon (this is the same as in 1.2)
 148 - rapi-daemon.log for the RAPI daemon logs
 149 - rapi-access.log, an additional log file for the RAPI that will be
 150   in the standard http log format for possible parsing by other tools
 151
 152 Since the watcher will only submit jobs to the master for startup of
 153 the instances, its log file will contain less information than before,
 154 mainly that it will start the instance, but not the results.
 155
 156 Caveats
 157 -------
 158
 159 A discussed alternative is to keep the current individual processes
 160 touching the cluster configuration model. The reasons we have not
 161 chosen this approach is:
 162
 163 - the speed of reading and unserializing the cluster state
 164   today is not small enough that we can ignore it; the addition of
 165   the job queue will make the startup cost even higher. While this
 166   runtime cost is low, it can be on the order of a few seconds on
 167   bigger clusters, which for very quick commands is comparable to
 168   the actual duration of the computation itself
 169
 170 - individual commands would make it harder to implement a
 171   fire-and-forget job request, along the lines "start this
 172   instance but do not wait for it to finish"; it would require a
 173   model of backgrounding the operation and other things that are
 174   much better served by a daemon-based model
 175
 176 Another area of discussion is moving away from Twisted in this new
 177 implementation. While Twisted hase its advantages, there are also many
 178 disatvantanges to using it:
 179
 180 - first and foremost, it's not a library, but a framework; thus, if
 181   you use twisted, all the code needs to be 'twiste-ized'; we were able
 182   to keep the 1.x code clean by hacking around twisted in an
 183   unsupported, unrecommended way, and the only alternative would have
 184   been to make all the code be written for twisted
 185 - it has some weaknesses in working with multiple threads, since its base
 186   model is designed to replace thread usage by the deffered, so while it can
 187   use threads, it's not less flexible in doing so
 188
 189 And, since we already have an http server library (for the RAPI), we
 190 can just reuse that for inter-node communication.