Revision 5c0c1eeb

b/Makefile.am
106 106
	doc/iallocator.sgml
107 107

  
108 108
docrst = \
109
	doc/design-2.0-cluster-parameters.rst \
110
	doc/design-2.0-commandline-parameters.rst \
111
	doc/design-2.0-disk-handling.rst \
112
	doc/design-2.0-index.rst \
113
	doc/design-2.0-job-queue.rst \
114
	doc/design-2.0-locking.rst \
115
	doc/design-2.0-master-daemon.rst \
116
	doc/design-2.0-os-interface.rst \
117
	doc/design-2.0-rapi-changes.rst \
109
	doc/design-2.0.rst \
118 110
	doc/security.rst
119 111

  
120 112
doc_DATA = \
/dev/null
1
Ganeti 2.0 cluster parameters
2
=============================
3

  
4
.. contents::
5

  
6
Objective
7
---------
8

  
9
We need to enhance the way attributes for instances and other clusters
10
parameters are handled internally within Ganeti in order to have
11
better flexibility in the following cases:
12

  
13
- introducting new parameters
14
- writing command line interfaces or APIs for these parameters
15
- supporting new 2.0 features
16

  
17
Background
18
----------
19

  
20
When the HVM hypervisor was introduced in Ganeti 1.2, the additional
21
instance parameters needed for it were simply added to the instance
22
namespace, as were additional parameters for the PVM hypervisor.
23

  
24
As a result of this, wether a particular parameter is valid for the
25
actual hypervisor could either be guessed from the name but only
26
really checked by following the code using it. Similar to this case,
27
other parameters are not valid in all cases, and were simply added to
28
the top-level instance objects.
29

  
30
Overview
31
--------
32

  
33
Across all cluster configuration data, we have multiple classes of
34
parameters:
35

  
36
A. cluster-wide parameters (e.g. name of the cluster, the master);
37
   these are the ones that we have today, and are unchanged from the
38
   current model
39

  
40
#. node parameters
41

  
42
#. instance specific parameters, e.g. the name of disks (LV), that
43
   cannot be shared with other instances
44

  
45
#. instance parameters, that are or can be the same for many
46
   instances, but are not hypervisor related; e.g. the number of VCPUs,
47
   or the size of memory
48

  
49
#. instance parameters that are hypervisor specific (e.g. kernel_path
50
   or PAE mode)
51

  
52

  
53

  
54
Detailed Design
55
---------------
56

  
57
The following definitions for instance parameters will be used below:
58

  
59
hypervisor parameter
60
  a hypervisor parameter (or hypervisor specific parameter) is defined
61
  as a parameter that is interpreted by the hypervisor support code in
62
  Ganeti and usually is specific to a particular hypervisor (like the
63
  kernel path for PVM which makes no sense for HVM).
64

  
65
backend parameter
66
  a backend parameter is defined as an instance parameter that can be
67
  shared among a list of instances, and is either generic enough not
68
  to be tied to a given hypervisor or cannot influence at all the
69
  hypervisor behaviour.
70

  
71
  For example: memory, vcpus, auto_balance
72

  
73
  All these parameters will be encoded into constants.py with the prefix "BE\_"
74
  and the whole list of parameters will exist in the set "BES_PARAMETERS"
75

  
76
proper parameter
77
  a parameter whose value is unique to the instance (e.g. the name of a LV,
78
  or the MAC of a NIC)
79

  
80
As a general rule, for all kind of parameters, “None” (or in
81
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
82
such, only non-default parameters will be saved as part of objects in
83
the serialization step, reducing the size of the serialized format.
84

  
85
Cluster parameters
86
~~~~~~~~~~~~~~~~~~
87

  
88
Cluster parameters remain as today, attributes at the top level of the
89
Cluster object. In addition, two new attributes at this level will
90
hold defaults for the instances:
91

  
92
- hvparams, a dictionary indexed by hypervisor type, holding default
93
  values for hypervisor parameters that are not defined/overrided by
94
  the instances of this hypervisor type
95

  
96
- beparams, a dictionary holding (for 2.0) a single element 'default',
97
  which holds the default value for backend parameters
98

  
99
Node parameters
100
~~~~~~~~~~~~~~~
101

  
102
Node-related parameters are very few, and we will continue using the
103
same model for these as previously (attributes on the Node object).
104

  
105
Instance parameters
106
~~~~~~~~~~~~~~~~~~~
107

  
108
As described before, the instance parameters are split in three:
109
instance proper parameters, unique to each instance, instance
110
hypervisor parameters and instance backend parameters.
111

  
112
The “hvparams” and “beparams” are kept in two dictionaries at instance
113
level. Only non-default parameters are stored (but once customized, a
114
parameter will be kept, even with the same value as the default one,
115
until reset).
116

  
117
The names for hypervisor parameters in the instance.hvparams subtree
118
should be choosen as generic as possible, especially if specific
119
parameters could conceivably be useful for more than one hypervisor,
120
e.g. instance.hvparams.vnc_console_port instead of using both
121
instance.hvparams.hvm_vnc_console_port and
122
instance.hvparams.kvm_vnc_console_port.
123

  
124
There are some special cases related to disks and NICs (for example):
125
a disk has both ganeti-related parameters (e.g. the name of the LV)
126
and hypervisor-related parameters (how the disk is presented to/named
127
in the instance). The former parameters remain as proper-instance
128
parameters, while the latter value are migrated to the hvparams
129
structure. In 2.0, we will have only globally-per-instance such
130
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
131
exported as of the same type).
132

  
133
Starting from the 1.2 list of instance parameters, here is how they
134
will be mapped to the three classes of parameters:
135

  
136
- name (P)
137
- primary_node (P)
138
- os (P)
139
- hypervisor (P)
140
- status (P)
141
- memory (BE)
142
- vcpus (BE)
143
- nics (P)
144
- disks (P)
145
- disk_template (P)
146
- network_port (P)
147
- kernel_path (HV)
148
- initrd_path (HV)
149
- hvm_boot_order (HV)
150
- hvm_acpi (HV)
151
- hvm_pae (HV)
152
- hvm_cdrom_image_path (HV)
153
- hvm_nic_type (HV)
154
- hvm_disk_type (HV)
155
- vnc_bind_address (HV)
156
- serial_no (P)
157

  
158

  
159
Parameter validation
160
~~~~~~~~~~~~~~~~~~~~
161

  
162
To support the new cluster parameter design, additional features will
163
be required from the hypervisor support implementations in Ganeti.
164

  
165
The hypervisor support  implementation API will be extended with the
166
following features:
167

  
168
:PARAMETERS: class-level attribute holding the list of valid parameters
169
  for this hypervisor
170
:CheckParamSyntax(hvparams): checks that the given parameters are
171
  valid (as in the names are valid) for this hypervisor; usually just
172
  comparing hvparams.keys() and cls.PARAMETERS; this is a class method
173
  that can be called from within master code (i.e. cmdlib) and should
174
  be safe to do so
175
:ValidateParameters(hvparams): verifies the values of the provided
176
  parameters against this hypervisor; this is a method that will be
177
  called on the target node, from backend.py code, and as such can
178
  make node-specific checks (e.g. kernel_path checking)
179

  
180
Default value application
181
~~~~~~~~~~~~~~~~~~~~~~~~~
182

  
183
The application of defaults to an instance is done in the Cluster
184
object, via two new methods as follows:
185

  
186
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
187
  instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
188

  
189
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
190
  beparams dict, based on the instance and cluster beparams
191

  
192
The FillHV/BE transformations will be used, for example, in the RpcRunner
193
when sending an instance for activation/stop, and the sent instance
194
hvparams/beparams will have the final value (noded code doesn't know
195
about defaults).
196

  
197
LU code will need to self-call the transformation, if needed.
198

  
199
Opcode changes
200
~~~~~~~~~~~~~~
201

  
202
The parameter changes will have impact on the OpCodes, especially on
203
the following ones:
204

  
205
- OpCreateInstance, where the new hv and be parameters will be sent as
206
  dictionaries; note that all hv and be parameters are now optional, as
207
  the values can be instead taken from the cluster
208
- OpQueryInstances, where we have to be able to query these new
209
  parameters; the syntax for names will be ``hvparam/$NAME`` and
210
  ``beparam/$NAME`` for querying an individual parameter out of one
211
  dictionary, and ``hvparams``, respectively ``beparams``, for the whole
212
  dictionaries
213
- OpModifyInstance, where the the modified parameters are sent as
214
  dictionaries
215

  
216
Additionally, we will need new OpCodes to modify the cluster-level
217
defaults for the be/hv sets of parameters.
218

  
219
Caveats
220
-------
221

  
222
One problem that might appear is that our classification is not
223
complete or not good enough, and we'll need to change this model. As
224
the last resort, we will need to rollback and keep 1.2 style.
225

  
226
Another problem is that classification of one parameter is unclear
227
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
228
the risk of having to move parameters later between classes.
229

  
230
Security
231
--------
232

  
233
The only security issue that we foresee is if some new parameters will
234
have sensitive value. If so, we will need to have a way to export the
235
config data while purging the sensitive value.
236

  
237
E.g. for the drbd shared secrets, we could export these with the
238
values replaced by an empty string.
/dev/null
1
Ganeti 2.0 commandline arguments
2
================================
3

  
4
.. contents::
5

  
6
Objective
7
---------
8

  
9
Ganeti 2.0 introduces several new features as well as new ways to
10
handle instance resources like disks or network interfaces. This
11
requires some noticable changes in the way commandline arguments are
12
handled.
13

  
14
- extend and modify commandline syntax to support new features
15
- ensure consistent patterns in commandline arguments to reduce cognitive load
16

  
17
Background
18
----------
19

  
20
Ganeti 2.0 introduces several changes in handling instances resources
21
such as disks and network cards as well as some new features. Due to
22
these changes, the commandline syntax needs to be changed
23
significantly since the existing commandline syntax is not able to
24
cover the changes.
25

  
26
Overview
27
--------
28

  
29
Design changes for Ganeti 2.0 that require changes for the commandline
30
syntax, in no particular order:
31

  
32
- flexible instance disk handling: support a variable number of disks
33
  with varying properties per instance,
34
- flexible instance network interface handling: support a variable
35
  number of network interfaces with varying properties per instance
36
- multiple hypervisors: multiple hypervisors can be active on the same
37
  cluster, each supporting different parameters,
38
- support for device type CDROM (via ISO image)
39

  
40
Detailed Design
41
---------------
42

  
43
There are several areas of Ganeti where the commandline arguments will change:
44

  
45
- Cluster configuration
46

  
47
  - cluster initialization
48
  - cluster default configuration
49

  
50
- Instance configuration
51

  
52
  - handling of network cards for instances,
53
  - handling of disks for instances,
54
  - handling of CDROM devices and
55
  - handling of hypervisor specific options.
56

  
57
Notes about device removal/addition
58
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
59

  
60
To avoid problems with device location changes (e.g. second network
61
interface of the instance becoming the first or third and the like)
62
the list of network/disk devices is treated as a stack, i.e. devices
63
can only be added/removed at the end of the list of devices of each
64
class (disk or network) for each instance.
65

  
66
gnt-instance commands
67
~~~~~~~~~~~~~~~~~~~~~
68

  
69
The commands for gnt-instance will be modified and extended to allow
70
for the new functionality:
71

  
72
- the add command will be extended to support the new device and
73
  hypervisor options,
74
- the modify command continues to handle all modifications to
75
  instances, but will be extended with new arguments for handling
76
  devices.
77

  
78
Network Device Options
79
~~~~~~~~~~~~~~~~~~~~~~
80

  
81
The generic format of the network device option is:
82

  
83
  --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
84

  
85
:$DEVNUM: device number, unsigned integer, starting at 0,
86
:$OPTION: device option, string,
87
:$VALUE: device option value, string.
88

  
89
Currently, the following device options will be defined (open to
90
further changes):
91

  
92
:mac: MAC address of the network interface, accepts either a valid
93
  MAC address or the string 'auto'. If 'auto' is specified, a new MAC
94
  address will be generated randomly. If the mac device option is not
95
  specified, the default value 'auto' is assumed.
96
:bridge: network bridge the network interface is connected
97
  to. Accepts either a valid bridge name (the specified bridge must
98
  exist on the node(s)) as string or the string 'auto'. If 'auto' is
99
  specified, the default brigde is used. If the bridge option is not
100
  specified, the default value 'auto' is assumed.
101

  
102
Disk Device Options
103
~~~~~~~~~~~~~~~~~~~
104

  
105
The generic format of the disk device option is:
106

  
107
  --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
108

  
109
:$DEVNUM: device number, unsigned integer, starting at 0,
110
:$OPTION: device option, string,
111
:$VALUE: device option value, string.
112

  
113
Currently, the following device options will be defined (open to
114
further changes):
115

  
116
:size: size of the disk device, either a positive number, specifying
117
  the disk size in mebibytes, or a number followed by a magnitude suffix
118
  (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
119
  which case the default disk size will be used. If the size option is
120
  not specified, 'auto' is assumed. This option is not valid for all
121
  disk layout types.
122
:access: access mode of the disk device, a single letter, valid values
123
  are:
124

  
125
  - w: read/write access to the disk device or
126
  - r: read-only access to the disk device.
127

  
128
  If the access mode is not specified, the default mode of read/write
129
  access will be configured.
130
:path: path to the image file for the disk device, string. No default
131
  exists. This option is not valid for all disk layout types.
132

  
133
Adding devices
134
~~~~~~~~~~~~~~
135

  
136
To add devices to an already existing instance, use the device type
137
specific option to gnt-instance modify. Currently, there are two
138
device type specific options supported:
139

  
140
:--net: for network interface cards
141
:--disk: for disk devices
142

  
143
The syntax to the device specific options is similiar to the generic
144
device options, but instead of specifying a device number like for
145
gnt-instance add, you specify the magic string add. The new device
146
will always be appended at the end of the list of devices of this type
147
for the specified instance, e.g. if the instance has disk devices 0,1
148
and 2, the newly added disk device will be disk device 3.
149

  
150
Example: gnt-instance modify --net add:mac=auto test-instance
151

  
152
Removing devices
153
~~~~~~~~~~~~~~~~
154

  
155
Removing devices from and instance is done via gnt-instance
156
modify. The same device specific options as for adding instances are
157
used. Instead of a device number and further device options, only the
158
magic string remove is specified. It will always remove the last
159
device in the list of devices of this type for the instance specified,
160
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
161
number 3 will be removed.
162

  
163
Example: gnt-instance modify --net remove test-instance
164

  
165
Modifying devices
166
~~~~~~~~~~~~~~~~~
167

  
168
Modifying devices is also done with device type specific options to
169
the gnt-instance modify command. There are currently two device type
170
options supported:
171

  
172
:--net: for network interface cards
173
:--disk: for disk devices
174

  
175
The syntax to the device specific options is similiar to the generic
176
device options. The device number you specify identifies the device to
177
be modified.
178

  
179
Example: gnt-instance modify --disk 2:access=r
180

  
181
Hypervisor Options
182
~~~~~~~~~~~~~~~~~~
183

  
184
Ganeti 2.0 will support more than one hypervisor. Different
185
hypervisors have various options that only apply to a specific
186
hypervisor. Those hypervisor specific options are treated specially
187
via the --hypervisor option. The generic syntax of the hypervisor
188
option is as follows:
189

  
190
  --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
191

  
192
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
193
  has to match the supported hypervisors. Example: xen-pvm
194

  
195
:$OPTION: hypervisor option name, string
196
:$VALUE: hypervisor option value, string
197

  
198
The hypervisor option for an instance can be set on instance creation
199
time via the gnt-instance add command. If the hypervisor for an
200
instance is not specified upon instance creation, the default
201
hypervisor will be used.
202

  
203
Modifying hypervisor parameters
204
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205

  
206
The hypervisor parameters of an existing instance can be modified
207
using --hypervisor option of the gnt-instance modify command. However,
208
the hypervisor type of an existing instance can not be changed, only
209
the particular hypervisor specific option can be changed. Therefore,
210
the format of the option parameters has been simplified to omit the
211
hypervisor name and only contain the comma separated list of
212
option-value pairs.
213

  
214
Example: gnt-instance modify --hypervisor
215
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
216

  
217
gnt-cluster commands
218
~~~~~~~~~~~~~~~~~~~~
219

  
220
The command for gnt-cluster will be extended to allow setting and
221
changing the default parameters of the cluster:
222

  
223
- The init command will be extend to support the defaults option to
224
  set the cluster defaults upon cluster initialization.
225
- The modify command will be added to modify the cluster
226
  parameters. It will support the --defaults option to change the
227
  cluster defaults.
228

  
229
Cluster defaults
230

  
231
The generic format of the cluster default setting option is:
232

  
233
  --defaults $OPTION=$VALUE[,$OPTION=$VALUE]
234

  
235
:$OPTION: cluster default option, string,
236
:$VALUE: cluster default option value, string.
237

  
238
Currently, the following cluster default options are defined (open to
239
further changes):
240

  
241
:hypervisor: the default hypervisor to use for new instances,
242
  string. Must be a valid hypervisor known to and supported by the
243
  cluster.
244
:disksize: the disksize for newly created instance disks, where
245
  applicable. Must be either a positive number, in which case the unit
246
  of megabyte is assumed, or a positive number followed by a supported
247
  magnitude symbol (M for megabyte or G for gigabyte).
248
:bridge: the default network bridge to use for newly created instance
249
  network interfaces, string. Must be a valid bridge name of a bridge
250
  existing on the node(s).
251

  
252
Hypervisor cluster defaults
253
~~~~~~~~~~~~~~~~~~~~~~~~~~~
254

  
255
The generic format of the hypervisor clusterwide default setting option is:
256

  
257
  --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
258

  
259
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
260
  to set, string
261
:$OPTION: cluster default option, string,
262
:$VALUE: cluster default option value, string.
/dev/null
1
Ganeti 2.0 disk handling changes
2
================================
3

  
4
Objective
5
---------
6

  
7
Change the storage options available and the details of the
8
implementation such that we overcome some design limitations present
9
in Ganeti 1.x.
10

  
11
Background
12
----------
13

  
14
The storage options available in Ganeti 1.x were introduced based on
15
then-current software (DRBD 0.7 and later DRBD 8) and the estimated
16
usage patters. However, experience has later shown that some
17
assumptions made initially are not true and that more flexibility is
18
needed.
19

  
20
One main assupmtion made was that disk failures should be treated as 'rare'
21
events, and that each of them needs to be manually handled in order to ensure
22
data safety; however, both these assumptions are false:
23

  
24
- disk failures can be a common occurence, based on usage patterns or cluster
25
  size
26
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
27
  automate more of the recovery
28

  
29
Note that we still don't have fully-automated disk recovery as a goal, but our
30
goal is to reduce the manual work needed.
31

  
32
Overview
33
--------
34

  
35
We plan the following main changes:
36

  
37
- DRBD8 is much more flexible and stable than its previous version (0.7),
38
  such that removing the support for the ``remote_raid1`` template and
39
  focusing only on DRBD8 is easier
40

  
41
- dynamic discovery of DRBD devices is not actually needed in a cluster that
42
  where the DRBD namespace is controlled by Ganeti; switching to a static
43
  assignment (done at either instance creation time or change secondary time)
44
  will change the disk activation time from O(n) to O(1), which on big
45
  clusters is a significant gain
46

  
47
- remove the hard dependency on LVM (currently all available storage types are
48
  ultimately backed by LVM volumes) by introducing file-based storage
49

  
50
Additionally, a number of smaller enhancements are also planned:
51
- support variable number of disks
52
- support read-only disks
53

  
54
Future enhancements in the 2.x series, which do not require base design
55
changes, might include:
56

  
57
- enhancement of the LVM allocation method in order to try to keep
58
  all of an instance's virtual disks on the same physical
59
  disks
60

  
61
- add support for DRBD8 authentication at handshake time in
62
  order to ensure each device connects to the correct peer
63

  
64
- remove the restrictions on failover only to the secondary
65
  which creates very strict rules on cluster allocation
66

  
67
Detailed Design
68
---------------
69

  
70
DRBD minor allocation
71
~~~~~~~~~~~~~~~~~~~~~
72

  
73
Currently, when trying to identify or activate a new DRBD (or MD)
74
device, the code scans all in-use devices in order to see if we find
75
one that looks similar to our parameters and is already in the desired
76
state or not. Since this needs external commands to be run, it is very
77
slow when more than a few devices are already present.
78

  
79
Therefore, we will change the discovery model from dynamic to
80
static. When a new device is logically created (added to the
81
configuration) a free minor number is computed from the list of
82
devices that should exist on that node and assigned to that
83
device.
84

  
85
At device activation, if the minor is already in use, we check if
86
it has our parameters; if not so, we just destroy the device (if
87
possible, otherwise we abort) and start it with our own
88
parameters.
89

  
90
This means that we in effect take ownership of the minor space for
91
that device type; if there's a user-created drbd minor, it will be
92
automatically removed.
93

  
94
The change will have the effect of reducing the number of external
95
commands run per device from a constant number times the index of the
96
first free DRBD minor to just a constant number.
97

  
98
Removal of obsolete device types (md, drbd7)
99
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
100

  
101
We need to remove these device types because of two issues. First,
102
drbd7 has bad failure modes in case of dual failures (both network and
103
disk - it cannot propagate the error up the device stack and instead
104
just panics. Second, due to the assymetry between primary and
105
secondary in md+drbd mode, we cannot do live failover (not even if we
106
had md+drbd8).
107

  
108
File-based storage support
109
~~~~~~~~~~~~~~~~~~~~~~~~~~
110

  
111
This is covered by a separate design doc (<em>Vinales</em>) and
112
would allow us to get rid of the hard requirement for testing
113
clusters; it would also allow people who have SAN storage to do live
114
failover taking advantage of their storage solution.
115

  
116
Variable number of disks
117
~~~~~~~~~~~~~~~~~~~~~~~~
118

  
119
In order to support high-security scenarios (for example read-only sda
120
and read-write sdb), we need to make a fully flexibly disk
121
definition. This has less impact that it might look at first sight:
122
only the instance creation has hardcoded number of disks, not the disk
123
handling code. The block device handling and most of the instance
124
handling code is already working with "the instance's disks" as
125
opposed to "the two disks of the instance", but some pieces are not
126
(e.g. import/export) and the code needs a review to ensure safety.
127

  
128
The objective is to be able to specify the number of disks at
129
instance creation, and to be able to toggle from read-only to
130
read-write a disk afterwards.
131

  
132
Better LVM allocation
133
~~~~~~~~~~~~~~~~~~~~~
134

  
135
Currently, the LV to PV allocation mechanism is a very simple one: at
136
each new request for a logical volume, tell LVM to allocate the volume
137
in order based on the amount of free space. This is good for
138
simplicity and for keeping the usage equally spread over the available
139
physical disks, however it introduces a problem that an instance could
140
end up with its (currently) two drives on two physical disks, or
141
(worse) that the data and metadata for a DRBD device end up on
142
different drives.
143

  
144
This is bad because it causes unneeded ``replace-disks`` operations in
145
case of a physical failure.
146

  
147
The solution is to batch allocations for an instance and make the LVM
148
handling code try to allocate as close as possible all the storage of
149
one instance. We will still allow the logical volumes to spill over to
150
additional disks as needed.
151

  
152
Note that this clustered allocation can only be attempted at initial
153
instance creation, or at change secondary node time. At add disk time,
154
or at replacing individual disks, it's not easy enough to compute the
155
current disk map so we'll not attempt the clustering.
156

  
157
DRBD8 peer authentication at handshake
158
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
159

  
160
DRBD8 has a new feature that allow authentication of the peer at
161
connect time. We can use this to prevent connecting to the wrong peer
162
more that securing the connection. Even though we never had issues
163
with wrong connections, it would be good to implement this.
164

  
165

  
166
LVM self-repair (optional)
167
~~~~~~~~~~~~~~~~~~~~~~~~~~
168

  
169
The complete failure of a physical disk is very tedious to
170
troubleshoot, mainly because of the many failure modes and the many
171
steps needed. We can safely automate some of the steps, more
172
specifically the ``vgreduce --removemissing`` using the following
173
method:
174

  
175
#. check if all nodes have consistent volume groups
176
#. if yes, and previous status was yes, do nothing
177
#. if yes, and previous status was no, save status and restart
178
#. if no, and previous status was no, do nothing
179
#. if no, and previous status was yes:
180
    #. if more than one node is inconsistent, do nothing
181
    #. if only one node is incosistent:
182
        #. run ``vgreduce --removemissing``
183
        #. log this occurence in the ganeti log in a form that
184
           can be used for monitoring
185
        #. [FUTURE] run ``replace-disks`` for all
186
           instances affected
187

  
188
Failover to any node
189
~~~~~~~~~~~~~~~~~~~~
190

  
191
With a modified disk activation sequence, we can implement the
192
*failover to any* functionality, removing many of the layout
193
restrictions of a cluster:
194

  
195
- the need to reserve memory on the current secondary: this gets reduced to
196
  a must to reserve memory anywhere on the cluster
197

  
198
- the need to first failover and then replace secondary for an
199
  instance: with failover-to-any, we can directly failover to
200
  another node, which also does the replace disks at the same
201
  step
202

  
203
In the following, we denote the current primary by P1, the current
204
secondary by S1, and the new primary and secondaries by P2 and S2. P2
205
is fixed to the node the user chooses, but the choice of S2 can be
206
made between P1 and S1. This choice can be constrained, depending on
207
which of P1 and S1 has failed.
208

  
209
- if P1 has failed, then S1 must become S2, and live migration is not possible
210
- if S1 has failed, then P1 must become S2, and live migration could be
211
  possible (in theory, but this is not a design goal for 2.0)
212

  
213
The algorithm for performing the failover is straightforward:
214

  
215
- verify that S2 (the node the user has chosen to keep as secondary) has
216
  valid data (is consistent)
217

  
218
- tear down the current DRBD association and setup a drbd pairing between
219
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
220
  start resyncing from S2
221

  
222
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
223
  but before it has finished), we can promote it to primary role (r/w)
224
  and start the instance on P2
225

  
226
- as soon as the P2⇐S2 sync has finished, we can remove
227
  the old data on the old node that has not been chosen for
228
  S2
229

  
230
Caveats: during the P2⇐S2 sync, a (non-transient) network error
231
will cause I/O errors on the instance, so (if a longer instance
232
downtime is acceptable) we can postpone the restart of the instance
233
until the resync is done. However, disk I/O errors on S2 will cause
234
dataloss, since we don't have a good copy of the data anymore, so in
235
this case waiting for the sync to complete is not an option. As such,
236
it is recommended that this feature is used only in conjunction with
237
proper disk monitoring.
238

  
239

  
240
Live migration note: While failover-to-any is possible for all choices
241
of S2, migration-to-any is possible only if we keep P1 as S2.
242

  
243
Caveats
244
-------
245

  
246
The dynamic device model, while more complex, has an advantage: it
247
will not reuse by mistake another's instance DRBD device, since it
248
always looks for either our own or a free one.
249

  
250
The static one, in contrast, will assume that given a minor number N,
251
it's ours and we can take over. This needs careful implementation such
252
that if the minor is in use, either we are able to cleanly shut it
253
down, or we abort the startup. Otherwise, it could be that we start
254
syncing between two instance's disks, causing dataloss.
255

  
256
Security Considerations
257
-----------------------
258

  
259
The changes will not affect the security model of Ganeti.
/dev/null
1
Ganeti 2.0 design documents
2
===========================
3

  
4

  
5
The 2.x versions of Ganeti will constitute a rewrite of the 'core'
6
architecture, plus some additional features (however 2.0 is geared
7
toward the core changes).
8

  
9
Core changes
10
------------
11

  
12
The main changes will be switching from a per-process model to a
13
daemon based model, where the individual gnt-* commands will be
14
clients that talk to this daemon (see the design-2.0-master-daemon
15
document). This will allow us to get rid of the global cluster lock
16
for most operations, having instead a per-object lock (see
17
design-2.0-granular-locking). Also, the daemon will be able to queue
18
jobs, and this will allow the invidual clients to submit jobs without
19
waiting for them to finish, and also see the result of old requests
20
(see design-2.0-job-queue).
21

  
22
Beside these major changes, another 'core' change but that will not be
23
as visible to the users will be changing the model of object attribute
24
storage, and separate that into namespaces (such that an Xen PVM
25
instance will not have the Xen HVM parameters). This will allow future
26
flexibility in defining additional parameters. More details in the
27
design-2.0-cluster-parameters document.
28

  
29
The various changes brought in by the master daemon model and the
30
read-write RAPI will require changes to the cluster security; we move
31
away from Twisted and use http(s) for intra- and extra-cluster
32
communications. For more details, see the security document in the
33
doc/ directory.
34

  
35

  
36
Functionality changes
37
---------------------
38

  
39
The disk storage will receive some changes, and will also remove
40
support for the drbd7 and md disk types. See the
41
design-2.0-disk-changes document.
42

  
43
The configuration storage will be changed, with the effect that more
44
data will be available on the nodes for access from outside ganeti
45
(e.g. from shell scripts) and that nodes will get slightly more
46
awareness of the cluster configuration.
47

  
48
The RAPI will enable modify operations (beside the read-only queries
49
that are available today), so in effect almost all the operations
50
available today via the ``gnt-*`` commands will be available via the
51
remote API.
52

  
53
A change in the hypervisor support area will be that we will support
54
multiple hypervisors in parallel in the same cluster, so one could run
55
Xen HVM side-by-side with Xen PVM on the same cluster.
56

  
57
New features
58
------------
59

  
60
There will be a number of minor feature enhancements targeted to
61
either 2.0 or subsequent 2.x releases:
62

  
63
- multiple disks, with custom properties (read-only/read-write, exportable,
64
  etc.)
65
- multiple NICs
66

  
67
These changes will require OS API changes, details are in the
68
design-2.0-os-interface document. And they will also require many
69
command line changes, see the design-2.0-commandline-parameters
70
document.
/dev/null
1
Job Queue
2
=========
3

  
4
.. contents::
5

  
6
Overview
7
--------
8

  
9
In Ganeti 1.2, operations in a cluster have to be done in a serialized way.
10
Virtually any operation locks the whole cluster by grabbing the global lock.
11
Other commands can't return before all work has been done.
12

  
13
By implementing a job queue and granular locking, we can lower the latency of
14
command execution inside a Ganeti cluster.
15

  
16

  
17
Detailed Design
18
---------------
19

  
20
Job execution—“Life of a Ganeti job”
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22

  
23
#. Job gets submitted by the client. A new job identifier is generated and
24
   assigned to the job. The job is then automatically replicated [#replic]_
25
   to all nodes in the cluster. The identifier is returned to the client.
26
#. A pool of worker threads waits for new jobs. If all are busy, the job has
27
   to wait and the first worker finishing its work will grab it. Otherwise any
28
   of the waiting threads will pick up the new job.
29
#. Client waits for job status updates by calling a waiting RPC function.
30
   Log message may be shown to the user. Until the job is started, it can also
31
   be cancelled.
32
#. As soon as the job is finished, its final result and status can be retrieved
33
   from the server.
34
#. If the client archives the job, it gets moved to a history directory.
35
   There will be a method to archive all jobs older than a a given age.
36

  
37
.. [#replic] We need replication in order to maintain the consistency across
38
   all nodes in the system; the master node only differs in the fact that
39
   now it is running the master daemon, but it if fails and we do a master
40
   failover, the jobs are still visible on the new master (even though they
41
   will be marked as failed).
42

  
43
Failures to replicate a job to other nodes will be only flagged as
44
errors in the master daemon log if more than half of the nodes failed,
45
otherwise we ignore the failure, and rely on the fact that the next
46
update (for still running jobs) will retry the update. For finished
47
jobs, it is less of a problem.
48

  
49
Future improvements will look into checking the consistency of the job
50
list and jobs themselves at master daemon startup.
51

  
52

  
53
Job storage
54
~~~~~~~~~~~
55

  
56
Jobs are stored in the filesystem as individual files, serialized
57
using JSON (standard serialization mechanism in Ganeti).
58

  
59
The choice of storing each job in its own file was made because:
60

  
61
- a file can be atomically replaced
62
- a file can easily be replicated to other nodes
63
- checking consistency across nodes can be implemented very easily, since
64
  all job files should be (at a given moment in time) identical
65

  
66
The other possible choices that were discussed and discounted were:
67

  
68
- single big file with all job data: not feasible due to difficult updates
69
- in-process databases: hard to replicate the entire database to the
70
  other nodes, and replicating individual operations does not mean wee keep
71
  consistency
72

  
73

  
74
Queue structure
75
~~~~~~~~~~~~~~~
76

  
77
All file operations have to be done atomically by writing to a temporary file
78
and subsequent renaming. Except for log messages, every change in a job is
79
stored and replicated to other nodes.
80

  
81
::
82

  
83
  /var/lib/ganeti/queue/
84
    job-1 (JSON encoded job description and status)
85
    […]
86
    job-37
87
    job-38
88
    job-39
89
    lock (Queue managing process opens this file in exclusive mode)
90
    serial (Last job ID used)
91
    version (Queue format version)
92

  
93

  
94
Locking
95
~~~~~~~
96

  
97
Locking in the job queue is a complicated topic. It is called from more than
98
one thread and must be thread-safe. For simplicity, a single lock is used for
99
the whole job queue.
100

  
101
A more detailed description can be found in doc/locking.txt.
102

  
103

  
104
Internal RPC
105
~~~~~~~~~~~~
106

  
107
RPC calls available between Ganeti master and node daemons:
108

  
109
jobqueue_update(file_name, content)
110
  Writes a file in the job queue directory.
111
jobqueue_purge()
112
  Cleans the job queue directory completely, including archived job.
113
jobqueue_rename(old, new)
114
  Renames a file in the job queue directory.
115

  
116

  
117
Client RPC
118
~~~~~~~~~~
119

  
120
RPC between Ganeti clients and the Ganeti master daemon supports the following
121
operations:
122

  
123
SubmitJob(ops)
124
  Submits a list of opcodes and returns the job identifier. The identifier is
125
  guaranteed to be unique during the lifetime of a cluster.
126
WaitForJobChange(job_id, fields, […], timeout)
127
  This function waits until a job changes or a timeout expires. The condition
128
  for when a job changed is defined by the fields passed and the last log
129
  message received.
130
QueryJobs(job_ids, fields)
131
  Returns field values for the job identifiers passed.
132
CancelJob(job_id)
133
  Cancels the job specified by identifier. This operation may fail if the job
134
  is already running, canceled or finished.
135
ArchiveJob(job_id)
136
  Moves a job into the …/archive/ directory. This operation will fail if the
137
  job has not been canceled or finished.
138

  
139

  
140
Job and opcode status
141
~~~~~~~~~~~~~~~~~~~~~
142

  
143
Each job and each opcode has, at any time, one of the following states:
144

  
145
Queued
146
  The job/opcode was submitted, but did not yet start.
147
Waiting
148
  The job/opcode is waiting for a lock to proceed.
149
Running
150
  The job/opcode is running.
151
Canceled
152
  The job/opcode was canceled before it started.
153
Success
154
  The job/opcode ran and finished successfully.
155
Error
156
  The job/opcode was aborted with an error.
157

  
158
If the master is aborted while a job is running, the job will be set to the
159
Error status once the master started again.
160

  
161

  
162
History
163
~~~~~~~
164

  
165
Archived jobs are kept in a separate directory,
166
/var/lib/ganeti/queue/archive/.  This is done in order to speed up the
167
queue handling: by default, the jobs in the archive are not touched by
168
any functions. Only the current (unarchived) jobs are parsed, loaded,
169
and verified (if implemented) by the master daemon.
170

  
171

  
172
Ganeti updates
173
~~~~~~~~~~~~~~
174

  
175
The queue has to be completely empty for Ganeti updates with changes
176
in the job queue structure. In order to allow this, there will be a
177
way to prevent new jobs entering the queue.
/dev/null
1
Ganeti 2.0 Granular Locking
2
===========================
3

  
4
.. contents::
5

  
6
Objective
7
---------
8

  
9
We want to make sure that multiple operations can run in parallel on a Ganeti
10
Cluster. In order for this to happen we need to make sure concurrently run
11
operations don't step on each other toes and break the cluster.
12

  
13
This design addresses how we are going to deal with locking so that:
14

  
15
- high urgency operations are not stopped by long length ones
16
- long length operations can run in parallel
17
- we preserve safety (data coherency) and liveness (no deadlock, no work
18
  postponed indefinitely) on the cluster
19

  
20
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
21
set of operations that are currently bottlenecks and need to be parallelised
22
and have worked on those. In the future it will be possible to address other
23
needs, thus making the cluster more and more parallel one step at a time.
24

  
25
This document only talks about parallelising Ganeti level operations, aka
26
Logical Units, and the locking needed for that. Any other synchronisation lock
27
needed internally by the code is outside its scope.
28

  
29
Background
30
----------
31

  
32
Ganeti 1.2 has a single global lock, which is used for all cluster operations.
33
This has been painful at various times, for example:
34

  
35
- It is impossible for two people to efficiently interact with a cluster
36
  (for example for debugging) at the same time.
37
- When batch jobs are running it's impossible to do other work (for example
38
  failovers/fixes) on a cluster.
39

  
40
This also poses scalability problems: as clusters grow in node and instance
41
size it's a lot more likely that operations which one could conceive should run
42
in parallel (for example because they happen on different nodes) are actually
43
stalling each other while waiting for the global lock, without a real reason
44
for that to happen.
45

  
46
Overview
47
--------
48

  
49
This design doc is best read in the context of the accompanying design
50
docs for Ganeti 2.0: Master daemon design and Job queue design.
51

  
52
We intend to implement a Ganeti locking library, which can be used by the
53
various ganeti code components in order to easily, efficiently and correctly
54
grab the locks they need to perform their function.
55

  
56
The proposed library has these features:
57

  
58
- Internally managing all the locks, making the implementation transparent
59
  from their usage
60
- Automatically grabbing multiple locks in the right order (avoid deadlock)
61
- Ability to transparently handle conversion to more granularity
62
- Support asynchronous operation (future goal)
63

  
64
Locking will be valid only on the master node and will not be a distributed
65
operation. In case of master failure, though, if some locks were held it means
66
some opcodes were in progress, so when recovery of the job queue is done it
67
will be possible to determine by the interrupted opcodes which operations could
68
have been left half way through and thus which locks could have been held. It
69
is then the responsibility either of the master failover code, of the cluster
70
verification code, or of the admin to do what's necessary to make sure that any
71
leftover state is dealt with. This is not an issue from a locking point of view
72
because the fact that the previous master has failed means that it cannot do
73
any job.
74

  
75
A corollary of this is that a master-failover operation with both masters alive
76
needs to happen while no other locks are held.
77

  
78
Detailed Design
79
---------------
80

  
81
The Locks
82
~~~~~~~~~
83
At the first stage we have decided to provide the following locks:
84

  
85
- One "config file" lock
86
- One lock per node in the cluster
87
- One lock per instance in the cluster
88

  
89
All the instance locks will need to be taken before the node locks, and the
90
node locks before the config lock. Locks will need to be acquired at the same
91
time for multiple instances and nodes, and internal ordering will be dealt
92
within the locking library, which, for simplicity, will just use alphabetical
93
order.
94

  
95
Handling conversion to more granularity
96
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
97

  
98
In order to convert to a more granular approach transparently each time we
99
split a lock into more we'll create a "metalock", which will depend on those
100
sublocks and live for the time necessary for all the code to convert (or
101
forever, in some conditions). When a metalock exists all converted code must
102
acquire it in shared mode, so it can run concurrently, but still be exclusive
103
with old code, which acquires it exclusively.
104

  
105
In the beginning the only such lock will be what replaces the current "command"
106
lock, and will acquire all the locks in the system, before proceeding. This
107
lock will be called the "Big Ganeti Lock" because holding that one will avoid
108
any other concurrent ganeti operations.
109

  
110
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
111
in order to make it easier for some parts of the code to acquire what it needs
112
without specifying it explicitly.
113

  
114
In the future things like the node locks could become metalocks, should we
115
decide to split them into an even more fine grained approach, but this will
116
probably be only after the first 2.0 version has been released.
117

  
118
Library API
119
~~~~~~~~~~~
120

  
121
All the locking will be its own class, and the locks will be created at
122
initialisation time, from the config file.
123

  
124
The API will have a way to grab one or more than one locks at the same time.
125
Any attempt to grab a lock while already holding one in the wrong order will be
126
checked for, and fail.
127

  
128
Adding/Removing locks
129
~~~~~~~~~~~~~~~~~~~~~
130

  
131
When a new instance or a new node is created an associated lock must be added
132
to the list. The relevant code will need to inform the locking library of such
133
a change.
134

  
135
This needs to be compatible with every other lock in the system, especially
136
metalocks that guarantee to grab sets of resources without specifying them
137
explicitly. The implementation of this will be handled in the locking library
138
itself.
139

  
140
Of course when instances or nodes disappear from the cluster the relevant locks
141
must be removed. This is easier than adding new elements, as the code which
142
removes them must own them exclusively or can queue for their ownership, and
143
thus deals with metalocks exactly as normal code acquiring those locks. Any
144
operation queueing on a removed lock will fail after its removal.
145

  
146
Asynchronous operations
147
~~~~~~~~~~~~~~~~~~~~~~~
148

  
149
For the first version the locking library will only export synchronous
150
operations, which will block till the needed lock are held, and only fail if
151
the request is impossible or somehow erroneous.
152

  
153
In the future we may want to implement different types of asynchronous
154
operations such as:
155

  
156
- Try to acquire this lock set and fail if not possible
157
- Try to acquire one of these lock sets and return the first one you were
158
  able to get (or after a timeout) (select/poll like)
159

  
160
These operations can be used to prioritize operations based on available locks,
161
rather than making them just blindly queue for acquiring them. The inherent
162
risk, though, is that any code using the first operation, or setting a timeout
163
for the second one, is susceptible to starvation and thus may never be able to
164
get the required locks and complete certain tasks. Considering this
165
providing/using these operations should not be among our first priorities.
166

  
167
Locking granularity
168
~~~~~~~~~~~~~~~~~~~
169

  
170
For the first version of this code we'll convert each Logical Unit to
171
acquire/release the locks it needs, so locking will be at the Logical Unit
172
level.  In the future we may want to split logical units in independent
173
"tasklets" with their own locking requirements. A different design doc (or mini
174
design doc) will cover the move from Logical Units to tasklets.
175

  
176
Lock acquisition code path
177
~~~~~~~~~~~~~~~~~~~~~~~~~~
178

  
179
In general when acquiring locks we should use a code path equivalent to::
180

  
181
  lock.acquire()
182
  try:
183
    ...
184
    # other code
185
  finally:
186
    lock.release()
187

  
188
This makes sure we release all locks, and avoid possible deadlocks. Of course
189
extra care must be used not to leave, if possible locked structures in an
190
unusable state.
191

  
192
In order to avoid this extra indentation and code changes everywhere in the
193
Logical Units code, we decided to allow LUs to declare locks, and then execute
194
their code with their locks acquired. In the new world LUs are called like
195
this::
196

  
197
  # user passed names are expanded to the internal lock/resource name,
198
  # then known needed locks are declared
199
  lu.ExpandNames()
200
  ... some locking/adding of locks may happen ...
201
  # late declaration of locks for one level: this is useful because sometimes
202
  # we can't know which resource we need before locking the previous level
203
  lu.DeclareLocks() # for each level (cluster, instance, node)
204
  ... more locking/adding of locks can happen ...
205
  # these functions are called with the proper locks held
206
  lu.CheckPrereq()
207
  lu.Exec()
208
  ... locks declared for removal are removed, all acquired locks released ...
209

  
210
The Processor and the LogicalUnit class will contain exact documentation on how
211
locks are supposed to be declared.
212

  
213
Caveats
214
-------
215

  
216
This library will provide an easy upgrade path to bring all the code to
217
granular locking without breaking everything, and it will also guarantee
218
against a lot of common errors. Code switching from the old "lock everything"
219
lock to the new system, though, needs to be carefully scrutinised to be sure it
220
is really acquiring all the necessary locks, and none has been overlooked or
221
forgotten.
222

  
223
The code can contain other locks outside of this library, to synchronise other
224
threaded code (eg for the job queue) but in general these should be leaf locks
225
or carefully structured non-leaf ones, to avoid deadlock race conditions.
226

  
/dev/null
1
Ganeti 2.0 Master daemon
2
========================
3

  
4
.. contents::
5

  
6
Objective
7
---------
8

  
9
Many of the important features of Ganeti 2.0 — job queue, granular
10
locking, external API, etc. — will be integrated via a master
11
daemon. While not absolutely necessary, it is the best way to
12
integrate all these components.
13

  
14
Background
15
----------
16

  
17
Currently there is no "master" daemon in Ganeti (1.2). Each command
18
tries to acquire the so called *cmd* lock and when it succeeds, it
19
takes complete ownership of the cluster configuration and state. The
20
scheduled improvements to Ganeti require or can use a daemon that
21
coordinates the activities/jobs scheduled/etc.
22

  
23
Overview
24
--------
25

  
26
The master daemon will be the central point of the cluster; command
27
line tools and the external API will interact with the cluster via
28
this daemon; it will be one coordinating the node daemons.
29

  
30
This design doc is best read in the context of the accompanying design
31
docs for Ganeti 2.0: Granular locking design and Job queue design.
32

  
33

  
34
Detailed Design
35
---------------
36

  
37
In Ganeti 2.0, we will have the following *entities*:
38

  
39
- the master daemon (on master node)
40
- the node daemon (all nodes)
41
- the command line tools (master node)
42
- the RAPI daemon (master node)
43

  
44
Interaction paths are between:
45

  
46
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
47
- the master daemon and the node daemons, via the node RPC
48

  
49
The protocol between the master daemon and the node daemons will be
50
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
51
messages. This is done due to difficulties in working with the Twisted
52
framework and its protocols in a multithreaded environment, which we can
53
overcome by using a simpler stack (see the caveats section). The protocol
54
between the CLI/RAPI and the master daemon will be a custom one: on a UNIX
55
socket on the master node, with rights restricted by filesystem
56
permissions, the CLI/RAPI will talk to the master daemon using JSON-encoded
57
messages.
58

  
59
The operations supported over this internal protocol will be encoded
60
via a python library that will expose a simple API for its
61
users. Internally, the protocol will simply encode all objects in JSON
62
format and decode them on the receiver side.
63

  
64
The LUXI protocol
65
~~~~~~~~~~~~~~~~~
66

  
67
We will have two main classes of operations over the master daemon API:
68

  
69
- cluster query functions
70
- job related functions
71

  
72
The cluster query functions are usually short-duration, and are the
73
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
74
internally implemented still with these opcodes). The clients are
75
guaranteed to receive the response in a reasonable time via a timeout.
76

  
77
The job-related functions will be:
78

  
79
- submit job
80
- query job (which could also be categorized in the query-functions)
81
- archive job (see the job queue design doc)
82
- wait for job change, which allows a client to wait without polling
83

  
84
For more details, see the job queue design document.
85

  
86
Daemon implementation
87
~~~~~~~~~~~~~~~~~~~~~
88

  
89
The daemon will be based around a main I/O thread that will wait for
90
new requests from the clients, and that does the setup/shutdown of the
91
other thread (pools).
92

  
93
There will two other classes of threads in the daemon:
94

  
95
- job processing threads, part of a thread pool, and which are
96
  long-lived, started at daemon startup and terminated only at shutdown
97
  time
98
- client I/O threads, which are the ones that talk the local protocol
99
  to the clients
100

  
101
Master startup/failover
102
~~~~~~~~~~~~~~~~~~~~~~~
103

  
104
In Ganeti 1.x there is no protection against failing over the master
105
to a node with stale configuration. In effect, the responsibility of
106
correct failovers falls on the admin. This is true both for the new
107
master and for when an old, offline master startup.
108

  
109
Since in 2.x we are extending the cluster state to cover the job queue
110
and have a daemon that will execute by itself the job queue, we want
111
to have more resilience for the master role.
112

  
113
The following algorithm will happen whenever a node is ready to
114
transition to the master role, either at startup time or at node
115
failover:
116

  
117
#. read the configuration file and parse the node list
118
   contained within
119

  
120
#. query all the nodes and make sure we obtain an agreement via
121
   a quorum of at least half plus one nodes for the following:
122

  
123
    - we have the latest configuration and job list (as
124
      determined by the serial number on the configuration and
125
      highest job ID on the job queue)
126

  
127
    - there is not even a single node having a newer
128
      configuration file
129

  
130
    - if we are not failing over (but just starting), the
131
      quorum agrees that we are the designated master
132

  
133
#. at this point, the node transitions to the master role
134

  
135
#. for all the in-progress jobs, mark them as failed, with
136
   reason unknown or something similar (master failed, etc.)
137

  
138

  
139
Logging
140
~~~~~~~
141

  
142
The logging system will be switched completely to the logging module;
143
currently it's logging-based, but exposes a different API, which is
144
just overhead. As such, the code will be switched over to standard
145
logging calls, and only the setup will be custom.
146

  
147
With this change, we will remove the separate debug/info/error logs,
148
and instead have always one logfile per daemon model:
149

  
150
- master-daemon.log for the master daemon
151
- node-daemon.log for the node daemon (this is the same as in 1.2)
152
- rapi-daemon.log for the RAPI daemon logs
153
- rapi-access.log, an additional log file for the RAPI that will be
154
  in the standard http log format for possible parsing by other tools
155

  
156
Since the watcher will only submit jobs to the master for startup of
157
the instances, its log file will contain less information than before,
158
mainly that it will start the instance, but not the results.
159

  
160
Caveats
161
-------
162

  
163
A discussed alternative is to keep the current individual processes
164
touching the cluster configuration model. The reasons we have not
165
chosen this approach is:
166

  
167
- the speed of reading and unserializing the cluster state
168
  today is not small enough that we can ignore it; the addition of
169
  the job queue will make the startup cost even higher. While this
170
  runtime cost is low, it can be on the order of a few seconds on
171
  bigger clusters, which for very quick commands is comparable to
172
  the actual duration of the computation itself
173

  
174
- individual commands would make it harder to implement a
175
  fire-and-forget job request, along the lines "start this
176
  instance but do not wait for it to finish"; it would require a
177
  model of backgrounding the operation and other things that are
178
  much better served by a daemon-based model
179

  
180
Another area of discussion is moving away from Twisted in this new
181
implementation. While Twisted hase its advantages, there are also many
182
disatvantanges to using it:
183

  
184
- first and foremost, it's not a library, but a framework; thus, if
185
  you use twisted, all the code needs to be 'twiste-ized'; we were able
186
  to keep the 1.x code clean by hacking around twisted in an
187
  unsupported, unrecommended way, and the only alternative would have
188
  been to make all the code be written for twisted
189
- it has some weaknesses in working with multiple threads, since its base
190
  model is designed to replace thread usage by using deferred calls, so while
191
  it can use threads, it's not less flexible in doing so
... This diff was truncated because it exceeds the maximum size that can be displayed.

Also available in: Unified diff