Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.1.rst @ aa355c79

History | View | Annotate | Download (41.9 kB)

1
=================
2
Ganeti 2.1 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.1 compared to
6
the 2.0 version.
7

    
8
The 2.1 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.0, in a timely fashion.
11

    
12
.. contents:: :depth: 4
13

    
14
Objective
15
=========
16

    
17
Ganeti 2.1 will add features to help further automatization of cluster
18
operations, further improve scalability to even bigger clusters, and
19
make it easier to debug the Ganeti core.
20

    
21
Background
22
==========
23

    
24
Overview
25
========
26

    
27
Detailed design
28
===============
29

    
30
As for 2.0 we divide the 2.1 design into three areas:
31

    
32
- core changes, which affect the master daemon/job queue/locking or
33
  all/most logical units
34
- logical unit/feature changes
35
- external interface changes (eg. command line, os api, hooks, ...)
36

    
37
Core changes
38
------------
39

    
40
Storage units modelling
41
~~~~~~~~~~~~~~~~~~~~~~~
42

    
43
Currently, Ganeti has a good model of the block devices for instances
44
(e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the
45
storage pools that are providing the space for these front-end
46
devices. For example, there are hardcoded inter-node RPC calls for
47
volume group listing, file storage creation/deletion, etc.
48

    
49
The storage units framework will implement a generic handling for all
50
kinds of storage backends:
51

    
52
- LVM physical volumes
53
- LVM volume groups
54
- File-based storage directories
55
- any other future storage method
56

    
57
There will be a generic list of methods that each storage unit type
58
will provide, like:
59

    
60
- list of storage units of this type
61
- check status of the storage unit
62

    
63
Additionally, there will be specific methods for each method, for
64
example:
65

    
66
- enable/disable allocations on a specific PV
67
- file storage directory creation/deletion
68
- VG consistency fixing
69

    
70
This will allow a much better modeling and unification of the various
71
RPC calls related to backend storage pool in the future. Ganeti 2.1 is
72
intended to add the basics of the framework, and not necessarilly move
73
all the curent VG/FileBased operations to it.
74

    
75
Note that while we model both LVM PVs and LVM VGs, the framework will
76
**not** model any relationship between the different types. In other
77
words, we don't model neither inheritances nor stacking, since this is
78
too complex for our needs. While a ``vgreduce`` operation on a LVM VG
79
could actually remove a PV from it, this will not be handled at the
80
framework level, but at individual operation level. The goal is that
81
this is a lightweight framework, for abstracting the different storage
82
operation, and not for modelling the storage hierarchy.
83

    
84

    
85
Locking improvements
86
~~~~~~~~~~~~~~~~~~~~
87

    
88
Current State and shortcomings
89
++++++++++++++++++++++++++++++
90

    
91
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or
92
many ``SharedLock`` instances. It provides an interface to add/remove
93
locks and to acquire and subsequently release any number of those locks
94
contained in it.
95

    
96
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to
97
the way we're using locks for nodes and instances (the single cluster
98
lock isn't affected by this issue) this can lead to long delays when
99
acquiring locks if another operation tries to acquire multiple locks but
100
has to wait for yet another operation.
101

    
102
In the following demonstration we assume to have the instance locks
103
``inst1``, ``inst2``, ``inst3`` and ``inst4``.
104

    
105
#. Operation A grabs lock for instance ``inst4``.
106
#. Operation B wants to acquire all instance locks in alphabetic order,
107
   but it has to wait for ``inst4``.
108
#. Operation C tries to lock ``inst1``, but it has to wait until
109
   Operation B (which is trying to acquire all locks) releases the lock
110
   again.
111
#. Operation A finishes and releases lock on ``inst4``. Operation B can
112
   continue and eventually releases all locks.
113
#. Operation C can get ``inst1`` lock and finishes.
114

    
115
Technically there's no need for Operation C to wait for Operation A, and
116
subsequently Operation B, to finish. Operation B can't continue until
117
Operation A is done (it has to wait for ``inst4``), anyway.
118

    
119
Proposed changes
120
++++++++++++++++
121

    
122
Non-blocking lock acquiring
123
^^^^^^^^^^^^^^^^^^^^^^^^^^^
124

    
125
Acquiring locks for OpCode execution is always done in blocking mode.
126
They won't return until the lock has successfully been acquired (or an
127
error occurred, although we won't cover that case here).
128

    
129
``SharedLock`` and ``LockSet`` must be able to be acquired in a
130
non-blocking way. They must support a timeout and abort trying to
131
acquire the lock(s) after the specified amount of time.
132

    
133
Retry acquiring locks
134
^^^^^^^^^^^^^^^^^^^^^
135

    
136
To prevent other operations from waiting for a long time, such as
137
described in the demonstration before, ``LockSet`` must not keep locks
138
for a prolonged period of time when trying to acquire two or more locks.
139
Instead it should, with an increasing timeout for acquiring all locks,
140
release all locks again and sleep some time if it fails to acquire all
141
requested locks.
142

    
143
A good timeout value needs to be determined. In any case should
144
``LockSet`` proceed to acquire locks in blocking mode after a few
145
(unsuccessful) attempts to acquire all requested locks.
146

    
147
One proposal for the timeout is to use ``2**tries`` seconds, where
148
``tries`` is the number of unsuccessful tries.
149

    
150
In the demonstration before this would allow Operation C to continue
151
after Operation B unsuccessfully tried to acquire all locks and released
152
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
153

    
154
Other solutions discussed
155
+++++++++++++++++++++++++
156

    
157
There was also some discussion on going one step further and extend the
158
job queue (see ``lib/jqueue.py``) to select the next task for a worker
159
depending on whether it can acquire the necessary locks. While this may
160
reduce the number of necessary worker threads and/or increase throughput
161
on large clusters with many jobs, it also brings many potential
162
problems, such as contention and increased memory usage, with it. As
163
this would be an extension of the changes proposed before it could be
164
implemented at a later point in time, but we decided to stay with the
165
simpler solution for now.
166

    
167
Implementation details
168
++++++++++++++++++++++
169

    
170
``SharedLock`` redesign
171
^^^^^^^^^^^^^^^^^^^^^^^
172

    
173
The current design of ``SharedLock`` is not good for supporting timeouts
174
when acquiring a lock and there are also minor fairness issues in it. We
175
plan to address both with a redesign. A proof of concept implementation
176
was written and resulted in significantly simpler code.
177

    
178
Currently ``SharedLock`` uses two separate queues for shared and
179
exclusive acquires and waiters get to run in turns. This means if an
180
exclusive acquire is released, the lock will allow shared waiters to run
181
and vice versa.  Although it's still fair in the end there is a slight
182
bias towards shared waiters in the current implementation. The same
183
implementation with two shared queues can not support timeouts without
184
adding a lot of complexity.
185

    
186
Our proposed redesign changes ``SharedLock`` to have only one single
187
queue.  There will be one condition (see Condition_ for a note about
188
performance) in the queue per exclusive acquire and two for all shared
189
acquires (see below for an explanation). The maximum queue length will
190
always be ``2 + (number of exclusive acquires waiting)``. The number of
191
queue entries for shared acquires can vary from 0 to 2.
192

    
193
The two conditions for shared acquires are a bit special. They will be
194
used in turn. When the lock is instantiated, no conditions are in the
195
queue. As soon as the first shared acquire arrives (and there are
196
holder(s) or waiting acquires; see Acquire_), the active condition is
197
added to the queue. Until it becomes the topmost condition in the queue
198
and has been notified, any shared acquire is added to this active
199
condition. When the active condition is notified, the conditions are
200
swapped and further shared acquires are added to the previously inactive
201
condition (which has now become the active condition). After all waiters
202
on the previously active (now inactive) and now notified condition
203
received the notification, it is removed from the queue of pending
204
acquires.
205

    
206
This means shared acquires will skip any exclusive acquire in the queue.
207
We believe it's better to improve parallelization on operations only
208
asking for shared (or read-only) locks. Exclusive operations holding the
209
same lock can not be parallelized.
210

    
211

    
212
Acquire
213
*******
214

    
215
For exclusive acquires a new condition is created and appended to the
216
queue.  Shared acquires are added to the active condition for shared
217
acquires and if the condition is not yet on the queue, it's appended.
218

    
219
The next step is to wait for our condition to be on the top of the queue
220
(to guarantee fairness). If the timeout expired, we return to the caller
221
without acquiring the lock. On every notification we check whether the
222
lock has been deleted, in which case an error is returned to the caller.
223

    
224
The lock can be acquired if we're on top of the queue (there is no one
225
else ahead of us). For an exclusive acquire, there must not be other
226
exclusive or shared holders. For a shared acquire, there must not be an
227
exclusive holder.  If these conditions are all true, the lock is
228
acquired and we return to the caller. In any other case we wait again on
229
the condition.
230

    
231
If it was the last waiter on a condition, the condition is removed from
232
the queue.
233

    
234
Optimization: There's no need to touch the queue if there are no pending
235
acquires and no current holders. The caller can have the lock
236
immediately.
237

    
238
.. image:: design-2.1-lock-acquire.png
239

    
240

    
241
Release
242
*******
243

    
244
First the lock removes the caller from the internal owner list. If there
245
are pending acquires in the queue, the first (the oldest) condition is
246
notified.
247

    
248
If the first condition was the active condition for shared acquires, the
249
inactive condition will be made active. This ensures fairness with
250
exclusive locks by forcing consecutive shared acquires to wait in the
251
queue.
252

    
253
.. image:: design-2.1-lock-release.png
254

    
255

    
256
Delete
257
******
258

    
259
The caller must either hold the lock in exclusive mode already or the
260
lock must be acquired in exclusive mode. Trying to delete a lock while
261
it's held in shared mode must fail.
262

    
263
After ensuring the lock is held in exclusive mode, the lock will mark
264
itself as deleted and continue to notify all pending acquires. They will
265
wake up, notice the deleted lock and return an error to the caller.
266

    
267

    
268
Condition
269
^^^^^^^^^
270

    
271
Note: This is not necessary for the locking changes above, but it may be
272
a good optimization (pending performance tests).
273

    
274
The existing locking code in Ganeti 2.0 uses Python's built-in
275
``threading.Condition`` class. Unfortunately ``Condition`` implements
276
timeouts by sleeping 1ms to 20ms between tries to acquire the condition
277
lock in non-blocking mode. This requires unnecessary context switches
278
and contention on the CPython GIL (Global Interpreter Lock).
279

    
280
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
281
support for timeouts on file descriptors (see ``select(2)``). A custom
282
condition class will have to be written for this.
283

    
284
On instantiation the class creates a pipe. After each notification the
285
previous pipe is abandoned and re-created (technically the old pipe
286
needs to stay around until all notifications have been delivered).
287

    
288
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
289
wait for notifications, optionally with a timeout. A notification will
290
be signalled to the waiting clients by closing the pipe. If the pipe
291
wasn't closed during the timeout, the waiting function returns to its
292
caller nonetheless.
293

    
294

    
295
Node daemon availability
296
~~~~~~~~~~~~~~~~~~~~~~~~
297

    
298
Current State and shortcomings
299
++++++++++++++++++++++++++++++
300

    
301
Currently, when a Ganeti node suffers serious system disk damage, the
302
migration/failover of an instance may not correctly shutdown the virtual
303
machine on the broken node causing instances duplication. The ``gnt-node
304
powercycle`` command can be used to force a node reboot and thus to
305
avoid duplicated instances. This command relies on node daemon
306
availability, though, and thus can fail if the node daemon has some
307
pages swapped out of ram, for example.
308

    
309

    
310
Proposed changes
311
++++++++++++++++
312

    
313
The proposed solution forces node daemon to run exclusively in RAM. It
314
uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on
315
the node daemon process and all its children. In addition another log
316
handler has been implemented for node daemon to redirect to
317
``/dev/console`` messages that cannot be written on the logfile.
318

    
319
With these changes node daemon can successfully run basic tasks such as
320
a powercycle request even when the system disk is heavily damaged and
321
reading/writing to disk fails constantly.
322

    
323

    
324
Feature changes
325
---------------
326

    
327
Ganeti Confd
328
~~~~~~~~~~~~
329

    
330
Current State and shortcomings
331
++++++++++++++++++++++++++++++
332

    
333
In Ganeti 2.0 all nodes are equal, but some are more equal than others.
334
In particular they are divided between "master", "master candidates" and
335
"normal".  (Moreover they can be offline or drained, but this is not
336
important for the current discussion). In general the whole
337
configuration is only replicated to master candidates, and some partial
338
information is spread to all nodes via ssconf.
339

    
340
This change was done so that the most frequent Ganeti operations didn't
341
need to contact all nodes, and so clusters could become bigger. If we
342
want more information to be available on all nodes, we need to add more
343
ssconf values, which is counter-balancing the change, or to talk with
344
the master node, which is not designed to happen now, and requires its
345
availability.
346

    
347
Information such as the instance->primary_node mapping will be needed on
348
all nodes, and we also want to make sure services external to the
349
cluster can query this information as well. This information must be
350
available at all times, so we can't query it through RAPI, which would
351
be a single point of failure, as it's only available on the master.
352

    
353

    
354
Proposed changes
355
++++++++++++++++
356

    
357
In order to allow fast and highly available access read-only to some
358
configuration values, we'll create a new ganeti-confd daemon, which will
359
run on master candidates. This daemon will talk via UDP, and
360
authenticate messages using HMAC with a cluster-wide shared key. This
361
key will be generated at cluster init time, and stored on the clusters
362
alongside the ganeti SSL keys, and readable only by root.
363

    
364
An interested client can query a value by making a request to a subset
365
of the cluster master candidates. It will then wait to get a few
366
responses, and use the one with the highest configuration serial number.
367
Since the configuration serial number is increased each time the ganeti
368
config is updated, and the serial number is included in all answers,
369
this can be used to make sure to use the most recent answer, in case
370
some master candidates are stale or in the middle of a configuration
371
update.
372

    
373
In order to prevent replay attacks queries will contain the current unix
374
timestamp according to the client, and the server will verify that its
375
timestamp is in the same 5 minutes range (this requires synchronized
376
clocks, which is a good idea anyway). Queries will also contain a "salt"
377
which they expect the answers to be sent with, and clients are supposed
378
to accept only answers which contain salt generated by them.
379

    
380
The configuration daemon will be able to answer simple queries such as:
381

    
382
- master candidates list
383
- master node
384
- offline nodes
385
- instance list
386
- instance primary nodes
387

    
388
Wire protocol
389
^^^^^^^^^^^^^
390

    
391
A confd query will look like this, on the wire::
392

    
393
  plj0{
394
    "msg": "{\"type\": 1,
395
             \"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\",
396
             \"protocol\": 1,
397
             \"query\": \"node1.example.com\"}\n",
398
    "salt": "1249637704",
399
    "hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f"
400
  }
401

    
402
"plj0" is a fourcc that details the message content. It stands for plain
403
json 0, and can be changed as we move on to different type of protocols
404
(for example protocol buffers, or encrypted json). What follows is a
405
json encoded string, with the following fields:
406

    
407
- 'msg' contains a JSON-encoded query, its fields are:
408

    
409
  - 'protocol', integer, is the confd protocol version (initially just
410
    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
411
  - 'type', integer, is the query type. For example "node role by name"
412
    or "node primary ip by instance ip". Constants will be provided for
413
    the actual available query types.
414
  - 'query', string, is the search key. For example an ip, or a node
415
    name.
416
  - 'rsalt', string, is the required response salt. The client must use
417
    it to recognize which answer it's getting.
418

    
419
- 'salt' must be the current unix timestamp, according to the client.
420
  Servers can refuse messages which have a wrong timing, according to
421
  their configuration and clock.
422
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
423

    
424
If an answer comes back (which is optional, since confd works over UDP)
425
it will be in this format::
426

    
427
  plj0{
428
    "msg": "{\"status\": 0,
429
             \"answer\": 0,
430
             \"serial\": 42,
431
             \"protocol\": 1}\n",
432
    "salt": "9aa6ce92-8336-11de-af38-001d093e835f",
433
    "hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af"
434
  }
435

    
436
Where:
437

    
438
- 'plj0' the message type magic fourcc, as discussed above
439
- 'msg' contains a JSON-encoded answer, its fields are:
440

    
441
  - 'protocol', integer, is the confd protocol version (initially just
442
    constants.CONFD_PROTOCOL_VERSION, with a value of 1)
443
  - 'status', integer, is the error code. Initially just 0 for 'ok' or
444
    '1' for 'error' (in which case answer contains an error detail,
445
    rather than an answer), but in the future it may be expanded to have
446
    more meanings (eg: 2, the answer is compressed)
447
  - 'answer', is the actual answer. Its type and meaning is query
448
    specific. For example for "node primary ip by instance ip" queries
449
    it will be a string containing an IP address, for "node role by
450
    name" queries it will be an integer which encodes the role (master,
451
    candidate, drained, offline) according to constants.
452

    
453
- 'salt' is the requested salt from the query. A client can use it to
454
  recognize what query the answer is answering.
455
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key
456

    
457

    
458
Redistribute Config
459
~~~~~~~~~~~~~~~~~~~
460

    
461
Current State and shortcomings
462
++++++++++++++++++++++++++++++
463

    
464
Currently LURedistributeConfig triggers a copy of the updated
465
configuration file to all master candidates and of the ssconf files to
466
all nodes. There are other files which are maintained manually but which
467
are important to keep in sync. These are:
468

    
469
- rapi SSL key certificate file (rapi.pem) (on master candidates)
470
- rapi user/password file rapi_users (on master candidates)
471

    
472
Furthermore there are some files which are hypervisor specific but we
473
may want to keep in sync:
474

    
475
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and
476
  copies the file once, during node add. This design is subject to
477
  revision to be able to have different passwords for different groups
478
  of instances via the use of hypervisor parameters, and to allow
479
  xen-hvm and kvm to use an equal system to provide password-protected
480
  vnc sessions. In general, though, it would be useful if the vnc
481
  password files were copied as well, to avoid unwanted vnc password
482
  changes on instance failover/migrate.
483

    
484
Optionally the admin may want to also ship files such as the global
485
xend.conf file, and the network scripts to all nodes.
486

    
487
Proposed changes
488
++++++++++++++++
489

    
490
RedistributeConfig will be changed to copy also the rapi files, and to
491
call every enabled hypervisor asking for a list of additional files to
492
copy. Users will have the possibility to populate a file containing a
493
list of files to be distributed; this file will be propagated as well.
494
Such solution is really simple to implement and it's easily usable by
495
scripts.
496

    
497
This code will be also shared (via tasklets or by other means, if
498
tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs
499
(so that the relevant files will be automatically shipped to new master
500
candidates as they are set).
501

    
502
VNC Console Password
503
~~~~~~~~~~~~~~~~~~~~
504

    
505
Current State and shortcomings
506
++++++++++++++++++++++++++++++
507

    
508
Currently just the xen-hvm hypervisor supports setting a password to
509
connect the the instances' VNC console, and has one common password
510
stored in a file.
511

    
512
This doesn't allow different passwords for different instances/groups of
513
instances, and makes it necessary to remember to copy the file around
514
the cluster when the password changes.
515

    
516
Proposed changes
517
++++++++++++++++
518

    
519
We'll change the VNC password file to a vnc_password_file hypervisor
520
parameter.  This way it can have a cluster default, but also a different
521
value for each instance. The VNC enabled hypervisors (xen and kvm) will
522
publish all the password files in use through the cluster so that a
523
redistribute-config will ship them to all nodes (see the Redistribute
524
Config proposed changes above).
525

    
526
The current VNC_PASSWORD_FILE constant will be removed, but its value
527
will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining
528
backwards compatibility with 2.0.
529

    
530
The code to export the list of VNC password files from the hypervisors
531
to RedistributeConfig will be shared between the KVM and xen-hvm
532
hypervisors.
533

    
534
Disk/Net parameters
535
~~~~~~~~~~~~~~~~~~~
536

    
537
Current State and shortcomings
538
++++++++++++++++++++++++++++++
539

    
540
Currently disks and network interfaces have a few tweakable options and
541
all the rest is left to a default we chose. We're finding that we need
542
more and more to tweak some of these parameters, for example to disable
543
barriers for DRBD devices, or allow striping for the LVM volumes.
544

    
545
Moreover for many of these parameters it will be nice to have
546
cluster-wide defaults, and then be able to change them per
547
disk/interface.
548

    
549
Proposed changes
550
++++++++++++++++
551

    
552
We will add new cluster level diskparams and netparams, which will
553
contain all the tweakable parameters. All values which have a sensible
554
cluster-wide default will go into this new structure while parameters
555
which have unique values will not.
556

    
557
Example of network parameters:
558
  - mode: bridge/route
559
  - link: for mode "bridge" the bridge to connect to, for mode route it
560
    can contain the routing table, or the destination interface
561

    
562
Example of disk parameters:
563
  - stripe: lvm stripes
564
  - stripe_size: lvm stripe size
565
  - meta_flushes: drbd, enable/disable metadata "barriers"
566
  - data_flushes: drbd, enable/disable data "barriers"
567

    
568
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs
569
files) or hypervisor specific (nic models for example), but for now they
570
will all live in the same structure. Each component is supposed to
571
validate only the parameters it knows about, and ganeti itself will make
572
sure that no "globally unknown" parameters are added, and that no
573
parameters have overridden meanings for different components.
574

    
575
The parameters will be kept, as for the BEPARAMS into a "default"
576
category, which will allow us to expand on by creating instance
577
"classes" in the future.  Instance classes is not a feature we plan
578
implementing in 2.1, though.
579

    
580

    
581
Global hypervisor parameters
582
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
583

    
584
Current State and shortcomings
585
++++++++++++++++++++++++++++++
586

    
587
Currently all hypervisor parameters are modifiable both globally
588
(cluster level) and at instance level. However, there is no other
589
framework to held hypervisor-specific parameters, so if we want to add
590
a new class of hypervisor parameters that only makes sense on a global
591
level, we have to change the hvparams framework.
592

    
593
Proposed changes
594
++++++++++++++++
595

    
596
We add a new (global, not per-hypervisor) list of parameters which are
597
not changeable on a per-instance level. The create, modify and query
598
instance operations are changed to not allow/show these parameters.
599

    
600
Furthermore, to allow transition of parameters to the global list, and
601
to allow cleanup of inadverdently-customised parameters, the
602
``UpgradeConfig()`` method of instances will drop any such parameters
603
from their list of hvparams, such that a restart of the master daemon
604
is all that is needed for cleaning these up.
605

    
606
Also, the framework is simple enough that if we need to replicate it
607
at beparams level we can do so easily.
608

    
609

    
610
Non bridged instances support
611
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
612

    
613
Current State and shortcomings
614
++++++++++++++++++++++++++++++
615

    
616
Currently each instance NIC must be connected to a bridge, and if the
617
bridge is not specified the default cluster one is used. This makes it
618
impossible to use the vif-route xen network scripts, or other
619
alternative mechanisms that don't need a bridge to work.
620

    
621
Proposed changes
622
++++++++++++++++
623

    
624
The new "mode" network parameter will distinguish between bridged
625
interfaces and routed ones.
626

    
627
When mode is "bridge" the "link" parameter will contain the bridge the
628
instance should be connected to, effectively making things as today. The
629
value has been migrated from a nic field to a parameter to allow for an
630
easier manipulation of the cluster default.
631

    
632
When mode is "route" the ip field of the interface will become
633
mandatory, to allow for a route to be set. In the future we may want
634
also to accept multiple IPs or IP/mask values for this purpose. We will
635
evaluate possible meanings of the link parameter to signify a routing
636
table to be used, which would allow for insulation between instance
637
groups (as today happens for different bridges).
638

    
639
For now we won't add a parameter to specify which network script gets
640
called for which instance, so in a mixed cluster the network script must
641
be able to handle both cases. The default kvm vif script will be changed
642
to do so. (Xen doesn't have a ganeti provided script, so nothing will be
643
done for that hypervisor)
644

    
645
Introducing persistent UUIDs
646
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
647

    
648
Current state and shortcomings
649
++++++++++++++++++++++++++++++
650

    
651
Some objects in the Ganeti configurations are tracked by their name
652
while also supporting renames. This creates an extra difficulty,
653
because neither Ganeti nor external management tools can then track
654
the actual entity, and due to the name change it behaves like a new
655
one.
656

    
657
Proposed changes part 1
658
+++++++++++++++++++++++
659

    
660
We will change Ganeti to use UUIDs for entity tracking, but in a
661
staggered way. In 2.1, we will simply add an “uuid” attribute to each
662
of the instances, nodes and cluster itself. This will be reported on
663
instance creation for nodes, and on node adds for the nodes. It will
664
be of course avaiblable for querying via the OpQueryNodes/Instance and
665
cluster information, and via RAPI as well.
666

    
667
Note that Ganeti will not provide any way to change this attribute.
668

    
669
Upgrading from Ganeti 2.0 will automatically add an ‘uuid’ attribute
670
to all entities missing it.
671

    
672

    
673
Proposed changes part 2
674
+++++++++++++++++++++++
675

    
676
In the next release (e.g. 2.2), the tracking of objects will change
677
from the name to the UUID internally, and externally Ganeti will
678
accept both forms of identification; e.g. an RAPI call would be made
679
either against ``/2/instances/foo.bar`` or against
680
``/2/instances/bb3b2e42…``. Since an FQDN must have at least a dot,
681
and dots are not valid characters in UUIDs, we will not have namespace
682
issues.
683

    
684
Another change here is that node identification (during cluster
685
operations/queries like master startup, “am I the master?” and
686
similar) could be done via UUIDs which is more stable than the current
687
hostname-based scheme.
688

    
689
Internal tracking refers to the way the configuration is stored; a
690
DRBD disk of an instance refers to the node name (so that IPs can be
691
changed easily), but this is still a problem for name changes; thus
692
these will be changed to point to the node UUID to ease renames.
693

    
694
The advantages of this change (after the second round of changes), is
695
that node rename becomes trivial, whereas today node rename would
696
require a complete lock of all instances.
697

    
698

    
699
Automated disk repairs infrastructure
700
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
701

    
702
Replacing defective disks in an automated fashion is quite difficult
703
with the current version of Ganeti. These changes will introduce
704
additional functionality and interfaces to simplify automating disk
705
replacements on a Ganeti node.
706

    
707
Fix node volume group
708
+++++++++++++++++++++
709

    
710
This is the most difficult addition, as it can lead to dataloss if it's
711
not properly safeguarded.
712

    
713
The operation must be done only when all the other nodes that have
714
instances in common with the target node are fine, i.e. this is the only
715
node with problems, and also we have to double-check that all instances
716
on this node have at least a good copy of the data.
717

    
718
This might mean that we have to enhance the GetMirrorStatus calls, and
719
introduce and a smarter version that can tell us more about the status
720
of an instance.
721

    
722
Stop allocation on a given PV
723
+++++++++++++++++++++++++++++
724

    
725
This is somewhat simple. First we need a "list PVs" opcode (and its
726
associated logical unit) and then a set PV status opcode/LU. These in
727
combination should allow both checking and changing the disk/PV status.
728

    
729
Instance disk status
730
++++++++++++++++++++
731

    
732
This new opcode or opcode change must list the instance-disk-index and
733
node combinations of the instance together with their status. This will
734
allow determining what part of the instance is broken (if any).
735

    
736
Repair instance
737
+++++++++++++++
738

    
739
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in
740
order to fix the instance status. It only affects primary instances;
741
secondaries can just be moved away.
742

    
743
Migrate node
744
++++++++++++
745

    
746
This new opcode/LU/RAPI call will take over the current ``gnt-node
747
migrate`` code and run migrate for all instances on the node.
748

    
749
Evacuate node
750
++++++++++++++
751

    
752
This new opcode/LU/RAPI call will take over the current ``gnt-node
753
evacuate`` code and run replace-secondary with an iallocator script for
754
all instances on the node.
755

    
756

    
757
User-id pool
758
~~~~~~~~~~~~
759

    
760
In order to allow running different processes under unique user-ids
761
on a node, we introduce the user-id pool concept.
762

    
763
The user-id pool is a cluster-wide configuration parameter.
764
It is a list of user-ids and/or user-id ranges that are reserved
765
for running Ganeti processes (including KVM instances).
766
The code guarantees that on a given node a given user-id is only
767
handed out if there is no other process running with that user-id.
768

    
769
Please note, that this can only be guaranteed if all processes in
770
the system - that run under a user-id belonging to the pool - are
771
started by reserving a user-id first. That can be accomplished
772
either by using the RequestUnusedUid() function to get an unused
773
user-id or by implementing the same locking mechanism.
774

    
775
Implementation
776
++++++++++++++
777

    
778
The functions that are specific to the user-id pool feature are located
779
in a separate module: ``lib/uidpool.py``.
780

    
781
Storage
782
^^^^^^^
783

    
784
The user-id pool is a single cluster parameter. It is stored in the
785
*Cluster* object under the ``uid_pool`` name as a list of integer
786
tuples. These tuples represent the boundaries of user-id ranges.
787
For single user-ids, the boundaries are equal.
788

    
789
The internal user-id pool representation is converted into a
790
string: a newline separated list of user-ids or user-id ranges.
791
This string representation is distributed to all the nodes via the
792
*ssconf* mechanism. This means that the user-id pool can be
793
accessed in a read-only way on any node without consulting the master
794
node or master candidate nodes.
795

    
796
Initial value
797
^^^^^^^^^^^^^
798

    
799
The value of the user-id pool cluster parameter can be initialized
800
at cluster initialization time using the
801

    
802
``gnt-cluster init --uid-pool <uid-pool definition> ...``
803

    
804
command.
805

    
806
As there is no sensible default value for the user-id pool parameter,
807
it is initialized to an empty list if no ``--uid-pool`` option is
808
supplied at cluster init time.
809

    
810
If the user-id pool is empty, the user-id pool feature is considered
811
to be disabled.
812

    
813
Manipulation
814
^^^^^^^^^^^^
815

    
816
The user-id pool cluster parameter can be modified from the
817
command-line with the following commands:
818

    
819
- ``gnt-cluster modify --uid-pool <uid-pool definition>``
820
- ``gnt-cluster modify --add-uids <uid-pool definition>``
821
- ``gnt-cluster modify --remove-uids <uid-pool definition>``
822

    
823
The ``--uid-pool`` option overwrites the current setting with the
824
supplied ``<uid-pool definition>``, while
825
``--add-uids``/``--remove-uids`` adds/removes the listed uids
826
or uid-ranges from the pool.
827

    
828
The ``<uid-pool definition>`` should be a comma-separated list of
829
user-ids or user-id ranges. A range should be defined by a lower and
830
a higher boundary. The boundaries should be separated with a dash.
831
The boundaries are inclusive.
832

    
833
The ``<uid-pool definition>`` is parsed into the internal
834
representation, sanity-checked and stored in the ``uid_pool``
835
attribute of the *Cluster* object.
836

    
837
It is also immediately converted into a string (formatted in the
838
input format) and distributed to all nodes via the *ssconf* mechanism.
839

    
840
Inspection
841
^^^^^^^^^^
842

    
843
The current value of the user-id pool cluster parameter is printed
844
by the ``gnt-cluster info`` command.
845

    
846
The output format is accepted by the ``gnt-cluster modify --uid-pool``
847
command.
848

    
849
Locking
850
^^^^^^^
851

    
852
The ``uidpool.py`` module provides a function (``RequestUnusedUid``)
853
for requesting an unused user-id from the pool.
854

    
855
This will try to find a random user-id that is not currently in use.
856
The algorithm is the following:
857

    
858
1) Randomize the list of user-ids in the user-id pool
859
2) Iterate over this randomized UID list
860
3) Create a lock file (it doesn't matter if it already exists)
861
4) Acquire an exclusive POSIX lock on the file, to provide mutual
862
   exclusion for the following non-atomic operations
863
5) Check if there is a process in the system with the given UID
864
6) If there isn't, return the UID, otherwise unlock the file and
865
   continue the iteration over the user-ids
866

    
867
The user can than start a new process with this user-id.
868
Once a process is successfully started, the exclusive POSIX lock can
869
be released, but the lock file will remain in the filesystem.
870
The presence of such a lock file means that the given user-id is most
871
probably in use. The lack of a uid lock file does not guarantee that
872
there are no processes with that user-id.
873

    
874
After acquiring the exclusive POSIX lock, ``RequestUnusedUid``
875
always performs a check to see if there is a process running with the
876
given uid.
877

    
878
A user-id can be returned to the pool, by calling the
879
``ReleaseUid`` function. This will remove the corresponding lock file.
880
Note, that it doesn't check if there is any process still running
881
with that user-id. The removal of the lock file only means that there
882
are most probably no processes with the given user-id. This helps
883
in speeding up the process of finding a user-id that is guaranteed to
884
be unused.
885

    
886
There is a convenience function, called ``ExecWithUnusedUid`` that
887
wraps the execution of a function (or any callable) that requires a
888
unique user-id. ``ExecWithUnusedUid`` takes care of requesting an
889
unused user-id and unlocking the lock file. It also automatically
890
returns the user-id to the pool if the callable raises an exception.
891

    
892
Code examples
893
+++++++++++++
894

    
895
Requesting a user-id from the pool:
896

    
897
::
898

    
899
  from ganeti import ssconf
900
  from ganeti import uidpool
901

    
902
  # Get list of all user-ids in the uid-pool from ssconf
903
  ss = ssconf.SimpleStore()
904
  uid_pool = uidpool.ParseUidPool(ss.GetUidPool(), separator="\n")
905
  all_uids = set(uidpool.ExpandUidPool(uid_pool))
906

    
907
  uid = uidpool.RequestUnusedUid(all_uids)
908
  try:
909
    <start a process with the UID>
910
    # Once the process is started, we can release the file lock
911
    uid.Unlock()
912
  except ..., err:
913
    # Return the UID to the pool
914
    uidpool.ReleaseUid(uid)
915

    
916

    
917
Releasing a user-id:
918

    
919
::
920

    
921
  from ganeti import uidpool
922

    
923
  uid = <get the UID the process is running under>
924
  <stop the process>
925
  uidpool.ReleaseUid(uid)
926

    
927

    
928
External interface changes
929
--------------------------
930

    
931
OS API
932
~~~~~~
933

    
934
The OS API of Ganeti 2.0 has been built with extensibility in mind.
935
Since we pass everything as environment variables it's a lot easier to
936
send new information to the OSes without breaking retrocompatibility.
937
This section of the design outlines the proposed extensions to the API
938
and their implementation.
939

    
940
API Version Compatibility Handling
941
++++++++++++++++++++++++++++++++++
942

    
943
In 2.1 there will be a new OS API version (eg. 15), which should be
944
mostly compatible with api 10, except for some new added variables.
945
Since it's easy not to pass some variables we'll be able to handle
946
Ganeti 2.0 OSes by just filtering out the newly added piece of
947
information. We will still encourage OSes to declare support for the new
948
API after checking that the new variables don't provide any conflict for
949
them, and we will drop api 10 support after ganeti 2.1 has released.
950

    
951
New Environment variables
952
+++++++++++++++++++++++++
953

    
954
Some variables have never been added to the OS api but would definitely
955
be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable
956
to allow the OS to make changes relevant to the virtualization the
957
instance is going to use. Since this field is immutable for each
958
instance, the os can tight the install without caring of making sure the
959
instance can run under any virtualization technology.
960

    
961
We also want the OS to know the particular hypervisor parameters, to be
962
able to customize the install even more.  Since the parameters can
963
change, though, we will pass them only as an "FYI": if an OS ties some
964
instance functionality to the value of a particular hypervisor parameter
965
manual changes or a reinstall may be needed to adapt the instance to the
966
new environment. This is not a regression as of today, because even if
967
the OSes are left blind about this information, sometimes they still
968
need to make compromises and cannot satisfy all possible parameter
969
values.
970

    
971
OS Variants
972
+++++++++++
973

    
974
Currently we are assisting to some degree of "os proliferation" just to
975
change a simple installation behavior. This means that the same OS gets
976
installed on the cluster multiple times, with different names, to
977
customize just one installation behavior. Usually such OSes try to share
978
as much as possible through symlinks, but this still causes
979
complications on the user side, especially when multiple parameters must
980
be cross-matched.
981

    
982
For example today if you want to install debian etch, lenny or squeeze
983
you probably need to install the debootstrap OS multiple times, changing
984
its configuration file, and calling it debootstrap-etch,
985
debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for
986
example a "server" and a "development" environment which installs
987
different packages/configuration files and must be available for all
988
installs you'll probably end  up with deboostrap-etch-server,
989
debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev,
990
etc. Crossing more than two parameters quickly becomes not manageable.
991

    
992
In order to avoid this we plan to make OSes more customizable, by
993
allowing each OS to declare a list of variants which can be used to
994
customize it. The variants list is mandatory and must be written, one
995
variant per line, in the new "variants.list" file inside the main os
996
dir. At least one supported variant must be supported. When choosing the
997
OS exactly one variant will have to be specified, and will be encoded in
998
the os name as <OS-name>+<variant>. As for today it will be possible to
999
change an instance's OS at creation or install time.
1000

    
1001
The 2.1 OS list will be the combination of each OS, plus its supported
1002
variants. This will cause the name name proliferation to remain, but at
1003
least the internal OS code will be simplified to just parsing the passed
1004
variant, without the need for symlinks or code duplication.
1005

    
1006
Also we expect the OSes to declare only "interesting" variants, but to
1007
accept some non-declared ones which a user will be able to pass in by
1008
overriding the checks ganeti does. This will be useful for allowing some
1009
variations to be used without polluting the OS list (per-OS
1010
documentation should list all supported variants). If a variant which is
1011
not internally supported is forced through, the OS scripts should abort.
1012

    
1013
In the future (post 2.1) we may want to move to full fledged parameters
1014
all orthogonal to each other (for example "architecture" (i386, amd64),
1015
"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which
1016
is a single parameter, and you need a different variant for all the set
1017
of combinations you want to support).  In this case we envision the
1018
variants to be moved inside of Ganeti and be associated with lists
1019
parameter->values associations, which will then be passed to the OS.
1020

    
1021

    
1022
IAllocator changes
1023
~~~~~~~~~~~~~~~~~~
1024

    
1025
Current State and shortcomings
1026
++++++++++++++++++++++++++++++
1027

    
1028
The iallocator interface allows creation of instances without manually
1029
specifying nodes, but instead by specifying plugins which will do the
1030
required computations and produce a valid node list.
1031

    
1032
However, the interface is quite akward to use:
1033

    
1034
- one cannot set a 'default' iallocator script
1035
- one cannot use it to easily test if allocation would succeed
1036
- some new functionality, such as rebalancing clusters and calculating
1037
  capacity estimates is needed
1038

    
1039
Proposed changes
1040
++++++++++++++++
1041

    
1042
There are two area of improvements proposed:
1043

    
1044
- improving the use of the current interface
1045
- extending the IAllocator API to cover more automation
1046

    
1047

    
1048
Default iallocator names
1049
^^^^^^^^^^^^^^^^^^^^^^^^
1050

    
1051
The cluster will hold, for each type of iallocator, a (possibly empty)
1052
list of modules that will be used automatically.
1053

    
1054
If the list is empty, the behaviour will remain the same.
1055

    
1056
If the list has one entry, then ganeti will behave as if
1057
'--iallocator' was specifyed on the command line. I.e. use this
1058
allocator by default. If the user however passed nodes, those will be
1059
used in preference.
1060

    
1061
If the list has multiple entries, they will be tried in order until
1062
one gives a successful answer.
1063

    
1064
Dry-run allocation
1065
^^^^^^^^^^^^^^^^^^
1066

    
1067
The create instance LU will get a new 'dry-run' option that will just
1068
simulate the placement, and return the chosen node-lists after running
1069
all the usual checks.
1070

    
1071
Cluster balancing
1072
^^^^^^^^^^^^^^^^^
1073

    
1074
Instance add/removals/moves can create a situation where load on the
1075
nodes is not spread equally. For this, a new iallocator mode will be
1076
implemented called ``balance`` in which the plugin, given the current
1077
cluster state, and a maximum number of operations, will need to
1078
compute the instance relocations needed in order to achieve a "better"
1079
(for whatever the script believes it's better) cluster.
1080

    
1081
Cluster capacity calculation
1082
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1083

    
1084
In this mode, called ``capacity``, given an instance specification and
1085
the current cluster state (similar to the ``allocate`` mode), the
1086
plugin needs to return:
1087

    
1088
- how many instances can be allocated on the cluster with that
1089
  specification
1090
- on which nodes these will be allocated (in order)
1091

    
1092
.. vim: set textwidth=72 :