Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.1.rst @ 513c5e25

History | View | Annotate | Download (48 kB)

1
=================
2
Ganeti 2.1 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.1 compared to
6
the 2.0 version.
7

    
8
The 2.1 version will be a relatively small release. Its main aim is to
9
avoid changing too much of the core code, while addressing issues and
10
adding new features and improvements over 2.0, in a timely fashion.
11

    
12
.. contents:: :depth: 4
13

    
14
Objective
15
=========
16

    
17
Ganeti 2.1 will add features to help further automatization of cluster
18
operations, further improve scalability to even bigger clusters, and
19
make it easier to debug the Ganeti core.
20

    
21
Detailed design
22
===============
23

    
24
As for 2.0 we divide the 2.1 design into three areas:
25

    
26
- core changes, which affect the master daemon/job queue/locking or
27
  all/most logical units
28
- logical unit/feature changes
29
- external interface changes (eg. command line, os api, hooks, ...)
30

    
31
Core changes
32
------------
33

    
34
Storage units modelling
35
~~~~~~~~~~~~~~~~~~~~~~~
36

    
37
Currently, Ganeti has a good model of the block devices for instances
38
(e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the
39
storage pools that are providing the space for these front-end
40
devices. For example, there are hardcoded inter-node RPC calls for
41
volume group listing, file storage creation/deletion, etc.
42

    
43
The storage units framework will implement a generic handling for all
44
kinds of storage backends:
45

    
46
- LVM physical volumes
47
- LVM volume groups
48
- File-based storage directories
49
- any other future storage method
50

    
51
There will be a generic list of methods that each storage unit type
52
will provide, like:
53

    
54
- list of storage units of this type
55
- check status of the storage unit
56

    
57
Additionally, there will be specific methods for each method, for
58
example:
59

    
60
- enable/disable allocations on a specific PV
61
- file storage directory creation/deletion
62
- VG consistency fixing
63

    
64
This will allow a much better modeling and unification of the various
65
RPC calls related to backend storage pool in the future. Ganeti 2.1 is
66
intended to add the basics of the framework, and not necessarilly move
67
all the curent VG/FileBased operations to it.
68

    
69
Note that while we model both LVM PVs and LVM VGs, the framework will
70
**not** model any relationship between the different types. In other
71
words, we don't model neither inheritances nor stacking, since this is
72
too complex for our needs. While a ``vgreduce`` operation on a LVM VG
73
could actually remove a PV from it, this will not be handled at the
74
framework level, but at individual operation level. The goal is that
75
this is a lightweight framework, for abstracting the different storage
76
operation, and not for modelling the storage hierarchy.
77

    
78

    
79
Locking improvements
80
~~~~~~~~~~~~~~~~~~~~
81

    
82
Current State and shortcomings
83
++++++++++++++++++++++++++++++
84

    
85
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or
86
many ``SharedLock`` instances. It provides an interface to add/remove
87
locks and to acquire and subsequently release any number of those locks
88
contained in it.
89

    
90
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to
91
the way we're using locks for nodes and instances (the single cluster
92
lock isn't affected by this issue) this can lead to long delays when
93
acquiring locks if another operation tries to acquire multiple locks but
94
has to wait for yet another operation.
95

    
96
In the following demonstration we assume to have the instance locks
97
``inst1``, ``inst2``, ``inst3`` and ``inst4``.
98

    
99
#. Operation A grabs lock for instance ``inst4``.
100
#. Operation B wants to acquire all instance locks in alphabetic order,
101
   but it has to wait for ``inst4``.
102
#. Operation C tries to lock ``inst1``, but it has to wait until
103
   Operation B (which is trying to acquire all locks) releases the lock
104
   again.
105
#. Operation A finishes and releases lock on ``inst4``. Operation B can
106
   continue and eventually releases all locks.
107
#. Operation C can get ``inst1`` lock and finishes.
108

    
109
Technically there's no need for Operation C to wait for Operation A, and
110
subsequently Operation B, to finish. Operation B can't continue until
111
Operation A is done (it has to wait for ``inst4``), anyway.
112

    
113
Proposed changes
114
++++++++++++++++
115

    
116
Non-blocking lock acquiring
117
^^^^^^^^^^^^^^^^^^^^^^^^^^^
118

    
119
Acquiring locks for OpCode execution is always done in blocking mode.
120
They won't return until the lock has successfully been acquired (or an
121
error occurred, although we won't cover that case here).
122

    
123
``SharedLock`` and ``LockSet`` must be able to be acquired in a
124
non-blocking way. They must support a timeout and abort trying to
125
acquire the lock(s) after the specified amount of time.
126

    
127
Retry acquiring locks
128
^^^^^^^^^^^^^^^^^^^^^
129

    
130
To prevent other operations from waiting for a long time, such as
131
described in the demonstration before, ``LockSet`` must not keep locks
132
for a prolonged period of time when trying to acquire two or more locks.
133
Instead it should, with an increasing timeout for acquiring all locks,
134
release all locks again and sleep some time if it fails to acquire all
135
requested locks.
136

    
137
A good timeout value needs to be determined. In any case should
138
``LockSet`` proceed to acquire locks in blocking mode after a few
139
(unsuccessful) attempts to acquire all requested locks.
140

    
141
One proposal for the timeout is to use ``2**tries`` seconds, where
142
``tries`` is the number of unsuccessful tries.
143

    
144
In the demonstration before this would allow Operation C to continue
145
after Operation B unsuccessfully tried to acquire all locks and released
146
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
147

    
148
Other solutions discussed
149
+++++++++++++++++++++++++
150

    
151
There was also some discussion on going one step further and extend the
152
job queue (see ``lib/jqueue.py``) to select the next task for a worker
153
depending on whether it can acquire the necessary locks. While this may
154
reduce the number of necessary worker threads and/or increase throughput
155
on large clusters with many jobs, it also brings many potential
156
problems, such as contention and increased memory usage, with it. As
157
this would be an extension of the changes proposed before it could be
158
implemented at a later point in time, but we decided to stay with the
159
simpler solution for now.
160

    
161
Implementation details
162
++++++++++++++++++++++
163

    
164
``SharedLock`` redesign
165
^^^^^^^^^^^^^^^^^^^^^^^
166

    
167
The current design of ``SharedLock`` is not good for supporting timeouts
168
when acquiring a lock and there are also minor fairness issues in it. We
169
plan to address both with a redesign. A proof of concept implementation
170
was written and resulted in significantly simpler code.
171

    
172
Currently ``SharedLock`` uses two separate queues for shared and
173
exclusive acquires and waiters get to run in turns. This means if an
174
exclusive acquire is released, the lock will allow shared waiters to run
175
and vice versa.  Although it's still fair in the end there is a slight
176
bias towards shared waiters in the current implementation. The same
177
implementation with two shared queues can not support timeouts without
178
adding a lot of complexity.
179

    
180
Our proposed redesign changes ``SharedLock`` to have only one single
181
queue.  There will be one condition (see Condition_ for a note about
182
performance) in the queue per exclusive acquire and two for all shared
183
acquires (see below for an explanation). The maximum queue length will
184
always be ``2 + (number of exclusive acquires waiting)``. The number of
185
queue entries for shared acquires can vary from 0 to 2.
186

    
187
The two conditions for shared acquires are a bit special. They will be
188
used in turn. When the lock is instantiated, no conditions are in the
189
queue. As soon as the first shared acquire arrives (and there are
190
holder(s) or waiting acquires; see Acquire_), the active condition is
191
added to the queue. Until it becomes the topmost condition in the queue
192
and has been notified, any shared acquire is added to this active
193
condition. When the active condition is notified, the conditions are
194
swapped and further shared acquires are added to the previously inactive
195
condition (which has now become the active condition). After all waiters
196
on the previously active (now inactive) and now notified condition
197
received the notification, it is removed from the queue of pending
198
acquires.
199

    
200
This means shared acquires will skip any exclusive acquire in the queue.
201
We believe it's better to improve parallelization on operations only
202
asking for shared (or read-only) locks. Exclusive operations holding the
203
same lock can not be parallelized.
204

    
205

    
206
Acquire
207
*******
208

    
209
For exclusive acquires a new condition is created and appended to the
210
queue.  Shared acquires are added to the active condition for shared
211
acquires and if the condition is not yet on the queue, it's appended.
212

    
213
The next step is to wait for our condition to be on the top of the queue
214
(to guarantee fairness). If the timeout expired, we return to the caller
215
without acquiring the lock. On every notification we check whether the
216
lock has been deleted, in which case an error is returned to the caller.
217

    
218
The lock can be acquired if we're on top of the queue (there is no one
219
else ahead of us). For an exclusive acquire, there must not be other
220
exclusive or shared holders. For a shared acquire, there must not be an
221
exclusive holder.  If these conditions are all true, the lock is
222
acquired and we return to the caller. In any other case we wait again on
223
the condition.
224

    
225
If it was the last waiter on a condition, the condition is removed from
226
the queue.
227

    
228
Optimization: There's no need to touch the queue if there are no pending
229
acquires and no current holders. The caller can have the lock
230
immediately.
231

    
232
.. digraph:: "design-2.1-lock-acquire"
233

    
234
  graph[fontsize=8, fontname="Helvetica"]
235
  node[fontsize=8, fontname="Helvetica", width="0", height="0"]
236
  edge[fontsize=8, fontname="Helvetica"]
237

    
238
  /* Actions */
239
  abort[label="Abort\n(couldn't acquire)"]
240
  acquire[label="Acquire lock"]
241
  add_to_queue[label="Add condition to queue"]
242
  wait[label="Wait for notification"]
243
  remove_from_queue[label="Remove from queue"]
244

    
245
  /* Conditions */
246
  alone[label="Empty queue\nand can acquire?", shape=diamond]
247
  have_timeout[label="Do I have\ntimeout?", shape=diamond]
248
  top_of_queue_and_can_acquire[
249
    label="On top of queue and\ncan acquire lock?",
250
    shape=diamond,
251
    ]
252

    
253
  /* Lines */
254
  alone->acquire[label="Yes"]
255
  alone->add_to_queue[label="No"]
256

    
257
  have_timeout->abort[label="Yes"]
258
  have_timeout->wait[label="No"]
259

    
260
  top_of_queue_and_can_acquire->acquire[label="Yes"]
261
  top_of_queue_and_can_acquire->have_timeout[label="No"]
262

    
263
  add_to_queue->wait
264
  wait->top_of_queue_and_can_acquire
265
  acquire->remove_from_queue
266

    
267
Release
268
*******
269

    
270
First the lock removes the caller from the internal owner list. If there
271
are pending acquires in the queue, the first (the oldest) condition is
272
notified.
273

    
274
If the first condition was the active condition for shared acquires, the
275
inactive condition will be made active. This ensures fairness with
276
exclusive locks by forcing consecutive shared acquires to wait in the
277
queue.
278

    
279
.. digraph:: "design-2.1-lock-release"
280

    
281
  graph[fontsize=8, fontname="Helvetica"]
282
  node[fontsize=8, fontname="Helvetica", width="0", height="0"]
283
  edge[fontsize=8, fontname="Helvetica"]
284

    
285
  /* Actions */
286
  remove_from_owners[label="Remove from owner list"]
287
  notify[label="Notify topmost"]
288
  swap_shared[label="Swap shared conditions"]
289
  success[label="Success"]
290

    
291
  /* Conditions */
292
  have_pending[label="Any pending\nacquires?", shape=diamond]
293
  was_active_queue[
294
    label="Was active condition\nfor shared acquires?",
295
    shape=diamond,
296
    ]
297

    
298
  /* Lines */
299
  remove_from_owners->have_pending
300

    
301
  have_pending->notify[label="Yes"]
302
  have_pending->success[label="No"]
303

    
304
  notify->was_active_queue
305

    
306
  was_active_queue->swap_shared[label="Yes"]
307
  was_active_queue->success[label="No"]
308

    
309
  swap_shared->success
310

    
311

    
312
Delete
313
******
314

    
315
The caller must either hold the lock in exclusive mode already or the
316
lock must be acquired in exclusive mode. Trying to delete a lock while
317
it's held in shared mode must fail.
318

    
319
After ensuring the lock is held in exclusive mode, the lock will mark
320
itself as deleted and continue to notify all pending acquires. They will
321
wake up, notice the deleted lock and return an error to the caller.
322

    
323

    
324
Condition
325
^^^^^^^^^
326

    
327
Note: This is not necessary for the locking changes above, but it may be
328
a good optimization (pending performance tests).
329

    
330
The existing locking code in Ganeti 2.0 uses Python's built-in
331
``threading.Condition`` class. Unfortunately ``Condition`` implements
332
timeouts by sleeping 1ms to 20ms between tries to acquire the condition
333
lock in non-blocking mode. This requires unnecessary context switches
334
and contention on the CPython GIL (Global Interpreter Lock).
335

    
336
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
337
support for timeouts on file descriptors (see ``select(2)``). A custom
338
condition class will have to be written for this.
339

    
340
On instantiation the class creates a pipe. After each notification the
341
previous pipe is abandoned and re-created (technically the old pipe
342
needs to stay around until all notifications have been delivered).
343

    
344
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
345
wait for notifications, optionally with a timeout. A notification will
346
be signalled to the waiting clients by closing the pipe. If the pipe
347
wasn't closed during the timeout, the waiting function returns to its
348
caller nonetheless.
349

    
350

    
351
Node daemon availability
352
~~~~~~~~~~~~~~~~~~~~~~~~
353

    
354
Current State and shortcomings
355
++++++++++++++++++++++++++++++
356

    
357
Currently, when a Ganeti node suffers serious system disk damage, the
358
migration/failover of an instance may not correctly shutdown the virtual
359
machine on the broken node causing instances duplication. The ``gnt-node
360
powercycle`` command can be used to force a node reboot and thus to
361
avoid duplicated instances. This command relies on node daemon
362
availability, though, and thus can fail if the node daemon has some
363
pages swapped out of ram, for example.
364

    
365

    
366
Proposed changes
367
++++++++++++++++
368

    
369
The proposed solution forces node daemon to run exclusively in RAM. It
370
uses python ctypes to to call ``mlockall(MCL_CURRENT | MCL_FUTURE)`` on
371
the node daemon process and all its children. In addition another log
372
handler has been implemented for node daemon to redirect to
373
``/dev/console`` messages that cannot be written on the logfile.
374

    
375
With these changes node daemon can successfully run basic tasks such as
376
a powercycle request even when the system disk is heavily damaged and
377
reading/writing to disk fails constantly.
378

    
379

    
380
New Features
381
------------
382

    
383
Automated Ganeti Cluster Merger
384
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
385

    
386
Current situation
387
+++++++++++++++++
388

    
389
Currently there's no easy way to merge two or more clusters together.
390
But in order to optimize resources this is a needed missing piece. The
391
goal of this design doc is to come up with a easy to use solution which
392
allows you to merge two or more cluster together.
393

    
394
Initial contact
395
+++++++++++++++
396

    
397
As the design of Ganeti is based on an autonomous system, Ganeti by
398
itself has no way to reach nodes outside of its cluster. To overcome
399
this situation we're required to prepare the cluster before we can go
400
ahead with the actual merge: We've to replace at least the ssh keys on
401
the affected nodes before we can do any operation within ``gnt-``
402
commands.
403

    
404
To make this a automated process we'll ask the user to provide us with
405
the root password of every cluster we've to merge. We use the password
406
to grab the current ``id_dsa`` key and then rely on that ssh key for any
407
further communication to be made until the cluster is fully merged.
408

    
409
Cluster merge
410
+++++++++++++
411

    
412
After initial contact we do the cluster merge:
413

    
414
1. Grab the list of nodes
415
2. On all nodes add our own ``id_dsa.pub`` key to ``authorized_keys``
416
3. Stop all instances running on the merging cluster
417
4. Disable ``ganeti-watcher`` as it tries to restart Ganeti daemons
418
5. Stop all Ganeti daemons on all merging nodes
419
6. Grab the ``config.data`` from the master of the merging cluster
420
7. Stop local ``ganeti-masterd``
421
8. Merge the config:
422

    
423
   1. Open our own cluster ``config.data``
424
   2. Open cluster ``config.data`` of the merging cluster
425
   3. Grab all nodes of the merging cluster
426
   4. Set ``master_candidate`` to false on all merging nodes
427
   5. Add the nodes to our own cluster ``config.data``
428
   6. Grab all the instances on the merging cluster
429
   7. Adjust the port if the instance has drbd layout:
430

    
431
      1. In ``logical_id`` (index 2)
432
      2. In ``physical_id`` (index 1 and 3)
433

    
434
   8. Add the instances to our own cluster ``config.data``
435

    
436
9. Start ``ganeti-masterd`` with ``--no-voting`` ``--yes-do-it``
437
10. ``gnt-node add --readd`` on all merging nodes
438
11. ``gnt-cluster redist-conf``
439
12. Restart ``ganeti-masterd`` normally
440
13. Enable ``ganeti-watcher`` again
441
14. Start all merging instances again
442

    
443
Rollback
444
++++++++
445

    
446
Until we actually (re)add any nodes we can abort and rollback the merge
447
at any point. After merging the config, though, we've to get the backup
448
copy of ``config.data`` (from another master candidate node). And for
449
security reasons it's a good idea to undo ``id_dsa.pub`` distribution by
450
going on every affected node and remove the ``id_dsa.pub`` key again.
451
Also we've to keep in mind, that we've to start the Ganeti daemons and
452
starting up the instances again.
453

    
454
Verification
455
++++++++++++
456

    
457
Last but not least we should verify that the merge was successful.
458
Therefore we run ``gnt-cluster verify``, which ensures that the cluster
459
overall is in a healthy state. Additional it's also possible to compare
460
the list of instances/nodes with a list made prior to the upgrade to
461
make sure we didn't lose any data/instance/node.
462

    
463
Appendix
464
++++++++
465

    
466
cluster-merge.py
467
^^^^^^^^^^^^^^^^
468

    
469
Used to merge the cluster config. This is a POC and might differ from
470
actual production code.
471

    
472
::
473

    
474
  #!/usr/bin/python
475

    
476
  import sys
477
  from ganeti import config
478
  from ganeti import constants
479

    
480
  c_mine = config.ConfigWriter(offline=True)
481
  c_other = config.ConfigWriter(sys.argv[1])
482

    
483
  fake_id = 0
484
  for node in c_other.GetNodeList():
485
    node_info = c_other.GetNodeInfo(node)
486
    node_info.master_candidate = False
487
    c_mine.AddNode(node_info, str(fake_id))
488
    fake_id += 1
489

    
490
  for instance in c_other.GetInstanceList():
491
    instance_info = c_other.GetInstanceInfo(instance)
492
    for dsk in instance_info.disks:
493
      if dsk.dev_type in constants.LDS_DRBD:
494
         port = c_mine.AllocatePort()
495
         logical_id = list(dsk.logical_id)
496
         logical_id[2] = port
497
         dsk.logical_id = tuple(logical_id)
498
         physical_id = list(dsk.physical_id)
499
         physical_id[1] = physical_id[3] = port
500
         dsk.physical_id = tuple(physical_id)
501
    c_mine.AddInstance(instance_info, str(fake_id))
502
    fake_id += 1
503

    
504

    
505
Feature changes
506
---------------
507

    
508
Ganeti Confd
509
~~~~~~~~~~~~
510

    
511
Current State and shortcomings
512
++++++++++++++++++++++++++++++
513

    
514
In Ganeti 2.0 all nodes are equal, but some are more equal than others.
515
In particular they are divided between "master", "master candidates" and
516
"normal".  (Moreover they can be offline or drained, but this is not
517
important for the current discussion). In general the whole
518
configuration is only replicated to master candidates, and some partial
519
information is spread to all nodes via ssconf.
520

    
521
This change was done so that the most frequent Ganeti operations didn't
522
need to contact all nodes, and so clusters could become bigger. If we
523
want more information to be available on all nodes, we need to add more
524
ssconf values, which is counter-balancing the change, or to talk with
525
the master node, which is not designed to happen now, and requires its
526
availability.
527

    
528
Information such as the instance->primary_node mapping will be needed on
529
all nodes, and we also want to make sure services external to the
530
cluster can query this information as well. This information must be
531
available at all times, so we can't query it through RAPI, which would
532
be a single point of failure, as it's only available on the master.
533

    
534

    
535
Proposed changes
536
++++++++++++++++
537

    
538
In order to allow fast and highly available access read-only to some
539
configuration values, we'll create a new ganeti-confd daemon, which will
540
run on master candidates. This daemon will talk via UDP, and
541
authenticate messages using HMAC with a cluster-wide shared key. This
542
key will be generated at cluster init time, and stored on the clusters
543
alongside the ganeti SSL keys, and readable only by root.
544

    
545
An interested client can query a value by making a request to a subset
546
of the cluster master candidates. It will then wait to get a few
547
responses, and use the one with the highest configuration serial number.
548
Since the configuration serial number is increased each time the ganeti
549
config is updated, and the serial number is included in all answers,
550
this can be used to make sure to use the most recent answer, in case
551
some master candidates are stale or in the middle of a configuration
552
update.
553

    
554
In order to prevent replay attacks queries will contain the current unix
555
timestamp according to the client, and the server will verify that its
556
timestamp is in the same 5 minutes range (this requires synchronized
557
clocks, which is a good idea anyway). Queries will also contain a "salt"
558
which they expect the answers to be sent with, and clients are supposed
559
to accept only answers which contain salt generated by them.
560

    
561
The configuration daemon will be able to answer simple queries such as:
562

    
563
- master candidates list
564
- master node
565
- offline nodes
566
- instance list
567
- instance primary nodes
568

    
569
Wire protocol
570
^^^^^^^^^^^^^
571

    
572
A confd query will look like this, on the wire::
573

    
574
  plj0{
575
    "msg": "{\"type\": 1,
576
             \"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\",
577
             \"protocol\": 1,
578
             \"query\": \"node1.example.com\"}\n",
579
    "salt": "1249637704",
580
    "hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f"
581
  }
582

    
583
``plj0`` is a fourcc that details the message content. It stands for plain
584
json 0, and can be changed as we move on to different type of protocols
585
(for example protocol buffers, or encrypted json). What follows is a
586
json encoded string, with the following fields:
587

    
588
- ``msg`` contains a JSON-encoded query, its fields are:
589

    
590
  - ``protocol``, integer, is the confd protocol version (initially
591
    just ``constants.CONFD_PROTOCOL_VERSION``, with a value of 1)
592
  - ``type``, integer, is the query type. For example "node role by
593
    name" or "node primary ip by instance ip". Constants will be
594
    provided for the actual available query types
595
  - ``query`` is a multi-type field (depending on the ``type`` field):
596

    
597
    - it can be missing, when the request is fully determined by the
598
      ``type`` field
599
    - it can contain a string which denotes the search key: for
600
      example an IP, or a node name
601
    - it can contain a dictionary, in which case the actual details
602
      vary further per request type
603

    
604
  - ``rsalt``, string, is the required response salt; the client must
605
    use it to recognize which answer it's getting.
606

    
607
- ``salt`` must be the current unix timestamp, according to the
608
  client; servers should refuse messages which have a wrong timing,
609
  according to their configuration and clock
610
- ``hmac`` is an hmac signature of salt+msg, with the cluster hmac key
611

    
612
If an answer comes back (which is optional, since confd works over UDP)
613
it will be in this format::
614

    
615
  plj0{
616
    "msg": "{\"status\": 0,
617
             \"answer\": 0,
618
             \"serial\": 42,
619
             \"protocol\": 1}\n",
620
    "salt": "9aa6ce92-8336-11de-af38-001d093e835f",
621
    "hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af"
622
  }
623

    
624
Where:
625

    
626
- ``plj0`` the message type magic fourcc, as discussed above
627
- ``msg`` contains a JSON-encoded answer, its fields are:
628

    
629
  - ``protocol``, integer, is the confd protocol version (initially
630
    just constants.CONFD_PROTOCOL_VERSION, with a value of 1)
631
  - ``status``, integer, is the error code; initially just ``0`` for
632
    'ok' or ``1`` for 'error' (in which case answer contains an error
633
    detail, rather than an answer), but in the future it may be
634
    expanded to have more meanings (e.g. ``2`` if the answer is
635
    compressed)
636
  - ``answer``, is the actual answer; its type and meaning is query
637
    specific: for example for "node primary ip by instance ip" queries
638
    it will be a string containing an IP address, for "node role by
639
    name" queries it will be an integer which encodes the role
640
    (master, candidate, drained, offline) according to constants
641

    
642
- ``salt`` is the requested salt from the query; a client can use it
643
  to recognize what query the answer is answering.
644
- ``hmac`` is an hmac signature of salt+msg, with the cluster hmac key
645

    
646

    
647
Redistribute Config
648
~~~~~~~~~~~~~~~~~~~
649

    
650
Current State and shortcomings
651
++++++++++++++++++++++++++++++
652

    
653
Currently LUClusterRedistConf triggers a copy of the updated
654
configuration file to all master candidates and of the ssconf files to
655
all nodes. There are other files which are maintained manually but which
656
are important to keep in sync. These are:
657

    
658
- rapi SSL key certificate file (rapi.pem) (on master candidates)
659
- rapi user/password file rapi_users (on master candidates)
660

    
661
Furthermore there are some files which are hypervisor specific but we
662
may want to keep in sync:
663

    
664
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and
665
  copies the file once, during node add. This design is subject to
666
  revision to be able to have different passwords for different groups
667
  of instances via the use of hypervisor parameters, and to allow
668
  xen-hvm and kvm to use an equal system to provide password-protected
669
  vnc sessions. In general, though, it would be useful if the vnc
670
  password files were copied as well, to avoid unwanted vnc password
671
  changes on instance failover/migrate.
672

    
673
Optionally the admin may want to also ship files such as the global
674
xend.conf file, and the network scripts to all nodes.
675

    
676
Proposed changes
677
++++++++++++++++
678

    
679
RedistributeConfig will be changed to copy also the rapi files, and to
680
call every enabled hypervisor asking for a list of additional files to
681
copy. Users will have the possibility to populate a file containing a
682
list of files to be distributed; this file will be propagated as well.
683
Such solution is really simple to implement and it's easily usable by
684
scripts.
685

    
686
This code will be also shared (via tasklets or by other means, if
687
tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs
688
(so that the relevant files will be automatically shipped to new master
689
candidates as they are set).
690

    
691
VNC Console Password
692
~~~~~~~~~~~~~~~~~~~~
693

    
694
Current State and shortcomings
695
++++++++++++++++++++++++++++++
696

    
697
Currently just the xen-hvm hypervisor supports setting a password to
698
connect the the instances' VNC console, and has one common password
699
stored in a file.
700

    
701
This doesn't allow different passwords for different instances/groups of
702
instances, and makes it necessary to remember to copy the file around
703
the cluster when the password changes.
704

    
705
Proposed changes
706
++++++++++++++++
707

    
708
We'll change the VNC password file to a vnc_password_file hypervisor
709
parameter.  This way it can have a cluster default, but also a different
710
value for each instance. The VNC enabled hypervisors (xen and kvm) will
711
publish all the password files in use through the cluster so that a
712
redistribute-config will ship them to all nodes (see the Redistribute
713
Config proposed changes above).
714

    
715
The current VNC_PASSWORD_FILE constant will be removed, but its value
716
will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining
717
backwards compatibility with 2.0.
718

    
719
The code to export the list of VNC password files from the hypervisors
720
to RedistributeConfig will be shared between the KVM and xen-hvm
721
hypervisors.
722

    
723
Disk/Net parameters
724
~~~~~~~~~~~~~~~~~~~
725

    
726
Current State and shortcomings
727
++++++++++++++++++++++++++++++
728

    
729
Currently disks and network interfaces have a few tweakable options and
730
all the rest is left to a default we chose. We're finding that we need
731
more and more to tweak some of these parameters, for example to disable
732
barriers for DRBD devices, or allow striping for the LVM volumes.
733

    
734
Moreover for many of these parameters it will be nice to have
735
cluster-wide defaults, and then be able to change them per
736
disk/interface.
737

    
738
Proposed changes
739
++++++++++++++++
740

    
741
We will add new cluster level diskparams and netparams, which will
742
contain all the tweakable parameters. All values which have a sensible
743
cluster-wide default will go into this new structure while parameters
744
which have unique values will not.
745

    
746
Example of network parameters:
747
  - mode: bridge/route
748
  - link: for mode "bridge" the bridge to connect to, for mode route it
749
    can contain the routing table, or the destination interface
750

    
751
Example of disk parameters:
752
  - stripe: lvm stripes
753
  - stripe_size: lvm stripe size
754
  - meta_flushes: drbd, enable/disable metadata "barriers"
755
  - data_flushes: drbd, enable/disable data "barriers"
756

    
757
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs
758
files) or hypervisor specific (nic models for example), but for now they
759
will all live in the same structure. Each component is supposed to
760
validate only the parameters it knows about, and ganeti itself will make
761
sure that no "globally unknown" parameters are added, and that no
762
parameters have overridden meanings for different components.
763

    
764
The parameters will be kept, as for the BEPARAMS into a "default"
765
category, which will allow us to expand on by creating instance
766
"classes" in the future.  Instance classes is not a feature we plan
767
implementing in 2.1, though.
768

    
769

    
770
Global hypervisor parameters
771
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
772

    
773
Current State and shortcomings
774
++++++++++++++++++++++++++++++
775

    
776
Currently all hypervisor parameters are modifiable both globally
777
(cluster level) and at instance level. However, there is no other
778
framework to held hypervisor-specific parameters, so if we want to add
779
a new class of hypervisor parameters that only makes sense on a global
780
level, we have to change the hvparams framework.
781

    
782
Proposed changes
783
++++++++++++++++
784

    
785
We add a new (global, not per-hypervisor) list of parameters which are
786
not changeable on a per-instance level. The create, modify and query
787
instance operations are changed to not allow/show these parameters.
788

    
789
Furthermore, to allow transition of parameters to the global list, and
790
to allow cleanup of inadverdently-customised parameters, the
791
``UpgradeConfig()`` method of instances will drop any such parameters
792
from their list of hvparams, such that a restart of the master daemon
793
is all that is needed for cleaning these up.
794

    
795
Also, the framework is simple enough that if we need to replicate it
796
at beparams level we can do so easily.
797

    
798

    
799
Non bridged instances support
800
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
801

    
802
Current State and shortcomings
803
++++++++++++++++++++++++++++++
804

    
805
Currently each instance NIC must be connected to a bridge, and if the
806
bridge is not specified the default cluster one is used. This makes it
807
impossible to use the vif-route xen network scripts, or other
808
alternative mechanisms that don't need a bridge to work.
809

    
810
Proposed changes
811
++++++++++++++++
812

    
813
The new "mode" network parameter will distinguish between bridged
814
interfaces and routed ones.
815

    
816
When mode is "bridge" the "link" parameter will contain the bridge the
817
instance should be connected to, effectively making things as today. The
818
value has been migrated from a nic field to a parameter to allow for an
819
easier manipulation of the cluster default.
820

    
821
When mode is "route" the ip field of the interface will become
822
mandatory, to allow for a route to be set. In the future we may want
823
also to accept multiple IPs or IP/mask values for this purpose. We will
824
evaluate possible meanings of the link parameter to signify a routing
825
table to be used, which would allow for insulation between instance
826
groups (as today happens for different bridges).
827

    
828
For now we won't add a parameter to specify which network script gets
829
called for which instance, so in a mixed cluster the network script must
830
be able to handle both cases. The default kvm vif script will be changed
831
to do so. (Xen doesn't have a ganeti provided script, so nothing will be
832
done for that hypervisor)
833

    
834
Introducing persistent UUIDs
835
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
836

    
837
Current state and shortcomings
838
++++++++++++++++++++++++++++++
839

    
840
Some objects in the Ganeti configurations are tracked by their name
841
while also supporting renames. This creates an extra difficulty,
842
because neither Ganeti nor external management tools can then track
843
the actual entity, and due to the name change it behaves like a new
844
one.
845

    
846
Proposed changes part 1
847
+++++++++++++++++++++++
848

    
849
We will change Ganeti to use UUIDs for entity tracking, but in a
850
staggered way. In 2.1, we will simply add an โ€œuuidโ€ attribute to each
851
of the instances, nodes and cluster itself. This will be reported on
852
instance creation for nodes, and on node adds for the nodes. It will
853
be of course avaiblable for querying via the OpNodeQuery/Instance and
854
cluster information, and via RAPI as well.
855

    
856
Note that Ganeti will not provide any way to change this attribute.
857

    
858
Upgrading from Ganeti 2.0 will automatically add an โ€˜uuidโ€™ attribute
859
to all entities missing it.
860

    
861

    
862
Proposed changes part 2
863
+++++++++++++++++++++++
864

    
865
In the next release (e.g. 2.2), the tracking of objects will change
866
from the name to the UUID internally, and externally Ganeti will
867
accept both forms of identification; e.g. an RAPI call would be made
868
either against ``/2/instances/foo.bar`` or against
869
``/2/instances/bb3b2e42โ€ฆ``. Since an FQDN must have at least a dot,
870
and dots are not valid characters in UUIDs, we will not have namespace
871
issues.
872

    
873
Another change here is that node identification (during cluster
874
operations/queries like master startup, โ€œam I the master?โ€ and
875
similar) could be done via UUIDs which is more stable than the current
876
hostname-based scheme.
877

    
878
Internal tracking refers to the way the configuration is stored; a
879
DRBD disk of an instance refers to the node name (so that IPs can be
880
changed easily), but this is still a problem for name changes; thus
881
these will be changed to point to the node UUID to ease renames.
882

    
883
The advantages of this change (after the second round of changes), is
884
that node rename becomes trivial, whereas today node rename would
885
require a complete lock of all instances.
886

    
887

    
888
Automated disk repairs infrastructure
889
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
890

    
891
Replacing defective disks in an automated fashion is quite difficult
892
with the current version of Ganeti. These changes will introduce
893
additional functionality and interfaces to simplify automating disk
894
replacements on a Ganeti node.
895

    
896
Fix node volume group
897
+++++++++++++++++++++
898

    
899
This is the most difficult addition, as it can lead to dataloss if it's
900
not properly safeguarded.
901

    
902
The operation must be done only when all the other nodes that have
903
instances in common with the target node are fine, i.e. this is the only
904
node with problems, and also we have to double-check that all instances
905
on this node have at least a good copy of the data.
906

    
907
This might mean that we have to enhance the GetMirrorStatus calls, and
908
introduce and a smarter version that can tell us more about the status
909
of an instance.
910

    
911
Stop allocation on a given PV
912
+++++++++++++++++++++++++++++
913

    
914
This is somewhat simple. First we need a "list PVs" opcode (and its
915
associated logical unit) and then a set PV status opcode/LU. These in
916
combination should allow both checking and changing the disk/PV status.
917

    
918
Instance disk status
919
++++++++++++++++++++
920

    
921
This new opcode or opcode change must list the instance-disk-index and
922
node combinations of the instance together with their status. This will
923
allow determining what part of the instance is broken (if any).
924

    
925
Repair instance
926
+++++++++++++++
927

    
928
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in
929
order to fix the instance status. It only affects primary instances;
930
secondaries can just be moved away.
931

    
932
Migrate node
933
++++++++++++
934

    
935
This new opcode/LU/RAPI call will take over the current ``gnt-node
936
migrate`` code and run migrate for all instances on the node.
937

    
938
Evacuate node
939
++++++++++++++
940

    
941
This new opcode/LU/RAPI call will take over the current ``gnt-node
942
evacuate`` code and run replace-secondary with an iallocator script for
943
all instances on the node.
944

    
945

    
946
User-id pool
947
~~~~~~~~~~~~
948

    
949
In order to allow running different processes under unique user-ids
950
on a node, we introduce the user-id pool concept.
951

    
952
The user-id pool is a cluster-wide configuration parameter.
953
It is a list of user-ids and/or user-id ranges that are reserved
954
for running Ganeti processes (including KVM instances).
955
The code guarantees that on a given node a given user-id is only
956
handed out if there is no other process running with that user-id.
957

    
958
Please note, that this can only be guaranteed if all processes in
959
the system - that run under a user-id belonging to the pool - are
960
started by reserving a user-id first. That can be accomplished
961
either by using the RequestUnusedUid() function to get an unused
962
user-id or by implementing the same locking mechanism.
963

    
964
Implementation
965
++++++++++++++
966

    
967
The functions that are specific to the user-id pool feature are located
968
in a separate module: ``lib/uidpool.py``.
969

    
970
Storage
971
^^^^^^^
972

    
973
The user-id pool is a single cluster parameter. It is stored in the
974
*Cluster* object under the ``uid_pool`` name as a list of integer
975
tuples. These tuples represent the boundaries of user-id ranges.
976
For single user-ids, the boundaries are equal.
977

    
978
The internal user-id pool representation is converted into a
979
string: a newline separated list of user-ids or user-id ranges.
980
This string representation is distributed to all the nodes via the
981
*ssconf* mechanism. This means that the user-id pool can be
982
accessed in a read-only way on any node without consulting the master
983
node or master candidate nodes.
984

    
985
Initial value
986
^^^^^^^^^^^^^
987

    
988
The value of the user-id pool cluster parameter can be initialized
989
at cluster initialization time using the
990

    
991
``gnt-cluster init --uid-pool <uid-pool definition> ...``
992

    
993
command.
994

    
995
As there is no sensible default value for the user-id pool parameter,
996
it is initialized to an empty list if no ``--uid-pool`` option is
997
supplied at cluster init time.
998

    
999
If the user-id pool is empty, the user-id pool feature is considered
1000
to be disabled.
1001

    
1002
Manipulation
1003
^^^^^^^^^^^^
1004

    
1005
The user-id pool cluster parameter can be modified from the
1006
command-line with the following commands:
1007

    
1008
- ``gnt-cluster modify --uid-pool <uid-pool definition>``
1009
- ``gnt-cluster modify --add-uids <uid-pool definition>``
1010
- ``gnt-cluster modify --remove-uids <uid-pool definition>``
1011

    
1012
The ``--uid-pool`` option overwrites the current setting with the
1013
supplied ``<uid-pool definition>``, while
1014
``--add-uids``/``--remove-uids`` adds/removes the listed uids
1015
or uid-ranges from the pool.
1016

    
1017
The ``<uid-pool definition>`` should be a comma-separated list of
1018
user-ids or user-id ranges. A range should be defined by a lower and
1019
a higher boundary. The boundaries should be separated with a dash.
1020
The boundaries are inclusive.
1021

    
1022
The ``<uid-pool definition>`` is parsed into the internal
1023
representation, sanity-checked and stored in the ``uid_pool``
1024
attribute of the *Cluster* object.
1025

    
1026
It is also immediately converted into a string (formatted in the
1027
input format) and distributed to all nodes via the *ssconf* mechanism.
1028

    
1029
Inspection
1030
^^^^^^^^^^
1031

    
1032
The current value of the user-id pool cluster parameter is printed
1033
by the ``gnt-cluster info`` command.
1034

    
1035
The output format is accepted by the ``gnt-cluster modify --uid-pool``
1036
command.
1037

    
1038
Locking
1039
^^^^^^^
1040

    
1041
The ``uidpool.py`` module provides a function (``RequestUnusedUid``)
1042
for requesting an unused user-id from the pool.
1043

    
1044
This will try to find a random user-id that is not currently in use.
1045
The algorithm is the following:
1046

    
1047
1) Randomize the list of user-ids in the user-id pool
1048
2) Iterate over this randomized UID list
1049
3) Create a lock file (it doesn't matter if it already exists)
1050
4) Acquire an exclusive POSIX lock on the file, to provide mutual
1051
   exclusion for the following non-atomic operations
1052
5) Check if there is a process in the system with the given UID
1053
6) If there isn't, return the UID, otherwise unlock the file and
1054
   continue the iteration over the user-ids
1055

    
1056
The user can than start a new process with this user-id.
1057
Once a process is successfully started, the exclusive POSIX lock can
1058
be released, but the lock file will remain in the filesystem.
1059
The presence of such a lock file means that the given user-id is most
1060
probably in use. The lack of a uid lock file does not guarantee that
1061
there are no processes with that user-id.
1062

    
1063
After acquiring the exclusive POSIX lock, ``RequestUnusedUid``
1064
always performs a check to see if there is a process running with the
1065
given uid.
1066

    
1067
A user-id can be returned to the pool, by calling the
1068
``ReleaseUid`` function. This will remove the corresponding lock file.
1069
Note, that it doesn't check if there is any process still running
1070
with that user-id. The removal of the lock file only means that there
1071
are most probably no processes with the given user-id. This helps
1072
in speeding up the process of finding a user-id that is guaranteed to
1073
be unused.
1074

    
1075
There is a convenience function, called ``ExecWithUnusedUid`` that
1076
wraps the execution of a function (or any callable) that requires a
1077
unique user-id. ``ExecWithUnusedUid`` takes care of requesting an
1078
unused user-id and unlocking the lock file. It also automatically
1079
returns the user-id to the pool if the callable raises an exception.
1080

    
1081
Code examples
1082
+++++++++++++
1083

    
1084
Requesting a user-id from the pool:
1085

    
1086
::
1087

    
1088
  from ganeti import ssconf
1089
  from ganeti import uidpool
1090

    
1091
  # Get list of all user-ids in the uid-pool from ssconf
1092
  ss = ssconf.SimpleStore()
1093
  uid_pool = uidpool.ParseUidPool(ss.GetUidPool(), separator="\n")
1094
  all_uids = set(uidpool.ExpandUidPool(uid_pool))
1095

    
1096
  uid = uidpool.RequestUnusedUid(all_uids)
1097
  try:
1098
    <start a process with the UID>
1099
    # Once the process is started, we can release the file lock
1100
    uid.Unlock()
1101
  except ..., err:
1102
    # Return the UID to the pool
1103
    uidpool.ReleaseUid(uid)
1104

    
1105

    
1106
Releasing a user-id:
1107

    
1108
::
1109

    
1110
  from ganeti import uidpool
1111

    
1112
  uid = <get the UID the process is running under>
1113
  <stop the process>
1114
  uidpool.ReleaseUid(uid)
1115

    
1116

    
1117
External interface changes
1118
--------------------------
1119

    
1120
OS API
1121
~~~~~~
1122

    
1123
The OS API of Ganeti 2.0 has been built with extensibility in mind.
1124
Since we pass everything as environment variables it's a lot easier to
1125
send new information to the OSes without breaking retrocompatibility.
1126
This section of the design outlines the proposed extensions to the API
1127
and their implementation.
1128

    
1129
API Version Compatibility Handling
1130
++++++++++++++++++++++++++++++++++
1131

    
1132
In 2.1 there will be a new OS API version (eg. 15), which should be
1133
mostly compatible with api 10, except for some new added variables.
1134
Since it's easy not to pass some variables we'll be able to handle
1135
Ganeti 2.0 OSes by just filtering out the newly added piece of
1136
information. We will still encourage OSes to declare support for the new
1137
API after checking that the new variables don't provide any conflict for
1138
them, and we will drop api 10 support after ganeti 2.1 has released.
1139

    
1140
New Environment variables
1141
+++++++++++++++++++++++++
1142

    
1143
Some variables have never been added to the OS api but would definitely
1144
be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable
1145
to allow the OS to make changes relevant to the virtualization the
1146
instance is going to use. Since this field is immutable for each
1147
instance, the os can tight the install without caring of making sure the
1148
instance can run under any virtualization technology.
1149

    
1150
We also want the OS to know the particular hypervisor parameters, to be
1151
able to customize the install even more.  Since the parameters can
1152
change, though, we will pass them only as an "FYI": if an OS ties some
1153
instance functionality to the value of a particular hypervisor parameter
1154
manual changes or a reinstall may be needed to adapt the instance to the
1155
new environment. This is not a regression as of today, because even if
1156
the OSes are left blind about this information, sometimes they still
1157
need to make compromises and cannot satisfy all possible parameter
1158
values.
1159

    
1160
OS Variants
1161
+++++++++++
1162

    
1163
Currently we are assisting to some degree of "os proliferation" just to
1164
change a simple installation behavior. This means that the same OS gets
1165
installed on the cluster multiple times, with different names, to
1166
customize just one installation behavior. Usually such OSes try to share
1167
as much as possible through symlinks, but this still causes
1168
complications on the user side, especially when multiple parameters must
1169
be cross-matched.
1170

    
1171
For example today if you want to install debian etch, lenny or squeeze
1172
you probably need to install the debootstrap OS multiple times, changing
1173
its configuration file, and calling it debootstrap-etch,
1174
debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for
1175
example a "server" and a "development" environment which installs
1176
different packages/configuration files and must be available for all
1177
installs you'll probably end  up with deboostrap-etch-server,
1178
debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev,
1179
etc. Crossing more than two parameters quickly becomes not manageable.
1180

    
1181
In order to avoid this we plan to make OSes more customizable, by
1182
allowing each OS to declare a list of variants which can be used to
1183
customize it. The variants list is mandatory and must be written, one
1184
variant per line, in the new "variants.list" file inside the main os
1185
dir. At least one supported variant must be supported. When choosing the
1186
OS exactly one variant will have to be specified, and will be encoded in
1187
the os name as <OS-name>+<variant>. As for today it will be possible to
1188
change an instance's OS at creation or install time.
1189

    
1190
The 2.1 OS list will be the combination of each OS, plus its supported
1191
variants. This will cause the name name proliferation to remain, but at
1192
least the internal OS code will be simplified to just parsing the passed
1193
variant, without the need for symlinks or code duplication.
1194

    
1195
Also we expect the OSes to declare only "interesting" variants, but to
1196
accept some non-declared ones which a user will be able to pass in by
1197
overriding the checks ganeti does. This will be useful for allowing some
1198
variations to be used without polluting the OS list (per-OS
1199
documentation should list all supported variants). If a variant which is
1200
not internally supported is forced through, the OS scripts should abort.
1201

    
1202
In the future (post 2.1) we may want to move to full fledged parameters
1203
all orthogonal to each other (for example "architecture" (i386, amd64),
1204
"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which
1205
is a single parameter, and you need a different variant for all the set
1206
of combinations you want to support).  In this case we envision the
1207
variants to be moved inside of Ganeti and be associated with lists
1208
parameter->values associations, which will then be passed to the OS.
1209

    
1210

    
1211
IAllocator changes
1212
~~~~~~~~~~~~~~~~~~
1213

    
1214
Current State and shortcomings
1215
++++++++++++++++++++++++++++++
1216

    
1217
The iallocator interface allows creation of instances without manually
1218
specifying nodes, but instead by specifying plugins which will do the
1219
required computations and produce a valid node list.
1220

    
1221
However, the interface is quite akward to use:
1222

    
1223
- one cannot set a 'default' iallocator script
1224
- one cannot use it to easily test if allocation would succeed
1225
- some new functionality, such as rebalancing clusters and calculating
1226
  capacity estimates is needed
1227

    
1228
Proposed changes
1229
++++++++++++++++
1230

    
1231
There are two area of improvements proposed:
1232

    
1233
- improving the use of the current interface
1234
- extending the IAllocator API to cover more automation
1235

    
1236

    
1237
Default iallocator names
1238
^^^^^^^^^^^^^^^^^^^^^^^^
1239

    
1240
The cluster will hold, for each type of iallocator, a (possibly empty)
1241
list of modules that will be used automatically.
1242

    
1243
If the list is empty, the behaviour will remain the same.
1244

    
1245
If the list has one entry, then ganeti will behave as if
1246
'--iallocator' was specifyed on the command line. I.e. use this
1247
allocator by default. If the user however passed nodes, those will be
1248
used in preference.
1249

    
1250
If the list has multiple entries, they will be tried in order until
1251
one gives a successful answer.
1252

    
1253
Dry-run allocation
1254
^^^^^^^^^^^^^^^^^^
1255

    
1256
The create instance LU will get a new 'dry-run' option that will just
1257
simulate the placement, and return the chosen node-lists after running
1258
all the usual checks.
1259

    
1260
Cluster balancing
1261
^^^^^^^^^^^^^^^^^
1262

    
1263
Instance add/removals/moves can create a situation where load on the
1264
nodes is not spread equally. For this, a new iallocator mode will be
1265
implemented called ``balance`` in which the plugin, given the current
1266
cluster state, and a maximum number of operations, will need to
1267
compute the instance relocations needed in order to achieve a "better"
1268
(for whatever the script believes it's better) cluster.
1269

    
1270
Cluster capacity calculation
1271
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1272

    
1273
In this mode, called ``capacity``, given an instance specification and
1274
the current cluster state (similar to the ``allocate`` mode), the
1275
plugin needs to return:
1276

    
1277
- how many instances can be allocated on the cluster with that
1278
  specification
1279
- on which nodes these will be allocated (in order)
1280

    
1281
.. vim: set textwidth=72 :