Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0.rst @ a1e43376

History | View | Annotate | Download (75.4 kB)

1
=================
2
Ganeti 2.0 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.0 compared to
6
the 1.2 version.
7

    
8
The 2.0 version will constitute a rewrite of the 'core' architecture,
9
paving the way for additional features in future 2.x versions.
10

    
11
.. contents:: :depth: 3
12

    
13
Objective
14
=========
15

    
16
Ganeti 1.2 has many scalability issues and restrictions due to its
17
roots as software for managing small and 'static' clusters.
18

    
19
Version 2.0 will attempt to remedy first the scalability issues and
20
then the restrictions.
21

    
22
Background
23
==========
24

    
25
While Ganeti 1.2 is usable, it severely limits the flexibility of the
26
cluster administration and imposes a very rigid model. It has the
27
following main scalability issues:
28

    
29
- only one operation at a time on the cluster [#]_
30
- poor handling of node failures in the cluster
31
- mixing hypervisors in a cluster not allowed
32

    
33
It also has a number of artificial restrictions, due to historical
34
design:
35

    
36
- fixed number of disks (two) per instance
37
- fixed number of NICs
38

    
39
.. [#] Replace disks will release the lock, but this is an exception
40
       and not a recommended way to operate
41

    
42
The 2.0 version is intended to address some of these problems, and
43
create a more flexible code base for future developments.
44

    
45
Among these problems, the single-operation at a time restriction is
46
biggest issue with the current version of Ganeti. It is such a big
47
impediment in operating bigger clusters that many times one is tempted
48
to remove the lock just to do a simple operation like start instance
49
while an OS installation is running.
50

    
51
Scalability problems
52
--------------------
53

    
54
Ganeti 1.2 has a single global lock, which is used for all cluster
55
operations.  This has been painful at various times, for example:
56

    
57
- It is impossible for two people to efficiently interact with a cluster
58
  (for example for debugging) at the same time.
59
- When batch jobs are running it's impossible to do other work (for
60
  example failovers/fixes) on a cluster.
61

    
62
This poses scalability problems: as clusters grow in node and instance
63
size it's a lot more likely that operations which one could conceive
64
should run in parallel (for example because they happen on different
65
nodes) are actually stalling each other while waiting for the global
66
lock, without a real reason for that to happen.
67

    
68
One of the main causes of this global lock (beside the higher
69
difficulty of ensuring data consistency in a more granular lock model)
70
is the fact that currently there is no long-lived process in Ganeti
71
that can coordinate multiple operations. Each command tries to acquire
72
the so called *cmd* lock and when it succeeds, it takes complete
73
ownership of the cluster configuration and state.
74

    
75
Other scalability problems are due the design of the DRBD device
76
model, which assumed at its creation a low (one to four) number of
77
instances per node, which is no longer true with today's hardware.
78

    
79
Artificial restrictions
80
-----------------------
81

    
82
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per
83
instance model. This is a purely artificial restrictions, but it
84
touches multiple areas (configuration, import/export, command line)
85
that it's more fitted to a major release than a minor one.
86

    
87
Architecture issues
88
-------------------
89

    
90
The fact that each command is a separate process that reads the
91
cluster state, executes the command, and saves the new state is also
92
an issue on big clusters where the configuration data for the cluster
93
begins to be non-trivial in size.
94

    
95
Overview
96
========
97

    
98
In order to solve the scalability problems, a rewrite of the core
99
design of Ganeti is required. While the cluster operations themselves
100
won't change (e.g. start instance will do the same things, the way
101
these operations are scheduled internally will change radically.
102

    
103
The new design will change the cluster architecture to:
104

    
105
.. image:: arch-2.0.png
106

    
107
This differs from the 1.2 architecture by the addition of the master
108
daemon, which will be the only entity to talk to the node daemons.
109

    
110

    
111
Detailed design
112
===============
113

    
114
The changes for 2.0 can be split into roughly three areas:
115

    
116
- core changes that affect the design of the software
117
- features (or restriction removals) but which do not have a wide
118
  impact on the design
119
- user-level and API-level changes which translate into differences for
120
  the operation of the cluster
121

    
122
Core changes
123
------------
124

    
125
The main changes will be switching from a per-process model to a
126
daemon based model, where the individual gnt-* commands will be
127
clients that talk to this daemon (see `Master daemon`_). This will
128
allow us to get rid of the global cluster lock for most operations,
129
having instead a per-object lock (see `Granular locking`_). Also, the
130
daemon will be able to queue jobs, and this will allow the individual
131
clients to submit jobs without waiting for them to finish, and also
132
see the result of old requests (see `Job Queue`_).
133

    
134
Beside these major changes, another 'core' change but that will not be
135
as visible to the users will be changing the model of object attribute
136
storage, and separate that into name spaces (such that an Xen PVM
137
instance will not have the Xen HVM parameters). This will allow future
138
flexibility in defining additional parameters. For more details see
139
`Object parameters`_.
140

    
141
The various changes brought in by the master daemon model and the
142
read-write RAPI will require changes to the cluster security; we move
143
away from Twisted and use HTTP(s) for intra- and extra-cluster
144
communications. For more details, see the security document in the
145
doc/ directory.
146

    
147
Master daemon
148
~~~~~~~~~~~~~
149

    
150
In Ganeti 2.0, we will have the following *entities*:
151

    
152
- the master daemon (on the master node)
153
- the node daemon (on all nodes)
154
- the command line tools (on the master node)
155
- the RAPI daemon (on the master node)
156

    
157
The master-daemon related interaction paths are:
158

    
159
- (CLI tools/RAPI daemon) and the master daemon, via the so called
160
  *LUXI* API
161
- the master daemon and the node daemons, via the node RPC
162

    
163
There are also some additional interaction paths for exceptional cases:
164

    
165
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile``
166
  and ``gnt-cluster command``)
167
- master failover is a special case when a non-master node will SSH
168
  and do node-RPC calls to the current master
169

    
170
The protocol between the master daemon and the node daemons will be
171
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S),
172
using a simple PUT/GET of JSON-encoded messages. This is done due to
173
difficulties in working with the Twisted framework and its protocols
174
in a multithreaded environment, which we can overcome by using a
175
simpler stack (see the caveats section).
176

    
177
The protocol between the CLI/RAPI and the master daemon will be a
178
custom one (called *LUXI*): on a UNIX socket on the master node, with
179
rights restricted by filesystem permissions, the CLI/RAPI will talk to
180
the master daemon using JSON-encoded messages.
181

    
182
The operations supported over this internal protocol will be encoded
183
via a python library that will expose a simple API for its
184
users. Internally, the protocol will simply encode all objects in JSON
185
format and decode them on the receiver side.
186

    
187
For more details about the RAPI daemon see `Remote API changes`_, and
188
for the node daemon see `Node daemon changes`_.
189

    
190
.. _luxi:
191

    
192
The LUXI protocol
193
+++++++++++++++++
194

    
195
As described above, the protocol for making requests or queries to the
196
master daemon will be a UNIX-socket based simple RPC of JSON-encoded
197
messages.
198

    
199
The choice of UNIX was in order to get rid of the need of
200
authentication and authorisation inside Ganeti; for 2.0, the
201
permissions on the Unix socket itself will determine the access
202
rights.
203

    
204
We will have two main classes of operations over this API:
205

    
206
- cluster query functions
207
- job related functions
208

    
209
The cluster query functions are usually short-duration, and are the
210
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are
211
internally implemented still with these opcodes). The clients are
212
guaranteed to receive the response in a reasonable time via a timeout.
213

    
214
The job-related functions will be:
215

    
216
- submit job
217
- query job (which could also be categorized in the query-functions)
218
- archive job (see the job queue design doc)
219
- wait for job change, which allows a client to wait without polling
220

    
221
For more details of the actual operation list, see the `Job Queue`_.
222

    
223
Both requests and responses will consist of a JSON-encoded message
224
followed by the ``ETX`` character (ASCII decimal 3), which is not a
225
valid character in JSON messages and thus can serve as a message
226
delimiter. The contents of the messages will be a dictionary with two
227
fields:
228

    
229
:method:
230
  the name of the method called
231
:args:
232
  the arguments to the method, as a list (no keyword arguments allowed)
233

    
234
Responses will follow the same format, with the two fields being:
235

    
236
:success:
237
  a boolean denoting the success of the operation
238
:result:
239
  the actual result, or error message in case of failure
240

    
241
There are two special value for the result field:
242

    
243
- in the case that the operation failed, and this field is a list of
244
  length two, the client library will try to interpret is as an
245
  exception, the first element being the exception type and the second
246
  one the actual exception arguments; this will allow a simple method of
247
  passing Ganeti-related exception across the interface
248
- for the *WaitForChange* call (that waits on the server for a job to
249
  change status), if the result is equal to ``nochange`` instead of the
250
  usual result for this call (a list of changes), then the library will
251
  internally retry the call; this is done in order to differentiate
252
  internally between master daemon hung and job simply not changed
253

    
254
Users of the API that don't use the provided python library should
255
take care of the above two cases.
256

    
257

    
258
Master daemon implementation
259
++++++++++++++++++++++++++++
260

    
261
The daemon will be based around a main I/O thread that will wait for
262
new requests from the clients, and that does the setup/shutdown of the
263
other thread (pools).
264

    
265
There will two other classes of threads in the daemon:
266

    
267
- job processing threads, part of a thread pool, and which are
268
  long-lived, started at daemon startup and terminated only at shutdown
269
  time
270
- client I/O threads, which are the ones that talk the local protocol
271
  (LUXI) to the clients, and are short-lived
272

    
273
Master startup/failover
274
+++++++++++++++++++++++
275

    
276
In Ganeti 1.x there is no protection against failing over the master
277
to a node with stale configuration. In effect, the responsibility of
278
correct failovers falls on the admin. This is true both for the new
279
master and for when an old, offline master startup.
280

    
281
Since in 2.x we are extending the cluster state to cover the job queue
282
and have a daemon that will execute by itself the job queue, we want
283
to have more resilience for the master role.
284

    
285
The following algorithm will happen whenever a node is ready to
286
transition to the master role, either at startup time or at node
287
failover:
288

    
289
#. read the configuration file and parse the node list
290
   contained within
291

    
292
#. query all the nodes and make sure we obtain an agreement via
293
   a quorum of at least half plus one nodes for the following:
294

    
295
    - we have the latest configuration and job list (as
296
      determined by the serial number on the configuration and
297
      highest job ID on the job queue)
298

    
299
    - there is not even a single node having a newer
300
      configuration file
301

    
302
    - if we are not failing over (but just starting), the
303
      quorum agrees that we are the designated master
304

    
305
    - if any of the above is false, we prevent the current operation
306
      (i.e. we don't become the master)
307

    
308
#. at this point, the node transitions to the master role
309

    
310
#. for all the in-progress jobs, mark them as failed, with
311
   reason unknown or something similar (master failed, etc.)
312

    
313
Since due to exceptional conditions we could have a situation in which
314
no node can become the master due to inconsistent data, we will have
315
an override switch for the master daemon startup that will assume the
316
current node has the right data and will replicate all the
317
configuration files to the other nodes.
318

    
319
**Note**: the above algorithm is by no means an election algorithm; it
320
is a *confirmation* of the master role currently held by a node.
321

    
322
Logging
323
+++++++
324

    
325
The logging system will be switched completely to the standard python
326
logging module; currently it's logging-based, but exposes a different
327
API, which is just overhead. As such, the code will be switched over
328
to standard logging calls, and only the setup will be custom.
329

    
330
With this change, we will remove the separate debug/info/error logs,
331
and instead have always one logfile per daemon model:
332

    
333
- master-daemon.log for the master daemon
334
- node-daemon.log for the node daemon (this is the same as in 1.2)
335
- rapi-daemon.log for the RAPI daemon logs
336
- rapi-access.log, an additional log file for the RAPI that will be
337
  in the standard HTTP log format for possible parsing by other tools
338

    
339
Since the :term:`watcher` will only submit jobs to the master for
340
startup of the instances, its log file will contain less information
341
than before, mainly that it will start the instance, but not the
342
results.
343

    
344
Node daemon changes
345
+++++++++++++++++++
346

    
347
The only change to the node daemon is that, since we need better
348
concurrency, we don't process the inter-node RPC calls in the node
349
daemon itself, but we fork and process each request in a separate
350
child.
351

    
352
Since we don't have many calls, and we only fork (not exec), the
353
overhead should be minimal.
354

    
355
Caveats
356
+++++++
357

    
358
A discussed alternative is to keep the current individual processes
359
touching the cluster configuration model. The reasons we have not
360
chosen this approach is:
361

    
362
- the speed of reading and unserializing the cluster state
363
  today is not small enough that we can ignore it; the addition of
364
  the job queue will make the startup cost even higher. While this
365
  runtime cost is low, it can be on the order of a few seconds on
366
  bigger clusters, which for very quick commands is comparable to
367
  the actual duration of the computation itself
368

    
369
- individual commands would make it harder to implement a
370
  fire-and-forget job request, along the lines "start this
371
  instance but do not wait for it to finish"; it would require a
372
  model of backgrounding the operation and other things that are
373
  much better served by a daemon-based model
374

    
375
Another area of discussion is moving away from Twisted in this new
376
implementation. While Twisted has its advantages, there are also many
377
disadvantages to using it:
378

    
379
- first and foremost, it's not a library, but a framework; thus, if
380
  you use twisted, all the code needs to be 'twiste-ized' and written
381
  in an asynchronous manner, using deferreds; while this method works,
382
  it's not a common way to code and it requires that the entire process
383
  workflow is based around a single *reactor* (Twisted name for a main
384
  loop)
385
- the more advanced granular locking that we want to implement would
386
  require, if written in the async-manner, deep integration with the
387
  Twisted stack, to such an extend that business-logic is inseparable
388
  from the protocol coding; we felt that this is an unreasonable
389
  request, and that a good protocol library should allow complete
390
  separation of low-level protocol calls and business logic; by
391
  comparison, the threaded approach combined with HTTPs protocol
392
  required (for the first iteration) absolutely no changes from the 1.2
393
  code, and later changes for optimizing the inter-node RPC calls
394
  required just syntactic changes (e.g.  ``rpc.call_...`` to
395
  ``self.rpc.call_...``)
396

    
397
Another issue is with the Twisted API stability - during the Ganeti
398
1.x lifetime, we had to to implement many times workarounds to changes
399
in the Twisted version, so that for example 1.2 is able to use both
400
Twisted 2.x and 8.x.
401

    
402
In the end, since we already had an HTTP server library for the RAPI,
403
we just reused that for inter-node communication.
404

    
405

    
406
Granular locking
407
~~~~~~~~~~~~~~~~
408

    
409
We want to make sure that multiple operations can run in parallel on a
410
Ganeti Cluster. In order for this to happen we need to make sure
411
concurrently run operations don't step on each other toes and break the
412
cluster.
413

    
414
This design addresses how we are going to deal with locking so that:
415

    
416
- we preserve data coherency
417
- we prevent deadlocks
418
- we prevent job starvation
419

    
420
Reaching the maximum possible parallelism is a Non-Goal. We have
421
identified a set of operations that are currently bottlenecks and need
422
to be parallelised and have worked on those. In the future it will be
423
possible to address other needs, thus making the cluster more and more
424
parallel one step at a time.
425

    
426
This section only talks about parallelising Ganeti level operations, aka
427
Logical Units, and the locking needed for that. Any other
428
synchronization lock needed internally by the code is outside its scope.
429

    
430
Library details
431
+++++++++++++++
432

    
433
The proposed library has these features:
434

    
435
- internally managing all the locks, making the implementation
436
  transparent from their usage
437
- automatically grabbing multiple locks in the right order (avoid
438
  deadlock)
439
- ability to transparently handle conversion to more granularity
440
- support asynchronous operation (future goal)
441

    
442
Locking will be valid only on the master node and will not be a
443
distributed operation. Therefore, in case of master failure, the
444
operations currently running will be aborted and the locks will be
445
lost; it remains to the administrator to cleanup (if needed) the
446
operation result (e.g. make sure an instance is either installed
447
correctly or removed).
448

    
449
A corollary of this is that a master-failover operation with both
450
masters alive needs to happen while no operations are running, and
451
therefore no locks are held.
452

    
453
All the locks will be represented by objects (like
454
``lockings.SharedLock``), and the individual locks for each object
455
will be created at initialisation time, from the config file.
456

    
457
The API will have a way to grab one or more than one locks at the same
458
time.  Any attempt to grab a lock while already holding one in the wrong
459
order will be checked for, and fail.
460

    
461

    
462
The Locks
463
+++++++++
464

    
465
At the first stage we have decided to provide the following locks:
466

    
467
- One "config file" lock
468
- One lock per node in the cluster
469
- One lock per instance in the cluster
470

    
471
All the instance locks will need to be taken before the node locks, and
472
the node locks before the config lock. Locks will need to be acquired at
473
the same time for multiple instances and nodes, and internal ordering
474
will be dealt within the locking library, which, for simplicity, will
475
just use alphabetical order.
476

    
477
Each lock has the following three possible statuses:
478

    
479
- unlocked (anyone can grab the lock)
480
- shared (anyone can grab/have the lock but only in shared mode)
481
- exclusive (no one else can grab/have the lock)
482

    
483
Handling conversion to more granularity
484
+++++++++++++++++++++++++++++++++++++++
485

    
486
In order to convert to a more granular approach transparently each time
487
we split a lock into more we'll create a "metalock", which will depend
488
on those sub-locks and live for the time necessary for all the code to
489
convert (or forever, in some conditions). When a metalock exists all
490
converted code must acquire it in shared mode, so it can run
491
concurrently, but still be exclusive with old code, which acquires it
492
exclusively.
493

    
494
In the beginning the only such lock will be what replaces the current
495
"command" lock, and will acquire all the locks in the system, before
496
proceeding. This lock will be called the "Big Ganeti Lock" because
497
holding that one will avoid any other concurrent Ganeti operations.
498

    
499
We might also want to devise more metalocks (eg. all nodes, all
500
nodes+config) in order to make it easier for some parts of the code to
501
acquire what it needs without specifying it explicitly.
502

    
503
In the future things like the node locks could become metalocks, should
504
we decide to split them into an even more fine grained approach, but
505
this will probably be only after the first 2.0 version has been
506
released.
507

    
508
Adding/Removing locks
509
+++++++++++++++++++++
510

    
511
When a new instance or a new node is created an associated lock must be
512
added to the list. The relevant code will need to inform the locking
513
library of such a change.
514

    
515
This needs to be compatible with every other lock in the system,
516
especially metalocks that guarantee to grab sets of resources without
517
specifying them explicitly. The implementation of this will be handled
518
in the locking library itself.
519

    
520
When instances or nodes disappear from the cluster the relevant locks
521
must be removed. This is easier than adding new elements, as the code
522
which removes them must own them exclusively already, and thus deals
523
with metalocks exactly as normal code acquiring those locks. Any
524
operation queuing on a removed lock will fail after its removal.
525

    
526
Asynchronous operations
527
+++++++++++++++++++++++
528

    
529
For the first version the locking library will only export synchronous
530
operations, which will block till the needed lock are held, and only
531
fail if the request is impossible or somehow erroneous.
532

    
533
In the future we may want to implement different types of asynchronous
534
operations such as:
535

    
536
- try to acquire this lock set and fail if not possible
537
- try to acquire one of these lock sets and return the first one you
538
  were able to get (or after a timeout) (select/poll like)
539

    
540
These operations can be used to prioritize operations based on available
541
locks, rather than making them just blindly queue for acquiring them.
542
The inherent risk, though, is that any code using the first operation,
543
or setting a timeout for the second one, is susceptible to starvation
544
and thus may never be able to get the required locks and complete
545
certain tasks. Considering this providing/using these operations should
546
not be among our first priorities.
547

    
548
Locking granularity
549
+++++++++++++++++++
550

    
551
For the first version of this code we'll convert each Logical Unit to
552
acquire/release the locks it needs, so locking will be at the Logical
553
Unit level.  In the future we may want to split logical units in
554
independent "tasklets" with their own locking requirements. A different
555
design doc (or mini design doc) will cover the move from Logical Units
556
to tasklets.
557

    
558
Code examples
559
+++++++++++++
560

    
561
In general when acquiring locks we should use a code path equivalent
562
to::
563

    
564
  lock.acquire()
565
  try:
566
    ...
567
    # other code
568
  finally:
569
    lock.release()
570

    
571
This makes sure we release all locks, and avoid possible deadlocks. Of
572
course extra care must be used not to leave, if possible locked
573
structures in an unusable state. Note that with Python 2.5 a simpler
574
syntax will be possible, but we want to keep compatibility with Python
575
2.4 so the new constructs should not be used.
576

    
577
In order to avoid this extra indentation and code changes everywhere in
578
the Logical Units code, we decided to allow LUs to declare locks, and
579
then execute their code with their locks acquired. In the new world LUs
580
are called like this::
581

    
582
  # user passed names are expanded to the internal lock/resource name,
583
  # then known needed locks are declared
584
  lu.ExpandNames()
585
  ... some locking/adding of locks may happen ...
586
  # late declaration of locks for one level: this is useful because sometimes
587
  # we can't know which resource we need before locking the previous level
588
  lu.DeclareLocks() # for each level (cluster, instance, node)
589
  ... more locking/adding of locks can happen ...
590
  # these functions are called with the proper locks held
591
  lu.CheckPrereq()
592
  lu.Exec()
593
  ... locks declared for removal are removed, all acquired locks released ...
594

    
595
The Processor and the LogicalUnit class will contain exact documentation
596
on how locks are supposed to be declared.
597

    
598
Caveats
599
+++++++
600

    
601
This library will provide an easy upgrade path to bring all the code to
602
granular locking without breaking everything, and it will also guarantee
603
against a lot of common errors. Code switching from the old "lock
604
everything" lock to the new system, though, needs to be carefully
605
scrutinised to be sure it is really acquiring all the necessary locks,
606
and none has been overlooked or forgotten.
607

    
608
The code can contain other locks outside of this library, to synchronise
609
other threaded code (eg for the job queue) but in general these should
610
be leaf locks or carefully structured non-leaf ones, to avoid deadlock
611
race conditions.
612

    
613

    
614
Job Queue
615
~~~~~~~~~
616

    
617
Granular locking is not enough to speed up operations, we also need a
618
queue to store these and to be able to process as many as possible in
619
parallel.
620

    
621
A Ganeti job will consist of multiple ``OpCodes`` which are the basic
622
element of operation in Ganeti 1.2 (and will remain as such). Most
623
command-level commands are equivalent to one OpCode, or in some cases
624
to a sequence of opcodes, all of the same type (e.g. evacuating a node
625
will generate N opcodes of type replace disks).
626

    
627

    
628
Job execution—“Life of a Ganeti job”
629
++++++++++++++++++++++++++++++++++++
630

    
631
#. Job gets submitted by the client. A new job identifier is generated
632
   and assigned to the job. The job is then automatically replicated
633
   [#replic]_ to all nodes in the cluster. The identifier is returned to
634
   the client.
635
#. A pool of worker threads waits for new jobs. If all are busy, the job
636
   has to wait and the first worker finishing its work will grab it.
637
   Otherwise any of the waiting threads will pick up the new job.
638
#. Client waits for job status updates by calling a waiting RPC
639
   function. Log message may be shown to the user. Until the job is
640
   started, it can also be canceled.
641
#. As soon as the job is finished, its final result and status can be
642
   retrieved from the server.
643
#. If the client archives the job, it gets moved to a history directory.
644
   There will be a method to archive all jobs older than a a given age.
645

    
646
.. [#replic] We need replication in order to maintain the consistency
647
   across all nodes in the system; the master node only differs in the
648
   fact that now it is running the master daemon, but it if fails and we
649
   do a master failover, the jobs are still visible on the new master
650
   (though marked as failed).
651

    
652
Failures to replicate a job to other nodes will be only flagged as
653
errors in the master daemon log if more than half of the nodes failed,
654
otherwise we ignore the failure, and rely on the fact that the next
655
update (for still running jobs) will retry the update. For finished
656
jobs, it is less of a problem.
657

    
658
Future improvements will look into checking the consistency of the job
659
list and jobs themselves at master daemon startup.
660

    
661

    
662
Job storage
663
+++++++++++
664

    
665
Jobs are stored in the filesystem as individual files, serialized
666
using JSON (standard serialization mechanism in Ganeti).
667

    
668
The choice of storing each job in its own file was made because:
669

    
670
- a file can be atomically replaced
671
- a file can easily be replicated to other nodes
672
- checking consistency across nodes can be implemented very easily,
673
  since all job files should be (at a given moment in time) identical
674

    
675
The other possible choices that were discussed and discounted were:
676

    
677
- single big file with all job data: not feasible due to difficult
678
  updates
679
- in-process databases: hard to replicate the entire database to the
680
  other nodes, and replicating individual operations does not mean wee
681
  keep consistency
682

    
683

    
684
Queue structure
685
+++++++++++++++
686

    
687
All file operations have to be done atomically by writing to a temporary
688
file and subsequent renaming. Except for log messages, every change in a
689
job is stored and replicated to other nodes.
690

    
691
::
692

    
693
  /var/lib/ganeti/queue/
694
    job-1 (JSON encoded job description and status)
695
    […]
696
    job-37
697
    job-38
698
    job-39
699
    lock (Queue managing process opens this file in exclusive mode)
700
    serial (Last job ID used)
701
    version (Queue format version)
702

    
703

    
704
Locking
705
+++++++
706

    
707
Locking in the job queue is a complicated topic. It is called from more
708
than one thread and must be thread-safe. For simplicity, a single lock
709
is used for the whole job queue.
710

    
711
A more detailed description can be found in doc/locking.rst.
712

    
713

    
714
Internal RPC
715
++++++++++++
716

    
717
RPC calls available between Ganeti master and node daemons:
718

    
719
jobqueue_update(file_name, content)
720
  Writes a file in the job queue directory.
721
jobqueue_purge()
722
  Cleans the job queue directory completely, including archived job.
723
jobqueue_rename(old, new)
724
  Renames a file in the job queue directory.
725

    
726

    
727
Client RPC
728
++++++++++
729

    
730
RPC between Ganeti clients and the Ganeti master daemon supports the
731
following operations:
732

    
733
SubmitJob(ops)
734
  Submits a list of opcodes and returns the job identifier. The
735
  identifier is guaranteed to be unique during the lifetime of a
736
  cluster.
737
WaitForJobChange(job_id, fields, […], timeout)
738
  This function waits until a job changes or a timeout expires. The
739
  condition for when a job changed is defined by the fields passed and
740
  the last log message received.
741
QueryJobs(job_ids, fields)
742
  Returns field values for the job identifiers passed.
743
CancelJob(job_id)
744
  Cancels the job specified by identifier. This operation may fail if
745
  the job is already running, canceled or finished.
746
ArchiveJob(job_id)
747
  Moves a job into the …/archive/ directory. This operation will fail if
748
  the job has not been canceled or finished.
749

    
750

    
751
Job and opcode status
752
+++++++++++++++++++++
753

    
754
Each job and each opcode has, at any time, one of the following states:
755

    
756
Queued
757
  The job/opcode was submitted, but did not yet start.
758
Waiting
759
  The job/opcode is waiting for a lock to proceed.
760
Running
761
  The job/opcode is running.
762
Canceled
763
  The job/opcode was canceled before it started.
764
Success
765
  The job/opcode ran and finished successfully.
766
Error
767
  The job/opcode was aborted with an error.
768

    
769
If the master is aborted while a job is running, the job will be set to
770
the Error status once the master started again.
771

    
772

    
773
History
774
+++++++
775

    
776
Archived jobs are kept in a separate directory,
777
``/var/lib/ganeti/queue/archive/``.  This is done in order to speed up
778
the queue handling: by default, the jobs in the archive are not
779
touched by any functions. Only the current (unarchived) jobs are
780
parsed, loaded, and verified (if implemented) by the master daemon.
781

    
782

    
783
Ganeti updates
784
++++++++++++++
785

    
786
The queue has to be completely empty for Ganeti updates with changes
787
in the job queue structure. In order to allow this, there will be a
788
way to prevent new jobs entering the queue.
789

    
790

    
791
Object parameters
792
~~~~~~~~~~~~~~~~~
793

    
794
Across all cluster configuration data, we have multiple classes of
795
parameters:
796

    
797
A. cluster-wide parameters (e.g. name of the cluster, the master);
798
   these are the ones that we have today, and are unchanged from the
799
   current model
800

    
801
#. node parameters
802

    
803
#. instance specific parameters, e.g. the name of disks (LV), that
804
   cannot be shared with other instances
805

    
806
#. instance parameters, that are or can be the same for many
807
   instances, but are not hypervisor related; e.g. the number of VCPUs,
808
   or the size of memory
809

    
810
#. instance parameters that are hypervisor specific (e.g. kernel_path
811
   or PAE mode)
812

    
813

    
814
The following definitions for instance parameters will be used below:
815

    
816
:hypervisor parameter:
817
  a hypervisor parameter (or hypervisor specific parameter) is defined
818
  as a parameter that is interpreted by the hypervisor support code in
819
  Ganeti and usually is specific to a particular hypervisor (like the
820
  kernel path for :term:`PVM` which makes no sense for :term:`HVM`).
821

    
822
:backend parameter:
823
  a backend parameter is defined as an instance parameter that can be
824
  shared among a list of instances, and is either generic enough not
825
  to be tied to a given hypervisor or cannot influence at all the
826
  hypervisor behaviour.
827

    
828
  For example: memory, vcpus, auto_balance
829

    
830
  All these parameters will be encoded into constants.py with the prefix
831
  "BE\_" and the whole list of parameters will exist in the set
832
  "BES_PARAMETERS"
833

    
834
:proper parameter:
835
  a parameter whose value is unique to the instance (e.g. the name of a
836
  LV, or the MAC of a NIC)
837

    
838
As a general rule, for all kind of parameters, “None” (or in
839
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
840
such, only non-default parameters will be saved as part of objects in
841
the serialization step, reducing the size of the serialized format.
842

    
843
Cluster parameters
844
++++++++++++++++++
845

    
846
Cluster parameters remain as today, attributes at the top level of the
847
Cluster object. In addition, two new attributes at this level will
848
hold defaults for the instances:
849

    
850
- hvparams, a dictionary indexed by hypervisor type, holding default
851
  values for hypervisor parameters that are not defined/overridden by
852
  the instances of this hypervisor type
853

    
854
- beparams, a dictionary holding (for 2.0) a single element 'default',
855
  which holds the default value for backend parameters
856

    
857
Node parameters
858
+++++++++++++++
859

    
860
Node-related parameters are very few, and we will continue using the
861
same model for these as previously (attributes on the Node object).
862

    
863
There are three new node flags, described in a separate section "node
864
flags" below.
865

    
866
Instance parameters
867
+++++++++++++++++++
868

    
869
As described before, the instance parameters are split in three:
870
instance proper parameters, unique to each instance, instance
871
hypervisor parameters and instance backend parameters.
872

    
873
The “hvparams” and “beparams” are kept in two dictionaries at instance
874
level. Only non-default parameters are stored (but once customized, a
875
parameter will be kept, even with the same value as the default one,
876
until reset).
877

    
878
The names for hypervisor parameters in the instance.hvparams subtree
879
should be choosen as generic as possible, especially if specific
880
parameters could conceivably be useful for more than one hypervisor,
881
e.g. ``instance.hvparams.vnc_console_port`` instead of using both
882
``instance.hvparams.hvm_vnc_console_port`` and
883
``instance.hvparams.kvm_vnc_console_port``.
884

    
885
There are some special cases related to disks and NICs (for example):
886
a disk has both Ganeti-related parameters (e.g. the name of the LV)
887
and hypervisor-related parameters (how the disk is presented to/named
888
in the instance). The former parameters remain as proper-instance
889
parameters, while the latter value are migrated to the hvparams
890
structure. In 2.0, we will have only globally-per-instance such
891
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
892
exported as of the same type).
893

    
894
Starting from the 1.2 list of instance parameters, here is how they
895
will be mapped to the three classes of parameters:
896

    
897
- name (P)
898
- primary_node (P)
899
- os (P)
900
- hypervisor (P)
901
- status (P)
902
- memory (BE)
903
- vcpus (BE)
904
- nics (P)
905
- disks (P)
906
- disk_template (P)
907
- network_port (P)
908
- kernel_path (HV)
909
- initrd_path (HV)
910
- hvm_boot_order (HV)
911
- hvm_acpi (HV)
912
- hvm_pae (HV)
913
- hvm_cdrom_image_path (HV)
914
- hvm_nic_type (HV)
915
- hvm_disk_type (HV)
916
- vnc_bind_address (HV)
917
- serial_no (P)
918

    
919

    
920
Parameter validation
921
++++++++++++++++++++
922

    
923
To support the new cluster parameter design, additional features will
924
be required from the hypervisor support implementations in Ganeti.
925

    
926
The hypervisor support  implementation API will be extended with the
927
following features:
928

    
929
:PARAMETERS: class-level attribute holding the list of valid parameters
930
  for this hypervisor
931
:CheckParamSyntax(hvparams): checks that the given parameters are
932
  valid (as in the names are valid) for this hypervisor; usually just
933
  comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class
934
  method that can be called from within master code (i.e. cmdlib) and
935
  should be safe to do so
936
:ValidateParameters(hvparams): verifies the values of the provided
937
  parameters against this hypervisor; this is a method that will be
938
  called on the target node, from backend.py code, and as such can
939
  make node-specific checks (e.g. kernel_path checking)
940

    
941
Default value application
942
+++++++++++++++++++++++++
943

    
944
The application of defaults to an instance is done in the Cluster
945
object, via two new methods as follows:
946

    
947
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
948
  instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
949

    
950
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
951
  beparams dict, based on the instance and cluster beparams
952

    
953
The FillHV/BE transformations will be used, for example, in the
954
RpcRunner when sending an instance for activation/stop, and the sent
955
instance hvparams/beparams will have the final value (noded code doesn't
956
know about defaults).
957

    
958
LU code will need to self-call the transformation, if needed.
959

    
960
Opcode changes
961
++++++++++++++
962

    
963
The parameter changes will have impact on the OpCodes, especially on
964
the following ones:
965

    
966
- ``OpInstanceCreate``, where the new hv and be parameters will be sent
967
  as dictionaries; note that all hv and be parameters are now optional,
968
  as the values can be instead taken from the cluster
969
- ``OpInstanceQuery``, where we have to be able to query these new
970
  parameters; the syntax for names will be ``hvparam/$NAME`` and
971
  ``beparam/$NAME`` for querying an individual parameter out of one
972
  dictionary, and ``hvparams``, respectively ``beparams``, for the whole
973
  dictionaries
974
- ``OpModifyInstance``, where the the modified parameters are sent as
975
  dictionaries
976

    
977
Additionally, we will need new OpCodes to modify the cluster-level
978
defaults for the be/hv sets of parameters.
979

    
980
Caveats
981
+++++++
982

    
983
One problem that might appear is that our classification is not
984
complete or not good enough, and we'll need to change this model. As
985
the last resort, we will need to rollback and keep 1.2 style.
986

    
987
Another problem is that classification of one parameter is unclear
988
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
989
the risk of having to move parameters later between classes.
990

    
991
Security
992
++++++++
993

    
994
The only security issue that we foresee is if some new parameters will
995
have sensitive value. If so, we will need to have a way to export the
996
config data while purging the sensitive value.
997

    
998
E.g. for the drbd shared secrets, we could export these with the
999
values replaced by an empty string.
1000

    
1001
Node flags
1002
~~~~~~~~~~
1003

    
1004
Ganeti 2.0 adds three node flags that change the way nodes are handled
1005
within Ganeti and the related infrastructure (iallocator interaction,
1006
RAPI data export).
1007

    
1008
*master candidate* flag
1009
+++++++++++++++++++++++
1010

    
1011
Ganeti 2.0 allows more scalability in operation by introducing
1012
parallelization. However, a new bottleneck is reached that is the
1013
synchronization and replication of cluster configuration to all nodes
1014
in the cluster.
1015

    
1016
This breaks scalability as the speed of the replication decreases
1017
roughly with the size of the nodes in the cluster. The goal of the
1018
master candidate flag is to change this O(n) into O(1) with respect to
1019
job and configuration data propagation.
1020

    
1021
Only nodes having this flag set (let's call this set of nodes the
1022
*candidate pool*) will have jobs and configuration data replicated.
1023

    
1024
The cluster will have a new parameter (runtime changeable) called
1025
``candidate_pool_size`` which represents the number of candidates the
1026
cluster tries to maintain (preferably automatically).
1027

    
1028
This will impact the cluster operations as follows:
1029

    
1030
- jobs and config data will be replicated only to a fixed set of nodes
1031
- master fail-over will only be possible to a node in the candidate pool
1032
- cluster verify needs changing to account for these two roles
1033
- external scripts will no longer have access to the configuration
1034
  file (this is not recommended anyway)
1035

    
1036

    
1037
The caveats of this change are:
1038

    
1039
- if all candidates are lost (completely), cluster configuration is
1040
  lost (but it should be backed up external to the cluster anyway)
1041

    
1042
- failed nodes which are candidate must be dealt with properly, so
1043
  that we don't lose too many candidates at the same time; this will be
1044
  reported in cluster verify
1045

    
1046
- the 'all equal' concept of ganeti is no longer true
1047

    
1048
- the partial distribution of config data means that all nodes will
1049
  have to revert to ssconf files for master info (as in 1.2)
1050

    
1051
Advantages:
1052

    
1053
- speed on a 100+ nodes simulated cluster is greatly enhanced, even
1054
  for a simple operation; ``gnt-instance remove`` on a diskless instance
1055
  remove goes from ~9seconds to ~2 seconds
1056

    
1057
- node failure of non-candidates will be less impacting on the cluster
1058

    
1059
The default value for the candidate pool size will be set to 10 but
1060
this can be changed at cluster creation and modified any time later.
1061

    
1062
Testing on simulated big clusters with sequential and parallel jobs
1063
show that this value (10) is a sweet-spot from performance and load
1064
point of view.
1065

    
1066
*offline* flag
1067
++++++++++++++
1068

    
1069
In order to support better the situation in which nodes are offline
1070
(e.g. for repair) without altering the cluster configuration, Ganeti
1071
needs to be told and needs to properly handle this state for nodes.
1072

    
1073
This will result in simpler procedures, and less mistakes, when the
1074
amount of node failures is high on an absolute scale (either due to
1075
high failure rate or simply big clusters).
1076

    
1077
Nodes having this attribute set will not be contacted for inter-node
1078
RPC calls, will not be master candidates, and will not be able to host
1079
instances as primaries.
1080

    
1081
Setting this attribute on a node:
1082

    
1083
- will not be allowed if the node is the master
1084
- will not be allowed if the node has primary instances
1085
- will cause the node to be demoted from the master candidate role (if
1086
  it was), possibly causing another node to be promoted to that role
1087

    
1088
This attribute will impact the cluster operations as follows:
1089

    
1090
- querying these nodes for anything will fail instantly in the RPC
1091
  library, with a specific RPC error (RpcResult.offline == True)
1092

    
1093
- they will be listed in the Other section of cluster verify
1094

    
1095
The code is changed in the following ways:
1096

    
1097
- RPC calls were be converted to skip such nodes:
1098

    
1099
  - RpcRunner-instance-based RPC calls are easy to convert
1100

    
1101
  - static/classmethod RPC calls are harder to convert, and were left
1102
    alone
1103

    
1104
- the RPC results were unified so that this new result state (offline)
1105
  can be differentiated
1106

    
1107
- master voting still queries in repair nodes, as we need to ensure
1108
  consistency in case the (wrong) masters have old data, and nodes have
1109
  come back from repairs
1110

    
1111
Caveats:
1112

    
1113
- some operation semantics are less clear (e.g. what to do on instance
1114
  start with offline secondary?); for now, these will just fail as if
1115
  the flag is not set (but faster)
1116
- 2-node cluster with one node offline needs manual startup of the
1117
  master with a special flag to skip voting (as the master can't get a
1118
  quorum there)
1119

    
1120
One of the advantages of implementing this flag is that it will allow
1121
in the future automation tools to automatically put the node in
1122
repairs and recover from this state, and the code (should/will) handle
1123
this much better than just timing out. So, future possible
1124
improvements (for later versions):
1125

    
1126
- watcher will detect nodes which fail RPC calls, will attempt to ssh
1127
  to them, if failure will put them offline
1128
- watcher will try to ssh and query the offline nodes, if successful
1129
  will take them off the repair list
1130

    
1131
Alternatives considered: The RPC call model in 2.0 is, by default,
1132
much nicer - errors are logged in the background, and job/opcode
1133
execution is clearer, so we could simply not introduce this. However,
1134
having this state will make both the codepaths clearer (offline
1135
vs. temporary failure) and the operational model (it's not a node with
1136
errors, but an offline node).
1137

    
1138

    
1139
*drained* flag
1140
++++++++++++++
1141

    
1142
Due to parallel execution of jobs in Ganeti 2.0, we could have the
1143
following situation:
1144

    
1145
- gnt-node migrate + failover is run
1146
- gnt-node evacuate is run, which schedules a long-running 6-opcode
1147
  job for the node
1148
- partway through, a new job comes in that runs an iallocator script,
1149
  which finds the above node as empty and a very good candidate
1150
- gnt-node evacuate has finished, but now it has to be run again, to
1151
  clean the above instance(s)
1152

    
1153
In order to prevent this situation, and to be able to get nodes into
1154
proper offline status easily, a new *drained* flag was added to the
1155
nodes.
1156

    
1157
This flag (which actually means "is being, or was drained, and is
1158
expected to go offline"), will prevent allocations on the node, but
1159
otherwise all other operations (start/stop instance, query, etc.) are
1160
working without any restrictions.
1161

    
1162
Interaction between flags
1163
+++++++++++++++++++++++++
1164

    
1165
While these flags are implemented as separate flags, they are
1166
mutually-exclusive and are acting together with the master node role
1167
as a single *node status* value. In other words, a flag is only in one
1168
of these roles at a given time. The lack of any of these flags denote
1169
a regular node.
1170

    
1171
The current node status is visible in the ``gnt-cluster verify``
1172
output, and the individual flags can be examined via separate flags in
1173
the ``gnt-node list`` output.
1174

    
1175
These new flags will be exported in both the iallocator input message
1176
and via RAPI, see the respective man pages for the exact names.
1177

    
1178
Feature changes
1179
---------------
1180

    
1181
The main feature-level changes will be:
1182

    
1183
- a number of disk related changes
1184
- removal of fixed two-disk, one-nic per instance limitation
1185

    
1186
Disk handling changes
1187
~~~~~~~~~~~~~~~~~~~~~
1188

    
1189
The storage options available in Ganeti 1.x were introduced based on
1190
then-current software (first DRBD 0.7 then later DRBD 8) and the
1191
estimated usage patters. However, experience has later shown that some
1192
assumptions made initially are not true and that more flexibility is
1193
needed.
1194

    
1195
One main assumption made was that disk failures should be treated as
1196
'rare' events, and that each of them needs to be manually handled in
1197
order to ensure data safety; however, both these assumptions are false:
1198

    
1199
- disk failures can be a common occurrence, based on usage patterns or
1200
  cluster size
1201
- our disk setup is robust enough (referring to DRBD8 + LVM) that we
1202
  could automate more of the recovery
1203

    
1204
Note that we still don't have fully-automated disk recovery as a goal,
1205
but our goal is to reduce the manual work needed.
1206

    
1207
As such, we plan the following main changes:
1208

    
1209
- DRBD8 is much more flexible and stable than its previous version
1210
  (0.7), such that removing the support for the ``remote_raid1``
1211
  template and focusing only on DRBD8 is easier
1212

    
1213
- dynamic discovery of DRBD devices is not actually needed in a cluster
1214
  that where the DRBD namespace is controlled by Ganeti; switching to a
1215
  static assignment (done at either instance creation time or change
1216
  secondary time) will change the disk activation time from O(n) to
1217
  O(1), which on big clusters is a significant gain
1218

    
1219
- remove the hard dependency on LVM (currently all available storage
1220
  types are ultimately backed by LVM volumes) by introducing file-based
1221
  storage
1222

    
1223
Additionally, a number of smaller enhancements are also planned:
1224
- support variable number of disks
1225
- support read-only disks
1226

    
1227
Future enhancements in the 2.x series, which do not require base design
1228
changes, might include:
1229

    
1230
- enhancement of the LVM allocation method in order to try to keep
1231
  all of an instance's virtual disks on the same physical
1232
  disks
1233

    
1234
- add support for DRBD8 authentication at handshake time in
1235
  order to ensure each device connects to the correct peer
1236

    
1237
- remove the restrictions on failover only to the secondary
1238
  which creates very strict rules on cluster allocation
1239

    
1240
DRBD minor allocation
1241
+++++++++++++++++++++
1242

    
1243
Currently, when trying to identify or activate a new DRBD (or MD)
1244
device, the code scans all in-use devices in order to see if we find
1245
one that looks similar to our parameters and is already in the desired
1246
state or not. Since this needs external commands to be run, it is very
1247
slow when more than a few devices are already present.
1248

    
1249
Therefore, we will change the discovery model from dynamic to
1250
static. When a new device is logically created (added to the
1251
configuration) a free minor number is computed from the list of
1252
devices that should exist on that node and assigned to that
1253
device.
1254

    
1255
At device activation, if the minor is already in use, we check if
1256
it has our parameters; if not so, we just destroy the device (if
1257
possible, otherwise we abort) and start it with our own
1258
parameters.
1259

    
1260
This means that we in effect take ownership of the minor space for
1261
that device type; if there's a user-created DRBD minor, it will be
1262
automatically removed.
1263

    
1264
The change will have the effect of reducing the number of external
1265
commands run per device from a constant number times the index of the
1266
first free DRBD minor to just a constant number.
1267

    
1268
Removal of obsolete device types (MD, DRBD7)
1269
++++++++++++++++++++++++++++++++++++++++++++
1270

    
1271
We need to remove these device types because of two issues. First,
1272
DRBD7 has bad failure modes in case of dual failures (both network and
1273
disk - it cannot propagate the error up the device stack and instead
1274
just panics. Second, due to the asymmetry between primary and
1275
secondary in MD+DRBD mode, we cannot do live failover (not even if we
1276
had MD+DRBD8).
1277

    
1278
File-based storage support
1279
++++++++++++++++++++++++++
1280

    
1281
Using files instead of logical volumes for instance storage would
1282
allow us to get rid of the hard requirement for volume groups for
1283
testing clusters and it would also allow usage of SAN storage to do
1284
live failover taking advantage of this storage solution.
1285

    
1286
Better LVM allocation
1287
+++++++++++++++++++++
1288

    
1289
Currently, the LV to PV allocation mechanism is a very simple one: at
1290
each new request for a logical volume, tell LVM to allocate the volume
1291
in order based on the amount of free space. This is good for
1292
simplicity and for keeping the usage equally spread over the available
1293
physical disks, however it introduces a problem that an instance could
1294
end up with its (currently) two drives on two physical disks, or
1295
(worse) that the data and metadata for a DRBD device end up on
1296
different drives.
1297

    
1298
This is bad because it causes unneeded ``replace-disks`` operations in
1299
case of a physical failure.
1300

    
1301
The solution is to batch allocations for an instance and make the LVM
1302
handling code try to allocate as close as possible all the storage of
1303
one instance. We will still allow the logical volumes to spill over to
1304
additional disks as needed.
1305

    
1306
Note that this clustered allocation can only be attempted at initial
1307
instance creation, or at change secondary node time. At add disk time,
1308
or at replacing individual disks, it's not easy enough to compute the
1309
current disk map so we'll not attempt the clustering.
1310

    
1311
DRBD8 peer authentication at handshake
1312
++++++++++++++++++++++++++++++++++++++
1313

    
1314
DRBD8 has a new feature that allow authentication of the peer at
1315
connect time. We can use this to prevent connecting to the wrong peer
1316
more that securing the connection. Even though we never had issues
1317
with wrong connections, it would be good to implement this.
1318

    
1319

    
1320
LVM self-repair (optional)
1321
++++++++++++++++++++++++++
1322

    
1323
The complete failure of a physical disk is very tedious to
1324
troubleshoot, mainly because of the many failure modes and the many
1325
steps needed. We can safely automate some of the steps, more
1326
specifically the ``vgreduce --removemissing`` using the following
1327
method:
1328

    
1329
#. check if all nodes have consistent volume groups
1330
#. if yes, and previous status was yes, do nothing
1331
#. if yes, and previous status was no, save status and restart
1332
#. if no, and previous status was no, do nothing
1333
#. if no, and previous status was yes:
1334
    #. if more than one node is inconsistent, do nothing
1335
    #. if only one node is inconsistent:
1336
        #. run ``vgreduce --removemissing``
1337
        #. log this occurrence in the Ganeti log in a form that
1338
           can be used for monitoring
1339
        #. [FUTURE] run ``replace-disks`` for all
1340
           instances affected
1341

    
1342
Failover to any node
1343
++++++++++++++++++++
1344

    
1345
With a modified disk activation sequence, we can implement the
1346
*failover to any* functionality, removing many of the layout
1347
restrictions of a cluster:
1348

    
1349
- the need to reserve memory on the current secondary: this gets reduced
1350
  to a must to reserve memory anywhere on the cluster
1351

    
1352
- the need to first failover and then replace secondary for an
1353
  instance: with failover-to-any, we can directly failover to
1354
  another node, which also does the replace disks at the same
1355
  step
1356

    
1357
In the following, we denote the current primary by P1, the current
1358
secondary by S1, and the new primary and secondaries by P2 and S2. P2
1359
is fixed to the node the user chooses, but the choice of S2 can be
1360
made between P1 and S1. This choice can be constrained, depending on
1361
which of P1 and S1 has failed.
1362

    
1363
- if P1 has failed, then S1 must become S2, and live migration is not
1364
  possible
1365
- if S1 has failed, then P1 must become S2, and live migration could be
1366
  possible (in theory, but this is not a design goal for 2.0)
1367

    
1368
The algorithm for performing the failover is straightforward:
1369

    
1370
- verify that S2 (the node the user has chosen to keep as secondary) has
1371
  valid data (is consistent)
1372

    
1373
- tear down the current DRBD association and setup a DRBD pairing
1374
  between P2 (P2 is indicated by the user) and S2; since P2 has no data,
1375
  it will start re-syncing from S2
1376

    
1377
- as soon as P2 is in state SyncTarget (i.e. after the resync has
1378
  started but before it has finished), we can promote it to primary role
1379
  (r/w) and start the instance on P2
1380

    
1381
- as soon as the P2?S2 sync has finished, we can remove
1382
  the old data on the old node that has not been chosen for
1383
  S2
1384

    
1385
Caveats: during the P2?S2 sync, a (non-transient) network error
1386
will cause I/O errors on the instance, so (if a longer instance
1387
downtime is acceptable) we can postpone the restart of the instance
1388
until the resync is done. However, disk I/O errors on S2 will cause
1389
data loss, since we don't have a good copy of the data anymore, so in
1390
this case waiting for the sync to complete is not an option. As such,
1391
it is recommended that this feature is used only in conjunction with
1392
proper disk monitoring.
1393

    
1394

    
1395
Live migration note: While failover-to-any is possible for all choices
1396
of S2, migration-to-any is possible only if we keep P1 as S2.
1397

    
1398
Caveats
1399
+++++++
1400

    
1401
The dynamic device model, while more complex, has an advantage: it
1402
will not reuse by mistake the DRBD device of another instance, since
1403
it always looks for either our own or a free one.
1404

    
1405
The static one, in contrast, will assume that given a minor number N,
1406
it's ours and we can take over. This needs careful implementation such
1407
that if the minor is in use, either we are able to cleanly shut it
1408
down, or we abort the startup. Otherwise, it could be that we start
1409
syncing between two instance's disks, causing data loss.
1410

    
1411

    
1412
Variable number of disk/NICs per instance
1413
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1414

    
1415
Variable number of disks
1416
++++++++++++++++++++++++
1417

    
1418
In order to support high-security scenarios (for example read-only sda
1419
and read-write sdb), we need to make a fully flexibly disk
1420
definition. This has less impact that it might look at first sight:
1421
only the instance creation has hard coded number of disks, not the disk
1422
handling code. The block device handling and most of the instance
1423
handling code is already working with "the instance's disks" as
1424
opposed to "the two disks of the instance", but some pieces are not
1425
(e.g. import/export) and the code needs a review to ensure safety.
1426

    
1427
The objective is to be able to specify the number of disks at
1428
instance creation, and to be able to toggle from read-only to
1429
read-write a disk afterward.
1430

    
1431
Variable number of NICs
1432
+++++++++++++++++++++++
1433

    
1434
Similar to the disk change, we need to allow multiple network
1435
interfaces per instance. This will affect the internal code (some
1436
function will have to stop assuming that ``instance.nics`` is a list
1437
of length one), the OS API which currently can export/import only one
1438
instance, and the command line interface.
1439

    
1440
Interface changes
1441
-----------------
1442

    
1443
There are two areas of interface changes: API-level changes (the OS
1444
interface and the RAPI interface) and the command line interface
1445
changes.
1446

    
1447
OS interface
1448
~~~~~~~~~~~~
1449

    
1450
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2.
1451
The interface is composed by a series of scripts which get called with
1452
certain parameters to perform OS-dependent operations on the cluster.
1453
The current scripts are:
1454

    
1455
create
1456
  called when a new instance is added to the cluster
1457
export
1458
  called to export an instance disk to a stream
1459
import
1460
  called to import from a stream to a new instance
1461
rename
1462
  called to perform the os-specific operations necessary for renaming an
1463
  instance
1464

    
1465
Currently these scripts suffer from the limitations of Ganeti 1.2: for
1466
example they accept exactly one block and one swap devices to operate
1467
on, rather than any amount of generic block devices, they blindly assume
1468
that an instance will have just one network interface to operate, they
1469
can not be configured to optimise the instance for a particular
1470
hypervisor.
1471

    
1472
Since in Ganeti 2.0 we want to support multiple hypervisors, and a
1473
non-fixed number of network and disks the OS interface need to change to
1474
transmit the appropriate amount of information about an instance to its
1475
managing operating system, when operating on it. Moreover since some old
1476
assumptions usually used in OS scripts are no longer valid we need to
1477
re-establish a common knowledge on what can be assumed and what cannot
1478
be regarding Ganeti environment.
1479

    
1480

    
1481
When designing the new OS API our priorities are:
1482
- ease of use
1483
- future extensibility
1484
- ease of porting from the old API
1485
- modularity
1486

    
1487
As such we want to limit the number of scripts that must be written to
1488
support an OS, and make it easy to share code between them by uniforming
1489
their input.  We also will leave the current script structure unchanged,
1490
as far as we can, and make a few of the scripts (import, export and
1491
rename) optional. Most information will be passed to the script through
1492
environment variables, for ease of access and at the same time ease of
1493
using only the information a script needs.
1494

    
1495

    
1496
The Scripts
1497
+++++++++++
1498

    
1499
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs
1500
to support the following functionality, through scripts:
1501

    
1502
create:
1503
  used to create a new instance running that OS. This script should
1504
  prepare the block devices, and install them so that the new OS can
1505
  boot under the specified hypervisor.
1506
export (optional):
1507
  used to export an installed instance using the given OS to a format
1508
  which can be used to import it back into a new instance.
1509
import (optional):
1510
  used to import an exported instance into a new one. This script is
1511
  similar to create, but the new instance should have the content of the
1512
  export, rather than contain a pristine installation.
1513
rename (optional):
1514
  used to perform the internal OS-specific operations needed to rename
1515
  an instance.
1516

    
1517
If any optional script is not implemented Ganeti will refuse to perform
1518
the given operation on instances using the non-implementing OS. Of
1519
course the create script is mandatory, and it doesn't make sense to
1520
support the either the export or the import operation but not both.
1521

    
1522
Incompatibilities with 1.2
1523
__________________________
1524

    
1525
We expect the following incompatibilities between the OS scripts for 1.2
1526
and the ones for 2.0:
1527

    
1528
- Input parameters: in 1.2 those were passed on the command line, in 2.0
1529
  we'll use environment variables, as there will be a lot more
1530
  information and not all OSes may care about all of it.
1531
- Number of calls: export scripts will be called once for each device
1532
  the instance has, and import scripts once for every exported disk.
1533
  Imported instances will be forced to have a number of disks greater or
1534
  equal to the one of the export.
1535
- Some scripts are not compulsory: if such a script is missing the
1536
  relevant operations will be forbidden for instances of that OS. This
1537
  makes it easier to distinguish between unsupported operations and
1538
  no-op ones (if any).
1539

    
1540

    
1541
Input
1542
_____
1543

    
1544
Rather than using command line flags, as they do now, scripts will
1545
accept inputs from environment variables. We expect the following input
1546
values:
1547

    
1548
OS_API_VERSION
1549
  The version of the OS API that the following parameters comply with;
1550
  this is used so that in the future we could have OSes supporting
1551
  multiple versions and thus Ganeti send the proper version in this
1552
  parameter
1553
INSTANCE_NAME
1554
  Name of the instance acted on
1555
HYPERVISOR
1556
  The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm',
1557
  'kvm')
1558
DISK_COUNT
1559
  The number of disks this instance will have
1560
NIC_COUNT
1561
  The number of NICs this instance will have
1562
DISK_<N>_PATH
1563
  Path to the Nth disk.
1564
DISK_<N>_ACCESS
1565
  W if read/write, R if read only. OS scripts are not supposed to touch
1566
  read-only disks, but will be passed them to know.
1567
DISK_<N>_FRONTEND_TYPE
1568
  Type of the disk as seen by the instance. Can be 'scsi', 'ide',
1569
  'virtio'
1570
DISK_<N>_BACKEND_TYPE
1571
  Type of the disk as seen from the node. Can be 'block', 'file:loop' or
1572
  'file:blktap'
1573
NIC_<N>_MAC
1574
  Mac address for the Nth network interface
1575
NIC_<N>_IP
1576
  Ip address for the Nth network interface, if available
1577
NIC_<N>_BRIDGE
1578
  Node bridge the Nth network interface will be connected to
1579
NIC_<N>_FRONTEND_TYPE
1580
  Type of the Nth NIC as seen by the instance. For example 'virtio',
1581
  'rtl8139', etc.
1582
DEBUG_LEVEL
1583
  Whether more out should be produced, for debugging purposes. Currently
1584
  the only valid values are 0 and 1.
1585

    
1586
These are only the basic variables we are thinking of now, but more
1587
may come during the implementation and they will be documented in the
1588
:manpage:`ganeti-os-api` man page. All these variables will be
1589
available to all scripts.
1590

    
1591
Some scripts will need a few more information to work. These will have
1592
per-script variables, such as for example:
1593

    
1594
OLD_INSTANCE_NAME
1595
  rename: the name the instance should be renamed from.
1596
EXPORT_DEVICE
1597
  export: device to be exported, a snapshot of the actual device. The
1598
  data must be exported to stdout.
1599
EXPORT_INDEX
1600
  export: sequential number of the instance device targeted.
1601
IMPORT_DEVICE
1602
  import: device to send the data to, part of the new instance. The data
1603
  must be imported from stdin.
1604
IMPORT_INDEX
1605
  import: sequential number of the instance device targeted.
1606

    
1607
(Rationale for INSTANCE_NAME as an environment variable: the instance
1608
name is always needed and we could pass it on the command line. On the
1609
other hand, though, this would force scripts to both access the
1610
environment and parse the command line, so we'll move it for
1611
uniformity.)
1612

    
1613

    
1614
Output/Behaviour
1615
________________
1616

    
1617
As discussed scripts should only send user-targeted information to
1618
stderr. The create and import scripts are supposed to format/initialise
1619
the given block devices and install the correct instance data. The
1620
export script is supposed to export instance data to stdout in a format
1621
understandable by the the import script. The data will be compressed by
1622
Ganeti, so no compression should be done. The rename script should only
1623
modify the instance's knowledge of what its name is.
1624

    
1625
Other declarative style features
1626
++++++++++++++++++++++++++++++++
1627

    
1628
Similar to Ganeti 1.2, OS specifications will need to provide a
1629
'ganeti_api_version' containing list of numbers matching the
1630
version(s) of the API they implement. Ganeti itself will always be
1631
compatible with one version of the API and may maintain backwards
1632
compatibility if it's feasible to do so. The numbers are one-per-line,
1633
so an OS supporting both version 5 and version 20 will have a file
1634
containing two lines. This is different from Ganeti 1.2, which only
1635
supported one version number.
1636

    
1637
In addition to that an OS will be able to declare that it does support
1638
only a subset of the Ganeti hypervisors, by declaring them in the
1639
'hypervisors' file.
1640

    
1641

    
1642
Caveats/Notes
1643
+++++++++++++
1644

    
1645
We might want to have a "default" import/export behaviour that just
1646
dumps all disks and restores them. This can save work as most systems
1647
will just do this, while allowing flexibility for different systems.
1648

    
1649
Environment variables are limited in size, but we expect that there will
1650
be enough space to store the information we need. If we discover that
1651
this is not the case we may want to go to a more complex API such as
1652
storing those information on the filesystem and providing the OS script
1653
with the path to a file where they are encoded in some format.
1654

    
1655

    
1656

    
1657
Remote API changes
1658
~~~~~~~~~~~~~~~~~~
1659

    
1660
The first Ganeti remote API (RAPI) was designed and deployed with the
1661
Ganeti 1.2.5 release.  That version provide read-only access to the
1662
cluster state. Fully functional read-write API demands significant
1663
internal changes which will be implemented in version 2.0.
1664

    
1665
We decided to go with implementing the Ganeti RAPI in a RESTful way,
1666
which is aligned with key features we looking. It is simple,
1667
stateless, scalable and extensible paradigm of API implementation. As
1668
transport it uses HTTP over SSL, and we are implementing it with JSON
1669
encoding, but in a way it possible to extend and provide any other
1670
one.
1671

    
1672
Design
1673
++++++
1674

    
1675
The Ganeti RAPI is implemented as independent daemon, running on the
1676
same node with the same permission level as Ganeti master
1677
daemon. Communication is done through the LUXI library to the master
1678
daemon. In order to keep communication asynchronous RAPI processes two
1679
types of client requests:
1680

    
1681
- queries: server is able to answer immediately
1682
- job submission: some time is required for a useful response
1683

    
1684
In the query case requested data send back to client in the HTTP
1685
response body. Typical examples of queries would be: list of nodes,
1686
instances, cluster info, etc.
1687

    
1688
In the case of job submission, the client receive a job ID, the
1689
identifier which allows to query the job progress in the job queue
1690
(see `Job Queue`_).
1691

    
1692
Internally, each exported object has an version identifier, which is
1693
used as a state identifier in the HTTP header E-Tag field for
1694
requests/responses to avoid race conditions.
1695

    
1696

    
1697
Resource representation
1698
+++++++++++++++++++++++
1699

    
1700
The key difference of using REST instead of others API is that REST
1701
requires separation of services via resources with unique URIs. Each
1702
of them should have limited amount of state and support standard HTTP
1703
methods: GET, POST, DELETE, PUT.
1704

    
1705
For example in Ganeti's case we can have a set of URI:
1706

    
1707
 - ``/{clustername}/instances``
1708
 - ``/{clustername}/instances/{instancename}``
1709
 - ``/{clustername}/instances/{instancename}/tag``
1710
 - ``/{clustername}/tag``
1711

    
1712
A GET request to ``/{clustername}/instances`` will return the list of
1713
instances, a POST to ``/{clustername}/instances`` should create a new
1714
instance, a DELETE ``/{clustername}/instances/{instancename}`` should
1715
delete the instance, a GET ``/{clustername}/tag`` should return get
1716
cluster tags.
1717

    
1718
Each resource URI will have a version prefix. The resource IDs are to
1719
be determined.
1720

    
1721
Internal encoding might be JSON, XML, or any other. The JSON encoding
1722
fits nicely in Ganeti RAPI needs. The client can request a specific
1723
representation via the Accept field in the HTTP header.
1724

    
1725
REST uses HTTP as its transport and application protocol for resource
1726
access. The set of possible responses is a subset of standard HTTP
1727
responses.
1728

    
1729
The statelessness model provides additional reliability and
1730
transparency to operations (e.g. only one request needs to be analyzed
1731
to understand the in-progress operation, not a sequence of multiple
1732
requests/responses).
1733

    
1734

    
1735
Security
1736
++++++++
1737

    
1738
With the write functionality security becomes a much bigger an issue.
1739
The Ganeti RAPI uses basic HTTP authentication on top of an
1740
SSL-secured connection to grant access to an exported resource. The
1741
password is stored locally in an Apache-style ``.htpasswd`` file. Only
1742
one level of privileges is supported.
1743

    
1744
Caveats
1745
+++++++
1746

    
1747
The model detailed above for job submission requires the client to
1748
poll periodically for updates to the job; an alternative would be to
1749
allow the client to request a callback, or a 'wait for updates' call.
1750

    
1751
The callback model was not considered due to the following two issues:
1752

    
1753
- callbacks would require a new model of allowed callback URLs,
1754
  together with a method of managing these
1755
- callbacks only work when the client and the master are in the same
1756
  security domain, and they fail in the other cases (e.g. when there is
1757
  a firewall between the client and the RAPI daemon that only allows
1758
  client-to-RAPI calls, which is usual in DMZ cases)
1759

    
1760
The 'wait for updates' method is not suited to the HTTP protocol,
1761
where requests are supposed to be short-lived.
1762

    
1763
Command line changes
1764
~~~~~~~~~~~~~~~~~~~~
1765

    
1766
Ganeti 2.0 introduces several new features as well as new ways to
1767
handle instance resources like disks or network interfaces. This
1768
requires some noticeable changes in the way command line arguments are
1769
handled.
1770

    
1771
- extend and modify command line syntax to support new features
1772
- ensure consistent patterns in command line arguments to reduce
1773
  cognitive load
1774

    
1775
The design changes that require these changes are, in no particular
1776
order:
1777

    
1778
- flexible instance disk handling: support a variable number of disks
1779
  with varying properties per instance,
1780
- flexible instance network interface handling: support a variable
1781
  number of network interfaces with varying properties per instance
1782
- multiple hypervisors: multiple hypervisors can be active on the same
1783
  cluster, each supporting different parameters,
1784
- support for device type CDROM (via ISO image)
1785

    
1786
As such, there are several areas of Ganeti where the command line
1787
arguments will change:
1788

    
1789
- Cluster configuration
1790

    
1791
  - cluster initialization
1792
  - cluster default configuration
1793

    
1794
- Instance configuration
1795

    
1796
  - handling of network cards for instances,
1797
  - handling of disks for instances,
1798
  - handling of CDROM devices and
1799
  - handling of hypervisor specific options.
1800

    
1801
There are several areas of Ganeti where the command line arguments
1802
will change:
1803

    
1804
- Cluster configuration
1805

    
1806
  - cluster initialization
1807
  - cluster default configuration
1808

    
1809
- Instance configuration
1810

    
1811
  - handling of network cards for instances,
1812
  - handling of disks for instances,
1813
  - handling of CDROM devices and
1814
  - handling of hypervisor specific options.
1815

    
1816
Notes about device removal/addition
1817
+++++++++++++++++++++++++++++++++++
1818

    
1819
To avoid problems with device location changes (e.g. second network
1820
interface of the instance becoming the first or third and the like)
1821
the list of network/disk devices is treated as a stack, i.e. devices
1822
can only be added/removed at the end of the list of devices of each
1823
class (disk or network) for each instance.
1824

    
1825
gnt-instance commands
1826
+++++++++++++++++++++
1827

    
1828
The commands for gnt-instance will be modified and extended to allow
1829
for the new functionality:
1830

    
1831
- the add command will be extended to support the new device and
1832
  hypervisor options,
1833
- the modify command continues to handle all modifications to
1834
  instances, but will be extended with new arguments for handling
1835
  devices.
1836

    
1837
Network Device Options
1838
++++++++++++++++++++++
1839

    
1840
The generic format of the network device option is:
1841

    
1842
  --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1843

    
1844
:$DEVNUM: device number, unsigned integer, starting at 0,
1845
:$OPTION: device option, string,
1846
:$VALUE: device option value, string.
1847

    
1848
Currently, the following device options will be defined (open to
1849
further changes):
1850

    
1851
:mac: MAC address of the network interface, accepts either a valid
1852
  MAC address or the string 'auto'. If 'auto' is specified, a new MAC
1853
  address will be generated randomly. If the mac device option is not
1854
  specified, the default value 'auto' is assumed.
1855
:bridge: network bridge the network interface is connected
1856
  to. Accepts either a valid bridge name (the specified bridge must
1857
  exist on the node(s)) as string or the string 'auto'. If 'auto' is
1858
  specified, the default brigde is used. If the bridge option is not
1859
  specified, the default value 'auto' is assumed.
1860

    
1861
Disk Device Options
1862
+++++++++++++++++++
1863

    
1864
The generic format of the disk device option is:
1865

    
1866
  --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1867

    
1868
:$DEVNUM: device number, unsigned integer, starting at 0,
1869
:$OPTION: device option, string,
1870
:$VALUE: device option value, string.
1871

    
1872
Currently, the following device options will be defined (open to
1873
further changes):
1874

    
1875
:size: size of the disk device, either a positive number, specifying
1876
  the disk size in mebibytes, or a number followed by a magnitude suffix
1877
  (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
1878
  which case the default disk size will be used. If the size option is
1879
  not specified, 'auto' is assumed. This option is not valid for all
1880
  disk layout types.
1881
:access: access mode of the disk device, a single letter, valid values
1882
  are:
1883

    
1884
  - *w*: read/write access to the disk device or
1885
  - *r*: read-only access to the disk device.
1886

    
1887
  If the access mode is not specified, the default mode of read/write
1888
  access will be configured.
1889
:path: path to the image file for the disk device, string. No default
1890
  exists. This option is not valid for all disk layout types.
1891

    
1892
Adding devices
1893
++++++++++++++
1894

    
1895
To add devices to an already existing instance, use the device type
1896
specific option to gnt-instance modify. Currently, there are two
1897
device type specific options supported:
1898

    
1899
:--net: for network interface cards
1900
:--disk: for disk devices
1901

    
1902
The syntax to the device specific options is similar to the generic
1903
device options, but instead of specifying a device number like for
1904
gnt-instance add, you specify the magic string add. The new device
1905
will always be appended at the end of the list of devices of this type
1906
for the specified instance, e.g. if the instance has disk devices 0,1
1907
and 2, the newly added disk device will be disk device 3.
1908

    
1909
Example: gnt-instance modify --net add:mac=auto test-instance
1910

    
1911
Removing devices
1912
++++++++++++++++
1913

    
1914
Removing devices from and instance is done via gnt-instance
1915
modify. The same device specific options as for adding instances are
1916
used. Instead of a device number and further device options, only the
1917
magic string remove is specified. It will always remove the last
1918
device in the list of devices of this type for the instance specified,
1919
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
1920
number 3 will be removed.
1921

    
1922
Example: gnt-instance modify --net remove test-instance
1923

    
1924
Modifying devices
1925
+++++++++++++++++
1926

    
1927
Modifying devices is also done with device type specific options to
1928
the gnt-instance modify command. There are currently two device type
1929
options supported:
1930

    
1931
:--net: for network interface cards
1932
:--disk: for disk devices
1933

    
1934
The syntax to the device specific options is similar to the generic
1935
device options. The device number you specify identifies the device to
1936
be modified.
1937

    
1938
Example::
1939

    
1940
  gnt-instance modify --disk 2:access=r
1941

    
1942
Hypervisor Options
1943
++++++++++++++++++
1944

    
1945
Ganeti 2.0 will support more than one hypervisor. Different
1946
hypervisors have various options that only apply to a specific
1947
hypervisor. Those hypervisor specific options are treated specially
1948
via the ``--hypervisor`` option. The generic syntax of the hypervisor
1949
option is as follows::
1950

    
1951
  --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1952

    
1953
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
1954
  has to match the supported hypervisors. Example: xen-pvm
1955

    
1956
:$OPTION: hypervisor option name, string
1957
:$VALUE: hypervisor option value, string
1958

    
1959
The hypervisor option for an instance can be set on instance creation
1960
time via the ``gnt-instance add`` command. If the hypervisor for an
1961
instance is not specified upon instance creation, the default
1962
hypervisor will be used.
1963

    
1964
Modifying hypervisor parameters
1965
+++++++++++++++++++++++++++++++
1966

    
1967
The hypervisor parameters of an existing instance can be modified
1968
using ``--hypervisor`` option of the ``gnt-instance modify``
1969
command. However, the hypervisor type of an existing instance can not
1970
be changed, only the particular hypervisor specific option can be
1971
changed. Therefore, the format of the option parameters has been
1972
simplified to omit the hypervisor name and only contain the comma
1973
separated list of option-value pairs.
1974

    
1975
Example::
1976

    
1977
  gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
1978

    
1979
gnt-cluster commands
1980
++++++++++++++++++++
1981

    
1982
The command for gnt-cluster will be extended to allow setting and
1983
changing the default parameters of the cluster:
1984

    
1985
- The init command will be extend to support the defaults option to
1986
  set the cluster defaults upon cluster initialization.
1987
- The modify command will be added to modify the cluster
1988
  parameters. It will support the --defaults option to change the
1989
  cluster defaults.
1990

    
1991
Cluster defaults
1992

    
1993
The generic format of the cluster default setting option is:
1994

    
1995
  --defaults $OPTION=$VALUE[,$OPTION=$VALUE]
1996

    
1997
:$OPTION: cluster default option, string,
1998
:$VALUE: cluster default option value, string.
1999

    
2000
Currently, the following cluster default options are defined (open to
2001
further changes):
2002

    
2003
:hypervisor: the default hypervisor to use for new instances,
2004
  string. Must be a valid hypervisor known to and supported by the
2005
  cluster.
2006
:disksize: the disksize for newly created instance disks, where
2007
  applicable. Must be either a positive number, in which case the unit
2008
  of megabyte is assumed, or a positive number followed by a supported
2009
  magnitude symbol (M for megabyte or G for gigabyte).
2010
:bridge: the default network bridge to use for newly created instance
2011
  network interfaces, string. Must be a valid bridge name of a bridge
2012
  existing on the node(s).
2013

    
2014
Hypervisor cluster defaults
2015
+++++++++++++++++++++++++++
2016

    
2017
The generic format of the hypervisor cluster wide default setting
2018
option is::
2019

    
2020
  --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
2021

    
2022
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
2023
  to set, string
2024
:$OPTION: cluster default option, string,
2025
:$VALUE: cluster default option value, string.
2026

    
2027
.. vim: set textwidth=72 :