Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0.rst @ 33c730a2

History | View | Annotate | Download (76.6 kB)

1
=================
2
Ganeti 2.0 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.0 compared to
6
the 1.2 version.
7

    
8
The 2.0 version will constitute a rewrite of the 'core' architecture,
9
paving the way for additional features in future 2.x versions.
10

    
11
.. contents:: :depth: 3
12

    
13
Objective
14
=========
15

    
16
Ganeti 1.2 has many scalability issues and restrictions due to its
17
roots as software for managing small and 'static' clusters.
18

    
19
Version 2.0 will attempt to remedy first the scalability issues and
20
then the restrictions.
21

    
22
Background
23
==========
24

    
25
While Ganeti 1.2 is usable, it severely limits the flexibility of the
26
cluster administration and imposes a very rigid model. It has the
27
following main scalability issues:
28

    
29
- only one operation at a time on the cluster [#]_
30
- poor handling of node failures in the cluster
31
- mixing hypervisors in a cluster not allowed
32

    
33
It also has a number of artificial restrictions, due to historical
34
design:
35

    
36
- fixed number of disks (two) per instance
37
- fixed number of NICs
38

    
39
.. [#] Replace disks will release the lock, but this is an exception
40
       and not a recommended way to operate
41

    
42
The 2.0 version is intended to address some of these problems, and
43
create a more flexible code base for future developments.
44

    
45
Among these problems, the single-operation at a time restriction is
46
biggest issue with the current version of Ganeti. It is such a big
47
impediment in operating bigger clusters that many times one is tempted
48
to remove the lock just to do a simple operation like start instance
49
while an OS installation is running.
50

    
51
Scalability problems
52
--------------------
53

    
54
Ganeti 1.2 has a single global lock, which is used for all cluster
55
operations.  This has been painful at various times, for example:
56

    
57
- It is impossible for two people to efficiently interact with a cluster
58
  (for example for debugging) at the same time.
59
- When batch jobs are running it's impossible to do other work (for
60
  example failovers/fixes) on a cluster.
61

    
62
This poses scalability problems: as clusters grow in node and instance
63
size it's a lot more likely that operations which one could conceive
64
should run in parallel (for example because they happen on different
65
nodes) are actually stalling each other while waiting for the global
66
lock, without a real reason for that to happen.
67

    
68
One of the main causes of this global lock (beside the higher
69
difficulty of ensuring data consistency in a more granular lock model)
70
is the fact that currently there is no long-lived process in Ganeti
71
that can coordinate multiple operations. Each command tries to acquire
72
the so called *cmd* lock and when it succeeds, it takes complete
73
ownership of the cluster configuration and state.
74

    
75
Other scalability problems are due the design of the DRBD device
76
model, which assumed at its creation a low (one to four) number of
77
instances per node, which is no longer true with today's hardware.
78

    
79
Artificial restrictions
80
-----------------------
81

    
82
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per
83
instance model. This is a purely artificial restrictions, but it
84
touches multiple areas (configuration, import/export, command line)
85
that it's more fitted to a major release than a minor one.
86

    
87
Architecture issues
88
-------------------
89

    
90
The fact that each command is a separate process that reads the
91
cluster state, executes the command, and saves the new state is also
92
an issue on big clusters where the configuration data for the cluster
93
begins to be non-trivial in size.
94

    
95
Overview
96
========
97

    
98
In order to solve the scalability problems, a rewrite of the core
99
design of Ganeti is required. While the cluster operations themselves
100
won't change (e.g. start instance will do the same things, the way
101
these operations are scheduled internally will change radically.
102

    
103
The new design will change the cluster architecture to:
104

    
105
.. digraph:: "ganeti-2.0-architecture"
106

    
107
  compound=false
108
  concentrate=true
109
  mclimit=100.0
110
  nslimit=100.0
111
  edge[fontsize="8" fontname="Helvetica-Oblique"]
112
  node[width="0" height="0" fontsize="12" fontcolor="black" shape=rect]
113

    
114
  subgraph outside {
115
    rclient[label="external clients"]
116
    label="Outside the cluster"
117
  }
118

    
119
  subgraph cluster_inside {
120
    label="ganeti cluster"
121
    labeljust=l
122
    subgraph cluster_master_node {
123
      label="master node"
124
      rapi[label="RAPI daemon"]
125
      cli[label="CLI"]
126
      watcher[label="Watcher"]
127
      burnin[label="Burnin"]
128
      masterd[shape=record style=filled label="{ <luxi> luxi endpoint | master I/O thread | job queue | {<w1> worker| <w2> worker | <w3> worker }}"]
129
      {rapi;cli;watcher;burnin} -> masterd:luxi [label="LUXI" labelpos=100]
130
    }
131

    
132
    subgraph cluster_nodes {
133
        label="nodes"
134
        noded1 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "]
135
        noded2 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "]
136
        noded3 [shape=record label="{ RPC listener | Disk management | Network management | Hypervisor } "]
137
    }
138
    masterd:w2 -> {noded1;noded2;noded3} [label="node RPC"]
139
    cli -> {noded1;noded2;noded3} [label="SSH"]
140
  }
141

    
142
  rclient -> rapi [label="RAPI protocol"]
143

    
144
This differs from the 1.2 architecture by the addition of the master
145
daemon, which will be the only entity to talk to the node daemons.
146

    
147

    
148
Detailed design
149
===============
150

    
151
The changes for 2.0 can be split into roughly three areas:
152

    
153
- core changes that affect the design of the software
154
- features (or restriction removals) but which do not have a wide
155
  impact on the design
156
- user-level and API-level changes which translate into differences for
157
  the operation of the cluster
158

    
159
Core changes
160
------------
161

    
162
The main changes will be switching from a per-process model to a
163
daemon based model, where the individual gnt-* commands will be
164
clients that talk to this daemon (see `Master daemon`_). This will
165
allow us to get rid of the global cluster lock for most operations,
166
having instead a per-object lock (see `Granular locking`_). Also, the
167
daemon will be able to queue jobs, and this will allow the individual
168
clients to submit jobs without waiting for them to finish, and also
169
see the result of old requests (see `Job Queue`_).
170

    
171
Beside these major changes, another 'core' change but that will not be
172
as visible to the users will be changing the model of object attribute
173
storage, and separate that into name spaces (such that an Xen PVM
174
instance will not have the Xen HVM parameters). This will allow future
175
flexibility in defining additional parameters. For more details see
176
`Object parameters`_.
177

    
178
The various changes brought in by the master daemon model and the
179
read-write RAPI will require changes to the cluster security; we move
180
away from Twisted and use HTTP(s) for intra- and extra-cluster
181
communications. For more details, see the security document in the
182
doc/ directory.
183

    
184
Master daemon
185
~~~~~~~~~~~~~
186

    
187
In Ganeti 2.0, we will have the following *entities*:
188

    
189
- the master daemon (on the master node)
190
- the node daemon (on all nodes)
191
- the command line tools (on the master node)
192
- the RAPI daemon (on the master node)
193

    
194
The master-daemon related interaction paths are:
195

    
196
- (CLI tools/RAPI daemon) and the master daemon, via the so called
197
  *LUXI* API
198
- the master daemon and the node daemons, via the node RPC
199

    
200
There are also some additional interaction paths for exceptional cases:
201

    
202
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile``
203
  and ``gnt-cluster command``)
204
- master failover is a special case when a non-master node will SSH
205
  and do node-RPC calls to the current master
206

    
207
The protocol between the master daemon and the node daemons will be
208
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S),
209
using a simple PUT/GET of JSON-encoded messages. This is done due to
210
difficulties in working with the Twisted framework and its protocols
211
in a multithreaded environment, which we can overcome by using a
212
simpler stack (see the caveats section).
213

    
214
The protocol between the CLI/RAPI and the master daemon will be a
215
custom one (called *LUXI*): on a UNIX socket on the master node, with
216
rights restricted by filesystem permissions, the CLI/RAPI will talk to
217
the master daemon using JSON-encoded messages.
218

    
219
The operations supported over this internal protocol will be encoded
220
via a python library that will expose a simple API for its
221
users. Internally, the protocol will simply encode all objects in JSON
222
format and decode them on the receiver side.
223

    
224
For more details about the RAPI daemon see `Remote API changes`_, and
225
for the node daemon see `Node daemon changes`_.
226

    
227
.. _luxi:
228

    
229
The LUXI protocol
230
+++++++++++++++++
231

    
232
As described above, the protocol for making requests or queries to the
233
master daemon will be a UNIX-socket based simple RPC of JSON-encoded
234
messages.
235

    
236
The choice of UNIX was in order to get rid of the need of
237
authentication and authorisation inside Ganeti; for 2.0, the
238
permissions on the Unix socket itself will determine the access
239
rights.
240

    
241
We will have two main classes of operations over this API:
242

    
243
- cluster query functions
244
- job related functions
245

    
246
The cluster query functions are usually short-duration, and are the
247
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are
248
internally implemented still with these opcodes). The clients are
249
guaranteed to receive the response in a reasonable time via a timeout.
250

    
251
The job-related functions will be:
252

    
253
- submit job
254
- query job (which could also be categorized in the query-functions)
255
- archive job (see the job queue design doc)
256
- wait for job change, which allows a client to wait without polling
257

    
258
For more details of the actual operation list, see the `Job Queue`_.
259

    
260
Both requests and responses will consist of a JSON-encoded message
261
followed by the ``ETX`` character (ASCII decimal 3), which is not a
262
valid character in JSON messages and thus can serve as a message
263
delimiter. The contents of the messages will be a dictionary with two
264
fields:
265

    
266
:method:
267
  the name of the method called
268
:args:
269
  the arguments to the method, as a list (no keyword arguments allowed)
270

    
271
Responses will follow the same format, with the two fields being:
272

    
273
:success:
274
  a boolean denoting the success of the operation
275
:result:
276
  the actual result, or error message in case of failure
277

    
278
There are two special value for the result field:
279

    
280
- in the case that the operation failed, and this field is a list of
281
  length two, the client library will try to interpret is as an
282
  exception, the first element being the exception type and the second
283
  one the actual exception arguments; this will allow a simple method of
284
  passing Ganeti-related exception across the interface
285
- for the *WaitForChange* call (that waits on the server for a job to
286
  change status), if the result is equal to ``nochange`` instead of the
287
  usual result for this call (a list of changes), then the library will
288
  internally retry the call; this is done in order to differentiate
289
  internally between master daemon hung and job simply not changed
290

    
291
Users of the API that don't use the provided python library should
292
take care of the above two cases.
293

    
294

    
295
Master daemon implementation
296
++++++++++++++++++++++++++++
297

    
298
The daemon will be based around a main I/O thread that will wait for
299
new requests from the clients, and that does the setup/shutdown of the
300
other thread (pools).
301

    
302
There will two other classes of threads in the daemon:
303

    
304
- job processing threads, part of a thread pool, and which are
305
  long-lived, started at daemon startup and terminated only at shutdown
306
  time
307
- client I/O threads, which are the ones that talk the local protocol
308
  (LUXI) to the clients, and are short-lived
309

    
310
Master startup/failover
311
+++++++++++++++++++++++
312

    
313
In Ganeti 1.x there is no protection against failing over the master
314
to a node with stale configuration. In effect, the responsibility of
315
correct failovers falls on the admin. This is true both for the new
316
master and for when an old, offline master startup.
317

    
318
Since in 2.x we are extending the cluster state to cover the job queue
319
and have a daemon that will execute by itself the job queue, we want
320
to have more resilience for the master role.
321

    
322
The following algorithm will happen whenever a node is ready to
323
transition to the master role, either at startup time or at node
324
failover:
325

    
326
#. read the configuration file and parse the node list
327
   contained within
328

    
329
#. query all the nodes and make sure we obtain an agreement via
330
   a quorum of at least half plus one nodes for the following:
331

    
332
    - we have the latest configuration and job list (as
333
      determined by the serial number on the configuration and
334
      highest job ID on the job queue)
335

    
336
    - if we are not failing over (but just starting), the
337
      quorum agrees that we are the designated master
338

    
339
    - if any of the above is false, we prevent the current operation
340
      (i.e. we don't become the master)
341

    
342
#. at this point, the node transitions to the master role
343

    
344
#. for all the in-progress jobs, mark them as failed, with
345
   reason unknown or something similar (master failed, etc.)
346

    
347
Since due to exceptional conditions we could have a situation in which
348
no node can become the master due to inconsistent data, we will have
349
an override switch for the master daemon startup that will assume the
350
current node has the right data and will replicate all the
351
configuration files to the other nodes.
352

    
353
**Note**: the above algorithm is by no means an election algorithm; it
354
is a *confirmation* of the master role currently held by a node.
355

    
356
Logging
357
+++++++
358

    
359
The logging system will be switched completely to the standard python
360
logging module; currently it's logging-based, but exposes a different
361
API, which is just overhead. As such, the code will be switched over
362
to standard logging calls, and only the setup will be custom.
363

    
364
With this change, we will remove the separate debug/info/error logs,
365
and instead have always one logfile per daemon model:
366

    
367
- master-daemon.log for the master daemon
368
- node-daemon.log for the node daemon (this is the same as in 1.2)
369
- rapi-daemon.log for the RAPI daemon logs
370
- rapi-access.log, an additional log file for the RAPI that will be
371
  in the standard HTTP log format for possible parsing by other tools
372

    
373
Since the :term:`watcher` will only submit jobs to the master for
374
startup of the instances, its log file will contain less information
375
than before, mainly that it will start the instance, but not the
376
results.
377

    
378
Node daemon changes
379
+++++++++++++++++++
380

    
381
The only change to the node daemon is that, since we need better
382
concurrency, we don't process the inter-node RPC calls in the node
383
daemon itself, but we fork and process each request in a separate
384
child.
385

    
386
Since we don't have many calls, and we only fork (not exec), the
387
overhead should be minimal.
388

    
389
Caveats
390
+++++++
391

    
392
A discussed alternative is to keep the current individual processes
393
touching the cluster configuration model. The reasons we have not
394
chosen this approach is:
395

    
396
- the speed of reading and unserializing the cluster state
397
  today is not small enough that we can ignore it; the addition of
398
  the job queue will make the startup cost even higher. While this
399
  runtime cost is low, it can be on the order of a few seconds on
400
  bigger clusters, which for very quick commands is comparable to
401
  the actual duration of the computation itself
402

    
403
- individual commands would make it harder to implement a
404
  fire-and-forget job request, along the lines "start this
405
  instance but do not wait for it to finish"; it would require a
406
  model of backgrounding the operation and other things that are
407
  much better served by a daemon-based model
408

    
409
Another area of discussion is moving away from Twisted in this new
410
implementation. While Twisted has its advantages, there are also many
411
disadvantages to using it:
412

    
413
- first and foremost, it's not a library, but a framework; thus, if
414
  you use twisted, all the code needs to be 'twiste-ized' and written
415
  in an asynchronous manner, using deferreds; while this method works,
416
  it's not a common way to code and it requires that the entire process
417
  workflow is based around a single *reactor* (Twisted name for a main
418
  loop)
419
- the more advanced granular locking that we want to implement would
420
  require, if written in the async-manner, deep integration with the
421
  Twisted stack, to such an extend that business-logic is inseparable
422
  from the protocol coding; we felt that this is an unreasonable
423
  request, and that a good protocol library should allow complete
424
  separation of low-level protocol calls and business logic; by
425
  comparison, the threaded approach combined with HTTPs protocol
426
  required (for the first iteration) absolutely no changes from the 1.2
427
  code, and later changes for optimizing the inter-node RPC calls
428
  required just syntactic changes (e.g.  ``rpc.call_...`` to
429
  ``self.rpc.call_...``)
430

    
431
Another issue is with the Twisted API stability - during the Ganeti
432
1.x lifetime, we had to to implement many times workarounds to changes
433
in the Twisted version, so that for example 1.2 is able to use both
434
Twisted 2.x and 8.x.
435

    
436
In the end, since we already had an HTTP server library for the RAPI,
437
we just reused that for inter-node communication.
438

    
439

    
440
Granular locking
441
~~~~~~~~~~~~~~~~
442

    
443
We want to make sure that multiple operations can run in parallel on a
444
Ganeti Cluster. In order for this to happen we need to make sure
445
concurrently run operations don't step on each other toes and break the
446
cluster.
447

    
448
This design addresses how we are going to deal with locking so that:
449

    
450
- we preserve data coherency
451
- we prevent deadlocks
452
- we prevent job starvation
453

    
454
Reaching the maximum possible parallelism is a Non-Goal. We have
455
identified a set of operations that are currently bottlenecks and need
456
to be parallelised and have worked on those. In the future it will be
457
possible to address other needs, thus making the cluster more and more
458
parallel one step at a time.
459

    
460
This section only talks about parallelising Ganeti level operations, aka
461
Logical Units, and the locking needed for that. Any other
462
synchronization lock needed internally by the code is outside its scope.
463

    
464
Library details
465
+++++++++++++++
466

    
467
The proposed library has these features:
468

    
469
- internally managing all the locks, making the implementation
470
  transparent from their usage
471
- automatically grabbing multiple locks in the right order (avoid
472
  deadlock)
473
- ability to transparently handle conversion to more granularity
474
- support asynchronous operation (future goal)
475

    
476
Locking will be valid only on the master node and will not be a
477
distributed operation. Therefore, in case of master failure, the
478
operations currently running will be aborted and the locks will be
479
lost; it remains to the administrator to cleanup (if needed) the
480
operation result (e.g. make sure an instance is either installed
481
correctly or removed).
482

    
483
A corollary of this is that a master-failover operation with both
484
masters alive needs to happen while no operations are running, and
485
therefore no locks are held.
486

    
487
All the locks will be represented by objects (like
488
``lockings.SharedLock``), and the individual locks for each object
489
will be created at initialisation time, from the config file.
490

    
491
The API will have a way to grab one or more than one locks at the same
492
time.  Any attempt to grab a lock while already holding one in the wrong
493
order will be checked for, and fail.
494

    
495

    
496
The Locks
497
+++++++++
498

    
499
At the first stage we have decided to provide the following locks:
500

    
501
- One "config file" lock
502
- One lock per node in the cluster
503
- One lock per instance in the cluster
504

    
505
All the instance locks will need to be taken before the node locks, and
506
the node locks before the config lock. Locks will need to be acquired at
507
the same time for multiple instances and nodes, and internal ordering
508
will be dealt within the locking library, which, for simplicity, will
509
just use alphabetical order.
510

    
511
Each lock has the following three possible statuses:
512

    
513
- unlocked (anyone can grab the lock)
514
- shared (anyone can grab/have the lock but only in shared mode)
515
- exclusive (no one else can grab/have the lock)
516

    
517
Handling conversion to more granularity
518
+++++++++++++++++++++++++++++++++++++++
519

    
520
In order to convert to a more granular approach transparently each time
521
we split a lock into more we'll create a "metalock", which will depend
522
on those sub-locks and live for the time necessary for all the code to
523
convert (or forever, in some conditions). When a metalock exists all
524
converted code must acquire it in shared mode, so it can run
525
concurrently, but still be exclusive with old code, which acquires it
526
exclusively.
527

    
528
In the beginning the only such lock will be what replaces the current
529
"command" lock, and will acquire all the locks in the system, before
530
proceeding. This lock will be called the "Big Ganeti Lock" because
531
holding that one will avoid any other concurrent Ganeti operations.
532

    
533
We might also want to devise more metalocks (eg. all nodes, all
534
nodes+config) in order to make it easier for some parts of the code to
535
acquire what it needs without specifying it explicitly.
536

    
537
In the future things like the node locks could become metalocks, should
538
we decide to split them into an even more fine grained approach, but
539
this will probably be only after the first 2.0 version has been
540
released.
541

    
542
Adding/Removing locks
543
+++++++++++++++++++++
544

    
545
When a new instance or a new node is created an associated lock must be
546
added to the list. The relevant code will need to inform the locking
547
library of such a change.
548

    
549
This needs to be compatible with every other lock in the system,
550
especially metalocks that guarantee to grab sets of resources without
551
specifying them explicitly. The implementation of this will be handled
552
in the locking library itself.
553

    
554
When instances or nodes disappear from the cluster the relevant locks
555
must be removed. This is easier than adding new elements, as the code
556
which removes them must own them exclusively already, and thus deals
557
with metalocks exactly as normal code acquiring those locks. Any
558
operation queuing on a removed lock will fail after its removal.
559

    
560
Asynchronous operations
561
+++++++++++++++++++++++
562

    
563
For the first version the locking library will only export synchronous
564
operations, which will block till the needed lock are held, and only
565
fail if the request is impossible or somehow erroneous.
566

    
567
In the future we may want to implement different types of asynchronous
568
operations such as:
569

    
570
- try to acquire this lock set and fail if not possible
571
- try to acquire one of these lock sets and return the first one you
572
  were able to get (or after a timeout) (select/poll like)
573

    
574
These operations can be used to prioritize operations based on available
575
locks, rather than making them just blindly queue for acquiring them.
576
The inherent risk, though, is that any code using the first operation,
577
or setting a timeout for the second one, is susceptible to starvation
578
and thus may never be able to get the required locks and complete
579
certain tasks. Considering this providing/using these operations should
580
not be among our first priorities.
581

    
582
Locking granularity
583
+++++++++++++++++++
584

    
585
For the first version of this code we'll convert each Logical Unit to
586
acquire/release the locks it needs, so locking will be at the Logical
587
Unit level.  In the future we may want to split logical units in
588
independent "tasklets" with their own locking requirements. A different
589
design doc (or mini design doc) will cover the move from Logical Units
590
to tasklets.
591

    
592
Code examples
593
+++++++++++++
594

    
595
In general when acquiring locks we should use a code path equivalent
596
to::
597

    
598
  lock.acquire()
599
  try:
600
    ...
601
    # other code
602
  finally:
603
    lock.release()
604

    
605
This makes sure we release all locks, and avoid possible deadlocks. Of
606
course extra care must be used not to leave, if possible locked
607
structures in an unusable state. Note that with Python 2.5 a simpler
608
syntax will be possible, but we want to keep compatibility with Python
609
2.4 so the new constructs should not be used.
610

    
611
In order to avoid this extra indentation and code changes everywhere in
612
the Logical Units code, we decided to allow LUs to declare locks, and
613
then execute their code with their locks acquired. In the new world LUs
614
are called like this::
615

    
616
  # user passed names are expanded to the internal lock/resource name,
617
  # then known needed locks are declared
618
  lu.ExpandNames()
619
  ... some locking/adding of locks may happen ...
620
  # late declaration of locks for one level: this is useful because sometimes
621
  # we can't know which resource we need before locking the previous level
622
  lu.DeclareLocks() # for each level (cluster, instance, node)
623
  ... more locking/adding of locks can happen ...
624
  # these functions are called with the proper locks held
625
  lu.CheckPrereq()
626
  lu.Exec()
627
  ... locks declared for removal are removed, all acquired locks released ...
628

    
629
The Processor and the LogicalUnit class will contain exact documentation
630
on how locks are supposed to be declared.
631

    
632
Caveats
633
+++++++
634

    
635
This library will provide an easy upgrade path to bring all the code to
636
granular locking without breaking everything, and it will also guarantee
637
against a lot of common errors. Code switching from the old "lock
638
everything" lock to the new system, though, needs to be carefully
639
scrutinised to be sure it is really acquiring all the necessary locks,
640
and none has been overlooked or forgotten.
641

    
642
The code can contain other locks outside of this library, to synchronise
643
other threaded code (eg for the job queue) but in general these should
644
be leaf locks or carefully structured non-leaf ones, to avoid deadlock
645
race conditions.
646

    
647

    
648
.. _jqueue-original-design:
649

    
650
Job Queue
651
~~~~~~~~~
652

    
653
Granular locking is not enough to speed up operations, we also need a
654
queue to store these and to be able to process as many as possible in
655
parallel.
656

    
657
A Ganeti job will consist of multiple ``OpCodes`` which are the basic
658
element of operation in Ganeti 1.2 (and will remain as such). Most
659
command-level commands are equivalent to one OpCode, or in some cases
660
to a sequence of opcodes, all of the same type (e.g. evacuating a node
661
will generate N opcodes of type replace disks).
662

    
663

    
664
Job executionโ€”โ€œLife of a Ganeti jobโ€
665
++++++++++++++++++++++++++++++++++++
666

    
667
#. Job gets submitted by the client. A new job identifier is generated
668
   and assigned to the job. The job is then automatically replicated
669
   [#replic]_ to all nodes in the cluster. The identifier is returned to
670
   the client.
671
#. A pool of worker threads waits for new jobs. If all are busy, the job
672
   has to wait and the first worker finishing its work will grab it.
673
   Otherwise any of the waiting threads will pick up the new job.
674
#. Client waits for job status updates by calling a waiting RPC
675
   function. Log message may be shown to the user. Until the job is
676
   started, it can also be canceled.
677
#. As soon as the job is finished, its final result and status can be
678
   retrieved from the server.
679
#. If the client archives the job, it gets moved to a history directory.
680
   There will be a method to archive all jobs older than a a given age.
681

    
682
.. [#replic] We need replication in order to maintain the consistency
683
   across all nodes in the system; the master node only differs in the
684
   fact that now it is running the master daemon, but it if fails and we
685
   do a master failover, the jobs are still visible on the new master
686
   (though marked as failed).
687

    
688
Failures to replicate a job to other nodes will be only flagged as
689
errors in the master daemon log if more than half of the nodes failed,
690
otherwise we ignore the failure, and rely on the fact that the next
691
update (for still running jobs) will retry the update. For finished
692
jobs, it is less of a problem.
693

    
694
Future improvements will look into checking the consistency of the job
695
list and jobs themselves at master daemon startup.
696

    
697

    
698
Job storage
699
+++++++++++
700

    
701
Jobs are stored in the filesystem as individual files, serialized
702
using JSON (standard serialization mechanism in Ganeti).
703

    
704
The choice of storing each job in its own file was made because:
705

    
706
- a file can be atomically replaced
707
- a file can easily be replicated to other nodes
708
- checking consistency across nodes can be implemented very easily,
709
  since all job files should be (at a given moment in time) identical
710

    
711
The other possible choices that were discussed and discounted were:
712

    
713
- single big file with all job data: not feasible due to difficult
714
  updates
715
- in-process databases: hard to replicate the entire database to the
716
  other nodes, and replicating individual operations does not mean wee
717
  keep consistency
718

    
719

    
720
Queue structure
721
+++++++++++++++
722

    
723
All file operations have to be done atomically by writing to a temporary
724
file and subsequent renaming. Except for log messages, every change in a
725
job is stored and replicated to other nodes.
726

    
727
::
728

    
729
  /var/lib/ganeti/queue/
730
    job-1 (JSON encoded job description and status)
731
    [โ€ฆ]
732
    job-37
733
    job-38
734
    job-39
735
    lock (Queue managing process opens this file in exclusive mode)
736
    serial (Last job ID used)
737
    version (Queue format version)
738

    
739

    
740
Locking
741
+++++++
742

    
743
Locking in the job queue is a complicated topic. It is called from more
744
than one thread and must be thread-safe. For simplicity, a single lock
745
is used for the whole job queue.
746

    
747
A more detailed description can be found in doc/locking.rst.
748

    
749

    
750
Internal RPC
751
++++++++++++
752

    
753
RPC calls available between Ganeti master and node daemons:
754

    
755
jobqueue_update(file_name, content)
756
  Writes a file in the job queue directory.
757
jobqueue_purge()
758
  Cleans the job queue directory completely, including archived job.
759
jobqueue_rename(old, new)
760
  Renames a file in the job queue directory.
761

    
762

    
763
Client RPC
764
++++++++++
765

    
766
RPC between Ganeti clients and the Ganeti master daemon supports the
767
following operations:
768

    
769
SubmitJob(ops)
770
  Submits a list of opcodes and returns the job identifier. The
771
  identifier is guaranteed to be unique during the lifetime of a
772
  cluster.
773
WaitForJobChange(job_id, fields, [โ€ฆ], timeout)
774
  This function waits until a job changes or a timeout expires. The
775
  condition for when a job changed is defined by the fields passed and
776
  the last log message received.
777
QueryJobs(job_ids, fields)
778
  Returns field values for the job identifiers passed.
779
CancelJob(job_id)
780
  Cancels the job specified by identifier. This operation may fail if
781
  the job is already running, canceled or finished.
782
ArchiveJob(job_id)
783
  Moves a job into the โ€ฆ/archive/ directory. This operation will fail if
784
  the job has not been canceled or finished.
785

    
786

    
787
Job and opcode status
788
+++++++++++++++++++++
789

    
790
Each job and each opcode has, at any time, one of the following states:
791

    
792
Queued
793
  The job/opcode was submitted, but did not yet start.
794
Waiting
795
  The job/opcode is waiting for a lock to proceed.
796
Running
797
  The job/opcode is running.
798
Canceled
799
  The job/opcode was canceled before it started.
800
Success
801
  The job/opcode ran and finished successfully.
802
Error
803
  The job/opcode was aborted with an error.
804

    
805
If the master is aborted while a job is running, the job will be set to
806
the Error status once the master started again.
807

    
808

    
809
History
810
+++++++
811

    
812
Archived jobs are kept in a separate directory,
813
``/var/lib/ganeti/queue/archive/``.  This is done in order to speed up
814
the queue handling: by default, the jobs in the archive are not
815
touched by any functions. Only the current (unarchived) jobs are
816
parsed, loaded, and verified (if implemented) by the master daemon.
817

    
818

    
819
Ganeti updates
820
++++++++++++++
821

    
822
The queue has to be completely empty for Ganeti updates with changes
823
in the job queue structure. In order to allow this, there will be a
824
way to prevent new jobs entering the queue.
825

    
826

    
827
Object parameters
828
~~~~~~~~~~~~~~~~~
829

    
830
Across all cluster configuration data, we have multiple classes of
831
parameters:
832

    
833
A. cluster-wide parameters (e.g. name of the cluster, the master);
834
   these are the ones that we have today, and are unchanged from the
835
   current model
836

    
837
#. node parameters
838

    
839
#. instance specific parameters, e.g. the name of disks (LV), that
840
   cannot be shared with other instances
841

    
842
#. instance parameters, that are or can be the same for many
843
   instances, but are not hypervisor related; e.g. the number of VCPUs,
844
   or the size of memory
845

    
846
#. instance parameters that are hypervisor specific (e.g. kernel_path
847
   or PAE mode)
848

    
849

    
850
The following definitions for instance parameters will be used below:
851

    
852
:hypervisor parameter:
853
  a hypervisor parameter (or hypervisor specific parameter) is defined
854
  as a parameter that is interpreted by the hypervisor support code in
855
  Ganeti and usually is specific to a particular hypervisor (like the
856
  kernel path for :term:`PVM` which makes no sense for :term:`HVM`).
857

    
858
:backend parameter:
859
  a backend parameter is defined as an instance parameter that can be
860
  shared among a list of instances, and is either generic enough not
861
  to be tied to a given hypervisor or cannot influence at all the
862
  hypervisor behaviour.
863

    
864
  For example: memory, vcpus, auto_balance
865

    
866
  All these parameters will be encoded into constants.py with the prefix
867
  "BE\_" and the whole list of parameters will exist in the set
868
  "BES_PARAMETERS"
869

    
870
:proper parameter:
871
  a parameter whose value is unique to the instance (e.g. the name of a
872
  LV, or the MAC of a NIC)
873

    
874
As a general rule, for all kind of parameters, โ€œNoneโ€ (or in
875
JSON-speak, โ€œnilโ€) will no longer be a valid value for a parameter. As
876
such, only non-default parameters will be saved as part of objects in
877
the serialization step, reducing the size of the serialized format.
878

    
879
Cluster parameters
880
++++++++++++++++++
881

    
882
Cluster parameters remain as today, attributes at the top level of the
883
Cluster object. In addition, two new attributes at this level will
884
hold defaults for the instances:
885

    
886
- hvparams, a dictionary indexed by hypervisor type, holding default
887
  values for hypervisor parameters that are not defined/overridden by
888
  the instances of this hypervisor type
889

    
890
- beparams, a dictionary holding (for 2.0) a single element 'default',
891
  which holds the default value for backend parameters
892

    
893
Node parameters
894
+++++++++++++++
895

    
896
Node-related parameters are very few, and we will continue using the
897
same model for these as previously (attributes on the Node object).
898

    
899
There are three new node flags, described in a separate section "node
900
flags" below.
901

    
902
Instance parameters
903
+++++++++++++++++++
904

    
905
As described before, the instance parameters are split in three:
906
instance proper parameters, unique to each instance, instance
907
hypervisor parameters and instance backend parameters.
908

    
909
The โ€œhvparamsโ€ and โ€œbeparamsโ€ are kept in two dictionaries at instance
910
level. Only non-default parameters are stored (but once customized, a
911
parameter will be kept, even with the same value as the default one,
912
until reset).
913

    
914
The names for hypervisor parameters in the instance.hvparams subtree
915
should be choosen as generic as possible, especially if specific
916
parameters could conceivably be useful for more than one hypervisor,
917
e.g. ``instance.hvparams.vnc_console_port`` instead of using both
918
``instance.hvparams.hvm_vnc_console_port`` and
919
``instance.hvparams.kvm_vnc_console_port``.
920

    
921
There are some special cases related to disks and NICs (for example):
922
a disk has both Ganeti-related parameters (e.g. the name of the LV)
923
and hypervisor-related parameters (how the disk is presented to/named
924
in the instance). The former parameters remain as proper-instance
925
parameters, while the latter value are migrated to the hvparams
926
structure. In 2.0, we will have only globally-per-instance such
927
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
928
exported as of the same type).
929

    
930
Starting from the 1.2 list of instance parameters, here is how they
931
will be mapped to the three classes of parameters:
932

    
933
- name (P)
934
- primary_node (P)
935
- os (P)
936
- hypervisor (P)
937
- status (P)
938
- memory (BE)
939
- vcpus (BE)
940
- nics (P)
941
- disks (P)
942
- disk_template (P)
943
- network_port (P)
944
- kernel_path (HV)
945
- initrd_path (HV)
946
- hvm_boot_order (HV)
947
- hvm_acpi (HV)
948
- hvm_pae (HV)
949
- hvm_cdrom_image_path (HV)
950
- hvm_nic_type (HV)
951
- hvm_disk_type (HV)
952
- vnc_bind_address (HV)
953
- serial_no (P)
954

    
955

    
956
Parameter validation
957
++++++++++++++++++++
958

    
959
To support the new cluster parameter design, additional features will
960
be required from the hypervisor support implementations in Ganeti.
961

    
962
The hypervisor support  implementation API will be extended with the
963
following features:
964

    
965
:PARAMETERS: class-level attribute holding the list of valid parameters
966
  for this hypervisor
967
:CheckParamSyntax(hvparams): checks that the given parameters are
968
  valid (as in the names are valid) for this hypervisor; usually just
969
  comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class
970
  method that can be called from within master code (i.e. cmdlib) and
971
  should be safe to do so
972
:ValidateParameters(hvparams): verifies the values of the provided
973
  parameters against this hypervisor; this is a method that will be
974
  called on the target node, from backend.py code, and as such can
975
  make node-specific checks (e.g. kernel_path checking)
976

    
977
Default value application
978
+++++++++++++++++++++++++
979

    
980
The application of defaults to an instance is done in the Cluster
981
object, via two new methods as follows:
982

    
983
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
984
  instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
985

    
986
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
987
  beparams dict, based on the instance and cluster beparams
988

    
989
The FillHV/BE transformations will be used, for example, in the
990
RpcRunner when sending an instance for activation/stop, and the sent
991
instance hvparams/beparams will have the final value (noded code doesn't
992
know about defaults).
993

    
994
LU code will need to self-call the transformation, if needed.
995

    
996
Opcode changes
997
++++++++++++++
998

    
999
The parameter changes will have impact on the OpCodes, especially on
1000
the following ones:
1001

    
1002
- ``OpInstanceCreate``, where the new hv and be parameters will be sent
1003
  as dictionaries; note that all hv and be parameters are now optional,
1004
  as the values can be instead taken from the cluster
1005
- ``OpInstanceQuery``, where we have to be able to query these new
1006
  parameters; the syntax for names will be ``hvparam/$NAME`` and
1007
  ``beparam/$NAME`` for querying an individual parameter out of one
1008
  dictionary, and ``hvparams``, respectively ``beparams``, for the whole
1009
  dictionaries
1010
- ``OpModifyInstance``, where the the modified parameters are sent as
1011
  dictionaries
1012

    
1013
Additionally, we will need new OpCodes to modify the cluster-level
1014
defaults for the be/hv sets of parameters.
1015

    
1016
Caveats
1017
+++++++
1018

    
1019
One problem that might appear is that our classification is not
1020
complete or not good enough, and we'll need to change this model. As
1021
the last resort, we will need to rollback and keep 1.2 style.
1022

    
1023
Another problem is that classification of one parameter is unclear
1024
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
1025
the risk of having to move parameters later between classes.
1026

    
1027
Security
1028
++++++++
1029

    
1030
The only security issue that we foresee is if some new parameters will
1031
have sensitive value. If so, we will need to have a way to export the
1032
config data while purging the sensitive value.
1033

    
1034
E.g. for the drbd shared secrets, we could export these with the
1035
values replaced by an empty string.
1036

    
1037
Node flags
1038
~~~~~~~~~~
1039

    
1040
Ganeti 2.0 adds three node flags that change the way nodes are handled
1041
within Ganeti and the related infrastructure (iallocator interaction,
1042
RAPI data export).
1043

    
1044
*master candidate* flag
1045
+++++++++++++++++++++++
1046

    
1047
Ganeti 2.0 allows more scalability in operation by introducing
1048
parallelization. However, a new bottleneck is reached that is the
1049
synchronization and replication of cluster configuration to all nodes
1050
in the cluster.
1051

    
1052
This breaks scalability as the speed of the replication decreases
1053
roughly with the size of the nodes in the cluster. The goal of the
1054
master candidate flag is to change this O(n) into O(1) with respect to
1055
job and configuration data propagation.
1056

    
1057
Only nodes having this flag set (let's call this set of nodes the
1058
*candidate pool*) will have jobs and configuration data replicated.
1059

    
1060
The cluster will have a new parameter (runtime changeable) called
1061
``candidate_pool_size`` which represents the number of candidates the
1062
cluster tries to maintain (preferably automatically).
1063

    
1064
This will impact the cluster operations as follows:
1065

    
1066
- jobs and config data will be replicated only to a fixed set of nodes
1067
- master fail-over will only be possible to a node in the candidate pool
1068
- cluster verify needs changing to account for these two roles
1069
- external scripts will no longer have access to the configuration
1070
  file (this is not recommended anyway)
1071

    
1072

    
1073
The caveats of this change are:
1074

    
1075
- if all candidates are lost (completely), cluster configuration is
1076
  lost (but it should be backed up external to the cluster anyway)
1077

    
1078
- failed nodes which are candidate must be dealt with properly, so
1079
  that we don't lose too many candidates at the same time; this will be
1080
  reported in cluster verify
1081

    
1082
- the 'all equal' concept of ganeti is no longer true
1083

    
1084
- the partial distribution of config data means that all nodes will
1085
  have to revert to ssconf files for master info (as in 1.2)
1086

    
1087
Advantages:
1088

    
1089
- speed on a 100+ nodes simulated cluster is greatly enhanced, even
1090
  for a simple operation; ``gnt-instance remove`` on a diskless instance
1091
  remove goes from ~9seconds to ~2 seconds
1092

    
1093
- node failure of non-candidates will be less impacting on the cluster
1094

    
1095
The default value for the candidate pool size will be set to 10 but
1096
this can be changed at cluster creation and modified any time later.
1097

    
1098
Testing on simulated big clusters with sequential and parallel jobs
1099
show that this value (10) is a sweet-spot from performance and load
1100
point of view.
1101

    
1102
*offline* flag
1103
++++++++++++++
1104

    
1105
In order to support better the situation in which nodes are offline
1106
(e.g. for repair) without altering the cluster configuration, Ganeti
1107
needs to be told and needs to properly handle this state for nodes.
1108

    
1109
This will result in simpler procedures, and less mistakes, when the
1110
amount of node failures is high on an absolute scale (either due to
1111
high failure rate or simply big clusters).
1112

    
1113
Nodes having this attribute set will not be contacted for inter-node
1114
RPC calls, will not be master candidates, and will not be able to host
1115
instances as primaries.
1116

    
1117
Setting this attribute on a node:
1118

    
1119
- will not be allowed if the node is the master
1120
- will not be allowed if the node has primary instances
1121
- will cause the node to be demoted from the master candidate role (if
1122
  it was), possibly causing another node to be promoted to that role
1123

    
1124
This attribute will impact the cluster operations as follows:
1125

    
1126
- querying these nodes for anything will fail instantly in the RPC
1127
  library, with a specific RPC error (RpcResult.offline == True)
1128

    
1129
- they will be listed in the Other section of cluster verify
1130

    
1131
The code is changed in the following ways:
1132

    
1133
- RPC calls were be converted to skip such nodes:
1134

    
1135
  - RpcRunner-instance-based RPC calls are easy to convert
1136

    
1137
  - static/classmethod RPC calls are harder to convert, and were left
1138
    alone
1139

    
1140
- the RPC results were unified so that this new result state (offline)
1141
  can be differentiated
1142

    
1143
- master voting still queries in repair nodes, as we need to ensure
1144
  consistency in case the (wrong) masters have old data, and nodes have
1145
  come back from repairs
1146

    
1147
Caveats:
1148

    
1149
- some operation semantics are less clear (e.g. what to do on instance
1150
  start with offline secondary?); for now, these will just fail as if
1151
  the flag is not set (but faster)
1152
- 2-node cluster with one node offline needs manual startup of the
1153
  master with a special flag to skip voting (as the master can't get a
1154
  quorum there)
1155

    
1156
One of the advantages of implementing this flag is that it will allow
1157
in the future automation tools to automatically put the node in
1158
repairs and recover from this state, and the code (should/will) handle
1159
this much better than just timing out. So, future possible
1160
improvements (for later versions):
1161

    
1162
- watcher will detect nodes which fail RPC calls, will attempt to ssh
1163
  to them, if failure will put them offline
1164
- watcher will try to ssh and query the offline nodes, if successful
1165
  will take them off the repair list
1166

    
1167
Alternatives considered: The RPC call model in 2.0 is, by default,
1168
much nicer - errors are logged in the background, and job/opcode
1169
execution is clearer, so we could simply not introduce this. However,
1170
having this state will make both the codepaths clearer (offline
1171
vs. temporary failure) and the operational model (it's not a node with
1172
errors, but an offline node).
1173

    
1174

    
1175
*drained* flag
1176
++++++++++++++
1177

    
1178
Due to parallel execution of jobs in Ganeti 2.0, we could have the
1179
following situation:
1180

    
1181
- gnt-node migrate + failover is run
1182
- gnt-node evacuate is run, which schedules a long-running 6-opcode
1183
  job for the node
1184
- partway through, a new job comes in that runs an iallocator script,
1185
  which finds the above node as empty and a very good candidate
1186
- gnt-node evacuate has finished, but now it has to be run again, to
1187
  clean the above instance(s)
1188

    
1189
In order to prevent this situation, and to be able to get nodes into
1190
proper offline status easily, a new *drained* flag was added to the
1191
nodes.
1192

    
1193
This flag (which actually means "is being, or was drained, and is
1194
expected to go offline"), will prevent allocations on the node, but
1195
otherwise all other operations (start/stop instance, query, etc.) are
1196
working without any restrictions.
1197

    
1198
Interaction between flags
1199
+++++++++++++++++++++++++
1200

    
1201
While these flags are implemented as separate flags, they are
1202
mutually-exclusive and are acting together with the master node role
1203
as a single *node status* value. In other words, a flag is only in one
1204
of these roles at a given time. The lack of any of these flags denote
1205
a regular node.
1206

    
1207
The current node status is visible in the ``gnt-cluster verify``
1208
output, and the individual flags can be examined via separate flags in
1209
the ``gnt-node list`` output.
1210

    
1211
These new flags will be exported in both the iallocator input message
1212
and via RAPI, see the respective man pages for the exact names.
1213

    
1214
Feature changes
1215
---------------
1216

    
1217
The main feature-level changes will be:
1218

    
1219
- a number of disk related changes
1220
- removal of fixed two-disk, one-nic per instance limitation
1221

    
1222
Disk handling changes
1223
~~~~~~~~~~~~~~~~~~~~~
1224

    
1225
The storage options available in Ganeti 1.x were introduced based on
1226
then-current software (first DRBD 0.7 then later DRBD 8) and the
1227
estimated usage patters. However, experience has later shown that some
1228
assumptions made initially are not true and that more flexibility is
1229
needed.
1230

    
1231
One main assumption made was that disk failures should be treated as
1232
'rare' events, and that each of them needs to be manually handled in
1233
order to ensure data safety; however, both these assumptions are false:
1234

    
1235
- disk failures can be a common occurrence, based on usage patterns or
1236
  cluster size
1237
- our disk setup is robust enough (referring to DRBD8 + LVM) that we
1238
  could automate more of the recovery
1239

    
1240
Note that we still don't have fully-automated disk recovery as a goal,
1241
but our goal is to reduce the manual work needed.
1242

    
1243
As such, we plan the following main changes:
1244

    
1245
- DRBD8 is much more flexible and stable than its previous version
1246
  (0.7), such that removing the support for the ``remote_raid1``
1247
  template and focusing only on DRBD8 is easier
1248

    
1249
- dynamic discovery of DRBD devices is not actually needed in a cluster
1250
  that where the DRBD namespace is controlled by Ganeti; switching to a
1251
  static assignment (done at either instance creation time or change
1252
  secondary time) will change the disk activation time from O(n) to
1253
  O(1), which on big clusters is a significant gain
1254

    
1255
- remove the hard dependency on LVM (currently all available storage
1256
  types are ultimately backed by LVM volumes) by introducing file-based
1257
  storage
1258

    
1259
Additionally, a number of smaller enhancements are also planned:
1260
- support variable number of disks
1261
- support read-only disks
1262

    
1263
Future enhancements in the 2.x series, which do not require base design
1264
changes, might include:
1265

    
1266
- enhancement of the LVM allocation method in order to try to keep
1267
  all of an instance's virtual disks on the same physical
1268
  disks
1269

    
1270
- add support for DRBD8 authentication at handshake time in
1271
  order to ensure each device connects to the correct peer
1272

    
1273
- remove the restrictions on failover only to the secondary
1274
  which creates very strict rules on cluster allocation
1275

    
1276
DRBD minor allocation
1277
+++++++++++++++++++++
1278

    
1279
Currently, when trying to identify or activate a new DRBD (or MD)
1280
device, the code scans all in-use devices in order to see if we find
1281
one that looks similar to our parameters and is already in the desired
1282
state or not. Since this needs external commands to be run, it is very
1283
slow when more than a few devices are already present.
1284

    
1285
Therefore, we will change the discovery model from dynamic to
1286
static. When a new device is logically created (added to the
1287
configuration) a free minor number is computed from the list of
1288
devices that should exist on that node and assigned to that
1289
device.
1290

    
1291
At device activation, if the minor is already in use, we check if
1292
it has our parameters; if not so, we just destroy the device (if
1293
possible, otherwise we abort) and start it with our own
1294
parameters.
1295

    
1296
This means that we in effect take ownership of the minor space for
1297
that device type; if there's a user-created DRBD minor, it will be
1298
automatically removed.
1299

    
1300
The change will have the effect of reducing the number of external
1301
commands run per device from a constant number times the index of the
1302
first free DRBD minor to just a constant number.
1303

    
1304
Removal of obsolete device types (MD, DRBD7)
1305
++++++++++++++++++++++++++++++++++++++++++++
1306

    
1307
We need to remove these device types because of two issues. First,
1308
DRBD7 has bad failure modes in case of dual failures (both network and
1309
disk - it cannot propagate the error up the device stack and instead
1310
just panics. Second, due to the asymmetry between primary and
1311
secondary in MD+DRBD mode, we cannot do live failover (not even if we
1312
had MD+DRBD8).
1313

    
1314
File-based storage support
1315
++++++++++++++++++++++++++
1316

    
1317
Using files instead of logical volumes for instance storage would
1318
allow us to get rid of the hard requirement for volume groups for
1319
testing clusters and it would also allow usage of SAN storage to do
1320
live failover taking advantage of this storage solution.
1321

    
1322
Better LVM allocation
1323
+++++++++++++++++++++
1324

    
1325
Currently, the LV to PV allocation mechanism is a very simple one: at
1326
each new request for a logical volume, tell LVM to allocate the volume
1327
in order based on the amount of free space. This is good for
1328
simplicity and for keeping the usage equally spread over the available
1329
physical disks, however it introduces a problem that an instance could
1330
end up with its (currently) two drives on two physical disks, or
1331
(worse) that the data and metadata for a DRBD device end up on
1332
different drives.
1333

    
1334
This is bad because it causes unneeded ``replace-disks`` operations in
1335
case of a physical failure.
1336

    
1337
The solution is to batch allocations for an instance and make the LVM
1338
handling code try to allocate as close as possible all the storage of
1339
one instance. We will still allow the logical volumes to spill over to
1340
additional disks as needed.
1341

    
1342
Note that this clustered allocation can only be attempted at initial
1343
instance creation, or at change secondary node time. At add disk time,
1344
or at replacing individual disks, it's not easy enough to compute the
1345
current disk map so we'll not attempt the clustering.
1346

    
1347
DRBD8 peer authentication at handshake
1348
++++++++++++++++++++++++++++++++++++++
1349

    
1350
DRBD8 has a new feature that allow authentication of the peer at
1351
connect time. We can use this to prevent connecting to the wrong peer
1352
more that securing the connection. Even though we never had issues
1353
with wrong connections, it would be good to implement this.
1354

    
1355

    
1356
LVM self-repair (optional)
1357
++++++++++++++++++++++++++
1358

    
1359
The complete failure of a physical disk is very tedious to
1360
troubleshoot, mainly because of the many failure modes and the many
1361
steps needed. We can safely automate some of the steps, more
1362
specifically the ``vgreduce --removemissing`` using the following
1363
method:
1364

    
1365
#. check if all nodes have consistent volume groups
1366
#. if yes, and previous status was yes, do nothing
1367
#. if yes, and previous status was no, save status and restart
1368
#. if no, and previous status was no, do nothing
1369
#. if no, and previous status was yes:
1370
    #. if more than one node is inconsistent, do nothing
1371
    #. if only one node is inconsistent:
1372
        #. run ``vgreduce --removemissing``
1373
        #. log this occurrence in the Ganeti log in a form that
1374
           can be used for monitoring
1375
        #. [FUTURE] run ``replace-disks`` for all
1376
           instances affected
1377

    
1378
Failover to any node
1379
++++++++++++++++++++
1380

    
1381
With a modified disk activation sequence, we can implement the
1382
*failover to any* functionality, removing many of the layout
1383
restrictions of a cluster:
1384

    
1385
- the need to reserve memory on the current secondary: this gets reduced
1386
  to a must to reserve memory anywhere on the cluster
1387

    
1388
- the need to first failover and then replace secondary for an
1389
  instance: with failover-to-any, we can directly failover to
1390
  another node, which also does the replace disks at the same
1391
  step
1392

    
1393
In the following, we denote the current primary by P1, the current
1394
secondary by S1, and the new primary and secondaries by P2 and S2. P2
1395
is fixed to the node the user chooses, but the choice of S2 can be
1396
made between P1 and S1. This choice can be constrained, depending on
1397
which of P1 and S1 has failed.
1398

    
1399
- if P1 has failed, then S1 must become S2, and live migration is not
1400
  possible
1401
- if S1 has failed, then P1 must become S2, and live migration could be
1402
  possible (in theory, but this is not a design goal for 2.0)
1403

    
1404
The algorithm for performing the failover is straightforward:
1405

    
1406
- verify that S2 (the node the user has chosen to keep as secondary) has
1407
  valid data (is consistent)
1408

    
1409
- tear down the current DRBD association and setup a DRBD pairing
1410
  between P2 (P2 is indicated by the user) and S2; since P2 has no data,
1411
  it will start re-syncing from S2
1412

    
1413
- as soon as P2 is in state SyncTarget (i.e. after the resync has
1414
  started but before it has finished), we can promote it to primary role
1415
  (r/w) and start the instance on P2
1416

    
1417
- as soon as the P2?S2 sync has finished, we can remove
1418
  the old data on the old node that has not been chosen for
1419
  S2
1420

    
1421
Caveats: during the P2?S2 sync, a (non-transient) network error
1422
will cause I/O errors on the instance, so (if a longer instance
1423
downtime is acceptable) we can postpone the restart of the instance
1424
until the resync is done. However, disk I/O errors on S2 will cause
1425
data loss, since we don't have a good copy of the data anymore, so in
1426
this case waiting for the sync to complete is not an option. As such,
1427
it is recommended that this feature is used only in conjunction with
1428
proper disk monitoring.
1429

    
1430

    
1431
Live migration note: While failover-to-any is possible for all choices
1432
of S2, migration-to-any is possible only if we keep P1 as S2.
1433

    
1434
Caveats
1435
+++++++
1436

    
1437
The dynamic device model, while more complex, has an advantage: it
1438
will not reuse by mistake the DRBD device of another instance, since
1439
it always looks for either our own or a free one.
1440

    
1441
The static one, in contrast, will assume that given a minor number N,
1442
it's ours and we can take over. This needs careful implementation such
1443
that if the minor is in use, either we are able to cleanly shut it
1444
down, or we abort the startup. Otherwise, it could be that we start
1445
syncing between two instance's disks, causing data loss.
1446

    
1447

    
1448
Variable number of disk/NICs per instance
1449
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1450

    
1451
Variable number of disks
1452
++++++++++++++++++++++++
1453

    
1454
In order to support high-security scenarios (for example read-only sda
1455
and read-write sdb), we need to make a fully flexibly disk
1456
definition. This has less impact that it might look at first sight:
1457
only the instance creation has hard coded number of disks, not the disk
1458
handling code. The block device handling and most of the instance
1459
handling code is already working with "the instance's disks" as
1460
opposed to "the two disks of the instance", but some pieces are not
1461
(e.g. import/export) and the code needs a review to ensure safety.
1462

    
1463
The objective is to be able to specify the number of disks at
1464
instance creation, and to be able to toggle from read-only to
1465
read-write a disk afterward.
1466

    
1467
Variable number of NICs
1468
+++++++++++++++++++++++
1469

    
1470
Similar to the disk change, we need to allow multiple network
1471
interfaces per instance. This will affect the internal code (some
1472
function will have to stop assuming that ``instance.nics`` is a list
1473
of length one), the OS API which currently can export/import only one
1474
instance, and the command line interface.
1475

    
1476
Interface changes
1477
-----------------
1478

    
1479
There are two areas of interface changes: API-level changes (the OS
1480
interface and the RAPI interface) and the command line interface
1481
changes.
1482

    
1483
OS interface
1484
~~~~~~~~~~~~
1485

    
1486
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2.
1487
The interface is composed by a series of scripts which get called with
1488
certain parameters to perform OS-dependent operations on the cluster.
1489
The current scripts are:
1490

    
1491
create
1492
  called when a new instance is added to the cluster
1493
export
1494
  called to export an instance disk to a stream
1495
import
1496
  called to import from a stream to a new instance
1497
rename
1498
  called to perform the os-specific operations necessary for renaming an
1499
  instance
1500

    
1501
Currently these scripts suffer from the limitations of Ganeti 1.2: for
1502
example they accept exactly one block and one swap devices to operate
1503
on, rather than any amount of generic block devices, they blindly assume
1504
that an instance will have just one network interface to operate, they
1505
can not be configured to optimise the instance for a particular
1506
hypervisor.
1507

    
1508
Since in Ganeti 2.0 we want to support multiple hypervisors, and a
1509
non-fixed number of network and disks the OS interface need to change to
1510
transmit the appropriate amount of information about an instance to its
1511
managing operating system, when operating on it. Moreover since some old
1512
assumptions usually used in OS scripts are no longer valid we need to
1513
re-establish a common knowledge on what can be assumed and what cannot
1514
be regarding Ganeti environment.
1515

    
1516

    
1517
When designing the new OS API our priorities are:
1518
- ease of use
1519
- future extensibility
1520
- ease of porting from the old API
1521
- modularity
1522

    
1523
As such we want to limit the number of scripts that must be written to
1524
support an OS, and make it easy to share code between them by uniforming
1525
their input.  We also will leave the current script structure unchanged,
1526
as far as we can, and make a few of the scripts (import, export and
1527
rename) optional. Most information will be passed to the script through
1528
environment variables, for ease of access and at the same time ease of
1529
using only the information a script needs.
1530

    
1531

    
1532
The Scripts
1533
+++++++++++
1534

    
1535
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs
1536
to support the following functionality, through scripts:
1537

    
1538
create:
1539
  used to create a new instance running that OS. This script should
1540
  prepare the block devices, and install them so that the new OS can
1541
  boot under the specified hypervisor.
1542
export (optional):
1543
  used to export an installed instance using the given OS to a format
1544
  which can be used to import it back into a new instance.
1545
import (optional):
1546
  used to import an exported instance into a new one. This script is
1547
  similar to create, but the new instance should have the content of the
1548
  export, rather than contain a pristine installation.
1549
rename (optional):
1550
  used to perform the internal OS-specific operations needed to rename
1551
  an instance.
1552

    
1553
If any optional script is not implemented Ganeti will refuse to perform
1554
the given operation on instances using the non-implementing OS. Of
1555
course the create script is mandatory, and it doesn't make sense to
1556
support the either the export or the import operation but not both.
1557

    
1558
Incompatibilities with 1.2
1559
__________________________
1560

    
1561
We expect the following incompatibilities between the OS scripts for 1.2
1562
and the ones for 2.0:
1563

    
1564
- Input parameters: in 1.2 those were passed on the command line, in 2.0
1565
  we'll use environment variables, as there will be a lot more
1566
  information and not all OSes may care about all of it.
1567
- Number of calls: export scripts will be called once for each device
1568
  the instance has, and import scripts once for every exported disk.
1569
  Imported instances will be forced to have a number of disks greater or
1570
  equal to the one of the export.
1571
- Some scripts are not compulsory: if such a script is missing the
1572
  relevant operations will be forbidden for instances of that OS. This
1573
  makes it easier to distinguish between unsupported operations and
1574
  no-op ones (if any).
1575

    
1576

    
1577
Input
1578
_____
1579

    
1580
Rather than using command line flags, as they do now, scripts will
1581
accept inputs from environment variables. We expect the following input
1582
values:
1583

    
1584
OS_API_VERSION
1585
  The version of the OS API that the following parameters comply with;
1586
  this is used so that in the future we could have OSes supporting
1587
  multiple versions and thus Ganeti send the proper version in this
1588
  parameter
1589
INSTANCE_NAME
1590
  Name of the instance acted on
1591
HYPERVISOR
1592
  The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm',
1593
  'kvm')
1594
DISK_COUNT
1595
  The number of disks this instance will have
1596
NIC_COUNT
1597
  The number of NICs this instance will have
1598
DISK_<N>_PATH
1599
  Path to the Nth disk.
1600
DISK_<N>_ACCESS
1601
  W if read/write, R if read only. OS scripts are not supposed to touch
1602
  read-only disks, but will be passed them to know.
1603
DISK_<N>_FRONTEND_TYPE
1604
  Type of the disk as seen by the instance. Can be 'scsi', 'ide',
1605
  'virtio'
1606
DISK_<N>_BACKEND_TYPE
1607
  Type of the disk as seen from the node. Can be 'block', 'file:loop' or
1608
  'file:blktap'
1609
NIC_<N>_MAC
1610
  Mac address for the Nth network interface
1611
NIC_<N>_IP
1612
  Ip address for the Nth network interface, if available
1613
NIC_<N>_BRIDGE
1614
  Node bridge the Nth network interface will be connected to
1615
NIC_<N>_FRONTEND_TYPE
1616
  Type of the Nth NIC as seen by the instance. For example 'virtio',
1617
  'rtl8139', etc.
1618
DEBUG_LEVEL
1619
  Whether more out should be produced, for debugging purposes. Currently
1620
  the only valid values are 0 and 1.
1621

    
1622
These are only the basic variables we are thinking of now, but more
1623
may come during the implementation and they will be documented in the
1624
:manpage:`ganeti-os-interface(7)` man page. All these variables will be
1625
available to all scripts.
1626

    
1627
Some scripts will need a few more information to work. These will have
1628
per-script variables, such as for example:
1629

    
1630
OLD_INSTANCE_NAME
1631
  rename: the name the instance should be renamed from.
1632
EXPORT_DEVICE
1633
  export: device to be exported, a snapshot of the actual device. The
1634
  data must be exported to stdout.
1635
EXPORT_INDEX
1636
  export: sequential number of the instance device targeted.
1637
IMPORT_DEVICE
1638
  import: device to send the data to, part of the new instance. The data
1639
  must be imported from stdin.
1640
IMPORT_INDEX
1641
  import: sequential number of the instance device targeted.
1642

    
1643
(Rationale for INSTANCE_NAME as an environment variable: the instance
1644
name is always needed and we could pass it on the command line. On the
1645
other hand, though, this would force scripts to both access the
1646
environment and parse the command line, so we'll move it for
1647
uniformity.)
1648

    
1649

    
1650
Output/Behaviour
1651
________________
1652

    
1653
As discussed scripts should only send user-targeted information to
1654
stderr. The create and import scripts are supposed to format/initialise
1655
the given block devices and install the correct instance data. The
1656
export script is supposed to export instance data to stdout in a format
1657
understandable by the the import script. The data will be compressed by
1658
Ganeti, so no compression should be done. The rename script should only
1659
modify the instance's knowledge of what its name is.
1660

    
1661
Other declarative style features
1662
++++++++++++++++++++++++++++++++
1663

    
1664
Similar to Ganeti 1.2, OS specifications will need to provide a
1665
'ganeti_api_version' containing list of numbers matching the
1666
version(s) of the API they implement. Ganeti itself will always be
1667
compatible with one version of the API and may maintain backwards
1668
compatibility if it's feasible to do so. The numbers are one-per-line,
1669
so an OS supporting both version 5 and version 20 will have a file
1670
containing two lines. This is different from Ganeti 1.2, which only
1671
supported one version number.
1672

    
1673
In addition to that an OS will be able to declare that it does support
1674
only a subset of the Ganeti hypervisors, by declaring them in the
1675
'hypervisors' file.
1676

    
1677

    
1678
Caveats/Notes
1679
+++++++++++++
1680

    
1681
We might want to have a "default" import/export behaviour that just
1682
dumps all disks and restores them. This can save work as most systems
1683
will just do this, while allowing flexibility for different systems.
1684

    
1685
Environment variables are limited in size, but we expect that there will
1686
be enough space to store the information we need. If we discover that
1687
this is not the case we may want to go to a more complex API such as
1688
storing those information on the filesystem and providing the OS script
1689
with the path to a file where they are encoded in some format.
1690

    
1691

    
1692

    
1693
Remote API changes
1694
~~~~~~~~~~~~~~~~~~
1695

    
1696
The first Ganeti remote API (RAPI) was designed and deployed with the
1697
Ganeti 1.2.5 release.  That version provide read-only access to the
1698
cluster state. Fully functional read-write API demands significant
1699
internal changes which will be implemented in version 2.0.
1700

    
1701
We decided to go with implementing the Ganeti RAPI in a RESTful way,
1702
which is aligned with key features we looking. It is simple,
1703
stateless, scalable and extensible paradigm of API implementation. As
1704
transport it uses HTTP over SSL, and we are implementing it with JSON
1705
encoding, but in a way it possible to extend and provide any other
1706
one.
1707

    
1708
Design
1709
++++++
1710

    
1711
The Ganeti RAPI is implemented as independent daemon, running on the
1712
same node with the same permission level as Ganeti master
1713
daemon. Communication is done through the LUXI library to the master
1714
daemon. In order to keep communication asynchronous RAPI processes two
1715
types of client requests:
1716

    
1717
- queries: server is able to answer immediately
1718
- job submission: some time is required for a useful response
1719

    
1720
In the query case requested data send back to client in the HTTP
1721
response body. Typical examples of queries would be: list of nodes,
1722
instances, cluster info, etc.
1723

    
1724
In the case of job submission, the client receive a job ID, the
1725
identifier which allows one to query the job progress in the job queue
1726
(see `Job Queue`_).
1727

    
1728
Internally, each exported object has an version identifier, which is
1729
used as a state identifier in the HTTP header E-Tag field for
1730
requests/responses to avoid race conditions.
1731

    
1732

    
1733
Resource representation
1734
+++++++++++++++++++++++
1735

    
1736
The key difference of using REST instead of others API is that REST
1737
requires separation of services via resources with unique URIs. Each
1738
of them should have limited amount of state and support standard HTTP
1739
methods: GET, POST, DELETE, PUT.
1740

    
1741
For example in Ganeti's case we can have a set of URI:
1742

    
1743
 - ``/{clustername}/instances``
1744
 - ``/{clustername}/instances/{instancename}``
1745
 - ``/{clustername}/instances/{instancename}/tag``
1746
 - ``/{clustername}/tag``
1747

    
1748
A GET request to ``/{clustername}/instances`` will return the list of
1749
instances, a POST to ``/{clustername}/instances`` should create a new
1750
instance, a DELETE ``/{clustername}/instances/{instancename}`` should
1751
delete the instance, a GET ``/{clustername}/tag`` should return get
1752
cluster tags.
1753

    
1754
Each resource URI will have a version prefix. The resource IDs are to
1755
be determined.
1756

    
1757
Internal encoding might be JSON, XML, or any other. The JSON encoding
1758
fits nicely in Ganeti RAPI needs. The client can request a specific
1759
representation via the Accept field in the HTTP header.
1760

    
1761
REST uses HTTP as its transport and application protocol for resource
1762
access. The set of possible responses is a subset of standard HTTP
1763
responses.
1764

    
1765
The statelessness model provides additional reliability and
1766
transparency to operations (e.g. only one request needs to be analyzed
1767
to understand the in-progress operation, not a sequence of multiple
1768
requests/responses).
1769

    
1770

    
1771
Security
1772
++++++++
1773

    
1774
With the write functionality security becomes a much bigger an issue.
1775
The Ganeti RAPI uses basic HTTP authentication on top of an
1776
SSL-secured connection to grant access to an exported resource. The
1777
password is stored locally in an Apache-style ``.htpasswd`` file. Only
1778
one level of privileges is supported.
1779

    
1780
Caveats
1781
+++++++
1782

    
1783
The model detailed above for job submission requires the client to
1784
poll periodically for updates to the job; an alternative would be to
1785
allow the client to request a callback, or a 'wait for updates' call.
1786

    
1787
The callback model was not considered due to the following two issues:
1788

    
1789
- callbacks would require a new model of allowed callback URLs,
1790
  together with a method of managing these
1791
- callbacks only work when the client and the master are in the same
1792
  security domain, and they fail in the other cases (e.g. when there is
1793
  a firewall between the client and the RAPI daemon that only allows
1794
  client-to-RAPI calls, which is usual in DMZ cases)
1795

    
1796
The 'wait for updates' method is not suited to the HTTP protocol,
1797
where requests are supposed to be short-lived.
1798

    
1799
Command line changes
1800
~~~~~~~~~~~~~~~~~~~~
1801

    
1802
Ganeti 2.0 introduces several new features as well as new ways to
1803
handle instance resources like disks or network interfaces. This
1804
requires some noticeable changes in the way command line arguments are
1805
handled.
1806

    
1807
- extend and modify command line syntax to support new features
1808
- ensure consistent patterns in command line arguments to reduce
1809
  cognitive load
1810

    
1811
The design changes that require these changes are, in no particular
1812
order:
1813

    
1814
- flexible instance disk handling: support a variable number of disks
1815
  with varying properties per instance,
1816
- flexible instance network interface handling: support a variable
1817
  number of network interfaces with varying properties per instance
1818
- multiple hypervisors: multiple hypervisors can be active on the same
1819
  cluster, each supporting different parameters,
1820
- support for device type CDROM (via ISO image)
1821

    
1822
As such, there are several areas of Ganeti where the command line
1823
arguments will change:
1824

    
1825
- Cluster configuration
1826

    
1827
  - cluster initialization
1828
  - cluster default configuration
1829

    
1830
- Instance configuration
1831

    
1832
  - handling of network cards for instances,
1833
  - handling of disks for instances,
1834
  - handling of CDROM devices and
1835
  - handling of hypervisor specific options.
1836

    
1837
There are several areas of Ganeti where the command line arguments
1838
will change:
1839

    
1840
- Cluster configuration
1841

    
1842
  - cluster initialization
1843
  - cluster default configuration
1844

    
1845
- Instance configuration
1846

    
1847
  - handling of network cards for instances,
1848
  - handling of disks for instances,
1849
  - handling of CDROM devices and
1850
  - handling of hypervisor specific options.
1851

    
1852
Notes about device removal/addition
1853
+++++++++++++++++++++++++++++++++++
1854

    
1855
To avoid problems with device location changes (e.g. second network
1856
interface of the instance becoming the first or third and the like)
1857
the list of network/disk devices is treated as a stack, i.e. devices
1858
can only be added/removed at the end of the list of devices of each
1859
class (disk or network) for each instance.
1860

    
1861
gnt-instance commands
1862
+++++++++++++++++++++
1863

    
1864
The commands for gnt-instance will be modified and extended to allow
1865
for the new functionality:
1866

    
1867
- the add command will be extended to support the new device and
1868
  hypervisor options,
1869
- the modify command continues to handle all modifications to
1870
  instances, but will be extended with new arguments for handling
1871
  devices.
1872

    
1873
Network Device Options
1874
++++++++++++++++++++++
1875

    
1876
The generic format of the network device option is:
1877

    
1878
  --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1879

    
1880
:$DEVNUM: device number, unsigned integer, starting at 0,
1881
:$OPTION: device option, string,
1882
:$VALUE: device option value, string.
1883

    
1884
Currently, the following device options will be defined (open to
1885
further changes):
1886

    
1887
:mac: MAC address of the network interface, accepts either a valid
1888
  MAC address or the string 'auto'. If 'auto' is specified, a new MAC
1889
  address will be generated randomly. If the mac device option is not
1890
  specified, the default value 'auto' is assumed.
1891
:bridge: network bridge the network interface is connected
1892
  to. Accepts either a valid bridge name (the specified bridge must
1893
  exist on the node(s)) as string or the string 'auto'. If 'auto' is
1894
  specified, the default brigde is used. If the bridge option is not
1895
  specified, the default value 'auto' is assumed.
1896

    
1897
Disk Device Options
1898
+++++++++++++++++++
1899

    
1900
The generic format of the disk device option is:
1901

    
1902
  --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1903

    
1904
:$DEVNUM: device number, unsigned integer, starting at 0,
1905
:$OPTION: device option, string,
1906
:$VALUE: device option value, string.
1907

    
1908
Currently, the following device options will be defined (open to
1909
further changes):
1910

    
1911
:size: size of the disk device, either a positive number, specifying
1912
  the disk size in mebibytes, or a number followed by a magnitude suffix
1913
  (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
1914
  which case the default disk size will be used. If the size option is
1915
  not specified, 'auto' is assumed. This option is not valid for all
1916
  disk layout types.
1917
:access: access mode of the disk device, a single letter, valid values
1918
  are:
1919

    
1920
  - *w*: read/write access to the disk device or
1921
  - *r*: read-only access to the disk device.
1922

    
1923
  If the access mode is not specified, the default mode of read/write
1924
  access will be configured.
1925
:path: path to the image file for the disk device, string. No default
1926
  exists. This option is not valid for all disk layout types.
1927

    
1928
Adding devices
1929
++++++++++++++
1930

    
1931
To add devices to an already existing instance, use the device type
1932
specific option to gnt-instance modify. Currently, there are two
1933
device type specific options supported:
1934

    
1935
:--net: for network interface cards
1936
:--disk: for disk devices
1937

    
1938
The syntax to the device specific options is similar to the generic
1939
device options, but instead of specifying a device number like for
1940
gnt-instance add, you specify the magic string add. The new device
1941
will always be appended at the end of the list of devices of this type
1942
for the specified instance, e.g. if the instance has disk devices 0,1
1943
and 2, the newly added disk device will be disk device 3.
1944

    
1945
Example: gnt-instance modify --net add:mac=auto test-instance
1946

    
1947
Removing devices
1948
++++++++++++++++
1949

    
1950
Removing devices from and instance is done via gnt-instance
1951
modify. The same device specific options as for adding instances are
1952
used. Instead of a device number and further device options, only the
1953
magic string remove is specified. It will always remove the last
1954
device in the list of devices of this type for the instance specified,
1955
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
1956
number 3 will be removed.
1957

    
1958
Example: gnt-instance modify --net remove test-instance
1959

    
1960
Modifying devices
1961
+++++++++++++++++
1962

    
1963
Modifying devices is also done with device type specific options to
1964
the gnt-instance modify command. There are currently two device type
1965
options supported:
1966

    
1967
:--net: for network interface cards
1968
:--disk: for disk devices
1969

    
1970
The syntax to the device specific options is similar to the generic
1971
device options. The device number you specify identifies the device to
1972
be modified.
1973

    
1974
Example::
1975

    
1976
  gnt-instance modify --disk 2:access=r
1977

    
1978
Hypervisor Options
1979
++++++++++++++++++
1980

    
1981
Ganeti 2.0 will support more than one hypervisor. Different
1982
hypervisors have various options that only apply to a specific
1983
hypervisor. Those hypervisor specific options are treated specially
1984
via the ``--hypervisor`` option. The generic syntax of the hypervisor
1985
option is as follows::
1986

    
1987
  --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1988

    
1989
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
1990
  has to match the supported hypervisors. Example: xen-pvm
1991

    
1992
:$OPTION: hypervisor option name, string
1993
:$VALUE: hypervisor option value, string
1994

    
1995
The hypervisor option for an instance can be set on instance creation
1996
time via the ``gnt-instance add`` command. If the hypervisor for an
1997
instance is not specified upon instance creation, the default
1998
hypervisor will be used.
1999

    
2000
Modifying hypervisor parameters
2001
+++++++++++++++++++++++++++++++
2002

    
2003
The hypervisor parameters of an existing instance can be modified
2004
using ``--hypervisor`` option of the ``gnt-instance modify``
2005
command. However, the hypervisor type of an existing instance can not
2006
be changed, only the particular hypervisor specific option can be
2007
changed. Therefore, the format of the option parameters has been
2008
simplified to omit the hypervisor name and only contain the comma
2009
separated list of option-value pairs.
2010

    
2011
Example::
2012

    
2013
  gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
2014

    
2015
gnt-cluster commands
2016
++++++++++++++++++++
2017

    
2018
The command for gnt-cluster will be extended to allow setting and
2019
changing the default parameters of the cluster:
2020

    
2021
- The init command will be extend to support the defaults option to
2022
  set the cluster defaults upon cluster initialization.
2023
- The modify command will be added to modify the cluster
2024
  parameters. It will support the --defaults option to change the
2025
  cluster defaults.
2026

    
2027
Cluster defaults
2028

    
2029
The generic format of the cluster default setting option is:
2030

    
2031
  --defaults $OPTION=$VALUE[,$OPTION=$VALUE]
2032

    
2033
:$OPTION: cluster default option, string,
2034
:$VALUE: cluster default option value, string.
2035

    
2036
Currently, the following cluster default options are defined (open to
2037
further changes):
2038

    
2039
:hypervisor: the default hypervisor to use for new instances,
2040
  string. Must be a valid hypervisor known to and supported by the
2041
  cluster.
2042
:disksize: the disksize for newly created instance disks, where
2043
  applicable. Must be either a positive number, in which case the unit
2044
  of megabyte is assumed, or a positive number followed by a supported
2045
  magnitude symbol (M for megabyte or G for gigabyte).
2046
:bridge: the default network bridge to use for newly created instance
2047
  network interfaces, string. Must be a valid bridge name of a bridge
2048
  existing on the node(s).
2049

    
2050
Hypervisor cluster defaults
2051
+++++++++++++++++++++++++++
2052

    
2053
The generic format of the hypervisor cluster wide default setting
2054
option is::
2055

    
2056
  --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
2057

    
2058
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
2059
  to set, string
2060
:$OPTION: cluster default option, string,
2061
:$VALUE: cluster default option value, string.
2062

    
2063
.. vim: set textwidth=72 :