Statistics
| Branch: | Tag: | Revision:

root / doc / design-2.0.rst @ 5c0c1eeb

History | View | Annotate | Download (64.4 kB)

1
=================
2
Ganeti 2.0 design
3
=================
4

    
5
This document describes the major changes in Ganeti 2.0 compared to
6
the 1.2 version.
7

    
8
The 2.0 version will constitute a rewrite of the 'core' architecture,
9
paving the way for additional features in future 2.x versions.
10

    
11
.. contents::
12

    
13
Objective
14
=========
15

    
16
Ganeti 1.2 has many scalability issues and restrictions due to its
17
roots as software for managing small and 'static' clusters.
18

    
19
Version 2.0 will attempt to remedy first the scalability issues and
20
then the restrictions.
21

    
22
Background
23
==========
24

    
25
While Ganeti 1.2 is usable, it severly limits the flexibility of the
26
cluster administration and imposes a very rigid model. It has the
27
following main scalability issues:
28

    
29
- only one operation at a time on the cluster [#]_
30
- poor handling of node failures in the cluster
31
- mixing hypervisors in a cluster not allowed
32

    
33
It also has a number of artificial restrictions, due to historical design:
34

    
35
- fixed number of disks (two) per instance
36
- fixed number of nics
37

    
38
.. [#] Replace disks will release the lock, but this is an exception
39
       and not a recommended way to operate
40

    
41
The 2.0 version is intended to address some of these problems, and
42
create a more flexible codebase for future developments.
43

    
44
Scalability problems
45
--------------------
46

    
47
Ganeti 1.2 has a single global lock, which is used for all cluster
48
operations.  This has been painful at various times, for example:
49

    
50
- It is impossible for two people to efficiently interact with a cluster
51
  (for example for debugging) at the same time.
52
- When batch jobs are running it's impossible to do other work (for example
53
  failovers/fixes) on a cluster.
54

    
55
This poses scalability problems: as clusters grow in node and instance
56
size it's a lot more likely that operations which one could conceive
57
should run in parallel (for example because they happen on different
58
nodes) are actually stalling each other while waiting for the global
59
lock, without a real reason for that to happen.
60

    
61
One of the main causes of this global lock (beside the higher
62
difficulty of ensuring data consistency in a more granular lock model)
63
is the fact that currently there is no "master" daemon in Ganeti. Each
64
command tries to acquire the so called *cmd* lock and when it
65
succeeds, it takes complete ownership of the cluster configuration and
66
state.
67

    
68
Other scalability problems are due the design of the DRBD device
69
model, which assumed at its creation a low (one to four) number of
70
instances per node, which is no longer true with today's hardware.
71

    
72
Artificial restrictions
73
-----------------------
74

    
75
Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per
76
instance model. This is a purely artificial restrictions, but it
77
touches multiple areas (configuration, import/export, command line)
78
that it's more fitted to a major release than a minor one.
79

    
80
Overview
81
========
82

    
83
In order to solve the scalability problems, a rewrite of the core
84
design of Ganeti is required. While the cluster operations themselves
85
won't change (e.g. start instance will do the same things, the way
86
these operations are scheduled internally will change radically.
87

    
88
Detailed design
89
===============
90

    
91
The changes for 2.0 can be split into roughly three areas:
92

    
93
- core changes that affect the design of the software
94
- features (or restriction removals) but which do not have a wide
95
  impact on the design
96
- user-level and API-level changes which translate into differences for
97
  the operation of the cluster
98

    
99
Core changes
100
------------
101

    
102
The main changes will be switching from a per-process model to a
103
daemon based model, where the individual gnt-* commands will be
104
clients that talk to this daemon (see the design-2.0-master-daemon
105
document). This will allow us to get rid of the global cluster lock
106
for most operations, having instead a per-object lock (see
107
design-2.0-granular-locking). Also, the daemon will be able to queue
108
jobs, and this will allow the invidual clients to submit jobs without
109
waiting for them to finish, and also see the result of old requests
110
(see design-2.0-job-queue).
111

    
112
Beside these major changes, another 'core' change but that will not be
113
as visible to the users will be changing the model of object attribute
114
storage, and separate that into namespaces (such that an Xen PVM
115
instance will not have the Xen HVM parameters). This will allow future
116
flexibility in defining additional parameters. More details in the
117
design-2.0-cluster-parameters document.
118

    
119
The various changes brought in by the master daemon model and the
120
read-write RAPI will require changes to the cluster security; we move
121
away from Twisted and use http(s) for intra- and extra-cluster
122
communications. For more details, see the security document in the
123
doc/ directory.
124

    
125
Master daemon
126
~~~~~~~~~~~~~
127

    
128
In Ganeti 2.0, we will have the following *entities*:
129

    
130
- the master daemon (on the master node)
131
- the node daemon (on all nodes)
132
- the command line tools (on the master node)
133
- the RAPI daemon (on the master node)
134

    
135
Interaction paths are between:
136

    
137
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
138
- the master daemon and the node daemons, via the node RPC
139

    
140
The protocol between the master daemon and the node daemons will be
141
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
142
messages. This is done due to difficulties in working with the Twisted
143
framework and its protocols in a multithreaded environment, which we
144
can overcome by using a simpler stack (see the caveats section). The
145
protocol between the CLI/RAPI and the master daemon will be a custom
146
one (called *luxi*): on a UNIX socket on the master node, with rights
147
restricted by filesystem permissions, the CLI/RAPI will talk to the
148
master daemon using JSON-encoded messages.
149

    
150
The operations supported over this internal protocol will be encoded
151
via a python library that will expose a simple API for its
152
users. Internally, the protocol will simply encode all objects in JSON
153
format and decode them on the receiver side.
154

    
155
The LUXI protocol
156
+++++++++++++++++
157

    
158
We will have two main classes of operations over the master daemon API:
159

    
160
- cluster query functions
161
- job related functions
162

    
163
The cluster query functions are usually short-duration, and are the
164
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
165
internally implemented still with these opcodes). The clients are
166
guaranteed to receive the response in a reasonable time via a timeout.
167

    
168
The job-related functions will be:
169

    
170
- submit job
171
- query job (which could also be categorized in the query-functions)
172
- archive job (see the job queue design doc)
173
- wait for job change, which allows a client to wait without polling
174

    
175
For more details, see the job queue design document.
176

    
177
Daemon implementation
178
+++++++++++++++++++++
179

    
180
The daemon will be based around a main I/O thread that will wait for
181
new requests from the clients, and that does the setup/shutdown of the
182
other thread (pools).
183

    
184
There will two other classes of threads in the daemon:
185

    
186
- job processing threads, part of a thread pool, and which are
187
  long-lived, started at daemon startup and terminated only at shutdown
188
  time
189
- client I/O threads, which are the ones that talk the local protocol
190
  to the clients
191

    
192
Master startup/failover
193
+++++++++++++++++++++++
194

    
195
In Ganeti 1.x there is no protection against failing over the master
196
to a node with stale configuration. In effect, the responsibility of
197
correct failovers falls on the admin. This is true both for the new
198
master and for when an old, offline master startup.
199

    
200
Since in 2.x we are extending the cluster state to cover the job queue
201
and have a daemon that will execute by itself the job queue, we want
202
to have more resilience for the master role.
203

    
204
The following algorithm will happen whenever a node is ready to
205
transition to the master role, either at startup time or at node
206
failover:
207

    
208
#. read the configuration file and parse the node list
209
   contained within
210

    
211
#. query all the nodes and make sure we obtain an agreement via
212
   a quorum of at least half plus one nodes for the following:
213

    
214
    - we have the latest configuration and job list (as
215
      determined by the serial number on the configuration and
216
      highest job ID on the job queue)
217

    
218
    - there is not even a single node having a newer
219
      configuration file
220

    
221
    - if we are not failing over (but just starting), the
222
      quorum agrees that we are the designated master
223

    
224
#. at this point, the node transitions to the master role
225

    
226
#. for all the in-progress jobs, mark them as failed, with
227
   reason unknown or something similar (master failed, etc.)
228

    
229

    
230
Logging
231
+++++++
232

    
233
The logging system will be switched completely to the logging module;
234
currently it's logging-based, but exposes a different API, which is
235
just overhead. As such, the code will be switched over to standard
236
logging calls, and only the setup will be custom.
237

    
238
With this change, we will remove the separate debug/info/error logs,
239
and instead have always one logfile per daemon model:
240

    
241
- master-daemon.log for the master daemon
242
- node-daemon.log for the node daemon (this is the same as in 1.2)
243
- rapi-daemon.log for the RAPI daemon logs
244
- rapi-access.log, an additional log file for the RAPI that will be
245
  in the standard http log format for possible parsing by other tools
246

    
247
Since the watcher will only submit jobs to the master for startup of
248
the instances, its log file will contain less information than before,
249
mainly that it will start the instance, but not the results.
250

    
251
Caveats
252
+++++++
253

    
254
A discussed alternative is to keep the current individual processes
255
touching the cluster configuration model. The reasons we have not
256
chosen this approach is:
257

    
258
- the speed of reading and unserializing the cluster state
259
  today is not small enough that we can ignore it; the addition of
260
  the job queue will make the startup cost even higher. While this
261
  runtime cost is low, it can be on the order of a few seconds on
262
  bigger clusters, which for very quick commands is comparable to
263
  the actual duration of the computation itself
264

    
265
- individual commands would make it harder to implement a
266
  fire-and-forget job request, along the lines "start this
267
  instance but do not wait for it to finish"; it would require a
268
  model of backgrounding the operation and other things that are
269
  much better served by a daemon-based model
270

    
271
Another area of discussion is moving away from Twisted in this new
272
implementation. While Twisted hase its advantages, there are also many
273
disatvantanges to using it:
274

    
275
- first and foremost, it's not a library, but a framework; thus, if
276
  you use twisted, all the code needs to be 'twiste-ized'; we were able
277
  to keep the 1.x code clean by hacking around twisted in an
278
  unsupported, unrecommended way, and the only alternative would have
279
  been to make all the code be written for twisted
280
- it has some weaknesses in working with multiple threads, since its base
281
  model is designed to replace thread usage by using deferred calls, so while
282
  it can use threads, it's not less flexible in doing so
283

    
284
And, since we already have an HTTP server library for the RAPI, we
285
can just reuse that for inter-node communication.
286

    
287

    
288
Granular locking
289
~~~~~~~~~~~~~~~~
290

    
291
We want to make sure that multiple operations can run in parallel on a Ganeti
292
Cluster. In order for this to happen we need to make sure concurrently run
293
operations don't step on each other toes and break the cluster.
294

    
295
This design addresses how we are going to deal with locking so that:
296

    
297
- high urgency operations are not stopped by long length ones
298
- long length operations can run in parallel
299
- we preserve safety (data coherency) and liveness (no deadlock, no work
300
  postponed indefinitely) on the cluster
301

    
302
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
303
set of operations that are currently bottlenecks and need to be parallelised
304
and have worked on those. In the future it will be possible to address other
305
needs, thus making the cluster more and more parallel one step at a time.
306

    
307
This document only talks about parallelising Ganeti level operations, aka
308
Logical Units, and the locking needed for that. Any other synchronisation lock
309
needed internally by the code is outside its scope.
310

    
311
Ganeti 1.2
312
++++++++++
313

    
314
We intend to implement a Ganeti locking library, which can be used by the
315
various ganeti code components in order to easily, efficiently and correctly
316
grab the locks they need to perform their function.
317

    
318
The proposed library has these features:
319

    
320
- Internally managing all the locks, making the implementation transparent
321
  from their usage
322
- Automatically grabbing multiple locks in the right order (avoid deadlock)
323
- Ability to transparently handle conversion to more granularity
324
- Support asynchronous operation (future goal)
325

    
326
Locking will be valid only on the master node and will not be a distributed
327
operation. In case of master failure, though, if some locks were held it means
328
some opcodes were in progress, so when recovery of the job queue is done it
329
will be possible to determine by the interrupted opcodes which operations could
330
have been left half way through and thus which locks could have been held. It
331
is then the responsibility either of the master failover code, of the cluster
332
verification code, or of the admin to do what's necessary to make sure that any
333
leftover state is dealt with. This is not an issue from a locking point of view
334
because the fact that the previous master has failed means that it cannot do
335
any job.
336

    
337
A corollary of this is that a master-failover operation with both masters alive
338
needs to happen while no other locks are held.
339

    
340
The Locks
341
+++++++++
342

    
343
At the first stage we have decided to provide the following locks:
344

    
345
- One "config file" lock
346
- One lock per node in the cluster
347
- One lock per instance in the cluster
348

    
349
All the instance locks will need to be taken before the node locks, and the
350
node locks before the config lock. Locks will need to be acquired at the same
351
time for multiple instances and nodes, and internal ordering will be dealt
352
within the locking library, which, for simplicity, will just use alphabetical
353
order.
354

    
355
Handling conversion to more granularity
356
+++++++++++++++++++++++++++++++++++++++
357

    
358
In order to convert to a more granular approach transparently each time we
359
split a lock into more we'll create a "metalock", which will depend on those
360
sublocks and live for the time necessary for all the code to convert (or
361
forever, in some conditions). When a metalock exists all converted code must
362
acquire it in shared mode, so it can run concurrently, but still be exclusive
363
with old code, which acquires it exclusively.
364

    
365
In the beginning the only such lock will be what replaces the current "command"
366
lock, and will acquire all the locks in the system, before proceeding. This
367
lock will be called the "Big Ganeti Lock" because holding that one will avoid
368
any other concurrent ganeti operations.
369

    
370
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
371
in order to make it easier for some parts of the code to acquire what it needs
372
without specifying it explicitly.
373

    
374
In the future things like the node locks could become metalocks, should we
375
decide to split them into an even more fine grained approach, but this will
376
probably be only after the first 2.0 version has been released.
377

    
378
Library API
379
+++++++++++
380

    
381
All the locking will be its own class, and the locks will be created at
382
initialisation time, from the config file.
383

    
384
The API will have a way to grab one or more than one locks at the same time.
385
Any attempt to grab a lock while already holding one in the wrong order will be
386
checked for, and fail.
387

    
388
Adding/Removing locks
389
+++++++++++++++++++++
390

    
391
When a new instance or a new node is created an associated lock must be added
392
to the list. The relevant code will need to inform the locking library of such
393
a change.
394

    
395
This needs to be compatible with every other lock in the system, especially
396
metalocks that guarantee to grab sets of resources without specifying them
397
explicitly. The implementation of this will be handled in the locking library
398
itself.
399

    
400
Of course when instances or nodes disappear from the cluster the relevant locks
401
must be removed. This is easier than adding new elements, as the code which
402
removes them must own them exclusively or can queue for their ownership, and
403
thus deals with metalocks exactly as normal code acquiring those locks. Any
404
operation queueing on a removed lock will fail after its removal.
405

    
406
Asynchronous operations
407
+++++++++++++++++++++++
408

    
409
For the first version the locking library will only export synchronous
410
operations, which will block till the needed lock are held, and only fail if
411
the request is impossible or somehow erroneous.
412

    
413
In the future we may want to implement different types of asynchronous
414
operations such as:
415

    
416
- Try to acquire this lock set and fail if not possible
417
- Try to acquire one of these lock sets and return the first one you were
418
  able to get (or after a timeout) (select/poll like)
419

    
420
These operations can be used to prioritize operations based on available locks,
421
rather than making them just blindly queue for acquiring them. The inherent
422
risk, though, is that any code using the first operation, or setting a timeout
423
for the second one, is susceptible to starvation and thus may never be able to
424
get the required locks and complete certain tasks. Considering this
425
providing/using these operations should not be among our first priorities.
426

    
427
Locking granularity
428
+++++++++++++++++++
429

    
430
For the first version of this code we'll convert each Logical Unit to
431
acquire/release the locks it needs, so locking will be at the Logical Unit
432
level.  In the future we may want to split logical units in independent
433
"tasklets" with their own locking requirements. A different design doc (or mini
434
design doc) will cover the move from Logical Units to tasklets.
435

    
436
Lock acquisition code path
437
++++++++++++++++++++++++++
438

    
439
In general when acquiring locks we should use a code path equivalent to::
440

    
441
  lock.acquire()
442
  try:
443
    ...
444
    # other code
445
  finally:
446
    lock.release()
447

    
448
This makes sure we release all locks, and avoid possible deadlocks. Of course
449
extra care must be used not to leave, if possible locked structures in an
450
unusable state.
451

    
452
In order to avoid this extra indentation and code changes everywhere in the
453
Logical Units code, we decided to allow LUs to declare locks, and then execute
454
their code with their locks acquired. In the new world LUs are called like
455
this::
456

    
457
  # user passed names are expanded to the internal lock/resource name,
458
  # then known needed locks are declared
459
  lu.ExpandNames()
460
  ... some locking/adding of locks may happen ...
461
  # late declaration of locks for one level: this is useful because sometimes
462
  # we can't know which resource we need before locking the previous level
463
  lu.DeclareLocks() # for each level (cluster, instance, node)
464
  ... more locking/adding of locks can happen ...
465
  # these functions are called with the proper locks held
466
  lu.CheckPrereq()
467
  lu.Exec()
468
  ... locks declared for removal are removed, all acquired locks released ...
469

    
470
The Processor and the LogicalUnit class will contain exact documentation on how
471
locks are supposed to be declared.
472

    
473
Caveats
474
+++++++
475

    
476
This library will provide an easy upgrade path to bring all the code to
477
granular locking without breaking everything, and it will also guarantee
478
against a lot of common errors. Code switching from the old "lock everything"
479
lock to the new system, though, needs to be carefully scrutinised to be sure it
480
is really acquiring all the necessary locks, and none has been overlooked or
481
forgotten.
482

    
483
The code can contain other locks outside of this library, to synchronise other
484
threaded code (eg for the job queue) but in general these should be leaf locks
485
or carefully structured non-leaf ones, to avoid deadlock race conditions.
486

    
487

    
488
Job Queue
489
~~~~~~~~~
490

    
491
Granular locking is not enough to speed up operations, we also need a
492
queue to store these and to be able to process as many as possible in
493
parallel.
494

    
495
A ganeti job will consist of multiple ``OpCodes`` which are the basic
496
element of operation in Ganeti 1.2 (and will remain as such). Most
497
command-level commands are equivalent to one OpCode, or in some cases
498
to a sequence of opcodes, all of the same type (e.g. evacuating a node
499
will generate N opcodes of type replace disks).
500

    
501

    
502
Job execution—“Life of a Ganeti job”
503
++++++++++++++++++++++++++++++++++++
504

    
505
#. Job gets submitted by the client. A new job identifier is generated and
506
   assigned to the job. The job is then automatically replicated [#replic]_
507
   to all nodes in the cluster. The identifier is returned to the client.
508
#. A pool of worker threads waits for new jobs. If all are busy, the job has
509
   to wait and the first worker finishing its work will grab it. Otherwise any
510
   of the waiting threads will pick up the new job.
511
#. Client waits for job status updates by calling a waiting RPC function.
512
   Log message may be shown to the user. Until the job is started, it can also
513
   be cancelled.
514
#. As soon as the job is finished, its final result and status can be retrieved
515
   from the server.
516
#. If the client archives the job, it gets moved to a history directory.
517
   There will be a method to archive all jobs older than a a given age.
518

    
519
.. [#replic] We need replication in order to maintain the consistency across
520
   all nodes in the system; the master node only differs in the fact that
521
   now it is running the master daemon, but it if fails and we do a master
522
   failover, the jobs are still visible on the new master (though marked as
523
   failed).
524

    
525
Failures to replicate a job to other nodes will be only flagged as
526
errors in the master daemon log if more than half of the nodes failed,
527
otherwise we ignore the failure, and rely on the fact that the next
528
update (for still running jobs) will retry the update. For finished
529
jobs, it is less of a problem.
530

    
531
Future improvements will look into checking the consistency of the job
532
list and jobs themselves at master daemon startup.
533

    
534

    
535
Job storage
536
+++++++++++
537

    
538
Jobs are stored in the filesystem as individual files, serialized
539
using JSON (standard serialization mechanism in Ganeti).
540

    
541
The choice of storing each job in its own file was made because:
542

    
543
- a file can be atomically replaced
544
- a file can easily be replicated to other nodes
545
- checking consistency across nodes can be implemented very easily, since
546
  all job files should be (at a given moment in time) identical
547

    
548
The other possible choices that were discussed and discounted were:
549

    
550
- single big file with all job data: not feasible due to difficult updates
551
- in-process databases: hard to replicate the entire database to the
552
  other nodes, and replicating individual operations does not mean wee keep
553
  consistency
554

    
555

    
556
Queue structure
557
+++++++++++++++
558

    
559
All file operations have to be done atomically by writing to a temporary file
560
and subsequent renaming. Except for log messages, every change in a job is
561
stored and replicated to other nodes.
562

    
563
::
564

    
565
  /var/lib/ganeti/queue/
566
    job-1 (JSON encoded job description and status)
567
    […]
568
    job-37
569
    job-38
570
    job-39
571
    lock (Queue managing process opens this file in exclusive mode)
572
    serial (Last job ID used)
573
    version (Queue format version)
574

    
575

    
576
Locking
577
+++++++
578

    
579
Locking in the job queue is a complicated topic. It is called from more than
580
one thread and must be thread-safe. For simplicity, a single lock is used for
581
the whole job queue.
582

    
583
A more detailed description can be found in doc/locking.txt.
584

    
585

    
586
Internal RPC
587
++++++++++++
588

    
589
RPC calls available between Ganeti master and node daemons:
590

    
591
jobqueue_update(file_name, content)
592
  Writes a file in the job queue directory.
593
jobqueue_purge()
594
  Cleans the job queue directory completely, including archived job.
595
jobqueue_rename(old, new)
596
  Renames a file in the job queue directory.
597

    
598

    
599
Client RPC
600
++++++++++
601

    
602
RPC between Ganeti clients and the Ganeti master daemon supports the following
603
operations:
604

    
605
SubmitJob(ops)
606
  Submits a list of opcodes and returns the job identifier. The identifier is
607
  guaranteed to be unique during the lifetime of a cluster.
608
WaitForJobChange(job_id, fields, […], timeout)
609
  This function waits until a job changes or a timeout expires. The condition
610
  for when a job changed is defined by the fields passed and the last log
611
  message received.
612
QueryJobs(job_ids, fields)
613
  Returns field values for the job identifiers passed.
614
CancelJob(job_id)
615
  Cancels the job specified by identifier. This operation may fail if the job
616
  is already running, canceled or finished.
617
ArchiveJob(job_id)
618
  Moves a job into the …/archive/ directory. This operation will fail if the
619
  job has not been canceled or finished.
620

    
621

    
622
Job and opcode status
623
+++++++++++++++++++++
624

    
625
Each job and each opcode has, at any time, one of the following states:
626

    
627
Queued
628
  The job/opcode was submitted, but did not yet start.
629
Waiting
630
  The job/opcode is waiting for a lock to proceed.
631
Running
632
  The job/opcode is running.
633
Canceled
634
  The job/opcode was canceled before it started.
635
Success
636
  The job/opcode ran and finished successfully.
637
Error
638
  The job/opcode was aborted with an error.
639

    
640
If the master is aborted while a job is running, the job will be set to the
641
Error status once the master started again.
642

    
643

    
644
History
645
+++++++
646

    
647
Archived jobs are kept in a separate directory,
648
/var/lib/ganeti/queue/archive/.  This is done in order to speed up the
649
queue handling: by default, the jobs in the archive are not touched by
650
any functions. Only the current (unarchived) jobs are parsed, loaded,
651
and verified (if implemented) by the master daemon.
652

    
653

    
654
Ganeti updates
655
++++++++++++++
656

    
657
The queue has to be completely empty for Ganeti updates with changes
658
in the job queue structure. In order to allow this, there will be a
659
way to prevent new jobs entering the queue.
660

    
661

    
662

    
663
Object parameters
664
~~~~~~~~~~~~~~~~~
665

    
666
Across all cluster configuration data, we have multiple classes of
667
parameters:
668

    
669
A. cluster-wide parameters (e.g. name of the cluster, the master);
670
   these are the ones that we have today, and are unchanged from the
671
   current model
672

    
673
#. node parameters
674

    
675
#. instance specific parameters, e.g. the name of disks (LV), that
676
   cannot be shared with other instances
677

    
678
#. instance parameters, that are or can be the same for many
679
   instances, but are not hypervisor related; e.g. the number of VCPUs,
680
   or the size of memory
681

    
682
#. instance parameters that are hypervisor specific (e.g. kernel_path
683
   or PAE mode)
684

    
685

    
686
The following definitions for instance parameters will be used below:
687

    
688
:hypervisor parameter:
689
  a hypervisor parameter (or hypervisor specific parameter) is defined
690
  as a parameter that is interpreted by the hypervisor support code in
691
  Ganeti and usually is specific to a particular hypervisor (like the
692
  kernel path for PVM which makes no sense for HVM).
693

    
694
:backend parameter:
695
  a backend parameter is defined as an instance parameter that can be
696
  shared among a list of instances, and is either generic enough not
697
  to be tied to a given hypervisor or cannot influence at all the
698
  hypervisor behaviour.
699

    
700
  For example: memory, vcpus, auto_balance
701

    
702
  All these parameters will be encoded into constants.py with the prefix "BE\_"
703
  and the whole list of parameters will exist in the set "BES_PARAMETERS"
704

    
705
:proper parameter:
706
  a parameter whose value is unique to the instance (e.g. the name of a LV,
707
  or the MAC of a NIC)
708

    
709
As a general rule, for all kind of parameters, “None” (or in
710
JSON-speak, “nil”) will no longer be a valid value for a parameter. As
711
such, only non-default parameters will be saved as part of objects in
712
the serialization step, reducing the size of the serialized format.
713

    
714
Cluster parameters
715
++++++++++++++++++
716

    
717
Cluster parameters remain as today, attributes at the top level of the
718
Cluster object. In addition, two new attributes at this level will
719
hold defaults for the instances:
720

    
721
- hvparams, a dictionary indexed by hypervisor type, holding default
722
  values for hypervisor parameters that are not defined/overrided by
723
  the instances of this hypervisor type
724

    
725
- beparams, a dictionary holding (for 2.0) a single element 'default',
726
  which holds the default value for backend parameters
727

    
728
Node parameters
729
+++++++++++++++
730

    
731
Node-related parameters are very few, and we will continue using the
732
same model for these as previously (attributes on the Node object).
733

    
734
Instance parameters
735
+++++++++++++++++++
736

    
737
As described before, the instance parameters are split in three:
738
instance proper parameters, unique to each instance, instance
739
hypervisor parameters and instance backend parameters.
740

    
741
The “hvparams” and “beparams” are kept in two dictionaries at instance
742
level. Only non-default parameters are stored (but once customized, a
743
parameter will be kept, even with the same value as the default one,
744
until reset).
745

    
746
The names for hypervisor parameters in the instance.hvparams subtree
747
should be choosen as generic as possible, especially if specific
748
parameters could conceivably be useful for more than one hypervisor,
749
e.g. instance.hvparams.vnc_console_port instead of using both
750
instance.hvparams.hvm_vnc_console_port and
751
instance.hvparams.kvm_vnc_console_port.
752

    
753
There are some special cases related to disks and NICs (for example):
754
a disk has both ganeti-related parameters (e.g. the name of the LV)
755
and hypervisor-related parameters (how the disk is presented to/named
756
in the instance). The former parameters remain as proper-instance
757
parameters, while the latter value are migrated to the hvparams
758
structure. In 2.0, we will have only globally-per-instance such
759
hypervisor parameters, and not per-disk ones (e.g. all NICs will be
760
exported as of the same type).
761

    
762
Starting from the 1.2 list of instance parameters, here is how they
763
will be mapped to the three classes of parameters:
764

    
765
- name (P)
766
- primary_node (P)
767
- os (P)
768
- hypervisor (P)
769
- status (P)
770
- memory (BE)
771
- vcpus (BE)
772
- nics (P)
773
- disks (P)
774
- disk_template (P)
775
- network_port (P)
776
- kernel_path (HV)
777
- initrd_path (HV)
778
- hvm_boot_order (HV)
779
- hvm_acpi (HV)
780
- hvm_pae (HV)
781
- hvm_cdrom_image_path (HV)
782
- hvm_nic_type (HV)
783
- hvm_disk_type (HV)
784
- vnc_bind_address (HV)
785
- serial_no (P)
786

    
787

    
788
Parameter validation
789
++++++++++++++++++++
790

    
791
To support the new cluster parameter design, additional features will
792
be required from the hypervisor support implementations in Ganeti.
793

    
794
The hypervisor support  implementation API will be extended with the
795
following features:
796

    
797
:PARAMETERS: class-level attribute holding the list of valid parameters
798
  for this hypervisor
799
:CheckParamSyntax(hvparams): checks that the given parameters are
800
  valid (as in the names are valid) for this hypervisor; usually just
801
  comparing hvparams.keys() and cls.PARAMETERS; this is a class method
802
  that can be called from within master code (i.e. cmdlib) and should
803
  be safe to do so
804
:ValidateParameters(hvparams): verifies the values of the provided
805
  parameters against this hypervisor; this is a method that will be
806
  called on the target node, from backend.py code, and as such can
807
  make node-specific checks (e.g. kernel_path checking)
808

    
809
Default value application
810
+++++++++++++++++++++++++
811

    
812
The application of defaults to an instance is done in the Cluster
813
object, via two new methods as follows:
814

    
815
- ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on
816
  instance's hvparams and cluster's ``hvparams[instance.hypervisor]``
817

    
818
- ``Cluster.FillBE(instance, be_type="default")``, which returns the
819
  beparams dict, based on the instance and cluster beparams
820

    
821
The FillHV/BE transformations will be used, for example, in the RpcRunner
822
when sending an instance for activation/stop, and the sent instance
823
hvparams/beparams will have the final value (noded code doesn't know
824
about defaults).
825

    
826
LU code will need to self-call the transformation, if needed.
827

    
828
Opcode changes
829
++++++++++++++
830

    
831
The parameter changes will have impact on the OpCodes, especially on
832
the following ones:
833

    
834
- OpCreateInstance, where the new hv and be parameters will be sent as
835
  dictionaries; note that all hv and be parameters are now optional, as
836
  the values can be instead taken from the cluster
837
- OpQueryInstances, where we have to be able to query these new
838
  parameters; the syntax for names will be ``hvparam/$NAME`` and
839
  ``beparam/$NAME`` for querying an individual parameter out of one
840
  dictionary, and ``hvparams``, respectively ``beparams``, for the whole
841
  dictionaries
842
- OpModifyInstance, where the the modified parameters are sent as
843
  dictionaries
844

    
845
Additionally, we will need new OpCodes to modify the cluster-level
846
defaults for the be/hv sets of parameters.
847

    
848
Caveats
849
+++++++
850

    
851
One problem that might appear is that our classification is not
852
complete or not good enough, and we'll need to change this model. As
853
the last resort, we will need to rollback and keep 1.2 style.
854

    
855
Another problem is that classification of one parameter is unclear
856
(e.g. ``network_port``, is this BE or HV?); in this case we'll take
857
the risk of having to move parameters later between classes.
858

    
859
Security
860
++++++++
861

    
862
The only security issue that we foresee is if some new parameters will
863
have sensitive value. If so, we will need to have a way to export the
864
config data while purging the sensitive value.
865

    
866
E.g. for the drbd shared secrets, we could export these with the
867
values replaced by an empty string.
868

    
869
Feature changes
870
---------------
871

    
872
The main feature-level changes will be:
873

    
874
- a number of disk related changes
875
- removal of fixed two-disk, one-nic per instance limitation
876

    
877
Disk handling changes
878
~~~~~~~~~~~~~~~~~~~~~
879

    
880
The storage options available in Ganeti 1.x were introduced based on
881
then-current software (first DRBD 0.7 then later DRBD 8) and the
882
estimated usage patters. However, experience has later shown that some
883
assumptions made initially are not true and that more flexibility is
884
needed.
885

    
886
One main assupmtion made was that disk failures should be treated as 'rare'
887
events, and that each of them needs to be manually handled in order to ensure
888
data safety; however, both these assumptions are false:
889

    
890
- disk failures can be a common occurence, based on usage patterns or cluster
891
  size
892
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
893
  automate more of the recovery
894

    
895
Note that we still don't have fully-automated disk recovery as a goal, but our
896
goal is to reduce the manual work needed.
897

    
898
As such, we plan the following main changes:
899

    
900
- DRBD8 is much more flexible and stable than its previous version (0.7),
901
  such that removing the support for the ``remote_raid1`` template and
902
  focusing only on DRBD8 is easier
903

    
904
- dynamic discovery of DRBD devices is not actually needed in a cluster that
905
  where the DRBD namespace is controlled by Ganeti; switching to a static
906
  assignment (done at either instance creation time or change secondary time)
907
  will change the disk activation time from O(n) to O(1), which on big
908
  clusters is a significant gain
909

    
910
- remove the hard dependency on LVM (currently all available storage types are
911
  ultimately backed by LVM volumes) by introducing file-based storage
912

    
913
Additionally, a number of smaller enhancements are also planned:
914
- support variable number of disks
915
- support read-only disks
916

    
917
Future enhancements in the 2.x series, which do not require base design
918
changes, might include:
919

    
920
- enhancement of the LVM allocation method in order to try to keep
921
  all of an instance's virtual disks on the same physical
922
  disks
923

    
924
- add support for DRBD8 authentication at handshake time in
925
  order to ensure each device connects to the correct peer
926

    
927
- remove the restrictions on failover only to the secondary
928
  which creates very strict rules on cluster allocation
929

    
930
DRBD minor allocation
931
+++++++++++++++++++++
932

    
933
Currently, when trying to identify or activate a new DRBD (or MD)
934
device, the code scans all in-use devices in order to see if we find
935
one that looks similar to our parameters and is already in the desired
936
state or not. Since this needs external commands to be run, it is very
937
slow when more than a few devices are already present.
938

    
939
Therefore, we will change the discovery model from dynamic to
940
static. When a new device is logically created (added to the
941
configuration) a free minor number is computed from the list of
942
devices that should exist on that node and assigned to that
943
device.
944

    
945
At device activation, if the minor is already in use, we check if
946
it has our parameters; if not so, we just destroy the device (if
947
possible, otherwise we abort) and start it with our own
948
parameters.
949

    
950
This means that we in effect take ownership of the minor space for
951
that device type; if there's a user-created drbd minor, it will be
952
automatically removed.
953

    
954
The change will have the effect of reducing the number of external
955
commands run per device from a constant number times the index of the
956
first free DRBD minor to just a constant number.
957

    
958
Removal of obsolete device types (md, drbd7)
959
++++++++++++++++++++++++++++++++++++++++++++
960

    
961
We need to remove these device types because of two issues. First,
962
drbd7 has bad failure modes in case of dual failures (both network and
963
disk - it cannot propagate the error up the device stack and instead
964
just panics. Second, due to the assymetry between primary and
965
secondary in md+drbd mode, we cannot do live failover (not even if we
966
had md+drbd8).
967

    
968
File-based storage support
969
++++++++++++++++++++++++++
970

    
971
This is covered by a separate design doc (<em>Vinales</em>) and
972
would allow us to get rid of the hard requirement for testing
973
clusters; it would also allow people who have SAN storage to do live
974
failover taking advantage of their storage solution.
975

    
976
Better LVM allocation
977
+++++++++++++++++++++
978

    
979
Currently, the LV to PV allocation mechanism is a very simple one: at
980
each new request for a logical volume, tell LVM to allocate the volume
981
in order based on the amount of free space. This is good for
982
simplicity and for keeping the usage equally spread over the available
983
physical disks, however it introduces a problem that an instance could
984
end up with its (currently) two drives on two physical disks, or
985
(worse) that the data and metadata for a DRBD device end up on
986
different drives.
987

    
988
This is bad because it causes unneeded ``replace-disks`` operations in
989
case of a physical failure.
990

    
991
The solution is to batch allocations for an instance and make the LVM
992
handling code try to allocate as close as possible all the storage of
993
one instance. We will still allow the logical volumes to spill over to
994
additional disks as needed.
995

    
996
Note that this clustered allocation can only be attempted at initial
997
instance creation, or at change secondary node time. At add disk time,
998
or at replacing individual disks, it's not easy enough to compute the
999
current disk map so we'll not attempt the clustering.
1000

    
1001
DRBD8 peer authentication at handshake
1002
++++++++++++++++++++++++++++++++++++++
1003

    
1004
DRBD8 has a new feature that allow authentication of the peer at
1005
connect time. We can use this to prevent connecting to the wrong peer
1006
more that securing the connection. Even though we never had issues
1007
with wrong connections, it would be good to implement this.
1008

    
1009

    
1010
LVM self-repair (optional)
1011
++++++++++++++++++++++++++
1012

    
1013
The complete failure of a physical disk is very tedious to
1014
troubleshoot, mainly because of the many failure modes and the many
1015
steps needed. We can safely automate some of the steps, more
1016
specifically the ``vgreduce --removemissing`` using the following
1017
method:
1018

    
1019
#. check if all nodes have consistent volume groups
1020
#. if yes, and previous status was yes, do nothing
1021
#. if yes, and previous status was no, save status and restart
1022
#. if no, and previous status was no, do nothing
1023
#. if no, and previous status was yes:
1024
    #. if more than one node is inconsistent, do nothing
1025
    #. if only one node is incosistent:
1026
        #. run ``vgreduce --removemissing``
1027
        #. log this occurence in the ganeti log in a form that
1028
           can be used for monitoring
1029
        #. [FUTURE] run ``replace-disks`` for all
1030
           instances affected
1031

    
1032
Failover to any node
1033
++++++++++++++++++++
1034

    
1035
With a modified disk activation sequence, we can implement the
1036
*failover to any* functionality, removing many of the layout
1037
restrictions of a cluster:
1038

    
1039
- the need to reserve memory on the current secondary: this gets reduced to
1040
  a must to reserve memory anywhere on the cluster
1041

    
1042
- the need to first failover and then replace secondary for an
1043
  instance: with failover-to-any, we can directly failover to
1044
  another node, which also does the replace disks at the same
1045
  step
1046

    
1047
In the following, we denote the current primary by P1, the current
1048
secondary by S1, and the new primary and secondaries by P2 and S2. P2
1049
is fixed to the node the user chooses, but the choice of S2 can be
1050
made between P1 and S1. This choice can be constrained, depending on
1051
which of P1 and S1 has failed.
1052

    
1053
- if P1 has failed, then S1 must become S2, and live migration is not possible
1054
- if S1 has failed, then P1 must become S2, and live migration could be
1055
  possible (in theory, but this is not a design goal for 2.0)
1056

    
1057
The algorithm for performing the failover is straightforward:
1058

    
1059
- verify that S2 (the node the user has chosen to keep as secondary) has
1060
  valid data (is consistent)
1061

    
1062
- tear down the current DRBD association and setup a drbd pairing between
1063
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
1064
  start resyncing from S2
1065

    
1066
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
1067
  but before it has finished), we can promote it to primary role (r/w)
1068
  and start the instance on P2
1069

    
1070
- as soon as the P2?S2 sync has finished, we can remove
1071
  the old data on the old node that has not been chosen for
1072
  S2
1073

    
1074
Caveats: during the P2?S2 sync, a (non-transient) network error
1075
will cause I/O errors on the instance, so (if a longer instance
1076
downtime is acceptable) we can postpone the restart of the instance
1077
until the resync is done. However, disk I/O errors on S2 will cause
1078
dataloss, since we don't have a good copy of the data anymore, so in
1079
this case waiting for the sync to complete is not an option. As such,
1080
it is recommended that this feature is used only in conjunction with
1081
proper disk monitoring.
1082

    
1083

    
1084
Live migration note: While failover-to-any is possible for all choices
1085
of S2, migration-to-any is possible only if we keep P1 as S2.
1086

    
1087
Caveats
1088
+++++++
1089

    
1090
The dynamic device model, while more complex, has an advantage: it
1091
will not reuse by mistake another's instance DRBD device, since it
1092
always looks for either our own or a free one.
1093

    
1094
The static one, in contrast, will assume that given a minor number N,
1095
it's ours and we can take over. This needs careful implementation such
1096
that if the minor is in use, either we are able to cleanly shut it
1097
down, or we abort the startup. Otherwise, it could be that we start
1098
syncing between two instance's disks, causing dataloss.
1099

    
1100

    
1101
Variable number of disk/NICs per instance
1102
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1103

    
1104
Variable number of disks
1105
++++++++++++++++++++++++
1106

    
1107
In order to support high-security scenarios (for example read-only sda
1108
and read-write sdb), we need to make a fully flexibly disk
1109
definition. This has less impact that it might look at first sight:
1110
only the instance creation has hardcoded number of disks, not the disk
1111
handling code. The block device handling and most of the instance
1112
handling code is already working with "the instance's disks" as
1113
opposed to "the two disks of the instance", but some pieces are not
1114
(e.g. import/export) and the code needs a review to ensure safety.
1115

    
1116
The objective is to be able to specify the number of disks at
1117
instance creation, and to be able to toggle from read-only to
1118
read-write a disk afterwards.
1119

    
1120
Variable number of NICs
1121
+++++++++++++++++++++++
1122

    
1123
Similar to the disk change, we need to allow multiple network
1124
interfaces per instance. This will affect the internal code (some
1125
function will have to stop assuming that ``instance.nics`` is a list
1126
of length one), the OS api which currently can export/import only one
1127
instance, and the command line interface.
1128

    
1129
Interface changes
1130
-----------------
1131

    
1132
There are two areas of interface changes: API-level changes (the OS
1133
interface and the RAPI interface) and the command line interface
1134
changes.
1135

    
1136
OS interface
1137
~~~~~~~~~~~~
1138

    
1139
The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The
1140
interface is composed by a series of scripts which get called with certain
1141
parameters to perform OS-dependent operations on the cluster. The current
1142
scripts are:
1143

    
1144
create
1145
  called when a new instance is added to the cluster
1146
export
1147
  called to export an instance disk to a stream
1148
import
1149
  called to import from a stream to a new instance
1150
rename
1151
  called to perform the os-specific operations necessary for renaming an
1152
  instance
1153

    
1154
Currently these scripts suffer from the limitations of Ganeti 1.2: for example
1155
they accept exactly one block and one swap devices to operate on, rather than
1156
any amount of generic block devices, they blindly assume that an instance will
1157
have just one network interface to operate, they can not be configured to
1158
optimise the instance for a particular hypervisor.
1159

    
1160
Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed
1161
number of network and disks the OS interface need to change to transmit the
1162
appropriate amount of information about an instance to its managing operating
1163
system, when operating on it. Moreover since some old assumptions usually used
1164
in OS scripts are no longer valid we need to re-establish a common knowledge on
1165
what can be assumed and what cannot be regarding Ganeti environment.
1166

    
1167

    
1168
When designing the new OS API our priorities are:
1169
- ease of use
1170
- future extensibility
1171
- ease of porting from the old api
1172
- modularity
1173

    
1174
As such we want to limit the number of scripts that must be written to support
1175
an OS, and make it easy to share code between them by uniforming their input.
1176
We also will leave the current script structure unchanged, as far as we can,
1177
and make a few of the scripts (import, export and rename) optional. Most
1178
information will be passed to the script through environment variables, for
1179
ease of access and at the same time ease of using only the information a script
1180
needs.
1181

    
1182

    
1183
The Scripts
1184
+++++++++++
1185

    
1186
As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to
1187
support the following functionality, through scripts:
1188

    
1189
create:
1190
  used to create a new instance running that OS. This script should prepare the
1191
  block devices, and install them so that the new OS can boot under the
1192
  specified hypervisor.
1193
export (optional):
1194
  used to export an installed instance using the given OS to a format which can
1195
  be used to import it back into a new instance.
1196
import (optional):
1197
  used to import an exported instance into a new one. This script is similar to
1198
  create, but the new instance should have the content of the export, rather
1199
  than contain a pristine installation.
1200
rename (optional):
1201
  used to perform the internal OS-specific operations needed to rename an
1202
  instance.
1203

    
1204
If any optional script is not implemented Ganeti will refuse to perform the
1205
given operation on instances using the non-implementing OS. Of course the
1206
create script is mandatory, and it doesn't make sense to support the either the
1207
export or the import operation but not both.
1208

    
1209
Incompatibilities with 1.2
1210
__________________________
1211

    
1212
We expect the following incompatibilities between the OS scripts for 1.2 and
1213
the ones for 2.0:
1214

    
1215
- Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll
1216
  use environment variables, as there will be a lot more information and not
1217
  all OSes may care about all of it.
1218
- Number of calls: export scripts will be called once for each device the
1219
  instance has, and import scripts once for every exported disk. Imported
1220
  instances will be forced to have a number of disks greater or equal to the
1221
  one of the export.
1222
- Some scripts are not compulsory: if such a script is missing the relevant
1223
  operations will be forbidden for instances of that os. This makes it easier
1224
  to distinguish between unsupported operations and no-op ones (if any).
1225

    
1226

    
1227
Input
1228
_____
1229

    
1230
Rather than using command line flags, as they do now, scripts will accept
1231
inputs from environment variables.  We expect the following input values:
1232

    
1233
OS_API_VERSION
1234
  The version of the OS api that the following parameters comply with;
1235
  this is used so that in the future we could have OSes supporting
1236
  multiple versions and thus Ganeti send the proper version in this
1237
  parameter
1238
INSTANCE_NAME
1239
  Name of the instance acted on
1240
HYPERVISOR
1241
  The hypervisor the instance should run on (eg. 'xen-pvm', 'xen-hvm', 'kvm')
1242
DISK_COUNT
1243
  The number of disks this instance will have
1244
NIC_COUNT
1245
  The number of nics this instance will have
1246
DISK_<N>_PATH
1247
  Path to the Nth disk.
1248
DISK_<N>_ACCESS
1249
  W if read/write, R if read only. OS scripts are not supposed to touch
1250
  read-only disks, but will be passed them to know.
1251
DISK_<N>_FRONTEND_TYPE
1252
  Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio'
1253
DISK_<N>_BACKEND_TYPE
1254
  Type of the disk as seen from the node. Can be 'block', 'file:loop' or
1255
  'file:blktap'
1256
NIC_<N>_MAC
1257
  Mac address for the Nth network interface
1258
NIC_<N>_IP
1259
  Ip address for the Nth network interface, if available
1260
NIC_<N>_BRIDGE
1261
  Node bridge the Nth network interface will be connected to
1262
NIC_<N>_FRONTEND_TYPE
1263
  Type of the Nth nic as seen by the instance. For example 'virtio', 'rtl8139', etc.
1264
DEBUG_LEVEL
1265
  Whether more out should be produced, for debugging purposes. Currently the
1266
  only valid values are 0 and 1.
1267

    
1268
These are only the basic variables we are thinking of now, but more may come
1269
during the implementation and they will be documented in the ganeti-os-api man
1270
page. All these variables will be available to all scripts.
1271

    
1272
Some scripts will need a few more information to work. These will have
1273
per-script variables, such as for example:
1274

    
1275
OLD_INSTANCE_NAME
1276
  rename: the name the instance should be renamed from.
1277
EXPORT_DEVICE
1278
  export: device to be exported, a snapshot of the actual device. The data must be exported to stdout.
1279
EXPORT_INDEX
1280
  export: sequential number of the instance device targeted.
1281
IMPORT_DEVICE
1282
  import: device to send the data to, part of the new instance. The data must be imported from stdin.
1283
IMPORT_INDEX
1284
  import: sequential number of the instance device targeted.
1285

    
1286
(Rationale for INSTANCE_NAME as an environment variable: the instance name is
1287
always needed and we could pass it on the command line. On the other hand,
1288
though, this would force scripts to both access the environment and parse the
1289
command line, so we'll move it for uniformity.)
1290

    
1291

    
1292
Output/Behaviour
1293
________________
1294

    
1295
As discussed scripts should only send user-targeted information to stderr. The
1296
create and import scripts are supposed to format/initialise the given block
1297
devices and install the correct instance data. The export script is supposed to
1298
export instance data to stdout in a format understandable by the the import
1299
script. The data will be compressed by ganeti, so no compression should be
1300
done. The rename script should only modify the instance's knowledge of what
1301
its name is.
1302

    
1303
Other declarative style features
1304
++++++++++++++++++++++++++++++++
1305

    
1306
Similar to Ganeti 1.2, OS specifications will need to provide a
1307
'ganeti_api_version' containing list of numbers matching the version(s) of the
1308
api they implement. Ganeti itself will always be compatible with one version of
1309
the API and may maintain retrocompatibility if it's feasible to do so. The
1310
numbers are one-per-line, so an OS supporting both version 5 and version 20
1311
will have a file containing two lines. This is different from Ganeti 1.2, which
1312
only supported one version number.
1313

    
1314
In addition to that an OS will be able to declare that it does support only a
1315
subset of the ganeti hypervisors, by declaring them in the 'hypervisors' file.
1316

    
1317

    
1318
Caveats/Notes
1319
+++++++++++++
1320

    
1321
We might want to have a "default" import/export behaviour that just dumps all
1322
disks and restores them. This can save work as most systems will just do this,
1323
while allowing flexibility for different systems.
1324

    
1325
Environment variables are limited in size, but we expect that there will be
1326
enough space to store the information we need. If we discover that this is not
1327
the case we may want to go to a more complex API such as storing those
1328
information on the filesystem and providing the OS script with the path to a
1329
file where they are encoded in some format.
1330

    
1331

    
1332

    
1333
Remote API changes
1334
~~~~~~~~~~~~~~~~~~
1335

    
1336
The first Ganeti RAPI was designed and deployed with the Ganeti 1.2.5 release.
1337
That version provide Read-Only access to a cluster state. Fully functional
1338
read-write API demand significant internal changes which are in a pipeline for
1339
Ganeti 2.0 release.
1340

    
1341
We decided to go with implementing the Ganeti RAPI in a RESTful way, which is
1342
aligned with key features we looking. It is simple, stateless, scalable and
1343
extensible paradigm of API implementation. As transport it uses HTTP over SSL,
1344
and we are implementing it in JSON encoding, but in a way it possible to extend
1345
and provide any other one.
1346

    
1347
Design
1348
++++++
1349

    
1350
The Ganeti API implemented as independent daemon, running on the same node
1351
with the same permission level as Ganeti master daemon. Communication done
1352
through unix socket protocol provided by Ganeti luxi library.
1353
In order to keep communication asynchronous RAPI process two types of client
1354
requests:
1355

    
1356
- queries: sever able to answer immediately
1357
- jobs: some time needed.
1358

    
1359
In the query case requested data send back to client in http body. Typical
1360
examples of queries would be list of nodes, instances, cluster info, etc.
1361
Dealing with jobs client instead of waiting until job completes receive a job
1362
id, the identifier which allows to query the job progress in the job queue.
1363
(See job queue design doc for details)
1364

    
1365
Internally, each exported object has an version identifier, which is used as a
1366
state stamp in the http header E-Tag field for request/response to avoid a race
1367
condition.
1368

    
1369

    
1370
Resource representation
1371
+++++++++++++++++++++++
1372

    
1373
The key difference of REST approach from others API is instead having one URI
1374
for all our requests, REST demand separate service by resources with unique
1375
URI. Each of them should have limited amount of stateless and standard HTTP
1376
methods: GET, POST, DELETE, PUT.
1377

    
1378
For example in Ganeti case we can have a set of URI:
1379
 - /{clustername}/instances
1380
 - /{clustername}/instances/{instancename}
1381
 - /{clustername}/instances/{instancename}/tag
1382
 - /{clustername}/tag
1383

    
1384
A GET request to /{clustername}/instances will return list of instances, a POST
1385
to /{clustername}/instances should create new instance, a DELETE
1386
/{clustername}/instances/{instancename} should delete instance, a GET
1387
/{clustername}/tag get cluster tag
1388

    
1389
Each resource URI has a version prefix. The complete list of resources id TBD.
1390

    
1391
Internal encoding might be JSON, XML, or any other. The JSON encoding fits
1392
nicely in Ganeti RAPI needs. Specific representation client can request with
1393
Accept field in the HTTP header.
1394

    
1395
The REST uses standard HTTP as application protocol (not just as a transport)
1396
for resources access. Set of possible result codes is a subset of standard HTTP
1397
results. The stateless provide additional reliability and transparency to
1398
operations.
1399

    
1400

    
1401
Security
1402
++++++++
1403

    
1404
With the write functionality security becomes much bigger an issue.  The Ganeti
1405
RAPI uses basic HTTP authentication on top of SSL connection to grant access to
1406
an exported resource. The password stores locally in Apache-style .htpasswd
1407
file. Only one level of privileges is supported.
1408

    
1409

    
1410
Command line changes
1411
~~~~~~~~~~~~~~~~~~~~
1412

    
1413
Ganeti 2.0 introduces several new features as well as new ways to
1414
handle instance resources like disks or network interfaces. This
1415
requires some noticable changes in the way commandline arguments are
1416
handled.
1417

    
1418
- extend and modify commandline syntax to support new features
1419
- ensure consistent patterns in commandline arguments to reduce cognitive load
1420

    
1421
The design changes that require these changes are, in no particular
1422
order:
1423

    
1424
- flexible instance disk handling: support a variable number of disks
1425
  with varying properties per instance,
1426
- flexible instance network interface handling: support a variable
1427
  number of network interfaces with varying properties per instance
1428
- multiple hypervisors: multiple hypervisors can be active on the same
1429
  cluster, each supporting different parameters,
1430
- support for device type CDROM (via ISO image)
1431

    
1432
As such, there are several areas of Ganeti where the commandline
1433
arguments will change:
1434

    
1435
- Cluster configuration
1436

    
1437
  - cluster initialization
1438
  - cluster default configuration
1439

    
1440
- Instance configuration
1441

    
1442
  - handling of network cards for instances,
1443
  - handling of disks for instances,
1444
  - handling of CDROM devices and
1445
  - handling of hypervisor specific options.
1446

    
1447
There are several areas of Ganeti where the commandline arguments will change:
1448

    
1449
- Cluster configuration
1450

    
1451
  - cluster initialization
1452
  - cluster default configuration
1453

    
1454
- Instance configuration
1455

    
1456
  - handling of network cards for instances,
1457
  - handling of disks for instances,
1458
  - handling of CDROM devices and
1459
  - handling of hypervisor specific options.
1460

    
1461
Notes about device removal/addition
1462
+++++++++++++++++++++++++++++++++++
1463

    
1464
To avoid problems with device location changes (e.g. second network
1465
interface of the instance becoming the first or third and the like)
1466
the list of network/disk devices is treated as a stack, i.e. devices
1467
can only be added/removed at the end of the list of devices of each
1468
class (disk or network) for each instance.
1469

    
1470
gnt-instance commands
1471
+++++++++++++++++++++
1472

    
1473
The commands for gnt-instance will be modified and extended to allow
1474
for the new functionality:
1475

    
1476
- the add command will be extended to support the new device and
1477
  hypervisor options,
1478
- the modify command continues to handle all modifications to
1479
  instances, but will be extended with new arguments for handling
1480
  devices.
1481

    
1482
Network Device Options
1483
++++++++++++++++++++++
1484

    
1485
The generic format of the network device option is:
1486

    
1487
  --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1488

    
1489
:$DEVNUM: device number, unsigned integer, starting at 0,
1490
:$OPTION: device option, string,
1491
:$VALUE: device option value, string.
1492

    
1493
Currently, the following device options will be defined (open to
1494
further changes):
1495

    
1496
:mac: MAC address of the network interface, accepts either a valid
1497
  MAC address or the string 'auto'. If 'auto' is specified, a new MAC
1498
  address will be generated randomly. If the mac device option is not
1499
  specified, the default value 'auto' is assumed.
1500
:bridge: network bridge the network interface is connected
1501
  to. Accepts either a valid bridge name (the specified bridge must
1502
  exist on the node(s)) as string or the string 'auto'. If 'auto' is
1503
  specified, the default brigde is used. If the bridge option is not
1504
  specified, the default value 'auto' is assumed.
1505

    
1506
Disk Device Options
1507
+++++++++++++++++++
1508

    
1509
The generic format of the disk device option is:
1510

    
1511
  --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE]
1512

    
1513
:$DEVNUM: device number, unsigned integer, starting at 0,
1514
:$OPTION: device option, string,
1515
:$VALUE: device option value, string.
1516

    
1517
Currently, the following device options will be defined (open to
1518
further changes):
1519

    
1520
:size: size of the disk device, either a positive number, specifying
1521
  the disk size in mebibytes, or a number followed by a magnitude suffix
1522
  (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in
1523
  which case the default disk size will be used. If the size option is
1524
  not specified, 'auto' is assumed. This option is not valid for all
1525
  disk layout types.
1526
:access: access mode of the disk device, a single letter, valid values
1527
  are:
1528

    
1529
  - *w*: read/write access to the disk device or
1530
  - *r*: read-only access to the disk device.
1531

    
1532
  If the access mode is not specified, the default mode of read/write
1533
  access will be configured.
1534
:path: path to the image file for the disk device, string. No default
1535
  exists. This option is not valid for all disk layout types.
1536

    
1537
Adding devices
1538
++++++++++++++
1539

    
1540
To add devices to an already existing instance, use the device type
1541
specific option to gnt-instance modify. Currently, there are two
1542
device type specific options supported:
1543

    
1544
:--net: for network interface cards
1545
:--disk: for disk devices
1546

    
1547
The syntax to the device specific options is similiar to the generic
1548
device options, but instead of specifying a device number like for
1549
gnt-instance add, you specify the magic string add. The new device
1550
will always be appended at the end of the list of devices of this type
1551
for the specified instance, e.g. if the instance has disk devices 0,1
1552
and 2, the newly added disk device will be disk device 3.
1553

    
1554
Example: gnt-instance modify --net add:mac=auto test-instance
1555

    
1556
Removing devices
1557
++++++++++++++++
1558

    
1559
Removing devices from and instance is done via gnt-instance
1560
modify. The same device specific options as for adding instances are
1561
used. Instead of a device number and further device options, only the
1562
magic string remove is specified. It will always remove the last
1563
device in the list of devices of this type for the instance specified,
1564
e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device
1565
number 3 will be removed.
1566

    
1567
Example: gnt-instance modify --net remove test-instance
1568

    
1569
Modifying devices
1570
+++++++++++++++++
1571

    
1572
Modifying devices is also done with device type specific options to
1573
the gnt-instance modify command. There are currently two device type
1574
options supported:
1575

    
1576
:--net: for network interface cards
1577
:--disk: for disk devices
1578

    
1579
The syntax to the device specific options is similiar to the generic
1580
device options. The device number you specify identifies the device to
1581
be modified.
1582

    
1583
Example: gnt-instance modify --disk 2:access=r
1584

    
1585
Hypervisor Options
1586
++++++++++++++++++
1587

    
1588
Ganeti 2.0 will support more than one hypervisor. Different
1589
hypervisors have various options that only apply to a specific
1590
hypervisor. Those hypervisor specific options are treated specially
1591
via the --hypervisor option. The generic syntax of the hypervisor
1592
option is as follows:
1593

    
1594
  --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1595

    
1596
:$HYPERVISOR: symbolic name of the hypervisor to use, string,
1597
  has to match the supported hypervisors. Example: xen-pvm
1598

    
1599
:$OPTION: hypervisor option name, string
1600
:$VALUE: hypervisor option value, string
1601

    
1602
The hypervisor option for an instance can be set on instance creation
1603
time via the gnt-instance add command. If the hypervisor for an
1604
instance is not specified upon instance creation, the default
1605
hypervisor will be used.
1606

    
1607
Modifying hypervisor parameters
1608
+++++++++++++++++++++++++++++++
1609

    
1610
The hypervisor parameters of an existing instance can be modified
1611
using --hypervisor option of the gnt-instance modify command. However,
1612
the hypervisor type of an existing instance can not be changed, only
1613
the particular hypervisor specific option can be changed. Therefore,
1614
the format of the option parameters has been simplified to omit the
1615
hypervisor name and only contain the comma separated list of
1616
option-value pairs.
1617

    
1618
Example: gnt-instance modify --hypervisor
1619
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
1620

    
1621
gnt-cluster commands
1622
++++++++++++++++++++
1623

    
1624
The command for gnt-cluster will be extended to allow setting and
1625
changing the default parameters of the cluster:
1626

    
1627
- The init command will be extend to support the defaults option to
1628
  set the cluster defaults upon cluster initialization.
1629
- The modify command will be added to modify the cluster
1630
  parameters. It will support the --defaults option to change the
1631
  cluster defaults.
1632

    
1633
Cluster defaults
1634

    
1635
The generic format of the cluster default setting option is:
1636

    
1637
  --defaults $OPTION=$VALUE[,$OPTION=$VALUE]
1638

    
1639
:$OPTION: cluster default option, string,
1640
:$VALUE: cluster default option value, string.
1641

    
1642
Currently, the following cluster default options are defined (open to
1643
further changes):
1644

    
1645
:hypervisor: the default hypervisor to use for new instances,
1646
  string. Must be a valid hypervisor known to and supported by the
1647
  cluster.
1648
:disksize: the disksize for newly created instance disks, where
1649
  applicable. Must be either a positive number, in which case the unit
1650
  of megabyte is assumed, or a positive number followed by a supported
1651
  magnitude symbol (M for megabyte or G for gigabyte).
1652
:bridge: the default network bridge to use for newly created instance
1653
  network interfaces, string. Must be a valid bridge name of a bridge
1654
  existing on the node(s).
1655

    
1656
Hypervisor cluster defaults
1657
+++++++++++++++++++++++++++
1658

    
1659
The generic format of the hypervisor clusterwide default setting option is:
1660

    
1661
  --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1662

    
1663
:$HYPERVISOR: symbolic name of the hypervisor whose defaults you want
1664
  to set, string
1665
:$OPTION: cluster default option, string,
1666
:$VALUE: cluster default option value, string.
1667

    
1668

    
1669
Functionality changes
1670
---------------------
1671

    
1672
The disk storage will receive some changes, and will also remove
1673
support for the drbd7 and md disk types. See the
1674
design-2.0-disk-changes document.
1675

    
1676
The configuration storage will be changed, with the effect that more
1677
data will be available on the nodes for access from outside ganeti
1678
(e.g. from shell scripts) and that nodes will get slightly more
1679
awareness of the cluster configuration.
1680

    
1681
The RAPI will enable modify operations (beside the read-only queries
1682
that are available today), so in effect almost all the operations
1683
available today via the ``gnt-*`` commands will be available via the
1684
remote API.
1685

    
1686
A change in the hypervisor support area will be that we will support
1687
multiple hypervisors in parallel in the same cluster, so one could run
1688
Xen HVM side-by-side with Xen PVM on the same cluster.
1689

    
1690
New features
1691
------------
1692

    
1693
There will be a number of minor feature enhancements targeted to
1694
either 2.0 or subsequent 2.x releases:
1695

    
1696
- multiple disks, with custom properties (read-only/read-write, exportable,
1697
  etc.)
1698
- multiple NICs
1699

    
1700
These changes will require OS API changes, details are in the
1701
design-2.0-os-interface document. And they will also require many
1702
command line changes, see the design-2.0-commandline-parameters
1703
document.