Revision 6c2d0b44

/dev/null
1
Notes about the client api
2
~~~~~~~~~~~~~~~~~~~~~~~~~~
3

  
4
Starting with Ganeti 1.3, the individual commands (gnt-...) do not
5
execute code directly, but instead via a master daemon. The
6
communication between these commands and the daemon will be language
7
agnostic, and we will be providing a python library implementing all
8
operations.
9

  
10
TODO: add tags support, add gnt-instance info implementation, document
11
all results from query and all opcode fields
12

  
13
Protocol
14
========
15

  
16
The protocol for communication will consist of passing JSON-encoded
17
messages over a UNIX socket. The protocol will be send message, receive
18
message, send message, ..., etc. Each message (either request or
19
response) will end (after the JSON message) with a ``ETX`` character
20
(ascii decimal 3), which is not a valid character in JSON messages and
21
thus can serve as a message delimiter. Quoting from the
22
http://www.json.org grammar::
23

  
24
  char: any unicode character except " or \ or control character
25

  
26
There are three request types than can be done:
27

  
28
  - submit job; a job is a sequence of opcodes that modify the cluster
29
  - abort a job; in some circumstances, a job can be aborted; the exact
30
    conditions depend on the master daemon implementation and clients
31
    should not rely on being able to abort jobs
32
  - query objects; this is a generic form of query that works for all
33
    object types
34

  
35
All requests will be encoded as a JSON object, having three fields:
36

  
37
  - ``request``: string, one of ``submit``, ``abort``, ``query``
38
  - ``data``: the payload of the request, variable type based on request
39
  - ``version``: the protocol version spoken by the client; we are
40
    describing here the version 0
41

  
42
The response to any request will be a JSON object, with two fields:
43

  
44
  - ``success``: either ``true`` or ``false`` denoting whether the
45
    request was successful or not
46
  - ``result``: the result of the request (depends on request type) if
47
    successful, otherwise the error message (describing the failure)
48

  
49
The server has no defined upper-limit on the time it will take to
50
respond to a request, so the clients should implement their own timeout
51
handling. Note though that most requests should be answered quickly, if
52
the cluster is in a normal state.
53

  
54
Submit
55
------
56

  
57
The submit ``data`` field will be a JSON object - a (partial) Job
58
description. It will have the following fields:
59

  
60
  - ``opcode_list``: a list of opcode objects, see below
61

  
62
The opcode objects supported will mostly be the ones supported by the
63
internal Ganeti implementation; currently there is no well-defined
64
definition of them (work in progress).
65

  
66
Each opcode will be represented in the message list as an object:
67

  
68
  - ``OP_ID``: an opcode id; this will be equivalent to the ``OP_ID``
69
    class attribute on opcodes (in lib/opcodes.py)
70
  - other fields: as needed by the opcode in question
71

  
72
Small example, request::
73

  
74
  {
75
    "opcode_list": [
76
      {
77
        "instance_name": "instance1.example.com",
78
        "OP_ID": "OP_INSTANCE_SHUTDOWN"
79
      }
80
    ]
81
  }
82

  
83
And response::
84

  
85
  {
86
    "result": "1104",
87
    "success": true
88
  }
89

  
90
The result of the submit request will be if successful, a single JSON
91
value denoting the job ID. While the job ID might be (or look like) an
92
integer, the clients should not depend on this and treat this ID as an
93
opaque identifier.
94

  
95
Abort
96
-----
97

  
98
The abort request data will be a single job ID (as returned by submit or
99
query jobs). The result will hold no data (i.e. it will be a JSON
100
``null`` value), if successful, and will be the error message if it
101
fails.
102

  
103
Query
104
-----
105

  
106
The ``data`` argument to the query request is a JSON object containing:
107

  
108
  - ``object``: the object type to be queried
109
  - ``names``: if querying a list of objects, this can restrict the
110
    query to a subset of the entire list
111
  - ``fields``: the list of informations that we are interested in
112

  
113
The valid values for the ``object`` field is:
114

  
115
  - ``cluster``
116
  - ``node``
117
  - ``instance``
118

  
119
For the ``cluster`` object, the ``names`` parameter is unused and must
120
be null.
121

  
122
The result value will be a list of lists: each row in the top-level list
123
will hold the result for a single object, and each row in the per-object
124
results will hold a result field, in the same order as the ``fields``
125
value.
126

  
127
Small example, request::
128

  
129
  {
130
    "data": {
131
      "fields": [
132
        "name",
133
        "admin_memory"
134
      ],
135
      "object": "instance"
136
    },
137
    "request": "query"
138
  }
139

  
140
And response::
141

  
142
  {
143
    "result": [
144
      [
145
        "instance1.example.com",
146
        "128"
147
      ],
148
      [
149
        "instance2.example.com",
150
        "4096"
151
      ]
152
    ],
153
    "success": true
154
  }
155

  
b/doc/design-2.0.rst
22 22
Background
23 23
==========
24 24

  
25
While Ganeti 1.2 is usable, it severly limits the flexibility of the
25
While Ganeti 1.2 is usable, it severely limits the flexibility of the
26 26
cluster administration and imposes a very rigid model. It has the
27 27
following main scalability issues:
28 28

  
......
33 33
It also has a number of artificial restrictions, due to historical design:
34 34

  
35 35
- fixed number of disks (two) per instance
36
- fixed number of nics
36
- fixed number of NICs
37 37

  
38 38
.. [#] Replace disks will release the lock, but this is an exception
39 39
       and not a recommended way to operate
40 40

  
41 41
The 2.0 version is intended to address some of these problems, and
42
create a more flexible codebase for future developments.
42
create a more flexible code base for future developments.
43

  
44
Among these problems, the single-operation at a time restriction is
45
biggest issue with the current version of Ganeti. It is such a big
46
impediment in operating bigger clusters that many times one is tempted
47
to remove the lock just to do a simple operation like start instance
48
while an OS installation is running.
43 49

  
44 50
Scalability problems
45 51
--------------------
......
60 66

  
61 67
One of the main causes of this global lock (beside the higher
62 68
difficulty of ensuring data consistency in a more granular lock model)
63
is the fact that currently there is no "master" daemon in Ganeti. Each
64
command tries to acquire the so called *cmd* lock and when it
65
succeeds, it takes complete ownership of the cluster configuration and
66
state.
69
is the fact that currently there is no long-lived process in Ganeti
70
that can coordinate multiple operations. Each command tries to acquire
71
the so called *cmd* lock and when it succeeds, it takes complete
72
ownership of the cluster configuration and state.
67 73

  
68 74
Other scalability problems are due the design of the DRBD device
69 75
model, which assumed at its creation a low (one to four) number of
......
77 83
touches multiple areas (configuration, import/export, command line)
78 84
that it's more fitted to a major release than a minor one.
79 85

  
86
Architecture issues
87
-------------------
88

  
89
The fact that each command is a separate process that reads the
90
cluster state, executes the command, and saves the new state is also
91
an issue on big clusters where the configuration data for the cluster
92
begins to be non-trivial in size.
93

  
80 94
Overview
81 95
========
82 96

  
......
109 123

  
110 124
The main changes will be switching from a per-process model to a
111 125
daemon based model, where the individual gnt-* commands will be
112
clients that talk to this daemon (see the design-2.0-master-daemon
113
document). This will allow us to get rid of the global cluster lock
114
for most operations, having instead a per-object lock (see
115
design-2.0-granular-locking). Also, the daemon will be able to queue
116
jobs, and this will allow the invidual clients to submit jobs without
117
waiting for them to finish, and also see the result of old requests
118
(see design-2.0-job-queue).
126
clients that talk to this daemon (see `Master daemon`_). This will
127
allow us to get rid of the global cluster lock for most operations,
128
having instead a per-object lock (see `Granular locking`_). Also, the
129
daemon will be able to queue jobs, and this will allow the individual
130
clients to submit jobs without waiting for them to finish, and also
131
see the result of old requests (see `Job Queue`_).
119 132

  
120 133
Beside these major changes, another 'core' change but that will not be
121 134
as visible to the users will be changing the model of object attribute
122
storage, and separate that into namespaces (such that an Xen PVM
135
storage, and separate that into name spaces (such that an Xen PVM
123 136
instance will not have the Xen HVM parameters). This will allow future
124
flexibility in defining additional parameters. More details in the
125
design-2.0-cluster-parameters document.
137
flexibility in defining additional parameters. For more details see
138
`Object parameters`_.
126 139

  
127 140
The various changes brought in by the master daemon model and the
128 141
read-write RAPI will require changes to the cluster security; we move
129
away from Twisted and use http(s) for intra- and extra-cluster
142
away from Twisted and use HTTP(s) for intra- and extra-cluster
130 143
communications. For more details, see the security document in the
131 144
doc/ directory.
132 145

  
......
140 153
- the command line tools (on the master node)
141 154
- the RAPI daemon (on the master node)
142 155

  
143
Interaction paths are between:
156
The master-daemon related interaction paths are:
144 157

  
145
- (CLI tools/RAPI daemon) and the master daemon, via the so called *luxi* API
158
- (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API
146 159
- the master daemon and the node daemons, via the node RPC
147 160

  
161
There are also some additional interaction paths for exceptional cases:
162

  
163
- CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile``
164
  and ``gnt-cluster command``)
165
- master failover is a special case when a non-master node will SSH
166
  and do node-RPC calls to the current master
167

  
148 168
The protocol between the master daemon and the node daemons will be
149
changed to HTTP(S), using a simple PUT/GET of JSON-encoded
150
messages. This is done due to difficulties in working with the Twisted
151
framework and its protocols in a multithreaded environment, which we
152
can overcome by using a simpler stack (see the caveats section). The
153
protocol between the CLI/RAPI and the master daemon will be a custom
154
one (called *luxi*): on a UNIX socket on the master node, with rights
155
restricted by filesystem permissions, the CLI/RAPI will talk to the
156
master daemon using JSON-encoded messages.
169
changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S),
170
using a simple PUT/GET of JSON-encoded messages. This is done due to
171
difficulties in working with the Twisted framework and its protocols
172
in a multithreaded environment, which we can overcome by using a
173
simpler stack (see the caveats section).
174

  
175
The protocol between the CLI/RAPI and the master daemon will be a
176
custom one (called *LUXI*): on a UNIX socket on the master node, with
177
rights restricted by filesystem permissions, the CLI/RAPI will talk to
178
the master daemon using JSON-encoded messages.
157 179

  
158 180
The operations supported over this internal protocol will be encoded
159 181
via a python library that will expose a simple API for its
160 182
users. Internally, the protocol will simply encode all objects in JSON
161 183
format and decode them on the receiver side.
162 184

  
185
For more details about the RAPI daemon see `Remote API changes`_, and
186
for the node daemon see `Node daemon changes`_.
187

  
163 188
The LUXI protocol
164 189
+++++++++++++++++
165 190

  
166
We will have two main classes of operations over the master daemon API:
191
As described above, the protocol for making requests or queries to the
192
master daemon will be a UNIX-socket based simple RPC of JSON-encoded
193
messages.
194

  
195
The choice of UNIX was in order to get rid of the need of
196
authentication and authorisation inside Ganeti; for 2.0, the
197
permissions on the Unix socket itself will determine the access
198
rights.
199

  
200
We will have two main classes of operations over this API:
167 201

  
168 202
- cluster query functions
169 203
- job related functions
170 204

  
171 205
The cluster query functions are usually short-duration, and are the
172
equivalent of the OP_QUERY_* opcodes in ganeti 1.2 (and they are
206
equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are
173 207
internally implemented still with these opcodes). The clients are
174 208
guaranteed to receive the response in a reasonable time via a timeout.
175 209

  
......
180 214
- archive job (see the job queue design doc)
181 215
- wait for job change, which allows a client to wait without polling
182 216

  
183
For more details, see the job queue design document.
217
For more details of the actual operation list, see the `Job Queue`_.
184 218

  
185
Daemon implementation
186
+++++++++++++++++++++
219
Both requests and responses will consist of a JSON-encoded message
220
followed by the ``ETX`` character (ASCII decimal 3), which is not a
221
valid character in JSON messages and thus can serve as a message
222
delimiter. The contents of the messages will be a dictionary with two
223
fields:
224

  
225
:method:
226
  the name of the method called
227
:args:
228
  the arguments to the method, as a list (no keyword arguments allowed)
229

  
230
Responses will follow the same format, with the two fields being:
231

  
232
:success:
233
  a boolean denoting the success of the operation
234
:result:
235
  the actual result, or error message in case of failure
236

  
237
There are two special value for the result field:
238

  
239
- in the case that the operation failed, and this field is a list of
240
  length two, the client library will try to interpret is as an exception,
241
  the first element being the exception type and the second one the
242
  actual exception arguments; this will allow a simple method of passing
243
  Ganeti-related exception across the interface
244
- for the *WaitForChange* call (that waits on the server for a job to
245
  change status), if the result is equal to ``nochange`` instead of the
246
  usual result for this call (a list of changes), then the library will
247
  internally retry the call; this is done in order to differentiate
248
  internally between master daemon hung and job simply not changed
249

  
250
Users of the API that don't use the provided python library should
251
take care of the above two cases.
252

  
253

  
254
Master daemon implementation
255
++++++++++++++++++++++++++++
187 256

  
188 257
The daemon will be based around a main I/O thread that will wait for
189 258
new requests from the clients, and that does the setup/shutdown of the
......
195 264
  long-lived, started at daemon startup and terminated only at shutdown
196 265
  time
197 266
- client I/O threads, which are the ones that talk the local protocol
198
  to the clients
267
  (LUXI) to the clients, and are short-lived
199 268

  
200 269
Master startup/failover
201 270
+++++++++++++++++++++++
......
229 298
    - if we are not failing over (but just starting), the
230 299
      quorum agrees that we are the designated master
231 300

  
301
    - if any of the above is false, we prevent the current operation
302
      (i.e. we don't become the master)
303

  
232 304
#. at this point, the node transitions to the master role
233 305

  
234 306
#. for all the in-progress jobs, mark them as failed, with
235 307
   reason unknown or something similar (master failed, etc.)
236 308

  
309
Since due to exceptional conditions we could have a situation in which
310
no node can become the master due to inconsistent data, we will have
311
an override switch for the master daemon startup that will assume the
312
current node has the right data and will replicate all the
313
configuration files to the other nodes.
314

  
315
**Note**: the above algorithm is by no means an election algorithm; it
316
is a *confirmation* of the master role currently held by a node.
237 317

  
238 318
Logging
239 319
+++++++
240 320

  
241
The logging system will be switched completely to the logging module;
242
currently it's logging-based, but exposes a different API, which is
243
just overhead. As such, the code will be switched over to standard
244
logging calls, and only the setup will be custom.
321
The logging system will be switched completely to the standard python
322
logging module; currently it's logging-based, but exposes a different
323
API, which is just overhead. As such, the code will be switched over
324
to standard logging calls, and only the setup will be custom.
245 325

  
246 326
With this change, we will remove the separate debug/info/error logs,
247 327
and instead have always one logfile per daemon model:
......
250 330
- node-daemon.log for the node daemon (this is the same as in 1.2)
251 331
- rapi-daemon.log for the RAPI daemon logs
252 332
- rapi-access.log, an additional log file for the RAPI that will be
253
  in the standard http log format for possible parsing by other tools
333
  in the standard HTTP log format for possible parsing by other tools
334

  
335
Since the `watcher`_ will only submit jobs to the master for startup
336
of the instances, its log file will contain less information than
337
before, mainly that it will start the instance, but not the results.
338

  
339
Node daemon changes
340
+++++++++++++++++++
341

  
342
The only change to the node daemon is that, since we need better
343
concurrency, we don't process the inter-node RPC calls in the node
344
daemon itself, but we fork and process each request in a separate
345
child.
254 346

  
255
Since the watcher will only submit jobs to the master for startup of
256
the instances, its log file will contain less information than before,
257
mainly that it will start the instance, but not the results.
347
Since we don't have many calls, and we only fork (not exec), the
348
overhead should be minimal.
258 349

  
259 350
Caveats
260 351
+++++++
......
277 368
  much better served by a daemon-based model
278 369

  
279 370
Another area of discussion is moving away from Twisted in this new
280
implementation. While Twisted hase its advantages, there are also many
281
disatvantanges to using it:
371
implementation. While Twisted has its advantages, there are also many
372
disadvantages to using it:
282 373

  
283 374
- first and foremost, it's not a library, but a framework; thus, if
284
  you use twisted, all the code needs to be 'twiste-ized'; we were able
285
  to keep the 1.x code clean by hacking around twisted in an
286
  unsupported, unrecommended way, and the only alternative would have
287
  been to make all the code be written for twisted
288
- it has some weaknesses in working with multiple threads, since its base
289
  model is designed to replace thread usage by using deferred calls, so while
290
  it can use threads, it's not less flexible in doing so
291

  
292
And, since we already have an HTTP server library for the RAPI, we
293
can just reuse that for inter-node communication.
375
  you use twisted, all the code needs to be 'twiste-ized' and written
376
  in an asynchronous manner, using deferreds; while this method works,
377
  it's not a common way to code and it requires that the entire process
378
  workflow is based around a single *reactor* (Twisted name for a main
379
  loop)
380
- the more advanced granular locking that we want to implement would
381
  require, if written in the async-manner, deep integration with the
382
  Twisted stack, to such an extend that business-logic is inseparable
383
  from the protocol coding; we felt that this is an unreasonable request,
384
  and that a good protocol library should allow complete separation of
385
  low-level protocol calls and business logic; by comparison, the threaded
386
  approach combined with HTTPs protocol required (for the first iteration)
387
  absolutely no changes from the 1.2 code, and later changes for optimizing
388
  the inter-node RPC calls required just syntactic changes (e.g.
389
  ``rpc.call_...`` to ``self.rpc.call_...``)
390

  
391
Another issue is with the Twisted API stability - during the Ganeti
392
1.x lifetime, we had to to implement many times workarounds to changes
393
in the Twisted version, so that for example 1.2 is able to use both
394
Twisted 2.x and 8.x.
395

  
396
In the end, since we already had an HTTP server library for the RAPI,
397
we just reused that for inter-node communication.
294 398

  
295 399

  
296 400
Granular locking
......
302 406

  
303 407
This design addresses how we are going to deal with locking so that:
304 408

  
305
- high urgency operations are not stopped by long length ones
306
- long length operations can run in parallel
307
- we preserve safety (data coherency) and liveness (no deadlock, no work
308
  postponed indefinitely) on the cluster
409
- we preserve data coherency
410
- we prevent deadlocks
411
- we prevent job starvation
309 412

  
310 413
Reaching the maximum possible parallelism is a Non-Goal. We have identified a
311 414
set of operations that are currently bottlenecks and need to be parallelised
312 415
and have worked on those. In the future it will be possible to address other
313 416
needs, thus making the cluster more and more parallel one step at a time.
314 417

  
315
This document only talks about parallelising Ganeti level operations, aka
316
Logical Units, and the locking needed for that. Any other synchronisation lock
418
This section only talks about parallelising Ganeti level operations, aka
419
Logical Units, and the locking needed for that. Any other synchronization lock
317 420
needed internally by the code is outside its scope.
318 421

  
319
Ganeti 1.2
320
++++++++++
321

  
322
We intend to implement a Ganeti locking library, which can be used by the
323
various ganeti code components in order to easily, efficiently and correctly
324
grab the locks they need to perform their function.
422
Library details
423
+++++++++++++++
325 424

  
326 425
The proposed library has these features:
327 426

  
328
- Internally managing all the locks, making the implementation transparent
427
- internally managing all the locks, making the implementation transparent
329 428
  from their usage
330
- Automatically grabbing multiple locks in the right order (avoid deadlock)
331
- Ability to transparently handle conversion to more granularity
332
- Support asynchronous operation (future goal)
333

  
334
Locking will be valid only on the master node and will not be a distributed
335
operation. In case of master failure, though, if some locks were held it means
336
some opcodes were in progress, so when recovery of the job queue is done it
337
will be possible to determine by the interrupted opcodes which operations could
338
have been left half way through and thus which locks could have been held. It
339
is then the responsibility either of the master failover code, of the cluster
340
verification code, or of the admin to do what's necessary to make sure that any
341
leftover state is dealt with. This is not an issue from a locking point of view
342
because the fact that the previous master has failed means that it cannot do
343
any job.
344

  
345
A corollary of this is that a master-failover operation with both masters alive
346
needs to happen while no other locks are held.
429
- automatically grabbing multiple locks in the right order (avoid deadlock)
430
- ability to transparently handle conversion to more granularity
431
- support asynchronous operation (future goal)
432

  
433
Locking will be valid only on the master node and will not be a
434
distributed operation. Therefore, in case of master failure, the
435
operations currently running will be aborted and the locks will be
436
lost; it remains to the administrator to cleanup (if needed) the
437
operation result (e.g. make sure an instance is either installed
438
correctly or removed).
439

  
440
A corollary of this is that a master-failover operation with both
441
masters alive needs to happen while no operations are running, and
442
therefore no locks are held.
443

  
444
All the locks will be represented by objects (like
445
``lockings.SharedLock``), and the individual locks for each object
446
will be created at initialisation time, from the config file.
447

  
448
The API will have a way to grab one or more than one locks at the same time.
449
Any attempt to grab a lock while already holding one in the wrong order will be
450
checked for, and fail.
451

  
347 452

  
348 453
The Locks
349 454
+++++++++
......
360 465
within the locking library, which, for simplicity, will just use alphabetical
361 466
order.
362 467

  
468
Each lock has the following three possible statuses:
469

  
470
- unlocked (anyone can grab the lock)
471
- shared (anyone can grab/have the lock but only in shared mode)
472
- exclusive (no one else can grab/have the lock)
473

  
363 474
Handling conversion to more granularity
364 475
+++++++++++++++++++++++++++++++++++++++
365 476

  
366 477
In order to convert to a more granular approach transparently each time we
367 478
split a lock into more we'll create a "metalock", which will depend on those
368
sublocks and live for the time necessary for all the code to convert (or
479
sub-locks and live for the time necessary for all the code to convert (or
369 480
forever, in some conditions). When a metalock exists all converted code must
370 481
acquire it in shared mode, so it can run concurrently, but still be exclusive
371 482
with old code, which acquires it exclusively.
......
373 484
In the beginning the only such lock will be what replaces the current "command"
374 485
lock, and will acquire all the locks in the system, before proceeding. This
375 486
lock will be called the "Big Ganeti Lock" because holding that one will avoid
376
any other concurrent ganeti operations.
487
any other concurrent Ganeti operations.
377 488

  
378 489
We might also want to devise more metalocks (eg. all nodes, all nodes+config)
379 490
in order to make it easier for some parts of the code to acquire what it needs
......
383 494
decide to split them into an even more fine grained approach, but this will
384 495
probably be only after the first 2.0 version has been released.
385 496

  
386
Library API
387
+++++++++++
388

  
389
All the locking will be its own class, and the locks will be created at
390
initialisation time, from the config file.
391

  
392
The API will have a way to grab one or more than one locks at the same time.
393
Any attempt to grab a lock while already holding one in the wrong order will be
394
checked for, and fail.
395

  
396 497
Adding/Removing locks
397 498
+++++++++++++++++++++
398 499

  
......
405 506
explicitly. The implementation of this will be handled in the locking library
406 507
itself.
407 508

  
408
Of course when instances or nodes disappear from the cluster the relevant locks
409
must be removed. This is easier than adding new elements, as the code which
410
removes them must own them exclusively or can queue for their ownership, and
411
thus deals with metalocks exactly as normal code acquiring those locks. Any
412
operation queueing on a removed lock will fail after its removal.
509
When instances or nodes disappear from the cluster the relevant locks
510
must be removed. This is easier than adding new elements, as the code
511
which removes them must own them exclusively already, and thus deals
512
with metalocks exactly as normal code acquiring those locks. Any
513
operation queuing on a removed lock will fail after its removal.
413 514

  
414 515
Asynchronous operations
415 516
+++++++++++++++++++++++
......
421 522
In the future we may want to implement different types of asynchronous
422 523
operations such as:
423 524

  
424
- Try to acquire this lock set and fail if not possible
425
- Try to acquire one of these lock sets and return the first one you were
525
- try to acquire this lock set and fail if not possible
526
- try to acquire one of these lock sets and return the first one you were
426 527
  able to get (or after a timeout) (select/poll like)
427 528

  
428 529
These operations can be used to prioritize operations based on available locks,
......
441 542
"tasklets" with their own locking requirements. A different design doc (or mini
442 543
design doc) will cover the move from Logical Units to tasklets.
443 544

  
444
Lock acquisition code path
445
++++++++++++++++++++++++++
545
Code examples
546
+++++++++++++
446 547

  
447 548
In general when acquiring locks we should use a code path equivalent to::
448 549

  
......
453 554
  finally:
454 555
    lock.release()
455 556

  
456
This makes sure we release all locks, and avoid possible deadlocks. Of course
457
extra care must be used not to leave, if possible locked structures in an
458
unusable state.
557
This makes sure we release all locks, and avoid possible deadlocks. Of
558
course extra care must be used not to leave, if possible locked
559
structures in an unusable state. Note that with Python 2.5 a simpler
560
syntax will be possible, but we want to keep compatibility with Python
561
2.4 so the new constructs should not be used.
459 562

  
460 563
In order to avoid this extra indentation and code changes everywhere in the
461 564
Logical Units code, we decided to allow LUs to declare locks, and then execute
......
500 603
queue to store these and to be able to process as many as possible in
501 604
parallel.
502 605

  
503
A ganeti job will consist of multiple ``OpCodes`` which are the basic
606
A Ganeti job will consist of multiple ``OpCodes`` which are the basic
504 607
element of operation in Ganeti 1.2 (and will remain as such). Most
505 608
command-level commands are equivalent to one OpCode, or in some cases
506 609
to a sequence of opcodes, all of the same type (e.g. evacuating a node
......
518 621
   of the waiting threads will pick up the new job.
519 622
#. Client waits for job status updates by calling a waiting RPC function.
520 623
   Log message may be shown to the user. Until the job is started, it can also
521
   be cancelled.
624
   be canceled.
522 625
#. As soon as the job is finished, its final result and status can be retrieved
523 626
   from the server.
524 627
#. If the client archives the job, it gets moved to a history directory.
......
653 756
+++++++
654 757

  
655 758
Archived jobs are kept in a separate directory,
656
/var/lib/ganeti/queue/archive/.  This is done in order to speed up the
657
queue handling: by default, the jobs in the archive are not touched by
658
any functions. Only the current (unarchived) jobs are parsed, loaded,
659
and verified (if implemented) by the master daemon.
759
``/var/lib/ganeti/queue/archive/``.  This is done in order to speed up
760
the queue handling: by default, the jobs in the archive are not
761
touched by any functions. Only the current (unarchived) jobs are
762
parsed, loaded, and verified (if implemented) by the master daemon.
660 763

  
661 764

  
662 765
Ganeti updates
......
667 770
way to prevent new jobs entering the queue.
668 771

  
669 772

  
670

  
671 773
Object parameters
672 774
~~~~~~~~~~~~~~~~~
673 775

  
......
697 799
  a hypervisor parameter (or hypervisor specific parameter) is defined
698 800
  as a parameter that is interpreted by the hypervisor support code in
699 801
  Ganeti and usually is specific to a particular hypervisor (like the
700
  kernel path for PVM which makes no sense for HVM).
802
  kernel path for `PVM`_ which makes no sense for `HVM`_).
701 803

  
702 804
:backend parameter:
703 805
  a backend parameter is defined as an instance parameter that can be
......
727 829
hold defaults for the instances:
728 830

  
729 831
- hvparams, a dictionary indexed by hypervisor type, holding default
730
  values for hypervisor parameters that are not defined/overrided by
832
  values for hypervisor parameters that are not defined/overridden by
731 833
  the instances of this hypervisor type
732 834

  
733 835
- beparams, a dictionary holding (for 2.0) a single element 'default',
......
754 856
The names for hypervisor parameters in the instance.hvparams subtree
755 857
should be choosen as generic as possible, especially if specific
756 858
parameters could conceivably be useful for more than one hypervisor,
757
e.g. instance.hvparams.vnc_console_port instead of using both
758
instance.hvparams.hvm_vnc_console_port and
759
instance.hvparams.kvm_vnc_console_port.
859
e.g. ``instance.hvparams.vnc_console_port`` instead of using both
860
``instance.hvparams.hvm_vnc_console_port`` and
861
``instance.hvparams.kvm_vnc_console_port``.
760 862

  
761 863
There are some special cases related to disks and NICs (for example):
762
a disk has both ganeti-related parameters (e.g. the name of the LV)
864
a disk has both Ganeti-related parameters (e.g. the name of the LV)
763 865
and hypervisor-related parameters (how the disk is presented to/named
764 866
in the instance). The former parameters remain as proper-instance
765 867
parameters, while the latter value are migrated to the hvparams
......
806 908
  for this hypervisor
807 909
:CheckParamSyntax(hvparams): checks that the given parameters are
808 910
  valid (as in the names are valid) for this hypervisor; usually just
809
  comparing hvparams.keys() and cls.PARAMETERS; this is a class method
810
  that can be called from within master code (i.e. cmdlib) and should
811
  be safe to do so
911
  comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class
912
  method that can be called from within master code (i.e. cmdlib) and
913
  should be safe to do so
812 914
:ValidateParameters(hvparams): verifies the values of the provided
813 915
  parameters against this hypervisor; this is a method that will be
814 916
  called on the target node, from backend.py code, and as such can
......
839 941
The parameter changes will have impact on the OpCodes, especially on
840 942
the following ones:
841 943

  
842
- OpCreateInstance, where the new hv and be parameters will be sent as
944
- ``OpCreateInstance``, where the new hv and be parameters will be sent as
843 945
  dictionaries; note that all hv and be parameters are now optional, as
844 946
  the values can be instead taken from the cluster
845
- OpQueryInstances, where we have to be able to query these new
947
- ``OpQueryInstances``, where we have to be able to query these new
846 948
  parameters; the syntax for names will be ``hvparam/$NAME`` and
847 949
  ``beparam/$NAME`` for querying an individual parameter out of one
848 950
  dictionary, and ``hvparams``, respectively ``beparams``, for the whole
849 951
  dictionaries
850
- OpModifyInstance, where the the modified parameters are sent as
952
- ``OpModifyInstance``, where the the modified parameters are sent as
851 953
  dictionaries
852 954

  
853 955
Additionally, we will need new OpCodes to modify the cluster-level
......
891 993
assumptions made initially are not true and that more flexibility is
892 994
needed.
893 995

  
894
One main assupmtion made was that disk failures should be treated as 'rare'
996
One main assumption made was that disk failures should be treated as 'rare'
895 997
events, and that each of them needs to be manually handled in order to ensure
896 998
data safety; however, both these assumptions are false:
897 999

  
898
- disk failures can be a common occurence, based on usage patterns or cluster
1000
- disk failures can be a common occurrence, based on usage patterns or cluster
899 1001
  size
900 1002
- our disk setup is robust enough (referring to DRBD8 + LVM) that we could
901 1003
  automate more of the recovery
......
956 1058
parameters.
957 1059

  
958 1060
This means that we in effect take ownership of the minor space for
959
that device type; if there's a user-created drbd minor, it will be
1061
that device type; if there's a user-created DRBD minor, it will be
960 1062
automatically removed.
961 1063

  
962 1064
The change will have the effect of reducing the number of external
963 1065
commands run per device from a constant number times the index of the
964 1066
first free DRBD minor to just a constant number.
965 1067

  
966
Removal of obsolete device types (md, drbd7)
1068
Removal of obsolete device types (MD, DRBD7)
967 1069
++++++++++++++++++++++++++++++++++++++++++++
968 1070

  
969 1071
We need to remove these device types because of two issues. First,
970
drbd7 has bad failure modes in case of dual failures (both network and
1072
DRBD7 has bad failure modes in case of dual failures (both network and
971 1073
disk - it cannot propagate the error up the device stack and instead
972
just panics. Second, due to the assymetry between primary and
973
secondary in md+drbd mode, we cannot do live failover (not even if we
974
had md+drbd8).
1074
just panics. Second, due to the asymmetry between primary and
1075
secondary in MD+DRBD mode, we cannot do live failover (not even if we
1076
had MD+DRBD8).
975 1077

  
976 1078
File-based storage support
977 1079
++++++++++++++++++++++++++
978 1080

  
979
This is covered by a separate design doc (<em>Vinales</em>) and
980
would allow us to get rid of the hard requirement for testing
981
clusters; it would also allow people who have SAN storage to do live
982
failover taking advantage of their storage solution.
1081
Using files instead of logical volumes for instance storage would
1082
allow us to get rid of the hard requirement for volume groups for
1083
testing clusters and it would also allow usage of SAN storage to do
1084
live failover taking advantage of this storage solution.
983 1085

  
984 1086
Better LVM allocation
985 1087
+++++++++++++++++++++
......
1030 1132
#. if no, and previous status was no, do nothing
1031 1133
#. if no, and previous status was yes:
1032 1134
    #. if more than one node is inconsistent, do nothing
1033
    #. if only one node is incosistent:
1135
    #. if only one node is inconsistent:
1034 1136
        #. run ``vgreduce --removemissing``
1035
        #. log this occurence in the ganeti log in a form that
1137
        #. log this occurrence in the Ganeti log in a form that
1036 1138
           can be used for monitoring
1037 1139
        #. [FUTURE] run ``replace-disks`` for all
1038 1140
           instances affected
......
1067 1169
- verify that S2 (the node the user has chosen to keep as secondary) has
1068 1170
  valid data (is consistent)
1069 1171

  
1070
- tear down the current DRBD association and setup a drbd pairing between
1172
- tear down the current DRBD association and setup a DRBD pairing between
1071 1173
  P2 (P2 is indicated by the user) and S2; since P2 has no data, it will
1072
  start resyncing from S2
1174
  start re-syncing from S2
1073 1175

  
1074 1176
- as soon as P2 is in state SyncTarget (i.e. after the resync has started
1075 1177
  but before it has finished), we can promote it to primary role (r/w)
......
1083 1185
will cause I/O errors on the instance, so (if a longer instance
1084 1186
downtime is acceptable) we can postpone the restart of the instance
1085 1187
until the resync is done. However, disk I/O errors on S2 will cause
1086
dataloss, since we don't have a good copy of the data anymore, so in
1188
data loss, since we don't have a good copy of the data anymore, so in
1087 1189
this case waiting for the sync to complete is not an option. As such,
1088 1190
it is recommended that this feature is used only in conjunction with
1089 1191
proper disk monitoring.
......
1096 1198
+++++++
1097 1199

  
1098 1200
The dynamic device model, while more complex, has an advantage: it
1099
will not reuse by mistake another's instance DRBD device, since it
1100
always looks for either our own or a free one.
1201
will not reuse by mistake the DRBD device of another instance, since
1202
it always looks for either our own or a free one.
1101 1203

  
1102 1204
The static one, in contrast, will assume that given a minor number N,
1103 1205
it's ours and we can take over. This needs careful implementation such
1104 1206
that if the minor is in use, either we are able to cleanly shut it
1105 1207
down, or we abort the startup. Otherwise, it could be that we start
1106
syncing between two instance's disks, causing dataloss.
1208
syncing between two instance's disks, causing data loss.
1107 1209

  
1108 1210

  
1109 1211
Variable number of disk/NICs per instance
......
1115 1217
In order to support high-security scenarios (for example read-only sda
1116 1218
and read-write sdb), we need to make a fully flexibly disk
1117 1219
definition. This has less impact that it might look at first sight:
1118
only the instance creation has hardcoded number of disks, not the disk
1220
only the instance creation has hard coded number of disks, not the disk
1119 1221
handling code. The block device handling and most of the instance
1120 1222
handling code is already working with "the instance's disks" as
1121 1223
opposed to "the two disks of the instance", but some pieces are not
......
1123 1225

  
1124 1226
The objective is to be able to specify the number of disks at
1125 1227
instance creation, and to be able to toggle from read-only to
1126
read-write a disk afterwards.
1228
read-write a disk afterward.
1127 1229

  
1128 1230
Variable number of NICs
1129 1231
+++++++++++++++++++++++
......
1131 1233
Similar to the disk change, we need to allow multiple network
1132 1234
interfaces per instance. This will affect the internal code (some
1133 1235
function will have to stop assuming that ``instance.nics`` is a list
1134
of length one), the OS api which currently can export/import only one
1236
of length one), the OS API which currently can export/import only one
1135 1237
instance, and the command line interface.
1136 1238

  
1137 1239
Interface changes
......
1176 1278
When designing the new OS API our priorities are:
1177 1279
- ease of use
1178 1280
- future extensibility
1179
- ease of porting from the old api
1281
- ease of porting from the old API
1180 1282
- modularity
1181 1283

  
1182 1284
As such we want to limit the number of scripts that must be written to support
......
1228 1330
  instances will be forced to have a number of disks greater or equal to the
1229 1331
  one of the export.
1230 1332
- Some scripts are not compulsory: if such a script is missing the relevant
1231
  operations will be forbidden for instances of that os. This makes it easier
1333
  operations will be forbidden for instances of that OS. This makes it easier
1232 1334
  to distinguish between unsupported operations and no-op ones (if any).
1233 1335

  
1234 1336

  
......
1239 1341
inputs from environment variables.  We expect the following input values:
1240 1342

  
1241 1343
OS_API_VERSION
1242
  The version of the OS api that the following parameters comply with;
1344
  The version of the OS API that the following parameters comply with;
1243 1345
  this is used so that in the future we could have OSes supporting
1244 1346
  multiple versions and thus Ganeti send the proper version in this
1245 1347
  parameter
1246 1348
INSTANCE_NAME
1247 1349
  Name of the instance acted on
1248 1350
HYPERVISOR
1249
  The hypervisor the instance should run on (eg. 'xen-pvm', 'xen-hvm', 'kvm')
1351
  The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm')
1250 1352
DISK_COUNT
1251 1353
  The number of disks this instance will have
1252 1354
NIC_COUNT
1253
  The number of nics this instance will have
1355
  The number of NICs this instance will have
1254 1356
DISK_<N>_PATH
1255 1357
  Path to the Nth disk.
1256 1358
DISK_<N>_ACCESS
......
1268 1370
NIC_<N>_BRIDGE
1269 1371
  Node bridge the Nth network interface will be connected to
1270 1372
NIC_<N>_FRONTEND_TYPE
1271
  Type of the Nth nic as seen by the instance. For example 'virtio', 'rtl8139', etc.
1373
  Type of the Nth NIC as seen by the instance. For example 'virtio',
1374
  'rtl8139', etc.
1272 1375
DEBUG_LEVEL
1273 1376
  Whether more out should be produced, for debugging purposes. Currently the
1274 1377
  only valid values are 0 and 1.
1275 1378

  
1276
These are only the basic variables we are thinking of now, but more may come
1277
during the implementation and they will be documented in the ganeti-os-api man
1278
page. All these variables will be available to all scripts.
1379
These are only the basic variables we are thinking of now, but more
1380
may come during the implementation and they will be documented in the
1381
``ganeti-os-api`` man page. All these variables will be available to
1382
all scripts.
1279 1383

  
1280 1384
Some scripts will need a few more information to work. These will have
1281 1385
per-script variables, such as for example:
......
1304 1408
create and import scripts are supposed to format/initialise the given block
1305 1409
devices and install the correct instance data. The export script is supposed to
1306 1410
export instance data to stdout in a format understandable by the the import
1307
script. The data will be compressed by ganeti, so no compression should be
1411
script. The data will be compressed by Ganeti, so no compression should be
1308 1412
done. The rename script should only modify the instance's knowledge of what
1309 1413
its name is.
1310 1414

  
......
1312 1416
++++++++++++++++++++++++++++++++
1313 1417

  
1314 1418
Similar to Ganeti 1.2, OS specifications will need to provide a
1315
'ganeti_api_version' containing list of numbers matching the version(s) of the
1316
api they implement. Ganeti itself will always be compatible with one version of
1317
the API and may maintain retrocompatibility if it's feasible to do so. The
1318
numbers are one-per-line, so an OS supporting both version 5 and version 20
1319
will have a file containing two lines. This is different from Ganeti 1.2, which
1320
only supported one version number.
1419
'ganeti_api_version' containing list of numbers matching the
1420
version(s) of the API they implement. Ganeti itself will always be
1421
compatible with one version of the API and may maintain backwards
1422
compatibility if it's feasible to do so. The numbers are one-per-line,
1423
so an OS supporting both version 5 and version 20 will have a file
1424
containing two lines. This is different from Ganeti 1.2, which only
1425
supported one version number.
1321 1426

  
1322 1427
In addition to that an OS will be able to declare that it does support only a
1323
subset of the ganeti hypervisors, by declaring them in the 'hypervisors' file.
1428
subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file.
1324 1429

  
1325 1430

  
1326 1431
Caveats/Notes
......
1341 1446
Remote API changes
1342 1447
~~~~~~~~~~~~~~~~~~
1343 1448

  
1344
The first Ganeti RAPI was designed and deployed with the Ganeti 1.2.5 release.
1345
That version provide Read-Only access to a cluster state. Fully functional
1346
read-write API demand significant internal changes which are in a pipeline for
1347
Ganeti 2.0 release.
1449
The first Ganeti remote API (RAPI) was designed and deployed with the
1450
Ganeti 1.2.5 release.  That version provide read-only access to the
1451
cluster state. Fully functional read-write API demands significant
1452
internal changes which will be implemented in version 2.0.
1348 1453

  
1349
We decided to go with implementing the Ganeti RAPI in a RESTful way, which is
1350
aligned with key features we looking. It is simple, stateless, scalable and
1351
extensible paradigm of API implementation. As transport it uses HTTP over SSL,
1352
and we are implementing it in JSON encoding, but in a way it possible to extend
1353
and provide any other one.
1454
We decided to go with implementing the Ganeti RAPI in a RESTful way,
1455
which is aligned with key features we looking. It is simple,
1456
stateless, scalable and extensible paradigm of API implementation. As
1457
transport it uses HTTP over SSL, and we are implementing it with JSON
1458
encoding, but in a way it possible to extend and provide any other
1459
one.
1354 1460

  
1355 1461
Design
1356 1462
++++++
1357 1463

  
1358
The Ganeti API implemented as independent daemon, running on the same node
1359
with the same permission level as Ganeti master daemon. Communication done
1360
through unix socket protocol provided by Ganeti luxi library.
1361
In order to keep communication asynchronous RAPI process two types of client
1362
requests:
1464
The Ganeti RAPI is implemented as independent daemon, running on the
1465
same node with the same permission level as Ganeti master
1466
daemon. Communication is done through the LUXI library to the master
1467
daemon. In order to keep communication asynchronous RAPI processes two
1468
types of client requests:
1363 1469

  
1364
- queries: sever able to answer immediately
1365
- jobs: some time needed.
1470
- queries: server is able to answer immediately
1471
- job submission: some time is required for a useful response
1366 1472

  
1367
In the query case requested data send back to client in http body. Typical
1368
examples of queries would be list of nodes, instances, cluster info, etc.
1369
Dealing with jobs client instead of waiting until job completes receive a job
1370
id, the identifier which allows to query the job progress in the job queue.
1371
(See job queue design doc for details)
1473
In the query case requested data send back to client in the HTTP
1474
response body. Typical examples of queries would be: list of nodes,
1475
instances, cluster info, etc.
1372 1476

  
1373
Internally, each exported object has an version identifier, which is used as a
1374
state stamp in the http header E-Tag field for request/response to avoid a race
1375
condition.
1477
In the case of job submission, the client receive a job ID, the
1478
identifier which allows to query the job progress in the job queue
1479
(see `Job Queue`_).
1480

  
1481
Internally, each exported object has an version identifier, which is
1482
used as a state identifier in the HTTP header E-Tag field for
1483
requests/responses to avoid race conditions.
1376 1484

  
1377 1485

  
1378 1486
Resource representation
1379 1487
+++++++++++++++++++++++
1380 1488

  
1381
The key difference of REST approach from others API is instead having one URI
1382
for all our requests, REST demand separate service by resources with unique
1383
URI. Each of them should have limited amount of stateless and standard HTTP
1489
The key difference of using REST instead of others API is that REST
1490
requires separation of services via resources with unique URIs. Each
1491
of them should have limited amount of state and support standard HTTP
1384 1492
methods: GET, POST, DELETE, PUT.
1385 1493

  
1386
For example in Ganeti case we can have a set of URI:
1387
 - /{clustername}/instances
1388
 - /{clustername}/instances/{instancename}
1389
 - /{clustername}/instances/{instancename}/tag
1390
 - /{clustername}/tag
1494
For example in Ganeti's case we can have a set of URI:
1495

  
1496
 - ``/{clustername}/instances``
1497
 - ``/{clustername}/instances/{instancename}``
1498
 - ``/{clustername}/instances/{instancename}/tag``
1499
 - ``/{clustername}/tag``
1391 1500

  
1392
A GET request to /{clustername}/instances will return list of instances, a POST
1393
to /{clustername}/instances should create new instance, a DELETE
1394
/{clustername}/instances/{instancename} should delete instance, a GET
1395
/{clustername}/tag get cluster tag
1501
A GET request to ``/{clustername}/instances`` will return the list of
1502
instances, a POST to ``/{clustername}/instances`` should create a new
1503
instance, a DELETE ``/{clustername}/instances/{instancename}`` should
1504
delete the instance, a GET ``/{clustername}/tag`` should return get
1505
cluster tags.
1396 1506

  
1397
Each resource URI has a version prefix. The complete list of resources id TBD.
1507
Each resource URI will have a version prefix. The resource IDs are to
1508
be determined.
1398 1509

  
1399
Internal encoding might be JSON, XML, or any other. The JSON encoding fits
1400
nicely in Ganeti RAPI needs. Specific representation client can request with
1401
Accept field in the HTTP header.
1510
Internal encoding might be JSON, XML, or any other. The JSON encoding
1511
fits nicely in Ganeti RAPI needs. The client can request a specific
1512
representation via the Accept field in the HTTP header.
1402 1513

  
1403
The REST uses standard HTTP as application protocol (not just as a transport)
1404
for resources access. Set of possible result codes is a subset of standard HTTP
1405
results. The stateless provide additional reliability and transparency to
1406
operations.
1514
REST uses HTTP as its transport and application protocol for resource
1515
access. The set of possible responses is a subset of standard HTTP
1516
responses.
1517

  
1518
The statelessness model provides additional reliability and
1519
transparency to operations (e.g. only one request needs to be analyzed
1520
to understand the in-progress operation, not a sequence of multiple
1521
requests/responses).
1407 1522

  
1408 1523

  
1409 1524
Security
1410 1525
++++++++
1411 1526

  
1412
With the write functionality security becomes much bigger an issue.  The Ganeti
1413
RAPI uses basic HTTP authentication on top of SSL connection to grant access to
1414
an exported resource. The password stores locally in Apache-style .htpasswd
1415
file. Only one level of privileges is supported.
1527
With the write functionality security becomes a much bigger an issue.
1528
The Ganeti RAPI uses basic HTTP authentication on top of an
1529
SSL-secured connection to grant access to an exported resource. The
1530
password is stored locally in an Apache-style ``.htpasswd`` file. Only
1531
one level of privileges is supported.
1532

  
1533
Caveats
1534
+++++++
1535

  
1536
The model detailed above for job submission requires the client to
1537
poll periodically for updates to the job; an alternative would be to
1538
allow the client to request a callback, or a 'wait for updates' call.
1539

  
1540
The callback model was not considered due to the following two issues:
1416 1541

  
1542
- callbacks would require a new model of allowed callback URLs,
1543
  together with a method of managing these
1544
- callbacks only work when the client and the master are in the same
1545
  security domain, and they fail in the other cases (e.g. when there is
1546
  a firewall between the client and the RAPI daemon that only allows
1547
  client-to-RAPI calls, which is usual in DMZ cases)
1548

  
1549
The 'wait for updates' method is not suited to the HTTP protocol,
1550
where requests are supposed to be short-lived.
1417 1551

  
1418 1552
Command line changes
1419 1553
~~~~~~~~~~~~~~~~~~~~
1420 1554

  
1421 1555
Ganeti 2.0 introduces several new features as well as new ways to
1422 1556
handle instance resources like disks or network interfaces. This
1423
requires some noticable changes in the way commandline arguments are
1557
requires some noticeable changes in the way command line arguments are
1424 1558
handled.
1425 1559

  
1426
- extend and modify commandline syntax to support new features
1427
- ensure consistent patterns in commandline arguments to reduce cognitive load
1560
- extend and modify command line syntax to support new features
1561
- ensure consistent patterns in command line arguments to reduce
1562
  cognitive load
1428 1563

  
1429 1564
The design changes that require these changes are, in no particular
1430 1565
order:
......
1437 1572
  cluster, each supporting different parameters,
1438 1573
- support for device type CDROM (via ISO image)
1439 1574

  
1440
As such, there are several areas of Ganeti where the commandline
1575
As such, there are several areas of Ganeti where the command line
1441 1576
arguments will change:
1442 1577

  
1443 1578
- Cluster configuration
......
1452 1587
  - handling of CDROM devices and
1453 1588
  - handling of hypervisor specific options.
1454 1589

  
1455
There are several areas of Ganeti where the commandline arguments will change:
1590
There are several areas of Ganeti where the command line arguments
1591
will change:
1456 1592

  
1457 1593
- Cluster configuration
1458 1594

  
......
1552 1688
:--net: for network interface cards
1553 1689
:--disk: for disk devices
1554 1690

  
1555
The syntax to the device specific options is similiar to the generic
1691
The syntax to the device specific options is similar to the generic
1556 1692
device options, but instead of specifying a device number like for
1557 1693
gnt-instance add, you specify the magic string add. The new device
1558 1694
will always be appended at the end of the list of devices of this type
......
1584 1720
:--net: for network interface cards
1585 1721
:--disk: for disk devices
1586 1722

  
1587
The syntax to the device specific options is similiar to the generic
1723
The syntax to the device specific options is similar to the generic
1588 1724
device options. The device number you specify identifies the device to
1589 1725
be modified.
1590 1726

  
1591
Example: gnt-instance modify --disk 2:access=r
1727
Example::
1728

  
1729
  gnt-instance modify --disk 2:access=r
1592 1730

  
1593 1731
Hypervisor Options
1594 1732
++++++++++++++++++
......
1596 1734
Ganeti 2.0 will support more than one hypervisor. Different
1597 1735
hypervisors have various options that only apply to a specific
1598 1736
hypervisor. Those hypervisor specific options are treated specially
1599
via the --hypervisor option. The generic syntax of the hypervisor
1600
option is as follows:
1737
via the ``--hypervisor`` option. The generic syntax of the hypervisor
1738
option is as follows::
1601 1739

  
1602 1740
  --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1603 1741

  
......
1608 1746
:$VALUE: hypervisor option value, string
1609 1747

  
1610 1748
The hypervisor option for an instance can be set on instance creation
1611
time via the gnt-instance add command. If the hypervisor for an
1749
time via the ``gnt-instance add`` command. If the hypervisor for an
1612 1750
instance is not specified upon instance creation, the default
1613 1751
hypervisor will be used.
1614 1752

  
......
1616 1754
+++++++++++++++++++++++++++++++
1617 1755

  
1618 1756
The hypervisor parameters of an existing instance can be modified
1619
using --hypervisor option of the gnt-instance modify command. However,
1620
the hypervisor type of an existing instance can not be changed, only
1621
the particular hypervisor specific option can be changed. Therefore,
1622
the format of the option parameters has been simplified to omit the
1623
hypervisor name and only contain the comma separated list of
1624
option-value pairs.
1757
using ``--hypervisor`` option of the ``gnt-instance modify``
1758
command. However, the hypervisor type of an existing instance can not
1759
be changed, only the particular hypervisor specific option can be
1760
changed. Therefore, the format of the option parameters has been
1761
simplified to omit the hypervisor name and only contain the comma
1762
separated list of option-value pairs.
1625 1763

  
1626
Example: gnt-instance modify --hypervisor
1627
cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
1764
Example::
1765

  
1766
  gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance
1628 1767

  
1629 1768
gnt-cluster commands
1630 1769
++++++++++++++++++++
......
1664 1803
Hypervisor cluster defaults
1665 1804
+++++++++++++++++++++++++++
1666 1805

  
1667
The generic format of the hypervisor clusterwide default setting option is:
1806
The generic format of the hypervisor cluster wide default setting
1807
option is::
1668 1808

  
1669 1809
  --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE]
1670 1810

  
......
1673 1813
:$OPTION: cluster default option, string,
1674 1814
:$VALUE: cluster default option value, string.
1675 1815

  
1816
Glossary
1817
========
1676 1818

  
1677
Functionality changes
1678
---------------------
1819
Since this document is only a delta from the Ganeti 1.2, there are
1820
some unexplained terms. Here is a non-exhaustive list.
1679 1821

  
1680
The disk storage will receive some changes, and will also remove
1681
support for the drbd7 and md disk types. See the
1682
design-2.0-disk-changes document.
1822
.. _HVM:
1683 1823

  
1684
The configuration storage will be changed, with the effect that more
1685
data will be available on the nodes for access from outside ganeti
1686
(e.g. from shell scripts) and that nodes will get slightly more
1687
awareness of the cluster configuration.
1824
HVM
1825
  hardware virtualization mode, where the virtual machine is oblivious
1826
  to the fact that's being virtualized and all the hardware is emulated
1688 1827

  
1689
The RAPI will enable modify operations (beside the read-only queries
1690
that are available today), so in effect almost all the operations
1691
available today via the ``gnt-*`` commands will be available via the
1692
remote API.
1828
.. _LU:
1693 1829

  
1694
A change in the hypervisor support area will be that we will support
1695
multiple hypervisors in parallel in the same cluster, so one could run
1696
Xen HVM side-by-side with Xen PVM on the same cluster.
1830
LogicalUnit
1831
  the code associated with an OpCode, i.e. the code that implements the
1832
  startup of an instance
1697 1833

  
1698
New features
1699
------------
1834
.. _opcode:
1835

  
1836
OpCode
1837
  a data structure encapsulating a basic cluster operation; for example,
1838
  start instance, add instance, etc.;
1839

  
1840
.. _PVM:
1700 1841

  
1701
There will be a number of minor feature enhancements targeted to
1702
either 2.0 or subsequent 2.x releases:
1842
PVM
1843
  para-virtualization mode, where the virtual machine knows it's being
1844
  virtualized and as such there is no need for hardware emulation
1703 1845

  
1704
- multiple disks, with custom properties (read-only/read-write, exportable,
1705
  etc.)
1706
- multiple NICs
1846
.. _watcher:
1707 1847

  
1708
These changes will require OS API changes, details are in the
1709
design-2.0-os-interface document. And they will also require many
1710
command line changes, see the design-2.0-commandline-parameters
1711
document.
1848
watcher
1849
  ``ganeti-watcher`` is a tool that should be run regularly from cron
1850
  and takes care of restarting failed instances, restarting secondary
1851
  DRBD devices, etc. For more details, see the man page
1852
  ``ganeti-watcher(8)``.

Also available in: Unified diff