Revision ac932df1 doc/design-2.1.rst

b/doc/design-2.1.rst
321 321
reading/writing to disk fails constantly.
322 322

  
323 323

  
324
New Features
325
------------
326

  
327
Automated Ganeti Cluster Merger
328
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
329

  
330
Current situation
331
+++++++++++++++++
332

  
333
Currently there's no easy way to merge two or more clusters together.
334
But in order to optimize resources this is a needed missing piece. The
335
goal of this design doc is to come up with a easy to use solution which
336
allows you to merge two or more cluster together.
337

  
338
Initial contact
339
+++++++++++++++
340

  
341
As the design of Ganeti is based on an autonomous system, Ganeti by
342
itself has no way to reach nodes outside of its cluster. To overcome
343
this situation we're required to prepare the cluster before we can go
344
ahead with the actual merge: We've to replace at least the ssh keys on
345
the affected nodes before we can do any operation within ``gnt-``
346
commands.
347

  
348
To make this a automated process we'll ask the user to provide us with
349
the root password of every cluster we've to merge. We use the password
350
to grab the current ``id_dsa`` key and then rely on that ssh key for any
351
further communication to be made until the cluster is fully merged.
352

  
353
Cluster merge
354
+++++++++++++
355

  
356
After initial contact we do the cluster merge:
357

  
358
1. Grab the list of nodes
359
2. On all nodes add our own ``id_dsa.pub`` key to ``authorized_keys``
360
3. Stop all instances running on the merging cluster
361
4. Disable ``ganeti-watcher`` as it tries to restart Ganeti daemons
362
5. Stop all Ganeti daemons on all merging nodes
363
6. Grab the ``config.data`` from the master of the merging cluster
364
7. Stop local ``ganeti-masterd``
365
8. Merge the config:
366

  
367
   1. Open our own cluster ``config.data``
368
   2. Open cluster ``config.data`` of the merging cluster
369
   3. Grab all nodes of the merging cluster
370
   4. Set ``master_candidate`` to false on all merging nodes
371
   5. Add the nodes to our own cluster ``config.data``
372
   6. Grab all the instances on the merging cluster
373
   7. Adjust the port if the instance has drbd layout:
374

  
375
      1. In ``logical_id`` (index 2)
376
      2. In ``physical_id`` (index 1 and 3)
377

  
378
   8. Add the instances to our own cluster ``config.data``
379

  
380
9. Start ``ganeti-masterd`` with ``--no-voting`` ``--yes-do-it``
381
10. ``gnt-node add --readd`` on all merging nodes
382
11. ``gnt-cluster redist-conf``
383
12. Restart ``ganeti-masterd`` normally
384
13. Enable ``ganeti-watcher`` again
385
14. Start all merging instances again
386

  
387
Rollback
388
++++++++
389

  
390
Until we actually (re)add any nodes we can abort and rollback the merge
391
at any point. After merging the config, though, we've to get the backup
392
copy of ``config.data`` (from another master candidate node). And for
393
security reasons it's a good idea to undo ``id_dsa.pub`` distribution by
394
going on every affected node and remove the ``id_dsa.pub`` key again.
395
Also we've to keep in mind, that we've to start the Ganeti daemons and
396
starting up the instances again.
397

  
398
Verification
399
++++++++++++
400

  
401
Last but not least we should verify that the merge was successful.
402
Therefore we run ``gnt-cluster verify``, which ensures that the cluster
403
overall is in a healthy state. Additional it's also possible to compare
404
the list of instances/nodes with a list made prior to the upgrade to
405
make sure we didn't lose any data/instance/node.
406

  
407
Appendix
408
++++++++
409

  
410
cluster-merge.py
411
^^^^^^^^^^^^^^^^
412

  
413
Used to merge the cluster config. This is a POC and might differ from
414
actual production code.
415

  
416
::
417

  
418
  #!/usr/bin/python
419

  
420
  import sys
421
  from ganeti import config
422
  from ganeti import constants
423

  
424
  c_mine = config.ConfigWriter(offline=True)
425
  c_other = config.ConfigWriter(sys.argv[1])
426

  
427
  fake_id = 0
428
  for node in c_other.GetNodeList():
429
    node_info = c_other.GetNodeInfo(node)
430
    node_info.master_candidate = False
431
    c_mine.AddNode(node_info, str(fake_id))
432
    fake_id += 1
433

  
434
  for instance in c_other.GetInstanceList():
435
    instance_info = c_other.GetInstanceInfo(instance)
436
    for dsk in instance_info.disks:
437
      if dsk.dev_type in constants.LDS_DRBD:
438
         port = c_mine.AllocatePort()
439
         logical_id = list(dsk.logical_id)
440
         logical_id[2] = port
441
         dsk.logical_id = tuple(logical_id)
442
         physical_id = list(dsk.physical_id)
443
         physical_id[1] = physical_id[3] = port
444
         dsk.physical_id = tuple(physical_id)
445
    c_mine.AddInstance(instance_info, str(fake_id))
446
    fake_id += 1
447

  
448

  
324 449
Feature changes
325 450
---------------
326 451

  

Also available in: Unified diff