Revision ee414f1c

b/doc/admin.rst
1410 1410
  /cluster foo
1411 1411
  /instances/instance1 owner:bar
1412 1412

  
1413
Autorepair
1414
----------
1415

  
1416
The tool ``harep`` can be used to automatically fix some problems that are
1417
present in the cluster.
1418

  
1419
It is mainly meant to be regularly and automatically executed
1420
as a cron job. This is quite evident by considering that, when executed, it does
1421
not immediately fix all the issues of the instances of the cluster, but it
1422
cycles the instances through a series of states, one at every ``harep``
1423
execution. Every state performs a step towards the resolution of the problem.
1424
This process goes on until the instance is brought back to the healthy state,
1425
or the tool realizes that it is not able to fix the instance, and
1426
therefore marks it as in failure state.
1427

  
1428
Allowing harep to act on the cluster
1429
++++++++++++++++++++++++++++++++++++
1430

  
1431
By default, ``harep`` checks the status of the cluster but it is not allowed to
1432
perform any modification. Modification must be explicitly allowed by an
1433
appropriate use of tags. Tagging can be applied at various levels, and can
1434
enable different kinds of autorepair, as hereafter described.
1435

  
1436
All the tags that authorize ``harep`` to perform modifications follow this
1437
syntax::
1438

  
1439
  ganeti:watcher:autorepair:<type>
1440

  
1441
where ``<type>`` indicates the kind of intervention that can be performed. Every
1442
possible value of ``<type>`` includes at least all the authorization of the
1443
previous one, plus its own. The possible values, in increasing order of
1444
severity, are:
1445

  
1446
- ``fix-storage`` allows a disk replacement or another operation that
1447
  fixes the instance backend storage without affecting the instance
1448
  itself. This can for example recover from a broken drbd secondary, but
1449
  risks data loss if something is wrong on the primary but the secondary
1450
  was somehow recoverable.
1451
- ``migrate`` allows an instance migration. This can recover from a
1452
  drained primary, but can cause an instance crash in some cases (bugs).
1453
- ``failover`` allows instance reboot on the secondary. This can recover
1454
  from an offline primary, but the instance will lose its running state.
1455
- ``reinstall`` allows disks to be recreated and an instance to be
1456
  reinstalled. This can recover from primary&secondary both being
1457
  offline, or from an offline primary in the case of non-redundant
1458
  instances. It causes data loss.
1459

  
1460
These autorepair tags can be applied to a cluster, a nodegroup or an instance,
1461
and will act where they are applied and to everything in the entities sub-tree
1462
(e.g. a tag applied to a nodegroup will apply to all the instances contained in
1463
that nodegroup, but not to the rest of the cluster).
1464

  
1465
If there are multiple ``ganeti:watcher:autorepair:<type>`` tags in an
1466
object (cluster, node group or instance), the least destructive tag
1467
takes precedence. When multiplicity happens across objects, the nearest
1468
tag wins. For example, if in a cluster with two instances, *I1* and
1469
*I2*, *I1* has ``failover``, and the cluster itself has both
1470
``fix-storage`` and ``reinstall``, *I1* will end up with ``failover``
1471
and *I2* with ``fix-storage``.
1472

  
1473
Limiting harep
1474
++++++++++++++
1475

  
1476
Sometimes it is useful to stop harep from performing its task temporarily,
1477
and it is useful to be able to do so without distrupting its configuration, that
1478
is, without removing the authorization tags. In order to do this, suspend tags
1479
are provided.
1480

  
1481
Suspend tags can be added to cluster, nodegroup or instances, and act on the
1482
entire entities sub-tree. No operation will be performed by ``harep`` on the
1483
instances protected by a suspend tag. Their syntax is as follows::
1484

  
1485
  ganeti:watcher:autorepair:suspend[:<timestamp>]
1486

  
1487
If there are multiple suspend tags in an object, the form without timestamp
1488
takes precedence (permanent suspension); or, if all object tags have a
1489
timestamp, the one with the highest timestamp.
1490

  
1491
Tags with a timestamp will be automatically removed when the time indicated by
1492
the timestamp is passed. Indefinite suspension tags have to be removed manually.
1493

  
1494
Result reporting
1495
++++++++++++++++
1496

  
1497
Harep will report about the result of its actions both through its CLI, and by
1498
adding tags to the instances it operated on. Such tags will follow the syntax
1499
hereby described::
1500

  
1501
  ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs>
1502

  
1503
If this tag is present a repair of type ``type`` has been performed on
1504
the instance and has been completed by ``timestamp``. The result is
1505
either ``success``, ``failure`` or ``enoperm``, and jobs is a
1506
*+*-separated list of jobs that were executed for this repair.
1507

  
1508
An ``enoperm`` result is an error state due to permission problems. It
1509
is returned when the repair cannot proceed because it would require to perform
1510
an operation that is not allowed by the ``ganeti:watcher:autorepair:<type>`` tag
1511
that is defining the instance autorepair permissions.
1512

  
1513
NB: if an instance repair ends up in a failure state, it will not be touched
1514
again by ``harep`` until it has been manually fixed by the system administrator
1515
and the ``ganeti:watcher:autorepair:result:failure:*`` tag has been manually
1516
removed.
1413 1517

  
1414 1518
Job operations
1415 1519
--------------

Also available in: Unified diff