Revision ee414f1c
b/doc/admin.rst | ||
---|---|---|
1410 | 1410 |
/cluster foo |
1411 | 1411 |
/instances/instance1 owner:bar |
1412 | 1412 |
|
1413 |
Autorepair |
|
1414 |
---------- |
|
1415 |
|
|
1416 |
The tool ``harep`` can be used to automatically fix some problems that are |
|
1417 |
present in the cluster. |
|
1418 |
|
|
1419 |
It is mainly meant to be regularly and automatically executed |
|
1420 |
as a cron job. This is quite evident by considering that, when executed, it does |
|
1421 |
not immediately fix all the issues of the instances of the cluster, but it |
|
1422 |
cycles the instances through a series of states, one at every ``harep`` |
|
1423 |
execution. Every state performs a step towards the resolution of the problem. |
|
1424 |
This process goes on until the instance is brought back to the healthy state, |
|
1425 |
or the tool realizes that it is not able to fix the instance, and |
|
1426 |
therefore marks it as in failure state. |
|
1427 |
|
|
1428 |
Allowing harep to act on the cluster |
|
1429 |
++++++++++++++++++++++++++++++++++++ |
|
1430 |
|
|
1431 |
By default, ``harep`` checks the status of the cluster but it is not allowed to |
|
1432 |
perform any modification. Modification must be explicitly allowed by an |
|
1433 |
appropriate use of tags. Tagging can be applied at various levels, and can |
|
1434 |
enable different kinds of autorepair, as hereafter described. |
|
1435 |
|
|
1436 |
All the tags that authorize ``harep`` to perform modifications follow this |
|
1437 |
syntax:: |
|
1438 |
|
|
1439 |
ganeti:watcher:autorepair:<type> |
|
1440 |
|
|
1441 |
where ``<type>`` indicates the kind of intervention that can be performed. Every |
|
1442 |
possible value of ``<type>`` includes at least all the authorization of the |
|
1443 |
previous one, plus its own. The possible values, in increasing order of |
|
1444 |
severity, are: |
|
1445 |
|
|
1446 |
- ``fix-storage`` allows a disk replacement or another operation that |
|
1447 |
fixes the instance backend storage without affecting the instance |
|
1448 |
itself. This can for example recover from a broken drbd secondary, but |
|
1449 |
risks data loss if something is wrong on the primary but the secondary |
|
1450 |
was somehow recoverable. |
|
1451 |
- ``migrate`` allows an instance migration. This can recover from a |
|
1452 |
drained primary, but can cause an instance crash in some cases (bugs). |
|
1453 |
- ``failover`` allows instance reboot on the secondary. This can recover |
|
1454 |
from an offline primary, but the instance will lose its running state. |
|
1455 |
- ``reinstall`` allows disks to be recreated and an instance to be |
|
1456 |
reinstalled. This can recover from primary&secondary both being |
|
1457 |
offline, or from an offline primary in the case of non-redundant |
|
1458 |
instances. It causes data loss. |
|
1459 |
|
|
1460 |
These autorepair tags can be applied to a cluster, a nodegroup or an instance, |
|
1461 |
and will act where they are applied and to everything in the entities sub-tree |
|
1462 |
(e.g. a tag applied to a nodegroup will apply to all the instances contained in |
|
1463 |
that nodegroup, but not to the rest of the cluster). |
|
1464 |
|
|
1465 |
If there are multiple ``ganeti:watcher:autorepair:<type>`` tags in an |
|
1466 |
object (cluster, node group or instance), the least destructive tag |
|
1467 |
takes precedence. When multiplicity happens across objects, the nearest |
|
1468 |
tag wins. For example, if in a cluster with two instances, *I1* and |
|
1469 |
*I2*, *I1* has ``failover``, and the cluster itself has both |
|
1470 |
``fix-storage`` and ``reinstall``, *I1* will end up with ``failover`` |
|
1471 |
and *I2* with ``fix-storage``. |
|
1472 |
|
|
1473 |
Limiting harep |
|
1474 |
++++++++++++++ |
|
1475 |
|
|
1476 |
Sometimes it is useful to stop harep from performing its task temporarily, |
|
1477 |
and it is useful to be able to do so without distrupting its configuration, that |
|
1478 |
is, without removing the authorization tags. In order to do this, suspend tags |
|
1479 |
are provided. |
|
1480 |
|
|
1481 |
Suspend tags can be added to cluster, nodegroup or instances, and act on the |
|
1482 |
entire entities sub-tree. No operation will be performed by ``harep`` on the |
|
1483 |
instances protected by a suspend tag. Their syntax is as follows:: |
|
1484 |
|
|
1485 |
ganeti:watcher:autorepair:suspend[:<timestamp>] |
|
1486 |
|
|
1487 |
If there are multiple suspend tags in an object, the form without timestamp |
|
1488 |
takes precedence (permanent suspension); or, if all object tags have a |
|
1489 |
timestamp, the one with the highest timestamp. |
|
1490 |
|
|
1491 |
Tags with a timestamp will be automatically removed when the time indicated by |
|
1492 |
the timestamp is passed. Indefinite suspension tags have to be removed manually. |
|
1493 |
|
|
1494 |
Result reporting |
|
1495 |
++++++++++++++++ |
|
1496 |
|
|
1497 |
Harep will report about the result of its actions both through its CLI, and by |
|
1498 |
adding tags to the instances it operated on. Such tags will follow the syntax |
|
1499 |
hereby described:: |
|
1500 |
|
|
1501 |
ganeti:watcher:autorepair:result:<type>:<id>:<timestamp>:<result>:<jobs> |
|
1502 |
|
|
1503 |
If this tag is present a repair of type ``type`` has been performed on |
|
1504 |
the instance and has been completed by ``timestamp``. The result is |
|
1505 |
either ``success``, ``failure`` or ``enoperm``, and jobs is a |
|
1506 |
*+*-separated list of jobs that were executed for this repair. |
|
1507 |
|
|
1508 |
An ``enoperm`` result is an error state due to permission problems. It |
|
1509 |
is returned when the repair cannot proceed because it would require to perform |
|
1510 |
an operation that is not allowed by the ``ganeti:watcher:autorepair:<type>`` tag |
|
1511 |
that is defining the instance autorepair permissions. |
|
1512 |
|
|
1513 |
NB: if an instance repair ends up in a failure state, it will not be touched |
|
1514 |
again by ``harep`` until it has been manually fixed by the system administrator |
|
1515 |
and the ``ganeti:watcher:autorepair:result:failure:*`` tag has been manually |
|
1516 |
removed. |
|
1413 | 1517 |
|
1414 | 1518 |
Job operations |
1415 | 1519 |
-------------- |
Also available in: Unified diff