Revision a6ab004b
ID | a6ab004bb9e1789599bf904390e5a3f2bad00beb |
LUDiagnoseOS: change locking and error handling
Since the “list OSes” call is exported via RAPI, this can be used pretty
easily to DOS the master daemon during long jobs.
The implementation of LUDiagnoseOS makes an RPC call to all nodes; we
lock nodes here in order to prevent node removal.
However, after closer examination, the worst case is:
- we get the list of nodes from the config
- another thread removes a node
- our RPC queries reach the removed node
As this point, if ganeti-noded is stopped or doesn't accept our queries,
the RPC call will return failed, and in the current implementation all
OSes will become invalid.
If we change the ‘failed RPC’ handling to ignore such nodes, this allows
us to both remove locking, and to handle transient RPC failures better
(not invalidating all OSes).
This patch does both these things, with a single drawback: in gnt-os
diagnose, the down nodes do not appear at all. I think this is a small
drawback, and the alternative is to add them with status failed; this
works (3-line patch), but then the output of “list” and “diagnose” will
no longer be consistent. As such, my proposal is to not list the nodes.
Reviewed-by: ultrotter
Files
- added
- modified
- copied
- renamed
- deleted