|
1 |
=======================
|
|
2 |
Ganeti monitoring agent
|
|
3 |
=======================
|
|
4 |
|
|
5 |
.. contents:: :depth: 4
|
|
6 |
|
|
7 |
This is a design document detailing the implementation of a Ganeti
|
|
8 |
monitoring agent report system, that can be queried by a monitoring
|
|
9 |
system to calculate health information for a Ganeti cluster.
|
|
10 |
|
|
11 |
Current state and shortcomings
|
|
12 |
==============================
|
|
13 |
|
|
14 |
There is currently no monitoring support in Ganeti. While we don't want
|
|
15 |
to build something like Nagios or Pacemaker as part of Ganeti, it would
|
|
16 |
be useful if such tools could easily extract information from a Ganeti
|
|
17 |
machine in order to take actions (example actions include logging an
|
|
18 |
outage for future reporting or alerting a person or system about it).
|
|
19 |
|
|
20 |
Proposed changes
|
|
21 |
================
|
|
22 |
|
|
23 |
Each Ganeti node should export a status page that can be queried by a
|
|
24 |
monitoring system. Such status page will be exported on a network port
|
|
25 |
and will be encoded in JSON (simple text) over HTTP.
|
|
26 |
|
|
27 |
The choice of json is obvious as we already depend on it in Ganeti and
|
|
28 |
thus we don't need to add extra libraries to use it, as opposed to what
|
|
29 |
would happen for XML or some other markup format.
|
|
30 |
|
|
31 |
Location of agent report
|
|
32 |
------------------------
|
|
33 |
|
|
34 |
The report will be available from all nodes, and be concerned for all
|
|
35 |
node-local resources. This allows more real-time information to be
|
|
36 |
available, at the cost of querying all nodes.
|
|
37 |
|
|
38 |
Information reported
|
|
39 |
--------------------
|
|
40 |
|
|
41 |
The monitoring agent system will report on the following basic information:
|
|
42 |
|
|
43 |
- Instance status
|
|
44 |
- Instance disk status
|
|
45 |
- Status of storage for instances
|
|
46 |
- Ganeti daemons status, CPU usage, memory footprint
|
|
47 |
- Hypervisor resources report (memory, CPU, network interfaces)
|
|
48 |
- Node OS resources report (memory, CPU, network interfaces)
|
|
49 |
- Information from a plugin system
|
|
50 |
|
|
51 |
Instance status
|
|
52 |
+++++++++++++++
|
|
53 |
|
|
54 |
At the moment each node knows which instances are running on it, which
|
|
55 |
instances it is primary for, but not the cause why an instance might not
|
|
56 |
be running. On the other hand we don't want to distribute full instance
|
|
57 |
"admin" status information to all nodes, because of the performance
|
|
58 |
impact this would have.
|
|
59 |
|
|
60 |
As such we propose that:
|
|
61 |
|
|
62 |
- Any operation that can affect instance status will have an optional
|
|
63 |
"reason" attached to it (at opcode level). This can be used for
|
|
64 |
example to distinguish an admin request, from a scheduled maintenance
|
|
65 |
or an automated tool's work. If this reason is not passed, Ganeti will
|
|
66 |
just use the information it has about the source of the request: for
|
|
67 |
example a cli shutdown operation will have "cli:shutdown" as a reason,
|
|
68 |
a cli failover operation will have "cli:failover". Operations coming
|
|
69 |
from the remote API will use "rapi" instead of "cli". Of course
|
|
70 |
setting a real site-specific reason is still preferred.
|
|
71 |
- RPCs that affect the instance status will be changed so that the
|
|
72 |
"reason" and the version of the config object they ran on is passed to
|
|
73 |
them. They will then export the new expected instance status, together
|
|
74 |
with the associated reason and object version to the status report
|
|
75 |
system, which then will export those themselves.
|
|
76 |
|
|
77 |
Monitoring and auditing systems can then use the reason to understand
|
|
78 |
the cause of an instance status, and they can use the object version to
|
|
79 |
understand the freshness of their data even in the absence of an atomic
|
|
80 |
cross-node reporting: for example if they see an instance "up" on a node
|
|
81 |
after seeing it running on a previous one, they can compare these values
|
|
82 |
to understand which data is freshest, and repoll the "older" node. Of
|
|
83 |
course if they keep seeing this status this represents an error (either
|
|
84 |
an instance continuously "flapping" between nodes, or an instance is
|
|
85 |
constantly up on more than one), which should be reported and acted
|
|
86 |
upon.
|
|
87 |
|
|
88 |
The instance status will be on each node, for the instances it is
|
|
89 |
primary for and will contain at least:
|
|
90 |
|
|
91 |
- The instance name
|
|
92 |
- The instance UUID (stable on name change)
|
|
93 |
- The instance running status (up or down)
|
|
94 |
- The timestamp of last known change
|
|
95 |
- The timestamp of when the status was last checked (see caching, below)
|
|
96 |
- The last known reason for change, if any
|
|
97 |
|
|
98 |
More information about all the fields and their type will be available
|
|
99 |
in the "Format of the report" section.
|
|
100 |
|
|
101 |
Note that as soon as a node knows it's not the primary anymore for an
|
|
102 |
instance it will stop reporting status for it: this means the instance
|
|
103 |
will either disappear, if it has been deleted, or appear on another
|
|
104 |
node, if it's been moved.
|
|
105 |
|
|
106 |
Instance Disk status
|
|
107 |
++++++++++++++++++++
|
|
108 |
|
|
109 |
As for the instance status Ganeti has now only partial information about
|
|
110 |
its instance disks: in particular each node is unaware of the disk to
|
|
111 |
instance mapping, that exists only on the master.
|
|
112 |
|
|
113 |
For this design doc we plan to fix this by changing all RPCs that create
|
|
114 |
a backend storage or that put an already existing one in use and passing
|
|
115 |
the relevant instance to the node. The node can then export these to the
|
|
116 |
status reporting tool.
|
|
117 |
|
|
118 |
While we haven't implemented these RPC changes yet, we'll use confd to
|
|
119 |
fetch this information in the data collector.
|
|
120 |
|
|
121 |
Since Ganeti supports many type of disks for instances (drbd, rbd,
|
|
122 |
plain, file) we will export both a "generic" status which will work for
|
|
123 |
any type of disk and will be very opaque (at minimum just an "healthy"
|
|
124 |
or "error" state, plus perhaps some human readable comment and a
|
|
125 |
"per-type" status which will explain more about the internal details but
|
|
126 |
will not be compatible between different storage types (and will for
|
|
127 |
example export the drbd connection status, sync, and so on).
|
|
128 |
|
|
129 |
Status of storage for instances
|
|
130 |
+++++++++++++++++++++++++++++++
|
|
131 |
|
|
132 |
The node will also be reporting on all storage types it knows about for
|
|
133 |
the current node (this is right now hardcoded to the enabled storage
|
|
134 |
types, and in the future tied to the enabled storage pools for the
|
|
135 |
nodegroup). For this kind of information also we will report both a
|
|
136 |
generic health status (healthy or error) for each type of storage, and
|
|
137 |
some more generic statistics (free space, used space, total visible
|
|
138 |
space). In addition type specific information can be exported: for
|
|
139 |
example, in case of error, the nature of the error can be disclosed as a
|
|
140 |
type specific information. Examples of these are "backend pv
|
|
141 |
unavailable" for lvm storage, "unreachable" for network based storage or
|
|
142 |
"filesystem error" for filesystem based implementations.
|
|
143 |
|
|
144 |
Ganeti daemons status
|
|
145 |
+++++++++++++++++++++
|
|
146 |
|
|
147 |
Ganeti will report what information it has about its own daemons: this
|
|
148 |
includes memory usage, uptime, CPU usage. This should allow identifying
|
|
149 |
possible problems with the Ganeti system itself: for example memory
|
|
150 |
leaks, crashes and high resource utilization should be evident by
|
|
151 |
analyzing this information.
|
|
152 |
|
|
153 |
Ganeti daemons will also be able to export extra internal information to
|
|
154 |
the status reporting, through the plugin system (see below).
|
|
155 |
|
|
156 |
Hypervisor resources report
|
|
157 |
+++++++++++++++++++++++++++
|
|
158 |
|
|
159 |
Each hypervisor has a view of system resources that sometimes is
|
|
160 |
different than the one the OS sees (for example in Xen the Node OS,
|
|
161 |
running as Dom0, has access to only part of those resources). In this
|
|
162 |
section we'll report all information we can in a "non hypervisor
|
|
163 |
specific" way. Each hypervisor can then add extra specific information
|
|
164 |
that is not generic enough be abstracted.
|
|
165 |
|
|
166 |
Node OS resources report
|
|
167 |
++++++++++++++++++++++++
|
|
168 |
|
|
169 |
Since Ganeti assumes it's running on Linux, it's useful to export some
|
|
170 |
basic information as seen by the host system. This includes number and
|
|
171 |
status of CPUs, memory, filesystems and network intefaces as well as the
|
|
172 |
version of components Ganeti interacts with (Linux, drbd, hypervisor,
|
|
173 |
etc).
|
|
174 |
|
|
175 |
Note that we won't go into any hardware specific details (e.g. querying a
|
|
176 |
node RAID is outside the scope of this, and can be implemented as a
|
|
177 |
plugin) but we can easily just report the information above, since it's
|
|
178 |
standard enough across all systems.
|
|
179 |
|
|
180 |
Plugin system
|
|
181 |
+++++++++++++
|
|
182 |
|
|
183 |
The monitoring system will be equipped with a plugin system that can
|
|
184 |
export specific local information through it. The plugin system will be
|
|
185 |
in the form of either scripts whose output will be inserted in the
|
|
186 |
report, plain text files which will be inserted into the report, or
|
|
187 |
local unix or network sockets from which the information has to be read.
|
|
188 |
This should allow most flexibility for implementing an efficient system,
|
|
189 |
while being able to keep it as simple as possible.
|
|
190 |
|
|
191 |
The plugin system is expected to be used by local installations to
|
|
192 |
export any installation specific information that they want to be
|
|
193 |
monitored, about either hardware or software on their systems.
|
|
194 |
|
|
195 |
|
|
196 |
Format of the query
|
|
197 |
-------------------
|
|
198 |
|
|
199 |
The query will be an HTTP GET request on a particular port. At the
|
|
200 |
beginning it will only be possible to query the full status report.
|
|
201 |
|
|
202 |
|
|
203 |
Format of the report
|
|
204 |
--------------------
|
|
205 |
|
|
206 |
TBD (this part needs to be completed with the format of the JSON and the
|
|
207 |
types of the various variables exported, as they get evaluated and
|
|
208 |
decided)
|
|
209 |
|
|
210 |
|
|
211 |
Data collectors
|
|
212 |
---------------
|
|
213 |
|
|
214 |
In order to ease testing as well as to make it simple to reuse this
|
|
215 |
subsystem it will be possible to run just the "data collectors" on each
|
|
216 |
node without passing through the agent daemon. Each data collector will
|
|
217 |
report specific data about its subsystem and will be documented
|
|
218 |
separately.
|
|
219 |
|
|
220 |
|
|
221 |
Mode of operation
|
|
222 |
-----------------
|
|
223 |
|
|
224 |
In order to be able to report information fast the monitoring agent
|
|
225 |
daemon will keep an in-memory or on-disk cache of the status, which will
|
|
226 |
be returned when queries are made. The status system will then
|
|
227 |
periodically check resources to make sure the status is up to date.
|
|
228 |
|
|
229 |
Different parts of the report will be queried at different speeds. These
|
|
230 |
will depend on:
|
|
231 |
- how often they vary (or we expect them to vary)
|
|
232 |
- how fast they are to query
|
|
233 |
- how important their freshness is
|
|
234 |
|
|
235 |
Of course the last parameter is installation specific, and while we'll
|
|
236 |
try to have defaults, it will be configurable. The first two instead we
|
|
237 |
can use adaptively to query a certain resource faster or slower
|
|
238 |
depending on those two parameters.
|
|
239 |
|
|
240 |
|
|
241 |
Implementation place
|
|
242 |
--------------------
|
|
243 |
|
|
244 |
The status daemon will be implemented as a standalone Haskell daemon. In
|
|
245 |
the future it should be easy to merge multiple daemons into one with
|
|
246 |
multiple entry points, should we find out it saves resources and doesn't
|
|
247 |
impact functionality.
|
|
248 |
|
|
249 |
The libekg library should be looked at for easily providing metrics in
|
|
250 |
json format.
|
|
251 |
|
|
252 |
|
|
253 |
Implementation order
|
|
254 |
--------------------
|
|
255 |
|
|
256 |
We will implement the agent system in this order:
|
|
257 |
|
|
258 |
- initial example data collectors (eg. for drbd and instance status)
|
|
259 |
- initial daemon for exporting data
|
|
260 |
- RPC updates for instance status reasons and disk to instance mapping
|
|
261 |
- more data collectors
|
|
262 |
- cache layer for the daemon (if needed)
|
|
263 |
|
|
264 |
|
|
265 |
Future work
|
|
266 |
===========
|
|
267 |
|
|
268 |
As a future step it can be useful to "centralize" all this reporting
|
|
269 |
data on a single place. This for example can be just the master node, or
|
|
270 |
all the master candidates. We will evaluate doing this after the first
|
|
271 |
node-local version has been developed and tested.
|
|
272 |
|
|
273 |
Another possible change is replacing the "read-only" RPCs with queries
|
|
274 |
to the agent system, thus having only one way of collecting information
|
|
275 |
from the nodes from a monitoring system and for Ganeti itself.
|
|
276 |
|
|
277 |
One extra feature we may need is a way to query for only sub-parts of
|
|
278 |
the report (eg. instances status only). This can be done by passing
|
|
279 |
arguments to the HTTP GET, which will be defined when we get to this
|
|
280 |
funtionality.
|
|
281 |
|
|
282 |
Finally the :doc:`autorepair system design <design-autorepair>`. system
|
|
283 |
(see its design) can be expanded to use the monitoring agent system as a
|
|
284 |
source of information to decide which repairs it can perform.
|
|
285 |
|
|
286 |
.. vim: set textwidth=72 :
|
|
287 |
.. Local Variables:
|
|
288 |
.. mode: rst
|
|
289 |
.. fill-column: 72
|
|
290 |
.. End:
|