root / doc / design-monitoring-agent.rst @ 0ac2ff3b
History | View | Annotate | Download (29.5 kB)
1 |
======================= |
---|---|
2 |
Ganeti monitoring agent |
3 |
======================= |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the implementation of a Ganeti |
8 |
monitoring agent report system, that can be queried by a monitoring |
9 |
system to calculate health information for a Ganeti cluster. |
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
There is currently no monitoring support in Ganeti. While we don't want |
15 |
to build something like Nagios or Pacemaker as part of Ganeti, it would |
16 |
be useful if such tools could easily extract information from a Ganeti |
17 |
machine in order to take actions (example actions include logging an |
18 |
outage for future reporting or alerting a person or system about it). |
19 |
|
20 |
Proposed changes |
21 |
================ |
22 |
|
23 |
Each Ganeti node should export a status page that can be queried by a |
24 |
monitoring system. Such status page will be exported on a network port |
25 |
and will be encoded in JSON (simple text) over HTTP. |
26 |
|
27 |
The choice of JSON is obvious as we already depend on it in Ganeti and |
28 |
thus we don't need to add extra libraries to use it, as opposed to what |
29 |
would happen for XML or some other markup format. |
30 |
|
31 |
Location of agent report |
32 |
------------------------ |
33 |
|
34 |
The report will be available from all nodes, and be concerned for all |
35 |
node-local resources. This allows more real-time information to be |
36 |
available, at the cost of querying all nodes. |
37 |
|
38 |
Information reported |
39 |
-------------------- |
40 |
|
41 |
The monitoring agent system will report on the following basic information: |
42 |
|
43 |
- Instance status |
44 |
- Instance disk status |
45 |
- Status of storage for instances |
46 |
- Ganeti daemons status, CPU usage, memory footprint |
47 |
- Hypervisor resources report (memory, CPU, network interfaces) |
48 |
- Node OS resources report (memory, CPU, network interfaces) |
49 |
- Node OS CPU load average report |
50 |
- Information from a plugin system |
51 |
|
52 |
.. _monitoring-agent-format-of-the-report: |
53 |
|
54 |
Format of the report |
55 |
-------------------- |
56 |
|
57 |
The report of the will be in JSON format, and it will present an array |
58 |
of report objects. |
59 |
Each report object will be produced by a specific data collector. |
60 |
Each report object includes some mandatory fields, to be provided by all |
61 |
the data collectors: |
62 |
|
63 |
``name`` |
64 |
The name of the data collector that produced this part of the report. |
65 |
It is supposed to be unique inside a report. |
66 |
|
67 |
``version`` |
68 |
The version of the data collector that produces this part of the |
69 |
report. Built-in data collectors (as opposed to those implemented as |
70 |
plugins) should have "B" as the version number. |
71 |
|
72 |
``format_version`` |
73 |
The format of what is represented in the "data" field for each data |
74 |
collector might change over time. Every time this happens, the |
75 |
format_version should be changed, so that who reads the report knows |
76 |
what format to expect, and how to correctly interpret it. |
77 |
|
78 |
``timestamp`` |
79 |
The time when the reported data were gathered. It has to be expressed |
80 |
in nanoseconds since the unix epoch (0:00:00 January 01, 1970). If not |
81 |
enough precision is available (or needed) it can be padded with |
82 |
zeroes. If a report object needs multiple timestamps, it can add more |
83 |
and/or override this one inside its own "data" section. |
84 |
|
85 |
``category`` |
86 |
A collector can belong to a given category of collectors (e.g.: storage |
87 |
collectors, daemon collector). This means that it will have to provide a |
88 |
minumum set of prescribed fields, as documented for each category. |
89 |
This field will contain the name of the category the collector belongs to, |
90 |
if any, or just the ``null`` value. |
91 |
|
92 |
``kind`` |
93 |
Two kinds of collectors are possible: |
94 |
`Performance reporting collectors`_ and `Status reporting collectors`_. |
95 |
The respective paragraphs will describe them and the value of this field. |
96 |
|
97 |
``data`` |
98 |
This field contains all the data generated by the specific data collector, |
99 |
in its own independently defined format. The monitoring agent could check |
100 |
this syntactically (according to the JSON specifications) but not |
101 |
semantically. |
102 |
|
103 |
Here follows a minimal example of a report:: |
104 |
|
105 |
[ |
106 |
{ |
107 |
"name" : "TheCollectorIdentifier", |
108 |
"version" : "1.2", |
109 |
"format_version" : 1, |
110 |
"timestamp" : 1351607182000000000, |
111 |
"category" : null, |
112 |
"kind" : 0, |
113 |
"data" : { "plugin_specific_data" : "go_here" } |
114 |
}, |
115 |
{ |
116 |
"name" : "AnotherDataCollector", |
117 |
"version" : "B", |
118 |
"format_version" : 7, |
119 |
"timestamp" : 1351609526123854000, |
120 |
"category" : "storage", |
121 |
"kind" : 1, |
122 |
"data" : { "status" : { "code" : 1, |
123 |
"message" : "Error on disk 2" |
124 |
}, |
125 |
"plugin_specific" : "data", |
126 |
"some_late_data" : { "timestamp" : 1351609526123942720, |
127 |
... |
128 |
} |
129 |
} |
130 |
} |
131 |
] |
132 |
|
133 |
Performance reporting collectors |
134 |
++++++++++++++++++++++++++++++++ |
135 |
|
136 |
These collectors only provide data about some component of the system, without |
137 |
giving any interpretation over their meaning. |
138 |
|
139 |
The value of the ``kind`` field of the report will be ``0``. |
140 |
|
141 |
Status reporting collectors |
142 |
+++++++++++++++++++++++++++ |
143 |
|
144 |
These collectors will provide information about the status of some |
145 |
component of ganeti, or managed by ganeti. |
146 |
|
147 |
The value of their ``kind`` field will be ``1``. |
148 |
|
149 |
The rationale behind this kind of collectors is that there are some situations |
150 |
where exporting data about the underlying subsystems would expose potential |
151 |
issues. But if Ganeti itself is able (and going) to fix the problem, conflicts |
152 |
might arise between Ganeti and something/somebody else trying to fix the same |
153 |
problem. |
154 |
Also, some external monitoring systems might not be aware of the internals of a |
155 |
particular subsystem (e.g.: DRBD) and might only exploit the high level |
156 |
response of its data collector, alerting an administrator if anything is wrong. |
157 |
Still, completely hiding the underlying data is not a good idea, as they might |
158 |
still be of use in some cases. So status reporting plugins will provide two |
159 |
output modes: one just exporting a high level information about the status, |
160 |
and one also exporting all the data they gathered. |
161 |
The default output mode will be the status-only one. Through a command line |
162 |
parameter (for stand-alone data collectors) or through the HTTP request to the |
163 |
monitoring agent |
164 |
(when collectors are executed as part of it) the verbose output mode providing |
165 |
all the data can be selected. |
166 |
|
167 |
When exporting just the status each status reporting collector will provide, |
168 |
in its ``data`` section, at least the following field: |
169 |
|
170 |
``status`` |
171 |
summarizes the status of the component being monitored and consists of two |
172 |
subfields: |
173 |
|
174 |
``code`` |
175 |
It assumes a numeric value, encoded in such a way to allow using a bitset |
176 |
to easily distinguish which states are currently present in the whole |
177 |
cluster. If the bitwise OR of all the ``status`` fields is 0, the cluster |
178 |
is completely healty. |
179 |
The status codes are as follows: |
180 |
|
181 |
``0`` |
182 |
The collector can determine that everything is working as |
183 |
intended. |
184 |
|
185 |
``1`` |
186 |
Something is temporarily wrong but it is being automatically fixed by |
187 |
Ganeti. |
188 |
There is no need of external intervention. |
189 |
|
190 |
``2`` |
191 |
The collector has failed to understand whether the status is good or |
192 |
bad. Further analysis is required. Interpret this status as a |
193 |
potentially dangerous situation. |
194 |
|
195 |
``4`` |
196 |
The collector can determine that something is wrong and Ganeti has no |
197 |
way to fix it autonomously. External intervention is required. |
198 |
|
199 |
``message`` |
200 |
A message to better explain the reason of the status. |
201 |
The exact format of the message string is data collector dependent. |
202 |
|
203 |
The field is mandatory, but the content can be an empty string if the |
204 |
``code`` is ``0`` (working as intended) or ``1`` (being fixed |
205 |
automatically). |
206 |
|
207 |
If the status code is ``2``, the message should specify what has gone |
208 |
wrong. |
209 |
If the status code is ``4``, the message shoud explain why it was not |
210 |
possible to determine a proper status. |
211 |
|
212 |
The ``data`` section will also contain all the fields describing the gathered |
213 |
data, according to a collector-specific format. |
214 |
|
215 |
Instance status |
216 |
+++++++++++++++ |
217 |
|
218 |
At the moment each node knows which instances are running on it, which |
219 |
instances it is primary for, but not the cause why an instance might not |
220 |
be running. On the other hand we don't want to distribute full instance |
221 |
"admin" status information to all nodes, because of the performance |
222 |
impact this would have. |
223 |
|
224 |
As such we propose that: |
225 |
|
226 |
- Any operation that can affect instance status will have an optional |
227 |
"reason" attached to it (at opcode level). This can be used for |
228 |
example to distinguish an admin request, from a scheduled maintenance |
229 |
or an automated tool's work. If this reason is not passed, Ganeti will |
230 |
just use the information it has about the source of the request. |
231 |
This reason information will be structured according to the |
232 |
:doc:`Ganeti reason trail <design-reason-trail>` design document. |
233 |
- RPCs that affect the instance status will be changed so that the |
234 |
"reason" and the version of the config object they ran on is passed to |
235 |
them. They will then export the new expected instance status, together |
236 |
with the associated reason and object version to the status report |
237 |
system, which then will export those themselves. |
238 |
|
239 |
Monitoring and auditing systems can then use the reason to understand |
240 |
the cause of an instance status, and they can use the timestamp to |
241 |
understand the freshness of their data even in the absence of an atomic |
242 |
cross-node reporting: for example if they see an instance "up" on a node |
243 |
after seeing it running on a previous one, they can compare these values |
244 |
to understand which data is freshest, and repoll the "older" node. Of |
245 |
course if they keep seeing this status this represents an error (either |
246 |
an instance continuously "flapping" between nodes, or an instance is |
247 |
constantly up on more than one), which should be reported and acted |
248 |
upon. |
249 |
|
250 |
The instance status will be on each node, for the instances it is |
251 |
primary for, and its ``data`` section of the report will contain a list |
252 |
of instances, named ``instances``, with at least the following fields for |
253 |
each instance: |
254 |
|
255 |
``name`` |
256 |
The name of the instance. |
257 |
|
258 |
``uuid`` |
259 |
The UUID of the instance (stable on name change). |
260 |
|
261 |
``admin_state`` |
262 |
The status of the instance (up/down/offline) as requested by the admin. |
263 |
|
264 |
``actual_state`` |
265 |
The actual status of the instance. It can be ``up``, ``down``, or |
266 |
``hung`` if the instance is up but it appears to be completely stuck. |
267 |
|
268 |
``uptime`` |
269 |
The uptime of the instance (if it is up, "null" otherwise). |
270 |
|
271 |
``mtime`` |
272 |
The timestamp of the last known change to the instance state. |
273 |
|
274 |
``state_reason`` |
275 |
The last known reason for state change of the instance, described according |
276 |
to the JSON representation of a reason trail, as detailed in the :doc:`reason |
277 |
trail design document <design-reason-trail>`. |
278 |
|
279 |
``status`` |
280 |
It represents the status of the instance, and its format is the same as that |
281 |
of the ``status`` field of `Status reporting collectors`_. |
282 |
|
283 |
Each hypervisor should provide its own instance status data collector, possibly |
284 |
with the addition of more, specific, fields. |
285 |
The ``category`` field of all of them will be ``instance``. |
286 |
The ``kind`` field will be ``1``. |
287 |
|
288 |
Note that as soon as a node knows it's not the primary anymore for an |
289 |
instance it will stop reporting status for it: this means the instance |
290 |
will either disappear, if it has been deleted, or appear on another |
291 |
node, if it's been moved. |
292 |
|
293 |
The ``code`` of the ``status`` field of the report of the Instance status data |
294 |
collector will be: |
295 |
|
296 |
``0`` |
297 |
if ``status`` is ``0`` for all the instances it is reporting about. |
298 |
|
299 |
``1`` |
300 |
otherwise. |
301 |
|
302 |
Storage collectors |
303 |
++++++++++++++++++ |
304 |
|
305 |
The storage collectors will be a series of data collectors |
306 |
that will gather data about storage for the current node. The collection |
307 |
will be performed at different granularity and abstraction levels, from |
308 |
the physical disks, to partitions, logical volumes and to the specific |
309 |
storage types used by Ganeti itself (drbd, rbd, plain, file). |
310 |
|
311 |
The ``name`` of each of these collector will reflect what storage type each of |
312 |
them refers to. |
313 |
|
314 |
The ``category`` field of these collector will be ``storage``. |
315 |
|
316 |
The ``kind`` field will depend on the specific collector. |
317 |
|
318 |
Each ``storage`` collector's ``data`` section will provide collector-specific |
319 |
fields. |
320 |
|
321 |
The various storage collectors will provide keys to join the data they provide, |
322 |
in order to allow the user to get a better understanding of the system. E.g.: |
323 |
through device names, or instance names. |
324 |
|
325 |
Diskstats collector |
326 |
******************* |
327 |
|
328 |
This storage data collector will gather information about the status of the |
329 |
disks installed in the system, as listed in the /proc/diskstats file. This means |
330 |
that not only physical hard drives, but also ramdisks and loopback devices will |
331 |
be listed. |
332 |
|
333 |
Its ``kind`` in the report will be ``0`` (`Performance reporting collectors`_). |
334 |
|
335 |
Its ``category`` field in the report will contain the value ``storage``. |
336 |
|
337 |
When executed in verbose mode, the ``data`` section of the report of this |
338 |
collector will be a list of items, each representing one disk, each providing |
339 |
the following fields: |
340 |
|
341 |
``major`` |
342 |
The major number of the device. |
343 |
|
344 |
``minor`` |
345 |
The minor number of the device. |
346 |
|
347 |
``name`` |
348 |
The name of the device. |
349 |
|
350 |
``readsNum`` |
351 |
This is the total number of reads completed successfully. |
352 |
|
353 |
``mergedReads`` |
354 |
Reads which are adjacent to each other may be merged for efficiency. Thus |
355 |
two 4K reads may become one 8K read before it is ultimately handed to the |
356 |
disk, and so it will be counted (and queued) as only one I/O. This field |
357 |
specifies how often this was done. |
358 |
|
359 |
``secRead`` |
360 |
This is the total number of sectors read successfully. |
361 |
|
362 |
``timeRead`` |
363 |
This is the total number of milliseconds spent by all reads. |
364 |
|
365 |
``writes`` |
366 |
This is the total number of writes completed successfully. |
367 |
|
368 |
``mergedWrites`` |
369 |
Writes which are adjacent to each other may be merged for efficiency. Thus |
370 |
two 4K writes may become one 8K read before it is ultimately handed to the |
371 |
disk, and so it will be counted (and queued) as only one I/O. This field |
372 |
specifies how often this was done. |
373 |
|
374 |
``secWritten`` |
375 |
This is the total number of sectors written successfully. |
376 |
|
377 |
``timeWrite`` |
378 |
This is the total number of milliseconds spent by all writes. |
379 |
|
380 |
``ios`` |
381 |
The number of I/Os currently in progress. |
382 |
The only field that should go to zero, it is incremented as requests are |
383 |
given to appropriate struct request_queue and decremented as they finish. |
384 |
|
385 |
``timeIO`` |
386 |
The number of milliseconds spent doing I/Os. This field increases so long |
387 |
as field ``IOs`` is nonzero. |
388 |
|
389 |
``wIOmillis`` |
390 |
The weighted number of milliseconds spent doing I/Os. |
391 |
This field is incremented at each I/O start, I/O completion, I/O merge, |
392 |
or read of these stats by the number of I/Os in progress (field ``IOs``) |
393 |
times the number of milliseconds spent doing I/O since the last update of |
394 |
this field. This can provide an easy measure of both I/O completion time |
395 |
and the backlog that may be accumulating. |
396 |
|
397 |
Logical Volume collector |
398 |
************************ |
399 |
|
400 |
This data collector will gather information about the attributes of logical |
401 |
volumes present in the system. |
402 |
|
403 |
Its ``kind`` in the report will be ``0`` (`Performance reporting collectors`_). |
404 |
|
405 |
Its ``category`` field in the report will contain the value ``storage``. |
406 |
|
407 |
The ``data`` section of the report of this collector will be a list of items, |
408 |
each representing one logical volume and providing the following fields: |
409 |
|
410 |
``uuid`` |
411 |
The UUID of the logical volume. |
412 |
|
413 |
``name`` |
414 |
The name of the logical volume. |
415 |
|
416 |
``attr`` |
417 |
The attributes of the logical volume. |
418 |
|
419 |
``major`` |
420 |
Persistent major number or -1 if not persistent. |
421 |
|
422 |
``minor`` |
423 |
Persistent minor number or -1 if not persistent. |
424 |
|
425 |
``kernel_major`` |
426 |
Currently assigned major number or -1 if LV is not active. |
427 |
|
428 |
``kernel_minor`` |
429 |
Currently assigned minor number or -1 if LV is not active. |
430 |
|
431 |
``size`` |
432 |
Size of LV in bytes. |
433 |
|
434 |
``seg_count`` |
435 |
Number of segments in LV. |
436 |
|
437 |
``tags`` |
438 |
Tags, if any. |
439 |
|
440 |
``modules`` |
441 |
Kernel device-mapper modules required for this LV, if any. |
442 |
|
443 |
``vg_uuid`` |
444 |
Unique identifier of the volume group. |
445 |
|
446 |
``vg_name`` |
447 |
Name of the volume group. |
448 |
|
449 |
``segtype`` |
450 |
Type of LV segment. |
451 |
|
452 |
``seg_start`` |
453 |
Offset within the LVto the start of the segment in bytes. |
454 |
|
455 |
``seg_start_pe`` |
456 |
Offset within the LV to the start of the segment in physical extents. |
457 |
|
458 |
``seg_size`` |
459 |
Size of the segment in bytes. |
460 |
|
461 |
``seg_tags`` |
462 |
Tags for the segment, if any. |
463 |
|
464 |
``seg_pe_ranges`` |
465 |
Ranges of Physical Extents of underlying devices in lvs command line format. |
466 |
|
467 |
``devices`` |
468 |
Underlying devices used with starting extent numbers. |
469 |
|
470 |
``instance`` |
471 |
The name of the instance this LV is used by, or ``null`` if it was not |
472 |
possible to determine it. |
473 |
|
474 |
DRBD status |
475 |
*********** |
476 |
|
477 |
This data collector will run only on nodes where DRBD is actually |
478 |
present and it will gather information about DRBD devices. |
479 |
|
480 |
Its ``kind`` in the report will be ``1`` (`Status reporting collectors`_). |
481 |
|
482 |
Its ``category`` field in the report will contain the value ``storage``. |
483 |
|
484 |
When executed in verbose mode, the ``data`` section of the report of this |
485 |
collector will provide the following fields: |
486 |
|
487 |
``versionInfo`` |
488 |
Information about the DRBD version number, given by a combination of |
489 |
any (but at least one) of the following fields: |
490 |
|
491 |
``version`` |
492 |
The DRBD driver version. |
493 |
|
494 |
``api`` |
495 |
The API version number. |
496 |
|
497 |
``proto`` |
498 |
The protocol version. |
499 |
|
500 |
``srcversion`` |
501 |
The version of the source files. |
502 |
|
503 |
``gitHash`` |
504 |
Git hash of the source files. |
505 |
|
506 |
``buildBy`` |
507 |
Who built the binary, and, optionally, when. |
508 |
|
509 |
``device`` |
510 |
A list of structures, each describing a DRBD device (a minor) and containing |
511 |
the following fields: |
512 |
|
513 |
``minor`` |
514 |
The device minor number. |
515 |
|
516 |
``connectionState`` |
517 |
The state of the connection. If it is "Unconfigured", all the following |
518 |
fields are not present. |
519 |
|
520 |
``localRole`` |
521 |
The role of the local resource. |
522 |
|
523 |
``remoteRole`` |
524 |
The role of the remote resource. |
525 |
|
526 |
``localState`` |
527 |
The status of the local disk. |
528 |
|
529 |
``remoteState`` |
530 |
The status of the remote disk. |
531 |
|
532 |
``replicationProtocol`` |
533 |
The replication protocol being used. |
534 |
|
535 |
``ioFlags`` |
536 |
The input/output flags. |
537 |
|
538 |
``perfIndicators`` |
539 |
The performance indicators. This field will contain the following |
540 |
sub-fields: |
541 |
|
542 |
``networkSend`` |
543 |
KiB of data sent on the network. |
544 |
|
545 |
``networkReceive`` |
546 |
KiB of data received from the network. |
547 |
|
548 |
``diskWrite`` |
549 |
KiB of data written on local disk. |
550 |
|
551 |
``diskRead`` |
552 |
KiB of date read from the local disk. |
553 |
|
554 |
``activityLog`` |
555 |
Number of updates of the activity log. |
556 |
|
557 |
``bitMap`` |
558 |
Number of updates to the bitmap area of the metadata. |
559 |
|
560 |
``localCount`` |
561 |
Number of open requests to the local I/O subsystem. |
562 |
|
563 |
``pending`` |
564 |
Number of requests sent to the partner but not yet answered. |
565 |
|
566 |
``unacknowledged`` |
567 |
Number of requests received by the partner but still to be answered. |
568 |
|
569 |
``applicationPending`` |
570 |
Num of block input/output requests forwarded to DRBD but that have not yet |
571 |
been answered. |
572 |
|
573 |
``epochs`` |
574 |
(Optional) Number of epoch objects. Not provided by all DRBD versions. |
575 |
|
576 |
``writeOrder`` |
577 |
(Optional) Currently used write ordering method. Not provided by all DRBD |
578 |
versions. |
579 |
|
580 |
``outOfSync`` |
581 |
(Optional) KiB of storage currently out of sync. Not provided by all DRBD |
582 |
versions. |
583 |
|
584 |
``syncStatus`` |
585 |
(Optional) The status of the synchronization of the disk. This is present |
586 |
only if the disk is being synchronized, and includes the following fields: |
587 |
|
588 |
``percentage`` |
589 |
The percentage of synchronized data. |
590 |
|
591 |
``progress`` |
592 |
How far the synchronization is. Written as "x/y", where x and y are |
593 |
integer numbers expressed in the measurement unit stated in |
594 |
``progressUnit`` |
595 |
|
596 |
``progressUnit`` |
597 |
The measurement unit for the progress indicator. |
598 |
|
599 |
``timeToFinish`` |
600 |
The expected time before finishing the synchronization. |
601 |
|
602 |
``speed`` |
603 |
The speed of the synchronization. |
604 |
|
605 |
``want`` |
606 |
The desiderd speed of the synchronization. |
607 |
|
608 |
``speedUnit`` |
609 |
The measurement unit of the ``speed`` and ``want`` values. Expressed |
610 |
as "size/time". |
611 |
|
612 |
``instance`` |
613 |
The name of the Ganeti instance this disk is associated to. |
614 |
|
615 |
|
616 |
Ganeti daemons status |
617 |
+++++++++++++++++++++ |
618 |
|
619 |
Ganeti will report what information it has about its own daemons. |
620 |
This should allow identifying possible problems with the Ganeti system itself: |
621 |
for example memory leaks, crashes and high resource utilization should be |
622 |
evident by analyzing this information. |
623 |
|
624 |
The ``kind`` field will be ``1`` (`Status reporting collectors`_). |
625 |
|
626 |
Each daemon will have its own data collector, and each of them will have |
627 |
a ``category`` field valued ``daemon``. |
628 |
|
629 |
When executed in verbose mode, their data section will include at least: |
630 |
|
631 |
``memory`` |
632 |
The amount of used memory. |
633 |
|
634 |
``size_unit`` |
635 |
The measurement unit used for the memory. |
636 |
|
637 |
``uptime`` |
638 |
The uptime of the daemon. |
639 |
|
640 |
``CPU usage`` |
641 |
How much cpu the daemon is using (percentage). |
642 |
|
643 |
Any other daemon-specific information can be included as well in the ``data`` |
644 |
section. |
645 |
|
646 |
Hypervisor resources report |
647 |
+++++++++++++++++++++++++++ |
648 |
|
649 |
Each hypervisor has a view of system resources that sometimes is |
650 |
different than the one the OS sees (for example in Xen the Node OS, |
651 |
running as Dom0, has access to only part of those resources). In this |
652 |
section we'll report all information we can in a "non hypervisor |
653 |
specific" way. Each hypervisor can then add extra specific information |
654 |
that is not generic enough be abstracted. |
655 |
|
656 |
The ``kind`` field will be ``0`` (`Performance reporting collectors`_). |
657 |
|
658 |
Each of the hypervisor data collectory will be of ``category``: ``hypervisor``. |
659 |
|
660 |
Node OS resources report |
661 |
++++++++++++++++++++++++ |
662 |
|
663 |
Since Ganeti assumes it's running on Linux, it's useful to export some |
664 |
basic information as seen by the host system. |
665 |
|
666 |
The ``category`` field of the report will be ``null``. |
667 |
|
668 |
The ``kind`` field will be ``0`` (`Performance reporting collectors`_). |
669 |
|
670 |
The ``data`` section will include: |
671 |
|
672 |
``cpu_number`` |
673 |
The number of available cpus. |
674 |
|
675 |
``cpus`` |
676 |
A list with one element per cpu, showing its average load. |
677 |
|
678 |
``memory`` |
679 |
The current view of memory (free, used, cached, etc.) |
680 |
|
681 |
``filesystem`` |
682 |
A list with one element per filesystem, showing a summary of the |
683 |
total/available space. |
684 |
|
685 |
``NICs`` |
686 |
A list with one element per network interface, showing the amount of |
687 |
sent/received data, error rate, IP address of the interface, etc. |
688 |
|
689 |
``versions`` |
690 |
A map using the name of a component Ganeti interacts (Linux, drbd, |
691 |
hypervisor, etc) as the key and its version number as the value. |
692 |
|
693 |
Note that we won't go into any hardware specific details (e.g. querying a |
694 |
node RAID is outside the scope of this, and can be implemented as a |
695 |
plugin) but we can easily just report the information above, since it's |
696 |
standard enough across all systems. |
697 |
|
698 |
Node OS CPU load average report |
699 |
+++++++++++++++++++++++++++++++ |
700 |
|
701 |
This data collector will export CPU load statistics as seen by the host |
702 |
system. Apart from using the data from an external monitoring system we |
703 |
can also use the data to improve instance allocation and/or the Ganeti |
704 |
cluster balance. To compute the CPU load average we will use a number of |
705 |
values collected inside a time window. The collection process will be |
706 |
done by an independent thread (see `Mode of Operation`_). |
707 |
|
708 |
This report is a subset of the previous report (`Node OS resources |
709 |
report`_) and they might eventually get merged, once reporting for the |
710 |
other fields (memory, filesystem, NICs) gets implemented too. |
711 |
|
712 |
Specifically: |
713 |
|
714 |
The ``category`` field of the report will be ``null``. |
715 |
|
716 |
The ``kind`` field will be ``0`` (`Performance reporting collectors`_). |
717 |
|
718 |
The ``data`` section will include: |
719 |
|
720 |
``cpu_number`` |
721 |
The number of available cpus. |
722 |
|
723 |
``cpus`` |
724 |
A list with one element per cpu, showing its average load. |
725 |
|
726 |
``cpu_total`` |
727 |
The total CPU load average as a sum of the all separate cpus. |
728 |
|
729 |
The CPU load report function will get N values, collected by the |
730 |
CPU load collection function and calculate the above averages. Please |
731 |
see the section `Mode of Operation`_ for more information one how the |
732 |
two functions of the data collector interact. |
733 |
|
734 |
Format of the query |
735 |
------------------- |
736 |
|
737 |
.. include:: monitoring-query-format.rst |
738 |
|
739 |
Instance disk status propagation |
740 |
-------------------------------- |
741 |
|
742 |
As for the instance status Ganeti has now only partial information about |
743 |
its instance disks: in particular each node is unaware of the disk to |
744 |
instance mapping, that exists only on the master. |
745 |
|
746 |
For this design doc we plan to fix this by changing all RPCs that create |
747 |
a backend storage or that put an already existing one in use and passing |
748 |
the relevant instance to the node. The node can then export these to the |
749 |
status reporting tool. |
750 |
|
751 |
While we haven't implemented these RPC changes yet, we'll use Confd to |
752 |
fetch this information in the data collectors. |
753 |
|
754 |
Plugin system |
755 |
------------- |
756 |
|
757 |
The monitoring system will be equipped with a plugin system that can |
758 |
export specific local information through it. |
759 |
|
760 |
The plugin system is expected to be used by local installations to |
761 |
export any installation specific information that they want to be |
762 |
monitored, about either hardware or software on their systems. |
763 |
|
764 |
The plugin system will be in the form of either scripts or binaries whose output |
765 |
will be inserted in the report. |
766 |
|
767 |
Eventually support for other kinds of plugins might be added as well, such as |
768 |
plain text files which will be inserted into the report, or local unix or |
769 |
network sockets from which the information has to be read. This should allow |
770 |
most flexibility for implementing an efficient system, while being able to keep |
771 |
it as simple as possible. |
772 |
|
773 |
Data collectors |
774 |
--------------- |
775 |
|
776 |
In order to ease testing as well as to make it simple to reuse this |
777 |
subsystem it will be possible to run just the "data collectors" on each |
778 |
node without passing through the agent daemon. |
779 |
|
780 |
If a data collector is run independently, it should print on stdout its |
781 |
report, according to the format corresponding to a single data collector |
782 |
report object, as described in the previous paragraphs. |
783 |
|
784 |
Mode of operation |
785 |
----------------- |
786 |
|
787 |
In order to be able to report information fast the monitoring agent |
788 |
daemon will keep an in-memory or on-disk cache of the status, which will |
789 |
be returned when queries are made. The status system will then |
790 |
periodically check resources to make sure the status is up to date. |
791 |
|
792 |
Different parts of the report will be queried at different speeds. These |
793 |
will depend on: |
794 |
- how often they vary (or we expect them to vary) |
795 |
- how fast they are to query |
796 |
- how important their freshness is |
797 |
|
798 |
Of course the last parameter is installation specific, and while we'll |
799 |
try to have defaults, it will be configurable. The first two instead we |
800 |
can use adaptively to query a certain resource faster or slower |
801 |
depending on those two parameters. |
802 |
|
803 |
When run as stand-alone binaries, the data collector will not using any |
804 |
caching system, and just fetch and return the data immediately. |
805 |
|
806 |
Since some performance collectors have to operate on a number of values |
807 |
collected in previous times, we need a mechanism independent of the data |
808 |
collector which will trigger the collection of those values and also |
809 |
store them, so that they are available for calculation by the data |
810 |
collectors. |
811 |
|
812 |
To collect data periodically, a thread will be created by the monitoring |
813 |
agent which will run the collection function of every data collector |
814 |
that provides one. The values returned by the collection function of |
815 |
the data collector will be saved in an appropriate map, associating each |
816 |
value to the corresponding collector, using the collector's name as the |
817 |
key of the map. This map will be stored in mond's memory. |
818 |
|
819 |
The collectors are divided in two categories: |
820 |
|
821 |
- stateless collectors, collectors who have immediate access to the |
822 |
reported information |
823 |
- stateful collectors, collectors whose report is based on data collected |
824 |
in a previous time window |
825 |
|
826 |
For example: the collection function of the CPU load collector will |
827 |
collect a CPU load value and save it in the map mentioned above. The |
828 |
collection function will be called by the collector thread every t |
829 |
milliseconds. When the report function of the collector is called, it |
830 |
will process the last N values of the map and calculate the |
831 |
corresponding average. |
832 |
|
833 |
Implementation place |
834 |
-------------------- |
835 |
|
836 |
The status daemon will be implemented as a standalone Haskell daemon. In |
837 |
the future it should be easy to merge multiple daemons into one with |
838 |
multiple entry points, should we find out it saves resources and doesn't |
839 |
impact functionality. |
840 |
|
841 |
The libekg library should be looked at for easily providing metrics in |
842 |
json format. |
843 |
|
844 |
Implementation order |
845 |
-------------------- |
846 |
|
847 |
We will implement the agent system in this order: |
848 |
|
849 |
- initial example data collectors (eg. for drbd and instance status). |
850 |
- initial daemon for exporting data, integrating the existing collectors |
851 |
- plugin system |
852 |
- RPC updates for instance status reasons and disk to instance mapping |
853 |
- cache layer for the daemon |
854 |
- more data collectors |
855 |
|
856 |
|
857 |
Future work |
858 |
=========== |
859 |
|
860 |
As a future step it can be useful to "centralize" all this reporting |
861 |
data on a single place. This for example can be just the master node, or |
862 |
all the master candidates. We will evaluate doing this after the first |
863 |
node-local version has been developed and tested. |
864 |
|
865 |
Another possible change is replacing the "read-only" RPCs with queries |
866 |
to the agent system, thus having only one way of collecting information |
867 |
from the nodes from a monitoring system and for Ganeti itself. |
868 |
|
869 |
One extra feature we may need is a way to query for only sub-parts of |
870 |
the report (eg. instances status only). This can be done by passing |
871 |
arguments to the HTTP GET, which will be defined when we get to this |
872 |
funtionality. |
873 |
|
874 |
Finally the :doc:`autorepair system design <design-autorepair>`. system |
875 |
(see its design) can be expanded to use the monitoring agent system as a |
876 |
source of information to decide which repairs it can perform. |
877 |
|
878 |
.. vim: set textwidth=72 : |
879 |
.. Local Variables: |
880 |
.. mode: rst |
881 |
.. fill-column: 72 |
882 |
.. End: |