Statistics
| Branch: | Tag: | Revision:

root / doc / design-monitoring-agent.rst @ 99b67c35

History | View | Annotate | Download (29.2 kB)

1
=======================
2
Ganeti monitoring agent
3
=======================
4

    
5
.. contents:: :depth: 4
6

    
7
This is a design document detailing the implementation of a Ganeti
8
monitoring agent report system, that can be queried by a monitoring
9
system to calculate health information for a Ganeti cluster.
10

    
11
Current state and shortcomings
12
==============================
13

    
14
There is currently no monitoring support in Ganeti. While we don't want
15
to build something like Nagios or Pacemaker as part of Ganeti, it would
16
be useful if such tools could easily extract information from a Ganeti
17
machine in order to take actions (example actions include logging an
18
outage for future reporting or alerting a person or system about it).
19

    
20
Proposed changes
21
================
22

    
23
Each Ganeti node should export a status page that can be queried by a
24
monitoring system. Such status page will be exported on a network port
25
and will be encoded in JSON (simple text) over HTTP.
26

    
27
The choice of JSON is obvious as we already depend on it in Ganeti and
28
thus we don't need to add extra libraries to use it, as opposed to what
29
would happen for XML or some other markup format.
30

    
31
Location of agent report
32
------------------------
33

    
34
The report will be available from all nodes, and be concerned for all
35
node-local resources. This allows more real-time information to be
36
available, at the cost of querying all nodes.
37

    
38
Information reported
39
--------------------
40

    
41
The monitoring agent system will report on the following basic information:
42

    
43
- Instance status
44
- Instance disk status
45
- Status of storage for instances
46
- Ganeti daemons status, CPU usage, memory footprint
47
- Hypervisor resources report (memory, CPU, network interfaces)
48
- Node OS resources report (memory, CPU, network interfaces)
49
- Node OS CPU load average report
50
- Information from a plugin system
51

    
52
Format of the report
53
--------------------
54

    
55
The report of the will be in JSON format, and it will present an array
56
of report objects.
57
Each report object will be produced by a specific data collector.
58
Each report object includes some mandatory fields, to be provided by all
59
the data collectors:
60

    
61
``name``
62
  The name of the data collector that produced this part of the report.
63
  It is supposed to be unique inside a report.
64

    
65
``version``
66
  The version of the data collector that produces this part of the
67
  report. Built-in data collectors (as opposed to those implemented as
68
  plugins) should have "B" as the version number.
69

    
70
``format_version``
71
  The format of what is represented in the "data" field for each data
72
  collector might change over time. Every time this happens, the
73
  format_version should be changed, so that who reads the report knows
74
  what format to expect, and how to correctly interpret it.
75

    
76
``timestamp``
77
  The time when the reported data were gathered. It has to be expressed
78
  in nanoseconds since the unix epoch (0:00:00 January 01, 1970). If not
79
  enough precision is available (or needed) it can be padded with
80
  zeroes. If a report object needs multiple timestamps, it can add more
81
  and/or override this one inside its own "data" section.
82

    
83
``category``
84
  A collector can belong to a given category of collectors (e.g.: storage
85
  collectors, daemon collector). This means that it will have to provide a
86
  minumum set of prescribed fields, as documented for each category.
87
  This field will contain the name of the category the collector belongs to,
88
  if any, or just the ``null`` value.
89

    
90
``kind``
91
  Two kinds of collectors are possible:
92
  `Performance reporting collectors`_ and `Status reporting collectors`_.
93
  The respective paragraphs will describe them and the value of this field.
94

    
95
``data``
96
  This field contains all the data generated by the specific data collector,
97
  in its own independently defined format. The monitoring agent could check
98
  this syntactically (according to the JSON specifications) but not
99
  semantically.
100

    
101
Here follows a minimal example of a report::
102

    
103
  [
104
  {
105
      "name" : "TheCollectorIdentifier",
106
      "version" : "1.2",
107
      "format_version" : 1,
108
      "timestamp" : 1351607182000000000,
109
      "category" : null,
110
      "kind" : 0,
111
      "data" : { "plugin_specific_data" : "go_here" }
112
  },
113
  {
114
      "name" : "AnotherDataCollector",
115
      "version" : "B",
116
      "format_version" : 7,
117
      "timestamp" : 1351609526123854000,
118
      "category" : "storage",
119
      "kind" : 1,
120
      "data" : { "status" : { "code" : 1,
121
                              "message" : "Error on disk 2"
122
                            },
123
                 "plugin_specific" : "data",
124
                 "some_late_data" : { "timestamp" : 1351609526123942720,
125
                                      ...
126
                                    }
127
               }
128
  }
129
  ]
130

    
131
Performance reporting collectors
132
++++++++++++++++++++++++++++++++
133

    
134
These collectors only provide data about some component of the system, without
135
giving any interpretation over their meaning.
136

    
137
The value of the ``kind`` field of the report will be ``0``.
138

    
139
Status reporting collectors
140
+++++++++++++++++++++++++++
141

    
142
These collectors will provide information about the status of some
143
component of ganeti, or managed by ganeti.
144

    
145
The value of their ``kind`` field will be ``1``.
146

    
147
The rationale behind this kind of collectors is that there are some situations
148
where exporting data about the underlying subsystems would expose potential
149
issues. But if Ganeti itself is able (and going) to fix the problem, conflicts
150
might arise between Ganeti and something/somebody else trying to fix the same
151
problem.
152
Also, some external monitoring systems might not be aware of the internals of a
153
particular subsystem (e.g.: DRBD) and might only exploit the high level
154
response of its data collector, alerting an administrator if anything is wrong.
155
Still, completely hiding the underlying data is not a good idea, as they might
156
still be of use in some cases. So status reporting plugins will provide two
157
output modes: one just exporting a high level information about the status,
158
and one also exporting all the data they gathered.
159
The default output mode will be the status-only one. Through a command line
160
parameter (for stand-alone data collectors) or through the HTTP request to the
161
monitoring agent
162
(when collectors are executed as part of it) the verbose output mode providing
163
all the data can be selected.
164

    
165
When exporting just the status each status reporting collector will provide,
166
in its ``data`` section, at least the following field:
167

    
168
``status``
169
  summarizes the status of the component being monitored and consists of two
170
  subfields:
171

    
172
  ``code``
173
    It assumes a numeric value, encoded in such a way to allow using a bitset
174
    to easily distinguish which states are currently present in the whole
175
    cluster. If the bitwise OR of all the ``status`` fields is 0, the cluster
176
    is completely healty.
177
    The status codes are as follows:
178

    
179
    ``0``
180
      The collector can determine that everything is working as
181
      intended.
182

    
183
    ``1``
184
      Something is temporarily wrong but it is being automatically fixed by
185
      Ganeti.
186
      There is no need of external intervention.
187

    
188
    ``2``
189
      The collector has failed to understand whether the status is good or
190
      bad. Further analysis is required. Interpret this status as a
191
      potentially dangerous situation.
192

    
193
    ``4``
194
      The collector can determine that something is wrong and Ganeti has no
195
      way to fix it autonomously. External intervention is required.
196

    
197
  ``message``
198
    A message to better explain the reason of the status.
199
    The exact format of the message string is data collector dependent.
200

    
201
    The field is mandatory, but the content can be an empty string if the
202
    ``code`` is ``0`` (working as intended) or ``1`` (being fixed
203
    automatically).
204

    
205
    If the status code is ``2``, the message should specify what has gone
206
    wrong.
207
    If the status code is ``4``, the message shoud explain why it was not
208
    possible to determine a proper status.
209

    
210
The ``data`` section will also contain all the fields describing the gathered
211
data, according to a collector-specific format.
212

    
213
Instance status
214
+++++++++++++++
215

    
216
At the moment each node knows which instances are running on it, which
217
instances it is primary for, but not the cause why an instance might not
218
be running. On the other hand we don't want to distribute full instance
219
"admin" status information to all nodes, because of the performance
220
impact this would have.
221

    
222
As such we propose that:
223

    
224
- Any operation that can affect instance status will have an optional
225
  "reason" attached to it (at opcode level). This can be used for
226
  example to distinguish an admin request, from a scheduled maintenance
227
  or an automated tool's work. If this reason is not passed, Ganeti will
228
  just use the information it has about the source of the request.
229
  This reason information will be structured according to the
230
  :doc:`Ganeti reason trail <design-reason-trail>` design document.
231
- RPCs that affect the instance status will be changed so that the
232
  "reason" and the version of the config object they ran on is passed to
233
  them. They will then export the new expected instance status, together
234
  with the associated reason and object version to the status report
235
  system, which then will export those themselves.
236

    
237
Monitoring and auditing systems can then use the reason to understand
238
the cause of an instance status, and they can use the timestamp to
239
understand the freshness of their data even in the absence of an atomic
240
cross-node reporting: for example if they see an instance "up" on a node
241
after seeing it running on a previous one, they can compare these values
242
to understand which data is freshest, and repoll the "older" node. Of
243
course if they keep seeing this status this represents an error (either
244
an instance continuously "flapping" between nodes, or an instance is
245
constantly up on more than one), which should be reported and acted
246
upon.
247

    
248
The instance status will be on each node, for the instances it is
249
primary for, and its ``data`` section of the report will contain a list
250
of instances, named ``instances``, with at least the following fields for
251
each instance:
252

    
253
``name``
254
  The name of the instance.
255

    
256
``uuid``
257
  The UUID of the instance (stable on name change).
258

    
259
``admin_state``
260
  The status of the instance (up/down/offline) as requested by the admin.
261

    
262
``actual_state``
263
  The actual status of the instance. It can be ``up``, ``down``, or
264
  ``hung`` if the instance is up but it appears to be completely stuck.
265

    
266
``uptime``
267
  The uptime of the instance (if it is up, "null" otherwise).
268

    
269
``mtime``
270
  The timestamp of the last known change to the instance state.
271

    
272
``state_reason``
273
  The last known reason for state change of the instance, described according
274
  to the JSON representation of a reason trail, as detailed in the :doc:`reason
275
  trail design document <design-reason-trail>`.
276

    
277
``status``
278
  It represents the status of the instance, and its format is the same as that
279
  of the ``status`` field of `Status reporting collectors`_.
280

    
281
Each hypervisor should provide its own instance status data collector, possibly
282
with the addition of more, specific, fields.
283
The ``category`` field of all of them will be ``instance``.
284
The ``kind`` field will be ``1``.
285

    
286
Note that as soon as a node knows it's not the primary anymore for an
287
instance it will stop reporting status for it: this means the instance
288
will either disappear, if it has been deleted, or appear on another
289
node, if it's been moved.
290

    
291
The ``code`` of the ``status`` field of the report of the Instance status data
292
collector will be:
293

    
294
``0``
295
  if ``status`` is ``0`` for all the instances it is reporting about.
296

    
297
``1``
298
  otherwise.
299

    
300
Storage collectors
301
++++++++++++++++++
302

    
303
The storage collectors will be a series of data collectors
304
that will gather data about storage for the current node. The collection
305
will be performed at different granularity and abstraction levels, from
306
the physical disks, to partitions, logical volumes and to the specific
307
storage types used by Ganeti itself (drbd, rbd, plain, file).
308

    
309
The ``name`` of each of these collector will reflect what storage type each of
310
them refers to.
311

    
312
The ``category`` field of these collector will be ``storage``.
313

    
314
The ``kind`` field will depend on the specific collector.
315

    
316
Each ``storage`` collector's ``data`` section will provide collector-specific
317
fields.
318

    
319
The various storage collectors will provide keys to join the data they provide,
320
in order to allow the user to get a better understanding of the system. E.g.:
321
through device names, or instance names.
322

    
323
Diskstats collector
324
*******************
325

    
326
This storage data collector will gather information about the status of the
327
disks installed in the system, as listed in the /proc/diskstats file. This means
328
that not only physical hard drives, but also ramdisks and loopback devices will
329
be listed.
330

    
331
Its ``kind`` in the report will be ``0`` (`Performance reporting collectors`_).
332

    
333
Its ``category`` field in the report will contain the value ``storage``.
334

    
335
When executed in verbose mode, the ``data`` section of the report of this
336
collector will be a list of items, each representing one disk, each providing
337
the following fields:
338

    
339
``major``
340
  The major number of the device.
341

    
342
``minor``
343
  The minor number of the device.
344

    
345
``name``
346
  The name of the device.
347

    
348
``readsNum``
349
  This is the total number of reads completed successfully.
350

    
351
``mergedReads``
352
  Reads which are adjacent to each other may be merged for efficiency. Thus
353
  two 4K reads may become one 8K read before it is ultimately handed to the
354
  disk, and so it will be counted (and queued) as only one I/O. This field
355
  specifies how often this was done.
356

    
357
``secRead``
358
  This is the total number of sectors read successfully.
359

    
360
``timeRead``
361
  This is the total number of milliseconds spent by all reads.
362

    
363
``writes``
364
  This is the total number of writes completed successfully.
365

    
366
``mergedWrites``
367
  Writes which are adjacent to each other may be merged for efficiency. Thus
368
  two 4K writes may become one 8K read before it is ultimately handed to the
369
  disk, and so it will be counted (and queued) as only one I/O. This field
370
  specifies how often this was done.
371

    
372
``secWritten``
373
  This is the total number of sectors written successfully.
374

    
375
``timeWrite``
376
  This is the total number of milliseconds spent by all writes.
377

    
378
``ios``
379
  The number of I/Os currently in progress.
380
  The only field that should go to zero, it is incremented as requests are
381
  given to appropriate struct request_queue and decremented as they finish.
382

    
383
``timeIO``
384
  The number of milliseconds spent doing I/Os. This field increases so long
385
  as field ``IOs`` is nonzero.
386

    
387
``wIOmillis``
388
  The weighted number of milliseconds spent doing I/Os.
389
  This field is incremented at each I/O start, I/O completion, I/O merge,
390
  or read of these stats by the number of I/Os in progress (field ``IOs``)
391
  times the number of milliseconds spent doing I/O since the last update of
392
  this field. This can provide an easy measure of both I/O completion time
393
  and the backlog that may be accumulating.
394

    
395
Logical Volume collector
396
************************
397

    
398
This data collector will gather information about the attributes of logical
399
volumes present in the system.
400

    
401
Its ``kind`` in the report will be ``0`` (`Performance reporting collectors`_).
402

    
403
Its ``category`` field in the report will contain the value ``storage``.
404

    
405
The ``data`` section of the report of this collector will be a list of items,
406
each representing one logical volume and providing the following fields:
407

    
408
``uuid``
409
  The UUID of the logical volume.
410

    
411
``name``
412
  The name of the logical volume.
413

    
414
``attr``
415
  The attributes of the logical volume.
416

    
417
``major``
418
  Persistent major number or -1 if not persistent.
419

    
420
``minor``
421
  Persistent minor number or -1 if not persistent.
422

    
423
``kernel_major``
424
  Currently assigned major number or -1 if LV is not active.
425

    
426
``kernel_minor``
427
  Currently assigned minor number or -1 if LV is not active.
428

    
429
``size``
430
  Size of LV in bytes.
431

    
432
``seg_count``
433
  Number of segments in LV.
434

    
435
``tags``
436
  Tags, if any.
437

    
438
``modules``
439
  Kernel device-mapper modules required for this LV, if any.
440

    
441
``vg_uuid``
442
  Unique identifier of the volume group.
443

    
444
``vg_name``
445
  Name of the volume group.
446

    
447
``segtype``
448
  Type of LV segment.
449

    
450
``seg_start``
451
  Offset within the LVto the start of the segment in bytes.
452

    
453
``seg_start_pe``
454
  Offset within the LV to the start of the segment in physical extents.
455

    
456
``seg_size``
457
  Size of the segment in bytes.
458

    
459
``seg_tags``
460
  Tags for the segment, if any.
461

    
462
``seg_pe_ranges``
463
  Ranges of Physical Extents of underlying devices in lvs command line format.
464

    
465
``devices``
466
  Underlying devices used with starting extent numbers.
467

    
468
``instance``
469
  The name of the instance this LV is used by, or ``null`` if it was not
470
  possible to determine it.
471

    
472
DRBD status
473
***********
474

    
475
This data collector will run only on nodes where DRBD is actually
476
present and it will gather information about DRBD devices.
477

    
478
Its ``kind`` in the report will be ``1`` (`Status reporting collectors`_).
479

    
480
Its ``category`` field in the report will contain the value ``storage``.
481

    
482
When executed in verbose mode, the ``data`` section of the report of this
483
collector will provide the following fields:
484

    
485
``versionInfo``
486
  Information about the DRBD version number, given by a combination of
487
  any (but at least one) of the following fields:
488

    
489
  ``version``
490
    The DRBD driver version.
491

    
492
  ``api``
493
    The API version number.
494

    
495
  ``proto``
496
    The protocol version.
497

    
498
  ``srcversion``
499
    The version of the source files.
500

    
501
  ``gitHash``
502
    Git hash of the source files.
503

    
504
  ``buildBy``
505
    Who built the binary, and, optionally, when.
506

    
507
``device``
508
  A list of structures, each describing a DRBD device (a minor) and containing
509
  the following fields:
510

    
511
  ``minor``
512
    The device minor number.
513

    
514
  ``connectionState``
515
    The state of the connection. If it is "Unconfigured", all the following
516
    fields are not present.
517

    
518
  ``localRole``
519
    The role of the local resource.
520

    
521
  ``remoteRole``
522
    The role of the remote resource.
523

    
524
  ``localState``
525
    The status of the local disk.
526

    
527
  ``remoteState``
528
    The status of the remote disk.
529

    
530
  ``replicationProtocol``
531
    The replication protocol being used.
532

    
533
  ``ioFlags``
534
    The input/output flags.
535

    
536
  ``perfIndicators``
537
    The performance indicators. This field will contain the following
538
    sub-fields:
539

    
540
    ``networkSend``
541
      KiB of data sent on the network.
542

    
543
    ``networkReceive``
544
      KiB of data received from the network.
545

    
546
    ``diskWrite``
547
      KiB of data written on local disk.
548

    
549
    ``diskRead``
550
      KiB of date read from the local disk.
551

    
552
    ``activityLog``
553
      Number of updates of the activity log.
554

    
555
    ``bitMap``
556
      Number of updates to the bitmap area of the metadata.
557

    
558
    ``localCount``
559
      Number of open requests to the local I/O subsystem.
560

    
561
    ``pending``
562
      Number of requests sent to the partner but not yet answered.
563

    
564
    ``unacknowledged``
565
      Number of requests received by the partner but still to be answered.
566

    
567
    ``applicationPending``
568
      Num of block input/output requests forwarded to DRBD but that have not yet
569
      been answered.
570

    
571
    ``epochs``
572
      (Optional) Number of epoch objects. Not provided by all DRBD versions.
573

    
574
    ``writeOrder``
575
      (Optional) Currently used write ordering method. Not provided by all DRBD
576
      versions.
577

    
578
    ``outOfSync``
579
      (Optional) KiB of storage currently out of sync. Not provided by all DRBD
580
      versions.
581

    
582
  ``syncStatus``
583
    (Optional) The status of the synchronization of the disk. This is present
584
    only if the disk is being synchronized, and includes the following fields:
585

    
586
    ``percentage``
587
      The percentage of synchronized data.
588

    
589
    ``progress``
590
      How far the synchronization is. Written as "x/y", where x and y are
591
      integer numbers expressed in the measurement unit stated in
592
      ``progressUnit``
593

    
594
    ``progressUnit``
595
      The measurement unit for the progress indicator.
596

    
597
    ``timeToFinish``
598
      The expected time before finishing the synchronization.
599

    
600
    ``speed``
601
      The speed of the synchronization.
602

    
603
    ``want``
604
      The desiderd speed of the synchronization.
605

    
606
    ``speedUnit``
607
      The measurement unit of the ``speed`` and ``want`` values. Expressed
608
      as "size/time".
609

    
610
  ``instance``
611
    The name of the Ganeti instance this disk is associated to.
612

    
613

    
614
Ganeti daemons status
615
+++++++++++++++++++++
616

    
617
Ganeti will report what information it has about its own daemons.
618
This should allow identifying possible problems with the Ganeti system itself:
619
for example memory leaks, crashes and high resource utilization should be
620
evident by analyzing this information.
621

    
622
The ``kind`` field will be ``1`` (`Status reporting collectors`_).
623

    
624
Each daemon will have its own data collector, and each of them will have
625
a ``category`` field valued ``daemon``.
626

    
627
When executed in verbose mode, their data section will include at least:
628

    
629
``memory``
630
  The amount of used memory.
631

    
632
``size_unit``
633
  The measurement unit used for the memory.
634

    
635
``uptime``
636
  The uptime of the daemon.
637

    
638
``CPU usage``
639
  How much cpu the daemon is using (percentage).
640

    
641
Any other daemon-specific information can be included as well in the ``data``
642
section.
643

    
644
Hypervisor resources report
645
+++++++++++++++++++++++++++
646

    
647
Each hypervisor has a view of system resources that sometimes is
648
different than the one the OS sees (for example in Xen the Node OS,
649
running as Dom0, has access to only part of those resources). In this
650
section we'll report all information we can in a "non hypervisor
651
specific" way. Each hypervisor can then add extra specific information
652
that is not generic enough be abstracted.
653

    
654
The ``kind`` field will be ``0`` (`Performance reporting collectors`_).
655

    
656
Each of the hypervisor data collectory will be of ``category``: ``hypervisor``.
657

    
658
Node OS resources report
659
++++++++++++++++++++++++
660

    
661
Since Ganeti assumes it's running on Linux, it's useful to export some
662
basic information as seen by the host system.
663

    
664
The ``category`` field of the report will be ``null``.
665

    
666
The ``kind`` field will be ``0`` (`Performance reporting collectors`_).
667

    
668
The ``data`` section will include:
669

    
670
``cpu_number``
671
  The number of available cpus.
672

    
673
``cpus``
674
  A list with one element per cpu, showing its average load.
675

    
676
``memory``
677
  The current view of memory (free, used, cached, etc.)
678

    
679
``filesystem``
680
  A list with one element per filesystem, showing a summary of the
681
  total/available space.
682

    
683
``NICs``
684
  A list with one element per network interface, showing the amount of
685
  sent/received data, error rate, IP address of the interface, etc.
686

    
687
``versions``
688
  A map using the name of a component Ganeti interacts (Linux, drbd,
689
  hypervisor, etc) as the key and its version number as the value.
690

    
691
Note that we won't go into any hardware specific details (e.g. querying a
692
node RAID is outside the scope of this, and can be implemented as a
693
plugin) but we can easily just report the information above, since it's
694
standard enough across all systems.
695

    
696
Node OS CPU load average report
697
+++++++++++++++++++++++++++++++
698

    
699
This data collector will export CPU load statistics as seen by the host
700
system. Apart from using the data from an external monitoring system we
701
can also use the data to improve instance allocation and/or the Ganeti
702
cluster balance. To compute the CPU load average we will use a number of
703
values collected inside a time window. The collection process will be
704
done by an independent thread (see `Mode of Operation`_).
705

    
706
This report is a subset of the previous report (`Node OS resources
707
report`_) and they might eventually get merged, once reporting for the
708
other fields (memory, filesystem, NICs) gets implemented too.
709

    
710
Specifically:
711

    
712
The ``category`` field of the report will be ``null``.
713

    
714
The ``kind`` field will be ``0`` (`Performance reporting collectors`_).
715

    
716
The ``data`` section will include:
717

    
718
``cpu_number``
719
  The number of available cpus.
720

    
721
``cpus``
722
  A list with one element per cpu, showing its average load.
723

    
724
``cpu_total``
725
  The total CPU load average as a sum of the all separate cpus.
726

    
727
The CPU load report function will get N values, collected by the
728
CPU load collection function and calculate the above averages. Please
729
see the section `Mode of Operation`_  for more information one how the
730
two functions of the data collector interact.
731

    
732
Format of the query
733
-------------------
734

    
735
.. include:: monitoring-query-format.rst
736

    
737
Instance disk status propagation
738
--------------------------------
739

    
740
As for the instance status Ganeti has now only partial information about
741
its instance disks: in particular each node is unaware of the disk to
742
instance mapping, that exists only on the master.
743

    
744
For this design doc we plan to fix this by changing all RPCs that create
745
a backend storage or that put an already existing one in use and passing
746
the relevant instance to the node. The node can then export these to the
747
status reporting tool.
748

    
749
While we haven't implemented these RPC changes yet, we'll use Confd to
750
fetch this information in the data collectors.
751

    
752
Plugin system
753
-------------
754

    
755
The monitoring system will be equipped with a plugin system that can
756
export specific local information through it.
757

    
758
The plugin system is expected to be used by local installations to
759
export any installation specific information that they want to be
760
monitored, about either hardware or software on their systems.
761

    
762
The plugin system will be in the form of either scripts or binaries whose output
763
will be inserted in the report.
764

    
765
Eventually support for other kinds of plugins might be added as well, such as
766
plain text files which will be inserted into the report, or local unix or
767
network sockets from which the information has to be read.  This should allow
768
most flexibility for implementing an efficient system, while being able to keep
769
it as simple as possible.
770

    
771
Data collectors
772
---------------
773

    
774
In order to ease testing as well as to make it simple to reuse this
775
subsystem it will be possible to run just the "data collectors" on each
776
node without passing through the agent daemon.
777

    
778
If a data collector is run independently, it should print on stdout its
779
report, according to the format corresponding to a single data collector
780
report object, as described in the previous paragraphs.
781

    
782
Mode of operation
783
-----------------
784

    
785
In order to be able to report information fast the monitoring agent
786
daemon will keep an in-memory or on-disk cache of the status, which will
787
be returned when queries are made. The status system will then
788
periodically check resources to make sure the status is up to date.
789

    
790
Different parts of the report will be queried at different speeds. These
791
will depend on:
792
- how often they vary (or we expect them to vary)
793
- how fast they are to query
794
- how important their freshness is
795

    
796
Of course the last parameter is installation specific, and while we'll
797
try to have defaults, it will be configurable. The first two instead we
798
can use adaptively to query a certain resource faster or slower
799
depending on those two parameters.
800

    
801
When run as stand-alone binaries, the data collector will not using any
802
caching system, and just fetch and return the data immediately.
803

    
804
Since some performance collectors have to operate on a number of values
805
collected in previous times, we need a mechanism independent of the data
806
collector which will trigger the collection of those values and also
807
store them, so that they are available for calculation by the data
808
collectors.
809

    
810
To collect data periodically, a thread will be created by the monitoring
811
agent which will run the collection function of every data collector
812
that provides one. The values returned by the collection function of
813
the data collector will be saved in an appropriate map, associating each
814
value to the corresponding collector, using the collector's name as the
815
key of the map. This map will be stored in mond's memory.
816

    
817
For example: the collection function of the CPU load collector will
818
collect a CPU load value and save it in the map mentioned above. The
819
collection function will be called by the collector thread every t
820
milliseconds. When the report function of the collector is called, it
821
will process the last N values of the map and calculate the
822
corresponding average.
823

    
824
Implementation place
825
--------------------
826

    
827
The status daemon will be implemented as a standalone Haskell daemon. In
828
the future it should be easy to merge multiple daemons into one with
829
multiple entry points, should we find out it saves resources and doesn't
830
impact functionality.
831

    
832
The libekg library should be looked at for easily providing metrics in
833
json format.
834

    
835
Implementation order
836
--------------------
837

    
838
We will implement the agent system in this order:
839

    
840
- initial example data collectors (eg. for drbd and instance status).
841
- initial daemon for exporting data, integrating the existing collectors
842
- plugin system
843
- RPC updates for instance status reasons and disk to instance mapping
844
- cache layer for the daemon
845
- more data collectors
846

    
847

    
848
Future work
849
===========
850

    
851
As a future step it can be useful to "centralize" all this reporting
852
data on a single place. This for example can be just the master node, or
853
all the master candidates. We will evaluate doing this after the first
854
node-local version has been developed and tested.
855

    
856
Another possible change is replacing the "read-only" RPCs with queries
857
to the agent system, thus having only one way of collecting information
858
from the nodes from a monitoring system and for Ganeti itself.
859

    
860
One extra feature we may need is a way to query for only sub-parts of
861
the report (eg. instances status only). This can be done by passing
862
arguments to the HTTP GET, which will be defined when we get to this
863
funtionality.
864

    
865
Finally the :doc:`autorepair system design <design-autorepair>`. system
866
(see its design) can be expanded to use the monitoring agent system as a
867
source of information to decide which repairs it can perform.
868

    
869
.. vim: set textwidth=72 :
870
.. Local Variables:
871
.. mode: rst
872
.. fill-column: 72
873
.. End: