code.grnet.gr Git - ganeti-local/blob - doc/design-monitoring-agent.rst

   1 =======================
   2 Ganeti monitoring agent
   3 =======================
   4
   5 .. contents:: :depth: 4
   6
   7 This is a design document detailing the implementation of a Ganeti
   8 monitoring agent report system, that can be queried by a monitoring
   9 system to calculate health information for a Ganeti cluster.
  10
  11 Current state and shortcomings
  12 ==============================
  13
  14 There is currently no monitoring support in Ganeti. While we don't want
  15 to build something like Nagios or Pacemaker as part of Ganeti, it would
  16 be useful if such tools could easily extract information from a Ganeti
  17 machine in order to take actions (example actions include logging an
  18 outage for future reporting or alerting a person or system about it).
  19
  20 Proposed changes
  21 ================
  22
  23 Each Ganeti node should export a status page that can be queried by a
  24 monitoring system. Such status page will be exported on a network port
  25 and will be encoded in JSON (simple text) over HTTP.
  26
  27 The choice of JSON is obvious as we already depend on it in Ganeti and
  28 thus we don't need to add extra libraries to use it, as opposed to what
  29 would happen for XML or some other markup format.
  30
  31 Location of agent report
  32 ------------------------
  33
  34 The report will be available from all nodes, and be concerned for all
  35 node-local resources. This allows more real-time information to be
  36 available, at the cost of querying all nodes.
  37
  38 Information reported
  39 --------------------
  40
  41 The monitoring agent system will report on the following basic information:
  42
  43 - Instance status
  44 - Instance disk status
  45 - Status of storage for instances
  46 - Ganeti daemons status, CPU usage, memory footprint
  47 - Hypervisor resources report (memory, CPU, network interfaces)
  48 - Node OS resources report (memory, CPU, network interfaces)
  49 - Information from a plugin system
  50
  51 Format of the report
  52 --------------------
  53
  54 The report of the will be in JSON format, and it will present an array
  55 of report objects.
  56 Each report object will be produced by a specific data collector.
  57 Each report object includes some mandatory fields, to be provided by all
  58 the data collectors:
  59
  60 ``name``
  61   The name of the data collector that produced this part of the report.
  62   It is supposed to be unique inside a report.
  63
  64 ``version``
  65   The version of the data collector that produces this part of the
  66   report. Built-in data collectors (as opposed to those implemented as
  67   plugins) should have "B" as the version number.
  68
  69 ``formatVersion``
  70   The format of what is represented in the "data" field for each data
  71   collector might change over time. Every time this happens, the
  72   format_version should be changed, so that who reads the report knows
  73   what format to expect, and how to correctly interpret it.
  74
  75 ``timestamp``
  76   The time when the reported data were gathered. Is has to be expressed
  77   in nanoseconds since the unix epoch (0:00:00 January 01, 1970). If not
  78   enough precision is available (or needed) it can be padded with
  79   zeroes. If a report object needs multiple timestamps, it can add more
  80   and/or override this one inside its own "data" section.
  81
  82 ``category``
  83   A collector can belong to a given category of collectors (e.g.: storage
  84   collectors, daemon collector). This means that it will have to provide a
  85   minumum set of prescribed fields, as documented for each category.
  86   This field will contain the name of the category the collector belongs to,
  87   if any, or just the ``null`` value.
  88
  89 ``kind``
  90   Two kinds of collectors are possible:
  91   `Performance reporting collectors`_ and `Status reporting collectors`_.
  92   The respective paragraphs will describe them and the value of this field.
  93
  94 ``data``
  95   This field contains all the data generated by the specific data collector,
  96   in its own independently defined format. The monitoring agent could check
  97   this syntactically (according to the JSON specifications) but not
  98   semantically.
  99
 100 Here follows a minimal example of a report::
 101
 102   [
 103   {
 104       "name" : "TheCollectorIdentifier",
 105       "version" : "1.2",
 106       "formatVersion" : 1,
 107       "timestamp" : 1351607182000000000,
 108       "category" : null,
 109       "kind" : 0,
 110       "data" : { "plugin_specific_data" : "go_here" }
 111   },
 112   {
 113       "name" : "AnotherDataCollector",
 114       "version" : "B",
 115       "formatVersion" : 7,
 116       "timestamp" : 1351609526123854000,
 117       "category" : "storage",
 118       "kind" : 1,
 119       "data" : { "status" : { "code" : 1,
 120                               "message" : "Error on disk 2"
 121                             },
 122                  "plugin_specific" : "data",
 123                  "some_late_data" : { "timestamp" : 1351609526123942720,
 124                                       ...
 125                                     }
 126                }
 127   }
 128   ]
 129
 130 Performance reporting collectors
 131 ++++++++++++++++++++++++++++++++
 132
 133 These collectors only provide data about some component of the system, without
 134 giving any interpretation over their meaning.
 135
 136 The value of the ``kind`` field of the report will be ``0``.
 137
 138 Status reporting collectors
 139 +++++++++++++++++++++++++++
 140
 141 These collectors will provide information about the status of some
 142 component of ganeti, or managed by ganeti.
 143
 144 The value of their ``kind`` field will be ``1``.
 145
 146 The rationale behind this kind of collectors is that there are some situations
 147 where exporting data about the underlying subsystems would expose potential
 148 issues. But if Ganeti itself is able (and going) to fix the problem, conflicts
 149 might arise between Ganeti and something/somebody else trying to fix the same
 150 problem.
 151 Also, some external monitoring systems might not be aware of the internals of a
 152 particular subsystem (e.g.: DRBD) and might only exploit the high level
 153 response of its data collector, alerting an administrator if anything is wrong.
 154 Still, completely hiding the underlying data is not a good idea, as they might
 155 still be of use in some cases. So status reporting plugins will provide two
 156 output modes: one just exporting a high level information about the status,
 157 and one also exporting all the data they gathered.
 158 The default output mode will be the status-only one. Through a command line
 159 parameter (for stand-alone data collectors) or through the HTTP request to the
 160 monitoring agent
 161 (when collectors are executed as part of it) the verbose output mode providing
 162 all the data can be selected.
 163
 164 When exporting just the status each status reporting collector will provide,
 165 in its ``data`` section, at least the following field:
 166
 167 ``status``
 168   summarizes the status of the component being monitored and consists of two
 169   subfields:
 170
 171   ``code``
 172     It assumes a numeric value, encoded in such a way to allow using a bitset
 173     to easily distinguish which states are currently present in the whole cluster.
 174     If the bitwise OR of all the ``status`` fields is 0, the cluster is
 175     completely healty.
 176     The status codes are as follows:
 177
 178     ``0``
 179       The collector can determine that everything is working as
 180       intended.
 181
 182     ``1``
 183       Something is temporarily wrong but it is being automatically fixed by
 184       Ganeti.
 185       There is no need of external intervention.
 186
 187     ``2``
 188       The collector can determine that something is wrong and Ganeti has no
 189       way to fix it autonomously. External intervention is required.
 190
 191     ``4``
 192       The collector has failed to understand whether the status is good or
 193       bad. Further analysis is required. Interpret this status as a
 194       potentially dangerous situation.
 195
 196   ``message``
 197     A message to better explain the reason of the status.
 198     The exact format of the message string is data collector dependent.
 199
 200     The field is mandatory, but the content can be ``null`` if the code is
 201     ``0`` (working as intended) or ``1`` (being fixed automatically).
 202
 203     If the status code is ``2``, the message should specify what has gone
 204     wrong.
 205     If the status code is ``4``, the message shoud explain why it was not
 206     possible to determine a proper status.
 207
 208 The ``data`` section will also contain all the fields describing the gathered
 209 data, according to a collector-specific format.
 210
 211 Instance status
 212 +++++++++++++++
 213
 214 At the moment each node knows which instances are running on it, which
 215 instances it is primary for, but not the cause why an instance might not
 216 be running. On the other hand we don't want to distribute full instance
 217 "admin" status information to all nodes, because of the performance
 218 impact this would have.
 219
 220 As such we propose that:
 221
 222 - Any operation that can affect instance status will have an optional
 223   "reason" attached to it (at opcode level). This can be used for
 224   example to distinguish an admin request, from a scheduled maintenance
 225   or an automated tool's work. If this reason is not passed, Ganeti will
 226   just use the information it has about the source of the request: for
 227   example a cli shutdown operation will have "cli:shutdown" as a reason,
 228   a cli failover operation will have "cli:failover". Operations coming
 229   from the remote API will use "rapi" instead of "cli". Of course
 230   setting a real site-specific reason is still preferred.
 231 - RPCs that affect the instance status will be changed so that the
 232   "reason" and the version of the config object they ran on is passed to
 233   them. They will then export the new expected instance status, together
 234   with the associated reason and object version to the status report
 235   system, which then will export those themselves.
 236
 237 Monitoring and auditing systems can then use the reason to understand
 238 the cause of an instance status, and they can use the timestamp to
 239 understand the freshness of their data even in the absence of an atomic
 240 cross-node reporting: for example if they see an instance "up" on a node
 241 after seeing it running on a previous one, they can compare these values
 242 to understand which data is freshest, and repoll the "older" node. Of
 243 course if they keep seeing this status this represents an error (either
 244 an instance continuously "flapping" between nodes, or an instance is
 245 constantly up on more than one), which should be reported and acted
 246 upon.
 247
 248 The instance status will be on each node, for the instances it is
 249 primary for, and its ``data`` section of the report will contain a list
 250 of instances, with at least the following fields for each instance:
 251
 252 ``name``
 253   The name of the instance.
 254
 255 ``uuid``
 256   The UUID of the instance (stable on name change).
 257
 258 ``admin_state``
 259   The status of the instance (up/down/offline) as requested by the admin.
 260
 261 ``actual_state``
 262   The actual status of the instance. It can be ``up``, ``down``, or
 263   ``hung`` if the instance is up but it appears to be completely stuck.
 264
 265 ``uptime``
 266   The uptime of the instance (if it is up, "null" otherwise).
 267
 268 ``mtime``
 269   The timestamp of the last known change to the instance state.
 270
 271 ``state_reason``
 272   The last known reason for state change, described according to the
 273   following subfields:
 274
 275   ``text``
 276     Either a user-provided reason (if any), or the name of the command that
 277     triggered the state change, as a fallback.
 278
 279   ``jobID``
 280     The ID of the job that caused the state change.
 281
 282   ``source``
 283     Where the state change was triggered (RAPI, CLI).
 284
 285 ``status``
 286   It represents the status of the instance, and its format is the same as that
 287   of the ``status`` field of `Status reporting collectors`_.
 288
 289 Each hypervisor should provide its own instance status data collector, possibly
 290 with the addition of more, specific, fields.
 291 The ``category`` field of all of them will be ``instance``.
 292 The ``kind`` field will be ``1``.
 293
 294 Note that as soon as a node knows it's not the primary anymore for an
 295 instance it will stop reporting status for it: this means the instance
 296 will either disappear, if it has been deleted, or appear on another
 297 node, if it's been moved.
 298
 299 The ``code`` of the ``status`` field of the report of the Instance status data
 300 collector will be:
 301
 302 ``0``
 303   if ``status`` is ``0`` for all the instances it is reporting about.
 304
 305 ``1``
 306   otherwise.
 307
 308 Storage status
 309 ++++++++++++++
 310
 311 The storage status collectors will be a series of data collectors
 312 (drbd, rbd, plain, file) that will gather data about all the storage types
 313 for the current node (this is right now hardcoded to the enabled storage
 314 types, and in the future tied to the enabled storage pools for the nodegroup).
 315
 316 The ``name`` of each of these collector will reflect what storage type each of
 317 them refers to.
 318
 319 The ``category`` field of these collector will be ``storage``.
 320
 321 The ``kind`` field will be ``1`` (`Status reporting collectors`_).
 322
 323 The ``data`` section of the report will provide at least the following fields:
 324
 325 ``free``
 326   The amount of free space (in KBytes).
 327
 328 ``used``
 329   The amount of used space (in KBytes).
 330
 331 ``total``
 332   The total visible space (in KBytes).
 333
 334 Each specific storage type might provide more type-specific fields.
 335
 336 In case of error, the ``message`` subfield of the ``status`` field of the
 337 report of the instance status collector will disclose the nature of the error
 338 as a type specific information. Examples of these are "backend pv unavailable"
 339 for lvm storage, "unreachable" for network based storage or "filesystem error"
 340 for filesystem based implementations.
 341
 342 DRBD status
 343 ***********
 344
 345 This data collector will run only on nodes where DRBD is actually
 346 present and it will gather information about DRBD devices.
 347
 348 Its ``kind`` in the report will be ``1`` (`Status reporting collectors`_).
 349
 350 Its ``category`` field in the report will contain the value ``storage``.
 351
 352 When executed in verbose mode, the ``data`` section of the report of this
 353 collector will provide the following fields:
 354
 355 ``versionInfo``
 356   Information about the DRBD version number, given by a combination of
 357   any (but at least one) of the following fields:
 358
 359   ``version``
 360     The DRBD driver version.
 361
 362   ``api``
 363     The API version number.
 364
 365   ``proto``
 366     The protocol version.
 367
 368   ``srcversion``
 369     The version of the source files.
 370
 371   ``gitHash``
 372     Git hash of the source files.
 373
 374   ``buildBy``
 375     Who built the binary, and, optionally, when.
 376
 377 ``device``
 378   A list of structures, each describing a DRBD device (a minor) and containing
 379   the following fields:
 380
 381   ``minor``
 382     The device minor number.
 383
 384   ``connectionState``
 385     The state of the connection. If it is "Unconfigured", all the following
 386     fields are not present.
 387
 388   ``localRole``
 389     The role of the local resource.
 390
 391   ``remoteRole``
 392     The role of the remote resource.
 393
 394   ``localState``
 395     The status of the local disk.
 396
 397   ``remoteState``
 398     The status of the remote disk.
 399
 400   ``replicationProtocol``
 401     The replication protocol being used.
 402
 403   ``ioFlags``
 404     The input/output flags.
 405
 406   ``perfIndicators``
 407     The performance indicators. This field will contain the following
 408     sub-fields:
 409
 410     ``networkSend``
 411       KiB of data sent on the network.
 412
 413     ``networkReceive``
 414       KiB of data received from the network.
 415
 416     ``diskWrite``
 417       KiB of data written on local disk.
 418
 419     ``diskRead``
 420       KiB of date read from the local disk.
 421
 422     ``activityLog``
 423       Number of updates of the activity log.
 424
 425     ``bitMap``
 426       Number of updates to the bitmap area of the metadata.
 427
 428     ``localCount``
 429       Number of open requests to the local I/O subsystem.
 430
 431     ``pending``
 432       Number of requests sent to the partner but not yet answered.
 433
 434     ``unacknowledged``
 435       Number of requests received by the partner but still to be answered.
 436
 437     ``applicationPending``
 438       Num of block input/output requests forwarded to DRBD but that have not yet
 439       been answered.
 440
 441     ``epochs``
 442       (Optional) Number of epoch objects. Not provided by all DRBD versions.
 443
 444     ``writeOrder``
 445       (Optional) Currently used write ordering method. Not provided by all DRBD
 446       versions.
 447
 448     ``outOfSync``
 449       (Optional) KiB of storage currently out of sync. Not provided by all DRBD
 450       versions.
 451
 452   ``syncStatus``
 453     (Optional) The status of the synchronization of the disk. This is present
 454     only if the disk is being synchronized, and includes the following fields:
 455
 456     ``percentage``
 457       The percentage of synchronized data.
 458
 459     ``progress``
 460       How far the synchronization is. Written as "x/y", where x and y are
 461       integer numbers expressed in the measurement unit stated in
 462       ``progressUnit``
 463
 464     ``progressUnit``
 465       The measurement unit for the progress indicator.
 466
 467     ``timeToFinish``
 468       The expected time before finishing the synchronization.
 469
 470     ``speed``
 471       The speed of the synchronization.
 472
 473     ``want``
 474       The desiderd speed of the synchronization.
 475
 476     ``speedUnit``
 477       The measurement unit of the ``speed`` and ``want`` values. Expressed
 478       as "size/time".
 479
 480   ``instance``
 481     The name of the Ganeti instance this disk is associated to.
 482
 483
 484 Ganeti daemons status
 485 +++++++++++++++++++++
 486
 487 Ganeti will report what information it has about its own daemons.
 488 This should allow identifying possible problems with the Ganeti system itself:
 489 for example memory leaks, crashes and high resource utilization should be
 490 evident by analyzing this information.
 491
 492 The ``kind`` field will be ``1`` (`Status reporting collectors`_).
 493
 494 Each daemon will have its own data collector, and each of them will have
 495 a ``category`` field valued ``daemon``.
 496
 497 When executed in verbose mode, their data section will include at least:
 498
 499 ``memory``
 500   The amount of used memory.
 501
 502 ``size_unit``
 503   The measurement unit used for the memory.
 504
 505 ``uptime``
 506   The uptime of the daemon.
 507
 508 ``CPU usage``
 509   How much cpu the daemon is using (percentage).
 510
 511 Any other daemon-specific information can be included as well in the ``data``
 512 section.
 513
 514 Hypervisor resources report
 515 +++++++++++++++++++++++++++
 516
 517 Each hypervisor has a view of system resources that sometimes is
 518 different than the one the OS sees (for example in Xen the Node OS,
 519 running as Dom0, has access to only part of those resources). In this
 520 section we'll report all information we can in a "non hypervisor
 521 specific" way. Each hypervisor can then add extra specific information
 522 that is not generic enough be abstracted.
 523
 524 The ``kind`` field will be ``0`` (`Performance reporting collectors`_).
 525
 526 Each of the hypervisor data collectory will be of ``category``: ``hypervisor``.
 527
 528 Node OS resources report
 529 ++++++++++++++++++++++++
 530
 531 Since Ganeti assumes it's running on Linux, it's useful to export some
 532 basic information as seen by the host system.
 533
 534 The ``category`` field of the report will be ``null``.
 535
 536 The ``kind`` field will be ``0`` (`Performance reporting collectors`_).
 537
 538 The ``data`` section will include:
 539
 540 ``cpu_number``
 541   The number of available cpus.
 542
 543 ``cpus``
 544   A list with one element per cpu, showing its average load.
 545
 546 ``memory``
 547   The current view of memory (free, used, cached, etc.)
 548
 549 ``filesystem``
 550   A list with one element per filesystem, showing a summary of the
 551   total/available space.
 552
 553 ``NICs``
 554   A list with one element per network interface, showing the amount of
 555   sent/received data, error rate, IP address of the interface, etc.
 556
 557 ``versions``
 558   A map using the name of a component Ganeti interacts (Linux, drbd,
 559   hypervisor, etc) as the key and its version number as the value.
 560
 561 Note that we won't go into any hardware specific details (e.g. querying a
 562 node RAID is outside the scope of this, and can be implemented as a
 563 plugin) but we can easily just report the information above, since it's
 564 standard enough across all systems.
 565
 566 Format of the query
 567 -------------------
 568
 569 The queries to the monitoring agent will be HTTP GET requests on port 1815.
 570 The answer will be encoded in JSON format and will depend on the specific
 571 accessed resource.
 572
 573 If a request is sent to a non-existing resource, a 404 error will be returned by
 574 the HTTP server.
 575
 576 The following paragraphs will present the existing resources supported by the
 577 current protocol version, that is version 1.
 578
 579 ``/``
 580 +++++
 581 The root resource. It will return the list of the supported protocol version
 582 numbers.
 583
 584 Currently, this will include only version 1.
 585
 586 ``/1``
 587 ++++++
 588 Not an actual resource per-se, it is the root of all the resources of protocol
 589 version 1.
 590
 591 If requested through GET, the null JSON value will be returned.
 592
 593 ``/1/full``
 594 +++++++++++
 595 The full report of all the data collectors, as described in the section
 596 `Format of the report`_.
 597
 598 `Status reporting collectors`_ will provide their output in non-verbose format.
 599 The verbose format can be requested by adding the parameter ``verbose=1`` to the
 600 request.
 601
 602 ``/[category]/[collector_name]``
 603 ++++++++++++++++++++++++++++++++
 604 Returns the report of the collector ``[collector_name]`` that belongs to the
 605 specified ``[category]``.
 606
 607 If a collector does not belong to any category, ``collector`` will be used as
 608 the value for ``[category]``.
 609
 610 `Status reporting collectors`_ will provide their output in non-verbose format.
 611 The verbose format can be requested by adding the parameter ``verbose=1`` to the
 612 request.
 613
 614 Instance disk status propagation
 615 --------------------------------
 616
 617 As for the instance status Ganeti has now only partial information about
 618 its instance disks: in particular each node is unaware of the disk to
 619 instance mapping, that exists only on the master.
 620
 621 For this design doc we plan to fix this by changing all RPCs that create
 622 a backend storage or that put an already existing one in use and passing
 623 the relevant instance to the node. The node can then export these to the
 624 status reporting tool.
 625
 626 While we haven't implemented these RPC changes yet, we'll use Confd to
 627 fetch this information in the data collectors.
 628
 629 Plugin system
 630 -------------
 631
 632 The monitoring system will be equipped with a plugin system that can
 633 export specific local information through it.
 634
 635 The plugin system is expected to be used by local installations to
 636 export any installation specific information that they want to be
 637 monitored, about either hardware or software on their systems.
 638
 639 The plugin system will be in the form of either scripts or binaries whose output
 640 will be inserted in the report.
 641
 642 Eventually support for other kinds of plugins might be added as well, such as
 643 plain text files which will be inserted into the report, or local unix or
 644 network sockets from which the information has to be read.  This should allow
 645 most flexibility for implementing an efficient system, while being able to keep
 646 it as simple as possible.
 647
 648 Data collectors
 649 ---------------
 650
 651 In order to ease testing as well as to make it simple to reuse this
 652 subsystem it will be possible to run just the "data collectors" on each
 653 node without passing through the agent daemon.
 654
 655 If a data collector is run independently, it should print on stdout its
 656 report, according to the format corresponding to a single data collector
 657 report object, as described in the previous paragraphs.
 658
 659 Mode of operation
 660 -----------------
 661
 662 In order to be able to report information fast the monitoring agent
 663 daemon will keep an in-memory or on-disk cache of the status, which will
 664 be returned when queries are made. The status system will then
 665 periodically check resources to make sure the status is up to date.
 666
 667 Different parts of the report will be queried at different speeds. These
 668 will depend on:
 669 - how often they vary (or we expect them to vary)
 670 - how fast they are to query
 671 - how important their freshness is
 672
 673 Of course the last parameter is installation specific, and while we'll
 674 try to have defaults, it will be configurable. The first two instead we
 675 can use adaptively to query a certain resource faster or slower
 676 depending on those two parameters.
 677
 678 When run as stand-alone binaries, the data collector will not using any
 679 caching system, and just fetch and return the data immediately.
 680
 681 Implementation place
 682 --------------------
 683
 684 The status daemon will be implemented as a standalone Haskell daemon. In
 685 the future it should be easy to merge multiple daemons into one with
 686 multiple entry points, should we find out it saves resources and doesn't
 687 impact functionality.
 688
 689 The libekg library should be looked at for easily providing metrics in
 690 json format.
 691
 692
 693 Implementation order
 694 --------------------
 695
 696 We will implement the agent system in this order:
 697
 698 - initial example data collectors (eg. for drbd and instance status).
 699 - initial daemon for exporting data, integrating the existing collectors
 700 - plugin system
 701 - RPC updates for instance status reasons and disk to instance mapping
 702 - cache layer for the daemon
 703 - more data collectors
 704
 705
 706 Future work
 707 ===========
 708
 709 As a future step it can be useful to "centralize" all this reporting
 710 data on a single place. This for example can be just the master node, or
 711 all the master candidates. We will evaluate doing this after the first
 712 node-local version has been developed and tested.
 713
 714 Another possible change is replacing the "read-only" RPCs with queries
 715 to the agent system, thus having only one way of collecting information
 716 from the nodes from a monitoring system and for Ganeti itself.
 717
 718 One extra feature we may need is a way to query for only sub-parts of
 719 the report (eg. instances status only). This can be done by passing
 720 arguments to the HTTP GET, which will be defined when we get to this
 721 funtionality.
 722
 723 Finally the :doc:`autorepair system design <design-autorepair>`. system
 724 (see its design) can be expanded to use the monitoring agent system as a
 725 source of information to decide which repairs it can perform.
 726
 727 .. vim: set textwidth=72 :
 728 .. Local Variables:
 729 .. mode: rst
 730 .. fill-column: 72
 731 .. End: