root / doc / design-monitoring-agent.rst @ 109e07c2
History | View | Annotate | Download (11.5 kB)
1 |
======================= |
---|---|
2 |
Ganeti monitoring agent |
3 |
======================= |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the implementation of a Ganeti |
8 |
monitoring agent report system, that can be queried by a monitoring |
9 |
system to calculate health information for a Ganeti cluster. |
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
There is currently no monitoring support in Ganeti. While we don't want |
15 |
to build something like Nagios or Pacemaker as part of Ganeti, it would |
16 |
be useful if such tools could easily extract information from a Ganeti |
17 |
machine in order to take actions (example actions include logging an |
18 |
outage for future reporting or alerting a person or system about it). |
19 |
|
20 |
Proposed changes |
21 |
================ |
22 |
|
23 |
Each Ganeti node should export a status page that can be queried by a |
24 |
monitoring system. Such status page will be exported on a network port |
25 |
and will be encoded in JSON (simple text) over HTTP. |
26 |
|
27 |
The choice of json is obvious as we already depend on it in Ganeti and |
28 |
thus we don't need to add extra libraries to use it, as opposed to what |
29 |
would happen for XML or some other markup format. |
30 |
|
31 |
Location of agent report |
32 |
------------------------ |
33 |
|
34 |
The report will be available from all nodes, and be concerned for all |
35 |
node-local resources. This allows more real-time information to be |
36 |
available, at the cost of querying all nodes. |
37 |
|
38 |
Information reported |
39 |
-------------------- |
40 |
|
41 |
The monitoring agent system will report on the following basic information: |
42 |
|
43 |
- Instance status |
44 |
- Instance disk status |
45 |
- Status of storage for instances |
46 |
- Ganeti daemons status, CPU usage, memory footprint |
47 |
- Hypervisor resources report (memory, CPU, network interfaces) |
48 |
- Node OS resources report (memory, CPU, network interfaces) |
49 |
- Information from a plugin system |
50 |
|
51 |
Instance status |
52 |
+++++++++++++++ |
53 |
|
54 |
At the moment each node knows which instances are running on it, which |
55 |
instances it is primary for, but not the cause why an instance might not |
56 |
be running. On the other hand we don't want to distribute full instance |
57 |
"admin" status information to all nodes, because of the performance |
58 |
impact this would have. |
59 |
|
60 |
As such we propose that: |
61 |
|
62 |
- Any operation that can affect instance status will have an optional |
63 |
"reason" attached to it (at opcode level). This can be used for |
64 |
example to distinguish an admin request, from a scheduled maintenance |
65 |
or an automated tool's work. If this reason is not passed, Ganeti will |
66 |
just use the information it has about the source of the request: for |
67 |
example a cli shutdown operation will have "cli:shutdown" as a reason, |
68 |
a cli failover operation will have "cli:failover". Operations coming |
69 |
from the remote API will use "rapi" instead of "cli". Of course |
70 |
setting a real site-specific reason is still preferred. |
71 |
- RPCs that affect the instance status will be changed so that the |
72 |
"reason" and the version of the config object they ran on is passed to |
73 |
them. They will then export the new expected instance status, together |
74 |
with the associated reason and object version to the status report |
75 |
system, which then will export those themselves. |
76 |
|
77 |
Monitoring and auditing systems can then use the reason to understand |
78 |
the cause of an instance status, and they can use the object version to |
79 |
understand the freshness of their data even in the absence of an atomic |
80 |
cross-node reporting: for example if they see an instance "up" on a node |
81 |
after seeing it running on a previous one, they can compare these values |
82 |
to understand which data is freshest, and repoll the "older" node. Of |
83 |
course if they keep seeing this status this represents an error (either |
84 |
an instance continuously "flapping" between nodes, or an instance is |
85 |
constantly up on more than one), which should be reported and acted |
86 |
upon. |
87 |
|
88 |
The instance status will be on each node, for the instances it is |
89 |
primary for and will contain at least: |
90 |
|
91 |
- The instance name |
92 |
- The instance UUID (stable on name change) |
93 |
- The instance running status (up or down) |
94 |
- The timestamp of last known change |
95 |
- The timestamp of when the status was last checked (see caching, below) |
96 |
- The last known reason for change, if any |
97 |
|
98 |
More information about all the fields and their type will be available |
99 |
in the "Format of the report" section. |
100 |
|
101 |
Note that as soon as a node knows it's not the primary anymore for an |
102 |
instance it will stop reporting status for it: this means the instance |
103 |
will either disappear, if it has been deleted, or appear on another |
104 |
node, if it's been moved. |
105 |
|
106 |
Instance Disk status |
107 |
++++++++++++++++++++ |
108 |
|
109 |
As for the instance status Ganeti has now only partial information about |
110 |
its instance disks: in particular each node is unaware of the disk to |
111 |
instance mapping, that exists only on the master. |
112 |
|
113 |
For this design doc we plan to fix this by changing all RPCs that create |
114 |
a backend storage or that put an already existing one in use and passing |
115 |
the relevant instance to the node. The node can then export these to the |
116 |
status reporting tool. |
117 |
|
118 |
While we haven't implemented these RPC changes yet, we'll use confd to |
119 |
fetch this information in the data collector. |
120 |
|
121 |
Since Ganeti supports many type of disks for instances (drbd, rbd, |
122 |
plain, file) we will export both a "generic" status which will work for |
123 |
any type of disk and will be very opaque (at minimum just an "healthy" |
124 |
or "error" state, plus perhaps some human readable comment and a |
125 |
"per-type" status which will explain more about the internal details but |
126 |
will not be compatible between different storage types (and will for |
127 |
example export the drbd connection status, sync, and so on). |
128 |
|
129 |
Status of storage for instances |
130 |
+++++++++++++++++++++++++++++++ |
131 |
|
132 |
The node will also be reporting on all storage types it knows about for |
133 |
the current node (this is right now hardcoded to the enabled storage |
134 |
types, and in the future tied to the enabled storage pools for the |
135 |
nodegroup). For this kind of information also we will report both a |
136 |
generic health status (healthy or error) for each type of storage, and |
137 |
some more generic statistics (free space, used space, total visible |
138 |
space). In addition type specific information can be exported: for |
139 |
example, in case of error, the nature of the error can be disclosed as a |
140 |
type specific information. Examples of these are "backend pv |
141 |
unavailable" for lvm storage, "unreachable" for network based storage or |
142 |
"filesystem error" for filesystem based implementations. |
143 |
|
144 |
Ganeti daemons status |
145 |
+++++++++++++++++++++ |
146 |
|
147 |
Ganeti will report what information it has about its own daemons: this |
148 |
includes memory usage, uptime, CPU usage. This should allow identifying |
149 |
possible problems with the Ganeti system itself: for example memory |
150 |
leaks, crashes and high resource utilization should be evident by |
151 |
analyzing this information. |
152 |
|
153 |
Ganeti daemons will also be able to export extra internal information to |
154 |
the status reporting, through the plugin system (see below). |
155 |
|
156 |
Hypervisor resources report |
157 |
+++++++++++++++++++++++++++ |
158 |
|
159 |
Each hypervisor has a view of system resources that sometimes is |
160 |
different than the one the OS sees (for example in Xen the Node OS, |
161 |
running as Dom0, has access to only part of those resources). In this |
162 |
section we'll report all information we can in a "non hypervisor |
163 |
specific" way. Each hypervisor can then add extra specific information |
164 |
that is not generic enough be abstracted. |
165 |
|
166 |
Node OS resources report |
167 |
++++++++++++++++++++++++ |
168 |
|
169 |
Since Ganeti assumes it's running on Linux, it's useful to export some |
170 |
basic information as seen by the host system. This includes number and |
171 |
status of CPUs, memory, filesystems and network intefaces as well as the |
172 |
version of components Ganeti interacts with (Linux, drbd, hypervisor, |
173 |
etc). |
174 |
|
175 |
Note that we won't go into any hardware specific details (e.g. querying a |
176 |
node RAID is outside the scope of this, and can be implemented as a |
177 |
plugin) but we can easily just report the information above, since it's |
178 |
standard enough across all systems. |
179 |
|
180 |
Plugin system |
181 |
+++++++++++++ |
182 |
|
183 |
The monitoring system will be equipped with a plugin system that can |
184 |
export specific local information through it. The plugin system will be |
185 |
in the form of either scripts whose output will be inserted in the |
186 |
report, plain text files which will be inserted into the report, or |
187 |
local unix or network sockets from which the information has to be read. |
188 |
This should allow most flexibility for implementing an efficient system, |
189 |
while being able to keep it as simple as possible. |
190 |
|
191 |
The plugin system is expected to be used by local installations to |
192 |
export any installation specific information that they want to be |
193 |
monitored, about either hardware or software on their systems. |
194 |
|
195 |
|
196 |
Format of the query |
197 |
------------------- |
198 |
|
199 |
The query will be an HTTP GET request on a particular port. At the |
200 |
beginning it will only be possible to query the full status report. |
201 |
|
202 |
|
203 |
Format of the report |
204 |
-------------------- |
205 |
|
206 |
TBD (this part needs to be completed with the format of the JSON and the |
207 |
types of the various variables exported, as they get evaluated and |
208 |
decided) |
209 |
|
210 |
|
211 |
Data collectors |
212 |
--------------- |
213 |
|
214 |
In order to ease testing as well as to make it simple to reuse this |
215 |
subsystem it will be possible to run just the "data collectors" on each |
216 |
node without passing through the agent daemon. Each data collector will |
217 |
report specific data about its subsystem and will be documented |
218 |
separately. |
219 |
|
220 |
|
221 |
Mode of operation |
222 |
----------------- |
223 |
|
224 |
In order to be able to report information fast the monitoring agent |
225 |
daemon will keep an in-memory or on-disk cache of the status, which will |
226 |
be returned when queries are made. The status system will then |
227 |
periodically check resources to make sure the status is up to date. |
228 |
|
229 |
Different parts of the report will be queried at different speeds. These |
230 |
will depend on: |
231 |
- how often they vary (or we expect them to vary) |
232 |
- how fast they are to query |
233 |
- how important their freshness is |
234 |
|
235 |
Of course the last parameter is installation specific, and while we'll |
236 |
try to have defaults, it will be configurable. The first two instead we |
237 |
can use adaptively to query a certain resource faster or slower |
238 |
depending on those two parameters. |
239 |
|
240 |
|
241 |
Implementation place |
242 |
-------------------- |
243 |
|
244 |
The status daemon will be implemented as a standalone Haskell daemon. In |
245 |
the future it should be easy to merge multiple daemons into one with |
246 |
multiple entry points, should we find out it saves resources and doesn't |
247 |
impact functionality. |
248 |
|
249 |
The libekg library should be looked at for easily providing metrics in |
250 |
json format. |
251 |
|
252 |
|
253 |
Implementation order |
254 |
-------------------- |
255 |
|
256 |
We will implement the agent system in this order: |
257 |
|
258 |
- initial example data collectors (eg. for drbd and instance status) |
259 |
- initial daemon for exporting data |
260 |
- RPC updates for instance status reasons and disk to instance mapping |
261 |
- more data collectors |
262 |
- cache layer for the daemon (if needed) |
263 |
|
264 |
|
265 |
Future work |
266 |
=========== |
267 |
|
268 |
As a future step it can be useful to "centralize" all this reporting |
269 |
data on a single place. This for example can be just the master node, or |
270 |
all the master candidates. We will evaluate doing this after the first |
271 |
node-local version has been developed and tested. |
272 |
|
273 |
Another possible change is replacing the "read-only" RPCs with queries |
274 |
to the agent system, thus having only one way of collecting information |
275 |
from the nodes from a monitoring system and for Ganeti itself. |
276 |
|
277 |
One extra feature we may need is a way to query for only sub-parts of |
278 |
the report (eg. instances status only). This can be done by passing |
279 |
arguments to the HTTP GET, which will be defined when we get to this |
280 |
funtionality. |
281 |
|
282 |
Finally the :doc:`autorepair system design <design-autorepair>`. system |
283 |
(see its design) can be expanded to use the monitoring agent system as a |
284 |
source of information to decide which repairs it can perform. |
285 |
|
286 |
.. vim: set textwidth=72 : |
287 |
.. Local Variables: |
288 |
.. mode: rst |
289 |
.. fill-column: 72 |
290 |
.. End: |