root / doc / design-monitoring-agent.rst @ d5af1f95
History | View | Annotate | Download (11.5 kB)
1 |
======================= |
---|---|
2 |
Ganeti monitoring agent |
3 |
======================= |
4 |
|
5 |
.. contents:: :depth: 4 |
6 |
|
7 |
This is a design document detailing the implementation of a Ganeti |
8 |
monitoring agent report system, that can be queried by a monitoring |
9 |
system to calculate health information for a Ganeti cluster. |
10 |
|
11 |
Current state and shortcomings |
12 |
============================== |
13 |
|
14 |
There is currently no monitoring support in Ganeti. While we don't want |
15 |
to build something like Nagios or Pacemaker as part of Ganeti, it would |
16 |
be useful if such tools could easily extract information from a Ganeti |
17 |
machine in order to take actions (example actions include logging an |
18 |
outage for future reporting or alerting a person or system about it). |
19 |
|
20 |
Proposed changes |
21 |
================ |
22 |
|
23 |
Each Ganeti node should export a status page that can be queried by a |
24 |
monitoring system. Such status page will be exported on a network port |
25 |
and will be encoded in JSON (simple text) over HTTP. |
26 |
|
27 |
The choice of json is obvious as we already depend on it in Ganeti and |
28 |
thus we don't need to add extra libraries to use it, as opposed to what |
29 |
would happen for XML or some other markup format. |
30 |
|
31 |
Location of agent report |
32 |
------------------------ |
33 |
|
34 |
The report will be available from all nodes, and be concerned for all |
35 |
node-local resources. This allows more real-time information to be |
36 |
available, at the cost of querying all nodes. |
37 |
|
38 |
Information reported |
39 |
-------------------- |
40 |
|
41 |
The monitoring agent system will report on the following basic information: |
42 |
|
43 |
- Instance status |
44 |
- Instance disk status |
45 |
- Status of storage for instances |
46 |
- Ganeti daemons status, CPU usage, memory footprint |
47 |
- Hypervisor resources report (memory, CPU, network interfaces) |
48 |
- Node OS resources report (memory, CPU, network interfaces) |
49 |
- Information from a plugin system |
50 |
|
51 |
Instance status |
52 |
+++++++++++++++ |
53 |
|
54 |
At the moment each node knows which instances are running on it, which |
55 |
instances it is primary for, but not the cause why an instance might not |
56 |
be running. On the other hand we don't want to distribute full instance |
57 |
"admin" status information to all nodes, because of the performance |
58 |
impact this would have. |
59 |
|
60 |
As such we propose that: |
61 |
|
62 |
- Any operation that can affect instance status will have an optional |
63 |
"reason" attached to it (at opcode level). This can be used for |
64 |
example to distinguish an admin request, from a scheduled maintenance |
65 |
or an automated tool's work. If this reason is not passed, Ganeti will |
66 |
just use the information it has about the source of the request: for |
67 |
example a cli shutdown operation will have "cli:shutdown" as a reason, |
68 |
a cli failover operation will have "cli:failover". Operations coming |
69 |
from the remote API will use "rapi" instead of "cli". Of course |
70 |
setting a real site-specific reason is still preferred. |
71 |
- RPCs that affect the instance status will be changed so that the |
72 |
"reason" and the version of the config object they ran on is passed to |
73 |
them. They will then export the new expected instance status, together |
74 |
with the associated reason and object version to the status report |
75 |
system, which then will export those themselves. |
76 |
|
77 |
Monitoring and auditing systems can then use the reason to understand |
78 |
the cause of an instance status, and they can use the object version to |
79 |
understand the freshness of their data even in the absence of an atomic |
80 |
cross-node reporting: for example if they see an instance "up" on a node |
81 |
after seeing it running on a previous one, they can compare these values |
82 |
to understand which data is freshest, and repoll the "older" node. Of |
83 |
course if they keep seeing this status this represents an error (either |
84 |
an instance continuously "flapping" between nodes, or an instance is |
85 |
constantly up on more than one), which should be reported and acted |
86 |
upon. |
87 |
|
88 |
The instance status will be on each node, for the instances it is |
89 |
primary for and will contain at least: |
90 |
|
91 |
- The instance name |
92 |
- The instance UUID (stable on name change) |
93 |
- The instance running status (up or down) |
94 |
- The uptime, as detected by the hypervisor |
95 |
- The timestamp of last known change |
96 |
- The timestamp of when the status was last checked (see caching, below) |
97 |
- The last known reason for change, if any |
98 |
|
99 |
More information about all the fields and their type will be available |
100 |
in the "Format of the report" section. |
101 |
|
102 |
Note that as soon as a node knows it's not the primary anymore for an |
103 |
instance it will stop reporting status for it: this means the instance |
104 |
will either disappear, if it has been deleted, or appear on another |
105 |
node, if it's been moved. |
106 |
|
107 |
Instance Disk status |
108 |
++++++++++++++++++++ |
109 |
|
110 |
As for the instance status Ganeti has now only partial information about |
111 |
its instance disks: in particular each node is unaware of the disk to |
112 |
instance mapping, that exists only on the master. |
113 |
|
114 |
For this design doc we plan to fix this by changing all RPCs that create |
115 |
a backend storage or that put an already existing one in use and passing |
116 |
the relevant instance to the node. The node can then export these to the |
117 |
status reporting tool. |
118 |
|
119 |
While we haven't implemented these RPC changes yet, we'll use confd to |
120 |
fetch this information in the data collector. |
121 |
|
122 |
Since Ganeti supports many type of disks for instances (drbd, rbd, |
123 |
plain, file) we will export both a "generic" status which will work for |
124 |
any type of disk and will be very opaque (at minimum just an "healthy" |
125 |
or "error" state, plus perhaps some human readable comment and a |
126 |
"per-type" status which will explain more about the internal details but |
127 |
will not be compatible between different storage types (and will for |
128 |
example export the drbd connection status, sync, and so on). |
129 |
|
130 |
Status of storage for instances |
131 |
+++++++++++++++++++++++++++++++ |
132 |
|
133 |
The node will also be reporting on all storage types it knows about for |
134 |
the current node (this is right now hardcoded to the enabled storage |
135 |
types, and in the future tied to the enabled storage pools for the |
136 |
nodegroup). For this kind of information also we will report both a |
137 |
generic health status (healthy or error) for each type of storage, and |
138 |
some more generic statistics (free space, used space, total visible |
139 |
space). In addition type specific information can be exported: for |
140 |
example, in case of error, the nature of the error can be disclosed as a |
141 |
type specific information. Examples of these are "backend pv |
142 |
unavailable" for lvm storage, "unreachable" for network based storage or |
143 |
"filesystem error" for filesystem based implementations. |
144 |
|
145 |
Ganeti daemons status |
146 |
+++++++++++++++++++++ |
147 |
|
148 |
Ganeti will report what information it has about its own daemons: this |
149 |
includes memory usage, uptime, CPU usage. This should allow identifying |
150 |
possible problems with the Ganeti system itself: for example memory |
151 |
leaks, crashes and high resource utilization should be evident by |
152 |
analyzing this information. |
153 |
|
154 |
Ganeti daemons will also be able to export extra internal information to |
155 |
the status reporting, through the plugin system (see below). |
156 |
|
157 |
Hypervisor resources report |
158 |
+++++++++++++++++++++++++++ |
159 |
|
160 |
Each hypervisor has a view of system resources that sometimes is |
161 |
different than the one the OS sees (for example in Xen the Node OS, |
162 |
running as Dom0, has access to only part of those resources). In this |
163 |
section we'll report all information we can in a "non hypervisor |
164 |
specific" way. Each hypervisor can then add extra specific information |
165 |
that is not generic enough be abstracted. |
166 |
|
167 |
Node OS resources report |
168 |
++++++++++++++++++++++++ |
169 |
|
170 |
Since Ganeti assumes it's running on Linux, it's useful to export some |
171 |
basic information as seen by the host system. This includes number and |
172 |
status of CPUs, memory, filesystems and network intefaces as well as the |
173 |
version of components Ganeti interacts with (Linux, drbd, hypervisor, |
174 |
etc). |
175 |
|
176 |
Note that we won't go into any hardware specific details (e.g. querying a |
177 |
node RAID is outside the scope of this, and can be implemented as a |
178 |
plugin) but we can easily just report the information above, since it's |
179 |
standard enough across all systems. |
180 |
|
181 |
Plugin system |
182 |
+++++++++++++ |
183 |
|
184 |
The monitoring system will be equipped with a plugin system that can |
185 |
export specific local information through it. The plugin system will be |
186 |
in the form of either scripts whose output will be inserted in the |
187 |
report, plain text files which will be inserted into the report, or |
188 |
local unix or network sockets from which the information has to be read. |
189 |
This should allow most flexibility for implementing an efficient system, |
190 |
while being able to keep it as simple as possible. |
191 |
|
192 |
The plugin system is expected to be used by local installations to |
193 |
export any installation specific information that they want to be |
194 |
monitored, about either hardware or software on their systems. |
195 |
|
196 |
|
197 |
Format of the query |
198 |
------------------- |
199 |
|
200 |
The query will be an HTTP GET request on a particular port. At the |
201 |
beginning it will only be possible to query the full status report. |
202 |
|
203 |
|
204 |
Format of the report |
205 |
-------------------- |
206 |
|
207 |
TBD (this part needs to be completed with the format of the JSON and the |
208 |
types of the various variables exported, as they get evaluated and |
209 |
decided) |
210 |
|
211 |
|
212 |
Data collectors |
213 |
--------------- |
214 |
|
215 |
In order to ease testing as well as to make it simple to reuse this |
216 |
subsystem it will be possible to run just the "data collectors" on each |
217 |
node without passing through the agent daemon. Each data collector will |
218 |
report specific data about its subsystem and will be documented |
219 |
separately. |
220 |
|
221 |
|
222 |
Mode of operation |
223 |
----------------- |
224 |
|
225 |
In order to be able to report information fast the monitoring agent |
226 |
daemon will keep an in-memory or on-disk cache of the status, which will |
227 |
be returned when queries are made. The status system will then |
228 |
periodically check resources to make sure the status is up to date. |
229 |
|
230 |
Different parts of the report will be queried at different speeds. These |
231 |
will depend on: |
232 |
- how often they vary (or we expect them to vary) |
233 |
- how fast they are to query |
234 |
- how important their freshness is |
235 |
|
236 |
Of course the last parameter is installation specific, and while we'll |
237 |
try to have defaults, it will be configurable. The first two instead we |
238 |
can use adaptively to query a certain resource faster or slower |
239 |
depending on those two parameters. |
240 |
|
241 |
|
242 |
Implementation place |
243 |
-------------------- |
244 |
|
245 |
The status daemon will be implemented as a standalone Haskell daemon. In |
246 |
the future it should be easy to merge multiple daemons into one with |
247 |
multiple entry points, should we find out it saves resources and doesn't |
248 |
impact functionality. |
249 |
|
250 |
The libekg library should be looked at for easily providing metrics in |
251 |
json format. |
252 |
|
253 |
|
254 |
Implementation order |
255 |
-------------------- |
256 |
|
257 |
We will implement the agent system in this order: |
258 |
|
259 |
- initial example data collectors (eg. for drbd and instance status) |
260 |
- initial daemon for exporting data |
261 |
- RPC updates for instance status reasons and disk to instance mapping |
262 |
- more data collectors |
263 |
- cache layer for the daemon (if needed) |
264 |
|
265 |
|
266 |
Future work |
267 |
=========== |
268 |
|
269 |
As a future step it can be useful to "centralize" all this reporting |
270 |
data on a single place. This for example can be just the master node, or |
271 |
all the master candidates. We will evaluate doing this after the first |
272 |
node-local version has been developed and tested. |
273 |
|
274 |
Another possible change is replacing the "read-only" RPCs with queries |
275 |
to the agent system, thus having only one way of collecting information |
276 |
from the nodes from a monitoring system and for Ganeti itself. |
277 |
|
278 |
One extra feature we may need is a way to query for only sub-parts of |
279 |
the report (eg. instances status only). This can be done by passing |
280 |
arguments to the HTTP GET, which will be defined when we get to this |
281 |
funtionality. |
282 |
|
283 |
Finally the :doc:`autorepair system design <design-autorepair>`. system |
284 |
(see its design) can be expanded to use the monitoring agent system as a |
285 |
source of information to decide which repairs it can perform. |
286 |
|
287 |
.. vim: set textwidth=72 : |
288 |
.. Local Variables: |
289 |
.. mode: rst |
290 |
.. fill-column: 72 |
291 |
.. End: |