root / doc / design-monitoring-agent.rst @ 834dc290
History | View | Annotate | Download (24.8 kB)
1 | 109e07c2 | Guido Trotter | ======================= |
---|---|---|---|
2 | 109e07c2 | Guido Trotter | Ganeti monitoring agent |
3 | 109e07c2 | Guido Trotter | ======================= |
4 | 109e07c2 | Guido Trotter | |
5 | 109e07c2 | Guido Trotter | .. contents:: :depth: 4 |
6 | 109e07c2 | Guido Trotter | |
7 | 109e07c2 | Guido Trotter | This is a design document detailing the implementation of a Ganeti |
8 | 109e07c2 | Guido Trotter | monitoring agent report system, that can be queried by a monitoring |
9 | 109e07c2 | Guido Trotter | system to calculate health information for a Ganeti cluster. |
10 | 109e07c2 | Guido Trotter | |
11 | 109e07c2 | Guido Trotter | Current state and shortcomings |
12 | 109e07c2 | Guido Trotter | ============================== |
13 | 109e07c2 | Guido Trotter | |
14 | 109e07c2 | Guido Trotter | There is currently no monitoring support in Ganeti. While we don't want |
15 | 109e07c2 | Guido Trotter | to build something like Nagios or Pacemaker as part of Ganeti, it would |
16 | 109e07c2 | Guido Trotter | be useful if such tools could easily extract information from a Ganeti |
17 | 109e07c2 | Guido Trotter | machine in order to take actions (example actions include logging an |
18 | 109e07c2 | Guido Trotter | outage for future reporting or alerting a person or system about it). |
19 | 109e07c2 | Guido Trotter | |
20 | 109e07c2 | Guido Trotter | Proposed changes |
21 | 109e07c2 | Guido Trotter | ================ |
22 | 109e07c2 | Guido Trotter | |
23 | 109e07c2 | Guido Trotter | Each Ganeti node should export a status page that can be queried by a |
24 | 109e07c2 | Guido Trotter | monitoring system. Such status page will be exported on a network port |
25 | 109e07c2 | Guido Trotter | and will be encoded in JSON (simple text) over HTTP. |
26 | 109e07c2 | Guido Trotter | |
27 | 3301805f | Michele Tartara | The choice of JSON is obvious as we already depend on it in Ganeti and |
28 | 109e07c2 | Guido Trotter | thus we don't need to add extra libraries to use it, as opposed to what |
29 | 109e07c2 | Guido Trotter | would happen for XML or some other markup format. |
30 | 109e07c2 | Guido Trotter | |
31 | 109e07c2 | Guido Trotter | Location of agent report |
32 | 109e07c2 | Guido Trotter | ------------------------ |
33 | 109e07c2 | Guido Trotter | |
34 | 109e07c2 | Guido Trotter | The report will be available from all nodes, and be concerned for all |
35 | 109e07c2 | Guido Trotter | node-local resources. This allows more real-time information to be |
36 | 109e07c2 | Guido Trotter | available, at the cost of querying all nodes. |
37 | 109e07c2 | Guido Trotter | |
38 | 109e07c2 | Guido Trotter | Information reported |
39 | 109e07c2 | Guido Trotter | -------------------- |
40 | 109e07c2 | Guido Trotter | |
41 | 109e07c2 | Guido Trotter | The monitoring agent system will report on the following basic information: |
42 | 109e07c2 | Guido Trotter | |
43 | 109e07c2 | Guido Trotter | - Instance status |
44 | 109e07c2 | Guido Trotter | - Instance disk status |
45 | 109e07c2 | Guido Trotter | - Status of storage for instances |
46 | 109e07c2 | Guido Trotter | - Ganeti daemons status, CPU usage, memory footprint |
47 | 109e07c2 | Guido Trotter | - Hypervisor resources report (memory, CPU, network interfaces) |
48 | 109e07c2 | Guido Trotter | - Node OS resources report (memory, CPU, network interfaces) |
49 | 109e07c2 | Guido Trotter | - Information from a plugin system |
50 | 109e07c2 | Guido Trotter | |
51 | 3301805f | Michele Tartara | Format of the report |
52 | 3301805f | Michele Tartara | -------------------- |
53 | 3301805f | Michele Tartara | |
54 | 3301805f | Michele Tartara | The report of the will be in JSON format, and it will present an array |
55 | 3301805f | Michele Tartara | of report objects. |
56 | 3301805f | Michele Tartara | Each report object will be produced by a specific data collector. |
57 | 3301805f | Michele Tartara | Each report object includes some mandatory fields, to be provided by all |
58 | 3301805f | Michele Tartara | the data collectors: |
59 | 3301805f | Michele Tartara | |
60 | 3301805f | Michele Tartara | ``name`` |
61 | 3301805f | Michele Tartara | The name of the data collector that produced this part of the report. |
62 | 3301805f | Michele Tartara | It is supposed to be unique inside a report. |
63 | 3301805f | Michele Tartara | |
64 | 3301805f | Michele Tartara | ``version`` |
65 | 3301805f | Michele Tartara | The version of the data collector that produces this part of the |
66 | 3301805f | Michele Tartara | report. Built-in data collectors (as opposed to those implemented as |
67 | 3301805f | Michele Tartara | plugins) should have "B" as the version number. |
68 | 3301805f | Michele Tartara | |
69 | 834dc290 | Michele Tartara | ``format_version`` |
70 | 3301805f | Michele Tartara | The format of what is represented in the "data" field for each data |
71 | 3301805f | Michele Tartara | collector might change over time. Every time this happens, the |
72 | 3301805f | Michele Tartara | format_version should be changed, so that who reads the report knows |
73 | 3301805f | Michele Tartara | what format to expect, and how to correctly interpret it. |
74 | 3301805f | Michele Tartara | |
75 | 3301805f | Michele Tartara | ``timestamp`` |
76 | 0e8d8384 | Michele Tartara | The time when the reported data were gathered. It has to be expressed |
77 | 3301805f | Michele Tartara | in nanoseconds since the unix epoch (0:00:00 January 01, 1970). If not |
78 | 3301805f | Michele Tartara | enough precision is available (or needed) it can be padded with |
79 | 3301805f | Michele Tartara | zeroes. If a report object needs multiple timestamps, it can add more |
80 | 3301805f | Michele Tartara | and/or override this one inside its own "data" section. |
81 | 3301805f | Michele Tartara | |
82 | 3301805f | Michele Tartara | ``category`` |
83 | 3301805f | Michele Tartara | A collector can belong to a given category of collectors (e.g.: storage |
84 | 3301805f | Michele Tartara | collectors, daemon collector). This means that it will have to provide a |
85 | 3301805f | Michele Tartara | minumum set of prescribed fields, as documented for each category. |
86 | 3301805f | Michele Tartara | This field will contain the name of the category the collector belongs to, |
87 | 3301805f | Michele Tartara | if any, or just the ``null`` value. |
88 | 3301805f | Michele Tartara | |
89 | 3301805f | Michele Tartara | ``kind`` |
90 | 3301805f | Michele Tartara | Two kinds of collectors are possible: |
91 | 3301805f | Michele Tartara | `Performance reporting collectors`_ and `Status reporting collectors`_. |
92 | 3301805f | Michele Tartara | The respective paragraphs will describe them and the value of this field. |
93 | 3301805f | Michele Tartara | |
94 | 3301805f | Michele Tartara | ``data`` |
95 | 3301805f | Michele Tartara | This field contains all the data generated by the specific data collector, |
96 | 3301805f | Michele Tartara | in its own independently defined format. The monitoring agent could check |
97 | 3301805f | Michele Tartara | this syntactically (according to the JSON specifications) but not |
98 | 3301805f | Michele Tartara | semantically. |
99 | 3301805f | Michele Tartara | |
100 | 3301805f | Michele Tartara | Here follows a minimal example of a report:: |
101 | 3301805f | Michele Tartara | |
102 | 3301805f | Michele Tartara | [ |
103 | 3301805f | Michele Tartara | { |
104 | 3301805f | Michele Tartara | "name" : "TheCollectorIdentifier", |
105 | 3301805f | Michele Tartara | "version" : "1.2", |
106 | 834dc290 | Michele Tartara | "format_version" : 1, |
107 | 3301805f | Michele Tartara | "timestamp" : 1351607182000000000, |
108 | 3301805f | Michele Tartara | "category" : null, |
109 | 3301805f | Michele Tartara | "kind" : 0, |
110 | 3301805f | Michele Tartara | "data" : { "plugin_specific_data" : "go_here" } |
111 | 3301805f | Michele Tartara | }, |
112 | 3301805f | Michele Tartara | { |
113 | 3301805f | Michele Tartara | "name" : "AnotherDataCollector", |
114 | 3301805f | Michele Tartara | "version" : "B", |
115 | 834dc290 | Michele Tartara | "format_version" : 7, |
116 | 3301805f | Michele Tartara | "timestamp" : 1351609526123854000, |
117 | 3301805f | Michele Tartara | "category" : "storage", |
118 | 3301805f | Michele Tartara | "kind" : 1, |
119 | 3301805f | Michele Tartara | "data" : { "status" : { "code" : 1, |
120 | 3301805f | Michele Tartara | "message" : "Error on disk 2" |
121 | 3301805f | Michele Tartara | }, |
122 | 3301805f | Michele Tartara | "plugin_specific" : "data", |
123 | 3301805f | Michele Tartara | "some_late_data" : { "timestamp" : 1351609526123942720, |
124 | 3301805f | Michele Tartara | ... |
125 | 3301805f | Michele Tartara | } |
126 | 3301805f | Michele Tartara | } |
127 | 3301805f | Michele Tartara | } |
128 | 3301805f | Michele Tartara | ] |
129 | 3301805f | Michele Tartara | |
130 | 3301805f | Michele Tartara | Performance reporting collectors |
131 | 3301805f | Michele Tartara | ++++++++++++++++++++++++++++++++ |
132 | 3301805f | Michele Tartara | |
133 | 3301805f | Michele Tartara | These collectors only provide data about some component of the system, without |
134 | 3301805f | Michele Tartara | giving any interpretation over their meaning. |
135 | 3301805f | Michele Tartara | |
136 | 3301805f | Michele Tartara | The value of the ``kind`` field of the report will be ``0``. |
137 | 3301805f | Michele Tartara | |
138 | 3301805f | Michele Tartara | Status reporting collectors |
139 | 3301805f | Michele Tartara | +++++++++++++++++++++++++++ |
140 | 3301805f | Michele Tartara | |
141 | 3301805f | Michele Tartara | These collectors will provide information about the status of some |
142 | 3301805f | Michele Tartara | component of ganeti, or managed by ganeti. |
143 | 3301805f | Michele Tartara | |
144 | 3301805f | Michele Tartara | The value of their ``kind`` field will be ``1``. |
145 | 3301805f | Michele Tartara | |
146 | 3301805f | Michele Tartara | The rationale behind this kind of collectors is that there are some situations |
147 | 3301805f | Michele Tartara | where exporting data about the underlying subsystems would expose potential |
148 | 3301805f | Michele Tartara | issues. But if Ganeti itself is able (and going) to fix the problem, conflicts |
149 | 3301805f | Michele Tartara | might arise between Ganeti and something/somebody else trying to fix the same |
150 | 3301805f | Michele Tartara | problem. |
151 | 3301805f | Michele Tartara | Also, some external monitoring systems might not be aware of the internals of a |
152 | 3301805f | Michele Tartara | particular subsystem (e.g.: DRBD) and might only exploit the high level |
153 | 3301805f | Michele Tartara | response of its data collector, alerting an administrator if anything is wrong. |
154 | 3301805f | Michele Tartara | Still, completely hiding the underlying data is not a good idea, as they might |
155 | 3301805f | Michele Tartara | still be of use in some cases. So status reporting plugins will provide two |
156 | 3301805f | Michele Tartara | output modes: one just exporting a high level information about the status, |
157 | 3301805f | Michele Tartara | and one also exporting all the data they gathered. |
158 | 3301805f | Michele Tartara | The default output mode will be the status-only one. Through a command line |
159 | 3301805f | Michele Tartara | parameter (for stand-alone data collectors) or through the HTTP request to the |
160 | 3301805f | Michele Tartara | monitoring agent |
161 | 3301805f | Michele Tartara | (when collectors are executed as part of it) the verbose output mode providing |
162 | 3301805f | Michele Tartara | all the data can be selected. |
163 | 3301805f | Michele Tartara | |
164 | 3301805f | Michele Tartara | When exporting just the status each status reporting collector will provide, |
165 | 3301805f | Michele Tartara | in its ``data`` section, at least the following field: |
166 | 3301805f | Michele Tartara | |
167 | 3301805f | Michele Tartara | ``status`` |
168 | 3301805f | Michele Tartara | summarizes the status of the component being monitored and consists of two |
169 | 3301805f | Michele Tartara | subfields: |
170 | 3301805f | Michele Tartara | |
171 | 3301805f | Michele Tartara | ``code`` |
172 | 3301805f | Michele Tartara | It assumes a numeric value, encoded in such a way to allow using a bitset |
173 | 3301805f | Michele Tartara | to easily distinguish which states are currently present in the whole cluster. |
174 | 3301805f | Michele Tartara | If the bitwise OR of all the ``status`` fields is 0, the cluster is |
175 | 3301805f | Michele Tartara | completely healty. |
176 | 3301805f | Michele Tartara | The status codes are as follows: |
177 | 3301805f | Michele Tartara | |
178 | 3301805f | Michele Tartara | ``0`` |
179 | 3301805f | Michele Tartara | The collector can determine that everything is working as |
180 | 3301805f | Michele Tartara | intended. |
181 | 3301805f | Michele Tartara | |
182 | 3301805f | Michele Tartara | ``1`` |
183 | 3301805f | Michele Tartara | Something is temporarily wrong but it is being automatically fixed by |
184 | 3301805f | Michele Tartara | Ganeti. |
185 | 3301805f | Michele Tartara | There is no need of external intervention. |
186 | 3301805f | Michele Tartara | |
187 | 3301805f | Michele Tartara | ``2`` |
188 | 3301805f | Michele Tartara | The collector can determine that something is wrong and Ganeti has no |
189 | 3301805f | Michele Tartara | way to fix it autonomously. External intervention is required. |
190 | 3301805f | Michele Tartara | |
191 | 3301805f | Michele Tartara | ``4`` |
192 | 3301805f | Michele Tartara | The collector has failed to understand whether the status is good or |
193 | 3301805f | Michele Tartara | bad. Further analysis is required. Interpret this status as a |
194 | 3301805f | Michele Tartara | potentially dangerous situation. |
195 | 3301805f | Michele Tartara | |
196 | 3301805f | Michele Tartara | ``message`` |
197 | 3301805f | Michele Tartara | A message to better explain the reason of the status. |
198 | 3301805f | Michele Tartara | The exact format of the message string is data collector dependent. |
199 | 3301805f | Michele Tartara | |
200 | debfca88 | Michele Tartara | The field is mandatory, but the content can be an empty string if the |
201 | debfca88 | Michele Tartara | ``code`` is ``0`` (working as intended) or ``1`` (being fixed |
202 | debfca88 | Michele Tartara | automatically). |
203 | 3301805f | Michele Tartara | |
204 | 3301805f | Michele Tartara | If the status code is ``2``, the message should specify what has gone |
205 | 3301805f | Michele Tartara | wrong. |
206 | 3301805f | Michele Tartara | If the status code is ``4``, the message shoud explain why it was not |
207 | 3301805f | Michele Tartara | possible to determine a proper status. |
208 | 3301805f | Michele Tartara | |
209 | 3301805f | Michele Tartara | The ``data`` section will also contain all the fields describing the gathered |
210 | 3301805f | Michele Tartara | data, according to a collector-specific format. |
211 | 3301805f | Michele Tartara | |
212 | 109e07c2 | Guido Trotter | Instance status |
213 | 109e07c2 | Guido Trotter | +++++++++++++++ |
214 | 109e07c2 | Guido Trotter | |
215 | 109e07c2 | Guido Trotter | At the moment each node knows which instances are running on it, which |
216 | 109e07c2 | Guido Trotter | instances it is primary for, but not the cause why an instance might not |
217 | 109e07c2 | Guido Trotter | be running. On the other hand we don't want to distribute full instance |
218 | 109e07c2 | Guido Trotter | "admin" status information to all nodes, because of the performance |
219 | 109e07c2 | Guido Trotter | impact this would have. |
220 | 109e07c2 | Guido Trotter | |
221 | 109e07c2 | Guido Trotter | As such we propose that: |
222 | 109e07c2 | Guido Trotter | |
223 | 109e07c2 | Guido Trotter | - Any operation that can affect instance status will have an optional |
224 | 109e07c2 | Guido Trotter | "reason" attached to it (at opcode level). This can be used for |
225 | 109e07c2 | Guido Trotter | example to distinguish an admin request, from a scheduled maintenance |
226 | 109e07c2 | Guido Trotter | or an automated tool's work. If this reason is not passed, Ganeti will |
227 | 109e07c2 | Guido Trotter | just use the information it has about the source of the request: for |
228 | 109e07c2 | Guido Trotter | example a cli shutdown operation will have "cli:shutdown" as a reason, |
229 | 109e07c2 | Guido Trotter | a cli failover operation will have "cli:failover". Operations coming |
230 | 109e07c2 | Guido Trotter | from the remote API will use "rapi" instead of "cli". Of course |
231 | 109e07c2 | Guido Trotter | setting a real site-specific reason is still preferred. |
232 | 109e07c2 | Guido Trotter | - RPCs that affect the instance status will be changed so that the |
233 | 109e07c2 | Guido Trotter | "reason" and the version of the config object they ran on is passed to |
234 | 109e07c2 | Guido Trotter | them. They will then export the new expected instance status, together |
235 | 109e07c2 | Guido Trotter | with the associated reason and object version to the status report |
236 | 109e07c2 | Guido Trotter | system, which then will export those themselves. |
237 | 109e07c2 | Guido Trotter | |
238 | 109e07c2 | Guido Trotter | Monitoring and auditing systems can then use the reason to understand |
239 | 3301805f | Michele Tartara | the cause of an instance status, and they can use the timestamp to |
240 | 109e07c2 | Guido Trotter | understand the freshness of their data even in the absence of an atomic |
241 | 109e07c2 | Guido Trotter | cross-node reporting: for example if they see an instance "up" on a node |
242 | 109e07c2 | Guido Trotter | after seeing it running on a previous one, they can compare these values |
243 | 109e07c2 | Guido Trotter | to understand which data is freshest, and repoll the "older" node. Of |
244 | 109e07c2 | Guido Trotter | course if they keep seeing this status this represents an error (either |
245 | 109e07c2 | Guido Trotter | an instance continuously "flapping" between nodes, or an instance is |
246 | 109e07c2 | Guido Trotter | constantly up on more than one), which should be reported and acted |
247 | 109e07c2 | Guido Trotter | upon. |
248 | 109e07c2 | Guido Trotter | |
249 | 109e07c2 | Guido Trotter | The instance status will be on each node, for the instances it is |
250 | 3301805f | Michele Tartara | primary for, and its ``data`` section of the report will contain a list |
251 | 3301805f | Michele Tartara | of instances, with at least the following fields for each instance: |
252 | 3301805f | Michele Tartara | |
253 | 3301805f | Michele Tartara | ``name`` |
254 | 3301805f | Michele Tartara | The name of the instance. |
255 | 3301805f | Michele Tartara | |
256 | 3301805f | Michele Tartara | ``uuid`` |
257 | 3301805f | Michele Tartara | The UUID of the instance (stable on name change). |
258 | 3301805f | Michele Tartara | |
259 | 3301805f | Michele Tartara | ``admin_state`` |
260 | 3301805f | Michele Tartara | The status of the instance (up/down/offline) as requested by the admin. |
261 | 3301805f | Michele Tartara | |
262 | 3301805f | Michele Tartara | ``actual_state`` |
263 | 3301805f | Michele Tartara | The actual status of the instance. It can be ``up``, ``down``, or |
264 | 3301805f | Michele Tartara | ``hung`` if the instance is up but it appears to be completely stuck. |
265 | 3301805f | Michele Tartara | |
266 | 3301805f | Michele Tartara | ``uptime`` |
267 | 3301805f | Michele Tartara | The uptime of the instance (if it is up, "null" otherwise). |
268 | 3301805f | Michele Tartara | |
269 | 3301805f | Michele Tartara | ``mtime`` |
270 | 3301805f | Michele Tartara | The timestamp of the last known change to the instance state. |
271 | 3301805f | Michele Tartara | |
272 | 3301805f | Michele Tartara | ``state_reason`` |
273 | 3301805f | Michele Tartara | The last known reason for state change, described according to the |
274 | 3301805f | Michele Tartara | following subfields: |
275 | 3301805f | Michele Tartara | |
276 | 3301805f | Michele Tartara | ``text`` |
277 | 3301805f | Michele Tartara | Either a user-provided reason (if any), or the name of the command that |
278 | 3301805f | Michele Tartara | triggered the state change, as a fallback. |
279 | 3301805f | Michele Tartara | |
280 | 3301805f | Michele Tartara | ``jobID`` |
281 | 3301805f | Michele Tartara | The ID of the job that caused the state change. |
282 | 109e07c2 | Guido Trotter | |
283 | 3301805f | Michele Tartara | ``source`` |
284 | 3301805f | Michele Tartara | Where the state change was triggered (RAPI, CLI). |
285 | 109e07c2 | Guido Trotter | |
286 | 3301805f | Michele Tartara | ``status`` |
287 | 3301805f | Michele Tartara | It represents the status of the instance, and its format is the same as that |
288 | 3301805f | Michele Tartara | of the ``status`` field of `Status reporting collectors`_. |
289 | 3301805f | Michele Tartara | |
290 | 3301805f | Michele Tartara | Each hypervisor should provide its own instance status data collector, possibly |
291 | 3301805f | Michele Tartara | with the addition of more, specific, fields. |
292 | 3301805f | Michele Tartara | The ``category`` field of all of them will be ``instance``. |
293 | 3301805f | Michele Tartara | The ``kind`` field will be ``1``. |
294 | 109e07c2 | Guido Trotter | |
295 | 109e07c2 | Guido Trotter | Note that as soon as a node knows it's not the primary anymore for an |
296 | 109e07c2 | Guido Trotter | instance it will stop reporting status for it: this means the instance |
297 | 109e07c2 | Guido Trotter | will either disappear, if it has been deleted, or appear on another |
298 | 109e07c2 | Guido Trotter | node, if it's been moved. |
299 | 109e07c2 | Guido Trotter | |
300 | 3301805f | Michele Tartara | The ``code`` of the ``status`` field of the report of the Instance status data |
301 | 3301805f | Michele Tartara | collector will be: |
302 | 109e07c2 | Guido Trotter | |
303 | 3301805f | Michele Tartara | ``0`` |
304 | 3301805f | Michele Tartara | if ``status`` is ``0`` for all the instances it is reporting about. |
305 | 109e07c2 | Guido Trotter | |
306 | 3301805f | Michele Tartara | ``1`` |
307 | 3301805f | Michele Tartara | otherwise. |
308 | 3301805f | Michele Tartara | |
309 | 3301805f | Michele Tartara | Storage status |
310 | 3301805f | Michele Tartara | ++++++++++++++ |
311 | 3301805f | Michele Tartara | |
312 | 3301805f | Michele Tartara | The storage status collectors will be a series of data collectors |
313 | 3301805f | Michele Tartara | (drbd, rbd, plain, file) that will gather data about all the storage types |
314 | 3301805f | Michele Tartara | for the current node (this is right now hardcoded to the enabled storage |
315 | 3301805f | Michele Tartara | types, and in the future tied to the enabled storage pools for the nodegroup). |
316 | 3301805f | Michele Tartara | |
317 | 3301805f | Michele Tartara | The ``name`` of each of these collector will reflect what storage type each of |
318 | 3301805f | Michele Tartara | them refers to. |
319 | 3301805f | Michele Tartara | |
320 | 3301805f | Michele Tartara | The ``category`` field of these collector will be ``storage``. |
321 | 3301805f | Michele Tartara | |
322 | 3301805f | Michele Tartara | The ``kind`` field will be ``1`` (`Status reporting collectors`_). |
323 | 3301805f | Michele Tartara | |
324 | 3301805f | Michele Tartara | The ``data`` section of the report will provide at least the following fields: |
325 | 3301805f | Michele Tartara | |
326 | 3301805f | Michele Tartara | ``free`` |
327 | 3301805f | Michele Tartara | The amount of free space (in KBytes). |
328 | 3301805f | Michele Tartara | |
329 | 3301805f | Michele Tartara | ``used`` |
330 | 3301805f | Michele Tartara | The amount of used space (in KBytes). |
331 | 3301805f | Michele Tartara | |
332 | 3301805f | Michele Tartara | ``total`` |
333 | 3301805f | Michele Tartara | The total visible space (in KBytes). |
334 | 3301805f | Michele Tartara | |
335 | 3301805f | Michele Tartara | Each specific storage type might provide more type-specific fields. |
336 | 3301805f | Michele Tartara | |
337 | 3301805f | Michele Tartara | In case of error, the ``message`` subfield of the ``status`` field of the |
338 | 3301805f | Michele Tartara | report of the instance status collector will disclose the nature of the error |
339 | 3301805f | Michele Tartara | as a type specific information. Examples of these are "backend pv unavailable" |
340 | 3301805f | Michele Tartara | for lvm storage, "unreachable" for network based storage or "filesystem error" |
341 | 3301805f | Michele Tartara | for filesystem based implementations. |
342 | 3301805f | Michele Tartara | |
343 | 3301805f | Michele Tartara | DRBD status |
344 | 3301805f | Michele Tartara | *********** |
345 | 3301805f | Michele Tartara | |
346 | 3301805f | Michele Tartara | This data collector will run only on nodes where DRBD is actually |
347 | 3301805f | Michele Tartara | present and it will gather information about DRBD devices. |
348 | 3301805f | Michele Tartara | |
349 | 3301805f | Michele Tartara | Its ``kind`` in the report will be ``1`` (`Status reporting collectors`_). |
350 | 3301805f | Michele Tartara | |
351 | 3301805f | Michele Tartara | Its ``category`` field in the report will contain the value ``storage``. |
352 | 3301805f | Michele Tartara | |
353 | 3301805f | Michele Tartara | When executed in verbose mode, the ``data`` section of the report of this |
354 | 3301805f | Michele Tartara | collector will provide the following fields: |
355 | 3301805f | Michele Tartara | |
356 | 3301805f | Michele Tartara | ``versionInfo`` |
357 | 3301805f | Michele Tartara | Information about the DRBD version number, given by a combination of |
358 | 3301805f | Michele Tartara | any (but at least one) of the following fields: |
359 | 3301805f | Michele Tartara | |
360 | 3301805f | Michele Tartara | ``version`` |
361 | 3301805f | Michele Tartara | The DRBD driver version. |
362 | 3301805f | Michele Tartara | |
363 | 3301805f | Michele Tartara | ``api`` |
364 | 3301805f | Michele Tartara | The API version number. |
365 | 3301805f | Michele Tartara | |
366 | 3301805f | Michele Tartara | ``proto`` |
367 | 3301805f | Michele Tartara | The protocol version. |
368 | 3301805f | Michele Tartara | |
369 | 3301805f | Michele Tartara | ``srcversion`` |
370 | 3301805f | Michele Tartara | The version of the source files. |
371 | 3301805f | Michele Tartara | |
372 | 3301805f | Michele Tartara | ``gitHash`` |
373 | 3301805f | Michele Tartara | Git hash of the source files. |
374 | 3301805f | Michele Tartara | |
375 | 3301805f | Michele Tartara | ``buildBy`` |
376 | 3301805f | Michele Tartara | Who built the binary, and, optionally, when. |
377 | 3301805f | Michele Tartara | |
378 | 3301805f | Michele Tartara | ``device`` |
379 | 3301805f | Michele Tartara | A list of structures, each describing a DRBD device (a minor) and containing |
380 | 3301805f | Michele Tartara | the following fields: |
381 | 3301805f | Michele Tartara | |
382 | 3301805f | Michele Tartara | ``minor`` |
383 | 3301805f | Michele Tartara | The device minor number. |
384 | 3301805f | Michele Tartara | |
385 | 3301805f | Michele Tartara | ``connectionState`` |
386 | 3301805f | Michele Tartara | The state of the connection. If it is "Unconfigured", all the following |
387 | 3301805f | Michele Tartara | fields are not present. |
388 | 3301805f | Michele Tartara | |
389 | 3301805f | Michele Tartara | ``localRole`` |
390 | 3301805f | Michele Tartara | The role of the local resource. |
391 | 3301805f | Michele Tartara | |
392 | 3301805f | Michele Tartara | ``remoteRole`` |
393 | 3301805f | Michele Tartara | The role of the remote resource. |
394 | 3301805f | Michele Tartara | |
395 | 3301805f | Michele Tartara | ``localState`` |
396 | 3301805f | Michele Tartara | The status of the local disk. |
397 | 3301805f | Michele Tartara | |
398 | 3301805f | Michele Tartara | ``remoteState`` |
399 | 3301805f | Michele Tartara | The status of the remote disk. |
400 | 3301805f | Michele Tartara | |
401 | 3301805f | Michele Tartara | ``replicationProtocol`` |
402 | 3301805f | Michele Tartara | The replication protocol being used. |
403 | 3301805f | Michele Tartara | |
404 | 3301805f | Michele Tartara | ``ioFlags`` |
405 | 3301805f | Michele Tartara | The input/output flags. |
406 | 3301805f | Michele Tartara | |
407 | 3301805f | Michele Tartara | ``perfIndicators`` |
408 | 3301805f | Michele Tartara | The performance indicators. This field will contain the following |
409 | 3301805f | Michele Tartara | sub-fields: |
410 | 3301805f | Michele Tartara | |
411 | 3301805f | Michele Tartara | ``networkSend`` |
412 | 3301805f | Michele Tartara | KiB of data sent on the network. |
413 | 3301805f | Michele Tartara | |
414 | 3301805f | Michele Tartara | ``networkReceive`` |
415 | 3301805f | Michele Tartara | KiB of data received from the network. |
416 | 3301805f | Michele Tartara | |
417 | 3301805f | Michele Tartara | ``diskWrite`` |
418 | 3301805f | Michele Tartara | KiB of data written on local disk. |
419 | 3301805f | Michele Tartara | |
420 | 3301805f | Michele Tartara | ``diskRead`` |
421 | 3301805f | Michele Tartara | KiB of date read from the local disk. |
422 | 3301805f | Michele Tartara | |
423 | 3301805f | Michele Tartara | ``activityLog`` |
424 | 3301805f | Michele Tartara | Number of updates of the activity log. |
425 | 3301805f | Michele Tartara | |
426 | 3301805f | Michele Tartara | ``bitMap`` |
427 | 3301805f | Michele Tartara | Number of updates to the bitmap area of the metadata. |
428 | 3301805f | Michele Tartara | |
429 | 3301805f | Michele Tartara | ``localCount`` |
430 | 3301805f | Michele Tartara | Number of open requests to the local I/O subsystem. |
431 | 3301805f | Michele Tartara | |
432 | 3301805f | Michele Tartara | ``pending`` |
433 | 3301805f | Michele Tartara | Number of requests sent to the partner but not yet answered. |
434 | 3301805f | Michele Tartara | |
435 | 3301805f | Michele Tartara | ``unacknowledged`` |
436 | 3301805f | Michele Tartara | Number of requests received by the partner but still to be answered. |
437 | 3301805f | Michele Tartara | |
438 | 3301805f | Michele Tartara | ``applicationPending`` |
439 | 3301805f | Michele Tartara | Num of block input/output requests forwarded to DRBD but that have not yet |
440 | 3301805f | Michele Tartara | been answered. |
441 | 3301805f | Michele Tartara | |
442 | 3301805f | Michele Tartara | ``epochs`` |
443 | 3301805f | Michele Tartara | (Optional) Number of epoch objects. Not provided by all DRBD versions. |
444 | 3301805f | Michele Tartara | |
445 | 3301805f | Michele Tartara | ``writeOrder`` |
446 | 3301805f | Michele Tartara | (Optional) Currently used write ordering method. Not provided by all DRBD |
447 | 3301805f | Michele Tartara | versions. |
448 | 3301805f | Michele Tartara | |
449 | 3301805f | Michele Tartara | ``outOfSync`` |
450 | 3301805f | Michele Tartara | (Optional) KiB of storage currently out of sync. Not provided by all DRBD |
451 | 3301805f | Michele Tartara | versions. |
452 | 3301805f | Michele Tartara | |
453 | 3301805f | Michele Tartara | ``syncStatus`` |
454 | 3301805f | Michele Tartara | (Optional) The status of the synchronization of the disk. This is present |
455 | 3301805f | Michele Tartara | only if the disk is being synchronized, and includes the following fields: |
456 | 3301805f | Michele Tartara | |
457 | 3301805f | Michele Tartara | ``percentage`` |
458 | 3301805f | Michele Tartara | The percentage of synchronized data. |
459 | 3301805f | Michele Tartara | |
460 | 3301805f | Michele Tartara | ``progress`` |
461 | 3301805f | Michele Tartara | How far the synchronization is. Written as "x/y", where x and y are |
462 | 3301805f | Michele Tartara | integer numbers expressed in the measurement unit stated in |
463 | 3301805f | Michele Tartara | ``progressUnit`` |
464 | 3301805f | Michele Tartara | |
465 | 3301805f | Michele Tartara | ``progressUnit`` |
466 | 3301805f | Michele Tartara | The measurement unit for the progress indicator. |
467 | 3301805f | Michele Tartara | |
468 | 3301805f | Michele Tartara | ``timeToFinish`` |
469 | 3301805f | Michele Tartara | The expected time before finishing the synchronization. |
470 | 3301805f | Michele Tartara | |
471 | 3301805f | Michele Tartara | ``speed`` |
472 | 3301805f | Michele Tartara | The speed of the synchronization. |
473 | 3301805f | Michele Tartara | |
474 | 3301805f | Michele Tartara | ``want`` |
475 | 3301805f | Michele Tartara | The desiderd speed of the synchronization. |
476 | 3301805f | Michele Tartara | |
477 | 3301805f | Michele Tartara | ``speedUnit`` |
478 | 3301805f | Michele Tartara | The measurement unit of the ``speed`` and ``want`` values. Expressed |
479 | 3301805f | Michele Tartara | as "size/time". |
480 | 3301805f | Michele Tartara | |
481 | 3301805f | Michele Tartara | ``instance`` |
482 | 3301805f | Michele Tartara | The name of the Ganeti instance this disk is associated to. |
483 | 109e07c2 | Guido Trotter | |
484 | 109e07c2 | Guido Trotter | |
485 | 109e07c2 | Guido Trotter | Ganeti daemons status |
486 | 109e07c2 | Guido Trotter | +++++++++++++++++++++ |
487 | 109e07c2 | Guido Trotter | |
488 | 3301805f | Michele Tartara | Ganeti will report what information it has about its own daemons. |
489 | 3301805f | Michele Tartara | This should allow identifying possible problems with the Ganeti system itself: |
490 | 3301805f | Michele Tartara | for example memory leaks, crashes and high resource utilization should be |
491 | 3301805f | Michele Tartara | evident by analyzing this information. |
492 | 3301805f | Michele Tartara | |
493 | 3301805f | Michele Tartara | The ``kind`` field will be ``1`` (`Status reporting collectors`_). |
494 | 3301805f | Michele Tartara | |
495 | 3301805f | Michele Tartara | Each daemon will have its own data collector, and each of them will have |
496 | 3301805f | Michele Tartara | a ``category`` field valued ``daemon``. |
497 | 3301805f | Michele Tartara | |
498 | 3301805f | Michele Tartara | When executed in verbose mode, their data section will include at least: |
499 | 3301805f | Michele Tartara | |
500 | 3301805f | Michele Tartara | ``memory`` |
501 | 3301805f | Michele Tartara | The amount of used memory. |
502 | 3301805f | Michele Tartara | |
503 | 3301805f | Michele Tartara | ``size_unit`` |
504 | 3301805f | Michele Tartara | The measurement unit used for the memory. |
505 | 109e07c2 | Guido Trotter | |
506 | 3301805f | Michele Tartara | ``uptime`` |
507 | 3301805f | Michele Tartara | The uptime of the daemon. |
508 | 3301805f | Michele Tartara | |
509 | 3301805f | Michele Tartara | ``CPU usage`` |
510 | 3301805f | Michele Tartara | How much cpu the daemon is using (percentage). |
511 | 3301805f | Michele Tartara | |
512 | 3301805f | Michele Tartara | Any other daemon-specific information can be included as well in the ``data`` |
513 | 3301805f | Michele Tartara | section. |
514 | 109e07c2 | Guido Trotter | |
515 | 109e07c2 | Guido Trotter | Hypervisor resources report |
516 | 109e07c2 | Guido Trotter | +++++++++++++++++++++++++++ |
517 | 109e07c2 | Guido Trotter | |
518 | 109e07c2 | Guido Trotter | Each hypervisor has a view of system resources that sometimes is |
519 | 109e07c2 | Guido Trotter | different than the one the OS sees (for example in Xen the Node OS, |
520 | 109e07c2 | Guido Trotter | running as Dom0, has access to only part of those resources). In this |
521 | 109e07c2 | Guido Trotter | section we'll report all information we can in a "non hypervisor |
522 | 109e07c2 | Guido Trotter | specific" way. Each hypervisor can then add extra specific information |
523 | 109e07c2 | Guido Trotter | that is not generic enough be abstracted. |
524 | 109e07c2 | Guido Trotter | |
525 | 3301805f | Michele Tartara | The ``kind`` field will be ``0`` (`Performance reporting collectors`_). |
526 | 3301805f | Michele Tartara | |
527 | 3301805f | Michele Tartara | Each of the hypervisor data collectory will be of ``category``: ``hypervisor``. |
528 | 3301805f | Michele Tartara | |
529 | 109e07c2 | Guido Trotter | Node OS resources report |
530 | 109e07c2 | Guido Trotter | ++++++++++++++++++++++++ |
531 | 109e07c2 | Guido Trotter | |
532 | 109e07c2 | Guido Trotter | Since Ganeti assumes it's running on Linux, it's useful to export some |
533 | 3301805f | Michele Tartara | basic information as seen by the host system. |
534 | 109e07c2 | Guido Trotter | |
535 | 3301805f | Michele Tartara | The ``category`` field of the report will be ``null``. |
536 | 109e07c2 | Guido Trotter | |
537 | 3301805f | Michele Tartara | The ``kind`` field will be ``0`` (`Performance reporting collectors`_). |
538 | 109e07c2 | Guido Trotter | |
539 | 3301805f | Michele Tartara | The ``data`` section will include: |
540 | 109e07c2 | Guido Trotter | |
541 | 3301805f | Michele Tartara | ``cpu_number`` |
542 | 3301805f | Michele Tartara | The number of available cpus. |
543 | 109e07c2 | Guido Trotter | |
544 | 3301805f | Michele Tartara | ``cpus`` |
545 | 3301805f | Michele Tartara | A list with one element per cpu, showing its average load. |
546 | 109e07c2 | Guido Trotter | |
547 | 3301805f | Michele Tartara | ``memory`` |
548 | 3301805f | Michele Tartara | The current view of memory (free, used, cached, etc.) |
549 | 109e07c2 | Guido Trotter | |
550 | 3301805f | Michele Tartara | ``filesystem`` |
551 | 3301805f | Michele Tartara | A list with one element per filesystem, showing a summary of the |
552 | 3301805f | Michele Tartara | total/available space. |
553 | 109e07c2 | Guido Trotter | |
554 | 3301805f | Michele Tartara | ``NICs`` |
555 | 3301805f | Michele Tartara | A list with one element per network interface, showing the amount of |
556 | 3301805f | Michele Tartara | sent/received data, error rate, IP address of the interface, etc. |
557 | 109e07c2 | Guido Trotter | |
558 | 3301805f | Michele Tartara | ``versions`` |
559 | 3301805f | Michele Tartara | A map using the name of a component Ganeti interacts (Linux, drbd, |
560 | 3301805f | Michele Tartara | hypervisor, etc) as the key and its version number as the value. |
561 | 109e07c2 | Guido Trotter | |
562 | 3301805f | Michele Tartara | Note that we won't go into any hardware specific details (e.g. querying a |
563 | 3301805f | Michele Tartara | node RAID is outside the scope of this, and can be implemented as a |
564 | 3301805f | Michele Tartara | plugin) but we can easily just report the information above, since it's |
565 | 3301805f | Michele Tartara | standard enough across all systems. |
566 | 9ef3e121 | Michele Tartara | |
567 | b166dcfc | Michele Tartara | Format of the query |
568 | b166dcfc | Michele Tartara | ------------------- |
569 | b166dcfc | Michele Tartara | |
570 | b166dcfc | Michele Tartara | The queries to the monitoring agent will be HTTP GET requests on port 1815. |
571 | b166dcfc | Michele Tartara | The answer will be encoded in JSON format and will depend on the specific |
572 | b166dcfc | Michele Tartara | accessed resource. |
573 | b166dcfc | Michele Tartara | |
574 | b166dcfc | Michele Tartara | If a request is sent to a non-existing resource, a 404 error will be returned by |
575 | b166dcfc | Michele Tartara | the HTTP server. |
576 | b166dcfc | Michele Tartara | |
577 | b166dcfc | Michele Tartara | The following paragraphs will present the existing resources supported by the |
578 | b166dcfc | Michele Tartara | current protocol version, that is version 1. |
579 | b166dcfc | Michele Tartara | |
580 | b166dcfc | Michele Tartara | ``/`` |
581 | b166dcfc | Michele Tartara | +++++ |
582 | b166dcfc | Michele Tartara | The root resource. It will return the list of the supported protocol version |
583 | b166dcfc | Michele Tartara | numbers. |
584 | b166dcfc | Michele Tartara | |
585 | b166dcfc | Michele Tartara | Currently, this will include only version 1. |
586 | b166dcfc | Michele Tartara | |
587 | b166dcfc | Michele Tartara | ``/1`` |
588 | b166dcfc | Michele Tartara | ++++++ |
589 | b166dcfc | Michele Tartara | Not an actual resource per-se, it is the root of all the resources of protocol |
590 | b166dcfc | Michele Tartara | version 1. |
591 | b166dcfc | Michele Tartara | |
592 | b166dcfc | Michele Tartara | If requested through GET, the null JSON value will be returned. |
593 | b166dcfc | Michele Tartara | |
594 | ea322c27 | Michele Tartara | ``/1/list/collectors`` |
595 | ea322c27 | Michele Tartara | ++++++++++++++++++++++ |
596 | ea322c27 | Michele Tartara | Returns a list of tuples (kind, category, name) showing all the collectors |
597 | ea322c27 | Michele Tartara | available in the system. |
598 | ea322c27 | Michele Tartara | |
599 | ea322c27 | Michele Tartara | ``/1/report/all`` |
600 | ea322c27 | Michele Tartara | +++++++++++++++++ |
601 | ea322c27 | Michele Tartara | A list of the reports of all the data collectors, as described in the section |
602 | b166dcfc | Michele Tartara | `Format of the report`_. |
603 | b166dcfc | Michele Tartara | |
604 | b166dcfc | Michele Tartara | `Status reporting collectors`_ will provide their output in non-verbose format. |
605 | b166dcfc | Michele Tartara | The verbose format can be requested by adding the parameter ``verbose=1`` to the |
606 | b166dcfc | Michele Tartara | request. |
607 | b166dcfc | Michele Tartara | |
608 | ea322c27 | Michele Tartara | ``/1/report/[category]/[collector_name]`` |
609 | ea322c27 | Michele Tartara | +++++++++++++++++++++++++++++++++++++++++ |
610 | b166dcfc | Michele Tartara | Returns the report of the collector ``[collector_name]`` that belongs to the |
611 | b166dcfc | Michele Tartara | specified ``[category]``. |
612 | b166dcfc | Michele Tartara | |
613 | b166dcfc | Michele Tartara | If a collector does not belong to any category, ``collector`` will be used as |
614 | b166dcfc | Michele Tartara | the value for ``[category]``. |
615 | b166dcfc | Michele Tartara | |
616 | b166dcfc | Michele Tartara | `Status reporting collectors`_ will provide their output in non-verbose format. |
617 | b166dcfc | Michele Tartara | The verbose format can be requested by adding the parameter ``verbose=1`` to the |
618 | b166dcfc | Michele Tartara | request. |
619 | b166dcfc | Michele Tartara | |
620 | 3301805f | Michele Tartara | Instance disk status propagation |
621 | 3301805f | Michele Tartara | -------------------------------- |
622 | 9ef3e121 | Michele Tartara | |
623 | 3301805f | Michele Tartara | As for the instance status Ganeti has now only partial information about |
624 | 3301805f | Michele Tartara | its instance disks: in particular each node is unaware of the disk to |
625 | 3301805f | Michele Tartara | instance mapping, that exists only on the master. |
626 | 9ef3e121 | Michele Tartara | |
627 | 3301805f | Michele Tartara | For this design doc we plan to fix this by changing all RPCs that create |
628 | 3301805f | Michele Tartara | a backend storage or that put an already existing one in use and passing |
629 | 3301805f | Michele Tartara | the relevant instance to the node. The node can then export these to the |
630 | 3301805f | Michele Tartara | status reporting tool. |
631 | 9ef3e121 | Michele Tartara | |
632 | 3301805f | Michele Tartara | While we haven't implemented these RPC changes yet, we'll use Confd to |
633 | 3301805f | Michele Tartara | fetch this information in the data collectors. |
634 | 9ef3e121 | Michele Tartara | |
635 | 3301805f | Michele Tartara | Plugin system |
636 | 3301805f | Michele Tartara | ------------- |
637 | 9ef3e121 | Michele Tartara | |
638 | 3301805f | Michele Tartara | The monitoring system will be equipped with a plugin system that can |
639 | 3301805f | Michele Tartara | export specific local information through it. |
640 | 9ef3e121 | Michele Tartara | |
641 | 3301805f | Michele Tartara | The plugin system is expected to be used by local installations to |
642 | 3301805f | Michele Tartara | export any installation specific information that they want to be |
643 | 3301805f | Michele Tartara | monitored, about either hardware or software on their systems. |
644 | 9ef3e121 | Michele Tartara | |
645 | 3301805f | Michele Tartara | The plugin system will be in the form of either scripts or binaries whose output |
646 | 3301805f | Michele Tartara | will be inserted in the report. |
647 | 109e07c2 | Guido Trotter | |
648 | 3301805f | Michele Tartara | Eventually support for other kinds of plugins might be added as well, such as |
649 | 3301805f | Michele Tartara | plain text files which will be inserted into the report, or local unix or |
650 | 3301805f | Michele Tartara | network sockets from which the information has to be read. This should allow |
651 | 3301805f | Michele Tartara | most flexibility for implementing an efficient system, while being able to keep |
652 | 3301805f | Michele Tartara | it as simple as possible. |
653 | 109e07c2 | Guido Trotter | |
654 | 109e07c2 | Guido Trotter | Data collectors |
655 | 109e07c2 | Guido Trotter | --------------- |
656 | 109e07c2 | Guido Trotter | |
657 | 109e07c2 | Guido Trotter | In order to ease testing as well as to make it simple to reuse this |
658 | 109e07c2 | Guido Trotter | subsystem it will be possible to run just the "data collectors" on each |
659 | 3301805f | Michele Tartara | node without passing through the agent daemon. |
660 | 109e07c2 | Guido Trotter | |
661 | 9ef3e121 | Michele Tartara | If a data collector is run independently, it should print on stdout its |
662 | 9ef3e121 | Michele Tartara | report, according to the format corresponding to a single data collector |
663 | 3301805f | Michele Tartara | report object, as described in the previous paragraphs. |
664 | 109e07c2 | Guido Trotter | |
665 | 109e07c2 | Guido Trotter | Mode of operation |
666 | 109e07c2 | Guido Trotter | ----------------- |
667 | 109e07c2 | Guido Trotter | |
668 | 109e07c2 | Guido Trotter | In order to be able to report information fast the monitoring agent |
669 | 109e07c2 | Guido Trotter | daemon will keep an in-memory or on-disk cache of the status, which will |
670 | 109e07c2 | Guido Trotter | be returned when queries are made. The status system will then |
671 | 109e07c2 | Guido Trotter | periodically check resources to make sure the status is up to date. |
672 | 109e07c2 | Guido Trotter | |
673 | 109e07c2 | Guido Trotter | Different parts of the report will be queried at different speeds. These |
674 | 109e07c2 | Guido Trotter | will depend on: |
675 | 109e07c2 | Guido Trotter | - how often they vary (or we expect them to vary) |
676 | 109e07c2 | Guido Trotter | - how fast they are to query |
677 | 109e07c2 | Guido Trotter | - how important their freshness is |
678 | 109e07c2 | Guido Trotter | |
679 | 109e07c2 | Guido Trotter | Of course the last parameter is installation specific, and while we'll |
680 | 109e07c2 | Guido Trotter | try to have defaults, it will be configurable. The first two instead we |
681 | 109e07c2 | Guido Trotter | can use adaptively to query a certain resource faster or slower |
682 | 109e07c2 | Guido Trotter | depending on those two parameters. |
683 | 109e07c2 | Guido Trotter | |
684 | 3301805f | Michele Tartara | When run as stand-alone binaries, the data collector will not using any |
685 | 3301805f | Michele Tartara | caching system, and just fetch and return the data immediately. |
686 | 109e07c2 | Guido Trotter | |
687 | 109e07c2 | Guido Trotter | Implementation place |
688 | 109e07c2 | Guido Trotter | -------------------- |
689 | 109e07c2 | Guido Trotter | |
690 | 109e07c2 | Guido Trotter | The status daemon will be implemented as a standalone Haskell daemon. In |
691 | 109e07c2 | Guido Trotter | the future it should be easy to merge multiple daemons into one with |
692 | 109e07c2 | Guido Trotter | multiple entry points, should we find out it saves resources and doesn't |
693 | 109e07c2 | Guido Trotter | impact functionality. |
694 | 109e07c2 | Guido Trotter | |
695 | 109e07c2 | Guido Trotter | The libekg library should be looked at for easily providing metrics in |
696 | 109e07c2 | Guido Trotter | json format. |
697 | 109e07c2 | Guido Trotter | |
698 | 109e07c2 | Guido Trotter | Implementation order |
699 | 109e07c2 | Guido Trotter | -------------------- |
700 | 109e07c2 | Guido Trotter | |
701 | 109e07c2 | Guido Trotter | We will implement the agent system in this order: |
702 | 109e07c2 | Guido Trotter | |
703 | 3301805f | Michele Tartara | - initial example data collectors (eg. for drbd and instance status). |
704 | 3301805f | Michele Tartara | - initial daemon for exporting data, integrating the existing collectors |
705 | 3301805f | Michele Tartara | - plugin system |
706 | 109e07c2 | Guido Trotter | - RPC updates for instance status reasons and disk to instance mapping |
707 | 3301805f | Michele Tartara | - cache layer for the daemon |
708 | 109e07c2 | Guido Trotter | - more data collectors |
709 | 109e07c2 | Guido Trotter | |
710 | 109e07c2 | Guido Trotter | |
711 | 109e07c2 | Guido Trotter | Future work |
712 | 109e07c2 | Guido Trotter | =========== |
713 | 109e07c2 | Guido Trotter | |
714 | 109e07c2 | Guido Trotter | As a future step it can be useful to "centralize" all this reporting |
715 | 109e07c2 | Guido Trotter | data on a single place. This for example can be just the master node, or |
716 | 109e07c2 | Guido Trotter | all the master candidates. We will evaluate doing this after the first |
717 | 109e07c2 | Guido Trotter | node-local version has been developed and tested. |
718 | 109e07c2 | Guido Trotter | |
719 | 109e07c2 | Guido Trotter | Another possible change is replacing the "read-only" RPCs with queries |
720 | 109e07c2 | Guido Trotter | to the agent system, thus having only one way of collecting information |
721 | 109e07c2 | Guido Trotter | from the nodes from a monitoring system and for Ganeti itself. |
722 | 109e07c2 | Guido Trotter | |
723 | 109e07c2 | Guido Trotter | One extra feature we may need is a way to query for only sub-parts of |
724 | 109e07c2 | Guido Trotter | the report (eg. instances status only). This can be done by passing |
725 | 109e07c2 | Guido Trotter | arguments to the HTTP GET, which will be defined when we get to this |
726 | 109e07c2 | Guido Trotter | funtionality. |
727 | 109e07c2 | Guido Trotter | |
728 | 109e07c2 | Guido Trotter | Finally the :doc:`autorepair system design <design-autorepair>`. system |
729 | 109e07c2 | Guido Trotter | (see its design) can be expanded to use the monitoring agent system as a |
730 | 109e07c2 | Guido Trotter | source of information to decide which repairs it can perform. |
731 | 109e07c2 | Guido Trotter | |
732 | 109e07c2 | Guido Trotter | .. vim: set textwidth=72 : |
733 | 109e07c2 | Guido Trotter | .. Local Variables: |
734 | 109e07c2 | Guido Trotter | .. mode: rst |
735 | 109e07c2 | Guido Trotter | .. fill-column: 72 |
736 | 109e07c2 | Guido Trotter | .. End: |