root / doc / design-monitoring-agent.rst @ 9ef3e121
History | View | Annotate | Download (13.7 kB)
1 | 109e07c2 | Guido Trotter | ======================= |
---|---|---|---|
2 | 109e07c2 | Guido Trotter | Ganeti monitoring agent |
3 | 109e07c2 | Guido Trotter | ======================= |
4 | 109e07c2 | Guido Trotter | |
5 | 109e07c2 | Guido Trotter | .. contents:: :depth: 4 |
6 | 109e07c2 | Guido Trotter | |
7 | 109e07c2 | Guido Trotter | This is a design document detailing the implementation of a Ganeti |
8 | 109e07c2 | Guido Trotter | monitoring agent report system, that can be queried by a monitoring |
9 | 109e07c2 | Guido Trotter | system to calculate health information for a Ganeti cluster. |
10 | 109e07c2 | Guido Trotter | |
11 | 109e07c2 | Guido Trotter | Current state and shortcomings |
12 | 109e07c2 | Guido Trotter | ============================== |
13 | 109e07c2 | Guido Trotter | |
14 | 109e07c2 | Guido Trotter | There is currently no monitoring support in Ganeti. While we don't want |
15 | 109e07c2 | Guido Trotter | to build something like Nagios or Pacemaker as part of Ganeti, it would |
16 | 109e07c2 | Guido Trotter | be useful if such tools could easily extract information from a Ganeti |
17 | 109e07c2 | Guido Trotter | machine in order to take actions (example actions include logging an |
18 | 109e07c2 | Guido Trotter | outage for future reporting or alerting a person or system about it). |
19 | 109e07c2 | Guido Trotter | |
20 | 109e07c2 | Guido Trotter | Proposed changes |
21 | 109e07c2 | Guido Trotter | ================ |
22 | 109e07c2 | Guido Trotter | |
23 | 109e07c2 | Guido Trotter | Each Ganeti node should export a status page that can be queried by a |
24 | 109e07c2 | Guido Trotter | monitoring system. Such status page will be exported on a network port |
25 | 109e07c2 | Guido Trotter | and will be encoded in JSON (simple text) over HTTP. |
26 | 109e07c2 | Guido Trotter | |
27 | 109e07c2 | Guido Trotter | The choice of json is obvious as we already depend on it in Ganeti and |
28 | 109e07c2 | Guido Trotter | thus we don't need to add extra libraries to use it, as opposed to what |
29 | 109e07c2 | Guido Trotter | would happen for XML or some other markup format. |
30 | 109e07c2 | Guido Trotter | |
31 | 109e07c2 | Guido Trotter | Location of agent report |
32 | 109e07c2 | Guido Trotter | ------------------------ |
33 | 109e07c2 | Guido Trotter | |
34 | 109e07c2 | Guido Trotter | The report will be available from all nodes, and be concerned for all |
35 | 109e07c2 | Guido Trotter | node-local resources. This allows more real-time information to be |
36 | 109e07c2 | Guido Trotter | available, at the cost of querying all nodes. |
37 | 109e07c2 | Guido Trotter | |
38 | 109e07c2 | Guido Trotter | Information reported |
39 | 109e07c2 | Guido Trotter | -------------------- |
40 | 109e07c2 | Guido Trotter | |
41 | 109e07c2 | Guido Trotter | The monitoring agent system will report on the following basic information: |
42 | 109e07c2 | Guido Trotter | |
43 | 109e07c2 | Guido Trotter | - Instance status |
44 | 109e07c2 | Guido Trotter | - Instance disk status |
45 | 109e07c2 | Guido Trotter | - Status of storage for instances |
46 | 109e07c2 | Guido Trotter | - Ganeti daemons status, CPU usage, memory footprint |
47 | 109e07c2 | Guido Trotter | - Hypervisor resources report (memory, CPU, network interfaces) |
48 | 109e07c2 | Guido Trotter | - Node OS resources report (memory, CPU, network interfaces) |
49 | 109e07c2 | Guido Trotter | - Information from a plugin system |
50 | 109e07c2 | Guido Trotter | |
51 | 109e07c2 | Guido Trotter | Instance status |
52 | 109e07c2 | Guido Trotter | +++++++++++++++ |
53 | 109e07c2 | Guido Trotter | |
54 | 109e07c2 | Guido Trotter | At the moment each node knows which instances are running on it, which |
55 | 109e07c2 | Guido Trotter | instances it is primary for, but not the cause why an instance might not |
56 | 109e07c2 | Guido Trotter | be running. On the other hand we don't want to distribute full instance |
57 | 109e07c2 | Guido Trotter | "admin" status information to all nodes, because of the performance |
58 | 109e07c2 | Guido Trotter | impact this would have. |
59 | 109e07c2 | Guido Trotter | |
60 | 109e07c2 | Guido Trotter | As such we propose that: |
61 | 109e07c2 | Guido Trotter | |
62 | 109e07c2 | Guido Trotter | - Any operation that can affect instance status will have an optional |
63 | 109e07c2 | Guido Trotter | "reason" attached to it (at opcode level). This can be used for |
64 | 109e07c2 | Guido Trotter | example to distinguish an admin request, from a scheduled maintenance |
65 | 109e07c2 | Guido Trotter | or an automated tool's work. If this reason is not passed, Ganeti will |
66 | 109e07c2 | Guido Trotter | just use the information it has about the source of the request: for |
67 | 109e07c2 | Guido Trotter | example a cli shutdown operation will have "cli:shutdown" as a reason, |
68 | 109e07c2 | Guido Trotter | a cli failover operation will have "cli:failover". Operations coming |
69 | 109e07c2 | Guido Trotter | from the remote API will use "rapi" instead of "cli". Of course |
70 | 109e07c2 | Guido Trotter | setting a real site-specific reason is still preferred. |
71 | 109e07c2 | Guido Trotter | - RPCs that affect the instance status will be changed so that the |
72 | 109e07c2 | Guido Trotter | "reason" and the version of the config object they ran on is passed to |
73 | 109e07c2 | Guido Trotter | them. They will then export the new expected instance status, together |
74 | 109e07c2 | Guido Trotter | with the associated reason and object version to the status report |
75 | 109e07c2 | Guido Trotter | system, which then will export those themselves. |
76 | 109e07c2 | Guido Trotter | |
77 | 109e07c2 | Guido Trotter | Monitoring and auditing systems can then use the reason to understand |
78 | 109e07c2 | Guido Trotter | the cause of an instance status, and they can use the object version to |
79 | 109e07c2 | Guido Trotter | understand the freshness of their data even in the absence of an atomic |
80 | 109e07c2 | Guido Trotter | cross-node reporting: for example if they see an instance "up" on a node |
81 | 109e07c2 | Guido Trotter | after seeing it running on a previous one, they can compare these values |
82 | 109e07c2 | Guido Trotter | to understand which data is freshest, and repoll the "older" node. Of |
83 | 109e07c2 | Guido Trotter | course if they keep seeing this status this represents an error (either |
84 | 109e07c2 | Guido Trotter | an instance continuously "flapping" between nodes, or an instance is |
85 | 109e07c2 | Guido Trotter | constantly up on more than one), which should be reported and acted |
86 | 109e07c2 | Guido Trotter | upon. |
87 | 109e07c2 | Guido Trotter | |
88 | 109e07c2 | Guido Trotter | The instance status will be on each node, for the instances it is |
89 | 109e07c2 | Guido Trotter | primary for and will contain at least: |
90 | 109e07c2 | Guido Trotter | |
91 | 109e07c2 | Guido Trotter | - The instance name |
92 | 109e07c2 | Guido Trotter | - The instance UUID (stable on name change) |
93 | 109e07c2 | Guido Trotter | - The instance running status (up or down) |
94 | 9805aa82 | Guido Trotter | - The uptime, as detected by the hypervisor |
95 | 109e07c2 | Guido Trotter | - The timestamp of last known change |
96 | 109e07c2 | Guido Trotter | - The timestamp of when the status was last checked (see caching, below) |
97 | 109e07c2 | Guido Trotter | - The last known reason for change, if any |
98 | 109e07c2 | Guido Trotter | |
99 | 109e07c2 | Guido Trotter | More information about all the fields and their type will be available |
100 | 109e07c2 | Guido Trotter | in the "Format of the report" section. |
101 | 109e07c2 | Guido Trotter | |
102 | 109e07c2 | Guido Trotter | Note that as soon as a node knows it's not the primary anymore for an |
103 | 109e07c2 | Guido Trotter | instance it will stop reporting status for it: this means the instance |
104 | 109e07c2 | Guido Trotter | will either disappear, if it has been deleted, or appear on another |
105 | 109e07c2 | Guido Trotter | node, if it's been moved. |
106 | 109e07c2 | Guido Trotter | |
107 | 109e07c2 | Guido Trotter | Instance Disk status |
108 | 109e07c2 | Guido Trotter | ++++++++++++++++++++ |
109 | 109e07c2 | Guido Trotter | |
110 | 109e07c2 | Guido Trotter | As for the instance status Ganeti has now only partial information about |
111 | 109e07c2 | Guido Trotter | its instance disks: in particular each node is unaware of the disk to |
112 | 109e07c2 | Guido Trotter | instance mapping, that exists only on the master. |
113 | 109e07c2 | Guido Trotter | |
114 | 109e07c2 | Guido Trotter | For this design doc we plan to fix this by changing all RPCs that create |
115 | 109e07c2 | Guido Trotter | a backend storage or that put an already existing one in use and passing |
116 | 109e07c2 | Guido Trotter | the relevant instance to the node. The node can then export these to the |
117 | 109e07c2 | Guido Trotter | status reporting tool. |
118 | 109e07c2 | Guido Trotter | |
119 | 109e07c2 | Guido Trotter | While we haven't implemented these RPC changes yet, we'll use confd to |
120 | 109e07c2 | Guido Trotter | fetch this information in the data collector. |
121 | 109e07c2 | Guido Trotter | |
122 | 109e07c2 | Guido Trotter | Since Ganeti supports many type of disks for instances (drbd, rbd, |
123 | 109e07c2 | Guido Trotter | plain, file) we will export both a "generic" status which will work for |
124 | 109e07c2 | Guido Trotter | any type of disk and will be very opaque (at minimum just an "healthy" |
125 | 109e07c2 | Guido Trotter | or "error" state, plus perhaps some human readable comment and a |
126 | 109e07c2 | Guido Trotter | "per-type" status which will explain more about the internal details but |
127 | 109e07c2 | Guido Trotter | will not be compatible between different storage types (and will for |
128 | 109e07c2 | Guido Trotter | example export the drbd connection status, sync, and so on). |
129 | 109e07c2 | Guido Trotter | |
130 | 109e07c2 | Guido Trotter | Status of storage for instances |
131 | 109e07c2 | Guido Trotter | +++++++++++++++++++++++++++++++ |
132 | 109e07c2 | Guido Trotter | |
133 | 109e07c2 | Guido Trotter | The node will also be reporting on all storage types it knows about for |
134 | 109e07c2 | Guido Trotter | the current node (this is right now hardcoded to the enabled storage |
135 | 109e07c2 | Guido Trotter | types, and in the future tied to the enabled storage pools for the |
136 | 109e07c2 | Guido Trotter | nodegroup). For this kind of information also we will report both a |
137 | 109e07c2 | Guido Trotter | generic health status (healthy or error) for each type of storage, and |
138 | 109e07c2 | Guido Trotter | some more generic statistics (free space, used space, total visible |
139 | 109e07c2 | Guido Trotter | space). In addition type specific information can be exported: for |
140 | 109e07c2 | Guido Trotter | example, in case of error, the nature of the error can be disclosed as a |
141 | 109e07c2 | Guido Trotter | type specific information. Examples of these are "backend pv |
142 | 109e07c2 | Guido Trotter | unavailable" for lvm storage, "unreachable" for network based storage or |
143 | 109e07c2 | Guido Trotter | "filesystem error" for filesystem based implementations. |
144 | 109e07c2 | Guido Trotter | |
145 | 109e07c2 | Guido Trotter | Ganeti daemons status |
146 | 109e07c2 | Guido Trotter | +++++++++++++++++++++ |
147 | 109e07c2 | Guido Trotter | |
148 | 109e07c2 | Guido Trotter | Ganeti will report what information it has about its own daemons: this |
149 | 109e07c2 | Guido Trotter | includes memory usage, uptime, CPU usage. This should allow identifying |
150 | 109e07c2 | Guido Trotter | possible problems with the Ganeti system itself: for example memory |
151 | 109e07c2 | Guido Trotter | leaks, crashes and high resource utilization should be evident by |
152 | 109e07c2 | Guido Trotter | analyzing this information. |
153 | 109e07c2 | Guido Trotter | |
154 | 109e07c2 | Guido Trotter | Ganeti daemons will also be able to export extra internal information to |
155 | 109e07c2 | Guido Trotter | the status reporting, through the plugin system (see below). |
156 | 109e07c2 | Guido Trotter | |
157 | 109e07c2 | Guido Trotter | Hypervisor resources report |
158 | 109e07c2 | Guido Trotter | +++++++++++++++++++++++++++ |
159 | 109e07c2 | Guido Trotter | |
160 | 109e07c2 | Guido Trotter | Each hypervisor has a view of system resources that sometimes is |
161 | 109e07c2 | Guido Trotter | different than the one the OS sees (for example in Xen the Node OS, |
162 | 109e07c2 | Guido Trotter | running as Dom0, has access to only part of those resources). In this |
163 | 109e07c2 | Guido Trotter | section we'll report all information we can in a "non hypervisor |
164 | 109e07c2 | Guido Trotter | specific" way. Each hypervisor can then add extra specific information |
165 | 109e07c2 | Guido Trotter | that is not generic enough be abstracted. |
166 | 109e07c2 | Guido Trotter | |
167 | 109e07c2 | Guido Trotter | Node OS resources report |
168 | 109e07c2 | Guido Trotter | ++++++++++++++++++++++++ |
169 | 109e07c2 | Guido Trotter | |
170 | 109e07c2 | Guido Trotter | Since Ganeti assumes it's running on Linux, it's useful to export some |
171 | 109e07c2 | Guido Trotter | basic information as seen by the host system. This includes number and |
172 | 109e07c2 | Guido Trotter | status of CPUs, memory, filesystems and network intefaces as well as the |
173 | 109e07c2 | Guido Trotter | version of components Ganeti interacts with (Linux, drbd, hypervisor, |
174 | 109e07c2 | Guido Trotter | etc). |
175 | 109e07c2 | Guido Trotter | |
176 | 109e07c2 | Guido Trotter | Note that we won't go into any hardware specific details (e.g. querying a |
177 | 109e07c2 | Guido Trotter | node RAID is outside the scope of this, and can be implemented as a |
178 | 109e07c2 | Guido Trotter | plugin) but we can easily just report the information above, since it's |
179 | 109e07c2 | Guido Trotter | standard enough across all systems. |
180 | 109e07c2 | Guido Trotter | |
181 | 109e07c2 | Guido Trotter | Plugin system |
182 | 109e07c2 | Guido Trotter | +++++++++++++ |
183 | 109e07c2 | Guido Trotter | |
184 | 109e07c2 | Guido Trotter | The monitoring system will be equipped with a plugin system that can |
185 | 109e07c2 | Guido Trotter | export specific local information through it. The plugin system will be |
186 | 109e07c2 | Guido Trotter | in the form of either scripts whose output will be inserted in the |
187 | 109e07c2 | Guido Trotter | report, plain text files which will be inserted into the report, or |
188 | 109e07c2 | Guido Trotter | local unix or network sockets from which the information has to be read. |
189 | 109e07c2 | Guido Trotter | This should allow most flexibility for implementing an efficient system, |
190 | 109e07c2 | Guido Trotter | while being able to keep it as simple as possible. |
191 | 109e07c2 | Guido Trotter | |
192 | 109e07c2 | Guido Trotter | The plugin system is expected to be used by local installations to |
193 | 109e07c2 | Guido Trotter | export any installation specific information that they want to be |
194 | 109e07c2 | Guido Trotter | monitored, about either hardware or software on their systems. |
195 | 109e07c2 | Guido Trotter | |
196 | 109e07c2 | Guido Trotter | |
197 | 109e07c2 | Guido Trotter | Format of the query |
198 | 109e07c2 | Guido Trotter | ------------------- |
199 | 109e07c2 | Guido Trotter | |
200 | 109e07c2 | Guido Trotter | The query will be an HTTP GET request on a particular port. At the |
201 | 109e07c2 | Guido Trotter | beginning it will only be possible to query the full status report. |
202 | 109e07c2 | Guido Trotter | |
203 | 109e07c2 | Guido Trotter | |
204 | 109e07c2 | Guido Trotter | Format of the report |
205 | 109e07c2 | Guido Trotter | -------------------- |
206 | 109e07c2 | Guido Trotter | |
207 | 9ef3e121 | Michele Tartara | The report of the will be in JSON format, and it will present an array |
208 | 9ef3e121 | Michele Tartara | of report objects. |
209 | 9ef3e121 | Michele Tartara | Each report object will be produced by a specific data collector. |
210 | 9ef3e121 | Michele Tartara | Each report object includes some mandatory fields, to be provided by all |
211 | 9ef3e121 | Michele Tartara | the data collectors, and a field to contain data collector-specific |
212 | 9ef3e121 | Michele Tartara | data. |
213 | 9ef3e121 | Michele Tartara | |
214 | 9ef3e121 | Michele Tartara | Here follows a minimal example of a report:: |
215 | 9ef3e121 | Michele Tartara | |
216 | 9ef3e121 | Michele Tartara | [ |
217 | 9ef3e121 | Michele Tartara | { |
218 | 9ef3e121 | Michele Tartara | "name" : "TheCollectorIdentifier", |
219 | 9ef3e121 | Michele Tartara | "version" : "1.2", |
220 | 9ef3e121 | Michele Tartara | "format_version" : 1, |
221 | 9ef3e121 | Michele Tartara | "timestamp" : 1351607182000000000, |
222 | 9ef3e121 | Michele Tartara | "data" : { "plugin_specific_data" : "go_here" } |
223 | 9ef3e121 | Michele Tartara | }, |
224 | 9ef3e121 | Michele Tartara | { |
225 | 9ef3e121 | Michele Tartara | "name" : "AnotherDataCollector", |
226 | 9ef3e121 | Michele Tartara | "version" : "B", |
227 | 9ef3e121 | Michele Tartara | "format_version" : 7, |
228 | 9ef3e121 | Michele Tartara | "timestamp" : 1351609526123854000, |
229 | 9ef3e121 | Michele Tartara | "data" : { "plugin_specific" : "data", |
230 | 9ef3e121 | Michele Tartara | "some_late_data" : { "timestamp" : "SPECIFIC_TIME", |
231 | 9ef3e121 | Michele Tartara | ... } |
232 | 9ef3e121 | Michele Tartara | } |
233 | 9ef3e121 | Michele Tartara | } |
234 | 9ef3e121 | Michele Tartara | ] |
235 | 9ef3e121 | Michele Tartara | |
236 | 9ef3e121 | Michele Tartara | Here is the description of the mandatory fields of each object: |
237 | 9ef3e121 | Michele Tartara | |
238 | 9ef3e121 | Michele Tartara | name |
239 | 9ef3e121 | Michele Tartara | the name of the data collector that produced this part of the report. |
240 | 9ef3e121 | Michele Tartara | It is supposed to be unique inside a report. |
241 | 9ef3e121 | Michele Tartara | |
242 | 9ef3e121 | Michele Tartara | version |
243 | 9ef3e121 | Michele Tartara | the version of the data collector that produces this part of the |
244 | 9ef3e121 | Michele Tartara | report. Built-in data collectors (as opposed to those implemented as |
245 | 9ef3e121 | Michele Tartara | plugins) should have "B" as the version number. |
246 | 9ef3e121 | Michele Tartara | |
247 | 9ef3e121 | Michele Tartara | format_version |
248 | 9ef3e121 | Michele Tartara | the format of what is represented in the "data" field for each data |
249 | 9ef3e121 | Michele Tartara | collector might change over time. Every time this happens, the |
250 | 9ef3e121 | Michele Tartara | format_version should be changed, so that who reads the report knows |
251 | 9ef3e121 | Michele Tartara | what format to expect, and how to correctly interpret it. |
252 | 9ef3e121 | Michele Tartara | |
253 | 9ef3e121 | Michele Tartara | timestamp |
254 | 9ef3e121 | Michele Tartara | the time when the reported data were gathered. Is has to be expressed |
255 | 9ef3e121 | Michele Tartara | in nanoseconds since the unix epoch (0:00:00 January 01, 1970). If not |
256 | 9ef3e121 | Michele Tartara | enough precision is available (or needed) it can be padded with |
257 | 9ef3e121 | Michele Tartara | zeroes. If a report object needs multiple timestamps, it can add more |
258 | 9ef3e121 | Michele Tartara | and/or override this one inside its own "data" section. |
259 | 9ef3e121 | Michele Tartara | |
260 | 9ef3e121 | Michele Tartara | data |
261 | 9ef3e121 | Michele Tartara | this field contains all the data generated by the data collector, in |
262 | 9ef3e121 | Michele Tartara | its own independently defined format. The monitoring agent could check |
263 | 9ef3e121 | Michele Tartara | this syntactically (according to the JSON specifications) but not |
264 | 9ef3e121 | Michele Tartara | semantically. |
265 | 109e07c2 | Guido Trotter | |
266 | 109e07c2 | Guido Trotter | |
267 | 109e07c2 | Guido Trotter | Data collectors |
268 | 109e07c2 | Guido Trotter | --------------- |
269 | 109e07c2 | Guido Trotter | |
270 | 109e07c2 | Guido Trotter | In order to ease testing as well as to make it simple to reuse this |
271 | 109e07c2 | Guido Trotter | subsystem it will be possible to run just the "data collectors" on each |
272 | 109e07c2 | Guido Trotter | node without passing through the agent daemon. Each data collector will |
273 | 109e07c2 | Guido Trotter | report specific data about its subsystem and will be documented |
274 | 109e07c2 | Guido Trotter | separately. |
275 | 109e07c2 | Guido Trotter | |
276 | 9ef3e121 | Michele Tartara | If a data collector is run independently, it should print on stdout its |
277 | 9ef3e121 | Michele Tartara | report, according to the format corresponding to a single data collector |
278 | 9ef3e121 | Michele Tartara | report object, as described in the previous paragraph. |
279 | 9ef3e121 | Michele Tartara | |
280 | 109e07c2 | Guido Trotter | |
281 | 109e07c2 | Guido Trotter | Mode of operation |
282 | 109e07c2 | Guido Trotter | ----------------- |
283 | 109e07c2 | Guido Trotter | |
284 | 109e07c2 | Guido Trotter | In order to be able to report information fast the monitoring agent |
285 | 109e07c2 | Guido Trotter | daemon will keep an in-memory or on-disk cache of the status, which will |
286 | 109e07c2 | Guido Trotter | be returned when queries are made. The status system will then |
287 | 109e07c2 | Guido Trotter | periodically check resources to make sure the status is up to date. |
288 | 109e07c2 | Guido Trotter | |
289 | 109e07c2 | Guido Trotter | Different parts of the report will be queried at different speeds. These |
290 | 109e07c2 | Guido Trotter | will depend on: |
291 | 109e07c2 | Guido Trotter | - how often they vary (or we expect them to vary) |
292 | 109e07c2 | Guido Trotter | - how fast they are to query |
293 | 109e07c2 | Guido Trotter | - how important their freshness is |
294 | 109e07c2 | Guido Trotter | |
295 | 109e07c2 | Guido Trotter | Of course the last parameter is installation specific, and while we'll |
296 | 109e07c2 | Guido Trotter | try to have defaults, it will be configurable. The first two instead we |
297 | 109e07c2 | Guido Trotter | can use adaptively to query a certain resource faster or slower |
298 | 109e07c2 | Guido Trotter | depending on those two parameters. |
299 | 109e07c2 | Guido Trotter | |
300 | 109e07c2 | Guido Trotter | |
301 | 109e07c2 | Guido Trotter | Implementation place |
302 | 109e07c2 | Guido Trotter | -------------------- |
303 | 109e07c2 | Guido Trotter | |
304 | 109e07c2 | Guido Trotter | The status daemon will be implemented as a standalone Haskell daemon. In |
305 | 109e07c2 | Guido Trotter | the future it should be easy to merge multiple daemons into one with |
306 | 109e07c2 | Guido Trotter | multiple entry points, should we find out it saves resources and doesn't |
307 | 109e07c2 | Guido Trotter | impact functionality. |
308 | 109e07c2 | Guido Trotter | |
309 | 109e07c2 | Guido Trotter | The libekg library should be looked at for easily providing metrics in |
310 | 109e07c2 | Guido Trotter | json format. |
311 | 109e07c2 | Guido Trotter | |
312 | 109e07c2 | Guido Trotter | |
313 | 109e07c2 | Guido Trotter | Implementation order |
314 | 109e07c2 | Guido Trotter | -------------------- |
315 | 109e07c2 | Guido Trotter | |
316 | 109e07c2 | Guido Trotter | We will implement the agent system in this order: |
317 | 109e07c2 | Guido Trotter | |
318 | 9ef3e121 | Michele Tartara | - initial example data collectors (eg. for drbd and instance status. |
319 | 9ef3e121 | Michele Tartara | Data collector-specific report format TBD). |
320 | 109e07c2 | Guido Trotter | - initial daemon for exporting data |
321 | 109e07c2 | Guido Trotter | - RPC updates for instance status reasons and disk to instance mapping |
322 | 109e07c2 | Guido Trotter | - more data collectors |
323 | 109e07c2 | Guido Trotter | - cache layer for the daemon (if needed) |
324 | 109e07c2 | Guido Trotter | |
325 | 109e07c2 | Guido Trotter | |
326 | 109e07c2 | Guido Trotter | Future work |
327 | 109e07c2 | Guido Trotter | =========== |
328 | 109e07c2 | Guido Trotter | |
329 | 109e07c2 | Guido Trotter | As a future step it can be useful to "centralize" all this reporting |
330 | 109e07c2 | Guido Trotter | data on a single place. This for example can be just the master node, or |
331 | 109e07c2 | Guido Trotter | all the master candidates. We will evaluate doing this after the first |
332 | 109e07c2 | Guido Trotter | node-local version has been developed and tested. |
333 | 109e07c2 | Guido Trotter | |
334 | 109e07c2 | Guido Trotter | Another possible change is replacing the "read-only" RPCs with queries |
335 | 109e07c2 | Guido Trotter | to the agent system, thus having only one way of collecting information |
336 | 109e07c2 | Guido Trotter | from the nodes from a monitoring system and for Ganeti itself. |
337 | 109e07c2 | Guido Trotter | |
338 | 109e07c2 | Guido Trotter | One extra feature we may need is a way to query for only sub-parts of |
339 | 109e07c2 | Guido Trotter | the report (eg. instances status only). This can be done by passing |
340 | 109e07c2 | Guido Trotter | arguments to the HTTP GET, which will be defined when we get to this |
341 | 109e07c2 | Guido Trotter | funtionality. |
342 | 109e07c2 | Guido Trotter | |
343 | 109e07c2 | Guido Trotter | Finally the :doc:`autorepair system design <design-autorepair>`. system |
344 | 109e07c2 | Guido Trotter | (see its design) can be expanded to use the monitoring agent system as a |
345 | 109e07c2 | Guido Trotter | source of information to decide which repairs it can perform. |
346 | 109e07c2 | Guido Trotter | |
347 | 109e07c2 | Guido Trotter | .. vim: set textwidth=72 : |
348 | 109e07c2 | Guido Trotter | .. Local Variables: |
349 | 109e07c2 | Guido Trotter | .. mode: rst |
350 | 109e07c2 | Guido Trotter | .. fill-column: 72 |
351 | 109e07c2 | Guido Trotter | .. End: |