root / doc / design-2.0.rst @ b166ef84
History | View | Annotate | Download (75.4 kB)
1 | 5c0c1eeb | Iustin Pop | ================= |
---|---|---|---|
2 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 design |
3 | 5c0c1eeb | Iustin Pop | ================= |
4 | 5c0c1eeb | Iustin Pop | |
5 | 5c0c1eeb | Iustin Pop | This document describes the major changes in Ganeti 2.0 compared to |
6 | 5c0c1eeb | Iustin Pop | the 1.2 version. |
7 | 5c0c1eeb | Iustin Pop | |
8 | 5c0c1eeb | Iustin Pop | The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 | 5c0c1eeb | Iustin Pop | paving the way for additional features in future 2.x versions. |
10 | 5c0c1eeb | Iustin Pop | |
11 | e0eb13de | Iustin Pop | .. contents:: :depth: 3 |
12 | 5c0c1eeb | Iustin Pop | |
13 | 5c0c1eeb | Iustin Pop | Objective |
14 | 5c0c1eeb | Iustin Pop | ========= |
15 | 5c0c1eeb | Iustin Pop | |
16 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 has many scalability issues and restrictions due to its |
17 | 5c0c1eeb | Iustin Pop | roots as software for managing small and 'static' clusters. |
18 | 5c0c1eeb | Iustin Pop | |
19 | 5c0c1eeb | Iustin Pop | Version 2.0 will attempt to remedy first the scalability issues and |
20 | 5c0c1eeb | Iustin Pop | then the restrictions. |
21 | 5c0c1eeb | Iustin Pop | |
22 | 5c0c1eeb | Iustin Pop | Background |
23 | 5c0c1eeb | Iustin Pop | ========== |
24 | 5c0c1eeb | Iustin Pop | |
25 | 6c2d0b44 | Iustin Pop | While Ganeti 1.2 is usable, it severely limits the flexibility of the |
26 | 5c0c1eeb | Iustin Pop | cluster administration and imposes a very rigid model. It has the |
27 | 5c0c1eeb | Iustin Pop | following main scalability issues: |
28 | 5c0c1eeb | Iustin Pop | |
29 | 5c0c1eeb | Iustin Pop | - only one operation at a time on the cluster [#]_ |
30 | 5c0c1eeb | Iustin Pop | - poor handling of node failures in the cluster |
31 | 5c0c1eeb | Iustin Pop | - mixing hypervisors in a cluster not allowed |
32 | 5c0c1eeb | Iustin Pop | |
33 | 7faf5110 | Michael Hanselmann | It also has a number of artificial restrictions, due to historical |
34 | 7faf5110 | Michael Hanselmann | design: |
35 | 5c0c1eeb | Iustin Pop | |
36 | 5c0c1eeb | Iustin Pop | - fixed number of disks (two) per instance |
37 | 6c2d0b44 | Iustin Pop | - fixed number of NICs |
38 | 5c0c1eeb | Iustin Pop | |
39 | 5c0c1eeb | Iustin Pop | .. [#] Replace disks will release the lock, but this is an exception |
40 | 5c0c1eeb | Iustin Pop | and not a recommended way to operate |
41 | 5c0c1eeb | Iustin Pop | |
42 | 5c0c1eeb | Iustin Pop | The 2.0 version is intended to address some of these problems, and |
43 | 6c2d0b44 | Iustin Pop | create a more flexible code base for future developments. |
44 | 6c2d0b44 | Iustin Pop | |
45 | 6c2d0b44 | Iustin Pop | Among these problems, the single-operation at a time restriction is |
46 | 6c2d0b44 | Iustin Pop | biggest issue with the current version of Ganeti. It is such a big |
47 | 6c2d0b44 | Iustin Pop | impediment in operating bigger clusters that many times one is tempted |
48 | 6c2d0b44 | Iustin Pop | to remove the lock just to do a simple operation like start instance |
49 | 6c2d0b44 | Iustin Pop | while an OS installation is running. |
50 | 5c0c1eeb | Iustin Pop | |
51 | 5c0c1eeb | Iustin Pop | Scalability problems |
52 | 5c0c1eeb | Iustin Pop | -------------------- |
53 | 5c0c1eeb | Iustin Pop | |
54 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 has a single global lock, which is used for all cluster |
55 | 5c0c1eeb | Iustin Pop | operations. This has been painful at various times, for example: |
56 | 5c0c1eeb | Iustin Pop | |
57 | 5c0c1eeb | Iustin Pop | - It is impossible for two people to efficiently interact with a cluster |
58 | 5c0c1eeb | Iustin Pop | (for example for debugging) at the same time. |
59 | 7faf5110 | Michael Hanselmann | - When batch jobs are running it's impossible to do other work (for |
60 | 7faf5110 | Michael Hanselmann | example failovers/fixes) on a cluster. |
61 | 5c0c1eeb | Iustin Pop | |
62 | 5c0c1eeb | Iustin Pop | This poses scalability problems: as clusters grow in node and instance |
63 | 5c0c1eeb | Iustin Pop | size it's a lot more likely that operations which one could conceive |
64 | 5c0c1eeb | Iustin Pop | should run in parallel (for example because they happen on different |
65 | 5c0c1eeb | Iustin Pop | nodes) are actually stalling each other while waiting for the global |
66 | 5c0c1eeb | Iustin Pop | lock, without a real reason for that to happen. |
67 | 5c0c1eeb | Iustin Pop | |
68 | 5c0c1eeb | Iustin Pop | One of the main causes of this global lock (beside the higher |
69 | 5c0c1eeb | Iustin Pop | difficulty of ensuring data consistency in a more granular lock model) |
70 | 6c2d0b44 | Iustin Pop | is the fact that currently there is no long-lived process in Ganeti |
71 | 6c2d0b44 | Iustin Pop | that can coordinate multiple operations. Each command tries to acquire |
72 | 6c2d0b44 | Iustin Pop | the so called *cmd* lock and when it succeeds, it takes complete |
73 | 6c2d0b44 | Iustin Pop | ownership of the cluster configuration and state. |
74 | 5c0c1eeb | Iustin Pop | |
75 | 5c0c1eeb | Iustin Pop | Other scalability problems are due the design of the DRBD device |
76 | 5c0c1eeb | Iustin Pop | model, which assumed at its creation a low (one to four) number of |
77 | 5c0c1eeb | Iustin Pop | instances per node, which is no longer true with today's hardware. |
78 | 5c0c1eeb | Iustin Pop | |
79 | 5c0c1eeb | Iustin Pop | Artificial restrictions |
80 | 5c0c1eeb | Iustin Pop | ----------------------- |
81 | 5c0c1eeb | Iustin Pop | |
82 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
83 | 5c0c1eeb | Iustin Pop | instance model. This is a purely artificial restrictions, but it |
84 | 5c0c1eeb | Iustin Pop | touches multiple areas (configuration, import/export, command line) |
85 | 5c0c1eeb | Iustin Pop | that it's more fitted to a major release than a minor one. |
86 | 5c0c1eeb | Iustin Pop | |
87 | 6c2d0b44 | Iustin Pop | Architecture issues |
88 | 6c2d0b44 | Iustin Pop | ------------------- |
89 | 6c2d0b44 | Iustin Pop | |
90 | 6c2d0b44 | Iustin Pop | The fact that each command is a separate process that reads the |
91 | 6c2d0b44 | Iustin Pop | cluster state, executes the command, and saves the new state is also |
92 | 6c2d0b44 | Iustin Pop | an issue on big clusters where the configuration data for the cluster |
93 | 6c2d0b44 | Iustin Pop | begins to be non-trivial in size. |
94 | 6c2d0b44 | Iustin Pop | |
95 | 5c0c1eeb | Iustin Pop | Overview |
96 | 5c0c1eeb | Iustin Pop | ======== |
97 | 5c0c1eeb | Iustin Pop | |
98 | 5c0c1eeb | Iustin Pop | In order to solve the scalability problems, a rewrite of the core |
99 | 5c0c1eeb | Iustin Pop | design of Ganeti is required. While the cluster operations themselves |
100 | 5c0c1eeb | Iustin Pop | won't change (e.g. start instance will do the same things, the way |
101 | 5c0c1eeb | Iustin Pop | these operations are scheduled internally will change radically. |
102 | 5c0c1eeb | Iustin Pop | |
103 | f86e82ef | Iustin Pop | The new design will change the cluster architecture to: |
104 | f86e82ef | Iustin Pop | |
105 | f86e82ef | Iustin Pop | .. image:: arch-2.0.png |
106 | f86e82ef | Iustin Pop | |
107 | f86e82ef | Iustin Pop | This differs from the 1.2 architecture by the addition of the master |
108 | f86e82ef | Iustin Pop | daemon, which will be the only entity to talk to the node daemons. |
109 | f86e82ef | Iustin Pop | |
110 | f86e82ef | Iustin Pop | |
111 | 5c0c1eeb | Iustin Pop | Detailed design |
112 | 5c0c1eeb | Iustin Pop | =============== |
113 | 5c0c1eeb | Iustin Pop | |
114 | 5c0c1eeb | Iustin Pop | The changes for 2.0 can be split into roughly three areas: |
115 | 5c0c1eeb | Iustin Pop | |
116 | 5c0c1eeb | Iustin Pop | - core changes that affect the design of the software |
117 | 5c0c1eeb | Iustin Pop | - features (or restriction removals) but which do not have a wide |
118 | 5c0c1eeb | Iustin Pop | impact on the design |
119 | 5c0c1eeb | Iustin Pop | - user-level and API-level changes which translate into differences for |
120 | 5c0c1eeb | Iustin Pop | the operation of the cluster |
121 | 5c0c1eeb | Iustin Pop | |
122 | 5c0c1eeb | Iustin Pop | Core changes |
123 | 5c0c1eeb | Iustin Pop | ------------ |
124 | 5c0c1eeb | Iustin Pop | |
125 | 5c0c1eeb | Iustin Pop | The main changes will be switching from a per-process model to a |
126 | 5c0c1eeb | Iustin Pop | daemon based model, where the individual gnt-* commands will be |
127 | 6c2d0b44 | Iustin Pop | clients that talk to this daemon (see `Master daemon`_). This will |
128 | 6c2d0b44 | Iustin Pop | allow us to get rid of the global cluster lock for most operations, |
129 | 6c2d0b44 | Iustin Pop | having instead a per-object lock (see `Granular locking`_). Also, the |
130 | 6c2d0b44 | Iustin Pop | daemon will be able to queue jobs, and this will allow the individual |
131 | 6c2d0b44 | Iustin Pop | clients to submit jobs without waiting for them to finish, and also |
132 | 6c2d0b44 | Iustin Pop | see the result of old requests (see `Job Queue`_). |
133 | 5c0c1eeb | Iustin Pop | |
134 | 5c0c1eeb | Iustin Pop | Beside these major changes, another 'core' change but that will not be |
135 | 5c0c1eeb | Iustin Pop | as visible to the users will be changing the model of object attribute |
136 | 6c2d0b44 | Iustin Pop | storage, and separate that into name spaces (such that an Xen PVM |
137 | 5c0c1eeb | Iustin Pop | instance will not have the Xen HVM parameters). This will allow future |
138 | 6c2d0b44 | Iustin Pop | flexibility in defining additional parameters. For more details see |
139 | 6c2d0b44 | Iustin Pop | `Object parameters`_. |
140 | 5c0c1eeb | Iustin Pop | |
141 | 5c0c1eeb | Iustin Pop | The various changes brought in by the master daemon model and the |
142 | 5c0c1eeb | Iustin Pop | read-write RAPI will require changes to the cluster security; we move |
143 | 6c2d0b44 | Iustin Pop | away from Twisted and use HTTP(s) for intra- and extra-cluster |
144 | 5c0c1eeb | Iustin Pop | communications. For more details, see the security document in the |
145 | 5c0c1eeb | Iustin Pop | doc/ directory. |
146 | 5c0c1eeb | Iustin Pop | |
147 | 5c0c1eeb | Iustin Pop | Master daemon |
148 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~ |
149 | 5c0c1eeb | Iustin Pop | |
150 | 5c0c1eeb | Iustin Pop | In Ganeti 2.0, we will have the following *entities*: |
151 | 5c0c1eeb | Iustin Pop | |
152 | 5c0c1eeb | Iustin Pop | - the master daemon (on the master node) |
153 | 5c0c1eeb | Iustin Pop | - the node daemon (on all nodes) |
154 | 5c0c1eeb | Iustin Pop | - the command line tools (on the master node) |
155 | 5c0c1eeb | Iustin Pop | - the RAPI daemon (on the master node) |
156 | 5c0c1eeb | Iustin Pop | |
157 | 6c2d0b44 | Iustin Pop | The master-daemon related interaction paths are: |
158 | 5c0c1eeb | Iustin Pop | |
159 | 7faf5110 | Michael Hanselmann | - (CLI tools/RAPI daemon) and the master daemon, via the so called |
160 | 7faf5110 | Michael Hanselmann | *LUXI* API |
161 | 5c0c1eeb | Iustin Pop | - the master daemon and the node daemons, via the node RPC |
162 | 5c0c1eeb | Iustin Pop | |
163 | 6c2d0b44 | Iustin Pop | There are also some additional interaction paths for exceptional cases: |
164 | 6c2d0b44 | Iustin Pop | |
165 | 6c2d0b44 | Iustin Pop | - CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
166 | 6c2d0b44 | Iustin Pop | and ``gnt-cluster command``) |
167 | 6c2d0b44 | Iustin Pop | - master failover is a special case when a non-master node will SSH |
168 | 6c2d0b44 | Iustin Pop | and do node-RPC calls to the current master |
169 | 6c2d0b44 | Iustin Pop | |
170 | 5c0c1eeb | Iustin Pop | The protocol between the master daemon and the node daemons will be |
171 | 6c2d0b44 | Iustin Pop | changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
172 | 6c2d0b44 | Iustin Pop | using a simple PUT/GET of JSON-encoded messages. This is done due to |
173 | 6c2d0b44 | Iustin Pop | difficulties in working with the Twisted framework and its protocols |
174 | 6c2d0b44 | Iustin Pop | in a multithreaded environment, which we can overcome by using a |
175 | 6c2d0b44 | Iustin Pop | simpler stack (see the caveats section). |
176 | 6c2d0b44 | Iustin Pop | |
177 | 6c2d0b44 | Iustin Pop | The protocol between the CLI/RAPI and the master daemon will be a |
178 | 6c2d0b44 | Iustin Pop | custom one (called *LUXI*): on a UNIX socket on the master node, with |
179 | 6c2d0b44 | Iustin Pop | rights restricted by filesystem permissions, the CLI/RAPI will talk to |
180 | 6c2d0b44 | Iustin Pop | the master daemon using JSON-encoded messages. |
181 | 5c0c1eeb | Iustin Pop | |
182 | 5c0c1eeb | Iustin Pop | The operations supported over this internal protocol will be encoded |
183 | 5c0c1eeb | Iustin Pop | via a python library that will expose a simple API for its |
184 | 5c0c1eeb | Iustin Pop | users. Internally, the protocol will simply encode all objects in JSON |
185 | 5c0c1eeb | Iustin Pop | format and decode them on the receiver side. |
186 | 5c0c1eeb | Iustin Pop | |
187 | 6c2d0b44 | Iustin Pop | For more details about the RAPI daemon see `Remote API changes`_, and |
188 | 6c2d0b44 | Iustin Pop | for the node daemon see `Node daemon changes`_. |
189 | 6c2d0b44 | Iustin Pop | |
190 | 5c0c1eeb | Iustin Pop | The LUXI protocol |
191 | 5c0c1eeb | Iustin Pop | +++++++++++++++++ |
192 | 5c0c1eeb | Iustin Pop | |
193 | 6c2d0b44 | Iustin Pop | As described above, the protocol for making requests or queries to the |
194 | 6c2d0b44 | Iustin Pop | master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
195 | 6c2d0b44 | Iustin Pop | messages. |
196 | 6c2d0b44 | Iustin Pop | |
197 | 6c2d0b44 | Iustin Pop | The choice of UNIX was in order to get rid of the need of |
198 | 6c2d0b44 | Iustin Pop | authentication and authorisation inside Ganeti; for 2.0, the |
199 | 6c2d0b44 | Iustin Pop | permissions on the Unix socket itself will determine the access |
200 | 6c2d0b44 | Iustin Pop | rights. |
201 | 6c2d0b44 | Iustin Pop | |
202 | 6c2d0b44 | Iustin Pop | We will have two main classes of operations over this API: |
203 | 5c0c1eeb | Iustin Pop | |
204 | 5c0c1eeb | Iustin Pop | - cluster query functions |
205 | 5c0c1eeb | Iustin Pop | - job related functions |
206 | 5c0c1eeb | Iustin Pop | |
207 | 5c0c1eeb | Iustin Pop | The cluster query functions are usually short-duration, and are the |
208 | 6c2d0b44 | Iustin Pop | equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are |
209 | 5c0c1eeb | Iustin Pop | internally implemented still with these opcodes). The clients are |
210 | 5c0c1eeb | Iustin Pop | guaranteed to receive the response in a reasonable time via a timeout. |
211 | 5c0c1eeb | Iustin Pop | |
212 | 5c0c1eeb | Iustin Pop | The job-related functions will be: |
213 | 5c0c1eeb | Iustin Pop | |
214 | 5c0c1eeb | Iustin Pop | - submit job |
215 | 5c0c1eeb | Iustin Pop | - query job (which could also be categorized in the query-functions) |
216 | 5c0c1eeb | Iustin Pop | - archive job (see the job queue design doc) |
217 | 5c0c1eeb | Iustin Pop | - wait for job change, which allows a client to wait without polling |
218 | 5c0c1eeb | Iustin Pop | |
219 | 6c2d0b44 | Iustin Pop | For more details of the actual operation list, see the `Job Queue`_. |
220 | 5c0c1eeb | Iustin Pop | |
221 | 6c2d0b44 | Iustin Pop | Both requests and responses will consist of a JSON-encoded message |
222 | 6c2d0b44 | Iustin Pop | followed by the ``ETX`` character (ASCII decimal 3), which is not a |
223 | 6c2d0b44 | Iustin Pop | valid character in JSON messages and thus can serve as a message |
224 | 6c2d0b44 | Iustin Pop | delimiter. The contents of the messages will be a dictionary with two |
225 | 6c2d0b44 | Iustin Pop | fields: |
226 | 6c2d0b44 | Iustin Pop | |
227 | 6c2d0b44 | Iustin Pop | :method: |
228 | 6c2d0b44 | Iustin Pop | the name of the method called |
229 | 6c2d0b44 | Iustin Pop | :args: |
230 | 6c2d0b44 | Iustin Pop | the arguments to the method, as a list (no keyword arguments allowed) |
231 | 6c2d0b44 | Iustin Pop | |
232 | 6c2d0b44 | Iustin Pop | Responses will follow the same format, with the two fields being: |
233 | 6c2d0b44 | Iustin Pop | |
234 | 6c2d0b44 | Iustin Pop | :success: |
235 | 6c2d0b44 | Iustin Pop | a boolean denoting the success of the operation |
236 | 6c2d0b44 | Iustin Pop | :result: |
237 | 6c2d0b44 | Iustin Pop | the actual result, or error message in case of failure |
238 | 6c2d0b44 | Iustin Pop | |
239 | 6c2d0b44 | Iustin Pop | There are two special value for the result field: |
240 | 6c2d0b44 | Iustin Pop | |
241 | 6c2d0b44 | Iustin Pop | - in the case that the operation failed, and this field is a list of |
242 | 7faf5110 | Michael Hanselmann | length two, the client library will try to interpret is as an |
243 | 7faf5110 | Michael Hanselmann | exception, the first element being the exception type and the second |
244 | 7faf5110 | Michael Hanselmann | one the actual exception arguments; this will allow a simple method of |
245 | 7faf5110 | Michael Hanselmann | passing Ganeti-related exception across the interface |
246 | 6c2d0b44 | Iustin Pop | - for the *WaitForChange* call (that waits on the server for a job to |
247 | 6c2d0b44 | Iustin Pop | change status), if the result is equal to ``nochange`` instead of the |
248 | 6c2d0b44 | Iustin Pop | usual result for this call (a list of changes), then the library will |
249 | 6c2d0b44 | Iustin Pop | internally retry the call; this is done in order to differentiate |
250 | 6c2d0b44 | Iustin Pop | internally between master daemon hung and job simply not changed |
251 | 6c2d0b44 | Iustin Pop | |
252 | 6c2d0b44 | Iustin Pop | Users of the API that don't use the provided python library should |
253 | 6c2d0b44 | Iustin Pop | take care of the above two cases. |
254 | 6c2d0b44 | Iustin Pop | |
255 | 6c2d0b44 | Iustin Pop | |
256 | 6c2d0b44 | Iustin Pop | Master daemon implementation |
257 | 6c2d0b44 | Iustin Pop | ++++++++++++++++++++++++++++ |
258 | 5c0c1eeb | Iustin Pop | |
259 | 5c0c1eeb | Iustin Pop | The daemon will be based around a main I/O thread that will wait for |
260 | 5c0c1eeb | Iustin Pop | new requests from the clients, and that does the setup/shutdown of the |
261 | 5c0c1eeb | Iustin Pop | other thread (pools). |
262 | 5c0c1eeb | Iustin Pop | |
263 | 5c0c1eeb | Iustin Pop | There will two other classes of threads in the daemon: |
264 | 5c0c1eeb | Iustin Pop | |
265 | 5c0c1eeb | Iustin Pop | - job processing threads, part of a thread pool, and which are |
266 | 5c0c1eeb | Iustin Pop | long-lived, started at daemon startup and terminated only at shutdown |
267 | 5c0c1eeb | Iustin Pop | time |
268 | 5c0c1eeb | Iustin Pop | - client I/O threads, which are the ones that talk the local protocol |
269 | 6c2d0b44 | Iustin Pop | (LUXI) to the clients, and are short-lived |
270 | 5c0c1eeb | Iustin Pop | |
271 | 5c0c1eeb | Iustin Pop | Master startup/failover |
272 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
273 | 5c0c1eeb | Iustin Pop | |
274 | 5c0c1eeb | Iustin Pop | In Ganeti 1.x there is no protection against failing over the master |
275 | 5c0c1eeb | Iustin Pop | to a node with stale configuration. In effect, the responsibility of |
276 | 5c0c1eeb | Iustin Pop | correct failovers falls on the admin. This is true both for the new |
277 | 5c0c1eeb | Iustin Pop | master and for when an old, offline master startup. |
278 | 5c0c1eeb | Iustin Pop | |
279 | 5c0c1eeb | Iustin Pop | Since in 2.x we are extending the cluster state to cover the job queue |
280 | 5c0c1eeb | Iustin Pop | and have a daemon that will execute by itself the job queue, we want |
281 | 5c0c1eeb | Iustin Pop | to have more resilience for the master role. |
282 | 5c0c1eeb | Iustin Pop | |
283 | 5c0c1eeb | Iustin Pop | The following algorithm will happen whenever a node is ready to |
284 | 5c0c1eeb | Iustin Pop | transition to the master role, either at startup time or at node |
285 | 5c0c1eeb | Iustin Pop | failover: |
286 | 5c0c1eeb | Iustin Pop | |
287 | 5c0c1eeb | Iustin Pop | #. read the configuration file and parse the node list |
288 | 5c0c1eeb | Iustin Pop | contained within |
289 | 5c0c1eeb | Iustin Pop | |
290 | 5c0c1eeb | Iustin Pop | #. query all the nodes and make sure we obtain an agreement via |
291 | 5c0c1eeb | Iustin Pop | a quorum of at least half plus one nodes for the following: |
292 | 5c0c1eeb | Iustin Pop | |
293 | 5c0c1eeb | Iustin Pop | - we have the latest configuration and job list (as |
294 | 5c0c1eeb | Iustin Pop | determined by the serial number on the configuration and |
295 | 5c0c1eeb | Iustin Pop | highest job ID on the job queue) |
296 | 5c0c1eeb | Iustin Pop | |
297 | 5c0c1eeb | Iustin Pop | - there is not even a single node having a newer |
298 | 5c0c1eeb | Iustin Pop | configuration file |
299 | 5c0c1eeb | Iustin Pop | |
300 | 5c0c1eeb | Iustin Pop | - if we are not failing over (but just starting), the |
301 | 5c0c1eeb | Iustin Pop | quorum agrees that we are the designated master |
302 | 5c0c1eeb | Iustin Pop | |
303 | 6c2d0b44 | Iustin Pop | - if any of the above is false, we prevent the current operation |
304 | 6c2d0b44 | Iustin Pop | (i.e. we don't become the master) |
305 | 6c2d0b44 | Iustin Pop | |
306 | 5c0c1eeb | Iustin Pop | #. at this point, the node transitions to the master role |
307 | 5c0c1eeb | Iustin Pop | |
308 | 5c0c1eeb | Iustin Pop | #. for all the in-progress jobs, mark them as failed, with |
309 | 5c0c1eeb | Iustin Pop | reason unknown or something similar (master failed, etc.) |
310 | 5c0c1eeb | Iustin Pop | |
311 | 6c2d0b44 | Iustin Pop | Since due to exceptional conditions we could have a situation in which |
312 | 6c2d0b44 | Iustin Pop | no node can become the master due to inconsistent data, we will have |
313 | 6c2d0b44 | Iustin Pop | an override switch for the master daemon startup that will assume the |
314 | 6c2d0b44 | Iustin Pop | current node has the right data and will replicate all the |
315 | 6c2d0b44 | Iustin Pop | configuration files to the other nodes. |
316 | 6c2d0b44 | Iustin Pop | |
317 | 6c2d0b44 | Iustin Pop | **Note**: the above algorithm is by no means an election algorithm; it |
318 | 6c2d0b44 | Iustin Pop | is a *confirmation* of the master role currently held by a node. |
319 | 5c0c1eeb | Iustin Pop | |
320 | 5c0c1eeb | Iustin Pop | Logging |
321 | 5c0c1eeb | Iustin Pop | +++++++ |
322 | 5c0c1eeb | Iustin Pop | |
323 | 6c2d0b44 | Iustin Pop | The logging system will be switched completely to the standard python |
324 | 6c2d0b44 | Iustin Pop | logging module; currently it's logging-based, but exposes a different |
325 | 6c2d0b44 | Iustin Pop | API, which is just overhead. As such, the code will be switched over |
326 | 6c2d0b44 | Iustin Pop | to standard logging calls, and only the setup will be custom. |
327 | 5c0c1eeb | Iustin Pop | |
328 | 5c0c1eeb | Iustin Pop | With this change, we will remove the separate debug/info/error logs, |
329 | 5c0c1eeb | Iustin Pop | and instead have always one logfile per daemon model: |
330 | 5c0c1eeb | Iustin Pop | |
331 | 5c0c1eeb | Iustin Pop | - master-daemon.log for the master daemon |
332 | 5c0c1eeb | Iustin Pop | - node-daemon.log for the node daemon (this is the same as in 1.2) |
333 | 5c0c1eeb | Iustin Pop | - rapi-daemon.log for the RAPI daemon logs |
334 | 5c0c1eeb | Iustin Pop | - rapi-access.log, an additional log file for the RAPI that will be |
335 | 6c2d0b44 | Iustin Pop | in the standard HTTP log format for possible parsing by other tools |
336 | 6c2d0b44 | Iustin Pop | |
337 | e2078d28 | Iustin Pop | Since the :term:`watcher` will only submit jobs to the master for |
338 | e2078d28 | Iustin Pop | startup of the instances, its log file will contain less information |
339 | e2078d28 | Iustin Pop | than before, mainly that it will start the instance, but not the |
340 | e2078d28 | Iustin Pop | results. |
341 | 6c2d0b44 | Iustin Pop | |
342 | 6c2d0b44 | Iustin Pop | Node daemon changes |
343 | 6c2d0b44 | Iustin Pop | +++++++++++++++++++ |
344 | 6c2d0b44 | Iustin Pop | |
345 | 6c2d0b44 | Iustin Pop | The only change to the node daemon is that, since we need better |
346 | 6c2d0b44 | Iustin Pop | concurrency, we don't process the inter-node RPC calls in the node |
347 | 6c2d0b44 | Iustin Pop | daemon itself, but we fork and process each request in a separate |
348 | 6c2d0b44 | Iustin Pop | child. |
349 | 5c0c1eeb | Iustin Pop | |
350 | 6c2d0b44 | Iustin Pop | Since we don't have many calls, and we only fork (not exec), the |
351 | 6c2d0b44 | Iustin Pop | overhead should be minimal. |
352 | 5c0c1eeb | Iustin Pop | |
353 | 5c0c1eeb | Iustin Pop | Caveats |
354 | 5c0c1eeb | Iustin Pop | +++++++ |
355 | 5c0c1eeb | Iustin Pop | |
356 | 5c0c1eeb | Iustin Pop | A discussed alternative is to keep the current individual processes |
357 | 5c0c1eeb | Iustin Pop | touching the cluster configuration model. The reasons we have not |
358 | 5c0c1eeb | Iustin Pop | chosen this approach is: |
359 | 5c0c1eeb | Iustin Pop | |
360 | 5c0c1eeb | Iustin Pop | - the speed of reading and unserializing the cluster state |
361 | 5c0c1eeb | Iustin Pop | today is not small enough that we can ignore it; the addition of |
362 | 5c0c1eeb | Iustin Pop | the job queue will make the startup cost even higher. While this |
363 | 5c0c1eeb | Iustin Pop | runtime cost is low, it can be on the order of a few seconds on |
364 | 5c0c1eeb | Iustin Pop | bigger clusters, which for very quick commands is comparable to |
365 | 5c0c1eeb | Iustin Pop | the actual duration of the computation itself |
366 | 5c0c1eeb | Iustin Pop | |
367 | 5c0c1eeb | Iustin Pop | - individual commands would make it harder to implement a |
368 | 5c0c1eeb | Iustin Pop | fire-and-forget job request, along the lines "start this |
369 | 5c0c1eeb | Iustin Pop | instance but do not wait for it to finish"; it would require a |
370 | 5c0c1eeb | Iustin Pop | model of backgrounding the operation and other things that are |
371 | 5c0c1eeb | Iustin Pop | much better served by a daemon-based model |
372 | 5c0c1eeb | Iustin Pop | |
373 | 5c0c1eeb | Iustin Pop | Another area of discussion is moving away from Twisted in this new |
374 | 6c2d0b44 | Iustin Pop | implementation. While Twisted has its advantages, there are also many |
375 | 6c2d0b44 | Iustin Pop | disadvantages to using it: |
376 | 5c0c1eeb | Iustin Pop | |
377 | 5c0c1eeb | Iustin Pop | - first and foremost, it's not a library, but a framework; thus, if |
378 | 6c2d0b44 | Iustin Pop | you use twisted, all the code needs to be 'twiste-ized' and written |
379 | 6c2d0b44 | Iustin Pop | in an asynchronous manner, using deferreds; while this method works, |
380 | 6c2d0b44 | Iustin Pop | it's not a common way to code and it requires that the entire process |
381 | 6c2d0b44 | Iustin Pop | workflow is based around a single *reactor* (Twisted name for a main |
382 | 6c2d0b44 | Iustin Pop | loop) |
383 | 6c2d0b44 | Iustin Pop | - the more advanced granular locking that we want to implement would |
384 | 6c2d0b44 | Iustin Pop | require, if written in the async-manner, deep integration with the |
385 | 6c2d0b44 | Iustin Pop | Twisted stack, to such an extend that business-logic is inseparable |
386 | 7faf5110 | Michael Hanselmann | from the protocol coding; we felt that this is an unreasonable |
387 | 7faf5110 | Michael Hanselmann | request, and that a good protocol library should allow complete |
388 | 7faf5110 | Michael Hanselmann | separation of low-level protocol calls and business logic; by |
389 | 7faf5110 | Michael Hanselmann | comparison, the threaded approach combined with HTTPs protocol |
390 | 7faf5110 | Michael Hanselmann | required (for the first iteration) absolutely no changes from the 1.2 |
391 | 7faf5110 | Michael Hanselmann | code, and later changes for optimizing the inter-node RPC calls |
392 | 7faf5110 | Michael Hanselmann | required just syntactic changes (e.g. ``rpc.call_...`` to |
393 | 7faf5110 | Michael Hanselmann | ``self.rpc.call_...``) |
394 | 6c2d0b44 | Iustin Pop | |
395 | 6c2d0b44 | Iustin Pop | Another issue is with the Twisted API stability - during the Ganeti |
396 | 6c2d0b44 | Iustin Pop | 1.x lifetime, we had to to implement many times workarounds to changes |
397 | 6c2d0b44 | Iustin Pop | in the Twisted version, so that for example 1.2 is able to use both |
398 | 6c2d0b44 | Iustin Pop | Twisted 2.x and 8.x. |
399 | 6c2d0b44 | Iustin Pop | |
400 | 6c2d0b44 | Iustin Pop | In the end, since we already had an HTTP server library for the RAPI, |
401 | 6c2d0b44 | Iustin Pop | we just reused that for inter-node communication. |
402 | 5c0c1eeb | Iustin Pop | |
403 | 5c0c1eeb | Iustin Pop | |
404 | 5c0c1eeb | Iustin Pop | Granular locking |
405 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~ |
406 | 5c0c1eeb | Iustin Pop | |
407 | 7faf5110 | Michael Hanselmann | We want to make sure that multiple operations can run in parallel on a |
408 | 7faf5110 | Michael Hanselmann | Ganeti Cluster. In order for this to happen we need to make sure |
409 | 7faf5110 | Michael Hanselmann | concurrently run operations don't step on each other toes and break the |
410 | 7faf5110 | Michael Hanselmann | cluster. |
411 | 5c0c1eeb | Iustin Pop | |
412 | 5c0c1eeb | Iustin Pop | This design addresses how we are going to deal with locking so that: |
413 | 5c0c1eeb | Iustin Pop | |
414 | 6c2d0b44 | Iustin Pop | - we preserve data coherency |
415 | 6c2d0b44 | Iustin Pop | - we prevent deadlocks |
416 | 6c2d0b44 | Iustin Pop | - we prevent job starvation |
417 | 5c0c1eeb | Iustin Pop | |
418 | 7faf5110 | Michael Hanselmann | Reaching the maximum possible parallelism is a Non-Goal. We have |
419 | 7faf5110 | Michael Hanselmann | identified a set of operations that are currently bottlenecks and need |
420 | 7faf5110 | Michael Hanselmann | to be parallelised and have worked on those. In the future it will be |
421 | 7faf5110 | Michael Hanselmann | possible to address other needs, thus making the cluster more and more |
422 | 7faf5110 | Michael Hanselmann | parallel one step at a time. |
423 | 5c0c1eeb | Iustin Pop | |
424 | 6c2d0b44 | Iustin Pop | This section only talks about parallelising Ganeti level operations, aka |
425 | 7faf5110 | Michael Hanselmann | Logical Units, and the locking needed for that. Any other |
426 | 7faf5110 | Michael Hanselmann | synchronization lock needed internally by the code is outside its scope. |
427 | 5c0c1eeb | Iustin Pop | |
428 | 6c2d0b44 | Iustin Pop | Library details |
429 | 6c2d0b44 | Iustin Pop | +++++++++++++++ |
430 | 5c0c1eeb | Iustin Pop | |
431 | 5c0c1eeb | Iustin Pop | The proposed library has these features: |
432 | 5c0c1eeb | Iustin Pop | |
433 | 7faf5110 | Michael Hanselmann | - internally managing all the locks, making the implementation |
434 | 7faf5110 | Michael Hanselmann | transparent from their usage |
435 | 7faf5110 | Michael Hanselmann | - automatically grabbing multiple locks in the right order (avoid |
436 | 7faf5110 | Michael Hanselmann | deadlock) |
437 | 6c2d0b44 | Iustin Pop | - ability to transparently handle conversion to more granularity |
438 | 6c2d0b44 | Iustin Pop | - support asynchronous operation (future goal) |
439 | 6c2d0b44 | Iustin Pop | |
440 | 6c2d0b44 | Iustin Pop | Locking will be valid only on the master node and will not be a |
441 | 6c2d0b44 | Iustin Pop | distributed operation. Therefore, in case of master failure, the |
442 | 6c2d0b44 | Iustin Pop | operations currently running will be aborted and the locks will be |
443 | 6c2d0b44 | Iustin Pop | lost; it remains to the administrator to cleanup (if needed) the |
444 | 6c2d0b44 | Iustin Pop | operation result (e.g. make sure an instance is either installed |
445 | 6c2d0b44 | Iustin Pop | correctly or removed). |
446 | 6c2d0b44 | Iustin Pop | |
447 | 6c2d0b44 | Iustin Pop | A corollary of this is that a master-failover operation with both |
448 | 6c2d0b44 | Iustin Pop | masters alive needs to happen while no operations are running, and |
449 | 6c2d0b44 | Iustin Pop | therefore no locks are held. |
450 | 6c2d0b44 | Iustin Pop | |
451 | 6c2d0b44 | Iustin Pop | All the locks will be represented by objects (like |
452 | 6c2d0b44 | Iustin Pop | ``lockings.SharedLock``), and the individual locks for each object |
453 | 6c2d0b44 | Iustin Pop | will be created at initialisation time, from the config file. |
454 | 6c2d0b44 | Iustin Pop | |
455 | 7faf5110 | Michael Hanselmann | The API will have a way to grab one or more than one locks at the same |
456 | 7faf5110 | Michael Hanselmann | time. Any attempt to grab a lock while already holding one in the wrong |
457 | 7faf5110 | Michael Hanselmann | order will be checked for, and fail. |
458 | 6c2d0b44 | Iustin Pop | |
459 | 5c0c1eeb | Iustin Pop | |
460 | 5c0c1eeb | Iustin Pop | The Locks |
461 | 5c0c1eeb | Iustin Pop | +++++++++ |
462 | 5c0c1eeb | Iustin Pop | |
463 | 5c0c1eeb | Iustin Pop | At the first stage we have decided to provide the following locks: |
464 | 5c0c1eeb | Iustin Pop | |
465 | 5c0c1eeb | Iustin Pop | - One "config file" lock |
466 | 5c0c1eeb | Iustin Pop | - One lock per node in the cluster |
467 | 5c0c1eeb | Iustin Pop | - One lock per instance in the cluster |
468 | 5c0c1eeb | Iustin Pop | |
469 | 7faf5110 | Michael Hanselmann | All the instance locks will need to be taken before the node locks, and |
470 | 7faf5110 | Michael Hanselmann | the node locks before the config lock. Locks will need to be acquired at |
471 | 7faf5110 | Michael Hanselmann | the same time for multiple instances and nodes, and internal ordering |
472 | 7faf5110 | Michael Hanselmann | will be dealt within the locking library, which, for simplicity, will |
473 | 7faf5110 | Michael Hanselmann | just use alphabetical order. |
474 | 5c0c1eeb | Iustin Pop | |
475 | 6c2d0b44 | Iustin Pop | Each lock has the following three possible statuses: |
476 | 6c2d0b44 | Iustin Pop | |
477 | 6c2d0b44 | Iustin Pop | - unlocked (anyone can grab the lock) |
478 | 6c2d0b44 | Iustin Pop | - shared (anyone can grab/have the lock but only in shared mode) |
479 | 6c2d0b44 | Iustin Pop | - exclusive (no one else can grab/have the lock) |
480 | 6c2d0b44 | Iustin Pop | |
481 | 5c0c1eeb | Iustin Pop | Handling conversion to more granularity |
482 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++++++++++ |
483 | 5c0c1eeb | Iustin Pop | |
484 | 7faf5110 | Michael Hanselmann | In order to convert to a more granular approach transparently each time |
485 | 7faf5110 | Michael Hanselmann | we split a lock into more we'll create a "metalock", which will depend |
486 | 7faf5110 | Michael Hanselmann | on those sub-locks and live for the time necessary for all the code to |
487 | 7faf5110 | Michael Hanselmann | convert (or forever, in some conditions). When a metalock exists all |
488 | 7faf5110 | Michael Hanselmann | converted code must acquire it in shared mode, so it can run |
489 | 7faf5110 | Michael Hanselmann | concurrently, but still be exclusive with old code, which acquires it |
490 | 7faf5110 | Michael Hanselmann | exclusively. |
491 | 5c0c1eeb | Iustin Pop | |
492 | 7faf5110 | Michael Hanselmann | In the beginning the only such lock will be what replaces the current |
493 | 7faf5110 | Michael Hanselmann | "command" lock, and will acquire all the locks in the system, before |
494 | 7faf5110 | Michael Hanselmann | proceeding. This lock will be called the "Big Ganeti Lock" because |
495 | 7faf5110 | Michael Hanselmann | holding that one will avoid any other concurrent Ganeti operations. |
496 | 5c0c1eeb | Iustin Pop | |
497 | 7faf5110 | Michael Hanselmann | We might also want to devise more metalocks (eg. all nodes, all |
498 | 7faf5110 | Michael Hanselmann | nodes+config) in order to make it easier for some parts of the code to |
499 | 7faf5110 | Michael Hanselmann | acquire what it needs without specifying it explicitly. |
500 | 5c0c1eeb | Iustin Pop | |
501 | 7faf5110 | Michael Hanselmann | In the future things like the node locks could become metalocks, should |
502 | 7faf5110 | Michael Hanselmann | we decide to split them into an even more fine grained approach, but |
503 | 7faf5110 | Michael Hanselmann | this will probably be only after the first 2.0 version has been |
504 | 7faf5110 | Michael Hanselmann | released. |
505 | 5c0c1eeb | Iustin Pop | |
506 | 5c0c1eeb | Iustin Pop | Adding/Removing locks |
507 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
508 | 5c0c1eeb | Iustin Pop | |
509 | 7faf5110 | Michael Hanselmann | When a new instance or a new node is created an associated lock must be |
510 | 7faf5110 | Michael Hanselmann | added to the list. The relevant code will need to inform the locking |
511 | 7faf5110 | Michael Hanselmann | library of such a change. |
512 | 5c0c1eeb | Iustin Pop | |
513 | 7faf5110 | Michael Hanselmann | This needs to be compatible with every other lock in the system, |
514 | 7faf5110 | Michael Hanselmann | especially metalocks that guarantee to grab sets of resources without |
515 | 7faf5110 | Michael Hanselmann | specifying them explicitly. The implementation of this will be handled |
516 | 7faf5110 | Michael Hanselmann | in the locking library itself. |
517 | 5c0c1eeb | Iustin Pop | |
518 | 6c2d0b44 | Iustin Pop | When instances or nodes disappear from the cluster the relevant locks |
519 | 6c2d0b44 | Iustin Pop | must be removed. This is easier than adding new elements, as the code |
520 | 6c2d0b44 | Iustin Pop | which removes them must own them exclusively already, and thus deals |
521 | 6c2d0b44 | Iustin Pop | with metalocks exactly as normal code acquiring those locks. Any |
522 | 6c2d0b44 | Iustin Pop | operation queuing on a removed lock will fail after its removal. |
523 | 5c0c1eeb | Iustin Pop | |
524 | 5c0c1eeb | Iustin Pop | Asynchronous operations |
525 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
526 | 5c0c1eeb | Iustin Pop | |
527 | 5c0c1eeb | Iustin Pop | For the first version the locking library will only export synchronous |
528 | 7faf5110 | Michael Hanselmann | operations, which will block till the needed lock are held, and only |
529 | 7faf5110 | Michael Hanselmann | fail if the request is impossible or somehow erroneous. |
530 | 5c0c1eeb | Iustin Pop | |
531 | 5c0c1eeb | Iustin Pop | In the future we may want to implement different types of asynchronous |
532 | 5c0c1eeb | Iustin Pop | operations such as: |
533 | 5c0c1eeb | Iustin Pop | |
534 | 6c2d0b44 | Iustin Pop | - try to acquire this lock set and fail if not possible |
535 | 7faf5110 | Michael Hanselmann | - try to acquire one of these lock sets and return the first one you |
536 | 7faf5110 | Michael Hanselmann | were able to get (or after a timeout) (select/poll like) |
537 | 5c0c1eeb | Iustin Pop | |
538 | 7faf5110 | Michael Hanselmann | These operations can be used to prioritize operations based on available |
539 | 7faf5110 | Michael Hanselmann | locks, rather than making them just blindly queue for acquiring them. |
540 | 7faf5110 | Michael Hanselmann | The inherent risk, though, is that any code using the first operation, |
541 | 7faf5110 | Michael Hanselmann | or setting a timeout for the second one, is susceptible to starvation |
542 | 7faf5110 | Michael Hanselmann | and thus may never be able to get the required locks and complete |
543 | 7faf5110 | Michael Hanselmann | certain tasks. Considering this providing/using these operations should |
544 | 7faf5110 | Michael Hanselmann | not be among our first priorities. |
545 | 5c0c1eeb | Iustin Pop | |
546 | 5c0c1eeb | Iustin Pop | Locking granularity |
547 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
548 | 5c0c1eeb | Iustin Pop | |
549 | 5c0c1eeb | Iustin Pop | For the first version of this code we'll convert each Logical Unit to |
550 | 7faf5110 | Michael Hanselmann | acquire/release the locks it needs, so locking will be at the Logical |
551 | 7faf5110 | Michael Hanselmann | Unit level. In the future we may want to split logical units in |
552 | 7faf5110 | Michael Hanselmann | independent "tasklets" with their own locking requirements. A different |
553 | 7faf5110 | Michael Hanselmann | design doc (or mini design doc) will cover the move from Logical Units |
554 | 7faf5110 | Michael Hanselmann | to tasklets. |
555 | 5c0c1eeb | Iustin Pop | |
556 | 6c2d0b44 | Iustin Pop | Code examples |
557 | 6c2d0b44 | Iustin Pop | +++++++++++++ |
558 | 5c0c1eeb | Iustin Pop | |
559 | 7faf5110 | Michael Hanselmann | In general when acquiring locks we should use a code path equivalent |
560 | 7faf5110 | Michael Hanselmann | to:: |
561 | 5c0c1eeb | Iustin Pop | |
562 | 5c0c1eeb | Iustin Pop | lock.acquire() |
563 | 5c0c1eeb | Iustin Pop | try: |
564 | 5c0c1eeb | Iustin Pop | ... |
565 | 5c0c1eeb | Iustin Pop | # other code |
566 | 5c0c1eeb | Iustin Pop | finally: |
567 | 5c0c1eeb | Iustin Pop | lock.release() |
568 | 5c0c1eeb | Iustin Pop | |
569 | 6c2d0b44 | Iustin Pop | This makes sure we release all locks, and avoid possible deadlocks. Of |
570 | 6c2d0b44 | Iustin Pop | course extra care must be used not to leave, if possible locked |
571 | 6c2d0b44 | Iustin Pop | structures in an unusable state. Note that with Python 2.5 a simpler |
572 | 6c2d0b44 | Iustin Pop | syntax will be possible, but we want to keep compatibility with Python |
573 | 6c2d0b44 | Iustin Pop | 2.4 so the new constructs should not be used. |
574 | 5c0c1eeb | Iustin Pop | |
575 | 7faf5110 | Michael Hanselmann | In order to avoid this extra indentation and code changes everywhere in |
576 | 7faf5110 | Michael Hanselmann | the Logical Units code, we decided to allow LUs to declare locks, and |
577 | 7faf5110 | Michael Hanselmann | then execute their code with their locks acquired. In the new world LUs |
578 | 7faf5110 | Michael Hanselmann | are called like this:: |
579 | 5c0c1eeb | Iustin Pop | |
580 | 5c0c1eeb | Iustin Pop | # user passed names are expanded to the internal lock/resource name, |
581 | 5c0c1eeb | Iustin Pop | # then known needed locks are declared |
582 | 5c0c1eeb | Iustin Pop | lu.ExpandNames() |
583 | 5c0c1eeb | Iustin Pop | ... some locking/adding of locks may happen ... |
584 | 5c0c1eeb | Iustin Pop | # late declaration of locks for one level: this is useful because sometimes |
585 | 5c0c1eeb | Iustin Pop | # we can't know which resource we need before locking the previous level |
586 | 5c0c1eeb | Iustin Pop | lu.DeclareLocks() # for each level (cluster, instance, node) |
587 | 5c0c1eeb | Iustin Pop | ... more locking/adding of locks can happen ... |
588 | 5c0c1eeb | Iustin Pop | # these functions are called with the proper locks held |
589 | 5c0c1eeb | Iustin Pop | lu.CheckPrereq() |
590 | 5c0c1eeb | Iustin Pop | lu.Exec() |
591 | 5c0c1eeb | Iustin Pop | ... locks declared for removal are removed, all acquired locks released ... |
592 | 5c0c1eeb | Iustin Pop | |
593 | 7faf5110 | Michael Hanselmann | The Processor and the LogicalUnit class will contain exact documentation |
594 | 7faf5110 | Michael Hanselmann | on how locks are supposed to be declared. |
595 | 5c0c1eeb | Iustin Pop | |
596 | 5c0c1eeb | Iustin Pop | Caveats |
597 | 5c0c1eeb | Iustin Pop | +++++++ |
598 | 5c0c1eeb | Iustin Pop | |
599 | 5c0c1eeb | Iustin Pop | This library will provide an easy upgrade path to bring all the code to |
600 | 5c0c1eeb | Iustin Pop | granular locking without breaking everything, and it will also guarantee |
601 | 7faf5110 | Michael Hanselmann | against a lot of common errors. Code switching from the old "lock |
602 | 7faf5110 | Michael Hanselmann | everything" lock to the new system, though, needs to be carefully |
603 | 7faf5110 | Michael Hanselmann | scrutinised to be sure it is really acquiring all the necessary locks, |
604 | 7faf5110 | Michael Hanselmann | and none has been overlooked or forgotten. |
605 | 5c0c1eeb | Iustin Pop | |
606 | 7faf5110 | Michael Hanselmann | The code can contain other locks outside of this library, to synchronise |
607 | 7faf5110 | Michael Hanselmann | other threaded code (eg for the job queue) but in general these should |
608 | 7faf5110 | Michael Hanselmann | be leaf locks or carefully structured non-leaf ones, to avoid deadlock |
609 | 7faf5110 | Michael Hanselmann | race conditions. |
610 | 5c0c1eeb | Iustin Pop | |
611 | 5c0c1eeb | Iustin Pop | |
612 | 5c0c1eeb | Iustin Pop | Job Queue |
613 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~ |
614 | 5c0c1eeb | Iustin Pop | |
615 | 5c0c1eeb | Iustin Pop | Granular locking is not enough to speed up operations, we also need a |
616 | 5c0c1eeb | Iustin Pop | queue to store these and to be able to process as many as possible in |
617 | 5c0c1eeb | Iustin Pop | parallel. |
618 | 5c0c1eeb | Iustin Pop | |
619 | 6c2d0b44 | Iustin Pop | A Ganeti job will consist of multiple ``OpCodes`` which are the basic |
620 | 5c0c1eeb | Iustin Pop | element of operation in Ganeti 1.2 (and will remain as such). Most |
621 | 5c0c1eeb | Iustin Pop | command-level commands are equivalent to one OpCode, or in some cases |
622 | 5c0c1eeb | Iustin Pop | to a sequence of opcodes, all of the same type (e.g. evacuating a node |
623 | 5c0c1eeb | Iustin Pop | will generate N opcodes of type replace disks). |
624 | 5c0c1eeb | Iustin Pop | |
625 | 5c0c1eeb | Iustin Pop | |
626 | 5c0c1eeb | Iustin Pop | Job executionโโLife of a Ganeti jobโ |
627 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++ |
628 | 5c0c1eeb | Iustin Pop | |
629 | 7faf5110 | Michael Hanselmann | #. Job gets submitted by the client. A new job identifier is generated |
630 | 7faf5110 | Michael Hanselmann | and assigned to the job. The job is then automatically replicated |
631 | 7faf5110 | Michael Hanselmann | [#replic]_ to all nodes in the cluster. The identifier is returned to |
632 | 7faf5110 | Michael Hanselmann | the client. |
633 | 7faf5110 | Michael Hanselmann | #. A pool of worker threads waits for new jobs. If all are busy, the job |
634 | 7faf5110 | Michael Hanselmann | has to wait and the first worker finishing its work will grab it. |
635 | 7faf5110 | Michael Hanselmann | Otherwise any of the waiting threads will pick up the new job. |
636 | 7faf5110 | Michael Hanselmann | #. Client waits for job status updates by calling a waiting RPC |
637 | 7faf5110 | Michael Hanselmann | function. Log message may be shown to the user. Until the job is |
638 | 7faf5110 | Michael Hanselmann | started, it can also be canceled. |
639 | 7faf5110 | Michael Hanselmann | #. As soon as the job is finished, its final result and status can be |
640 | 7faf5110 | Michael Hanselmann | retrieved from the server. |
641 | 5c0c1eeb | Iustin Pop | #. If the client archives the job, it gets moved to a history directory. |
642 | 5c0c1eeb | Iustin Pop | There will be a method to archive all jobs older than a a given age. |
643 | 5c0c1eeb | Iustin Pop | |
644 | 7faf5110 | Michael Hanselmann | .. [#replic] We need replication in order to maintain the consistency |
645 | 7faf5110 | Michael Hanselmann | across all nodes in the system; the master node only differs in the |
646 | 7faf5110 | Michael Hanselmann | fact that now it is running the master daemon, but it if fails and we |
647 | 7faf5110 | Michael Hanselmann | do a master failover, the jobs are still visible on the new master |
648 | 7faf5110 | Michael Hanselmann | (though marked as failed). |
649 | 5c0c1eeb | Iustin Pop | |
650 | 5c0c1eeb | Iustin Pop | Failures to replicate a job to other nodes will be only flagged as |
651 | 5c0c1eeb | Iustin Pop | errors in the master daemon log if more than half of the nodes failed, |
652 | 5c0c1eeb | Iustin Pop | otherwise we ignore the failure, and rely on the fact that the next |
653 | 5c0c1eeb | Iustin Pop | update (for still running jobs) will retry the update. For finished |
654 | 5c0c1eeb | Iustin Pop | jobs, it is less of a problem. |
655 | 5c0c1eeb | Iustin Pop | |
656 | 5c0c1eeb | Iustin Pop | Future improvements will look into checking the consistency of the job |
657 | 5c0c1eeb | Iustin Pop | list and jobs themselves at master daemon startup. |
658 | 5c0c1eeb | Iustin Pop | |
659 | 5c0c1eeb | Iustin Pop | |
660 | 5c0c1eeb | Iustin Pop | Job storage |
661 | 5c0c1eeb | Iustin Pop | +++++++++++ |
662 | 5c0c1eeb | Iustin Pop | |
663 | 5c0c1eeb | Iustin Pop | Jobs are stored in the filesystem as individual files, serialized |
664 | 5c0c1eeb | Iustin Pop | using JSON (standard serialization mechanism in Ganeti). |
665 | 5c0c1eeb | Iustin Pop | |
666 | 5c0c1eeb | Iustin Pop | The choice of storing each job in its own file was made because: |
667 | 5c0c1eeb | Iustin Pop | |
668 | 5c0c1eeb | Iustin Pop | - a file can be atomically replaced |
669 | 5c0c1eeb | Iustin Pop | - a file can easily be replicated to other nodes |
670 | 7faf5110 | Michael Hanselmann | - checking consistency across nodes can be implemented very easily, |
671 | 7faf5110 | Michael Hanselmann | since all job files should be (at a given moment in time) identical |
672 | 5c0c1eeb | Iustin Pop | |
673 | 5c0c1eeb | Iustin Pop | The other possible choices that were discussed and discounted were: |
674 | 5c0c1eeb | Iustin Pop | |
675 | 7faf5110 | Michael Hanselmann | - single big file with all job data: not feasible due to difficult |
676 | 7faf5110 | Michael Hanselmann | updates |
677 | 5c0c1eeb | Iustin Pop | - in-process databases: hard to replicate the entire database to the |
678 | 7faf5110 | Michael Hanselmann | other nodes, and replicating individual operations does not mean wee |
679 | 7faf5110 | Michael Hanselmann | keep consistency |
680 | 5c0c1eeb | Iustin Pop | |
681 | 5c0c1eeb | Iustin Pop | |
682 | 5c0c1eeb | Iustin Pop | Queue structure |
683 | 5c0c1eeb | Iustin Pop | +++++++++++++++ |
684 | 5c0c1eeb | Iustin Pop | |
685 | 7faf5110 | Michael Hanselmann | All file operations have to be done atomically by writing to a temporary |
686 | 7faf5110 | Michael Hanselmann | file and subsequent renaming. Except for log messages, every change in a |
687 | 7faf5110 | Michael Hanselmann | job is stored and replicated to other nodes. |
688 | 5c0c1eeb | Iustin Pop | |
689 | 5c0c1eeb | Iustin Pop | :: |
690 | 5c0c1eeb | Iustin Pop | |
691 | 5c0c1eeb | Iustin Pop | /var/lib/ganeti/queue/ |
692 | 5c0c1eeb | Iustin Pop | job-1 (JSON encoded job description and status) |
693 | 5c0c1eeb | Iustin Pop | [โฆ] |
694 | 5c0c1eeb | Iustin Pop | job-37 |
695 | 5c0c1eeb | Iustin Pop | job-38 |
696 | 5c0c1eeb | Iustin Pop | job-39 |
697 | 5c0c1eeb | Iustin Pop | lock (Queue managing process opens this file in exclusive mode) |
698 | 5c0c1eeb | Iustin Pop | serial (Last job ID used) |
699 | 5c0c1eeb | Iustin Pop | version (Queue format version) |
700 | 5c0c1eeb | Iustin Pop | |
701 | 5c0c1eeb | Iustin Pop | |
702 | 5c0c1eeb | Iustin Pop | Locking |
703 | 5c0c1eeb | Iustin Pop | +++++++ |
704 | 5c0c1eeb | Iustin Pop | |
705 | 7faf5110 | Michael Hanselmann | Locking in the job queue is a complicated topic. It is called from more |
706 | 7faf5110 | Michael Hanselmann | than one thread and must be thread-safe. For simplicity, a single lock |
707 | 7faf5110 | Michael Hanselmann | is used for the whole job queue. |
708 | 5c0c1eeb | Iustin Pop | |
709 | 9725b53d | Michael Hanselmann | A more detailed description can be found in doc/locking.rst. |
710 | 5c0c1eeb | Iustin Pop | |
711 | 5c0c1eeb | Iustin Pop | |
712 | 5c0c1eeb | Iustin Pop | Internal RPC |
713 | 5c0c1eeb | Iustin Pop | ++++++++++++ |
714 | 5c0c1eeb | Iustin Pop | |
715 | 5c0c1eeb | Iustin Pop | RPC calls available between Ganeti master and node daemons: |
716 | 5c0c1eeb | Iustin Pop | |
717 | 5c0c1eeb | Iustin Pop | jobqueue_update(file_name, content) |
718 | 5c0c1eeb | Iustin Pop | Writes a file in the job queue directory. |
719 | 5c0c1eeb | Iustin Pop | jobqueue_purge() |
720 | 5c0c1eeb | Iustin Pop | Cleans the job queue directory completely, including archived job. |
721 | 5c0c1eeb | Iustin Pop | jobqueue_rename(old, new) |
722 | 5c0c1eeb | Iustin Pop | Renames a file in the job queue directory. |
723 | 5c0c1eeb | Iustin Pop | |
724 | 5c0c1eeb | Iustin Pop | |
725 | 5c0c1eeb | Iustin Pop | Client RPC |
726 | 5c0c1eeb | Iustin Pop | ++++++++++ |
727 | 5c0c1eeb | Iustin Pop | |
728 | 7faf5110 | Michael Hanselmann | RPC between Ganeti clients and the Ganeti master daemon supports the |
729 | 7faf5110 | Michael Hanselmann | following operations: |
730 | 5c0c1eeb | Iustin Pop | |
731 | 5c0c1eeb | Iustin Pop | SubmitJob(ops) |
732 | 7faf5110 | Michael Hanselmann | Submits a list of opcodes and returns the job identifier. The |
733 | 7faf5110 | Michael Hanselmann | identifier is guaranteed to be unique during the lifetime of a |
734 | 7faf5110 | Michael Hanselmann | cluster. |
735 | 5c0c1eeb | Iustin Pop | WaitForJobChange(job_id, fields, [โฆ], timeout) |
736 | 7faf5110 | Michael Hanselmann | This function waits until a job changes or a timeout expires. The |
737 | 7faf5110 | Michael Hanselmann | condition for when a job changed is defined by the fields passed and |
738 | 7faf5110 | Michael Hanselmann | the last log message received. |
739 | 5c0c1eeb | Iustin Pop | QueryJobs(job_ids, fields) |
740 | 5c0c1eeb | Iustin Pop | Returns field values for the job identifiers passed. |
741 | 5c0c1eeb | Iustin Pop | CancelJob(job_id) |
742 | 7faf5110 | Michael Hanselmann | Cancels the job specified by identifier. This operation may fail if |
743 | 7faf5110 | Michael Hanselmann | the job is already running, canceled or finished. |
744 | 5c0c1eeb | Iustin Pop | ArchiveJob(job_id) |
745 | 7faf5110 | Michael Hanselmann | Moves a job into the โฆ/archive/ directory. This operation will fail if |
746 | 7faf5110 | Michael Hanselmann | the job has not been canceled or finished. |
747 | 5c0c1eeb | Iustin Pop | |
748 | 5c0c1eeb | Iustin Pop | |
749 | 5c0c1eeb | Iustin Pop | Job and opcode status |
750 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
751 | 5c0c1eeb | Iustin Pop | |
752 | 5c0c1eeb | Iustin Pop | Each job and each opcode has, at any time, one of the following states: |
753 | 5c0c1eeb | Iustin Pop | |
754 | 5c0c1eeb | Iustin Pop | Queued |
755 | 5c0c1eeb | Iustin Pop | The job/opcode was submitted, but did not yet start. |
756 | 5c0c1eeb | Iustin Pop | Waiting |
757 | 5c0c1eeb | Iustin Pop | The job/opcode is waiting for a lock to proceed. |
758 | 5c0c1eeb | Iustin Pop | Running |
759 | 5c0c1eeb | Iustin Pop | The job/opcode is running. |
760 | 5c0c1eeb | Iustin Pop | Canceled |
761 | 5c0c1eeb | Iustin Pop | The job/opcode was canceled before it started. |
762 | 5c0c1eeb | Iustin Pop | Success |
763 | 5c0c1eeb | Iustin Pop | The job/opcode ran and finished successfully. |
764 | 5c0c1eeb | Iustin Pop | Error |
765 | 5c0c1eeb | Iustin Pop | The job/opcode was aborted with an error. |
766 | 5c0c1eeb | Iustin Pop | |
767 | 7faf5110 | Michael Hanselmann | If the master is aborted while a job is running, the job will be set to |
768 | 7faf5110 | Michael Hanselmann | the Error status once the master started again. |
769 | 5c0c1eeb | Iustin Pop | |
770 | 5c0c1eeb | Iustin Pop | |
771 | 5c0c1eeb | Iustin Pop | History |
772 | 5c0c1eeb | Iustin Pop | +++++++ |
773 | 5c0c1eeb | Iustin Pop | |
774 | 5c0c1eeb | Iustin Pop | Archived jobs are kept in a separate directory, |
775 | 6c2d0b44 | Iustin Pop | ``/var/lib/ganeti/queue/archive/``. This is done in order to speed up |
776 | 6c2d0b44 | Iustin Pop | the queue handling: by default, the jobs in the archive are not |
777 | 6c2d0b44 | Iustin Pop | touched by any functions. Only the current (unarchived) jobs are |
778 | 6c2d0b44 | Iustin Pop | parsed, loaded, and verified (if implemented) by the master daemon. |
779 | 5c0c1eeb | Iustin Pop | |
780 | 5c0c1eeb | Iustin Pop | |
781 | 5c0c1eeb | Iustin Pop | Ganeti updates |
782 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
783 | 5c0c1eeb | Iustin Pop | |
784 | 5c0c1eeb | Iustin Pop | The queue has to be completely empty for Ganeti updates with changes |
785 | 5c0c1eeb | Iustin Pop | in the job queue structure. In order to allow this, there will be a |
786 | 5c0c1eeb | Iustin Pop | way to prevent new jobs entering the queue. |
787 | 5c0c1eeb | Iustin Pop | |
788 | 5c0c1eeb | Iustin Pop | |
789 | 5c0c1eeb | Iustin Pop | Object parameters |
790 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~ |
791 | 5c0c1eeb | Iustin Pop | |
792 | 5c0c1eeb | Iustin Pop | Across all cluster configuration data, we have multiple classes of |
793 | 5c0c1eeb | Iustin Pop | parameters: |
794 | 5c0c1eeb | Iustin Pop | |
795 | 5c0c1eeb | Iustin Pop | A. cluster-wide parameters (e.g. name of the cluster, the master); |
796 | 5c0c1eeb | Iustin Pop | these are the ones that we have today, and are unchanged from the |
797 | 5c0c1eeb | Iustin Pop | current model |
798 | 5c0c1eeb | Iustin Pop | |
799 | 5c0c1eeb | Iustin Pop | #. node parameters |
800 | 5c0c1eeb | Iustin Pop | |
801 | 5c0c1eeb | Iustin Pop | #. instance specific parameters, e.g. the name of disks (LV), that |
802 | 5c0c1eeb | Iustin Pop | cannot be shared with other instances |
803 | 5c0c1eeb | Iustin Pop | |
804 | 5c0c1eeb | Iustin Pop | #. instance parameters, that are or can be the same for many |
805 | 5c0c1eeb | Iustin Pop | instances, but are not hypervisor related; e.g. the number of VCPUs, |
806 | 5c0c1eeb | Iustin Pop | or the size of memory |
807 | 5c0c1eeb | Iustin Pop | |
808 | 5c0c1eeb | Iustin Pop | #. instance parameters that are hypervisor specific (e.g. kernel_path |
809 | 5c0c1eeb | Iustin Pop | or PAE mode) |
810 | 5c0c1eeb | Iustin Pop | |
811 | 5c0c1eeb | Iustin Pop | |
812 | 5c0c1eeb | Iustin Pop | The following definitions for instance parameters will be used below: |
813 | 5c0c1eeb | Iustin Pop | |
814 | 5c0c1eeb | Iustin Pop | :hypervisor parameter: |
815 | 5c0c1eeb | Iustin Pop | a hypervisor parameter (or hypervisor specific parameter) is defined |
816 | 5c0c1eeb | Iustin Pop | as a parameter that is interpreted by the hypervisor support code in |
817 | 5c0c1eeb | Iustin Pop | Ganeti and usually is specific to a particular hypervisor (like the |
818 | e2078d28 | Iustin Pop | kernel path for :term:`PVM` which makes no sense for :term:`HVM`). |
819 | 5c0c1eeb | Iustin Pop | |
820 | 5c0c1eeb | Iustin Pop | :backend parameter: |
821 | 5c0c1eeb | Iustin Pop | a backend parameter is defined as an instance parameter that can be |
822 | 5c0c1eeb | Iustin Pop | shared among a list of instances, and is either generic enough not |
823 | 5c0c1eeb | Iustin Pop | to be tied to a given hypervisor or cannot influence at all the |
824 | 5c0c1eeb | Iustin Pop | hypervisor behaviour. |
825 | 5c0c1eeb | Iustin Pop | |
826 | 5c0c1eeb | Iustin Pop | For example: memory, vcpus, auto_balance |
827 | 5c0c1eeb | Iustin Pop | |
828 | 7faf5110 | Michael Hanselmann | All these parameters will be encoded into constants.py with the prefix |
829 | 7faf5110 | Michael Hanselmann | "BE\_" and the whole list of parameters will exist in the set |
830 | 7faf5110 | Michael Hanselmann | "BES_PARAMETERS" |
831 | 5c0c1eeb | Iustin Pop | |
832 | 5c0c1eeb | Iustin Pop | :proper parameter: |
833 | 7faf5110 | Michael Hanselmann | a parameter whose value is unique to the instance (e.g. the name of a |
834 | 7faf5110 | Michael Hanselmann | LV, or the MAC of a NIC) |
835 | 5c0c1eeb | Iustin Pop | |
836 | 5c0c1eeb | Iustin Pop | As a general rule, for all kind of parameters, โNoneโ (or in |
837 | 5c0c1eeb | Iustin Pop | JSON-speak, โnilโ) will no longer be a valid value for a parameter. As |
838 | 5c0c1eeb | Iustin Pop | such, only non-default parameters will be saved as part of objects in |
839 | 5c0c1eeb | Iustin Pop | the serialization step, reducing the size of the serialized format. |
840 | 5c0c1eeb | Iustin Pop | |
841 | 5c0c1eeb | Iustin Pop | Cluster parameters |
842 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++ |
843 | 5c0c1eeb | Iustin Pop | |
844 | 5c0c1eeb | Iustin Pop | Cluster parameters remain as today, attributes at the top level of the |
845 | 5c0c1eeb | Iustin Pop | Cluster object. In addition, two new attributes at this level will |
846 | 5c0c1eeb | Iustin Pop | hold defaults for the instances: |
847 | 5c0c1eeb | Iustin Pop | |
848 | 5c0c1eeb | Iustin Pop | - hvparams, a dictionary indexed by hypervisor type, holding default |
849 | 6c2d0b44 | Iustin Pop | values for hypervisor parameters that are not defined/overridden by |
850 | 5c0c1eeb | Iustin Pop | the instances of this hypervisor type |
851 | 5c0c1eeb | Iustin Pop | |
852 | 5c0c1eeb | Iustin Pop | - beparams, a dictionary holding (for 2.0) a single element 'default', |
853 | 5c0c1eeb | Iustin Pop | which holds the default value for backend parameters |
854 | 5c0c1eeb | Iustin Pop | |
855 | 5c0c1eeb | Iustin Pop | Node parameters |
856 | 5c0c1eeb | Iustin Pop | +++++++++++++++ |
857 | 5c0c1eeb | Iustin Pop | |
858 | 5c0c1eeb | Iustin Pop | Node-related parameters are very few, and we will continue using the |
859 | 5c0c1eeb | Iustin Pop | same model for these as previously (attributes on the Node object). |
860 | 5c0c1eeb | Iustin Pop | |
861 | e0eb13de | Iustin Pop | There are three new node flags, described in a separate section "node |
862 | e0eb13de | Iustin Pop | flags" below. |
863 | e0eb13de | Iustin Pop | |
864 | 5c0c1eeb | Iustin Pop | Instance parameters |
865 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
866 | 5c0c1eeb | Iustin Pop | |
867 | 5c0c1eeb | Iustin Pop | As described before, the instance parameters are split in three: |
868 | 5c0c1eeb | Iustin Pop | instance proper parameters, unique to each instance, instance |
869 | 5c0c1eeb | Iustin Pop | hypervisor parameters and instance backend parameters. |
870 | 5c0c1eeb | Iustin Pop | |
871 | 5c0c1eeb | Iustin Pop | The โhvparamsโ and โbeparamsโ are kept in two dictionaries at instance |
872 | 5c0c1eeb | Iustin Pop | level. Only non-default parameters are stored (but once customized, a |
873 | 5c0c1eeb | Iustin Pop | parameter will be kept, even with the same value as the default one, |
874 | 5c0c1eeb | Iustin Pop | until reset). |
875 | 5c0c1eeb | Iustin Pop | |
876 | 5c0c1eeb | Iustin Pop | The names for hypervisor parameters in the instance.hvparams subtree |
877 | 5c0c1eeb | Iustin Pop | should be choosen as generic as possible, especially if specific |
878 | 5c0c1eeb | Iustin Pop | parameters could conceivably be useful for more than one hypervisor, |
879 | 6c2d0b44 | Iustin Pop | e.g. ``instance.hvparams.vnc_console_port`` instead of using both |
880 | 6c2d0b44 | Iustin Pop | ``instance.hvparams.hvm_vnc_console_port`` and |
881 | 6c2d0b44 | Iustin Pop | ``instance.hvparams.kvm_vnc_console_port``. |
882 | 5c0c1eeb | Iustin Pop | |
883 | 5c0c1eeb | Iustin Pop | There are some special cases related to disks and NICs (for example): |
884 | 6c2d0b44 | Iustin Pop | a disk has both Ganeti-related parameters (e.g. the name of the LV) |
885 | 5c0c1eeb | Iustin Pop | and hypervisor-related parameters (how the disk is presented to/named |
886 | 5c0c1eeb | Iustin Pop | in the instance). The former parameters remain as proper-instance |
887 | 5c0c1eeb | Iustin Pop | parameters, while the latter value are migrated to the hvparams |
888 | 5c0c1eeb | Iustin Pop | structure. In 2.0, we will have only globally-per-instance such |
889 | 5c0c1eeb | Iustin Pop | hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
890 | 5c0c1eeb | Iustin Pop | exported as of the same type). |
891 | 5c0c1eeb | Iustin Pop | |
892 | 5c0c1eeb | Iustin Pop | Starting from the 1.2 list of instance parameters, here is how they |
893 | 5c0c1eeb | Iustin Pop | will be mapped to the three classes of parameters: |
894 | 5c0c1eeb | Iustin Pop | |
895 | 5c0c1eeb | Iustin Pop | - name (P) |
896 | 5c0c1eeb | Iustin Pop | - primary_node (P) |
897 | 5c0c1eeb | Iustin Pop | - os (P) |
898 | 5c0c1eeb | Iustin Pop | - hypervisor (P) |
899 | 5c0c1eeb | Iustin Pop | - status (P) |
900 | 5c0c1eeb | Iustin Pop | - memory (BE) |
901 | 5c0c1eeb | Iustin Pop | - vcpus (BE) |
902 | 5c0c1eeb | Iustin Pop | - nics (P) |
903 | 5c0c1eeb | Iustin Pop | - disks (P) |
904 | 5c0c1eeb | Iustin Pop | - disk_template (P) |
905 | 5c0c1eeb | Iustin Pop | - network_port (P) |
906 | 5c0c1eeb | Iustin Pop | - kernel_path (HV) |
907 | 5c0c1eeb | Iustin Pop | - initrd_path (HV) |
908 | 5c0c1eeb | Iustin Pop | - hvm_boot_order (HV) |
909 | 5c0c1eeb | Iustin Pop | - hvm_acpi (HV) |
910 | 5c0c1eeb | Iustin Pop | - hvm_pae (HV) |
911 | 5c0c1eeb | Iustin Pop | - hvm_cdrom_image_path (HV) |
912 | 5c0c1eeb | Iustin Pop | - hvm_nic_type (HV) |
913 | 5c0c1eeb | Iustin Pop | - hvm_disk_type (HV) |
914 | 5c0c1eeb | Iustin Pop | - vnc_bind_address (HV) |
915 | 5c0c1eeb | Iustin Pop | - serial_no (P) |
916 | 5c0c1eeb | Iustin Pop | |
917 | 5c0c1eeb | Iustin Pop | |
918 | 5c0c1eeb | Iustin Pop | Parameter validation |
919 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
920 | 5c0c1eeb | Iustin Pop | |
921 | 5c0c1eeb | Iustin Pop | To support the new cluster parameter design, additional features will |
922 | 5c0c1eeb | Iustin Pop | be required from the hypervisor support implementations in Ganeti. |
923 | 5c0c1eeb | Iustin Pop | |
924 | 5c0c1eeb | Iustin Pop | The hypervisor support implementation API will be extended with the |
925 | 5c0c1eeb | Iustin Pop | following features: |
926 | 5c0c1eeb | Iustin Pop | |
927 | 5c0c1eeb | Iustin Pop | :PARAMETERS: class-level attribute holding the list of valid parameters |
928 | 5c0c1eeb | Iustin Pop | for this hypervisor |
929 | 5c0c1eeb | Iustin Pop | :CheckParamSyntax(hvparams): checks that the given parameters are |
930 | 5c0c1eeb | Iustin Pop | valid (as in the names are valid) for this hypervisor; usually just |
931 | 6c2d0b44 | Iustin Pop | comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class |
932 | 6c2d0b44 | Iustin Pop | method that can be called from within master code (i.e. cmdlib) and |
933 | 6c2d0b44 | Iustin Pop | should be safe to do so |
934 | 5c0c1eeb | Iustin Pop | :ValidateParameters(hvparams): verifies the values of the provided |
935 | 5c0c1eeb | Iustin Pop | parameters against this hypervisor; this is a method that will be |
936 | 5c0c1eeb | Iustin Pop | called on the target node, from backend.py code, and as such can |
937 | 5c0c1eeb | Iustin Pop | make node-specific checks (e.g. kernel_path checking) |
938 | 5c0c1eeb | Iustin Pop | |
939 | 5c0c1eeb | Iustin Pop | Default value application |
940 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++ |
941 | 5c0c1eeb | Iustin Pop | |
942 | 5c0c1eeb | Iustin Pop | The application of defaults to an instance is done in the Cluster |
943 | 5c0c1eeb | Iustin Pop | object, via two new methods as follows: |
944 | 5c0c1eeb | Iustin Pop | |
945 | 5c0c1eeb | Iustin Pop | - ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
946 | 5c0c1eeb | Iustin Pop | instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
947 | 5c0c1eeb | Iustin Pop | |
948 | 5c0c1eeb | Iustin Pop | - ``Cluster.FillBE(instance, be_type="default")``, which returns the |
949 | 5c0c1eeb | Iustin Pop | beparams dict, based on the instance and cluster beparams |
950 | 5c0c1eeb | Iustin Pop | |
951 | 7faf5110 | Michael Hanselmann | The FillHV/BE transformations will be used, for example, in the |
952 | 7faf5110 | Michael Hanselmann | RpcRunner when sending an instance for activation/stop, and the sent |
953 | 7faf5110 | Michael Hanselmann | instance hvparams/beparams will have the final value (noded code doesn't |
954 | 7faf5110 | Michael Hanselmann | know about defaults). |
955 | 5c0c1eeb | Iustin Pop | |
956 | 5c0c1eeb | Iustin Pop | LU code will need to self-call the transformation, if needed. |
957 | 5c0c1eeb | Iustin Pop | |
958 | 5c0c1eeb | Iustin Pop | Opcode changes |
959 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
960 | 5c0c1eeb | Iustin Pop | |
961 | 5c0c1eeb | Iustin Pop | The parameter changes will have impact on the OpCodes, especially on |
962 | 5c0c1eeb | Iustin Pop | the following ones: |
963 | 5c0c1eeb | Iustin Pop | |
964 | e1530b10 | Iustin Pop | - ``OpInstanceCreate``, where the new hv and be parameters will be sent |
965 | 7faf5110 | Michael Hanselmann | as dictionaries; note that all hv and be parameters are now optional, |
966 | 7faf5110 | Michael Hanselmann | as the values can be instead taken from the cluster |
967 | f2af0bec | Iustin Pop | - ``OpInstanceQuery``, where we have to be able to query these new |
968 | 5c0c1eeb | Iustin Pop | parameters; the syntax for names will be ``hvparam/$NAME`` and |
969 | 5c0c1eeb | Iustin Pop | ``beparam/$NAME`` for querying an individual parameter out of one |
970 | 5c0c1eeb | Iustin Pop | dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
971 | 5c0c1eeb | Iustin Pop | dictionaries |
972 | 6c2d0b44 | Iustin Pop | - ``OpModifyInstance``, where the the modified parameters are sent as |
973 | 5c0c1eeb | Iustin Pop | dictionaries |
974 | 5c0c1eeb | Iustin Pop | |
975 | 5c0c1eeb | Iustin Pop | Additionally, we will need new OpCodes to modify the cluster-level |
976 | 5c0c1eeb | Iustin Pop | defaults for the be/hv sets of parameters. |
977 | 5c0c1eeb | Iustin Pop | |
978 | 5c0c1eeb | Iustin Pop | Caveats |
979 | 5c0c1eeb | Iustin Pop | +++++++ |
980 | 5c0c1eeb | Iustin Pop | |
981 | 5c0c1eeb | Iustin Pop | One problem that might appear is that our classification is not |
982 | 5c0c1eeb | Iustin Pop | complete or not good enough, and we'll need to change this model. As |
983 | 5c0c1eeb | Iustin Pop | the last resort, we will need to rollback and keep 1.2 style. |
984 | 5c0c1eeb | Iustin Pop | |
985 | 5c0c1eeb | Iustin Pop | Another problem is that classification of one parameter is unclear |
986 | 5c0c1eeb | Iustin Pop | (e.g. ``network_port``, is this BE or HV?); in this case we'll take |
987 | 5c0c1eeb | Iustin Pop | the risk of having to move parameters later between classes. |
988 | 5c0c1eeb | Iustin Pop | |
989 | 5c0c1eeb | Iustin Pop | Security |
990 | 5c0c1eeb | Iustin Pop | ++++++++ |
991 | 5c0c1eeb | Iustin Pop | |
992 | 5c0c1eeb | Iustin Pop | The only security issue that we foresee is if some new parameters will |
993 | 5c0c1eeb | Iustin Pop | have sensitive value. If so, we will need to have a way to export the |
994 | 5c0c1eeb | Iustin Pop | config data while purging the sensitive value. |
995 | 5c0c1eeb | Iustin Pop | |
996 | 5c0c1eeb | Iustin Pop | E.g. for the drbd shared secrets, we could export these with the |
997 | 5c0c1eeb | Iustin Pop | values replaced by an empty string. |
998 | 5c0c1eeb | Iustin Pop | |
999 | e0eb13de | Iustin Pop | Node flags |
1000 | e0eb13de | Iustin Pop | ~~~~~~~~~~ |
1001 | e0eb13de | Iustin Pop | |
1002 | e0eb13de | Iustin Pop | Ganeti 2.0 adds three node flags that change the way nodes are handled |
1003 | e0eb13de | Iustin Pop | within Ganeti and the related infrastructure (iallocator interaction, |
1004 | e0eb13de | Iustin Pop | RAPI data export). |
1005 | e0eb13de | Iustin Pop | |
1006 | e0eb13de | Iustin Pop | *master candidate* flag |
1007 | e0eb13de | Iustin Pop | +++++++++++++++++++++++ |
1008 | e0eb13de | Iustin Pop | |
1009 | e0eb13de | Iustin Pop | Ganeti 2.0 allows more scalability in operation by introducing |
1010 | e0eb13de | Iustin Pop | parallelization. However, a new bottleneck is reached that is the |
1011 | e0eb13de | Iustin Pop | synchronization and replication of cluster configuration to all nodes |
1012 | e0eb13de | Iustin Pop | in the cluster. |
1013 | e0eb13de | Iustin Pop | |
1014 | e0eb13de | Iustin Pop | This breaks scalability as the speed of the replication decreases |
1015 | e0eb13de | Iustin Pop | roughly with the size of the nodes in the cluster. The goal of the |
1016 | e0eb13de | Iustin Pop | master candidate flag is to change this O(n) into O(1) with respect to |
1017 | e0eb13de | Iustin Pop | job and configuration data propagation. |
1018 | e0eb13de | Iustin Pop | |
1019 | e0eb13de | Iustin Pop | Only nodes having this flag set (let's call this set of nodes the |
1020 | e0eb13de | Iustin Pop | *candidate pool*) will have jobs and configuration data replicated. |
1021 | e0eb13de | Iustin Pop | |
1022 | e0eb13de | Iustin Pop | The cluster will have a new parameter (runtime changeable) called |
1023 | e0eb13de | Iustin Pop | ``candidate_pool_size`` which represents the number of candidates the |
1024 | e0eb13de | Iustin Pop | cluster tries to maintain (preferably automatically). |
1025 | e0eb13de | Iustin Pop | |
1026 | e0eb13de | Iustin Pop | This will impact the cluster operations as follows: |
1027 | e0eb13de | Iustin Pop | |
1028 | e0eb13de | Iustin Pop | - jobs and config data will be replicated only to a fixed set of nodes |
1029 | e0eb13de | Iustin Pop | - master fail-over will only be possible to a node in the candidate pool |
1030 | e0eb13de | Iustin Pop | - cluster verify needs changing to account for these two roles |
1031 | e0eb13de | Iustin Pop | - external scripts will no longer have access to the configuration |
1032 | e0eb13de | Iustin Pop | file (this is not recommended anyway) |
1033 | e0eb13de | Iustin Pop | |
1034 | e0eb13de | Iustin Pop | |
1035 | e0eb13de | Iustin Pop | The caveats of this change are: |
1036 | e0eb13de | Iustin Pop | |
1037 | e0eb13de | Iustin Pop | - if all candidates are lost (completely), cluster configuration is |
1038 | e0eb13de | Iustin Pop | lost (but it should be backed up external to the cluster anyway) |
1039 | e0eb13de | Iustin Pop | |
1040 | e0eb13de | Iustin Pop | - failed nodes which are candidate must be dealt with properly, so |
1041 | e0eb13de | Iustin Pop | that we don't lose too many candidates at the same time; this will be |
1042 | e0eb13de | Iustin Pop | reported in cluster verify |
1043 | e0eb13de | Iustin Pop | |
1044 | e0eb13de | Iustin Pop | - the 'all equal' concept of ganeti is no longer true |
1045 | e0eb13de | Iustin Pop | |
1046 | e0eb13de | Iustin Pop | - the partial distribution of config data means that all nodes will |
1047 | e0eb13de | Iustin Pop | have to revert to ssconf files for master info (as in 1.2) |
1048 | e0eb13de | Iustin Pop | |
1049 | e0eb13de | Iustin Pop | Advantages: |
1050 | e0eb13de | Iustin Pop | |
1051 | e0eb13de | Iustin Pop | - speed on a 100+ nodes simulated cluster is greatly enhanced, even |
1052 | e0eb13de | Iustin Pop | for a simple operation; ``gnt-instance remove`` on a diskless instance |
1053 | e0eb13de | Iustin Pop | remove goes from ~9seconds to ~2 seconds |
1054 | e0eb13de | Iustin Pop | |
1055 | e0eb13de | Iustin Pop | - node failure of non-candidates will be less impacting on the cluster |
1056 | e0eb13de | Iustin Pop | |
1057 | e0eb13de | Iustin Pop | The default value for the candidate pool size will be set to 10 but |
1058 | e0eb13de | Iustin Pop | this can be changed at cluster creation and modified any time later. |
1059 | e0eb13de | Iustin Pop | |
1060 | e0eb13de | Iustin Pop | Testing on simulated big clusters with sequential and parallel jobs |
1061 | e0eb13de | Iustin Pop | show that this value (10) is a sweet-spot from performance and load |
1062 | e0eb13de | Iustin Pop | point of view. |
1063 | e0eb13de | Iustin Pop | |
1064 | e0eb13de | Iustin Pop | *offline* flag |
1065 | e0eb13de | Iustin Pop | ++++++++++++++ |
1066 | e0eb13de | Iustin Pop | |
1067 | e0eb13de | Iustin Pop | In order to support better the situation in which nodes are offline |
1068 | e0eb13de | Iustin Pop | (e.g. for repair) without altering the cluster configuration, Ganeti |
1069 | e0eb13de | Iustin Pop | needs to be told and needs to properly handle this state for nodes. |
1070 | e0eb13de | Iustin Pop | |
1071 | e0eb13de | Iustin Pop | This will result in simpler procedures, and less mistakes, when the |
1072 | e0eb13de | Iustin Pop | amount of node failures is high on an absolute scale (either due to |
1073 | e0eb13de | Iustin Pop | high failure rate or simply big clusters). |
1074 | e0eb13de | Iustin Pop | |
1075 | e0eb13de | Iustin Pop | Nodes having this attribute set will not be contacted for inter-node |
1076 | e0eb13de | Iustin Pop | RPC calls, will not be master candidates, and will not be able to host |
1077 | e0eb13de | Iustin Pop | instances as primaries. |
1078 | e0eb13de | Iustin Pop | |
1079 | e0eb13de | Iustin Pop | Setting this attribute on a node: |
1080 | e0eb13de | Iustin Pop | |
1081 | e0eb13de | Iustin Pop | - will not be allowed if the node is the master |
1082 | e0eb13de | Iustin Pop | - will not be allowed if the node has primary instances |
1083 | e0eb13de | Iustin Pop | - will cause the node to be demoted from the master candidate role (if |
1084 | e0eb13de | Iustin Pop | it was), possibly causing another node to be promoted to that role |
1085 | e0eb13de | Iustin Pop | |
1086 | e0eb13de | Iustin Pop | This attribute will impact the cluster operations as follows: |
1087 | e0eb13de | Iustin Pop | |
1088 | e0eb13de | Iustin Pop | - querying these nodes for anything will fail instantly in the RPC |
1089 | e0eb13de | Iustin Pop | library, with a specific RPC error (RpcResult.offline == True) |
1090 | e0eb13de | Iustin Pop | |
1091 | e0eb13de | Iustin Pop | - they will be listed in the Other section of cluster verify |
1092 | e0eb13de | Iustin Pop | |
1093 | e0eb13de | Iustin Pop | The code is changed in the following ways: |
1094 | e0eb13de | Iustin Pop | |
1095 | e0eb13de | Iustin Pop | - RPC calls were be converted to skip such nodes: |
1096 | e0eb13de | Iustin Pop | |
1097 | e0eb13de | Iustin Pop | - RpcRunner-instance-based RPC calls are easy to convert |
1098 | e0eb13de | Iustin Pop | |
1099 | e0eb13de | Iustin Pop | - static/classmethod RPC calls are harder to convert, and were left |
1100 | e0eb13de | Iustin Pop | alone |
1101 | e0eb13de | Iustin Pop | |
1102 | e0eb13de | Iustin Pop | - the RPC results were unified so that this new result state (offline) |
1103 | e0eb13de | Iustin Pop | can be differentiated |
1104 | e0eb13de | Iustin Pop | |
1105 | e0eb13de | Iustin Pop | - master voting still queries in repair nodes, as we need to ensure |
1106 | e0eb13de | Iustin Pop | consistency in case the (wrong) masters have old data, and nodes have |
1107 | e0eb13de | Iustin Pop | come back from repairs |
1108 | e0eb13de | Iustin Pop | |
1109 | e0eb13de | Iustin Pop | Caveats: |
1110 | e0eb13de | Iustin Pop | |
1111 | e0eb13de | Iustin Pop | - some operation semantics are less clear (e.g. what to do on instance |
1112 | 7faf5110 | Michael Hanselmann | start with offline secondary?); for now, these will just fail as if |
1113 | 7faf5110 | Michael Hanselmann | the flag is not set (but faster) |
1114 | e0eb13de | Iustin Pop | - 2-node cluster with one node offline needs manual startup of the |
1115 | e0eb13de | Iustin Pop | master with a special flag to skip voting (as the master can't get a |
1116 | e0eb13de | Iustin Pop | quorum there) |
1117 | e0eb13de | Iustin Pop | |
1118 | e0eb13de | Iustin Pop | One of the advantages of implementing this flag is that it will allow |
1119 | e0eb13de | Iustin Pop | in the future automation tools to automatically put the node in |
1120 | e0eb13de | Iustin Pop | repairs and recover from this state, and the code (should/will) handle |
1121 | e0eb13de | Iustin Pop | this much better than just timing out. So, future possible |
1122 | e0eb13de | Iustin Pop | improvements (for later versions): |
1123 | e0eb13de | Iustin Pop | |
1124 | e0eb13de | Iustin Pop | - watcher will detect nodes which fail RPC calls, will attempt to ssh |
1125 | e0eb13de | Iustin Pop | to them, if failure will put them offline |
1126 | e0eb13de | Iustin Pop | - watcher will try to ssh and query the offline nodes, if successful |
1127 | e0eb13de | Iustin Pop | will take them off the repair list |
1128 | e0eb13de | Iustin Pop | |
1129 | e0eb13de | Iustin Pop | Alternatives considered: The RPC call model in 2.0 is, by default, |
1130 | e0eb13de | Iustin Pop | much nicer - errors are logged in the background, and job/opcode |
1131 | e0eb13de | Iustin Pop | execution is clearer, so we could simply not introduce this. However, |
1132 | e0eb13de | Iustin Pop | having this state will make both the codepaths clearer (offline |
1133 | e0eb13de | Iustin Pop | vs. temporary failure) and the operational model (it's not a node with |
1134 | e0eb13de | Iustin Pop | errors, but an offline node). |
1135 | e0eb13de | Iustin Pop | |
1136 | e0eb13de | Iustin Pop | |
1137 | e0eb13de | Iustin Pop | *drained* flag |
1138 | e0eb13de | Iustin Pop | ++++++++++++++ |
1139 | e0eb13de | Iustin Pop | |
1140 | e0eb13de | Iustin Pop | Due to parallel execution of jobs in Ganeti 2.0, we could have the |
1141 | e0eb13de | Iustin Pop | following situation: |
1142 | e0eb13de | Iustin Pop | |
1143 | e0eb13de | Iustin Pop | - gnt-node migrate + failover is run |
1144 | e0eb13de | Iustin Pop | - gnt-node evacuate is run, which schedules a long-running 6-opcode |
1145 | e0eb13de | Iustin Pop | job for the node |
1146 | e0eb13de | Iustin Pop | - partway through, a new job comes in that runs an iallocator script, |
1147 | e0eb13de | Iustin Pop | which finds the above node as empty and a very good candidate |
1148 | e0eb13de | Iustin Pop | - gnt-node evacuate has finished, but now it has to be run again, to |
1149 | e0eb13de | Iustin Pop | clean the above instance(s) |
1150 | e0eb13de | Iustin Pop | |
1151 | e0eb13de | Iustin Pop | In order to prevent this situation, and to be able to get nodes into |
1152 | 7faf5110 | Michael Hanselmann | proper offline status easily, a new *drained* flag was added to the |
1153 | 7faf5110 | Michael Hanselmann | nodes. |
1154 | e0eb13de | Iustin Pop | |
1155 | e0eb13de | Iustin Pop | This flag (which actually means "is being, or was drained, and is |
1156 | e0eb13de | Iustin Pop | expected to go offline"), will prevent allocations on the node, but |
1157 | e0eb13de | Iustin Pop | otherwise all other operations (start/stop instance, query, etc.) are |
1158 | e0eb13de | Iustin Pop | working without any restrictions. |
1159 | e0eb13de | Iustin Pop | |
1160 | e0eb13de | Iustin Pop | Interaction between flags |
1161 | e0eb13de | Iustin Pop | +++++++++++++++++++++++++ |
1162 | e0eb13de | Iustin Pop | |
1163 | e0eb13de | Iustin Pop | While these flags are implemented as separate flags, they are |
1164 | e0eb13de | Iustin Pop | mutually-exclusive and are acting together with the master node role |
1165 | e0eb13de | Iustin Pop | as a single *node status* value. In other words, a flag is only in one |
1166 | e0eb13de | Iustin Pop | of these roles at a given time. The lack of any of these flags denote |
1167 | e0eb13de | Iustin Pop | a regular node. |
1168 | e0eb13de | Iustin Pop | |
1169 | e0eb13de | Iustin Pop | The current node status is visible in the ``gnt-cluster verify`` |
1170 | e0eb13de | Iustin Pop | output, and the individual flags can be examined via separate flags in |
1171 | e0eb13de | Iustin Pop | the ``gnt-node list`` output. |
1172 | e0eb13de | Iustin Pop | |
1173 | e0eb13de | Iustin Pop | These new flags will be exported in both the iallocator input message |
1174 | e0eb13de | Iustin Pop | and via RAPI, see the respective man pages for the exact names. |
1175 | e0eb13de | Iustin Pop | |
1176 | 5c0c1eeb | Iustin Pop | Feature changes |
1177 | 5c0c1eeb | Iustin Pop | --------------- |
1178 | 5c0c1eeb | Iustin Pop | |
1179 | 5c0c1eeb | Iustin Pop | The main feature-level changes will be: |
1180 | 5c0c1eeb | Iustin Pop | |
1181 | 5c0c1eeb | Iustin Pop | - a number of disk related changes |
1182 | 5c0c1eeb | Iustin Pop | - removal of fixed two-disk, one-nic per instance limitation |
1183 | 5c0c1eeb | Iustin Pop | |
1184 | 5c0c1eeb | Iustin Pop | Disk handling changes |
1185 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
1186 | 5c0c1eeb | Iustin Pop | |
1187 | 5c0c1eeb | Iustin Pop | The storage options available in Ganeti 1.x were introduced based on |
1188 | 5c0c1eeb | Iustin Pop | then-current software (first DRBD 0.7 then later DRBD 8) and the |
1189 | 5c0c1eeb | Iustin Pop | estimated usage patters. However, experience has later shown that some |
1190 | 5c0c1eeb | Iustin Pop | assumptions made initially are not true and that more flexibility is |
1191 | 5c0c1eeb | Iustin Pop | needed. |
1192 | 5c0c1eeb | Iustin Pop | |
1193 | 7faf5110 | Michael Hanselmann | One main assumption made was that disk failures should be treated as |
1194 | 7faf5110 | Michael Hanselmann | 'rare' events, and that each of them needs to be manually handled in |
1195 | 7faf5110 | Michael Hanselmann | order to ensure data safety; however, both these assumptions are false: |
1196 | 5c0c1eeb | Iustin Pop | |
1197 | 7faf5110 | Michael Hanselmann | - disk failures can be a common occurrence, based on usage patterns or |
1198 | 7faf5110 | Michael Hanselmann | cluster size |
1199 | 7faf5110 | Michael Hanselmann | - our disk setup is robust enough (referring to DRBD8 + LVM) that we |
1200 | 7faf5110 | Michael Hanselmann | could automate more of the recovery |
1201 | 5c0c1eeb | Iustin Pop | |
1202 | 7faf5110 | Michael Hanselmann | Note that we still don't have fully-automated disk recovery as a goal, |
1203 | 7faf5110 | Michael Hanselmann | but our goal is to reduce the manual work needed. |
1204 | 5c0c1eeb | Iustin Pop | |
1205 | 5c0c1eeb | Iustin Pop | As such, we plan the following main changes: |
1206 | 5c0c1eeb | Iustin Pop | |
1207 | 7faf5110 | Michael Hanselmann | - DRBD8 is much more flexible and stable than its previous version |
1208 | 7faf5110 | Michael Hanselmann | (0.7), such that removing the support for the ``remote_raid1`` |
1209 | 7faf5110 | Michael Hanselmann | template and focusing only on DRBD8 is easier |
1210 | 5c0c1eeb | Iustin Pop | |
1211 | 7faf5110 | Michael Hanselmann | - dynamic discovery of DRBD devices is not actually needed in a cluster |
1212 | 7faf5110 | Michael Hanselmann | that where the DRBD namespace is controlled by Ganeti; switching to a |
1213 | 7faf5110 | Michael Hanselmann | static assignment (done at either instance creation time or change |
1214 | 7faf5110 | Michael Hanselmann | secondary time) will change the disk activation time from O(n) to |
1215 | 7faf5110 | Michael Hanselmann | O(1), which on big clusters is a significant gain |
1216 | 5c0c1eeb | Iustin Pop | |
1217 | 7faf5110 | Michael Hanselmann | - remove the hard dependency on LVM (currently all available storage |
1218 | 7faf5110 | Michael Hanselmann | types are ultimately backed by LVM volumes) by introducing file-based |
1219 | 7faf5110 | Michael Hanselmann | storage |
1220 | 5c0c1eeb | Iustin Pop | |
1221 | 5c0c1eeb | Iustin Pop | Additionally, a number of smaller enhancements are also planned: |
1222 | 5c0c1eeb | Iustin Pop | - support variable number of disks |
1223 | 5c0c1eeb | Iustin Pop | - support read-only disks |
1224 | 5c0c1eeb | Iustin Pop | |
1225 | 5c0c1eeb | Iustin Pop | Future enhancements in the 2.x series, which do not require base design |
1226 | 5c0c1eeb | Iustin Pop | changes, might include: |
1227 | 5c0c1eeb | Iustin Pop | |
1228 | 5c0c1eeb | Iustin Pop | - enhancement of the LVM allocation method in order to try to keep |
1229 | 5c0c1eeb | Iustin Pop | all of an instance's virtual disks on the same physical |
1230 | 5c0c1eeb | Iustin Pop | disks |
1231 | 5c0c1eeb | Iustin Pop | |
1232 | 5c0c1eeb | Iustin Pop | - add support for DRBD8 authentication at handshake time in |
1233 | 5c0c1eeb | Iustin Pop | order to ensure each device connects to the correct peer |
1234 | 5c0c1eeb | Iustin Pop | |
1235 | 5c0c1eeb | Iustin Pop | - remove the restrictions on failover only to the secondary |
1236 | 5c0c1eeb | Iustin Pop | which creates very strict rules on cluster allocation |
1237 | 5c0c1eeb | Iustin Pop | |
1238 | 5c0c1eeb | Iustin Pop | DRBD minor allocation |
1239 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1240 | 5c0c1eeb | Iustin Pop | |
1241 | 5c0c1eeb | Iustin Pop | Currently, when trying to identify or activate a new DRBD (or MD) |
1242 | 5c0c1eeb | Iustin Pop | device, the code scans all in-use devices in order to see if we find |
1243 | 5c0c1eeb | Iustin Pop | one that looks similar to our parameters and is already in the desired |
1244 | 5c0c1eeb | Iustin Pop | state or not. Since this needs external commands to be run, it is very |
1245 | 5c0c1eeb | Iustin Pop | slow when more than a few devices are already present. |
1246 | 5c0c1eeb | Iustin Pop | |
1247 | 5c0c1eeb | Iustin Pop | Therefore, we will change the discovery model from dynamic to |
1248 | 5c0c1eeb | Iustin Pop | static. When a new device is logically created (added to the |
1249 | 5c0c1eeb | Iustin Pop | configuration) a free minor number is computed from the list of |
1250 | 5c0c1eeb | Iustin Pop | devices that should exist on that node and assigned to that |
1251 | 5c0c1eeb | Iustin Pop | device. |
1252 | 5c0c1eeb | Iustin Pop | |
1253 | 5c0c1eeb | Iustin Pop | At device activation, if the minor is already in use, we check if |
1254 | 5c0c1eeb | Iustin Pop | it has our parameters; if not so, we just destroy the device (if |
1255 | 5c0c1eeb | Iustin Pop | possible, otherwise we abort) and start it with our own |
1256 | 5c0c1eeb | Iustin Pop | parameters. |
1257 | 5c0c1eeb | Iustin Pop | |
1258 | 5c0c1eeb | Iustin Pop | This means that we in effect take ownership of the minor space for |
1259 | 6c2d0b44 | Iustin Pop | that device type; if there's a user-created DRBD minor, it will be |
1260 | 5c0c1eeb | Iustin Pop | automatically removed. |
1261 | 5c0c1eeb | Iustin Pop | |
1262 | 5c0c1eeb | Iustin Pop | The change will have the effect of reducing the number of external |
1263 | 5c0c1eeb | Iustin Pop | commands run per device from a constant number times the index of the |
1264 | 5c0c1eeb | Iustin Pop | first free DRBD minor to just a constant number. |
1265 | 5c0c1eeb | Iustin Pop | |
1266 | 6c2d0b44 | Iustin Pop | Removal of obsolete device types (MD, DRBD7) |
1267 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++++++++++ |
1268 | 5c0c1eeb | Iustin Pop | |
1269 | 5c0c1eeb | Iustin Pop | We need to remove these device types because of two issues. First, |
1270 | 6c2d0b44 | Iustin Pop | DRBD7 has bad failure modes in case of dual failures (both network and |
1271 | 5c0c1eeb | Iustin Pop | disk - it cannot propagate the error up the device stack and instead |
1272 | 6c2d0b44 | Iustin Pop | just panics. Second, due to the asymmetry between primary and |
1273 | 6c2d0b44 | Iustin Pop | secondary in MD+DRBD mode, we cannot do live failover (not even if we |
1274 | 6c2d0b44 | Iustin Pop | had MD+DRBD8). |
1275 | 5c0c1eeb | Iustin Pop | |
1276 | 5c0c1eeb | Iustin Pop | File-based storage support |
1277 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++ |
1278 | 5c0c1eeb | Iustin Pop | |
1279 | 6c2d0b44 | Iustin Pop | Using files instead of logical volumes for instance storage would |
1280 | 6c2d0b44 | Iustin Pop | allow us to get rid of the hard requirement for volume groups for |
1281 | 6c2d0b44 | Iustin Pop | testing clusters and it would also allow usage of SAN storage to do |
1282 | 6c2d0b44 | Iustin Pop | live failover taking advantage of this storage solution. |
1283 | 5c0c1eeb | Iustin Pop | |
1284 | 5c0c1eeb | Iustin Pop | Better LVM allocation |
1285 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1286 | 5c0c1eeb | Iustin Pop | |
1287 | 5c0c1eeb | Iustin Pop | Currently, the LV to PV allocation mechanism is a very simple one: at |
1288 | 5c0c1eeb | Iustin Pop | each new request for a logical volume, tell LVM to allocate the volume |
1289 | 5c0c1eeb | Iustin Pop | in order based on the amount of free space. This is good for |
1290 | 5c0c1eeb | Iustin Pop | simplicity and for keeping the usage equally spread over the available |
1291 | 5c0c1eeb | Iustin Pop | physical disks, however it introduces a problem that an instance could |
1292 | 5c0c1eeb | Iustin Pop | end up with its (currently) two drives on two physical disks, or |
1293 | 5c0c1eeb | Iustin Pop | (worse) that the data and metadata for a DRBD device end up on |
1294 | 5c0c1eeb | Iustin Pop | different drives. |
1295 | 5c0c1eeb | Iustin Pop | |
1296 | 5c0c1eeb | Iustin Pop | This is bad because it causes unneeded ``replace-disks`` operations in |
1297 | 5c0c1eeb | Iustin Pop | case of a physical failure. |
1298 | 5c0c1eeb | Iustin Pop | |
1299 | 5c0c1eeb | Iustin Pop | The solution is to batch allocations for an instance and make the LVM |
1300 | 5c0c1eeb | Iustin Pop | handling code try to allocate as close as possible all the storage of |
1301 | 5c0c1eeb | Iustin Pop | one instance. We will still allow the logical volumes to spill over to |
1302 | 5c0c1eeb | Iustin Pop | additional disks as needed. |
1303 | 5c0c1eeb | Iustin Pop | |
1304 | 5c0c1eeb | Iustin Pop | Note that this clustered allocation can only be attempted at initial |
1305 | 5c0c1eeb | Iustin Pop | instance creation, or at change secondary node time. At add disk time, |
1306 | 5c0c1eeb | Iustin Pop | or at replacing individual disks, it's not easy enough to compute the |
1307 | 5c0c1eeb | Iustin Pop | current disk map so we'll not attempt the clustering. |
1308 | 5c0c1eeb | Iustin Pop | |
1309 | 5c0c1eeb | Iustin Pop | DRBD8 peer authentication at handshake |
1310 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++++ |
1311 | 5c0c1eeb | Iustin Pop | |
1312 | 5c0c1eeb | Iustin Pop | DRBD8 has a new feature that allow authentication of the peer at |
1313 | 5c0c1eeb | Iustin Pop | connect time. We can use this to prevent connecting to the wrong peer |
1314 | 5c0c1eeb | Iustin Pop | more that securing the connection. Even though we never had issues |
1315 | 5c0c1eeb | Iustin Pop | with wrong connections, it would be good to implement this. |
1316 | 5c0c1eeb | Iustin Pop | |
1317 | 5c0c1eeb | Iustin Pop | |
1318 | 5c0c1eeb | Iustin Pop | LVM self-repair (optional) |
1319 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++ |
1320 | 5c0c1eeb | Iustin Pop | |
1321 | 5c0c1eeb | Iustin Pop | The complete failure of a physical disk is very tedious to |
1322 | 5c0c1eeb | Iustin Pop | troubleshoot, mainly because of the many failure modes and the many |
1323 | 5c0c1eeb | Iustin Pop | steps needed. We can safely automate some of the steps, more |
1324 | 5c0c1eeb | Iustin Pop | specifically the ``vgreduce --removemissing`` using the following |
1325 | 5c0c1eeb | Iustin Pop | method: |
1326 | 5c0c1eeb | Iustin Pop | |
1327 | 5c0c1eeb | Iustin Pop | #. check if all nodes have consistent volume groups |
1328 | 5c0c1eeb | Iustin Pop | #. if yes, and previous status was yes, do nothing |
1329 | 5c0c1eeb | Iustin Pop | #. if yes, and previous status was no, save status and restart |
1330 | 5c0c1eeb | Iustin Pop | #. if no, and previous status was no, do nothing |
1331 | 5c0c1eeb | Iustin Pop | #. if no, and previous status was yes: |
1332 | 5c0c1eeb | Iustin Pop | #. if more than one node is inconsistent, do nothing |
1333 | 6c2d0b44 | Iustin Pop | #. if only one node is inconsistent: |
1334 | 5c0c1eeb | Iustin Pop | #. run ``vgreduce --removemissing`` |
1335 | 6c2d0b44 | Iustin Pop | #. log this occurrence in the Ganeti log in a form that |
1336 | 5c0c1eeb | Iustin Pop | can be used for monitoring |
1337 | 5c0c1eeb | Iustin Pop | #. [FUTURE] run ``replace-disks`` for all |
1338 | 5c0c1eeb | Iustin Pop | instances affected |
1339 | 5c0c1eeb | Iustin Pop | |
1340 | 5c0c1eeb | Iustin Pop | Failover to any node |
1341 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
1342 | 5c0c1eeb | Iustin Pop | |
1343 | 5c0c1eeb | Iustin Pop | With a modified disk activation sequence, we can implement the |
1344 | 5c0c1eeb | Iustin Pop | *failover to any* functionality, removing many of the layout |
1345 | 5c0c1eeb | Iustin Pop | restrictions of a cluster: |
1346 | 5c0c1eeb | Iustin Pop | |
1347 | 7faf5110 | Michael Hanselmann | - the need to reserve memory on the current secondary: this gets reduced |
1348 | 7faf5110 | Michael Hanselmann | to a must to reserve memory anywhere on the cluster |
1349 | 5c0c1eeb | Iustin Pop | |
1350 | 5c0c1eeb | Iustin Pop | - the need to first failover and then replace secondary for an |
1351 | 5c0c1eeb | Iustin Pop | instance: with failover-to-any, we can directly failover to |
1352 | 5c0c1eeb | Iustin Pop | another node, which also does the replace disks at the same |
1353 | 5c0c1eeb | Iustin Pop | step |
1354 | 5c0c1eeb | Iustin Pop | |
1355 | 5c0c1eeb | Iustin Pop | In the following, we denote the current primary by P1, the current |
1356 | 5c0c1eeb | Iustin Pop | secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1357 | 5c0c1eeb | Iustin Pop | is fixed to the node the user chooses, but the choice of S2 can be |
1358 | 5c0c1eeb | Iustin Pop | made between P1 and S1. This choice can be constrained, depending on |
1359 | 5c0c1eeb | Iustin Pop | which of P1 and S1 has failed. |
1360 | 5c0c1eeb | Iustin Pop | |
1361 | 7faf5110 | Michael Hanselmann | - if P1 has failed, then S1 must become S2, and live migration is not |
1362 | 7faf5110 | Michael Hanselmann | possible |
1363 | 5c0c1eeb | Iustin Pop | - if S1 has failed, then P1 must become S2, and live migration could be |
1364 | 5c0c1eeb | Iustin Pop | possible (in theory, but this is not a design goal for 2.0) |
1365 | 5c0c1eeb | Iustin Pop | |
1366 | 5c0c1eeb | Iustin Pop | The algorithm for performing the failover is straightforward: |
1367 | 5c0c1eeb | Iustin Pop | |
1368 | 5c0c1eeb | Iustin Pop | - verify that S2 (the node the user has chosen to keep as secondary) has |
1369 | 5c0c1eeb | Iustin Pop | valid data (is consistent) |
1370 | 5c0c1eeb | Iustin Pop | |
1371 | 7faf5110 | Michael Hanselmann | - tear down the current DRBD association and setup a DRBD pairing |
1372 | 7faf5110 | Michael Hanselmann | between P2 (P2 is indicated by the user) and S2; since P2 has no data, |
1373 | 7faf5110 | Michael Hanselmann | it will start re-syncing from S2 |
1374 | 5c0c1eeb | Iustin Pop | |
1375 | 7faf5110 | Michael Hanselmann | - as soon as P2 is in state SyncTarget (i.e. after the resync has |
1376 | 7faf5110 | Michael Hanselmann | started but before it has finished), we can promote it to primary role |
1377 | 7faf5110 | Michael Hanselmann | (r/w) and start the instance on P2 |
1378 | 5c0c1eeb | Iustin Pop | |
1379 | 5c0c1eeb | Iustin Pop | - as soon as the P2?S2 sync has finished, we can remove |
1380 | 5c0c1eeb | Iustin Pop | the old data on the old node that has not been chosen for |
1381 | 5c0c1eeb | Iustin Pop | S2 |
1382 | 5c0c1eeb | Iustin Pop | |
1383 | 5c0c1eeb | Iustin Pop | Caveats: during the P2?S2 sync, a (non-transient) network error |
1384 | 5c0c1eeb | Iustin Pop | will cause I/O errors on the instance, so (if a longer instance |
1385 | 5c0c1eeb | Iustin Pop | downtime is acceptable) we can postpone the restart of the instance |
1386 | 5c0c1eeb | Iustin Pop | until the resync is done. However, disk I/O errors on S2 will cause |
1387 | 6c2d0b44 | Iustin Pop | data loss, since we don't have a good copy of the data anymore, so in |
1388 | 5c0c1eeb | Iustin Pop | this case waiting for the sync to complete is not an option. As such, |
1389 | 5c0c1eeb | Iustin Pop | it is recommended that this feature is used only in conjunction with |
1390 | 5c0c1eeb | Iustin Pop | proper disk monitoring. |
1391 | 5c0c1eeb | Iustin Pop | |
1392 | 5c0c1eeb | Iustin Pop | |
1393 | 5c0c1eeb | Iustin Pop | Live migration note: While failover-to-any is possible for all choices |
1394 | 5c0c1eeb | Iustin Pop | of S2, migration-to-any is possible only if we keep P1 as S2. |
1395 | 5c0c1eeb | Iustin Pop | |
1396 | 5c0c1eeb | Iustin Pop | Caveats |
1397 | 5c0c1eeb | Iustin Pop | +++++++ |
1398 | 5c0c1eeb | Iustin Pop | |
1399 | 5c0c1eeb | Iustin Pop | The dynamic device model, while more complex, has an advantage: it |
1400 | 6c2d0b44 | Iustin Pop | will not reuse by mistake the DRBD device of another instance, since |
1401 | 6c2d0b44 | Iustin Pop | it always looks for either our own or a free one. |
1402 | 5c0c1eeb | Iustin Pop | |
1403 | 5c0c1eeb | Iustin Pop | The static one, in contrast, will assume that given a minor number N, |
1404 | 5c0c1eeb | Iustin Pop | it's ours and we can take over. This needs careful implementation such |
1405 | 5c0c1eeb | Iustin Pop | that if the minor is in use, either we are able to cleanly shut it |
1406 | 5c0c1eeb | Iustin Pop | down, or we abort the startup. Otherwise, it could be that we start |
1407 | 6c2d0b44 | Iustin Pop | syncing between two instance's disks, causing data loss. |
1408 | 5c0c1eeb | Iustin Pop | |
1409 | 5c0c1eeb | Iustin Pop | |
1410 | 5c0c1eeb | Iustin Pop | Variable number of disk/NICs per instance |
1411 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1412 | 5c0c1eeb | Iustin Pop | |
1413 | 5c0c1eeb | Iustin Pop | Variable number of disks |
1414 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++ |
1415 | 5c0c1eeb | Iustin Pop | |
1416 | 5c0c1eeb | Iustin Pop | In order to support high-security scenarios (for example read-only sda |
1417 | 5c0c1eeb | Iustin Pop | and read-write sdb), we need to make a fully flexibly disk |
1418 | 5c0c1eeb | Iustin Pop | definition. This has less impact that it might look at first sight: |
1419 | 6c2d0b44 | Iustin Pop | only the instance creation has hard coded number of disks, not the disk |
1420 | 5c0c1eeb | Iustin Pop | handling code. The block device handling and most of the instance |
1421 | 5c0c1eeb | Iustin Pop | handling code is already working with "the instance's disks" as |
1422 | 5c0c1eeb | Iustin Pop | opposed to "the two disks of the instance", but some pieces are not |
1423 | 5c0c1eeb | Iustin Pop | (e.g. import/export) and the code needs a review to ensure safety. |
1424 | 5c0c1eeb | Iustin Pop | |
1425 | 5c0c1eeb | Iustin Pop | The objective is to be able to specify the number of disks at |
1426 | 5c0c1eeb | Iustin Pop | instance creation, and to be able to toggle from read-only to |
1427 | 6c2d0b44 | Iustin Pop | read-write a disk afterward. |
1428 | 5c0c1eeb | Iustin Pop | |
1429 | 5c0c1eeb | Iustin Pop | Variable number of NICs |
1430 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
1431 | 5c0c1eeb | Iustin Pop | |
1432 | 5c0c1eeb | Iustin Pop | Similar to the disk change, we need to allow multiple network |
1433 | 5c0c1eeb | Iustin Pop | interfaces per instance. This will affect the internal code (some |
1434 | 5c0c1eeb | Iustin Pop | function will have to stop assuming that ``instance.nics`` is a list |
1435 | 6c2d0b44 | Iustin Pop | of length one), the OS API which currently can export/import only one |
1436 | 5c0c1eeb | Iustin Pop | instance, and the command line interface. |
1437 | 5c0c1eeb | Iustin Pop | |
1438 | 5c0c1eeb | Iustin Pop | Interface changes |
1439 | 5c0c1eeb | Iustin Pop | ----------------- |
1440 | 5c0c1eeb | Iustin Pop | |
1441 | 5c0c1eeb | Iustin Pop | There are two areas of interface changes: API-level changes (the OS |
1442 | 5c0c1eeb | Iustin Pop | interface and the RAPI interface) and the command line interface |
1443 | 5c0c1eeb | Iustin Pop | changes. |
1444 | 5c0c1eeb | Iustin Pop | |
1445 | 5c0c1eeb | Iustin Pop | OS interface |
1446 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~ |
1447 | 5c0c1eeb | Iustin Pop | |
1448 | 7faf5110 | Michael Hanselmann | The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. |
1449 | 7faf5110 | Michael Hanselmann | The interface is composed by a series of scripts which get called with |
1450 | 7faf5110 | Michael Hanselmann | certain parameters to perform OS-dependent operations on the cluster. |
1451 | 7faf5110 | Michael Hanselmann | The current scripts are: |
1452 | 5c0c1eeb | Iustin Pop | |
1453 | 5c0c1eeb | Iustin Pop | create |
1454 | 5c0c1eeb | Iustin Pop | called when a new instance is added to the cluster |
1455 | 5c0c1eeb | Iustin Pop | export |
1456 | 5c0c1eeb | Iustin Pop | called to export an instance disk to a stream |
1457 | 5c0c1eeb | Iustin Pop | import |
1458 | 5c0c1eeb | Iustin Pop | called to import from a stream to a new instance |
1459 | 5c0c1eeb | Iustin Pop | rename |
1460 | 5c0c1eeb | Iustin Pop | called to perform the os-specific operations necessary for renaming an |
1461 | 5c0c1eeb | Iustin Pop | instance |
1462 | 5c0c1eeb | Iustin Pop | |
1463 | 7faf5110 | Michael Hanselmann | Currently these scripts suffer from the limitations of Ganeti 1.2: for |
1464 | 7faf5110 | Michael Hanselmann | example they accept exactly one block and one swap devices to operate |
1465 | 7faf5110 | Michael Hanselmann | on, rather than any amount of generic block devices, they blindly assume |
1466 | 7faf5110 | Michael Hanselmann | that an instance will have just one network interface to operate, they |
1467 | 7faf5110 | Michael Hanselmann | can not be configured to optimise the instance for a particular |
1468 | 7faf5110 | Michael Hanselmann | hypervisor. |
1469 | 5c0c1eeb | Iustin Pop | |
1470 | 7faf5110 | Michael Hanselmann | Since in Ganeti 2.0 we want to support multiple hypervisors, and a |
1471 | 7faf5110 | Michael Hanselmann | non-fixed number of network and disks the OS interface need to change to |
1472 | 7faf5110 | Michael Hanselmann | transmit the appropriate amount of information about an instance to its |
1473 | 7faf5110 | Michael Hanselmann | managing operating system, when operating on it. Moreover since some old |
1474 | 7faf5110 | Michael Hanselmann | assumptions usually used in OS scripts are no longer valid we need to |
1475 | 7faf5110 | Michael Hanselmann | re-establish a common knowledge on what can be assumed and what cannot |
1476 | 7faf5110 | Michael Hanselmann | be regarding Ganeti environment. |
1477 | 5c0c1eeb | Iustin Pop | |
1478 | 5c0c1eeb | Iustin Pop | |
1479 | 5c0c1eeb | Iustin Pop | When designing the new OS API our priorities are: |
1480 | 5c0c1eeb | Iustin Pop | - ease of use |
1481 | 5c0c1eeb | Iustin Pop | - future extensibility |
1482 | 6c2d0b44 | Iustin Pop | - ease of porting from the old API |
1483 | 5c0c1eeb | Iustin Pop | - modularity |
1484 | 5c0c1eeb | Iustin Pop | |
1485 | 7faf5110 | Michael Hanselmann | As such we want to limit the number of scripts that must be written to |
1486 | 7faf5110 | Michael Hanselmann | support an OS, and make it easy to share code between them by uniforming |
1487 | 7faf5110 | Michael Hanselmann | their input. We also will leave the current script structure unchanged, |
1488 | 7faf5110 | Michael Hanselmann | as far as we can, and make a few of the scripts (import, export and |
1489 | 7faf5110 | Michael Hanselmann | rename) optional. Most information will be passed to the script through |
1490 | 7faf5110 | Michael Hanselmann | environment variables, for ease of access and at the same time ease of |
1491 | 7faf5110 | Michael Hanselmann | using only the information a script needs. |
1492 | 5c0c1eeb | Iustin Pop | |
1493 | 5c0c1eeb | Iustin Pop | |
1494 | 5c0c1eeb | Iustin Pop | The Scripts |
1495 | 5c0c1eeb | Iustin Pop | +++++++++++ |
1496 | 5c0c1eeb | Iustin Pop | |
1497 | 7faf5110 | Michael Hanselmann | As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs |
1498 | 7faf5110 | Michael Hanselmann | to support the following functionality, through scripts: |
1499 | 5c0c1eeb | Iustin Pop | |
1500 | 5c0c1eeb | Iustin Pop | create: |
1501 | 7faf5110 | Michael Hanselmann | used to create a new instance running that OS. This script should |
1502 | 7faf5110 | Michael Hanselmann | prepare the block devices, and install them so that the new OS can |
1503 | 7faf5110 | Michael Hanselmann | boot under the specified hypervisor. |
1504 | 5c0c1eeb | Iustin Pop | export (optional): |
1505 | 7faf5110 | Michael Hanselmann | used to export an installed instance using the given OS to a format |
1506 | 7faf5110 | Michael Hanselmann | which can be used to import it back into a new instance. |
1507 | 5c0c1eeb | Iustin Pop | import (optional): |
1508 | 7faf5110 | Michael Hanselmann | used to import an exported instance into a new one. This script is |
1509 | 7faf5110 | Michael Hanselmann | similar to create, but the new instance should have the content of the |
1510 | 7faf5110 | Michael Hanselmann | export, rather than contain a pristine installation. |
1511 | 5c0c1eeb | Iustin Pop | rename (optional): |
1512 | 7faf5110 | Michael Hanselmann | used to perform the internal OS-specific operations needed to rename |
1513 | 7faf5110 | Michael Hanselmann | an instance. |
1514 | 5c0c1eeb | Iustin Pop | |
1515 | 7faf5110 | Michael Hanselmann | If any optional script is not implemented Ganeti will refuse to perform |
1516 | 7faf5110 | Michael Hanselmann | the given operation on instances using the non-implementing OS. Of |
1517 | 7faf5110 | Michael Hanselmann | course the create script is mandatory, and it doesn't make sense to |
1518 | 7faf5110 | Michael Hanselmann | support the either the export or the import operation but not both. |
1519 | 5c0c1eeb | Iustin Pop | |
1520 | 5c0c1eeb | Iustin Pop | Incompatibilities with 1.2 |
1521 | 5c0c1eeb | Iustin Pop | __________________________ |
1522 | 5c0c1eeb | Iustin Pop | |
1523 | 7faf5110 | Michael Hanselmann | We expect the following incompatibilities between the OS scripts for 1.2 |
1524 | 7faf5110 | Michael Hanselmann | and the ones for 2.0: |
1525 | 5c0c1eeb | Iustin Pop | |
1526 | 7faf5110 | Michael Hanselmann | - Input parameters: in 1.2 those were passed on the command line, in 2.0 |
1527 | 7faf5110 | Michael Hanselmann | we'll use environment variables, as there will be a lot more |
1528 | 7faf5110 | Michael Hanselmann | information and not all OSes may care about all of it. |
1529 | 7faf5110 | Michael Hanselmann | - Number of calls: export scripts will be called once for each device |
1530 | 7faf5110 | Michael Hanselmann | the instance has, and import scripts once for every exported disk. |
1531 | 7faf5110 | Michael Hanselmann | Imported instances will be forced to have a number of disks greater or |
1532 | 7faf5110 | Michael Hanselmann | equal to the one of the export. |
1533 | 7faf5110 | Michael Hanselmann | - Some scripts are not compulsory: if such a script is missing the |
1534 | 7faf5110 | Michael Hanselmann | relevant operations will be forbidden for instances of that OS. This |
1535 | 7faf5110 | Michael Hanselmann | makes it easier to distinguish between unsupported operations and |
1536 | 7faf5110 | Michael Hanselmann | no-op ones (if any). |
1537 | 5c0c1eeb | Iustin Pop | |
1538 | 5c0c1eeb | Iustin Pop | |
1539 | 5c0c1eeb | Iustin Pop | Input |
1540 | 5c0c1eeb | Iustin Pop | _____ |
1541 | 5c0c1eeb | Iustin Pop | |
1542 | 7faf5110 | Michael Hanselmann | Rather than using command line flags, as they do now, scripts will |
1543 | 7faf5110 | Michael Hanselmann | accept inputs from environment variables. We expect the following input |
1544 | 7faf5110 | Michael Hanselmann | values: |
1545 | 5c0c1eeb | Iustin Pop | |
1546 | 5c0c1eeb | Iustin Pop | OS_API_VERSION |
1547 | 6c2d0b44 | Iustin Pop | The version of the OS API that the following parameters comply with; |
1548 | 5c0c1eeb | Iustin Pop | this is used so that in the future we could have OSes supporting |
1549 | 5c0c1eeb | Iustin Pop | multiple versions and thus Ganeti send the proper version in this |
1550 | 5c0c1eeb | Iustin Pop | parameter |
1551 | 5c0c1eeb | Iustin Pop | INSTANCE_NAME |
1552 | 5c0c1eeb | Iustin Pop | Name of the instance acted on |
1553 | 5c0c1eeb | Iustin Pop | HYPERVISOR |
1554 | 7faf5110 | Michael Hanselmann | The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', |
1555 | 7faf5110 | Michael Hanselmann | 'kvm') |
1556 | 5c0c1eeb | Iustin Pop | DISK_COUNT |
1557 | 5c0c1eeb | Iustin Pop | The number of disks this instance will have |
1558 | 5c0c1eeb | Iustin Pop | NIC_COUNT |
1559 | 6c2d0b44 | Iustin Pop | The number of NICs this instance will have |
1560 | 5c0c1eeb | Iustin Pop | DISK_<N>_PATH |
1561 | 5c0c1eeb | Iustin Pop | Path to the Nth disk. |
1562 | 5c0c1eeb | Iustin Pop | DISK_<N>_ACCESS |
1563 | 5c0c1eeb | Iustin Pop | W if read/write, R if read only. OS scripts are not supposed to touch |
1564 | 5c0c1eeb | Iustin Pop | read-only disks, but will be passed them to know. |
1565 | 5c0c1eeb | Iustin Pop | DISK_<N>_FRONTEND_TYPE |
1566 | 7faf5110 | Michael Hanselmann | Type of the disk as seen by the instance. Can be 'scsi', 'ide', |
1567 | 7faf5110 | Michael Hanselmann | 'virtio' |
1568 | 5c0c1eeb | Iustin Pop | DISK_<N>_BACKEND_TYPE |
1569 | 5c0c1eeb | Iustin Pop | Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1570 | 5c0c1eeb | Iustin Pop | 'file:blktap' |
1571 | 5c0c1eeb | Iustin Pop | NIC_<N>_MAC |
1572 | 5c0c1eeb | Iustin Pop | Mac address for the Nth network interface |
1573 | 5c0c1eeb | Iustin Pop | NIC_<N>_IP |
1574 | 5c0c1eeb | Iustin Pop | Ip address for the Nth network interface, if available |
1575 | 5c0c1eeb | Iustin Pop | NIC_<N>_BRIDGE |
1576 | 5c0c1eeb | Iustin Pop | Node bridge the Nth network interface will be connected to |
1577 | 5c0c1eeb | Iustin Pop | NIC_<N>_FRONTEND_TYPE |
1578 | 6c2d0b44 | Iustin Pop | Type of the Nth NIC as seen by the instance. For example 'virtio', |
1579 | 6c2d0b44 | Iustin Pop | 'rtl8139', etc. |
1580 | 5c0c1eeb | Iustin Pop | DEBUG_LEVEL |
1581 | 7faf5110 | Michael Hanselmann | Whether more out should be produced, for debugging purposes. Currently |
1582 | 7faf5110 | Michael Hanselmann | the only valid values are 0 and 1. |
1583 | 5c0c1eeb | Iustin Pop | |
1584 | 6c2d0b44 | Iustin Pop | These are only the basic variables we are thinking of now, but more |
1585 | 6c2d0b44 | Iustin Pop | may come during the implementation and they will be documented in the |
1586 | fd07c6b3 | Iustin Pop | :manpage:`ganeti-os-api` man page. All these variables will be |
1587 | fd07c6b3 | Iustin Pop | available to all scripts. |
1588 | 5c0c1eeb | Iustin Pop | |
1589 | 5c0c1eeb | Iustin Pop | Some scripts will need a few more information to work. These will have |
1590 | 5c0c1eeb | Iustin Pop | per-script variables, such as for example: |
1591 | 5c0c1eeb | Iustin Pop | |
1592 | 5c0c1eeb | Iustin Pop | OLD_INSTANCE_NAME |
1593 | 5c0c1eeb | Iustin Pop | rename: the name the instance should be renamed from. |
1594 | 5c0c1eeb | Iustin Pop | EXPORT_DEVICE |
1595 | 7faf5110 | Michael Hanselmann | export: device to be exported, a snapshot of the actual device. The |
1596 | 7faf5110 | Michael Hanselmann | data must be exported to stdout. |
1597 | 5c0c1eeb | Iustin Pop | EXPORT_INDEX |
1598 | 5c0c1eeb | Iustin Pop | export: sequential number of the instance device targeted. |
1599 | 5c0c1eeb | Iustin Pop | IMPORT_DEVICE |
1600 | 7faf5110 | Michael Hanselmann | import: device to send the data to, part of the new instance. The data |
1601 | 7faf5110 | Michael Hanselmann | must be imported from stdin. |
1602 | 5c0c1eeb | Iustin Pop | IMPORT_INDEX |
1603 | 5c0c1eeb | Iustin Pop | import: sequential number of the instance device targeted. |
1604 | 5c0c1eeb | Iustin Pop | |
1605 | 7faf5110 | Michael Hanselmann | (Rationale for INSTANCE_NAME as an environment variable: the instance |
1606 | 7faf5110 | Michael Hanselmann | name is always needed and we could pass it on the command line. On the |
1607 | 7faf5110 | Michael Hanselmann | other hand, though, this would force scripts to both access the |
1608 | 7faf5110 | Michael Hanselmann | environment and parse the command line, so we'll move it for |
1609 | 7faf5110 | Michael Hanselmann | uniformity.) |
1610 | 5c0c1eeb | Iustin Pop | |
1611 | 5c0c1eeb | Iustin Pop | |
1612 | 5c0c1eeb | Iustin Pop | Output/Behaviour |
1613 | 5c0c1eeb | Iustin Pop | ________________ |
1614 | 5c0c1eeb | Iustin Pop | |
1615 | 7faf5110 | Michael Hanselmann | As discussed scripts should only send user-targeted information to |
1616 | 7faf5110 | Michael Hanselmann | stderr. The create and import scripts are supposed to format/initialise |
1617 | 7faf5110 | Michael Hanselmann | the given block devices and install the correct instance data. The |
1618 | 7faf5110 | Michael Hanselmann | export script is supposed to export instance data to stdout in a format |
1619 | 7faf5110 | Michael Hanselmann | understandable by the the import script. The data will be compressed by |
1620 | 7faf5110 | Michael Hanselmann | Ganeti, so no compression should be done. The rename script should only |
1621 | 7faf5110 | Michael Hanselmann | modify the instance's knowledge of what its name is. |
1622 | 5c0c1eeb | Iustin Pop | |
1623 | 5c0c1eeb | Iustin Pop | Other declarative style features |
1624 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++ |
1625 | 5c0c1eeb | Iustin Pop | |
1626 | 5c0c1eeb | Iustin Pop | Similar to Ganeti 1.2, OS specifications will need to provide a |
1627 | 6c2d0b44 | Iustin Pop | 'ganeti_api_version' containing list of numbers matching the |
1628 | 6c2d0b44 | Iustin Pop | version(s) of the API they implement. Ganeti itself will always be |
1629 | 6c2d0b44 | Iustin Pop | compatible with one version of the API and may maintain backwards |
1630 | 6c2d0b44 | Iustin Pop | compatibility if it's feasible to do so. The numbers are one-per-line, |
1631 | 6c2d0b44 | Iustin Pop | so an OS supporting both version 5 and version 20 will have a file |
1632 | 6c2d0b44 | Iustin Pop | containing two lines. This is different from Ganeti 1.2, which only |
1633 | 6c2d0b44 | Iustin Pop | supported one version number. |
1634 | 5c0c1eeb | Iustin Pop | |
1635 | 7faf5110 | Michael Hanselmann | In addition to that an OS will be able to declare that it does support |
1636 | 7faf5110 | Michael Hanselmann | only a subset of the Ganeti hypervisors, by declaring them in the |
1637 | 7faf5110 | Michael Hanselmann | 'hypervisors' file. |
1638 | 5c0c1eeb | Iustin Pop | |
1639 | 5c0c1eeb | Iustin Pop | |
1640 | 5c0c1eeb | Iustin Pop | Caveats/Notes |
1641 | 5c0c1eeb | Iustin Pop | +++++++++++++ |
1642 | 5c0c1eeb | Iustin Pop | |
1643 | 7faf5110 | Michael Hanselmann | We might want to have a "default" import/export behaviour that just |
1644 | 7faf5110 | Michael Hanselmann | dumps all disks and restores them. This can save work as most systems |
1645 | 7faf5110 | Michael Hanselmann | will just do this, while allowing flexibility for different systems. |
1646 | 5c0c1eeb | Iustin Pop | |
1647 | 7faf5110 | Michael Hanselmann | Environment variables are limited in size, but we expect that there will |
1648 | 7faf5110 | Michael Hanselmann | be enough space to store the information we need. If we discover that |
1649 | 7faf5110 | Michael Hanselmann | this is not the case we may want to go to a more complex API such as |
1650 | 7faf5110 | Michael Hanselmann | storing those information on the filesystem and providing the OS script |
1651 | 7faf5110 | Michael Hanselmann | with the path to a file where they are encoded in some format. |
1652 | 5c0c1eeb | Iustin Pop | |
1653 | 5c0c1eeb | Iustin Pop | |
1654 | 5c0c1eeb | Iustin Pop | |
1655 | 5c0c1eeb | Iustin Pop | Remote API changes |
1656 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~ |
1657 | 5c0c1eeb | Iustin Pop | |
1658 | 6c2d0b44 | Iustin Pop | The first Ganeti remote API (RAPI) was designed and deployed with the |
1659 | 6c2d0b44 | Iustin Pop | Ganeti 1.2.5 release. That version provide read-only access to the |
1660 | 6c2d0b44 | Iustin Pop | cluster state. Fully functional read-write API demands significant |
1661 | 6c2d0b44 | Iustin Pop | internal changes which will be implemented in version 2.0. |
1662 | 5c0c1eeb | Iustin Pop | |
1663 | 6c2d0b44 | Iustin Pop | We decided to go with implementing the Ganeti RAPI in a RESTful way, |
1664 | 6c2d0b44 | Iustin Pop | which is aligned with key features we looking. It is simple, |
1665 | 6c2d0b44 | Iustin Pop | stateless, scalable and extensible paradigm of API implementation. As |
1666 | 6c2d0b44 | Iustin Pop | transport it uses HTTP over SSL, and we are implementing it with JSON |
1667 | 6c2d0b44 | Iustin Pop | encoding, but in a way it possible to extend and provide any other |
1668 | 6c2d0b44 | Iustin Pop | one. |
1669 | 5c0c1eeb | Iustin Pop | |
1670 | 5c0c1eeb | Iustin Pop | Design |
1671 | 5c0c1eeb | Iustin Pop | ++++++ |
1672 | 5c0c1eeb | Iustin Pop | |
1673 | 6c2d0b44 | Iustin Pop | The Ganeti RAPI is implemented as independent daemon, running on the |
1674 | 6c2d0b44 | Iustin Pop | same node with the same permission level as Ganeti master |
1675 | 6c2d0b44 | Iustin Pop | daemon. Communication is done through the LUXI library to the master |
1676 | 6c2d0b44 | Iustin Pop | daemon. In order to keep communication asynchronous RAPI processes two |
1677 | 6c2d0b44 | Iustin Pop | types of client requests: |
1678 | 5c0c1eeb | Iustin Pop | |
1679 | 6c2d0b44 | Iustin Pop | - queries: server is able to answer immediately |
1680 | 6c2d0b44 | Iustin Pop | - job submission: some time is required for a useful response |
1681 | 5c0c1eeb | Iustin Pop | |
1682 | 6c2d0b44 | Iustin Pop | In the query case requested data send back to client in the HTTP |
1683 | 6c2d0b44 | Iustin Pop | response body. Typical examples of queries would be: list of nodes, |
1684 | 6c2d0b44 | Iustin Pop | instances, cluster info, etc. |
1685 | 5c0c1eeb | Iustin Pop | |
1686 | 6c2d0b44 | Iustin Pop | In the case of job submission, the client receive a job ID, the |
1687 | 6c2d0b44 | Iustin Pop | identifier which allows to query the job progress in the job queue |
1688 | 6c2d0b44 | Iustin Pop | (see `Job Queue`_). |
1689 | 6c2d0b44 | Iustin Pop | |
1690 | 6c2d0b44 | Iustin Pop | Internally, each exported object has an version identifier, which is |
1691 | 6c2d0b44 | Iustin Pop | used as a state identifier in the HTTP header E-Tag field for |
1692 | 6c2d0b44 | Iustin Pop | requests/responses to avoid race conditions. |
1693 | 5c0c1eeb | Iustin Pop | |
1694 | 5c0c1eeb | Iustin Pop | |
1695 | 5c0c1eeb | Iustin Pop | Resource representation |
1696 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
1697 | 5c0c1eeb | Iustin Pop | |
1698 | 6c2d0b44 | Iustin Pop | The key difference of using REST instead of others API is that REST |
1699 | 6c2d0b44 | Iustin Pop | requires separation of services via resources with unique URIs. Each |
1700 | 6c2d0b44 | Iustin Pop | of them should have limited amount of state and support standard HTTP |
1701 | 5c0c1eeb | Iustin Pop | methods: GET, POST, DELETE, PUT. |
1702 | 5c0c1eeb | Iustin Pop | |
1703 | 6c2d0b44 | Iustin Pop | For example in Ganeti's case we can have a set of URI: |
1704 | 6c2d0b44 | Iustin Pop | |
1705 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances`` |
1706 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances/{instancename}`` |
1707 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances/{instancename}/tag`` |
1708 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/tag`` |
1709 | 5c0c1eeb | Iustin Pop | |
1710 | 6c2d0b44 | Iustin Pop | A GET request to ``/{clustername}/instances`` will return the list of |
1711 | 6c2d0b44 | Iustin Pop | instances, a POST to ``/{clustername}/instances`` should create a new |
1712 | 6c2d0b44 | Iustin Pop | instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
1713 | 6c2d0b44 | Iustin Pop | delete the instance, a GET ``/{clustername}/tag`` should return get |
1714 | 6c2d0b44 | Iustin Pop | cluster tags. |
1715 | 5c0c1eeb | Iustin Pop | |
1716 | 6c2d0b44 | Iustin Pop | Each resource URI will have a version prefix. The resource IDs are to |
1717 | 6c2d0b44 | Iustin Pop | be determined. |
1718 | 5c0c1eeb | Iustin Pop | |
1719 | 6c2d0b44 | Iustin Pop | Internal encoding might be JSON, XML, or any other. The JSON encoding |
1720 | 6c2d0b44 | Iustin Pop | fits nicely in Ganeti RAPI needs. The client can request a specific |
1721 | 6c2d0b44 | Iustin Pop | representation via the Accept field in the HTTP header. |
1722 | 5c0c1eeb | Iustin Pop | |
1723 | 6c2d0b44 | Iustin Pop | REST uses HTTP as its transport and application protocol for resource |
1724 | 6c2d0b44 | Iustin Pop | access. The set of possible responses is a subset of standard HTTP |
1725 | 6c2d0b44 | Iustin Pop | responses. |
1726 | 6c2d0b44 | Iustin Pop | |
1727 | 6c2d0b44 | Iustin Pop | The statelessness model provides additional reliability and |
1728 | 6c2d0b44 | Iustin Pop | transparency to operations (e.g. only one request needs to be analyzed |
1729 | 6c2d0b44 | Iustin Pop | to understand the in-progress operation, not a sequence of multiple |
1730 | 6c2d0b44 | Iustin Pop | requests/responses). |
1731 | 5c0c1eeb | Iustin Pop | |
1732 | 5c0c1eeb | Iustin Pop | |
1733 | 5c0c1eeb | Iustin Pop | Security |
1734 | 5c0c1eeb | Iustin Pop | ++++++++ |
1735 | 5c0c1eeb | Iustin Pop | |
1736 | 6c2d0b44 | Iustin Pop | With the write functionality security becomes a much bigger an issue. |
1737 | 6c2d0b44 | Iustin Pop | The Ganeti RAPI uses basic HTTP authentication on top of an |
1738 | 6c2d0b44 | Iustin Pop | SSL-secured connection to grant access to an exported resource. The |
1739 | 6c2d0b44 | Iustin Pop | password is stored locally in an Apache-style ``.htpasswd`` file. Only |
1740 | 6c2d0b44 | Iustin Pop | one level of privileges is supported. |
1741 | 6c2d0b44 | Iustin Pop | |
1742 | 6c2d0b44 | Iustin Pop | Caveats |
1743 | 6c2d0b44 | Iustin Pop | +++++++ |
1744 | 6c2d0b44 | Iustin Pop | |
1745 | 6c2d0b44 | Iustin Pop | The model detailed above for job submission requires the client to |
1746 | 6c2d0b44 | Iustin Pop | poll periodically for updates to the job; an alternative would be to |
1747 | 6c2d0b44 | Iustin Pop | allow the client to request a callback, or a 'wait for updates' call. |
1748 | 6c2d0b44 | Iustin Pop | |
1749 | 6c2d0b44 | Iustin Pop | The callback model was not considered due to the following two issues: |
1750 | 5c0c1eeb | Iustin Pop | |
1751 | 6c2d0b44 | Iustin Pop | - callbacks would require a new model of allowed callback URLs, |
1752 | 6c2d0b44 | Iustin Pop | together with a method of managing these |
1753 | 6c2d0b44 | Iustin Pop | - callbacks only work when the client and the master are in the same |
1754 | 6c2d0b44 | Iustin Pop | security domain, and they fail in the other cases (e.g. when there is |
1755 | 6c2d0b44 | Iustin Pop | a firewall between the client and the RAPI daemon that only allows |
1756 | 6c2d0b44 | Iustin Pop | client-to-RAPI calls, which is usual in DMZ cases) |
1757 | 6c2d0b44 | Iustin Pop | |
1758 | 6c2d0b44 | Iustin Pop | The 'wait for updates' method is not suited to the HTTP protocol, |
1759 | 6c2d0b44 | Iustin Pop | where requests are supposed to be short-lived. |
1760 | 5c0c1eeb | Iustin Pop | |
1761 | 5c0c1eeb | Iustin Pop | Command line changes |
1762 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
1763 | 5c0c1eeb | Iustin Pop | |
1764 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 introduces several new features as well as new ways to |
1765 | 5c0c1eeb | Iustin Pop | handle instance resources like disks or network interfaces. This |
1766 | 6c2d0b44 | Iustin Pop | requires some noticeable changes in the way command line arguments are |
1767 | 5c0c1eeb | Iustin Pop | handled. |
1768 | 5c0c1eeb | Iustin Pop | |
1769 | 6c2d0b44 | Iustin Pop | - extend and modify command line syntax to support new features |
1770 | 6c2d0b44 | Iustin Pop | - ensure consistent patterns in command line arguments to reduce |
1771 | 6c2d0b44 | Iustin Pop | cognitive load |
1772 | 5c0c1eeb | Iustin Pop | |
1773 | 5c0c1eeb | Iustin Pop | The design changes that require these changes are, in no particular |
1774 | 5c0c1eeb | Iustin Pop | order: |
1775 | 5c0c1eeb | Iustin Pop | |
1776 | 5c0c1eeb | Iustin Pop | - flexible instance disk handling: support a variable number of disks |
1777 | 5c0c1eeb | Iustin Pop | with varying properties per instance, |
1778 | 5c0c1eeb | Iustin Pop | - flexible instance network interface handling: support a variable |
1779 | 5c0c1eeb | Iustin Pop | number of network interfaces with varying properties per instance |
1780 | 5c0c1eeb | Iustin Pop | - multiple hypervisors: multiple hypervisors can be active on the same |
1781 | 5c0c1eeb | Iustin Pop | cluster, each supporting different parameters, |
1782 | 5c0c1eeb | Iustin Pop | - support for device type CDROM (via ISO image) |
1783 | 5c0c1eeb | Iustin Pop | |
1784 | 6c2d0b44 | Iustin Pop | As such, there are several areas of Ganeti where the command line |
1785 | 5c0c1eeb | Iustin Pop | arguments will change: |
1786 | 5c0c1eeb | Iustin Pop | |
1787 | 5c0c1eeb | Iustin Pop | - Cluster configuration |
1788 | 5c0c1eeb | Iustin Pop | |
1789 | 5c0c1eeb | Iustin Pop | - cluster initialization |
1790 | 5c0c1eeb | Iustin Pop | - cluster default configuration |
1791 | 5c0c1eeb | Iustin Pop | |
1792 | 5c0c1eeb | Iustin Pop | - Instance configuration |
1793 | 5c0c1eeb | Iustin Pop | |
1794 | 5c0c1eeb | Iustin Pop | - handling of network cards for instances, |
1795 | 5c0c1eeb | Iustin Pop | - handling of disks for instances, |
1796 | 5c0c1eeb | Iustin Pop | - handling of CDROM devices and |
1797 | 5c0c1eeb | Iustin Pop | - handling of hypervisor specific options. |
1798 | 5c0c1eeb | Iustin Pop | |
1799 | 6c2d0b44 | Iustin Pop | There are several areas of Ganeti where the command line arguments |
1800 | 6c2d0b44 | Iustin Pop | will change: |
1801 | 5c0c1eeb | Iustin Pop | |
1802 | 5c0c1eeb | Iustin Pop | - Cluster configuration |
1803 | 5c0c1eeb | Iustin Pop | |
1804 | 5c0c1eeb | Iustin Pop | - cluster initialization |
1805 | 5c0c1eeb | Iustin Pop | - cluster default configuration |
1806 | 5c0c1eeb | Iustin Pop | |
1807 | 5c0c1eeb | Iustin Pop | - Instance configuration |
1808 | 5c0c1eeb | Iustin Pop | |
1809 | 5c0c1eeb | Iustin Pop | - handling of network cards for instances, |
1810 | 5c0c1eeb | Iustin Pop | - handling of disks for instances, |
1811 | 5c0c1eeb | Iustin Pop | - handling of CDROM devices and |
1812 | 5c0c1eeb | Iustin Pop | - handling of hypervisor specific options. |
1813 | 5c0c1eeb | Iustin Pop | |
1814 | 5c0c1eeb | Iustin Pop | Notes about device removal/addition |
1815 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++++++ |
1816 | 5c0c1eeb | Iustin Pop | |
1817 | 5c0c1eeb | Iustin Pop | To avoid problems with device location changes (e.g. second network |
1818 | 5c0c1eeb | Iustin Pop | interface of the instance becoming the first or third and the like) |
1819 | 5c0c1eeb | Iustin Pop | the list of network/disk devices is treated as a stack, i.e. devices |
1820 | 5c0c1eeb | Iustin Pop | can only be added/removed at the end of the list of devices of each |
1821 | 5c0c1eeb | Iustin Pop | class (disk or network) for each instance. |
1822 | 5c0c1eeb | Iustin Pop | |
1823 | 5c0c1eeb | Iustin Pop | gnt-instance commands |
1824 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1825 | 5c0c1eeb | Iustin Pop | |
1826 | 5c0c1eeb | Iustin Pop | The commands for gnt-instance will be modified and extended to allow |
1827 | 5c0c1eeb | Iustin Pop | for the new functionality: |
1828 | 5c0c1eeb | Iustin Pop | |
1829 | 5c0c1eeb | Iustin Pop | - the add command will be extended to support the new device and |
1830 | 5c0c1eeb | Iustin Pop | hypervisor options, |
1831 | 5c0c1eeb | Iustin Pop | - the modify command continues to handle all modifications to |
1832 | 5c0c1eeb | Iustin Pop | instances, but will be extended with new arguments for handling |
1833 | 5c0c1eeb | Iustin Pop | devices. |
1834 | 5c0c1eeb | Iustin Pop | |
1835 | 5c0c1eeb | Iustin Pop | Network Device Options |
1836 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++ |
1837 | 5c0c1eeb | Iustin Pop | |
1838 | 5c0c1eeb | Iustin Pop | The generic format of the network device option is: |
1839 | 5c0c1eeb | Iustin Pop | |
1840 | 5c0c1eeb | Iustin Pop | --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1841 | 5c0c1eeb | Iustin Pop | |
1842 | 5c0c1eeb | Iustin Pop | :$DEVNUM: device number, unsigned integer, starting at 0, |
1843 | 5c0c1eeb | Iustin Pop | :$OPTION: device option, string, |
1844 | 5c0c1eeb | Iustin Pop | :$VALUE: device option value, string. |
1845 | 5c0c1eeb | Iustin Pop | |
1846 | 5c0c1eeb | Iustin Pop | Currently, the following device options will be defined (open to |
1847 | 5c0c1eeb | Iustin Pop | further changes): |
1848 | 5c0c1eeb | Iustin Pop | |
1849 | 5c0c1eeb | Iustin Pop | :mac: MAC address of the network interface, accepts either a valid |
1850 | 5c0c1eeb | Iustin Pop | MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1851 | 5c0c1eeb | Iustin Pop | address will be generated randomly. If the mac device option is not |
1852 | 5c0c1eeb | Iustin Pop | specified, the default value 'auto' is assumed. |
1853 | 5c0c1eeb | Iustin Pop | :bridge: network bridge the network interface is connected |
1854 | 5c0c1eeb | Iustin Pop | to. Accepts either a valid bridge name (the specified bridge must |
1855 | 5c0c1eeb | Iustin Pop | exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1856 | 5c0c1eeb | Iustin Pop | specified, the default brigde is used. If the bridge option is not |
1857 | 5c0c1eeb | Iustin Pop | specified, the default value 'auto' is assumed. |
1858 | 5c0c1eeb | Iustin Pop | |
1859 | 5c0c1eeb | Iustin Pop | Disk Device Options |
1860 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
1861 | 5c0c1eeb | Iustin Pop | |
1862 | 5c0c1eeb | Iustin Pop | The generic format of the disk device option is: |
1863 | 5c0c1eeb | Iustin Pop | |
1864 | 5c0c1eeb | Iustin Pop | --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1865 | 5c0c1eeb | Iustin Pop | |
1866 | 5c0c1eeb | Iustin Pop | :$DEVNUM: device number, unsigned integer, starting at 0, |
1867 | 5c0c1eeb | Iustin Pop | :$OPTION: device option, string, |
1868 | 5c0c1eeb | Iustin Pop | :$VALUE: device option value, string. |
1869 | 5c0c1eeb | Iustin Pop | |
1870 | 5c0c1eeb | Iustin Pop | Currently, the following device options will be defined (open to |
1871 | 5c0c1eeb | Iustin Pop | further changes): |
1872 | 5c0c1eeb | Iustin Pop | |
1873 | 5c0c1eeb | Iustin Pop | :size: size of the disk device, either a positive number, specifying |
1874 | 5c0c1eeb | Iustin Pop | the disk size in mebibytes, or a number followed by a magnitude suffix |
1875 | 5c0c1eeb | Iustin Pop | (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1876 | 5c0c1eeb | Iustin Pop | which case the default disk size will be used. If the size option is |
1877 | 5c0c1eeb | Iustin Pop | not specified, 'auto' is assumed. This option is not valid for all |
1878 | 5c0c1eeb | Iustin Pop | disk layout types. |
1879 | 5c0c1eeb | Iustin Pop | :access: access mode of the disk device, a single letter, valid values |
1880 | 5c0c1eeb | Iustin Pop | are: |
1881 | 5c0c1eeb | Iustin Pop | |
1882 | 5c0c1eeb | Iustin Pop | - *w*: read/write access to the disk device or |
1883 | 5c0c1eeb | Iustin Pop | - *r*: read-only access to the disk device. |
1884 | 5c0c1eeb | Iustin Pop | |
1885 | 5c0c1eeb | Iustin Pop | If the access mode is not specified, the default mode of read/write |
1886 | 5c0c1eeb | Iustin Pop | access will be configured. |
1887 | 5c0c1eeb | Iustin Pop | :path: path to the image file for the disk device, string. No default |
1888 | 5c0c1eeb | Iustin Pop | exists. This option is not valid for all disk layout types. |
1889 | 5c0c1eeb | Iustin Pop | |
1890 | 5c0c1eeb | Iustin Pop | Adding devices |
1891 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
1892 | 5c0c1eeb | Iustin Pop | |
1893 | 5c0c1eeb | Iustin Pop | To add devices to an already existing instance, use the device type |
1894 | 5c0c1eeb | Iustin Pop | specific option to gnt-instance modify. Currently, there are two |
1895 | 5c0c1eeb | Iustin Pop | device type specific options supported: |
1896 | 5c0c1eeb | Iustin Pop | |
1897 | 5c0c1eeb | Iustin Pop | :--net: for network interface cards |
1898 | 5c0c1eeb | Iustin Pop | :--disk: for disk devices |
1899 | 5c0c1eeb | Iustin Pop | |
1900 | 6c2d0b44 | Iustin Pop | The syntax to the device specific options is similar to the generic |
1901 | 5c0c1eeb | Iustin Pop | device options, but instead of specifying a device number like for |
1902 | 5c0c1eeb | Iustin Pop | gnt-instance add, you specify the magic string add. The new device |
1903 | 5c0c1eeb | Iustin Pop | will always be appended at the end of the list of devices of this type |
1904 | 5c0c1eeb | Iustin Pop | for the specified instance, e.g. if the instance has disk devices 0,1 |
1905 | 5c0c1eeb | Iustin Pop | and 2, the newly added disk device will be disk device 3. |
1906 | 5c0c1eeb | Iustin Pop | |
1907 | 5c0c1eeb | Iustin Pop | Example: gnt-instance modify --net add:mac=auto test-instance |
1908 | 5c0c1eeb | Iustin Pop | |
1909 | 5c0c1eeb | Iustin Pop | Removing devices |
1910 | 5c0c1eeb | Iustin Pop | ++++++++++++++++ |
1911 | 5c0c1eeb | Iustin Pop | |
1912 | 5c0c1eeb | Iustin Pop | Removing devices from and instance is done via gnt-instance |
1913 | 5c0c1eeb | Iustin Pop | modify. The same device specific options as for adding instances are |
1914 | 5c0c1eeb | Iustin Pop | used. Instead of a device number and further device options, only the |
1915 | 5c0c1eeb | Iustin Pop | magic string remove is specified. It will always remove the last |
1916 | 5c0c1eeb | Iustin Pop | device in the list of devices of this type for the instance specified, |
1917 | 5c0c1eeb | Iustin Pop | e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1918 | 5c0c1eeb | Iustin Pop | number 3 will be removed. |
1919 | 5c0c1eeb | Iustin Pop | |
1920 | 5c0c1eeb | Iustin Pop | Example: gnt-instance modify --net remove test-instance |
1921 | 5c0c1eeb | Iustin Pop | |
1922 | 5c0c1eeb | Iustin Pop | Modifying devices |
1923 | 5c0c1eeb | Iustin Pop | +++++++++++++++++ |
1924 | 5c0c1eeb | Iustin Pop | |
1925 | 5c0c1eeb | Iustin Pop | Modifying devices is also done with device type specific options to |
1926 | 5c0c1eeb | Iustin Pop | the gnt-instance modify command. There are currently two device type |
1927 | 5c0c1eeb | Iustin Pop | options supported: |
1928 | 5c0c1eeb | Iustin Pop | |
1929 | 5c0c1eeb | Iustin Pop | :--net: for network interface cards |
1930 | 5c0c1eeb | Iustin Pop | :--disk: for disk devices |
1931 | 5c0c1eeb | Iustin Pop | |
1932 | 6c2d0b44 | Iustin Pop | The syntax to the device specific options is similar to the generic |
1933 | 5c0c1eeb | Iustin Pop | device options. The device number you specify identifies the device to |
1934 | 5c0c1eeb | Iustin Pop | be modified. |
1935 | 5c0c1eeb | Iustin Pop | |
1936 | 6c2d0b44 | Iustin Pop | Example:: |
1937 | 6c2d0b44 | Iustin Pop | |
1938 | 6c2d0b44 | Iustin Pop | gnt-instance modify --disk 2:access=r |
1939 | 5c0c1eeb | Iustin Pop | |
1940 | 5c0c1eeb | Iustin Pop | Hypervisor Options |
1941 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++ |
1942 | 5c0c1eeb | Iustin Pop | |
1943 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 will support more than one hypervisor. Different |
1944 | 5c0c1eeb | Iustin Pop | hypervisors have various options that only apply to a specific |
1945 | 5c0c1eeb | Iustin Pop | hypervisor. Those hypervisor specific options are treated specially |
1946 | 6c2d0b44 | Iustin Pop | via the ``--hypervisor`` option. The generic syntax of the hypervisor |
1947 | 6c2d0b44 | Iustin Pop | option is as follows:: |
1948 | 5c0c1eeb | Iustin Pop | |
1949 | 5c0c1eeb | Iustin Pop | --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1950 | 5c0c1eeb | Iustin Pop | |
1951 | 5c0c1eeb | Iustin Pop | :$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1952 | 5c0c1eeb | Iustin Pop | has to match the supported hypervisors. Example: xen-pvm |
1953 | 5c0c1eeb | Iustin Pop | |
1954 | 5c0c1eeb | Iustin Pop | :$OPTION: hypervisor option name, string |
1955 | 5c0c1eeb | Iustin Pop | :$VALUE: hypervisor option value, string |
1956 | 5c0c1eeb | Iustin Pop | |
1957 | 5c0c1eeb | Iustin Pop | The hypervisor option for an instance can be set on instance creation |
1958 | 6c2d0b44 | Iustin Pop | time via the ``gnt-instance add`` command. If the hypervisor for an |
1959 | 5c0c1eeb | Iustin Pop | instance is not specified upon instance creation, the default |
1960 | 5c0c1eeb | Iustin Pop | hypervisor will be used. |
1961 | 5c0c1eeb | Iustin Pop | |
1962 | 5c0c1eeb | Iustin Pop | Modifying hypervisor parameters |
1963 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++ |
1964 | 5c0c1eeb | Iustin Pop | |
1965 | 5c0c1eeb | Iustin Pop | The hypervisor parameters of an existing instance can be modified |
1966 | 6c2d0b44 | Iustin Pop | using ``--hypervisor`` option of the ``gnt-instance modify`` |
1967 | 6c2d0b44 | Iustin Pop | command. However, the hypervisor type of an existing instance can not |
1968 | 6c2d0b44 | Iustin Pop | be changed, only the particular hypervisor specific option can be |
1969 | 6c2d0b44 | Iustin Pop | changed. Therefore, the format of the option parameters has been |
1970 | 6c2d0b44 | Iustin Pop | simplified to omit the hypervisor name and only contain the comma |
1971 | 6c2d0b44 | Iustin Pop | separated list of option-value pairs. |
1972 | 5c0c1eeb | Iustin Pop | |
1973 | 6c2d0b44 | Iustin Pop | Example:: |
1974 | 6c2d0b44 | Iustin Pop | |
1975 | 6c2d0b44 | Iustin Pop | gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
1976 | 5c0c1eeb | Iustin Pop | |
1977 | 5c0c1eeb | Iustin Pop | gnt-cluster commands |
1978 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
1979 | 5c0c1eeb | Iustin Pop | |
1980 | 5c0c1eeb | Iustin Pop | The command for gnt-cluster will be extended to allow setting and |
1981 | 5c0c1eeb | Iustin Pop | changing the default parameters of the cluster: |
1982 | 5c0c1eeb | Iustin Pop | |
1983 | 5c0c1eeb | Iustin Pop | - The init command will be extend to support the defaults option to |
1984 | 5c0c1eeb | Iustin Pop | set the cluster defaults upon cluster initialization. |
1985 | 5c0c1eeb | Iustin Pop | - The modify command will be added to modify the cluster |
1986 | 5c0c1eeb | Iustin Pop | parameters. It will support the --defaults option to change the |
1987 | 5c0c1eeb | Iustin Pop | cluster defaults. |
1988 | 5c0c1eeb | Iustin Pop | |
1989 | 5c0c1eeb | Iustin Pop | Cluster defaults |
1990 | 5c0c1eeb | Iustin Pop | |
1991 | 5c0c1eeb | Iustin Pop | The generic format of the cluster default setting option is: |
1992 | 5c0c1eeb | Iustin Pop | |
1993 | 5c0c1eeb | Iustin Pop | --defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
1994 | 5c0c1eeb | Iustin Pop | |
1995 | 5c0c1eeb | Iustin Pop | :$OPTION: cluster default option, string, |
1996 | 5c0c1eeb | Iustin Pop | :$VALUE: cluster default option value, string. |
1997 | 5c0c1eeb | Iustin Pop | |
1998 | 5c0c1eeb | Iustin Pop | Currently, the following cluster default options are defined (open to |
1999 | 5c0c1eeb | Iustin Pop | further changes): |
2000 | 5c0c1eeb | Iustin Pop | |
2001 | 5c0c1eeb | Iustin Pop | :hypervisor: the default hypervisor to use for new instances, |
2002 | 5c0c1eeb | Iustin Pop | string. Must be a valid hypervisor known to and supported by the |
2003 | 5c0c1eeb | Iustin Pop | cluster. |
2004 | 5c0c1eeb | Iustin Pop | :disksize: the disksize for newly created instance disks, where |
2005 | 5c0c1eeb | Iustin Pop | applicable. Must be either a positive number, in which case the unit |
2006 | 5c0c1eeb | Iustin Pop | of megabyte is assumed, or a positive number followed by a supported |
2007 | 5c0c1eeb | Iustin Pop | magnitude symbol (M for megabyte or G for gigabyte). |
2008 | 5c0c1eeb | Iustin Pop | :bridge: the default network bridge to use for newly created instance |
2009 | 5c0c1eeb | Iustin Pop | network interfaces, string. Must be a valid bridge name of a bridge |
2010 | 5c0c1eeb | Iustin Pop | existing on the node(s). |
2011 | 5c0c1eeb | Iustin Pop | |
2012 | 5c0c1eeb | Iustin Pop | Hypervisor cluster defaults |
2013 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++ |
2014 | 5c0c1eeb | Iustin Pop | |
2015 | 6c2d0b44 | Iustin Pop | The generic format of the hypervisor cluster wide default setting |
2016 | 6c2d0b44 | Iustin Pop | option is:: |
2017 | 5c0c1eeb | Iustin Pop | |
2018 | 5c0c1eeb | Iustin Pop | --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
2019 | 5c0c1eeb | Iustin Pop | |
2020 | 5c0c1eeb | Iustin Pop | :$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
2021 | 5c0c1eeb | Iustin Pop | to set, string |
2022 | 5c0c1eeb | Iustin Pop | :$OPTION: cluster default option, string, |
2023 | 5c0c1eeb | Iustin Pop | :$VALUE: cluster default option value, string. |
2024 | 558fd122 | Michael Hanselmann | |
2025 | 558fd122 | Michael Hanselmann | .. vim: set textwidth=72 : |