root / doc / design-2.0.rst @ e18def2a
History | View | Annotate | Download (76.2 kB)
1 | 5c0c1eeb | Iustin Pop | ================= |
---|---|---|---|
2 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 design |
3 | 5c0c1eeb | Iustin Pop | ================= |
4 | 5c0c1eeb | Iustin Pop | |
5 | 5c0c1eeb | Iustin Pop | This document describes the major changes in Ganeti 2.0 compared to |
6 | 5c0c1eeb | Iustin Pop | the 1.2 version. |
7 | 5c0c1eeb | Iustin Pop | |
8 | 5c0c1eeb | Iustin Pop | The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 | 5c0c1eeb | Iustin Pop | paving the way for additional features in future 2.x versions. |
10 | 5c0c1eeb | Iustin Pop | |
11 | e0eb13de | Iustin Pop | .. contents:: :depth: 3 |
12 | 5c0c1eeb | Iustin Pop | |
13 | 5c0c1eeb | Iustin Pop | Objective |
14 | 5c0c1eeb | Iustin Pop | ========= |
15 | 5c0c1eeb | Iustin Pop | |
16 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 has many scalability issues and restrictions due to its |
17 | 5c0c1eeb | Iustin Pop | roots as software for managing small and 'static' clusters. |
18 | 5c0c1eeb | Iustin Pop | |
19 | 5c0c1eeb | Iustin Pop | Version 2.0 will attempt to remedy first the scalability issues and |
20 | 5c0c1eeb | Iustin Pop | then the restrictions. |
21 | 5c0c1eeb | Iustin Pop | |
22 | 5c0c1eeb | Iustin Pop | Background |
23 | 5c0c1eeb | Iustin Pop | ========== |
24 | 5c0c1eeb | Iustin Pop | |
25 | 6c2d0b44 | Iustin Pop | While Ganeti 1.2 is usable, it severely limits the flexibility of the |
26 | 5c0c1eeb | Iustin Pop | cluster administration and imposes a very rigid model. It has the |
27 | 5c0c1eeb | Iustin Pop | following main scalability issues: |
28 | 5c0c1eeb | Iustin Pop | |
29 | 5c0c1eeb | Iustin Pop | - only one operation at a time on the cluster [#]_ |
30 | 5c0c1eeb | Iustin Pop | - poor handling of node failures in the cluster |
31 | 5c0c1eeb | Iustin Pop | - mixing hypervisors in a cluster not allowed |
32 | 5c0c1eeb | Iustin Pop | |
33 | 5c0c1eeb | Iustin Pop | It also has a number of artificial restrictions, due to historical design: |
34 | 5c0c1eeb | Iustin Pop | |
35 | 5c0c1eeb | Iustin Pop | - fixed number of disks (two) per instance |
36 | 6c2d0b44 | Iustin Pop | - fixed number of NICs |
37 | 5c0c1eeb | Iustin Pop | |
38 | 5c0c1eeb | Iustin Pop | .. [#] Replace disks will release the lock, but this is an exception |
39 | 5c0c1eeb | Iustin Pop | and not a recommended way to operate |
40 | 5c0c1eeb | Iustin Pop | |
41 | 5c0c1eeb | Iustin Pop | The 2.0 version is intended to address some of these problems, and |
42 | 6c2d0b44 | Iustin Pop | create a more flexible code base for future developments. |
43 | 6c2d0b44 | Iustin Pop | |
44 | 6c2d0b44 | Iustin Pop | Among these problems, the single-operation at a time restriction is |
45 | 6c2d0b44 | Iustin Pop | biggest issue with the current version of Ganeti. It is such a big |
46 | 6c2d0b44 | Iustin Pop | impediment in operating bigger clusters that many times one is tempted |
47 | 6c2d0b44 | Iustin Pop | to remove the lock just to do a simple operation like start instance |
48 | 6c2d0b44 | Iustin Pop | while an OS installation is running. |
49 | 5c0c1eeb | Iustin Pop | |
50 | 5c0c1eeb | Iustin Pop | Scalability problems |
51 | 5c0c1eeb | Iustin Pop | -------------------- |
52 | 5c0c1eeb | Iustin Pop | |
53 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 has a single global lock, which is used for all cluster |
54 | 5c0c1eeb | Iustin Pop | operations. This has been painful at various times, for example: |
55 | 5c0c1eeb | Iustin Pop | |
56 | 5c0c1eeb | Iustin Pop | - It is impossible for two people to efficiently interact with a cluster |
57 | 5c0c1eeb | Iustin Pop | (for example for debugging) at the same time. |
58 | 5c0c1eeb | Iustin Pop | - When batch jobs are running it's impossible to do other work (for example |
59 | 5c0c1eeb | Iustin Pop | failovers/fixes) on a cluster. |
60 | 5c0c1eeb | Iustin Pop | |
61 | 5c0c1eeb | Iustin Pop | This poses scalability problems: as clusters grow in node and instance |
62 | 5c0c1eeb | Iustin Pop | size it's a lot more likely that operations which one could conceive |
63 | 5c0c1eeb | Iustin Pop | should run in parallel (for example because they happen on different |
64 | 5c0c1eeb | Iustin Pop | nodes) are actually stalling each other while waiting for the global |
65 | 5c0c1eeb | Iustin Pop | lock, without a real reason for that to happen. |
66 | 5c0c1eeb | Iustin Pop | |
67 | 5c0c1eeb | Iustin Pop | One of the main causes of this global lock (beside the higher |
68 | 5c0c1eeb | Iustin Pop | difficulty of ensuring data consistency in a more granular lock model) |
69 | 6c2d0b44 | Iustin Pop | is the fact that currently there is no long-lived process in Ganeti |
70 | 6c2d0b44 | Iustin Pop | that can coordinate multiple operations. Each command tries to acquire |
71 | 6c2d0b44 | Iustin Pop | the so called *cmd* lock and when it succeeds, it takes complete |
72 | 6c2d0b44 | Iustin Pop | ownership of the cluster configuration and state. |
73 | 5c0c1eeb | Iustin Pop | |
74 | 5c0c1eeb | Iustin Pop | Other scalability problems are due the design of the DRBD device |
75 | 5c0c1eeb | Iustin Pop | model, which assumed at its creation a low (one to four) number of |
76 | 5c0c1eeb | Iustin Pop | instances per node, which is no longer true with today's hardware. |
77 | 5c0c1eeb | Iustin Pop | |
78 | 5c0c1eeb | Iustin Pop | Artificial restrictions |
79 | 5c0c1eeb | Iustin Pop | ----------------------- |
80 | 5c0c1eeb | Iustin Pop | |
81 | 5c0c1eeb | Iustin Pop | Ganeti 1.2 (and previous versions) have a fixed two-disks, one-NIC per |
82 | 5c0c1eeb | Iustin Pop | instance model. This is a purely artificial restrictions, but it |
83 | 5c0c1eeb | Iustin Pop | touches multiple areas (configuration, import/export, command line) |
84 | 5c0c1eeb | Iustin Pop | that it's more fitted to a major release than a minor one. |
85 | 5c0c1eeb | Iustin Pop | |
86 | 6c2d0b44 | Iustin Pop | Architecture issues |
87 | 6c2d0b44 | Iustin Pop | ------------------- |
88 | 6c2d0b44 | Iustin Pop | |
89 | 6c2d0b44 | Iustin Pop | The fact that each command is a separate process that reads the |
90 | 6c2d0b44 | Iustin Pop | cluster state, executes the command, and saves the new state is also |
91 | 6c2d0b44 | Iustin Pop | an issue on big clusters where the configuration data for the cluster |
92 | 6c2d0b44 | Iustin Pop | begins to be non-trivial in size. |
93 | 6c2d0b44 | Iustin Pop | |
94 | 5c0c1eeb | Iustin Pop | Overview |
95 | 5c0c1eeb | Iustin Pop | ======== |
96 | 5c0c1eeb | Iustin Pop | |
97 | 5c0c1eeb | Iustin Pop | In order to solve the scalability problems, a rewrite of the core |
98 | 5c0c1eeb | Iustin Pop | design of Ganeti is required. While the cluster operations themselves |
99 | 5c0c1eeb | Iustin Pop | won't change (e.g. start instance will do the same things, the way |
100 | 5c0c1eeb | Iustin Pop | these operations are scheduled internally will change radically. |
101 | 5c0c1eeb | Iustin Pop | |
102 | f86e82ef | Iustin Pop | The new design will change the cluster architecture to: |
103 | f86e82ef | Iustin Pop | |
104 | f86e82ef | Iustin Pop | .. image:: arch-2.0.png |
105 | f86e82ef | Iustin Pop | |
106 | f86e82ef | Iustin Pop | This differs from the 1.2 architecture by the addition of the master |
107 | f86e82ef | Iustin Pop | daemon, which will be the only entity to talk to the node daemons. |
108 | f86e82ef | Iustin Pop | |
109 | f86e82ef | Iustin Pop | |
110 | 5c0c1eeb | Iustin Pop | Detailed design |
111 | 5c0c1eeb | Iustin Pop | =============== |
112 | 5c0c1eeb | Iustin Pop | |
113 | 5c0c1eeb | Iustin Pop | The changes for 2.0 can be split into roughly three areas: |
114 | 5c0c1eeb | Iustin Pop | |
115 | 5c0c1eeb | Iustin Pop | - core changes that affect the design of the software |
116 | 5c0c1eeb | Iustin Pop | - features (or restriction removals) but which do not have a wide |
117 | 5c0c1eeb | Iustin Pop | impact on the design |
118 | 5c0c1eeb | Iustin Pop | - user-level and API-level changes which translate into differences for |
119 | 5c0c1eeb | Iustin Pop | the operation of the cluster |
120 | 5c0c1eeb | Iustin Pop | |
121 | 5c0c1eeb | Iustin Pop | Core changes |
122 | 5c0c1eeb | Iustin Pop | ------------ |
123 | 5c0c1eeb | Iustin Pop | |
124 | 5c0c1eeb | Iustin Pop | The main changes will be switching from a per-process model to a |
125 | 5c0c1eeb | Iustin Pop | daemon based model, where the individual gnt-* commands will be |
126 | 6c2d0b44 | Iustin Pop | clients that talk to this daemon (see `Master daemon`_). This will |
127 | 6c2d0b44 | Iustin Pop | allow us to get rid of the global cluster lock for most operations, |
128 | 6c2d0b44 | Iustin Pop | having instead a per-object lock (see `Granular locking`_). Also, the |
129 | 6c2d0b44 | Iustin Pop | daemon will be able to queue jobs, and this will allow the individual |
130 | 6c2d0b44 | Iustin Pop | clients to submit jobs without waiting for them to finish, and also |
131 | 6c2d0b44 | Iustin Pop | see the result of old requests (see `Job Queue`_). |
132 | 5c0c1eeb | Iustin Pop | |
133 | 5c0c1eeb | Iustin Pop | Beside these major changes, another 'core' change but that will not be |
134 | 5c0c1eeb | Iustin Pop | as visible to the users will be changing the model of object attribute |
135 | 6c2d0b44 | Iustin Pop | storage, and separate that into name spaces (such that an Xen PVM |
136 | 5c0c1eeb | Iustin Pop | instance will not have the Xen HVM parameters). This will allow future |
137 | 6c2d0b44 | Iustin Pop | flexibility in defining additional parameters. For more details see |
138 | 6c2d0b44 | Iustin Pop | `Object parameters`_. |
139 | 5c0c1eeb | Iustin Pop | |
140 | 5c0c1eeb | Iustin Pop | The various changes brought in by the master daemon model and the |
141 | 5c0c1eeb | Iustin Pop | read-write RAPI will require changes to the cluster security; we move |
142 | 6c2d0b44 | Iustin Pop | away from Twisted and use HTTP(s) for intra- and extra-cluster |
143 | 5c0c1eeb | Iustin Pop | communications. For more details, see the security document in the |
144 | 5c0c1eeb | Iustin Pop | doc/ directory. |
145 | 5c0c1eeb | Iustin Pop | |
146 | 5c0c1eeb | Iustin Pop | Master daemon |
147 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~ |
148 | 5c0c1eeb | Iustin Pop | |
149 | 5c0c1eeb | Iustin Pop | In Ganeti 2.0, we will have the following *entities*: |
150 | 5c0c1eeb | Iustin Pop | |
151 | 5c0c1eeb | Iustin Pop | - the master daemon (on the master node) |
152 | 5c0c1eeb | Iustin Pop | - the node daemon (on all nodes) |
153 | 5c0c1eeb | Iustin Pop | - the command line tools (on the master node) |
154 | 5c0c1eeb | Iustin Pop | - the RAPI daemon (on the master node) |
155 | 5c0c1eeb | Iustin Pop | |
156 | 6c2d0b44 | Iustin Pop | The master-daemon related interaction paths are: |
157 | 5c0c1eeb | Iustin Pop | |
158 | 6c2d0b44 | Iustin Pop | - (CLI tools/RAPI daemon) and the master daemon, via the so called *LUXI* API |
159 | 5c0c1eeb | Iustin Pop | - the master daemon and the node daemons, via the node RPC |
160 | 5c0c1eeb | Iustin Pop | |
161 | 6c2d0b44 | Iustin Pop | There are also some additional interaction paths for exceptional cases: |
162 | 6c2d0b44 | Iustin Pop | |
163 | 6c2d0b44 | Iustin Pop | - CLI tools might access via SSH the nodes (for ``gnt-cluster copyfile`` |
164 | 6c2d0b44 | Iustin Pop | and ``gnt-cluster command``) |
165 | 6c2d0b44 | Iustin Pop | - master failover is a special case when a non-master node will SSH |
166 | 6c2d0b44 | Iustin Pop | and do node-RPC calls to the current master |
167 | 6c2d0b44 | Iustin Pop | |
168 | 5c0c1eeb | Iustin Pop | The protocol between the master daemon and the node daemons will be |
169 | 6c2d0b44 | Iustin Pop | changed from (Ganeti 1.2) Twisted PB (perspective broker) to HTTP(S), |
170 | 6c2d0b44 | Iustin Pop | using a simple PUT/GET of JSON-encoded messages. This is done due to |
171 | 6c2d0b44 | Iustin Pop | difficulties in working with the Twisted framework and its protocols |
172 | 6c2d0b44 | Iustin Pop | in a multithreaded environment, which we can overcome by using a |
173 | 6c2d0b44 | Iustin Pop | simpler stack (see the caveats section). |
174 | 6c2d0b44 | Iustin Pop | |
175 | 6c2d0b44 | Iustin Pop | The protocol between the CLI/RAPI and the master daemon will be a |
176 | 6c2d0b44 | Iustin Pop | custom one (called *LUXI*): on a UNIX socket on the master node, with |
177 | 6c2d0b44 | Iustin Pop | rights restricted by filesystem permissions, the CLI/RAPI will talk to |
178 | 6c2d0b44 | Iustin Pop | the master daemon using JSON-encoded messages. |
179 | 5c0c1eeb | Iustin Pop | |
180 | 5c0c1eeb | Iustin Pop | The operations supported over this internal protocol will be encoded |
181 | 5c0c1eeb | Iustin Pop | via a python library that will expose a simple API for its |
182 | 5c0c1eeb | Iustin Pop | users. Internally, the protocol will simply encode all objects in JSON |
183 | 5c0c1eeb | Iustin Pop | format and decode them on the receiver side. |
184 | 5c0c1eeb | Iustin Pop | |
185 | 6c2d0b44 | Iustin Pop | For more details about the RAPI daemon see `Remote API changes`_, and |
186 | 6c2d0b44 | Iustin Pop | for the node daemon see `Node daemon changes`_. |
187 | 6c2d0b44 | Iustin Pop | |
188 | 5c0c1eeb | Iustin Pop | The LUXI protocol |
189 | 5c0c1eeb | Iustin Pop | +++++++++++++++++ |
190 | 5c0c1eeb | Iustin Pop | |
191 | 6c2d0b44 | Iustin Pop | As described above, the protocol for making requests or queries to the |
192 | 6c2d0b44 | Iustin Pop | master daemon will be a UNIX-socket based simple RPC of JSON-encoded |
193 | 6c2d0b44 | Iustin Pop | messages. |
194 | 6c2d0b44 | Iustin Pop | |
195 | 6c2d0b44 | Iustin Pop | The choice of UNIX was in order to get rid of the need of |
196 | 6c2d0b44 | Iustin Pop | authentication and authorisation inside Ganeti; for 2.0, the |
197 | 6c2d0b44 | Iustin Pop | permissions on the Unix socket itself will determine the access |
198 | 6c2d0b44 | Iustin Pop | rights. |
199 | 6c2d0b44 | Iustin Pop | |
200 | 6c2d0b44 | Iustin Pop | We will have two main classes of operations over this API: |
201 | 5c0c1eeb | Iustin Pop | |
202 | 5c0c1eeb | Iustin Pop | - cluster query functions |
203 | 5c0c1eeb | Iustin Pop | - job related functions |
204 | 5c0c1eeb | Iustin Pop | |
205 | 5c0c1eeb | Iustin Pop | The cluster query functions are usually short-duration, and are the |
206 | 6c2d0b44 | Iustin Pop | equivalent of the ``OP_QUERY_*`` opcodes in Ganeti 1.2 (and they are |
207 | 5c0c1eeb | Iustin Pop | internally implemented still with these opcodes). The clients are |
208 | 5c0c1eeb | Iustin Pop | guaranteed to receive the response in a reasonable time via a timeout. |
209 | 5c0c1eeb | Iustin Pop | |
210 | 5c0c1eeb | Iustin Pop | The job-related functions will be: |
211 | 5c0c1eeb | Iustin Pop | |
212 | 5c0c1eeb | Iustin Pop | - submit job |
213 | 5c0c1eeb | Iustin Pop | - query job (which could also be categorized in the query-functions) |
214 | 5c0c1eeb | Iustin Pop | - archive job (see the job queue design doc) |
215 | 5c0c1eeb | Iustin Pop | - wait for job change, which allows a client to wait without polling |
216 | 5c0c1eeb | Iustin Pop | |
217 | 6c2d0b44 | Iustin Pop | For more details of the actual operation list, see the `Job Queue`_. |
218 | 5c0c1eeb | Iustin Pop | |
219 | 6c2d0b44 | Iustin Pop | Both requests and responses will consist of a JSON-encoded message |
220 | 6c2d0b44 | Iustin Pop | followed by the ``ETX`` character (ASCII decimal 3), which is not a |
221 | 6c2d0b44 | Iustin Pop | valid character in JSON messages and thus can serve as a message |
222 | 6c2d0b44 | Iustin Pop | delimiter. The contents of the messages will be a dictionary with two |
223 | 6c2d0b44 | Iustin Pop | fields: |
224 | 6c2d0b44 | Iustin Pop | |
225 | 6c2d0b44 | Iustin Pop | :method: |
226 | 6c2d0b44 | Iustin Pop | the name of the method called |
227 | 6c2d0b44 | Iustin Pop | :args: |
228 | 6c2d0b44 | Iustin Pop | the arguments to the method, as a list (no keyword arguments allowed) |
229 | 6c2d0b44 | Iustin Pop | |
230 | 6c2d0b44 | Iustin Pop | Responses will follow the same format, with the two fields being: |
231 | 6c2d0b44 | Iustin Pop | |
232 | 6c2d0b44 | Iustin Pop | :success: |
233 | 6c2d0b44 | Iustin Pop | a boolean denoting the success of the operation |
234 | 6c2d0b44 | Iustin Pop | :result: |
235 | 6c2d0b44 | Iustin Pop | the actual result, or error message in case of failure |
236 | 6c2d0b44 | Iustin Pop | |
237 | 6c2d0b44 | Iustin Pop | There are two special value for the result field: |
238 | 6c2d0b44 | Iustin Pop | |
239 | 6c2d0b44 | Iustin Pop | - in the case that the operation failed, and this field is a list of |
240 | 6c2d0b44 | Iustin Pop | length two, the client library will try to interpret is as an exception, |
241 | 6c2d0b44 | Iustin Pop | the first element being the exception type and the second one the |
242 | 6c2d0b44 | Iustin Pop | actual exception arguments; this will allow a simple method of passing |
243 | 6c2d0b44 | Iustin Pop | Ganeti-related exception across the interface |
244 | 6c2d0b44 | Iustin Pop | - for the *WaitForChange* call (that waits on the server for a job to |
245 | 6c2d0b44 | Iustin Pop | change status), if the result is equal to ``nochange`` instead of the |
246 | 6c2d0b44 | Iustin Pop | usual result for this call (a list of changes), then the library will |
247 | 6c2d0b44 | Iustin Pop | internally retry the call; this is done in order to differentiate |
248 | 6c2d0b44 | Iustin Pop | internally between master daemon hung and job simply not changed |
249 | 6c2d0b44 | Iustin Pop | |
250 | 6c2d0b44 | Iustin Pop | Users of the API that don't use the provided python library should |
251 | 6c2d0b44 | Iustin Pop | take care of the above two cases. |
252 | 6c2d0b44 | Iustin Pop | |
253 | 6c2d0b44 | Iustin Pop | |
254 | 6c2d0b44 | Iustin Pop | Master daemon implementation |
255 | 6c2d0b44 | Iustin Pop | ++++++++++++++++++++++++++++ |
256 | 5c0c1eeb | Iustin Pop | |
257 | 5c0c1eeb | Iustin Pop | The daemon will be based around a main I/O thread that will wait for |
258 | 5c0c1eeb | Iustin Pop | new requests from the clients, and that does the setup/shutdown of the |
259 | 5c0c1eeb | Iustin Pop | other thread (pools). |
260 | 5c0c1eeb | Iustin Pop | |
261 | 5c0c1eeb | Iustin Pop | There will two other classes of threads in the daemon: |
262 | 5c0c1eeb | Iustin Pop | |
263 | 5c0c1eeb | Iustin Pop | - job processing threads, part of a thread pool, and which are |
264 | 5c0c1eeb | Iustin Pop | long-lived, started at daemon startup and terminated only at shutdown |
265 | 5c0c1eeb | Iustin Pop | time |
266 | 5c0c1eeb | Iustin Pop | - client I/O threads, which are the ones that talk the local protocol |
267 | 6c2d0b44 | Iustin Pop | (LUXI) to the clients, and are short-lived |
268 | 5c0c1eeb | Iustin Pop | |
269 | 5c0c1eeb | Iustin Pop | Master startup/failover |
270 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
271 | 5c0c1eeb | Iustin Pop | |
272 | 5c0c1eeb | Iustin Pop | In Ganeti 1.x there is no protection against failing over the master |
273 | 5c0c1eeb | Iustin Pop | to a node with stale configuration. In effect, the responsibility of |
274 | 5c0c1eeb | Iustin Pop | correct failovers falls on the admin. This is true both for the new |
275 | 5c0c1eeb | Iustin Pop | master and for when an old, offline master startup. |
276 | 5c0c1eeb | Iustin Pop | |
277 | 5c0c1eeb | Iustin Pop | Since in 2.x we are extending the cluster state to cover the job queue |
278 | 5c0c1eeb | Iustin Pop | and have a daemon that will execute by itself the job queue, we want |
279 | 5c0c1eeb | Iustin Pop | to have more resilience for the master role. |
280 | 5c0c1eeb | Iustin Pop | |
281 | 5c0c1eeb | Iustin Pop | The following algorithm will happen whenever a node is ready to |
282 | 5c0c1eeb | Iustin Pop | transition to the master role, either at startup time or at node |
283 | 5c0c1eeb | Iustin Pop | failover: |
284 | 5c0c1eeb | Iustin Pop | |
285 | 5c0c1eeb | Iustin Pop | #. read the configuration file and parse the node list |
286 | 5c0c1eeb | Iustin Pop | contained within |
287 | 5c0c1eeb | Iustin Pop | |
288 | 5c0c1eeb | Iustin Pop | #. query all the nodes and make sure we obtain an agreement via |
289 | 5c0c1eeb | Iustin Pop | a quorum of at least half plus one nodes for the following: |
290 | 5c0c1eeb | Iustin Pop | |
291 | 5c0c1eeb | Iustin Pop | - we have the latest configuration and job list (as |
292 | 5c0c1eeb | Iustin Pop | determined by the serial number on the configuration and |
293 | 5c0c1eeb | Iustin Pop | highest job ID on the job queue) |
294 | 5c0c1eeb | Iustin Pop | |
295 | 5c0c1eeb | Iustin Pop | - there is not even a single node having a newer |
296 | 5c0c1eeb | Iustin Pop | configuration file |
297 | 5c0c1eeb | Iustin Pop | |
298 | 5c0c1eeb | Iustin Pop | - if we are not failing over (but just starting), the |
299 | 5c0c1eeb | Iustin Pop | quorum agrees that we are the designated master |
300 | 5c0c1eeb | Iustin Pop | |
301 | 6c2d0b44 | Iustin Pop | - if any of the above is false, we prevent the current operation |
302 | 6c2d0b44 | Iustin Pop | (i.e. we don't become the master) |
303 | 6c2d0b44 | Iustin Pop | |
304 | 5c0c1eeb | Iustin Pop | #. at this point, the node transitions to the master role |
305 | 5c0c1eeb | Iustin Pop | |
306 | 5c0c1eeb | Iustin Pop | #. for all the in-progress jobs, mark them as failed, with |
307 | 5c0c1eeb | Iustin Pop | reason unknown or something similar (master failed, etc.) |
308 | 5c0c1eeb | Iustin Pop | |
309 | 6c2d0b44 | Iustin Pop | Since due to exceptional conditions we could have a situation in which |
310 | 6c2d0b44 | Iustin Pop | no node can become the master due to inconsistent data, we will have |
311 | 6c2d0b44 | Iustin Pop | an override switch for the master daemon startup that will assume the |
312 | 6c2d0b44 | Iustin Pop | current node has the right data and will replicate all the |
313 | 6c2d0b44 | Iustin Pop | configuration files to the other nodes. |
314 | 6c2d0b44 | Iustin Pop | |
315 | 6c2d0b44 | Iustin Pop | **Note**: the above algorithm is by no means an election algorithm; it |
316 | 6c2d0b44 | Iustin Pop | is a *confirmation* of the master role currently held by a node. |
317 | 5c0c1eeb | Iustin Pop | |
318 | 5c0c1eeb | Iustin Pop | Logging |
319 | 5c0c1eeb | Iustin Pop | +++++++ |
320 | 5c0c1eeb | Iustin Pop | |
321 | 6c2d0b44 | Iustin Pop | The logging system will be switched completely to the standard python |
322 | 6c2d0b44 | Iustin Pop | logging module; currently it's logging-based, but exposes a different |
323 | 6c2d0b44 | Iustin Pop | API, which is just overhead. As such, the code will be switched over |
324 | 6c2d0b44 | Iustin Pop | to standard logging calls, and only the setup will be custom. |
325 | 5c0c1eeb | Iustin Pop | |
326 | 5c0c1eeb | Iustin Pop | With this change, we will remove the separate debug/info/error logs, |
327 | 5c0c1eeb | Iustin Pop | and instead have always one logfile per daemon model: |
328 | 5c0c1eeb | Iustin Pop | |
329 | 5c0c1eeb | Iustin Pop | - master-daemon.log for the master daemon |
330 | 5c0c1eeb | Iustin Pop | - node-daemon.log for the node daemon (this is the same as in 1.2) |
331 | 5c0c1eeb | Iustin Pop | - rapi-daemon.log for the RAPI daemon logs |
332 | 5c0c1eeb | Iustin Pop | - rapi-access.log, an additional log file for the RAPI that will be |
333 | 6c2d0b44 | Iustin Pop | in the standard HTTP log format for possible parsing by other tools |
334 | 6c2d0b44 | Iustin Pop | |
335 | 6c2d0b44 | Iustin Pop | Since the `watcher`_ will only submit jobs to the master for startup |
336 | 6c2d0b44 | Iustin Pop | of the instances, its log file will contain less information than |
337 | 6c2d0b44 | Iustin Pop | before, mainly that it will start the instance, but not the results. |
338 | 6c2d0b44 | Iustin Pop | |
339 | 6c2d0b44 | Iustin Pop | Node daemon changes |
340 | 6c2d0b44 | Iustin Pop | +++++++++++++++++++ |
341 | 6c2d0b44 | Iustin Pop | |
342 | 6c2d0b44 | Iustin Pop | The only change to the node daemon is that, since we need better |
343 | 6c2d0b44 | Iustin Pop | concurrency, we don't process the inter-node RPC calls in the node |
344 | 6c2d0b44 | Iustin Pop | daemon itself, but we fork and process each request in a separate |
345 | 6c2d0b44 | Iustin Pop | child. |
346 | 5c0c1eeb | Iustin Pop | |
347 | 6c2d0b44 | Iustin Pop | Since we don't have many calls, and we only fork (not exec), the |
348 | 6c2d0b44 | Iustin Pop | overhead should be minimal. |
349 | 5c0c1eeb | Iustin Pop | |
350 | 5c0c1eeb | Iustin Pop | Caveats |
351 | 5c0c1eeb | Iustin Pop | +++++++ |
352 | 5c0c1eeb | Iustin Pop | |
353 | 5c0c1eeb | Iustin Pop | A discussed alternative is to keep the current individual processes |
354 | 5c0c1eeb | Iustin Pop | touching the cluster configuration model. The reasons we have not |
355 | 5c0c1eeb | Iustin Pop | chosen this approach is: |
356 | 5c0c1eeb | Iustin Pop | |
357 | 5c0c1eeb | Iustin Pop | - the speed of reading and unserializing the cluster state |
358 | 5c0c1eeb | Iustin Pop | today is not small enough that we can ignore it; the addition of |
359 | 5c0c1eeb | Iustin Pop | the job queue will make the startup cost even higher. While this |
360 | 5c0c1eeb | Iustin Pop | runtime cost is low, it can be on the order of a few seconds on |
361 | 5c0c1eeb | Iustin Pop | bigger clusters, which for very quick commands is comparable to |
362 | 5c0c1eeb | Iustin Pop | the actual duration of the computation itself |
363 | 5c0c1eeb | Iustin Pop | |
364 | 5c0c1eeb | Iustin Pop | - individual commands would make it harder to implement a |
365 | 5c0c1eeb | Iustin Pop | fire-and-forget job request, along the lines "start this |
366 | 5c0c1eeb | Iustin Pop | instance but do not wait for it to finish"; it would require a |
367 | 5c0c1eeb | Iustin Pop | model of backgrounding the operation and other things that are |
368 | 5c0c1eeb | Iustin Pop | much better served by a daemon-based model |
369 | 5c0c1eeb | Iustin Pop | |
370 | 5c0c1eeb | Iustin Pop | Another area of discussion is moving away from Twisted in this new |
371 | 6c2d0b44 | Iustin Pop | implementation. While Twisted has its advantages, there are also many |
372 | 6c2d0b44 | Iustin Pop | disadvantages to using it: |
373 | 5c0c1eeb | Iustin Pop | |
374 | 5c0c1eeb | Iustin Pop | - first and foremost, it's not a library, but a framework; thus, if |
375 | 6c2d0b44 | Iustin Pop | you use twisted, all the code needs to be 'twiste-ized' and written |
376 | 6c2d0b44 | Iustin Pop | in an asynchronous manner, using deferreds; while this method works, |
377 | 6c2d0b44 | Iustin Pop | it's not a common way to code and it requires that the entire process |
378 | 6c2d0b44 | Iustin Pop | workflow is based around a single *reactor* (Twisted name for a main |
379 | 6c2d0b44 | Iustin Pop | loop) |
380 | 6c2d0b44 | Iustin Pop | - the more advanced granular locking that we want to implement would |
381 | 6c2d0b44 | Iustin Pop | require, if written in the async-manner, deep integration with the |
382 | 6c2d0b44 | Iustin Pop | Twisted stack, to such an extend that business-logic is inseparable |
383 | 6c2d0b44 | Iustin Pop | from the protocol coding; we felt that this is an unreasonable request, |
384 | 6c2d0b44 | Iustin Pop | and that a good protocol library should allow complete separation of |
385 | 6c2d0b44 | Iustin Pop | low-level protocol calls and business logic; by comparison, the threaded |
386 | 6c2d0b44 | Iustin Pop | approach combined with HTTPs protocol required (for the first iteration) |
387 | 6c2d0b44 | Iustin Pop | absolutely no changes from the 1.2 code, and later changes for optimizing |
388 | 6c2d0b44 | Iustin Pop | the inter-node RPC calls required just syntactic changes (e.g. |
389 | 6c2d0b44 | Iustin Pop | ``rpc.call_...`` to ``self.rpc.call_...``) |
390 | 6c2d0b44 | Iustin Pop | |
391 | 6c2d0b44 | Iustin Pop | Another issue is with the Twisted API stability - during the Ganeti |
392 | 6c2d0b44 | Iustin Pop | 1.x lifetime, we had to to implement many times workarounds to changes |
393 | 6c2d0b44 | Iustin Pop | in the Twisted version, so that for example 1.2 is able to use both |
394 | 6c2d0b44 | Iustin Pop | Twisted 2.x and 8.x. |
395 | 6c2d0b44 | Iustin Pop | |
396 | 6c2d0b44 | Iustin Pop | In the end, since we already had an HTTP server library for the RAPI, |
397 | 6c2d0b44 | Iustin Pop | we just reused that for inter-node communication. |
398 | 5c0c1eeb | Iustin Pop | |
399 | 5c0c1eeb | Iustin Pop | |
400 | 5c0c1eeb | Iustin Pop | Granular locking |
401 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~ |
402 | 5c0c1eeb | Iustin Pop | |
403 | 5c0c1eeb | Iustin Pop | We want to make sure that multiple operations can run in parallel on a Ganeti |
404 | 5c0c1eeb | Iustin Pop | Cluster. In order for this to happen we need to make sure concurrently run |
405 | 5c0c1eeb | Iustin Pop | operations don't step on each other toes and break the cluster. |
406 | 5c0c1eeb | Iustin Pop | |
407 | 5c0c1eeb | Iustin Pop | This design addresses how we are going to deal with locking so that: |
408 | 5c0c1eeb | Iustin Pop | |
409 | 6c2d0b44 | Iustin Pop | - we preserve data coherency |
410 | 6c2d0b44 | Iustin Pop | - we prevent deadlocks |
411 | 6c2d0b44 | Iustin Pop | - we prevent job starvation |
412 | 5c0c1eeb | Iustin Pop | |
413 | 5c0c1eeb | Iustin Pop | Reaching the maximum possible parallelism is a Non-Goal. We have identified a |
414 | 5c0c1eeb | Iustin Pop | set of operations that are currently bottlenecks and need to be parallelised |
415 | 5c0c1eeb | Iustin Pop | and have worked on those. In the future it will be possible to address other |
416 | 5c0c1eeb | Iustin Pop | needs, thus making the cluster more and more parallel one step at a time. |
417 | 5c0c1eeb | Iustin Pop | |
418 | 6c2d0b44 | Iustin Pop | This section only talks about parallelising Ganeti level operations, aka |
419 | 6c2d0b44 | Iustin Pop | Logical Units, and the locking needed for that. Any other synchronization lock |
420 | 5c0c1eeb | Iustin Pop | needed internally by the code is outside its scope. |
421 | 5c0c1eeb | Iustin Pop | |
422 | 6c2d0b44 | Iustin Pop | Library details |
423 | 6c2d0b44 | Iustin Pop | +++++++++++++++ |
424 | 5c0c1eeb | Iustin Pop | |
425 | 5c0c1eeb | Iustin Pop | The proposed library has these features: |
426 | 5c0c1eeb | Iustin Pop | |
427 | 6c2d0b44 | Iustin Pop | - internally managing all the locks, making the implementation transparent |
428 | 5c0c1eeb | Iustin Pop | from their usage |
429 | 6c2d0b44 | Iustin Pop | - automatically grabbing multiple locks in the right order (avoid deadlock) |
430 | 6c2d0b44 | Iustin Pop | - ability to transparently handle conversion to more granularity |
431 | 6c2d0b44 | Iustin Pop | - support asynchronous operation (future goal) |
432 | 6c2d0b44 | Iustin Pop | |
433 | 6c2d0b44 | Iustin Pop | Locking will be valid only on the master node and will not be a |
434 | 6c2d0b44 | Iustin Pop | distributed operation. Therefore, in case of master failure, the |
435 | 6c2d0b44 | Iustin Pop | operations currently running will be aborted and the locks will be |
436 | 6c2d0b44 | Iustin Pop | lost; it remains to the administrator to cleanup (if needed) the |
437 | 6c2d0b44 | Iustin Pop | operation result (e.g. make sure an instance is either installed |
438 | 6c2d0b44 | Iustin Pop | correctly or removed). |
439 | 6c2d0b44 | Iustin Pop | |
440 | 6c2d0b44 | Iustin Pop | A corollary of this is that a master-failover operation with both |
441 | 6c2d0b44 | Iustin Pop | masters alive needs to happen while no operations are running, and |
442 | 6c2d0b44 | Iustin Pop | therefore no locks are held. |
443 | 6c2d0b44 | Iustin Pop | |
444 | 6c2d0b44 | Iustin Pop | All the locks will be represented by objects (like |
445 | 6c2d0b44 | Iustin Pop | ``lockings.SharedLock``), and the individual locks for each object |
446 | 6c2d0b44 | Iustin Pop | will be created at initialisation time, from the config file. |
447 | 6c2d0b44 | Iustin Pop | |
448 | 6c2d0b44 | Iustin Pop | The API will have a way to grab one or more than one locks at the same time. |
449 | 6c2d0b44 | Iustin Pop | Any attempt to grab a lock while already holding one in the wrong order will be |
450 | 6c2d0b44 | Iustin Pop | checked for, and fail. |
451 | 6c2d0b44 | Iustin Pop | |
452 | 5c0c1eeb | Iustin Pop | |
453 | 5c0c1eeb | Iustin Pop | The Locks |
454 | 5c0c1eeb | Iustin Pop | +++++++++ |
455 | 5c0c1eeb | Iustin Pop | |
456 | 5c0c1eeb | Iustin Pop | At the first stage we have decided to provide the following locks: |
457 | 5c0c1eeb | Iustin Pop | |
458 | 5c0c1eeb | Iustin Pop | - One "config file" lock |
459 | 5c0c1eeb | Iustin Pop | - One lock per node in the cluster |
460 | 5c0c1eeb | Iustin Pop | - One lock per instance in the cluster |
461 | 5c0c1eeb | Iustin Pop | |
462 | 5c0c1eeb | Iustin Pop | All the instance locks will need to be taken before the node locks, and the |
463 | 5c0c1eeb | Iustin Pop | node locks before the config lock. Locks will need to be acquired at the same |
464 | 5c0c1eeb | Iustin Pop | time for multiple instances and nodes, and internal ordering will be dealt |
465 | 5c0c1eeb | Iustin Pop | within the locking library, which, for simplicity, will just use alphabetical |
466 | 5c0c1eeb | Iustin Pop | order. |
467 | 5c0c1eeb | Iustin Pop | |
468 | 6c2d0b44 | Iustin Pop | Each lock has the following three possible statuses: |
469 | 6c2d0b44 | Iustin Pop | |
470 | 6c2d0b44 | Iustin Pop | - unlocked (anyone can grab the lock) |
471 | 6c2d0b44 | Iustin Pop | - shared (anyone can grab/have the lock but only in shared mode) |
472 | 6c2d0b44 | Iustin Pop | - exclusive (no one else can grab/have the lock) |
473 | 6c2d0b44 | Iustin Pop | |
474 | 5c0c1eeb | Iustin Pop | Handling conversion to more granularity |
475 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++++++++++ |
476 | 5c0c1eeb | Iustin Pop | |
477 | 5c0c1eeb | Iustin Pop | In order to convert to a more granular approach transparently each time we |
478 | 5c0c1eeb | Iustin Pop | split a lock into more we'll create a "metalock", which will depend on those |
479 | 6c2d0b44 | Iustin Pop | sub-locks and live for the time necessary for all the code to convert (or |
480 | 5c0c1eeb | Iustin Pop | forever, in some conditions). When a metalock exists all converted code must |
481 | 5c0c1eeb | Iustin Pop | acquire it in shared mode, so it can run concurrently, but still be exclusive |
482 | 5c0c1eeb | Iustin Pop | with old code, which acquires it exclusively. |
483 | 5c0c1eeb | Iustin Pop | |
484 | 5c0c1eeb | Iustin Pop | In the beginning the only such lock will be what replaces the current "command" |
485 | 5c0c1eeb | Iustin Pop | lock, and will acquire all the locks in the system, before proceeding. This |
486 | 5c0c1eeb | Iustin Pop | lock will be called the "Big Ganeti Lock" because holding that one will avoid |
487 | 6c2d0b44 | Iustin Pop | any other concurrent Ganeti operations. |
488 | 5c0c1eeb | Iustin Pop | |
489 | 5c0c1eeb | Iustin Pop | We might also want to devise more metalocks (eg. all nodes, all nodes+config) |
490 | 5c0c1eeb | Iustin Pop | in order to make it easier for some parts of the code to acquire what it needs |
491 | 5c0c1eeb | Iustin Pop | without specifying it explicitly. |
492 | 5c0c1eeb | Iustin Pop | |
493 | 5c0c1eeb | Iustin Pop | In the future things like the node locks could become metalocks, should we |
494 | 5c0c1eeb | Iustin Pop | decide to split them into an even more fine grained approach, but this will |
495 | 5c0c1eeb | Iustin Pop | probably be only after the first 2.0 version has been released. |
496 | 5c0c1eeb | Iustin Pop | |
497 | 5c0c1eeb | Iustin Pop | Adding/Removing locks |
498 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
499 | 5c0c1eeb | Iustin Pop | |
500 | 5c0c1eeb | Iustin Pop | When a new instance or a new node is created an associated lock must be added |
501 | 5c0c1eeb | Iustin Pop | to the list. The relevant code will need to inform the locking library of such |
502 | 5c0c1eeb | Iustin Pop | a change. |
503 | 5c0c1eeb | Iustin Pop | |
504 | 5c0c1eeb | Iustin Pop | This needs to be compatible with every other lock in the system, especially |
505 | 5c0c1eeb | Iustin Pop | metalocks that guarantee to grab sets of resources without specifying them |
506 | 5c0c1eeb | Iustin Pop | explicitly. The implementation of this will be handled in the locking library |
507 | 5c0c1eeb | Iustin Pop | itself. |
508 | 5c0c1eeb | Iustin Pop | |
509 | 6c2d0b44 | Iustin Pop | When instances or nodes disappear from the cluster the relevant locks |
510 | 6c2d0b44 | Iustin Pop | must be removed. This is easier than adding new elements, as the code |
511 | 6c2d0b44 | Iustin Pop | which removes them must own them exclusively already, and thus deals |
512 | 6c2d0b44 | Iustin Pop | with metalocks exactly as normal code acquiring those locks. Any |
513 | 6c2d0b44 | Iustin Pop | operation queuing on a removed lock will fail after its removal. |
514 | 5c0c1eeb | Iustin Pop | |
515 | 5c0c1eeb | Iustin Pop | Asynchronous operations |
516 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
517 | 5c0c1eeb | Iustin Pop | |
518 | 5c0c1eeb | Iustin Pop | For the first version the locking library will only export synchronous |
519 | 5c0c1eeb | Iustin Pop | operations, which will block till the needed lock are held, and only fail if |
520 | 5c0c1eeb | Iustin Pop | the request is impossible or somehow erroneous. |
521 | 5c0c1eeb | Iustin Pop | |
522 | 5c0c1eeb | Iustin Pop | In the future we may want to implement different types of asynchronous |
523 | 5c0c1eeb | Iustin Pop | operations such as: |
524 | 5c0c1eeb | Iustin Pop | |
525 | 6c2d0b44 | Iustin Pop | - try to acquire this lock set and fail if not possible |
526 | 6c2d0b44 | Iustin Pop | - try to acquire one of these lock sets and return the first one you were |
527 | 5c0c1eeb | Iustin Pop | able to get (or after a timeout) (select/poll like) |
528 | 5c0c1eeb | Iustin Pop | |
529 | 5c0c1eeb | Iustin Pop | These operations can be used to prioritize operations based on available locks, |
530 | 5c0c1eeb | Iustin Pop | rather than making them just blindly queue for acquiring them. The inherent |
531 | 5c0c1eeb | Iustin Pop | risk, though, is that any code using the first operation, or setting a timeout |
532 | 5c0c1eeb | Iustin Pop | for the second one, is susceptible to starvation and thus may never be able to |
533 | 5c0c1eeb | Iustin Pop | get the required locks and complete certain tasks. Considering this |
534 | 5c0c1eeb | Iustin Pop | providing/using these operations should not be among our first priorities. |
535 | 5c0c1eeb | Iustin Pop | |
536 | 5c0c1eeb | Iustin Pop | Locking granularity |
537 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
538 | 5c0c1eeb | Iustin Pop | |
539 | 5c0c1eeb | Iustin Pop | For the first version of this code we'll convert each Logical Unit to |
540 | 5c0c1eeb | Iustin Pop | acquire/release the locks it needs, so locking will be at the Logical Unit |
541 | 5c0c1eeb | Iustin Pop | level. In the future we may want to split logical units in independent |
542 | 5c0c1eeb | Iustin Pop | "tasklets" with their own locking requirements. A different design doc (or mini |
543 | 5c0c1eeb | Iustin Pop | design doc) will cover the move from Logical Units to tasklets. |
544 | 5c0c1eeb | Iustin Pop | |
545 | 6c2d0b44 | Iustin Pop | Code examples |
546 | 6c2d0b44 | Iustin Pop | +++++++++++++ |
547 | 5c0c1eeb | Iustin Pop | |
548 | 5c0c1eeb | Iustin Pop | In general when acquiring locks we should use a code path equivalent to:: |
549 | 5c0c1eeb | Iustin Pop | |
550 | 5c0c1eeb | Iustin Pop | lock.acquire() |
551 | 5c0c1eeb | Iustin Pop | try: |
552 | 5c0c1eeb | Iustin Pop | ... |
553 | 5c0c1eeb | Iustin Pop | # other code |
554 | 5c0c1eeb | Iustin Pop | finally: |
555 | 5c0c1eeb | Iustin Pop | lock.release() |
556 | 5c0c1eeb | Iustin Pop | |
557 | 6c2d0b44 | Iustin Pop | This makes sure we release all locks, and avoid possible deadlocks. Of |
558 | 6c2d0b44 | Iustin Pop | course extra care must be used not to leave, if possible locked |
559 | 6c2d0b44 | Iustin Pop | structures in an unusable state. Note that with Python 2.5 a simpler |
560 | 6c2d0b44 | Iustin Pop | syntax will be possible, but we want to keep compatibility with Python |
561 | 6c2d0b44 | Iustin Pop | 2.4 so the new constructs should not be used. |
562 | 5c0c1eeb | Iustin Pop | |
563 | 5c0c1eeb | Iustin Pop | In order to avoid this extra indentation and code changes everywhere in the |
564 | 5c0c1eeb | Iustin Pop | Logical Units code, we decided to allow LUs to declare locks, and then execute |
565 | 5c0c1eeb | Iustin Pop | their code with their locks acquired. In the new world LUs are called like |
566 | 5c0c1eeb | Iustin Pop | this:: |
567 | 5c0c1eeb | Iustin Pop | |
568 | 5c0c1eeb | Iustin Pop | # user passed names are expanded to the internal lock/resource name, |
569 | 5c0c1eeb | Iustin Pop | # then known needed locks are declared |
570 | 5c0c1eeb | Iustin Pop | lu.ExpandNames() |
571 | 5c0c1eeb | Iustin Pop | ... some locking/adding of locks may happen ... |
572 | 5c0c1eeb | Iustin Pop | # late declaration of locks for one level: this is useful because sometimes |
573 | 5c0c1eeb | Iustin Pop | # we can't know which resource we need before locking the previous level |
574 | 5c0c1eeb | Iustin Pop | lu.DeclareLocks() # for each level (cluster, instance, node) |
575 | 5c0c1eeb | Iustin Pop | ... more locking/adding of locks can happen ... |
576 | 5c0c1eeb | Iustin Pop | # these functions are called with the proper locks held |
577 | 5c0c1eeb | Iustin Pop | lu.CheckPrereq() |
578 | 5c0c1eeb | Iustin Pop | lu.Exec() |
579 | 5c0c1eeb | Iustin Pop | ... locks declared for removal are removed, all acquired locks released ... |
580 | 5c0c1eeb | Iustin Pop | |
581 | 5c0c1eeb | Iustin Pop | The Processor and the LogicalUnit class will contain exact documentation on how |
582 | 5c0c1eeb | Iustin Pop | locks are supposed to be declared. |
583 | 5c0c1eeb | Iustin Pop | |
584 | 5c0c1eeb | Iustin Pop | Caveats |
585 | 5c0c1eeb | Iustin Pop | +++++++ |
586 | 5c0c1eeb | Iustin Pop | |
587 | 5c0c1eeb | Iustin Pop | This library will provide an easy upgrade path to bring all the code to |
588 | 5c0c1eeb | Iustin Pop | granular locking without breaking everything, and it will also guarantee |
589 | 5c0c1eeb | Iustin Pop | against a lot of common errors. Code switching from the old "lock everything" |
590 | 5c0c1eeb | Iustin Pop | lock to the new system, though, needs to be carefully scrutinised to be sure it |
591 | 5c0c1eeb | Iustin Pop | is really acquiring all the necessary locks, and none has been overlooked or |
592 | 5c0c1eeb | Iustin Pop | forgotten. |
593 | 5c0c1eeb | Iustin Pop | |
594 | 5c0c1eeb | Iustin Pop | The code can contain other locks outside of this library, to synchronise other |
595 | 5c0c1eeb | Iustin Pop | threaded code (eg for the job queue) but in general these should be leaf locks |
596 | 5c0c1eeb | Iustin Pop | or carefully structured non-leaf ones, to avoid deadlock race conditions. |
597 | 5c0c1eeb | Iustin Pop | |
598 | 5c0c1eeb | Iustin Pop | |
599 | 5c0c1eeb | Iustin Pop | Job Queue |
600 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~ |
601 | 5c0c1eeb | Iustin Pop | |
602 | 5c0c1eeb | Iustin Pop | Granular locking is not enough to speed up operations, we also need a |
603 | 5c0c1eeb | Iustin Pop | queue to store these and to be able to process as many as possible in |
604 | 5c0c1eeb | Iustin Pop | parallel. |
605 | 5c0c1eeb | Iustin Pop | |
606 | 6c2d0b44 | Iustin Pop | A Ganeti job will consist of multiple ``OpCodes`` which are the basic |
607 | 5c0c1eeb | Iustin Pop | element of operation in Ganeti 1.2 (and will remain as such). Most |
608 | 5c0c1eeb | Iustin Pop | command-level commands are equivalent to one OpCode, or in some cases |
609 | 5c0c1eeb | Iustin Pop | to a sequence of opcodes, all of the same type (e.g. evacuating a node |
610 | 5c0c1eeb | Iustin Pop | will generate N opcodes of type replace disks). |
611 | 5c0c1eeb | Iustin Pop | |
612 | 5c0c1eeb | Iustin Pop | |
613 | 5c0c1eeb | Iustin Pop | Job execution—“Life of a Ganeti job” |
614 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++ |
615 | 5c0c1eeb | Iustin Pop | |
616 | 5c0c1eeb | Iustin Pop | #. Job gets submitted by the client. A new job identifier is generated and |
617 | 5c0c1eeb | Iustin Pop | assigned to the job. The job is then automatically replicated [#replic]_ |
618 | 5c0c1eeb | Iustin Pop | to all nodes in the cluster. The identifier is returned to the client. |
619 | 5c0c1eeb | Iustin Pop | #. A pool of worker threads waits for new jobs. If all are busy, the job has |
620 | 5c0c1eeb | Iustin Pop | to wait and the first worker finishing its work will grab it. Otherwise any |
621 | 5c0c1eeb | Iustin Pop | of the waiting threads will pick up the new job. |
622 | 5c0c1eeb | Iustin Pop | #. Client waits for job status updates by calling a waiting RPC function. |
623 | 5c0c1eeb | Iustin Pop | Log message may be shown to the user. Until the job is started, it can also |
624 | 6c2d0b44 | Iustin Pop | be canceled. |
625 | 5c0c1eeb | Iustin Pop | #. As soon as the job is finished, its final result and status can be retrieved |
626 | 5c0c1eeb | Iustin Pop | from the server. |
627 | 5c0c1eeb | Iustin Pop | #. If the client archives the job, it gets moved to a history directory. |
628 | 5c0c1eeb | Iustin Pop | There will be a method to archive all jobs older than a a given age. |
629 | 5c0c1eeb | Iustin Pop | |
630 | 5c0c1eeb | Iustin Pop | .. [#replic] We need replication in order to maintain the consistency across |
631 | 5c0c1eeb | Iustin Pop | all nodes in the system; the master node only differs in the fact that |
632 | 5c0c1eeb | Iustin Pop | now it is running the master daemon, but it if fails and we do a master |
633 | 5c0c1eeb | Iustin Pop | failover, the jobs are still visible on the new master (though marked as |
634 | 5c0c1eeb | Iustin Pop | failed). |
635 | 5c0c1eeb | Iustin Pop | |
636 | 5c0c1eeb | Iustin Pop | Failures to replicate a job to other nodes will be only flagged as |
637 | 5c0c1eeb | Iustin Pop | errors in the master daemon log if more than half of the nodes failed, |
638 | 5c0c1eeb | Iustin Pop | otherwise we ignore the failure, and rely on the fact that the next |
639 | 5c0c1eeb | Iustin Pop | update (for still running jobs) will retry the update. For finished |
640 | 5c0c1eeb | Iustin Pop | jobs, it is less of a problem. |
641 | 5c0c1eeb | Iustin Pop | |
642 | 5c0c1eeb | Iustin Pop | Future improvements will look into checking the consistency of the job |
643 | 5c0c1eeb | Iustin Pop | list and jobs themselves at master daemon startup. |
644 | 5c0c1eeb | Iustin Pop | |
645 | 5c0c1eeb | Iustin Pop | |
646 | 5c0c1eeb | Iustin Pop | Job storage |
647 | 5c0c1eeb | Iustin Pop | +++++++++++ |
648 | 5c0c1eeb | Iustin Pop | |
649 | 5c0c1eeb | Iustin Pop | Jobs are stored in the filesystem as individual files, serialized |
650 | 5c0c1eeb | Iustin Pop | using JSON (standard serialization mechanism in Ganeti). |
651 | 5c0c1eeb | Iustin Pop | |
652 | 5c0c1eeb | Iustin Pop | The choice of storing each job in its own file was made because: |
653 | 5c0c1eeb | Iustin Pop | |
654 | 5c0c1eeb | Iustin Pop | - a file can be atomically replaced |
655 | 5c0c1eeb | Iustin Pop | - a file can easily be replicated to other nodes |
656 | 5c0c1eeb | Iustin Pop | - checking consistency across nodes can be implemented very easily, since |
657 | 5c0c1eeb | Iustin Pop | all job files should be (at a given moment in time) identical |
658 | 5c0c1eeb | Iustin Pop | |
659 | 5c0c1eeb | Iustin Pop | The other possible choices that were discussed and discounted were: |
660 | 5c0c1eeb | Iustin Pop | |
661 | 5c0c1eeb | Iustin Pop | - single big file with all job data: not feasible due to difficult updates |
662 | 5c0c1eeb | Iustin Pop | - in-process databases: hard to replicate the entire database to the |
663 | 5c0c1eeb | Iustin Pop | other nodes, and replicating individual operations does not mean wee keep |
664 | 5c0c1eeb | Iustin Pop | consistency |
665 | 5c0c1eeb | Iustin Pop | |
666 | 5c0c1eeb | Iustin Pop | |
667 | 5c0c1eeb | Iustin Pop | Queue structure |
668 | 5c0c1eeb | Iustin Pop | +++++++++++++++ |
669 | 5c0c1eeb | Iustin Pop | |
670 | 5c0c1eeb | Iustin Pop | All file operations have to be done atomically by writing to a temporary file |
671 | 5c0c1eeb | Iustin Pop | and subsequent renaming. Except for log messages, every change in a job is |
672 | 5c0c1eeb | Iustin Pop | stored and replicated to other nodes. |
673 | 5c0c1eeb | Iustin Pop | |
674 | 5c0c1eeb | Iustin Pop | :: |
675 | 5c0c1eeb | Iustin Pop | |
676 | 5c0c1eeb | Iustin Pop | /var/lib/ganeti/queue/ |
677 | 5c0c1eeb | Iustin Pop | job-1 (JSON encoded job description and status) |
678 | 5c0c1eeb | Iustin Pop | […] |
679 | 5c0c1eeb | Iustin Pop | job-37 |
680 | 5c0c1eeb | Iustin Pop | job-38 |
681 | 5c0c1eeb | Iustin Pop | job-39 |
682 | 5c0c1eeb | Iustin Pop | lock (Queue managing process opens this file in exclusive mode) |
683 | 5c0c1eeb | Iustin Pop | serial (Last job ID used) |
684 | 5c0c1eeb | Iustin Pop | version (Queue format version) |
685 | 5c0c1eeb | Iustin Pop | |
686 | 5c0c1eeb | Iustin Pop | |
687 | 5c0c1eeb | Iustin Pop | Locking |
688 | 5c0c1eeb | Iustin Pop | +++++++ |
689 | 5c0c1eeb | Iustin Pop | |
690 | 5c0c1eeb | Iustin Pop | Locking in the job queue is a complicated topic. It is called from more than |
691 | 5c0c1eeb | Iustin Pop | one thread and must be thread-safe. For simplicity, a single lock is used for |
692 | 5c0c1eeb | Iustin Pop | the whole job queue. |
693 | 5c0c1eeb | Iustin Pop | |
694 | 5c0c1eeb | Iustin Pop | A more detailed description can be found in doc/locking.txt. |
695 | 5c0c1eeb | Iustin Pop | |
696 | 5c0c1eeb | Iustin Pop | |
697 | 5c0c1eeb | Iustin Pop | Internal RPC |
698 | 5c0c1eeb | Iustin Pop | ++++++++++++ |
699 | 5c0c1eeb | Iustin Pop | |
700 | 5c0c1eeb | Iustin Pop | RPC calls available between Ganeti master and node daemons: |
701 | 5c0c1eeb | Iustin Pop | |
702 | 5c0c1eeb | Iustin Pop | jobqueue_update(file_name, content) |
703 | 5c0c1eeb | Iustin Pop | Writes a file in the job queue directory. |
704 | 5c0c1eeb | Iustin Pop | jobqueue_purge() |
705 | 5c0c1eeb | Iustin Pop | Cleans the job queue directory completely, including archived job. |
706 | 5c0c1eeb | Iustin Pop | jobqueue_rename(old, new) |
707 | 5c0c1eeb | Iustin Pop | Renames a file in the job queue directory. |
708 | 5c0c1eeb | Iustin Pop | |
709 | 5c0c1eeb | Iustin Pop | |
710 | 5c0c1eeb | Iustin Pop | Client RPC |
711 | 5c0c1eeb | Iustin Pop | ++++++++++ |
712 | 5c0c1eeb | Iustin Pop | |
713 | 5c0c1eeb | Iustin Pop | RPC between Ganeti clients and the Ganeti master daemon supports the following |
714 | 5c0c1eeb | Iustin Pop | operations: |
715 | 5c0c1eeb | Iustin Pop | |
716 | 5c0c1eeb | Iustin Pop | SubmitJob(ops) |
717 | 5c0c1eeb | Iustin Pop | Submits a list of opcodes and returns the job identifier. The identifier is |
718 | 5c0c1eeb | Iustin Pop | guaranteed to be unique during the lifetime of a cluster. |
719 | 5c0c1eeb | Iustin Pop | WaitForJobChange(job_id, fields, […], timeout) |
720 | 5c0c1eeb | Iustin Pop | This function waits until a job changes or a timeout expires. The condition |
721 | 5c0c1eeb | Iustin Pop | for when a job changed is defined by the fields passed and the last log |
722 | 5c0c1eeb | Iustin Pop | message received. |
723 | 5c0c1eeb | Iustin Pop | QueryJobs(job_ids, fields) |
724 | 5c0c1eeb | Iustin Pop | Returns field values for the job identifiers passed. |
725 | 5c0c1eeb | Iustin Pop | CancelJob(job_id) |
726 | 5c0c1eeb | Iustin Pop | Cancels the job specified by identifier. This operation may fail if the job |
727 | 5c0c1eeb | Iustin Pop | is already running, canceled or finished. |
728 | 5c0c1eeb | Iustin Pop | ArchiveJob(job_id) |
729 | 5c0c1eeb | Iustin Pop | Moves a job into the …/archive/ directory. This operation will fail if the |
730 | 5c0c1eeb | Iustin Pop | job has not been canceled or finished. |
731 | 5c0c1eeb | Iustin Pop | |
732 | 5c0c1eeb | Iustin Pop | |
733 | 5c0c1eeb | Iustin Pop | Job and opcode status |
734 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
735 | 5c0c1eeb | Iustin Pop | |
736 | 5c0c1eeb | Iustin Pop | Each job and each opcode has, at any time, one of the following states: |
737 | 5c0c1eeb | Iustin Pop | |
738 | 5c0c1eeb | Iustin Pop | Queued |
739 | 5c0c1eeb | Iustin Pop | The job/opcode was submitted, but did not yet start. |
740 | 5c0c1eeb | Iustin Pop | Waiting |
741 | 5c0c1eeb | Iustin Pop | The job/opcode is waiting for a lock to proceed. |
742 | 5c0c1eeb | Iustin Pop | Running |
743 | 5c0c1eeb | Iustin Pop | The job/opcode is running. |
744 | 5c0c1eeb | Iustin Pop | Canceled |
745 | 5c0c1eeb | Iustin Pop | The job/opcode was canceled before it started. |
746 | 5c0c1eeb | Iustin Pop | Success |
747 | 5c0c1eeb | Iustin Pop | The job/opcode ran and finished successfully. |
748 | 5c0c1eeb | Iustin Pop | Error |
749 | 5c0c1eeb | Iustin Pop | The job/opcode was aborted with an error. |
750 | 5c0c1eeb | Iustin Pop | |
751 | 5c0c1eeb | Iustin Pop | If the master is aborted while a job is running, the job will be set to the |
752 | 5c0c1eeb | Iustin Pop | Error status once the master started again. |
753 | 5c0c1eeb | Iustin Pop | |
754 | 5c0c1eeb | Iustin Pop | |
755 | 5c0c1eeb | Iustin Pop | History |
756 | 5c0c1eeb | Iustin Pop | +++++++ |
757 | 5c0c1eeb | Iustin Pop | |
758 | 5c0c1eeb | Iustin Pop | Archived jobs are kept in a separate directory, |
759 | 6c2d0b44 | Iustin Pop | ``/var/lib/ganeti/queue/archive/``. This is done in order to speed up |
760 | 6c2d0b44 | Iustin Pop | the queue handling: by default, the jobs in the archive are not |
761 | 6c2d0b44 | Iustin Pop | touched by any functions. Only the current (unarchived) jobs are |
762 | 6c2d0b44 | Iustin Pop | parsed, loaded, and verified (if implemented) by the master daemon. |
763 | 5c0c1eeb | Iustin Pop | |
764 | 5c0c1eeb | Iustin Pop | |
765 | 5c0c1eeb | Iustin Pop | Ganeti updates |
766 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
767 | 5c0c1eeb | Iustin Pop | |
768 | 5c0c1eeb | Iustin Pop | The queue has to be completely empty for Ganeti updates with changes |
769 | 5c0c1eeb | Iustin Pop | in the job queue structure. In order to allow this, there will be a |
770 | 5c0c1eeb | Iustin Pop | way to prevent new jobs entering the queue. |
771 | 5c0c1eeb | Iustin Pop | |
772 | 5c0c1eeb | Iustin Pop | |
773 | 5c0c1eeb | Iustin Pop | Object parameters |
774 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~ |
775 | 5c0c1eeb | Iustin Pop | |
776 | 5c0c1eeb | Iustin Pop | Across all cluster configuration data, we have multiple classes of |
777 | 5c0c1eeb | Iustin Pop | parameters: |
778 | 5c0c1eeb | Iustin Pop | |
779 | 5c0c1eeb | Iustin Pop | A. cluster-wide parameters (e.g. name of the cluster, the master); |
780 | 5c0c1eeb | Iustin Pop | these are the ones that we have today, and are unchanged from the |
781 | 5c0c1eeb | Iustin Pop | current model |
782 | 5c0c1eeb | Iustin Pop | |
783 | 5c0c1eeb | Iustin Pop | #. node parameters |
784 | 5c0c1eeb | Iustin Pop | |
785 | 5c0c1eeb | Iustin Pop | #. instance specific parameters, e.g. the name of disks (LV), that |
786 | 5c0c1eeb | Iustin Pop | cannot be shared with other instances |
787 | 5c0c1eeb | Iustin Pop | |
788 | 5c0c1eeb | Iustin Pop | #. instance parameters, that are or can be the same for many |
789 | 5c0c1eeb | Iustin Pop | instances, but are not hypervisor related; e.g. the number of VCPUs, |
790 | 5c0c1eeb | Iustin Pop | or the size of memory |
791 | 5c0c1eeb | Iustin Pop | |
792 | 5c0c1eeb | Iustin Pop | #. instance parameters that are hypervisor specific (e.g. kernel_path |
793 | 5c0c1eeb | Iustin Pop | or PAE mode) |
794 | 5c0c1eeb | Iustin Pop | |
795 | 5c0c1eeb | Iustin Pop | |
796 | 5c0c1eeb | Iustin Pop | The following definitions for instance parameters will be used below: |
797 | 5c0c1eeb | Iustin Pop | |
798 | 5c0c1eeb | Iustin Pop | :hypervisor parameter: |
799 | 5c0c1eeb | Iustin Pop | a hypervisor parameter (or hypervisor specific parameter) is defined |
800 | 5c0c1eeb | Iustin Pop | as a parameter that is interpreted by the hypervisor support code in |
801 | 5c0c1eeb | Iustin Pop | Ganeti and usually is specific to a particular hypervisor (like the |
802 | 6c2d0b44 | Iustin Pop | kernel path for `PVM`_ which makes no sense for `HVM`_). |
803 | 5c0c1eeb | Iustin Pop | |
804 | 5c0c1eeb | Iustin Pop | :backend parameter: |
805 | 5c0c1eeb | Iustin Pop | a backend parameter is defined as an instance parameter that can be |
806 | 5c0c1eeb | Iustin Pop | shared among a list of instances, and is either generic enough not |
807 | 5c0c1eeb | Iustin Pop | to be tied to a given hypervisor or cannot influence at all the |
808 | 5c0c1eeb | Iustin Pop | hypervisor behaviour. |
809 | 5c0c1eeb | Iustin Pop | |
810 | 5c0c1eeb | Iustin Pop | For example: memory, vcpus, auto_balance |
811 | 5c0c1eeb | Iustin Pop | |
812 | 5c0c1eeb | Iustin Pop | All these parameters will be encoded into constants.py with the prefix "BE\_" |
813 | 5c0c1eeb | Iustin Pop | and the whole list of parameters will exist in the set "BES_PARAMETERS" |
814 | 5c0c1eeb | Iustin Pop | |
815 | 5c0c1eeb | Iustin Pop | :proper parameter: |
816 | 5c0c1eeb | Iustin Pop | a parameter whose value is unique to the instance (e.g. the name of a LV, |
817 | 5c0c1eeb | Iustin Pop | or the MAC of a NIC) |
818 | 5c0c1eeb | Iustin Pop | |
819 | 5c0c1eeb | Iustin Pop | As a general rule, for all kind of parameters, “None” (or in |
820 | 5c0c1eeb | Iustin Pop | JSON-speak, “nil”) will no longer be a valid value for a parameter. As |
821 | 5c0c1eeb | Iustin Pop | such, only non-default parameters will be saved as part of objects in |
822 | 5c0c1eeb | Iustin Pop | the serialization step, reducing the size of the serialized format. |
823 | 5c0c1eeb | Iustin Pop | |
824 | 5c0c1eeb | Iustin Pop | Cluster parameters |
825 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++ |
826 | 5c0c1eeb | Iustin Pop | |
827 | 5c0c1eeb | Iustin Pop | Cluster parameters remain as today, attributes at the top level of the |
828 | 5c0c1eeb | Iustin Pop | Cluster object. In addition, two new attributes at this level will |
829 | 5c0c1eeb | Iustin Pop | hold defaults for the instances: |
830 | 5c0c1eeb | Iustin Pop | |
831 | 5c0c1eeb | Iustin Pop | - hvparams, a dictionary indexed by hypervisor type, holding default |
832 | 6c2d0b44 | Iustin Pop | values for hypervisor parameters that are not defined/overridden by |
833 | 5c0c1eeb | Iustin Pop | the instances of this hypervisor type |
834 | 5c0c1eeb | Iustin Pop | |
835 | 5c0c1eeb | Iustin Pop | - beparams, a dictionary holding (for 2.0) a single element 'default', |
836 | 5c0c1eeb | Iustin Pop | which holds the default value for backend parameters |
837 | 5c0c1eeb | Iustin Pop | |
838 | 5c0c1eeb | Iustin Pop | Node parameters |
839 | 5c0c1eeb | Iustin Pop | +++++++++++++++ |
840 | 5c0c1eeb | Iustin Pop | |
841 | 5c0c1eeb | Iustin Pop | Node-related parameters are very few, and we will continue using the |
842 | 5c0c1eeb | Iustin Pop | same model for these as previously (attributes on the Node object). |
843 | 5c0c1eeb | Iustin Pop | |
844 | e0eb13de | Iustin Pop | There are three new node flags, described in a separate section "node |
845 | e0eb13de | Iustin Pop | flags" below. |
846 | e0eb13de | Iustin Pop | |
847 | 5c0c1eeb | Iustin Pop | Instance parameters |
848 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
849 | 5c0c1eeb | Iustin Pop | |
850 | 5c0c1eeb | Iustin Pop | As described before, the instance parameters are split in three: |
851 | 5c0c1eeb | Iustin Pop | instance proper parameters, unique to each instance, instance |
852 | 5c0c1eeb | Iustin Pop | hypervisor parameters and instance backend parameters. |
853 | 5c0c1eeb | Iustin Pop | |
854 | 5c0c1eeb | Iustin Pop | The “hvparams” and “beparams” are kept in two dictionaries at instance |
855 | 5c0c1eeb | Iustin Pop | level. Only non-default parameters are stored (but once customized, a |
856 | 5c0c1eeb | Iustin Pop | parameter will be kept, even with the same value as the default one, |
857 | 5c0c1eeb | Iustin Pop | until reset). |
858 | 5c0c1eeb | Iustin Pop | |
859 | 5c0c1eeb | Iustin Pop | The names for hypervisor parameters in the instance.hvparams subtree |
860 | 5c0c1eeb | Iustin Pop | should be choosen as generic as possible, especially if specific |
861 | 5c0c1eeb | Iustin Pop | parameters could conceivably be useful for more than one hypervisor, |
862 | 6c2d0b44 | Iustin Pop | e.g. ``instance.hvparams.vnc_console_port`` instead of using both |
863 | 6c2d0b44 | Iustin Pop | ``instance.hvparams.hvm_vnc_console_port`` and |
864 | 6c2d0b44 | Iustin Pop | ``instance.hvparams.kvm_vnc_console_port``. |
865 | 5c0c1eeb | Iustin Pop | |
866 | 5c0c1eeb | Iustin Pop | There are some special cases related to disks and NICs (for example): |
867 | 6c2d0b44 | Iustin Pop | a disk has both Ganeti-related parameters (e.g. the name of the LV) |
868 | 5c0c1eeb | Iustin Pop | and hypervisor-related parameters (how the disk is presented to/named |
869 | 5c0c1eeb | Iustin Pop | in the instance). The former parameters remain as proper-instance |
870 | 5c0c1eeb | Iustin Pop | parameters, while the latter value are migrated to the hvparams |
871 | 5c0c1eeb | Iustin Pop | structure. In 2.0, we will have only globally-per-instance such |
872 | 5c0c1eeb | Iustin Pop | hypervisor parameters, and not per-disk ones (e.g. all NICs will be |
873 | 5c0c1eeb | Iustin Pop | exported as of the same type). |
874 | 5c0c1eeb | Iustin Pop | |
875 | 5c0c1eeb | Iustin Pop | Starting from the 1.2 list of instance parameters, here is how they |
876 | 5c0c1eeb | Iustin Pop | will be mapped to the three classes of parameters: |
877 | 5c0c1eeb | Iustin Pop | |
878 | 5c0c1eeb | Iustin Pop | - name (P) |
879 | 5c0c1eeb | Iustin Pop | - primary_node (P) |
880 | 5c0c1eeb | Iustin Pop | - os (P) |
881 | 5c0c1eeb | Iustin Pop | - hypervisor (P) |
882 | 5c0c1eeb | Iustin Pop | - status (P) |
883 | 5c0c1eeb | Iustin Pop | - memory (BE) |
884 | 5c0c1eeb | Iustin Pop | - vcpus (BE) |
885 | 5c0c1eeb | Iustin Pop | - nics (P) |
886 | 5c0c1eeb | Iustin Pop | - disks (P) |
887 | 5c0c1eeb | Iustin Pop | - disk_template (P) |
888 | 5c0c1eeb | Iustin Pop | - network_port (P) |
889 | 5c0c1eeb | Iustin Pop | - kernel_path (HV) |
890 | 5c0c1eeb | Iustin Pop | - initrd_path (HV) |
891 | 5c0c1eeb | Iustin Pop | - hvm_boot_order (HV) |
892 | 5c0c1eeb | Iustin Pop | - hvm_acpi (HV) |
893 | 5c0c1eeb | Iustin Pop | - hvm_pae (HV) |
894 | 5c0c1eeb | Iustin Pop | - hvm_cdrom_image_path (HV) |
895 | 5c0c1eeb | Iustin Pop | - hvm_nic_type (HV) |
896 | 5c0c1eeb | Iustin Pop | - hvm_disk_type (HV) |
897 | 5c0c1eeb | Iustin Pop | - vnc_bind_address (HV) |
898 | 5c0c1eeb | Iustin Pop | - serial_no (P) |
899 | 5c0c1eeb | Iustin Pop | |
900 | 5c0c1eeb | Iustin Pop | |
901 | 5c0c1eeb | Iustin Pop | Parameter validation |
902 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
903 | 5c0c1eeb | Iustin Pop | |
904 | 5c0c1eeb | Iustin Pop | To support the new cluster parameter design, additional features will |
905 | 5c0c1eeb | Iustin Pop | be required from the hypervisor support implementations in Ganeti. |
906 | 5c0c1eeb | Iustin Pop | |
907 | 5c0c1eeb | Iustin Pop | The hypervisor support implementation API will be extended with the |
908 | 5c0c1eeb | Iustin Pop | following features: |
909 | 5c0c1eeb | Iustin Pop | |
910 | 5c0c1eeb | Iustin Pop | :PARAMETERS: class-level attribute holding the list of valid parameters |
911 | 5c0c1eeb | Iustin Pop | for this hypervisor |
912 | 5c0c1eeb | Iustin Pop | :CheckParamSyntax(hvparams): checks that the given parameters are |
913 | 5c0c1eeb | Iustin Pop | valid (as in the names are valid) for this hypervisor; usually just |
914 | 6c2d0b44 | Iustin Pop | comparing ``hvparams.keys()`` and ``cls.PARAMETERS``; this is a class |
915 | 6c2d0b44 | Iustin Pop | method that can be called from within master code (i.e. cmdlib) and |
916 | 6c2d0b44 | Iustin Pop | should be safe to do so |
917 | 5c0c1eeb | Iustin Pop | :ValidateParameters(hvparams): verifies the values of the provided |
918 | 5c0c1eeb | Iustin Pop | parameters against this hypervisor; this is a method that will be |
919 | 5c0c1eeb | Iustin Pop | called on the target node, from backend.py code, and as such can |
920 | 5c0c1eeb | Iustin Pop | make node-specific checks (e.g. kernel_path checking) |
921 | 5c0c1eeb | Iustin Pop | |
922 | 5c0c1eeb | Iustin Pop | Default value application |
923 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++ |
924 | 5c0c1eeb | Iustin Pop | |
925 | 5c0c1eeb | Iustin Pop | The application of defaults to an instance is done in the Cluster |
926 | 5c0c1eeb | Iustin Pop | object, via two new methods as follows: |
927 | 5c0c1eeb | Iustin Pop | |
928 | 5c0c1eeb | Iustin Pop | - ``Cluster.FillHV(instance)``, returns 'filled' hvparams dict, based on |
929 | 5c0c1eeb | Iustin Pop | instance's hvparams and cluster's ``hvparams[instance.hypervisor]`` |
930 | 5c0c1eeb | Iustin Pop | |
931 | 5c0c1eeb | Iustin Pop | - ``Cluster.FillBE(instance, be_type="default")``, which returns the |
932 | 5c0c1eeb | Iustin Pop | beparams dict, based on the instance and cluster beparams |
933 | 5c0c1eeb | Iustin Pop | |
934 | 5c0c1eeb | Iustin Pop | The FillHV/BE transformations will be used, for example, in the RpcRunner |
935 | 5c0c1eeb | Iustin Pop | when sending an instance for activation/stop, and the sent instance |
936 | 5c0c1eeb | Iustin Pop | hvparams/beparams will have the final value (noded code doesn't know |
937 | 5c0c1eeb | Iustin Pop | about defaults). |
938 | 5c0c1eeb | Iustin Pop | |
939 | 5c0c1eeb | Iustin Pop | LU code will need to self-call the transformation, if needed. |
940 | 5c0c1eeb | Iustin Pop | |
941 | 5c0c1eeb | Iustin Pop | Opcode changes |
942 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
943 | 5c0c1eeb | Iustin Pop | |
944 | 5c0c1eeb | Iustin Pop | The parameter changes will have impact on the OpCodes, especially on |
945 | 5c0c1eeb | Iustin Pop | the following ones: |
946 | 5c0c1eeb | Iustin Pop | |
947 | 6c2d0b44 | Iustin Pop | - ``OpCreateInstance``, where the new hv and be parameters will be sent as |
948 | 5c0c1eeb | Iustin Pop | dictionaries; note that all hv and be parameters are now optional, as |
949 | 5c0c1eeb | Iustin Pop | the values can be instead taken from the cluster |
950 | 6c2d0b44 | Iustin Pop | - ``OpQueryInstances``, where we have to be able to query these new |
951 | 5c0c1eeb | Iustin Pop | parameters; the syntax for names will be ``hvparam/$NAME`` and |
952 | 5c0c1eeb | Iustin Pop | ``beparam/$NAME`` for querying an individual parameter out of one |
953 | 5c0c1eeb | Iustin Pop | dictionary, and ``hvparams``, respectively ``beparams``, for the whole |
954 | 5c0c1eeb | Iustin Pop | dictionaries |
955 | 6c2d0b44 | Iustin Pop | - ``OpModifyInstance``, where the the modified parameters are sent as |
956 | 5c0c1eeb | Iustin Pop | dictionaries |
957 | 5c0c1eeb | Iustin Pop | |
958 | 5c0c1eeb | Iustin Pop | Additionally, we will need new OpCodes to modify the cluster-level |
959 | 5c0c1eeb | Iustin Pop | defaults for the be/hv sets of parameters. |
960 | 5c0c1eeb | Iustin Pop | |
961 | 5c0c1eeb | Iustin Pop | Caveats |
962 | 5c0c1eeb | Iustin Pop | +++++++ |
963 | 5c0c1eeb | Iustin Pop | |
964 | 5c0c1eeb | Iustin Pop | One problem that might appear is that our classification is not |
965 | 5c0c1eeb | Iustin Pop | complete or not good enough, and we'll need to change this model. As |
966 | 5c0c1eeb | Iustin Pop | the last resort, we will need to rollback and keep 1.2 style. |
967 | 5c0c1eeb | Iustin Pop | |
968 | 5c0c1eeb | Iustin Pop | Another problem is that classification of one parameter is unclear |
969 | 5c0c1eeb | Iustin Pop | (e.g. ``network_port``, is this BE or HV?); in this case we'll take |
970 | 5c0c1eeb | Iustin Pop | the risk of having to move parameters later between classes. |
971 | 5c0c1eeb | Iustin Pop | |
972 | 5c0c1eeb | Iustin Pop | Security |
973 | 5c0c1eeb | Iustin Pop | ++++++++ |
974 | 5c0c1eeb | Iustin Pop | |
975 | 5c0c1eeb | Iustin Pop | The only security issue that we foresee is if some new parameters will |
976 | 5c0c1eeb | Iustin Pop | have sensitive value. If so, we will need to have a way to export the |
977 | 5c0c1eeb | Iustin Pop | config data while purging the sensitive value. |
978 | 5c0c1eeb | Iustin Pop | |
979 | 5c0c1eeb | Iustin Pop | E.g. for the drbd shared secrets, we could export these with the |
980 | 5c0c1eeb | Iustin Pop | values replaced by an empty string. |
981 | 5c0c1eeb | Iustin Pop | |
982 | e0eb13de | Iustin Pop | Node flags |
983 | e0eb13de | Iustin Pop | ~~~~~~~~~~ |
984 | e0eb13de | Iustin Pop | |
985 | e0eb13de | Iustin Pop | Ganeti 2.0 adds three node flags that change the way nodes are handled |
986 | e0eb13de | Iustin Pop | within Ganeti and the related infrastructure (iallocator interaction, |
987 | e0eb13de | Iustin Pop | RAPI data export). |
988 | e0eb13de | Iustin Pop | |
989 | e0eb13de | Iustin Pop | *master candidate* flag |
990 | e0eb13de | Iustin Pop | +++++++++++++++++++++++ |
991 | e0eb13de | Iustin Pop | |
992 | e0eb13de | Iustin Pop | Ganeti 2.0 allows more scalability in operation by introducing |
993 | e0eb13de | Iustin Pop | parallelization. However, a new bottleneck is reached that is the |
994 | e0eb13de | Iustin Pop | synchronization and replication of cluster configuration to all nodes |
995 | e0eb13de | Iustin Pop | in the cluster. |
996 | e0eb13de | Iustin Pop | |
997 | e0eb13de | Iustin Pop | This breaks scalability as the speed of the replication decreases |
998 | e0eb13de | Iustin Pop | roughly with the size of the nodes in the cluster. The goal of the |
999 | e0eb13de | Iustin Pop | master candidate flag is to change this O(n) into O(1) with respect to |
1000 | e0eb13de | Iustin Pop | job and configuration data propagation. |
1001 | e0eb13de | Iustin Pop | |
1002 | e0eb13de | Iustin Pop | Only nodes having this flag set (let's call this set of nodes the |
1003 | e0eb13de | Iustin Pop | *candidate pool*) will have jobs and configuration data replicated. |
1004 | e0eb13de | Iustin Pop | |
1005 | e0eb13de | Iustin Pop | The cluster will have a new parameter (runtime changeable) called |
1006 | e0eb13de | Iustin Pop | ``candidate_pool_size`` which represents the number of candidates the |
1007 | e0eb13de | Iustin Pop | cluster tries to maintain (preferably automatically). |
1008 | e0eb13de | Iustin Pop | |
1009 | e0eb13de | Iustin Pop | This will impact the cluster operations as follows: |
1010 | e0eb13de | Iustin Pop | |
1011 | e0eb13de | Iustin Pop | - jobs and config data will be replicated only to a fixed set of nodes |
1012 | e0eb13de | Iustin Pop | - master fail-over will only be possible to a node in the candidate pool |
1013 | e0eb13de | Iustin Pop | - cluster verify needs changing to account for these two roles |
1014 | e0eb13de | Iustin Pop | - external scripts will no longer have access to the configuration |
1015 | e0eb13de | Iustin Pop | file (this is not recommended anyway) |
1016 | e0eb13de | Iustin Pop | |
1017 | e0eb13de | Iustin Pop | |
1018 | e0eb13de | Iustin Pop | The caveats of this change are: |
1019 | e0eb13de | Iustin Pop | |
1020 | e0eb13de | Iustin Pop | - if all candidates are lost (completely), cluster configuration is |
1021 | e0eb13de | Iustin Pop | lost (but it should be backed up external to the cluster anyway) |
1022 | e0eb13de | Iustin Pop | |
1023 | e0eb13de | Iustin Pop | - failed nodes which are candidate must be dealt with properly, so |
1024 | e0eb13de | Iustin Pop | that we don't lose too many candidates at the same time; this will be |
1025 | e0eb13de | Iustin Pop | reported in cluster verify |
1026 | e0eb13de | Iustin Pop | |
1027 | e0eb13de | Iustin Pop | - the 'all equal' concept of ganeti is no longer true |
1028 | e0eb13de | Iustin Pop | |
1029 | e0eb13de | Iustin Pop | - the partial distribution of config data means that all nodes will |
1030 | e0eb13de | Iustin Pop | have to revert to ssconf files for master info (as in 1.2) |
1031 | e0eb13de | Iustin Pop | |
1032 | e0eb13de | Iustin Pop | Advantages: |
1033 | e0eb13de | Iustin Pop | |
1034 | e0eb13de | Iustin Pop | - speed on a 100+ nodes simulated cluster is greatly enhanced, even |
1035 | e0eb13de | Iustin Pop | for a simple operation; ``gnt-instance remove`` on a diskless instance |
1036 | e0eb13de | Iustin Pop | remove goes from ~9seconds to ~2 seconds |
1037 | e0eb13de | Iustin Pop | |
1038 | e0eb13de | Iustin Pop | - node failure of non-candidates will be less impacting on the cluster |
1039 | e0eb13de | Iustin Pop | |
1040 | e0eb13de | Iustin Pop | The default value for the candidate pool size will be set to 10 but |
1041 | e0eb13de | Iustin Pop | this can be changed at cluster creation and modified any time later. |
1042 | e0eb13de | Iustin Pop | |
1043 | e0eb13de | Iustin Pop | Testing on simulated big clusters with sequential and parallel jobs |
1044 | e0eb13de | Iustin Pop | show that this value (10) is a sweet-spot from performance and load |
1045 | e0eb13de | Iustin Pop | point of view. |
1046 | e0eb13de | Iustin Pop | |
1047 | e0eb13de | Iustin Pop | *offline* flag |
1048 | e0eb13de | Iustin Pop | ++++++++++++++ |
1049 | e0eb13de | Iustin Pop | |
1050 | e0eb13de | Iustin Pop | In order to support better the situation in which nodes are offline |
1051 | e0eb13de | Iustin Pop | (e.g. for repair) without altering the cluster configuration, Ganeti |
1052 | e0eb13de | Iustin Pop | needs to be told and needs to properly handle this state for nodes. |
1053 | e0eb13de | Iustin Pop | |
1054 | e0eb13de | Iustin Pop | This will result in simpler procedures, and less mistakes, when the |
1055 | e0eb13de | Iustin Pop | amount of node failures is high on an absolute scale (either due to |
1056 | e0eb13de | Iustin Pop | high failure rate or simply big clusters). |
1057 | e0eb13de | Iustin Pop | |
1058 | e0eb13de | Iustin Pop | Nodes having this attribute set will not be contacted for inter-node |
1059 | e0eb13de | Iustin Pop | RPC calls, will not be master candidates, and will not be able to host |
1060 | e0eb13de | Iustin Pop | instances as primaries. |
1061 | e0eb13de | Iustin Pop | |
1062 | e0eb13de | Iustin Pop | Setting this attribute on a node: |
1063 | e0eb13de | Iustin Pop | |
1064 | e0eb13de | Iustin Pop | - will not be allowed if the node is the master |
1065 | e0eb13de | Iustin Pop | - will not be allowed if the node has primary instances |
1066 | e0eb13de | Iustin Pop | - will cause the node to be demoted from the master candidate role (if |
1067 | e0eb13de | Iustin Pop | it was), possibly causing another node to be promoted to that role |
1068 | e0eb13de | Iustin Pop | |
1069 | e0eb13de | Iustin Pop | This attribute will impact the cluster operations as follows: |
1070 | e0eb13de | Iustin Pop | |
1071 | e0eb13de | Iustin Pop | - querying these nodes for anything will fail instantly in the RPC |
1072 | e0eb13de | Iustin Pop | library, with a specific RPC error (RpcResult.offline == True) |
1073 | e0eb13de | Iustin Pop | |
1074 | e0eb13de | Iustin Pop | - they will be listed in the Other section of cluster verify |
1075 | e0eb13de | Iustin Pop | |
1076 | e0eb13de | Iustin Pop | The code is changed in the following ways: |
1077 | e0eb13de | Iustin Pop | |
1078 | e0eb13de | Iustin Pop | - RPC calls were be converted to skip such nodes: |
1079 | e0eb13de | Iustin Pop | |
1080 | e0eb13de | Iustin Pop | - RpcRunner-instance-based RPC calls are easy to convert |
1081 | e0eb13de | Iustin Pop | |
1082 | e0eb13de | Iustin Pop | - static/classmethod RPC calls are harder to convert, and were left |
1083 | e0eb13de | Iustin Pop | alone |
1084 | e0eb13de | Iustin Pop | |
1085 | e0eb13de | Iustin Pop | - the RPC results were unified so that this new result state (offline) |
1086 | e0eb13de | Iustin Pop | can be differentiated |
1087 | e0eb13de | Iustin Pop | |
1088 | e0eb13de | Iustin Pop | - master voting still queries in repair nodes, as we need to ensure |
1089 | e0eb13de | Iustin Pop | consistency in case the (wrong) masters have old data, and nodes have |
1090 | e0eb13de | Iustin Pop | come back from repairs |
1091 | e0eb13de | Iustin Pop | |
1092 | e0eb13de | Iustin Pop | Caveats: |
1093 | e0eb13de | Iustin Pop | |
1094 | e0eb13de | Iustin Pop | - some operation semantics are less clear (e.g. what to do on instance |
1095 | e0eb13de | Iustin Pop | start with offline secondary?); for now, these will just fail as if the |
1096 | e0eb13de | Iustin Pop | flag is not set (but faster) |
1097 | e0eb13de | Iustin Pop | - 2-node cluster with one node offline needs manual startup of the |
1098 | e0eb13de | Iustin Pop | master with a special flag to skip voting (as the master can't get a |
1099 | e0eb13de | Iustin Pop | quorum there) |
1100 | e0eb13de | Iustin Pop | |
1101 | e0eb13de | Iustin Pop | One of the advantages of implementing this flag is that it will allow |
1102 | e0eb13de | Iustin Pop | in the future automation tools to automatically put the node in |
1103 | e0eb13de | Iustin Pop | repairs and recover from this state, and the code (should/will) handle |
1104 | e0eb13de | Iustin Pop | this much better than just timing out. So, future possible |
1105 | e0eb13de | Iustin Pop | improvements (for later versions): |
1106 | e0eb13de | Iustin Pop | |
1107 | e0eb13de | Iustin Pop | - watcher will detect nodes which fail RPC calls, will attempt to ssh |
1108 | e0eb13de | Iustin Pop | to them, if failure will put them offline |
1109 | e0eb13de | Iustin Pop | - watcher will try to ssh and query the offline nodes, if successful |
1110 | e0eb13de | Iustin Pop | will take them off the repair list |
1111 | e0eb13de | Iustin Pop | |
1112 | e0eb13de | Iustin Pop | Alternatives considered: The RPC call model in 2.0 is, by default, |
1113 | e0eb13de | Iustin Pop | much nicer - errors are logged in the background, and job/opcode |
1114 | e0eb13de | Iustin Pop | execution is clearer, so we could simply not introduce this. However, |
1115 | e0eb13de | Iustin Pop | having this state will make both the codepaths clearer (offline |
1116 | e0eb13de | Iustin Pop | vs. temporary failure) and the operational model (it's not a node with |
1117 | e0eb13de | Iustin Pop | errors, but an offline node). |
1118 | e0eb13de | Iustin Pop | |
1119 | e0eb13de | Iustin Pop | |
1120 | e0eb13de | Iustin Pop | *drained* flag |
1121 | e0eb13de | Iustin Pop | ++++++++++++++ |
1122 | e0eb13de | Iustin Pop | |
1123 | e0eb13de | Iustin Pop | Due to parallel execution of jobs in Ganeti 2.0, we could have the |
1124 | e0eb13de | Iustin Pop | following situation: |
1125 | e0eb13de | Iustin Pop | |
1126 | e0eb13de | Iustin Pop | - gnt-node migrate + failover is run |
1127 | e0eb13de | Iustin Pop | - gnt-node evacuate is run, which schedules a long-running 6-opcode |
1128 | e0eb13de | Iustin Pop | job for the node |
1129 | e0eb13de | Iustin Pop | - partway through, a new job comes in that runs an iallocator script, |
1130 | e0eb13de | Iustin Pop | which finds the above node as empty and a very good candidate |
1131 | e0eb13de | Iustin Pop | - gnt-node evacuate has finished, but now it has to be run again, to |
1132 | e0eb13de | Iustin Pop | clean the above instance(s) |
1133 | e0eb13de | Iustin Pop | |
1134 | e0eb13de | Iustin Pop | In order to prevent this situation, and to be able to get nodes into |
1135 | e0eb13de | Iustin Pop | proper offline status easily, a new *drained* flag was added to the nodes. |
1136 | e0eb13de | Iustin Pop | |
1137 | e0eb13de | Iustin Pop | This flag (which actually means "is being, or was drained, and is |
1138 | e0eb13de | Iustin Pop | expected to go offline"), will prevent allocations on the node, but |
1139 | e0eb13de | Iustin Pop | otherwise all other operations (start/stop instance, query, etc.) are |
1140 | e0eb13de | Iustin Pop | working without any restrictions. |
1141 | e0eb13de | Iustin Pop | |
1142 | e0eb13de | Iustin Pop | Interaction between flags |
1143 | e0eb13de | Iustin Pop | +++++++++++++++++++++++++ |
1144 | e0eb13de | Iustin Pop | |
1145 | e0eb13de | Iustin Pop | While these flags are implemented as separate flags, they are |
1146 | e0eb13de | Iustin Pop | mutually-exclusive and are acting together with the master node role |
1147 | e0eb13de | Iustin Pop | as a single *node status* value. In other words, a flag is only in one |
1148 | e0eb13de | Iustin Pop | of these roles at a given time. The lack of any of these flags denote |
1149 | e0eb13de | Iustin Pop | a regular node. |
1150 | e0eb13de | Iustin Pop | |
1151 | e0eb13de | Iustin Pop | The current node status is visible in the ``gnt-cluster verify`` |
1152 | e0eb13de | Iustin Pop | output, and the individual flags can be examined via separate flags in |
1153 | e0eb13de | Iustin Pop | the ``gnt-node list`` output. |
1154 | e0eb13de | Iustin Pop | |
1155 | e0eb13de | Iustin Pop | These new flags will be exported in both the iallocator input message |
1156 | e0eb13de | Iustin Pop | and via RAPI, see the respective man pages for the exact names. |
1157 | e0eb13de | Iustin Pop | |
1158 | 5c0c1eeb | Iustin Pop | Feature changes |
1159 | 5c0c1eeb | Iustin Pop | --------------- |
1160 | 5c0c1eeb | Iustin Pop | |
1161 | 5c0c1eeb | Iustin Pop | The main feature-level changes will be: |
1162 | 5c0c1eeb | Iustin Pop | |
1163 | 5c0c1eeb | Iustin Pop | - a number of disk related changes |
1164 | 5c0c1eeb | Iustin Pop | - removal of fixed two-disk, one-nic per instance limitation |
1165 | 5c0c1eeb | Iustin Pop | |
1166 | 5c0c1eeb | Iustin Pop | Disk handling changes |
1167 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
1168 | 5c0c1eeb | Iustin Pop | |
1169 | 5c0c1eeb | Iustin Pop | The storage options available in Ganeti 1.x were introduced based on |
1170 | 5c0c1eeb | Iustin Pop | then-current software (first DRBD 0.7 then later DRBD 8) and the |
1171 | 5c0c1eeb | Iustin Pop | estimated usage patters. However, experience has later shown that some |
1172 | 5c0c1eeb | Iustin Pop | assumptions made initially are not true and that more flexibility is |
1173 | 5c0c1eeb | Iustin Pop | needed. |
1174 | 5c0c1eeb | Iustin Pop | |
1175 | 6c2d0b44 | Iustin Pop | One main assumption made was that disk failures should be treated as 'rare' |
1176 | 5c0c1eeb | Iustin Pop | events, and that each of them needs to be manually handled in order to ensure |
1177 | 5c0c1eeb | Iustin Pop | data safety; however, both these assumptions are false: |
1178 | 5c0c1eeb | Iustin Pop | |
1179 | 6c2d0b44 | Iustin Pop | - disk failures can be a common occurrence, based on usage patterns or cluster |
1180 | 5c0c1eeb | Iustin Pop | size |
1181 | 5c0c1eeb | Iustin Pop | - our disk setup is robust enough (referring to DRBD8 + LVM) that we could |
1182 | 5c0c1eeb | Iustin Pop | automate more of the recovery |
1183 | 5c0c1eeb | Iustin Pop | |
1184 | 5c0c1eeb | Iustin Pop | Note that we still don't have fully-automated disk recovery as a goal, but our |
1185 | 5c0c1eeb | Iustin Pop | goal is to reduce the manual work needed. |
1186 | 5c0c1eeb | Iustin Pop | |
1187 | 5c0c1eeb | Iustin Pop | As such, we plan the following main changes: |
1188 | 5c0c1eeb | Iustin Pop | |
1189 | 5c0c1eeb | Iustin Pop | - DRBD8 is much more flexible and stable than its previous version (0.7), |
1190 | 5c0c1eeb | Iustin Pop | such that removing the support for the ``remote_raid1`` template and |
1191 | 5c0c1eeb | Iustin Pop | focusing only on DRBD8 is easier |
1192 | 5c0c1eeb | Iustin Pop | |
1193 | 5c0c1eeb | Iustin Pop | - dynamic discovery of DRBD devices is not actually needed in a cluster that |
1194 | 5c0c1eeb | Iustin Pop | where the DRBD namespace is controlled by Ganeti; switching to a static |
1195 | 5c0c1eeb | Iustin Pop | assignment (done at either instance creation time or change secondary time) |
1196 | 5c0c1eeb | Iustin Pop | will change the disk activation time from O(n) to O(1), which on big |
1197 | 5c0c1eeb | Iustin Pop | clusters is a significant gain |
1198 | 5c0c1eeb | Iustin Pop | |
1199 | 5c0c1eeb | Iustin Pop | - remove the hard dependency on LVM (currently all available storage types are |
1200 | 5c0c1eeb | Iustin Pop | ultimately backed by LVM volumes) by introducing file-based storage |
1201 | 5c0c1eeb | Iustin Pop | |
1202 | 5c0c1eeb | Iustin Pop | Additionally, a number of smaller enhancements are also planned: |
1203 | 5c0c1eeb | Iustin Pop | - support variable number of disks |
1204 | 5c0c1eeb | Iustin Pop | - support read-only disks |
1205 | 5c0c1eeb | Iustin Pop | |
1206 | 5c0c1eeb | Iustin Pop | Future enhancements in the 2.x series, which do not require base design |
1207 | 5c0c1eeb | Iustin Pop | changes, might include: |
1208 | 5c0c1eeb | Iustin Pop | |
1209 | 5c0c1eeb | Iustin Pop | - enhancement of the LVM allocation method in order to try to keep |
1210 | 5c0c1eeb | Iustin Pop | all of an instance's virtual disks on the same physical |
1211 | 5c0c1eeb | Iustin Pop | disks |
1212 | 5c0c1eeb | Iustin Pop | |
1213 | 5c0c1eeb | Iustin Pop | - add support for DRBD8 authentication at handshake time in |
1214 | 5c0c1eeb | Iustin Pop | order to ensure each device connects to the correct peer |
1215 | 5c0c1eeb | Iustin Pop | |
1216 | 5c0c1eeb | Iustin Pop | - remove the restrictions on failover only to the secondary |
1217 | 5c0c1eeb | Iustin Pop | which creates very strict rules on cluster allocation |
1218 | 5c0c1eeb | Iustin Pop | |
1219 | 5c0c1eeb | Iustin Pop | DRBD minor allocation |
1220 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1221 | 5c0c1eeb | Iustin Pop | |
1222 | 5c0c1eeb | Iustin Pop | Currently, when trying to identify or activate a new DRBD (or MD) |
1223 | 5c0c1eeb | Iustin Pop | device, the code scans all in-use devices in order to see if we find |
1224 | 5c0c1eeb | Iustin Pop | one that looks similar to our parameters and is already in the desired |
1225 | 5c0c1eeb | Iustin Pop | state or not. Since this needs external commands to be run, it is very |
1226 | 5c0c1eeb | Iustin Pop | slow when more than a few devices are already present. |
1227 | 5c0c1eeb | Iustin Pop | |
1228 | 5c0c1eeb | Iustin Pop | Therefore, we will change the discovery model from dynamic to |
1229 | 5c0c1eeb | Iustin Pop | static. When a new device is logically created (added to the |
1230 | 5c0c1eeb | Iustin Pop | configuration) a free minor number is computed from the list of |
1231 | 5c0c1eeb | Iustin Pop | devices that should exist on that node and assigned to that |
1232 | 5c0c1eeb | Iustin Pop | device. |
1233 | 5c0c1eeb | Iustin Pop | |
1234 | 5c0c1eeb | Iustin Pop | At device activation, if the minor is already in use, we check if |
1235 | 5c0c1eeb | Iustin Pop | it has our parameters; if not so, we just destroy the device (if |
1236 | 5c0c1eeb | Iustin Pop | possible, otherwise we abort) and start it with our own |
1237 | 5c0c1eeb | Iustin Pop | parameters. |
1238 | 5c0c1eeb | Iustin Pop | |
1239 | 5c0c1eeb | Iustin Pop | This means that we in effect take ownership of the minor space for |
1240 | 6c2d0b44 | Iustin Pop | that device type; if there's a user-created DRBD minor, it will be |
1241 | 5c0c1eeb | Iustin Pop | automatically removed. |
1242 | 5c0c1eeb | Iustin Pop | |
1243 | 5c0c1eeb | Iustin Pop | The change will have the effect of reducing the number of external |
1244 | 5c0c1eeb | Iustin Pop | commands run per device from a constant number times the index of the |
1245 | 5c0c1eeb | Iustin Pop | first free DRBD minor to just a constant number. |
1246 | 5c0c1eeb | Iustin Pop | |
1247 | 6c2d0b44 | Iustin Pop | Removal of obsolete device types (MD, DRBD7) |
1248 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++++++++++ |
1249 | 5c0c1eeb | Iustin Pop | |
1250 | 5c0c1eeb | Iustin Pop | We need to remove these device types because of two issues. First, |
1251 | 6c2d0b44 | Iustin Pop | DRBD7 has bad failure modes in case of dual failures (both network and |
1252 | 5c0c1eeb | Iustin Pop | disk - it cannot propagate the error up the device stack and instead |
1253 | 6c2d0b44 | Iustin Pop | just panics. Second, due to the asymmetry between primary and |
1254 | 6c2d0b44 | Iustin Pop | secondary in MD+DRBD mode, we cannot do live failover (not even if we |
1255 | 6c2d0b44 | Iustin Pop | had MD+DRBD8). |
1256 | 5c0c1eeb | Iustin Pop | |
1257 | 5c0c1eeb | Iustin Pop | File-based storage support |
1258 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++ |
1259 | 5c0c1eeb | Iustin Pop | |
1260 | 6c2d0b44 | Iustin Pop | Using files instead of logical volumes for instance storage would |
1261 | 6c2d0b44 | Iustin Pop | allow us to get rid of the hard requirement for volume groups for |
1262 | 6c2d0b44 | Iustin Pop | testing clusters and it would also allow usage of SAN storage to do |
1263 | 6c2d0b44 | Iustin Pop | live failover taking advantage of this storage solution. |
1264 | 5c0c1eeb | Iustin Pop | |
1265 | 5c0c1eeb | Iustin Pop | Better LVM allocation |
1266 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1267 | 5c0c1eeb | Iustin Pop | |
1268 | 5c0c1eeb | Iustin Pop | Currently, the LV to PV allocation mechanism is a very simple one: at |
1269 | 5c0c1eeb | Iustin Pop | each new request for a logical volume, tell LVM to allocate the volume |
1270 | 5c0c1eeb | Iustin Pop | in order based on the amount of free space. This is good for |
1271 | 5c0c1eeb | Iustin Pop | simplicity and for keeping the usage equally spread over the available |
1272 | 5c0c1eeb | Iustin Pop | physical disks, however it introduces a problem that an instance could |
1273 | 5c0c1eeb | Iustin Pop | end up with its (currently) two drives on two physical disks, or |
1274 | 5c0c1eeb | Iustin Pop | (worse) that the data and metadata for a DRBD device end up on |
1275 | 5c0c1eeb | Iustin Pop | different drives. |
1276 | 5c0c1eeb | Iustin Pop | |
1277 | 5c0c1eeb | Iustin Pop | This is bad because it causes unneeded ``replace-disks`` operations in |
1278 | 5c0c1eeb | Iustin Pop | case of a physical failure. |
1279 | 5c0c1eeb | Iustin Pop | |
1280 | 5c0c1eeb | Iustin Pop | The solution is to batch allocations for an instance and make the LVM |
1281 | 5c0c1eeb | Iustin Pop | handling code try to allocate as close as possible all the storage of |
1282 | 5c0c1eeb | Iustin Pop | one instance. We will still allow the logical volumes to spill over to |
1283 | 5c0c1eeb | Iustin Pop | additional disks as needed. |
1284 | 5c0c1eeb | Iustin Pop | |
1285 | 5c0c1eeb | Iustin Pop | Note that this clustered allocation can only be attempted at initial |
1286 | 5c0c1eeb | Iustin Pop | instance creation, or at change secondary node time. At add disk time, |
1287 | 5c0c1eeb | Iustin Pop | or at replacing individual disks, it's not easy enough to compute the |
1288 | 5c0c1eeb | Iustin Pop | current disk map so we'll not attempt the clustering. |
1289 | 5c0c1eeb | Iustin Pop | |
1290 | 5c0c1eeb | Iustin Pop | DRBD8 peer authentication at handshake |
1291 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++++++++ |
1292 | 5c0c1eeb | Iustin Pop | |
1293 | 5c0c1eeb | Iustin Pop | DRBD8 has a new feature that allow authentication of the peer at |
1294 | 5c0c1eeb | Iustin Pop | connect time. We can use this to prevent connecting to the wrong peer |
1295 | 5c0c1eeb | Iustin Pop | more that securing the connection. Even though we never had issues |
1296 | 5c0c1eeb | Iustin Pop | with wrong connections, it would be good to implement this. |
1297 | 5c0c1eeb | Iustin Pop | |
1298 | 5c0c1eeb | Iustin Pop | |
1299 | 5c0c1eeb | Iustin Pop | LVM self-repair (optional) |
1300 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++ |
1301 | 5c0c1eeb | Iustin Pop | |
1302 | 5c0c1eeb | Iustin Pop | The complete failure of a physical disk is very tedious to |
1303 | 5c0c1eeb | Iustin Pop | troubleshoot, mainly because of the many failure modes and the many |
1304 | 5c0c1eeb | Iustin Pop | steps needed. We can safely automate some of the steps, more |
1305 | 5c0c1eeb | Iustin Pop | specifically the ``vgreduce --removemissing`` using the following |
1306 | 5c0c1eeb | Iustin Pop | method: |
1307 | 5c0c1eeb | Iustin Pop | |
1308 | 5c0c1eeb | Iustin Pop | #. check if all nodes have consistent volume groups |
1309 | 5c0c1eeb | Iustin Pop | #. if yes, and previous status was yes, do nothing |
1310 | 5c0c1eeb | Iustin Pop | #. if yes, and previous status was no, save status and restart |
1311 | 5c0c1eeb | Iustin Pop | #. if no, and previous status was no, do nothing |
1312 | 5c0c1eeb | Iustin Pop | #. if no, and previous status was yes: |
1313 | 5c0c1eeb | Iustin Pop | #. if more than one node is inconsistent, do nothing |
1314 | 6c2d0b44 | Iustin Pop | #. if only one node is inconsistent: |
1315 | 5c0c1eeb | Iustin Pop | #. run ``vgreduce --removemissing`` |
1316 | 6c2d0b44 | Iustin Pop | #. log this occurrence in the Ganeti log in a form that |
1317 | 5c0c1eeb | Iustin Pop | can be used for monitoring |
1318 | 5c0c1eeb | Iustin Pop | #. [FUTURE] run ``replace-disks`` for all |
1319 | 5c0c1eeb | Iustin Pop | instances affected |
1320 | 5c0c1eeb | Iustin Pop | |
1321 | 5c0c1eeb | Iustin Pop | Failover to any node |
1322 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
1323 | 5c0c1eeb | Iustin Pop | |
1324 | 5c0c1eeb | Iustin Pop | With a modified disk activation sequence, we can implement the |
1325 | 5c0c1eeb | Iustin Pop | *failover to any* functionality, removing many of the layout |
1326 | 5c0c1eeb | Iustin Pop | restrictions of a cluster: |
1327 | 5c0c1eeb | Iustin Pop | |
1328 | 5c0c1eeb | Iustin Pop | - the need to reserve memory on the current secondary: this gets reduced to |
1329 | 5c0c1eeb | Iustin Pop | a must to reserve memory anywhere on the cluster |
1330 | 5c0c1eeb | Iustin Pop | |
1331 | 5c0c1eeb | Iustin Pop | - the need to first failover and then replace secondary for an |
1332 | 5c0c1eeb | Iustin Pop | instance: with failover-to-any, we can directly failover to |
1333 | 5c0c1eeb | Iustin Pop | another node, which also does the replace disks at the same |
1334 | 5c0c1eeb | Iustin Pop | step |
1335 | 5c0c1eeb | Iustin Pop | |
1336 | 5c0c1eeb | Iustin Pop | In the following, we denote the current primary by P1, the current |
1337 | 5c0c1eeb | Iustin Pop | secondary by S1, and the new primary and secondaries by P2 and S2. P2 |
1338 | 5c0c1eeb | Iustin Pop | is fixed to the node the user chooses, but the choice of S2 can be |
1339 | 5c0c1eeb | Iustin Pop | made between P1 and S1. This choice can be constrained, depending on |
1340 | 5c0c1eeb | Iustin Pop | which of P1 and S1 has failed. |
1341 | 5c0c1eeb | Iustin Pop | |
1342 | 5c0c1eeb | Iustin Pop | - if P1 has failed, then S1 must become S2, and live migration is not possible |
1343 | 5c0c1eeb | Iustin Pop | - if S1 has failed, then P1 must become S2, and live migration could be |
1344 | 5c0c1eeb | Iustin Pop | possible (in theory, but this is not a design goal for 2.0) |
1345 | 5c0c1eeb | Iustin Pop | |
1346 | 5c0c1eeb | Iustin Pop | The algorithm for performing the failover is straightforward: |
1347 | 5c0c1eeb | Iustin Pop | |
1348 | 5c0c1eeb | Iustin Pop | - verify that S2 (the node the user has chosen to keep as secondary) has |
1349 | 5c0c1eeb | Iustin Pop | valid data (is consistent) |
1350 | 5c0c1eeb | Iustin Pop | |
1351 | 6c2d0b44 | Iustin Pop | - tear down the current DRBD association and setup a DRBD pairing between |
1352 | 5c0c1eeb | Iustin Pop | P2 (P2 is indicated by the user) and S2; since P2 has no data, it will |
1353 | 6c2d0b44 | Iustin Pop | start re-syncing from S2 |
1354 | 5c0c1eeb | Iustin Pop | |
1355 | 5c0c1eeb | Iustin Pop | - as soon as P2 is in state SyncTarget (i.e. after the resync has started |
1356 | 5c0c1eeb | Iustin Pop | but before it has finished), we can promote it to primary role (r/w) |
1357 | 5c0c1eeb | Iustin Pop | and start the instance on P2 |
1358 | 5c0c1eeb | Iustin Pop | |
1359 | 5c0c1eeb | Iustin Pop | - as soon as the P2?S2 sync has finished, we can remove |
1360 | 5c0c1eeb | Iustin Pop | the old data on the old node that has not been chosen for |
1361 | 5c0c1eeb | Iustin Pop | S2 |
1362 | 5c0c1eeb | Iustin Pop | |
1363 | 5c0c1eeb | Iustin Pop | Caveats: during the P2?S2 sync, a (non-transient) network error |
1364 | 5c0c1eeb | Iustin Pop | will cause I/O errors on the instance, so (if a longer instance |
1365 | 5c0c1eeb | Iustin Pop | downtime is acceptable) we can postpone the restart of the instance |
1366 | 5c0c1eeb | Iustin Pop | until the resync is done. However, disk I/O errors on S2 will cause |
1367 | 6c2d0b44 | Iustin Pop | data loss, since we don't have a good copy of the data anymore, so in |
1368 | 5c0c1eeb | Iustin Pop | this case waiting for the sync to complete is not an option. As such, |
1369 | 5c0c1eeb | Iustin Pop | it is recommended that this feature is used only in conjunction with |
1370 | 5c0c1eeb | Iustin Pop | proper disk monitoring. |
1371 | 5c0c1eeb | Iustin Pop | |
1372 | 5c0c1eeb | Iustin Pop | |
1373 | 5c0c1eeb | Iustin Pop | Live migration note: While failover-to-any is possible for all choices |
1374 | 5c0c1eeb | Iustin Pop | of S2, migration-to-any is possible only if we keep P1 as S2. |
1375 | 5c0c1eeb | Iustin Pop | |
1376 | 5c0c1eeb | Iustin Pop | Caveats |
1377 | 5c0c1eeb | Iustin Pop | +++++++ |
1378 | 5c0c1eeb | Iustin Pop | |
1379 | 5c0c1eeb | Iustin Pop | The dynamic device model, while more complex, has an advantage: it |
1380 | 6c2d0b44 | Iustin Pop | will not reuse by mistake the DRBD device of another instance, since |
1381 | 6c2d0b44 | Iustin Pop | it always looks for either our own or a free one. |
1382 | 5c0c1eeb | Iustin Pop | |
1383 | 5c0c1eeb | Iustin Pop | The static one, in contrast, will assume that given a minor number N, |
1384 | 5c0c1eeb | Iustin Pop | it's ours and we can take over. This needs careful implementation such |
1385 | 5c0c1eeb | Iustin Pop | that if the minor is in use, either we are able to cleanly shut it |
1386 | 5c0c1eeb | Iustin Pop | down, or we abort the startup. Otherwise, it could be that we start |
1387 | 6c2d0b44 | Iustin Pop | syncing between two instance's disks, causing data loss. |
1388 | 5c0c1eeb | Iustin Pop | |
1389 | 5c0c1eeb | Iustin Pop | |
1390 | 5c0c1eeb | Iustin Pop | Variable number of disk/NICs per instance |
1391 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
1392 | 5c0c1eeb | Iustin Pop | |
1393 | 5c0c1eeb | Iustin Pop | Variable number of disks |
1394 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++ |
1395 | 5c0c1eeb | Iustin Pop | |
1396 | 5c0c1eeb | Iustin Pop | In order to support high-security scenarios (for example read-only sda |
1397 | 5c0c1eeb | Iustin Pop | and read-write sdb), we need to make a fully flexibly disk |
1398 | 5c0c1eeb | Iustin Pop | definition. This has less impact that it might look at first sight: |
1399 | 6c2d0b44 | Iustin Pop | only the instance creation has hard coded number of disks, not the disk |
1400 | 5c0c1eeb | Iustin Pop | handling code. The block device handling and most of the instance |
1401 | 5c0c1eeb | Iustin Pop | handling code is already working with "the instance's disks" as |
1402 | 5c0c1eeb | Iustin Pop | opposed to "the two disks of the instance", but some pieces are not |
1403 | 5c0c1eeb | Iustin Pop | (e.g. import/export) and the code needs a review to ensure safety. |
1404 | 5c0c1eeb | Iustin Pop | |
1405 | 5c0c1eeb | Iustin Pop | The objective is to be able to specify the number of disks at |
1406 | 5c0c1eeb | Iustin Pop | instance creation, and to be able to toggle from read-only to |
1407 | 6c2d0b44 | Iustin Pop | read-write a disk afterward. |
1408 | 5c0c1eeb | Iustin Pop | |
1409 | 5c0c1eeb | Iustin Pop | Variable number of NICs |
1410 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
1411 | 5c0c1eeb | Iustin Pop | |
1412 | 5c0c1eeb | Iustin Pop | Similar to the disk change, we need to allow multiple network |
1413 | 5c0c1eeb | Iustin Pop | interfaces per instance. This will affect the internal code (some |
1414 | 5c0c1eeb | Iustin Pop | function will have to stop assuming that ``instance.nics`` is a list |
1415 | 6c2d0b44 | Iustin Pop | of length one), the OS API which currently can export/import only one |
1416 | 5c0c1eeb | Iustin Pop | instance, and the command line interface. |
1417 | 5c0c1eeb | Iustin Pop | |
1418 | 5c0c1eeb | Iustin Pop | Interface changes |
1419 | 5c0c1eeb | Iustin Pop | ----------------- |
1420 | 5c0c1eeb | Iustin Pop | |
1421 | 5c0c1eeb | Iustin Pop | There are two areas of interface changes: API-level changes (the OS |
1422 | 5c0c1eeb | Iustin Pop | interface and the RAPI interface) and the command line interface |
1423 | 5c0c1eeb | Iustin Pop | changes. |
1424 | 5c0c1eeb | Iustin Pop | |
1425 | 5c0c1eeb | Iustin Pop | OS interface |
1426 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~ |
1427 | 5c0c1eeb | Iustin Pop | |
1428 | 5c0c1eeb | Iustin Pop | The current Ganeti OS interface, version 5, is tailored for Ganeti 1.2. The |
1429 | 5c0c1eeb | Iustin Pop | interface is composed by a series of scripts which get called with certain |
1430 | 5c0c1eeb | Iustin Pop | parameters to perform OS-dependent operations on the cluster. The current |
1431 | 5c0c1eeb | Iustin Pop | scripts are: |
1432 | 5c0c1eeb | Iustin Pop | |
1433 | 5c0c1eeb | Iustin Pop | create |
1434 | 5c0c1eeb | Iustin Pop | called when a new instance is added to the cluster |
1435 | 5c0c1eeb | Iustin Pop | export |
1436 | 5c0c1eeb | Iustin Pop | called to export an instance disk to a stream |
1437 | 5c0c1eeb | Iustin Pop | import |
1438 | 5c0c1eeb | Iustin Pop | called to import from a stream to a new instance |
1439 | 5c0c1eeb | Iustin Pop | rename |
1440 | 5c0c1eeb | Iustin Pop | called to perform the os-specific operations necessary for renaming an |
1441 | 5c0c1eeb | Iustin Pop | instance |
1442 | 5c0c1eeb | Iustin Pop | |
1443 | 5c0c1eeb | Iustin Pop | Currently these scripts suffer from the limitations of Ganeti 1.2: for example |
1444 | 5c0c1eeb | Iustin Pop | they accept exactly one block and one swap devices to operate on, rather than |
1445 | 5c0c1eeb | Iustin Pop | any amount of generic block devices, they blindly assume that an instance will |
1446 | 5c0c1eeb | Iustin Pop | have just one network interface to operate, they can not be configured to |
1447 | 5c0c1eeb | Iustin Pop | optimise the instance for a particular hypervisor. |
1448 | 5c0c1eeb | Iustin Pop | |
1449 | 5c0c1eeb | Iustin Pop | Since in Ganeti 2.0 we want to support multiple hypervisors, and a non-fixed |
1450 | 5c0c1eeb | Iustin Pop | number of network and disks the OS interface need to change to transmit the |
1451 | 5c0c1eeb | Iustin Pop | appropriate amount of information about an instance to its managing operating |
1452 | 5c0c1eeb | Iustin Pop | system, when operating on it. Moreover since some old assumptions usually used |
1453 | 5c0c1eeb | Iustin Pop | in OS scripts are no longer valid we need to re-establish a common knowledge on |
1454 | 5c0c1eeb | Iustin Pop | what can be assumed and what cannot be regarding Ganeti environment. |
1455 | 5c0c1eeb | Iustin Pop | |
1456 | 5c0c1eeb | Iustin Pop | |
1457 | 5c0c1eeb | Iustin Pop | When designing the new OS API our priorities are: |
1458 | 5c0c1eeb | Iustin Pop | - ease of use |
1459 | 5c0c1eeb | Iustin Pop | - future extensibility |
1460 | 6c2d0b44 | Iustin Pop | - ease of porting from the old API |
1461 | 5c0c1eeb | Iustin Pop | - modularity |
1462 | 5c0c1eeb | Iustin Pop | |
1463 | 5c0c1eeb | Iustin Pop | As such we want to limit the number of scripts that must be written to support |
1464 | 5c0c1eeb | Iustin Pop | an OS, and make it easy to share code between them by uniforming their input. |
1465 | 5c0c1eeb | Iustin Pop | We also will leave the current script structure unchanged, as far as we can, |
1466 | 5c0c1eeb | Iustin Pop | and make a few of the scripts (import, export and rename) optional. Most |
1467 | 5c0c1eeb | Iustin Pop | information will be passed to the script through environment variables, for |
1468 | 5c0c1eeb | Iustin Pop | ease of access and at the same time ease of using only the information a script |
1469 | 5c0c1eeb | Iustin Pop | needs. |
1470 | 5c0c1eeb | Iustin Pop | |
1471 | 5c0c1eeb | Iustin Pop | |
1472 | 5c0c1eeb | Iustin Pop | The Scripts |
1473 | 5c0c1eeb | Iustin Pop | +++++++++++ |
1474 | 5c0c1eeb | Iustin Pop | |
1475 | 5c0c1eeb | Iustin Pop | As in Ganeti 1.2, every OS which wants to be installed in Ganeti needs to |
1476 | 5c0c1eeb | Iustin Pop | support the following functionality, through scripts: |
1477 | 5c0c1eeb | Iustin Pop | |
1478 | 5c0c1eeb | Iustin Pop | create: |
1479 | 5c0c1eeb | Iustin Pop | used to create a new instance running that OS. This script should prepare the |
1480 | 5c0c1eeb | Iustin Pop | block devices, and install them so that the new OS can boot under the |
1481 | 5c0c1eeb | Iustin Pop | specified hypervisor. |
1482 | 5c0c1eeb | Iustin Pop | export (optional): |
1483 | 5c0c1eeb | Iustin Pop | used to export an installed instance using the given OS to a format which can |
1484 | 5c0c1eeb | Iustin Pop | be used to import it back into a new instance. |
1485 | 5c0c1eeb | Iustin Pop | import (optional): |
1486 | 5c0c1eeb | Iustin Pop | used to import an exported instance into a new one. This script is similar to |
1487 | 5c0c1eeb | Iustin Pop | create, but the new instance should have the content of the export, rather |
1488 | 5c0c1eeb | Iustin Pop | than contain a pristine installation. |
1489 | 5c0c1eeb | Iustin Pop | rename (optional): |
1490 | 5c0c1eeb | Iustin Pop | used to perform the internal OS-specific operations needed to rename an |
1491 | 5c0c1eeb | Iustin Pop | instance. |
1492 | 5c0c1eeb | Iustin Pop | |
1493 | 5c0c1eeb | Iustin Pop | If any optional script is not implemented Ganeti will refuse to perform the |
1494 | 5c0c1eeb | Iustin Pop | given operation on instances using the non-implementing OS. Of course the |
1495 | 5c0c1eeb | Iustin Pop | create script is mandatory, and it doesn't make sense to support the either the |
1496 | 5c0c1eeb | Iustin Pop | export or the import operation but not both. |
1497 | 5c0c1eeb | Iustin Pop | |
1498 | 5c0c1eeb | Iustin Pop | Incompatibilities with 1.2 |
1499 | 5c0c1eeb | Iustin Pop | __________________________ |
1500 | 5c0c1eeb | Iustin Pop | |
1501 | 5c0c1eeb | Iustin Pop | We expect the following incompatibilities between the OS scripts for 1.2 and |
1502 | 5c0c1eeb | Iustin Pop | the ones for 2.0: |
1503 | 5c0c1eeb | Iustin Pop | |
1504 | 5c0c1eeb | Iustin Pop | - Input parameters: in 1.2 those were passed on the command line, in 2.0 we'll |
1505 | 5c0c1eeb | Iustin Pop | use environment variables, as there will be a lot more information and not |
1506 | 5c0c1eeb | Iustin Pop | all OSes may care about all of it. |
1507 | 5c0c1eeb | Iustin Pop | - Number of calls: export scripts will be called once for each device the |
1508 | 5c0c1eeb | Iustin Pop | instance has, and import scripts once for every exported disk. Imported |
1509 | 5c0c1eeb | Iustin Pop | instances will be forced to have a number of disks greater or equal to the |
1510 | 5c0c1eeb | Iustin Pop | one of the export. |
1511 | 5c0c1eeb | Iustin Pop | - Some scripts are not compulsory: if such a script is missing the relevant |
1512 | 6c2d0b44 | Iustin Pop | operations will be forbidden for instances of that OS. This makes it easier |
1513 | 5c0c1eeb | Iustin Pop | to distinguish between unsupported operations and no-op ones (if any). |
1514 | 5c0c1eeb | Iustin Pop | |
1515 | 5c0c1eeb | Iustin Pop | |
1516 | 5c0c1eeb | Iustin Pop | Input |
1517 | 5c0c1eeb | Iustin Pop | _____ |
1518 | 5c0c1eeb | Iustin Pop | |
1519 | 5c0c1eeb | Iustin Pop | Rather than using command line flags, as they do now, scripts will accept |
1520 | 5c0c1eeb | Iustin Pop | inputs from environment variables. We expect the following input values: |
1521 | 5c0c1eeb | Iustin Pop | |
1522 | 5c0c1eeb | Iustin Pop | OS_API_VERSION |
1523 | 6c2d0b44 | Iustin Pop | The version of the OS API that the following parameters comply with; |
1524 | 5c0c1eeb | Iustin Pop | this is used so that in the future we could have OSes supporting |
1525 | 5c0c1eeb | Iustin Pop | multiple versions and thus Ganeti send the proper version in this |
1526 | 5c0c1eeb | Iustin Pop | parameter |
1527 | 5c0c1eeb | Iustin Pop | INSTANCE_NAME |
1528 | 5c0c1eeb | Iustin Pop | Name of the instance acted on |
1529 | 5c0c1eeb | Iustin Pop | HYPERVISOR |
1530 | 6c2d0b44 | Iustin Pop | The hypervisor the instance should run on (e.g. 'xen-pvm', 'xen-hvm', 'kvm') |
1531 | 5c0c1eeb | Iustin Pop | DISK_COUNT |
1532 | 5c0c1eeb | Iustin Pop | The number of disks this instance will have |
1533 | 5c0c1eeb | Iustin Pop | NIC_COUNT |
1534 | 6c2d0b44 | Iustin Pop | The number of NICs this instance will have |
1535 | 5c0c1eeb | Iustin Pop | DISK_<N>_PATH |
1536 | 5c0c1eeb | Iustin Pop | Path to the Nth disk. |
1537 | 5c0c1eeb | Iustin Pop | DISK_<N>_ACCESS |
1538 | 5c0c1eeb | Iustin Pop | W if read/write, R if read only. OS scripts are not supposed to touch |
1539 | 5c0c1eeb | Iustin Pop | read-only disks, but will be passed them to know. |
1540 | 5c0c1eeb | Iustin Pop | DISK_<N>_FRONTEND_TYPE |
1541 | 5c0c1eeb | Iustin Pop | Type of the disk as seen by the instance. Can be 'scsi', 'ide', 'virtio' |
1542 | 5c0c1eeb | Iustin Pop | DISK_<N>_BACKEND_TYPE |
1543 | 5c0c1eeb | Iustin Pop | Type of the disk as seen from the node. Can be 'block', 'file:loop' or |
1544 | 5c0c1eeb | Iustin Pop | 'file:blktap' |
1545 | 5c0c1eeb | Iustin Pop | NIC_<N>_MAC |
1546 | 5c0c1eeb | Iustin Pop | Mac address for the Nth network interface |
1547 | 5c0c1eeb | Iustin Pop | NIC_<N>_IP |
1548 | 5c0c1eeb | Iustin Pop | Ip address for the Nth network interface, if available |
1549 | 5c0c1eeb | Iustin Pop | NIC_<N>_BRIDGE |
1550 | 5c0c1eeb | Iustin Pop | Node bridge the Nth network interface will be connected to |
1551 | 5c0c1eeb | Iustin Pop | NIC_<N>_FRONTEND_TYPE |
1552 | 6c2d0b44 | Iustin Pop | Type of the Nth NIC as seen by the instance. For example 'virtio', |
1553 | 6c2d0b44 | Iustin Pop | 'rtl8139', etc. |
1554 | 5c0c1eeb | Iustin Pop | DEBUG_LEVEL |
1555 | 5c0c1eeb | Iustin Pop | Whether more out should be produced, for debugging purposes. Currently the |
1556 | 5c0c1eeb | Iustin Pop | only valid values are 0 and 1. |
1557 | 5c0c1eeb | Iustin Pop | |
1558 | 6c2d0b44 | Iustin Pop | These are only the basic variables we are thinking of now, but more |
1559 | 6c2d0b44 | Iustin Pop | may come during the implementation and they will be documented in the |
1560 | 6c2d0b44 | Iustin Pop | ``ganeti-os-api`` man page. All these variables will be available to |
1561 | 6c2d0b44 | Iustin Pop | all scripts. |
1562 | 5c0c1eeb | Iustin Pop | |
1563 | 5c0c1eeb | Iustin Pop | Some scripts will need a few more information to work. These will have |
1564 | 5c0c1eeb | Iustin Pop | per-script variables, such as for example: |
1565 | 5c0c1eeb | Iustin Pop | |
1566 | 5c0c1eeb | Iustin Pop | OLD_INSTANCE_NAME |
1567 | 5c0c1eeb | Iustin Pop | rename: the name the instance should be renamed from. |
1568 | 5c0c1eeb | Iustin Pop | EXPORT_DEVICE |
1569 | 5c0c1eeb | Iustin Pop | export: device to be exported, a snapshot of the actual device. The data must be exported to stdout. |
1570 | 5c0c1eeb | Iustin Pop | EXPORT_INDEX |
1571 | 5c0c1eeb | Iustin Pop | export: sequential number of the instance device targeted. |
1572 | 5c0c1eeb | Iustin Pop | IMPORT_DEVICE |
1573 | 5c0c1eeb | Iustin Pop | import: device to send the data to, part of the new instance. The data must be imported from stdin. |
1574 | 5c0c1eeb | Iustin Pop | IMPORT_INDEX |
1575 | 5c0c1eeb | Iustin Pop | import: sequential number of the instance device targeted. |
1576 | 5c0c1eeb | Iustin Pop | |
1577 | 5c0c1eeb | Iustin Pop | (Rationale for INSTANCE_NAME as an environment variable: the instance name is |
1578 | 5c0c1eeb | Iustin Pop | always needed and we could pass it on the command line. On the other hand, |
1579 | 5c0c1eeb | Iustin Pop | though, this would force scripts to both access the environment and parse the |
1580 | 5c0c1eeb | Iustin Pop | command line, so we'll move it for uniformity.) |
1581 | 5c0c1eeb | Iustin Pop | |
1582 | 5c0c1eeb | Iustin Pop | |
1583 | 5c0c1eeb | Iustin Pop | Output/Behaviour |
1584 | 5c0c1eeb | Iustin Pop | ________________ |
1585 | 5c0c1eeb | Iustin Pop | |
1586 | 5c0c1eeb | Iustin Pop | As discussed scripts should only send user-targeted information to stderr. The |
1587 | 5c0c1eeb | Iustin Pop | create and import scripts are supposed to format/initialise the given block |
1588 | 5c0c1eeb | Iustin Pop | devices and install the correct instance data. The export script is supposed to |
1589 | 5c0c1eeb | Iustin Pop | export instance data to stdout in a format understandable by the the import |
1590 | 6c2d0b44 | Iustin Pop | script. The data will be compressed by Ganeti, so no compression should be |
1591 | 5c0c1eeb | Iustin Pop | done. The rename script should only modify the instance's knowledge of what |
1592 | 5c0c1eeb | Iustin Pop | its name is. |
1593 | 5c0c1eeb | Iustin Pop | |
1594 | 5c0c1eeb | Iustin Pop | Other declarative style features |
1595 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++++++++++++ |
1596 | 5c0c1eeb | Iustin Pop | |
1597 | 5c0c1eeb | Iustin Pop | Similar to Ganeti 1.2, OS specifications will need to provide a |
1598 | 6c2d0b44 | Iustin Pop | 'ganeti_api_version' containing list of numbers matching the |
1599 | 6c2d0b44 | Iustin Pop | version(s) of the API they implement. Ganeti itself will always be |
1600 | 6c2d0b44 | Iustin Pop | compatible with one version of the API and may maintain backwards |
1601 | 6c2d0b44 | Iustin Pop | compatibility if it's feasible to do so. The numbers are one-per-line, |
1602 | 6c2d0b44 | Iustin Pop | so an OS supporting both version 5 and version 20 will have a file |
1603 | 6c2d0b44 | Iustin Pop | containing two lines. This is different from Ganeti 1.2, which only |
1604 | 6c2d0b44 | Iustin Pop | supported one version number. |
1605 | 5c0c1eeb | Iustin Pop | |
1606 | 5c0c1eeb | Iustin Pop | In addition to that an OS will be able to declare that it does support only a |
1607 | 6c2d0b44 | Iustin Pop | subset of the Ganeti hypervisors, by declaring them in the 'hypervisors' file. |
1608 | 5c0c1eeb | Iustin Pop | |
1609 | 5c0c1eeb | Iustin Pop | |
1610 | 5c0c1eeb | Iustin Pop | Caveats/Notes |
1611 | 5c0c1eeb | Iustin Pop | +++++++++++++ |
1612 | 5c0c1eeb | Iustin Pop | |
1613 | 5c0c1eeb | Iustin Pop | We might want to have a "default" import/export behaviour that just dumps all |
1614 | 5c0c1eeb | Iustin Pop | disks and restores them. This can save work as most systems will just do this, |
1615 | 5c0c1eeb | Iustin Pop | while allowing flexibility for different systems. |
1616 | 5c0c1eeb | Iustin Pop | |
1617 | 5c0c1eeb | Iustin Pop | Environment variables are limited in size, but we expect that there will be |
1618 | 5c0c1eeb | Iustin Pop | enough space to store the information we need. If we discover that this is not |
1619 | 5c0c1eeb | Iustin Pop | the case we may want to go to a more complex API such as storing those |
1620 | 5c0c1eeb | Iustin Pop | information on the filesystem and providing the OS script with the path to a |
1621 | 5c0c1eeb | Iustin Pop | file where they are encoded in some format. |
1622 | 5c0c1eeb | Iustin Pop | |
1623 | 5c0c1eeb | Iustin Pop | |
1624 | 5c0c1eeb | Iustin Pop | |
1625 | 5c0c1eeb | Iustin Pop | Remote API changes |
1626 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~ |
1627 | 5c0c1eeb | Iustin Pop | |
1628 | 6c2d0b44 | Iustin Pop | The first Ganeti remote API (RAPI) was designed and deployed with the |
1629 | 6c2d0b44 | Iustin Pop | Ganeti 1.2.5 release. That version provide read-only access to the |
1630 | 6c2d0b44 | Iustin Pop | cluster state. Fully functional read-write API demands significant |
1631 | 6c2d0b44 | Iustin Pop | internal changes which will be implemented in version 2.0. |
1632 | 5c0c1eeb | Iustin Pop | |
1633 | 6c2d0b44 | Iustin Pop | We decided to go with implementing the Ganeti RAPI in a RESTful way, |
1634 | 6c2d0b44 | Iustin Pop | which is aligned with key features we looking. It is simple, |
1635 | 6c2d0b44 | Iustin Pop | stateless, scalable and extensible paradigm of API implementation. As |
1636 | 6c2d0b44 | Iustin Pop | transport it uses HTTP over SSL, and we are implementing it with JSON |
1637 | 6c2d0b44 | Iustin Pop | encoding, but in a way it possible to extend and provide any other |
1638 | 6c2d0b44 | Iustin Pop | one. |
1639 | 5c0c1eeb | Iustin Pop | |
1640 | 5c0c1eeb | Iustin Pop | Design |
1641 | 5c0c1eeb | Iustin Pop | ++++++ |
1642 | 5c0c1eeb | Iustin Pop | |
1643 | 6c2d0b44 | Iustin Pop | The Ganeti RAPI is implemented as independent daemon, running on the |
1644 | 6c2d0b44 | Iustin Pop | same node with the same permission level as Ganeti master |
1645 | 6c2d0b44 | Iustin Pop | daemon. Communication is done through the LUXI library to the master |
1646 | 6c2d0b44 | Iustin Pop | daemon. In order to keep communication asynchronous RAPI processes two |
1647 | 6c2d0b44 | Iustin Pop | types of client requests: |
1648 | 5c0c1eeb | Iustin Pop | |
1649 | 6c2d0b44 | Iustin Pop | - queries: server is able to answer immediately |
1650 | 6c2d0b44 | Iustin Pop | - job submission: some time is required for a useful response |
1651 | 5c0c1eeb | Iustin Pop | |
1652 | 6c2d0b44 | Iustin Pop | In the query case requested data send back to client in the HTTP |
1653 | 6c2d0b44 | Iustin Pop | response body. Typical examples of queries would be: list of nodes, |
1654 | 6c2d0b44 | Iustin Pop | instances, cluster info, etc. |
1655 | 5c0c1eeb | Iustin Pop | |
1656 | 6c2d0b44 | Iustin Pop | In the case of job submission, the client receive a job ID, the |
1657 | 6c2d0b44 | Iustin Pop | identifier which allows to query the job progress in the job queue |
1658 | 6c2d0b44 | Iustin Pop | (see `Job Queue`_). |
1659 | 6c2d0b44 | Iustin Pop | |
1660 | 6c2d0b44 | Iustin Pop | Internally, each exported object has an version identifier, which is |
1661 | 6c2d0b44 | Iustin Pop | used as a state identifier in the HTTP header E-Tag field for |
1662 | 6c2d0b44 | Iustin Pop | requests/responses to avoid race conditions. |
1663 | 5c0c1eeb | Iustin Pop | |
1664 | 5c0c1eeb | Iustin Pop | |
1665 | 5c0c1eeb | Iustin Pop | Resource representation |
1666 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++ |
1667 | 5c0c1eeb | Iustin Pop | |
1668 | 6c2d0b44 | Iustin Pop | The key difference of using REST instead of others API is that REST |
1669 | 6c2d0b44 | Iustin Pop | requires separation of services via resources with unique URIs. Each |
1670 | 6c2d0b44 | Iustin Pop | of them should have limited amount of state and support standard HTTP |
1671 | 5c0c1eeb | Iustin Pop | methods: GET, POST, DELETE, PUT. |
1672 | 5c0c1eeb | Iustin Pop | |
1673 | 6c2d0b44 | Iustin Pop | For example in Ganeti's case we can have a set of URI: |
1674 | 6c2d0b44 | Iustin Pop | |
1675 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances`` |
1676 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances/{instancename}`` |
1677 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/instances/{instancename}/tag`` |
1678 | 6c2d0b44 | Iustin Pop | - ``/{clustername}/tag`` |
1679 | 5c0c1eeb | Iustin Pop | |
1680 | 6c2d0b44 | Iustin Pop | A GET request to ``/{clustername}/instances`` will return the list of |
1681 | 6c2d0b44 | Iustin Pop | instances, a POST to ``/{clustername}/instances`` should create a new |
1682 | 6c2d0b44 | Iustin Pop | instance, a DELETE ``/{clustername}/instances/{instancename}`` should |
1683 | 6c2d0b44 | Iustin Pop | delete the instance, a GET ``/{clustername}/tag`` should return get |
1684 | 6c2d0b44 | Iustin Pop | cluster tags. |
1685 | 5c0c1eeb | Iustin Pop | |
1686 | 6c2d0b44 | Iustin Pop | Each resource URI will have a version prefix. The resource IDs are to |
1687 | 6c2d0b44 | Iustin Pop | be determined. |
1688 | 5c0c1eeb | Iustin Pop | |
1689 | 6c2d0b44 | Iustin Pop | Internal encoding might be JSON, XML, or any other. The JSON encoding |
1690 | 6c2d0b44 | Iustin Pop | fits nicely in Ganeti RAPI needs. The client can request a specific |
1691 | 6c2d0b44 | Iustin Pop | representation via the Accept field in the HTTP header. |
1692 | 5c0c1eeb | Iustin Pop | |
1693 | 6c2d0b44 | Iustin Pop | REST uses HTTP as its transport and application protocol for resource |
1694 | 6c2d0b44 | Iustin Pop | access. The set of possible responses is a subset of standard HTTP |
1695 | 6c2d0b44 | Iustin Pop | responses. |
1696 | 6c2d0b44 | Iustin Pop | |
1697 | 6c2d0b44 | Iustin Pop | The statelessness model provides additional reliability and |
1698 | 6c2d0b44 | Iustin Pop | transparency to operations (e.g. only one request needs to be analyzed |
1699 | 6c2d0b44 | Iustin Pop | to understand the in-progress operation, not a sequence of multiple |
1700 | 6c2d0b44 | Iustin Pop | requests/responses). |
1701 | 5c0c1eeb | Iustin Pop | |
1702 | 5c0c1eeb | Iustin Pop | |
1703 | 5c0c1eeb | Iustin Pop | Security |
1704 | 5c0c1eeb | Iustin Pop | ++++++++ |
1705 | 5c0c1eeb | Iustin Pop | |
1706 | 6c2d0b44 | Iustin Pop | With the write functionality security becomes a much bigger an issue. |
1707 | 6c2d0b44 | Iustin Pop | The Ganeti RAPI uses basic HTTP authentication on top of an |
1708 | 6c2d0b44 | Iustin Pop | SSL-secured connection to grant access to an exported resource. The |
1709 | 6c2d0b44 | Iustin Pop | password is stored locally in an Apache-style ``.htpasswd`` file. Only |
1710 | 6c2d0b44 | Iustin Pop | one level of privileges is supported. |
1711 | 6c2d0b44 | Iustin Pop | |
1712 | 6c2d0b44 | Iustin Pop | Caveats |
1713 | 6c2d0b44 | Iustin Pop | +++++++ |
1714 | 6c2d0b44 | Iustin Pop | |
1715 | 6c2d0b44 | Iustin Pop | The model detailed above for job submission requires the client to |
1716 | 6c2d0b44 | Iustin Pop | poll periodically for updates to the job; an alternative would be to |
1717 | 6c2d0b44 | Iustin Pop | allow the client to request a callback, or a 'wait for updates' call. |
1718 | 6c2d0b44 | Iustin Pop | |
1719 | 6c2d0b44 | Iustin Pop | The callback model was not considered due to the following two issues: |
1720 | 5c0c1eeb | Iustin Pop | |
1721 | 6c2d0b44 | Iustin Pop | - callbacks would require a new model of allowed callback URLs, |
1722 | 6c2d0b44 | Iustin Pop | together with a method of managing these |
1723 | 6c2d0b44 | Iustin Pop | - callbacks only work when the client and the master are in the same |
1724 | 6c2d0b44 | Iustin Pop | security domain, and they fail in the other cases (e.g. when there is |
1725 | 6c2d0b44 | Iustin Pop | a firewall between the client and the RAPI daemon that only allows |
1726 | 6c2d0b44 | Iustin Pop | client-to-RAPI calls, which is usual in DMZ cases) |
1727 | 6c2d0b44 | Iustin Pop | |
1728 | 6c2d0b44 | Iustin Pop | The 'wait for updates' method is not suited to the HTTP protocol, |
1729 | 6c2d0b44 | Iustin Pop | where requests are supposed to be short-lived. |
1730 | 5c0c1eeb | Iustin Pop | |
1731 | 5c0c1eeb | Iustin Pop | Command line changes |
1732 | 5c0c1eeb | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
1733 | 5c0c1eeb | Iustin Pop | |
1734 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 introduces several new features as well as new ways to |
1735 | 5c0c1eeb | Iustin Pop | handle instance resources like disks or network interfaces. This |
1736 | 6c2d0b44 | Iustin Pop | requires some noticeable changes in the way command line arguments are |
1737 | 5c0c1eeb | Iustin Pop | handled. |
1738 | 5c0c1eeb | Iustin Pop | |
1739 | 6c2d0b44 | Iustin Pop | - extend and modify command line syntax to support new features |
1740 | 6c2d0b44 | Iustin Pop | - ensure consistent patterns in command line arguments to reduce |
1741 | 6c2d0b44 | Iustin Pop | cognitive load |
1742 | 5c0c1eeb | Iustin Pop | |
1743 | 5c0c1eeb | Iustin Pop | The design changes that require these changes are, in no particular |
1744 | 5c0c1eeb | Iustin Pop | order: |
1745 | 5c0c1eeb | Iustin Pop | |
1746 | 5c0c1eeb | Iustin Pop | - flexible instance disk handling: support a variable number of disks |
1747 | 5c0c1eeb | Iustin Pop | with varying properties per instance, |
1748 | 5c0c1eeb | Iustin Pop | - flexible instance network interface handling: support a variable |
1749 | 5c0c1eeb | Iustin Pop | number of network interfaces with varying properties per instance |
1750 | 5c0c1eeb | Iustin Pop | - multiple hypervisors: multiple hypervisors can be active on the same |
1751 | 5c0c1eeb | Iustin Pop | cluster, each supporting different parameters, |
1752 | 5c0c1eeb | Iustin Pop | - support for device type CDROM (via ISO image) |
1753 | 5c0c1eeb | Iustin Pop | |
1754 | 6c2d0b44 | Iustin Pop | As such, there are several areas of Ganeti where the command line |
1755 | 5c0c1eeb | Iustin Pop | arguments will change: |
1756 | 5c0c1eeb | Iustin Pop | |
1757 | 5c0c1eeb | Iustin Pop | - Cluster configuration |
1758 | 5c0c1eeb | Iustin Pop | |
1759 | 5c0c1eeb | Iustin Pop | - cluster initialization |
1760 | 5c0c1eeb | Iustin Pop | - cluster default configuration |
1761 | 5c0c1eeb | Iustin Pop | |
1762 | 5c0c1eeb | Iustin Pop | - Instance configuration |
1763 | 5c0c1eeb | Iustin Pop | |
1764 | 5c0c1eeb | Iustin Pop | - handling of network cards for instances, |
1765 | 5c0c1eeb | Iustin Pop | - handling of disks for instances, |
1766 | 5c0c1eeb | Iustin Pop | - handling of CDROM devices and |
1767 | 5c0c1eeb | Iustin Pop | - handling of hypervisor specific options. |
1768 | 5c0c1eeb | Iustin Pop | |
1769 | 6c2d0b44 | Iustin Pop | There are several areas of Ganeti where the command line arguments |
1770 | 6c2d0b44 | Iustin Pop | will change: |
1771 | 5c0c1eeb | Iustin Pop | |
1772 | 5c0c1eeb | Iustin Pop | - Cluster configuration |
1773 | 5c0c1eeb | Iustin Pop | |
1774 | 5c0c1eeb | Iustin Pop | - cluster initialization |
1775 | 5c0c1eeb | Iustin Pop | - cluster default configuration |
1776 | 5c0c1eeb | Iustin Pop | |
1777 | 5c0c1eeb | Iustin Pop | - Instance configuration |
1778 | 5c0c1eeb | Iustin Pop | |
1779 | 5c0c1eeb | Iustin Pop | - handling of network cards for instances, |
1780 | 5c0c1eeb | Iustin Pop | - handling of disks for instances, |
1781 | 5c0c1eeb | Iustin Pop | - handling of CDROM devices and |
1782 | 5c0c1eeb | Iustin Pop | - handling of hypervisor specific options. |
1783 | 5c0c1eeb | Iustin Pop | |
1784 | 5c0c1eeb | Iustin Pop | Notes about device removal/addition |
1785 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++++++ |
1786 | 5c0c1eeb | Iustin Pop | |
1787 | 5c0c1eeb | Iustin Pop | To avoid problems with device location changes (e.g. second network |
1788 | 5c0c1eeb | Iustin Pop | interface of the instance becoming the first or third and the like) |
1789 | 5c0c1eeb | Iustin Pop | the list of network/disk devices is treated as a stack, i.e. devices |
1790 | 5c0c1eeb | Iustin Pop | can only be added/removed at the end of the list of devices of each |
1791 | 5c0c1eeb | Iustin Pop | class (disk or network) for each instance. |
1792 | 5c0c1eeb | Iustin Pop | |
1793 | 5c0c1eeb | Iustin Pop | gnt-instance commands |
1794 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++ |
1795 | 5c0c1eeb | Iustin Pop | |
1796 | 5c0c1eeb | Iustin Pop | The commands for gnt-instance will be modified and extended to allow |
1797 | 5c0c1eeb | Iustin Pop | for the new functionality: |
1798 | 5c0c1eeb | Iustin Pop | |
1799 | 5c0c1eeb | Iustin Pop | - the add command will be extended to support the new device and |
1800 | 5c0c1eeb | Iustin Pop | hypervisor options, |
1801 | 5c0c1eeb | Iustin Pop | - the modify command continues to handle all modifications to |
1802 | 5c0c1eeb | Iustin Pop | instances, but will be extended with new arguments for handling |
1803 | 5c0c1eeb | Iustin Pop | devices. |
1804 | 5c0c1eeb | Iustin Pop | |
1805 | 5c0c1eeb | Iustin Pop | Network Device Options |
1806 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++++ |
1807 | 5c0c1eeb | Iustin Pop | |
1808 | 5c0c1eeb | Iustin Pop | The generic format of the network device option is: |
1809 | 5c0c1eeb | Iustin Pop | |
1810 | 5c0c1eeb | Iustin Pop | --net $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1811 | 5c0c1eeb | Iustin Pop | |
1812 | 5c0c1eeb | Iustin Pop | :$DEVNUM: device number, unsigned integer, starting at 0, |
1813 | 5c0c1eeb | Iustin Pop | :$OPTION: device option, string, |
1814 | 5c0c1eeb | Iustin Pop | :$VALUE: device option value, string. |
1815 | 5c0c1eeb | Iustin Pop | |
1816 | 5c0c1eeb | Iustin Pop | Currently, the following device options will be defined (open to |
1817 | 5c0c1eeb | Iustin Pop | further changes): |
1818 | 5c0c1eeb | Iustin Pop | |
1819 | 5c0c1eeb | Iustin Pop | :mac: MAC address of the network interface, accepts either a valid |
1820 | 5c0c1eeb | Iustin Pop | MAC address or the string 'auto'. If 'auto' is specified, a new MAC |
1821 | 5c0c1eeb | Iustin Pop | address will be generated randomly. If the mac device option is not |
1822 | 5c0c1eeb | Iustin Pop | specified, the default value 'auto' is assumed. |
1823 | 5c0c1eeb | Iustin Pop | :bridge: network bridge the network interface is connected |
1824 | 5c0c1eeb | Iustin Pop | to. Accepts either a valid bridge name (the specified bridge must |
1825 | 5c0c1eeb | Iustin Pop | exist on the node(s)) as string or the string 'auto'. If 'auto' is |
1826 | 5c0c1eeb | Iustin Pop | specified, the default brigde is used. If the bridge option is not |
1827 | 5c0c1eeb | Iustin Pop | specified, the default value 'auto' is assumed. |
1828 | 5c0c1eeb | Iustin Pop | |
1829 | 5c0c1eeb | Iustin Pop | Disk Device Options |
1830 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++ |
1831 | 5c0c1eeb | Iustin Pop | |
1832 | 5c0c1eeb | Iustin Pop | The generic format of the disk device option is: |
1833 | 5c0c1eeb | Iustin Pop | |
1834 | 5c0c1eeb | Iustin Pop | --disk $DEVNUM[:$OPTION=$VALUE][,$OPTION=VALUE] |
1835 | 5c0c1eeb | Iustin Pop | |
1836 | 5c0c1eeb | Iustin Pop | :$DEVNUM: device number, unsigned integer, starting at 0, |
1837 | 5c0c1eeb | Iustin Pop | :$OPTION: device option, string, |
1838 | 5c0c1eeb | Iustin Pop | :$VALUE: device option value, string. |
1839 | 5c0c1eeb | Iustin Pop | |
1840 | 5c0c1eeb | Iustin Pop | Currently, the following device options will be defined (open to |
1841 | 5c0c1eeb | Iustin Pop | further changes): |
1842 | 5c0c1eeb | Iustin Pop | |
1843 | 5c0c1eeb | Iustin Pop | :size: size of the disk device, either a positive number, specifying |
1844 | 5c0c1eeb | Iustin Pop | the disk size in mebibytes, or a number followed by a magnitude suffix |
1845 | 5c0c1eeb | Iustin Pop | (M for mebibytes, G for gibibytes). Also accepts the string 'auto' in |
1846 | 5c0c1eeb | Iustin Pop | which case the default disk size will be used. If the size option is |
1847 | 5c0c1eeb | Iustin Pop | not specified, 'auto' is assumed. This option is not valid for all |
1848 | 5c0c1eeb | Iustin Pop | disk layout types. |
1849 | 5c0c1eeb | Iustin Pop | :access: access mode of the disk device, a single letter, valid values |
1850 | 5c0c1eeb | Iustin Pop | are: |
1851 | 5c0c1eeb | Iustin Pop | |
1852 | 5c0c1eeb | Iustin Pop | - *w*: read/write access to the disk device or |
1853 | 5c0c1eeb | Iustin Pop | - *r*: read-only access to the disk device. |
1854 | 5c0c1eeb | Iustin Pop | |
1855 | 5c0c1eeb | Iustin Pop | If the access mode is not specified, the default mode of read/write |
1856 | 5c0c1eeb | Iustin Pop | access will be configured. |
1857 | 5c0c1eeb | Iustin Pop | :path: path to the image file for the disk device, string. No default |
1858 | 5c0c1eeb | Iustin Pop | exists. This option is not valid for all disk layout types. |
1859 | 5c0c1eeb | Iustin Pop | |
1860 | 5c0c1eeb | Iustin Pop | Adding devices |
1861 | 5c0c1eeb | Iustin Pop | ++++++++++++++ |
1862 | 5c0c1eeb | Iustin Pop | |
1863 | 5c0c1eeb | Iustin Pop | To add devices to an already existing instance, use the device type |
1864 | 5c0c1eeb | Iustin Pop | specific option to gnt-instance modify. Currently, there are two |
1865 | 5c0c1eeb | Iustin Pop | device type specific options supported: |
1866 | 5c0c1eeb | Iustin Pop | |
1867 | 5c0c1eeb | Iustin Pop | :--net: for network interface cards |
1868 | 5c0c1eeb | Iustin Pop | :--disk: for disk devices |
1869 | 5c0c1eeb | Iustin Pop | |
1870 | 6c2d0b44 | Iustin Pop | The syntax to the device specific options is similar to the generic |
1871 | 5c0c1eeb | Iustin Pop | device options, but instead of specifying a device number like for |
1872 | 5c0c1eeb | Iustin Pop | gnt-instance add, you specify the magic string add. The new device |
1873 | 5c0c1eeb | Iustin Pop | will always be appended at the end of the list of devices of this type |
1874 | 5c0c1eeb | Iustin Pop | for the specified instance, e.g. if the instance has disk devices 0,1 |
1875 | 5c0c1eeb | Iustin Pop | and 2, the newly added disk device will be disk device 3. |
1876 | 5c0c1eeb | Iustin Pop | |
1877 | 5c0c1eeb | Iustin Pop | Example: gnt-instance modify --net add:mac=auto test-instance |
1878 | 5c0c1eeb | Iustin Pop | |
1879 | 5c0c1eeb | Iustin Pop | Removing devices |
1880 | 5c0c1eeb | Iustin Pop | ++++++++++++++++ |
1881 | 5c0c1eeb | Iustin Pop | |
1882 | 5c0c1eeb | Iustin Pop | Removing devices from and instance is done via gnt-instance |
1883 | 5c0c1eeb | Iustin Pop | modify. The same device specific options as for adding instances are |
1884 | 5c0c1eeb | Iustin Pop | used. Instead of a device number and further device options, only the |
1885 | 5c0c1eeb | Iustin Pop | magic string remove is specified. It will always remove the last |
1886 | 5c0c1eeb | Iustin Pop | device in the list of devices of this type for the instance specified, |
1887 | 5c0c1eeb | Iustin Pop | e.g. if the instance has disk devices 0, 1, 2 and 3, the disk device |
1888 | 5c0c1eeb | Iustin Pop | number 3 will be removed. |
1889 | 5c0c1eeb | Iustin Pop | |
1890 | 5c0c1eeb | Iustin Pop | Example: gnt-instance modify --net remove test-instance |
1891 | 5c0c1eeb | Iustin Pop | |
1892 | 5c0c1eeb | Iustin Pop | Modifying devices |
1893 | 5c0c1eeb | Iustin Pop | +++++++++++++++++ |
1894 | 5c0c1eeb | Iustin Pop | |
1895 | 5c0c1eeb | Iustin Pop | Modifying devices is also done with device type specific options to |
1896 | 5c0c1eeb | Iustin Pop | the gnt-instance modify command. There are currently two device type |
1897 | 5c0c1eeb | Iustin Pop | options supported: |
1898 | 5c0c1eeb | Iustin Pop | |
1899 | 5c0c1eeb | Iustin Pop | :--net: for network interface cards |
1900 | 5c0c1eeb | Iustin Pop | :--disk: for disk devices |
1901 | 5c0c1eeb | Iustin Pop | |
1902 | 6c2d0b44 | Iustin Pop | The syntax to the device specific options is similar to the generic |
1903 | 5c0c1eeb | Iustin Pop | device options. The device number you specify identifies the device to |
1904 | 5c0c1eeb | Iustin Pop | be modified. |
1905 | 5c0c1eeb | Iustin Pop | |
1906 | 6c2d0b44 | Iustin Pop | Example:: |
1907 | 6c2d0b44 | Iustin Pop | |
1908 | 6c2d0b44 | Iustin Pop | gnt-instance modify --disk 2:access=r |
1909 | 5c0c1eeb | Iustin Pop | |
1910 | 5c0c1eeb | Iustin Pop | Hypervisor Options |
1911 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++ |
1912 | 5c0c1eeb | Iustin Pop | |
1913 | 5c0c1eeb | Iustin Pop | Ganeti 2.0 will support more than one hypervisor. Different |
1914 | 5c0c1eeb | Iustin Pop | hypervisors have various options that only apply to a specific |
1915 | 5c0c1eeb | Iustin Pop | hypervisor. Those hypervisor specific options are treated specially |
1916 | 6c2d0b44 | Iustin Pop | via the ``--hypervisor`` option. The generic syntax of the hypervisor |
1917 | 6c2d0b44 | Iustin Pop | option is as follows:: |
1918 | 5c0c1eeb | Iustin Pop | |
1919 | 5c0c1eeb | Iustin Pop | --hypervisor $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1920 | 5c0c1eeb | Iustin Pop | |
1921 | 5c0c1eeb | Iustin Pop | :$HYPERVISOR: symbolic name of the hypervisor to use, string, |
1922 | 5c0c1eeb | Iustin Pop | has to match the supported hypervisors. Example: xen-pvm |
1923 | 5c0c1eeb | Iustin Pop | |
1924 | 5c0c1eeb | Iustin Pop | :$OPTION: hypervisor option name, string |
1925 | 5c0c1eeb | Iustin Pop | :$VALUE: hypervisor option value, string |
1926 | 5c0c1eeb | Iustin Pop | |
1927 | 5c0c1eeb | Iustin Pop | The hypervisor option for an instance can be set on instance creation |
1928 | 6c2d0b44 | Iustin Pop | time via the ``gnt-instance add`` command. If the hypervisor for an |
1929 | 5c0c1eeb | Iustin Pop | instance is not specified upon instance creation, the default |
1930 | 5c0c1eeb | Iustin Pop | hypervisor will be used. |
1931 | 5c0c1eeb | Iustin Pop | |
1932 | 5c0c1eeb | Iustin Pop | Modifying hypervisor parameters |
1933 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++++++ |
1934 | 5c0c1eeb | Iustin Pop | |
1935 | 5c0c1eeb | Iustin Pop | The hypervisor parameters of an existing instance can be modified |
1936 | 6c2d0b44 | Iustin Pop | using ``--hypervisor`` option of the ``gnt-instance modify`` |
1937 | 6c2d0b44 | Iustin Pop | command. However, the hypervisor type of an existing instance can not |
1938 | 6c2d0b44 | Iustin Pop | be changed, only the particular hypervisor specific option can be |
1939 | 6c2d0b44 | Iustin Pop | changed. Therefore, the format of the option parameters has been |
1940 | 6c2d0b44 | Iustin Pop | simplified to omit the hypervisor name and only contain the comma |
1941 | 6c2d0b44 | Iustin Pop | separated list of option-value pairs. |
1942 | 5c0c1eeb | Iustin Pop | |
1943 | 6c2d0b44 | Iustin Pop | Example:: |
1944 | 6c2d0b44 | Iustin Pop | |
1945 | 6c2d0b44 | Iustin Pop | gnt-instance modify --hypervisor cdrom=/srv/boot.iso,boot_order=cdrom:network test-instance |
1946 | 5c0c1eeb | Iustin Pop | |
1947 | 5c0c1eeb | Iustin Pop | gnt-cluster commands |
1948 | 5c0c1eeb | Iustin Pop | ++++++++++++++++++++ |
1949 | 5c0c1eeb | Iustin Pop | |
1950 | 5c0c1eeb | Iustin Pop | The command for gnt-cluster will be extended to allow setting and |
1951 | 5c0c1eeb | Iustin Pop | changing the default parameters of the cluster: |
1952 | 5c0c1eeb | Iustin Pop | |
1953 | 5c0c1eeb | Iustin Pop | - The init command will be extend to support the defaults option to |
1954 | 5c0c1eeb | Iustin Pop | set the cluster defaults upon cluster initialization. |
1955 | 5c0c1eeb | Iustin Pop | - The modify command will be added to modify the cluster |
1956 | 5c0c1eeb | Iustin Pop | parameters. It will support the --defaults option to change the |
1957 | 5c0c1eeb | Iustin Pop | cluster defaults. |
1958 | 5c0c1eeb | Iustin Pop | |
1959 | 5c0c1eeb | Iustin Pop | Cluster defaults |
1960 | 5c0c1eeb | Iustin Pop | |
1961 | 5c0c1eeb | Iustin Pop | The generic format of the cluster default setting option is: |
1962 | 5c0c1eeb | Iustin Pop | |
1963 | 5c0c1eeb | Iustin Pop | --defaults $OPTION=$VALUE[,$OPTION=$VALUE] |
1964 | 5c0c1eeb | Iustin Pop | |
1965 | 5c0c1eeb | Iustin Pop | :$OPTION: cluster default option, string, |
1966 | 5c0c1eeb | Iustin Pop | :$VALUE: cluster default option value, string. |
1967 | 5c0c1eeb | Iustin Pop | |
1968 | 5c0c1eeb | Iustin Pop | Currently, the following cluster default options are defined (open to |
1969 | 5c0c1eeb | Iustin Pop | further changes): |
1970 | 5c0c1eeb | Iustin Pop | |
1971 | 5c0c1eeb | Iustin Pop | :hypervisor: the default hypervisor to use for new instances, |
1972 | 5c0c1eeb | Iustin Pop | string. Must be a valid hypervisor known to and supported by the |
1973 | 5c0c1eeb | Iustin Pop | cluster. |
1974 | 5c0c1eeb | Iustin Pop | :disksize: the disksize for newly created instance disks, where |
1975 | 5c0c1eeb | Iustin Pop | applicable. Must be either a positive number, in which case the unit |
1976 | 5c0c1eeb | Iustin Pop | of megabyte is assumed, or a positive number followed by a supported |
1977 | 5c0c1eeb | Iustin Pop | magnitude symbol (M for megabyte or G for gigabyte). |
1978 | 5c0c1eeb | Iustin Pop | :bridge: the default network bridge to use for newly created instance |
1979 | 5c0c1eeb | Iustin Pop | network interfaces, string. Must be a valid bridge name of a bridge |
1980 | 5c0c1eeb | Iustin Pop | existing on the node(s). |
1981 | 5c0c1eeb | Iustin Pop | |
1982 | 5c0c1eeb | Iustin Pop | Hypervisor cluster defaults |
1983 | 5c0c1eeb | Iustin Pop | +++++++++++++++++++++++++++ |
1984 | 5c0c1eeb | Iustin Pop | |
1985 | 6c2d0b44 | Iustin Pop | The generic format of the hypervisor cluster wide default setting |
1986 | 6c2d0b44 | Iustin Pop | option is:: |
1987 | 5c0c1eeb | Iustin Pop | |
1988 | 5c0c1eeb | Iustin Pop | --hypervisor-defaults $HYPERVISOR:$OPTION=$VALUE[,$OPTION=$VALUE] |
1989 | 5c0c1eeb | Iustin Pop | |
1990 | 5c0c1eeb | Iustin Pop | :$HYPERVISOR: symbolic name of the hypervisor whose defaults you want |
1991 | 5c0c1eeb | Iustin Pop | to set, string |
1992 | 5c0c1eeb | Iustin Pop | :$OPTION: cluster default option, string, |
1993 | 5c0c1eeb | Iustin Pop | :$VALUE: cluster default option value, string. |
1994 | 5c0c1eeb | Iustin Pop | |
1995 | 6c2d0b44 | Iustin Pop | Glossary |
1996 | 6c2d0b44 | Iustin Pop | ======== |
1997 | 5c0c1eeb | Iustin Pop | |
1998 | 6c2d0b44 | Iustin Pop | Since this document is only a delta from the Ganeti 1.2, there are |
1999 | 6c2d0b44 | Iustin Pop | some unexplained terms. Here is a non-exhaustive list. |
2000 | 5c0c1eeb | Iustin Pop | |
2001 | 6c2d0b44 | Iustin Pop | .. _HVM: |
2002 | 5c0c1eeb | Iustin Pop | |
2003 | 6c2d0b44 | Iustin Pop | HVM |
2004 | 6c2d0b44 | Iustin Pop | hardware virtualization mode, where the virtual machine is oblivious |
2005 | 6c2d0b44 | Iustin Pop | to the fact that's being virtualized and all the hardware is emulated |
2006 | 5c0c1eeb | Iustin Pop | |
2007 | 6c2d0b44 | Iustin Pop | .. _LU: |
2008 | 5c0c1eeb | Iustin Pop | |
2009 | 6c2d0b44 | Iustin Pop | LogicalUnit |
2010 | 6c2d0b44 | Iustin Pop | the code associated with an OpCode, i.e. the code that implements the |
2011 | 6c2d0b44 | Iustin Pop | startup of an instance |
2012 | 5c0c1eeb | Iustin Pop | |
2013 | 6c2d0b44 | Iustin Pop | .. _opcode: |
2014 | 6c2d0b44 | Iustin Pop | |
2015 | 6c2d0b44 | Iustin Pop | OpCode |
2016 | 6c2d0b44 | Iustin Pop | a data structure encapsulating a basic cluster operation; for example, |
2017 | 6c2d0b44 | Iustin Pop | start instance, add instance, etc.; |
2018 | 6c2d0b44 | Iustin Pop | |
2019 | 6c2d0b44 | Iustin Pop | .. _PVM: |
2020 | 5c0c1eeb | Iustin Pop | |
2021 | 6c2d0b44 | Iustin Pop | PVM |
2022 | 6c2d0b44 | Iustin Pop | para-virtualization mode, where the virtual machine knows it's being |
2023 | 6c2d0b44 | Iustin Pop | virtualized and as such there is no need for hardware emulation |
2024 | 5c0c1eeb | Iustin Pop | |
2025 | 6c2d0b44 | Iustin Pop | .. _watcher: |
2026 | 5c0c1eeb | Iustin Pop | |
2027 | 6c2d0b44 | Iustin Pop | watcher |
2028 | 6c2d0b44 | Iustin Pop | ``ganeti-watcher`` is a tool that should be run regularly from cron |
2029 | 6c2d0b44 | Iustin Pop | and takes care of restarting failed instances, restarting secondary |
2030 | 6c2d0b44 | Iustin Pop | DRBD devices, etc. For more details, see the man page |
2031 | 6c2d0b44 | Iustin Pop | ``ganeti-watcher(8)``. |