root / doc / design-2.1.rst @ c0446a46
History | View | Annotate | Download (16.3 kB)
1 |
================= |
---|---|
2 |
Ganeti 2.1 design |
3 |
================= |
4 |
|
5 |
This document describes the major changes in Ganeti 2.1 compared to |
6 |
the 2.0 version. |
7 |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to avoid |
9 |
changing too much of the core code, while addressing issues and adding new |
10 |
features and improvements over 2.0, in a timely fashion. |
11 |
|
12 |
.. contents:: :depth: 3 |
13 |
|
14 |
Objective |
15 |
========= |
16 |
|
17 |
Ganeti 2.1 will add features to help further automatization of cluster |
18 |
operations, further improbe scalability to even bigger clusters, and make it |
19 |
easier to debug the Ganeti core. |
20 |
|
21 |
Background |
22 |
========== |
23 |
|
24 |
Overview |
25 |
======== |
26 |
|
27 |
Detailed design |
28 |
=============== |
29 |
|
30 |
As for 2.0 we divide the 2.1 design into three areas: |
31 |
|
32 |
- core changes, which affect the master daemon/job queue/locking or all/most |
33 |
logical units |
34 |
- logical unit/feature changes |
35 |
- external interface changes (eg. command line, os api, hooks, ...) |
36 |
|
37 |
Core changes |
38 |
------------ |
39 |
|
40 |
Feature changes |
41 |
--------------- |
42 |
|
43 |
Ganeti Confd |
44 |
~~~~~~~~~~~~ |
45 |
|
46 |
Current State and shortcomings |
47 |
++++++++++++++++++++++++++++++ |
48 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. In |
49 |
particular they are divided between "master", "master candidates" and "normal". |
50 |
(Moreover they can be offline or drained, but this is not important for the |
51 |
current discussion). In general the whole configuration is only replicated to |
52 |
master candidates, and some partial information is spread to all nodes via |
53 |
ssconf. |
54 |
|
55 |
This change was done so that the most frequent Ganeti operations didn't need to |
56 |
contact all nodes, and so clusters could become bigger. If we want more |
57 |
information to be available on all nodes, we need to add more ssconf values, |
58 |
which is counter-balancing the change, or to talk with the master node, which |
59 |
is not designed to happen now, and requires its availability. |
60 |
|
61 |
Information such as the instance->primary_node mapping will be needed on all |
62 |
nodes, and we also want to make sure services external to the cluster can query |
63 |
this information as well. This information must be available at all times, so |
64 |
we can't query it through RAPI, which would be a single point of failure, as |
65 |
it's only available on the master. |
66 |
|
67 |
|
68 |
Proposed changes |
69 |
++++++++++++++++ |
70 |
|
71 |
In order to allow fast and highly available access read-only to some |
72 |
configuration values, we'll create a new ganeti-confd daemon, which will run on |
73 |
master candidates. This daemon will talk via UDP, and authenticate messages |
74 |
using HMAC with a cluster-wide shared key. |
75 |
|
76 |
An interested client can query a value by making a request to a subset of the |
77 |
cluster master candidates. It will then wait to get a few responses, and use |
78 |
the one with the highest configuration serial number (which will be always |
79 |
included in the answer). If some candidates are stale, or we are in the middle |
80 |
of a configuration update, various master candidates may return different |
81 |
values, and this should make sure the most recent information is used. |
82 |
|
83 |
In order to prevent replay attacks queries will contain the current unix |
84 |
timestamp according to the client, and the server will verify that its |
85 |
timestamp is in the same 5 minutes range (this requires synchronized clocks, |
86 |
which is a good idea anyway). Queries will also contain a "salt" which they |
87 |
expect the answers to be sent with, and clients are supposed to accept only |
88 |
answers which contain salt generated by them. |
89 |
|
90 |
The configuration daemon will be able to answer simple queries such as: |
91 |
- master candidates list |
92 |
- master node |
93 |
- offline nodes |
94 |
- instance list |
95 |
- instance primary nodes |
96 |
|
97 |
|
98 |
Redistribute Config |
99 |
~~~~~~~~~~~~~~~~~~~ |
100 |
|
101 |
Current State and shortcomings |
102 |
++++++++++++++++++++++++++++++ |
103 |
Currently LURedistributeConfig triggers a copy of the updated configuration |
104 |
file to all master candidates and of the ssconf files to all nodes. There are |
105 |
other files which are maintained manually but which are important to keep in |
106 |
sync. These are: |
107 |
|
108 |
- rapi SSL key certificate file (rapi.pem) (on master candidates) |
109 |
- rapi user/password file rapi_users (on master candidates) |
110 |
|
111 |
Furthermore there are some files which are hypervisor specific but we may want |
112 |
to keep in sync: |
113 |
|
114 |
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies |
115 |
the file once, during node add. This design is subject to revision to be able |
116 |
to have different passwords for different groups of instances via the use of |
117 |
hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to |
118 |
provide password-protected vnc sessions. In general, though, it would be |
119 |
useful if the vnc password files were copied as well, to avoid unwanted vnc |
120 |
password changes on instance failover/migrate. |
121 |
|
122 |
Optionally the admin may want to also ship files such as the global xend.conf |
123 |
file, and the network scripts to all nodes. |
124 |
|
125 |
Proposed changes |
126 |
++++++++++++++++ |
127 |
|
128 |
RedistributeConfig will be changed to copy also the rapi files, and to call |
129 |
every enabled hypervisor asking for a list of additional files to copy. We also |
130 |
may want to add a global list of files on the cluster object, which will be |
131 |
propagated as well, or a hook to calculate them. If we implement this feature |
132 |
there should be a way to specify whether a file must be shipped to all nodes or |
133 |
just master candidates. |
134 |
|
135 |
This code will be also shared (via tasklets or by other means, if tasklets are |
136 |
not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant |
137 |
files will be automatically shipped to new master candidates as they are set). |
138 |
|
139 |
VNC Console Password |
140 |
~~~~~~~~~~~~~~~~~~~~ |
141 |
|
142 |
Current State and shortcomings |
143 |
++++++++++++++++++++++++++++++ |
144 |
|
145 |
Currently just the xen-hvm hypervisor supports setting a password to connect |
146 |
the the instances' VNC console, and has one common password stored in a file. |
147 |
|
148 |
This doesn't allow different passwords for different instances/groups of |
149 |
instances, and makes it necessary to remember to copy the file around the |
150 |
cluster when the password changes. |
151 |
|
152 |
Proposed changes |
153 |
++++++++++++++++ |
154 |
|
155 |
We'll change the VNC password file to a vnc_password_file hypervisor parameter. |
156 |
This way it can have a cluster default, but also a different value for each |
157 |
instance. The VNC enabled hypervisors (xen and kvm) will publish all the |
158 |
password files in use through the cluster so that a redistribute-config will |
159 |
ship them to all nodes (see the Redistribute Config proposed changes above). |
160 |
|
161 |
The current VNC_PASSWORD_FILE constant will be removed, but its value will be |
162 |
used as the default HV_VNC_PASSWORD_FILE value, thus retaining backwards |
163 |
compatibility with 2.0. |
164 |
|
165 |
The code to export the list of VNC password files from the hypervisors to |
166 |
RedistributeConfig will be shared between the KVM and xen-hvm hypervisors. |
167 |
|
168 |
Disk/Net parameters |
169 |
~~~~~~~~~~~~~~~~~~~ |
170 |
|
171 |
Current State and shortcomings |
172 |
++++++++++++++++++++++++++++++ |
173 |
|
174 |
Currently disks and network interfaces have a few tweakable options and all the |
175 |
rest is left to a default we chose. We're finding that we need more and more to |
176 |
tweak some of these parameters, for example to disable barriers for DRBD |
177 |
devices, or allow striping for the LVM volumes. |
178 |
|
179 |
Moreover for many of these parameters it will be nice to have cluster-wide |
180 |
defaults, and then be able to change them per disk/interface. |
181 |
|
182 |
Proposed changes |
183 |
++++++++++++++++ |
184 |
|
185 |
We will add new cluster level diskparams and netparams, which will contain all |
186 |
the tweakable parameters. All values which have a sensible cluster-wide default |
187 |
will go into this new structure while parameters which have unique values will not. |
188 |
|
189 |
Example of network parameters: |
190 |
- mode: bridge/route |
191 |
- link: for mode "bridge" the bridge to connect to, for mode route it can |
192 |
contain the routing table, or the destination interface |
193 |
|
194 |
Example of disk parameters: |
195 |
- stripe: lvm stripes |
196 |
- stripe_size: lvm stripe size |
197 |
- meta_flushes: drbd, enable/disable metadata "barriers" |
198 |
- data_flushes: drbd, enable/disable data "barriers" |
199 |
|
200 |
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs files) or |
201 |
hypervisor specific (nic models for example), but for now they will all live in |
202 |
the same structure. Each component is supposed to validate only the parameters |
203 |
it knows about, and ganeti itself will make sure that no "globally unknown" |
204 |
parameters are added, and that no parameters have overridden meanings for |
205 |
different components. |
206 |
|
207 |
The parameters will be kept, as for the BEPARAMS into a "default" category, |
208 |
which will allow us to expand on by creating instance "classes" in the future. |
209 |
Instance classes is not a feature we plan implementing in 2.1, though. |
210 |
|
211 |
Non bridged instances support |
212 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
213 |
|
214 |
Current State and shortcomings |
215 |
++++++++++++++++++++++++++++++ |
216 |
|
217 |
Currently each instance NIC must be connected to a bridge, and if the bridge is |
218 |
not specified the default cluster one is used. This makes it impossible to use |
219 |
the vif-route xen network scripts, or other alternative mechanisms that don't |
220 |
need a bridge to work. |
221 |
|
222 |
Proposed changes |
223 |
++++++++++++++++ |
224 |
|
225 |
The new "mode" network parameter will distinguish between bridged interfaces |
226 |
and routed ones. |
227 |
|
228 |
When mode is "bridge" the "link" parameter will contain the bridge the instance |
229 |
should be connected to, effectively making things as today. The value has been |
230 |
migrated from a nic field to a parameter to allow for an easier manipulation of |
231 |
the cluster default. |
232 |
|
233 |
When mode is "route" the ip field of the interface will become mandatory, to |
234 |
allow for a route to be set. In the future we may want also to accept multiple |
235 |
IPs or IP/mask values for this purpose. We will evaluate possible meanings of |
236 |
the link parameter to signify a routing table to be used, which would allow for |
237 |
insulation between instance groups (as today happens for different bridges). |
238 |
|
239 |
For now we won't add a parameter to specify which network script gets called |
240 |
for which instance, so in a mixed cluster the network script must be able to |
241 |
handle both cases. The default kvm vif script will be changed to do so. (Xen |
242 |
doesn't have a ganeti provided script, so nothing will be done for that |
243 |
hypervisor) |
244 |
|
245 |
External interface changes |
246 |
-------------------------- |
247 |
|
248 |
OS API |
249 |
~~~~~~ |
250 |
|
251 |
The OS API of Ganeti 2.0 has been built with extensibility in mind. Since we |
252 |
pass everything as environment variables it's a lot easier to send new |
253 |
information to the OSes without breaking retrocompatibility. This section of |
254 |
the design outlines the proposed extensions to the API and their |
255 |
implementation. |
256 |
|
257 |
API Version Compatibility Handling |
258 |
++++++++++++++++++++++++++++++++++ |
259 |
|
260 |
In 2.1 there will be a new OS API version (eg. 15), which should be mostly |
261 |
compatible with api 10, except for some new added variables. Since it's easy |
262 |
not to pass some variables we'll be able to handle Ganeti 2.0 OSes by just |
263 |
filtering out the newly added piece of information. We will still encourage |
264 |
OSes to declare support for the new API after checking that the new variables |
265 |
don't provide any conflict for them, and we will drop api 10 support after |
266 |
ganeti 2.1 has released. |
267 |
|
268 |
New Environment variables |
269 |
+++++++++++++++++++++++++ |
270 |
|
271 |
Some variables have never been added to the OS api but would definitely be |
272 |
useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable to allow |
273 |
the OS to make changes relevant to the virtualization the instance is going to |
274 |
use. Since this field is immutable for each instance, the os can tight the |
275 |
install without caring of making sure the instance can run under any |
276 |
virtualization technology. |
277 |
|
278 |
We also want the OS to know the particular hypervisor parameters, to be able to |
279 |
customize the install even more. Since the parameters can change, though, we |
280 |
will pass them only as an "FYI": if an OS ties some instance functionality to |
281 |
the value of a particular hypervisor parameter manual changes or a reinstall |
282 |
may be needed to adapt the instance to the new environment. This is not a |
283 |
regression as of today, because even if the OSes are left blind about this |
284 |
information, sometimes they still need to make compromises and cannot satisfy |
285 |
all possible parameter values. |
286 |
|
287 |
OS Flavours |
288 |
+++++++++++ |
289 |
|
290 |
Currently we are assisting to some degree of "os proliferation" just to change |
291 |
a simple installation behavior. This means that the same OS gets installed on |
292 |
the cluster multiple times, with different names, to customize just one |
293 |
installation behavior. Usually such OSes try to share as much as possible |
294 |
through symlinks, but this still causes complications on the user side, |
295 |
especially when multiple parameters must be cross-matched. |
296 |
|
297 |
For example today if you want to install debian etch, lenny or squeeze you |
298 |
probably need to install the debootstrap OS multiple times, changing its |
299 |
configuration file, and calling it debootstrap-etch, debootstrap-lenny or |
300 |
debootstrap-squeeze. Furthermore if you have for example a "server" and a |
301 |
"development" environment which installs different packages/configuration files |
302 |
and must be available for all installs you'll probably end up with |
303 |
deboostrap-etch-server, debootstrap-etch-dev, debootrap-lenny-server, |
304 |
debootstrap-lenny-dev, etc. Crossing more than two parameters quickly becomes |
305 |
not manageable. |
306 |
|
307 |
In order to avoid this we plan to make OSes more customizable, by allowing each |
308 |
OS to declare a list of flavours which can be used to customize it. The |
309 |
flavours list is mandatory for new API OSes and must contain at least one |
310 |
supported flavour. When choosing the OS exactly one flavour will have to be |
311 |
specified, and will be encoded in the os name as <OS-name>+<flavour>. As for |
312 |
today it will be possible to change an instance's OS at creation or install |
313 |
time. |
314 |
|
315 |
The 2.1 OS list will be the combination of each OS, plus its supported |
316 |
flavours. This will cause the name name proliferation to remain, but at least |
317 |
the internal OS code will be simplified to just parsing the passed flavour, |
318 |
without the need for symlinks or code duplication. |
319 |
|
320 |
Also we expect the OSes to declare only "interesting" flavours, but to accept |
321 |
some non-declared ones which a user will be able to pass in by overriding the |
322 |
checks ganeti does. This will be useful for allowing some variations to be used |
323 |
without polluting the OS list (per-OS documentation should list all supported |
324 |
flavours). If a flavour which is not internally supported is forced through, |
325 |
the OS scripts should abort. |
326 |
|
327 |
In the future (post 2.1) we may want to move to full fledged orthogonal |
328 |
parameters for the OSes. In this case we envision the flavours to be moved |
329 |
inside of Ganeti and be associated with lists parameter->values associations, |
330 |
which will then be passed to the OS. |
331 |
|
332 |
IAllocator changes |
333 |
~~~~~~~~~~~~~~~~~~ |
334 |
|
335 |
Current State and shortcomings |
336 |
++++++++++++++++++++++++++++++ |
337 |
|
338 |
The iallocator interface allows creation of instances without manually |
339 |
specifying nodes, but instead by specifying plugins which will do the |
340 |
required computations and produce a valid node list. |
341 |
|
342 |
However, the interface is quite akward to use: |
343 |
|
344 |
- one cannot set a 'default' iallocator script |
345 |
- one cannot use it to easily test if allocation would succeed |
346 |
- some new functionality, such as rebalancing clusters and calculating |
347 |
capacity estimates is needed |
348 |
|
349 |
Proposed changes |
350 |
++++++++++++++++ |
351 |
|
352 |
There are two area of improvements proposed: |
353 |
|
354 |
- improving the use of the current interface |
355 |
- extending the IAllocator API to cover more automation |
356 |
|
357 |
|
358 |
Default iallocator names |
359 |
^^^^^^^^^^^^^^^^^^^^^^^^ |
360 |
|
361 |
The cluster will hold, for each type of iallocator, a (possibly empty) |
362 |
list of modules that will be used automatically. |
363 |
|
364 |
If the list is empty, the behaviour will remain the same. |
365 |
|
366 |
If the list has one entry, then ganeti will behave as if |
367 |
'--iallocator' was specifyed on the command line. I.e. use this |
368 |
allocator by default. If the user however passed nodes, those will be |
369 |
used in preference. |
370 |
|
371 |
If the list has multiple entries, they will be tried in order until |
372 |
one gives a successful answer. |
373 |
|
374 |
Dry-run allocation |
375 |
^^^^^^^^^^^^^^^^^^ |
376 |
|
377 |
The create instance LU will get a new 'dry-run' option that will just |
378 |
simulate the placement, and return the chosen node-lists after running |
379 |
all the usual checks. |
380 |
|
381 |
Cluster balancing |
382 |
^^^^^^^^^^^^^^^^^ |
383 |
|
384 |
Instance add/removals/moves can create a situation where load on the |
385 |
nodes is not spread equally. For this, a new iallocator mode will be |
386 |
implemented called ``balance`` in which the plugin, given the current |
387 |
cluster state, and a maximum number of operations, will need to |
388 |
compute the instance relocations needed in order to achieve a "better" |
389 |
(for whatever the script believes it's better) cluster. |
390 |
|
391 |
Cluster capacity calculation |
392 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
393 |
|
394 |
In this mode, called ``capacity``, given an instance specification and |
395 |
the current cluster state (similar to the ``allocate`` mode), the |
396 |
plugin needs to return: |
397 |
|
398 |
- how many instances can be allocated on the cluster with that specification |
399 |
- on which nodes these will be allocated (in order) |