Revision 7faf5110 doc/design-2.1.rst
b/doc/design-2.1.rst | ||
---|---|---|
5 | 5 |
This document describes the major changes in Ganeti 2.1 compared to |
6 | 6 |
the 2.0 version. |
7 | 7 |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to avoid
|
|
9 |
changing too much of the core code, while addressing issues and adding new
|
|
10 |
features and improvements over 2.0, in a timely fashion. |
|
8 |
The 2.1 version will be a relatively small release. Its main aim is to |
|
9 |
avoid changing too much of the core code, while addressing issues and
|
|
10 |
adding new features and improvements over 2.0, in a timely fashion.
|
|
11 | 11 |
|
12 | 12 |
.. contents:: :depth: 4 |
13 | 13 |
|
... | ... | |
15 | 15 |
========= |
16 | 16 |
|
17 | 17 |
Ganeti 2.1 will add features to help further automatization of cluster |
18 |
operations, further improbe scalability to even bigger clusters, and make it
|
|
19 |
easier to debug the Ganeti core. |
|
18 |
operations, further improbe scalability to even bigger clusters, and |
|
19 |
make it easier to debug the Ganeti core.
|
|
20 | 20 |
|
21 | 21 |
Background |
22 | 22 |
========== |
... | ... | |
29 | 29 |
|
30 | 30 |
As for 2.0 we divide the 2.1 design into three areas: |
31 | 31 |
|
32 |
- core changes, which affect the master daemon/job queue/locking or all/most
|
|
33 |
logical units |
|
32 |
- core changes, which affect the master daemon/job queue/locking or |
|
33 |
all/most logical units
|
|
34 | 34 |
- logical unit/feature changes |
35 | 35 |
- external interface changes (eg. command line, os api, hooks, ...) |
36 | 36 |
|
... | ... | |
60 | 60 |
- list of storage units of this type |
61 | 61 |
- check status of the storage unit |
62 | 62 |
|
63 |
Additionally, there will be specific methods for each method, for example: |
|
63 |
Additionally, there will be specific methods for each method, for |
|
64 |
example: |
|
64 | 65 |
|
65 | 66 |
- enable/disable allocations on a specific PV |
66 | 67 |
- file storage directory creation/deletion |
... | ... | |
88 | 89 |
++++++++++++++++++++++++++++++ |
89 | 90 |
|
90 | 91 |
The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
91 |
many ``SharedLock`` instances. It provides an interface to add/remove locks
|
|
92 |
and to acquire and subsequently release any number of those locks contained
|
|
93 |
in it. |
|
92 |
many ``SharedLock`` instances. It provides an interface to add/remove |
|
93 |
locks and to acquire and subsequently release any number of those locks
|
|
94 |
contained in it.
|
|
94 | 95 |
|
95 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the
|
|
96 |
way we're using locks for nodes and instances (the single cluster lock isn't
|
|
97 |
affected by this issue) this can lead to long delays when acquiring locks if
|
|
98 |
another operation tries to acquire multiple locks but has to wait for yet
|
|
99 |
another operation. |
|
96 |
Locks in a ``LockSet`` are always acquired in alphabetic order. Due to |
|
97 |
the way we're using locks for nodes and instances (the single cluster
|
|
98 |
lock isn't affected by this issue) this can lead to long delays when
|
|
99 |
acquiring locks if another operation tries to acquire multiple locks but
|
|
100 |
has to wait for yet another operation.
|
|
100 | 101 |
|
101 | 102 |
In the following demonstration we assume to have the instance locks |
102 | 103 |
``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
103 | 104 |
|
104 | 105 |
#. Operation A grabs lock for instance ``inst4``. |
105 |
#. Operation B wants to acquire all instance locks in alphabetic order, but
|
|
106 |
it has to wait for ``inst4``. |
|
106 |
#. Operation B wants to acquire all instance locks in alphabetic order, |
|
107 |
but it has to wait for ``inst4``.
|
|
107 | 108 |
#. Operation C tries to lock ``inst1``, but it has to wait until |
108 | 109 |
Operation B (which is trying to acquire all locks) releases the lock |
109 | 110 |
again. |
... | ... | |
121 | 122 |
Non-blocking lock acquiring |
122 | 123 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
123 | 124 |
|
124 |
Acquiring locks for OpCode execution is always done in blocking mode. They
|
|
125 |
won't return until the lock has successfully been acquired (or an error
|
|
126 |
occurred, although we won't cover that case here). |
|
125 |
Acquiring locks for OpCode execution is always done in blocking mode. |
|
126 |
They won't return until the lock has successfully been acquired (or an
|
|
127 |
error occurred, although we won't cover that case here).
|
|
127 | 128 |
|
128 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking
|
|
129 |
way. They must support a timeout and abort trying to acquire the lock(s)
|
|
130 |
after the specified amount of time. |
|
129 |
``SharedLock`` and ``LockSet`` must be able to be acquired in a |
|
130 |
non-blocking way. They must support a timeout and abort trying to
|
|
131 |
acquire the lock(s) after the specified amount of time.
|
|
131 | 132 |
|
132 | 133 |
Retry acquiring locks |
133 | 134 |
^^^^^^^^^^^^^^^^^^^^^ |
134 | 135 |
|
135 |
To prevent other operations from waiting for a long time, such as described |
|
136 |
in the demonstration before, ``LockSet`` must not keep locks for a prolonged |
|
137 |
period of time when trying to acquire two or more locks. Instead it should, |
|
138 |
with an increasing timeout for acquiring all locks, release all locks again |
|
139 |
and sleep some time if it fails to acquire all requested locks. |
|
136 |
To prevent other operations from waiting for a long time, such as |
|
137 |
described in the demonstration before, ``LockSet`` must not keep locks |
|
138 |
for a prolonged period of time when trying to acquire two or more locks. |
|
139 |
Instead it should, with an increasing timeout for acquiring all locks, |
|
140 |
release all locks again and sleep some time if it fails to acquire all |
|
141 |
requested locks. |
|
140 | 142 |
|
141 |
A good timeout value needs to be determined. In any case should ``LockSet``
|
|
142 |
proceed to acquire locks in blocking mode after a few (unsuccessful)
|
|
143 |
attempts to acquire all requested locks. |
|
143 |
A good timeout value needs to be determined. In any case should |
|
144 |
``LockSet`` proceed to acquire locks in blocking mode after a few
|
|
145 |
(unsuccessful) attempts to acquire all requested locks.
|
|
144 | 146 |
|
145 |
One proposal for the timeout is to use ``2**tries`` seconds, where ``tries``
|
|
146 |
is the number of unsuccessful tries. |
|
147 |
One proposal for the timeout is to use ``2**tries`` seconds, where |
|
148 |
``tries`` is the number of unsuccessful tries.
|
|
147 | 149 |
|
148 |
In the demonstration before this would allow Operation C to continue after
|
|
149 |
Operation B unsuccessfully tried to acquire all locks and released all
|
|
150 |
acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
|
150 |
In the demonstration before this would allow Operation C to continue |
|
151 |
after Operation B unsuccessfully tried to acquire all locks and released
|
|
152 |
all acquired locks (``inst1``, ``inst2`` and ``inst3``) again.
|
|
151 | 153 |
|
152 | 154 |
Other solutions discussed |
153 | 155 |
+++++++++++++++++++++++++ |
154 | 156 |
|
155 |
There was also some discussion on going one step further and extend the job |
|
156 |
queue (see ``lib/jqueue.py``) to select the next task for a worker depending |
|
157 |
on whether it can acquire the necessary locks. While this may reduce the |
|
158 |
number of necessary worker threads and/or increase throughput on large |
|
159 |
clusters with many jobs, it also brings many potential problems, such as |
|
160 |
contention and increased memory usage, with it. As this would be an |
|
161 |
extension of the changes proposed before it could be implemented at a later |
|
162 |
point in time, but we decided to stay with the simpler solution for now. |
|
157 |
There was also some discussion on going one step further and extend the |
|
158 |
job queue (see ``lib/jqueue.py``) to select the next task for a worker |
|
159 |
depending on whether it can acquire the necessary locks. While this may |
|
160 |
reduce the number of necessary worker threads and/or increase throughput |
|
161 |
on large clusters with many jobs, it also brings many potential |
|
162 |
problems, such as contention and increased memory usage, with it. As |
|
163 |
this would be an extension of the changes proposed before it could be |
|
164 |
implemented at a later point in time, but we decided to stay with the |
|
165 |
simpler solution for now. |
|
163 | 166 |
|
164 | 167 |
Implementation details |
165 | 168 |
++++++++++++++++++++++ |
... | ... | |
169 | 172 |
|
170 | 173 |
The current design of ``SharedLock`` is not good for supporting timeouts |
171 | 174 |
when acquiring a lock and there are also minor fairness issues in it. We |
172 |
plan to address both with a redesign. A proof of concept implementation was |
|
173 |
written and resulted in significantly simpler code. |
|
174 |
|
|
175 |
Currently ``SharedLock`` uses two separate queues for shared and exclusive |
|
176 |
acquires and waiters get to run in turns. This means if an exclusive acquire |
|
177 |
is released, the lock will allow shared waiters to run and vice versa. |
|
178 |
Although it's still fair in the end there is a slight bias towards shared |
|
179 |
waiters in the current implementation. The same implementation with two |
|
180 |
shared queues can not support timeouts without adding a lot of complexity. |
|
181 |
|
|
182 |
Our proposed redesign changes ``SharedLock`` to have only one single queue. |
|
183 |
There will be one condition (see Condition_ for a note about performance) in |
|
184 |
the queue per exclusive acquire and two for all shared acquires (see below for |
|
185 |
an explanation). The maximum queue length will always be ``2 + (number of |
|
186 |
exclusive acquires waiting)``. The number of queue entries for shared acquires |
|
187 |
can vary from 0 to 2. |
|
188 |
|
|
189 |
The two conditions for shared acquires are a bit special. They will be used |
|
190 |
in turn. When the lock is instantiated, no conditions are in the queue. As |
|
191 |
soon as the first shared acquire arrives (and there are holder(s) or waiting |
|
192 |
acquires; see Acquire_), the active condition is added to the queue. Until |
|
193 |
it becomes the topmost condition in the queue and has been notified, any |
|
194 |
shared acquire is added to this active condition. When the active condition |
|
195 |
is notified, the conditions are swapped and further shared acquires are |
|
196 |
added to the previously inactive condition (which has now become the active |
|
197 |
condition). After all waiters on the previously active (now inactive) and |
|
198 |
now notified condition received the notification, it is removed from the |
|
199 |
queue of pending acquires. |
|
200 |
|
|
201 |
This means shared acquires will skip any exclusive acquire in the queue. We |
|
202 |
believe it's better to improve parallelization on operations only asking for |
|
203 |
shared (or read-only) locks. Exclusive operations holding the same lock can |
|
204 |
not be parallelized. |
|
175 |
plan to address both with a redesign. A proof of concept implementation |
|
176 |
was written and resulted in significantly simpler code. |
|
177 |
|
|
178 |
Currently ``SharedLock`` uses two separate queues for shared and |
|
179 |
exclusive acquires and waiters get to run in turns. This means if an |
|
180 |
exclusive acquire is released, the lock will allow shared waiters to run |
|
181 |
and vice versa. Although it's still fair in the end there is a slight |
|
182 |
bias towards shared waiters in the current implementation. The same |
|
183 |
implementation with two shared queues can not support timeouts without |
|
184 |
adding a lot of complexity. |
|
185 |
|
|
186 |
Our proposed redesign changes ``SharedLock`` to have only one single |
|
187 |
queue. There will be one condition (see Condition_ for a note about |
|
188 |
performance) in the queue per exclusive acquire and two for all shared |
|
189 |
acquires (see below for an explanation). The maximum queue length will |
|
190 |
always be ``2 + (number of exclusive acquires waiting)``. The number of |
|
191 |
queue entries for shared acquires can vary from 0 to 2. |
|
192 |
|
|
193 |
The two conditions for shared acquires are a bit special. They will be |
|
194 |
used in turn. When the lock is instantiated, no conditions are in the |
|
195 |
queue. As soon as the first shared acquire arrives (and there are |
|
196 |
holder(s) or waiting acquires; see Acquire_), the active condition is |
|
197 |
added to the queue. Until it becomes the topmost condition in the queue |
|
198 |
and has been notified, any shared acquire is added to this active |
|
199 |
condition. When the active condition is notified, the conditions are |
|
200 |
swapped and further shared acquires are added to the previously inactive |
|
201 |
condition (which has now become the active condition). After all waiters |
|
202 |
on the previously active (now inactive) and now notified condition |
|
203 |
received the notification, it is removed from the queue of pending |
|
204 |
acquires. |
|
205 |
|
|
206 |
This means shared acquires will skip any exclusive acquire in the queue. |
|
207 |
We believe it's better to improve parallelization on operations only |
|
208 |
asking for shared (or read-only) locks. Exclusive operations holding the |
|
209 |
same lock can not be parallelized. |
|
205 | 210 |
|
206 | 211 |
|
207 | 212 |
Acquire |
208 | 213 |
******* |
209 | 214 |
|
210 |
For exclusive acquires a new condition is created and appended to the queue.
|
|
211 |
Shared acquires are added to the active condition for shared acquires and if
|
|
212 |
the condition is not yet on the queue, it's appended. |
|
215 |
For exclusive acquires a new condition is created and appended to the |
|
216 |
queue. Shared acquires are added to the active condition for shared
|
|
217 |
acquires and if the condition is not yet on the queue, it's appended.
|
|
213 | 218 |
|
214 |
The next step is to wait for our condition to be on the top of the queue (to
|
|
215 |
guarantee fairness). If the timeout expired, we return to the caller without
|
|
216 |
acquiring the lock. On every notification we check whether the lock has been
|
|
217 |
deleted, in which case an error is returned to the caller. |
|
219 |
The next step is to wait for our condition to be on the top of the queue |
|
220 |
(to guarantee fairness). If the timeout expired, we return to the caller
|
|
221 |
without acquiring the lock. On every notification we check whether the
|
|
222 |
lock has been deleted, in which case an error is returned to the caller.
|
|
218 | 223 |
|
219 |
The lock can be acquired if we're on top of the queue (there is no one else |
|
220 |
ahead of us). For an exclusive acquire, there must not be other exclusive or |
|
221 |
shared holders. For a shared acquire, there must not be an exclusive holder. |
|
222 |
If these conditions are all true, the lock is acquired and we return to the |
|
223 |
caller. In any other case we wait again on the condition. |
|
224 |
The lock can be acquired if we're on top of the queue (there is no one |
|
225 |
else ahead of us). For an exclusive acquire, there must not be other |
|
226 |
exclusive or shared holders. For a shared acquire, there must not be an |
|
227 |
exclusive holder. If these conditions are all true, the lock is |
|
228 |
acquired and we return to the caller. In any other case we wait again on |
|
229 |
the condition. |
|
224 | 230 |
|
225 |
If it was the last waiter on a condition, the condition is removed from the
|
|
226 |
queue. |
|
231 |
If it was the last waiter on a condition, the condition is removed from |
|
232 |
the queue.
|
|
227 | 233 |
|
228 | 234 |
Optimization: There's no need to touch the queue if there are no pending |
229 |
acquires and no current holders. The caller can have the lock immediately. |
|
235 |
acquires and no current holders. The caller can have the lock |
|
236 |
immediately. |
|
230 | 237 |
|
231 | 238 |
.. image:: design-2.1-lock-acquire.png |
232 | 239 |
|
... | ... | |
234 | 241 |
Release |
235 | 242 |
******* |
236 | 243 |
|
237 |
First the lock removes the caller from the internal owner list. If there are |
|
238 |
pending acquires in the queue, the first (the oldest) condition is notified. |
|
244 |
First the lock removes the caller from the internal owner list. If there |
|
245 |
are pending acquires in the queue, the first (the oldest) condition is |
|
246 |
notified. |
|
239 | 247 |
|
240 | 248 |
If the first condition was the active condition for shared acquires, the |
241 |
inactive condition will be made active. This ensures fairness with exclusive |
|
242 |
locks by forcing consecutive shared acquires to wait in the queue. |
|
249 |
inactive condition will be made active. This ensures fairness with |
|
250 |
exclusive locks by forcing consecutive shared acquires to wait in the |
|
251 |
queue. |
|
243 | 252 |
|
244 | 253 |
.. image:: design-2.1-lock-release.png |
245 | 254 |
|
... | ... | |
247 | 256 |
Delete |
248 | 257 |
****** |
249 | 258 |
|
250 |
The caller must either hold the lock in exclusive mode already or the lock
|
|
251 |
must be acquired in exclusive mode. Trying to delete a lock while it's held
|
|
252 |
in shared mode must fail. |
|
259 |
The caller must either hold the lock in exclusive mode already or the |
|
260 |
lock must be acquired in exclusive mode. Trying to delete a lock while
|
|
261 |
it's held in shared mode must fail.
|
|
253 | 262 |
|
254 |
After ensuring the lock is held in exclusive mode, the lock will mark itself
|
|
255 |
as deleted and continue to notify all pending acquires. They will wake up,
|
|
256 |
notice the deleted lock and return an error to the caller. |
|
263 |
After ensuring the lock is held in exclusive mode, the lock will mark |
|
264 |
itself as deleted and continue to notify all pending acquires. They will
|
|
265 |
wake up, notice the deleted lock and return an error to the caller.
|
|
257 | 266 |
|
258 | 267 |
|
259 | 268 |
Condition |
260 | 269 |
^^^^^^^^^ |
261 | 270 |
|
262 |
Note: This is not necessary for the locking changes above, but it may be a
|
|
263 |
good optimization (pending performance tests). |
|
271 |
Note: This is not necessary for the locking changes above, but it may be |
|
272 |
a good optimization (pending performance tests).
|
|
264 | 273 |
|
265 | 274 |
The existing locking code in Ganeti 2.0 uses Python's built-in |
266 | 275 |
``threading.Condition`` class. Unfortunately ``Condition`` implements |
267 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock
|
|
268 |
in non-blocking mode. This requires unnecessary context switches and
|
|
269 |
contention on the CPython GIL (Global Interpreter Lock). |
|
276 |
timeouts by sleeping 1ms to 20ms between tries to acquire the condition |
|
277 |
lock in non-blocking mode. This requires unnecessary context switches
|
|
278 |
and contention on the CPython GIL (Global Interpreter Lock).
|
|
270 | 279 |
|
271 | 280 |
By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
272 | 281 |
support for timeouts on file descriptors (see ``select(2)``). A custom |
273 | 282 |
condition class will have to be written for this. |
274 | 283 |
|
275 | 284 |
On instantiation the class creates a pipe. After each notification the |
276 |
previous pipe is abandoned and re-created (technically the old pipe needs to
|
|
277 |
stay around until all notifications have been delivered). |
|
285 |
previous pipe is abandoned and re-created (technically the old pipe |
|
286 |
needs to stay around until all notifications have been delivered).
|
|
278 | 287 |
|
279 | 288 |
All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
280 |
wait for notifications, optionally with a timeout. A notification will be
|
|
281 |
signalled to the waiting clients by closing the pipe. If the pipe wasn't
|
|
282 |
closed during the timeout, the waiting function returns to its caller
|
|
283 |
nonetheless. |
|
289 |
wait for notifications, optionally with a timeout. A notification will |
|
290 |
be signalled to the waiting clients by closing the pipe. If the pipe
|
|
291 |
wasn't closed during the timeout, the waiting function returns to its
|
|
292 |
caller nonetheless.
|
|
284 | 293 |
|
285 | 294 |
|
286 | 295 |
Feature changes |
... | ... | |
291 | 300 |
|
292 | 301 |
Current State and shortcomings |
293 | 302 |
++++++++++++++++++++++++++++++ |
294 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. In |
|
295 |
particular they are divided between "master", "master candidates" and "normal". |
|
296 |
(Moreover they can be offline or drained, but this is not important for the |
|
297 |
current discussion). In general the whole configuration is only replicated to |
|
298 |
master candidates, and some partial information is spread to all nodes via |
|
299 |
ssconf. |
|
300 |
|
|
301 |
This change was done so that the most frequent Ganeti operations didn't need to |
|
302 |
contact all nodes, and so clusters could become bigger. If we want more |
|
303 |
information to be available on all nodes, we need to add more ssconf values, |
|
304 |
which is counter-balancing the change, or to talk with the master node, which |
|
305 |
is not designed to happen now, and requires its availability. |
|
306 |
|
|
307 |
Information such as the instance->primary_node mapping will be needed on all |
|
308 |
nodes, and we also want to make sure services external to the cluster can query |
|
309 |
this information as well. This information must be available at all times, so |
|
310 |
we can't query it through RAPI, which would be a single point of failure, as |
|
311 |
it's only available on the master. |
|
303 |
|
|
304 |
In Ganeti 2.0 all nodes are equal, but some are more equal than others. |
|
305 |
In particular they are divided between "master", "master candidates" and |
|
306 |
"normal". (Moreover they can be offline or drained, but this is not |
|
307 |
important for the current discussion). In general the whole |
|
308 |
configuration is only replicated to master candidates, and some partial |
|
309 |
information is spread to all nodes via ssconf. |
|
310 |
|
|
311 |
This change was done so that the most frequent Ganeti operations didn't |
|
312 |
need to contact all nodes, and so clusters could become bigger. If we |
|
313 |
want more information to be available on all nodes, we need to add more |
|
314 |
ssconf values, which is counter-balancing the change, or to talk with |
|
315 |
the master node, which is not designed to happen now, and requires its |
|
316 |
availability. |
|
317 |
|
|
318 |
Information such as the instance->primary_node mapping will be needed on |
|
319 |
all nodes, and we also want to make sure services external to the |
|
320 |
cluster can query this information as well. This information must be |
|
321 |
available at all times, so we can't query it through RAPI, which would |
|
322 |
be a single point of failure, as it's only available on the master. |
|
312 | 323 |
|
313 | 324 |
|
314 | 325 |
Proposed changes |
315 | 326 |
++++++++++++++++ |
316 | 327 |
|
317 | 328 |
In order to allow fast and highly available access read-only to some |
318 |
configuration values, we'll create a new ganeti-confd daemon, which will run on |
|
319 |
master candidates. This daemon will talk via UDP, and authenticate messages |
|
320 |
using HMAC with a cluster-wide shared key. This key will be generated at |
|
321 |
cluster init time, and stored on the clusters alongside the ganeti SSL keys, |
|
322 |
and readable only by root. |
|
323 |
|
|
324 |
An interested client can query a value by making a request to a subset of the |
|
325 |
cluster master candidates. It will then wait to get a few responses, and use |
|
326 |
the one with the highest configuration serial number. Since the configuration |
|
327 |
serial number is increased each time the ganeti config is updated, and the |
|
328 |
serial number is included in all answers, this can be used to make sure to use |
|
329 |
the most recent answer, in case some master candidates are stale or in the |
|
330 |
middle of a configuration update. |
|
329 |
configuration values, we'll create a new ganeti-confd daemon, which will |
|
330 |
run on master candidates. This daemon will talk via UDP, and |
|
331 |
authenticate messages using HMAC with a cluster-wide shared key. This |
|
332 |
key will be generated at cluster init time, and stored on the clusters |
|
333 |
alongside the ganeti SSL keys, and readable only by root. |
|
334 |
|
|
335 |
An interested client can query a value by making a request to a subset |
|
336 |
of the cluster master candidates. It will then wait to get a few |
|
337 |
responses, and use the one with the highest configuration serial number. |
|
338 |
Since the configuration serial number is increased each time the ganeti |
|
339 |
config is updated, and the serial number is included in all answers, |
|
340 |
this can be used to make sure to use the most recent answer, in case |
|
341 |
some master candidates are stale or in the middle of a configuration |
|
342 |
update. |
|
331 | 343 |
|
332 | 344 |
In order to prevent replay attacks queries will contain the current unix |
333 | 345 |
timestamp according to the client, and the server will verify that its |
334 |
timestamp is in the same 5 minutes range (this requires synchronized clocks,
|
|
335 |
which is a good idea anyway). Queries will also contain a "salt" which they
|
|
336 |
expect the answers to be sent with, and clients are supposed to accept only
|
|
337 |
answers which contain salt generated by them. |
|
346 |
timestamp is in the same 5 minutes range (this requires synchronized |
|
347 |
clocks, which is a good idea anyway). Queries will also contain a "salt"
|
|
348 |
which they expect the answers to be sent with, and clients are supposed
|
|
349 |
to accept only answers which contain salt generated by them.
|
|
338 | 350 |
|
339 | 351 |
The configuration daemon will be able to answer simple queries such as: |
340 | 352 |
|
... | ... | |
364 | 376 |
|
365 | 377 |
- 'protocol', integer, is the confd protocol version (initially just |
366 | 378 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
367 |
- 'type', integer, is the query type. For example "node role by name" or |
|
368 |
"node primary ip by instance ip". Constants will be provided for the actual |
|
369 |
available query types. |
|
370 |
- 'query', string, is the search key. For example an ip, or a node name. |
|
371 |
- 'rsalt', string, is the required response salt. The client must use it to |
|
372 |
recognize which answer it's getting. |
|
373 |
|
|
374 |
- 'salt' must be the current unix timestamp, according to the client. Servers |
|
375 |
can refuse messages which have a wrong timing, according to their |
|
376 |
configuration and clock. |
|
379 |
- 'type', integer, is the query type. For example "node role by name" |
|
380 |
or "node primary ip by instance ip". Constants will be provided for |
|
381 |
the actual available query types. |
|
382 |
- 'query', string, is the search key. For example an ip, or a node |
|
383 |
name. |
|
384 |
- 'rsalt', string, is the required response salt. The client must use |
|
385 |
it to recognize which answer it's getting. |
|
386 |
|
|
387 |
- 'salt' must be the current unix timestamp, according to the client. |
|
388 |
Servers can refuse messages which have a wrong timing, according to |
|
389 |
their configuration and clock. |
|
377 | 390 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
378 | 391 |
|
379 |
If an answer comes back (which is optional, since confd works over UDP) it will
|
|
380 |
be in this format:: |
|
392 |
If an answer comes back (which is optional, since confd works over UDP) |
|
393 |
it will be in this format::
|
|
381 | 394 |
|
382 | 395 |
{ |
383 | 396 |
"msg": "{\"status\": 0, |
... | ... | |
394 | 407 |
|
395 | 408 |
- 'protocol', integer, is the confd protocol version (initially just |
396 | 409 |
constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
397 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for
|
|
398 |
'error' (in which case answer contains an error detail, rather than an
|
|
399 |
answer), but in the future it may be expanded to have more meanings (eg: 2,
|
|
400 |
the answer is compressed) |
|
401 |
- 'answer', is the actual answer. Its type and meaning is query specific. For
|
|
402 |
example for "node primary ip by instance ip" queries it will be a string
|
|
403 |
containing an IP address, for "node role by name" queries it will be an
|
|
404 |
integer which encodes the role (master, candidate, drained, offline)
|
|
405 |
according to constants. |
|
406 |
|
|
407 |
- 'salt' is the requested salt from the query. A client can use it to recognize
|
|
408 |
what query the answer is answering. |
|
410 |
- 'status', integer, is the error code. Initially just 0 for 'ok' or |
|
411 |
'1' for 'error' (in which case answer contains an error detail,
|
|
412 |
rather than an answer), but in the future it may be expanded to have
|
|
413 |
more meanings (eg: 2, the answer is compressed)
|
|
414 |
- 'answer', is the actual answer. Its type and meaning is query |
|
415 |
specific. For example for "node primary ip by instance ip" queries
|
|
416 |
it will be a string containing an IP address, for "node role by
|
|
417 |
name" queries it will be an integer which encodes the role (master,
|
|
418 |
candidate, drained, offline) according to constants.
|
|
419 |
|
|
420 |
- 'salt' is the requested salt from the query. A client can use it to |
|
421 |
recognize what query the answer is answering.
|
|
409 | 422 |
- 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
410 | 423 |
|
411 | 424 |
|
... | ... | |
414 | 427 |
|
415 | 428 |
Current State and shortcomings |
416 | 429 |
++++++++++++++++++++++++++++++ |
417 |
Currently LURedistributeConfig triggers a copy of the updated configuration |
|
418 |
file to all master candidates and of the ssconf files to all nodes. There are |
|
419 |
other files which are maintained manually but which are important to keep in |
|
420 |
sync. These are: |
|
430 |
|
|
431 |
Currently LURedistributeConfig triggers a copy of the updated |
|
432 |
configuration file to all master candidates and of the ssconf files to |
|
433 |
all nodes. There are other files which are maintained manually but which |
|
434 |
are important to keep in sync. These are: |
|
421 | 435 |
|
422 | 436 |
- rapi SSL key certificate file (rapi.pem) (on master candidates) |
423 | 437 |
- rapi user/password file rapi_users (on master candidates) |
424 | 438 |
|
425 |
Furthermore there are some files which are hypervisor specific but we may want
|
|
426 |
to keep in sync: |
|
439 |
Furthermore there are some files which are hypervisor specific but we |
|
440 |
may want to keep in sync:
|
|
427 | 441 |
|
428 |
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies |
|
429 |
the file once, during node add. This design is subject to revision to be able |
|
430 |
to have different passwords for different groups of instances via the use of |
|
431 |
hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to |
|
432 |
provide password-protected vnc sessions. In general, though, it would be |
|
433 |
useful if the vnc password files were copied as well, to avoid unwanted vnc |
|
434 |
password changes on instance failover/migrate. |
|
442 |
- the xen-hvm hypervisor uses one shared file for all vnc passwords, and |
|
443 |
copies the file once, during node add. This design is subject to |
|
444 |
revision to be able to have different passwords for different groups |
|
445 |
of instances via the use of hypervisor parameters, and to allow |
|
446 |
xen-hvm and kvm to use an equal system to provide password-protected |
|
447 |
vnc sessions. In general, though, it would be useful if the vnc |
|
448 |
password files were copied as well, to avoid unwanted vnc password |
|
449 |
changes on instance failover/migrate. |
|
435 | 450 |
|
436 |
Optionally the admin may want to also ship files such as the global xend.conf
|
|
437 |
file, and the network scripts to all nodes. |
|
451 |
Optionally the admin may want to also ship files such as the global |
|
452 |
xend.conf file, and the network scripts to all nodes.
|
|
438 | 453 |
|
439 | 454 |
Proposed changes |
440 | 455 |
++++++++++++++++ |
441 | 456 |
|
442 |
RedistributeConfig will be changed to copy also the rapi files, and to call |
|
443 |
every enabled hypervisor asking for a list of additional files to copy. Users |
|
444 |
will have the possibility to populate a file containing a list of files to be |
|
445 |
distributed; this file will be propagated as well. Such solution is really |
|
446 |
simple to implement and it's easily usable by scripts. |
|
457 |
RedistributeConfig will be changed to copy also the rapi files, and to |
|
458 |
call every enabled hypervisor asking for a list of additional files to |
|
459 |
copy. Users will have the possibility to populate a file containing a |
|
460 |
list of files to be distributed; this file will be propagated as well. |
|
461 |
Such solution is really simple to implement and it's easily usable by |
|
462 |
scripts. |
|
447 | 463 |
|
448 |
This code will be also shared (via tasklets or by other means, if tasklets are |
|
449 |
not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant |
|
450 |
files will be automatically shipped to new master candidates as they are set). |
|
464 |
This code will be also shared (via tasklets or by other means, if |
|
465 |
tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs |
|
466 |
(so that the relevant files will be automatically shipped to new master |
|
467 |
candidates as they are set). |
|
451 | 468 |
|
452 | 469 |
VNC Console Password |
453 | 470 |
~~~~~~~~~~~~~~~~~~~~ |
... | ... | |
455 | 472 |
Current State and shortcomings |
456 | 473 |
++++++++++++++++++++++++++++++ |
457 | 474 |
|
458 |
Currently just the xen-hvm hypervisor supports setting a password to connect |
|
459 |
the the instances' VNC console, and has one common password stored in a file. |
|
475 |
Currently just the xen-hvm hypervisor supports setting a password to |
|
476 |
connect the the instances' VNC console, and has one common password |
|
477 |
stored in a file. |
|
460 | 478 |
|
461 | 479 |
This doesn't allow different passwords for different instances/groups of |
462 |
instances, and makes it necessary to remember to copy the file around the
|
|
463 |
cluster when the password changes. |
|
480 |
instances, and makes it necessary to remember to copy the file around |
|
481 |
the cluster when the password changes.
|
|
464 | 482 |
|
465 | 483 |
Proposed changes |
466 | 484 |
++++++++++++++++ |
467 | 485 |
|
468 |
We'll change the VNC password file to a vnc_password_file hypervisor parameter. |
|
469 |
This way it can have a cluster default, but also a different value for each |
|
470 |
instance. The VNC enabled hypervisors (xen and kvm) will publish all the |
|
471 |
password files in use through the cluster so that a redistribute-config will |
|
472 |
ship them to all nodes (see the Redistribute Config proposed changes above). |
|
486 |
We'll change the VNC password file to a vnc_password_file hypervisor |
|
487 |
parameter. This way it can have a cluster default, but also a different |
|
488 |
value for each instance. The VNC enabled hypervisors (xen and kvm) will |
|
489 |
publish all the password files in use through the cluster so that a |
|
490 |
redistribute-config will ship them to all nodes (see the Redistribute |
|
491 |
Config proposed changes above). |
|
473 | 492 |
|
474 |
The current VNC_PASSWORD_FILE constant will be removed, but its value will be
|
|
475 |
used as the default HV_VNC_PASSWORD_FILE value, thus retaining backwards
|
|
476 |
compatibility with 2.0. |
|
493 |
The current VNC_PASSWORD_FILE constant will be removed, but its value |
|
494 |
will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining
|
|
495 |
backwards compatibility with 2.0.
|
|
477 | 496 |
|
478 |
The code to export the list of VNC password files from the hypervisors to |
|
479 |
RedistributeConfig will be shared between the KVM and xen-hvm hypervisors. |
|
497 |
The code to export the list of VNC password files from the hypervisors |
|
498 |
to RedistributeConfig will be shared between the KVM and xen-hvm |
|
499 |
hypervisors. |
|
480 | 500 |
|
481 | 501 |
Disk/Net parameters |
482 | 502 |
~~~~~~~~~~~~~~~~~~~ |
... | ... | |
484 | 504 |
Current State and shortcomings |
485 | 505 |
++++++++++++++++++++++++++++++ |
486 | 506 |
|
487 |
Currently disks and network interfaces have a few tweakable options and all the
|
|
488 |
rest is left to a default we chose. We're finding that we need more and more to
|
|
489 |
tweak some of these parameters, for example to disable barriers for DRBD
|
|
490 |
devices, or allow striping for the LVM volumes. |
|
507 |
Currently disks and network interfaces have a few tweakable options and |
|
508 |
all the rest is left to a default we chose. We're finding that we need
|
|
509 |
more and more to tweak some of these parameters, for example to disable
|
|
510 |
barriers for DRBD devices, or allow striping for the LVM volumes.
|
|
491 | 511 |
|
492 |
Moreover for many of these parameters it will be nice to have cluster-wide |
|
493 |
defaults, and then be able to change them per disk/interface. |
|
512 |
Moreover for many of these parameters it will be nice to have |
|
513 |
cluster-wide defaults, and then be able to change them per |
|
514 |
disk/interface. |
|
494 | 515 |
|
495 | 516 |
Proposed changes |
496 | 517 |
++++++++++++++++ |
497 | 518 |
|
498 |
We will add new cluster level diskparams and netparams, which will contain all |
|
499 |
the tweakable parameters. All values which have a sensible cluster-wide default |
|
500 |
will go into this new structure while parameters which have unique values will not. |
|
519 |
We will add new cluster level diskparams and netparams, which will |
|
520 |
contain all the tweakable parameters. All values which have a sensible |
|
521 |
cluster-wide default will go into this new structure while parameters |
|
522 |
which have unique values will not. |
|
501 | 523 |
|
502 | 524 |
Example of network parameters: |
503 | 525 |
- mode: bridge/route |
504 |
- link: for mode "bridge" the bridge to connect to, for mode route it can
|
|
505 |
contain the routing table, or the destination interface |
|
526 |
- link: for mode "bridge" the bridge to connect to, for mode route it |
|
527 |
can contain the routing table, or the destination interface
|
|
506 | 528 |
|
507 | 529 |
Example of disk parameters: |
508 | 530 |
- stripe: lvm stripes |
... | ... | |
510 | 532 |
- meta_flushes: drbd, enable/disable metadata "barriers" |
511 | 533 |
- data_flushes: drbd, enable/disable data "barriers" |
512 | 534 |
|
513 |
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs files) or
|
|
514 |
hypervisor specific (nic models for example), but for now they will all live in
|
|
515 |
the same structure. Each component is supposed to validate only the parameters
|
|
516 |
it knows about, and ganeti itself will make sure that no "globally unknown"
|
|
517 |
parameters are added, and that no parameters have overridden meanings for
|
|
518 |
different components. |
|
535 |
Some parameters are bound to be disk-type specific (drbd, vs lvm, vs |
|
536 |
files) or hypervisor specific (nic models for example), but for now they
|
|
537 |
will all live in the same structure. Each component is supposed to
|
|
538 |
validate only the parameters it knows about, and ganeti itself will make
|
|
539 |
sure that no "globally unknown" parameters are added, and that no
|
|
540 |
parameters have overridden meanings for different components.
|
|
519 | 541 |
|
520 |
The parameters will be kept, as for the BEPARAMS into a "default" category, |
|
521 |
which will allow us to expand on by creating instance "classes" in the future. |
|
522 |
Instance classes is not a feature we plan implementing in 2.1, though. |
|
542 |
The parameters will be kept, as for the BEPARAMS into a "default" |
|
543 |
category, which will allow us to expand on by creating instance |
|
544 |
"classes" in the future. Instance classes is not a feature we plan |
|
545 |
implementing in 2.1, though. |
|
523 | 546 |
|
524 | 547 |
Non bridged instances support |
525 | 548 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
... | ... | |
527 | 550 |
Current State and shortcomings |
528 | 551 |
++++++++++++++++++++++++++++++ |
529 | 552 |
|
530 |
Currently each instance NIC must be connected to a bridge, and if the bridge is
|
|
531 |
not specified the default cluster one is used. This makes it impossible to use
|
|
532 |
the vif-route xen network scripts, or other alternative mechanisms that don't
|
|
533 |
need a bridge to work. |
|
553 |
Currently each instance NIC must be connected to a bridge, and if the |
|
554 |
bridge is not specified the default cluster one is used. This makes it
|
|
555 |
impossible to use the vif-route xen network scripts, or other
|
|
556 |
alternative mechanisms that don't need a bridge to work.
|
|
534 | 557 |
|
535 | 558 |
Proposed changes |
536 | 559 |
++++++++++++++++ |
537 | 560 |
|
538 |
The new "mode" network parameter will distinguish between bridged interfaces
|
|
539 |
and routed ones. |
|
561 |
The new "mode" network parameter will distinguish between bridged |
|
562 |
interfaces and routed ones.
|
|
540 | 563 |
|
541 |
When mode is "bridge" the "link" parameter will contain the bridge the instance
|
|
542 |
should be connected to, effectively making things as today. The value has been
|
|
543 |
migrated from a nic field to a parameter to allow for an easier manipulation of
|
|
544 |
the cluster default. |
|
564 |
When mode is "bridge" the "link" parameter will contain the bridge the |
|
565 |
instance should be connected to, effectively making things as today. The
|
|
566 |
value has been migrated from a nic field to a parameter to allow for an
|
|
567 |
easier manipulation of the cluster default.
|
|
545 | 568 |
|
546 |
When mode is "route" the ip field of the interface will become mandatory, to |
|
547 |
allow for a route to be set. In the future we may want also to accept multiple |
|
548 |
IPs or IP/mask values for this purpose. We will evaluate possible meanings of |
|
549 |
the link parameter to signify a routing table to be used, which would allow for |
|
550 |
insulation between instance groups (as today happens for different bridges). |
|
569 |
When mode is "route" the ip field of the interface will become |
|
570 |
mandatory, to allow for a route to be set. In the future we may want |
|
571 |
also to accept multiple IPs or IP/mask values for this purpose. We will |
|
572 |
evaluate possible meanings of the link parameter to signify a routing |
|
573 |
table to be used, which would allow for insulation between instance |
|
574 |
groups (as today happens for different bridges). |
|
551 | 575 |
|
552 |
For now we won't add a parameter to specify which network script gets called
|
|
553 |
for which instance, so in a mixed cluster the network script must be able to
|
|
554 |
handle both cases. The default kvm vif script will be changed to do so. (Xen
|
|
555 |
doesn't have a ganeti provided script, so nothing will be done for that
|
|
556 |
hypervisor) |
|
576 |
For now we won't add a parameter to specify which network script gets |
|
577 |
called for which instance, so in a mixed cluster the network script must
|
|
578 |
be able to handle both cases. The default kvm vif script will be changed
|
|
579 |
to do so. (Xen doesn't have a ganeti provided script, so nothing will be
|
|
580 |
done for that hypervisor)
|
|
557 | 581 |
|
558 | 582 |
Introducing persistent UUIDs |
559 | 583 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
... | ... | |
612 | 636 |
Automated disk repairs infrastructure |
613 | 637 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
614 | 638 |
|
615 |
Replacing defective disks in an automated fashion is quite difficult with the
|
|
616 |
current version of Ganeti. These changes will introduce additional
|
|
617 |
functionality and interfaces to simplify automating disk replacements on a
|
|
618 |
Ganeti node. |
|
639 |
Replacing defective disks in an automated fashion is quite difficult |
|
640 |
with the current version of Ganeti. These changes will introduce
|
|
641 |
additional functionality and interfaces to simplify automating disk
|
|
642 |
replacements on a Ganeti node.
|
|
619 | 643 |
|
620 | 644 |
Fix node volume group |
621 | 645 |
+++++++++++++++++++++ |
622 | 646 |
|
623 |
This is the most difficult addition, as it can lead to dataloss if it's not
|
|
624 |
properly safeguarded. |
|
647 |
This is the most difficult addition, as it can lead to dataloss if it's |
|
648 |
not properly safeguarded.
|
|
625 | 649 |
|
626 |
The operation must be done only when all the other nodes that have instances in
|
|
627 |
common with the target node are fine, i.e. this is the only node with problems,
|
|
628 |
and also we have to double-check that all instances on this node have at least
|
|
629 |
a good copy of the data. |
|
650 |
The operation must be done only when all the other nodes that have |
|
651 |
instances in common with the target node are fine, i.e. this is the only
|
|
652 |
node with problems, and also we have to double-check that all instances
|
|
653 |
on this node have at least a good copy of the data.
|
|
630 | 654 |
|
631 | 655 |
This might mean that we have to enhance the GetMirrorStatus calls, and |
632 |
introduce and a smarter version that can tell us more about the status of an
|
|
633 |
instance. |
|
656 |
introduce and a smarter version that can tell us more about the status |
|
657 |
of an instance.
|
|
634 | 658 |
|
635 | 659 |
Stop allocation on a given PV |
636 | 660 |
+++++++++++++++++++++++++++++ |
637 | 661 |
|
638 |
This is somewhat simple. First we need a "list PVs" opcode (and its associated
|
|
639 |
logical unit) and then a set PV status opcode/LU. These in combination should
|
|
640 |
allow both checking and changing the disk/PV status. |
|
662 |
This is somewhat simple. First we need a "list PVs" opcode (and its |
|
663 |
associated logical unit) and then a set PV status opcode/LU. These in
|
|
664 |
combination should allow both checking and changing the disk/PV status.
|
|
641 | 665 |
|
642 | 666 |
Instance disk status |
643 | 667 |
++++++++++++++++++++ |
644 | 668 |
|
645 |
This new opcode or opcode change must list the instance-disk-index and node
|
|
646 |
combinations of the instance together with their status. This will allow
|
|
647 |
determining what part of the instance is broken (if any). |
|
669 |
This new opcode or opcode change must list the instance-disk-index and |
|
670 |
node combinations of the instance together with their status. This will
|
|
671 |
allow determining what part of the instance is broken (if any).
|
|
648 | 672 |
|
649 | 673 |
Repair instance |
650 | 674 |
+++++++++++++++ |
651 | 675 |
|
652 |
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in order
|
|
653 |
to fix the instance status. It only affects primary instances; secondaries can
|
|
654 |
just be moved away. |
|
676 |
This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in |
|
677 |
order to fix the instance status. It only affects primary instances;
|
|
678 |
secondaries can just be moved away.
|
|
655 | 679 |
|
656 | 680 |
Migrate node |
657 | 681 |
++++++++++++ |
658 | 682 |
|
659 |
This new opcode/LU/RAPI call will take over the current ``gnt-node migrate``
|
|
660 |
code and run migrate for all instances on the node. |
|
683 |
This new opcode/LU/RAPI call will take over the current ``gnt-node |
|
684 |
migrate`` code and run migrate for all instances on the node.
|
|
661 | 685 |
|
662 | 686 |
Evacuate node |
663 | 687 |
++++++++++++++ |
664 | 688 |
|
665 |
This new opcode/LU/RAPI call will take over the current ``gnt-node evacuate``
|
|
666 |
code and run replace-secondary with an iallocator script for all instances on
|
|
667 |
the node. |
|
689 |
This new opcode/LU/RAPI call will take over the current ``gnt-node |
|
690 |
evacuate`` code and run replace-secondary with an iallocator script for
|
|
691 |
all instances on the node.
|
|
668 | 692 |
|
669 | 693 |
|
670 | 694 |
External interface changes |
... | ... | |
673 | 697 |
OS API |
674 | 698 |
~~~~~~ |
675 | 699 |
|
676 |
The OS API of Ganeti 2.0 has been built with extensibility in mind. Since we
|
|
677 |
pass everything as environment variables it's a lot easier to send new
|
|
678 |
information to the OSes without breaking retrocompatibility. This section of
|
|
679 |
the design outlines the proposed extensions to the API and their
|
|
680 |
implementation. |
|
700 |
The OS API of Ganeti 2.0 has been built with extensibility in mind. |
|
701 |
Since we pass everything as environment variables it's a lot easier to
|
|
702 |
send new information to the OSes without breaking retrocompatibility.
|
|
703 |
This section of the design outlines the proposed extensions to the API
|
|
704 |
and their implementation.
|
|
681 | 705 |
|
682 | 706 |
API Version Compatibility Handling |
683 | 707 |
++++++++++++++++++++++++++++++++++ |
684 | 708 |
|
685 |
In 2.1 there will be a new OS API version (eg. 15), which should be mostly
|
|
686 |
compatible with api 10, except for some new added variables. Since it's easy
|
|
687 |
not to pass some variables we'll be able to handle Ganeti 2.0 OSes by just
|
|
688 |
filtering out the newly added piece of information. We will still encourage
|
|
689 |
OSes to declare support for the new API after checking that the new variables
|
|
690 |
don't provide any conflict for them, and we will drop api 10 support after
|
|
691 |
ganeti 2.1 has released. |
|
709 |
In 2.1 there will be a new OS API version (eg. 15), which should be |
|
710 |
mostly compatible with api 10, except for some new added variables.
|
|
711 |
Since it's easy not to pass some variables we'll be able to handle
|
|
712 |
Ganeti 2.0 OSes by just filtering out the newly added piece of
|
|
713 |
information. We will still encourage OSes to declare support for the new
|
|
714 |
API after checking that the new variables don't provide any conflict for
|
|
715 |
them, and we will drop api 10 support after ganeti 2.1 has released.
|
|
692 | 716 |
|
693 | 717 |
New Environment variables |
694 | 718 |
+++++++++++++++++++++++++ |
695 | 719 |
|
696 |
Some variables have never been added to the OS api but would definitely be |
|
697 |
useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable to allow |
|
698 |
the OS to make changes relevant to the virtualization the instance is going to |
|
699 |
use. Since this field is immutable for each instance, the os can tight the |
|
700 |
install without caring of making sure the instance can run under any |
|
701 |
virtualization technology. |
|
702 |
|
|
703 |
We also want the OS to know the particular hypervisor parameters, to be able to |
|
704 |
customize the install even more. Since the parameters can change, though, we |
|
705 |
will pass them only as an "FYI": if an OS ties some instance functionality to |
|
706 |
the value of a particular hypervisor parameter manual changes or a reinstall |
|
707 |
may be needed to adapt the instance to the new environment. This is not a |
|
708 |
regression as of today, because even if the OSes are left blind about this |
|
709 |
information, sometimes they still need to make compromises and cannot satisfy |
|
710 |
all possible parameter values. |
|
720 |
Some variables have never been added to the OS api but would definitely |
|
721 |
be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable |
|
722 |
to allow the OS to make changes relevant to the virtualization the |
|
723 |
instance is going to use. Since this field is immutable for each |
|
724 |
instance, the os can tight the install without caring of making sure the |
|
725 |
instance can run under any virtualization technology. |
|
726 |
|
|
727 |
We also want the OS to know the particular hypervisor parameters, to be |
|
728 |
able to customize the install even more. Since the parameters can |
|
729 |
change, though, we will pass them only as an "FYI": if an OS ties some |
|
730 |
instance functionality to the value of a particular hypervisor parameter |
|
731 |
manual changes or a reinstall may be needed to adapt the instance to the |
|
732 |
new environment. This is not a regression as of today, because even if |
|
733 |
the OSes are left blind about this information, sometimes they still |
|
734 |
need to make compromises and cannot satisfy all possible parameter |
|
735 |
values. |
|
711 | 736 |
|
712 | 737 |
OS Variants |
713 | 738 |
+++++++++++ |
714 | 739 |
|
715 |
Currently we are assisting to some degree of "os proliferation" just to change |
|
716 |
a simple installation behavior. This means that the same OS gets installed on |
|
717 |
the cluster multiple times, with different names, to customize just one |
|
718 |
installation behavior. Usually such OSes try to share as much as possible |
|
719 |
through symlinks, but this still causes complications on the user side, |
|
720 |
especially when multiple parameters must be cross-matched. |
|
721 |
|
|
722 |
For example today if you want to install debian etch, lenny or squeeze you |
|
723 |
probably need to install the debootstrap OS multiple times, changing its |
|
724 |
configuration file, and calling it debootstrap-etch, debootstrap-lenny or |
|
725 |
debootstrap-squeeze. Furthermore if you have for example a "server" and a |
|
726 |
"development" environment which installs different packages/configuration files |
|
727 |
and must be available for all installs you'll probably end up with |
|
728 |
deboostrap-etch-server, debootstrap-etch-dev, debootrap-lenny-server, |
|
729 |
debootstrap-lenny-dev, etc. Crossing more than two parameters quickly becomes |
|
730 |
not manageable. |
|
731 |
|
|
732 |
In order to avoid this we plan to make OSes more customizable, by allowing each |
|
733 |
OS to declare a list of variants which can be used to customize it. The |
|
734 |
variants list is mandatory and must be written, one variant per line, in the |
|
735 |
new "variants.list" file inside the main os dir. At least one supported variant |
|
736 |
must be supported. When choosing the OS exactly one variant will have to be |
|
737 |
specified, and will be encoded in the os name as <OS-name>+<variant>. As for |
|
738 |
today it will be possible to change an instance's OS at creation or install |
|
739 |
time. |
|
740 |
Currently we are assisting to some degree of "os proliferation" just to |
|
741 |
change a simple installation behavior. This means that the same OS gets |
|
742 |
installed on the cluster multiple times, with different names, to |
|
743 |
customize just one installation behavior. Usually such OSes try to share |
|
744 |
as much as possible through symlinks, but this still causes |
|
745 |
complications on the user side, especially when multiple parameters must |
|
746 |
be cross-matched. |
|
747 |
|
|
748 |
For example today if you want to install debian etch, lenny or squeeze |
|
749 |
you probably need to install the debootstrap OS multiple times, changing |
|
750 |
its configuration file, and calling it debootstrap-etch, |
|
751 |
debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for |
|
752 |
example a "server" and a "development" environment which installs |
|
753 |
different packages/configuration files and must be available for all |
|
754 |
installs you'll probably end up with deboostrap-etch-server, |
|
755 |
debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev, |
|
756 |
etc. Crossing more than two parameters quickly becomes not manageable. |
|
757 |
|
|
758 |
In order to avoid this we plan to make OSes more customizable, by |
|
759 |
allowing each OS to declare a list of variants which can be used to |
|
760 |
customize it. The variants list is mandatory and must be written, one |
|
761 |
variant per line, in the new "variants.list" file inside the main os |
|
762 |
dir. At least one supported variant must be supported. When choosing the |
|
763 |
OS exactly one variant will have to be specified, and will be encoded in |
|
764 |
the os name as <OS-name>+<variant>. As for today it will be possible to |
|
765 |
change an instance's OS at creation or install time. |
|
740 | 766 |
|
741 | 767 |
The 2.1 OS list will be the combination of each OS, plus its supported |
742 |
variants. This will cause the name name proliferation to remain, but at least
|
|
743 |
the internal OS code will be simplified to just parsing the passed variant,
|
|
744 |
without the need for symlinks or code duplication. |
|
745 |
|
|
746 |
Also we expect the OSes to declare only "interesting" variants, but to accept
|
|
747 |
some non-declared ones which a user will be able to pass in by overriding the
|
|
748 |
checks ganeti does. This will be useful for allowing some variations to be used
|
|
749 |
without polluting the OS list (per-OS documentation should list all supported
|
|
750 |
variants). If a variant which is not internally supported is forced through,
|
|
751 |
the OS scripts should abort. |
|
752 |
|
|
753 |
In the future (post 2.1) we may want to move to full fledged parameters all
|
|
754 |
orthogonal to each other (for example "architecture" (i386, amd64), "suite"
|
|
755 |
(lenny, squeeze, ...), etc). (As opposed to the variant, which is a single
|
|
756 |
parameter, and you need a different variant for all the set of combinations you
|
|
757 |
want to support). In this case we envision the variants to be moved inside of
|
|
758 |
Ganeti and be associated with lists parameter->values associations, which will
|
|
759 |
then be passed to the OS. |
|
768 |
variants. This will cause the name name proliferation to remain, but at |
|
769 |
least the internal OS code will be simplified to just parsing the passed
|
|
770 |
variant, without the need for symlinks or code duplication.
|
|
771 |
|
|
772 |
Also we expect the OSes to declare only "interesting" variants, but to |
|
773 |
accept some non-declared ones which a user will be able to pass in by
|
|
774 |
overriding the checks ganeti does. This will be useful for allowing some
|
|
775 |
variations to be used without polluting the OS list (per-OS
|
|
776 |
documentation should list all supported variants). If a variant which is
|
|
777 |
not internally supported is forced through, the OS scripts should abort.
|
|
778 |
|
|
779 |
In the future (post 2.1) we may want to move to full fledged parameters |
|
780 |
all orthogonal to each other (for example "architecture" (i386, amd64),
|
|
781 |
"suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which
|
|
782 |
is a single parameter, and you need a different variant for all the set
|
|
783 |
of combinations you want to support). In this case we envision the
|
|
784 |
variants to be moved inside of Ganeti and be associated with lists
|
|
785 |
parameter->values associations, which will then be passed to the OS.
|
|
760 | 786 |
|
761 | 787 |
|
762 | 788 |
IAllocator changes |
... | ... | |
825 | 851 |
the current cluster state (similar to the ``allocate`` mode), the |
826 | 852 |
plugin needs to return: |
827 | 853 |
|
828 |
- how many instances can be allocated on the cluster with that specification |
|
854 |
- how many instances can be allocated on the cluster with that |
|
855 |
specification |
|
829 | 856 |
- on which nodes these will be allocated (in order) |
830 | 857 |
|
831 | 858 |
.. vim: set textwidth=72 : |
Also available in: Unified diff