root / doc / design-2.1.rst @ d2baa21d
History | View | Annotate | Download (35 kB)
1 | 82a1c938 | Guido Trotter | ================= |
---|---|---|---|
2 | 82a1c938 | Guido Trotter | Ganeti 2.1 design |
3 | 82a1c938 | Guido Trotter | ================= |
4 | 82a1c938 | Guido Trotter | |
5 | 82a1c938 | Guido Trotter | This document describes the major changes in Ganeti 2.1 compared to |
6 | 82a1c938 | Guido Trotter | the 2.0 version. |
7 | 82a1c938 | Guido Trotter | |
8 | 7faf5110 | Michael Hanselmann | The 2.1 version will be a relatively small release. Its main aim is to |
9 | 7faf5110 | Michael Hanselmann | avoid changing too much of the core code, while addressing issues and |
10 | 7faf5110 | Michael Hanselmann | adding new features and improvements over 2.0, in a timely fashion. |
11 | 82a1c938 | Guido Trotter | |
12 | 5ee09f03 | Michael Hanselmann | .. contents:: :depth: 4 |
13 | 82a1c938 | Guido Trotter | |
14 | 82a1c938 | Guido Trotter | Objective |
15 | 82a1c938 | Guido Trotter | ========= |
16 | 82a1c938 | Guido Trotter | |
17 | 82a1c938 | Guido Trotter | Ganeti 2.1 will add features to help further automatization of cluster |
18 | 7faf5110 | Michael Hanselmann | operations, further improbe scalability to even bigger clusters, and |
19 | 7faf5110 | Michael Hanselmann | make it easier to debug the Ganeti core. |
20 | 82a1c938 | Guido Trotter | |
21 | 82a1c938 | Guido Trotter | Background |
22 | 82a1c938 | Guido Trotter | ========== |
23 | 82a1c938 | Guido Trotter | |
24 | 82a1c938 | Guido Trotter | Overview |
25 | 82a1c938 | Guido Trotter | ======== |
26 | 82a1c938 | Guido Trotter | |
27 | 82a1c938 | Guido Trotter | Detailed design |
28 | 82a1c938 | Guido Trotter | =============== |
29 | 82a1c938 | Guido Trotter | |
30 | 82a1c938 | Guido Trotter | As for 2.0 we divide the 2.1 design into three areas: |
31 | 82a1c938 | Guido Trotter | |
32 | 7faf5110 | Michael Hanselmann | - core changes, which affect the master daemon/job queue/locking or |
33 | 7faf5110 | Michael Hanselmann | all/most logical units |
34 | 82a1c938 | Guido Trotter | - logical unit/feature changes |
35 | 82a1c938 | Guido Trotter | - external interface changes (eg. command line, os api, hooks, ...) |
36 | 82a1c938 | Guido Trotter | |
37 | 82a1c938 | Guido Trotter | Core changes |
38 | 82a1c938 | Guido Trotter | ------------ |
39 | 82a1c938 | Guido Trotter | |
40 | a392a6b8 | Michael Hanselmann | Storage units modelling |
41 | a392a6b8 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~ |
42 | a392a6b8 | Michael Hanselmann | |
43 | a392a6b8 | Michael Hanselmann | Currently, Ganeti has a good model of the block devices for instances |
44 | a392a6b8 | Michael Hanselmann | (e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the |
45 | a392a6b8 | Michael Hanselmann | storage pools that are providing the space for these front-end |
46 | a392a6b8 | Michael Hanselmann | devices. For example, there are hardcoded inter-node RPC calls for |
47 | a392a6b8 | Michael Hanselmann | volume group listing, file storage creation/deletion, etc. |
48 | a392a6b8 | Michael Hanselmann | |
49 | a392a6b8 | Michael Hanselmann | The storage units framework will implement a generic handling for all |
50 | a392a6b8 | Michael Hanselmann | kinds of storage backends: |
51 | a392a6b8 | Michael Hanselmann | |
52 | a392a6b8 | Michael Hanselmann | - LVM physical volumes |
53 | a392a6b8 | Michael Hanselmann | - LVM volume groups |
54 | a392a6b8 | Michael Hanselmann | - File-based storage directories |
55 | a392a6b8 | Michael Hanselmann | - any other future storage method |
56 | a392a6b8 | Michael Hanselmann | |
57 | a392a6b8 | Michael Hanselmann | There will be a generic list of methods that each storage unit type |
58 | a392a6b8 | Michael Hanselmann | will provide, like: |
59 | a392a6b8 | Michael Hanselmann | |
60 | a392a6b8 | Michael Hanselmann | - list of storage units of this type |
61 | a392a6b8 | Michael Hanselmann | - check status of the storage unit |
62 | a392a6b8 | Michael Hanselmann | |
63 | 7faf5110 | Michael Hanselmann | Additionally, there will be specific methods for each method, for |
64 | 7faf5110 | Michael Hanselmann | example: |
65 | a392a6b8 | Michael Hanselmann | |
66 | a392a6b8 | Michael Hanselmann | - enable/disable allocations on a specific PV |
67 | a392a6b8 | Michael Hanselmann | - file storage directory creation/deletion |
68 | a392a6b8 | Michael Hanselmann | - VG consistency fixing |
69 | a392a6b8 | Michael Hanselmann | |
70 | a392a6b8 | Michael Hanselmann | This will allow a much better modeling and unification of the various |
71 | a392a6b8 | Michael Hanselmann | RPC calls related to backend storage pool in the future. Ganeti 2.1 is |
72 | a392a6b8 | Michael Hanselmann | intended to add the basics of the framework, and not necessarilly move |
73 | a392a6b8 | Michael Hanselmann | all the curent VG/FileBased operations to it. |
74 | a392a6b8 | Michael Hanselmann | |
75 | a392a6b8 | Michael Hanselmann | Note that while we model both LVM PVs and LVM VGs, the framework will |
76 | a392a6b8 | Michael Hanselmann | **not** model any relationship between the different types. In other |
77 | a392a6b8 | Michael Hanselmann | words, we don't model neither inheritances nor stacking, since this is |
78 | a392a6b8 | Michael Hanselmann | too complex for our needs. While a ``vgreduce`` operation on a LVM VG |
79 | a392a6b8 | Michael Hanselmann | could actually remove a PV from it, this will not be handled at the |
80 | a392a6b8 | Michael Hanselmann | framework level, but at individual operation level. The goal is that |
81 | a392a6b8 | Michael Hanselmann | this is a lightweight framework, for abstracting the different storage |
82 | a392a6b8 | Michael Hanselmann | operation, and not for modelling the storage hierarchy. |
83 | a392a6b8 | Michael Hanselmann | |
84 | 5ee09f03 | Michael Hanselmann | |
85 | 5ee09f03 | Michael Hanselmann | Locking improvements |
86 | 5ee09f03 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~ |
87 | 5ee09f03 | Michael Hanselmann | |
88 | 5ee09f03 | Michael Hanselmann | Current State and shortcomings |
89 | 5ee09f03 | Michael Hanselmann | ++++++++++++++++++++++++++++++ |
90 | 5ee09f03 | Michael Hanselmann | |
91 | a5b360e4 | Michael Hanselmann | The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
92 | 7faf5110 | Michael Hanselmann | many ``SharedLock`` instances. It provides an interface to add/remove |
93 | 7faf5110 | Michael Hanselmann | locks and to acquire and subsequently release any number of those locks |
94 | 7faf5110 | Michael Hanselmann | contained in it. |
95 | 5ee09f03 | Michael Hanselmann | |
96 | 7faf5110 | Michael Hanselmann | Locks in a ``LockSet`` are always acquired in alphabetic order. Due to |
97 | 7faf5110 | Michael Hanselmann | the way we're using locks for nodes and instances (the single cluster |
98 | 7faf5110 | Michael Hanselmann | lock isn't affected by this issue) this can lead to long delays when |
99 | 7faf5110 | Michael Hanselmann | acquiring locks if another operation tries to acquire multiple locks but |
100 | 7faf5110 | Michael Hanselmann | has to wait for yet another operation. |
101 | 5ee09f03 | Michael Hanselmann | |
102 | a5b360e4 | Michael Hanselmann | In the following demonstration we assume to have the instance locks |
103 | a5b360e4 | Michael Hanselmann | ``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
104 | 5ee09f03 | Michael Hanselmann | |
105 | 5ee09f03 | Michael Hanselmann | #. Operation A grabs lock for instance ``inst4``. |
106 | 7faf5110 | Michael Hanselmann | #. Operation B wants to acquire all instance locks in alphabetic order, |
107 | 7faf5110 | Michael Hanselmann | but it has to wait for ``inst4``. |
108 | 5ee09f03 | Michael Hanselmann | #. Operation C tries to lock ``inst1``, but it has to wait until |
109 | a5b360e4 | Michael Hanselmann | Operation B (which is trying to acquire all locks) releases the lock |
110 | a5b360e4 | Michael Hanselmann | again. |
111 | 5ee09f03 | Michael Hanselmann | #. Operation A finishes and releases lock on ``inst4``. Operation B can |
112 | 5ee09f03 | Michael Hanselmann | continue and eventually releases all locks. |
113 | 5ee09f03 | Michael Hanselmann | #. Operation C can get ``inst1`` lock and finishes. |
114 | 5ee09f03 | Michael Hanselmann | |
115 | 5ee09f03 | Michael Hanselmann | Technically there's no need for Operation C to wait for Operation A, and |
116 | 5ee09f03 | Michael Hanselmann | subsequently Operation B, to finish. Operation B can't continue until |
117 | 5ee09f03 | Michael Hanselmann | Operation A is done (it has to wait for ``inst4``), anyway. |
118 | 5ee09f03 | Michael Hanselmann | |
119 | 5ee09f03 | Michael Hanselmann | Proposed changes |
120 | 5ee09f03 | Michael Hanselmann | ++++++++++++++++ |
121 | 5ee09f03 | Michael Hanselmann | |
122 | 5ee09f03 | Michael Hanselmann | Non-blocking lock acquiring |
123 | 5ee09f03 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
124 | 5ee09f03 | Michael Hanselmann | |
125 | 7faf5110 | Michael Hanselmann | Acquiring locks for OpCode execution is always done in blocking mode. |
126 | 7faf5110 | Michael Hanselmann | They won't return until the lock has successfully been acquired (or an |
127 | 7faf5110 | Michael Hanselmann | error occurred, although we won't cover that case here). |
128 | 5ee09f03 | Michael Hanselmann | |
129 | 7faf5110 | Michael Hanselmann | ``SharedLock`` and ``LockSet`` must be able to be acquired in a |
130 | 7faf5110 | Michael Hanselmann | non-blocking way. They must support a timeout and abort trying to |
131 | 7faf5110 | Michael Hanselmann | acquire the lock(s) after the specified amount of time. |
132 | 5ee09f03 | Michael Hanselmann | |
133 | 5ee09f03 | Michael Hanselmann | Retry acquiring locks |
134 | 5ee09f03 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^ |
135 | 5ee09f03 | Michael Hanselmann | |
136 | 7faf5110 | Michael Hanselmann | To prevent other operations from waiting for a long time, such as |
137 | 7faf5110 | Michael Hanselmann | described in the demonstration before, ``LockSet`` must not keep locks |
138 | 7faf5110 | Michael Hanselmann | for a prolonged period of time when trying to acquire two or more locks. |
139 | 7faf5110 | Michael Hanselmann | Instead it should, with an increasing timeout for acquiring all locks, |
140 | 7faf5110 | Michael Hanselmann | release all locks again and sleep some time if it fails to acquire all |
141 | 7faf5110 | Michael Hanselmann | requested locks. |
142 | 5ee09f03 | Michael Hanselmann | |
143 | 7faf5110 | Michael Hanselmann | A good timeout value needs to be determined. In any case should |
144 | 7faf5110 | Michael Hanselmann | ``LockSet`` proceed to acquire locks in blocking mode after a few |
145 | 7faf5110 | Michael Hanselmann | (unsuccessful) attempts to acquire all requested locks. |
146 | 5ee09f03 | Michael Hanselmann | |
147 | 7faf5110 | Michael Hanselmann | One proposal for the timeout is to use ``2**tries`` seconds, where |
148 | 7faf5110 | Michael Hanselmann | ``tries`` is the number of unsuccessful tries. |
149 | 5ee09f03 | Michael Hanselmann | |
150 | 7faf5110 | Michael Hanselmann | In the demonstration before this would allow Operation C to continue |
151 | 7faf5110 | Michael Hanselmann | after Operation B unsuccessfully tried to acquire all locks and released |
152 | 7faf5110 | Michael Hanselmann | all acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
153 | 5ee09f03 | Michael Hanselmann | |
154 | 5ee09f03 | Michael Hanselmann | Other solutions discussed |
155 | 5ee09f03 | Michael Hanselmann | +++++++++++++++++++++++++ |
156 | 5ee09f03 | Michael Hanselmann | |
157 | 7faf5110 | Michael Hanselmann | There was also some discussion on going one step further and extend the |
158 | 7faf5110 | Michael Hanselmann | job queue (see ``lib/jqueue.py``) to select the next task for a worker |
159 | 7faf5110 | Michael Hanselmann | depending on whether it can acquire the necessary locks. While this may |
160 | 7faf5110 | Michael Hanselmann | reduce the number of necessary worker threads and/or increase throughput |
161 | 7faf5110 | Michael Hanselmann | on large clusters with many jobs, it also brings many potential |
162 | 7faf5110 | Michael Hanselmann | problems, such as contention and increased memory usage, with it. As |
163 | 7faf5110 | Michael Hanselmann | this would be an extension of the changes proposed before it could be |
164 | 7faf5110 | Michael Hanselmann | implemented at a later point in time, but we decided to stay with the |
165 | 7faf5110 | Michael Hanselmann | simpler solution for now. |
166 | 5ee09f03 | Michael Hanselmann | |
167 | ca9ccea8 | Michael Hanselmann | Implementation details |
168 | ca9ccea8 | Michael Hanselmann | ++++++++++++++++++++++ |
169 | ca9ccea8 | Michael Hanselmann | |
170 | ca9ccea8 | Michael Hanselmann | ``SharedLock`` redesign |
171 | ca9ccea8 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^ |
172 | ca9ccea8 | Michael Hanselmann | |
173 | ca9ccea8 | Michael Hanselmann | The current design of ``SharedLock`` is not good for supporting timeouts |
174 | ca9ccea8 | Michael Hanselmann | when acquiring a lock and there are also minor fairness issues in it. We |
175 | 7faf5110 | Michael Hanselmann | plan to address both with a redesign. A proof of concept implementation |
176 | 7faf5110 | Michael Hanselmann | was written and resulted in significantly simpler code. |
177 | 7faf5110 | Michael Hanselmann | |
178 | 7faf5110 | Michael Hanselmann | Currently ``SharedLock`` uses two separate queues for shared and |
179 | 7faf5110 | Michael Hanselmann | exclusive acquires and waiters get to run in turns. This means if an |
180 | 7faf5110 | Michael Hanselmann | exclusive acquire is released, the lock will allow shared waiters to run |
181 | 7faf5110 | Michael Hanselmann | and vice versa. Although it's still fair in the end there is a slight |
182 | 7faf5110 | Michael Hanselmann | bias towards shared waiters in the current implementation. The same |
183 | 7faf5110 | Michael Hanselmann | implementation with two shared queues can not support timeouts without |
184 | 7faf5110 | Michael Hanselmann | adding a lot of complexity. |
185 | 7faf5110 | Michael Hanselmann | |
186 | 7faf5110 | Michael Hanselmann | Our proposed redesign changes ``SharedLock`` to have only one single |
187 | 7faf5110 | Michael Hanselmann | queue. There will be one condition (see Condition_ for a note about |
188 | 7faf5110 | Michael Hanselmann | performance) in the queue per exclusive acquire and two for all shared |
189 | 7faf5110 | Michael Hanselmann | acquires (see below for an explanation). The maximum queue length will |
190 | 7faf5110 | Michael Hanselmann | always be ``2 + (number of exclusive acquires waiting)``. The number of |
191 | 7faf5110 | Michael Hanselmann | queue entries for shared acquires can vary from 0 to 2. |
192 | 7faf5110 | Michael Hanselmann | |
193 | 7faf5110 | Michael Hanselmann | The two conditions for shared acquires are a bit special. They will be |
194 | 7faf5110 | Michael Hanselmann | used in turn. When the lock is instantiated, no conditions are in the |
195 | 7faf5110 | Michael Hanselmann | queue. As soon as the first shared acquire arrives (and there are |
196 | 7faf5110 | Michael Hanselmann | holder(s) or waiting acquires; see Acquire_), the active condition is |
197 | 7faf5110 | Michael Hanselmann | added to the queue. Until it becomes the topmost condition in the queue |
198 | 7faf5110 | Michael Hanselmann | and has been notified, any shared acquire is added to this active |
199 | 7faf5110 | Michael Hanselmann | condition. When the active condition is notified, the conditions are |
200 | 7faf5110 | Michael Hanselmann | swapped and further shared acquires are added to the previously inactive |
201 | 7faf5110 | Michael Hanselmann | condition (which has now become the active condition). After all waiters |
202 | 7faf5110 | Michael Hanselmann | on the previously active (now inactive) and now notified condition |
203 | 7faf5110 | Michael Hanselmann | received the notification, it is removed from the queue of pending |
204 | 7faf5110 | Michael Hanselmann | acquires. |
205 | 7faf5110 | Michael Hanselmann | |
206 | 7faf5110 | Michael Hanselmann | This means shared acquires will skip any exclusive acquire in the queue. |
207 | 7faf5110 | Michael Hanselmann | We believe it's better to improve parallelization on operations only |
208 | 7faf5110 | Michael Hanselmann | asking for shared (or read-only) locks. Exclusive operations holding the |
209 | 7faf5110 | Michael Hanselmann | same lock can not be parallelized. |
210 | ca9ccea8 | Michael Hanselmann | |
211 | ca9ccea8 | Michael Hanselmann | |
212 | ca9ccea8 | Michael Hanselmann | Acquire |
213 | ca9ccea8 | Michael Hanselmann | ******* |
214 | ca9ccea8 | Michael Hanselmann | |
215 | 7faf5110 | Michael Hanselmann | For exclusive acquires a new condition is created and appended to the |
216 | 7faf5110 | Michael Hanselmann | queue. Shared acquires are added to the active condition for shared |
217 | 7faf5110 | Michael Hanselmann | acquires and if the condition is not yet on the queue, it's appended. |
218 | ca9ccea8 | Michael Hanselmann | |
219 | 7faf5110 | Michael Hanselmann | The next step is to wait for our condition to be on the top of the queue |
220 | 7faf5110 | Michael Hanselmann | (to guarantee fairness). If the timeout expired, we return to the caller |
221 | 7faf5110 | Michael Hanselmann | without acquiring the lock. On every notification we check whether the |
222 | 7faf5110 | Michael Hanselmann | lock has been deleted, in which case an error is returned to the caller. |
223 | ca9ccea8 | Michael Hanselmann | |
224 | 7faf5110 | Michael Hanselmann | The lock can be acquired if we're on top of the queue (there is no one |
225 | 7faf5110 | Michael Hanselmann | else ahead of us). For an exclusive acquire, there must not be other |
226 | 7faf5110 | Michael Hanselmann | exclusive or shared holders. For a shared acquire, there must not be an |
227 | 7faf5110 | Michael Hanselmann | exclusive holder. If these conditions are all true, the lock is |
228 | 7faf5110 | Michael Hanselmann | acquired and we return to the caller. In any other case we wait again on |
229 | 7faf5110 | Michael Hanselmann | the condition. |
230 | ca9ccea8 | Michael Hanselmann | |
231 | 7faf5110 | Michael Hanselmann | If it was the last waiter on a condition, the condition is removed from |
232 | 7faf5110 | Michael Hanselmann | the queue. |
233 | ca9ccea8 | Michael Hanselmann | |
234 | ca9ccea8 | Michael Hanselmann | Optimization: There's no need to touch the queue if there are no pending |
235 | 7faf5110 | Michael Hanselmann | acquires and no current holders. The caller can have the lock |
236 | 7faf5110 | Michael Hanselmann | immediately. |
237 | ca9ccea8 | Michael Hanselmann | |
238 | ca9ccea8 | Michael Hanselmann | .. image:: design-2.1-lock-acquire.png |
239 | ca9ccea8 | Michael Hanselmann | |
240 | ca9ccea8 | Michael Hanselmann | |
241 | ca9ccea8 | Michael Hanselmann | Release |
242 | ca9ccea8 | Michael Hanselmann | ******* |
243 | ca9ccea8 | Michael Hanselmann | |
244 | 7faf5110 | Michael Hanselmann | First the lock removes the caller from the internal owner list. If there |
245 | 7faf5110 | Michael Hanselmann | are pending acquires in the queue, the first (the oldest) condition is |
246 | 7faf5110 | Michael Hanselmann | notified. |
247 | ca9ccea8 | Michael Hanselmann | |
248 | ca9ccea8 | Michael Hanselmann | If the first condition was the active condition for shared acquires, the |
249 | 7faf5110 | Michael Hanselmann | inactive condition will be made active. This ensures fairness with |
250 | 7faf5110 | Michael Hanselmann | exclusive locks by forcing consecutive shared acquires to wait in the |
251 | 7faf5110 | Michael Hanselmann | queue. |
252 | ca9ccea8 | Michael Hanselmann | |
253 | ca9ccea8 | Michael Hanselmann | .. image:: design-2.1-lock-release.png |
254 | ca9ccea8 | Michael Hanselmann | |
255 | ca9ccea8 | Michael Hanselmann | |
256 | ca9ccea8 | Michael Hanselmann | Delete |
257 | ca9ccea8 | Michael Hanselmann | ****** |
258 | ca9ccea8 | Michael Hanselmann | |
259 | 7faf5110 | Michael Hanselmann | The caller must either hold the lock in exclusive mode already or the |
260 | 7faf5110 | Michael Hanselmann | lock must be acquired in exclusive mode. Trying to delete a lock while |
261 | 7faf5110 | Michael Hanselmann | it's held in shared mode must fail. |
262 | ca9ccea8 | Michael Hanselmann | |
263 | 7faf5110 | Michael Hanselmann | After ensuring the lock is held in exclusive mode, the lock will mark |
264 | 7faf5110 | Michael Hanselmann | itself as deleted and continue to notify all pending acquires. They will |
265 | 7faf5110 | Michael Hanselmann | wake up, notice the deleted lock and return an error to the caller. |
266 | ca9ccea8 | Michael Hanselmann | |
267 | ca9ccea8 | Michael Hanselmann | |
268 | ca9ccea8 | Michael Hanselmann | Condition |
269 | ca9ccea8 | Michael Hanselmann | ^^^^^^^^^ |
270 | ca9ccea8 | Michael Hanselmann | |
271 | 7faf5110 | Michael Hanselmann | Note: This is not necessary for the locking changes above, but it may be |
272 | 7faf5110 | Michael Hanselmann | a good optimization (pending performance tests). |
273 | ca9ccea8 | Michael Hanselmann | |
274 | ca9ccea8 | Michael Hanselmann | The existing locking code in Ganeti 2.0 uses Python's built-in |
275 | ca9ccea8 | Michael Hanselmann | ``threading.Condition`` class. Unfortunately ``Condition`` implements |
276 | 7faf5110 | Michael Hanselmann | timeouts by sleeping 1ms to 20ms between tries to acquire the condition |
277 | 7faf5110 | Michael Hanselmann | lock in non-blocking mode. This requires unnecessary context switches |
278 | 7faf5110 | Michael Hanselmann | and contention on the CPython GIL (Global Interpreter Lock). |
279 | ca9ccea8 | Michael Hanselmann | |
280 | ca9ccea8 | Michael Hanselmann | By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
281 | ca9ccea8 | Michael Hanselmann | support for timeouts on file descriptors (see ``select(2)``). A custom |
282 | ca9ccea8 | Michael Hanselmann | condition class will have to be written for this. |
283 | ca9ccea8 | Michael Hanselmann | |
284 | ca9ccea8 | Michael Hanselmann | On instantiation the class creates a pipe. After each notification the |
285 | 7faf5110 | Michael Hanselmann | previous pipe is abandoned and re-created (technically the old pipe |
286 | 7faf5110 | Michael Hanselmann | needs to stay around until all notifications have been delivered). |
287 | ca9ccea8 | Michael Hanselmann | |
288 | ca9ccea8 | Michael Hanselmann | All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
289 | 7faf5110 | Michael Hanselmann | wait for notifications, optionally with a timeout. A notification will |
290 | 7faf5110 | Michael Hanselmann | be signalled to the waiting clients by closing the pipe. If the pipe |
291 | 7faf5110 | Michael Hanselmann | wasn't closed during the timeout, the waiting function returns to its |
292 | 7faf5110 | Michael Hanselmann | caller nonetheless. |
293 | ca9ccea8 | Michael Hanselmann | |
294 | 5ee09f03 | Michael Hanselmann | |
295 | 82a1c938 | Guido Trotter | Feature changes |
296 | 82a1c938 | Guido Trotter | --------------- |
297 | 82a1c938 | Guido Trotter | |
298 | c0446a46 | Guido Trotter | Ganeti Confd |
299 | c0446a46 | Guido Trotter | ~~~~~~~~~~~~ |
300 | c0446a46 | Guido Trotter | |
301 | c0446a46 | Guido Trotter | Current State and shortcomings |
302 | c0446a46 | Guido Trotter | ++++++++++++++++++++++++++++++ |
303 | 7faf5110 | Michael Hanselmann | |
304 | 7faf5110 | Michael Hanselmann | In Ganeti 2.0 all nodes are equal, but some are more equal than others. |
305 | 7faf5110 | Michael Hanselmann | In particular they are divided between "master", "master candidates" and |
306 | 7faf5110 | Michael Hanselmann | "normal". (Moreover they can be offline or drained, but this is not |
307 | 7faf5110 | Michael Hanselmann | important for the current discussion). In general the whole |
308 | 7faf5110 | Michael Hanselmann | configuration is only replicated to master candidates, and some partial |
309 | 7faf5110 | Michael Hanselmann | information is spread to all nodes via ssconf. |
310 | 7faf5110 | Michael Hanselmann | |
311 | 7faf5110 | Michael Hanselmann | This change was done so that the most frequent Ganeti operations didn't |
312 | 7faf5110 | Michael Hanselmann | need to contact all nodes, and so clusters could become bigger. If we |
313 | 7faf5110 | Michael Hanselmann | want more information to be available on all nodes, we need to add more |
314 | 7faf5110 | Michael Hanselmann | ssconf values, which is counter-balancing the change, or to talk with |
315 | 7faf5110 | Michael Hanselmann | the master node, which is not designed to happen now, and requires its |
316 | 7faf5110 | Michael Hanselmann | availability. |
317 | 7faf5110 | Michael Hanselmann | |
318 | 7faf5110 | Michael Hanselmann | Information such as the instance->primary_node mapping will be needed on |
319 | 7faf5110 | Michael Hanselmann | all nodes, and we also want to make sure services external to the |
320 | 7faf5110 | Michael Hanselmann | cluster can query this information as well. This information must be |
321 | 7faf5110 | Michael Hanselmann | available at all times, so we can't query it through RAPI, which would |
322 | 7faf5110 | Michael Hanselmann | be a single point of failure, as it's only available on the master. |
323 | c0446a46 | Guido Trotter | |
324 | c0446a46 | Guido Trotter | |
325 | c0446a46 | Guido Trotter | Proposed changes |
326 | c0446a46 | Guido Trotter | ++++++++++++++++ |
327 | c0446a46 | Guido Trotter | |
328 | c0446a46 | Guido Trotter | In order to allow fast and highly available access read-only to some |
329 | 7faf5110 | Michael Hanselmann | configuration values, we'll create a new ganeti-confd daemon, which will |
330 | 7faf5110 | Michael Hanselmann | run on master candidates. This daemon will talk via UDP, and |
331 | 7faf5110 | Michael Hanselmann | authenticate messages using HMAC with a cluster-wide shared key. This |
332 | 7faf5110 | Michael Hanselmann | key will be generated at cluster init time, and stored on the clusters |
333 | 7faf5110 | Michael Hanselmann | alongside the ganeti SSL keys, and readable only by root. |
334 | 7faf5110 | Michael Hanselmann | |
335 | 7faf5110 | Michael Hanselmann | An interested client can query a value by making a request to a subset |
336 | 7faf5110 | Michael Hanselmann | of the cluster master candidates. It will then wait to get a few |
337 | 7faf5110 | Michael Hanselmann | responses, and use the one with the highest configuration serial number. |
338 | 7faf5110 | Michael Hanselmann | Since the configuration serial number is increased each time the ganeti |
339 | 7faf5110 | Michael Hanselmann | config is updated, and the serial number is included in all answers, |
340 | 7faf5110 | Michael Hanselmann | this can be used to make sure to use the most recent answer, in case |
341 | 7faf5110 | Michael Hanselmann | some master candidates are stale or in the middle of a configuration |
342 | 7faf5110 | Michael Hanselmann | update. |
343 | c0446a46 | Guido Trotter | |
344 | c0446a46 | Guido Trotter | In order to prevent replay attacks queries will contain the current unix |
345 | c0446a46 | Guido Trotter | timestamp according to the client, and the server will verify that its |
346 | 7faf5110 | Michael Hanselmann | timestamp is in the same 5 minutes range (this requires synchronized |
347 | 7faf5110 | Michael Hanselmann | clocks, which is a good idea anyway). Queries will also contain a "salt" |
348 | 7faf5110 | Michael Hanselmann | which they expect the answers to be sent with, and clients are supposed |
349 | 7faf5110 | Michael Hanselmann | to accept only answers which contain salt generated by them. |
350 | c0446a46 | Guido Trotter | |
351 | c0446a46 | Guido Trotter | The configuration daemon will be able to answer simple queries such as: |
352 | a9407509 | Guido Trotter | |
353 | c0446a46 | Guido Trotter | - master candidates list |
354 | c0446a46 | Guido Trotter | - master node |
355 | c0446a46 | Guido Trotter | - offline nodes |
356 | c0446a46 | Guido Trotter | - instance list |
357 | c0446a46 | Guido Trotter | - instance primary nodes |
358 | c0446a46 | Guido Trotter | |
359 | a9407509 | Guido Trotter | Wire protocol |
360 | a9407509 | Guido Trotter | ^^^^^^^^^^^^^ |
361 | a9407509 | Guido Trotter | |
362 | a9407509 | Guido Trotter | A confd query will look like this, on the wire:: |
363 | a9407509 | Guido Trotter | |
364 | f3448a3c | Guido Trotter | plj0{ |
365 | a9407509 | Guido Trotter | "msg": "{\"type\": 1, |
366 | a9407509 | Guido Trotter | \"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\", |
367 | a9407509 | Guido Trotter | \"protocol\": 1, |
368 | a9407509 | Guido Trotter | \"query\": \"node1.example.com\"}\n", |
369 | a9407509 | Guido Trotter | "salt": "1249637704", |
370 | a9407509 | Guido Trotter | "hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f" |
371 | a9407509 | Guido Trotter | } |
372 | a9407509 | Guido Trotter | |
373 | f3448a3c | Guido Trotter | "plj0" is a fourcc that details the message content. It stands for plain |
374 | f3448a3c | Guido Trotter | json 0, and can be changed as we move on to different type of protocols |
375 | f3448a3c | Guido Trotter | (for example protocol buffers, or encrypted json). What follows is a |
376 | f3448a3c | Guido Trotter | json encoded string, with the following fields: |
377 | a9407509 | Guido Trotter | |
378 | a9407509 | Guido Trotter | - 'msg' contains a JSON-encoded query, its fields are: |
379 | a9407509 | Guido Trotter | |
380 | a9407509 | Guido Trotter | - 'protocol', integer, is the confd protocol version (initially just |
381 | a9407509 | Guido Trotter | constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
382 | 7faf5110 | Michael Hanselmann | - 'type', integer, is the query type. For example "node role by name" |
383 | 7faf5110 | Michael Hanselmann | or "node primary ip by instance ip". Constants will be provided for |
384 | 7faf5110 | Michael Hanselmann | the actual available query types. |
385 | 7faf5110 | Michael Hanselmann | - 'query', string, is the search key. For example an ip, or a node |
386 | 7faf5110 | Michael Hanselmann | name. |
387 | 7faf5110 | Michael Hanselmann | - 'rsalt', string, is the required response salt. The client must use |
388 | 7faf5110 | Michael Hanselmann | it to recognize which answer it's getting. |
389 | 7faf5110 | Michael Hanselmann | |
390 | 7faf5110 | Michael Hanselmann | - 'salt' must be the current unix timestamp, according to the client. |
391 | 7faf5110 | Michael Hanselmann | Servers can refuse messages which have a wrong timing, according to |
392 | 7faf5110 | Michael Hanselmann | their configuration and clock. |
393 | a9407509 | Guido Trotter | - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
394 | a9407509 | Guido Trotter | |
395 | 7faf5110 | Michael Hanselmann | If an answer comes back (which is optional, since confd works over UDP) |
396 | 7faf5110 | Michael Hanselmann | it will be in this format:: |
397 | a9407509 | Guido Trotter | |
398 | f3448a3c | Guido Trotter | plj0{ |
399 | a9407509 | Guido Trotter | "msg": "{\"status\": 0, |
400 | a9407509 | Guido Trotter | \"answer\": 0, |
401 | a9407509 | Guido Trotter | \"serial\": 42, |
402 | a9407509 | Guido Trotter | \"protocol\": 1}\n", |
403 | a9407509 | Guido Trotter | "salt": "9aa6ce92-8336-11de-af38-001d093e835f", |
404 | a9407509 | Guido Trotter | "hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af" |
405 | a9407509 | Guido Trotter | } |
406 | a9407509 | Guido Trotter | |
407 | a9407509 | Guido Trotter | Where: |
408 | a9407509 | Guido Trotter | |
409 | f3448a3c | Guido Trotter | - 'plj0' the message type magic fourcc, as discussed above |
410 | a9407509 | Guido Trotter | - 'msg' contains a JSON-encoded answer, its fields are: |
411 | a9407509 | Guido Trotter | |
412 | a9407509 | Guido Trotter | - 'protocol', integer, is the confd protocol version (initially just |
413 | a9407509 | Guido Trotter | constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
414 | 7faf5110 | Michael Hanselmann | - 'status', integer, is the error code. Initially just 0 for 'ok' or |
415 | 7faf5110 | Michael Hanselmann | '1' for 'error' (in which case answer contains an error detail, |
416 | 7faf5110 | Michael Hanselmann | rather than an answer), but in the future it may be expanded to have |
417 | 7faf5110 | Michael Hanselmann | more meanings (eg: 2, the answer is compressed) |
418 | 7faf5110 | Michael Hanselmann | - 'answer', is the actual answer. Its type and meaning is query |
419 | 7faf5110 | Michael Hanselmann | specific. For example for "node primary ip by instance ip" queries |
420 | 7faf5110 | Michael Hanselmann | it will be a string containing an IP address, for "node role by |
421 | 7faf5110 | Michael Hanselmann | name" queries it will be an integer which encodes the role (master, |
422 | 7faf5110 | Michael Hanselmann | candidate, drained, offline) according to constants. |
423 | 7faf5110 | Michael Hanselmann | |
424 | 7faf5110 | Michael Hanselmann | - 'salt' is the requested salt from the query. A client can use it to |
425 | 7faf5110 | Michael Hanselmann | recognize what query the answer is answering. |
426 | a9407509 | Guido Trotter | - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
427 | a9407509 | Guido Trotter | |
428 | c0446a46 | Guido Trotter | |
429 | d1268971 | Guido Trotter | Redistribute Config |
430 | d1268971 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~ |
431 | d1268971 | Guido Trotter | |
432 | d1268971 | Guido Trotter | Current State and shortcomings |
433 | d1268971 | Guido Trotter | ++++++++++++++++++++++++++++++ |
434 | 7faf5110 | Michael Hanselmann | |
435 | 7faf5110 | Michael Hanselmann | Currently LURedistributeConfig triggers a copy of the updated |
436 | 7faf5110 | Michael Hanselmann | configuration file to all master candidates and of the ssconf files to |
437 | 7faf5110 | Michael Hanselmann | all nodes. There are other files which are maintained manually but which |
438 | 7faf5110 | Michael Hanselmann | are important to keep in sync. These are: |
439 | d1268971 | Guido Trotter | |
440 | d1268971 | Guido Trotter | - rapi SSL key certificate file (rapi.pem) (on master candidates) |
441 | d1268971 | Guido Trotter | - rapi user/password file rapi_users (on master candidates) |
442 | d1268971 | Guido Trotter | |
443 | 7faf5110 | Michael Hanselmann | Furthermore there are some files which are hypervisor specific but we |
444 | 7faf5110 | Michael Hanselmann | may want to keep in sync: |
445 | d1268971 | Guido Trotter | |
446 | 7faf5110 | Michael Hanselmann | - the xen-hvm hypervisor uses one shared file for all vnc passwords, and |
447 | 7faf5110 | Michael Hanselmann | copies the file once, during node add. This design is subject to |
448 | 7faf5110 | Michael Hanselmann | revision to be able to have different passwords for different groups |
449 | 7faf5110 | Michael Hanselmann | of instances via the use of hypervisor parameters, and to allow |
450 | 7faf5110 | Michael Hanselmann | xen-hvm and kvm to use an equal system to provide password-protected |
451 | 7faf5110 | Michael Hanselmann | vnc sessions. In general, though, it would be useful if the vnc |
452 | 7faf5110 | Michael Hanselmann | password files were copied as well, to avoid unwanted vnc password |
453 | 7faf5110 | Michael Hanselmann | changes on instance failover/migrate. |
454 | d1268971 | Guido Trotter | |
455 | 7faf5110 | Michael Hanselmann | Optionally the admin may want to also ship files such as the global |
456 | 7faf5110 | Michael Hanselmann | xend.conf file, and the network scripts to all nodes. |
457 | d1268971 | Guido Trotter | |
458 | d1268971 | Guido Trotter | Proposed changes |
459 | d1268971 | Guido Trotter | ++++++++++++++++ |
460 | d1268971 | Guido Trotter | |
461 | 7faf5110 | Michael Hanselmann | RedistributeConfig will be changed to copy also the rapi files, and to |
462 | 7faf5110 | Michael Hanselmann | call every enabled hypervisor asking for a list of additional files to |
463 | 7faf5110 | Michael Hanselmann | copy. Users will have the possibility to populate a file containing a |
464 | 7faf5110 | Michael Hanselmann | list of files to be distributed; this file will be propagated as well. |
465 | 7faf5110 | Michael Hanselmann | Such solution is really simple to implement and it's easily usable by |
466 | 7faf5110 | Michael Hanselmann | scripts. |
467 | d1268971 | Guido Trotter | |
468 | 7faf5110 | Michael Hanselmann | This code will be also shared (via tasklets or by other means, if |
469 | 7faf5110 | Michael Hanselmann | tasklets are not ready for 2.1) with the AddNode and SetNodeParams LUs |
470 | 7faf5110 | Michael Hanselmann | (so that the relevant files will be automatically shipped to new master |
471 | 7faf5110 | Michael Hanselmann | candidates as they are set). |
472 | d1268971 | Guido Trotter | |
473 | 5b18ff3b | Guido Trotter | VNC Console Password |
474 | 5b18ff3b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~ |
475 | 5b18ff3b | Guido Trotter | |
476 | 5b18ff3b | Guido Trotter | Current State and shortcomings |
477 | 5b18ff3b | Guido Trotter | ++++++++++++++++++++++++++++++ |
478 | 5b18ff3b | Guido Trotter | |
479 | 7faf5110 | Michael Hanselmann | Currently just the xen-hvm hypervisor supports setting a password to |
480 | 7faf5110 | Michael Hanselmann | connect the the instances' VNC console, and has one common password |
481 | 7faf5110 | Michael Hanselmann | stored in a file. |
482 | 5b18ff3b | Guido Trotter | |
483 | 5b18ff3b | Guido Trotter | This doesn't allow different passwords for different instances/groups of |
484 | 7faf5110 | Michael Hanselmann | instances, and makes it necessary to remember to copy the file around |
485 | 7faf5110 | Michael Hanselmann | the cluster when the password changes. |
486 | 5b18ff3b | Guido Trotter | |
487 | 5b18ff3b | Guido Trotter | Proposed changes |
488 | 5b18ff3b | Guido Trotter | ++++++++++++++++ |
489 | 5b18ff3b | Guido Trotter | |
490 | 7faf5110 | Michael Hanselmann | We'll change the VNC password file to a vnc_password_file hypervisor |
491 | 7faf5110 | Michael Hanselmann | parameter. This way it can have a cluster default, but also a different |
492 | 7faf5110 | Michael Hanselmann | value for each instance. The VNC enabled hypervisors (xen and kvm) will |
493 | 7faf5110 | Michael Hanselmann | publish all the password files in use through the cluster so that a |
494 | 7faf5110 | Michael Hanselmann | redistribute-config will ship them to all nodes (see the Redistribute |
495 | 7faf5110 | Michael Hanselmann | Config proposed changes above). |
496 | 5b18ff3b | Guido Trotter | |
497 | 7faf5110 | Michael Hanselmann | The current VNC_PASSWORD_FILE constant will be removed, but its value |
498 | 7faf5110 | Michael Hanselmann | will be used as the default HV_VNC_PASSWORD_FILE value, thus retaining |
499 | 7faf5110 | Michael Hanselmann | backwards compatibility with 2.0. |
500 | 5b18ff3b | Guido Trotter | |
501 | 7faf5110 | Michael Hanselmann | The code to export the list of VNC password files from the hypervisors |
502 | 7faf5110 | Michael Hanselmann | to RedistributeConfig will be shared between the KVM and xen-hvm |
503 | 7faf5110 | Michael Hanselmann | hypervisors. |
504 | 5b18ff3b | Guido Trotter | |
505 | 76bb661b | Guido Trotter | Disk/Net parameters |
506 | 76bb661b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~ |
507 | 76bb661b | Guido Trotter | |
508 | 76bb661b | Guido Trotter | Current State and shortcomings |
509 | 76bb661b | Guido Trotter | ++++++++++++++++++++++++++++++ |
510 | 76bb661b | Guido Trotter | |
511 | 7faf5110 | Michael Hanselmann | Currently disks and network interfaces have a few tweakable options and |
512 | 7faf5110 | Michael Hanselmann | all the rest is left to a default we chose. We're finding that we need |
513 | 7faf5110 | Michael Hanselmann | more and more to tweak some of these parameters, for example to disable |
514 | 7faf5110 | Michael Hanselmann | barriers for DRBD devices, or allow striping for the LVM volumes. |
515 | 76bb661b | Guido Trotter | |
516 | 7faf5110 | Michael Hanselmann | Moreover for many of these parameters it will be nice to have |
517 | 7faf5110 | Michael Hanselmann | cluster-wide defaults, and then be able to change them per |
518 | 7faf5110 | Michael Hanselmann | disk/interface. |
519 | 76bb661b | Guido Trotter | |
520 | 76bb661b | Guido Trotter | Proposed changes |
521 | 76bb661b | Guido Trotter | ++++++++++++++++ |
522 | 76bb661b | Guido Trotter | |
523 | 7faf5110 | Michael Hanselmann | We will add new cluster level diskparams and netparams, which will |
524 | 7faf5110 | Michael Hanselmann | contain all the tweakable parameters. All values which have a sensible |
525 | 7faf5110 | Michael Hanselmann | cluster-wide default will go into this new structure while parameters |
526 | 7faf5110 | Michael Hanselmann | which have unique values will not. |
527 | 76bb661b | Guido Trotter | |
528 | 76bb661b | Guido Trotter | Example of network parameters: |
529 | 76bb661b | Guido Trotter | - mode: bridge/route |
530 | 7faf5110 | Michael Hanselmann | - link: for mode "bridge" the bridge to connect to, for mode route it |
531 | 7faf5110 | Michael Hanselmann | can contain the routing table, or the destination interface |
532 | 76bb661b | Guido Trotter | |
533 | 76bb661b | Guido Trotter | Example of disk parameters: |
534 | 76bb661b | Guido Trotter | - stripe: lvm stripes |
535 | 76bb661b | Guido Trotter | - stripe_size: lvm stripe size |
536 | 76bb661b | Guido Trotter | - meta_flushes: drbd, enable/disable metadata "barriers" |
537 | 76bb661b | Guido Trotter | - data_flushes: drbd, enable/disable data "barriers" |
538 | 76bb661b | Guido Trotter | |
539 | 7faf5110 | Michael Hanselmann | Some parameters are bound to be disk-type specific (drbd, vs lvm, vs |
540 | 7faf5110 | Michael Hanselmann | files) or hypervisor specific (nic models for example), but for now they |
541 | 7faf5110 | Michael Hanselmann | will all live in the same structure. Each component is supposed to |
542 | 7faf5110 | Michael Hanselmann | validate only the parameters it knows about, and ganeti itself will make |
543 | 7faf5110 | Michael Hanselmann | sure that no "globally unknown" parameters are added, and that no |
544 | 7faf5110 | Michael Hanselmann | parameters have overridden meanings for different components. |
545 | 76bb661b | Guido Trotter | |
546 | 7faf5110 | Michael Hanselmann | The parameters will be kept, as for the BEPARAMS into a "default" |
547 | 7faf5110 | Michael Hanselmann | category, which will allow us to expand on by creating instance |
548 | 7faf5110 | Michael Hanselmann | "classes" in the future. Instance classes is not a feature we plan |
549 | 7faf5110 | Michael Hanselmann | implementing in 2.1, though. |
550 | 76bb661b | Guido Trotter | |
551 | e8a3bf18 | Iustin Pop | |
552 | e8a3bf18 | Iustin Pop | Global hypervisor parameters |
553 | e8a3bf18 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
554 | e8a3bf18 | Iustin Pop | |
555 | e8a3bf18 | Iustin Pop | Current State and shortcomings |
556 | e8a3bf18 | Iustin Pop | ++++++++++++++++++++++++++++++ |
557 | e8a3bf18 | Iustin Pop | |
558 | e8a3bf18 | Iustin Pop | Currently all hypervisor parameters are modifiable both globally |
559 | e8a3bf18 | Iustin Pop | (cluster level) and at instance level. However, there is no other |
560 | e8a3bf18 | Iustin Pop | framework to held hypervisor-specific parameters, so if we want to add |
561 | e8a3bf18 | Iustin Pop | a new class of hypervisor parameters that only makes sense on a global |
562 | e8a3bf18 | Iustin Pop | level, we have to change the hvparams framework. |
563 | e8a3bf18 | Iustin Pop | |
564 | e8a3bf18 | Iustin Pop | Proposed changes |
565 | e8a3bf18 | Iustin Pop | ++++++++++++++++ |
566 | e8a3bf18 | Iustin Pop | |
567 | e8a3bf18 | Iustin Pop | We add a new (global, not per-hypervisor) list of parameters which are |
568 | e8a3bf18 | Iustin Pop | not changeable on a per-instance level. The create, modify and query |
569 | e8a3bf18 | Iustin Pop | instance operations are changed to not allow/show these parameters. |
570 | e8a3bf18 | Iustin Pop | |
571 | e8a3bf18 | Iustin Pop | Furthermore, to allow transition of parameters to the global list, and |
572 | e8a3bf18 | Iustin Pop | to allow cleanup of inadverdently-customised parameters, the |
573 | e8a3bf18 | Iustin Pop | ``UpgradeConfig()`` method of instances will drop any such parameters |
574 | e8a3bf18 | Iustin Pop | from their list of hvparams, such that a restart of the master daemon |
575 | e8a3bf18 | Iustin Pop | is all that is needed for cleaning these up. |
576 | e8a3bf18 | Iustin Pop | |
577 | e8a3bf18 | Iustin Pop | Also, the framework is simple enough that if we need to replicate it |
578 | e8a3bf18 | Iustin Pop | at beparams level we can do so easily. |
579 | e8a3bf18 | Iustin Pop | |
580 | e8a3bf18 | Iustin Pop | |
581 | bff04b1b | Guido Trotter | Non bridged instances support |
582 | bff04b1b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
583 | bff04b1b | Guido Trotter | |
584 | bff04b1b | Guido Trotter | Current State and shortcomings |
585 | bff04b1b | Guido Trotter | ++++++++++++++++++++++++++++++ |
586 | bff04b1b | Guido Trotter | |
587 | 7faf5110 | Michael Hanselmann | Currently each instance NIC must be connected to a bridge, and if the |
588 | 7faf5110 | Michael Hanselmann | bridge is not specified the default cluster one is used. This makes it |
589 | 7faf5110 | Michael Hanselmann | impossible to use the vif-route xen network scripts, or other |
590 | 7faf5110 | Michael Hanselmann | alternative mechanisms that don't need a bridge to work. |
591 | bff04b1b | Guido Trotter | |
592 | bff04b1b | Guido Trotter | Proposed changes |
593 | bff04b1b | Guido Trotter | ++++++++++++++++ |
594 | bff04b1b | Guido Trotter | |
595 | 7faf5110 | Michael Hanselmann | The new "mode" network parameter will distinguish between bridged |
596 | 7faf5110 | Michael Hanselmann | interfaces and routed ones. |
597 | bff04b1b | Guido Trotter | |
598 | 7faf5110 | Michael Hanselmann | When mode is "bridge" the "link" parameter will contain the bridge the |
599 | 7faf5110 | Michael Hanselmann | instance should be connected to, effectively making things as today. The |
600 | 7faf5110 | Michael Hanselmann | value has been migrated from a nic field to a parameter to allow for an |
601 | 7faf5110 | Michael Hanselmann | easier manipulation of the cluster default. |
602 | bff04b1b | Guido Trotter | |
603 | 7faf5110 | Michael Hanselmann | When mode is "route" the ip field of the interface will become |
604 | 7faf5110 | Michael Hanselmann | mandatory, to allow for a route to be set. In the future we may want |
605 | 7faf5110 | Michael Hanselmann | also to accept multiple IPs or IP/mask values for this purpose. We will |
606 | 7faf5110 | Michael Hanselmann | evaluate possible meanings of the link parameter to signify a routing |
607 | 7faf5110 | Michael Hanselmann | table to be used, which would allow for insulation between instance |
608 | 7faf5110 | Michael Hanselmann | groups (as today happens for different bridges). |
609 | bff04b1b | Guido Trotter | |
610 | 7faf5110 | Michael Hanselmann | For now we won't add a parameter to specify which network script gets |
611 | 7faf5110 | Michael Hanselmann | called for which instance, so in a mixed cluster the network script must |
612 | 7faf5110 | Michael Hanselmann | be able to handle both cases. The default kvm vif script will be changed |
613 | 7faf5110 | Michael Hanselmann | to do so. (Xen doesn't have a ganeti provided script, so nothing will be |
614 | 7faf5110 | Michael Hanselmann | done for that hypervisor) |
615 | 76bb661b | Guido Trotter | |
616 | 0f828357 | Iustin Pop | Introducing persistent UUIDs |
617 | 0f828357 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
618 | 0f828357 | Iustin Pop | |
619 | 0f828357 | Iustin Pop | Current state and shortcomings |
620 | 0f828357 | Iustin Pop | ++++++++++++++++++++++++++++++ |
621 | 0f828357 | Iustin Pop | |
622 | 0f828357 | Iustin Pop | Some objects in the Ganeti configurations are tracked by their name |
623 | 0f828357 | Iustin Pop | while also supporting renames. This creates an extra difficulty, |
624 | 0f828357 | Iustin Pop | because neither Ganeti nor external management tools can then track |
625 | 0f828357 | Iustin Pop | the actual entity, and due to the name change it behaves like a new |
626 | 0f828357 | Iustin Pop | one. |
627 | 0f828357 | Iustin Pop | |
628 | 0f828357 | Iustin Pop | Proposed changes part 1 |
629 | 0f828357 | Iustin Pop | +++++++++++++++++++++++ |
630 | 0f828357 | Iustin Pop | |
631 | 0f828357 | Iustin Pop | We will change Ganeti to use UUIDs for entity tracking, but in a |
632 | 0f828357 | Iustin Pop | staggered way. In 2.1, we will simply add an “uuid” attribute to each |
633 | 0f828357 | Iustin Pop | of the instances, nodes and cluster itself. This will be reported on |
634 | 0f828357 | Iustin Pop | instance creation for nodes, and on node adds for the nodes. It will |
635 | 0f828357 | Iustin Pop | be of course avaiblable for querying via the OpQueryNodes/Instance and |
636 | 0f828357 | Iustin Pop | cluster information, and via RAPI as well. |
637 | 0f828357 | Iustin Pop | |
638 | 0f828357 | Iustin Pop | Note that Ganeti will not provide any way to change this attribute. |
639 | 0f828357 | Iustin Pop | |
640 | 0f828357 | Iustin Pop | Upgrading from Ganeti 2.0 will automatically add an ‘uuid’ attribute |
641 | 0f828357 | Iustin Pop | to all entities missing it. |
642 | 0f828357 | Iustin Pop | |
643 | 0f828357 | Iustin Pop | |
644 | 0f828357 | Iustin Pop | Proposed changes part 2 |
645 | 0f828357 | Iustin Pop | +++++++++++++++++++++++ |
646 | 0f828357 | Iustin Pop | |
647 | 0f828357 | Iustin Pop | In the next release (e.g. 2.2), the tracking of objects will change |
648 | 0f828357 | Iustin Pop | from the name to the UUID internally, and externally Ganeti will |
649 | 0f828357 | Iustin Pop | accept both forms of identification; e.g. an RAPI call would be made |
650 | 0f828357 | Iustin Pop | either against ``/2/instances/foo.bar`` or against |
651 | 0f828357 | Iustin Pop | ``/2/instances/bb3b2e42…``. Since an FQDN must have at least a dot, |
652 | 0f828357 | Iustin Pop | and dots are not valid characters in UUIDs, we will not have namespace |
653 | 0f828357 | Iustin Pop | issues. |
654 | 0f828357 | Iustin Pop | |
655 | 0f828357 | Iustin Pop | Another change here is that node identification (during cluster |
656 | 0f828357 | Iustin Pop | operations/queries like master startup, “am I the master?” and |
657 | 0f828357 | Iustin Pop | similar) could be done via UUIDs which is more stable than the current |
658 | 0f828357 | Iustin Pop | hostname-based scheme. |
659 | 0f828357 | Iustin Pop | |
660 | 0f828357 | Iustin Pop | Internal tracking refers to the way the configuration is stored; a |
661 | 0f828357 | Iustin Pop | DRBD disk of an instance refers to the node name (so that IPs can be |
662 | 0f828357 | Iustin Pop | changed easily), but this is still a problem for name changes; thus |
663 | 0f828357 | Iustin Pop | these will be changed to point to the node UUID to ease renames. |
664 | 0f828357 | Iustin Pop | |
665 | 0f828357 | Iustin Pop | The advantages of this change (after the second round of changes), is |
666 | 0f828357 | Iustin Pop | that node rename becomes trivial, whereas today node rename would |
667 | 0f828357 | Iustin Pop | require a complete lock of all instances. |
668 | 0f828357 | Iustin Pop | |
669 | 395aa879 | Michael Hanselmann | |
670 | 395aa879 | Michael Hanselmann | Automated disk repairs infrastructure |
671 | 395aa879 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
672 | 395aa879 | Michael Hanselmann | |
673 | 7faf5110 | Michael Hanselmann | Replacing defective disks in an automated fashion is quite difficult |
674 | 7faf5110 | Michael Hanselmann | with the current version of Ganeti. These changes will introduce |
675 | 7faf5110 | Michael Hanselmann | additional functionality and interfaces to simplify automating disk |
676 | 7faf5110 | Michael Hanselmann | replacements on a Ganeti node. |
677 | 395aa879 | Michael Hanselmann | |
678 | 395aa879 | Michael Hanselmann | Fix node volume group |
679 | 395aa879 | Michael Hanselmann | +++++++++++++++++++++ |
680 | 395aa879 | Michael Hanselmann | |
681 | 7faf5110 | Michael Hanselmann | This is the most difficult addition, as it can lead to dataloss if it's |
682 | 7faf5110 | Michael Hanselmann | not properly safeguarded. |
683 | 395aa879 | Michael Hanselmann | |
684 | 7faf5110 | Michael Hanselmann | The operation must be done only when all the other nodes that have |
685 | 7faf5110 | Michael Hanselmann | instances in common with the target node are fine, i.e. this is the only |
686 | 7faf5110 | Michael Hanselmann | node with problems, and also we have to double-check that all instances |
687 | 7faf5110 | Michael Hanselmann | on this node have at least a good copy of the data. |
688 | 395aa879 | Michael Hanselmann | |
689 | 395aa879 | Michael Hanselmann | This might mean that we have to enhance the GetMirrorStatus calls, and |
690 | 7faf5110 | Michael Hanselmann | introduce and a smarter version that can tell us more about the status |
691 | 7faf5110 | Michael Hanselmann | of an instance. |
692 | 395aa879 | Michael Hanselmann | |
693 | 395aa879 | Michael Hanselmann | Stop allocation on a given PV |
694 | 395aa879 | Michael Hanselmann | +++++++++++++++++++++++++++++ |
695 | 395aa879 | Michael Hanselmann | |
696 | 7faf5110 | Michael Hanselmann | This is somewhat simple. First we need a "list PVs" opcode (and its |
697 | 7faf5110 | Michael Hanselmann | associated logical unit) and then a set PV status opcode/LU. These in |
698 | 7faf5110 | Michael Hanselmann | combination should allow both checking and changing the disk/PV status. |
699 | 395aa879 | Michael Hanselmann | |
700 | 395aa879 | Michael Hanselmann | Instance disk status |
701 | 395aa879 | Michael Hanselmann | ++++++++++++++++++++ |
702 | 395aa879 | Michael Hanselmann | |
703 | 7faf5110 | Michael Hanselmann | This new opcode or opcode change must list the instance-disk-index and |
704 | 7faf5110 | Michael Hanselmann | node combinations of the instance together with their status. This will |
705 | 7faf5110 | Michael Hanselmann | allow determining what part of the instance is broken (if any). |
706 | 395aa879 | Michael Hanselmann | |
707 | 395aa879 | Michael Hanselmann | Repair instance |
708 | 395aa879 | Michael Hanselmann | +++++++++++++++ |
709 | 395aa879 | Michael Hanselmann | |
710 | 7faf5110 | Michael Hanselmann | This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in |
711 | 7faf5110 | Michael Hanselmann | order to fix the instance status. It only affects primary instances; |
712 | 7faf5110 | Michael Hanselmann | secondaries can just be moved away. |
713 | 395aa879 | Michael Hanselmann | |
714 | 395aa879 | Michael Hanselmann | Migrate node |
715 | 395aa879 | Michael Hanselmann | ++++++++++++ |
716 | 395aa879 | Michael Hanselmann | |
717 | 7faf5110 | Michael Hanselmann | This new opcode/LU/RAPI call will take over the current ``gnt-node |
718 | 7faf5110 | Michael Hanselmann | migrate`` code and run migrate for all instances on the node. |
719 | 395aa879 | Michael Hanselmann | |
720 | 395aa879 | Michael Hanselmann | Evacuate node |
721 | 395aa879 | Michael Hanselmann | ++++++++++++++ |
722 | 395aa879 | Michael Hanselmann | |
723 | 7faf5110 | Michael Hanselmann | This new opcode/LU/RAPI call will take over the current ``gnt-node |
724 | 7faf5110 | Michael Hanselmann | evacuate`` code and run replace-secondary with an iallocator script for |
725 | 7faf5110 | Michael Hanselmann | all instances on the node. |
726 | 395aa879 | Michael Hanselmann | |
727 | 395aa879 | Michael Hanselmann | |
728 | 82a1c938 | Guido Trotter | External interface changes |
729 | 82a1c938 | Guido Trotter | -------------------------- |
730 | 82a1c938 | Guido Trotter | |
731 | b6cc971c | Guido Trotter | OS API |
732 | b6cc971c | Guido Trotter | ~~~~~~ |
733 | b6cc971c | Guido Trotter | |
734 | 7faf5110 | Michael Hanselmann | The OS API of Ganeti 2.0 has been built with extensibility in mind. |
735 | 7faf5110 | Michael Hanselmann | Since we pass everything as environment variables it's a lot easier to |
736 | 7faf5110 | Michael Hanselmann | send new information to the OSes without breaking retrocompatibility. |
737 | 7faf5110 | Michael Hanselmann | This section of the design outlines the proposed extensions to the API |
738 | 7faf5110 | Michael Hanselmann | and their implementation. |
739 | b6cc971c | Guido Trotter | |
740 | b6cc971c | Guido Trotter | API Version Compatibility Handling |
741 | b6cc971c | Guido Trotter | ++++++++++++++++++++++++++++++++++ |
742 | b6cc971c | Guido Trotter | |
743 | 7faf5110 | Michael Hanselmann | In 2.1 there will be a new OS API version (eg. 15), which should be |
744 | 7faf5110 | Michael Hanselmann | mostly compatible with api 10, except for some new added variables. |
745 | 7faf5110 | Michael Hanselmann | Since it's easy not to pass some variables we'll be able to handle |
746 | 7faf5110 | Michael Hanselmann | Ganeti 2.0 OSes by just filtering out the newly added piece of |
747 | 7faf5110 | Michael Hanselmann | information. We will still encourage OSes to declare support for the new |
748 | 7faf5110 | Michael Hanselmann | API after checking that the new variables don't provide any conflict for |
749 | 7faf5110 | Michael Hanselmann | them, and we will drop api 10 support after ganeti 2.1 has released. |
750 | b6cc971c | Guido Trotter | |
751 | b6cc971c | Guido Trotter | New Environment variables |
752 | b6cc971c | Guido Trotter | +++++++++++++++++++++++++ |
753 | b6cc971c | Guido Trotter | |
754 | 7faf5110 | Michael Hanselmann | Some variables have never been added to the OS api but would definitely |
755 | 7faf5110 | Michael Hanselmann | be useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable |
756 | 7faf5110 | Michael Hanselmann | to allow the OS to make changes relevant to the virtualization the |
757 | 7faf5110 | Michael Hanselmann | instance is going to use. Since this field is immutable for each |
758 | 7faf5110 | Michael Hanselmann | instance, the os can tight the install without caring of making sure the |
759 | 7faf5110 | Michael Hanselmann | instance can run under any virtualization technology. |
760 | 7faf5110 | Michael Hanselmann | |
761 | 7faf5110 | Michael Hanselmann | We also want the OS to know the particular hypervisor parameters, to be |
762 | 7faf5110 | Michael Hanselmann | able to customize the install even more. Since the parameters can |
763 | 7faf5110 | Michael Hanselmann | change, though, we will pass them only as an "FYI": if an OS ties some |
764 | 7faf5110 | Michael Hanselmann | instance functionality to the value of a particular hypervisor parameter |
765 | 7faf5110 | Michael Hanselmann | manual changes or a reinstall may be needed to adapt the instance to the |
766 | 7faf5110 | Michael Hanselmann | new environment. This is not a regression as of today, because even if |
767 | 7faf5110 | Michael Hanselmann | the OSes are left blind about this information, sometimes they still |
768 | 7faf5110 | Michael Hanselmann | need to make compromises and cannot satisfy all possible parameter |
769 | 7faf5110 | Michael Hanselmann | values. |
770 | b6cc971c | Guido Trotter | |
771 | 4dfac6af | Guido Trotter | OS Variants |
772 | 00b66530 | Guido Trotter | +++++++++++ |
773 | b6cc971c | Guido Trotter | |
774 | 7faf5110 | Michael Hanselmann | Currently we are assisting to some degree of "os proliferation" just to |
775 | 7faf5110 | Michael Hanselmann | change a simple installation behavior. This means that the same OS gets |
776 | 7faf5110 | Michael Hanselmann | installed on the cluster multiple times, with different names, to |
777 | 7faf5110 | Michael Hanselmann | customize just one installation behavior. Usually such OSes try to share |
778 | 7faf5110 | Michael Hanselmann | as much as possible through symlinks, but this still causes |
779 | 7faf5110 | Michael Hanselmann | complications on the user side, especially when multiple parameters must |
780 | 7faf5110 | Michael Hanselmann | be cross-matched. |
781 | 7faf5110 | Michael Hanselmann | |
782 | 7faf5110 | Michael Hanselmann | For example today if you want to install debian etch, lenny or squeeze |
783 | 7faf5110 | Michael Hanselmann | you probably need to install the debootstrap OS multiple times, changing |
784 | 7faf5110 | Michael Hanselmann | its configuration file, and calling it debootstrap-etch, |
785 | 7faf5110 | Michael Hanselmann | debootstrap-lenny or debootstrap-squeeze. Furthermore if you have for |
786 | 7faf5110 | Michael Hanselmann | example a "server" and a "development" environment which installs |
787 | 7faf5110 | Michael Hanselmann | different packages/configuration files and must be available for all |
788 | 7faf5110 | Michael Hanselmann | installs you'll probably end up with deboostrap-etch-server, |
789 | 7faf5110 | Michael Hanselmann | debootstrap-etch-dev, debootrap-lenny-server, debootstrap-lenny-dev, |
790 | 7faf5110 | Michael Hanselmann | etc. Crossing more than two parameters quickly becomes not manageable. |
791 | 7faf5110 | Michael Hanselmann | |
792 | 7faf5110 | Michael Hanselmann | In order to avoid this we plan to make OSes more customizable, by |
793 | 7faf5110 | Michael Hanselmann | allowing each OS to declare a list of variants which can be used to |
794 | 7faf5110 | Michael Hanselmann | customize it. The variants list is mandatory and must be written, one |
795 | 7faf5110 | Michael Hanselmann | variant per line, in the new "variants.list" file inside the main os |
796 | 7faf5110 | Michael Hanselmann | dir. At least one supported variant must be supported. When choosing the |
797 | 7faf5110 | Michael Hanselmann | OS exactly one variant will have to be specified, and will be encoded in |
798 | 7faf5110 | Michael Hanselmann | the os name as <OS-name>+<variant>. As for today it will be possible to |
799 | 7faf5110 | Michael Hanselmann | change an instance's OS at creation or install time. |
800 | 00b66530 | Guido Trotter | |
801 | 00b66530 | Guido Trotter | The 2.1 OS list will be the combination of each OS, plus its supported |
802 | 7faf5110 | Michael Hanselmann | variants. This will cause the name name proliferation to remain, but at |
803 | 7faf5110 | Michael Hanselmann | least the internal OS code will be simplified to just parsing the passed |
804 | 7faf5110 | Michael Hanselmann | variant, without the need for symlinks or code duplication. |
805 | 7faf5110 | Michael Hanselmann | |
806 | 7faf5110 | Michael Hanselmann | Also we expect the OSes to declare only "interesting" variants, but to |
807 | 7faf5110 | Michael Hanselmann | accept some non-declared ones which a user will be able to pass in by |
808 | 7faf5110 | Michael Hanselmann | overriding the checks ganeti does. This will be useful for allowing some |
809 | 7faf5110 | Michael Hanselmann | variations to be used without polluting the OS list (per-OS |
810 | 7faf5110 | Michael Hanselmann | documentation should list all supported variants). If a variant which is |
811 | 7faf5110 | Michael Hanselmann | not internally supported is forced through, the OS scripts should abort. |
812 | 7faf5110 | Michael Hanselmann | |
813 | 7faf5110 | Michael Hanselmann | In the future (post 2.1) we may want to move to full fledged parameters |
814 | 7faf5110 | Michael Hanselmann | all orthogonal to each other (for example "architecture" (i386, amd64), |
815 | 7faf5110 | Michael Hanselmann | "suite" (lenny, squeeze, ...), etc). (As opposed to the variant, which |
816 | 7faf5110 | Michael Hanselmann | is a single parameter, and you need a different variant for all the set |
817 | 7faf5110 | Michael Hanselmann | of combinations you want to support). In this case we envision the |
818 | 7faf5110 | Michael Hanselmann | variants to be moved inside of Ganeti and be associated with lists |
819 | 7faf5110 | Michael Hanselmann | parameter->values associations, which will then be passed to the OS. |
820 | 4dfac6af | Guido Trotter | |
821 | b6cc971c | Guido Trotter | |
822 | 3bd3d643 | Iustin Pop | IAllocator changes |
823 | 3bd3d643 | Iustin Pop | ~~~~~~~~~~~~~~~~~~ |
824 | 3bd3d643 | Iustin Pop | |
825 | 3bd3d643 | Iustin Pop | Current State and shortcomings |
826 | 3bd3d643 | Iustin Pop | ++++++++++++++++++++++++++++++ |
827 | 3bd3d643 | Iustin Pop | |
828 | 3bd3d643 | Iustin Pop | The iallocator interface allows creation of instances without manually |
829 | 3bd3d643 | Iustin Pop | specifying nodes, but instead by specifying plugins which will do the |
830 | 3bd3d643 | Iustin Pop | required computations and produce a valid node list. |
831 | 3bd3d643 | Iustin Pop | |
832 | 3bd3d643 | Iustin Pop | However, the interface is quite akward to use: |
833 | 3bd3d643 | Iustin Pop | |
834 | 3bd3d643 | Iustin Pop | - one cannot set a 'default' iallocator script |
835 | 3bd3d643 | Iustin Pop | - one cannot use it to easily test if allocation would succeed |
836 | 3bd3d643 | Iustin Pop | - some new functionality, such as rebalancing clusters and calculating |
837 | 3bd3d643 | Iustin Pop | capacity estimates is needed |
838 | 3bd3d643 | Iustin Pop | |
839 | 3bd3d643 | Iustin Pop | Proposed changes |
840 | 3bd3d643 | Iustin Pop | ++++++++++++++++ |
841 | 3bd3d643 | Iustin Pop | |
842 | 3bd3d643 | Iustin Pop | There are two area of improvements proposed: |
843 | 3bd3d643 | Iustin Pop | |
844 | 3bd3d643 | Iustin Pop | - improving the use of the current interface |
845 | 3bd3d643 | Iustin Pop | - extending the IAllocator API to cover more automation |
846 | 3bd3d643 | Iustin Pop | |
847 | 3bd3d643 | Iustin Pop | |
848 | 3bd3d643 | Iustin Pop | Default iallocator names |
849 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^^^^^^^ |
850 | 3bd3d643 | Iustin Pop | |
851 | 3bd3d643 | Iustin Pop | The cluster will hold, for each type of iallocator, a (possibly empty) |
852 | 3bd3d643 | Iustin Pop | list of modules that will be used automatically. |
853 | 3bd3d643 | Iustin Pop | |
854 | 3bd3d643 | Iustin Pop | If the list is empty, the behaviour will remain the same. |
855 | 3bd3d643 | Iustin Pop | |
856 | 3bd3d643 | Iustin Pop | If the list has one entry, then ganeti will behave as if |
857 | 3bd3d643 | Iustin Pop | '--iallocator' was specifyed on the command line. I.e. use this |
858 | 3bd3d643 | Iustin Pop | allocator by default. If the user however passed nodes, those will be |
859 | 3bd3d643 | Iustin Pop | used in preference. |
860 | 3bd3d643 | Iustin Pop | |
861 | 3bd3d643 | Iustin Pop | If the list has multiple entries, they will be tried in order until |
862 | 3bd3d643 | Iustin Pop | one gives a successful answer. |
863 | 3bd3d643 | Iustin Pop | |
864 | 3bd3d643 | Iustin Pop | Dry-run allocation |
865 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^ |
866 | 3bd3d643 | Iustin Pop | |
867 | 3bd3d643 | Iustin Pop | The create instance LU will get a new 'dry-run' option that will just |
868 | 3bd3d643 | Iustin Pop | simulate the placement, and return the chosen node-lists after running |
869 | 3bd3d643 | Iustin Pop | all the usual checks. |
870 | 3bd3d643 | Iustin Pop | |
871 | 3bd3d643 | Iustin Pop | Cluster balancing |
872 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^ |
873 | 3bd3d643 | Iustin Pop | |
874 | 3bd3d643 | Iustin Pop | Instance add/removals/moves can create a situation where load on the |
875 | 3bd3d643 | Iustin Pop | nodes is not spread equally. For this, a new iallocator mode will be |
876 | 3bd3d643 | Iustin Pop | implemented called ``balance`` in which the plugin, given the current |
877 | 3bd3d643 | Iustin Pop | cluster state, and a maximum number of operations, will need to |
878 | 3bd3d643 | Iustin Pop | compute the instance relocations needed in order to achieve a "better" |
879 | 3bd3d643 | Iustin Pop | (for whatever the script believes it's better) cluster. |
880 | 3bd3d643 | Iustin Pop | |
881 | 3bd3d643 | Iustin Pop | Cluster capacity calculation |
882 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
883 | 3bd3d643 | Iustin Pop | |
884 | 3bd3d643 | Iustin Pop | In this mode, called ``capacity``, given an instance specification and |
885 | 3bd3d643 | Iustin Pop | the current cluster state (similar to the ``allocate`` mode), the |
886 | 3bd3d643 | Iustin Pop | plugin needs to return: |
887 | 3bd3d643 | Iustin Pop | |
888 | 7faf5110 | Michael Hanselmann | - how many instances can be allocated on the cluster with that |
889 | 7faf5110 | Michael Hanselmann | specification |
890 | 3bd3d643 | Iustin Pop | - on which nodes these will be allocated (in order) |
891 | 558fd122 | Michael Hanselmann | |
892 | 558fd122 | Michael Hanselmann | .. vim: set textwidth=72 : |