root / doc / design-2.1.rst @ 558fd122
History | View | Annotate | Download (33.6 kB)
1 | 82a1c938 | Guido Trotter | ================= |
---|---|---|---|
2 | 82a1c938 | Guido Trotter | Ganeti 2.1 design |
3 | 82a1c938 | Guido Trotter | ================= |
4 | 82a1c938 | Guido Trotter | |
5 | 82a1c938 | Guido Trotter | This document describes the major changes in Ganeti 2.1 compared to |
6 | 82a1c938 | Guido Trotter | the 2.0 version. |
7 | 82a1c938 | Guido Trotter | |
8 | 82a1c938 | Guido Trotter | The 2.1 version will be a relatively small release. Its main aim is to avoid |
9 | 82a1c938 | Guido Trotter | changing too much of the core code, while addressing issues and adding new |
10 | 82a1c938 | Guido Trotter | features and improvements over 2.0, in a timely fashion. |
11 | 82a1c938 | Guido Trotter | |
12 | 5ee09f03 | Michael Hanselmann | .. contents:: :depth: 4 |
13 | 82a1c938 | Guido Trotter | |
14 | 82a1c938 | Guido Trotter | Objective |
15 | 82a1c938 | Guido Trotter | ========= |
16 | 82a1c938 | Guido Trotter | |
17 | 82a1c938 | Guido Trotter | Ganeti 2.1 will add features to help further automatization of cluster |
18 | 82a1c938 | Guido Trotter | operations, further improbe scalability to even bigger clusters, and make it |
19 | 82a1c938 | Guido Trotter | easier to debug the Ganeti core. |
20 | 82a1c938 | Guido Trotter | |
21 | 82a1c938 | Guido Trotter | Background |
22 | 82a1c938 | Guido Trotter | ========== |
23 | 82a1c938 | Guido Trotter | |
24 | 82a1c938 | Guido Trotter | Overview |
25 | 82a1c938 | Guido Trotter | ======== |
26 | 82a1c938 | Guido Trotter | |
27 | 82a1c938 | Guido Trotter | Detailed design |
28 | 82a1c938 | Guido Trotter | =============== |
29 | 82a1c938 | Guido Trotter | |
30 | 82a1c938 | Guido Trotter | As for 2.0 we divide the 2.1 design into three areas: |
31 | 82a1c938 | Guido Trotter | |
32 | 587ff6fa | Guido Trotter | - core changes, which affect the master daemon/job queue/locking or all/most |
33 | 587ff6fa | Guido Trotter | logical units |
34 | 82a1c938 | Guido Trotter | - logical unit/feature changes |
35 | 82a1c938 | Guido Trotter | - external interface changes (eg. command line, os api, hooks, ...) |
36 | 82a1c938 | Guido Trotter | |
37 | 82a1c938 | Guido Trotter | Core changes |
38 | 82a1c938 | Guido Trotter | ------------ |
39 | 82a1c938 | Guido Trotter | |
40 | a392a6b8 | Michael Hanselmann | Storage units modelling |
41 | a392a6b8 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~ |
42 | a392a6b8 | Michael Hanselmann | |
43 | a392a6b8 | Michael Hanselmann | Currently, Ganeti has a good model of the block devices for instances |
44 | a392a6b8 | Michael Hanselmann | (e.g. LVM logical volumes, files, DRBD devices, etc.) but none of the |
45 | a392a6b8 | Michael Hanselmann | storage pools that are providing the space for these front-end |
46 | a392a6b8 | Michael Hanselmann | devices. For example, there are hardcoded inter-node RPC calls for |
47 | a392a6b8 | Michael Hanselmann | volume group listing, file storage creation/deletion, etc. |
48 | a392a6b8 | Michael Hanselmann | |
49 | a392a6b8 | Michael Hanselmann | The storage units framework will implement a generic handling for all |
50 | a392a6b8 | Michael Hanselmann | kinds of storage backends: |
51 | a392a6b8 | Michael Hanselmann | |
52 | a392a6b8 | Michael Hanselmann | - LVM physical volumes |
53 | a392a6b8 | Michael Hanselmann | - LVM volume groups |
54 | a392a6b8 | Michael Hanselmann | - File-based storage directories |
55 | a392a6b8 | Michael Hanselmann | - any other future storage method |
56 | a392a6b8 | Michael Hanselmann | |
57 | a392a6b8 | Michael Hanselmann | There will be a generic list of methods that each storage unit type |
58 | a392a6b8 | Michael Hanselmann | will provide, like: |
59 | a392a6b8 | Michael Hanselmann | |
60 | a392a6b8 | Michael Hanselmann | - list of storage units of this type |
61 | a392a6b8 | Michael Hanselmann | - check status of the storage unit |
62 | a392a6b8 | Michael Hanselmann | |
63 | a392a6b8 | Michael Hanselmann | Additionally, there will be specific methods for each method, for example: |
64 | a392a6b8 | Michael Hanselmann | |
65 | a392a6b8 | Michael Hanselmann | - enable/disable allocations on a specific PV |
66 | a392a6b8 | Michael Hanselmann | - file storage directory creation/deletion |
67 | a392a6b8 | Michael Hanselmann | - VG consistency fixing |
68 | a392a6b8 | Michael Hanselmann | |
69 | a392a6b8 | Michael Hanselmann | This will allow a much better modeling and unification of the various |
70 | a392a6b8 | Michael Hanselmann | RPC calls related to backend storage pool in the future. Ganeti 2.1 is |
71 | a392a6b8 | Michael Hanselmann | intended to add the basics of the framework, and not necessarilly move |
72 | a392a6b8 | Michael Hanselmann | all the curent VG/FileBased operations to it. |
73 | a392a6b8 | Michael Hanselmann | |
74 | a392a6b8 | Michael Hanselmann | Note that while we model both LVM PVs and LVM VGs, the framework will |
75 | a392a6b8 | Michael Hanselmann | **not** model any relationship between the different types. In other |
76 | a392a6b8 | Michael Hanselmann | words, we don't model neither inheritances nor stacking, since this is |
77 | a392a6b8 | Michael Hanselmann | too complex for our needs. While a ``vgreduce`` operation on a LVM VG |
78 | a392a6b8 | Michael Hanselmann | could actually remove a PV from it, this will not be handled at the |
79 | a392a6b8 | Michael Hanselmann | framework level, but at individual operation level. The goal is that |
80 | a392a6b8 | Michael Hanselmann | this is a lightweight framework, for abstracting the different storage |
81 | a392a6b8 | Michael Hanselmann | operation, and not for modelling the storage hierarchy. |
82 | a392a6b8 | Michael Hanselmann | |
83 | 5ee09f03 | Michael Hanselmann | |
84 | 5ee09f03 | Michael Hanselmann | Locking improvements |
85 | 5ee09f03 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~ |
86 | 5ee09f03 | Michael Hanselmann | |
87 | 5ee09f03 | Michael Hanselmann | Current State and shortcomings |
88 | 5ee09f03 | Michael Hanselmann | ++++++++++++++++++++++++++++++ |
89 | 5ee09f03 | Michael Hanselmann | |
90 | a5b360e4 | Michael Hanselmann | The class ``LockSet`` (see ``lib/locking.py``) is a container for one or |
91 | a5b360e4 | Michael Hanselmann | many ``SharedLock`` instances. It provides an interface to add/remove locks |
92 | a5b360e4 | Michael Hanselmann | and to acquire and subsequently release any number of those locks contained |
93 | a5b360e4 | Michael Hanselmann | in it. |
94 | 5ee09f03 | Michael Hanselmann | |
95 | a5b360e4 | Michael Hanselmann | Locks in a ``LockSet`` are always acquired in alphabetic order. Due to the |
96 | a5b360e4 | Michael Hanselmann | way we're using locks for nodes and instances (the single cluster lock isn't |
97 | 5ee09f03 | Michael Hanselmann | affected by this issue) this can lead to long delays when acquiring locks if |
98 | 5ee09f03 | Michael Hanselmann | another operation tries to acquire multiple locks but has to wait for yet |
99 | 5ee09f03 | Michael Hanselmann | another operation. |
100 | 5ee09f03 | Michael Hanselmann | |
101 | a5b360e4 | Michael Hanselmann | In the following demonstration we assume to have the instance locks |
102 | a5b360e4 | Michael Hanselmann | ``inst1``, ``inst2``, ``inst3`` and ``inst4``. |
103 | 5ee09f03 | Michael Hanselmann | |
104 | 5ee09f03 | Michael Hanselmann | #. Operation A grabs lock for instance ``inst4``. |
105 | a5b360e4 | Michael Hanselmann | #. Operation B wants to acquire all instance locks in alphabetic order, but |
106 | a5b360e4 | Michael Hanselmann | it has to wait for ``inst4``. |
107 | 5ee09f03 | Michael Hanselmann | #. Operation C tries to lock ``inst1``, but it has to wait until |
108 | a5b360e4 | Michael Hanselmann | Operation B (which is trying to acquire all locks) releases the lock |
109 | a5b360e4 | Michael Hanselmann | again. |
110 | 5ee09f03 | Michael Hanselmann | #. Operation A finishes and releases lock on ``inst4``. Operation B can |
111 | 5ee09f03 | Michael Hanselmann | continue and eventually releases all locks. |
112 | 5ee09f03 | Michael Hanselmann | #. Operation C can get ``inst1`` lock and finishes. |
113 | 5ee09f03 | Michael Hanselmann | |
114 | 5ee09f03 | Michael Hanselmann | Technically there's no need for Operation C to wait for Operation A, and |
115 | 5ee09f03 | Michael Hanselmann | subsequently Operation B, to finish. Operation B can't continue until |
116 | 5ee09f03 | Michael Hanselmann | Operation A is done (it has to wait for ``inst4``), anyway. |
117 | 5ee09f03 | Michael Hanselmann | |
118 | 5ee09f03 | Michael Hanselmann | Proposed changes |
119 | 5ee09f03 | Michael Hanselmann | ++++++++++++++++ |
120 | 5ee09f03 | Michael Hanselmann | |
121 | 5ee09f03 | Michael Hanselmann | Non-blocking lock acquiring |
122 | 5ee09f03 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
123 | 5ee09f03 | Michael Hanselmann | |
124 | 5ee09f03 | Michael Hanselmann | Acquiring locks for OpCode execution is always done in blocking mode. They |
125 | 5ee09f03 | Michael Hanselmann | won't return until the lock has successfully been acquired (or an error |
126 | 5ee09f03 | Michael Hanselmann | occurred, although we won't cover that case here). |
127 | 5ee09f03 | Michael Hanselmann | |
128 | a5b360e4 | Michael Hanselmann | ``SharedLock`` and ``LockSet`` must be able to be acquired in a non-blocking |
129 | a5b360e4 | Michael Hanselmann | way. They must support a timeout and abort trying to acquire the lock(s) |
130 | a5b360e4 | Michael Hanselmann | after the specified amount of time. |
131 | 5ee09f03 | Michael Hanselmann | |
132 | 5ee09f03 | Michael Hanselmann | Retry acquiring locks |
133 | 5ee09f03 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^ |
134 | 5ee09f03 | Michael Hanselmann | |
135 | a5b360e4 | Michael Hanselmann | To prevent other operations from waiting for a long time, such as described |
136 | a5b360e4 | Michael Hanselmann | in the demonstration before, ``LockSet`` must not keep locks for a prolonged |
137 | a5b360e4 | Michael Hanselmann | period of time when trying to acquire two or more locks. Instead it should, |
138 | a5b360e4 | Michael Hanselmann | with an increasing timeout for acquiring all locks, release all locks again |
139 | a5b360e4 | Michael Hanselmann | and sleep some time if it fails to acquire all requested locks. |
140 | 5ee09f03 | Michael Hanselmann | |
141 | 5ee09f03 | Michael Hanselmann | A good timeout value needs to be determined. In any case should ``LockSet`` |
142 | a5b360e4 | Michael Hanselmann | proceed to acquire locks in blocking mode after a few (unsuccessful) |
143 | a5b360e4 | Michael Hanselmann | attempts to acquire all requested locks. |
144 | 5ee09f03 | Michael Hanselmann | |
145 | 5ee09f03 | Michael Hanselmann | One proposal for the timeout is to use ``2**tries`` seconds, where ``tries`` |
146 | 5ee09f03 | Michael Hanselmann | is the number of unsuccessful tries. |
147 | 5ee09f03 | Michael Hanselmann | |
148 | 5ee09f03 | Michael Hanselmann | In the demonstration before this would allow Operation C to continue after |
149 | 5ee09f03 | Michael Hanselmann | Operation B unsuccessfully tried to acquire all locks and released all |
150 | 5ee09f03 | Michael Hanselmann | acquired locks (``inst1``, ``inst2`` and ``inst3``) again. |
151 | 5ee09f03 | Michael Hanselmann | |
152 | 5ee09f03 | Michael Hanselmann | Other solutions discussed |
153 | 5ee09f03 | Michael Hanselmann | +++++++++++++++++++++++++ |
154 | 5ee09f03 | Michael Hanselmann | |
155 | 5ee09f03 | Michael Hanselmann | There was also some discussion on going one step further and extend the job |
156 | a5b360e4 | Michael Hanselmann | queue (see ``lib/jqueue.py``) to select the next task for a worker depending |
157 | a5b360e4 | Michael Hanselmann | on whether it can acquire the necessary locks. While this may reduce the |
158 | a5b360e4 | Michael Hanselmann | number of necessary worker threads and/or increase throughput on large |
159 | a5b360e4 | Michael Hanselmann | clusters with many jobs, it also brings many potential problems, such as |
160 | a5b360e4 | Michael Hanselmann | contention and increased memory usage, with it. As this would be an |
161 | a5b360e4 | Michael Hanselmann | extension of the changes proposed before it could be implemented at a later |
162 | a5b360e4 | Michael Hanselmann | point in time, but we decided to stay with the simpler solution for now. |
163 | 5ee09f03 | Michael Hanselmann | |
164 | ca9ccea8 | Michael Hanselmann | Implementation details |
165 | ca9ccea8 | Michael Hanselmann | ++++++++++++++++++++++ |
166 | ca9ccea8 | Michael Hanselmann | |
167 | ca9ccea8 | Michael Hanselmann | ``SharedLock`` redesign |
168 | ca9ccea8 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^ |
169 | ca9ccea8 | Michael Hanselmann | |
170 | ca9ccea8 | Michael Hanselmann | The current design of ``SharedLock`` is not good for supporting timeouts |
171 | ca9ccea8 | Michael Hanselmann | when acquiring a lock and there are also minor fairness issues in it. We |
172 | ca9ccea8 | Michael Hanselmann | plan to address both with a redesign. A proof of concept implementation was |
173 | ca9ccea8 | Michael Hanselmann | written and resulted in significantly simpler code. |
174 | ca9ccea8 | Michael Hanselmann | |
175 | ca9ccea8 | Michael Hanselmann | Currently ``SharedLock`` uses two separate queues for shared and exclusive |
176 | ca9ccea8 | Michael Hanselmann | acquires and waiters get to run in turns. This means if an exclusive acquire |
177 | ca9ccea8 | Michael Hanselmann | is released, the lock will allow shared waiters to run and vice versa. |
178 | ca9ccea8 | Michael Hanselmann | Although it's still fair in the end there is a slight bias towards shared |
179 | ca9ccea8 | Michael Hanselmann | waiters in the current implementation. The same implementation with two |
180 | ca9ccea8 | Michael Hanselmann | shared queues can not support timeouts without adding a lot of complexity. |
181 | ca9ccea8 | Michael Hanselmann | |
182 | ca9ccea8 | Michael Hanselmann | Our proposed redesign changes ``SharedLock`` to have only one single queue. |
183 | ca9ccea8 | Michael Hanselmann | There will be one condition (see Condition_ for a note about performance) in |
184 | ca9ccea8 | Michael Hanselmann | the queue per exclusive acquire and two for all shared acquires (see below for |
185 | ca9ccea8 | Michael Hanselmann | an explanation). The maximum queue length will always be ``2 + (number of |
186 | ca9ccea8 | Michael Hanselmann | exclusive acquires waiting)``. The number of queue entries for shared acquires |
187 | ca9ccea8 | Michael Hanselmann | can vary from 0 to 2. |
188 | ca9ccea8 | Michael Hanselmann | |
189 | ca9ccea8 | Michael Hanselmann | The two conditions for shared acquires are a bit special. They will be used |
190 | ca9ccea8 | Michael Hanselmann | in turn. When the lock is instantiated, no conditions are in the queue. As |
191 | ca9ccea8 | Michael Hanselmann | soon as the first shared acquire arrives (and there are holder(s) or waiting |
192 | ca9ccea8 | Michael Hanselmann | acquires; see Acquire_), the active condition is added to the queue. Until |
193 | ca9ccea8 | Michael Hanselmann | it becomes the topmost condition in the queue and has been notified, any |
194 | ca9ccea8 | Michael Hanselmann | shared acquire is added to this active condition. When the active condition |
195 | ca9ccea8 | Michael Hanselmann | is notified, the conditions are swapped and further shared acquires are |
196 | ca9ccea8 | Michael Hanselmann | added to the previously inactive condition (which has now become the active |
197 | ca9ccea8 | Michael Hanselmann | condition). After all waiters on the previously active (now inactive) and |
198 | ca9ccea8 | Michael Hanselmann | now notified condition received the notification, it is removed from the |
199 | ca9ccea8 | Michael Hanselmann | queue of pending acquires. |
200 | ca9ccea8 | Michael Hanselmann | |
201 | ca9ccea8 | Michael Hanselmann | This means shared acquires will skip any exclusive acquire in the queue. We |
202 | ca9ccea8 | Michael Hanselmann | believe it's better to improve parallelization on operations only asking for |
203 | ca9ccea8 | Michael Hanselmann | shared (or read-only) locks. Exclusive operations holding the same lock can |
204 | ca9ccea8 | Michael Hanselmann | not be parallelized. |
205 | ca9ccea8 | Michael Hanselmann | |
206 | ca9ccea8 | Michael Hanselmann | |
207 | ca9ccea8 | Michael Hanselmann | Acquire |
208 | ca9ccea8 | Michael Hanselmann | ******* |
209 | ca9ccea8 | Michael Hanselmann | |
210 | ca9ccea8 | Michael Hanselmann | For exclusive acquires a new condition is created and appended to the queue. |
211 | ca9ccea8 | Michael Hanselmann | Shared acquires are added to the active condition for shared acquires and if |
212 | ca9ccea8 | Michael Hanselmann | the condition is not yet on the queue, it's appended. |
213 | ca9ccea8 | Michael Hanselmann | |
214 | ca9ccea8 | Michael Hanselmann | The next step is to wait for our condition to be on the top of the queue (to |
215 | ca9ccea8 | Michael Hanselmann | guarantee fairness). If the timeout expired, we return to the caller without |
216 | ca9ccea8 | Michael Hanselmann | acquiring the lock. On every notification we check whether the lock has been |
217 | ca9ccea8 | Michael Hanselmann | deleted, in which case an error is returned to the caller. |
218 | ca9ccea8 | Michael Hanselmann | |
219 | ca9ccea8 | Michael Hanselmann | The lock can be acquired if we're on top of the queue (there is no one else |
220 | ca9ccea8 | Michael Hanselmann | ahead of us). For an exclusive acquire, there must not be other exclusive or |
221 | ca9ccea8 | Michael Hanselmann | shared holders. For a shared acquire, there must not be an exclusive holder. |
222 | ca9ccea8 | Michael Hanselmann | If these conditions are all true, the lock is acquired and we return to the |
223 | ca9ccea8 | Michael Hanselmann | caller. In any other case we wait again on the condition. |
224 | ca9ccea8 | Michael Hanselmann | |
225 | ca9ccea8 | Michael Hanselmann | If it was the last waiter on a condition, the condition is removed from the |
226 | ca9ccea8 | Michael Hanselmann | queue. |
227 | ca9ccea8 | Michael Hanselmann | |
228 | ca9ccea8 | Michael Hanselmann | Optimization: There's no need to touch the queue if there are no pending |
229 | ca9ccea8 | Michael Hanselmann | acquires and no current holders. The caller can have the lock immediately. |
230 | ca9ccea8 | Michael Hanselmann | |
231 | ca9ccea8 | Michael Hanselmann | .. image:: design-2.1-lock-acquire.png |
232 | ca9ccea8 | Michael Hanselmann | |
233 | ca9ccea8 | Michael Hanselmann | |
234 | ca9ccea8 | Michael Hanselmann | Release |
235 | ca9ccea8 | Michael Hanselmann | ******* |
236 | ca9ccea8 | Michael Hanselmann | |
237 | ca9ccea8 | Michael Hanselmann | First the lock removes the caller from the internal owner list. If there are |
238 | ca9ccea8 | Michael Hanselmann | pending acquires in the queue, the first (the oldest) condition is notified. |
239 | ca9ccea8 | Michael Hanselmann | |
240 | ca9ccea8 | Michael Hanselmann | If the first condition was the active condition for shared acquires, the |
241 | ca9ccea8 | Michael Hanselmann | inactive condition will be made active. This ensures fairness with exclusive |
242 | ca9ccea8 | Michael Hanselmann | locks by forcing consecutive shared acquires to wait in the queue. |
243 | ca9ccea8 | Michael Hanselmann | |
244 | ca9ccea8 | Michael Hanselmann | .. image:: design-2.1-lock-release.png |
245 | ca9ccea8 | Michael Hanselmann | |
246 | ca9ccea8 | Michael Hanselmann | |
247 | ca9ccea8 | Michael Hanselmann | Delete |
248 | ca9ccea8 | Michael Hanselmann | ****** |
249 | ca9ccea8 | Michael Hanselmann | |
250 | ca9ccea8 | Michael Hanselmann | The caller must either hold the lock in exclusive mode already or the lock |
251 | ca9ccea8 | Michael Hanselmann | must be acquired in exclusive mode. Trying to delete a lock while it's held |
252 | ca9ccea8 | Michael Hanselmann | in shared mode must fail. |
253 | ca9ccea8 | Michael Hanselmann | |
254 | ca9ccea8 | Michael Hanselmann | After ensuring the lock is held in exclusive mode, the lock will mark itself |
255 | ca9ccea8 | Michael Hanselmann | as deleted and continue to notify all pending acquires. They will wake up, |
256 | ca9ccea8 | Michael Hanselmann | notice the deleted lock and return an error to the caller. |
257 | ca9ccea8 | Michael Hanselmann | |
258 | ca9ccea8 | Michael Hanselmann | |
259 | ca9ccea8 | Michael Hanselmann | Condition |
260 | ca9ccea8 | Michael Hanselmann | ^^^^^^^^^ |
261 | ca9ccea8 | Michael Hanselmann | |
262 | ca9ccea8 | Michael Hanselmann | Note: This is not necessary for the locking changes above, but it may be a |
263 | ca9ccea8 | Michael Hanselmann | good optimization (pending performance tests). |
264 | ca9ccea8 | Michael Hanselmann | |
265 | ca9ccea8 | Michael Hanselmann | The existing locking code in Ganeti 2.0 uses Python's built-in |
266 | ca9ccea8 | Michael Hanselmann | ``threading.Condition`` class. Unfortunately ``Condition`` implements |
267 | ca9ccea8 | Michael Hanselmann | timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock |
268 | ca9ccea8 | Michael Hanselmann | in non-blocking mode. This requires unnecessary context switches and |
269 | ca9ccea8 | Michael Hanselmann | contention on the CPython GIL (Global Interpreter Lock). |
270 | ca9ccea8 | Michael Hanselmann | |
271 | ca9ccea8 | Michael Hanselmann | By using POSIX pipes (see ``pipe(2)``) we can use the operating system's |
272 | ca9ccea8 | Michael Hanselmann | support for timeouts on file descriptors (see ``select(2)``). A custom |
273 | ca9ccea8 | Michael Hanselmann | condition class will have to be written for this. |
274 | ca9ccea8 | Michael Hanselmann | |
275 | ca9ccea8 | Michael Hanselmann | On instantiation the class creates a pipe. After each notification the |
276 | ca9ccea8 | Michael Hanselmann | previous pipe is abandoned and re-created (technically the old pipe needs to |
277 | ca9ccea8 | Michael Hanselmann | stay around until all notifications have been delivered). |
278 | ca9ccea8 | Michael Hanselmann | |
279 | ca9ccea8 | Michael Hanselmann | All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to |
280 | ca9ccea8 | Michael Hanselmann | wait for notifications, optionally with a timeout. A notification will be |
281 | ca9ccea8 | Michael Hanselmann | signalled to the waiting clients by closing the pipe. If the pipe wasn't |
282 | ca9ccea8 | Michael Hanselmann | closed during the timeout, the waiting function returns to its caller |
283 | ca9ccea8 | Michael Hanselmann | nonetheless. |
284 | ca9ccea8 | Michael Hanselmann | |
285 | 5ee09f03 | Michael Hanselmann | |
286 | 82a1c938 | Guido Trotter | Feature changes |
287 | 82a1c938 | Guido Trotter | --------------- |
288 | 82a1c938 | Guido Trotter | |
289 | c0446a46 | Guido Trotter | Ganeti Confd |
290 | c0446a46 | Guido Trotter | ~~~~~~~~~~~~ |
291 | c0446a46 | Guido Trotter | |
292 | c0446a46 | Guido Trotter | Current State and shortcomings |
293 | c0446a46 | Guido Trotter | ++++++++++++++++++++++++++++++ |
294 | c0446a46 | Guido Trotter | In Ganeti 2.0 all nodes are equal, but some are more equal than others. In |
295 | c0446a46 | Guido Trotter | particular they are divided between "master", "master candidates" and "normal". |
296 | c0446a46 | Guido Trotter | (Moreover they can be offline or drained, but this is not important for the |
297 | c0446a46 | Guido Trotter | current discussion). In general the whole configuration is only replicated to |
298 | c0446a46 | Guido Trotter | master candidates, and some partial information is spread to all nodes via |
299 | c0446a46 | Guido Trotter | ssconf. |
300 | c0446a46 | Guido Trotter | |
301 | c0446a46 | Guido Trotter | This change was done so that the most frequent Ganeti operations didn't need to |
302 | c0446a46 | Guido Trotter | contact all nodes, and so clusters could become bigger. If we want more |
303 | c0446a46 | Guido Trotter | information to be available on all nodes, we need to add more ssconf values, |
304 | c0446a46 | Guido Trotter | which is counter-balancing the change, or to talk with the master node, which |
305 | c0446a46 | Guido Trotter | is not designed to happen now, and requires its availability. |
306 | c0446a46 | Guido Trotter | |
307 | c0446a46 | Guido Trotter | Information such as the instance->primary_node mapping will be needed on all |
308 | c0446a46 | Guido Trotter | nodes, and we also want to make sure services external to the cluster can query |
309 | c0446a46 | Guido Trotter | this information as well. This information must be available at all times, so |
310 | c0446a46 | Guido Trotter | we can't query it through RAPI, which would be a single point of failure, as |
311 | c0446a46 | Guido Trotter | it's only available on the master. |
312 | c0446a46 | Guido Trotter | |
313 | c0446a46 | Guido Trotter | |
314 | c0446a46 | Guido Trotter | Proposed changes |
315 | c0446a46 | Guido Trotter | ++++++++++++++++ |
316 | c0446a46 | Guido Trotter | |
317 | c0446a46 | Guido Trotter | In order to allow fast and highly available access read-only to some |
318 | c0446a46 | Guido Trotter | configuration values, we'll create a new ganeti-confd daemon, which will run on |
319 | c0446a46 | Guido Trotter | master candidates. This daemon will talk via UDP, and authenticate messages |
320 | a5bca3e9 | Guido Trotter | using HMAC with a cluster-wide shared key. This key will be generated at |
321 | a5bca3e9 | Guido Trotter | cluster init time, and stored on the clusters alongside the ganeti SSL keys, |
322 | a5bca3e9 | Guido Trotter | and readable only by root. |
323 | c0446a46 | Guido Trotter | |
324 | c0446a46 | Guido Trotter | An interested client can query a value by making a request to a subset of the |
325 | c0446a46 | Guido Trotter | cluster master candidates. It will then wait to get a few responses, and use |
326 | 4a1821de | Guido Trotter | the one with the highest configuration serial number. Since the configuration |
327 | 4a1821de | Guido Trotter | serial number is increased each time the ganeti config is updated, and the |
328 | 4a1821de | Guido Trotter | serial number is included in all answers, this can be used to make sure to use |
329 | 4a1821de | Guido Trotter | the most recent answer, in case some master candidates are stale or in the |
330 | 4a1821de | Guido Trotter | middle of a configuration update. |
331 | c0446a46 | Guido Trotter | |
332 | c0446a46 | Guido Trotter | In order to prevent replay attacks queries will contain the current unix |
333 | c0446a46 | Guido Trotter | timestamp according to the client, and the server will verify that its |
334 | c0446a46 | Guido Trotter | timestamp is in the same 5 minutes range (this requires synchronized clocks, |
335 | c0446a46 | Guido Trotter | which is a good idea anyway). Queries will also contain a "salt" which they |
336 | c0446a46 | Guido Trotter | expect the answers to be sent with, and clients are supposed to accept only |
337 | c0446a46 | Guido Trotter | answers which contain salt generated by them. |
338 | c0446a46 | Guido Trotter | |
339 | c0446a46 | Guido Trotter | The configuration daemon will be able to answer simple queries such as: |
340 | a9407509 | Guido Trotter | |
341 | c0446a46 | Guido Trotter | - master candidates list |
342 | c0446a46 | Guido Trotter | - master node |
343 | c0446a46 | Guido Trotter | - offline nodes |
344 | c0446a46 | Guido Trotter | - instance list |
345 | c0446a46 | Guido Trotter | - instance primary nodes |
346 | c0446a46 | Guido Trotter | |
347 | a9407509 | Guido Trotter | Wire protocol |
348 | a9407509 | Guido Trotter | ^^^^^^^^^^^^^ |
349 | a9407509 | Guido Trotter | |
350 | a9407509 | Guido Trotter | A confd query will look like this, on the wire:: |
351 | a9407509 | Guido Trotter | |
352 | a9407509 | Guido Trotter | { |
353 | a9407509 | Guido Trotter | "msg": "{\"type\": 1, |
354 | a9407509 | Guido Trotter | \"rsalt\": \"9aa6ce92-8336-11de-af38-001d093e835f\", |
355 | a9407509 | Guido Trotter | \"protocol\": 1, |
356 | a9407509 | Guido Trotter | \"query\": \"node1.example.com\"}\n", |
357 | a9407509 | Guido Trotter | "salt": "1249637704", |
358 | a9407509 | Guido Trotter | "hmac": "4a4139b2c3c5921f7e439469a0a45ad200aead0f" |
359 | a9407509 | Guido Trotter | } |
360 | a9407509 | Guido Trotter | |
361 | a9407509 | Guido Trotter | Detailed explanation of the various fields: |
362 | a9407509 | Guido Trotter | |
363 | a9407509 | Guido Trotter | - 'msg' contains a JSON-encoded query, its fields are: |
364 | a9407509 | Guido Trotter | |
365 | a9407509 | Guido Trotter | - 'protocol', integer, is the confd protocol version (initially just |
366 | a9407509 | Guido Trotter | constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
367 | a9407509 | Guido Trotter | - 'type', integer, is the query type. For example "node role by name" or |
368 | a9407509 | Guido Trotter | "node primary ip by instance ip". Constants will be provided for the actual |
369 | a9407509 | Guido Trotter | available query types. |
370 | a9407509 | Guido Trotter | - 'query', string, is the search key. For example an ip, or a node name. |
371 | a9407509 | Guido Trotter | - 'rsalt', string, is the required response salt. The client must use it to |
372 | a9407509 | Guido Trotter | recognize which answer it's getting. |
373 | a9407509 | Guido Trotter | |
374 | a9407509 | Guido Trotter | - 'salt' must be the current unix timestamp, according to the client. Servers |
375 | a9407509 | Guido Trotter | can refuse messages which have a wrong timing, according to their |
376 | a9407509 | Guido Trotter | configuration and clock. |
377 | a9407509 | Guido Trotter | - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
378 | a9407509 | Guido Trotter | |
379 | a9407509 | Guido Trotter | If an answer comes back (which is optional, since confd works over UDP) it will |
380 | a9407509 | Guido Trotter | be in this format:: |
381 | a9407509 | Guido Trotter | |
382 | a9407509 | Guido Trotter | { |
383 | a9407509 | Guido Trotter | "msg": "{\"status\": 0, |
384 | a9407509 | Guido Trotter | \"answer\": 0, |
385 | a9407509 | Guido Trotter | \"serial\": 42, |
386 | a9407509 | Guido Trotter | \"protocol\": 1}\n", |
387 | a9407509 | Guido Trotter | "salt": "9aa6ce92-8336-11de-af38-001d093e835f", |
388 | a9407509 | Guido Trotter | "hmac": "aaeccc0dff9328fdf7967cb600b6a80a6a9332af" |
389 | a9407509 | Guido Trotter | } |
390 | a9407509 | Guido Trotter | |
391 | a9407509 | Guido Trotter | Where: |
392 | a9407509 | Guido Trotter | |
393 | a9407509 | Guido Trotter | - 'msg' contains a JSON-encoded answer, its fields are: |
394 | a9407509 | Guido Trotter | |
395 | a9407509 | Guido Trotter | - 'protocol', integer, is the confd protocol version (initially just |
396 | a9407509 | Guido Trotter | constants.CONFD_PROTOCOL_VERSION, with a value of 1) |
397 | a9407509 | Guido Trotter | - 'status', integer, is the error code. Initially just 0 for 'ok' or '1' for |
398 | a9407509 | Guido Trotter | 'error' (in which case answer contains an error detail, rather than an |
399 | a9407509 | Guido Trotter | answer), but in the future it may be expanded to have more meanings (eg: 2, |
400 | a9407509 | Guido Trotter | the answer is compressed) |
401 | a9407509 | Guido Trotter | - 'answer', is the actual answer. Its type and meaning is query specific. For |
402 | a9407509 | Guido Trotter | example for "node primary ip by instance ip" queries it will be a string |
403 | a9407509 | Guido Trotter | containing an IP address, for "node role by name" queries it will be an |
404 | a9407509 | Guido Trotter | integer which encodes the role (master, candidate, drained, offline) |
405 | a9407509 | Guido Trotter | according to constants. |
406 | a9407509 | Guido Trotter | |
407 | a9407509 | Guido Trotter | - 'salt' is the requested salt from the query. A client can use it to recognize |
408 | a9407509 | Guido Trotter | what query the answer is answering. |
409 | a9407509 | Guido Trotter | - 'hmac' is an hmac signature of salt+msg, with the cluster hmac key |
410 | a9407509 | Guido Trotter | |
411 | c0446a46 | Guido Trotter | |
412 | d1268971 | Guido Trotter | Redistribute Config |
413 | d1268971 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~ |
414 | d1268971 | Guido Trotter | |
415 | d1268971 | Guido Trotter | Current State and shortcomings |
416 | d1268971 | Guido Trotter | ++++++++++++++++++++++++++++++ |
417 | d1268971 | Guido Trotter | Currently LURedistributeConfig triggers a copy of the updated configuration |
418 | d1268971 | Guido Trotter | file to all master candidates and of the ssconf files to all nodes. There are |
419 | d1268971 | Guido Trotter | other files which are maintained manually but which are important to keep in |
420 | d1268971 | Guido Trotter | sync. These are: |
421 | d1268971 | Guido Trotter | |
422 | d1268971 | Guido Trotter | - rapi SSL key certificate file (rapi.pem) (on master candidates) |
423 | d1268971 | Guido Trotter | - rapi user/password file rapi_users (on master candidates) |
424 | d1268971 | Guido Trotter | |
425 | d1268971 | Guido Trotter | Furthermore there are some files which are hypervisor specific but we may want |
426 | d1268971 | Guido Trotter | to keep in sync: |
427 | d1268971 | Guido Trotter | |
428 | d1268971 | Guido Trotter | - the xen-hvm hypervisor uses one shared file for all vnc passwords, and copies |
429 | d1268971 | Guido Trotter | the file once, during node add. This design is subject to revision to be able |
430 | d1268971 | Guido Trotter | to have different passwords for different groups of instances via the use of |
431 | d1268971 | Guido Trotter | hypervisor parameters, and to allow xen-hvm and kvm to use an equal system to |
432 | d1268971 | Guido Trotter | provide password-protected vnc sessions. In general, though, it would be |
433 | d1268971 | Guido Trotter | useful if the vnc password files were copied as well, to avoid unwanted vnc |
434 | d1268971 | Guido Trotter | password changes on instance failover/migrate. |
435 | d1268971 | Guido Trotter | |
436 | d1268971 | Guido Trotter | Optionally the admin may want to also ship files such as the global xend.conf |
437 | d1268971 | Guido Trotter | file, and the network scripts to all nodes. |
438 | d1268971 | Guido Trotter | |
439 | d1268971 | Guido Trotter | Proposed changes |
440 | d1268971 | Guido Trotter | ++++++++++++++++ |
441 | d1268971 | Guido Trotter | |
442 | d1268971 | Guido Trotter | RedistributeConfig will be changed to copy also the rapi files, and to call |
443 | 479a8cb8 | Luca Bigliardi | every enabled hypervisor asking for a list of additional files to copy. Users |
444 | 479a8cb8 | Luca Bigliardi | will have the possibility to populate a file containing a list of files to be |
445 | 479a8cb8 | Luca Bigliardi | distributed; this file will be propagated as well. Such solution is really |
446 | 479a8cb8 | Luca Bigliardi | simple to implement and it's easily usable by scripts. |
447 | d1268971 | Guido Trotter | |
448 | d1268971 | Guido Trotter | This code will be also shared (via tasklets or by other means, if tasklets are |
449 | d1268971 | Guido Trotter | not ready for 2.1) with the AddNode and SetNodeParams LUs (so that the relevant |
450 | d1268971 | Guido Trotter | files will be automatically shipped to new master candidates as they are set). |
451 | d1268971 | Guido Trotter | |
452 | 5b18ff3b | Guido Trotter | VNC Console Password |
453 | 5b18ff3b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~ |
454 | 5b18ff3b | Guido Trotter | |
455 | 5b18ff3b | Guido Trotter | Current State and shortcomings |
456 | 5b18ff3b | Guido Trotter | ++++++++++++++++++++++++++++++ |
457 | 5b18ff3b | Guido Trotter | |
458 | 5b18ff3b | Guido Trotter | Currently just the xen-hvm hypervisor supports setting a password to connect |
459 | 5b18ff3b | Guido Trotter | the the instances' VNC console, and has one common password stored in a file. |
460 | 5b18ff3b | Guido Trotter | |
461 | 5b18ff3b | Guido Trotter | This doesn't allow different passwords for different instances/groups of |
462 | 5b18ff3b | Guido Trotter | instances, and makes it necessary to remember to copy the file around the |
463 | 5b18ff3b | Guido Trotter | cluster when the password changes. |
464 | 5b18ff3b | Guido Trotter | |
465 | 5b18ff3b | Guido Trotter | Proposed changes |
466 | 5b18ff3b | Guido Trotter | ++++++++++++++++ |
467 | 5b18ff3b | Guido Trotter | |
468 | 5b18ff3b | Guido Trotter | We'll change the VNC password file to a vnc_password_file hypervisor parameter. |
469 | 5b18ff3b | Guido Trotter | This way it can have a cluster default, but also a different value for each |
470 | 5b18ff3b | Guido Trotter | instance. The VNC enabled hypervisors (xen and kvm) will publish all the |
471 | 5b18ff3b | Guido Trotter | password files in use through the cluster so that a redistribute-config will |
472 | 5b18ff3b | Guido Trotter | ship them to all nodes (see the Redistribute Config proposed changes above). |
473 | 5b18ff3b | Guido Trotter | |
474 | 5b18ff3b | Guido Trotter | The current VNC_PASSWORD_FILE constant will be removed, but its value will be |
475 | 5b18ff3b | Guido Trotter | used as the default HV_VNC_PASSWORD_FILE value, thus retaining backwards |
476 | 5b18ff3b | Guido Trotter | compatibility with 2.0. |
477 | 5b18ff3b | Guido Trotter | |
478 | 5b18ff3b | Guido Trotter | The code to export the list of VNC password files from the hypervisors to |
479 | 5b18ff3b | Guido Trotter | RedistributeConfig will be shared between the KVM and xen-hvm hypervisors. |
480 | 5b18ff3b | Guido Trotter | |
481 | 76bb661b | Guido Trotter | Disk/Net parameters |
482 | 76bb661b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~ |
483 | 76bb661b | Guido Trotter | |
484 | 76bb661b | Guido Trotter | Current State and shortcomings |
485 | 76bb661b | Guido Trotter | ++++++++++++++++++++++++++++++ |
486 | 76bb661b | Guido Trotter | |
487 | 76bb661b | Guido Trotter | Currently disks and network interfaces have a few tweakable options and all the |
488 | 76bb661b | Guido Trotter | rest is left to a default we chose. We're finding that we need more and more to |
489 | 76bb661b | Guido Trotter | tweak some of these parameters, for example to disable barriers for DRBD |
490 | 76bb661b | Guido Trotter | devices, or allow striping for the LVM volumes. |
491 | 76bb661b | Guido Trotter | |
492 | 76bb661b | Guido Trotter | Moreover for many of these parameters it will be nice to have cluster-wide |
493 | 76bb661b | Guido Trotter | defaults, and then be able to change them per disk/interface. |
494 | 76bb661b | Guido Trotter | |
495 | 76bb661b | Guido Trotter | Proposed changes |
496 | 76bb661b | Guido Trotter | ++++++++++++++++ |
497 | 76bb661b | Guido Trotter | |
498 | 76bb661b | Guido Trotter | We will add new cluster level diskparams and netparams, which will contain all |
499 | 76bb661b | Guido Trotter | the tweakable parameters. All values which have a sensible cluster-wide default |
500 | 76bb661b | Guido Trotter | will go into this new structure while parameters which have unique values will not. |
501 | 76bb661b | Guido Trotter | |
502 | 76bb661b | Guido Trotter | Example of network parameters: |
503 | 76bb661b | Guido Trotter | - mode: bridge/route |
504 | 76bb661b | Guido Trotter | - link: for mode "bridge" the bridge to connect to, for mode route it can |
505 | 76bb661b | Guido Trotter | contain the routing table, or the destination interface |
506 | 76bb661b | Guido Trotter | |
507 | 76bb661b | Guido Trotter | Example of disk parameters: |
508 | 76bb661b | Guido Trotter | - stripe: lvm stripes |
509 | 76bb661b | Guido Trotter | - stripe_size: lvm stripe size |
510 | 76bb661b | Guido Trotter | - meta_flushes: drbd, enable/disable metadata "barriers" |
511 | 76bb661b | Guido Trotter | - data_flushes: drbd, enable/disable data "barriers" |
512 | 76bb661b | Guido Trotter | |
513 | 76bb661b | Guido Trotter | Some parameters are bound to be disk-type specific (drbd, vs lvm, vs files) or |
514 | 76bb661b | Guido Trotter | hypervisor specific (nic models for example), but for now they will all live in |
515 | 76bb661b | Guido Trotter | the same structure. Each component is supposed to validate only the parameters |
516 | 76bb661b | Guido Trotter | it knows about, and ganeti itself will make sure that no "globally unknown" |
517 | 76bb661b | Guido Trotter | parameters are added, and that no parameters have overridden meanings for |
518 | 76bb661b | Guido Trotter | different components. |
519 | 76bb661b | Guido Trotter | |
520 | 76bb661b | Guido Trotter | The parameters will be kept, as for the BEPARAMS into a "default" category, |
521 | 76bb661b | Guido Trotter | which will allow us to expand on by creating instance "classes" in the future. |
522 | 76bb661b | Guido Trotter | Instance classes is not a feature we plan implementing in 2.1, though. |
523 | 76bb661b | Guido Trotter | |
524 | bff04b1b | Guido Trotter | Non bridged instances support |
525 | bff04b1b | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
526 | bff04b1b | Guido Trotter | |
527 | bff04b1b | Guido Trotter | Current State and shortcomings |
528 | bff04b1b | Guido Trotter | ++++++++++++++++++++++++++++++ |
529 | bff04b1b | Guido Trotter | |
530 | bff04b1b | Guido Trotter | Currently each instance NIC must be connected to a bridge, and if the bridge is |
531 | bff04b1b | Guido Trotter | not specified the default cluster one is used. This makes it impossible to use |
532 | bff04b1b | Guido Trotter | the vif-route xen network scripts, or other alternative mechanisms that don't |
533 | bff04b1b | Guido Trotter | need a bridge to work. |
534 | bff04b1b | Guido Trotter | |
535 | bff04b1b | Guido Trotter | Proposed changes |
536 | bff04b1b | Guido Trotter | ++++++++++++++++ |
537 | bff04b1b | Guido Trotter | |
538 | bff04b1b | Guido Trotter | The new "mode" network parameter will distinguish between bridged interfaces |
539 | bff04b1b | Guido Trotter | and routed ones. |
540 | bff04b1b | Guido Trotter | |
541 | bff04b1b | Guido Trotter | When mode is "bridge" the "link" parameter will contain the bridge the instance |
542 | bff04b1b | Guido Trotter | should be connected to, effectively making things as today. The value has been |
543 | bff04b1b | Guido Trotter | migrated from a nic field to a parameter to allow for an easier manipulation of |
544 | bff04b1b | Guido Trotter | the cluster default. |
545 | bff04b1b | Guido Trotter | |
546 | bff04b1b | Guido Trotter | When mode is "route" the ip field of the interface will become mandatory, to |
547 | bff04b1b | Guido Trotter | allow for a route to be set. In the future we may want also to accept multiple |
548 | bff04b1b | Guido Trotter | IPs or IP/mask values for this purpose. We will evaluate possible meanings of |
549 | bff04b1b | Guido Trotter | the link parameter to signify a routing table to be used, which would allow for |
550 | bff04b1b | Guido Trotter | insulation between instance groups (as today happens for different bridges). |
551 | bff04b1b | Guido Trotter | |
552 | bff04b1b | Guido Trotter | For now we won't add a parameter to specify which network script gets called |
553 | bff04b1b | Guido Trotter | for which instance, so in a mixed cluster the network script must be able to |
554 | bff04b1b | Guido Trotter | handle both cases. The default kvm vif script will be changed to do so. (Xen |
555 | bff04b1b | Guido Trotter | doesn't have a ganeti provided script, so nothing will be done for that |
556 | bff04b1b | Guido Trotter | hypervisor) |
557 | 76bb661b | Guido Trotter | |
558 | 0f828357 | Iustin Pop | Introducing persistent UUIDs |
559 | 0f828357 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
560 | 0f828357 | Iustin Pop | |
561 | 0f828357 | Iustin Pop | Current state and shortcomings |
562 | 0f828357 | Iustin Pop | ++++++++++++++++++++++++++++++ |
563 | 0f828357 | Iustin Pop | |
564 | 0f828357 | Iustin Pop | Some objects in the Ganeti configurations are tracked by their name |
565 | 0f828357 | Iustin Pop | while also supporting renames. This creates an extra difficulty, |
566 | 0f828357 | Iustin Pop | because neither Ganeti nor external management tools can then track |
567 | 0f828357 | Iustin Pop | the actual entity, and due to the name change it behaves like a new |
568 | 0f828357 | Iustin Pop | one. |
569 | 0f828357 | Iustin Pop | |
570 | 0f828357 | Iustin Pop | Proposed changes part 1 |
571 | 0f828357 | Iustin Pop | +++++++++++++++++++++++ |
572 | 0f828357 | Iustin Pop | |
573 | 0f828357 | Iustin Pop | We will change Ganeti to use UUIDs for entity tracking, but in a |
574 | 0f828357 | Iustin Pop | staggered way. In 2.1, we will simply add an โuuidโ attribute to each |
575 | 0f828357 | Iustin Pop | of the instances, nodes and cluster itself. This will be reported on |
576 | 0f828357 | Iustin Pop | instance creation for nodes, and on node adds for the nodes. It will |
577 | 0f828357 | Iustin Pop | be of course avaiblable for querying via the OpQueryNodes/Instance and |
578 | 0f828357 | Iustin Pop | cluster information, and via RAPI as well. |
579 | 0f828357 | Iustin Pop | |
580 | 0f828357 | Iustin Pop | Note that Ganeti will not provide any way to change this attribute. |
581 | 0f828357 | Iustin Pop | |
582 | 0f828357 | Iustin Pop | Upgrading from Ganeti 2.0 will automatically add an โuuidโ attribute |
583 | 0f828357 | Iustin Pop | to all entities missing it. |
584 | 0f828357 | Iustin Pop | |
585 | 0f828357 | Iustin Pop | |
586 | 0f828357 | Iustin Pop | Proposed changes part 2 |
587 | 0f828357 | Iustin Pop | +++++++++++++++++++++++ |
588 | 0f828357 | Iustin Pop | |
589 | 0f828357 | Iustin Pop | In the next release (e.g. 2.2), the tracking of objects will change |
590 | 0f828357 | Iustin Pop | from the name to the UUID internally, and externally Ganeti will |
591 | 0f828357 | Iustin Pop | accept both forms of identification; e.g. an RAPI call would be made |
592 | 0f828357 | Iustin Pop | either against ``/2/instances/foo.bar`` or against |
593 | 0f828357 | Iustin Pop | ``/2/instances/bb3b2e42โฆ``. Since an FQDN must have at least a dot, |
594 | 0f828357 | Iustin Pop | and dots are not valid characters in UUIDs, we will not have namespace |
595 | 0f828357 | Iustin Pop | issues. |
596 | 0f828357 | Iustin Pop | |
597 | 0f828357 | Iustin Pop | Another change here is that node identification (during cluster |
598 | 0f828357 | Iustin Pop | operations/queries like master startup, โam I the master?โ and |
599 | 0f828357 | Iustin Pop | similar) could be done via UUIDs which is more stable than the current |
600 | 0f828357 | Iustin Pop | hostname-based scheme. |
601 | 0f828357 | Iustin Pop | |
602 | 0f828357 | Iustin Pop | Internal tracking refers to the way the configuration is stored; a |
603 | 0f828357 | Iustin Pop | DRBD disk of an instance refers to the node name (so that IPs can be |
604 | 0f828357 | Iustin Pop | changed easily), but this is still a problem for name changes; thus |
605 | 0f828357 | Iustin Pop | these will be changed to point to the node UUID to ease renames. |
606 | 0f828357 | Iustin Pop | |
607 | 0f828357 | Iustin Pop | The advantages of this change (after the second round of changes), is |
608 | 0f828357 | Iustin Pop | that node rename becomes trivial, whereas today node rename would |
609 | 0f828357 | Iustin Pop | require a complete lock of all instances. |
610 | 0f828357 | Iustin Pop | |
611 | 395aa879 | Michael Hanselmann | |
612 | 395aa879 | Michael Hanselmann | Automated disk repairs infrastructure |
613 | 395aa879 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
614 | 395aa879 | Michael Hanselmann | |
615 | 395aa879 | Michael Hanselmann | Replacing defective disks in an automated fashion is quite difficult with the |
616 | 395aa879 | Michael Hanselmann | current version of Ganeti. These changes will introduce additional |
617 | 395aa879 | Michael Hanselmann | functionality and interfaces to simplify automating disk replacements on a |
618 | 395aa879 | Michael Hanselmann | Ganeti node. |
619 | 395aa879 | Michael Hanselmann | |
620 | 395aa879 | Michael Hanselmann | Fix node volume group |
621 | 395aa879 | Michael Hanselmann | +++++++++++++++++++++ |
622 | 395aa879 | Michael Hanselmann | |
623 | 395aa879 | Michael Hanselmann | This is the most difficult addition, as it can lead to dataloss if it's not |
624 | 395aa879 | Michael Hanselmann | properly safeguarded. |
625 | 395aa879 | Michael Hanselmann | |
626 | 395aa879 | Michael Hanselmann | The operation must be done only when all the other nodes that have instances in |
627 | 395aa879 | Michael Hanselmann | common with the target node are fine, i.e. this is the only node with problems, |
628 | 395aa879 | Michael Hanselmann | and also we have to double-check that all instances on this node have at least |
629 | 395aa879 | Michael Hanselmann | a good copy of the data. |
630 | 395aa879 | Michael Hanselmann | |
631 | 395aa879 | Michael Hanselmann | This might mean that we have to enhance the GetMirrorStatus calls, and |
632 | 395aa879 | Michael Hanselmann | introduce and a smarter version that can tell us more about the status of an |
633 | 395aa879 | Michael Hanselmann | instance. |
634 | 395aa879 | Michael Hanselmann | |
635 | 395aa879 | Michael Hanselmann | Stop allocation on a given PV |
636 | 395aa879 | Michael Hanselmann | +++++++++++++++++++++++++++++ |
637 | 395aa879 | Michael Hanselmann | |
638 | 395aa879 | Michael Hanselmann | This is somewhat simple. First we need a "list PVs" opcode (and its associated |
639 | 395aa879 | Michael Hanselmann | logical unit) and then a set PV status opcode/LU. These in combination should |
640 | 395aa879 | Michael Hanselmann | allow both checking and changing the disk/PV status. |
641 | 395aa879 | Michael Hanselmann | |
642 | 395aa879 | Michael Hanselmann | Instance disk status |
643 | 395aa879 | Michael Hanselmann | ++++++++++++++++++++ |
644 | 395aa879 | Michael Hanselmann | |
645 | 395aa879 | Michael Hanselmann | This new opcode or opcode change must list the instance-disk-index and node |
646 | 395aa879 | Michael Hanselmann | combinations of the instance together with their status. This will allow |
647 | 395aa879 | Michael Hanselmann | determining what part of the instance is broken (if any). |
648 | 395aa879 | Michael Hanselmann | |
649 | 395aa879 | Michael Hanselmann | Repair instance |
650 | 395aa879 | Michael Hanselmann | +++++++++++++++ |
651 | 395aa879 | Michael Hanselmann | |
652 | 395aa879 | Michael Hanselmann | This new opcode/LU/RAPI call will run ``replace-disks -p`` as needed, in order |
653 | 395aa879 | Michael Hanselmann | to fix the instance status. It only affects primary instances; secondaries can |
654 | 395aa879 | Michael Hanselmann | just be moved away. |
655 | 395aa879 | Michael Hanselmann | |
656 | 395aa879 | Michael Hanselmann | Migrate node |
657 | 395aa879 | Michael Hanselmann | ++++++++++++ |
658 | 395aa879 | Michael Hanselmann | |
659 | 395aa879 | Michael Hanselmann | This new opcode/LU/RAPI call will take over the current ``gnt-node migrate`` |
660 | 395aa879 | Michael Hanselmann | code and run migrate for all instances on the node. |
661 | 395aa879 | Michael Hanselmann | |
662 | 395aa879 | Michael Hanselmann | Evacuate node |
663 | 395aa879 | Michael Hanselmann | ++++++++++++++ |
664 | 395aa879 | Michael Hanselmann | |
665 | 395aa879 | Michael Hanselmann | This new opcode/LU/RAPI call will take over the current ``gnt-node evacuate`` |
666 | 395aa879 | Michael Hanselmann | code and run replace-secondary with an iallocator script for all instances on |
667 | 395aa879 | Michael Hanselmann | the node. |
668 | 395aa879 | Michael Hanselmann | |
669 | 395aa879 | Michael Hanselmann | |
670 | 82a1c938 | Guido Trotter | External interface changes |
671 | 82a1c938 | Guido Trotter | -------------------------- |
672 | 82a1c938 | Guido Trotter | |
673 | b6cc971c | Guido Trotter | OS API |
674 | b6cc971c | Guido Trotter | ~~~~~~ |
675 | b6cc971c | Guido Trotter | |
676 | b6cc971c | Guido Trotter | The OS API of Ganeti 2.0 has been built with extensibility in mind. Since we |
677 | b6cc971c | Guido Trotter | pass everything as environment variables it's a lot easier to send new |
678 | b6cc971c | Guido Trotter | information to the OSes without breaking retrocompatibility. This section of |
679 | b6cc971c | Guido Trotter | the design outlines the proposed extensions to the API and their |
680 | b6cc971c | Guido Trotter | implementation. |
681 | b6cc971c | Guido Trotter | |
682 | b6cc971c | Guido Trotter | API Version Compatibility Handling |
683 | b6cc971c | Guido Trotter | ++++++++++++++++++++++++++++++++++ |
684 | b6cc971c | Guido Trotter | |
685 | b6cc971c | Guido Trotter | In 2.1 there will be a new OS API version (eg. 15), which should be mostly |
686 | b6cc971c | Guido Trotter | compatible with api 10, except for some new added variables. Since it's easy |
687 | b6cc971c | Guido Trotter | not to pass some variables we'll be able to handle Ganeti 2.0 OSes by just |
688 | b6cc971c | Guido Trotter | filtering out the newly added piece of information. We will still encourage |
689 | b6cc971c | Guido Trotter | OSes to declare support for the new API after checking that the new variables |
690 | b6cc971c | Guido Trotter | don't provide any conflict for them, and we will drop api 10 support after |
691 | b6cc971c | Guido Trotter | ganeti 2.1 has released. |
692 | b6cc971c | Guido Trotter | |
693 | b6cc971c | Guido Trotter | New Environment variables |
694 | b6cc971c | Guido Trotter | +++++++++++++++++++++++++ |
695 | b6cc971c | Guido Trotter | |
696 | b6cc971c | Guido Trotter | Some variables have never been added to the OS api but would definitely be |
697 | b6cc971c | Guido Trotter | useful for the OSes. We plan to add an INSTANCE_HYPERVISOR variable to allow |
698 | b6cc971c | Guido Trotter | the OS to make changes relevant to the virtualization the instance is going to |
699 | b6cc971c | Guido Trotter | use. Since this field is immutable for each instance, the os can tight the |
700 | b6cc971c | Guido Trotter | install without caring of making sure the instance can run under any |
701 | b6cc971c | Guido Trotter | virtualization technology. |
702 | b6cc971c | Guido Trotter | |
703 | b6cc971c | Guido Trotter | We also want the OS to know the particular hypervisor parameters, to be able to |
704 | b6cc971c | Guido Trotter | customize the install even more. Since the parameters can change, though, we |
705 | b6cc971c | Guido Trotter | will pass them only as an "FYI": if an OS ties some instance functionality to |
706 | b6cc971c | Guido Trotter | the value of a particular hypervisor parameter manual changes or a reinstall |
707 | b6cc971c | Guido Trotter | may be needed to adapt the instance to the new environment. This is not a |
708 | b6cc971c | Guido Trotter | regression as of today, because even if the OSes are left blind about this |
709 | b6cc971c | Guido Trotter | information, sometimes they still need to make compromises and cannot satisfy |
710 | b6cc971c | Guido Trotter | all possible parameter values. |
711 | b6cc971c | Guido Trotter | |
712 | 4dfac6af | Guido Trotter | OS Variants |
713 | 00b66530 | Guido Trotter | +++++++++++ |
714 | b6cc971c | Guido Trotter | |
715 | b6cc971c | Guido Trotter | Currently we are assisting to some degree of "os proliferation" just to change |
716 | b6cc971c | Guido Trotter | a simple installation behavior. This means that the same OS gets installed on |
717 | b6cc971c | Guido Trotter | the cluster multiple times, with different names, to customize just one |
718 | b6cc971c | Guido Trotter | installation behavior. Usually such OSes try to share as much as possible |
719 | b6cc971c | Guido Trotter | through symlinks, but this still causes complications on the user side, |
720 | b6cc971c | Guido Trotter | especially when multiple parameters must be cross-matched. |
721 | b6cc971c | Guido Trotter | |
722 | b6cc971c | Guido Trotter | For example today if you want to install debian etch, lenny or squeeze you |
723 | b6cc971c | Guido Trotter | probably need to install the debootstrap OS multiple times, changing its |
724 | b6cc971c | Guido Trotter | configuration file, and calling it debootstrap-etch, debootstrap-lenny or |
725 | b6cc971c | Guido Trotter | debootstrap-squeeze. Furthermore if you have for example a "server" and a |
726 | b6cc971c | Guido Trotter | "development" environment which installs different packages/configuration files |
727 | b6cc971c | Guido Trotter | and must be available for all installs you'll probably end up with |
728 | b6cc971c | Guido Trotter | deboostrap-etch-server, debootstrap-etch-dev, debootrap-lenny-server, |
729 | b6cc971c | Guido Trotter | debootstrap-lenny-dev, etc. Crossing more than two parameters quickly becomes |
730 | b6cc971c | Guido Trotter | not manageable. |
731 | b6cc971c | Guido Trotter | |
732 | 00b66530 | Guido Trotter | In order to avoid this we plan to make OSes more customizable, by allowing each |
733 | 4dfac6af | Guido Trotter | OS to declare a list of variants which can be used to customize it. The |
734 | 4dfac6af | Guido Trotter | variants list is mandatory and must be written, one variant per line, in the |
735 | 4dfac6af | Guido Trotter | new "variants.list" file inside the main os dir. At least one supported variant |
736 | 4dfac6af | Guido Trotter | must be supported. When choosing the OS exactly one variant will have to be |
737 | 4dfac6af | Guido Trotter | specified, and will be encoded in the os name as <OS-name>+<variant>. As for |
738 | 00b66530 | Guido Trotter | today it will be possible to change an instance's OS at creation or install |
739 | 00b66530 | Guido Trotter | time. |
740 | 00b66530 | Guido Trotter | |
741 | 00b66530 | Guido Trotter | The 2.1 OS list will be the combination of each OS, plus its supported |
742 | 4dfac6af | Guido Trotter | variants. This will cause the name name proliferation to remain, but at least |
743 | 4dfac6af | Guido Trotter | the internal OS code will be simplified to just parsing the passed variant, |
744 | 00b66530 | Guido Trotter | without the need for symlinks or code duplication. |
745 | 00b66530 | Guido Trotter | |
746 | 4dfac6af | Guido Trotter | Also we expect the OSes to declare only "interesting" variants, but to accept |
747 | 00b66530 | Guido Trotter | some non-declared ones which a user will be able to pass in by overriding the |
748 | 00b66530 | Guido Trotter | checks ganeti does. This will be useful for allowing some variations to be used |
749 | 00b66530 | Guido Trotter | without polluting the OS list (per-OS documentation should list all supported |
750 | 4dfac6af | Guido Trotter | variants). If a variant which is not internally supported is forced through, |
751 | 00b66530 | Guido Trotter | the OS scripts should abort. |
752 | 00b66530 | Guido Trotter | |
753 | 4dfac6af | Guido Trotter | In the future (post 2.1) we may want to move to full fledged parameters all |
754 | 4dfac6af | Guido Trotter | orthogonal to each other (for example "architecture" (i386, amd64), "suite" |
755 | 4dfac6af | Guido Trotter | (lenny, squeeze, ...), etc). (As opposed to the variant, which is a single |
756 | 4dfac6af | Guido Trotter | parameter, and you need a different variant for all the set of combinations you |
757 | 4dfac6af | Guido Trotter | want to support). In this case we envision the variants to be moved inside of |
758 | 4dfac6af | Guido Trotter | Ganeti and be associated with lists parameter->values associations, which will |
759 | 4dfac6af | Guido Trotter | then be passed to the OS. |
760 | 4dfac6af | Guido Trotter | |
761 | b6cc971c | Guido Trotter | |
762 | 3bd3d643 | Iustin Pop | IAllocator changes |
763 | 3bd3d643 | Iustin Pop | ~~~~~~~~~~~~~~~~~~ |
764 | 3bd3d643 | Iustin Pop | |
765 | 3bd3d643 | Iustin Pop | Current State and shortcomings |
766 | 3bd3d643 | Iustin Pop | ++++++++++++++++++++++++++++++ |
767 | 3bd3d643 | Iustin Pop | |
768 | 3bd3d643 | Iustin Pop | The iallocator interface allows creation of instances without manually |
769 | 3bd3d643 | Iustin Pop | specifying nodes, but instead by specifying plugins which will do the |
770 | 3bd3d643 | Iustin Pop | required computations and produce a valid node list. |
771 | 3bd3d643 | Iustin Pop | |
772 | 3bd3d643 | Iustin Pop | However, the interface is quite akward to use: |
773 | 3bd3d643 | Iustin Pop | |
774 | 3bd3d643 | Iustin Pop | - one cannot set a 'default' iallocator script |
775 | 3bd3d643 | Iustin Pop | - one cannot use it to easily test if allocation would succeed |
776 | 3bd3d643 | Iustin Pop | - some new functionality, such as rebalancing clusters and calculating |
777 | 3bd3d643 | Iustin Pop | capacity estimates is needed |
778 | 3bd3d643 | Iustin Pop | |
779 | 3bd3d643 | Iustin Pop | Proposed changes |
780 | 3bd3d643 | Iustin Pop | ++++++++++++++++ |
781 | 3bd3d643 | Iustin Pop | |
782 | 3bd3d643 | Iustin Pop | There are two area of improvements proposed: |
783 | 3bd3d643 | Iustin Pop | |
784 | 3bd3d643 | Iustin Pop | - improving the use of the current interface |
785 | 3bd3d643 | Iustin Pop | - extending the IAllocator API to cover more automation |
786 | 3bd3d643 | Iustin Pop | |
787 | 3bd3d643 | Iustin Pop | |
788 | 3bd3d643 | Iustin Pop | Default iallocator names |
789 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^^^^^^^ |
790 | 3bd3d643 | Iustin Pop | |
791 | 3bd3d643 | Iustin Pop | The cluster will hold, for each type of iallocator, a (possibly empty) |
792 | 3bd3d643 | Iustin Pop | list of modules that will be used automatically. |
793 | 3bd3d643 | Iustin Pop | |
794 | 3bd3d643 | Iustin Pop | If the list is empty, the behaviour will remain the same. |
795 | 3bd3d643 | Iustin Pop | |
796 | 3bd3d643 | Iustin Pop | If the list has one entry, then ganeti will behave as if |
797 | 3bd3d643 | Iustin Pop | '--iallocator' was specifyed on the command line. I.e. use this |
798 | 3bd3d643 | Iustin Pop | allocator by default. If the user however passed nodes, those will be |
799 | 3bd3d643 | Iustin Pop | used in preference. |
800 | 3bd3d643 | Iustin Pop | |
801 | 3bd3d643 | Iustin Pop | If the list has multiple entries, they will be tried in order until |
802 | 3bd3d643 | Iustin Pop | one gives a successful answer. |
803 | 3bd3d643 | Iustin Pop | |
804 | 3bd3d643 | Iustin Pop | Dry-run allocation |
805 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^ |
806 | 3bd3d643 | Iustin Pop | |
807 | 3bd3d643 | Iustin Pop | The create instance LU will get a new 'dry-run' option that will just |
808 | 3bd3d643 | Iustin Pop | simulate the placement, and return the chosen node-lists after running |
809 | 3bd3d643 | Iustin Pop | all the usual checks. |
810 | 3bd3d643 | Iustin Pop | |
811 | 3bd3d643 | Iustin Pop | Cluster balancing |
812 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^ |
813 | 3bd3d643 | Iustin Pop | |
814 | 3bd3d643 | Iustin Pop | Instance add/removals/moves can create a situation where load on the |
815 | 3bd3d643 | Iustin Pop | nodes is not spread equally. For this, a new iallocator mode will be |
816 | 3bd3d643 | Iustin Pop | implemented called ``balance`` in which the plugin, given the current |
817 | 3bd3d643 | Iustin Pop | cluster state, and a maximum number of operations, will need to |
818 | 3bd3d643 | Iustin Pop | compute the instance relocations needed in order to achieve a "better" |
819 | 3bd3d643 | Iustin Pop | (for whatever the script believes it's better) cluster. |
820 | 3bd3d643 | Iustin Pop | |
821 | 3bd3d643 | Iustin Pop | Cluster capacity calculation |
822 | 3bd3d643 | Iustin Pop | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
823 | 3bd3d643 | Iustin Pop | |
824 | 3bd3d643 | Iustin Pop | In this mode, called ``capacity``, given an instance specification and |
825 | 3bd3d643 | Iustin Pop | the current cluster state (similar to the ``allocate`` mode), the |
826 | 3bd3d643 | Iustin Pop | plugin needs to return: |
827 | 3bd3d643 | Iustin Pop | |
828 | 3bd3d643 | Iustin Pop | - how many instances can be allocated on the cluster with that specification |
829 | 3bd3d643 | Iustin Pop | - on which nodes these will be allocated (in order) |
830 | 558fd122 | Michael Hanselmann | |
831 | 558fd122 | Michael Hanselmann | .. vim: set textwidth=72 : |