root / doc / design-2.2.rst @ 545d1f1a
History | View | Annotate | Download (40 kB)
1 | e56bb0e8 | Guido Trotter | ================= |
---|---|---|---|
2 | e56bb0e8 | Guido Trotter | Ganeti 2.2 design |
3 | e56bb0e8 | Guido Trotter | ================= |
4 | e56bb0e8 | Guido Trotter | |
5 | e56bb0e8 | Guido Trotter | This document describes the major changes in Ganeti 2.2 compared to |
6 | e56bb0e8 | Guido Trotter | the 2.1 version. |
7 | e56bb0e8 | Guido Trotter | |
8 | e56bb0e8 | Guido Trotter | The 2.2 version will be a relatively small release. Its main aim is to |
9 | e56bb0e8 | Guido Trotter | avoid changing too much of the core code, while addressing issues and |
10 | e56bb0e8 | Guido Trotter | adding new features and improvements over 2.1, in a timely fashion. |
11 | e56bb0e8 | Guido Trotter | |
12 | e56bb0e8 | Guido Trotter | .. contents:: :depth: 4 |
13 | e56bb0e8 | Guido Trotter | |
14 | e56bb0e8 | Guido Trotter | Objective |
15 | e56bb0e8 | Guido Trotter | ========= |
16 | e56bb0e8 | Guido Trotter | |
17 | e56bb0e8 | Guido Trotter | Background |
18 | e56bb0e8 | Guido Trotter | ========== |
19 | e56bb0e8 | Guido Trotter | |
20 | e56bb0e8 | Guido Trotter | Overview |
21 | e56bb0e8 | Guido Trotter | ======== |
22 | e56bb0e8 | Guido Trotter | |
23 | e56bb0e8 | Guido Trotter | Detailed design |
24 | e56bb0e8 | Guido Trotter | =============== |
25 | e56bb0e8 | Guido Trotter | |
26 | e56bb0e8 | Guido Trotter | As for 2.1 we divide the 2.2 design into three areas: |
27 | e56bb0e8 | Guido Trotter | |
28 | e56bb0e8 | Guido Trotter | - core changes, which affect the master daemon/job queue/locking or |
29 | e56bb0e8 | Guido Trotter | all/most logical units |
30 | e56bb0e8 | Guido Trotter | - logical unit/feature changes |
31 | e56bb0e8 | Guido Trotter | - external interface changes (eg. command line, os api, hooks, ...) |
32 | e56bb0e8 | Guido Trotter | |
33 | e56bb0e8 | Guido Trotter | Core changes |
34 | e56bb0e8 | Guido Trotter | ------------ |
35 | e56bb0e8 | Guido Trotter | |
36 | c3c5dc77 | Guido Trotter | Master Daemon Scaling improvements |
37 | c3c5dc77 | Guido Trotter | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
38 | c3c5dc77 | Guido Trotter | |
39 | c3c5dc77 | Guido Trotter | Current state and shortcomings |
40 | c3c5dc77 | Guido Trotter | ++++++++++++++++++++++++++++++ |
41 | c3c5dc77 | Guido Trotter | |
42 | c3c5dc77 | Guido Trotter | Currently the Ganeti master daemon is based on four sets of threads: |
43 | c3c5dc77 | Guido Trotter | |
44 | c3c5dc77 | Guido Trotter | - The main thread (1 thread) just accepts connections on the master |
45 | c3c5dc77 | Guido Trotter | socket |
46 | c3c5dc77 | Guido Trotter | - The client worker pool (16 threads) handles those connections, |
47 | c3c5dc77 | Guido Trotter | one thread per connected socket, parses luxi requests, and sends data |
48 | c3c5dc77 | Guido Trotter | back to the clients |
49 | c3c5dc77 | Guido Trotter | - The job queue worker pool (25 threads) executes the actual jobs |
50 | c3c5dc77 | Guido Trotter | submitted by the clients |
51 | c3c5dc77 | Guido Trotter | - The rpc worker pool (10 threads) interacts with the nodes via |
52 | c3c5dc77 | Guido Trotter | http-based-rpc |
53 | c3c5dc77 | Guido Trotter | |
54 | c3c5dc77 | Guido Trotter | This means that every masterd currently runs 52 threads to do its job. |
55 | c3c5dc77 | Guido Trotter | Being able to reduce the number of thread sets would make the master's |
56 | c3c5dc77 | Guido Trotter | architecture a lot simpler. Moreover having less threads can help |
57 | c3c5dc77 | Guido Trotter | decrease lock contention, log pollution and memory usage. |
58 | c3c5dc77 | Guido Trotter | Also, with the current architecture, masterd suffers from quite a few |
59 | c3c5dc77 | Guido Trotter | scalability issues: |
60 | c3c5dc77 | Guido Trotter | |
61 | 37e1e262 | Guido Trotter | Core daemon connection handling |
62 | 37e1e262 | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
63 | 37e1e262 | Guido Trotter | |
64 | 37e1e262 | Guido Trotter | Since the 16 client worker threads handle one connection each, it's very |
65 | 37e1e262 | Guido Trotter | easy to exhaust them, by just connecting to masterd 16 times and not |
66 | 37e1e262 | Guido Trotter | sending any data. While we could perhaps make those pools resizable, |
67 | 37e1e262 | Guido Trotter | increasing the number of threads won't help with lock contention nor |
68 | 37e1e262 | Guido Trotter | with better handling long running operations making sure the client is |
69 | 37e1e262 | Guido Trotter | informed that everything is proceeding, and doesn't need to time out. |
70 | 37e1e262 | Guido Trotter | |
71 | 37e1e262 | Guido Trotter | Wait for job change |
72 | 37e1e262 | Guido Trotter | ^^^^^^^^^^^^^^^^^^^ |
73 | 37e1e262 | Guido Trotter | |
74 | 37e1e262 | Guido Trotter | The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client |
75 | 37e1e262 | Guido Trotter | thread block on its job for a relative long time. This is another easy |
76 | 37e1e262 | Guido Trotter | way to exhaust the 16 client threads, and a place where clients often |
77 | 37e1e262 | Guido Trotter | time out, moreover this operation is negative for the job queue lock |
78 | 37e1e262 | Guido Trotter | contention (see below). |
79 | 37e1e262 | Guido Trotter | |
80 | 37e1e262 | Guido Trotter | Job Queue lock |
81 | 37e1e262 | Guido Trotter | ^^^^^^^^^^^^^^ |
82 | 37e1e262 | Guido Trotter | |
83 | 37e1e262 | Guido Trotter | The job queue lock is quite heavily contended, and certain easily |
84 | 37e1e262 | Guido Trotter | reproducible workloads show that's it's very easy to put masterd in |
85 | 37e1e262 | Guido Trotter | trouble: for example running ~15 background instance reinstall jobs, |
86 | 37e1e262 | Guido Trotter | results in a master daemon that, even without having finished the |
87 | 37e1e262 | Guido Trotter | client worker threads, can't answer simple job list requests, or |
88 | 37e1e262 | Guido Trotter | submit more jobs. |
89 | 37e1e262 | Guido Trotter | |
90 | 37e1e262 | Guido Trotter | Currently the job queue lock is an exclusive non-fair lock insulating |
91 | 37e1e262 | Guido Trotter | the following job queue methods (called by the client workers). |
92 | 37e1e262 | Guido Trotter | |
93 | 37e1e262 | Guido Trotter | - AddNode |
94 | 37e1e262 | Guido Trotter | - RemoveNode |
95 | 37e1e262 | Guido Trotter | - SubmitJob |
96 | 37e1e262 | Guido Trotter | - SubmitManyJobs |
97 | 37e1e262 | Guido Trotter | - WaitForJobChanges |
98 | 37e1e262 | Guido Trotter | - CancelJob |
99 | 37e1e262 | Guido Trotter | - ArchiveJob |
100 | 37e1e262 | Guido Trotter | - AutoArchiveJobs |
101 | 37e1e262 | Guido Trotter | - QueryJobs |
102 | 37e1e262 | Guido Trotter | - Shutdown |
103 | 37e1e262 | Guido Trotter | |
104 | 37e1e262 | Guido Trotter | Moreover the job queue lock is acquired outside of the job queue in two |
105 | 37e1e262 | Guido Trotter | other classes: |
106 | 37e1e262 | Guido Trotter | |
107 | 37e1e262 | Guido Trotter | - jqueue._JobQueueWorker (in RunTask) before executing the opcode, after |
108 | 37e1e262 | Guido Trotter | finishing its executing and when handling an exception. |
109 | 37e1e262 | Guido Trotter | - jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the |
110 | 37e1e262 | Guido Trotter | processor (mcpu.Processor) is about to start working on the opcode |
111 | 37e1e262 | Guido Trotter | (after acquiring the necessary locks) and when any data is sent back |
112 | 37e1e262 | Guido Trotter | via the feedback function. |
113 | 37e1e262 | Guido Trotter | |
114 | 37e1e262 | Guido Trotter | Of those the major critical points are: |
115 | 37e1e262 | Guido Trotter | |
116 | 37e1e262 | Guido Trotter | - Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow |
117 | 37e1e262 | Guido Trotter | down and block client threads up to making the respective clients |
118 | 37e1e262 | Guido Trotter | time out. |
119 | 37e1e262 | Guido Trotter | - The code paths in NotifyStart, Feedback, and RunTask, which slow |
120 | 37e1e262 | Guido Trotter | down job processing between clients and otherwise non-related jobs. |
121 | 37e1e262 | Guido Trotter | |
122 | 37e1e262 | Guido Trotter | To increase the pain: |
123 | 37e1e262 | Guido Trotter | |
124 | 37e1e262 | Guido Trotter | - WaitForJobChanges is a bad offender because it's implemented with a |
125 | 37e1e262 | Guido Trotter | notified condition which awakes waiting threads, who then try to |
126 | 37e1e262 | Guido Trotter | acquire the global lock again |
127 | 37e1e262 | Guido Trotter | - Many should-be-fast code paths are slowed down by replicating the |
128 | 37e1e262 | Guido Trotter | change to remote nodes, and thus waiting, with the lock held, on |
129 | 37e1e262 | Guido Trotter | remote rpcs to complete (starting, finishing, and submitting jobs) |
130 | c3c5dc77 | Guido Trotter | |
131 | c3c5dc77 | Guido Trotter | Proposed changes |
132 | c3c5dc77 | Guido Trotter | ++++++++++++++++ |
133 | c3c5dc77 | Guido Trotter | |
134 | c3c5dc77 | Guido Trotter | In order to be able to interact with the master daemon even when it's |
135 | c3c5dc77 | Guido Trotter | under heavy load, and to make it simpler to add core functionality |
136 | c3c5dc77 | Guido Trotter | (such as an asynchronous rpc client) we propose three subsequent levels |
137 | c3c5dc77 | Guido Trotter | of changes to the master core architecture. |
138 | c3c5dc77 | Guido Trotter | |
139 | c3c5dc77 | Guido Trotter | After making this change we'll be able to re-evaluate the size of our |
140 | c3c5dc77 | Guido Trotter | thread pool, if we see that we can make most threads in the client |
141 | c3c5dc77 | Guido Trotter | worker pool always idle. In the future we should also investigate making |
142 | c3c5dc77 | Guido Trotter | the rpc client asynchronous as well, so that we can make masterd a lot |
143 | c3c5dc77 | Guido Trotter | smaller in number of threads, and memory size, and thus also easier to |
144 | c3c5dc77 | Guido Trotter | understand, debug, and scale. |
145 | c3c5dc77 | Guido Trotter | |
146 | c3c5dc77 | Guido Trotter | Connection handling |
147 | c3c5dc77 | Guido Trotter | ^^^^^^^^^^^^^^^^^^^ |
148 | c3c5dc77 | Guido Trotter | |
149 | c3c5dc77 | Guido Trotter | We'll move the main thread of ganeti-masterd to asyncore, so that it can |
150 | c3c5dc77 | Guido Trotter | share the mainloop code with all other Ganeti daemons. Then all luxi |
151 | c3c5dc77 | Guido Trotter | clients will be asyncore clients, and I/O to/from them will be handled |
152 | c3c5dc77 | Guido Trotter | by the master thread asynchronously. Data will be read from the client |
153 | c3c5dc77 | Guido Trotter | sockets as it becomes available, and kept in a buffer, then when a |
154 | c3c5dc77 | Guido Trotter | complete message is found, it's passed to a client worker thread for |
155 | c3c5dc77 | Guido Trotter | parsing and processing. The client worker thread is responsible for |
156 | c3c5dc77 | Guido Trotter | serializing the reply, which can then be sent asynchronously by the main |
157 | c3c5dc77 | Guido Trotter | thread on the socket. |
158 | c3c5dc77 | Guido Trotter | |
159 | c3c5dc77 | Guido Trotter | Wait for job change |
160 | c3c5dc77 | Guido Trotter | ^^^^^^^^^^^^^^^^^^^ |
161 | c3c5dc77 | Guido Trotter | |
162 | c3c5dc77 | Guido Trotter | The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be |
163 | c3c5dc77 | Guido Trotter | subscription-based, so that the executing thread doesn't have to be |
164 | c3c5dc77 | Guido Trotter | waiting for the changes to arrive. Threads producing messages (job queue |
165 | c3c5dc77 | Guido Trotter | executors) will make sure that when there is a change another thread is |
166 | c3c5dc77 | Guido Trotter | awaken and delivers it to the waiting clients. This can be either a |
167 | c3c5dc77 | Guido Trotter | dedicated "wait for job changes" thread or pool, or one of the client |
168 | c3c5dc77 | Guido Trotter | workers, depending on what's easier to implement. In either case the |
169 | c3c5dc77 | Guido Trotter | main asyncore thread will only be involved in pushing of the actual |
170 | c3c5dc77 | Guido Trotter | data, and not in fetching/serializing it. |
171 | c3c5dc77 | Guido Trotter | |
172 | c3c5dc77 | Guido Trotter | Other features to look at, when implementing this code are: |
173 | c3c5dc77 | Guido Trotter | |
174 | 37e1e262 | Guido Trotter | - Possibility not to need the job lock to know which updates to push: |
175 | 37e1e262 | Guido Trotter | if the thread producing the data pushed a copy of the update for the |
176 | 37e1e262 | Guido Trotter | waiting clients, the thread sending it won't need to acquire the |
177 | 37e1e262 | Guido Trotter | lock again to fetch the actual data. |
178 | c3c5dc77 | Guido Trotter | - Possibility to signal clients about to time out, when no update has |
179 | c3c5dc77 | Guido Trotter | been received, not to despair and to keep waiting (luxi level |
180 | c3c5dc77 | Guido Trotter | keepalive). |
181 | c3c5dc77 | Guido Trotter | - Possibility to defer updates if they are too frequent, providing |
182 | c3c5dc77 | Guido Trotter | them at a maximum rate (lower priority). |
183 | c3c5dc77 | Guido Trotter | |
184 | c3c5dc77 | Guido Trotter | Job Queue lock |
185 | c3c5dc77 | Guido Trotter | ^^^^^^^^^^^^^^ |
186 | c3c5dc77 | Guido Trotter | |
187 | 37e1e262 | Guido Trotter | In order to decrease the job queue lock contention, we will change the |
188 | 37e1e262 | Guido Trotter | code paths in the following ways, initially: |
189 | 37e1e262 | Guido Trotter | |
190 | 37e1e262 | Guido Trotter | - A per-job lock will be introduced. All operations affecting only one |
191 | 37e1e262 | Guido Trotter | job (for example feedback, starting/finishing notifications, |
192 | 37e1e262 | Guido Trotter | subscribing to or watching a job) will only require the job lock. |
193 | 37e1e262 | Guido Trotter | This should be a leaf lock, but if a situation arises in which it |
194 | 37e1e262 | Guido Trotter | must be acquired together with the global job queue lock the global |
195 | 37e1e262 | Guido Trotter | one must always be acquired last (for the global section). |
196 | 37e1e262 | Guido Trotter | - The locks will be converted to a sharedlock. Any read-only operation |
197 | 37e1e262 | Guido Trotter | will be able to proceed in parallel. |
198 | 37e1e262 | Guido Trotter | - During remote update (which happens already per-job) we'll drop the |
199 | 37e1e262 | Guido Trotter | job lock level to shared mode, so that activities reading the lock |
200 | 37e1e262 | Guido Trotter | (for example job change notifications or QueryJobs calls) will be |
201 | 37e1e262 | Guido Trotter | able to proceed in parallel. |
202 | 37e1e262 | Guido Trotter | - The wait for job changes improvements proposed above will be |
203 | 37e1e262 | Guido Trotter | implemented. |
204 | 37e1e262 | Guido Trotter | |
205 | 37e1e262 | Guido Trotter | In the future other improvements may include splitting off some of the |
206 | 37e1e262 | Guido Trotter | work (eg replication of a job to remote nodes) to a separate thread pool |
207 | 37e1e262 | Guido Trotter | or asynchronous thread, not tied with the code path for answering client |
208 | 37e1e262 | Guido Trotter | requests or the one executing the "real" work. This can be discussed |
209 | 37e1e262 | Guido Trotter | again after we used the more granular job queue in production and tested |
210 | 37e1e262 | Guido Trotter | its benefits. |
211 | 37e1e262 | Guido Trotter | |
212 | c3c5dc77 | Guido Trotter | |
213 | 6e56e84a | Michael Hanselmann | Remote procedure call timeouts |
214 | 6e56e84a | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
215 | 6e56e84a | Michael Hanselmann | |
216 | 6e56e84a | Michael Hanselmann | Current state and shortcomings |
217 | 6e56e84a | Michael Hanselmann | ++++++++++++++++++++++++++++++ |
218 | 6e56e84a | Michael Hanselmann | |
219 | 6e56e84a | Michael Hanselmann | The current RPC protocol used by Ganeti is based on HTTP. Every request |
220 | 6e56e84a | Michael Hanselmann | consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``) |
221 | 6e56e84a | Michael Hanselmann | and doesn't return until the function called has returned. Parameters |
222 | 6e56e84a | Michael Hanselmann | and return values are encoded using JSON. |
223 | 6e56e84a | Michael Hanselmann | |
224 | 6e56e84a | Michael Hanselmann | On the server side, ``ganeti-noded`` handles every incoming connection |
225 | 6e56e84a | Michael Hanselmann | in a separate process by forking just after accepting the connection. |
226 | 6e56e84a | Michael Hanselmann | This process exits after sending the response. |
227 | 6e56e84a | Michael Hanselmann | |
228 | 6e56e84a | Michael Hanselmann | There is one major problem with this design: Timeouts can not be used on |
229 | 6e56e84a | Michael Hanselmann | a per-request basis. Neither client or server know how long it will |
230 | 6e56e84a | Michael Hanselmann | take. Even if we might be able to group requests into different |
231 | 6e56e84a | Michael Hanselmann | categories (e.g. fast and slow), this is not reliable. |
232 | 6e56e84a | Michael Hanselmann | |
233 | 6e56e84a | Michael Hanselmann | If a node has an issue or the network connection fails while a request |
234 | 6e56e84a | Michael Hanselmann | is being handled, the master daemon can wait for a long time for the |
235 | 6e56e84a | Michael Hanselmann | connection to time out (e.g. due to the operating system's underlying |
236 | 6e56e84a | Michael Hanselmann | TCP keep-alive packets or timeouts). While the settings for keep-alive |
237 | 6e56e84a | Michael Hanselmann | packets can be changed using Linux-specific socket options, we prefer to |
238 | 6e56e84a | Michael Hanselmann | use application-level timeouts because these cover both machine down and |
239 | 6e56e84a | Michael Hanselmann | unresponsive node daemon cases. |
240 | 6e56e84a | Michael Hanselmann | |
241 | 6e56e84a | Michael Hanselmann | Proposed changes |
242 | 6e56e84a | Michael Hanselmann | ++++++++++++++++ |
243 | 6e56e84a | Michael Hanselmann | |
244 | 6e56e84a | Michael Hanselmann | RPC glossary |
245 | 6e56e84a | Michael Hanselmann | ^^^^^^^^^^^^ |
246 | 6e56e84a | Michael Hanselmann | |
247 | 6e56e84a | Michael Hanselmann | Function call ID |
248 | 6e56e84a | Michael Hanselmann | Unique identifier returned by ``ganeti-noded`` after invoking a |
249 | 6e56e84a | Michael Hanselmann | function. |
250 | 6e56e84a | Michael Hanselmann | Function process |
251 | 6e56e84a | Michael Hanselmann | Process started by ``ganeti-noded`` to call actual (backend) function. |
252 | 6e56e84a | Michael Hanselmann | |
253 | 6e56e84a | Michael Hanselmann | Protocol |
254 | 6e56e84a | Michael Hanselmann | ^^^^^^^^ |
255 | 6e56e84a | Michael Hanselmann | |
256 | 6e56e84a | Michael Hanselmann | Initially we chose HTTP as our RPC protocol because there were existing |
257 | 6e56e84a | Michael Hanselmann | libraries, which, unfortunately, turned out to miss important features |
258 | 6e56e84a | Michael Hanselmann | (such as SSL certificate authentication) and we had to write our own. |
259 | 6e56e84a | Michael Hanselmann | |
260 | 6e56e84a | Michael Hanselmann | This proposal can easily be implemented using HTTP, though it would |
261 | 6e56e84a | Michael Hanselmann | likely be more efficient and less complicated to use the LUXI protocol |
262 | 6e56e84a | Michael Hanselmann | already used to communicate between client tools and the Ganeti master |
263 | 6e56e84a | Michael Hanselmann | daemon. Switching to another protocol can occur at a later point. This |
264 | 6e56e84a | Michael Hanselmann | proposal should be implemented using HTTP as its underlying protocol. |
265 | 6e56e84a | Michael Hanselmann | |
266 | 6e56e84a | Michael Hanselmann | The LUXI protocol currently contains two functions, ``WaitForJobChange`` |
267 | 6e56e84a | Michael Hanselmann | and ``AutoArchiveJobs``, which can take a longer time. They both support |
268 | 6e56e84a | Michael Hanselmann | a parameter to specify the timeout. This timeout is usually chosen as |
269 | 6e56e84a | Michael Hanselmann | roughly half of the socket timeout, guaranteeing a response before the |
270 | 6e56e84a | Michael Hanselmann | socket times out. After the specified amount of time, |
271 | 6e56e84a | Michael Hanselmann | ``AutoArchiveJobs`` returns and reports the number of archived jobs. |
272 | 6e56e84a | Michael Hanselmann | ``WaitForJobChange`` returns and reports a timeout. In both cases, the |
273 | 6e56e84a | Michael Hanselmann | functions can be called again. |
274 | 6e56e84a | Michael Hanselmann | |
275 | 6e56e84a | Michael Hanselmann | A similar model can be used for the inter-node RPC protocol. In some |
276 | 6e56e84a | Michael Hanselmann | sense, the node daemon will implement a light variant of *"node daemon |
277 | 6e56e84a | Michael Hanselmann | jobs"*. When the function call is sent, it specifies an initial timeout. |
278 | 6e56e84a | Michael Hanselmann | If the function didn't finish within this timeout, a response is sent |
279 | 6e56e84a | Michael Hanselmann | with a unique identifier, the function call ID. The client can then |
280 | 6e56e84a | Michael Hanselmann | choose to wait for the function to finish again with a timeout. |
281 | 6e56e84a | Michael Hanselmann | Inter-node RPC calls would no longer be blocking indefinitely and there |
282 | 6e56e84a | Michael Hanselmann | would be an implicit ping-mechanism. |
283 | 6e56e84a | Michael Hanselmann | |
284 | 6e56e84a | Michael Hanselmann | Request handling |
285 | 6e56e84a | Michael Hanselmann | ^^^^^^^^^^^^^^^^ |
286 | 6e56e84a | Michael Hanselmann | |
287 | 6e56e84a | Michael Hanselmann | To support the protocol changes described above, the way the node daemon |
288 | 6e56e84a | Michael Hanselmann | handles request will have to change. Instead of forking and handling |
289 | 6e56e84a | Michael Hanselmann | every connection in a separate process, there should be one child |
290 | 6e56e84a | Michael Hanselmann | process per function call and the master process will handle the |
291 | 6e56e84a | Michael Hanselmann | communication with clients and the function processes using asynchronous |
292 | 6e56e84a | Michael Hanselmann | I/O. |
293 | 6e56e84a | Michael Hanselmann | |
294 | 6e56e84a | Michael Hanselmann | Function processes communicate with the parent process via stdio and |
295 | 6e56e84a | Michael Hanselmann | possibly their exit status. Every function process has a unique |
296 | 6e56e84a | Michael Hanselmann | identifier, though it shouldn't be the process ID only (PIDs can be |
297 | 6e56e84a | Michael Hanselmann | recycled and are prone to race conditions for this use case). The |
298 | 6e56e84a | Michael Hanselmann | proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid`` |
299 | 6e56e84a | Michael Hanselmann | is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the |
300 | 6e56e84a | Michael Hanselmann | current Unix timestamp with decimal places and ``random`` at least 16 |
301 | 6e56e84a | Michael Hanselmann | random bits. |
302 | 6e56e84a | Michael Hanselmann | |
303 | 6e56e84a | Michael Hanselmann | The following operations will be supported: |
304 | 6e56e84a | Michael Hanselmann | |
305 | 6e56e84a | Michael Hanselmann | ``StartFunction(fn_name, fn_args, timeout)`` |
306 | 6e56e84a | Michael Hanselmann | Starts a function specified by ``fn_name`` with arguments in |
307 | 6e56e84a | Michael Hanselmann | ``fn_args`` and waits up to ``timeout`` seconds for the function |
308 | 6e56e84a | Michael Hanselmann | to finish. Fire-and-forget calls can be made by specifying a timeout |
309 | 6e56e84a | Michael Hanselmann | of 0 seconds (e.g. for powercycling the node). Returns three values: |
310 | 6e56e84a | Michael Hanselmann | function call ID (if not finished), whether function finished (or |
311 | 6e56e84a | Michael Hanselmann | timeout) and the function's return value. |
312 | 6e56e84a | Michael Hanselmann | ``WaitForFunction(fnc_id, timeout)`` |
313 | 6e56e84a | Michael Hanselmann | Waits up to ``timeout`` seconds for function call to finish. Return |
314 | 6e56e84a | Michael Hanselmann | value same as ``StartFunction``. |
315 | 6e56e84a | Michael Hanselmann | |
316 | 6e56e84a | Michael Hanselmann | In the future, ``StartFunction`` could support an additional parameter |
317 | 6e56e84a | Michael Hanselmann | to specify after how long the function process should be aborted. |
318 | 6e56e84a | Michael Hanselmann | |
319 | 6e56e84a | Michael Hanselmann | Simplified timing diagram:: |
320 | 6e56e84a | Michael Hanselmann | |
321 | 6e56e84a | Michael Hanselmann | Master daemon Node daemon Function process |
322 | 6e56e84a | Michael Hanselmann | | |
323 | 6e56e84a | Michael Hanselmann | Call function |
324 | 6e56e84a | Michael Hanselmann | (timeout 10s) -----> Parse request and fork for ----> Start function |
325 | 6e56e84a | Michael Hanselmann | calling actual function, then | |
326 | 6e56e84a | Michael Hanselmann | wait up to 10s for function to | |
327 | 6e56e84a | Michael Hanselmann | finish | |
328 | 6e56e84a | Michael Hanselmann | | | |
329 | 6e56e84a | Michael Hanselmann | ... ... |
330 | 6e56e84a | Michael Hanselmann | | | |
331 | 6e56e84a | Michael Hanselmann | Examine return <---- | | |
332 | 6e56e84a | Michael Hanselmann | value and wait | |
333 | 6e56e84a | Michael Hanselmann | again -------------> Wait another 10s for function | |
334 | 6e56e84a | Michael Hanselmann | | | |
335 | 6e56e84a | Michael Hanselmann | ... ... |
336 | 6e56e84a | Michael Hanselmann | | | |
337 | 6e56e84a | Michael Hanselmann | Examine return <---- | | |
338 | 6e56e84a | Michael Hanselmann | value and wait | |
339 | 6e56e84a | Michael Hanselmann | again -------------> Wait another 10s for function | |
340 | 6e56e84a | Michael Hanselmann | | | |
341 | 6e56e84a | Michael Hanselmann | ... ... |
342 | 6e56e84a | Michael Hanselmann | | | |
343 | 6e56e84a | Michael Hanselmann | | Function ends, |
344 | 6e56e84a | Michael Hanselmann | Get return value and forward <-- process exits |
345 | 6e56e84a | Michael Hanselmann | Process return <---- it to caller |
346 | 6e56e84a | Michael Hanselmann | value and continue |
347 | 6e56e84a | Michael Hanselmann | | |
348 | 6e56e84a | Michael Hanselmann | |
349 | 6e56e84a | Michael Hanselmann | .. TODO: Convert diagram above to graphviz/dot graphic |
350 | 6e56e84a | Michael Hanselmann | |
351 | 6e56e84a | Michael Hanselmann | On process termination (e.g. after having been sent a ``SIGTERM`` or |
352 | 6e56e84a | Michael Hanselmann | ``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all |
353 | 6e56e84a | Michael Hanselmann | function processes and wait for all of them to terminate. |
354 | 6e56e84a | Michael Hanselmann | |
355 | 6e56e84a | Michael Hanselmann | |
356 | 5b2069a9 | Michael Hanselmann | Inter-cluster instance moves |
357 | 5b2069a9 | Michael Hanselmann | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
358 | 5b2069a9 | Michael Hanselmann | |
359 | 5b2069a9 | Michael Hanselmann | Current state and shortcomings |
360 | 5b2069a9 | Michael Hanselmann | ++++++++++++++++++++++++++++++ |
361 | 5b2069a9 | Michael Hanselmann | |
362 | 5b2069a9 | Michael Hanselmann | With the current design of Ganeti, moving whole instances between |
363 | 5b2069a9 | Michael Hanselmann | different clusters involves a lot of manual work. There are several ways |
364 | 5b2069a9 | Michael Hanselmann | to move instances, one of them being to export the instance, manually |
365 | 5b2069a9 | Michael Hanselmann | copying all data to the new cluster before importing it again. Manual |
366 | 5b2069a9 | Michael Hanselmann | changes to the instances configuration, such as the IP address, may be |
367 | 5b2069a9 | Michael Hanselmann | necessary in the new environment. The goal is to improve and automate |
368 | 5b2069a9 | Michael Hanselmann | this process in Ganeti 2.2. |
369 | 5b2069a9 | Michael Hanselmann | |
370 | 5b2069a9 | Michael Hanselmann | Proposed changes |
371 | 5b2069a9 | Michael Hanselmann | ++++++++++++++++ |
372 | 5b2069a9 | Michael Hanselmann | |
373 | 5b2069a9 | Michael Hanselmann | Authorization, Authentication and Security |
374 | 5b2069a9 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
375 | 5b2069a9 | Michael Hanselmann | |
376 | 5b2069a9 | Michael Hanselmann | Until now, each Ganeti cluster was a self-contained entity and wouldn't |
377 | 5b2069a9 | Michael Hanselmann | talk to other Ganeti clusters. Nodes within clusters only had to trust |
378 | 5b2069a9 | Michael Hanselmann | the other nodes in the same cluster and the network used for replication |
379 | 5b2069a9 | Michael Hanselmann | was trusted, too (hence the ability the use a separate, local network |
380 | 5b2069a9 | Michael Hanselmann | for replication). |
381 | 5b2069a9 | Michael Hanselmann | |
382 | 5b2069a9 | Michael Hanselmann | For inter-cluster instance transfers this model must be weakened. Nodes |
383 | 5b2069a9 | Michael Hanselmann | in one cluster will have to talk to nodes in other clusters, sometimes |
384 | 5b2069a9 | Michael Hanselmann | in other locations and, very important, via untrusted network |
385 | 5b2069a9 | Michael Hanselmann | connections. |
386 | 5b2069a9 | Michael Hanselmann | |
387 | 5b2069a9 | Michael Hanselmann | Various option have been considered for securing and authenticating the |
388 | 5b2069a9 | Michael Hanselmann | data transfer from one machine to another. To reduce the risk of |
389 | 5b2069a9 | Michael Hanselmann | accidentally overwriting data due to software bugs, authenticating the |
390 | 5b2069a9 | Michael Hanselmann | arriving data was considered critical. Eventually we decided to use |
391 | 5b2069a9 | Michael Hanselmann | socat's OpenSSL options (``OPENSSL:``, ``OPENSSL-LISTEN:`` et al), which |
392 | 5b2069a9 | Michael Hanselmann | provide us with encryption, authentication and authorization when used |
393 | 5b2069a9 | Michael Hanselmann | with separate keys and certificates. |
394 | 5b2069a9 | Michael Hanselmann | |
395 | 5b2069a9 | Michael Hanselmann | Combinations of OpenSSH, GnuPG and Netcat were deemed too complex to set |
396 | 5b2069a9 | Michael Hanselmann | up from within Ganeti. Any solution involving OpenSSH would require a |
397 | 5b2069a9 | Michael Hanselmann | dedicated user with a home directory and likely automated modifications |
398 | 5b2069a9 | Michael Hanselmann | to the user's ``$HOME/.ssh/authorized_keys`` file. When using Netcat, |
399 | 5b2069a9 | Michael Hanselmann | GnuPG or another encryption method would be necessary to transfer the |
400 | 5b2069a9 | Michael Hanselmann | data over an untrusted network. socat combines both in one program and |
401 | 5b2069a9 | Michael Hanselmann | is already a dependency. |
402 | 5b2069a9 | Michael Hanselmann | |
403 | 5b2069a9 | Michael Hanselmann | Each of the two clusters will have to generate an RSA key. The public |
404 | 5b2069a9 | Michael Hanselmann | parts are exchanged between the clusters by a third party, such as an |
405 | 5b2069a9 | Michael Hanselmann | administrator or a system interacting with Ganeti via the remote API |
406 | 5b2069a9 | Michael Hanselmann | ("third party" from here on). After receiving each other's public key, |
407 | 5b2069a9 | Michael Hanselmann | the clusters can start talking to each other. |
408 | 5b2069a9 | Michael Hanselmann | |
409 | 5b2069a9 | Michael Hanselmann | All encrypted connections must be verified on both sides. Neither side |
410 | 5b2069a9 | Michael Hanselmann | may accept unverified certificates. The generated certificate should |
411 | 5b2069a9 | Michael Hanselmann | only be valid for the time necessary to move the instance. |
412 | 5b2069a9 | Michael Hanselmann | |
413 | a7c6552d | Michael Hanselmann | For additional protection of the instance data, the two clusters can |
414 | f0476905 | Michael Hanselmann | verify the certificates and destination information exchanged via the |
415 | f0476905 | Michael Hanselmann | third party by checking an HMAC signature using a key shared among the |
416 | f0476905 | Michael Hanselmann | involved clusters. By default this secret key will be a random string |
417 | f0476905 | Michael Hanselmann | unique to the cluster, generated by running SHA1 over 20 bytes read from |
418 | f0476905 | Michael Hanselmann | ``/dev/urandom`` and the administrator must synchronize the secrets |
419 | f0476905 | Michael Hanselmann | between clusters before instances can be moved. If the third party does |
420 | f0476905 | Michael Hanselmann | not know the secret, it can't forge the certificates or redirect the |
421 | f0476905 | Michael Hanselmann | data. Unless disabled by a new cluster parameter, verifying the HMAC |
422 | f0476905 | Michael Hanselmann | signatures must be mandatory. The HMAC signature for X509 certificates |
423 | f0476905 | Michael Hanselmann | will be prepended to the certificate similar to an RFC822 header and |
424 | f0476905 | Michael Hanselmann | only covers the certificate (from ``-----BEGIN CERTIFICATE-----`` to |
425 | f0476905 | Michael Hanselmann | ``-----END CERTIFICATE-----``). The header name will be |
426 | 68857643 | Michael Hanselmann | ``X-Ganeti-Signature`` and its value will have the format |
427 | 68857643 | Michael Hanselmann | ``$salt/$hash`` (salt and hash separated by slash). The salt may only |
428 | 68857643 | Michael Hanselmann | contain characters in the range ``[a-zA-Z0-9]``. |
429 | a7c6552d | Michael Hanselmann | |
430 | 5b2069a9 | Michael Hanselmann | On the web, the destination cluster would be equivalent to an HTTPS |
431 | 5b2069a9 | Michael Hanselmann | server requiring verifiable client certificates. The browser would be |
432 | 5b2069a9 | Michael Hanselmann | equivalent to the source cluster and must verify the server's |
433 | 5b2069a9 | Michael Hanselmann | certificate while providing a client certificate to the server. |
434 | 5b2069a9 | Michael Hanselmann | |
435 | 5b2069a9 | Michael Hanselmann | Copying data |
436 | 5b2069a9 | Michael Hanselmann | ^^^^^^^^^^^^ |
437 | 5b2069a9 | Michael Hanselmann | |
438 | 5b2069a9 | Michael Hanselmann | To simplify the implementation, we decided to operate at a block-device |
439 | 5b2069a9 | Michael Hanselmann | level only, allowing us to easily support non-DRBD instance moves. |
440 | 5b2069a9 | Michael Hanselmann | |
441 | 5b2069a9 | Michael Hanselmann | Intra-cluster instance moves will re-use the existing export and import |
442 | 5b2069a9 | Michael Hanselmann | scripts supplied by instance OS definitions. Unlike simply copying the |
443 | 5b2069a9 | Michael Hanselmann | raw data, this allows to use filesystem-specific utilities to dump only |
444 | 5b2069a9 | Michael Hanselmann | used parts of the disk and to exclude certain disks from the move. |
445 | 5b2069a9 | Michael Hanselmann | Compression should be used to further reduce the amount of data |
446 | 5b2069a9 | Michael Hanselmann | transferred. |
447 | 5b2069a9 | Michael Hanselmann | |
448 | 5b2069a9 | Michael Hanselmann | The export scripts writes all data to stdout and the import script reads |
449 | 5b2069a9 | Michael Hanselmann | it from stdin again. To avoid copying data and reduce disk space |
450 | 5b2069a9 | Michael Hanselmann | consumption, everything is read from the disk and sent over the network |
451 | 5b2069a9 | Michael Hanselmann | directly, where it'll be written to the new block device directly again. |
452 | 5b2069a9 | Michael Hanselmann | |
453 | 5b2069a9 | Michael Hanselmann | Workflow |
454 | 5b2069a9 | Michael Hanselmann | ^^^^^^^^ |
455 | 5b2069a9 | Michael Hanselmann | |
456 | 5b2069a9 | Michael Hanselmann | #. Third party tells source cluster to shut down instance, asks for the |
457 | 5b2069a9 | Michael Hanselmann | instance specification and for the public part of an encryption key |
458 | f0476905 | Michael Hanselmann | |
459 | f0476905 | Michael Hanselmann | - Instance information can already be retrieved using an existing API |
460 | f0476905 | Michael Hanselmann | (``OpQueryInstanceData``). |
461 | f0476905 | Michael Hanselmann | - An RSA encryption key and a corresponding self-signed X509 |
462 | f0476905 | Michael Hanselmann | certificate is generated using the "openssl" command. This key will |
463 | f0476905 | Michael Hanselmann | be used to encrypt the data sent to the destination cluster. |
464 | f0476905 | Michael Hanselmann | |
465 | f0476905 | Michael Hanselmann | - Private keys never leave the cluster. |
466 | f0476905 | Michael Hanselmann | - The public part (the X509 certificate) is signed using HMAC with |
467 | f0476905 | Michael Hanselmann | salting and a secret shared between Ganeti clusters. |
468 | f0476905 | Michael Hanselmann | |
469 | 5b2069a9 | Michael Hanselmann | #. Third party tells destination cluster to create an instance with the |
470 | 5b2069a9 | Michael Hanselmann | same specifications as on source cluster and to prepare for an |
471 | 5b2069a9 | Michael Hanselmann | instance move with the key received from the source cluster and |
472 | 5b2069a9 | Michael Hanselmann | receives the public part of the destination's encryption key |
473 | f0476905 | Michael Hanselmann | |
474 | f0476905 | Michael Hanselmann | - The current API to create instances (``OpCreateInstance``) will be |
475 | f0476905 | Michael Hanselmann | extended to support an import from a remote cluster. |
476 | f0476905 | Michael Hanselmann | - A valid, unexpired X509 certificate signed with the destination |
477 | f0476905 | Michael Hanselmann | cluster's secret will be required. By verifying the signature, we |
478 | f0476905 | Michael Hanselmann | know the third party didn't modify the certificate. |
479 | f0476905 | Michael Hanselmann | |
480 | f0476905 | Michael Hanselmann | - The private keys never leave their cluster, hence the third party |
481 | f0476905 | Michael Hanselmann | can not decrypt or intercept the instance's data by modifying the |
482 | f0476905 | Michael Hanselmann | IP address or port sent by the destination cluster. |
483 | f0476905 | Michael Hanselmann | |
484 | f0476905 | Michael Hanselmann | - The destination cluster generates another key and certificate, |
485 | f0476905 | Michael Hanselmann | signs and sends it to the third party, who will have to pass it to |
486 | f0476905 | Michael Hanselmann | the API for exporting an instance (``OpExportInstance``). This |
487 | f0476905 | Michael Hanselmann | certificate is used to ensure we're sending the disk data to the |
488 | f0476905 | Michael Hanselmann | correct destination cluster. |
489 | f0476905 | Michael Hanselmann | - Once a disk can be imported, the API sends the destination |
490 | f0476905 | Michael Hanselmann | information (IP address and TCP port) together with an HMAC |
491 | f0476905 | Michael Hanselmann | signature to the third party. |
492 | f0476905 | Michael Hanselmann | |
493 | 5b2069a9 | Michael Hanselmann | #. Third party hands public part of the destination's encryption key |
494 | 5b2069a9 | Michael Hanselmann | together with all necessary information to source cluster and tells |
495 | 5b2069a9 | Michael Hanselmann | it to start the move |
496 | f0476905 | Michael Hanselmann | |
497 | f0476905 | Michael Hanselmann | - The existing API for exporting instances (``OpExportInstance``) |
498 | f0476905 | Michael Hanselmann | will be extended to export instances to remote clusters. |
499 | f0476905 | Michael Hanselmann | |
500 | 5b2069a9 | Michael Hanselmann | #. Source cluster connects to destination cluster for each disk and |
501 | 5b2069a9 | Michael Hanselmann | transfers its data using the instance OS definition's export and |
502 | 5b2069a9 | Michael Hanselmann | import scripts |
503 | f0476905 | Michael Hanselmann | |
504 | f0476905 | Michael Hanselmann | - Before starting, the source cluster must verify the HMAC signature |
505 | f0476905 | Michael Hanselmann | of the certificate and destination information (IP address and TCP |
506 | f0476905 | Michael Hanselmann | port). |
507 | f0476905 | Michael Hanselmann | - When connecting to the remote machine, strong certificate checks |
508 | f0476905 | Michael Hanselmann | must be employed. |
509 | f0476905 | Michael Hanselmann | |
510 | 5b2069a9 | Michael Hanselmann | #. Due to the asynchronous nature of the whole process, the destination |
511 | 5b2069a9 | Michael Hanselmann | cluster checks whether all disks have been transferred every time |
512 | f0476905 | Michael Hanselmann | after transferring a single disk; if so, it destroys the encryption |
513 | 5b2069a9 | Michael Hanselmann | key |
514 | 5b2069a9 | Michael Hanselmann | #. After sending all disks, the source cluster destroys its key |
515 | 5b2069a9 | Michael Hanselmann | #. Destination cluster runs OS definition's rename script to adjust |
516 | 5b2069a9 | Michael Hanselmann | instance settings if needed (e.g. IP address) |
517 | 5b2069a9 | Michael Hanselmann | #. Destination cluster starts the instance if requested at the beginning |
518 | 5b2069a9 | Michael Hanselmann | by the third party |
519 | 5b2069a9 | Michael Hanselmann | #. Source cluster removes the instance if requested |
520 | 5b2069a9 | Michael Hanselmann | |
521 | f0476905 | Michael Hanselmann | Instance move in pseudo code |
522 | f0476905 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
523 | f0476905 | Michael Hanselmann | |
524 | f0476905 | Michael Hanselmann | .. highlight:: python |
525 | f0476905 | Michael Hanselmann | |
526 | f0476905 | Michael Hanselmann | The following pseudo code describes a script moving instances between |
527 | f0476905 | Michael Hanselmann | clusters and what happens on both clusters. |
528 | f0476905 | Michael Hanselmann | |
529 | f0476905 | Michael Hanselmann | #. Script is started, gets the instance name and destination cluster:: |
530 | f0476905 | Michael Hanselmann | |
531 | f0476905 | Michael Hanselmann | (instance_name, dest_cluster_name) = sys.argv[1:] |
532 | f0476905 | Michael Hanselmann | |
533 | f0476905 | Michael Hanselmann | # Get destination cluster object |
534 | f0476905 | Michael Hanselmann | dest_cluster = db.FindCluster(dest_cluster_name) |
535 | f0476905 | Michael Hanselmann | |
536 | f0476905 | Michael Hanselmann | # Use database to find source cluster |
537 | f0476905 | Michael Hanselmann | src_cluster = db.FindClusterByInstance(instance_name) |
538 | f0476905 | Michael Hanselmann | |
539 | f0476905 | Michael Hanselmann | #. Script tells source cluster to stop instance:: |
540 | f0476905 | Michael Hanselmann | |
541 | f0476905 | Michael Hanselmann | # Stop instance |
542 | f0476905 | Michael Hanselmann | src_cluster.StopInstance(instance_name) |
543 | f0476905 | Michael Hanselmann | |
544 | f0476905 | Michael Hanselmann | # Get instance specification (memory, disk, etc.) |
545 | f0476905 | Michael Hanselmann | inst_spec = src_cluster.GetInstanceInfo(instance_name) |
546 | f0476905 | Michael Hanselmann | |
547 | f0476905 | Michael Hanselmann | (src_key_name, src_cert) = src_cluster.CreateX509Certificate() |
548 | f0476905 | Michael Hanselmann | |
549 | f0476905 | Michael Hanselmann | #. ``CreateX509Certificate`` on source cluster:: |
550 | f0476905 | Michael Hanselmann | |
551 | f0476905 | Michael Hanselmann | key_file = mkstemp() |
552 | f0476905 | Michael Hanselmann | cert_file = "%s.cert" % key_file |
553 | f0476905 | Michael Hanselmann | RunCmd(["/usr/bin/openssl", "req", "-new", |
554 | f0476905 | Michael Hanselmann | "-newkey", "rsa:1024", "-days", "1", |
555 | f0476905 | Michael Hanselmann | "-nodes", "-x509", "-batch", |
556 | f0476905 | Michael Hanselmann | "-keyout", key_file, "-out", cert_file]) |
557 | f0476905 | Michael Hanselmann | |
558 | f0476905 | Michael Hanselmann | plain_cert = utils.ReadFile(cert_file) |
559 | f0476905 | Michael Hanselmann | |
560 | f0476905 | Michael Hanselmann | # HMAC sign using secret key, this adds a "X-Ganeti-Signature" |
561 | f0476905 | Michael Hanselmann | # header to the beginning of the certificate |
562 | f0476905 | Michael Hanselmann | signed_cert = utils.SignX509Certificate(plain_cert, |
563 | f0476905 | Michael Hanselmann | utils.ReadFile(constants.X509_SIGNKEY_FILE)) |
564 | f0476905 | Michael Hanselmann | |
565 | f0476905 | Michael Hanselmann | # The certificate now looks like the following: |
566 | f0476905 | Michael Hanselmann | # |
567 | f0476905 | Michael Hanselmann | # X-Ganeti-Signature: $1234$28676f0516c6ab68062b[โฆ] |
568 | f0476905 | Michael Hanselmann | # -----BEGIN CERTIFICATE----- |
569 | f0476905 | Michael Hanselmann | # MIICsDCCAhmgAwIBAgI[โฆ] |
570 | f0476905 | Michael Hanselmann | # -----END CERTIFICATE----- |
571 | f0476905 | Michael Hanselmann | |
572 | f0476905 | Michael Hanselmann | # Return name of key file and signed certificate in PEM format |
573 | f0476905 | Michael Hanselmann | return (os.path.basename(key_file), signed_cert) |
574 | f0476905 | Michael Hanselmann | |
575 | f0476905 | Michael Hanselmann | #. Script creates instance on destination cluster and waits for move to |
576 | f0476905 | Michael Hanselmann | finish:: |
577 | f0476905 | Michael Hanselmann | |
578 | f0476905 | Michael Hanselmann | dest_cluster.CreateInstance(mode=constants.REMOTE_IMPORT, |
579 | f0476905 | Michael Hanselmann | spec=inst_spec, |
580 | f0476905 | Michael Hanselmann | source_cert=src_cert) |
581 | f0476905 | Michael Hanselmann | |
582 | f0476905 | Michael Hanselmann | # Wait until destination cluster gives us its certificate |
583 | f0476905 | Michael Hanselmann | dest_cert = None |
584 | f0476905 | Michael Hanselmann | disk_info = [] |
585 | f0476905 | Michael Hanselmann | while not (dest_cert and len(disk_info) < len(inst_spec.disks)): |
586 | f0476905 | Michael Hanselmann | tmp = dest_cluster.WaitOutput() |
587 | f0476905 | Michael Hanselmann | if tmp is Certificate: |
588 | f0476905 | Michael Hanselmann | dest_cert = tmp |
589 | f0476905 | Michael Hanselmann | elif tmp is DiskInfo: |
590 | f0476905 | Michael Hanselmann | # DiskInfo contains destination address and port |
591 | f0476905 | Michael Hanselmann | disk_info[tmp.index] = tmp |
592 | f0476905 | Michael Hanselmann | |
593 | f0476905 | Michael Hanselmann | # Tell source cluster to export disks |
594 | f0476905 | Michael Hanselmann | for disk in disk_info: |
595 | f0476905 | Michael Hanselmann | src_cluster.ExportDisk(instance_name, disk=disk, |
596 | f0476905 | Michael Hanselmann | key_name=src_key_name, |
597 | f0476905 | Michael Hanselmann | dest_cert=dest_cert) |
598 | f0476905 | Michael Hanselmann | |
599 | f0476905 | Michael Hanselmann | print ("Instance %s sucessfully moved to %s" % |
600 | f0476905 | Michael Hanselmann | (instance_name, dest_cluster.name)) |
601 | f0476905 | Michael Hanselmann | |
602 | f0476905 | Michael Hanselmann | #. ``CreateInstance`` on destination cluster:: |
603 | f0476905 | Michael Hanselmann | |
604 | f0476905 | Michael Hanselmann | # โฆ |
605 | f0476905 | Michael Hanselmann | |
606 | f0476905 | Michael Hanselmann | if mode == constants.REMOTE_IMPORT: |
607 | f0476905 | Michael Hanselmann | # Make sure certificate was not modified since it was generated by |
608 | f0476905 | Michael Hanselmann | # source cluster (which must use the same secret) |
609 | f0476905 | Michael Hanselmann | if (not utils.VerifySignedX509Cert(source_cert, |
610 | f0476905 | Michael Hanselmann | utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
611 | f0476905 | Michael Hanselmann | raise Error("Certificate not signed with this cluster's secret") |
612 | f0476905 | Michael Hanselmann | |
613 | f0476905 | Michael Hanselmann | if utils.CheckExpiredX509Cert(source_cert): |
614 | f0476905 | Michael Hanselmann | raise Error("X509 certificate is expired") |
615 | f0476905 | Michael Hanselmann | |
616 | f0476905 | Michael Hanselmann | source_cert_file = utils.WriteTempFile(source_cert) |
617 | f0476905 | Michael Hanselmann | |
618 | f0476905 | Michael Hanselmann | # See above for X509 certificate generation and signing |
619 | f0476905 | Michael Hanselmann | (key_name, signed_cert) = CreateSignedX509Certificate() |
620 | f0476905 | Michael Hanselmann | |
621 | f0476905 | Michael Hanselmann | SendToClient("x509-cert", signed_cert) |
622 | f0476905 | Michael Hanselmann | |
623 | f0476905 | Michael Hanselmann | for disk in instance.disks: |
624 | f0476905 | Michael Hanselmann | # Start socat |
625 | f0476905 | Michael Hanselmann | RunCmd(("socat" |
626 | f0476905 | Michael Hanselmann | " OPENSSL-LISTEN:%s,โฆ,key=%s,cert=%s,cafile=%s,verify=1" |
627 | f0476905 | Michael Hanselmann | " stdout > /dev/diskโฆ") % |
628 | f0476905 | Michael Hanselmann | port, GetRsaKeyPath(key_name, private=True), |
629 | f0476905 | Michael Hanselmann | GetRsaKeyPath(key_name, private=False), src_cert_file) |
630 | f0476905 | Michael Hanselmann | SendToClient("send-disk-to", disk, ip_address, port) |
631 | f0476905 | Michael Hanselmann | |
632 | f0476905 | Michael Hanselmann | DestroyX509Cert(key_name) |
633 | f0476905 | Michael Hanselmann | |
634 | f0476905 | Michael Hanselmann | RunRenameScript(instance_name) |
635 | f0476905 | Michael Hanselmann | |
636 | f0476905 | Michael Hanselmann | #. ``ExportDisk`` on source cluster:: |
637 | f0476905 | Michael Hanselmann | |
638 | f0476905 | Michael Hanselmann | # Make sure certificate was not modified since it was generated by |
639 | f0476905 | Michael Hanselmann | # destination cluster (which must use the same secret) |
640 | f0476905 | Michael Hanselmann | if (not utils.VerifySignedX509Cert(cert_pem, |
641 | f0476905 | Michael Hanselmann | utils.ReadFile(constants.X509_SIGNKEY_FILE))): |
642 | f0476905 | Michael Hanselmann | raise Error("Certificate not signed with this cluster's secret") |
643 | f0476905 | Michael Hanselmann | |
644 | f0476905 | Michael Hanselmann | if utils.CheckExpiredX509Cert(cert_pem): |
645 | f0476905 | Michael Hanselmann | raise Error("X509 certificate is expired") |
646 | f0476905 | Michael Hanselmann | |
647 | f0476905 | Michael Hanselmann | dest_cert_file = utils.WriteTempFile(cert_pem) |
648 | f0476905 | Michael Hanselmann | |
649 | f0476905 | Michael Hanselmann | # Start socat |
650 | f0476905 | Michael Hanselmann | RunCmd(("socat stdin" |
651 | f0476905 | Michael Hanselmann | " OPENSSL:%s:%s,โฆ,key=%s,cert=%s,cafile=%s,verify=1" |
652 | f0476905 | Michael Hanselmann | " < /dev/diskโฆ") % |
653 | f0476905 | Michael Hanselmann | disk.host, disk.port, |
654 | f0476905 | Michael Hanselmann | GetRsaKeyPath(key_name, private=True), |
655 | f0476905 | Michael Hanselmann | GetRsaKeyPath(key_name, private=False), dest_cert_file) |
656 | f0476905 | Michael Hanselmann | |
657 | f0476905 | Michael Hanselmann | if instance.all_disks_done: |
658 | f0476905 | Michael Hanselmann | DestroyX509Cert(key_name) |
659 | f0476905 | Michael Hanselmann | |
660 | f0476905 | Michael Hanselmann | .. highlight:: text |
661 | f0476905 | Michael Hanselmann | |
662 | 5b2069a9 | Michael Hanselmann | Miscellaneous notes |
663 | 5b2069a9 | Michael Hanselmann | ^^^^^^^^^^^^^^^^^^^ |
664 | 5b2069a9 | Michael Hanselmann | |
665 | 5b2069a9 | Michael Hanselmann | - A very similar system could also be used for instance exports within |
666 | 5b2069a9 | Michael Hanselmann | the same cluster. Currently OpenSSH is being used, but could be |
667 | 5b2069a9 | Michael Hanselmann | replaced by socat and SSL/TLS. |
668 | 5b2069a9 | Michael Hanselmann | - During the design of intra-cluster instance moves we also discussed |
669 | 5b2069a9 | Michael Hanselmann | encrypting instance exports using GnuPG. |
670 | 5b2069a9 | Michael Hanselmann | - While most instances should have exactly the same configuration as |
671 | 5b2069a9 | Michael Hanselmann | on the source cluster, setting them up with a different disk layout |
672 | 5b2069a9 | Michael Hanselmann | might be helpful in some use-cases. |
673 | 5b2069a9 | Michael Hanselmann | - A cleanup operation, similar to the one available for failed instance |
674 | 5b2069a9 | Michael Hanselmann | migrations, should be provided. |
675 | 5b2069a9 | Michael Hanselmann | - ``ganeti-watcher`` should remove instances pending a move from another |
676 | 5b2069a9 | Michael Hanselmann | cluster after a certain amount of time. This takes care of failures |
677 | 5b2069a9 | Michael Hanselmann | somewhere in the process. |
678 | 5b2069a9 | Michael Hanselmann | - RSA keys can be generated using the existing |
679 | 5b2069a9 | Michael Hanselmann | ``bootstrap.GenerateSelfSignedSslCert`` function, though it might be |
680 | 5b2069a9 | Michael Hanselmann | useful to not write both parts into a single file, requiring small |
681 | 5b2069a9 | Michael Hanselmann | changes to the function. The public part always starts with |
682 | 5b2069a9 | Michael Hanselmann | ``-----BEGIN CERTIFICATE-----`` and ends with ``-----END |
683 | 5b2069a9 | Michael Hanselmann | CERTIFICATE-----``. |
684 | 5b2069a9 | Michael Hanselmann | - The source and destination cluster might be different when it comes |
685 | 5b2069a9 | Michael Hanselmann | to available hypervisors, kernels, etc. The destination cluster should |
686 | 5b2069a9 | Michael Hanselmann | refuse to accept an instance move if it can't fulfill an instance's |
687 | 5b2069a9 | Michael Hanselmann | requirements. |
688 | 5b2069a9 | Michael Hanselmann | |
689 | 5b2069a9 | Michael Hanselmann | |
690 | e56bb0e8 | Guido Trotter | Feature changes |
691 | e56bb0e8 | Guido Trotter | --------------- |
692 | e56bb0e8 | Guido Trotter | |
693 | 8388e9ff | Guido Trotter | KVM Security |
694 | 8388e9ff | Guido Trotter | ~~~~~~~~~~~~ |
695 | 8388e9ff | Guido Trotter | |
696 | 8388e9ff | Guido Trotter | Current state and shortcomings |
697 | 8388e9ff | Guido Trotter | ++++++++++++++++++++++++++++++ |
698 | 8388e9ff | Guido Trotter | |
699 | 8388e9ff | Guido Trotter | Currently all kvm processes run as root. Taking ownership of the |
700 | 8388e9ff | Guido Trotter | hypervisor process, from inside a virtual machine, would mean a full |
701 | 8388e9ff | Guido Trotter | compromise of the whole Ganeti cluster, knowledge of all Ganeti |
702 | 8388e9ff | Guido Trotter | authentication secrets, full access to all running instances, and the |
703 | 8388e9ff | Guido Trotter | option of subverting other basic services on the cluster (eg: ssh). |
704 | 8388e9ff | Guido Trotter | |
705 | 8388e9ff | Guido Trotter | Proposed changes |
706 | 8388e9ff | Guido Trotter | ++++++++++++++++ |
707 | 8388e9ff | Guido Trotter | |
708 | 8388e9ff | Guido Trotter | We would like to decrease the surface of attack available if an |
709 | 8388e9ff | Guido Trotter | hypervisor is compromised. We can do so adding different features to |
710 | 8388e9ff | Guido Trotter | Ganeti, which will allow restricting the broken hypervisor |
711 | 8388e9ff | Guido Trotter | possibilities, in the absence of a local privilege escalation attack, to |
712 | 8388e9ff | Guido Trotter | subvert the node. |
713 | 8388e9ff | Guido Trotter | |
714 | 8388e9ff | Guido Trotter | Dropping privileges in kvm to a single user (easy) |
715 | 8388e9ff | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
716 | 8388e9ff | Guido Trotter | |
717 | 8388e9ff | Guido Trotter | By passing the ``-runas`` option to kvm, we can make it drop privileges. |
718 | 8388e9ff | Guido Trotter | The user can be chosen by an hypervisor parameter, so that each instance |
719 | 8388e9ff | Guido Trotter | can have its own user, but by default they will all run under the same |
720 | 8388e9ff | Guido Trotter | one. It should be very easy to implement, and can easily be backported |
721 | 8388e9ff | Guido Trotter | to 2.1.X. |
722 | 8388e9ff | Guido Trotter | |
723 | 8388e9ff | Guido Trotter | This mode protects the Ganeti cluster from a subverted hypervisor, but |
724 | 8388e9ff | Guido Trotter | doesn't protect the instances between each other, unless care is taken |
725 | 8388e9ff | Guido Trotter | to specify a different user for each. This would prevent the worst |
726 | 8388e9ff | Guido Trotter | attacks, including: |
727 | 8388e9ff | Guido Trotter | |
728 | 8388e9ff | Guido Trotter | - logging in to other nodes |
729 | 8388e9ff | Guido Trotter | - administering the Ganeti cluster |
730 | 8388e9ff | Guido Trotter | - subverting other services |
731 | 8388e9ff | Guido Trotter | |
732 | 8388e9ff | Guido Trotter | But the following would remain an option: |
733 | 8388e9ff | Guido Trotter | |
734 | 8388e9ff | Guido Trotter | - terminate other VMs (but not start them again, as that requires root |
735 | 8388e9ff | Guido Trotter | privileges to set up networking) (unless different users are used) |
736 | 8388e9ff | Guido Trotter | - trace other VMs, and probably subvert them and access their data |
737 | 8388e9ff | Guido Trotter | (unless different users are used) |
738 | 8388e9ff | Guido Trotter | - send network traffic from the node |
739 | 8388e9ff | Guido Trotter | - read unprotected data on the node filesystem |
740 | 8388e9ff | Guido Trotter | |
741 | 8388e9ff | Guido Trotter | Running kvm in a chroot (slightly harder) |
742 | 8388e9ff | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
743 | 8388e9ff | Guido Trotter | |
744 | 8388e9ff | Guido Trotter | By passing the ``-chroot`` option to kvm, we can restrict the kvm |
745 | 8388e9ff | Guido Trotter | process in its own (possibly empty) root directory. We need to set this |
746 | 8388e9ff | Guido Trotter | area up so that the instance disks and control sockets are accessible, |
747 | 8388e9ff | Guido Trotter | so it would require slightly more work at the Ganeti level. |
748 | 8388e9ff | Guido Trotter | |
749 | 8388e9ff | Guido Trotter | Breaking out in a chroot would mean: |
750 | 8388e9ff | Guido Trotter | |
751 | 8388e9ff | Guido Trotter | - a lot less options to find a local privilege escalation vector |
752 | 8388e9ff | Guido Trotter | - the impossibility to write local data, if the chroot is set up |
753 | 8388e9ff | Guido Trotter | correctly |
754 | 8388e9ff | Guido Trotter | - the impossibility to read filesystem data on the host |
755 | 8388e9ff | Guido Trotter | |
756 | 8388e9ff | Guido Trotter | It would still be possible though to: |
757 | 8388e9ff | Guido Trotter | |
758 | 8388e9ff | Guido Trotter | - terminate other VMs |
759 | 8388e9ff | Guido Trotter | - trace other VMs, and possibly subvert them (if a tracer can be |
760 | 8388e9ff | Guido Trotter | installed in the chroot) |
761 | 8388e9ff | Guido Trotter | - send network traffic from the node |
762 | 8388e9ff | Guido Trotter | |
763 | 8388e9ff | Guido Trotter | |
764 | 8388e9ff | Guido Trotter | Running kvm with a pool of users (slightly harder) |
765 | 8388e9ff | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
766 | 8388e9ff | Guido Trotter | |
767 | 8388e9ff | Guido Trotter | If rather than passing a single user as an hypervisor parameter, we have |
768 | 8388e9ff | Guido Trotter | a pool of useable ones, we can dynamically choose a free one to use and |
769 | 8388e9ff | Guido Trotter | thus guarantee that each machine will be separate from the others, |
770 | 8388e9ff | Guido Trotter | without putting the burden of this on the cluster administrator. |
771 | 8388e9ff | Guido Trotter | |
772 | 8388e9ff | Guido Trotter | This would mean interfering between machines would be impossible, and |
773 | 8388e9ff | Guido Trotter | can still be combined with the chroot benefits. |
774 | 8388e9ff | Guido Trotter | |
775 | 8388e9ff | Guido Trotter | Running iptables rules to limit network interaction (easy) |
776 | 8388e9ff | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
777 | 8388e9ff | Guido Trotter | |
778 | 8388e9ff | Guido Trotter | These don't need to be handled by Ganeti, but we can ship examples. If |
779 | 8388e9ff | Guido Trotter | the users used to run VMs would be blocked from sending some or all |
780 | 8388e9ff | Guido Trotter | network traffic, it would become impossible for a broken into hypervisor |
781 | 8388e9ff | Guido Trotter | to send arbitrary data on the node network, which is especially useful |
782 | 8388e9ff | Guido Trotter | when the instance and the node network are separated (using ganeti-nbma |
783 | 8388e9ff | Guido Trotter | or a separate set of network interfaces), or when a separate replication |
784 | 8388e9ff | Guido Trotter | network is maintained. We need to experiment to see how much restriction |
785 | 8388e9ff | Guido Trotter | we can properly apply, without limiting the instance legitimate traffic. |
786 | 8388e9ff | Guido Trotter | |
787 | 8388e9ff | Guido Trotter | |
788 | 8388e9ff | Guido Trotter | Running kvm inside a container (even harder) |
789 | 8388e9ff | Guido Trotter | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
790 | 8388e9ff | Guido Trotter | |
791 | 8388e9ff | Guido Trotter | Recent linux kernels support different process namespaces through |
792 | 8388e9ff | Guido Trotter | control groups. PIDs, users, filesystems and even network interfaces can |
793 | 8388e9ff | Guido Trotter | be separated. If we can set up ganeti to run kvm in a separate container |
794 | 8388e9ff | Guido Trotter | we could insulate all the host process from being even visible if the |
795 | 8388e9ff | Guido Trotter | hypervisor gets broken into. Most probably separating the network |
796 | 8388e9ff | Guido Trotter | namespace would require one extra hop in the host, through a veth |
797 | 8388e9ff | Guido Trotter | interface, thus reducing performance, so we may want to avoid that, and |
798 | 8388e9ff | Guido Trotter | just rely on iptables. |
799 | 8388e9ff | Guido Trotter | |
800 | 8388e9ff | Guido Trotter | Implementation plan |
801 | 8388e9ff | Guido Trotter | +++++++++++++++++++ |
802 | 8388e9ff | Guido Trotter | |
803 | 8388e9ff | Guido Trotter | We will first implement dropping privileges for kvm processes as a |
804 | 8388e9ff | Guido Trotter | single user, and most probably backport it to 2.1. Then we'll ship |
805 | 8388e9ff | Guido Trotter | example iptables rules to show how the user can be limited in its |
806 | 8388e9ff | Guido Trotter | network activities. After that we'll implement chroot restriction for |
807 | 8388e9ff | Guido Trotter | kvm processes, and extend the user limitation to use a user pool. |
808 | 8388e9ff | Guido Trotter | |
809 | 8388e9ff | Guido Trotter | Finally we'll look into namespaces and containers, although that might |
810 | 8388e9ff | Guido Trotter | slip after the 2.2 release. |
811 | 8388e9ff | Guido Trotter | |
812 | 545d1f1a | Iustin Pop | |
813 | e56bb0e8 | Guido Trotter | External interface changes |
814 | e56bb0e8 | Guido Trotter | -------------------------- |
815 | e56bb0e8 | Guido Trotter | |
816 | 545d1f1a | Iustin Pop | |
817 | 545d1f1a | Iustin Pop | OS API |
818 | 545d1f1a | Iustin Pop | ~~~~~~ |
819 | 545d1f1a | Iustin Pop | |
820 | 545d1f1a | Iustin Pop | The OS variants implementation in Ganeti 2.1 didn't prove to be useful |
821 | 545d1f1a | Iustin Pop | enough to alleviate the need to hack around the Ganeti API in order to |
822 | 545d1f1a | Iustin Pop | provide flexible OS parameters. |
823 | 545d1f1a | Iustin Pop | |
824 | 545d1f1a | Iustin Pop | As such, for Ganeti 2.2 we will provide support for arbitrary OS |
825 | 545d1f1a | Iustin Pop | parameters. However, since OSes are not registered in Ganeti, but |
826 | 545d1f1a | Iustin Pop | instead discovered at runtime, the interface is not entirely |
827 | 545d1f1a | Iustin Pop | straightforward. |
828 | 545d1f1a | Iustin Pop | |
829 | 545d1f1a | Iustin Pop | Furthermore, to support the system administrator in keeping OSes |
830 | 545d1f1a | Iustin Pop | properly in sync across the nodes of a cluster, Ganeti will also verify |
831 | 545d1f1a | Iustin Pop | (if existing) the consistence of a new ``os_version`` file. |
832 | 545d1f1a | Iustin Pop | |
833 | 545d1f1a | Iustin Pop | These changes to the OS API will bump the API version to 20. |
834 | 545d1f1a | Iustin Pop | |
835 | 545d1f1a | Iustin Pop | |
836 | 545d1f1a | Iustin Pop | OS version |
837 | 545d1f1a | Iustin Pop | ++++++++++ |
838 | 545d1f1a | Iustin Pop | |
839 | 545d1f1a | Iustin Pop | A new ``os_version`` file will be supported by Ganeti. This file is not |
840 | 545d1f1a | Iustin Pop | required, but if existing, its contents will be checked for consistency |
841 | 545d1f1a | Iustin Pop | across nodes. The file should hold only one line of text (any extra data |
842 | 545d1f1a | Iustin Pop | will be discarded), and its contents will be shown in the OS information |
843 | 545d1f1a | Iustin Pop | and diagnose commands. |
844 | 545d1f1a | Iustin Pop | |
845 | 545d1f1a | Iustin Pop | It is recommended that OS authors increase the contents of this file for |
846 | 545d1f1a | Iustin Pop | any changes; at a minimum, modifications that change the behaviour of |
847 | 545d1f1a | Iustin Pop | import/export scripts must increase the version, since they break |
848 | 545d1f1a | Iustin Pop | intra-cluster migration. |
849 | 545d1f1a | Iustin Pop | |
850 | 545d1f1a | Iustin Pop | Parameters |
851 | 545d1f1a | Iustin Pop | ++++++++++ |
852 | 545d1f1a | Iustin Pop | |
853 | 545d1f1a | Iustin Pop | The interface between Ganeti and the OS scripts will be based on |
854 | 545d1f1a | Iustin Pop | environment variables, and as such the parameters and their values will |
855 | 545d1f1a | Iustin Pop | need to be valid in this context. |
856 | 545d1f1a | Iustin Pop | |
857 | 545d1f1a | Iustin Pop | Names |
858 | 545d1f1a | Iustin Pop | ^^^^^ |
859 | 545d1f1a | Iustin Pop | |
860 | 545d1f1a | Iustin Pop | The parameter names will be declared in a new file, ``parameters.list``, |
861 | 545d1f1a | Iustin Pop | together with a one-line documentation (whitespace-separated). Example:: |
862 | 545d1f1a | Iustin Pop | |
863 | 545d1f1a | Iustin Pop | $ cat parameters.list |
864 | 545d1f1a | Iustin Pop | ns1 Specifies the first name server to add to /etc/resolv.conf |
865 | 545d1f1a | Iustin Pop | extra_packages Specifies additional packages to install |
866 | 545d1f1a | Iustin Pop | rootfs_size Specifies the root filesystem size (the rest will be left unallocated) |
867 | 545d1f1a | Iustin Pop | track Specifies the distribution track, one of 'stable', 'testing' or 'unstable' |
868 | 545d1f1a | Iustin Pop | |
869 | 545d1f1a | Iustin Pop | As seen above, the documentation can be separate via multiple |
870 | 545d1f1a | Iustin Pop | spaces/tabs from the names. |
871 | 545d1f1a | Iustin Pop | |
872 | 545d1f1a | Iustin Pop | The parameter names as read from the file will be used for the command |
873 | 545d1f1a | Iustin Pop | line interface in lowercased form; as such, there shouldn't be any two |
874 | 545d1f1a | Iustin Pop | parameters which differ in case only. |
875 | 545d1f1a | Iustin Pop | |
876 | 545d1f1a | Iustin Pop | Values |
877 | 545d1f1a | Iustin Pop | ^^^^^^ |
878 | 545d1f1a | Iustin Pop | |
879 | 545d1f1a | Iustin Pop | The values of the parameters are, from Ganeti's point of view, |
880 | 545d1f1a | Iustin Pop | completely freeform. If a given parameter has, from the OS' point of |
881 | 545d1f1a | Iustin Pop | view, a fixed set of valid values, these should be documented as such |
882 | 545d1f1a | Iustin Pop | and verified by the OS, but Ganeti will not handle such parameters |
883 | 545d1f1a | Iustin Pop | specially. |
884 | 545d1f1a | Iustin Pop | |
885 | 545d1f1a | Iustin Pop | An empty value must be handled identically as a missing parameter. In |
886 | 545d1f1a | Iustin Pop | other words, the validation script should only test for non-empty |
887 | 545d1f1a | Iustin Pop | values, and not for declared versus undeclared parameters. |
888 | 545d1f1a | Iustin Pop | |
889 | 545d1f1a | Iustin Pop | Furthermore, each parameter should have an (internal to the OS) default |
890 | 545d1f1a | Iustin Pop | value, that will be used if not passed from Ganeti. More precisely, it |
891 | 545d1f1a | Iustin Pop | should be possible for any parameter to specify a value that will have |
892 | 545d1f1a | Iustin Pop | the same effect as not passing the parameter, and no in no case should |
893 | 545d1f1a | Iustin Pop | the absence of a parameter be treated as an exceptional case (outside |
894 | 545d1f1a | Iustin Pop | the value space). |
895 | 545d1f1a | Iustin Pop | |
896 | 545d1f1a | Iustin Pop | |
897 | 545d1f1a | Iustin Pop | Environment variables |
898 | 545d1f1a | Iustin Pop | +++++++++++++++++++++ |
899 | 545d1f1a | Iustin Pop | |
900 | 545d1f1a | Iustin Pop | The parameters will be exposed in the environment upper-case and |
901 | 545d1f1a | Iustin Pop | prefixed with the string ``OSP_``. For example, a parameter declared in |
902 | 545d1f1a | Iustin Pop | the 'parameters' file as ``ns1`` will appear in the environment as the |
903 | 545d1f1a | Iustin Pop | variable ``OSP_NS1``. |
904 | 545d1f1a | Iustin Pop | |
905 | 545d1f1a | Iustin Pop | Validation |
906 | 545d1f1a | Iustin Pop | ++++++++++ |
907 | 545d1f1a | Iustin Pop | |
908 | 545d1f1a | Iustin Pop | For the purpose of parameter name/value validation, the OS scripts |
909 | 545d1f1a | Iustin Pop | *must* provide an additional script, named ``verify``. This script will |
910 | 545d1f1a | Iustin Pop | be called with the argument ``parameters``, and all the parameters will |
911 | 545d1f1a | Iustin Pop | be passed in via environment variables, as described above. |
912 | 545d1f1a | Iustin Pop | |
913 | 545d1f1a | Iustin Pop | The script should signify result/failure based on its exit code, and |
914 | 545d1f1a | Iustin Pop | show explanatory messages either on its standard output or standard |
915 | 545d1f1a | Iustin Pop | error. These messages will be passed on to the master, and stored as in |
916 | 545d1f1a | Iustin Pop | the OpCode result/error message. |
917 | 545d1f1a | Iustin Pop | |
918 | 545d1f1a | Iustin Pop | The parameters must be constructed to be independent of the instance |
919 | 545d1f1a | Iustin Pop | specifications. In general, the validation script will only be called |
920 | 545d1f1a | Iustin Pop | with the parameter variables set, but not with the normal per-instance |
921 | 545d1f1a | Iustin Pop | variables, in order for Ganeti to be able to validate default parameters |
922 | 545d1f1a | Iustin Pop | too, when they change. Validation will only be performed on one cluster |
923 | 545d1f1a | Iustin Pop | node, and it will be up to the ganeti administrator to keep the OS |
924 | 545d1f1a | Iustin Pop | scripts in sync between all nodes. |
925 | 545d1f1a | Iustin Pop | |
926 | 545d1f1a | Iustin Pop | Instance operations |
927 | 545d1f1a | Iustin Pop | +++++++++++++++++++ |
928 | 545d1f1a | Iustin Pop | |
929 | 545d1f1a | Iustin Pop | The parameters will be passed, as described above, to all the other |
930 | 545d1f1a | Iustin Pop | instance operations (creation, import, export). Ideally, these scripts |
931 | 545d1f1a | Iustin Pop | will not abort with parameter validation errors, if the ``verify`` |
932 | 545d1f1a | Iustin Pop | script has verified them correctly. |
933 | 545d1f1a | Iustin Pop | |
934 | 545d1f1a | Iustin Pop | Note: when changing an instance's OS type, any OS parameters defined at |
935 | 545d1f1a | Iustin Pop | instance level will be kept as-is. If the parameters differ between the |
936 | 545d1f1a | Iustin Pop | new and the old OS, the user should manually remove/update them as |
937 | 545d1f1a | Iustin Pop | needed. |
938 | 545d1f1a | Iustin Pop | |
939 | 545d1f1a | Iustin Pop | Declaration and modification |
940 | 545d1f1a | Iustin Pop | ++++++++++++++++++++++++++++ |
941 | 545d1f1a | Iustin Pop | |
942 | 545d1f1a | Iustin Pop | Since the OSes are not registered in Ganeti, we will only make a 'weak' |
943 | 545d1f1a | Iustin Pop | link between the parameters as declared in Ganeti and the actual OSes |
944 | 545d1f1a | Iustin Pop | existing on the cluster. |
945 | 545d1f1a | Iustin Pop | |
946 | 545d1f1a | Iustin Pop | It will be possible to declare parameters either globally, per cluster |
947 | 545d1f1a | Iustin Pop | (where they are indexed per OS/variant), or individually, per |
948 | 545d1f1a | Iustin Pop | instance. The declaration of parameters will not be tied to current |
949 | 545d1f1a | Iustin Pop | existing OSes. When specifying a parameter, if the OS exists, it will be |
950 | 545d1f1a | Iustin Pop | validated; if not, then it will simply be stored as-is. |
951 | 545d1f1a | Iustin Pop | |
952 | 545d1f1a | Iustin Pop | A special note is that it will not be possible to 'unset' at instance |
953 | 545d1f1a | Iustin Pop | level a parameter that is declared globally. Instead, at instance level |
954 | 545d1f1a | Iustin Pop | the parameter should be given an explicit value, or the default value as |
955 | 545d1f1a | Iustin Pop | explained above. |
956 | 545d1f1a | Iustin Pop | |
957 | 545d1f1a | Iustin Pop | CLI interface |
958 | 545d1f1a | Iustin Pop | +++++++++++++ |
959 | 545d1f1a | Iustin Pop | |
960 | 545d1f1a | Iustin Pop | The modification of global (default) parameters will be done via the |
961 | 545d1f1a | Iustin Pop | ``gnt-os`` command, and the per-instance parameters via the |
962 | 545d1f1a | Iustin Pop | ``gnt-instance`` command. Both these commands will take an addition |
963 | 545d1f1a | Iustin Pop | ``--os-parameters`` or ``-O`` flag that specifies the parameters in the |
964 | 545d1f1a | Iustin Pop | familiar comma-separated, key=value format. For removing a parameter, a |
965 | 545d1f1a | Iustin Pop | ``-key`` syntax will be used, e.g.:: |
966 | 545d1f1a | Iustin Pop | |
967 | 545d1f1a | Iustin Pop | # initial modification |
968 | 545d1f1a | Iustin Pop | $ gnt-instance modify -O use_dchp=true instance1 |
969 | 545d1f1a | Iustin Pop | # later revert (to the cluster default, or the OS default if not |
970 | 545d1f1a | Iustin Pop | # defined at cluster level) |
971 | 545d1f1a | Iustin Pop | $ gnt-instance modify -O -use_dhcp instance1 |
972 | 545d1f1a | Iustin Pop | |
973 | 545d1f1a | Iustin Pop | Internal storage |
974 | 545d1f1a | Iustin Pop | ++++++++++++++++ |
975 | 545d1f1a | Iustin Pop | |
976 | 545d1f1a | Iustin Pop | Internally, the OS parameters will be stored in a new ``osparams`` |
977 | 545d1f1a | Iustin Pop | attribute. The global parameters will be stored on the cluster object, |
978 | 545d1f1a | Iustin Pop | and the value of this attribute will be a dictionary indexed by OS name |
979 | 545d1f1a | Iustin Pop | (this also accepts an OS+variant name, which will override a simple OS |
980 | 545d1f1a | Iustin Pop | name, see below), and for values the key/name dictionary. For the |
981 | 545d1f1a | Iustin Pop | instances, the value will be directly the key/name dictionary. |
982 | 545d1f1a | Iustin Pop | |
983 | 545d1f1a | Iustin Pop | Overriding rules |
984 | 545d1f1a | Iustin Pop | ++++++++++++++++ |
985 | 545d1f1a | Iustin Pop | |
986 | 545d1f1a | Iustin Pop | Any instance-specific parameters will override any variant-specific |
987 | 545d1f1a | Iustin Pop | parameters, which in turn will override any global parameters. The |
988 | 545d1f1a | Iustin Pop | global parameters, in turn, override the built-in defaults (of the OS |
989 | 545d1f1a | Iustin Pop | scripts). |
990 | 545d1f1a | Iustin Pop | |
991 | 545d1f1a | Iustin Pop | |
992 | e56bb0e8 | Guido Trotter | .. vim: set textwidth=72 : |
993 | 545d1f1a | Iustin Pop | .. Local Variables: |
994 | 545d1f1a | Iustin Pop | .. mode: rst |
995 | 545d1f1a | Iustin Pop | .. fill-column: 72 |
996 | 545d1f1a | Iustin Pop | .. End: |