Revision f0f7293f doc/design-2.2.rst
b/doc/design-2.2.rst | ||
---|---|---|
199 | 199 |
its benefits. |
200 | 200 |
|
201 | 201 |
|
202 |
Remote procedure call timeouts |
|
203 |
------------------------------ |
|
204 |
|
|
205 |
Current state and shortcomings |
|
206 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
207 |
|
|
208 |
The current RPC protocol used by Ganeti is based on HTTP. Every request |
|
209 |
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``) |
|
210 |
and doesn't return until the function called has returned. Parameters |
|
211 |
and return values are encoded using JSON. |
|
212 |
|
|
213 |
On the server side, ``ganeti-noded`` handles every incoming connection |
|
214 |
in a separate process by forking just after accepting the connection. |
|
215 |
This process exits after sending the response. |
|
216 |
|
|
217 |
There is one major problem with this design: Timeouts can not be used on |
|
218 |
a per-request basis. Neither client or server know how long it will |
|
219 |
take. Even if we might be able to group requests into different |
|
220 |
categories (e.g. fast and slow), this is not reliable. |
|
221 |
|
|
222 |
If a node has an issue or the network connection fails while a request |
|
223 |
is being handled, the master daemon can wait for a long time for the |
|
224 |
connection to time out (e.g. due to the operating system's underlying |
|
225 |
TCP keep-alive packets or timeouts). While the settings for keep-alive |
|
226 |
packets can be changed using Linux-specific socket options, we prefer to |
|
227 |
use application-level timeouts because these cover both machine down and |
|
228 |
unresponsive node daemon cases. |
|
229 |
|
|
230 |
Proposed changes |
|
231 |
~~~~~~~~~~~~~~~~ |
|
232 |
|
|
233 |
RPC glossary |
|
234 |
++++++++++++ |
|
235 |
|
|
236 |
Function call ID |
|
237 |
Unique identifier returned by ``ganeti-noded`` after invoking a |
|
238 |
function. |
|
239 |
Function process |
|
240 |
Process started by ``ganeti-noded`` to call actual (backend) function. |
|
241 |
|
|
242 |
Protocol |
|
243 |
++++++++ |
|
244 |
|
|
245 |
Initially we chose HTTP as our RPC protocol because there were existing |
|
246 |
libraries, which, unfortunately, turned out to miss important features |
|
247 |
(such as SSL certificate authentication) and we had to write our own. |
|
248 |
|
|
249 |
This proposal can easily be implemented using HTTP, though it would |
|
250 |
likely be more efficient and less complicated to use the LUXI protocol |
|
251 |
already used to communicate between client tools and the Ganeti master |
|
252 |
daemon. Switching to another protocol can occur at a later point. This |
|
253 |
proposal should be implemented using HTTP as its underlying protocol. |
|
254 |
|
|
255 |
The LUXI protocol currently contains two functions, ``WaitForJobChange`` |
|
256 |
and ``AutoArchiveJobs``, which can take a longer time. They both support |
|
257 |
a parameter to specify the timeout. This timeout is usually chosen as |
|
258 |
roughly half of the socket timeout, guaranteeing a response before the |
|
259 |
socket times out. After the specified amount of time, |
|
260 |
``AutoArchiveJobs`` returns and reports the number of archived jobs. |
|
261 |
``WaitForJobChange`` returns and reports a timeout. In both cases, the |
|
262 |
functions can be called again. |
|
263 |
|
|
264 |
A similar model can be used for the inter-node RPC protocol. In some |
|
265 |
sense, the node daemon will implement a light variant of *"node daemon |
|
266 |
jobs"*. When the function call is sent, it specifies an initial timeout. |
|
267 |
If the function didn't finish within this timeout, a response is sent |
|
268 |
with a unique identifier, the function call ID. The client can then |
|
269 |
choose to wait for the function to finish again with a timeout. |
|
270 |
Inter-node RPC calls would no longer be blocking indefinitely and there |
|
271 |
would be an implicit ping-mechanism. |
|
272 |
|
|
273 |
Request handling |
|
274 |
++++++++++++++++ |
|
275 |
|
|
276 |
To support the protocol changes described above, the way the node daemon |
|
277 |
handles request will have to change. Instead of forking and handling |
|
278 |
every connection in a separate process, there should be one child |
|
279 |
process per function call and the master process will handle the |
|
280 |
communication with clients and the function processes using asynchronous |
|
281 |
I/O. |
|
282 |
|
|
283 |
Function processes communicate with the parent process via stdio and |
|
284 |
possibly their exit status. Every function process has a unique |
|
285 |
identifier, though it shouldn't be the process ID only (PIDs can be |
|
286 |
recycled and are prone to race conditions for this use case). The |
|
287 |
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid`` |
|
288 |
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the |
|
289 |
current Unix timestamp with decimal places and ``random`` at least 16 |
|
290 |
random bits. |
|
291 |
|
|
292 |
The following operations will be supported: |
|
293 |
|
|
294 |
``StartFunction(fn_name, fn_args, timeout)`` |
|
295 |
Starts a function specified by ``fn_name`` with arguments in |
|
296 |
``fn_args`` and waits up to ``timeout`` seconds for the function |
|
297 |
to finish. Fire-and-forget calls can be made by specifying a timeout |
|
298 |
of 0 seconds (e.g. for powercycling the node). Returns three values: |
|
299 |
function call ID (if not finished), whether function finished (or |
|
300 |
timeout) and the function's return value. |
|
301 |
``WaitForFunction(fnc_id, timeout)`` |
|
302 |
Waits up to ``timeout`` seconds for function call to finish. Return |
|
303 |
value same as ``StartFunction``. |
|
304 |
|
|
305 |
In the future, ``StartFunction`` could support an additional parameter |
|
306 |
to specify after how long the function process should be aborted. |
|
307 |
|
|
308 |
Simplified timing diagram:: |
|
309 |
|
|
310 |
Master daemon Node daemon Function process |
|
311 |
| |
|
312 |
Call function |
|
313 |
(timeout 10s) -----> Parse request and fork for ----> Start function |
|
314 |
calling actual function, then | |
|
315 |
wait up to 10s for function to | |
|
316 |
finish | |
|
317 |
| | |
|
318 |
... ... |
|
319 |
| | |
|
320 |
Examine return <---- | | |
|
321 |
value and wait | |
|
322 |
again -------------> Wait another 10s for function | |
|
323 |
| | |
|
324 |
... ... |
|
325 |
| | |
|
326 |
Examine return <---- | | |
|
327 |
value and wait | |
|
328 |
again -------------> Wait another 10s for function | |
|
329 |
| | |
|
330 |
... ... |
|
331 |
| | |
|
332 |
| Function ends, |
|
333 |
Get return value and forward <-- process exits |
|
334 |
Process return <---- it to caller |
|
335 |
value and continue |
|
336 |
| |
|
337 |
|
|
338 |
.. TODO: Convert diagram above to graphviz/dot graphic |
|
339 |
|
|
340 |
On process termination (e.g. after having been sent a ``SIGTERM`` or |
|
341 |
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all |
|
342 |
function processes and wait for all of them to terminate. |
|
343 |
|
|
344 |
|
|
345 | 202 |
Inter-cluster instance moves |
346 | 203 |
---------------------------- |
347 | 204 |
|
Also available in: Unified diff