Revision f0f7293f doc/design-2.2.rst

b/doc/design-2.2.rst
199 199
its benefits.
200 200

  
201 201

  
202
Remote procedure call timeouts
203
------------------------------
204

  
205
Current state and shortcomings
206
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
207

  
208
The current RPC protocol used by Ganeti is based on HTTP. Every request
209
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``)
210
and doesn't return until the function called has returned. Parameters
211
and return values are encoded using JSON.
212

  
213
On the server side, ``ganeti-noded`` handles every incoming connection
214
in a separate process by forking just after accepting the connection.
215
This process exits after sending the response.
216

  
217
There is one major problem with this design: Timeouts can not be used on
218
a per-request basis. Neither client or server know how long it will
219
take. Even if we might be able to group requests into different
220
categories (e.g. fast and slow), this is not reliable.
221

  
222
If a node has an issue or the network connection fails while a request
223
is being handled, the master daemon can wait for a long time for the
224
connection to time out (e.g. due to the operating system's underlying
225
TCP keep-alive packets or timeouts). While the settings for keep-alive
226
packets can be changed using Linux-specific socket options, we prefer to
227
use application-level timeouts because these cover both machine down and
228
unresponsive node daemon cases.
229

  
230
Proposed changes
231
~~~~~~~~~~~~~~~~
232

  
233
RPC glossary
234
++++++++++++
235

  
236
Function call ID
237
  Unique identifier returned by ``ganeti-noded`` after invoking a
238
  function.
239
Function process
240
  Process started by ``ganeti-noded`` to call actual (backend) function.
241

  
242
Protocol
243
++++++++
244

  
245
Initially we chose HTTP as our RPC protocol because there were existing
246
libraries, which, unfortunately, turned out to miss important features
247
(such as SSL certificate authentication) and we had to write our own.
248

  
249
This proposal can easily be implemented using HTTP, though it would
250
likely be more efficient and less complicated to use the LUXI protocol
251
already used to communicate between client tools and the Ganeti master
252
daemon. Switching to another protocol can occur at a later point. This
253
proposal should be implemented using HTTP as its underlying protocol.
254

  
255
The LUXI protocol currently contains two functions, ``WaitForJobChange``
256
and ``AutoArchiveJobs``, which can take a longer time. They both support
257
a parameter to specify the timeout. This timeout is usually chosen as
258
roughly half of the socket timeout, guaranteeing a response before the
259
socket times out. After the specified amount of time,
260
``AutoArchiveJobs`` returns and reports the number of archived jobs.
261
``WaitForJobChange`` returns and reports a timeout. In both cases, the
262
functions can be called again.
263

  
264
A similar model can be used for the inter-node RPC protocol. In some
265
sense, the node daemon will implement a light variant of *"node daemon
266
jobs"*. When the function call is sent, it specifies an initial timeout.
267
If the function didn't finish within this timeout, a response is sent
268
with a unique identifier, the function call ID. The client can then
269
choose to wait for the function to finish again with a timeout.
270
Inter-node RPC calls would no longer be blocking indefinitely and there
271
would be an implicit ping-mechanism.
272

  
273
Request handling
274
++++++++++++++++
275

  
276
To support the protocol changes described above, the way the node daemon
277
handles request will have to change. Instead of forking and handling
278
every connection in a separate process, there should be one child
279
process per function call and the master process will handle the
280
communication with clients and the function processes using asynchronous
281
I/O.
282

  
283
Function processes communicate with the parent process via stdio and
284
possibly their exit status. Every function process has a unique
285
identifier, though it shouldn't be the process ID only (PIDs can be
286
recycled and are prone to race conditions for this use case). The
287
proposed format is ``${ppid}:${cpid}:${time}:${random}``, where ``ppid``
288
is the ``ganeti-noded`` PID, ``cpid`` the child's PID, ``time`` the
289
current Unix timestamp with decimal places and ``random`` at least 16
290
random bits.
291

  
292
The following operations will be supported:
293

  
294
``StartFunction(fn_name, fn_args, timeout)``
295
  Starts a function specified by ``fn_name`` with arguments in
296
  ``fn_args`` and waits up to ``timeout`` seconds for the function
297
  to finish. Fire-and-forget calls can be made by specifying a timeout
298
  of 0 seconds (e.g. for powercycling the node). Returns three values:
299
  function call ID (if not finished), whether function finished (or
300
  timeout) and the function's return value.
301
``WaitForFunction(fnc_id, timeout)``
302
  Waits up to ``timeout`` seconds for function call to finish. Return
303
  value same as ``StartFunction``.
304

  
305
In the future, ``StartFunction`` could support an additional parameter
306
to specify after how long the function process should be aborted.
307

  
308
Simplified timing diagram::
309

  
310
  Master daemon        Node daemon                      Function process
311
   |
312
  Call function
313
  (timeout 10s) -----> Parse request and fork for ----> Start function
314
                       calling actual function, then     |
315
                       wait up to 10s for function to    |
316
                       finish                            |
317
                        |                                |
318
                       ...                              ...
319
                        |                                |
320
  Examine return <----  |                                |
321
  value and wait                                         |
322
  again -------------> Wait another 10s for function     |
323
                        |                                |
324
                       ...                              ...
325
                        |                                |
326
  Examine return <----  |                                |
327
  value and wait                                         |
328
  again -------------> Wait another 10s for function     |
329
                        |                                |
330
                       ...                              ...
331
                        |                                |
332
                        |                               Function ends,
333
                       Get return value and forward <-- process exits
334
  Process return <---- it to caller
335
  value and continue
336
   |
337

  
338
.. TODO: Convert diagram above to graphviz/dot graphic
339

  
340
On process termination (e.g. after having been sent a ``SIGTERM`` or
341
``SIGINT`` signal), ``ganeti-noded`` should send ``SIGTERM`` to all
342
function processes and wait for all of them to terminate.
343

  
344

  
345 202
Inter-cluster instance moves
346 203
----------------------------
347 204

  

Also available in: Unified diff