Revision 2f2f1289 doc/design-2.2.rst
b/doc/design-2.2.rst | ||
---|---|---|
11 | 11 |
|
12 | 12 |
.. contents:: :depth: 4 |
13 | 13 |
|
14 |
Detailed design |
|
15 |
=============== |
|
16 |
|
|
17 | 14 |
As for 2.1 we divide the 2.2 design into three areas: |
18 | 15 |
|
19 | 16 |
- core changes, which affect the master daemon/job queue/locking or |
20 | 17 |
all/most logical units |
21 | 18 |
- logical unit/feature changes |
22 |
- external interface changes (eg. command line, os api, hooks, ...) |
|
19 |
- external interface changes (e.g. command line, OS API, hooks, ...) |
|
20 |
|
|
23 | 21 |
|
24 | 22 |
Core changes |
25 |
------------
|
|
23 |
============
|
|
26 | 24 |
|
27 | 25 |
Master Daemon Scaling improvements |
28 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
26 |
----------------------------------
|
|
29 | 27 |
|
30 | 28 |
Current state and shortcomings |
31 |
++++++++++++++++++++++++++++++
|
|
29 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
32 | 30 |
|
33 | 31 |
Currently the Ganeti master daemon is based on four sets of threads: |
34 | 32 |
|
... | ... | |
50 | 48 |
scalability issues: |
51 | 49 |
|
52 | 50 |
Core daemon connection handling |
53 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
51 |
+++++++++++++++++++++++++++++++
|
|
54 | 52 |
|
55 | 53 |
Since the 16 client worker threads handle one connection each, it's very |
56 | 54 |
easy to exhaust them, by just connecting to masterd 16 times and not |
... | ... | |
60 | 58 |
informed that everything is proceeding, and doesn't need to time out. |
61 | 59 |
|
62 | 60 |
Wait for job change |
63 |
^^^^^^^^^^^^^^^^^^^
|
|
61 |
+++++++++++++++++++
|
|
64 | 62 |
|
65 | 63 |
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client |
66 | 64 |
thread block on its job for a relative long time. This is another easy |
... | ... | |
69 | 67 |
contention (see below). |
70 | 68 |
|
71 | 69 |
Job Queue lock |
72 |
^^^^^^^^^^^^^^
|
|
70 |
++++++++++++++
|
|
73 | 71 |
|
74 | 72 |
The job queue lock is quite heavily contended, and certain easily |
75 | 73 |
reproducible workloads show that's it's very easy to put masterd in |
... | ... | |
120 | 118 |
remote rpcs to complete (starting, finishing, and submitting jobs) |
121 | 119 |
|
122 | 120 |
Proposed changes |
123 |
++++++++++++++++
|
|
121 |
~~~~~~~~~~~~~~~~
|
|
124 | 122 |
|
125 | 123 |
In order to be able to interact with the master daemon even when it's |
126 | 124 |
under heavy load, and to make it simpler to add core functionality |
... | ... | |
135 | 133 |
understand, debug, and scale. |
136 | 134 |
|
137 | 135 |
Connection handling |
138 |
^^^^^^^^^^^^^^^^^^^
|
|
136 |
+++++++++++++++++++
|
|
139 | 137 |
|
140 | 138 |
We'll move the main thread of ganeti-masterd to asyncore, so that it can |
141 | 139 |
share the mainloop code with all other Ganeti daemons. Then all luxi |
... | ... | |
148 | 146 |
thread on the socket. |
149 | 147 |
|
150 | 148 |
Wait for job change |
151 |
^^^^^^^^^^^^^^^^^^^
|
|
149 |
+++++++++++++++++++
|
|
152 | 150 |
|
153 | 151 |
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be |
154 | 152 |
subscription-based, so that the executing thread doesn't have to be |
... | ... | |
173 | 171 |
them at a maximum rate (lower priority). |
174 | 172 |
|
175 | 173 |
Job Queue lock |
176 |
^^^^^^^^^^^^^^
|
|
174 |
++++++++++++++
|
|
177 | 175 |
|
178 | 176 |
In order to decrease the job queue lock contention, we will change the |
179 | 177 |
code paths in the following ways, initially: |
... | ... | |
202 | 200 |
|
203 | 201 |
|
204 | 202 |
Remote procedure call timeouts |
205 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
203 |
------------------------------
|
|
206 | 204 |
|
207 | 205 |
Current state and shortcomings |
208 |
++++++++++++++++++++++++++++++
|
|
206 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
209 | 207 |
|
210 | 208 |
The current RPC protocol used by Ganeti is based on HTTP. Every request |
211 | 209 |
consists of an HTTP PUT request (e.g. ``PUT /hooks_runner HTTP/1.0``) |
... | ... | |
230 | 228 |
unresponsive node daemon cases. |
231 | 229 |
|
232 | 230 |
Proposed changes |
233 |
++++++++++++++++
|
|
231 |
~~~~~~~~~~~~~~~~
|
|
234 | 232 |
|
235 | 233 |
RPC glossary |
236 |
^^^^^^^^^^^^
|
|
234 |
++++++++++++
|
|
237 | 235 |
|
238 | 236 |
Function call ID |
239 | 237 |
Unique identifier returned by ``ganeti-noded`` after invoking a |
... | ... | |
242 | 240 |
Process started by ``ganeti-noded`` to call actual (backend) function. |
243 | 241 |
|
244 | 242 |
Protocol |
245 |
^^^^^^^^
|
|
243 |
++++++++
|
|
246 | 244 |
|
247 | 245 |
Initially we chose HTTP as our RPC protocol because there were existing |
248 | 246 |
libraries, which, unfortunately, turned out to miss important features |
... | ... | |
273 | 271 |
would be an implicit ping-mechanism. |
274 | 272 |
|
275 | 273 |
Request handling |
276 |
^^^^^^^^^^^^^^^^
|
|
274 |
++++++++++++++++
|
|
277 | 275 |
|
278 | 276 |
To support the protocol changes described above, the way the node daemon |
279 | 277 |
handles request will have to change. Instead of forking and handling |
... | ... | |
345 | 343 |
|
346 | 344 |
|
347 | 345 |
Inter-cluster instance moves |
348 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
346 |
----------------------------
|
|
349 | 347 |
|
350 | 348 |
Current state and shortcomings |
351 |
++++++++++++++++++++++++++++++
|
|
349 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
352 | 350 |
|
353 | 351 |
With the current design of Ganeti, moving whole instances between |
354 | 352 |
different clusters involves a lot of manual work. There are several ways |
... | ... | |
359 | 357 |
this process in Ganeti 2.2. |
360 | 358 |
|
361 | 359 |
Proposed changes |
362 |
++++++++++++++++
|
|
360 |
~~~~~~~~~~~~~~~~
|
|
363 | 361 |
|
364 | 362 |
Authorization, Authentication and Security |
365 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
363 |
++++++++++++++++++++++++++++++++++++++++++
|
|
366 | 364 |
|
367 | 365 |
Until now, each Ganeti cluster was a self-contained entity and wouldn't |
368 | 366 |
talk to other Ganeti clusters. Nodes within clusters only had to trust |
... | ... | |
424 | 422 |
certificate while providing a client certificate to the server. |
425 | 423 |
|
426 | 424 |
Copying data |
427 |
^^^^^^^^^^^^
|
|
425 |
++++++++++++
|
|
428 | 426 |
|
429 | 427 |
To simplify the implementation, we decided to operate at a block-device |
430 | 428 |
level only, allowing us to easily support non-DRBD instance moves. |
... | ... | |
442 | 440 |
directly, where it'll be written to the new block device directly again. |
443 | 441 |
|
444 | 442 |
Workflow |
445 |
^^^^^^^^
|
|
443 |
++++++++
|
|
446 | 444 |
|
447 | 445 |
#. Third party tells source cluster to shut down instance, asks for the |
448 | 446 |
instance specification and for the public part of an encryption key |
... | ... | |
510 | 508 |
#. Source cluster removes the instance if requested |
511 | 509 |
|
512 | 510 |
Instance move in pseudo code |
513 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
511 |
++++++++++++++++++++++++++++
|
|
514 | 512 |
|
515 | 513 |
.. highlight:: python |
516 | 514 |
|
... | ... | |
651 | 649 |
.. highlight:: text |
652 | 650 |
|
653 | 651 |
Miscellaneous notes |
654 |
^^^^^^^^^^^^^^^^^^^
|
|
652 |
+++++++++++++++++++
|
|
655 | 653 |
|
656 | 654 |
- A very similar system could also be used for instance exports within |
657 | 655 |
the same cluster. Currently OpenSSH is being used, but could be |
... | ... | |
679 | 677 |
|
680 | 678 |
|
681 | 679 |
Privilege separation |
682 |
~~~~~~~~~~~~~~~~~~~~
|
|
680 |
--------------------
|
|
683 | 681 |
|
684 | 682 |
Current state and shortcomings |
685 |
++++++++++++++++++++++++++++++
|
|
683 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
686 | 684 |
|
687 | 685 |
All Ganeti daemons are run under the user root. This is not ideal from a |
688 | 686 |
security perspective as for possible exploitation of any daemon the user |
... | ... | |
694 | 692 |
is in the same group. |
695 | 693 |
|
696 | 694 |
Implementation |
697 |
++++++++++++++
|
|
695 |
~~~~~~~~~~~~~~
|
|
698 | 696 |
|
699 | 697 |
For Ganeti 2.2 the implementation will be focused on a the RAPI daemon |
700 | 698 |
only. This involves changes to ``daemons.py`` so it's possible to drop |
... | ... | |
710 | 708 |
|
711 | 709 |
|
712 | 710 |
Feature changes |
713 |
---------------
|
|
711 |
===============
|
|
714 | 712 |
|
715 | 713 |
KVM Security |
716 |
~~~~~~~~~~~~
|
|
714 |
------------
|
|
717 | 715 |
|
718 | 716 |
Current state and shortcomings |
719 |
++++++++++++++++++++++++++++++
|
|
717 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
720 | 718 |
|
721 | 719 |
Currently all kvm processes run as root. Taking ownership of the |
722 | 720 |
hypervisor process, from inside a virtual machine, would mean a full |
... | ... | |
725 | 723 |
option of subverting other basic services on the cluster (eg: ssh). |
726 | 724 |
|
727 | 725 |
Proposed changes |
728 |
++++++++++++++++
|
|
726 |
~~~~~~~~~~~~~~~~
|
|
729 | 727 |
|
730 | 728 |
We would like to decrease the surface of attack available if an |
731 | 729 |
hypervisor is compromised. We can do so adding different features to |
... | ... | |
734 | 732 |
subvert the node. |
735 | 733 |
|
736 | 734 |
Dropping privileges in kvm to a single user (easy) |
737 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
735 |
++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
738 | 736 |
|
739 | 737 |
By passing the ``-runas`` option to kvm, we can make it drop privileges. |
740 | 738 |
The user can be chosen by an hypervisor parameter, so that each instance |
... | ... | |
761 | 759 |
- read unprotected data on the node filesystem |
762 | 760 |
|
763 | 761 |
Running kvm in a chroot (slightly harder) |
764 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
762 |
+++++++++++++++++++++++++++++++++++++++++
|
|
765 | 763 |
|
766 | 764 |
By passing the ``-chroot`` option to kvm, we can restrict the kvm |
767 | 765 |
process in its own (possibly empty) root directory. We need to set this |
... | ... | |
784 | 782 |
|
785 | 783 |
|
786 | 784 |
Running kvm with a pool of users (slightly harder) |
787 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
785 |
++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
788 | 786 |
|
789 | 787 |
If rather than passing a single user as an hypervisor parameter, we have |
790 | 788 |
a pool of useable ones, we can dynamically choose a free one to use and |
... | ... | |
795 | 793 |
can still be combined with the chroot benefits. |
796 | 794 |
|
797 | 795 |
Running iptables rules to limit network interaction (easy) |
798 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
796 |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|
|
799 | 797 |
|
800 | 798 |
These don't need to be handled by Ganeti, but we can ship examples. If |
801 | 799 |
the users used to run VMs would be blocked from sending some or all |
... | ... | |
808 | 806 |
|
809 | 807 |
|
810 | 808 |
Running kvm inside a container (even harder) |
811 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
809 |
++++++++++++++++++++++++++++++++++++++++++++
|
|
812 | 810 |
|
813 | 811 |
Recent linux kernels support different process namespaces through |
814 | 812 |
control groups. PIDs, users, filesystems and even network interfaces can |
... | ... | |
820 | 818 |
just rely on iptables. |
821 | 819 |
|
822 | 820 |
Implementation plan |
823 |
+++++++++++++++++++
|
|
821 |
~~~~~~~~~~~~~~~~~~~
|
|
824 | 822 |
|
825 | 823 |
We will first implement dropping privileges for kvm processes as a |
826 | 824 |
single user, and most probably backport it to 2.1. Then we'll ship |
... | ... | |
833 | 831 |
|
834 | 832 |
|
835 | 833 |
External interface changes |
836 |
--------------------------
|
|
834 |
==========================
|
|
837 | 835 |
|
838 | 836 |
|
839 | 837 |
OS API |
840 |
~~~~~~
|
|
838 |
------
|
|
841 | 839 |
|
842 | 840 |
The OS variants implementation in Ganeti 2.1 didn't prove to be useful |
843 | 841 |
enough to alleviate the need to hack around the Ganeti API in order to |
... | ... | |
856 | 854 |
|
857 | 855 |
|
858 | 856 |
OS version |
859 |
++++++++++
|
|
857 |
~~~~~~~~~~
|
|
860 | 858 |
|
861 | 859 |
A new ``os_version`` file will be supported by Ganeti. This file is not |
862 | 860 |
required, but if existing, its contents will be checked for consistency |
... | ... | |
870 | 868 |
intra-cluster migration. |
871 | 869 |
|
872 | 870 |
Parameters |
873 |
++++++++++
|
|
871 |
~~~~~~~~~~
|
|
874 | 872 |
|
875 | 873 |
The interface between Ganeti and the OS scripts will be based on |
876 | 874 |
environment variables, and as such the parameters and their values will |
877 | 875 |
need to be valid in this context. |
878 | 876 |
|
879 | 877 |
Names |
880 |
^^^^^
|
|
878 |
+++++
|
|
881 | 879 |
|
882 | 880 |
The parameter names will be declared in a new file, ``parameters.list``, |
883 | 881 |
together with a one-line documentation (whitespace-separated). Example:: |
... | ... | |
896 | 894 |
parameters which differ in case only. |
897 | 895 |
|
898 | 896 |
Values |
899 |
^^^^^^
|
|
897 |
++++++
|
|
900 | 898 |
|
901 | 899 |
The values of the parameters are, from Ganeti's point of view, |
902 | 900 |
completely freeform. If a given parameter has, from the OS' point of |
... | ... | |
917 | 915 |
|
918 | 916 |
|
919 | 917 |
Environment variables |
920 |
+++++++++++++++++++++
|
|
918 |
^^^^^^^^^^^^^^^^^^^^^
|
|
921 | 919 |
|
922 | 920 |
The parameters will be exposed in the environment upper-case and |
923 | 921 |
prefixed with the string ``OSP_``. For example, a parameter declared in |
Also available in: Unified diff