Revision 282f38e3
b/doc/design-2.3.rst | ||
---|---|---|
17 | 17 |
Core changes |
18 | 18 |
============ |
19 | 19 |
|
20 |
Node Groups |
|
21 |
----------- |
|
22 |
|
|
23 |
Current state and shortcomings |
|
24 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
25 |
|
|
26 |
Currently all nodes of a Ganeti cluster are considered as part of the |
|
27 |
same pool, for allocation purposes: DRBD instances for example can be |
|
28 |
allocated on any two nodes. |
|
29 |
|
|
30 |
This does cause a problem in cases where nodes are not all equally |
|
31 |
connected to each other. For example if a cluster is created over two |
|
32 |
set of machines, each connected to its own switch, the internal bandwidth |
|
33 |
between machines connected to the same switch might be bigger than the |
|
34 |
bandwidth for inter-switch connections. |
|
35 |
|
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
|
37 |
together for inter-node consistency, and won't scale if we increase the |
|
38 |
number of nodes to a few hundreds. |
|
39 |
|
|
40 |
Proposed changes |
|
41 |
~~~~~~~~~~~~~~~~ |
|
42 |
|
|
43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
|
44 |
change for clusters with only one node group, the default one. Bigger |
|
45 |
cluster instead will be able to have more than one group, and each node |
|
46 |
will belong to exactly one. |
|
47 |
|
|
48 |
Node group management |
|
49 |
+++++++++++++++++++++ |
|
50 |
|
|
51 |
To manage node groups and the nodes belonging to them, the following new |
|
52 |
commands/flags will be introduced:: |
|
53 |
|
|
54 |
gnt-node group-add <group> # add a new node group |
|
55 |
gnt-node group-del <group> # delete an empty group |
|
56 |
gnt-node group-list # list node groups |
|
57 |
gnt-node group-rename <oldname> <newname> # rename a group |
|
58 |
gnt-node list/info -g <group> # list only nodes belongin to a group |
|
59 |
gnt-node add -g <group> # add a node to a certain group |
|
60 |
gnt-node modify -g <group> # move a node to a new group |
|
61 |
|
|
62 |
Instance level changes |
|
63 |
++++++++++++++++++++++ |
|
64 |
|
|
65 |
Instances will be able to live in only one group at a time. This is |
|
66 |
mostly important for DRBD instances, in which case both their primary |
|
67 |
and secondary nodes will need to be in the same group. To support this |
|
68 |
we envision the following changes: |
|
69 |
|
|
70 |
- The cluster will have a default group, which will initially be |
|
71 |
- Instance allocation will happen to the cluster's default group |
|
72 |
(which will be changable via gnt-cluster modify or RAPI) unless a |
|
73 |
group is explicitely specified in the creation job (with -g or via |
|
74 |
RAPI). Iallocator will be only passed the nodes belonging to that |
|
75 |
group. |
|
76 |
- Moving an instance between groups can only happen via an explicit |
|
77 |
operation, which for example in the case of DRBD will work by |
|
78 |
performing internally a replace-disks, a migration, and a second |
|
79 |
replace-disks. It will be possible to cleanup an interrupted |
|
80 |
group-move operation. |
|
81 |
- Cluster verify will signal an error if an instance has been left |
|
82 |
mid-transition between groups. |
|
83 |
- Intra-group instance migration/failover will check that the target |
|
84 |
group will be able to accept the instance network/storage wise, and |
|
85 |
fail otherwise. In the future we may be able to make some parameter |
|
86 |
changed during the move, but in the first version we expect an |
|
87 |
import/export if this is not possible. |
|
88 |
- From an allocation point of view, inter-group movements will be |
|
89 |
shown to a iallocator as a new allocation over the target group. |
|
90 |
Only in a future version we may add allocator extensions to decide |
|
91 |
which group the instance should be in. In the meantime we expect |
|
92 |
Ganeti administrators to either put instances on different groups by |
|
93 |
filling all groups first, or to have their own strategy based on the |
|
94 |
instance needs. |
|
95 |
|
|
96 |
Cluster/Internal/Config level changes |
|
97 |
+++++++++++++++++++++++++++++++++++++ |
|
98 |
|
|
99 |
We expect the following changes for cluster management: |
|
100 |
|
|
101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
|
102 |
will act one group at a time. The default group will be used if none |
|
103 |
is passed. Command line tools will have a way to easily target all |
|
104 |
groups, by generating one job per group. |
|
105 |
- Groups will have a human-readable name, but will internally always |
|
106 |
be referenced by a UUID, which will be immutable. For example the |
|
107 |
cluster object will contain the UUID of the default group, each node |
|
108 |
will contain the UUID of the group it belongs to, etc. This is done |
|
109 |
to simplify referencing while keeping it easy to handle renames and |
|
110 |
movements. If we see that this works well, we'll transition other |
|
111 |
config objects (instances, nodes) to the same model. |
|
112 |
- The addition of a new per-group lock will be evaluated, if we can |
|
113 |
transition some operations now requiring the BGL to it. |
|
114 |
- Master candidate status will be allowed to be spread among groups. |
|
115 |
For the first version we won't add any restriction over how this is |
|
116 |
done, although in the future we may have a minimum number of master |
|
117 |
candidates which Ganeti will try to keep in each group, for example. |
|
118 |
|
|
119 |
Other work and future changes |
|
120 |
+++++++++++++++++++++++++++++ |
|
121 |
|
|
122 |
Commands like gnt-cluster command/copyfile will continue to work on the |
|
123 |
whole cluster, but it will be possible to target one group only by |
|
124 |
specifying it. |
|
125 |
|
|
126 |
Commands which allow selection of sets of resources (for example |
|
127 |
gnt-instance start/stop) will be able to select them by node group as |
|
128 |
well. |
|
129 |
|
|
130 |
Initially node groups won't be taggable objects, to simplify the first |
|
131 |
implementation, but we expect this to be easy to add in a future version |
|
132 |
should we see it's useful. |
|
133 |
|
|
134 |
We envision groups as a good place to enhance cluster scalability. In |
|
135 |
the future we may want to use them ad units for configuration diffusion, |
|
136 |
to allow a better master scalability. For example it could be possible |
|
137 |
to change some all-nodes RPCs to contact each group once, from the |
|
138 |
master, and make one node in the group perform internal diffusion. We |
|
139 |
won't implement this in the first version, but we'll evaluate it for the |
|
140 |
future, if we see scalability problems on big multi-group clusters. |
|
141 |
|
|
142 |
When Ganeti will support more storage models (eg. SANs, sheepdog, ceph) |
|
143 |
we expect groups to be the basis for this, allowing for example a |
|
144 |
different sheepdog/ceph cluster, or a different SAN to be connected to |
|
145 |
each group. In some cases this will mean that inter-group move operation |
|
146 |
will be necessarily performed with instance downtime, unless the |
|
147 |
hypervisor has block-migrate functionality, and we implement support for |
|
148 |
it (this would be theoretically possible, today, with KVM, for example). |
|
149 |
|
|
150 |
|
|
20 | 151 |
Job priorities |
21 | 152 |
-------------- |
22 | 153 |
|
Also available in: Unified diff