Revision 1b9c867c doc/design-2.3.rst
b/doc/design-2.3.rst | ||
---|---|---|
33 | 33 |
between machines connected to the same switch might be bigger than the |
34 | 34 |
bandwidth for inter-switch connections. |
35 | 35 |
|
36 |
Moreover some operations inside a cluster require all nodes to be locked |
|
36 |
Moreover, some operations inside a cluster require all nodes to be locked
|
|
37 | 37 |
together for inter-node consistency, and won't scale if we increase the |
38 | 38 |
number of nodes to a few hundreds. |
39 | 39 |
|
... | ... | |
41 | 41 |
~~~~~~~~~~~~~~~~ |
42 | 42 |
|
43 | 43 |
With this change we'll divide Ganeti nodes into groups. Nothing will |
44 |
change for clusters with only one node group, the default one. Bigger
|
|
45 |
cluster instead will be able to have more than one group, and each node
|
|
46 |
will belong to exactly one.
|
|
44 |
change for clusters with only one node group. Bigger clusters will be
|
|
45 |
able to have more than one group, and each node will belong to exactly
|
|
46 |
one. |
|
47 | 47 |
|
48 | 48 |
Node group management |
49 | 49 |
+++++++++++++++++++++ |
50 | 50 |
|
51 | 51 |
To manage node groups and the nodes belonging to them, the following new |
52 |
commands/flags will be introduced::
|
|
52 |
commands and flags will be introduced::
|
|
53 | 53 |
|
54 | 54 |
gnt-node group-add <group> # add a new node group |
55 | 55 |
gnt-node group-del <group> # delete an empty group |
... | ... | |
76 | 76 |
- Moving an instance between groups can only happen via an explicit |
77 | 77 |
operation, which for example in the case of DRBD will work by |
78 | 78 |
performing internally a replace-disks, a migration, and a second |
79 |
replace-disks. It will be possible to cleanup an interrupted |
|
79 |
replace-disks. It will be possible to clean up an interrupted
|
|
80 | 80 |
group-move operation. |
81 | 81 |
- Cluster verify will signal an error if an instance has been left |
82 | 82 |
mid-transition between groups. |
83 |
- Intra-group instance migration/failover will check that the target
|
|
83 |
- Inter-group instance migration/failover will check that the target
|
|
84 | 84 |
group will be able to accept the instance network/storage wise, and |
85 | 85 |
fail otherwise. In the future we may be able to make some parameter |
86 | 86 |
changed during the move, but in the first version we expect an |
... | ... | |
99 | 99 |
We expect the following changes for cluster management: |
100 | 100 |
|
101 | 101 |
- Frequent multinode operations, such as os-diagnose or cluster-verify |
102 |
will act one group at a time. The default group will be used if none |
|
102 |
will act on one group at a time. The default group will be used if none
|
|
103 | 103 |
is passed. Command line tools will have a way to easily target all |
104 | 104 |
groups, by generating one job per group. |
105 | 105 |
- Groups will have a human-readable name, but will internally always |
... | ... | |
132 | 132 |
should we see it's useful. |
133 | 133 |
|
134 | 134 |
We envision groups as a good place to enhance cluster scalability. In |
135 |
the future we may want to use them ad units for configuration diffusion,
|
|
135 |
the future we may want to use them as units for configuration diffusion,
|
|
136 | 136 |
to allow a better master scalability. For example it could be possible |
137 | 137 |
to change some all-nodes RPCs to contact each group once, from the |
138 | 138 |
master, and make one node in the group perform internal diffusion. We |
... | ... | |
195 | 195 |
- the total node memory, CPU count are very seldom changing; the total |
196 | 196 |
node disk space is also slow changing, but can change at runtime; the |
197 | 197 |
free memory and free disk will change significantly for some jobs, but |
198 |
on a short timescale; in general, these values will mostly “constant” |
|
198 |
on a short timescale; in general, these values will be mostly “constant”
|
|
199 | 199 |
during the lifetime of a job |
200 | 200 |
- we already have a periodic set of jobs that query the node and |
201 | 201 |
instance state, driven the by :command:`ganeti-watcher` command, and |
202 | 202 |
we're just discarding the results after acting on them |
203 | 203 |
|
204 |
Given the above, it makes sense to cache inside the master daemon the
|
|
205 |
results of node and instance state (with a focus on the node state).
|
|
204 |
Given the above, it makes sense to cache the results of node and instance
|
|
205 |
state (with a focus on the node state) inside the master daemon.
|
|
206 | 206 |
|
207 | 207 |
The cache will not be serialised to disk, and will be for the most part |
208 | 208 |
transparent to the outside of the master daemon. |
... | ... | |
228 | 228 |
consistent). Partial results will not update the cache (see next |
229 | 229 |
paragraph). |
230 | 230 |
|
231 |
Since the there will be no way to feed the cache from outside, and we
|
|
231 |
Since there will be no way to feed the cache from outside, and we |
|
232 | 232 |
would like to have a consistent cache view when driven by the watcher, |
233 | 233 |
we'll introduce a new OpCode/LU for the watcher to run, instead of the |
234 | 234 |
current separate opcodes (see below in the watcher section). |
... | ... | |
278 | 278 |
allocation on one group from exclusive blocking jobs on other node |
279 | 279 |
groups. |
280 | 280 |
|
281 |
The capacity calculations will also use the cache—this is detailed in
|
|
281 |
The capacity calculations will also use the cache. This is detailed in
|
|
282 | 282 |
the respective sections. |
283 | 283 |
|
284 | 284 |
Watcher operation |
... | ... | |
406 | 406 |
|
407 | 407 |
This method will feed the cluster state (for the complete set of node |
408 | 408 |
group, or alternative just a subset) to the iallocator plugin (either |
409 |
the specified one, or the default is none is specified), and return the
|
|
409 |
the specified one, or the default if none is specified), and return the
|
|
410 | 410 |
new capacity in the format currently exported by the htools suite and |
411 | 411 |
known as the “tiered specs” (see :manpage:`hspace(1)`). |
412 | 412 |
|
Also available in: Unified diff