root / doc / design-resource-model.rst @ 5d40c988
History | View | Annotate | Download (40.4 kB)
1 | d85f01e7 | Iustin Pop | ======================== |
---|---|---|---|
2 | d85f01e7 | Iustin Pop | Resource model changes |
3 | d85f01e7 | Iustin Pop | ======================== |
4 | d85f01e7 | Iustin Pop | |
5 | d85f01e7 | Iustin Pop | |
6 | d85f01e7 | Iustin Pop | Introduction |
7 | d85f01e7 | Iustin Pop | ============ |
8 | d85f01e7 | Iustin Pop | |
9 | d85f01e7 | Iustin Pop | In order to manage virtual machines across the cluster, Ganeti needs to |
10 | d85f01e7 | Iustin Pop | understand the resources present on the nodes, the hardware and software |
11 | d85f01e7 | Iustin Pop | limitations of the nodes, and how much can be allocated safely on each |
12 | d85f01e7 | Iustin Pop | node. Some of these decisions are delegated to IAllocator plugins, for |
13 | d85f01e7 | Iustin Pop | easier site-level customisation. |
14 | d85f01e7 | Iustin Pop | |
15 | d85f01e7 | Iustin Pop | Similarly, the HTools suite has an internal model that simulates the |
16 | d85f01e7 | Iustin Pop | hardware resource changes in response to Ganeti operations, in order to |
17 | d85f01e7 | Iustin Pop | provide both an iallocator plugin and for balancing the |
18 | d85f01e7 | Iustin Pop | cluster. |
19 | d85f01e7 | Iustin Pop | |
20 | d85f01e7 | Iustin Pop | While currently the HTools model is much more advanced than Ganeti's, |
21 | d85f01e7 | Iustin Pop | neither one is flexible enough and both are heavily geared toward a |
22 | d85f01e7 | Iustin Pop | specific Xen model; they fail to work well with (e.g.) KVM or LXC, or |
23 | d85f01e7 | Iustin Pop | with Xen when :term:`tmem` is enabled. Furthermore, the set of metrics |
24 | d85f01e7 | Iustin Pop | contained in the models is limited to historic requirements and fails to |
25 | d85f01e7 | Iustin Pop | account for (e.g.) heterogeneity in the I/O performance of the nodes. |
26 | d85f01e7 | Iustin Pop | |
27 | d85f01e7 | Iustin Pop | Current situation |
28 | d85f01e7 | Iustin Pop | ================= |
29 | d85f01e7 | Iustin Pop | |
30 | d85f01e7 | Iustin Pop | Ganeti |
31 | d85f01e7 | Iustin Pop | ------ |
32 | d85f01e7 | Iustin Pop | |
33 | d85f01e7 | Iustin Pop | At this moment, Ganeti itself doesn't do any static modelling of the |
34 | d85f01e7 | Iustin Pop | cluster resources. It only does some runtime checks: |
35 | d85f01e7 | Iustin Pop | |
36 | d85f01e7 | Iustin Pop | - when creating instances, for the (current) free disk space |
37 | d85f01e7 | Iustin Pop | - when starting instances, for the (current) free memory |
38 | d85f01e7 | Iustin Pop | - during cluster verify, for enough N+1 memory on the secondaries, based |
39 | d85f01e7 | Iustin Pop | on the (current) free memory |
40 | d85f01e7 | Iustin Pop | |
41 | d85f01e7 | Iustin Pop | Basically this model is a pure :term:`SoW` one, and it works well when |
42 | d85f01e7 | Iustin Pop | there are other instances/LVs on the nodes, as it allows Ganeti to deal |
43 | d85f01e7 | Iustin Pop | with ‘orphan’ resource usage, but on the other hand it has many issues, |
44 | d85f01e7 | Iustin Pop | described below. |
45 | d85f01e7 | Iustin Pop | |
46 | d85f01e7 | Iustin Pop | HTools |
47 | d85f01e7 | Iustin Pop | ------ |
48 | d85f01e7 | Iustin Pop | |
49 | d85f01e7 | Iustin Pop | Since HTools does an pure in-memory modelling of the cluster changes as |
50 | d85f01e7 | Iustin Pop | it executes the balancing or allocation steps, it had to introduce a |
51 | d85f01e7 | Iustin Pop | static (:term:`SoR`) cluster model. |
52 | d85f01e7 | Iustin Pop | |
53 | d85f01e7 | Iustin Pop | The model is constructed based on the received node properties from |
54 | d85f01e7 | Iustin Pop | Ganeti (hence it basically is constructed on what Ganeti can export). |
55 | d85f01e7 | Iustin Pop | |
56 | d85f01e7 | Iustin Pop | Disk |
57 | d85f01e7 | Iustin Pop | ~~~~ |
58 | d85f01e7 | Iustin Pop | |
59 | d85f01e7 | Iustin Pop | For disk it consists of just the total (``tdsk``) and the free disk |
60 | d85f01e7 | Iustin Pop | space (``fdsk``); we don't directly track the used disk space. On top of |
61 | d85f01e7 | Iustin Pop | this, we compute and warn if the sum of disk sizes used by instance does |
62 | d85f01e7 | Iustin Pop | not match with ``tdsk - fdsk``, but otherwise we do not track this |
63 | d85f01e7 | Iustin Pop | separately. |
64 | d85f01e7 | Iustin Pop | |
65 | d85f01e7 | Iustin Pop | Memory |
66 | d85f01e7 | Iustin Pop | ~~~~~~ |
67 | d85f01e7 | Iustin Pop | |
68 | d85f01e7 | Iustin Pop | For memory, the model is more complex and tracks some variables that |
69 | d85f01e7 | Iustin Pop | Ganeti itself doesn't compute. We start from the total (``tmem``), free |
70 | d85f01e7 | Iustin Pop | (``fmem``) and node memory (``nmem``) as supplied by Ganeti, and |
71 | d85f01e7 | Iustin Pop | additionally we track: |
72 | d85f01e7 | Iustin Pop | |
73 | d85f01e7 | Iustin Pop | instance memory (``imem``) |
74 | d85f01e7 | Iustin Pop | the total memory used by primary instances on the node, computed |
75 | d85f01e7 | Iustin Pop | as the sum of instance memory |
76 | d85f01e7 | Iustin Pop | |
77 | d85f01e7 | Iustin Pop | reserved memory (``rmem``) |
78 | d85f01e7 | Iustin Pop | the memory reserved by peer nodes for N+1 redundancy; this memory is |
79 | d85f01e7 | Iustin Pop | tracked per peer-node, and the maximum value out of the peer memory |
80 | d85f01e7 | Iustin Pop | lists is the node's ``rmem``; when not using DRBD, this will be |
81 | d85f01e7 | Iustin Pop | equal to zero |
82 | d85f01e7 | Iustin Pop | |
83 | d85f01e7 | Iustin Pop | unaccounted memory (``xmem``) |
84 | d85f01e7 | Iustin Pop | memory that cannot be unaccounted for via the Ganeti model; this is |
85 | d85f01e7 | Iustin Pop | computed at startup as:: |
86 | d85f01e7 | Iustin Pop | |
87 | d85f01e7 | Iustin Pop | tmem - imem - nmem - fmem |
88 | d85f01e7 | Iustin Pop | |
89 | d85f01e7 | Iustin Pop | and is presumed to remain constant irrespective of any instance |
90 | d85f01e7 | Iustin Pop | moves |
91 | d85f01e7 | Iustin Pop | |
92 | d85f01e7 | Iustin Pop | available memory (``amem``) |
93 | d85f01e7 | Iustin Pop | this is simply ``fmem - rmem``, so unless we use DRBD, this will be |
94 | d85f01e7 | Iustin Pop | equal to ``fmem`` |
95 | d85f01e7 | Iustin Pop | |
96 | d85f01e7 | Iustin Pop | ``tmem``, ``nmem`` and ``xmem`` are presumed constant during the |
97 | d85f01e7 | Iustin Pop | instance moves, whereas the ``fmem``, ``imem``, ``rmem`` and ``amem`` |
98 | d85f01e7 | Iustin Pop | values are updated according to the executed moves. |
99 | d85f01e7 | Iustin Pop | |
100 | d85f01e7 | Iustin Pop | CPU |
101 | d85f01e7 | Iustin Pop | ~~~ |
102 | d85f01e7 | Iustin Pop | |
103 | d85f01e7 | Iustin Pop | The CPU model is different than the disk/memory models, since it's the |
104 | d85f01e7 | Iustin Pop | only one where: |
105 | d85f01e7 | Iustin Pop | |
106 | d85f01e7 | Iustin Pop | #. we do oversubscribe physical CPUs |
107 | d85f01e7 | Iustin Pop | #. and there is no natural limit for the number of VCPUs we can allocate |
108 | d85f01e7 | Iustin Pop | |
109 | d85f01e7 | Iustin Pop | We therefore track the total number of VCPUs used on the node and the |
110 | d85f01e7 | Iustin Pop | number of physical CPUs, and we cap the vcpu-to-cpu ratio in order to |
111 | d85f01e7 | Iustin Pop | make this somewhat more similar to the other resources which are |
112 | d85f01e7 | Iustin Pop | limited. |
113 | d85f01e7 | Iustin Pop | |
114 | d85f01e7 | Iustin Pop | Dynamic load |
115 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~ |
116 | d85f01e7 | Iustin Pop | |
117 | d85f01e7 | Iustin Pop | There is also a model that deals with *dynamic load* values in |
118 | d85f01e7 | Iustin Pop | htools. As far as we know, it is not currently used actually with load |
119 | d85f01e7 | Iustin Pop | values, but it is active by default with unitary values for all |
120 | d85f01e7 | Iustin Pop | instances; it currently tracks these metrics: |
121 | d85f01e7 | Iustin Pop | |
122 | d85f01e7 | Iustin Pop | - disk load |
123 | d85f01e7 | Iustin Pop | - memory load |
124 | d85f01e7 | Iustin Pop | - cpu load |
125 | d85f01e7 | Iustin Pop | - network load |
126 | d85f01e7 | Iustin Pop | |
127 | d85f01e7 | Iustin Pop | Even though we do not assign real values to these load values, the fact |
128 | d85f01e7 | Iustin Pop | that we at least sum them means that the algorithm tries to equalise |
129 | d85f01e7 | Iustin Pop | these loads, and especially the network load, which is otherwise not |
130 | d85f01e7 | Iustin Pop | tracked at all. The practical result (due to a combination of these four |
131 | d85f01e7 | Iustin Pop | metrics) is that the number of secondaries will be balanced. |
132 | d85f01e7 | Iustin Pop | |
133 | d85f01e7 | Iustin Pop | Limitations |
134 | d85f01e7 | Iustin Pop | ----------- |
135 | d85f01e7 | Iustin Pop | |
136 | d85f01e7 | Iustin Pop | |
137 | d85f01e7 | Iustin Pop | There are unfortunately many limitations to the current model. |
138 | d85f01e7 | Iustin Pop | |
139 | d85f01e7 | Iustin Pop | Memory |
140 | d85f01e7 | Iustin Pop | ~~~~~~ |
141 | d85f01e7 | Iustin Pop | |
142 | d85f01e7 | Iustin Pop | The memory model doesn't work well in case of KVM. For Xen, the memory |
143 | d85f01e7 | Iustin Pop | for the node (i.e. ``dom0``) can be static or dynamic; we don't support |
144 | d85f01e7 | Iustin Pop | the latter case, but for the former case, the static value is configured |
145 | d85f01e7 | Iustin Pop | in Xen/kernel command line, and can be queried from Xen |
146 | d85f01e7 | Iustin Pop | itself. Therefore, Ganeti can query the hypervisor for the memory used |
147 | d85f01e7 | Iustin Pop | for the node; the same model was adopted for the chroot/KVM/LXC |
148 | d85f01e7 | Iustin Pop | hypervisors, but in these cases there's no natural value for the memory |
149 | d85f01e7 | Iustin Pop | used by the base OS/kernel, and we currently try to compute a value for |
150 | d85f01e7 | Iustin Pop | the node memory based on current consumption. This, being variable, |
151 | d85f01e7 | Iustin Pop | breaks the assumptions in both Ganeti and HTools. |
152 | d85f01e7 | Iustin Pop | |
153 | d85f01e7 | Iustin Pop | This problem also shows for the free memory: if the free memory on the |
154 | d85f01e7 | Iustin Pop | node is not constant (Xen with :term:`tmem` auto-ballooning enabled), or |
155 | d85f01e7 | Iustin Pop | if the node and instance memory are pooled together (Linux-based |
156 | d85f01e7 | Iustin Pop | hypervisors like KVM and LXC), the current value of the free memory is |
157 | d85f01e7 | Iustin Pop | meaningless and cannot be used for instance checks. |
158 | d85f01e7 | Iustin Pop | |
159 | d85f01e7 | Iustin Pop | A separate issue related to the free memory tracking is that since we |
160 | d85f01e7 | Iustin Pop | don't track memory use but rather memory availability, an instance that |
161 | d85f01e7 | Iustin Pop | is temporary down changes Ganeti's understanding of the memory status of |
162 | d85f01e7 | Iustin Pop | the node. This can lead to problems such as: |
163 | d85f01e7 | Iustin Pop | |
164 | d85f01e7 | Iustin Pop | .. digraph:: "free-mem-issue" |
165 | d85f01e7 | Iustin Pop | |
166 | d85f01e7 | Iustin Pop | node [shape=box]; |
167 | d85f01e7 | Iustin Pop | inst1 [label="instance1"]; |
168 | d85f01e7 | Iustin Pop | inst2 [label="instance2"]; |
169 | d85f01e7 | Iustin Pop | |
170 | d85f01e7 | Iustin Pop | node [shape=note]; |
171 | d85f01e7 | Iustin Pop | nodeA [label="fmem=0"]; |
172 | d85f01e7 | Iustin Pop | nodeB [label="fmem=1"]; |
173 | d85f01e7 | Iustin Pop | nodeC [label="fmem=0"]; |
174 | d85f01e7 | Iustin Pop | |
175 | d85f01e7 | Iustin Pop | node [shape=ellipse, style=filled, fillcolor=green] |
176 | d85f01e7 | Iustin Pop | |
177 | d85f01e7 | Iustin Pop | {rank=same; inst1 inst2} |
178 | d85f01e7 | Iustin Pop | |
179 | d85f01e7 | Iustin Pop | stop [label="crash!", fillcolor=orange]; |
180 | d85f01e7 | Iustin Pop | migrate [label="migrate/ok"]; |
181 | d85f01e7 | Iustin Pop | start [style=filled, fillcolor=red, label="start/fail"]; |
182 | d85f01e7 | Iustin Pop | inst1 -> stop -> start; |
183 | d85f01e7 | Iustin Pop | stop -> migrate -> start [style=invis, weight=0]; |
184 | d85f01e7 | Iustin Pop | inst2 -> migrate; |
185 | d85f01e7 | Iustin Pop | |
186 | d85f01e7 | Iustin Pop | {rank=same; inst1 inst2 nodeA} |
187 | d85f01e7 | Iustin Pop | {rank=same; stop nodeB} |
188 | d85f01e7 | Iustin Pop | {rank=same; migrate nodeC} |
189 | d85f01e7 | Iustin Pop | |
190 | d85f01e7 | Iustin Pop | nodeA -> nodeB -> nodeC [style=invis, weight=1]; |
191 | d85f01e7 | Iustin Pop | |
192 | d85f01e7 | Iustin Pop | The behaviour here is wrong; the migration of *instance2* to the node in |
193 | d85f01e7 | Iustin Pop | question will succeed or fail depending on whether *instance1* is |
194 | d85f01e7 | Iustin Pop | running or not. And for *instance1*, it can lead to cases where it if |
195 | d85f01e7 | Iustin Pop | crashes, it cannot restart anymore. |
196 | d85f01e7 | Iustin Pop | |
197 | d85f01e7 | Iustin Pop | Finally, not a problem but rather a missing important feature is support |
198 | d85f01e7 | Iustin Pop | for memory over-subscription: both Xen and KVM support memory |
199 | d85f01e7 | Iustin Pop | ballooning, even automatic memory ballooning, for a while now. The |
200 | d85f01e7 | Iustin Pop | entire memory model is based on a fixed memory size for instances, and |
201 | d85f01e7 | Iustin Pop | if memory ballooning is enabled, it will “break” the HTools |
202 | d85f01e7 | Iustin Pop | algorithm. Even the fact that KVM instances do not use all memory from |
203 | d85f01e7 | Iustin Pop | the start creates problems (although not as high, since it will grow and |
204 | d85f01e7 | Iustin Pop | stabilise in the end). |
205 | d85f01e7 | Iustin Pop | |
206 | d85f01e7 | Iustin Pop | Disks |
207 | d85f01e7 | Iustin Pop | ~~~~~ |
208 | d85f01e7 | Iustin Pop | |
209 | d85f01e7 | Iustin Pop | Because we only track disk space currently, this means if we have a |
210 | d85f01e7 | Iustin Pop | cluster of ``N`` otherwise identical nodes but half of them have 10 |
211 | d85f01e7 | Iustin Pop | drives of size ``X`` and the other half 2 drives of size ``5X``, HTools |
212 | d85f01e7 | Iustin Pop | will consider them exactly the same. However, in the case of mechanical |
213 | d85f01e7 | Iustin Pop | drives at least, the I/O performance will differ significantly based on |
214 | d85f01e7 | Iustin Pop | spindle count, and a “fair” load distribution should take this into |
215 | d85f01e7 | Iustin Pop | account (a similar comment can be made about processor/memory/network |
216 | d85f01e7 | Iustin Pop | speed). |
217 | d85f01e7 | Iustin Pop | |
218 | d85f01e7 | Iustin Pop | Another problem related to the spindle count is the LVM allocation |
219 | d85f01e7 | Iustin Pop | algorithm. Currently, the algorithm always creates (or tries to create) |
220 | d85f01e7 | Iustin Pop | striped volumes, with the stripe count being hard-coded to the |
221 | d85f01e7 | Iustin Pop | ``./configure`` parameter ``--with-lvm-stripecount``. This creates |
222 | d85f01e7 | Iustin Pop | problems like: |
223 | d85f01e7 | Iustin Pop | |
224 | d85f01e7 | Iustin Pop | - when installing from a distribution package, all clusters will be |
225 | d85f01e7 | Iustin Pop | either limited or overloaded due to this fixed value |
226 | d85f01e7 | Iustin Pop | - it is not possible to mix heterogeneous nodes (even in different node |
227 | d85f01e7 | Iustin Pop | groups) and have optimal settings for all nodes |
228 | d85f01e7 | Iustin Pop | - the striping value applies both to LVM/DRBD data volumes (which are on |
229 | d85f01e7 | Iustin Pop | the order of gigabytes to hundreds of gigabytes) and to DRBD metadata |
230 | d85f01e7 | Iustin Pop | volumes (whose size is always fixed at 128MB); when stripping such |
231 | d85f01e7 | Iustin Pop | small volumes over many PVs, their size will increase needlessly (and |
232 | d85f01e7 | Iustin Pop | this can confuse HTools' disk computation algorithm) |
233 | d85f01e7 | Iustin Pop | |
234 | d85f01e7 | Iustin Pop | Moreover, the allocation currently allocates based on a ‘most free |
235 | d85f01e7 | Iustin Pop | space’ algorithm. This balances the free space usage on disks, but on |
236 | d85f01e7 | Iustin Pop | the other hand it tends to mix rather badly the data and metadata |
237 | d85f01e7 | Iustin Pop | volumes of different instances. For example, it cannot do the following: |
238 | d85f01e7 | Iustin Pop | |
239 | d85f01e7 | Iustin Pop | - keep DRBD data and metadata volumes on the same drives, in order to |
240 | d85f01e7 | Iustin Pop | reduce exposure to drive failure in a many-drives system |
241 | d85f01e7 | Iustin Pop | - keep DRBD data and metadata volumes on different drives, to reduce |
242 | d85f01e7 | Iustin Pop | performance impact of metadata writes |
243 | d85f01e7 | Iustin Pop | |
244 | d85f01e7 | Iustin Pop | Additionally, while Ganeti supports setting the volume separately for |
245 | d85f01e7 | Iustin Pop | data and metadata volumes at instance creation, there are no defaults |
246 | d85f01e7 | Iustin Pop | for this setting. |
247 | d85f01e7 | Iustin Pop | |
248 | d85f01e7 | Iustin Pop | Similar to the above stripe count problem (which is about not good |
249 | d85f01e7 | Iustin Pop | enough customisation of Ganeti's behaviour), we have limited |
250 | d85f01e7 | Iustin Pop | pass-through customisation of the various options of our storage |
251 | d85f01e7 | Iustin Pop | backends; while LVM has a system-wide configuration file that can be |
252 | d85f01e7 | Iustin Pop | used to tweak some of its behaviours, for DRBD we don't use the |
253 | d85f01e7 | Iustin Pop | :command:`drbdadmin` tool, and instead we call :command:`drbdsetup` |
254 | d85f01e7 | Iustin Pop | directly, with a fixed/restricted set of options; so for example one |
255 | d85f01e7 | Iustin Pop | cannot tweak the buffer sizes. |
256 | d85f01e7 | Iustin Pop | |
257 | d85f01e7 | Iustin Pop | Another current problem is that the support for shared storage in HTools |
258 | d85f01e7 | Iustin Pop | is still limited, but this problem is outside of this design document. |
259 | d85f01e7 | Iustin Pop | |
260 | d85f01e7 | Iustin Pop | Locking |
261 | d85f01e7 | Iustin Pop | ~~~~~~~ |
262 | d85f01e7 | Iustin Pop | |
263 | d85f01e7 | Iustin Pop | A further problem generated by the “current free” model is that during a |
264 | d85f01e7 | Iustin Pop | long operation which affects resource usage (e.g. disk replaces, |
265 | d85f01e7 | Iustin Pop | instance creations) we have to keep the respective objects locked |
266 | d85f01e7 | Iustin Pop | (sometimes even in exclusive mode), since we don't want any concurrent |
267 | d85f01e7 | Iustin Pop | modifications to the *free* values. |
268 | d85f01e7 | Iustin Pop | |
269 | d85f01e7 | Iustin Pop | A classic example of the locking problem is the following: |
270 | d85f01e7 | Iustin Pop | |
271 | d85f01e7 | Iustin Pop | .. digraph:: "iallocator-lock-issues" |
272 | d85f01e7 | Iustin Pop | |
273 | d85f01e7 | Iustin Pop | rankdir=TB; |
274 | d85f01e7 | Iustin Pop | |
275 | d85f01e7 | Iustin Pop | start [style=invis]; |
276 | d85f01e7 | Iustin Pop | node [shape=box,width=2]; |
277 | d85f01e7 | Iustin Pop | job1 [label="add instance\niallocator run\nchoose A,B"]; |
278 | d85f01e7 | Iustin Pop | job1e [label="finish add"]; |
279 | d85f01e7 | Iustin Pop | job2 [label="add instance\niallocator run\nwait locks"]; |
280 | d85f01e7 | Iustin Pop | job2s [label="acquire locks\nchoose C,D"]; |
281 | d85f01e7 | Iustin Pop | job2e [label="finish add"]; |
282 | d85f01e7 | Iustin Pop | |
283 | d85f01e7 | Iustin Pop | job1 -> job1e; |
284 | d85f01e7 | Iustin Pop | job2 -> job2s -> job2e; |
285 | d85f01e7 | Iustin Pop | edge [style=invis,weight=0]; |
286 | d85f01e7 | Iustin Pop | start -> {job1; job2} |
287 | d85f01e7 | Iustin Pop | job1 -> job2; |
288 | d85f01e7 | Iustin Pop | job2 -> job1e; |
289 | d85f01e7 | Iustin Pop | job1e -> job2s [style=dotted,label="release locks"]; |
290 | d85f01e7 | Iustin Pop | |
291 | d85f01e7 | Iustin Pop | In the above example, the second IAllocator run will wait for locks for |
292 | d85f01e7 | Iustin Pop | nodes ``A`` and ``B``, even though in the end the second instance will |
293 | d85f01e7 | Iustin Pop | be placed on another set of nodes (``C`` and ``D``). This wait shouldn't |
294 | d85f01e7 | Iustin Pop | be needed, since right after the first IAllocator run has finished, |
295 | d85f01e7 | Iustin Pop | :command:`hail` knows the status of the cluster after the allocation, |
296 | d85f01e7 | Iustin Pop | and it could answer the question for the second run too; however, Ganeti |
297 | d85f01e7 | Iustin Pop | doesn't have such visibility into the cluster state and thus it is |
298 | d85f01e7 | Iustin Pop | forced to wait with the second job. |
299 | d85f01e7 | Iustin Pop | |
300 | d85f01e7 | Iustin Pop | Similar examples can be made about replace disks (another long-running |
301 | d85f01e7 | Iustin Pop | opcode). |
302 | d85f01e7 | Iustin Pop | |
303 | d85f01e7 | Iustin Pop | .. _label-policies: |
304 | d85f01e7 | Iustin Pop | |
305 | d85f01e7 | Iustin Pop | Policies |
306 | d85f01e7 | Iustin Pop | ~~~~~~~~ |
307 | d85f01e7 | Iustin Pop | |
308 | d85f01e7 | Iustin Pop | For most of the resources, we have metrics defined by policy: e.g. the |
309 | d85f01e7 | Iustin Pop | over-subscription ratio for CPUs, the amount of space to reserve, |
310 | d85f01e7 | Iustin Pop | etc. Furthermore, although there are no such definitions in Ganeti such |
311 | d85f01e7 | Iustin Pop | as minimum/maximum instance size, a real deployment will need to have |
312 | d85f01e7 | Iustin Pop | them, especially in a fully-automated workflow where end-users can |
313 | d85f01e7 | Iustin Pop | request instances via an automated interface (that talks to the cluster |
314 | d85f01e7 | Iustin Pop | via RAPI, LUXI or command line). However, such an automated interface |
315 | d85f01e7 | Iustin Pop | will need to also take into account cluster capacity, and if the |
316 | d85f01e7 | Iustin Pop | :command:`hspace` tool is used for the capacity computation, it needs to |
317 | d85f01e7 | Iustin Pop | be told the maximum instance size, however it has a built-in minimum |
318 | d85f01e7 | Iustin Pop | instance size which is not customisable. |
319 | d85f01e7 | Iustin Pop | |
320 | d85f01e7 | Iustin Pop | It is clear that this situation leads to duplicate definition of |
321 | d85f01e7 | Iustin Pop | resource policies which makes it hard to easily change per-cluster (or |
322 | d85f01e7 | Iustin Pop | globally) the respective policies, and furthermore it creates |
323 | d85f01e7 | Iustin Pop | inconsistencies if such policies are not enforced at the source (i.e. in |
324 | d85f01e7 | Iustin Pop | Ganeti). |
325 | d85f01e7 | Iustin Pop | |
326 | d85f01e7 | Iustin Pop | Balancing algorithm |
327 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~ |
328 | d85f01e7 | Iustin Pop | |
329 | d85f01e7 | Iustin Pop | The balancing algorithm, as documented in the HTools ``README`` file, |
330 | d85f01e7 | Iustin Pop | tries to minimise the cluster score; this score is based on a set of |
331 | d85f01e7 | Iustin Pop | metrics that describe both exceptional conditions and how spread the |
332 | d85f01e7 | Iustin Pop | instances are across the nodes. In order to achieve this goal, it moves |
333 | d85f01e7 | Iustin Pop | the instances around, with a series of moves of various types: |
334 | d85f01e7 | Iustin Pop | |
335 | d85f01e7 | Iustin Pop | - disk replaces (for DRBD-based instances) |
336 | d85f01e7 | Iustin Pop | - instance failover/migrations (for all types) |
337 | d85f01e7 | Iustin Pop | |
338 | d85f01e7 | Iustin Pop | However, the algorithm only looks at the cluster score, and not at the |
339 | d85f01e7 | Iustin Pop | *“cost”* of the moves. In other words, the following can and will happen |
340 | d85f01e7 | Iustin Pop | on a cluster: |
341 | d85f01e7 | Iustin Pop | |
342 | d85f01e7 | Iustin Pop | .. digraph:: "balancing-cost-issues" |
343 | d85f01e7 | Iustin Pop | |
344 | d85f01e7 | Iustin Pop | rankdir=LR; |
345 | d85f01e7 | Iustin Pop | ranksep=1; |
346 | d85f01e7 | Iustin Pop | |
347 | d85f01e7 | Iustin Pop | start [label="score α", shape=hexagon]; |
348 | d85f01e7 | Iustin Pop | |
349 | d85f01e7 | Iustin Pop | node [shape=box, width=2]; |
350 | d85f01e7 | Iustin Pop | replace1 [label="replace_disks 500G\nscore α-3ε\ncost 3"]; |
351 | d85f01e7 | Iustin Pop | replace2a [label="replace_disks 20G\nscore α-2ε\ncost 2"]; |
352 | d85f01e7 | Iustin Pop | migrate1 [label="migrate\nscore α-ε\ncost 1"]; |
353 | d85f01e7 | Iustin Pop | |
354 | d85f01e7 | Iustin Pop | choose [shape=ellipse,label="choose min(score)=α-3ε\ncost 3"]; |
355 | d85f01e7 | Iustin Pop | |
356 | d85f01e7 | Iustin Pop | start -> {replace1; replace2a; migrate1} -> choose; |
357 | d85f01e7 | Iustin Pop | |
358 | d85f01e7 | Iustin Pop | Even though a migration is much, much cheaper than a disk replace (in |
359 | d85f01e7 | Iustin Pop | terms of network and disk traffic on the cluster), if the disk replace |
360 | d85f01e7 | Iustin Pop | results in a score infinitesimally smaller, then it will be |
361 | d85f01e7 | Iustin Pop | chosen. Similarly, between two disk replaces, one moving e.g. ``500GiB`` |
362 | d85f01e7 | Iustin Pop | and one moving ``20GiB``, the first one will be chosen if it results in |
363 | d85f01e7 | Iustin Pop | a score smaller than the second one. Furthermore, even if the resulting |
364 | d85f01e7 | Iustin Pop | scores are equal, the first computed solution will be kept, whichever it |
365 | d85f01e7 | Iustin Pop | is. |
366 | d85f01e7 | Iustin Pop | |
367 | d85f01e7 | Iustin Pop | Fixing this algorithmic problem is doable, but currently Ganeti doesn't |
368 | d85f01e7 | Iustin Pop | export enough information about nodes to make an informed decision; in |
369 | d85f01e7 | Iustin Pop | the above example, if the ``500GiB`` move is between nodes having fast |
370 | d85f01e7 | Iustin Pop | I/O (both disks and network), it makes sense to execute it over a disk |
371 | d85f01e7 | Iustin Pop | replace of ``100GiB`` between nodes with slow I/O, so simply relating to |
372 | d85f01e7 | Iustin Pop | the properties of the move itself is not enough; we need more node |
373 | d85f01e7 | Iustin Pop | information for cost computation. |
374 | d85f01e7 | Iustin Pop | |
375 | d85f01e7 | Iustin Pop | Allocation algorithm |
376 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
377 | d85f01e7 | Iustin Pop | |
378 | d85f01e7 | Iustin Pop | .. note:: This design document will not address this limitation, but it |
379 | d85f01e7 | Iustin Pop | is worth mentioning as it directly related to the resource model. |
380 | d85f01e7 | Iustin Pop | |
381 | d85f01e7 | Iustin Pop | The current allocation/capacity algorithm works as follows (per |
382 | d85f01e7 | Iustin Pop | node-group):: |
383 | d85f01e7 | Iustin Pop | |
384 | d85f01e7 | Iustin Pop | repeat: |
385 | d85f01e7 | Iustin Pop | allocate instance without failing N+1 |
386 | d85f01e7 | Iustin Pop | |
387 | d85f01e7 | Iustin Pop | This simple algorithm, and its use of ``N+1`` criterion, has a built-in |
388 | d85f01e7 | Iustin Pop | limit of 1 machine failure in case of DRBD. This means the algorithm |
389 | d85f01e7 | Iustin Pop | guarantees that, if using DRBD storage, there are enough resources to |
390 | d85f01e7 | Iustin Pop | (re)start all affected instances in case of one machine failure. This |
391 | d85f01e7 | Iustin Pop | relates mostly to memory; there is no account for CPU over-subscription |
392 | d85f01e7 | Iustin Pop | (i.e. in case of failure, make sure we can failover while still not |
393 | d85f01e7 | Iustin Pop | going over CPU limits), or for any other resource. |
394 | d85f01e7 | Iustin Pop | |
395 | d85f01e7 | Iustin Pop | In case of shared storage, there's not even the memory guarantee, as the |
396 | d85f01e7 | Iustin Pop | N+1 protection doesn't work for shared storage. |
397 | d85f01e7 | Iustin Pop | |
398 | d85f01e7 | Iustin Pop | If a given cluster administrator wants to survive up to two machine |
399 | d85f01e7 | Iustin Pop | failures, or wants to ensure CPU limits too for DRBD, there is no |
400 | d85f01e7 | Iustin Pop | possibility to configure this in HTools (neither in :command:`hail` nor |
401 | d85f01e7 | Iustin Pop | in :command:`hspace`). Current workaround employ for example deducting a |
402 | d85f01e7 | Iustin Pop | certain number of instances from the size computed by :command:`hspace`, |
403 | d85f01e7 | Iustin Pop | but this is a very crude method, and requires that instance creations |
404 | d85f01e7 | Iustin Pop | are limited before Ganeti (otherwise :command:`hail` would allocate |
405 | d85f01e7 | Iustin Pop | until the cluster is full). |
406 | d85f01e7 | Iustin Pop | |
407 | d85f01e7 | Iustin Pop | Proposed architecture |
408 | d85f01e7 | Iustin Pop | ===================== |
409 | d85f01e7 | Iustin Pop | |
410 | d85f01e7 | Iustin Pop | |
411 | d85f01e7 | Iustin Pop | There are two main changes proposed: |
412 | d85f01e7 | Iustin Pop | |
413 | d85f01e7 | Iustin Pop | - changing the resource model from a pure :term:`SoW` to a hybrid |
414 | d85f01e7 | Iustin Pop | :term:`SoR`/:term:`SoW` one, where the :term:`SoR` component is |
415 | d85f01e7 | Iustin Pop | heavily emphasised |
416 | d85f01e7 | Iustin Pop | - extending the resource model to cover additional properties, |
417 | d85f01e7 | Iustin Pop | completing the “holes” in the current coverage |
418 | d85f01e7 | Iustin Pop | |
419 | d85f01e7 | Iustin Pop | The second change is rather straightforward, but will add more |
420 | d85f01e7 | Iustin Pop | complexity in the modelling of the cluster. The first change, however, |
421 | d85f01e7 | Iustin Pop | represents a significant shift from the current model, which Ganeti had |
422 | d85f01e7 | Iustin Pop | from its beginnings. |
423 | d85f01e7 | Iustin Pop | |
424 | d85f01e7 | Iustin Pop | Lock-improved resource model |
425 | d85f01e7 | Iustin Pop | ---------------------------- |
426 | d85f01e7 | Iustin Pop | |
427 | d85f01e7 | Iustin Pop | Hybrid SoR/SoW model |
428 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
429 | d85f01e7 | Iustin Pop | |
430 | d85f01e7 | Iustin Pop | The resources of a node can be characterised in two broad classes: |
431 | d85f01e7 | Iustin Pop | |
432 | d85f01e7 | Iustin Pop | - mostly static resources |
433 | d85f01e7 | Iustin Pop | - dynamically changing resources |
434 | d85f01e7 | Iustin Pop | |
435 | d85f01e7 | Iustin Pop | In the first category, we have things such as total core count, total |
436 | d85f01e7 | Iustin Pop | memory size, total disk size, number of network interfaces etc. In the |
437 | d85f01e7 | Iustin Pop | second category we have things such as free disk space, free memory, CPU |
438 | d85f01e7 | Iustin Pop | load, etc. Note that nowadays we don't have (anymore) fully-static |
439 | d85f01e7 | Iustin Pop | resources: features like CPU and memory hot-plug, online disk replace, |
440 | d85f01e7 | Iustin Pop | etc. mean that theoretically all resources can change (there are some |
441 | d85f01e7 | Iustin Pop | practical limitations, of course). |
442 | d85f01e7 | Iustin Pop | |
443 | d85f01e7 | Iustin Pop | Even though the rate of change of the two resource types is wildly |
444 | d85f01e7 | Iustin Pop | different, right now Ganeti handles both the same. Given that the |
445 | d85f01e7 | Iustin Pop | interval of change of the semi-static ones is much bigger than most |
446 | d85f01e7 | Iustin Pop | Ganeti operations, even more than lengthy sequences of Ganeti jobs, it |
447 | d85f01e7 | Iustin Pop | makes sense to treat them separately. |
448 | d85f01e7 | Iustin Pop | |
449 | d85f01e7 | Iustin Pop | The proposal is then to move the following resources into the |
450 | d85f01e7 | Iustin Pop | configuration and treat the configuration as the authoritative source |
451 | d85f01e7 | Iustin Pop | for them (a :term:`SoR` model): |
452 | d85f01e7 | Iustin Pop | |
453 | d85f01e7 | Iustin Pop | - CPU resources: |
454 | d85f01e7 | Iustin Pop | - total core count |
455 | d85f01e7 | Iustin Pop | - node core usage (*new*) |
456 | d85f01e7 | Iustin Pop | - memory resources: |
457 | d85f01e7 | Iustin Pop | - total memory size |
458 | d85f01e7 | Iustin Pop | - node memory size |
459 | d85f01e7 | Iustin Pop | - hypervisor overhead (*new*) |
460 | d85f01e7 | Iustin Pop | - disk resources: |
461 | d85f01e7 | Iustin Pop | - total disk size |
462 | d85f01e7 | Iustin Pop | - disk overhead (*new*) |
463 | d85f01e7 | Iustin Pop | |
464 | d85f01e7 | Iustin Pop | Since these resources can though change at run-time, we will need |
465 | d85f01e7 | Iustin Pop | functionality to update the recorded values. |
466 | d85f01e7 | Iustin Pop | |
467 | d85f01e7 | Iustin Pop | Pre-computing dynamic resource values |
468 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
469 | d85f01e7 | Iustin Pop | |
470 | d85f01e7 | Iustin Pop | Remember that the resource model used by HTools models the clusters as |
471 | d85f01e7 | Iustin Pop | obeying the following equations: |
472 | d85f01e7 | Iustin Pop | |
473 | d85f01e7 | Iustin Pop | disk\ :sub:`free` = disk\ :sub:`total` - ∑ disk\ :sub:`instances` |
474 | d85f01e7 | Iustin Pop | |
475 | d85f01e7 | Iustin Pop | mem\ :sub:`free` = mem\ :sub:`total` - ∑ mem\ :sub:`instances` - mem\ |
476 | d85f01e7 | Iustin Pop | :sub:`node` - mem\ :sub:`overhead` |
477 | d85f01e7 | Iustin Pop | |
478 | d85f01e7 | Iustin Pop | As this model worked fine for HTools, we can consider it valid and adopt |
479 | d85f01e7 | Iustin Pop | it in Ganeti. Furthermore, note that all values in the right-hand side |
480 | d85f01e7 | Iustin Pop | come now from the configuration: |
481 | d85f01e7 | Iustin Pop | |
482 | d85f01e7 | Iustin Pop | - the per-instance usage values were already stored in the configuration |
483 | d85f01e7 | Iustin Pop | - the other values will are moved to the configuration per the previous |
484 | d85f01e7 | Iustin Pop | section |
485 | d85f01e7 | Iustin Pop | |
486 | d85f01e7 | Iustin Pop | This means that we can now compute the free values without having to |
487 | d85f01e7 | Iustin Pop | actually live-query the nodes, which brings a significant advantage. |
488 | d85f01e7 | Iustin Pop | |
489 | d85f01e7 | Iustin Pop | There are a couple of caveats to this model though. First, as the |
490 | d85f01e7 | Iustin Pop | run-time state of the instance is no longer taken into consideration, it |
491 | d85f01e7 | Iustin Pop | means that we have to introduce a new *offline* state for an instance |
492 | d85f01e7 | Iustin Pop | (similar to the node one). In this state, the instance's runtime |
493 | d85f01e7 | Iustin Pop | resources (memory and VCPUs) are no longer reserved for it, and can be |
494 | d85f01e7 | Iustin Pop | reused by other instances. Static resources like disk and MAC addresses |
495 | d85f01e7 | Iustin Pop | are still reserved though. Transitioning into and out of this reserved |
496 | d85f01e7 | Iustin Pop | state will be more involved than simply stopping/starting the instance |
497 | d85f01e7 | Iustin Pop | (e.g. de-offlining can fail due to missing resources). This complexity |
498 | d85f01e7 | Iustin Pop | is compensated by the increased consistency of what guarantees we have |
499 | d85f01e7 | Iustin Pop | in the stopped state (we always guarantee resource reservation), and the |
500 | d85f01e7 | Iustin Pop | potential for management tools to restrict which users can transition |
501 | d85f01e7 | Iustin Pop | into/out of this state separate from which users can stop/start the |
502 | d85f01e7 | Iustin Pop | instance. |
503 | d85f01e7 | Iustin Pop | |
504 | d85f01e7 | Iustin Pop | Separating per-node resource locks |
505 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
506 | d85f01e7 | Iustin Pop | |
507 | d85f01e7 | Iustin Pop | Many of the current node locks in Ganeti exist in order to guarantee |
508 | d85f01e7 | Iustin Pop | correct resource state computation, whereas others are designed to |
509 | d85f01e7 | Iustin Pop | guarantee reasonable run-time performance of nodes (e.g. by not |
510 | d85f01e7 | Iustin Pop | overloading the I/O subsystem). This is an unfortunate coupling, since |
511 | d85f01e7 | Iustin Pop | it means for example that the following two operations conflict in |
512 | d85f01e7 | Iustin Pop | practice even though they are orthogonal: |
513 | d85f01e7 | Iustin Pop | |
514 | d85f01e7 | Iustin Pop | - replacing a instance's disk on a node |
515 | d85f01e7 | Iustin Pop | - computing node disk/memory free for an IAllocator run |
516 | d85f01e7 | Iustin Pop | |
517 | d85f01e7 | Iustin Pop | This conflict increases significantly the lock contention on a big/busy |
518 | d85f01e7 | Iustin Pop | cluster and at odds with the goal of increasing the cluster size. |
519 | d85f01e7 | Iustin Pop | |
520 | d85f01e7 | Iustin Pop | The proposal is therefore to add a new level of locking that is only |
521 | d85f01e7 | Iustin Pop | used to prevent concurrent modification to the resource states (either |
522 | d85f01e7 | Iustin Pop | node properties or instance properties) and not for long-term |
523 | d85f01e7 | Iustin Pop | operations: |
524 | d85f01e7 | Iustin Pop | |
525 | d85f01e7 | Iustin Pop | - instance creation needs to acquire and keep this lock until adding the |
526 | d85f01e7 | Iustin Pop | instance to the configuration |
527 | d85f01e7 | Iustin Pop | - instance modification needs to acquire and keep this lock until |
528 | d85f01e7 | Iustin Pop | updating the instance |
529 | d85f01e7 | Iustin Pop | - node property changes will need to acquire this lock for the |
530 | d85f01e7 | Iustin Pop | modification |
531 | d85f01e7 | Iustin Pop | |
532 | d85f01e7 | Iustin Pop | The new lock level will sit before the instance level (right after BGL) |
533 | d85f01e7 | Iustin Pop | and could either be single-valued (like the “Big Ganeti Lock”), in which |
534 | d85f01e7 | Iustin Pop | case we won't be able to modify two nodes at the same time, or per-node, |
535 | d85f01e7 | Iustin Pop | in which case the list of locks at this level needs to be synchronised |
536 | d85f01e7 | Iustin Pop | with the node lock level. To be determined. |
537 | d85f01e7 | Iustin Pop | |
538 | d85f01e7 | Iustin Pop | Lock contention reduction |
539 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~~~~~ |
540 | d85f01e7 | Iustin Pop | |
541 | d85f01e7 | Iustin Pop | Based on the above, the locking contention will be reduced as follows: |
542 | d85f01e7 | Iustin Pop | IAllocator calls will no longer need the ``LEVEL_NODE: ALL_SET`` lock, |
543 | d85f01e7 | Iustin Pop | only the resource lock (in exclusive mode). Hence allocating/computing |
544 | d85f01e7 | Iustin Pop | evacuation targets will no longer conflict for longer than the time to |
545 | d85f01e7 | Iustin Pop | compute the allocation solution. |
546 | d85f01e7 | Iustin Pop | |
547 | d85f01e7 | Iustin Pop | The remaining long-running locks will be the DRBD replace-disks ones |
548 | d85f01e7 | Iustin Pop | (exclusive mode). These can also be removed, or changed into shared |
549 | d85f01e7 | Iustin Pop | locks, but that is a separate design change. |
550 | d85f01e7 | Iustin Pop | |
551 | d85f01e7 | Iustin Pop | .. admonition:: FIXME |
552 | d85f01e7 | Iustin Pop | |
553 | 0469fd96 | Michael Hanselmann | Need to rework instance replace disks. I don't think we need exclusive |
554 | 0469fd96 | Michael Hanselmann | locks for replacing disks: it is safe to stop/start the instance while |
555 | 0469fd96 | Michael Hanselmann | it's doing a replace disks. Only modify would need exclusive, and only |
556 | 0469fd96 | Michael Hanselmann | for transitioning into/out of offline state. |
557 | d85f01e7 | Iustin Pop | |
558 | d85f01e7 | Iustin Pop | Instance memory model |
559 | d85f01e7 | Iustin Pop | --------------------- |
560 | d85f01e7 | Iustin Pop | |
561 | d85f01e7 | Iustin Pop | In order to support ballooning, the instance memory model needs to be |
562 | d85f01e7 | Iustin Pop | changed from a “memory size” one to a “min/max memory size”. This |
563 | d85f01e7 | Iustin Pop | interacts with the new static resource model, however, and thus we need |
564 | d85f01e7 | Iustin Pop | to declare a-priori the expected oversubscription ratio on the cluster. |
565 | d85f01e7 | Iustin Pop | |
566 | d85f01e7 | Iustin Pop | The new minimum memory size parameter will be similar to the current |
567 | d85f01e7 | Iustin Pop | memory size; the cluster will guarantee that in all circumstances, all |
568 | d85f01e7 | Iustin Pop | instances will have available their minimum memory size. The maximum |
569 | d85f01e7 | Iustin Pop | memory size will permit burst usage of more memory by instances, with |
570 | d85f01e7 | Iustin Pop | the restriction that the sum of maximum memory usage will not be more |
571 | d85f01e7 | Iustin Pop | than the free memory times the oversubscription factor: |
572 | d85f01e7 | Iustin Pop | |
573 | d85f01e7 | Iustin Pop | ∑ memory\ :sub:`min` ≤ memory\ :sub:`available` |
574 | d85f01e7 | Iustin Pop | |
575 | d85f01e7 | Iustin Pop | ∑ memory\ :sub:`max` ≤ memory\ :sub:`free` * oversubscription_ratio |
576 | d85f01e7 | Iustin Pop | |
577 | d85f01e7 | Iustin Pop | The hypervisor will have the possibility of adjusting the instance's |
578 | d85f01e7 | Iustin Pop | memory size dynamically between these two boundaries. |
579 | d85f01e7 | Iustin Pop | |
580 | d85f01e7 | Iustin Pop | Note that the minimum memory is related to the available memory on the |
581 | d85f01e7 | Iustin Pop | node, whereas the maximum memory is related to the free memory. On |
582 | d85f01e7 | Iustin Pop | DRBD-enabled clusters, this will have the advantage of using the |
583 | d85f01e7 | Iustin Pop | reserved memory for N+1 failover for burst usage, instead of having it |
584 | d85f01e7 | Iustin Pop | completely idle. |
585 | d85f01e7 | Iustin Pop | |
586 | d85f01e7 | Iustin Pop | .. admonition:: FIXME |
587 | d85f01e7 | Iustin Pop | |
588 | d85f01e7 | Iustin Pop | Need to document how Ganeti forces minimum size at runtime, overriding |
589 | d85f01e7 | Iustin Pop | the hypervisor, in cases of failover/lack of resources. |
590 | d85f01e7 | Iustin Pop | |
591 | d85f01e7 | Iustin Pop | New parameters |
592 | d85f01e7 | Iustin Pop | -------------- |
593 | d85f01e7 | Iustin Pop | |
594 | d85f01e7 | Iustin Pop | Unfortunately the design will add a significant number of new |
595 | d85f01e7 | Iustin Pop | parameters, and change the meaning of some of the current ones. |
596 | d85f01e7 | Iustin Pop | |
597 | d85f01e7 | Iustin Pop | Instance size limits |
598 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~ |
599 | d85f01e7 | Iustin Pop | |
600 | d85f01e7 | Iustin Pop | As described in :ref:`label-policies`, we currently lack a clear |
601 | d85f01e7 | Iustin Pop | definition of the support instance sizes (minimum, maximum and |
602 | d85f01e7 | Iustin Pop | standard). As such, we will add the following structure to the cluster |
603 | d85f01e7 | Iustin Pop | parameters: |
604 | d85f01e7 | Iustin Pop | |
605 | d85f01e7 | Iustin Pop | - ``min_ispec``, ``max_ispec``: minimum and maximum acceptable instance |
606 | d85f01e7 | Iustin Pop | specs |
607 | d85f01e7 | Iustin Pop | - ``std_ispec``: standard instance size, which will be used for capacity |
608 | d85f01e7 | Iustin Pop | computations and for default parameters on the instance creation |
609 | d85f01e7 | Iustin Pop | request |
610 | d85f01e7 | Iustin Pop | |
611 | d85f01e7 | Iustin Pop | Ganeti will by default reject non-standard instance sizes (lower than |
612 | d85f01e7 | Iustin Pop | ``min_ispec`` or greater than ``max_ispec``), but as usual a ``--force`` |
613 | d85f01e7 | Iustin Pop | option on the command line or in the RAPI request will override these |
614 | d85f01e7 | Iustin Pop | constraints. The ``std_spec`` structure will be used to fill in missing |
615 | d85f01e7 | Iustin Pop | instance specifications on create. |
616 | d85f01e7 | Iustin Pop | |
617 | d85f01e7 | Iustin Pop | Each of the ispec structures will be a dictionary, since the contents |
618 | d85f01e7 | Iustin Pop | can change over time. Initially, we will define the following variables |
619 | d85f01e7 | Iustin Pop | in these structures: |
620 | d85f01e7 | Iustin Pop | |
621 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
622 | d85f01e7 | Iustin Pop | |Name |Description |Type | |
623 | d85f01e7 | Iustin Pop | +===============+==================================+==============+ |
624 | d85f01e7 | Iustin Pop | |mem_min |Minimum memory size allowed |int | |
625 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
626 | d85f01e7 | Iustin Pop | |mem_max |Maximum allowed memory size |int | |
627 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
628 | d85f01e7 | Iustin Pop | |cpu_count |Allowed vCPU count |int | |
629 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
630 | d85f01e7 | Iustin Pop | |disk_count |Allowed disk count |int | |
631 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
632 | d85f01e7 | Iustin Pop | |disk_size |Allowed disk size |int | |
633 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
634 | d85f01e7 | Iustin Pop | |nic_count |Alowed NIC count |int | |
635 | d85f01e7 | Iustin Pop | +---------------+----------------------------------+--------------+ |
636 | d85f01e7 | Iustin Pop | |
637 | d85f01e7 | Iustin Pop | Inheritance |
638 | d85f01e7 | Iustin Pop | +++++++++++ |
639 | d85f01e7 | Iustin Pop | |
640 | d85f01e7 | Iustin Pop | In a single-group cluster, the above structure is sufficient. However, |
641 | d85f01e7 | Iustin Pop | on a multi-group cluster, it could be that the hardware specifications |
642 | d85f01e7 | Iustin Pop | differ across node groups, and thus the following problem appears: how |
643 | d85f01e7 | Iustin Pop | can Ganeti present unified specifications over RAPI? |
644 | d85f01e7 | Iustin Pop | |
645 | d85f01e7 | Iustin Pop | Since the set of instance specs is only partially ordered (as opposed to |
646 | d85f01e7 | Iustin Pop | the sets of values of individual variable in the spec, which are totally |
647 | d85f01e7 | Iustin Pop | ordered), it follows that we can't present unified specs. As such, the |
648 | d85f01e7 | Iustin Pop | proposed approach is to allow the ``min_ispec`` and ``max_ispec`` to be |
649 | d85f01e7 | Iustin Pop | customised per node-group (and export them as a list of specifications), |
650 | d85f01e7 | Iustin Pop | and a single ``std_spec`` at cluster level (exported as a single value). |
651 | d85f01e7 | Iustin Pop | |
652 | d85f01e7 | Iustin Pop | |
653 | d85f01e7 | Iustin Pop | Allocation parameters |
654 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~~~ |
655 | d85f01e7 | Iustin Pop | |
656 | d85f01e7 | Iustin Pop | Beside the limits of min/max instance sizes, there are other parameters |
657 | d85f01e7 | Iustin Pop | related to capacity and allocation limits. These are mostly related to |
658 | d85f01e7 | Iustin Pop | the problems related to over allocation. |
659 | d85f01e7 | Iustin Pop | |
660 | d85f01e7 | Iustin Pop | +-----------------+----------+---------------------------+----------+------+ |
661 | d85f01e7 | Iustin Pop | | Name |Level(s) |Description |Current |Type | |
662 | d85f01e7 | Iustin Pop | | | | |value | | |
663 | d85f01e7 | Iustin Pop | +=================+==========+===========================+==========+======+ |
664 | d85f01e7 | Iustin Pop | |vcpu_ratio |cluster, |Maximum ratio of virtual to|64 (only |float | |
665 | d85f01e7 | Iustin Pop | | |node group|physical CPUs |in htools)| | |
666 | d85f01e7 | Iustin Pop | +-----------------+----------+---------------------------+----------+------+ |
667 | d85f01e7 | Iustin Pop | |spindle_ratio |cluster, |Maximum ratio of instances |none |float | |
668 | d85f01e7 | Iustin Pop | | |node group|to spindles; when the I/O | | | |
669 | d85f01e7 | Iustin Pop | | | |model doesn't map directly | | | |
670 | d85f01e7 | Iustin Pop | | | |to spindles, another | | | |
671 | d85f01e7 | Iustin Pop | | | |measure of I/O should be | | | |
672 | d85f01e7 | Iustin Pop | | | |used instead | | | |
673 | d85f01e7 | Iustin Pop | +-----------------+----------+---------------------------+----------+------+ |
674 | d85f01e7 | Iustin Pop | |max_node_failures|cluster, |Cap allocation/capacity so |1 |int | |
675 | d85f01e7 | Iustin Pop | | |node group|that the cluster can |(hardcoded| | |
676 | d85f01e7 | Iustin Pop | | | |survive this many node |in htools)| | |
677 | d85f01e7 | Iustin Pop | | | |failures | | | |
678 | d85f01e7 | Iustin Pop | +-----------------+----------+---------------------------+----------+------+ |
679 | d85f01e7 | Iustin Pop | |
680 | d85f01e7 | Iustin Pop | Since these are used mostly internally (in htools), they will be |
681 | d85f01e7 | Iustin Pop | exported as-is from Ganeti, without explicit handling of node-groups |
682 | d85f01e7 | Iustin Pop | grouping. |
683 | d85f01e7 | Iustin Pop | |
684 | d85f01e7 | Iustin Pop | Regarding ``spindle_ratio``, in this context spindles do not necessarily |
685 | d85f01e7 | Iustin Pop | have to mean actual mechanical hard-drivers; it's rather a measure of |
686 | d85f01e7 | Iustin Pop | I/O performance for internal storage. |
687 | d85f01e7 | Iustin Pop | |
688 | d85f01e7 | Iustin Pop | Disk parameters |
689 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~ |
690 | d85f01e7 | Iustin Pop | |
691 | 5d40c988 | Andrea Spadaccini | The proposed model for the new disk parameters is a simple free-form one |
692 | 5d40c988 | Andrea Spadaccini | based on dictionaries, indexed per disk template and parameter name. |
693 | 5d40c988 | Andrea Spadaccini | Only the disk template parameters are visible to the user, and those are |
694 | 5d40c988 | Andrea Spadaccini | internally translated to logical disk level parameters. |
695 | 5d40c988 | Andrea Spadaccini | |
696 | 5d40c988 | Andrea Spadaccini | This is a simplification, because each parameter is applied to a whole |
697 | 5d40c988 | Andrea Spadaccini | nested structure and there is no way of fine-tuning each level's |
698 | 5d40c988 | Andrea Spadaccini | parameters, but it is good enough for the current parameter set. This |
699 | 5d40c988 | Andrea Spadaccini | model could need to be expanded, e.g., if support for three-nodes stacked |
700 | 5d40c988 | Andrea Spadaccini | DRBD setups is added to Ganeti. |
701 | 5d40c988 | Andrea Spadaccini | |
702 | 5d40c988 | Andrea Spadaccini | At JSON level, since the object key has to be a string, the keys can be |
703 | 5d40c988 | Andrea Spadaccini | encoded via a separator (e.g. slash), or by having two dict levels. |
704 | d85f01e7 | Iustin Pop | |
705 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
706 | d85f01e7 | Iustin Pop | |Disk |Name |Description |Current status |Type | |
707 | d85f01e7 | Iustin Pop | |template| | | | | |
708 | d85f01e7 | Iustin Pop | +========+=============+=========================+=====================+======+ |
709 | 5d40c988 | Andrea Spadaccini | |plain |stripes |How many stripes to use |Configured at |int | |
710 | d85f01e7 | Iustin Pop | | | |for newly created (plain)|./configure time, not| | |
711 | d85f01e7 | Iustin Pop | | | |logical voumes |overridable at | | |
712 | d85f01e7 | Iustin Pop | | | | |runtime | | |
713 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
714 | 5d40c988 | Andrea Spadaccini | |drbd |stripes |How many stripes to use |Same as for plain |int | |
715 | d85f01e7 | Iustin Pop | | | |for data volumes | | | |
716 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
717 | 5d40c988 | Andrea Spadaccini | |drbd |metavg |Default volume group for |Same as the main |string| |
718 | d85f01e7 | Iustin Pop | | | |the metadata LVs |volume group, | | |
719 | d85f01e7 | Iustin Pop | | | | |overridable via | | |
720 | d85f01e7 | Iustin Pop | | | | |'metavg' key | | |
721 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
722 | 5d40c988 | Andrea Spadaccini | |drbd |metastripes |How many stripes to use |Same as for lvm |int | |
723 | d85f01e7 | Iustin Pop | | | |for meta volumes |'stripes', suboptimal| | |
724 | d85f01e7 | Iustin Pop | | | | |as the meta LVs are | | |
725 | d85f01e7 | Iustin Pop | | | | |small | | |
726 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
727 | 5d40c988 | Andrea Spadaccini | |drbd |disk_barriers|What kind of barriers to |Either all enabled or|string| |
728 | d85f01e7 | Iustin Pop | | | |*disable* for disks; |all disabled, per | | |
729 | d85f01e7 | Iustin Pop | | | |either "n" or a string |./configure time | | |
730 | d85f01e7 | Iustin Pop | | | |containing a subset of |option | | |
731 | d85f01e7 | Iustin Pop | | | |"bfd" | | | |
732 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
733 | 5d40c988 | Andrea Spadaccini | |drbd |meta_barriers|Whether barriers are |Handled together with|bool | |
734 | d85f01e7 | Iustin Pop | | | |enabled or not for the |disk_barriers | | |
735 | d85f01e7 | Iustin Pop | | | |meta volume | | | |
736 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
737 | 5d40c988 | Andrea Spadaccini | |drbd |resync_rate |The (static) resync rate |Hardcoded in |int | |
738 | d85f01e7 | Iustin Pop | | | |for drbd, when using the |constants.py, not | | |
739 | d85f01e7 | Iustin Pop | | | |static syncer, in MiB/s |changeable via Ganeti| | |
740 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
741 | 5d40c988 | Andrea Spadaccini | |drbd |disk_custom |Free-form string that |Not supported |string| |
742 | d85f01e7 | Iustin Pop | | | |will be appended to the | | | |
743 | d85f01e7 | Iustin Pop | | | |drbdsetup disk command | | | |
744 | d85f01e7 | Iustin Pop | | | |line, for custom options | | | |
745 | d85f01e7 | Iustin Pop | | | |not supported by Ganeti | | | |
746 | d85f01e7 | Iustin Pop | | | |itself | | | |
747 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
748 | 5d40c988 | Andrea Spadaccini | |drbd |net_custom |Free-form string for |Not supported |string| |
749 | d85f01e7 | Iustin Pop | | | |custom net setup options | | | |
750 | d85f01e7 | Iustin Pop | +--------+-------------+-------------------------+---------------------+------+ |
751 | d85f01e7 | Iustin Pop | |
752 | 5d40c988 | Andrea Spadaccini | Note that the DRBD parameters might change once Ganeti supports DRBD 8.4, in |
753 | 5d40c988 | Andrea Spadaccini | which the :command:`drbdsetup` syntax has changed significantly. |
754 | 5d40c988 | Andrea Spadaccini | Moreover, new parameters for the dynamic synchronization algorithm will |
755 | 5d40c988 | Andrea Spadaccini | be added for DRBD versions >= 8.3.9. |
756 | d85f01e7 | Iustin Pop | |
757 | d85f01e7 | Iustin Pop | All the above parameters are at cluster and node group level; as in |
758 | d85f01e7 | Iustin Pop | other parts of the code, the intention is that all nodes in a node group |
759 | 5d40c988 | Andrea Spadaccini | should be equal. It will later be decided to which node group give |
760 | 5d40c988 | Andrea Spadaccini | precedence in case of instances split over node groups. |
761 | 5d40c988 | Andrea Spadaccini | |
762 | 5d40c988 | Andrea Spadaccini | .. admonition:: FIXME |
763 | 5d40c988 | Andrea Spadaccini | |
764 | 5d40c988 | Andrea Spadaccini | Add details about when each parameter change takes effect (device |
765 | 5d40c988 | Andrea Spadaccini | creation vs. activation) |
766 | d85f01e7 | Iustin Pop | |
767 | d85f01e7 | Iustin Pop | Node parameters |
768 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~ |
769 | d85f01e7 | Iustin Pop | |
770 | d85f01e7 | Iustin Pop | For the new memory model, we'll add the following parameters, in a |
771 | d85f01e7 | Iustin Pop | dictionary indexed by the hypervisor name (node attribute |
772 | d85f01e7 | Iustin Pop | ``hv_state``). The rationale is that, even though multi-hypervisor |
773 | d85f01e7 | Iustin Pop | clusters are rare, they make sense sometimes, and thus we need to |
774 | d85f01e7 | Iustin Pop | support multipe node states (one per hypervisor). |
775 | d85f01e7 | Iustin Pop | |
776 | d85f01e7 | Iustin Pop | Since usually only one of the multiple hypervisors is the 'main' one |
777 | d85f01e7 | Iustin Pop | (and the others used sparringly), capacity computation will still only |
778 | d85f01e7 | Iustin Pop | use the first hypervisor, and not all of them. Thus we avoid possible |
779 | d85f01e7 | Iustin Pop | inconsistencies. |
780 | d85f01e7 | Iustin Pop | |
781 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
782 | d85f01e7 | Iustin Pop | |Name |Description |Current state |Type | |
783 | d85f01e7 | Iustin Pop | | | | | | |
784 | d85f01e7 | Iustin Pop | +==========+===================================+===============+=======+ |
785 | d85f01e7 | Iustin Pop | |mem_total |Total node memory, as discovered by|Queried at |int | |
786 | d85f01e7 | Iustin Pop | | |this hypervisor |runtime | | |
787 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
788 | d85f01e7 | Iustin Pop | |mem_node |Memory used by, or reserved for, |Queried at |int | |
789 | d85f01e7 | Iustin Pop | | |the node itself; not that some |runtime | | |
790 | d85f01e7 | Iustin Pop | | |hypervisors can report this in an | | | |
791 | d85f01e7 | Iustin Pop | | |authoritative way, other not | | | |
792 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
793 | d85f01e7 | Iustin Pop | |mem_hv |Memory used either by the |Not used, |int | |
794 | d85f01e7 | Iustin Pop | | |hypervisor itself or lost due to |htools computes| | |
795 | d85f01e7 | Iustin Pop | | |instance allocation rounding; |it internally | | |
796 | d85f01e7 | Iustin Pop | | |usually this cannot be precisely | | | |
797 | d85f01e7 | Iustin Pop | | |computed, but only roughly | | | |
798 | d85f01e7 | Iustin Pop | | |estimated | | | |
799 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
800 | d85f01e7 | Iustin Pop | |cpu_total |Total node cpu (core) count; |Queried at |int | |
801 | d85f01e7 | Iustin Pop | | |usually this can be discovered |runtime | | |
802 | d85f01e7 | Iustin Pop | | |automatically | | | |
803 | d85f01e7 | Iustin Pop | | | | | | |
804 | d85f01e7 | Iustin Pop | | | | | | |
805 | d85f01e7 | Iustin Pop | | | | | | |
806 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
807 | d85f01e7 | Iustin Pop | |cpu_node |Number of cores reserved for the |Not used at all|int | |
808 | d85f01e7 | Iustin Pop | | |node itself; this can either be | | | |
809 | d85f01e7 | Iustin Pop | | |discovered or set manually. Only | | | |
810 | d85f01e7 | Iustin Pop | | |used for estimating how many VCPUs | | | |
811 | d85f01e7 | Iustin Pop | | |are left for instances | | | |
812 | d85f01e7 | Iustin Pop | | | | | | |
813 | d85f01e7 | Iustin Pop | +----------+-----------------------------------+---------------+-------+ |
814 | d85f01e7 | Iustin Pop | |
815 | d85f01e7 | Iustin Pop | Of the above parameters, only ``_total`` ones are straight-forward. The |
816 | d85f01e7 | Iustin Pop | others have sometimes strange semantics: |
817 | d85f01e7 | Iustin Pop | |
818 | d85f01e7 | Iustin Pop | - Xen can report ``mem_node``, if configured statically (as we |
819 | d85f01e7 | Iustin Pop | recommend); but Linux-based hypervisors (KVM, chroot, LXC) do not, and |
820 | d85f01e7 | Iustin Pop | this needs to be configured statically for these values |
821 | d85f01e7 | Iustin Pop | - ``mem_hv``, representing unaccounted for memory, is not directly |
822 | d85f01e7 | Iustin Pop | computable; on Xen, it can be seen that on a N GB machine, with 1 GB |
823 | d85f01e7 | Iustin Pop | for dom0 and N-2 GB for instances, there's just a few MB left, instead |
824 | d85f01e7 | Iustin Pop | fo a full 1 GB of RAM; however, the exact value varies with the total |
825 | d85f01e7 | Iustin Pop | memory size (at least) |
826 | d85f01e7 | Iustin Pop | - ``cpu_node`` only makes sense on Xen (currently), in the case when we |
827 | d85f01e7 | Iustin Pop | restrict dom0; for Linux-based hypervisors, the node itself cannot be |
828 | d85f01e7 | Iustin Pop | easily restricted, so it should be set as an estimate of how "heavy" |
829 | d85f01e7 | Iustin Pop | the node loads will be |
830 | d85f01e7 | Iustin Pop | |
831 | d85f01e7 | Iustin Pop | Since these two values cannot be auto-computed from the node, we need to |
832 | d85f01e7 | Iustin Pop | be able to declare a default at cluster level (debatable how useful they |
833 | d85f01e7 | Iustin Pop | are at node group level); the proposal is to do this via a cluster-level |
834 | d85f01e7 | Iustin Pop | ``hv_state`` dict (per hypervisor). |
835 | d85f01e7 | Iustin Pop | |
836 | d85f01e7 | Iustin Pop | Beside the per-hypervisor attributes, we also have disk attributes, |
837 | d85f01e7 | Iustin Pop | which are queried directly on the node (without hypervisor |
838 | d85f01e7 | Iustin Pop | involvment). The are stored in a separate attribute (``disk_state``), |
839 | d85f01e7 | Iustin Pop | which is indexed per storage type and name; currently this will be just |
840 | d85f01e7 | Iustin Pop | ``LD_LV`` and the volume name as key. |
841 | d85f01e7 | Iustin Pop | |
842 | d85f01e7 | Iustin Pop | +-------------+-------------------------+--------------------+--------+ |
843 | d85f01e7 | Iustin Pop | |Name |Description |Current state |Type | |
844 | d85f01e7 | Iustin Pop | | | | | | |
845 | d85f01e7 | Iustin Pop | +=============+=========================+====================+========+ |
846 | d85f01e7 | Iustin Pop | |disk_total |Total disk size |Queried at runtime |int | |
847 | d85f01e7 | Iustin Pop | | | | | | |
848 | d85f01e7 | Iustin Pop | +-------------+-------------------------+--------------------+--------+ |
849 | d85f01e7 | Iustin Pop | |disk_reserved|Reserved disk size; this |None used in Ganeti;|int | |
850 | d85f01e7 | Iustin Pop | | |is a lower limit on the |htools has a | | |
851 | d85f01e7 | Iustin Pop | | |free space, if such a |parameter for this | | |
852 | d85f01e7 | Iustin Pop | | |limit is desired | | | |
853 | d85f01e7 | Iustin Pop | +-------------+-------------------------+--------------------+--------+ |
854 | d85f01e7 | Iustin Pop | |disk_overhead|Disk that is expected to |None used in Ganeti;|int | |
855 | d85f01e7 | Iustin Pop | | |be used by other volumes |htools detects this | | |
856 | d85f01e7 | Iustin Pop | | |(set via |at runtime | | |
857 | d85f01e7 | Iustin Pop | | |``reserved_lvs``); | | | |
858 | d85f01e7 | Iustin Pop | | |usually should be zero | | | |
859 | d85f01e7 | Iustin Pop | +-------------+-------------------------+--------------------+--------+ |
860 | d85f01e7 | Iustin Pop | |
861 | d85f01e7 | Iustin Pop | |
862 | d85f01e7 | Iustin Pop | Instance parameters |
863 | d85f01e7 | Iustin Pop | ~~~~~~~~~~~~~~~~~~~ |
864 | d85f01e7 | Iustin Pop | |
865 | d85f01e7 | Iustin Pop | New instance parameters, needed especially for supporting the new memory |
866 | d85f01e7 | Iustin Pop | model: |
867 | d85f01e7 | Iustin Pop | |
868 | d85f01e7 | Iustin Pop | +--------------+----------------------------------+-----------------+------+ |
869 | d85f01e7 | Iustin Pop | |Name |Description |Current status |Type | |
870 | d85f01e7 | Iustin Pop | | | | | | |
871 | d85f01e7 | Iustin Pop | +==============+==================================+=================+======+ |
872 | d85f01e7 | Iustin Pop | |offline |Whether the instance is in |Not supported |bool | |
873 | d85f01e7 | Iustin Pop | | |“permanent” offline mode; this is | | | |
874 | d85f01e7 | Iustin Pop | | |stronger than the "admin_down” | | | |
875 | d85f01e7 | Iustin Pop | | |state, and is similar to the node | | | |
876 | d85f01e7 | Iustin Pop | | |offline attribute | | | |
877 | d85f01e7 | Iustin Pop | +--------------+----------------------------------+-----------------+------+ |
878 | d85f01e7 | Iustin Pop | |be/max_memory |The maximum memory the instance is|Not existent, but|int | |
879 | d85f01e7 | Iustin Pop | | |allowed |virtually | | |
880 | d85f01e7 | Iustin Pop | | | |identical to | | |
881 | d85f01e7 | Iustin Pop | | | |memory | | |
882 | d85f01e7 | Iustin Pop | +--------------+----------------------------------+-----------------+------+ |
883 | d85f01e7 | Iustin Pop | |
884 | d85f01e7 | Iustin Pop | HTools changes |
885 | d85f01e7 | Iustin Pop | -------------- |
886 | d85f01e7 | Iustin Pop | |
887 | d85f01e7 | Iustin Pop | All the new parameters (node, instance, cluster, not so much disk) will |
888 | d85f01e7 | Iustin Pop | need to be taken into account by HTools, both in balancing and in |
889 | d85f01e7 | Iustin Pop | capacity computation. |
890 | d85f01e7 | Iustin Pop | |
891 | d85f01e7 | Iustin Pop | Since the Ganeti's cluster model is much enhanced, Ganeti can also |
892 | d85f01e7 | Iustin Pop | export its own reserved/overhead variables, and as such HTools can make |
893 | d85f01e7 | Iustin Pop | less “guesses” as to the difference in values. |
894 | d85f01e7 | Iustin Pop | |
895 | d85f01e7 | Iustin Pop | .. admonition:: FIXME |
896 | d85f01e7 | Iustin Pop | |
897 | d85f01e7 | Iustin Pop | Need to detail more the htools changes; the model is clear to me, but |
898 | d85f01e7 | Iustin Pop | need to write it down. |
899 | d85f01e7 | Iustin Pop | |
900 | d85f01e7 | Iustin Pop | .. vim: set textwidth=72 : |
901 | d85f01e7 | Iustin Pop | .. Local Variables: |
902 | d85f01e7 | Iustin Pop | .. mode: rst |
903 | d85f01e7 | Iustin Pop | .. fill-column: 72 |
904 | d85f01e7 | Iustin Pop | .. End: |