root / doc / design-glusterfs-ganeti-support.rst @ 56c934da
History | View | Annotate | Download (12.6 kB)
1 |
======================== |
---|---|
2 |
GlusterFS Ganeti support |
3 |
======================== |
4 |
|
5 |
This document describes the plan for adding GlusterFS support inside Ganeti. |
6 |
|
7 |
.. contents:: :depth: 4 |
8 |
.. highlight:: shell-example |
9 |
|
10 |
Gluster overview |
11 |
================ |
12 |
|
13 |
Gluster is a "brick" "translation" service that can turn a number of LVM logical |
14 |
volume or disks (so-called "bricks") into an unified "volume" that can be |
15 |
mounted over the network through FUSE or NFS. |
16 |
|
17 |
This is a simplified view of what components are at play and how they |
18 |
interconnect as data flows from the actual disks to the instances. The parts in |
19 |
grey are available for Ganeti to use and included for completeness but not |
20 |
targeted for implementation at this stage. |
21 |
|
22 |
.. digraph:: "gluster-ganeti-overview" |
23 |
|
24 |
graph [ spline=ortho ] |
25 |
node [ shape=rect ] |
26 |
|
27 |
{ |
28 |
|
29 |
node [ shape=none ] |
30 |
_volume [ label=volume ] |
31 |
|
32 |
bricks -> translators -> _volume |
33 |
_volume -> network [label=transport] |
34 |
network -> instances |
35 |
} |
36 |
|
37 |
{ rank=same; brick1 [ shape=oval ] |
38 |
brick2 [ shape=oval ] |
39 |
brick3 [ shape=oval ] |
40 |
bricks } |
41 |
{ rank=same; translators distribute } |
42 |
{ rank=same; volume [ shape=oval ] |
43 |
_volume } |
44 |
{ rank=same; instances instanceA instanceB instanceC instanceD } |
45 |
{ rank=same; network FUSE NFS QEMUC QEMUD } |
46 |
|
47 |
{ |
48 |
node [ shape=oval ] |
49 |
brick1 [ label=brick ] |
50 |
brick2 [ label=brick ] |
51 |
brick3 [ label=brick ] |
52 |
} |
53 |
|
54 |
{ |
55 |
node [ shape=oval ] |
56 |
volume |
57 |
} |
58 |
|
59 |
brick1 -> distribute |
60 |
brick2 -> distribute |
61 |
brick3 -> distribute -> volume |
62 |
volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>> |
63 |
color="black:grey" ] |
64 |
|
65 |
NFS [ color=grey fontcolor=grey ] |
66 |
volume -> NFS [ label="TCP" color=grey fontcolor=grey ] |
67 |
NFS -> mountpoint [ color=grey fontcolor=grey ] |
68 |
|
69 |
mountpoint [ shape=oval ] |
70 |
|
71 |
FUSE -> mountpoint |
72 |
|
73 |
instanceA [ label=instances ] |
74 |
instanceB [ label=instances ] |
75 |
|
76 |
mountpoint -> instanceA |
77 |
mountpoint -> instanceB |
78 |
|
79 |
mountpoint [ shape=oval ] |
80 |
|
81 |
QEMUC [ label=QEMU ] |
82 |
QEMUD [ label=QEMU ] |
83 |
|
84 |
{ |
85 |
instanceC [ label=instances ] |
86 |
instanceD [ label=instances ] |
87 |
} |
88 |
|
89 |
volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>> |
90 |
color="black:grey" ] |
91 |
volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>> |
92 |
color="black:grey" ] |
93 |
QEMUC -> instanceC |
94 |
QEMUD -> instanceD |
95 |
|
96 |
brick: |
97 |
The unit of storage in gluster. Typically a drive or LVM logical volume |
98 |
formatted using, for example, XFS. |
99 |
|
100 |
distribute: |
101 |
One of the translators in Gluster, it assigns files to bricks based on the |
102 |
hash of their full path inside the volume. |
103 |
|
104 |
volume: |
105 |
A filesystem you can mount on multiple machines; all machines see the same |
106 |
directory tree and files. |
107 |
|
108 |
FUSE/NFS: |
109 |
Gluster offers two ways to mount volumes: through FUSE or a custom NFS server |
110 |
that is incompatible with other NFS servers. FUSE is more compatible with |
111 |
other services running on the storage nodes; NFS gives better performance. |
112 |
For now, FUSE is a priority. |
113 |
|
114 |
QEMU: |
115 |
QEMU 1.3 has the ability to use Gluster volumes directly in userspace without |
116 |
the need for mounting anything. Ganeti still needs kernelspace access at disk |
117 |
creation and OS install time. |
118 |
|
119 |
transport: |
120 |
FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only |
121 |
supports TCP. Those protocols are called transports in Gluster. For now, TCP |
122 |
is a priority. |
123 |
|
124 |
It is the administrator's duty to set up the bricks, the translators and thus |
125 |
the volume as they see fit. Ganeti will take care of connecting the instances to |
126 |
a given volume. |
127 |
|
128 |
.. note:: |
129 |
|
130 |
The gluster mountpoint must be whitelisted by the administrator in |
131 |
``/etc/ganeti/file-storage-paths`` for security reasons in order to allow |
132 |
Ganeti to modify the filesystem. |
133 |
|
134 |
Why not use a ``sharedfile`` disk template? |
135 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
136 |
|
137 |
Gluster volumes `can` be used by Ganeti using the generic shared file disk |
138 |
template. There is a number of reasons why that is probably not a good idea, |
139 |
however: |
140 |
|
141 |
* Shared file, being a generic solution, cannot offer userspace access support. |
142 |
* Even with userspace support, Ganeti still needs kernelspace access in order to |
143 |
create disks and install OSes on them. Ganeti can manage the mounting for you |
144 |
so that the Gluster servers only have as many connections as necessary. |
145 |
* Experiments showed that you can't trust ``mount.glusterfs`` to give useful |
146 |
return codes or error messages. Ganeti can work around its oddities so |
147 |
administrators don't have to. |
148 |
* The shared file folder scheme (``../{instance.name}/disk{disk.id}``) does not |
149 |
work well with Gluster. The ``distribute`` translator distributes files across |
150 |
bricks, but directories need to be replicated on `all` bricks. As a result, if |
151 |
we have a dozen hundred instances, that means a dozen hundred folders being |
152 |
replicated on all bricks. This does not scale well. |
153 |
* This frees up the shared file disk template to use a different, unsupported |
154 |
replication scheme together with Gluster. (Storage pools are the long term |
155 |
solution for this, however.) |
156 |
|
157 |
So, while gluster `is` a shared file disk template, essentially, Ganeti can |
158 |
provide better support for it than that. |
159 |
|
160 |
Implementation strategy |
161 |
======================= |
162 |
|
163 |
Working with GlusterFS in kernel space essentially boils down to: |
164 |
|
165 |
1. Ask FUSE to mount the Gluster volume. |
166 |
2. Check that the mount succeeded. |
167 |
3. Use files stored in the volume as instance disks, just like sharedfile does. |
168 |
4. When the instances are spun down, attempt unmounting the volume. If the |
169 |
gluster connection is still required, the mountpoint is allowed to remain. |
170 |
|
171 |
Since it is not strictly necessary for Gluster to mount the disk if all that's |
172 |
needed is userspace access, however, it is inappropriate for the Gluster storage |
173 |
class to inherit from FileStorage. So the implementation should resort to |
174 |
composition rather than inheritance: |
175 |
|
176 |
1. Extract the ``FileStorage`` disk-facing logic into a ``FileDeviceHelper`` |
177 |
class. |
178 |
|
179 |
* In order not to further inflate bdev.py, Filestorage should join its helper |
180 |
functions in filestorage.py (thus reducing their visibility) and add Gluster |
181 |
to its own file, gluster.py. Moving the other classes to their own files |
182 |
like it's been done in ``lib/hypervisor/``) is not addressed as part of this |
183 |
design. |
184 |
|
185 |
2. Use the ``FileDeviceHelper`` class to implement a ``GlusterStorage`` class in |
186 |
much the same way. |
187 |
3. Add Gluster as a disk template that behaves like SharedFile in every way. |
188 |
4. Provide Ganeti knowledge about what a ``GlusterVolume`` is and how to mount, |
189 |
unmount and reference them. |
190 |
|
191 |
* Before attempting a mount, we should check if the volume is not mounted |
192 |
already. Linux allows mounting partitions multiple times, but then you also |
193 |
have to unmount them as many times as you mounted them to actually free the |
194 |
resources; this also makes the output of commands such as ``mount`` less |
195 |
useful. |
196 |
* Every time the device could be released (after instance shutdown, OS |
197 |
installation scripts or file creation), a single unmount is attempted. If |
198 |
the device is still busy (e.g. from other instances, jobs or open |
199 |
administrator shells), the failure is ignored. |
200 |
|
201 |
5. Modify ``GlusterStorage`` and customize the disk template behavior to fit |
202 |
Gluster's needs. |
203 |
|
204 |
Directory structure |
205 |
~~~~~~~~~~~~~~~~~~~ |
206 |
|
207 |
In order to address the shortcomings of the generic shared file handling of |
208 |
instance disk directory structure, Gluster uses a different scheme for |
209 |
determining a disk's logical id and therefore path on the file system. |
210 |
|
211 |
The naming scheme is:: |
212 |
|
213 |
/ganeti/{instance.uuid}.{disk.id} |
214 |
|
215 |
...bringing the actual path on a node's file system to:: |
216 |
|
217 |
/var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id} |
218 |
|
219 |
This means Ganeti only uses one folder on the Gluster volume (allowing other |
220 |
uses of the Gluster volume in the meantime) and works better with how Gluster |
221 |
distributes storage over its bricks. |
222 |
|
223 |
Changes to the storage types system |
224 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
225 |
|
226 |
Ganeti has a number of storage types that abstract over disk templates. This |
227 |
matters mainly in terms of disk space reporting. Gluster support is improved by |
228 |
a rethinking of how disk templates are assigned to storage types in Ganeti. |
229 |
|
230 |
This is the summary of the changes: |
231 |
|
232 |
+--------------+---------+---------+-------------------------------------------+ |
233 |
| Disk | Current | New | Does it report storage information to... | |
234 |
| template | storage | storage +-------------+----------------+------------+ |
235 |
| | type | type | ``gnt-node | ``gnt-node | iallocator | |
236 |
| | | | list`` | list-storage`` | | |
237 |
+==============+=========+=========+=============+================+============+ |
238 |
| File | File | File | Yes. | Yes. | Yes. | |
239 |
+--------------+---------+---------+-------------+----------------+------------+ |
240 |
| Shared file | File | Shared | No. | Yes. | No. | |
241 |
+--------------+---------+ file | | | | |
242 |
| Gluster (new)| N/A | (new) | | | | |
243 |
+--------------+---------+---------+-------------+----------------+------------+ |
244 |
| RBD (for | RBD | No. | No. | No. | |
245 |
| reference) | | | | | |
246 |
+--------------+-------------------+-------------+----------------+------------+ |
247 |
|
248 |
Gluster or Shared File should not, like RBD, report storage information to |
249 |
gnt-node list or to IAllocators. Regrettably, the simplest way to do so right |
250 |
now is by claiming that storage reporting for the relevant storage type is not |
251 |
implemented. An effort was made to claim that the shared storage type did support |
252 |
disk reporting while refusing to provide any value, but it was not successful |
253 |
(``hail`` does not support this combination.) |
254 |
|
255 |
To do so without breaking the File disk template, a new storage type must be |
256 |
added. Like RBD, it does not claim to support disk reporting. However, we can |
257 |
still make an effort of reporting stats to ``gnt-node list-storage``. |
258 |
|
259 |
The rationale is simple. For shared file and gluster storage, disk space is not |
260 |
a function of any one node. If storage types with disk space reporting are used, |
261 |
Hail expects them to give useful numbers for allocation purposes, but a shared |
262 |
storage system means disk balancing is not affected by node-instance allocation |
263 |
any longer. Moreover, it would be wasteful to mount a Gluster volume on each |
264 |
node just for running statvfs() if no machine was actually running gluster VMs. |
265 |
|
266 |
As a result, Gluster support for gnt-node list-storage is necessarily limited |
267 |
and nodes on which Gluster is available but not in use will report failures. |
268 |
Additionally, running ``gnt-node list`` will give an output like this:: |
269 |
|
270 |
Node DTotal DFree MTotal MNode MFree Pinst Sinst |
271 |
node1.example.com ? ? 744M 273M 477M 0 0 |
272 |
node2.example.com ? ? 744M 273M 477M 0 0 |
273 |
|
274 |
This is expected and consistent with behaviour in RBD. |
275 |
|
276 |
An alternative would have been to report DTotal and DFree as 0 in order to allow |
277 |
``hail`` to ignore the disk information, but this incorrectly populates the |
278 |
``gnt-node list`` DTotal and DFree fields with 0s as well. |
279 |
|
280 |
New configuration switches |
281 |
~~~~~~~~~~~~~~~~~~~~~~~~~~ |
282 |
|
283 |
Configurable at the cluster and node group level (``gnt-cluster modify``, |
284 |
``gnt-group modify`` and other commands that support the `-D` switch to edit |
285 |
disk parameters): |
286 |
|
287 |
``gluster:host`` |
288 |
The IP address or hostname of the Gluster server to connect to. In the default |
289 |
deployment of Gluster, that is any machine that is hosting a brick. |
290 |
|
291 |
Default: ``"127.0.0.1"`` |
292 |
|
293 |
``gluster:port`` |
294 |
The port where the Gluster server is listening to. |
295 |
|
296 |
Default: ``24007`` |
297 |
|
298 |
``gluster:volume`` |
299 |
The volume Ganeti should use. |
300 |
|
301 |
Default: ``"gv0"`` |
302 |
|
303 |
Configurable at the cluster level only (``gnt-cluster init``) and stored in |
304 |
ssconf for all nodes to read (just like shared file): |
305 |
|
306 |
``--gluster-dir`` |
307 |
Where the Gluster volume should be mounted. |
308 |
|
309 |
Default: ``/var/run/ganeti/gluster`` |
310 |
|
311 |
The default values work if all of the Ganeti nodes also host Gluster bricks. |
312 |
This is possible, but `not` recommended as it can cause the host to hardlock due |
313 |
to deadlocks in the kernel memory (much in the same way RBD works). |
314 |
|
315 |
Future work |
316 |
=========== |
317 |
|
318 |
In no particular order: |
319 |
|
320 |
* Support the UDP transport. |
321 |
* Support mounting through NFS. |
322 |
* Filter ``gnt-node list`` so DTotal and DFree are not shown for RBD and shared |
323 |
file disk types, or otherwise report the disk storage values as "-" or some |
324 |
other special value to clearly distinguish it from the result of a |
325 |
communication failure between nodes. |
326 |
* Allow configuring the in-volume path Ganeti uses. |
327 |
|
328 |
.. vim: set textwidth=72 : |
329 |
.. Local Variables: |
330 |
.. mode: rst |
331 |
.. fill-column: 72 |
332 |
.. End: |