Revision cdea7fa8 doc/design-glusterfs-ganeti-support.rst

b/doc/design-glusterfs-ganeti-support.rst
7 7
.. contents:: :depth: 4
8 8
.. highlight:: shell-example
9 9

  
10
Objective
11
=========
12

  
13
The aim is to let Ganeti support GlusterFS as one of its backend storage.
14
This includes three aspects to finish:
15

  
16
- Add Gluster as a storage backend.
17
- Make sure Ganeti VMs can use GlusterFS backends in userspace mode (for
18
  newer QEMU/KVM which has this support) and otherwise, if possible, through
19
  some kernel exported block device.
20
- Make sure Ganeti can configure GlusterFS by itself, by just joining
21
  storage space on new nodes to a GlusterFS nodes pool. Note that this
22
  may need another design document that explains how it interacts with
23
  storage pools, and that the node might or might not host VMs as well.
24

  
25
Background
26
==========
27

  
28
There are two possible ways to implement "GlusterFS Ganeti Support". One is
29
GlusterFS as one of external backend storage, the other one is realizing
30
GlusterFS inside Ganeti, that is, as a new disk type for Ganeti. The benefit
31
of the latter one is that it would not be opaque but fully supported and
32
integrated in Ganeti, which would not need to add infrastructures for
33
testing/QAing and such. Having it internal we can also provide a monitoring
34
agent for it and more visibility into what's going on. For these reasons,
35
GlusterFS support will be added directly inside Ganeti.
36

  
37
Implementation Plan
38
===================
39

  
40
Ganeti Side
41
-----------
42

  
43
To realize an internal storage backend for Ganeti, one should realize
44
BlockDev class in `ganeti/lib/storage/base.py` that is a specific
45
class including create, remove and such. These functions should be
46
realized in `ganeti/lib/storage/bdev.py`. Actually, the differences
47
between implementing inside and outside (external) Ganeti are how to
48
finish these functions in BlockDev class and how to combine with Ganeti
49
itself. The internal implementation is not based on external scripts
50
and combines with Ganeti in a more compact way. RBD patches may be a
51
good reference here. Adding a backend storage steps are as follows:
52

  
53
- Implement the BlockDev interface in bdev.py.
54
- Add the logic in cmdlib (eg, migration, verify).
55
- Add the new storage type name to constants.
56
- Modify objects.Disk to support GlusterFS storage type.
57
- The implementation will be performed similarly to the RBD one (see
58
  commit 7181fba).
59

  
60
GlusterFS side
61
--------------
62

  
63
GlusterFS is a distributed file system implemented in user space.
64
The way to access GlusterFS namespace is via FUSE based Gluster native
65
client except NFS and CIFS. The efficiency of this way is lower because
66
the data would be pass the kernel space and then come to user space.
67
Now, there are two specific enhancements:
68

  
69
- A new library called libgfapi is now available as part of GlusterFS
70
  that provides POSIX-like C APIs for accessing Gluster volumes.
71
  libgfapi support will be available from GlusterFS-3.4 release.
72
- QEMU/KVM (starting from QEMU-1.3) will have GlusterFS block driver that
73
  uses libgfapi and hence there is no FUSE overhead any longer when QEMU/KVM
74
  works with VM images on Gluster volumes.
75

  
76
Proposed implementation
77
-----------------------
78

  
79
QEMU/KVM includes support for GlusterFS and Ganeti could support GlusterFS
80
through QEMU/KVM. However, this way could just let VMs of QEMU/KVM use GlusterFS
81
backend storage but not other VMs like XEN and such. There are two parts that need
82
to be implemented for supporting GlusterFS inside Ganeti so that it can not only
83
support QEMU/KVM VMs, but also XEN and other VMs. One part is GlusterFS for XEN VM,
84
which is similar to sharedfile disk template. The other part is GlusterFS for
85
QEMU/KVM VM, which is supported by the GlusterFS driver for QEMU/KVM. After
86
``gnt-instance add -t gluster instance.example.com`` command is executed, the added
87
instance should be checked. If the instance is a XEN VM, it would run the GlusterFS
88
sharedfile way. However, if the instance is a QEMU/KVM VM, it would run the
89
QEMU/KVM + GlusterFS way. For the first part (GlusterFS for XEN VMs), sharedfile
90
disk template would be a good reference. For the second part (GlusterFS for QEMU/KVM
91
VMs), RBD disk template would be a good reference. The first part would be finished
92
at first and then the second part would be completed, which is based on the first
93
part.
10
Gluster overview
11
================
12

  
13
Gluster is a "brick" "translation" service that can turn a number of LVM logical
14
volume or disks (so-called "bricks") into an unified "volume" that can be
15
mounted over the network through FUSE or NFS.
16

  
17
This is a simplified view of what components are at play and how they
18
interconnect as data flows from the actual disks to the instances. The parts in
19
grey are available for Ganeti to use and included for completeness but not
20
targeted for implementation at this stage.
21

  
22
.. digraph:: "gluster-ganeti-overview"
23

  
24
  graph [ spline=ortho ]
25
  node [ shape=rect ]
26

  
27
  {
28

  
29
    node [ shape=none ]
30
    _volume [ label=volume ]
31

  
32
    bricks -> translators -> _volume
33
    _volume -> network [label=transport]
34
    network -> instances
35
  }
36

  
37
  { rank=same; brick1 [ shape=oval ]
38
               brick2 [ shape=oval ]
39
               brick3 [ shape=oval ]
40
               bricks }
41
  { rank=same; translators distribute }
42
  { rank=same; volume [ shape=oval ]
43
               _volume }
44
  { rank=same; instances instanceA instanceB instanceC instanceD }
45
  { rank=same; network FUSE NFS QEMUC QEMUD }
46

  
47
  {
48
    node [ shape=oval ]
49
    brick1 [ label=brick ]
50
    brick2 [ label=brick ]
51
    brick3 [ label=brick ]
52
  }
53

  
54
  {
55
    node [ shape=oval ]
56
    volume
57
  }
58

  
59
  brick1 -> distribute
60
  brick2 -> distribute
61
  brick3 -> distribute -> volume
62
  volume -> FUSE [ label=<TCP<br/><font color="grey">UDP</font>>
63
                   color="black:grey" ]
64

  
65
  NFS [ color=grey fontcolor=grey ]
66
  volume -> NFS [ label="TCP" color=grey fontcolor=grey ]
67
  NFS -> mountpoint [ color=grey fontcolor=grey ]
68

  
69
  mountpoint [ shape=oval ]
70

  
71
  FUSE -> mountpoint
72

  
73
  instanceA [ label=instances ]
74
  instanceB [ label=instances ]
75

  
76
  mountpoint -> instanceA
77
  mountpoint -> instanceB
78

  
79
  mountpoint [ shape=oval ]
80

  
81
  QEMUC [ label=QEMU ]
82
  QEMUD [ label=QEMU ]
83

  
84
  {
85
    instanceC [ label=instances ]
86
    instanceD [ label=instances ]
87
  }
88

  
89
  volume -> QEMUC [ label=<TCP<br/><font color="grey">UDP</font>>
90
                    color="black:grey" ]
91
  volume -> QEMUD [ label=<TCP<br/><font color="grey">UDP</font>>
92
                    color="black:grey" ]
93
  QEMUC -> instanceC
94
  QEMUD -> instanceD
95

  
96
brick:
97
  The unit of storage in gluster. Typically a drive or LVM logical volume
98
  formatted using, for example, XFS.
99

  
100
distribute:
101
  One of the translators in Gluster, it assigns files to bricks based on the
102
  hash of their full path inside the volume.
103

  
104
volume:
105
  A filesystem you can mount on multiple machines; all machines see the same
106
  directory tree and files.
107

  
108
FUSE/NFS:
109
  Gluster offers two ways to mount volumes: through FUSE or a custom NFS server
110
  that is incompatible with other NFS servers. FUSE is more compatible with
111
  other services running on the storage nodes; NFS gives better performance.
112
  For now, FUSE is a priority.
113

  
114
QEMU:
115
  QEMU 1.3 has the ability to use Gluster volumes directly in userspace without
116
  the need for mounting anything. Ganeti still needs kernelspace access at disk
117
  creation and OS install time.
118

  
119
transport:
120
  FUSE and QEMU allow you to connect using TCP and UDP, whereas NFS only
121
  supports TCP. Those protocols are called transports in Gluster. For now, TCP
122
  is a priority.
123

  
124
It is the administrator's duty to set up the bricks, the translators and thus
125
the volume as they see fit. Ganeti will take care of connecting the instances to
126
a given volume.
127

  
128
.. note::
129

  
130
  The gluster mountpoint must be whitelisted by the administrator in
131
  ``/etc/ganeti/file-storage-paths`` for security reasons in order to allow
132
  Ganeti to modify the filesystem.
133

  
134
Why not use a ``sharedfile`` disk template?
135
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
136

  
137
Gluster volumes `can` be used by Ganeti using the generic shared file disk
138
template. There is a number of reasons why that is probably not a good idea,
139
however:
140

  
141
* Shared file, being a generic solution, cannot offer userspace access support.
142
* Even with userspace support, Ganeti still needs kernelspace access in order to
143
  create disks and install OSes on them. Ganeti can manage the mounting for you
144
  so that the Gluster servers only have as many connections as necessary.
145
* Experiments showed that you can't trust ``mount.glusterfs`` to give useful
146
  return codes or error messages. Ganeti can work around its oddities so
147
  administrators don't have to.
148
* The shared file folder scheme (``../{instance.name}/disk{disk.id}``) does not
149
  work well with Gluster. The ``distribute`` translator distributes files across
150
  bricks, but directories need to be replicated on `all` bricks. As a result, if
151
  we have a dozen hundred instances, that means a dozen hundred folders being
152
  replicated on all bricks. This does not scale well.
153
* This frees up the shared file disk template to use a different, unsupported
154
  replication scheme together with Gluster. (Storage pools are the long term
155
  solution for this, however.)
156

  
157
So, while gluster `is` a shared file disk template, essentially, Ganeti can
158
provide better support for it than that.
159

  
160
Implementation strategy
161
=======================
162

  
163
Working with GlusterFS in kernel space essentially boils down to:
164

  
165
1. Ask FUSE to mount the Gluster volume.
166
2. Check that the mount succeeded.
167
3. Use files stored in the volume as instance disks, just like sharedfile does.
168
4. When the instances are spun down, attempt unmounting the volume. If the
169
   gluster connection is still required, the mountpoint is allowed to remain.
170

  
171
Since it is not strictly necessary for Gluster to mount the disk if all that's
172
needed is userspace access, however, it is inappropriate for the Gluster storage
173
class to inherit from FileStorage. So the implementation should resort to
174
composition rather than inheritance:
175

  
176
1. Extract the ``FileStorage`` disk-facing logic into a ``FileDeviceHelper``
177
   class.
178

  
179
 * In order not to further inflate bdev.py, Filestorage should join its helper
180
   functions in filestorage.py (thus reducing their visibility) and add Gluster
181
   to its own file, gluster.py. Moving the other classes to their own files
182
   like it's been done in ``lib/hypervisor/``) is not addressed as part of this
183
   design.
184

  
185
2. Use the ``FileDeviceHelper`` class to implement a ``GlusterStorage`` class in
186
   much the same way.
187
3. Add Gluster as a disk template that behaves like SharedFile in every way.
188
4. Provide Ganeti knowledge about what a ``GlusterVolume`` is and how to mount,
189
   unmount and reference them.
190

  
191
 * Before attempting a mount, we should check if the volume is not mounted
192
   already. Linux allows mounting partitions multiple times, but then you also
193
   have to unmount them as many times as you mounted them to actually free the
194
   resources; this also makes the output of commands such as ``mount`` less
195
   useful.
196
 * Every time the device could be released (after instance shutdown, OS
197
   installation scripts or file creation), a single unmount is attempted. If
198
   the device is still busy (e.g. from other instances, jobs or open
199
   administrator shells), the failure is ignored.
200

  
201
5. Modify ``GlusterStorage`` and customize the disk template behavior to fit
202
   Gluster's needs.
203

  
204
Directory structure
205
~~~~~~~~~~~~~~~~~~~
206

  
207
In order to address the shortcomings of the generic shared file handling of
208
instance disk directory structure, Gluster uses a different scheme for
209
determining a disk's logical id and therefore path on the file system.
210

  
211
The naming scheme is::
212

  
213
    /ganeti/{instance.uuid}.{disk.id}
214

  
215
...bringing the actual path on a node's file system to::
216

  
217
    /var/run/ganeti/gluster/ganeti/{instance.uuid}.{disk.id}
218

  
219
This means Ganeti only uses one folder on the Gluster volume (allowing other
220
uses of the Gluster volume in the meantime) and works better with how Gluster
221
distributes storage over its bricks.
222

  
223
Changes to the storage types system
224
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
225

  
226
Ganeti has a number of storage types that abstract over disk templates. This
227
matters mainly in terms of disk space reporting. Gluster support is improved by
228
a rethinking of how disk templates are assigned to storage types in Ganeti.
229

  
230
This is the summary of the changes:
231

  
232
+--------------+---------+---------+-------------------------------------------+
233
| Disk         | Current | New     | Does it report storage information to...  |
234
| template     | storage | storage +-------------+----------------+------------+
235
|              | type    | type    | ``gnt-node  | ``gnt-node     | iallocator |
236
|              |         |         | list``      | list-storage`` |            |
237
+==============+=========+=========+=============+================+============+
238
| File         | File    | File    | Yes.        | Yes.           | Yes.       |
239
+--------------+---------+---------+-------------+----------------+------------+
240
| Shared file  | File    | Shared  | No.         | Yes.           | No.        |
241
+--------------+---------+ file    |             |                |            |
242
| Gluster (new)| N/A     | (new)   |             |                |            |
243
+--------------+---------+---------+-------------+----------------+------------+
244
| RBD (for     | RBD               | No.         | No.            | No.        |
245
| reference)   |                   |             |                |            |
246
+--------------+-------------------+-------------+----------------+------------+
247

  
248
Gluster or Shared File should not, like RBD, report storage information to
249
gnt-node list or to IAllocators. Regrettably, the simplest way to do so right
250
now is by claiming that storage reporting for the relevant storage type is not
251
implemented. An effort was made to claim that the shared storage type did support
252
disk reporting while refusing to provide any value, but it was not successful
253
(``hail`` does not support this combination.)
254

  
255
To do so without breaking the File disk template, a new storage type must be
256
added. Like RBD, it does not claim to support disk reporting. However, we can
257
still make an effort of reporting stats to ``gnt-node list-storage``.
258

  
259
The rationale is simple. For shared file and gluster storage, disk space is not
260
a function of any one node. If storage types with disk space reporting are used,
261
Hail expects them to give useful numbers for allocation purposes, but a shared
262
storage system means disk balancing is not affected by node-instance allocation
263
any longer. Moreover, it would be wasteful to mount a Gluster volume on each
264
node just for running statvfs() if no machine was actually running gluster VMs.
265

  
266
As a result, Gluster support for gnt-node list-storage is necessarily limited
267
and nodes on which Gluster is available but not in use will report failures.
268
Additionally, running ``gnt-node list`` will give an output like this::
269

  
270
  Node              DTotal DFree MTotal MNode MFree Pinst Sinst
271
  node1.example.com      ?     ?   744M  273M  477M     0     0
272
  node2.example.com      ?     ?   744M  273M  477M     0     0
273

  
274
This is expected and consistent with behaviour in RBD.
275

  
276
An alternative would have been to report DTotal and DFree as 0 in order to allow
277
``hail`` to ignore the disk information, but this incorrectly populates the
278
``gnt-node list`` DTotal and DFree fields with 0s as well.
279

  
280
New configuration switches
281
~~~~~~~~~~~~~~~~~~~~~~~~~~
282

  
283
Configurable at the cluster and node group level (``gnt-cluster modify``,
284
``gnt-group modify`` and other commands that support the `-D` switch to edit
285
disk parameters):
286

  
287
``gluster:host``
288
  The IP address or hostname of the Gluster server to connect to. In the default
289
  deployment of Gluster, that is any machine that is hosting a brick.
290

  
291
  Default: ``"127.0.0.1"``
292

  
293
``gluster:port``
294
  The port where the Gluster server is listening to.
295

  
296
  Default: ``24007``
297

  
298
``gluster:volume``
299
  The volume Ganeti should use.
300

  
301
  Default: ``"gv0"``
302

  
303
Configurable at the cluster level only (``gnt-cluster init``) and stored in
304
ssconf for all nodes to read (just like shared file):
305

  
306
``--gluster-dir``
307
  Where the Gluster volume should be mounted.
308

  
309
  Default: ``/var/run/ganeti/gluster``
310

  
311
The default values work if all of the Ganeti nodes also host Gluster bricks.
312
This is possible, but `not` recommended as it can cause the host to hardlock due
313
to deadlocks in the kernel memory (much in the same way RBD works).
314

  
315
Future work
316
===========
317

  
318
In no particular order:
319

  
320
* Support the UDP transport.
321
* Support mounting through NFS.
322
* Filter ``gnt-node list`` so DTotal and DFree are not shown for RBD and shared
323
  file disk types, or otherwise report the disk storage values as "-" or some
324
  other special value to clearly distinguish it from the result of a
325
  communication failure between nodes.
326
* Allow configuring the in-volume path Ganeti uses.
94 327

  
95 328
.. vim: set textwidth=72 :
96 329
.. Local Variables:

Also available in: Unified diff