Statistics
| Branch: | Tag: | Revision:

root / doc / design-shared-storage.rst @ ac70550e

History | View | Annotate | Download (14.1 kB)

1
=============================
2
Ganeti shared storage support
3
=============================
4

    
5
This document describes the changes in Ganeti 2.3+ compared to Ganeti
6
2.3 storage model. It also documents the ExtStorage Interface.
7

    
8
.. contents:: :depth: 4
9
.. highlight:: shell-example
10

    
11
Objective
12
=========
13

    
14
The aim is to introduce support for externally mirrored, shared storage.
15
This includes two distinct disk templates:
16

    
17
- A shared filesystem containing instance disks as regular files
18
  typically residing on a networked or cluster filesystem (e.g. NFS,
19
  AFS, Ceph, OCFS2, etc.).
20
- Instance images being shared block devices, typically LUNs residing on
21
  a SAN appliance.
22

    
23
Background
24
==========
25

    
26
DRBD is currently the only shared storage backend supported by Ganeti.
27
DRBD offers the advantages of high availability while running on
28
commodity hardware at the cost of high network I/O for block-level
29
synchronization between hosts. DRBD's master-slave model has greatly
30
influenced Ganeti's design, primarily by introducing the concept of
31
primary and secondary nodes and thus defining an instance's “mobility
32
domain”.
33

    
34
Although DRBD has many advantages, many sites choose to use networked
35
storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
36
which provide shared storage without the administrative overhead of DRBD
37
nor the limitation of a 1:1 master-slave setup. Furthermore, new
38
distributed filesystems such as Ceph are becoming viable alternatives to
39
expensive storage appliances. Support for both modes of operation, i.e.
40
shared block storage and shared file storage backend would make Ganeti a
41
robust choice for high-availability virtualization clusters.
42

    
43
Throughout this document, the term “externally mirrored storage” will
44
refer to both modes of shared storage, suggesting that Ganeti does not
45
need to take care about the mirroring process from one host to another.
46

    
47
Use cases
48
=========
49

    
50
We consider the following use cases:
51

    
52
- A virtualization cluster with FibreChannel shared storage, mapping at
53
  least one LUN per instance, accessible by the whole cluster.
54
- A virtualization cluster with instance images stored as files on an
55
  NFS server.
56
- A virtualization cluster storing instance images on a Ceph volume.
57

    
58
Design Overview
59
===============
60

    
61
The design addresses the following procedures:
62

    
63
- Refactoring of all code referring to constants.DTS_NET_MIRROR.
64
- Obsolescence of the primary-secondary concept for externally mirrored
65
  storage.
66
- Introduction of a shared file storage disk template for use with networked
67
  filesystems.
68
- Introduction of a shared block device disk template with device
69
  adoption.
70
- Introduction of the External Storage Interface.
71

    
72
Additionally, mid- to long-term goals include:
73

    
74
- Support for external “storage pools”.
75

    
76
Refactoring of all code referring to constants.DTS_NET_MIRROR
77
=============================================================
78

    
79
Currently, all storage-related decision-making depends on a number of
80
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR.
81
However, constants.DTS_NET_MIRROR is used to signify two different
82
attributes:
83

    
84
- A storage device that is shared
85
- A storage device whose mirroring is supervised by Ganeti
86

    
87
We propose the introduction of two new frozensets to ease
88
decision-making:
89

    
90
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
91
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and
92
  DTS_NET_MIRROR.
93

    
94
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect
95
the status of the storage as internally mirrored by Ganeti.
96

    
97
Thus, checks could be grouped into the following categories:
98

    
99
- Mobility checks, like whether an instance failover or migration is
100
  possible should check against constants.DTS_MIRRORED
101
- Syncing actions should be performed only for templates in
102
  constants.DTS_NET_MIRROR
103

    
104
Obsolescence of the primary-secondary node model
105
================================================
106

    
107
The primary-secondary node concept has primarily evolved through the use
108
of DRBD. In a globally shared storage framework without need for
109
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
110
following reasons:
111

    
112
1. Access to the storage does not necessarily imply different roles for
113
   the nodes (e.g. primary vs secondary).
114
2. The same storage is available to potentially more than 2 nodes. Thus,
115
   an instance backed by a SAN LUN for example may actually migrate to
116
   any of the other nodes and not just a pre-designated failover node.
117

    
118
The proposed solution is using the iallocator framework for run-time
119
decision making during migration and failover, for nodes with disk
120
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
121
gnt-node will be required to accept target node and/or iallocator
122
specification for these operations. Modifications of the iallocator
123
protocol will be required to address at least the following needs:
124

    
125
- Allocation tools must be able to distinguish between internal and
126
  external storage
127
- Migration/failover decisions must take into account shared storage
128
  availability
129

    
130
Introduction of a shared file disk template
131
===========================================
132

    
133
Basic shared file storage support can be implemented by creating a new
134
disk template based on the existing FileStorage class, with only minor
135
modifications in lib/bdev.py. The shared file disk template relies on a
136
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being
137
mounted on all nodes under the same path, where instance images will be
138
saved.
139

    
140
A new cluster initialization option is added to specify the mountpoint
141
of the shared filesystem.
142

    
143
The remainder of this document deals with shared block storage.
144

    
145
Introduction of a shared block device template
146
==============================================
147

    
148
Basic shared block device support will be implemented with an additional
149
disk template. This disk template will not feature any kind of storage
150
control (provisioning, removal, resizing, etc.), but will instead rely
151
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD
152
devices, remote iSCSI targets, etc.).
153

    
154
The shared block device template will make the following assumptions:
155

    
156
- The adopted block device has a consistent name across all nodes,
157
  enforced e.g. via udev rules.
158
- The device will be available with the same path under all nodes in the
159
  node group.
160

    
161
Introduction of the External Storage Interface
162
==============================================
163

    
164
Overview
165
--------
166

    
167
To extend the shared block storage template and give Ganeti the ability
168
to control and manipulate external storage (provisioning, removal,
169
growing, etc.) we need a more generic approach. The generic method for
170
supporting external shared storage in Ganeti will be to have an
171
ExtStorage provider for each external shared storage hardware type. The
172
ExtStorage provider will be a set of files (executable scripts and text
173
files), contained inside a directory which will be named after the
174
provider. This directory must be present across all nodes of a nodegroup
175
(Ganeti doesn't replicate it), in order for the provider to be usable by
176
Ganeti for this nodegroup (valid). The external shared storage hardware
177
should also be accessible by all nodes of this nodegroup too.
178

    
179
An “ExtStorage provider” will have to provide the following methods:
180

    
181
- Create a disk
182
- Remove a disk
183
- Grow a disk
184
- Attach a disk to a given node
185
- Detach a disk from a given node
186
- SetInfo to a disk (add metadata)
187
- Verify its supported parameters
188

    
189
The proposed ExtStorage interface borrows heavily from the OS
190
interface and follows a one-script-per-function approach. An ExtStorage
191
provider is expected to provide the following scripts:
192

    
193
- ``create``
194
- ``remove``
195
- ``grow``
196
- ``attach``
197
- ``detach``
198
- ``setinfo``
199
- ``verify``
200

    
201
All scripts will be called with no arguments and get their input via
202
environment variables. A common set of variables will be exported for
203
all commands, and some of them might have extra ones.
204

    
205
``VOL_NAME``
206
  The name of the volume. This is unique for Ganeti and it
207
  uses it to refer to a specific volume inside the external storage.
208
``VOL_SIZE``
209
  The volume's size in mebibytes.
210
``VOL_NEW_SIZE``
211
  Available only to the `grow` script. It declares the
212
  new size of the volume after grow (in mebibytes).
213
``EXTP_name``
214
  ExtStorage parameter, where `name` is the parameter in
215
  upper-case (same as OS interface's ``OSP_*`` parameters).
216
``VOL_METADATA``
217
  A string containing metadata to be set for the volume.
218
  This is exported only to the ``setinfo`` script.
219

    
220
All scripts except `attach` should return 0 on success and non-zero on
221
error, accompanied by an appropriate error message on stderr. The
222
`attach` script should return a string on stdout on success, which is
223
the block device's full path, after it has been successfully attached to
224
the host node. On error it should return non-zero.
225

    
226
Implementation
227
--------------
228

    
229
To support the ExtStorage interface, we will introduce a new disk
230
template called `ext`. This template will implement the existing Ganeti
231
disk interface in `lib/bdev.py` (create, remove, attach, assemble,
232
shutdown, grow, setinfo), and will simultaneously pass control to the
233
external scripts to actually handle the above actions. The `ext` disk
234
template will act as a translation layer between the current Ganeti disk
235
interface and the ExtStorage providers.
236

    
237
We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER =
238
provider`, which will be used at the command line to select the desired
239
ExtStorage provider. This parameter will be valid only for template
240
`ext` e.g.::
241

    
242
  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1
243

    
244
The Extstorage interface will support different disks to be created by
245
different providers. e.g.::
246

    
247
  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \
248
                            --disk=1:size=1G,provider=sample_provider2 \
249
                            --disk=2:size=3G,provider=sample_provider1
250

    
251
Finally, the ExtStorage interface will support passing of parameters to
252
the ExtStorage provider. This will also be done per disk, from the
253
command line::
254

    
255
 $ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\
256
                                            param1=value1,param2=value2
257

    
258
The above parameters will be exported to the ExtStorage provider's
259
scripts as the enviromental variables:
260

    
261
- `EXTP_PARAM1 = str(value1)`
262
- `EXTP_PARAM2 = str(value2)`
263

    
264
We will also introduce a new Ganeti client called `gnt-storage` which
265
will be used to diagnose ExtStorage providers and show information about
266
them, similarly to the way  `gnt-os diagose` and `gnt-os info` handle OS
267
definitions.
268

    
269
ExtStorage Interface support for userspace access
270
=================================================
271

    
272
Overview
273
--------
274

    
275
The ExtStorage Interface gets extended to cater for ExtStorage providers
276
that support userspace access. This will allow the instances to access
277
their external storage devices directly without going through a block
278
device, avoiding expensive context switches with kernel space and the
279
potential for deadlocks in low memory scenarios. The implementation
280
should be backwards compatible and allow existing ExtStorage
281
providers to work as is.
282

    
283
Implementation
284
--------------
285

    
286
Since the implementation should be backwards compatible we are not going
287
to add a new script in the set of scripts an ExtStorage provider should
288
ship with. Instead, the 'attach' script, which is currently responsible
289
to map the block device and return a valid device path, should also be
290
responsible for providing the URIs that will be used by each
291
hypervisor. Even though Ganeti currently allows userspace access only
292
for the KVM hypervisor, we want the implementation to enable the
293
extstorage providers to support more than one hypervisors for future
294
compliance.
295

    
296
More specifically, the 'attach' script will be allowed to return more
297
than one line. The first line will contain as always the block device
298
path. Each one of the extra lines will contain a URI to be used for the
299
userspace access by a specific hypervisor. Each URI should be prefixed
300
with the hypervisor it corresponds to (e.g. kvm:<uri>). The prefix will
301
be case insensitive. If the 'attach' script doesn't return any extra
302
lines, we assume that the ExtStorage provider doesn't support userspace
303
access (this way we maintain backward compatibility with the existing
304
'attach' scripts).
305

    
306
The 'GetUserspaceAccessUri' method of the 'ExtStorageDevice' class will
307
parse the output of the 'attach' script and if the provider supports
308
userspace access for the requested hypervisor, it will use the
309
corresponding URI instead of the block device itself.
310

    
311
Long-term shared storage goals
312
==============================
313

    
314
Storage pool handling
315
---------------------
316

    
317
A new cluster configuration attribute will be introduced, named
318
“storage_pools”, modeled as a dictionary mapping storage pools to
319
external storage providers (see below), e.g.::
320

    
321
 {
322
  "nas1": "foostore",
323
  "nas2": "foostore",
324
  "cloud1": "barcloud",
325
 }
326

    
327
Ganeti will not interpret the contents of this dictionary, although it
328
will provide methods for manipulating them under some basic constraints
329
(pool identifier uniqueness, driver existence). The manipulation of
330
storage pools will be performed by implementing new options to the
331
`gnt-cluster` command::
332

    
333
 $ gnt-cluster modify --add-pool nas1 foostore
334
 $ gnt-cluster modify --remove-pool nas1 # There must be no instances using
335
                                         # the pool to remove it
336

    
337
Furthermore, the storage pools will be used to indicate the availability
338
of storage pools to different node groups, thus specifying the
339
instances' “mobility domain”.
340

    
341
The pool, in which to put the new instance's disk, will be defined at
342
the command line during `instance add`. This will become possible by
343
replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL
344
= pool`. The cmdlib logic will then look at the cluster-level mapping
345
dictionary to determine the ExtStorage provider for the given pool.
346

    
347
gnt-storage
348
-----------
349

    
350
The ``gnt-storage`` client can be extended to support pool management
351
(creation/modification/deletion of pools, connection/disconnection of
352
pools to nodegroups, etc.). It can also be extended to diagnose and
353
provide information for internal disk templates too, such as lvm and
354
drbd.
355

    
356
.. vim: set textwidth=72 :