Statistics
| Branch: | Tag: | Revision:

root / doc / design-shared-storage.rst @ 3e16567e

History | View | Annotate | Download (11.9 kB)

1
======================================
2
Ganeti shared storage support for 2.3+
3
======================================
4

    
5
This document describes the changes in Ganeti 2.3+ compared to Ganeti
6
2.3 storage model.
7

    
8
.. contents:: :depth: 4
9
.. highlight:: shell-example
10

    
11
Objective
12
=========
13

    
14
The aim is to introduce support for externally mirrored, shared storage.
15
This includes two distinct disk templates:
16

    
17
- A shared filesystem containing instance disks as regular files
18
  typically residing on a networked or cluster filesystem (e.g. NFS,
19
  AFS, Ceph, OCFS2, etc.).
20
- Instance images being shared block devices, typically LUNs residing on
21
  a SAN appliance.
22

    
23
Background
24
==========
25
DRBD is currently the only shared storage backend supported by Ganeti.
26
DRBD offers the advantages of high availability while running on
27
commodity hardware at the cost of high network I/O for block-level
28
synchronization between hosts. DRBD's master-slave model has greatly
29
influenced Ganeti's design, primarily by introducing the concept of
30
primary and secondary nodes and thus defining an instance's “mobility
31
domain”.
32

    
33
Although DRBD has many advantages, many sites choose to use networked
34
storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
35
which provide shared storage without the administrative overhead of DRBD
36
nor the limitation of a 1:1 master-slave setup. Furthermore, new
37
distributed filesystems such as Ceph are becoming viable alternatives to
38
expensive storage appliances. Support for both modes of operation, i.e.
39
shared block storage and shared file storage backend would make Ganeti a
40
robust choice for high-availability virtualization clusters.
41

    
42
Throughout this document, the term “externally mirrored storage” will
43
refer to both modes of shared storage, suggesting that Ganeti does not
44
need to take care about the mirroring process from one host to another.
45

    
46
Use cases
47
=========
48
We consider the following use cases:
49

    
50
- A virtualization cluster with FibreChannel shared storage, mapping at
51
  least one LUN per instance, accessible by the whole cluster.
52
- A virtualization cluster with instance images stored as files on an
53
  NFS server.
54
- A virtualization cluster storing instance images on a Ceph volume.
55

    
56
Design Overview
57
===============
58

    
59
The design addresses the following procedures:
60

    
61
- Refactoring of all code referring to constants.DTS_NET_MIRROR.
62
- Obsolescence of the primary-secondary concept for externally mirrored
63
  storage.
64
- Introduction of a shared file storage disk template for use with networked
65
  filesystems.
66
- Introduction of shared block device disk template with device
67
  adoption.
68
- Introduction of an External Storage Interface.
69

    
70
Additionally, mid- to long-term goals include:
71

    
72
- Support for external “storage pools”.
73

    
74
Refactoring of all code referring to constants.DTS_NET_MIRROR
75
=============================================================
76

    
77
Currently, all storage-related decision-making depends on a number of
78
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR.
79
However, constants.DTS_NET_MIRROR is used to signify two different
80
attributes:
81

    
82
- A storage device that is shared
83
- A storage device whose mirroring is supervised by Ganeti
84

    
85
We propose the introduction of two new frozensets to ease
86
decision-making:
87

    
88
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
89
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and
90
  DTS_NET_MIRROR.
91

    
92
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect
93
the status of the storage as internally mirrored by Ganeti.
94

    
95
Thus, checks could be grouped into the following categories:
96

    
97
- Mobility checks, like whether an instance failover or migration is
98
  possible should check against constants.DTS_MIRRORED
99
- Syncing actions should be performed only for templates in
100
  constants.DTS_NET_MIRROR
101

    
102
Obsolescence of the primary-secondary node model
103
================================================
104

    
105
The primary-secondary node concept has primarily evolved through the use
106
of DRBD. In a globally shared storage framework without need for
107
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
108
following reasons:
109

    
110
1. Access to the storage does not necessarily imply different roles for
111
   the nodes (e.g. primary vs secondary).
112
2. The same storage is available to potentially more than 2 nodes. Thus,
113
   an instance backed by a SAN LUN for example may actually migrate to
114
   any of the other nodes and not just a pre-designated failover node.
115

    
116
The proposed solution is using the iallocator framework for run-time
117
decision making during migration and failover, for nodes with disk
118
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
119
gnt-node will be required to accept target node and/or iallocator
120
specification for these operations. Modifications of the iallocator
121
protocol will be required to address at least the following needs:
122

    
123
- Allocation tools must be able to distinguish between internal and
124
  external storage
125
- Migration/failover decisions must take into account shared storage
126
  availability
127

    
128
Introduction of a shared file disk template
129
===========================================
130

    
131
Basic shared file storage support can be implemented by creating a new
132
disk template based on the existing FileStorage class, with only minor
133
modifications in lib/bdev.py. The shared file disk template relies on a
134
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being
135
mounted on all nodes under the same path, where instance images will be
136
saved.
137

    
138
A new cluster initialization option is added to specify the mountpoint
139
of the shared filesystem.
140

    
141
The remainder of this document deals with shared block storage.
142

    
143
Introduction of a shared block device template
144
==============================================
145

    
146
Basic shared block device support will be implemented with an additional
147
disk template. This disk template will not feature any kind of storage
148
control (provisioning, removal, resizing, etc.), but will instead rely
149
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD
150
devices, remote iSCSI targets, etc.).
151

    
152
The shared block device template will make the following assumptions:
153

    
154
- The adopted block device has a consistent name across all nodes,
155
  enforced e.g. via udev rules.
156
- The device will be available with the same path under all nodes in the
157
  node group.
158

    
159
Introduction of an External Storage Interface
160
==============================================
161
Overview
162
--------
163

    
164
To extend the shared block storage template and give Ganeti the ability
165
to control and manipulate external storage (provisioning, removal,
166
growing, etc.) we need a more generic approach. The generic method for
167
supporting external shared storage in Ganeti will be to have an
168
ExtStorage provider for each external shared storage hardware type. The
169
ExtStorage provider will be a set of files (executable scripts and text
170
files), contained inside a directory which will be named after the
171
provider. This directory must be present across all nodes of a nodegroup
172
(Ganeti doesn't replicate it), in order for the provider to be usable by
173
Ganeti for this nodegroup (valid). The external shared storage hardware
174
should also be accessible by all nodes of this nodegroup too.
175

    
176
An “ExtStorage provider” will have to provide the following methods:
177

    
178
- Create a disk
179
- Remove a disk
180
- Grow a disk
181
- Attach a disk to a given node
182
- Detach a disk from a given node
183
- Verify its supported parameters
184

    
185
The proposed ExtStorage interface borrows heavily from the OS
186
interface and follows a one-script-per-function approach. An ExtStorage
187
provider is expected to provide the following scripts:
188

    
189
- ``create``
190
- ``remove``
191
- ``grow``
192
- ``attach``
193
- ``detach``
194
- ``verify``
195

    
196
All scripts will be called with no arguments and get their input via
197
environment variables. A common set of variables will be exported for
198
all commands, and some of them might have extra ones.
199

    
200
``VOL_NAME``
201
  The name of the volume. This is unique for Ganeti and it
202
  uses it to refer to a specific volume inside the external storage.
203
``VOL_SIZE``
204
  The volume's size in mebibytes.
205
``VOL_NEW_SIZE``
206
  Available only to the `grow` script. It declares the
207
  new size of the volume after grow (in mebibytes).
208
``EXTP_name``
209
  ExtStorage parameter, where `name` is the parameter in
210
  upper-case (same as OS interface's ``OSP_*`` parameters).
211

    
212
All scripts except `attach` should return 0 on success and non-zero on
213
error, accompanied by an appropriate error message on stderr. The
214
`attach` script should return a string on stdout on success, which is
215
the block device's full path, after it has been successfully attached to
216
the host node. On error it should return non-zero.
217

    
218
Implementation
219
--------------
220

    
221
To support the ExtStorage interface, we will introduce a new disk
222
template called `ext`. This template will implement the existing Ganeti
223
disk interface in `lib/bdev.py` (create, remove, attach, assemble,
224
shutdown, grow), and will simultaneously pass control to the external
225
scripts to actually handle the above actions. The `ext` disk template
226
will act as a translation layer between the current Ganeti disk
227
interface and the ExtStorage providers.
228

    
229
We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER =
230
provider`, which will be used at the command line to select the desired
231
ExtStorage provider. This parameter will be valid only for template
232
`ext` e.g.::
233

    
234
  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1
235

    
236
The Extstorage interface will support different disks to be created by
237
different providers. e.g.::
238

    
239
  $ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \
240
                            --disk=1:size=1G,provider=sample_provider2 \
241
                            --disk=2:size=3G,provider=sample_provider1
242

    
243
Finally, the ExtStorage interface will support passing of parameters to
244
the ExtStorage provider. This will also be done per disk, from the
245
command line::
246

    
247
 $ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\
248
                                            param1=value1,param2=value2
249

    
250
The above parameters will be exported to the ExtStorage provider's
251
scripts as the enviromental variables:
252

    
253
- `EXTP_PARAM1 = str(value1)`
254
- `EXTP_PARAM2 = str(value2)`
255

    
256
We will also introduce a new Ganeti client called `gnt-storage` which
257
will be used to diagnose ExtStorage providers and show information about
258
them, similarly to the way  `gnt-os diagose` and `gnt-os info` handle OS
259
definitions.
260

    
261
Long-term shared storage goals
262
==============================
263

    
264
Storage pool handling
265
---------------------
266

    
267
A new cluster configuration attribute will be introduced, named
268
“storage_pools”, modeled as a dictionary mapping storage pools to
269
external storage providers (see below), e.g.::
270

    
271
 {
272
  "nas1": "foostore",
273
  "nas2": "foostore",
274
  "cloud1": "barcloud",
275
 }
276

    
277
Ganeti will not interpret the contents of this dictionary, although it
278
will provide methods for manipulating them under some basic constraints
279
(pool identifier uniqueness, driver existence). The manipulation of
280
storage pools will be performed by implementing new options to the
281
`gnt-cluster` command::
282

    
283
 $ gnt-cluster modify --add-pool nas1 foostore
284
 $ gnt-cluster modify --remove-pool nas1 # There must be no instances using
285
                                         # the pool to remove it
286

    
287
Furthermore, the storage pools will be used to indicate the availability
288
of storage pools to different node groups, thus specifying the
289
instances' “mobility domain”.
290

    
291
The pool, in which to put the new instance's disk, will be defined at
292
the command line during `instance add`. This will become possible by
293
replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL
294
= pool`. The cmdlib logic will then look at the cluster-level mapping
295
dictionary to determine the ExtStorage provider for the given pool.
296

    
297
gnt-storage
298
-----------
299

    
300
The ``gnt-storage`` client can be extended to support pool management
301
(creation/modification/deletion of pools, connection/disconnection of
302
pools to nodegroups, etc.). It can also be extended to diagnose and
303
provide information for internal disk templates too, such as lvm and
304
drbd.
305

    
306
.. vim: set textwidth=72 :