Statistics
| Branch: | Tag: | Revision:

root / doc / design-shared-storage.rst @ 7fa310f6

History | View | Annotate | Download (10.8 kB)

1
======================================
2
Ganeti shared storage support for 2.3+
3
======================================
4

    
5
This document describes the changes in Ganeti 2.3+ compared to Ganeti
6
2.3 storage model.
7

    
8
.. contents:: :depth: 4
9

    
10
Objective
11
=========
12

    
13
The aim is to introduce support for externally mirrored, shared storage.
14
This includes two distinct disk templates:
15

    
16
- A shared filesystem containing instance disks as regular files
17
  typically residing on a networked or cluster filesystem (e.g. NFS,
18
  AFS, Ceph, OCFS2, etc.).
19
- Instance images being shared block devices, typically LUNs residing on
20
  a SAN appliance.
21

    
22
Background
23
==========
24
DRBD is currently the only shared storage backend supported by Ganeti.
25
DRBD offers the advantages of high availability while running on
26
commodity hardware at the cost of high network I/O for block-level
27
synchronization between hosts. DRBD's master-slave model has greatly
28
influenced Ganeti's design, primarily by introducing the concept of
29
primary and secondary nodes and thus defining an instance's “mobility
30
domain”.
31

    
32
Although DRBD has many advantages, many sites choose to use networked
33
storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
34
which provide shared storage without the administrative overhead of DRBD
35
nor the limitation of a 1:1 master-slave setup. Furthermore, new
36
distributed filesystems such as Ceph are becoming viable alternatives to
37
expensive storage appliances. Support for both modes of operation, i.e.
38
shared block storage and shared file storage backend would make Ganeti a
39
robust choice for high-availability virtualization clusters.
40

    
41
Throughout this document, the term “externally mirrored storage” will
42
refer to both modes of shared storage, suggesting that Ganeti does not
43
need to take care about the mirroring process from one host to another.
44

    
45
Use cases
46
=========
47
We consider the following use cases:
48

    
49
- A virtualization cluster with FibreChannel shared storage, mapping at
50
  leaste one LUN per instance, accessible by the whole cluster.
51
- A virtualization cluster with instance images stored as files on an
52
  NFS server.
53
- A virtualization cluster storing instance images on a Ceph volume.
54

    
55
Design Overview
56
===============
57

    
58
The design addresses the following procedures:
59

    
60
- Refactoring of all code referring to constants.DTS_NET_MIRROR.
61
- Obsolescence of the primary-secondary concept for externally mirrored
62
  storage.
63
- Introduction of a shared file storage disk template for use with networked
64
  filesystems.
65
- Introduction of shared block device disk template with device
66
  adoption.
67

    
68
Additionally, mid- to long-term goals include:
69

    
70
- Support for external “storage pools”.
71
- Introduction of an interface for communicating with external scripts,
72
  providing methods for the various stages of a block device's and
73
  instance's life-cycle. In order to provide storage provisioning
74
  capabilities for various SAN appliances, external helpers in the form
75
  of a “storage driver” will be possibly introduced as well.
76

    
77
Refactoring of all code referring to constants.DTS_NET_MIRROR
78
=============================================================
79

    
80
Currently, all storage-related decision-making depends on a number of
81
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR.
82
However, constants.DTS_NET_MIRROR is used to signify two different
83
attributes:
84

    
85
- A storage device that is shared
86
- A storage device whose mirroring is supervised by Ganeti
87

    
88
We propose the introduction of two new frozensets to ease
89
decision-making:
90

    
91
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
92
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and
93
  DTS_NET_MIRROR.
94

    
95
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect
96
the status of the storage as internally mirrored by Ganeti.
97

    
98
Thus, checks could be grouped into the following categories:
99

    
100
- Mobility checks, like whether an instance failover or migration is
101
  possible should check against constants.DTS_MIRRORED
102
- Syncing actions should be performed only for templates in
103
  constants.DTS_NET_MIRROR
104

    
105
Obsolescence of the primary-secondary node model
106
================================================
107

    
108
The primary-secondary node concept has primarily evolved through the use
109
of DRBD. In a globally shared storage framework without need for
110
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
111
following reasons:
112

    
113
1. Access to the storage does not necessarily imply different roles for
114
   the nodes (e.g. primary vs secondary).
115
2. The same storage is available to potentially more than 2 nodes. Thus,
116
   an instance backed by a SAN LUN for example may actually migrate to
117
   any of the other nodes and not just a pre-designated failover node.
118

    
119
The proposed solution is using the iallocator framework for run-time
120
decision making during migration and failover, for nodes with disk
121
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
122
gnt-node will be required to accept target node and/or iallocator
123
specification for these operations. Modifications of the iallocator
124
protocol will be required to address at least the following needs:
125

    
126
- Allocation tools must be able to distinguish between internal and
127
  external storage
128
- Migration/failover decisions must take into account shared storage
129
  availability
130

    
131
Introduction of a shared file disk template
132
===========================================
133

    
134
Basic shared file storage support can be implemented by creating a new
135
disk template based on the existing FileStorage class, with only minor
136
modifications in lib/bdev.py. The shared file disk template relies on a
137
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being
138
mounted on all nodes under the same path, where instance images will be
139
saved.
140

    
141
A new cluster initialization option is added to specify the mountpoint
142
of the shared filesystem.
143

    
144
The remainder of this document deals with shared block storage.
145

    
146
Introduction of a shared block device template
147
==============================================
148

    
149
Basic shared block device support will be implemented with an additional
150
disk template. This disk template will not feature any kind of storage
151
control (provisioning, removal, resizing, etc.), but will instead rely
152
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD
153
devices, remote iSCSI targets, etc.).
154

    
155
The shared block device template will make the following assumptions:
156

    
157
- The adopted block device has a consistent name across all nodes,
158
  enforced e.g. via udev rules.
159
- The device will be available with the same path under all nodes in the
160
  node group.
161

    
162
Long-term shared storage goals
163
==============================
164
Storage pool handling
165
---------------------
166

    
167
A new cluster configuration attribute will be introduced, named
168
“storage_pools”, modeled as a dictionary mapping storage pools to
169
external storage drivers (see below), e.g.::
170

    
171
 {
172
  "nas1": "foostore",
173
  "nas2": "foostore",
174
  "cloud1": "barcloud",
175
 }
176

    
177
Ganeti will not interpret the contents of this dictionary, although it
178
will provide methods for manipulating them under some basic constraints
179
(pool identifier uniqueness, driver existence). The manipulation of
180
storage pools will be performed by implementing new options to the
181
`gnt-cluster` command::
182

    
183
 gnt-cluster modify --add-pool nas1 foostore
184
 gnt-cluster modify --remove-pool nas1 # There may be no instances using
185
                                       # the pool to remove it
186

    
187
Furthermore, the storage pools will be used to indicate the availability
188
of storage pools to different node groups, thus specifying the
189
instances' “mobility domain”.
190

    
191
New disk templates will also be necessary to facilitate the use of external
192
storage. The proposed addition is a whole template namespace created by
193
prefixing the pool names with a fixed string, e.g. “ext:”, forming names
194
like “ext:nas1”, “ext:foo”.
195

    
196
Interface to the external storage drivers
197
-----------------------------------------
198

    
199
In addition to external storage pools, a new interface will be
200
introduced to allow external scripts to provision and manipulate shared
201
storage.
202

    
203
In order to provide storage provisioning and manipulation (e.g. growing,
204
renaming) capabilities, each instance's disk template can possibly be
205
associated with an external “storage driver” which, based on the
206
instance's configuration and tags, will perform all supported storage
207
operations using auxiliary means (e.g. XML-RPC, ssh, etc.).
208

    
209
A “storage driver” will have to provide the following methods:
210

    
211
- Create a disk
212
- Remove a disk
213
- Rename a disk
214
- Resize a disk
215
- Attach a disk to a given node
216
- Detach a disk from a given node
217

    
218
The proposed storage driver architecture borrows heavily from the OS
219
interface and follows a one-script-per-function approach. A storage
220
driver is expected to provide the following scripts:
221

    
222
- `create`
223
- `resize`
224
- `rename`
225
- `remove`
226
- `attach`
227
- `detach`
228

    
229
These executables will be called once for each disk with no arguments
230
and all required information will be passed through environment
231
variables. The following environment variables will always be present on
232
each invocation:
233

    
234
- `INSTANCE_NAME`: The instance's name
235
- `INSTANCE_UUID`: The instance's UUID
236
- `INSTANCE_TAGS`: The instance's tags
237
- `DISK_INDEX`: The current disk index.
238
- `LOGICAL_ID`: The disk's logical id (if existing)
239
- `POOL`: The storage pool the instance belongs to.
240

    
241
Additional variables may be available in a per-script context (see
242
below).
243

    
244
Of particular importance is the disk's logical ID, which will act as
245
glue between Ganeti and the external storage drivers; there are two
246
possible ways of using a disk's logical ID in a storage driver:
247

    
248
1. Simply use it as a unique identifier (e.g. UUID) and keep a separate,
249
   external database linking it to the actual storage.
250
2. Encode all useful storage information in the logical ID and have the
251
   driver decode it at runtime.
252

    
253
All scripts should return 0 on success and non-zero on error accompanied by
254
an appropriate error message on stderr. Furthermore, the following
255
special cases are defined:
256

    
257
1. `create` In case of success, a string representing the disk's logical
258
   id must be returned on stdout, which will be saved in the instance's
259
   configuration and can be later used by the other scripts of the same
260
   storage driver. The logical id may be based on instance name,
261
   instance uuid and/or disk index.
262

    
263
   Additional environment variables present:
264
     - `DISK_SIZE`: The requested disk size in MiB
265

    
266
2. `resize` In case of success, output the new disk size.
267

    
268
   Additional environment variables present:
269
     - `DISK_SIZE`: The requested disk size in MiB
270

    
271
3. `rename` On success, a new logical id should be returned, which will
272
   replace the old one. This script is meant to rename the instance's
273
   backing store and update the disk's logical ID in case one of them is
274
   bound to the instance name.
275

    
276
   Additional environment variables present:
277
     - `NEW_INSTANCE_NAME`: The instance's new name.
278

    
279

    
280
.. vim: set textwidth=72 :