|
1 |
======================================
|
|
2 |
Ganeti shared storage support for 2.3+
|
|
3 |
======================================
|
|
4 |
|
|
5 |
This document describes the changes in Ganeti 2.3+ compared to Ganeti
|
|
6 |
2.3 storage model.
|
|
7 |
|
|
8 |
.. contents:: :depth: 4
|
|
9 |
|
|
10 |
Objective
|
|
11 |
=========
|
|
12 |
|
|
13 |
The aim is to introduce support for externally mirrored, shared storage.
|
|
14 |
This includes two distinct disk templates:
|
|
15 |
|
|
16 |
- A shared filesystem containing instance disks as regular files
|
|
17 |
typically residing on a networked or cluster filesystem (e.g. NFS,
|
|
18 |
AFS, Ceph, OCFS2, etc.).
|
|
19 |
- Instance images being shared block devices, typically LUNs residing on
|
|
20 |
a SAN appliance.
|
|
21 |
|
|
22 |
Background
|
|
23 |
==========
|
|
24 |
DRBD is currently the only shared storage backend supported by Ganeti.
|
|
25 |
DRBD offers the advantages of high availability while running on
|
|
26 |
commodity hardware at the cost of high network I/O for block-level
|
|
27 |
synchronization between hosts. DRBD's master-slave model has greatly
|
|
28 |
influenced Ganeti's design, primarily by introducing the concept of
|
|
29 |
primary and secondary nodes and thus defining an instance's “mobility
|
|
30 |
domain”.
|
|
31 |
|
|
32 |
Although DRBD has many advantages, many sites choose to use networked
|
|
33 |
storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
|
|
34 |
which provide shared storage without the administrative overhead of DRBD
|
|
35 |
nor the limitation of a 1:1 master-slave setup. Furthermore, new
|
|
36 |
distributed filesystems such as Ceph are becoming viable alternatives to
|
|
37 |
expensive storage appliances. Support for both modes of operation, i.e.
|
|
38 |
shared block storage and shared file storage backend would make Ganeti a
|
|
39 |
robust choice for high-availability virtualization clusters.
|
|
40 |
|
|
41 |
Throughout this document, the term “externally mirrored storage” will
|
|
42 |
refer to both modes of shared storage, suggesting that Ganeti does not
|
|
43 |
need to take care about the mirroring process from one host to another.
|
|
44 |
|
|
45 |
Use cases
|
|
46 |
=========
|
|
47 |
We consider the following use cases:
|
|
48 |
|
|
49 |
- A virtualization cluster with FibreChannel shared storage, mapping at
|
|
50 |
leaste one LUN per instance, accessible by the whole cluster.
|
|
51 |
- A virtualization cluster with instance images stored as files on an
|
|
52 |
NFS server.
|
|
53 |
- A virtualization cluster storing instance images on a Ceph volume.
|
|
54 |
|
|
55 |
Design Overview
|
|
56 |
===============
|
|
57 |
|
|
58 |
The design addresses the following procedures:
|
|
59 |
|
|
60 |
- Refactoring of all code referring to constants.DTS_NET_MIRROR.
|
|
61 |
- Obsolescence of the primary-secondary concept for externally mirrored
|
|
62 |
storage.
|
|
63 |
- Introduction of a shared file storage disk template for use with networked
|
|
64 |
filesystems.
|
|
65 |
- Introduction of shared block device disk template with device
|
|
66 |
adoption.
|
|
67 |
|
|
68 |
Additionally, mid- to long-term goals include:
|
|
69 |
|
|
70 |
- Support for external “storage pools”.
|
|
71 |
- Introduction of an interface for communicating with external scripts,
|
|
72 |
providing methods for the various stages of a block device's and
|
|
73 |
instance's life-cycle. In order to provide storage provisioning
|
|
74 |
capabilities for various SAN appliances, external helpers in the form
|
|
75 |
of a “storage driver” will be possibly introduced as well.
|
|
76 |
|
|
77 |
Refactoring of all code referring to constants.DTS_NET_MIRROR
|
|
78 |
=============================================================
|
|
79 |
|
|
80 |
Currently, all storage-related decision-making depends on a number of
|
|
81 |
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR.
|
|
82 |
However, constants.DTS_NET_MIRROR is used to signify two different
|
|
83 |
attributes:
|
|
84 |
|
|
85 |
- A storage device that is shared
|
|
86 |
- A storage device whose mirroring is supervised by Ganeti
|
|
87 |
|
|
88 |
We propose the introduction of two new frozensets to ease
|
|
89 |
decision-making:
|
|
90 |
|
|
91 |
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
|
|
92 |
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and
|
|
93 |
DTS_NET_MIRROR.
|
|
94 |
|
|
95 |
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect
|
|
96 |
the status of the storage as internally mirrored by Ganeti.
|
|
97 |
|
|
98 |
Thus, checks could be grouped into the following categories:
|
|
99 |
|
|
100 |
- Mobility checks, like whether an instance failover or migration is
|
|
101 |
possible should check against constants.DTS_MIRRORED
|
|
102 |
- Syncing actions should be performed only for templates in
|
|
103 |
constants.DTS_NET_MIRROR
|
|
104 |
|
|
105 |
Obsolescence of the primary-secondary node model
|
|
106 |
================================================
|
|
107 |
|
|
108 |
The primary-secondary node concept has primarily evolved through the use
|
|
109 |
of DRBD. In a globally shared storage framework without need for
|
|
110 |
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
|
|
111 |
following reasons:
|
|
112 |
|
|
113 |
1. Access to the storage does not necessarily imply different roles for
|
|
114 |
the nodes (e.g. primary vs secondary).
|
|
115 |
2. The same storage is available to potentially more than 2 nodes. Thus,
|
|
116 |
an instance backed by a SAN LUN for example may actually migrate to
|
|
117 |
any of the other nodes and not just a pre-designated failover node.
|
|
118 |
|
|
119 |
The proposed solution is using the iallocator framework for run-time
|
|
120 |
decision making during migration and failover, for nodes with disk
|
|
121 |
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
|
|
122 |
gnt-node will be required to accept target node and/or iallocator
|
|
123 |
specification for these operations. Modifications of the iallocator
|
|
124 |
protocol will be required to address at least the following needs:
|
|
125 |
|
|
126 |
- Allocation tools must be able to distinguish between internal and
|
|
127 |
external storage
|
|
128 |
- Migration/failover decisions must take into account shared storage
|
|
129 |
availability
|
|
130 |
|
|
131 |
Introduction of a shared file disk template
|
|
132 |
===========================================
|
|
133 |
|
|
134 |
Basic shared file storage support can be implemented by creating a new
|
|
135 |
disk template based on the existing FileStorage class, with only minor
|
|
136 |
modifications in lib/bdev.py. The shared file disk template relies on a
|
|
137 |
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being
|
|
138 |
mounted on all nodes under the same path, where instance images will be
|
|
139 |
saved.
|
|
140 |
|
|
141 |
A new cluster initialization option is added to specify the mountpoint
|
|
142 |
of the shared filesystem.
|
|
143 |
|
|
144 |
The remainder of this document deals with shared block storage.
|
|
145 |
|
|
146 |
Introduction of a shared block device template
|
|
147 |
==============================================
|
|
148 |
|
|
149 |
Basic shared block device support will be implemented with an additional
|
|
150 |
disk template. This disk template will not feature any kind of storage
|
|
151 |
control (provisioning, removal, resizing, etc.), but will instead rely
|
|
152 |
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD
|
|
153 |
devices, remote iSCSI targets, etc.).
|
|
154 |
|
|
155 |
The shared block device template will make the following assumptions:
|
|
156 |
|
|
157 |
- The adopted block device has a consistent name across all nodes,
|
|
158 |
enforced e.g. via udev rules.
|
|
159 |
- The device will be available with the same path under all nodes in the
|
|
160 |
node group.
|
|
161 |
|
|
162 |
Long-term shared storage goals
|
|
163 |
==============================
|
|
164 |
Storage pool handling
|
|
165 |
---------------------
|
|
166 |
|
|
167 |
A new cluster configuration attribute will be introduced, named
|
|
168 |
“storage_pools”, modeled as a dictionary mapping storage pools to
|
|
169 |
external storage drivers (see below), e.g.::
|
|
170 |
|
|
171 |
{
|
|
172 |
"nas1": "foostore",
|
|
173 |
"nas2": "foostore",
|
|
174 |
"cloud1": "barcloud",
|
|
175 |
}
|
|
176 |
|
|
177 |
Ganeti will not interpret the contents of this dictionary, although it
|
|
178 |
will provide methods for manipulating them under some basic constraints
|
|
179 |
(pool identifier uniqueness, driver existence). The manipulation of
|
|
180 |
storage pools will be performed by implementing new options to the
|
|
181 |
`gnt-cluster` command::
|
|
182 |
|
|
183 |
gnt-cluster modify --add-pool nas1 foostore
|
|
184 |
gnt-cluster modify --remove-pool nas1 # There may be no instances using
|
|
185 |
# the pool to remove it
|
|
186 |
|
|
187 |
Furthermore, the storage pools will be used to indicate the availability
|
|
188 |
of storage pools to different node groups, thus specifying the
|
|
189 |
instances' “mobility domain”.
|
|
190 |
|
|
191 |
New disk templates will also be necessary to facilitate the use of external
|
|
192 |
storage. The proposed addition is a whole template namespace created by
|
|
193 |
prefixing the pool names with a fixed string, e.g. “ext:”, forming names
|
|
194 |
like “ext:nas1”, “ext:foo”.
|
|
195 |
|
|
196 |
Interface to the external storage drivers
|
|
197 |
-----------------------------------------
|
|
198 |
|
|
199 |
In addition to external storage pools, a new interface will be
|
|
200 |
introduced to allow external scripts to provision and manipulate shared
|
|
201 |
storage.
|
|
202 |
|
|
203 |
In order to provide storage provisioning and manipulation (e.g. growing,
|
|
204 |
renaming) capabilities, each instance's disk template can possibly be
|
|
205 |
associated with an external “storage driver” which, based on the
|
|
206 |
instance's configuration and tags, will perform all supported storage
|
|
207 |
operations using auxiliary means (e.g. XML-RPC, ssh, etc.).
|
|
208 |
|
|
209 |
A “storage driver” will have to provide the following methods:
|
|
210 |
|
|
211 |
- Create a disk
|
|
212 |
- Remove a disk
|
|
213 |
- Rename a disk
|
|
214 |
- Resize a disk
|
|
215 |
- Attach a disk to a given node
|
|
216 |
- Detach a disk from a given node
|
|
217 |
|
|
218 |
The proposed storage driver architecture borrows heavily from the OS
|
|
219 |
interface and follows a one-script-per-function approach. A storage
|
|
220 |
driver is expected to provide the following scripts:
|
|
221 |
|
|
222 |
- `create`
|
|
223 |
- `resize`
|
|
224 |
- `rename`
|
|
225 |
- `remove`
|
|
226 |
- `attach`
|
|
227 |
- `detach`
|
|
228 |
|
|
229 |
These executables will be called once for each disk with no arguments
|
|
230 |
and all required information will be passed through environment
|
|
231 |
variables. The following environment variables will always be present on
|
|
232 |
each invocation:
|
|
233 |
|
|
234 |
- `INSTANCE_NAME`: The instance's name
|
|
235 |
- `INSTANCE_UUID`: The instance's UUID
|
|
236 |
- `INSTANCE_TAGS`: The instance's tags
|
|
237 |
- `DISK_INDEX`: The current disk index.
|
|
238 |
- `LOGICAL_ID`: The disk's logical id (if existing)
|
|
239 |
- `POOL`: The storage pool the instance belongs to.
|
|
240 |
|
|
241 |
Additional variables may be available in a per-script context (see
|
|
242 |
below).
|
|
243 |
|
|
244 |
Of particular importance is the disk's logical ID, which will act as
|
|
245 |
glue between Ganeti and the external storage drivers; there are two
|
|
246 |
possible ways of using a disk's logical ID in a storage driver:
|
|
247 |
|
|
248 |
1. Simply use it as a unique identifier (e.g. UUID) and keep a separate,
|
|
249 |
external database linking it to the actual storage.
|
|
250 |
2. Encode all useful storage information in the logical ID and have the
|
|
251 |
driver decode it at runtime.
|
|
252 |
|
|
253 |
All scripts should return 0 on success and non-zero on error accompanied by
|
|
254 |
an appropriate error message on stderr. Furthermore, the following
|
|
255 |
special cases are defined:
|
|
256 |
|
|
257 |
1. `create` In case of success, a string representing the disk's logical
|
|
258 |
id must be returned on stdout, which will be saved in the instance's
|
|
259 |
configuration and can be later used by the other scripts of the same
|
|
260 |
storage driver. The logical id may be based on instance name,
|
|
261 |
instance uuid and/or disk index.
|
|
262 |
|
|
263 |
Additional environment variables present:
|
|
264 |
- `DISK_SIZE`: The requested disk size in MiB
|
|
265 |
|
|
266 |
2. `resize` In case of success, output the new disk size.
|
|
267 |
|
|
268 |
Additional environment variables present:
|
|
269 |
- `DISK_SIZE`: The requested disk size in MiB
|
|
270 |
|
|
271 |
3. `rename` On success, a new logical id should be returned, which will
|
|
272 |
replace the old one. This script is meant to rename the instance's
|
|
273 |
backing store and update the disk's logical ID in case one of them is
|
|
274 |
bound to the instance name.
|
|
275 |
|
|
276 |
Additional environment variables present:
|
|
277 |
- `NEW_INSTANCE_NAME`: The instance's new name.
|
|
278 |
|
|
279 |
|
|
280 |
.. vim: set textwidth=72 :
|