root / doc / design-shared-storage.rst @ fcad6377
History | View | Annotate | Download (10.8 kB)
1 |
====================================== |
---|---|
2 |
Ganeti shared storage support for 2.3+ |
3 |
====================================== |
4 |
|
5 |
This document describes the changes in Ganeti 2.3+ compared to Ganeti |
6 |
2.3 storage model. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
|
10 |
Objective |
11 |
========= |
12 |
|
13 |
The aim is to introduce support for externally mirrored, shared storage. |
14 |
This includes two distinct disk templates: |
15 |
|
16 |
- A shared filesystem containing instance disks as regular files |
17 |
typically residing on a networked or cluster filesystem (e.g. NFS, |
18 |
AFS, Ceph, OCFS2, etc.). |
19 |
- Instance images being shared block devices, typically LUNs residing on |
20 |
a SAN appliance. |
21 |
|
22 |
Background |
23 |
========== |
24 |
DRBD is currently the only shared storage backend supported by Ganeti. |
25 |
DRBD offers the advantages of high availability while running on |
26 |
commodity hardware at the cost of high network I/O for block-level |
27 |
synchronization between hosts. DRBD's master-slave model has greatly |
28 |
influenced Ganeti's design, primarily by introducing the concept of |
29 |
primary and secondary nodes and thus defining an instance's “mobility |
30 |
domain”. |
31 |
|
32 |
Although DRBD has many advantages, many sites choose to use networked |
33 |
storage appliances for Virtual Machine hosting, such as SAN and/or NAS, |
34 |
which provide shared storage without the administrative overhead of DRBD |
35 |
nor the limitation of a 1:1 master-slave setup. Furthermore, new |
36 |
distributed filesystems such as Ceph are becoming viable alternatives to |
37 |
expensive storage appliances. Support for both modes of operation, i.e. |
38 |
shared block storage and shared file storage backend would make Ganeti a |
39 |
robust choice for high-availability virtualization clusters. |
40 |
|
41 |
Throughout this document, the term “externally mirrored storage” will |
42 |
refer to both modes of shared storage, suggesting that Ganeti does not |
43 |
need to take care about the mirroring process from one host to another. |
44 |
|
45 |
Use cases |
46 |
========= |
47 |
We consider the following use cases: |
48 |
|
49 |
- A virtualization cluster with FibreChannel shared storage, mapping at |
50 |
least one LUN per instance, accessible by the whole cluster. |
51 |
- A virtualization cluster with instance images stored as files on an |
52 |
NFS server. |
53 |
- A virtualization cluster storing instance images on a Ceph volume. |
54 |
|
55 |
Design Overview |
56 |
=============== |
57 |
|
58 |
The design addresses the following procedures: |
59 |
|
60 |
- Refactoring of all code referring to constants.DTS_NET_MIRROR. |
61 |
- Obsolescence of the primary-secondary concept for externally mirrored |
62 |
storage. |
63 |
- Introduction of a shared file storage disk template for use with networked |
64 |
filesystems. |
65 |
- Introduction of shared block device disk template with device |
66 |
adoption. |
67 |
|
68 |
Additionally, mid- to long-term goals include: |
69 |
|
70 |
- Support for external “storage pools”. |
71 |
- Introduction of an interface for communicating with external scripts, |
72 |
providing methods for the various stages of a block device's and |
73 |
instance's life-cycle. In order to provide storage provisioning |
74 |
capabilities for various SAN appliances, external helpers in the form |
75 |
of a “storage driver” will be possibly introduced as well. |
76 |
|
77 |
Refactoring of all code referring to constants.DTS_NET_MIRROR |
78 |
============================================================= |
79 |
|
80 |
Currently, all storage-related decision-making depends on a number of |
81 |
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR. |
82 |
However, constants.DTS_NET_MIRROR is used to signify two different |
83 |
attributes: |
84 |
|
85 |
- A storage device that is shared |
86 |
- A storage device whose mirroring is supervised by Ganeti |
87 |
|
88 |
We propose the introduction of two new frozensets to ease |
89 |
decision-making: |
90 |
|
91 |
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates |
92 |
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and |
93 |
DTS_NET_MIRROR. |
94 |
|
95 |
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect |
96 |
the status of the storage as internally mirrored by Ganeti. |
97 |
|
98 |
Thus, checks could be grouped into the following categories: |
99 |
|
100 |
- Mobility checks, like whether an instance failover or migration is |
101 |
possible should check against constants.DTS_MIRRORED |
102 |
- Syncing actions should be performed only for templates in |
103 |
constants.DTS_NET_MIRROR |
104 |
|
105 |
Obsolescence of the primary-secondary node model |
106 |
================================================ |
107 |
|
108 |
The primary-secondary node concept has primarily evolved through the use |
109 |
of DRBD. In a globally shared storage framework without need for |
110 |
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the |
111 |
following reasons: |
112 |
|
113 |
1. Access to the storage does not necessarily imply different roles for |
114 |
the nodes (e.g. primary vs secondary). |
115 |
2. The same storage is available to potentially more than 2 nodes. Thus, |
116 |
an instance backed by a SAN LUN for example may actually migrate to |
117 |
any of the other nodes and not just a pre-designated failover node. |
118 |
|
119 |
The proposed solution is using the iallocator framework for run-time |
120 |
decision making during migration and failover, for nodes with disk |
121 |
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and |
122 |
gnt-node will be required to accept target node and/or iallocator |
123 |
specification for these operations. Modifications of the iallocator |
124 |
protocol will be required to address at least the following needs: |
125 |
|
126 |
- Allocation tools must be able to distinguish between internal and |
127 |
external storage |
128 |
- Migration/failover decisions must take into account shared storage |
129 |
availability |
130 |
|
131 |
Introduction of a shared file disk template |
132 |
=========================================== |
133 |
|
134 |
Basic shared file storage support can be implemented by creating a new |
135 |
disk template based on the existing FileStorage class, with only minor |
136 |
modifications in lib/bdev.py. The shared file disk template relies on a |
137 |
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being |
138 |
mounted on all nodes under the same path, where instance images will be |
139 |
saved. |
140 |
|
141 |
A new cluster initialization option is added to specify the mountpoint |
142 |
of the shared filesystem. |
143 |
|
144 |
The remainder of this document deals with shared block storage. |
145 |
|
146 |
Introduction of a shared block device template |
147 |
============================================== |
148 |
|
149 |
Basic shared block device support will be implemented with an additional |
150 |
disk template. This disk template will not feature any kind of storage |
151 |
control (provisioning, removal, resizing, etc.), but will instead rely |
152 |
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD |
153 |
devices, remote iSCSI targets, etc.). |
154 |
|
155 |
The shared block device template will make the following assumptions: |
156 |
|
157 |
- The adopted block device has a consistent name across all nodes, |
158 |
enforced e.g. via udev rules. |
159 |
- The device will be available with the same path under all nodes in the |
160 |
node group. |
161 |
|
162 |
Long-term shared storage goals |
163 |
============================== |
164 |
Storage pool handling |
165 |
--------------------- |
166 |
|
167 |
A new cluster configuration attribute will be introduced, named |
168 |
“storage_pools”, modeled as a dictionary mapping storage pools to |
169 |
external storage drivers (see below), e.g.:: |
170 |
|
171 |
{ |
172 |
"nas1": "foostore", |
173 |
"nas2": "foostore", |
174 |
"cloud1": "barcloud", |
175 |
} |
176 |
|
177 |
Ganeti will not interpret the contents of this dictionary, although it |
178 |
will provide methods for manipulating them under some basic constraints |
179 |
(pool identifier uniqueness, driver existence). The manipulation of |
180 |
storage pools will be performed by implementing new options to the |
181 |
`gnt-cluster` command:: |
182 |
|
183 |
gnt-cluster modify --add-pool nas1 foostore |
184 |
gnt-cluster modify --remove-pool nas1 # There may be no instances using |
185 |
# the pool to remove it |
186 |
|
187 |
Furthermore, the storage pools will be used to indicate the availability |
188 |
of storage pools to different node groups, thus specifying the |
189 |
instances' “mobility domain”. |
190 |
|
191 |
New disk templates will also be necessary to facilitate the use of external |
192 |
storage. The proposed addition is a whole template namespace created by |
193 |
prefixing the pool names with a fixed string, e.g. “ext:”, forming names |
194 |
like “ext:nas1”, “ext:foo”. |
195 |
|
196 |
Interface to the external storage drivers |
197 |
----------------------------------------- |
198 |
|
199 |
In addition to external storage pools, a new interface will be |
200 |
introduced to allow external scripts to provision and manipulate shared |
201 |
storage. |
202 |
|
203 |
In order to provide storage provisioning and manipulation (e.g. growing, |
204 |
renaming) capabilities, each instance's disk template can possibly be |
205 |
associated with an external “storage driver” which, based on the |
206 |
instance's configuration and tags, will perform all supported storage |
207 |
operations using auxiliary means (e.g. XML-RPC, ssh, etc.). |
208 |
|
209 |
A “storage driver” will have to provide the following methods: |
210 |
|
211 |
- Create a disk |
212 |
- Remove a disk |
213 |
- Rename a disk |
214 |
- Resize a disk |
215 |
- Attach a disk to a given node |
216 |
- Detach a disk from a given node |
217 |
|
218 |
The proposed storage driver architecture borrows heavily from the OS |
219 |
interface and follows a one-script-per-function approach. A storage |
220 |
driver is expected to provide the following scripts: |
221 |
|
222 |
- `create` |
223 |
- `resize` |
224 |
- `rename` |
225 |
- `remove` |
226 |
- `attach` |
227 |
- `detach` |
228 |
|
229 |
These executables will be called once for each disk with no arguments |
230 |
and all required information will be passed through environment |
231 |
variables. The following environment variables will always be present on |
232 |
each invocation: |
233 |
|
234 |
- `INSTANCE_NAME`: The instance's name |
235 |
- `INSTANCE_UUID`: The instance's UUID |
236 |
- `INSTANCE_TAGS`: The instance's tags |
237 |
- `DISK_INDEX`: The current disk index. |
238 |
- `LOGICAL_ID`: The disk's logical id (if existing) |
239 |
- `POOL`: The storage pool the instance belongs to. |
240 |
|
241 |
Additional variables may be available in a per-script context (see |
242 |
below). |
243 |
|
244 |
Of particular importance is the disk's logical ID, which will act as |
245 |
glue between Ganeti and the external storage drivers; there are two |
246 |
possible ways of using a disk's logical ID in a storage driver: |
247 |
|
248 |
1. Simply use it as a unique identifier (e.g. UUID) and keep a separate, |
249 |
external database linking it to the actual storage. |
250 |
2. Encode all useful storage information in the logical ID and have the |
251 |
driver decode it at runtime. |
252 |
|
253 |
All scripts should return 0 on success and non-zero on error accompanied by |
254 |
an appropriate error message on stderr. Furthermore, the following |
255 |
special cases are defined: |
256 |
|
257 |
1. `create` In case of success, a string representing the disk's logical |
258 |
id must be returned on stdout, which will be saved in the instance's |
259 |
configuration and can be later used by the other scripts of the same |
260 |
storage driver. The logical id may be based on instance name, |
261 |
instance uuid and/or disk index. |
262 |
|
263 |
Additional environment variables present: |
264 |
- `DISK_SIZE`: The requested disk size in MiB |
265 |
|
266 |
2. `resize` In case of success, output the new disk size. |
267 |
|
268 |
Additional environment variables present: |
269 |
- `DISK_SIZE`: The requested disk size in MiB |
270 |
|
271 |
3. `rename` On success, a new logical id should be returned, which will |
272 |
replace the old one. This script is meant to rename the instance's |
273 |
backing store and update the disk's logical ID in case one of them is |
274 |
bound to the instance name. |
275 |
|
276 |
Additional environment variables present: |
277 |
- `NEW_INSTANCE_NAME`: The instance's new name. |
278 |
|
279 |
|
280 |
.. vim: set textwidth=72 : |