root / doc / design-shared-storage.rst @ e7e2552e
History | View | Annotate | Download (11.9 kB)
1 |
====================================== |
---|---|
2 |
Ganeti shared storage support for 2.3+ |
3 |
====================================== |
4 |
|
5 |
This document describes the changes in Ganeti 2.3+ compared to Ganeti |
6 |
2.3 storage model. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
.. highlight:: shell-example |
10 |
|
11 |
Objective |
12 |
========= |
13 |
|
14 |
The aim is to introduce support for externally mirrored, shared storage. |
15 |
This includes two distinct disk templates: |
16 |
|
17 |
- A shared filesystem containing instance disks as regular files |
18 |
typically residing on a networked or cluster filesystem (e.g. NFS, |
19 |
AFS, Ceph, OCFS2, etc.). |
20 |
- Instance images being shared block devices, typically LUNs residing on |
21 |
a SAN appliance. |
22 |
|
23 |
Background |
24 |
========== |
25 |
DRBD is currently the only shared storage backend supported by Ganeti. |
26 |
DRBD offers the advantages of high availability while running on |
27 |
commodity hardware at the cost of high network I/O for block-level |
28 |
synchronization between hosts. DRBD's master-slave model has greatly |
29 |
influenced Ganeti's design, primarily by introducing the concept of |
30 |
primary and secondary nodes and thus defining an instance's “mobility |
31 |
domain”. |
32 |
|
33 |
Although DRBD has many advantages, many sites choose to use networked |
34 |
storage appliances for Virtual Machine hosting, such as SAN and/or NAS, |
35 |
which provide shared storage without the administrative overhead of DRBD |
36 |
nor the limitation of a 1:1 master-slave setup. Furthermore, new |
37 |
distributed filesystems such as Ceph are becoming viable alternatives to |
38 |
expensive storage appliances. Support for both modes of operation, i.e. |
39 |
shared block storage and shared file storage backend would make Ganeti a |
40 |
robust choice for high-availability virtualization clusters. |
41 |
|
42 |
Throughout this document, the term “externally mirrored storage” will |
43 |
refer to both modes of shared storage, suggesting that Ganeti does not |
44 |
need to take care about the mirroring process from one host to another. |
45 |
|
46 |
Use cases |
47 |
========= |
48 |
We consider the following use cases: |
49 |
|
50 |
- A virtualization cluster with FibreChannel shared storage, mapping at |
51 |
least one LUN per instance, accessible by the whole cluster. |
52 |
- A virtualization cluster with instance images stored as files on an |
53 |
NFS server. |
54 |
- A virtualization cluster storing instance images on a Ceph volume. |
55 |
|
56 |
Design Overview |
57 |
=============== |
58 |
|
59 |
The design addresses the following procedures: |
60 |
|
61 |
- Refactoring of all code referring to constants.DTS_NET_MIRROR. |
62 |
- Obsolescence of the primary-secondary concept for externally mirrored |
63 |
storage. |
64 |
- Introduction of a shared file storage disk template for use with networked |
65 |
filesystems. |
66 |
- Introduction of shared block device disk template with device |
67 |
adoption. |
68 |
- Introduction of an External Storage Interface. |
69 |
|
70 |
Additionally, mid- to long-term goals include: |
71 |
|
72 |
- Support for external “storage pools”. |
73 |
|
74 |
Refactoring of all code referring to constants.DTS_NET_MIRROR |
75 |
============================================================= |
76 |
|
77 |
Currently, all storage-related decision-making depends on a number of |
78 |
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR. |
79 |
However, constants.DTS_NET_MIRROR is used to signify two different |
80 |
attributes: |
81 |
|
82 |
- A storage device that is shared |
83 |
- A storage device whose mirroring is supervised by Ganeti |
84 |
|
85 |
We propose the introduction of two new frozensets to ease |
86 |
decision-making: |
87 |
|
88 |
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates |
89 |
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and |
90 |
DTS_NET_MIRROR. |
91 |
|
92 |
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect |
93 |
the status of the storage as internally mirrored by Ganeti. |
94 |
|
95 |
Thus, checks could be grouped into the following categories: |
96 |
|
97 |
- Mobility checks, like whether an instance failover or migration is |
98 |
possible should check against constants.DTS_MIRRORED |
99 |
- Syncing actions should be performed only for templates in |
100 |
constants.DTS_NET_MIRROR |
101 |
|
102 |
Obsolescence of the primary-secondary node model |
103 |
================================================ |
104 |
|
105 |
The primary-secondary node concept has primarily evolved through the use |
106 |
of DRBD. In a globally shared storage framework without need for |
107 |
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the |
108 |
following reasons: |
109 |
|
110 |
1. Access to the storage does not necessarily imply different roles for |
111 |
the nodes (e.g. primary vs secondary). |
112 |
2. The same storage is available to potentially more than 2 nodes. Thus, |
113 |
an instance backed by a SAN LUN for example may actually migrate to |
114 |
any of the other nodes and not just a pre-designated failover node. |
115 |
|
116 |
The proposed solution is using the iallocator framework for run-time |
117 |
decision making during migration and failover, for nodes with disk |
118 |
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and |
119 |
gnt-node will be required to accept target node and/or iallocator |
120 |
specification for these operations. Modifications of the iallocator |
121 |
protocol will be required to address at least the following needs: |
122 |
|
123 |
- Allocation tools must be able to distinguish between internal and |
124 |
external storage |
125 |
- Migration/failover decisions must take into account shared storage |
126 |
availability |
127 |
|
128 |
Introduction of a shared file disk template |
129 |
=========================================== |
130 |
|
131 |
Basic shared file storage support can be implemented by creating a new |
132 |
disk template based on the existing FileStorage class, with only minor |
133 |
modifications in lib/bdev.py. The shared file disk template relies on a |
134 |
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being |
135 |
mounted on all nodes under the same path, where instance images will be |
136 |
saved. |
137 |
|
138 |
A new cluster initialization option is added to specify the mountpoint |
139 |
of the shared filesystem. |
140 |
|
141 |
The remainder of this document deals with shared block storage. |
142 |
|
143 |
Introduction of a shared block device template |
144 |
============================================== |
145 |
|
146 |
Basic shared block device support will be implemented with an additional |
147 |
disk template. This disk template will not feature any kind of storage |
148 |
control (provisioning, removal, resizing, etc.), but will instead rely |
149 |
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD |
150 |
devices, remote iSCSI targets, etc.). |
151 |
|
152 |
The shared block device template will make the following assumptions: |
153 |
|
154 |
- The adopted block device has a consistent name across all nodes, |
155 |
enforced e.g. via udev rules. |
156 |
- The device will be available with the same path under all nodes in the |
157 |
node group. |
158 |
|
159 |
Introduction of an External Storage Interface |
160 |
============================================== |
161 |
Overview |
162 |
-------- |
163 |
|
164 |
To extend the shared block storage template and give Ganeti the ability |
165 |
to control and manipulate external storage (provisioning, removal, |
166 |
growing, etc.) we need a more generic approach. The generic method for |
167 |
supporting external shared storage in Ganeti will be to have an |
168 |
ExtStorage provider for each external shared storage hardware type. The |
169 |
ExtStorage provider will be a set of files (executable scripts and text |
170 |
files), contained inside a directory which will be named after the |
171 |
provider. This directory must be present across all nodes of a nodegroup |
172 |
(Ganeti doesn't replicate it), in order for the provider to be usable by |
173 |
Ganeti for this nodegroup (valid). The external shared storage hardware |
174 |
should also be accessible by all nodes of this nodegroup too. |
175 |
|
176 |
An “ExtStorage provider” will have to provide the following methods: |
177 |
|
178 |
- Create a disk |
179 |
- Remove a disk |
180 |
- Grow a disk |
181 |
- Attach a disk to a given node |
182 |
- Detach a disk from a given node |
183 |
- Verify its supported parameters |
184 |
|
185 |
The proposed ExtStorage interface borrows heavily from the OS |
186 |
interface and follows a one-script-per-function approach. An ExtStorage |
187 |
provider is expected to provide the following scripts: |
188 |
|
189 |
- ``create`` |
190 |
- ``remove`` |
191 |
- ``grow`` |
192 |
- ``attach`` |
193 |
- ``detach`` |
194 |
- ``verify`` |
195 |
|
196 |
All scripts will be called with no arguments and get their input via |
197 |
environment variables. A common set of variables will be exported for |
198 |
all commands, and some of them might have extra ones. |
199 |
|
200 |
``VOL_NAME`` |
201 |
The name of the volume. This is unique for Ganeti and it |
202 |
uses it to refer to a specific volume inside the external storage. |
203 |
``VOL_SIZE`` |
204 |
The volume's size in mebibytes. |
205 |
``VOL_NEW_SIZE`` |
206 |
Available only to the `grow` script. It declares the |
207 |
new size of the volume after grow (in mebibytes). |
208 |
``EXTP_name`` |
209 |
ExtStorage parameter, where `name` is the parameter in |
210 |
upper-case (same as OS interface's ``OSP_*`` parameters). |
211 |
|
212 |
All scripts except `attach` should return 0 on success and non-zero on |
213 |
error, accompanied by an appropriate error message on stderr. The |
214 |
`attach` script should return a string on stdout on success, which is |
215 |
the block device's full path, after it has been successfully attached to |
216 |
the host node. On error it should return non-zero. |
217 |
|
218 |
Implementation |
219 |
-------------- |
220 |
|
221 |
To support the ExtStorage interface, we will introduce a new disk |
222 |
template called `ext`. This template will implement the existing Ganeti |
223 |
disk interface in `lib/bdev.py` (create, remove, attach, assemble, |
224 |
shutdown, grow), and will simultaneously pass control to the external |
225 |
scripts to actually handle the above actions. The `ext` disk template |
226 |
will act as a translation layer between the current Ganeti disk |
227 |
interface and the ExtStorage providers. |
228 |
|
229 |
We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER = |
230 |
provider`, which will be used at the command line to select the desired |
231 |
ExtStorage provider. This parameter will be valid only for template |
232 |
`ext` e.g.:: |
233 |
|
234 |
$ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 |
235 |
|
236 |
The Extstorage interface will support different disks to be created by |
237 |
different providers. e.g.:: |
238 |
|
239 |
$ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \ |
240 |
--disk=1:size=1G,provider=sample_provider2 \ |
241 |
--disk=2:size=3G,provider=sample_provider1 |
242 |
|
243 |
Finally, the ExtStorage interface will support passing of parameters to |
244 |
the ExtStorage provider. This will also be done per disk, from the |
245 |
command line:: |
246 |
|
247 |
$ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\ |
248 |
param1=value1,param2=value2 |
249 |
|
250 |
The above parameters will be exported to the ExtStorage provider's |
251 |
scripts as the enviromental variables: |
252 |
|
253 |
- `EXTP_PARAM1 = str(value1)` |
254 |
- `EXTP_PARAM2 = str(value2)` |
255 |
|
256 |
We will also introduce a new Ganeti client called `gnt-storage` which |
257 |
will be used to diagnose ExtStorage providers and show information about |
258 |
them, similarly to the way `gnt-os diagose` and `gnt-os info` handle OS |
259 |
definitions. |
260 |
|
261 |
Long-term shared storage goals |
262 |
============================== |
263 |
|
264 |
Storage pool handling |
265 |
--------------------- |
266 |
|
267 |
A new cluster configuration attribute will be introduced, named |
268 |
“storage_pools”, modeled as a dictionary mapping storage pools to |
269 |
external storage providers (see below), e.g.:: |
270 |
|
271 |
{ |
272 |
"nas1": "foostore", |
273 |
"nas2": "foostore", |
274 |
"cloud1": "barcloud", |
275 |
} |
276 |
|
277 |
Ganeti will not interpret the contents of this dictionary, although it |
278 |
will provide methods for manipulating them under some basic constraints |
279 |
(pool identifier uniqueness, driver existence). The manipulation of |
280 |
storage pools will be performed by implementing new options to the |
281 |
`gnt-cluster` command:: |
282 |
|
283 |
$ gnt-cluster modify --add-pool nas1 foostore |
284 |
$ gnt-cluster modify --remove-pool nas1 # There must be no instances using |
285 |
# the pool to remove it |
286 |
|
287 |
Furthermore, the storage pools will be used to indicate the availability |
288 |
of storage pools to different node groups, thus specifying the |
289 |
instances' “mobility domain”. |
290 |
|
291 |
The pool, in which to put the new instance's disk, will be defined at |
292 |
the command line during `instance add`. This will become possible by |
293 |
replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL |
294 |
= pool`. The cmdlib logic will then look at the cluster-level mapping |
295 |
dictionary to determine the ExtStorage provider for the given pool. |
296 |
|
297 |
gnt-storage |
298 |
----------- |
299 |
|
300 |
The ``gnt-storage`` client can be extended to support pool management |
301 |
(creation/modification/deletion of pools, connection/disconnection of |
302 |
pools to nodegroups, etc.). It can also be extended to diagnose and |
303 |
provide information for internal disk templates too, such as lvm and |
304 |
drbd. |
305 |
|
306 |
.. vim: set textwidth=72 : |