root / doc / design-shared-storage.rst @ 82437b28
History | View | Annotate | Download (12.1 kB)
1 |
============================= |
---|---|
2 |
Ganeti shared storage support |
3 |
============================= |
4 |
|
5 |
This document describes the changes in Ganeti 2.3+ compared to Ganeti |
6 |
2.3 storage model. It also documents the ExtStorage Interface. |
7 |
|
8 |
.. contents:: :depth: 4 |
9 |
.. highlight:: shell-example |
10 |
|
11 |
Objective |
12 |
========= |
13 |
|
14 |
The aim is to introduce support for externally mirrored, shared storage. |
15 |
This includes two distinct disk templates: |
16 |
|
17 |
- A shared filesystem containing instance disks as regular files |
18 |
typically residing on a networked or cluster filesystem (e.g. NFS, |
19 |
AFS, Ceph, OCFS2, etc.). |
20 |
- Instance images being shared block devices, typically LUNs residing on |
21 |
a SAN appliance. |
22 |
|
23 |
Background |
24 |
========== |
25 |
|
26 |
DRBD is currently the only shared storage backend supported by Ganeti. |
27 |
DRBD offers the advantages of high availability while running on |
28 |
commodity hardware at the cost of high network I/O for block-level |
29 |
synchronization between hosts. DRBD's master-slave model has greatly |
30 |
influenced Ganeti's design, primarily by introducing the concept of |
31 |
primary and secondary nodes and thus defining an instance's “mobility |
32 |
domain”. |
33 |
|
34 |
Although DRBD has many advantages, many sites choose to use networked |
35 |
storage appliances for Virtual Machine hosting, such as SAN and/or NAS, |
36 |
which provide shared storage without the administrative overhead of DRBD |
37 |
nor the limitation of a 1:1 master-slave setup. Furthermore, new |
38 |
distributed filesystems such as Ceph are becoming viable alternatives to |
39 |
expensive storage appliances. Support for both modes of operation, i.e. |
40 |
shared block storage and shared file storage backend would make Ganeti a |
41 |
robust choice for high-availability virtualization clusters. |
42 |
|
43 |
Throughout this document, the term “externally mirrored storage” will |
44 |
refer to both modes of shared storage, suggesting that Ganeti does not |
45 |
need to take care about the mirroring process from one host to another. |
46 |
|
47 |
Use cases |
48 |
========= |
49 |
|
50 |
We consider the following use cases: |
51 |
|
52 |
- A virtualization cluster with FibreChannel shared storage, mapping at |
53 |
least one LUN per instance, accessible by the whole cluster. |
54 |
- A virtualization cluster with instance images stored as files on an |
55 |
NFS server. |
56 |
- A virtualization cluster storing instance images on a Ceph volume. |
57 |
|
58 |
Design Overview |
59 |
=============== |
60 |
|
61 |
The design addresses the following procedures: |
62 |
|
63 |
- Refactoring of all code referring to constants.DTS_NET_MIRROR. |
64 |
- Obsolescence of the primary-secondary concept for externally mirrored |
65 |
storage. |
66 |
- Introduction of a shared file storage disk template for use with networked |
67 |
filesystems. |
68 |
- Introduction of a shared block device disk template with device |
69 |
adoption. |
70 |
- Introduction of the External Storage Interface. |
71 |
|
72 |
Additionally, mid- to long-term goals include: |
73 |
|
74 |
- Support for external “storage pools”. |
75 |
|
76 |
Refactoring of all code referring to constants.DTS_NET_MIRROR |
77 |
============================================================= |
78 |
|
79 |
Currently, all storage-related decision-making depends on a number of |
80 |
frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR. |
81 |
However, constants.DTS_NET_MIRROR is used to signify two different |
82 |
attributes: |
83 |
|
84 |
- A storage device that is shared |
85 |
- A storage device whose mirroring is supervised by Ganeti |
86 |
|
87 |
We propose the introduction of two new frozensets to ease |
88 |
decision-making: |
89 |
|
90 |
- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates |
91 |
- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and |
92 |
DTS_NET_MIRROR. |
93 |
|
94 |
Additionally, DTS_NET_MIRROR will be renamed to DTS_INT_MIRROR to reflect |
95 |
the status of the storage as internally mirrored by Ganeti. |
96 |
|
97 |
Thus, checks could be grouped into the following categories: |
98 |
|
99 |
- Mobility checks, like whether an instance failover or migration is |
100 |
possible should check against constants.DTS_MIRRORED |
101 |
- Syncing actions should be performed only for templates in |
102 |
constants.DTS_NET_MIRROR |
103 |
|
104 |
Obsolescence of the primary-secondary node model |
105 |
================================================ |
106 |
|
107 |
The primary-secondary node concept has primarily evolved through the use |
108 |
of DRBD. In a globally shared storage framework without need for |
109 |
external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the |
110 |
following reasons: |
111 |
|
112 |
1. Access to the storage does not necessarily imply different roles for |
113 |
the nodes (e.g. primary vs secondary). |
114 |
2. The same storage is available to potentially more than 2 nodes. Thus, |
115 |
an instance backed by a SAN LUN for example may actually migrate to |
116 |
any of the other nodes and not just a pre-designated failover node. |
117 |
|
118 |
The proposed solution is using the iallocator framework for run-time |
119 |
decision making during migration and failover, for nodes with disk |
120 |
templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and |
121 |
gnt-node will be required to accept target node and/or iallocator |
122 |
specification for these operations. Modifications of the iallocator |
123 |
protocol will be required to address at least the following needs: |
124 |
|
125 |
- Allocation tools must be able to distinguish between internal and |
126 |
external storage |
127 |
- Migration/failover decisions must take into account shared storage |
128 |
availability |
129 |
|
130 |
Introduction of a shared file disk template |
131 |
=========================================== |
132 |
|
133 |
Basic shared file storage support can be implemented by creating a new |
134 |
disk template based on the existing FileStorage class, with only minor |
135 |
modifications in lib/bdev.py. The shared file disk template relies on a |
136 |
shared filesystem (e.g. NFS, AFS, Ceph, OCFS2 over SAN or DRBD) being |
137 |
mounted on all nodes under the same path, where instance images will be |
138 |
saved. |
139 |
|
140 |
A new cluster initialization option is added to specify the mountpoint |
141 |
of the shared filesystem. |
142 |
|
143 |
The remainder of this document deals with shared block storage. |
144 |
|
145 |
Introduction of a shared block device template |
146 |
============================================== |
147 |
|
148 |
Basic shared block device support will be implemented with an additional |
149 |
disk template. This disk template will not feature any kind of storage |
150 |
control (provisioning, removal, resizing, etc.), but will instead rely |
151 |
on the adoption of already-existing block devices (e.g. SAN LUNs, NBD |
152 |
devices, remote iSCSI targets, etc.). |
153 |
|
154 |
The shared block device template will make the following assumptions: |
155 |
|
156 |
- The adopted block device has a consistent name across all nodes, |
157 |
enforced e.g. via udev rules. |
158 |
- The device will be available with the same path under all nodes in the |
159 |
node group. |
160 |
|
161 |
Introduction of the External Storage Interface |
162 |
============================================== |
163 |
|
164 |
Overview |
165 |
-------- |
166 |
|
167 |
To extend the shared block storage template and give Ganeti the ability |
168 |
to control and manipulate external storage (provisioning, removal, |
169 |
growing, etc.) we need a more generic approach. The generic method for |
170 |
supporting external shared storage in Ganeti will be to have an |
171 |
ExtStorage provider for each external shared storage hardware type. The |
172 |
ExtStorage provider will be a set of files (executable scripts and text |
173 |
files), contained inside a directory which will be named after the |
174 |
provider. This directory must be present across all nodes of a nodegroup |
175 |
(Ganeti doesn't replicate it), in order for the provider to be usable by |
176 |
Ganeti for this nodegroup (valid). The external shared storage hardware |
177 |
should also be accessible by all nodes of this nodegroup too. |
178 |
|
179 |
An “ExtStorage provider” will have to provide the following methods: |
180 |
|
181 |
- Create a disk |
182 |
- Remove a disk |
183 |
- Grow a disk |
184 |
- Attach a disk to a given node |
185 |
- Detach a disk from a given node |
186 |
- SetInfo to a disk (add metadata) |
187 |
- Verify its supported parameters |
188 |
|
189 |
The proposed ExtStorage interface borrows heavily from the OS |
190 |
interface and follows a one-script-per-function approach. An ExtStorage |
191 |
provider is expected to provide the following scripts: |
192 |
|
193 |
- ``create`` |
194 |
- ``remove`` |
195 |
- ``grow`` |
196 |
- ``attach`` |
197 |
- ``detach`` |
198 |
- ``setinfo`` |
199 |
- ``verify`` |
200 |
|
201 |
All scripts will be called with no arguments and get their input via |
202 |
environment variables. A common set of variables will be exported for |
203 |
all commands, and some of them might have extra ones. |
204 |
|
205 |
``VOL_NAME`` |
206 |
The name of the volume. This is unique for Ganeti and it |
207 |
uses it to refer to a specific volume inside the external storage. |
208 |
``VOL_SIZE`` |
209 |
The volume's size in mebibytes. |
210 |
``VOL_NEW_SIZE`` |
211 |
Available only to the `grow` script. It declares the |
212 |
new size of the volume after grow (in mebibytes). |
213 |
``EXTP_name`` |
214 |
ExtStorage parameter, where `name` is the parameter in |
215 |
upper-case (same as OS interface's ``OSP_*`` parameters). |
216 |
``VOL_METADATA`` |
217 |
A string containing metadata to be set for the volume. |
218 |
This is exported only to the ``setinfo`` script. |
219 |
|
220 |
All scripts except `attach` should return 0 on success and non-zero on |
221 |
error, accompanied by an appropriate error message on stderr. The |
222 |
`attach` script should return a string on stdout on success, which is |
223 |
the block device's full path, after it has been successfully attached to |
224 |
the host node. On error it should return non-zero. |
225 |
|
226 |
Implementation |
227 |
-------------- |
228 |
|
229 |
To support the ExtStorage interface, we will introduce a new disk |
230 |
template called `ext`. This template will implement the existing Ganeti |
231 |
disk interface in `lib/bdev.py` (create, remove, attach, assemble, |
232 |
shutdown, grow, setinfo), and will simultaneously pass control to the |
233 |
external scripts to actually handle the above actions. The `ext` disk |
234 |
template will act as a translation layer between the current Ganeti disk |
235 |
interface and the ExtStorage providers. |
236 |
|
237 |
We will also introduce a new IDISK_PARAM called `IDISK_PROVIDER = |
238 |
provider`, which will be used at the command line to select the desired |
239 |
ExtStorage provider. This parameter will be valid only for template |
240 |
`ext` e.g.:: |
241 |
|
242 |
$ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 |
243 |
|
244 |
The Extstorage interface will support different disks to be created by |
245 |
different providers. e.g.:: |
246 |
|
247 |
$ gnt-instance add -t ext --disk=0:size=2G,provider=sample_provider1 \ |
248 |
--disk=1:size=1G,provider=sample_provider2 \ |
249 |
--disk=2:size=3G,provider=sample_provider1 |
250 |
|
251 |
Finally, the ExtStorage interface will support passing of parameters to |
252 |
the ExtStorage provider. This will also be done per disk, from the |
253 |
command line:: |
254 |
|
255 |
$ gnt-instance add -t ext --disk=0:size=1G,provider=sample_provider1,\ |
256 |
param1=value1,param2=value2 |
257 |
|
258 |
The above parameters will be exported to the ExtStorage provider's |
259 |
scripts as the enviromental variables: |
260 |
|
261 |
- `EXTP_PARAM1 = str(value1)` |
262 |
- `EXTP_PARAM2 = str(value2)` |
263 |
|
264 |
We will also introduce a new Ganeti client called `gnt-storage` which |
265 |
will be used to diagnose ExtStorage providers and show information about |
266 |
them, similarly to the way `gnt-os diagose` and `gnt-os info` handle OS |
267 |
definitions. |
268 |
|
269 |
Long-term shared storage goals |
270 |
============================== |
271 |
|
272 |
Storage pool handling |
273 |
--------------------- |
274 |
|
275 |
A new cluster configuration attribute will be introduced, named |
276 |
“storage_pools”, modeled as a dictionary mapping storage pools to |
277 |
external storage providers (see below), e.g.:: |
278 |
|
279 |
{ |
280 |
"nas1": "foostore", |
281 |
"nas2": "foostore", |
282 |
"cloud1": "barcloud", |
283 |
} |
284 |
|
285 |
Ganeti will not interpret the contents of this dictionary, although it |
286 |
will provide methods for manipulating them under some basic constraints |
287 |
(pool identifier uniqueness, driver existence). The manipulation of |
288 |
storage pools will be performed by implementing new options to the |
289 |
`gnt-cluster` command:: |
290 |
|
291 |
$ gnt-cluster modify --add-pool nas1 foostore |
292 |
$ gnt-cluster modify --remove-pool nas1 # There must be no instances using |
293 |
# the pool to remove it |
294 |
|
295 |
Furthermore, the storage pools will be used to indicate the availability |
296 |
of storage pools to different node groups, thus specifying the |
297 |
instances' “mobility domain”. |
298 |
|
299 |
The pool, in which to put the new instance's disk, will be defined at |
300 |
the command line during `instance add`. This will become possible by |
301 |
replacing the IDISK_PROVIDER parameter with a new one, called `IDISK_POOL |
302 |
= pool`. The cmdlib logic will then look at the cluster-level mapping |
303 |
dictionary to determine the ExtStorage provider for the given pool. |
304 |
|
305 |
gnt-storage |
306 |
----------- |
307 |
|
308 |
The ``gnt-storage`` client can be extended to support pool management |
309 |
(creation/modification/deletion of pools, connection/disconnection of |
310 |
pools to nodegroups, etc.). It can also be extended to diagnose and |
311 |
provide information for internal disk templates too, such as lvm and |
312 |
drbd. |
313 |
|
314 |
.. vim: set textwidth=72 : |