root / snf-docs / storage.rst @ 9940eacf
History | View | Annotate | Download (17.3 kB)
1 |
Storage guide |
---|---|
2 |
============= |
3 |
|
4 |
Instructions for RADOS cluster deployment and administration |
5 |
|
6 |
This document describes the basic steps to obtain a working RADOS cluster / |
7 |
object store installation, to be used as a storage backend for synnefo, and |
8 |
provides information about its administration. |
9 |
|
10 |
It begins by providing general information on the RADOS object store describing |
11 |
the different nodes in a RADOS cluster, and then moves to the installation and |
12 |
setup of the distinct software components. Finally, it provides some basic |
13 |
information about the cluster administration and debugging. |
14 |
|
15 |
RADOS is the object storage component of the Ceph project |
16 |
(http://http://ceph.newdream.net). For more documentation, see the official wiki |
17 |
(http://ceph.newdream.net/wiki), and the official documentation |
18 |
(http://ceph.newdream.net/docs). Usage information for userspace tools, used to |
19 |
administer the cluster, are also available in the respective manpages. |
20 |
|
21 |
|
22 |
RADOS Intro |
23 |
----------- |
24 |
RADOS is the object storage component of Ceph. |
25 |
|
26 |
An object, in this context, means a named entity that has |
27 |
|
28 |
* name: a sequence of bytes, unique within its container, that is used to locate |
29 |
and access the object |
30 |
* content: sequence of bytes |
31 |
* metadata: a mapping from keys to values |
32 |
|
33 |
RADOS takes care of distributing the objects across the whole storage cluster |
34 |
and replicating them for fault tolerance. |
35 |
|
36 |
|
37 |
Node types |
38 |
---------- |
39 |
|
40 |
Nodes in a RADOS deployment belong in one of the following types: |
41 |
|
42 |
* Monitor: |
43 |
Lightweight daemon (ceph-mon) that provides a consensus for distributed |
44 |
decisionmaking in a Ceph/RADOS cluster. It also is the initial point of |
45 |
contact for new clients, and will hand out information about the topology of |
46 |
the cluster, such as the osdmap. |
47 |
|
48 |
You normally run 3 ceph-mon daemons, on 3 separate physical machines, |
49 |
isolated from each other; for example, in different racks or rows. You could |
50 |
run just 1 instance, but that means giving up on high availability. |
51 |
|
52 |
Any decision requires the majority of the ceph-mon processes to be healthy |
53 |
and communicating with each other. For this reason, you never want an even |
54 |
number of ceph-mons; there is no unambiguous majority subgroup for an even |
55 |
number. |
56 |
|
57 |
* OSD: |
58 |
Storage daemon (ceph-osd) that provides the RADOS service. It uses the |
59 |
monitor servers for cluster membership, services object read/write/etc |
60 |
request from clients, and peers with other ceph-osds for data replication. |
61 |
|
62 |
The data model is fairly simple on this level. There are multiple named |
63 |
pools, and within each pool there are named objects, in a flat namespace (no |
64 |
directories). Each object has both data and metadata. |
65 |
|
66 |
By default, three pools are created (data, metadata, rbd). |
67 |
|
68 |
The data for an object is a single, potentially big, series of bytes. |
69 |
Additionally, the series may be sparse, it may have holes that contain binary |
70 |
zeros, and take up no actual storage. |
71 |
|
72 |
The metadata is an unordered set of key-value pairs. Its semantics are |
73 |
completely up to the client. |
74 |
|
75 |
Multiple OSDs can run on one node, one for each disk included in the object |
76 |
store. This might impose a perfomance overhead, due to peering/replication. |
77 |
Alternatively, disks can be pooled together (either with RAID or with btrfs), |
78 |
requiring only one osd to manage the pool. |
79 |
|
80 |
In the case of multiple OSDs, care must be taken to generate a CRUSH map, |
81 |
which doesn't replicate objects across OSDs on the same host (see the next |
82 |
section). |
83 |
|
84 |
* Clients: |
85 |
Clients that can access the RADOS cluster either directly, and on an object |
86 |
'granurality' by using librados and the rados userspace tool, or by using |
87 |
librbd, and the rbd tool, which creates an image / volume abstraction over |
88 |
the object store. |
89 |
|
90 |
RBD images are striped over the object store daemons, to provide higher |
91 |
throughput, and can be accessed either via the in-kernel Rados Block Device |
92 |
(RBD) driver, which maps RBD images to block devices, or directly via Qemu, |
93 |
and the Qemu-RBD driver. |
94 |
|
95 |
|
96 |
Replication and Fault tolerance |
97 |
------------------------------- |
98 |
|
99 |
The objects in each pool are paritioned in a (per-pool configurable) number |
100 |
of placement groups (pgs), and each placement group is mapped to a nubmer of |
101 |
OSDs, according to the (per-pool configurable) replication level, and a |
102 |
(per-pool configurable) CRUSH map, which defines how objects are replicated |
103 |
across OSDs. |
104 |
|
105 |
The CRUSH map is generated with hints from the config file (eg hostnames, racks |
106 |
etc), so that the objects are replicated across OSDs in different 'failure |
107 |
domains'. However, in order to be on the safe side, the CRUSH map should be |
108 |
examined to verify that for example PGs are not replicated acroos OSDs on the |
109 |
same host, and corrected if needed (see the Admin section). |
110 |
|
111 |
Information about objects, pools, and pgs is included in the osdmap, which |
112 |
the clients fetch initially from the monitor servers. Using the osdmap, |
113 |
clients learn which OSD is the primary for each PG, and therefore know which |
114 |
OSD to contact when they want to interact with a specific object. |
115 |
|
116 |
More information about the internals of the replication / fault tolerace / |
117 |
peering inside the RADOS cluster can be found in the original RADOS paper |
118 |
(http://dl.acm.org/citation.cfm?id=1374606). |
119 |
|
120 |
|
121 |
Journaling |
122 |
----------- |
123 |
|
124 |
The OSD maintains a journal to help keep all on-disk data in a consistent state |
125 |
while still keep write latency low. That is, each OSD normally has a back-end |
126 |
file system (ideally btrfs) and a journal device or file. |
127 |
|
128 |
When the journal is enabled, all writes are written both to the journal and to |
129 |
the file system. This is somewhat similar to ext3's data=journal mode, with a |
130 |
few differences. There are two basic journaling modes: |
131 |
|
132 |
* In writeahead mode, every write transaction is written first to the journal. |
133 |
Once that is safely on disk, we can ack the write and then apply it to the |
134 |
back-end file system. This will work with any file system (with a few |
135 |
caveats). |
136 |
|
137 |
* In parallel mode, every write transaction is written to the journal and the |
138 |
file system in parallel. The write is acked when either one safely commits |
139 |
(usually the journal). This will only work on btrfs, as it relies on |
140 |
btrfs-specific snapshot ioctls to rollback to a consistent state before |
141 |
replaying the journal. |
142 |
|
143 |
|
144 |
Authentication |
145 |
-------------- |
146 |
|
147 |
Ceph supports cephx secure authentication between the nodes, this to make your |
148 |
cluster more secure. There are some issues with the cephx authentication, |
149 |
especially with clients (Qemu-RBD), and it complicates the cluster deployment. |
150 |
Future revisions of this document will include documentation on setting up |
151 |
fine-grained cephx authentication acroos the cluster. |
152 |
|
153 |
|
154 |
RADOS Cluster design and configuration |
155 |
-------------------------------------- |
156 |
|
157 |
This section proposes and describes a sample cluster configuration. |
158 |
|
159 |
0. Monitor servers: |
160 |
* 3 mon servers on separate 'failure domains' (eg rack) |
161 |
* Monitor servers are named mon.a, mon.b, mon.c repectively |
162 |
* Monitor data stored in /rados/mon.$id (should be created) |
163 |
* Monitor servers bind on 6789 TCP port, which should not be blocked by |
164 |
firewall |
165 |
* Ceph configuration section for monitors: |
166 |
[mon] |
167 |
mon data = /rados/mon.$id |
168 |
|
169 |
[mon.a] |
170 |
host = [hostname] |
171 |
mon addr = [ip]:6789 |
172 |
[mon.b] |
173 |
host = [hostname] |
174 |
mon addr = [ip]:6789 |
175 |
[mon.c] |
176 |
host = [hostname] |
177 |
mon addr = [ip]:6789 |
178 |
|
179 |
* Debugging options which can be included in the monitor configuration: |
180 |
[mon] |
181 |
;show monitor messaging traffic |
182 |
debug ms = 1 |
183 |
;show monitor debug messages |
184 |
debug mon = 20 |
185 |
; show Paxos debug messages (consensus protocol) |
186 |
debug paxos = 20 |
187 |
|
188 |
1. OSD servers: |
189 |
* A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n) |
190 |
* OSD servers bind on 6800+ TCP ports, which should not be blocked by |
191 |
firewall |
192 |
* OSD data are stored in /rados/osd.$id (should be created and mounted if |
193 |
needed) |
194 |
* /rados/osd.$id can be either a directory on the rootfs, or a separate |
195 |
partition, on a dedicated fast disk (recommended) |
196 |
|
197 |
The upstream recommended filesystem is btrfs. btrfs will use the parallel |
198 |
mode for OSD journaling. |
199 |
|
200 |
Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD |
201 |
journaling. ext4 itself can also use an external journal device |
202 |
(preferably a fast, eg SSD, disk). In that case, the filesystem can be |
203 |
mounted with data=journal,commit=9999,noatime,nodiratime options, to |
204 |
improve perfomance (proof?): |
205 |
|
206 |
mkfs.ext4 /dev/sdyy |
207 |
mke2fs -O journal_dev /dev/sdxx |
208 |
tune2fs -O ^has_journal /dev/sdyy |
209 |
tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy |
210 |
mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999 |
211 |
|
212 |
* OSD journal can be either on a raw block device, a separate partition, or |
213 |
a file. |
214 |
|
215 |
A fash disk (SSD) is recommended as a journal device. |
216 |
|
217 |
If a file is used, the journal size must be also specified in the |
218 |
configuration. |
219 |
|
220 |
* Ceph configuration section for monitors: |
221 |
[osd] |
222 |
osd data = /rados/osd.$id |
223 |
osd journal = /dev/sdzz |
224 |
;if a file is used as a journal |
225 |
;osd journal size = N (in MB) |
226 |
|
227 |
[osd.0] |
228 |
;host and rack directives are used to generate a CRUSH map for PG |
229 |
;placement |
230 |
host = [hostname] |
231 |
rack = [rack] |
232 |
|
233 |
;public addr is the one the clients will use to contact the osd |
234 |
public_addr = [public ip] |
235 |
;cluster addr is the one used for osd-to-osd replication/peering etc |
236 |
cluster_addr = [cluster ip] |
237 |
|
238 |
[osd.1] |
239 |
... |
240 |
|
241 |
* Debug options which can be included in the osd configuration: |
242 |
[osd] |
243 |
;show OSD messaging traffic |
244 |
debug ms = 1 |
245 |
;show OSD debug information |
246 |
debug osd = 20 |
247 |
;show OSD journal debug information |
248 |
debug jorunal = 20 |
249 |
;show filestore debug information |
250 |
debug filestore = 20 |
251 |
;show monitor client debug information |
252 |
debug monc = 20 |
253 |
|
254 |
3. Clients |
255 |
* Clients configuration only need the monitor servers addresses |
256 |
* Configration section for clients: |
257 |
[mon.a] |
258 |
mon addr = [ip]:6789 |
259 |
[mon.b] |
260 |
mon addr = [ip]:6789 |
261 |
[mon.c] |
262 |
mon addr = [ip]:6789 |
263 |
* Debug options which can be included in the client configuration: |
264 |
;show client messaging traffic |
265 |
debug ms = 1 |
266 |
;show RADOS debug information |
267 |
debug rados = 20 |
268 |
;show objecter debug information |
269 |
debug objecter = 20 |
270 |
;show filer debug information |
271 |
debug filer = 20 |
272 |
;show objectcacher debug information |
273 |
debug object cacher = 20 |
274 |
|
275 |
4. Tips |
276 |
* Mount all the filesystems with noatime,nodiratime options |
277 |
* Even without any debug options, RADOS generates lots of logs. Make sure |
278 |
the logs files are in a fast disk, with little I/O traffic, and the |
279 |
partition is mounted with noatime. |
280 |
|
281 |
|
282 |
Installation Process |
283 |
-------------------- |
284 |
|
285 |
This section describes the installation process of the various software |
286 |
components in a RADOS cluster. |
287 |
|
288 |
0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd, |
289 |
clients):: |
290 |
|
291 |
deb http://ceph.newdream.net/debian/ squeeze main |
292 |
deb-src http://ceph.newdream.net/debian/ squeeze main |
293 |
|
294 |
1. Monitor and OSD servers: |
295 |
* Install the ceph package |
296 |
* Upgrade to an up-to-date kernel (>=3.x) |
297 |
* Edit the /etc/ceph/ceph.conf to include the mon and osd configuration |
298 |
sections, shown previously. |
299 |
* Create the corresponding dirs in /rados (mon.$id and osd.$id) |
300 |
* (optionally) Format and mount the osd.$id patition in /rados/osd.$id |
301 |
* Make sure the journal device specified in the conf exists. |
302 |
* (optionally) Make sure everything is mounted with the noatime,nodiratime |
303 |
options |
304 |
* Make sure monitor and osd servers can freely ssh to each other, using only |
305 |
hostnames. |
306 |
* Create the object store: |
307 |
mkcephfs -a -c /etc/ceph/ceph.conf |
308 |
* Start the servers: |
309 |
service ceph -a start |
310 |
* Verify that the object store is healthy, and running: |
311 |
ceph helth |
312 |
ceph -s |
313 |
|
314 |
2. Clients: |
315 |
* Install the ceph-common package |
316 |
* Upgrade to an up-to-date kernel (>=3.x) |
317 |
* Install linux-headers for the new kernel |
318 |
* Check out the latest ceph-client git repo: |
319 |
git clone git://github.com/NewDreamNetwork/ceph-client.git |
320 |
* Copy the ncecessary ceph header file to linux-headers: |
321 |
cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/ |
322 |
* Build the modules: |
323 |
cd ~/ceph-client/net/ceph/ |
324 |
make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) libceph.ko |
325 |
cp Modules.symvers ../../drivers/block/ |
326 |
cd ~/ceph-client/drivers/block/ |
327 |
make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) rbd.ko |
328 |
* Optionally, copy rbd.ko and libceph. ko to /lib/modules/ |
329 |
* Load the modules: |
330 |
modprobe rbd |
331 |
|
332 |
|
333 |
Administration Notes |
334 |
-------------------- |
335 |
|
336 |
This section includes some notes on the RADOS cluster administration. |
337 |
|
338 |
0. Starting / Stopping servers |
339 |
* service ceph -a start/stop (affects all the servers in the cluster) |
340 |
* service ceph start/stop osd (affects only the osds in the current node) |
341 |
* service ceph start/stop mon (affects only the mons in the current node) |
342 |
* service ceph start/stop osd.$id/mon.$id (affects only the specified node) |
343 |
|
344 |
* sevice ceph cleanlogs/cleanalllogs |
345 |
|
346 |
1. Stop the cluster cleanly |
347 |
ceph stop |
348 |
|
349 |
2. Increase the replication level for a given pool: |
350 |
ceph osd pool set $poolname size $size |
351 |
|
352 |
Note that increasing the replication level, the overhead for the replication |
353 |
will impact perfomance. |
354 |
|
355 |
3. Adjust the number of placement groups per pool: |
356 |
ceph osd pool set $poolname pg_num $num |
357 |
|
358 |
The default number of pgs per pool is determined by the number of OSDs in the |
359 |
cluster, and the replication level of the pool (for 4 OSDs and replication |
360 |
size 2, the default value is 8). The default pools (data,metadata,rbd) are |
361 |
assigned 256 PGs. |
362 |
|
363 |
After the splitting is complete, the number of PGs in the system must be |
364 |
changed. Warning: this is not considered safe on PGs in use (with objects), |
365 |
and should be changed only when the PG is created, and before being used: |
366 |
ceph osd pool set $poolname pgp_num $num |
367 |
|
368 |
4. Replacing the journal for osd.$id: |
369 |
Edit the osd.$id journal configration section |
370 |
ceph-osd -i osd.$id --mkjournal |
371 |
ceph-osd -i osd.$id --osd.journal /path/to/journal |
372 |
|
373 |
5. Add a new OSD: |
374 |
Edit /etc/ceph/ceph.conf to include the new OSD |
375 |
ceph mon getmap -o /tmp/monmap |
376 |
ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap |
377 |
ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed) |
378 |
service ceph start osd.$id |
379 |
|
380 |
Generate the CRUSH map to include the new osd in PGs: |
381 |
osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush |
382 |
ceph osd setcrushmap -i /tmp/crush |
383 |
Or edit the CRUSH map by hand: |
384 |
ceph osd getcrushmap -o /tmp/crush |
385 |
crushmaptool -d /tmp/crush -o crushmap |
386 |
vim crushmap |
387 |
crushmaptool -c crushmap -o /tmp/crush |
388 |
ceph osd setcrushmap -i /tmp/crush |
389 |
|
390 |
6. General ceph tool commands: |
391 |
* ceph mon stat (stat mon servers) |
392 |
* ceph mon getmap (get the monmap, use monmaptool to edit) |
393 |
* ceph osd dump (dump osdmap -> pool info, osd info) |
394 |
* ceph osd getmap (get osdmap -> use osdmaptool to edit) |
395 |
* ceph osd lspools |
396 |
* ceph osd stat (stat osd servers) |
397 |
* ceph ost tree (osd server info) |
398 |
* ceph pg dump/stat (show info about PGs) |
399 |
|
400 |
7. rados userspace tool: |
401 |
|
402 |
The rados userspace tool (included in ceph-common package), uses librados to |
403 |
communicate with the object store. |
404 |
|
405 |
* rados mkpool [pool] |
406 |
* rados rmpool [pool] |
407 |
* rados df (show usage per pool) |
408 |
* rados lspools (list pools) |
409 |
* rados ls -p [pool] (list objects in [pool] |
410 |
* rados bench [secs] write|seq -t [concurrent operation] |
411 |
* rados import/export <pool> <dir> (import/export a local directory in a rados pool) |
412 |
|
413 |
8. rbd userspace tool: |
414 |
|
415 |
The rbd userspace tool (included in ceph-commong package), uses librbd and |
416 |
librados to communicate with the object store. |
417 |
|
418 |
* rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) |
419 |
* rbd info [pool] -p [pool] |
420 |
* rbd create [image] --size n (in MB) |
421 |
* rbd rm [image] |
422 |
* rbd export/import [dir] [image] |
423 |
* rbd cp/mv [image] [dest] |
424 |
* rbd resize [image] |
425 |
* rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver) |
426 |
* rbd unmap /dev/rbdx (unmap an RBD device) |
427 |
* rbd showmapped |
428 |
|
429 |
9. In-kernel RBD driver |
430 |
|
431 |
The in-kernel RBD driver can be used to map and ummap RBD images as block |
432 |
devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be |
433 |
created in /dev/rbd/[poolname]/[imagename]:[bdev id]. |
434 |
|
435 |
It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to |
436 |
add / remove / list devices, although the rbd map/unmap/showmapped commands |
437 |
are preferred. |
438 |
|
439 |
The RBD module depends on the net/ceph/libceph module, which implements the |
440 |
communication with the object store in the kernel. |
441 |
|
442 |
10. Qemu-RBD driver |
443 |
|
444 |
The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as |
445 |
block devices inside VMs. It currently supports a feature not present in the |
446 |
in-kenrel RBD driver (writeback_window). |
447 |
|
448 |
It can be configured via libvirt, and the configuration looks like this: |
449 |
|
450 |
.. code-block:: xml |
451 |
|
452 |
<disk type='network' device='disk'> |
453 |
<driver name='qemu' type='raw'/> |
454 |
<source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/> |
455 |
<target dev='vda' bus='virtio'/> |
456 |
</disk> |
457 |
|
458 |
Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM |
459 |
version, which is not included in Debian. |
460 |
|
461 |
9. Logging and Debugging: |
462 |
For command-line tools (ceph, rados, rbd), you can specify debug options in |
463 |
the form of --debug-[component]=n, which will override the options in the |
464 |
config file. In order to get any output when using the cli debug options, |
465 |
you must also use --log-to-stderr. |
466 |
|
467 |
rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20 |
468 |
|
469 |
Ceph log files are located in /var/log/ceph/mon.$id and |
470 |
/var/log/ceph/osd.$id. |