root / docs / storage.rst @ 4a3e83c6
History | View | Annotate | Download (17.2 kB)
1 |
README.storage -- Instructions for RADOS cluster deployment and administration |
---|---|
2 |
|
3 |
This document describes the basic steps to obtain a working RADOS cluster / |
4 |
object store installation, to be used as a storage backend for synnefo, and |
5 |
provides information about its administration. |
6 |
|
7 |
It begins by providing general information on the RADOS object store describing |
8 |
the different nodes in a RADOS cluster, and then moves to the installation and |
9 |
setup of the distinct software components. Finally, it provides some basic |
10 |
information about the cluster administration and debugging. |
11 |
|
12 |
RADOS is the object storage component of the Ceph project |
13 |
(http://http://ceph.newdream.net). For more documentation, see the official wiki |
14 |
(http://ceph.newdream.net/wiki), and the official documentation |
15 |
(http://ceph.newdream.net/docs). Usage information for userspace tools, used to |
16 |
administer the cluster, are also available in the respective manpages. |
17 |
|
18 |
|
19 |
RADOS Intro |
20 |
=========== |
21 |
RADOS is the object storage component of Ceph. |
22 |
|
23 |
An object, in this context, means a named entity that has |
24 |
|
25 |
* name: a sequence of bytes, unique within its container, that is used to locate |
26 |
and access the object |
27 |
* content: sequence of bytes |
28 |
* metadata: a mapping from keys to values |
29 |
|
30 |
RADOS takes care of distributing the objects across the whole storage cluster |
31 |
and replicating them for fault tolerance. |
32 |
|
33 |
|
34 |
Node types |
35 |
========== |
36 |
|
37 |
Nodes in a RADOS deployment belong in one of the following types: |
38 |
|
39 |
* Monitor: |
40 |
Lightweight daemon (ceph-mon) that provides a consensus for distributed |
41 |
decisionmaking in a Ceph/RADOS cluster. It also is the initial point of |
42 |
contact for new clients, and will hand out information about the topology of |
43 |
the cluster, such as the osdmap. |
44 |
|
45 |
You normally run 3 ceph-mon daemons, on 3 separate physical machines, |
46 |
isolated from each other; for example, in different racks or rows. You could |
47 |
run just 1 instance, but that means giving up on high availability. |
48 |
|
49 |
Any decision requires the majority of the ceph-mon processes to be healthy |
50 |
and communicating with each other. For this reason, you never want an even |
51 |
number of ceph-mons; there is no unambiguous majority subgroup for an even |
52 |
number. |
53 |
|
54 |
* OSD: |
55 |
Storage daemon (ceph-osd) that provides the RADOS service. It uses the |
56 |
monitor servers for cluster membership, services object read/write/etc |
57 |
request from clients, and peers with other ceph-osds for data replication. |
58 |
|
59 |
The data model is fairly simple on this level. There are multiple named |
60 |
pools, and within each pool there are named objects, in a flat namespace (no |
61 |
directories). Each object has both data and metadata. |
62 |
|
63 |
By default, three pools are created (data, metadata, rbd). |
64 |
|
65 |
The data for an object is a single, potentially big, series of bytes. |
66 |
Additionally, the series may be sparse, it may have holes that contain binary |
67 |
zeros, and take up no actual storage. |
68 |
|
69 |
The metadata is an unordered set of key-value pairs. Its semantics are |
70 |
completely up to the client. |
71 |
|
72 |
Multiple OSDs can run on one node, one for each disk included in the object |
73 |
store. This might impose a perfomance overhead, due to peering/replication. |
74 |
Alternatively, disks can be pooled together (either with RAID or with btrfs), |
75 |
requiring only one osd to manage the pool. |
76 |
|
77 |
In the case of multiple OSDs, care must be taken to generate a CRUSH map, |
78 |
which doesn't replicate objects across OSDs on the same host (see the next |
79 |
section). |
80 |
|
81 |
* Clients: |
82 |
Clients that can access the RADOS cluster either directly, and on an object |
83 |
'granurality' by using librados and the rados userspace tool, or by using |
84 |
librbd, and the rbd tool, which creates an image / volume abstraction over |
85 |
the object store. |
86 |
|
87 |
RBD images are striped over the object store daemons, to provide higher |
88 |
throughput, and can be accessed either via the in-kernel Rados Block Device |
89 |
(RBD) driver, which maps RBD images to block devices, or directly via Qemu, |
90 |
and the Qemu-RBD driver. |
91 |
|
92 |
|
93 |
Replication and Fault tolerance |
94 |
=============================== |
95 |
|
96 |
The objects in each pool are paritioned in a (per-pool configurable) number |
97 |
of placement groups (pgs), and each placement group is mapped to a nubmer of |
98 |
OSDs, according to the (per-pool configurable) replication level, and a |
99 |
(per-pool configurable) CRUSH map, which defines how objects are replicated |
100 |
across OSDs. |
101 |
|
102 |
The CRUSH map is generated with hints from the config file (eg hostnames, racks |
103 |
etc), so that the objects are replicated across OSDs in different 'failure |
104 |
domains'. However, in order to be on the safe side, the CRUSH map should be |
105 |
examined to verify that for example PGs are not replicated acroos OSDs on the |
106 |
same host, and corrected if needed (see the Admin section). |
107 |
|
108 |
Information about objects, pools, and pgs is included in the osdmap, which |
109 |
the clients fetch initially from the monitor servers. Using the osdmap, |
110 |
clients learn which OSD is the primary for each PG, and therefore know which |
111 |
OSD to contact when they want to interact with a specific object. |
112 |
|
113 |
More information about the internals of the replication / fault tolerace / |
114 |
peering inside the RADOS cluster can be found in the original RADOS paper |
115 |
(http://dl.acm.org/citation.cfm?id=1374606). |
116 |
|
117 |
|
118 |
Journaling |
119 |
=========== |
120 |
|
121 |
The OSD maintains a journal to help keep all on-disk data in a consistent state |
122 |
while still keep write latency low. That is, each OSD normally has a back-end |
123 |
file system (ideally btrfs) and a journal device or file. |
124 |
|
125 |
When the journal is enabled, all writes are written both to the journal and to |
126 |
the file system. This is somewhat similar to ext3's data=journal mode, with a |
127 |
few differences. There are two basic journaling modes: |
128 |
|
129 |
* In writeahead mode, every write transaction is written first to the journal. |
130 |
Once that is safely on disk, we can ack the write and then apply it to the |
131 |
back-end file system. This will work with any file system (with a few |
132 |
caveats). |
133 |
|
134 |
* In parallel mode, every write transaction is written to the journal and the |
135 |
file system in parallel. The write is acked when either one safely commits |
136 |
(usually the journal). This will only work on btrfs, as it relies on |
137 |
btrfs-specific snapshot ioctls to rollback to a consistent state before |
138 |
replaying the journal. |
139 |
|
140 |
|
141 |
Authentication |
142 |
============== |
143 |
|
144 |
Ceph supports cephx secure authentication between the nodes, this to make your |
145 |
cluster more secure. There are some issues with the cephx authentication, |
146 |
especially with clients (Qemu-RBD), and it complicates the cluster deployment. |
147 |
Future revisions of this document will include documentation on setting up |
148 |
fine-grained cephx authentication acroos the cluster. |
149 |
|
150 |
|
151 |
RADOS Cluster design and configuration |
152 |
====================================== |
153 |
|
154 |
This section proposes and describes a sample cluster configuration. |
155 |
|
156 |
0. Monitor servers: |
157 |
* 3 mon servers on separate 'failure domains' (eg rack) |
158 |
* Monitor servers are named mon.a, mon.b, mon.c repectively |
159 |
* Monitor data stored in /rados/mon.$id (should be created) |
160 |
* Monitor servers bind on 6789 TCP port, which should not be blocked by |
161 |
firewall |
162 |
* Ceph configuration section for monitors: |
163 |
[mon] |
164 |
mon data = /rados/mon.$id |
165 |
|
166 |
[mon.a] |
167 |
host = [hostname] |
168 |
mon addr = [ip]:6789 |
169 |
[mon.b] |
170 |
host = [hostname] |
171 |
mon addr = [ip]:6789 |
172 |
[mon.c] |
173 |
host = [hostname] |
174 |
mon addr = [ip]:6789 |
175 |
|
176 |
* Debugging options which can be included in the monitor configuration: |
177 |
[mon] |
178 |
;show monitor messaging traffic |
179 |
debug ms = 1 |
180 |
;show monitor debug messages |
181 |
debug mon = 20 |
182 |
; show Paxos debug messages (consensus protocol) |
183 |
debug paxos = 20 |
184 |
|
185 |
1. OSD servers: |
186 |
* A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n) |
187 |
* OSD servers bind on 6800+ TCP ports, which should not be blocked by |
188 |
firewall |
189 |
* OSD data are stored in /rados/osd.$id (should be created and mounted if |
190 |
needed) |
191 |
* /rados/osd.$id can be either a directory on the rootfs, or a separate |
192 |
partition, on a dedicated fast disk (recommended) |
193 |
|
194 |
The upstream recommended filesystem is btrfs. btrfs will use the parallel |
195 |
mode for OSD journaling. |
196 |
|
197 |
Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD |
198 |
journaling. ext4 itself can also use an external journal device |
199 |
(preferably a fast, eg SSD, disk). In that case, the filesystem can be |
200 |
mounted with data=journal,commit=9999,noatime,nodiratime options, to |
201 |
improve perfomance (proof?): |
202 |
|
203 |
mkfs.ext4 /dev/sdyy |
204 |
mke2fs -O journal_dev /dev/sdxx |
205 |
tune2fs -O ^has_journal /dev/sdyy |
206 |
tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy |
207 |
mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999 |
208 |
|
209 |
* OSD journal can be either on a raw block device, a separate partition, or |
210 |
a file. |
211 |
|
212 |
A fash disk (SSD) is recommended as a journal device. |
213 |
|
214 |
If a file is used, the journal size must be also specified in the |
215 |
configuration. |
216 |
|
217 |
* Ceph configuration section for monitors: |
218 |
[osd] |
219 |
osd data = /rados/osd.$id |
220 |
osd journal = /dev/sdzz |
221 |
;if a file is used as a journal |
222 |
;osd journal size = N (in MB) |
223 |
|
224 |
[osd.0] |
225 |
;host and rack directives are used to generate a CRUSH map for PG |
226 |
;placement |
227 |
host = [hostname] |
228 |
rack = [rack] |
229 |
|
230 |
;public addr is the one the clients will use to contact the osd |
231 |
public_addr = [public ip] |
232 |
;cluster addr is the one used for osd-to-osd replication/peering etc |
233 |
cluster_addr = [cluster ip] |
234 |
|
235 |
[osd.1] |
236 |
... |
237 |
|
238 |
* Debug options which can be included in the osd configuration: |
239 |
[osd] |
240 |
;show OSD messaging traffic |
241 |
debug ms = 1 |
242 |
;show OSD debug information |
243 |
debug osd = 20 |
244 |
;show OSD journal debug information |
245 |
debug jorunal = 20 |
246 |
;show filestore debug information |
247 |
debug filestore = 20 |
248 |
;show monitor client debug information |
249 |
debug monc = 20 |
250 |
|
251 |
3. Clients |
252 |
* Clients configuration only need the monitor servers addresses |
253 |
* Configration section for clients: |
254 |
[mon.a] |
255 |
mon addr = [ip]:6789 |
256 |
[mon.b] |
257 |
mon addr = [ip]:6789 |
258 |
[mon.c] |
259 |
mon addr = [ip]:6789 |
260 |
* Debug options which can be included in the client configuration: |
261 |
;show client messaging traffic |
262 |
debug ms = 1 |
263 |
;show RADOS debug information |
264 |
debug rados = 20 |
265 |
;show objecter debug information |
266 |
debug objecter = 20 |
267 |
;show filer debug information |
268 |
debug filer = 20 |
269 |
;show objectcacher debug information |
270 |
debug object cacher = 20 |
271 |
|
272 |
4. Tips |
273 |
* Mount all the filesystems with noatime,nodiratime options |
274 |
* Even without any debug options, RADOS generates lots of logs. Make sure |
275 |
the logs files are in a fast disk, with little I/O traffic, and the |
276 |
partition is mounted with noatime. |
277 |
|
278 |
|
279 |
Installation Process |
280 |
==================== |
281 |
|
282 |
This section describes the installation process of the various software |
283 |
components in a RADOS cluster. |
284 |
|
285 |
0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd, |
286 |
clients): |
287 |
deb http://ceph.newdream.net/debian/ squeeze main |
288 |
deb-src http://ceph.newdream.net/debian/ squeeze main |
289 |
|
290 |
1. Monitor and OSD servers: |
291 |
* Install the ceph package |
292 |
* Upgrade to an up-to-date kernel (>=3.x) |
293 |
* Edit the /etc/ceph/ceph.conf to include the mon and osd configuration |
294 |
sections, shown previously. |
295 |
* Create the corresponding dirs in /rados (mon.$id and osd.$id) |
296 |
* (optionally) Format and mount the osd.$id patition in /rados/osd.$id |
297 |
* Make sure the journal device specified in the conf exists. |
298 |
* (optionally) Make sure everything is mounted with the noatime,nodiratime |
299 |
options |
300 |
* Make sure monitor and osd servers can freely ssh to each other, using only |
301 |
hostnames. |
302 |
* Create the object store: |
303 |
mkcephfs -a -c /etc/ceph/ceph.conf |
304 |
* Start the servers: |
305 |
service ceph -a start |
306 |
* Verify that the object store is healthy, and running: |
307 |
ceph helth |
308 |
ceph -s |
309 |
|
310 |
2. Clients: |
311 |
* Install the ceph-common package |
312 |
* Upgrade to an up-to-date kernel (>=3.x) |
313 |
* Install linux-headers for the new kernel |
314 |
* Check out the latest ceph-client git repo: |
315 |
git clone git://github.com/NewDreamNetwork/ceph-client.git |
316 |
* Copy the ncecessary ceph header file to linux-headers: |
317 |
cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/ |
318 |
* Build the modules: |
319 |
cd ~/ceph-client/net/ceph/ |
320 |
make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) libceph.ko |
321 |
cp Modules.symvers ../../drivers/block/ |
322 |
cd ~/ceph-client/drivers/block/ |
323 |
make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) rbd.ko |
324 |
* Optionally, copy rbd.ko and libceph. ko to /lib/modules/ |
325 |
* Load the modules: |
326 |
modprobe rbd |
327 |
|
328 |
|
329 |
Administration Notes |
330 |
==================== |
331 |
|
332 |
This section includes some notes on the RADOS cluster administration. |
333 |
|
334 |
0. Starting / Stopping servers |
335 |
* service ceph -a start/stop (affects all the servers in the cluster) |
336 |
* service ceph start/stop osd (affects only the osds in the current node) |
337 |
* service ceph start/stop mon (affects only the mons in the current node) |
338 |
* service ceph start/stop osd.$id/mon.$id (affects only the specified node) |
339 |
|
340 |
* sevice ceph cleanlogs/cleanalllogs |
341 |
|
342 |
1. Stop the cluster cleanly |
343 |
ceph stop |
344 |
|
345 |
2. Increase the replication level for a given pool: |
346 |
ceph osd pool set $poolname size $size |
347 |
|
348 |
Note that increasing the replication level, the overhead for the replication |
349 |
will impact perfomance. |
350 |
|
351 |
3. Adjust the number of placement groups per pool: |
352 |
ceph osd pool set $poolname pg_num $num |
353 |
|
354 |
The default number of pgs per pool is determined by the number of OSDs in the |
355 |
cluster, and the replication level of the pool (for 4 OSDs and replication |
356 |
size 2, the default value is 8). The default pools (data,metadata,rbd) are |
357 |
assigned 256 PGs. |
358 |
|
359 |
After the splitting is complete, the number of PGs in the system must be |
360 |
changed. Warning: this is not considered safe on PGs in use (with objects), |
361 |
and should be changed only when the PG is created, and before being used: |
362 |
ceph osd pool set $poolname pgp_num $num |
363 |
|
364 |
4. Replacing the journal for osd.$id: |
365 |
Edit the osd.$id journal configration section |
366 |
ceph-osd -i osd.$id --mkjournal |
367 |
ceph-osd -i osd.$id --osd.journal /path/to/journal |
368 |
|
369 |
5. Add a new OSD: |
370 |
Edit /etc/ceph/ceph.conf to include the new OSD |
371 |
ceph mon getmap -o /tmp/monmap |
372 |
ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap |
373 |
ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed) |
374 |
service ceph start osd.$id |
375 |
|
376 |
Generate the CRUSH map to include the new osd in PGs: |
377 |
osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush |
378 |
ceph osd setcrushmap -i /tmp/crush |
379 |
Or edit the CRUSH map by hand: |
380 |
ceph osd getcrushmap -o /tmp/crush |
381 |
crushmaptool -d /tmp/crush -o crushmap |
382 |
vim crushmap |
383 |
crushmaptool -c crushmap -o /tmp/crush |
384 |
ceph osd setcrushmap -i /tmp/crush |
385 |
|
386 |
6. General ceph tool commands: |
387 |
* ceph mon stat (stat mon servers) |
388 |
* ceph mon getmap (get the monmap, use monmaptool to edit) |
389 |
* ceph osd dump (dump osdmap -> pool info, osd info) |
390 |
* ceph osd getmap (get osdmap -> use osdmaptool to edit) |
391 |
* ceph osd lspools |
392 |
* ceph osd stat (stat osd servers) |
393 |
* ceph ost tree (osd server info) |
394 |
* ceph pg dump/stat (show info about PGs) |
395 |
|
396 |
7. rados userspace tool: |
397 |
|
398 |
The rados userspace tool (included in ceph-common package), uses librados to |
399 |
communicate with the object store. |
400 |
|
401 |
* rados mkpool [pool] |
402 |
* rados rmpool [pool] |
403 |
* rados df (show usage per pool) |
404 |
* rados lspools (list pools) |
405 |
* rados ls -p [pool] (list objects in [pool] |
406 |
* rados bench [secs] write|seq -t [concurrent operation] |
407 |
* rados import/export <pool> <dir> (import/export a local directory in a rados pool) |
408 |
|
409 |
8. rbd userspace tool: |
410 |
|
411 |
The rbd userspace tool (included in ceph-commong package), uses librbd and |
412 |
librados to communicate with the object store. |
413 |
|
414 |
* rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) |
415 |
* rbd info [pool] -p [pool] |
416 |
* rbd create [image] --size n (in MB) |
417 |
* rbd rm [image] |
418 |
* rbd export/import [dir] [image] |
419 |
* rbd cp/mv [image] [dest] |
420 |
* rbd resize [image] |
421 |
* rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver) |
422 |
* rbd unmap /dev/rbdx (unmap an RBD device) |
423 |
* rbd showmapped |
424 |
|
425 |
9. In-kernel RBD driver |
426 |
|
427 |
The in-kernel RBD driver can be used to map and ummap RBD images as block |
428 |
devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be |
429 |
created in /dev/rbd/[poolname]/[imagename]:[bdev id]. |
430 |
|
431 |
It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to |
432 |
add / remove / list devices, although the rbd map/unmap/showmapped commands |
433 |
are preferred. |
434 |
|
435 |
The RBD module depends on the net/ceph/libceph module, which implements the |
436 |
communication with the object store in the kernel. |
437 |
|
438 |
10. Qemu-RBD driver |
439 |
|
440 |
The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as |
441 |
block devices inside VMs. It currently supports a feature not present in the |
442 |
in-kenrel RBD driver (writeback_window). |
443 |
|
444 |
It can be configured via libvirt, and the configuration looks like this: |
445 |
|
446 |
<disk type='network' device='disk'> |
447 |
<driver name='qemu' type='raw'/> |
448 |
<source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/> |
449 |
<target dev='vda' bus='virtio'/> |
450 |
</disk> |
451 |
|
452 |
Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM |
453 |
version, which is not included in Debian. |
454 |
|
455 |
9. Logging and Debugging: |
456 |
For command-line tools (ceph, rados, rbd), you can specify debug options in |
457 |
the form of --debug-[component]=n, which will override the options in the |
458 |
config file. In order to get any output when using the cli debug options, |
459 |
you must also use --log-to-stderr. |
460 |
|
461 |
rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20 |
462 |
|
463 |
Ceph log files are located in /var/log/ceph/mon.$id and |
464 |
/var/log/ceph/osd.$id. |