Statistics
| Branch: | Tag: | Revision:

root / docs / storage.rst @ 2e1e6844

History | View | Annotate | Download (17.3 kB)

1
Storage guide
2
=============
3

    
4
Instructions for RADOS cluster deployment and administration
5

    
6
This document describes the basic steps to obtain a working RADOS cluster /
7
object store installation, to be used as a storage backend for synnefo, and
8
provides information about its administration.
9

    
10
It begins by providing general information on the RADOS object store describing
11
the different nodes in a RADOS cluster, and then moves to the installation and
12
setup of the distinct software components. Finally, it provides some basic
13
information about the cluster administration and debugging.
14

    
15
RADOS is the object storage component of the Ceph project
16
(http://http://ceph.newdream.net). For more documentation, see the official wiki
17
(http://ceph.newdream.net/wiki), and the official documentation
18
(http://ceph.newdream.net/docs). Usage information for userspace tools, used to
19
administer the cluster, are also available in the respective manpages.
20

    
21

    
22
RADOS Intro
23
-----------
24
RADOS is the object storage component of Ceph.
25

    
26
An object, in this context, means a named entity that has
27

    
28
 * name: a sequence of bytes, unique within its container, that is used to locate
29
   and access the object
30
 * content: sequence of bytes
31
 * metadata: a mapping from keys to values
32

    
33
RADOS takes care of distributing the objects across the whole storage cluster
34
and replicating them for fault tolerance.
35

    
36

    
37
Node types
38
----------
39

    
40
Nodes in a RADOS deployment belong in one of the following types:
41

    
42
 * Monitor:
43
   Lightweight daemon (ceph-mon) that provides a consensus for distributed
44
   decisionmaking in a Ceph/RADOS cluster. It also is the initial point of
45
   contact for new clients, and will hand out information about the topology of
46
   the cluster, such as the osdmap.
47

    
48
   You normally run 3 ceph-mon daemons, on 3 separate physical machines,
49
   isolated from each other; for example, in different racks or rows.  You could
50
   run just 1 instance, but that means giving up on high availability.
51

    
52
   Any decision requires the majority of the ceph-mon processes to be healthy
53
   and communicating with each other. For this reason, you never want an even
54
   number of ceph-mons; there is no unambiguous majority subgroup for an even
55
   number.
56

    
57
 * OSD:
58
   Storage daemon (ceph-osd) that provides the RADOS service. It uses the
59
   monitor servers for cluster membership, services object read/write/etc
60
   request from clients, and peers with other ceph-osds for data replication.
61

    
62
   The data model is fairly simple on this level. There are multiple named
63
   pools, and within each pool there are named objects, in a flat namespace (no
64
   directories). Each object has both data and metadata.
65

    
66
   By default, three pools are created (data, metadata, rbd).
67

    
68
   The data for an object is a single, potentially big, series of bytes.
69
   Additionally, the series may be sparse, it may have holes that contain binary
70
   zeros, and take up no actual storage.
71
   
72
   The metadata is an unordered set of key-value pairs. Its semantics are
73
   completely up to the client.
74

    
75
   Multiple OSDs can run on one node, one for each disk included in the object
76
   store. This might impose a perfomance overhead, due to peering/replication.
77
   Alternatively, disks can be pooled together (either with RAID or with btrfs),
78
   requiring only one osd to manage the pool.
79

    
80
   In the case of multiple OSDs, care must be taken to generate a CRUSH map,
81
   which doesn't replicate objects across OSDs on the same host (see the next
82
   section).
83

    
84
 * Clients:
85
   Clients that can access the RADOS cluster either directly, and on an object
86
   'granurality' by using librados and the rados userspace tool, or by using
87
   librbd, and the rbd tool, which creates an image / volume abstraction over
88
   the object store.
89

    
90
   RBD images are striped over the object store daemons, to provide higher
91
   throughput, and can be accessed either via the in-kernel Rados Block Device
92
   (RBD) driver, which maps RBD images to block devices, or directly via Qemu,
93
   and the Qemu-RBD driver.
94
   
95

    
96
Replication and Fault tolerance
97
-------------------------------
98

    
99
The objects in each pool are paritioned in a (per-pool configurable) number
100
of placement groups (pgs), and each placement group is mapped to a nubmer of
101
OSDs, according to the (per-pool configurable) replication level, and a
102
(per-pool configurable) CRUSH map, which defines how objects are replicated
103
across OSDs.
104

    
105
The CRUSH map is generated with hints from the config file (eg hostnames, racks
106
etc), so that the objects are replicated across OSDs in different 'failure
107
domains'. However, in order to be on the safe side, the CRUSH map should be
108
examined to verify that for example PGs are not replicated acroos OSDs on the
109
same host, and corrected if needed (see the Admin section).
110

    
111
Information about objects, pools, and pgs is included in the osdmap, which
112
the clients fetch initially from the monitor servers. Using the osdmap,
113
clients learn which OSD is the primary for each PG, and therefore know which
114
OSD to contact when they want to interact with a specific object. 
115

    
116
More information about the internals of the replication / fault tolerace /
117
peering inside the RADOS cluster can be found in the original RADOS paper
118
(http://dl.acm.org/citation.cfm?id=1374606).
119

    
120

    
121
Journaling
122
-----------
123

    
124
The OSD maintains a journal to help keep all on-disk data in a consistent state
125
while still keep write latency low. That is, each OSD normally has a back-end
126
file system (ideally btrfs) and a journal device or file.
127

    
128
When the journal is enabled, all writes are written both to the journal and to
129
the file system. This is somewhat similar to ext3's data=journal mode, with a
130
few differences. There are two basic journaling modes:
131

    
132
 * In writeahead mode, every write transaction is written first to the journal.
133
   Once that is safely on disk, we can ack the write and then apply it to the
134
   back-end file system. This will work with any file system (with a few
135
   caveats).
136
   
137
 * In parallel mode, every write transaction is written to the journal and the 
138
   file system in parallel. The write is acked when either one safely commits
139
   (usually the journal). This will only work on btrfs, as it relies on
140
   btrfs-specific snapshot ioctls to rollback to a consistent state before
141
   replaying the journal.
142

    
143

    
144
Authentication
145
--------------
146

    
147
Ceph supports cephx secure authentication between the nodes, this to make your
148
cluster more secure. There are some issues with the cephx authentication,
149
especially with clients (Qemu-RBD), and it complicates the cluster deployment.
150
Future revisions of this document will include documentation on setting up
151
fine-grained cephx authentication acroos the cluster.
152

    
153

    
154
RADOS Cluster design and configuration
155
--------------------------------------
156

    
157
This section proposes and describes a sample cluster configuration.
158

    
159
0. Monitor servers:
160
	* 3 mon servers on separate 'failure domains' (eg rack) 
161
	* Monitor servers are named mon.a, mon.b, mon.c repectively
162
	* Monitor data stored in /rados/mon.$id (should be created)
163
	* Monitor servers bind on 6789 TCP port, which should not be blocked by
164
	  firewall
165
	* Ceph configuration section for monitors:
166
		[mon]
167
			mon data = /rados/mon.$id
168

    
169
		[mon.a]
170
			host = [hostname] 
171
			mon addr = [ip]:6789
172
		[mon.b]
173
			host = [hostname] 
174
			mon addr = [ip]:6789
175
		[mon.c]
176
			host = [hostname] 
177
			mon addr = [ip]:6789
178
			
179
	* Debugging options which can be included in the monitor configuration:
180
		[mon] 
181
			;show monitor messaging traffic
182
			debug ms = 1 
183
			;show monitor debug messages
184
			debug mon = 20
185
			; show Paxos debug messages (consensus protocol)
186
			debug paxos = 20
187

    
188
1. OSD servers:
189
	* A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n)
190
	* OSD servers bind on 6800+ TCP ports, which should not be blocked by
191
	  firewall
192
	* OSD data are stored in /rados/osd.$id (should be created and mounted if
193
	  needed)
194
	* /rados/osd.$id can be either a directory on the rootfs, or a separate
195
	  partition, on a dedicated fast disk (recommended)
196
		
197
	  The upstream recommended filesystem is btrfs. btrfs will use the parallel
198
	  mode for OSD journaling.
199

    
200
	  Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD
201
	  journaling. ext4 itself can also use an external journal device
202
	  (preferably a fast, eg SSD, disk). In that case, the filesystem can be
203
	  mounted with data=journal,commit=9999,noatime,nodiratime options, to
204
	  improve perfomance (proof?):
205

    
206
		mkfs.ext4 /dev/sdyy
207
	  	mke2fs -O journal_dev /dev/sdxx
208
		tune2fs -O ^has_journal /dev/sdyy
209
		tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy
210
		mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999
211
		
212
	* OSD journal can be either on a raw block device, a separate partition, or
213
	  a file.
214

    
215
	  A fash disk (SSD) is recommended as a journal device. 
216
	  
217
	  If a file is used, the journal size must be also specified in the
218
	  configuration.
219

    
220
	* Ceph configuration section for monitors:
221
		[osd]
222
			osd data = /rados/osd.$id
223
			osd journal = /dev/sdzz
224
			;if a file is used as a journal
225
			;osd journal size = N (in MB)
226
		
227
		[osd.0]
228
			;host and rack directives are used to generate a CRUSH map for PG
229
			;placement
230
			host = [hostname]
231
			rack = [rack]
232
			
233
			;public addr is the one the clients will use to contact the osd
234
			public_addr = [public ip]
235
			;cluster addr is the one used for osd-to-osd replication/peering etc
236
			cluster_addr = [cluster ip]
237

    
238
		[osd.1] 
239
			...
240

    
241
	* Debug options which can be included in the osd configuration:
242
		[osd]
243
			;show OSD messaging traffic
244
			debug ms = 1
245
			;show OSD debug information
246
			debug osd = 20
247
			;show OSD journal debug information
248
			debug jorunal = 20
249
			;show filestore debug information
250
			debug filestore = 20
251
			;show monitor client debug information
252
			debug monc = 20
253

    
254
3. Clients
255
	* Clients configuration only need the monitor servers addresses
256
	* Configration section for clients:
257
		[mon.a]
258
			mon addr = [ip]:6789
259
		[mon.b]
260
			mon addr = [ip]:6789
261
		[mon.c]
262
			mon addr = [ip]:6789
263
	* Debug options which can be included in the client configuration:
264
			;show client messaging traffic
265
			debug ms = 1
266
			;show RADOS debug information
267
			debug rados = 20
268
			;show objecter debug information
269
			debug objecter = 20
270
			;show filer debug information
271
			debug filer = 20
272
			;show objectcacher debug information
273
			debug object cacher = 20
274
		
275
4. Tips
276
	* Mount all the filesystems with noatime,nodiratime options
277
	* Even without any debug options, RADOS generates lots of logs. Make sure
278
	  the logs files are in a fast disk, with little I/O traffic, and the
279
	  partition is mounted with noatime.
280

    
281

    
282
Installation Process
283
--------------------
284

    
285
This section describes the installation process of the various software
286
components in a RADOS cluster.
287

    
288
0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd,
289
   clients)::
290

    
291
	 deb http://ceph.newdream.net/debian/ squeeze main
292
	 deb-src http://ceph.newdream.net/debian/ squeeze main
293

    
294
1. Monitor and OSD servers:
295
	* Install the ceph package
296
	* Upgrade to an up-to-date kernel (>=3.x)
297
	* Edit the /etc/ceph/ceph.conf to include the mon and osd configuration
298
	  sections, shown previously.
299
	* Create the corresponding dirs in /rados (mon.$id and osd.$id)
300
	* (optionally) Format and mount the osd.$id patition in /rados/osd.$id
301
	* Make sure the journal device specified in the conf exists.
302
	* (optionally) Make sure everything is mounted with the noatime,nodiratime
303
	  options
304
	* Make sure monitor and osd servers can freely ssh to each other, using only
305
	  hostnames.
306
	* Create the object store: 
307
		mkcephfs -a -c /etc/ceph/ceph.conf
308
	* Start the servers:
309
		service ceph -a start
310
	* Verify that the object store is healthy, and running:
311
		ceph helth
312
		ceph -s
313

    
314
2. Clients:
315
	* Install the ceph-common package
316
	* Upgrade to an up-to-date kernel (>=3.x)
317
	* Install linux-headers for the new kernel
318
	* Check out the latest ceph-client git repo:
319
		git clone git://github.com/NewDreamNetwork/ceph-client.git
320
	* Copy the ncecessary ceph header file to linux-headers:
321
		cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/
322
	* Build the modules:
323
		cd ~/ceph-client/net/ceph/
324
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) libceph.ko
325
		cp Modules.symvers ../../drivers/block/
326
		cd ~/ceph-client/drivers/block/
327
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) rbd.ko
328
	* Optionally, copy rbd.ko and libceph. ko to /lib/modules/
329
	* Load the modules:
330
		modprobe rbd
331

    
332

    
333
Administration Notes
334
--------------------
335

    
336
This section includes some notes on the RADOS cluster administration.
337

    
338
0. Starting / Stopping servers
339
	* service ceph -a start/stop (affects all the servers in the cluster)
340
	* service ceph start/stop osd (affects only the osds in the current node)
341
	* service ceph start/stop mon (affects only the mons in the current node)
342
	* service ceph start/stop osd.$id/mon.$id (affects only the specified node)
343

    
344
	* sevice ceph cleanlogs/cleanalllogs
345

    
346
1. Stop the cluster cleanly
347
	ceph stop
348

    
349
2. Increase the replication level for a given pool:
350
	ceph osd pool set $poolname size $size
351

    
352
   Note that increasing the replication level, the overhead for the replication
353
   will impact perfomance.
354

    
355
3. Adjust the number of placement groups per pool:
356
	ceph osd pool set $poolname pg_num $num
357
   
358
   The default number of pgs per pool is determined by the number of OSDs in the
359
   cluster, and the replication level of the pool (for 4 OSDs and replication
360
   size 2, the default value is 8). The default pools (data,metadata,rbd) are
361
   assigned 256 PGs.
362

    
363
   After the splitting is complete, the number of PGs in the system must be
364
   changed. Warning: this is not considered safe on PGs in use (with objects),
365
   and should be changed only when the PG is created, and before being used:
366
   ceph osd pool set $poolname pgp_num $num
367

    
368
4. Replacing the journal for osd.$id:
369
	Edit the osd.$id journal configration section
370
	ceph-osd -i osd.$id --mkjournal
371
	ceph-osd -i osd.$id --osd.journal /path/to/journal
372

    
373
5. Add a new OSD:
374
	Edit /etc/ceph/ceph.conf to include the new OSD
375
	ceph mon getmap -o /tmp/monmap
376
	ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap
377
	ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed)
378
	service ceph start osd.$id
379

    
380
	Generate the CRUSH map to include the new osd in PGs:
381
		osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush
382
		ceph osd setcrushmap -i /tmp/crush
383
	Or edit the CRUSH map by hand:
384
		ceph osd getcrushmap -o /tmp/crush
385
		crushmaptool -d /tmp/crush -o crushmap
386
		vim crushmap
387
		crushmaptool -c crushmap -o /tmp/crush
388
		ceph osd setcrushmap -i /tmp/crush
389

    
390
6. General ceph tool commands:
391
	* ceph mon stat (stat mon servers)
392
	* ceph mon getmap (get the monmap, use monmaptool to edit)
393
	* ceph osd dump (dump osdmap -> pool info, osd info)
394
	* ceph osd getmap (get osdmap -> use osdmaptool to edit)
395
	* ceph osd lspools
396
	* ceph osd stat (stat osd servers)
397
	* ceph ost tree (osd server info)
398
	* ceph pg dump/stat (show info about PGs)
399

    
400
7. rados userspace tool:
401

    
402
   The rados userspace tool (included in ceph-common package), uses librados to
403
   communicate with the object store.
404

    
405
	* rados mkpool [pool]
406
	* rados rmpool [pool]
407
	* rados df (show usage per pool)
408
	* rados lspools (list pools)
409
	* rados ls -p [pool] (list objects in [pool]
410
	* rados bench [secs] write|seq -t [concurrent operation]
411
	* rados import/export <pool> <dir> (import/export a local directory in a rados pool)
412

    
413
8. rbd userspace tool:
414
   
415
   The rbd userspace tool (included in ceph-commong package), uses librbd and
416
   librados to communicate with the object store. 
417

    
418
	* rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) 
419
	* rbd info [pool] -p [pool]
420
	* rbd create [image] --size n (in MB)
421
	* rbd rm [image]
422
	* rbd export/import [dir] [image]
423
	* rbd cp/mv [image] [dest]
424
	* rbd resize [image]
425
	* rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver)
426
	* rbd unmap /dev/rbdx (unmap an RBD device)
427
	* rbd showmapped
428

    
429
9. In-kernel RBD driver
430

    
431
   The in-kernel RBD driver can be used to map and ummap RBD images as block
432
   devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be
433
   created in /dev/rbd/[poolname]/[imagename]:[bdev id].
434

    
435
   It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to
436
   add / remove / list devices, although the rbd map/unmap/showmapped commands
437
   are preferred.
438
   
439
   The RBD module depends on the net/ceph/libceph module, which implements the
440
   communication with the object store in the kernel.
441

    
442
10. Qemu-RBD driver
443
	
444
	The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as
445
	block devices inside VMs. It currently supports a feature not present in the
446
	in-kenrel RBD driver (writeback_window).
447

    
448
	It can be configured via libvirt, and the configuration looks like this:
449

    
450
    .. code-block:: xml
451

    
452
		<disk type='network' device='disk'>
453
		  <driver name='qemu' type='raw'/>
454
		  <source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/>
455
		  <target dev='vda' bus='virtio'/>
456
		</disk>
457

    
458
	Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM
459
	version, which is not included in Debian.
460

    
461
9. Logging and Debugging:
462
	For command-line tools (ceph, rados, rbd), you can specify debug options in
463
	the form of --debug-[component]=n, which will override the options in the
464
	config file. In order to get any output when using the cli debug options,
465
	you must also use --log-to-stderr.
466
		
467
		rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20
468

    
469
	Ceph log files are located in /var/log/ceph/mon.$id and
470
	/var/log/ceph/osd.$id.