Statistics
| Branch: | Tag: | Revision:

root / docs / storage.rst @ 4a3e83c6

History | View | Annotate | Download (17.2 kB)

1
README.storage -- Instructions for RADOS cluster deployment and administration
2

    
3
This document describes the basic steps to obtain a working RADOS cluster /
4
object store installation, to be used as a storage backend for synnefo, and
5
provides information about its administration.
6

    
7
It begins by providing general information on the RADOS object store describing
8
the different nodes in a RADOS cluster, and then moves to the installation and
9
setup of the distinct software components. Finally, it provides some basic
10
information about the cluster administration and debugging.
11

    
12
RADOS is the object storage component of the Ceph project
13
(http://http://ceph.newdream.net). For more documentation, see the official wiki
14
(http://ceph.newdream.net/wiki), and the official documentation
15
(http://ceph.newdream.net/docs). Usage information for userspace tools, used to
16
administer the cluster, are also available in the respective manpages.
17

    
18

    
19
RADOS Intro
20
===========
21
RADOS is the object storage component of Ceph.
22

    
23
An object, in this context, means a named entity that has
24

    
25
 * name: a sequence of bytes, unique within its container, that is used to locate
26
   and access the object
27
 * content: sequence of bytes
28
 * metadata: a mapping from keys to values
29

    
30
RADOS takes care of distributing the objects across the whole storage cluster
31
and replicating them for fault tolerance.
32

    
33

    
34
Node types
35
==========
36

    
37
Nodes in a RADOS deployment belong in one of the following types:
38

    
39
 * Monitor:
40
   Lightweight daemon (ceph-mon) that provides a consensus for distributed
41
   decisionmaking in a Ceph/RADOS cluster. It also is the initial point of
42
   contact for new clients, and will hand out information about the topology of
43
   the cluster, such as the osdmap.
44

    
45
   You normally run 3 ceph-mon daemons, on 3 separate physical machines,
46
   isolated from each other; for example, in different racks or rows.  You could
47
   run just 1 instance, but that means giving up on high availability.
48

    
49
   Any decision requires the majority of the ceph-mon processes to be healthy
50
   and communicating with each other. For this reason, you never want an even
51
   number of ceph-mons; there is no unambiguous majority subgroup for an even
52
   number.
53

    
54
 * OSD:
55
   Storage daemon (ceph-osd) that provides the RADOS service. It uses the
56
   monitor servers for cluster membership, services object read/write/etc
57
   request from clients, and peers with other ceph-osds for data replication.
58

    
59
   The data model is fairly simple on this level. There are multiple named
60
   pools, and within each pool there are named objects, in a flat namespace (no
61
   directories). Each object has both data and metadata.
62

    
63
   By default, three pools are created (data, metadata, rbd).
64

    
65
   The data for an object is a single, potentially big, series of bytes.
66
   Additionally, the series may be sparse, it may have holes that contain binary
67
   zeros, and take up no actual storage.
68
   
69
   The metadata is an unordered set of key-value pairs. Its semantics are
70
   completely up to the client.
71

    
72
   Multiple OSDs can run on one node, one for each disk included in the object
73
   store. This might impose a perfomance overhead, due to peering/replication.
74
   Alternatively, disks can be pooled together (either with RAID or with btrfs),
75
   requiring only one osd to manage the pool.
76

    
77
   In the case of multiple OSDs, care must be taken to generate a CRUSH map,
78
   which doesn't replicate objects across OSDs on the same host (see the next
79
   section).
80

    
81
 * Clients:
82
   Clients that can access the RADOS cluster either directly, and on an object
83
   'granurality' by using librados and the rados userspace tool, or by using
84
   librbd, and the rbd tool, which creates an image / volume abstraction over
85
   the object store.
86

    
87
   RBD images are striped over the object store daemons, to provide higher
88
   throughput, and can be accessed either via the in-kernel Rados Block Device
89
   (RBD) driver, which maps RBD images to block devices, or directly via Qemu,
90
   and the Qemu-RBD driver.
91
   
92

    
93
Replication and Fault tolerance
94
===============================
95

    
96
The objects in each pool are paritioned in a (per-pool configurable) number
97
of placement groups (pgs), and each placement group is mapped to a nubmer of
98
OSDs, according to the (per-pool configurable) replication level, and a
99
(per-pool configurable) CRUSH map, which defines how objects are replicated
100
across OSDs.
101

    
102
The CRUSH map is generated with hints from the config file (eg hostnames, racks
103
etc), so that the objects are replicated across OSDs in different 'failure
104
domains'. However, in order to be on the safe side, the CRUSH map should be
105
examined to verify that for example PGs are not replicated acroos OSDs on the
106
same host, and corrected if needed (see the Admin section).
107

    
108
Information about objects, pools, and pgs is included in the osdmap, which
109
the clients fetch initially from the monitor servers. Using the osdmap,
110
clients learn which OSD is the primary for each PG, and therefore know which
111
OSD to contact when they want to interact with a specific object. 
112

    
113
More information about the internals of the replication / fault tolerace /
114
peering inside the RADOS cluster can be found in the original RADOS paper
115
(http://dl.acm.org/citation.cfm?id=1374606).
116

    
117

    
118
Journaling
119
===========
120

    
121
The OSD maintains a journal to help keep all on-disk data in a consistent state
122
while still keep write latency low. That is, each OSD normally has a back-end
123
file system (ideally btrfs) and a journal device or file.
124

    
125
When the journal is enabled, all writes are written both to the journal and to
126
the file system. This is somewhat similar to ext3's data=journal mode, with a
127
few differences. There are two basic journaling modes:
128

    
129
 * In writeahead mode, every write transaction is written first to the journal.
130
   Once that is safely on disk, we can ack the write and then apply it to the
131
   back-end file system. This will work with any file system (with a few
132
   caveats).
133
   
134
 * In parallel mode, every write transaction is written to the journal and the 
135
   file system in parallel. The write is acked when either one safely commits
136
   (usually the journal). This will only work on btrfs, as it relies on
137
   btrfs-specific snapshot ioctls to rollback to a consistent state before
138
   replaying the journal.
139

    
140

    
141
Authentication
142
==============
143

    
144
Ceph supports cephx secure authentication between the nodes, this to make your
145
cluster more secure. There are some issues with the cephx authentication,
146
especially with clients (Qemu-RBD), and it complicates the cluster deployment.
147
Future revisions of this document will include documentation on setting up
148
fine-grained cephx authentication acroos the cluster.
149

    
150

    
151
RADOS Cluster design and configuration
152
======================================
153

    
154
This section proposes and describes a sample cluster configuration.
155

    
156
0. Monitor servers:
157
	* 3 mon servers on separate 'failure domains' (eg rack) 
158
	* Monitor servers are named mon.a, mon.b, mon.c repectively
159
	* Monitor data stored in /rados/mon.$id (should be created)
160
	* Monitor servers bind on 6789 TCP port, which should not be blocked by
161
	  firewall
162
	* Ceph configuration section for monitors:
163
		[mon]
164
			mon data = /rados/mon.$id
165

    
166
		[mon.a]
167
			host = [hostname] 
168
			mon addr = [ip]:6789
169
		[mon.b]
170
			host = [hostname] 
171
			mon addr = [ip]:6789
172
		[mon.c]
173
			host = [hostname] 
174
			mon addr = [ip]:6789
175
			
176
	* Debugging options which can be included in the monitor configuration:
177
		[mon] 
178
			;show monitor messaging traffic
179
			debug ms = 1 
180
			;show monitor debug messages
181
			debug mon = 20
182
			; show Paxos debug messages (consensus protocol)
183
			debug paxos = 20
184

    
185
1. OSD servers:
186
	* A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n)
187
	* OSD servers bind on 6800+ TCP ports, which should not be blocked by
188
	  firewall
189
	* OSD data are stored in /rados/osd.$id (should be created and mounted if
190
	  needed)
191
	* /rados/osd.$id can be either a directory on the rootfs, or a separate
192
	  partition, on a dedicated fast disk (recommended)
193
		
194
	  The upstream recommended filesystem is btrfs. btrfs will use the parallel
195
	  mode for OSD journaling.
196

    
197
	  Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD
198
	  journaling. ext4 itself can also use an external journal device
199
	  (preferably a fast, eg SSD, disk). In that case, the filesystem can be
200
	  mounted with data=journal,commit=9999,noatime,nodiratime options, to
201
	  improve perfomance (proof?):
202

    
203
		mkfs.ext4 /dev/sdyy
204
	  	mke2fs -O journal_dev /dev/sdxx
205
		tune2fs -O ^has_journal /dev/sdyy
206
		tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy
207
		mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999
208
		
209
	* OSD journal can be either on a raw block device, a separate partition, or
210
	  a file.
211

    
212
	  A fash disk (SSD) is recommended as a journal device. 
213
	  
214
	  If a file is used, the journal size must be also specified in the
215
	  configuration.
216

    
217
	* Ceph configuration section for monitors:
218
		[osd]
219
			osd data = /rados/osd.$id
220
			osd journal = /dev/sdzz
221
			;if a file is used as a journal
222
			;osd journal size = N (in MB)
223
		
224
		[osd.0]
225
			;host and rack directives are used to generate a CRUSH map for PG
226
			;placement
227
			host = [hostname]
228
			rack = [rack]
229
			
230
			;public addr is the one the clients will use to contact the osd
231
			public_addr = [public ip]
232
			;cluster addr is the one used for osd-to-osd replication/peering etc
233
			cluster_addr = [cluster ip]
234

    
235
		[osd.1] 
236
			...
237

    
238
	* Debug options which can be included in the osd configuration:
239
		[osd]
240
			;show OSD messaging traffic
241
			debug ms = 1
242
			;show OSD debug information
243
			debug osd = 20
244
			;show OSD journal debug information
245
			debug jorunal = 20
246
			;show filestore debug information
247
			debug filestore = 20
248
			;show monitor client debug information
249
			debug monc = 20
250

    
251
3. Clients
252
	* Clients configuration only need the monitor servers addresses
253
	* Configration section for clients:
254
		[mon.a]
255
			mon addr = [ip]:6789
256
		[mon.b]
257
			mon addr = [ip]:6789
258
		[mon.c]
259
			mon addr = [ip]:6789
260
	* Debug options which can be included in the client configuration:
261
			;show client messaging traffic
262
			debug ms = 1
263
			;show RADOS debug information
264
			debug rados = 20
265
			;show objecter debug information
266
			debug objecter = 20
267
			;show filer debug information
268
			debug filer = 20
269
			;show objectcacher debug information
270
			debug object cacher = 20
271
		
272
4. Tips
273
	* Mount all the filesystems with noatime,nodiratime options
274
	* Even without any debug options, RADOS generates lots of logs. Make sure
275
	  the logs files are in a fast disk, with little I/O traffic, and the
276
	  partition is mounted with noatime.
277

    
278

    
279
Installation Process
280
====================
281

    
282
This section describes the installation process of the various software
283
components in a RADOS cluster.
284

    
285
0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd,
286
   clients):
287
	 deb http://ceph.newdream.net/debian/ squeeze main
288
	 deb-src http://ceph.newdream.net/debian/ squeeze main
289

    
290
1. Monitor and OSD servers:
291
	* Install the ceph package
292
	* Upgrade to an up-to-date kernel (>=3.x)
293
	* Edit the /etc/ceph/ceph.conf to include the mon and osd configuration
294
	  sections, shown previously.
295
	* Create the corresponding dirs in /rados (mon.$id and osd.$id)
296
	* (optionally) Format and mount the osd.$id patition in /rados/osd.$id
297
	* Make sure the journal device specified in the conf exists.
298
	* (optionally) Make sure everything is mounted with the noatime,nodiratime
299
	  options
300
	* Make sure monitor and osd servers can freely ssh to each other, using only
301
	  hostnames.
302
	* Create the object store: 
303
		mkcephfs -a -c /etc/ceph/ceph.conf
304
	* Start the servers:
305
		service ceph -a start
306
	* Verify that the object store is healthy, and running:
307
		ceph helth
308
		ceph -s
309

    
310
2. Clients:
311
	* Install the ceph-common package
312
	* Upgrade to an up-to-date kernel (>=3.x)
313
	* Install linux-headers for the new kernel
314
	* Check out the latest ceph-client git repo:
315
		git clone git://github.com/NewDreamNetwork/ceph-client.git
316
	* Copy the ncecessary ceph header file to linux-headers:
317
		cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/
318
	* Build the modules:
319
		cd ~/ceph-client/net/ceph/
320
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) libceph.ko
321
		cp Modules.symvers ../../drivers/block/
322
		cd ~/ceph-client/drivers/block/
323
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) rbd.ko
324
	* Optionally, copy rbd.ko and libceph. ko to /lib/modules/
325
	* Load the modules:
326
		modprobe rbd
327

    
328

    
329
Administration Notes
330
====================
331

    
332
This section includes some notes on the RADOS cluster administration.
333

    
334
0. Starting / Stopping servers
335
	* service ceph -a start/stop (affects all the servers in the cluster)
336
	* service ceph start/stop osd (affects only the osds in the current node)
337
	* service ceph start/stop mon (affects only the mons in the current node)
338
	* service ceph start/stop osd.$id/mon.$id (affects only the specified node)
339

    
340
	* sevice ceph cleanlogs/cleanalllogs
341

    
342
1. Stop the cluster cleanly
343
	ceph stop
344

    
345
2. Increase the replication level for a given pool:
346
	ceph osd pool set $poolname size $size
347

    
348
   Note that increasing the replication level, the overhead for the replication
349
   will impact perfomance.
350

    
351
3. Adjust the number of placement groups per pool:
352
	ceph osd pool set $poolname pg_num $num
353
   
354
   The default number of pgs per pool is determined by the number of OSDs in the
355
   cluster, and the replication level of the pool (for 4 OSDs and replication
356
   size 2, the default value is 8). The default pools (data,metadata,rbd) are
357
   assigned 256 PGs.
358

    
359
   After the splitting is complete, the number of PGs in the system must be
360
   changed. Warning: this is not considered safe on PGs in use (with objects),
361
   and should be changed only when the PG is created, and before being used:
362
   	ceph osd pool set $poolname pgp_num $num
363

    
364
4. Replacing the journal for osd.$id:
365
	Edit the osd.$id journal configration section
366
	ceph-osd -i osd.$id --mkjournal
367
	ceph-osd -i osd.$id --osd.journal /path/to/journal
368

    
369
5. Add a new OSD:
370
	Edit /etc/ceph/ceph.conf to include the new OSD
371
	ceph mon getmap -o /tmp/monmap
372
	ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap
373
	ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed)
374
	service ceph start osd.$id
375

    
376
	Generate the CRUSH map to include the new osd in PGs:
377
		osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush
378
		ceph osd setcrushmap -i /tmp/crush
379
	Or edit the CRUSH map by hand:
380
		ceph osd getcrushmap -o /tmp/crush
381
		crushmaptool -d /tmp/crush -o crushmap
382
		vim crushmap
383
		crushmaptool -c crushmap -o /tmp/crush
384
		ceph osd setcrushmap -i /tmp/crush
385

    
386
6. General ceph tool commands:
387
	* ceph mon stat (stat mon servers)
388
	* ceph mon getmap (get the monmap, use monmaptool to edit)
389
	* ceph osd dump (dump osdmap -> pool info, osd info)
390
	* ceph osd getmap (get osdmap -> use osdmaptool to edit)
391
	* ceph osd lspools
392
	* ceph osd stat (stat osd servers)
393
	* ceph ost tree (osd server info)
394
	* ceph pg dump/stat (show info about PGs)
395

    
396
7. rados userspace tool:
397

    
398
   The rados userspace tool (included in ceph-common package), uses librados to
399
   communicate with the object store.
400

    
401
	* rados mkpool [pool]
402
	* rados rmpool [pool]
403
	* rados df (show usage per pool)
404
	* rados lspools (list pools)
405
	* rados ls -p [pool] (list objects in [pool]
406
	* rados bench [secs] write|seq -t [concurrent operation]
407
	* rados import/export <pool> <dir> (import/export a local directory in a rados pool)
408

    
409
8. rbd userspace tool:
410
   
411
   The rbd userspace tool (included in ceph-commong package), uses librbd and
412
   librados to communicate with the object store. 
413

    
414
	* rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) 
415
	* rbd info [pool] -p [pool]
416
	* rbd create [image] --size n (in MB)
417
	* rbd rm [image]
418
	* rbd export/import [dir] [image]
419
	* rbd cp/mv [image] [dest]
420
	* rbd resize [image]
421
	* rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver)
422
	* rbd unmap /dev/rbdx (unmap an RBD device)
423
	* rbd showmapped
424

    
425
9. In-kernel RBD driver
426

    
427
   The in-kernel RBD driver can be used to map and ummap RBD images as block
428
   devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be
429
   created in /dev/rbd/[poolname]/[imagename]:[bdev id].
430

    
431
   It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to
432
   add / remove / list devices, although the rbd map/unmap/showmapped commands
433
   are preferred.
434
   
435
   The RBD module depends on the net/ceph/libceph module, which implements the
436
   communication with the object store in the kernel.
437

    
438
10. Qemu-RBD driver
439
	
440
	The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as
441
	block devices inside VMs. It currently supports a feature not present in the
442
	in-kenrel RBD driver (writeback_window).
443

    
444
	It can be configured via libvirt, and the configuration looks like this:
445

    
446
		<disk type='network' device='disk'>
447
		  <driver name='qemu' type='raw'/>
448
		  <source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/>
449
		  <target dev='vda' bus='virtio'/>
450
		</disk>
451

    
452
	Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM
453
	version, which is not included in Debian.
454

    
455
9. Logging and Debugging:
456
	For command-line tools (ceph, rados, rbd), you can specify debug options in
457
	the form of --debug-[component]=n, which will override the options in the
458
	config file. In order to get any output when using the cli debug options,
459
	you must also use --log-to-stderr.
460
		
461
		rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20
462

    
463
	Ceph log files are located in /var/log/ceph/mon.$id and
464
	/var/log/ceph/osd.$id.