Statistics
| Branch: | Tag: | Revision:

root / docs / storage.rst @ 6d8a47d0

History | View | Annotate | Download (17.3 kB)

1 82b5509d Kostas Papadimitriou
Storage guide
2 82b5509d Kostas Papadimitriou
=============
3 82b5509d Kostas Papadimitriou
4 82b5509d Kostas Papadimitriou
Instructions for RADOS cluster deployment and administration
5 5a0df659 Stratos Psomadakis
6 5a0df659 Stratos Psomadakis
This document describes the basic steps to obtain a working RADOS cluster /
7 5a0df659 Stratos Psomadakis
object store installation, to be used as a storage backend for synnefo, and
8 5a0df659 Stratos Psomadakis
provides information about its administration.
9 5a0df659 Stratos Psomadakis
10 5a0df659 Stratos Psomadakis
It begins by providing general information on the RADOS object store describing
11 5a0df659 Stratos Psomadakis
the different nodes in a RADOS cluster, and then moves to the installation and
12 5a0df659 Stratos Psomadakis
setup of the distinct software components. Finally, it provides some basic
13 5a0df659 Stratos Psomadakis
information about the cluster administration and debugging.
14 5a0df659 Stratos Psomadakis
15 5a0df659 Stratos Psomadakis
RADOS is the object storage component of the Ceph project
16 5a0df659 Stratos Psomadakis
(http://http://ceph.newdream.net). For more documentation, see the official wiki
17 5a0df659 Stratos Psomadakis
(http://ceph.newdream.net/wiki), and the official documentation
18 5a0df659 Stratos Psomadakis
(http://ceph.newdream.net/docs). Usage information for userspace tools, used to
19 5a0df659 Stratos Psomadakis
administer the cluster, are also available in the respective manpages.
20 5a0df659 Stratos Psomadakis
21 5a0df659 Stratos Psomadakis
22 5a0df659 Stratos Psomadakis
RADOS Intro
23 82b5509d Kostas Papadimitriou
-----------
24 5a0df659 Stratos Psomadakis
RADOS is the object storage component of Ceph.
25 5a0df659 Stratos Psomadakis
26 5a0df659 Stratos Psomadakis
An object, in this context, means a named entity that has
27 5a0df659 Stratos Psomadakis
28 5a0df659 Stratos Psomadakis
 * name: a sequence of bytes, unique within its container, that is used to locate
29 5a0df659 Stratos Psomadakis
   and access the object
30 5a0df659 Stratos Psomadakis
 * content: sequence of bytes
31 5a0df659 Stratos Psomadakis
 * metadata: a mapping from keys to values
32 5a0df659 Stratos Psomadakis
33 5a0df659 Stratos Psomadakis
RADOS takes care of distributing the objects across the whole storage cluster
34 5a0df659 Stratos Psomadakis
and replicating them for fault tolerance.
35 5a0df659 Stratos Psomadakis
36 5a0df659 Stratos Psomadakis
37 5a0df659 Stratos Psomadakis
Node types
38 82b5509d Kostas Papadimitriou
----------
39 5a0df659 Stratos Psomadakis
40 5a0df659 Stratos Psomadakis
Nodes in a RADOS deployment belong in one of the following types:
41 5a0df659 Stratos Psomadakis
42 5a0df659 Stratos Psomadakis
 * Monitor:
43 5a0df659 Stratos Psomadakis
   Lightweight daemon (ceph-mon) that provides a consensus for distributed
44 5a0df659 Stratos Psomadakis
   decisionmaking in a Ceph/RADOS cluster. It also is the initial point of
45 5a0df659 Stratos Psomadakis
   contact for new clients, and will hand out information about the topology of
46 5a0df659 Stratos Psomadakis
   the cluster, such as the osdmap.
47 5a0df659 Stratos Psomadakis
48 5a0df659 Stratos Psomadakis
   You normally run 3 ceph-mon daemons, on 3 separate physical machines,
49 5a0df659 Stratos Psomadakis
   isolated from each other; for example, in different racks or rows.  You could
50 5a0df659 Stratos Psomadakis
   run just 1 instance, but that means giving up on high availability.
51 5a0df659 Stratos Psomadakis
52 5a0df659 Stratos Psomadakis
   Any decision requires the majority of the ceph-mon processes to be healthy
53 5a0df659 Stratos Psomadakis
   and communicating with each other. For this reason, you never want an even
54 5a0df659 Stratos Psomadakis
   number of ceph-mons; there is no unambiguous majority subgroup for an even
55 5a0df659 Stratos Psomadakis
   number.
56 5a0df659 Stratos Psomadakis
57 5a0df659 Stratos Psomadakis
 * OSD:
58 5a0df659 Stratos Psomadakis
   Storage daemon (ceph-osd) that provides the RADOS service. It uses the
59 5a0df659 Stratos Psomadakis
   monitor servers for cluster membership, services object read/write/etc
60 5a0df659 Stratos Psomadakis
   request from clients, and peers with other ceph-osds for data replication.
61 5a0df659 Stratos Psomadakis
62 5a0df659 Stratos Psomadakis
   The data model is fairly simple on this level. There are multiple named
63 5a0df659 Stratos Psomadakis
   pools, and within each pool there are named objects, in a flat namespace (no
64 5a0df659 Stratos Psomadakis
   directories). Each object has both data and metadata.
65 5a0df659 Stratos Psomadakis
66 5a0df659 Stratos Psomadakis
   By default, three pools are created (data, metadata, rbd).
67 5a0df659 Stratos Psomadakis
68 5a0df659 Stratos Psomadakis
   The data for an object is a single, potentially big, series of bytes.
69 5a0df659 Stratos Psomadakis
   Additionally, the series may be sparse, it may have holes that contain binary
70 5a0df659 Stratos Psomadakis
   zeros, and take up no actual storage.
71 5a0df659 Stratos Psomadakis
   
72 5a0df659 Stratos Psomadakis
   The metadata is an unordered set of key-value pairs. Its semantics are
73 5a0df659 Stratos Psomadakis
   completely up to the client.
74 5a0df659 Stratos Psomadakis
75 5a0df659 Stratos Psomadakis
   Multiple OSDs can run on one node, one for each disk included in the object
76 5a0df659 Stratos Psomadakis
   store. This might impose a perfomance overhead, due to peering/replication.
77 5a0df659 Stratos Psomadakis
   Alternatively, disks can be pooled together (either with RAID or with btrfs),
78 5a0df659 Stratos Psomadakis
   requiring only one osd to manage the pool.
79 5a0df659 Stratos Psomadakis
80 5a0df659 Stratos Psomadakis
   In the case of multiple OSDs, care must be taken to generate a CRUSH map,
81 5a0df659 Stratos Psomadakis
   which doesn't replicate objects across OSDs on the same host (see the next
82 5a0df659 Stratos Psomadakis
   section).
83 5a0df659 Stratos Psomadakis
84 5a0df659 Stratos Psomadakis
 * Clients:
85 5a0df659 Stratos Psomadakis
   Clients that can access the RADOS cluster either directly, and on an object
86 5a0df659 Stratos Psomadakis
   'granurality' by using librados and the rados userspace tool, or by using
87 5a0df659 Stratos Psomadakis
   librbd, and the rbd tool, which creates an image / volume abstraction over
88 5a0df659 Stratos Psomadakis
   the object store.
89 5a0df659 Stratos Psomadakis
90 5a0df659 Stratos Psomadakis
   RBD images are striped over the object store daemons, to provide higher
91 5a0df659 Stratos Psomadakis
   throughput, and can be accessed either via the in-kernel Rados Block Device
92 5a0df659 Stratos Psomadakis
   (RBD) driver, which maps RBD images to block devices, or directly via Qemu,
93 5a0df659 Stratos Psomadakis
   and the Qemu-RBD driver.
94 5a0df659 Stratos Psomadakis
   
95 5a0df659 Stratos Psomadakis
96 5a0df659 Stratos Psomadakis
Replication and Fault tolerance
97 82b5509d Kostas Papadimitriou
-------------------------------
98 5a0df659 Stratos Psomadakis
99 5a0df659 Stratos Psomadakis
The objects in each pool are paritioned in a (per-pool configurable) number
100 5a0df659 Stratos Psomadakis
of placement groups (pgs), and each placement group is mapped to a nubmer of
101 5a0df659 Stratos Psomadakis
OSDs, according to the (per-pool configurable) replication level, and a
102 5a0df659 Stratos Psomadakis
(per-pool configurable) CRUSH map, which defines how objects are replicated
103 5a0df659 Stratos Psomadakis
across OSDs.
104 5a0df659 Stratos Psomadakis
105 5a0df659 Stratos Psomadakis
The CRUSH map is generated with hints from the config file (eg hostnames, racks
106 5a0df659 Stratos Psomadakis
etc), so that the objects are replicated across OSDs in different 'failure
107 5a0df659 Stratos Psomadakis
domains'. However, in order to be on the safe side, the CRUSH map should be
108 5a0df659 Stratos Psomadakis
examined to verify that for example PGs are not replicated acroos OSDs on the
109 5a0df659 Stratos Psomadakis
same host, and corrected if needed (see the Admin section).
110 5a0df659 Stratos Psomadakis
111 5a0df659 Stratos Psomadakis
Information about objects, pools, and pgs is included in the osdmap, which
112 5a0df659 Stratos Psomadakis
the clients fetch initially from the monitor servers. Using the osdmap,
113 5a0df659 Stratos Psomadakis
clients learn which OSD is the primary for each PG, and therefore know which
114 5a0df659 Stratos Psomadakis
OSD to contact when they want to interact with a specific object. 
115 5a0df659 Stratos Psomadakis
116 5a0df659 Stratos Psomadakis
More information about the internals of the replication / fault tolerace /
117 5a0df659 Stratos Psomadakis
peering inside the RADOS cluster can be found in the original RADOS paper
118 5a0df659 Stratos Psomadakis
(http://dl.acm.org/citation.cfm?id=1374606).
119 5a0df659 Stratos Psomadakis
120 5a0df659 Stratos Psomadakis
121 5a0df659 Stratos Psomadakis
Journaling
122 82b5509d Kostas Papadimitriou
-----------
123 5a0df659 Stratos Psomadakis
124 5a0df659 Stratos Psomadakis
The OSD maintains a journal to help keep all on-disk data in a consistent state
125 5a0df659 Stratos Psomadakis
while still keep write latency low. That is, each OSD normally has a back-end
126 5a0df659 Stratos Psomadakis
file system (ideally btrfs) and a journal device or file.
127 5a0df659 Stratos Psomadakis
128 5a0df659 Stratos Psomadakis
When the journal is enabled, all writes are written both to the journal and to
129 5a0df659 Stratos Psomadakis
the file system. This is somewhat similar to ext3's data=journal mode, with a
130 5a0df659 Stratos Psomadakis
few differences. There are two basic journaling modes:
131 5a0df659 Stratos Psomadakis
132 5a0df659 Stratos Psomadakis
 * In writeahead mode, every write transaction is written first to the journal.
133 5a0df659 Stratos Psomadakis
   Once that is safely on disk, we can ack the write and then apply it to the
134 5a0df659 Stratos Psomadakis
   back-end file system. This will work with any file system (with a few
135 5a0df659 Stratos Psomadakis
   caveats).
136 5a0df659 Stratos Psomadakis
   
137 5a0df659 Stratos Psomadakis
 * In parallel mode, every write transaction is written to the journal and the 
138 5a0df659 Stratos Psomadakis
   file system in parallel. The write is acked when either one safely commits
139 5a0df659 Stratos Psomadakis
   (usually the journal). This will only work on btrfs, as it relies on
140 5a0df659 Stratos Psomadakis
   btrfs-specific snapshot ioctls to rollback to a consistent state before
141 5a0df659 Stratos Psomadakis
   replaying the journal.
142 5a0df659 Stratos Psomadakis
143 5a0df659 Stratos Psomadakis
144 5a0df659 Stratos Psomadakis
Authentication
145 82b5509d Kostas Papadimitriou
--------------
146 5a0df659 Stratos Psomadakis
147 5a0df659 Stratos Psomadakis
Ceph supports cephx secure authentication between the nodes, this to make your
148 5a0df659 Stratos Psomadakis
cluster more secure. There are some issues with the cephx authentication,
149 5a0df659 Stratos Psomadakis
especially with clients (Qemu-RBD), and it complicates the cluster deployment.
150 5a0df659 Stratos Psomadakis
Future revisions of this document will include documentation on setting up
151 5a0df659 Stratos Psomadakis
fine-grained cephx authentication acroos the cluster.
152 5a0df659 Stratos Psomadakis
153 5a0df659 Stratos Psomadakis
154 5a0df659 Stratos Psomadakis
RADOS Cluster design and configuration
155 82b5509d Kostas Papadimitriou
--------------------------------------
156 5a0df659 Stratos Psomadakis
157 5a0df659 Stratos Psomadakis
This section proposes and describes a sample cluster configuration.
158 5a0df659 Stratos Psomadakis
159 5a0df659 Stratos Psomadakis
0. Monitor servers:
160 5a0df659 Stratos Psomadakis
	* 3 mon servers on separate 'failure domains' (eg rack) 
161 5a0df659 Stratos Psomadakis
	* Monitor servers are named mon.a, mon.b, mon.c repectively
162 5a0df659 Stratos Psomadakis
	* Monitor data stored in /rados/mon.$id (should be created)
163 5a0df659 Stratos Psomadakis
	* Monitor servers bind on 6789 TCP port, which should not be blocked by
164 5a0df659 Stratos Psomadakis
	  firewall
165 5a0df659 Stratos Psomadakis
	* Ceph configuration section for monitors:
166 5a0df659 Stratos Psomadakis
		[mon]
167 5a0df659 Stratos Psomadakis
			mon data = /rados/mon.$id
168 5a0df659 Stratos Psomadakis
169 5a0df659 Stratos Psomadakis
		[mon.a]
170 5a0df659 Stratos Psomadakis
			host = [hostname] 
171 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
172 5a0df659 Stratos Psomadakis
		[mon.b]
173 5a0df659 Stratos Psomadakis
			host = [hostname] 
174 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
175 5a0df659 Stratos Psomadakis
		[mon.c]
176 5a0df659 Stratos Psomadakis
			host = [hostname] 
177 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
178 5a0df659 Stratos Psomadakis
			
179 5a0df659 Stratos Psomadakis
	* Debugging options which can be included in the monitor configuration:
180 5a0df659 Stratos Psomadakis
		[mon] 
181 5a0df659 Stratos Psomadakis
			;show monitor messaging traffic
182 5a0df659 Stratos Psomadakis
			debug ms = 1 
183 5a0df659 Stratos Psomadakis
			;show monitor debug messages
184 5a0df659 Stratos Psomadakis
			debug mon = 20
185 5a0df659 Stratos Psomadakis
			; show Paxos debug messages (consensus protocol)
186 5a0df659 Stratos Psomadakis
			debug paxos = 20
187 5a0df659 Stratos Psomadakis
188 5a0df659 Stratos Psomadakis
1. OSD servers:
189 5a0df659 Stratos Psomadakis
	* A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n)
190 5a0df659 Stratos Psomadakis
	* OSD servers bind on 6800+ TCP ports, which should not be blocked by
191 5a0df659 Stratos Psomadakis
	  firewall
192 5a0df659 Stratos Psomadakis
	* OSD data are stored in /rados/osd.$id (should be created and mounted if
193 5a0df659 Stratos Psomadakis
	  needed)
194 5a0df659 Stratos Psomadakis
	* /rados/osd.$id can be either a directory on the rootfs, or a separate
195 5a0df659 Stratos Psomadakis
	  partition, on a dedicated fast disk (recommended)
196 5a0df659 Stratos Psomadakis
		
197 5a0df659 Stratos Psomadakis
	  The upstream recommended filesystem is btrfs. btrfs will use the parallel
198 5a0df659 Stratos Psomadakis
	  mode for OSD journaling.
199 5a0df659 Stratos Psomadakis
200 5a0df659 Stratos Psomadakis
	  Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD
201 5a0df659 Stratos Psomadakis
	  journaling. ext4 itself can also use an external journal device
202 5a0df659 Stratos Psomadakis
	  (preferably a fast, eg SSD, disk). In that case, the filesystem can be
203 5a0df659 Stratos Psomadakis
	  mounted with data=journal,commit=9999,noatime,nodiratime options, to
204 5a0df659 Stratos Psomadakis
	  improve perfomance (proof?):
205 5a0df659 Stratos Psomadakis
206 5a0df659 Stratos Psomadakis
		mkfs.ext4 /dev/sdyy
207 5a0df659 Stratos Psomadakis
	  	mke2fs -O journal_dev /dev/sdxx
208 5a0df659 Stratos Psomadakis
		tune2fs -O ^has_journal /dev/sdyy
209 5a0df659 Stratos Psomadakis
		tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy
210 5a0df659 Stratos Psomadakis
		mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999
211 5a0df659 Stratos Psomadakis
		
212 5a0df659 Stratos Psomadakis
	* OSD journal can be either on a raw block device, a separate partition, or
213 5a0df659 Stratos Psomadakis
	  a file.
214 5a0df659 Stratos Psomadakis
215 5a0df659 Stratos Psomadakis
	  A fash disk (SSD) is recommended as a journal device. 
216 5a0df659 Stratos Psomadakis
	  
217 5a0df659 Stratos Psomadakis
	  If a file is used, the journal size must be also specified in the
218 5a0df659 Stratos Psomadakis
	  configuration.
219 5a0df659 Stratos Psomadakis
220 5a0df659 Stratos Psomadakis
	* Ceph configuration section for monitors:
221 5a0df659 Stratos Psomadakis
		[osd]
222 5a0df659 Stratos Psomadakis
			osd data = /rados/osd.$id
223 5a0df659 Stratos Psomadakis
			osd journal = /dev/sdzz
224 5a0df659 Stratos Psomadakis
			;if a file is used as a journal
225 5a0df659 Stratos Psomadakis
			;osd journal size = N (in MB)
226 5a0df659 Stratos Psomadakis
		
227 5a0df659 Stratos Psomadakis
		[osd.0]
228 5a0df659 Stratos Psomadakis
			;host and rack directives are used to generate a CRUSH map for PG
229 5a0df659 Stratos Psomadakis
			;placement
230 5a0df659 Stratos Psomadakis
			host = [hostname]
231 5a0df659 Stratos Psomadakis
			rack = [rack]
232 5a0df659 Stratos Psomadakis
			
233 5a0df659 Stratos Psomadakis
			;public addr is the one the clients will use to contact the osd
234 5a0df659 Stratos Psomadakis
			public_addr = [public ip]
235 5a0df659 Stratos Psomadakis
			;cluster addr is the one used for osd-to-osd replication/peering etc
236 5a0df659 Stratos Psomadakis
			cluster_addr = [cluster ip]
237 5a0df659 Stratos Psomadakis
238 5a0df659 Stratos Psomadakis
		[osd.1] 
239 5a0df659 Stratos Psomadakis
			...
240 5a0df659 Stratos Psomadakis
241 5a0df659 Stratos Psomadakis
	* Debug options which can be included in the osd configuration:
242 5a0df659 Stratos Psomadakis
		[osd]
243 5a0df659 Stratos Psomadakis
			;show OSD messaging traffic
244 5a0df659 Stratos Psomadakis
			debug ms = 1
245 5a0df659 Stratos Psomadakis
			;show OSD debug information
246 5a0df659 Stratos Psomadakis
			debug osd = 20
247 5a0df659 Stratos Psomadakis
			;show OSD journal debug information
248 5a0df659 Stratos Psomadakis
			debug jorunal = 20
249 5a0df659 Stratos Psomadakis
			;show filestore debug information
250 5a0df659 Stratos Psomadakis
			debug filestore = 20
251 5a0df659 Stratos Psomadakis
			;show monitor client debug information
252 5a0df659 Stratos Psomadakis
			debug monc = 20
253 5a0df659 Stratos Psomadakis
254 5a0df659 Stratos Psomadakis
3. Clients
255 5a0df659 Stratos Psomadakis
	* Clients configuration only need the monitor servers addresses
256 5a0df659 Stratos Psomadakis
	* Configration section for clients:
257 5a0df659 Stratos Psomadakis
		[mon.a]
258 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
259 5a0df659 Stratos Psomadakis
		[mon.b]
260 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
261 5a0df659 Stratos Psomadakis
		[mon.c]
262 5a0df659 Stratos Psomadakis
			mon addr = [ip]:6789
263 5a0df659 Stratos Psomadakis
	* Debug options which can be included in the client configuration:
264 5a0df659 Stratos Psomadakis
			;show client messaging traffic
265 5a0df659 Stratos Psomadakis
			debug ms = 1
266 5a0df659 Stratos Psomadakis
			;show RADOS debug information
267 5a0df659 Stratos Psomadakis
			debug rados = 20
268 5a0df659 Stratos Psomadakis
			;show objecter debug information
269 5a0df659 Stratos Psomadakis
			debug objecter = 20
270 5a0df659 Stratos Psomadakis
			;show filer debug information
271 5a0df659 Stratos Psomadakis
			debug filer = 20
272 5a0df659 Stratos Psomadakis
			;show objectcacher debug information
273 5a0df659 Stratos Psomadakis
			debug object cacher = 20
274 5a0df659 Stratos Psomadakis
		
275 5a0df659 Stratos Psomadakis
4. Tips
276 5a0df659 Stratos Psomadakis
	* Mount all the filesystems with noatime,nodiratime options
277 5a0df659 Stratos Psomadakis
	* Even without any debug options, RADOS generates lots of logs. Make sure
278 5a0df659 Stratos Psomadakis
	  the logs files are in a fast disk, with little I/O traffic, and the
279 5a0df659 Stratos Psomadakis
	  partition is mounted with noatime.
280 5a0df659 Stratos Psomadakis
281 5a0df659 Stratos Psomadakis
282 5a0df659 Stratos Psomadakis
Installation Process
283 82b5509d Kostas Papadimitriou
--------------------
284 5a0df659 Stratos Psomadakis
285 5a0df659 Stratos Psomadakis
This section describes the installation process of the various software
286 5a0df659 Stratos Psomadakis
components in a RADOS cluster.
287 5a0df659 Stratos Psomadakis
288 5a0df659 Stratos Psomadakis
0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd,
289 c469ca86 Kostas Papadimitriou
   clients)::
290 c469ca86 Kostas Papadimitriou
291 5a0df659 Stratos Psomadakis
	 deb http://ceph.newdream.net/debian/ squeeze main
292 5a0df659 Stratos Psomadakis
	 deb-src http://ceph.newdream.net/debian/ squeeze main
293 5a0df659 Stratos Psomadakis
294 5a0df659 Stratos Psomadakis
1. Monitor and OSD servers:
295 5a0df659 Stratos Psomadakis
	* Install the ceph package
296 5a0df659 Stratos Psomadakis
	* Upgrade to an up-to-date kernel (>=3.x)
297 5a0df659 Stratos Psomadakis
	* Edit the /etc/ceph/ceph.conf to include the mon and osd configuration
298 5a0df659 Stratos Psomadakis
	  sections, shown previously.
299 5a0df659 Stratos Psomadakis
	* Create the corresponding dirs in /rados (mon.$id and osd.$id)
300 5a0df659 Stratos Psomadakis
	* (optionally) Format and mount the osd.$id patition in /rados/osd.$id
301 5a0df659 Stratos Psomadakis
	* Make sure the journal device specified in the conf exists.
302 5a0df659 Stratos Psomadakis
	* (optionally) Make sure everything is mounted with the noatime,nodiratime
303 5a0df659 Stratos Psomadakis
	  options
304 5a0df659 Stratos Psomadakis
	* Make sure monitor and osd servers can freely ssh to each other, using only
305 5a0df659 Stratos Psomadakis
	  hostnames.
306 5a0df659 Stratos Psomadakis
	* Create the object store: 
307 5a0df659 Stratos Psomadakis
		mkcephfs -a -c /etc/ceph/ceph.conf
308 5a0df659 Stratos Psomadakis
	* Start the servers:
309 5a0df659 Stratos Psomadakis
		service ceph -a start
310 5a0df659 Stratos Psomadakis
	* Verify that the object store is healthy, and running:
311 5a0df659 Stratos Psomadakis
		ceph helth
312 5a0df659 Stratos Psomadakis
		ceph -s
313 5a0df659 Stratos Psomadakis
314 5a0df659 Stratos Psomadakis
2. Clients:
315 5a0df659 Stratos Psomadakis
	* Install the ceph-common package
316 5a0df659 Stratos Psomadakis
	* Upgrade to an up-to-date kernel (>=3.x)
317 5a0df659 Stratos Psomadakis
	* Install linux-headers for the new kernel
318 5a0df659 Stratos Psomadakis
	* Check out the latest ceph-client git repo:
319 5a0df659 Stratos Psomadakis
		git clone git://github.com/NewDreamNetwork/ceph-client.git
320 5a0df659 Stratos Psomadakis
	* Copy the ncecessary ceph header file to linux-headers:
321 5a0df659 Stratos Psomadakis
		cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/
322 5a0df659 Stratos Psomadakis
	* Build the modules:
323 5a0df659 Stratos Psomadakis
		cd ~/ceph-client/net/ceph/
324 5a0df659 Stratos Psomadakis
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) libceph.ko
325 5a0df659 Stratos Psomadakis
		cp Modules.symvers ../../drivers/block/
326 5a0df659 Stratos Psomadakis
		cd ~/ceph-client/drivers/block/
327 5a0df659 Stratos Psomadakis
		make -C /usr/src/linux-headers-3.0.0-2-amd64/  M=$(pwd) rbd.ko
328 5a0df659 Stratos Psomadakis
	* Optionally, copy rbd.ko and libceph. ko to /lib/modules/
329 5a0df659 Stratos Psomadakis
	* Load the modules:
330 5a0df659 Stratos Psomadakis
		modprobe rbd
331 5a0df659 Stratos Psomadakis
332 5a0df659 Stratos Psomadakis
333 5a0df659 Stratos Psomadakis
Administration Notes
334 82b5509d Kostas Papadimitriou
--------------------
335 5a0df659 Stratos Psomadakis
336 5a0df659 Stratos Psomadakis
This section includes some notes on the RADOS cluster administration.
337 5a0df659 Stratos Psomadakis
338 5a0df659 Stratos Psomadakis
0. Starting / Stopping servers
339 5a0df659 Stratos Psomadakis
	* service ceph -a start/stop (affects all the servers in the cluster)
340 5a0df659 Stratos Psomadakis
	* service ceph start/stop osd (affects only the osds in the current node)
341 5a0df659 Stratos Psomadakis
	* service ceph start/stop mon (affects only the mons in the current node)
342 5a0df659 Stratos Psomadakis
	* service ceph start/stop osd.$id/mon.$id (affects only the specified node)
343 5a0df659 Stratos Psomadakis
344 5a0df659 Stratos Psomadakis
	* sevice ceph cleanlogs/cleanalllogs
345 5a0df659 Stratos Psomadakis
346 5a0df659 Stratos Psomadakis
1. Stop the cluster cleanly
347 5a0df659 Stratos Psomadakis
	ceph stop
348 5a0df659 Stratos Psomadakis
349 5a0df659 Stratos Psomadakis
2. Increase the replication level for a given pool:
350 5a0df659 Stratos Psomadakis
	ceph osd pool set $poolname size $size
351 5a0df659 Stratos Psomadakis
352 5a0df659 Stratos Psomadakis
   Note that increasing the replication level, the overhead for the replication
353 5a0df659 Stratos Psomadakis
   will impact perfomance.
354 5a0df659 Stratos Psomadakis
355 5a0df659 Stratos Psomadakis
3. Adjust the number of placement groups per pool:
356 5a0df659 Stratos Psomadakis
	ceph osd pool set $poolname pg_num $num
357 5a0df659 Stratos Psomadakis
   
358 5a0df659 Stratos Psomadakis
   The default number of pgs per pool is determined by the number of OSDs in the
359 5a0df659 Stratos Psomadakis
   cluster, and the replication level of the pool (for 4 OSDs and replication
360 5a0df659 Stratos Psomadakis
   size 2, the default value is 8). The default pools (data,metadata,rbd) are
361 5a0df659 Stratos Psomadakis
   assigned 256 PGs.
362 5a0df659 Stratos Psomadakis
363 5a0df659 Stratos Psomadakis
   After the splitting is complete, the number of PGs in the system must be
364 5a0df659 Stratos Psomadakis
   changed. Warning: this is not considered safe on PGs in use (with objects),
365 5a0df659 Stratos Psomadakis
   and should be changed only when the PG is created, and before being used:
366 c469ca86 Kostas Papadimitriou
   ceph osd pool set $poolname pgp_num $num
367 5a0df659 Stratos Psomadakis
368 5a0df659 Stratos Psomadakis
4. Replacing the journal for osd.$id:
369 5a0df659 Stratos Psomadakis
	Edit the osd.$id journal configration section
370 5a0df659 Stratos Psomadakis
	ceph-osd -i osd.$id --mkjournal
371 5a0df659 Stratos Psomadakis
	ceph-osd -i osd.$id --osd.journal /path/to/journal
372 5a0df659 Stratos Psomadakis
373 5a0df659 Stratos Psomadakis
5. Add a new OSD:
374 5a0df659 Stratos Psomadakis
	Edit /etc/ceph/ceph.conf to include the new OSD
375 5a0df659 Stratos Psomadakis
	ceph mon getmap -o /tmp/monmap
376 5a0df659 Stratos Psomadakis
	ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap
377 5a0df659 Stratos Psomadakis
	ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed)
378 5a0df659 Stratos Psomadakis
	service ceph start osd.$id
379 5a0df659 Stratos Psomadakis
380 5a0df659 Stratos Psomadakis
	Generate the CRUSH map to include the new osd in PGs:
381 5a0df659 Stratos Psomadakis
		osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush
382 5a0df659 Stratos Psomadakis
		ceph osd setcrushmap -i /tmp/crush
383 5a0df659 Stratos Psomadakis
	Or edit the CRUSH map by hand:
384 5a0df659 Stratos Psomadakis
		ceph osd getcrushmap -o /tmp/crush
385 5a0df659 Stratos Psomadakis
		crushmaptool -d /tmp/crush -o crushmap
386 5a0df659 Stratos Psomadakis
		vim crushmap
387 5a0df659 Stratos Psomadakis
		crushmaptool -c crushmap -o /tmp/crush
388 5a0df659 Stratos Psomadakis
		ceph osd setcrushmap -i /tmp/crush
389 5a0df659 Stratos Psomadakis
390 5a0df659 Stratos Psomadakis
6. General ceph tool commands:
391 5a0df659 Stratos Psomadakis
	* ceph mon stat (stat mon servers)
392 5a0df659 Stratos Psomadakis
	* ceph mon getmap (get the monmap, use monmaptool to edit)
393 5a0df659 Stratos Psomadakis
	* ceph osd dump (dump osdmap -> pool info, osd info)
394 5a0df659 Stratos Psomadakis
	* ceph osd getmap (get osdmap -> use osdmaptool to edit)
395 5a0df659 Stratos Psomadakis
	* ceph osd lspools
396 5a0df659 Stratos Psomadakis
	* ceph osd stat (stat osd servers)
397 5a0df659 Stratos Psomadakis
	* ceph ost tree (osd server info)
398 5a0df659 Stratos Psomadakis
	* ceph pg dump/stat (show info about PGs)
399 5a0df659 Stratos Psomadakis
400 5a0df659 Stratos Psomadakis
7. rados userspace tool:
401 5a0df659 Stratos Psomadakis
402 5a0df659 Stratos Psomadakis
   The rados userspace tool (included in ceph-common package), uses librados to
403 5a0df659 Stratos Psomadakis
   communicate with the object store.
404 5a0df659 Stratos Psomadakis
405 5a0df659 Stratos Psomadakis
	* rados mkpool [pool]
406 5a0df659 Stratos Psomadakis
	* rados rmpool [pool]
407 5a0df659 Stratos Psomadakis
	* rados df (show usage per pool)
408 5a0df659 Stratos Psomadakis
	* rados lspools (list pools)
409 5a0df659 Stratos Psomadakis
	* rados ls -p [pool] (list objects in [pool]
410 5a0df659 Stratos Psomadakis
	* rados bench [secs] write|seq -t [concurrent operation]
411 5a0df659 Stratos Psomadakis
	* rados import/export <pool> <dir> (import/export a local directory in a rados pool)
412 5a0df659 Stratos Psomadakis
413 5a0df659 Stratos Psomadakis
8. rbd userspace tool:
414 5a0df659 Stratos Psomadakis
   
415 5a0df659 Stratos Psomadakis
   The rbd userspace tool (included in ceph-commong package), uses librbd and
416 5a0df659 Stratos Psomadakis
   librados to communicate with the object store. 
417 5a0df659 Stratos Psomadakis
418 5a0df659 Stratos Psomadakis
	* rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) 
419 5a0df659 Stratos Psomadakis
	* rbd info [pool] -p [pool]
420 5a0df659 Stratos Psomadakis
	* rbd create [image] --size n (in MB)
421 5a0df659 Stratos Psomadakis
	* rbd rm [image]
422 5a0df659 Stratos Psomadakis
	* rbd export/import [dir] [image]
423 5a0df659 Stratos Psomadakis
	* rbd cp/mv [image] [dest]
424 5a0df659 Stratos Psomadakis
	* rbd resize [image]
425 5a0df659 Stratos Psomadakis
	* rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver)
426 5a0df659 Stratos Psomadakis
	* rbd unmap /dev/rbdx (unmap an RBD device)
427 5a0df659 Stratos Psomadakis
	* rbd showmapped
428 5a0df659 Stratos Psomadakis
429 5a0df659 Stratos Psomadakis
9. In-kernel RBD driver
430 5a0df659 Stratos Psomadakis
431 5a0df659 Stratos Psomadakis
   The in-kernel RBD driver can be used to map and ummap RBD images as block
432 5a0df659 Stratos Psomadakis
   devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be
433 5a0df659 Stratos Psomadakis
   created in /dev/rbd/[poolname]/[imagename]:[bdev id].
434 5a0df659 Stratos Psomadakis
435 5a0df659 Stratos Psomadakis
   It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to
436 5a0df659 Stratos Psomadakis
   add / remove / list devices, although the rbd map/unmap/showmapped commands
437 5a0df659 Stratos Psomadakis
   are preferred.
438 5a0df659 Stratos Psomadakis
   
439 5a0df659 Stratos Psomadakis
   The RBD module depends on the net/ceph/libceph module, which implements the
440 5a0df659 Stratos Psomadakis
   communication with the object store in the kernel.
441 5a0df659 Stratos Psomadakis
442 5a0df659 Stratos Psomadakis
10. Qemu-RBD driver
443 5a0df659 Stratos Psomadakis
	
444 5a0df659 Stratos Psomadakis
	The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as
445 5a0df659 Stratos Psomadakis
	block devices inside VMs. It currently supports a feature not present in the
446 5a0df659 Stratos Psomadakis
	in-kenrel RBD driver (writeback_window).
447 5a0df659 Stratos Psomadakis
448 5a0df659 Stratos Psomadakis
	It can be configured via libvirt, and the configuration looks like this:
449 5a0df659 Stratos Psomadakis
450 c469ca86 Kostas Papadimitriou
    .. code-block:: xml
451 c469ca86 Kostas Papadimitriou
452 5a0df659 Stratos Psomadakis
		<disk type='network' device='disk'>
453 5a0df659 Stratos Psomadakis
		  <driver name='qemu' type='raw'/>
454 5a0df659 Stratos Psomadakis
		  <source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/>
455 5a0df659 Stratos Psomadakis
		  <target dev='vda' bus='virtio'/>
456 5a0df659 Stratos Psomadakis
		</disk>
457 5a0df659 Stratos Psomadakis
458 5a0df659 Stratos Psomadakis
	Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM
459 5a0df659 Stratos Psomadakis
	version, which is not included in Debian.
460 5a0df659 Stratos Psomadakis
461 5a0df659 Stratos Psomadakis
9. Logging and Debugging:
462 5a0df659 Stratos Psomadakis
	For command-line tools (ceph, rados, rbd), you can specify debug options in
463 5a0df659 Stratos Psomadakis
	the form of --debug-[component]=n, which will override the options in the
464 5a0df659 Stratos Psomadakis
	config file. In order to get any output when using the cli debug options,
465 5a0df659 Stratos Psomadakis
	you must also use --log-to-stderr.
466 5a0df659 Stratos Psomadakis
		
467 5a0df659 Stratos Psomadakis
		rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20
468 5a0df659 Stratos Psomadakis
469 5a0df659 Stratos Psomadakis
	Ceph log files are located in /var/log/ceph/mon.$id and
470 5a0df659 Stratos Psomadakis
	/var/log/ceph/osd.$id.