root / docs / storage.rst @ 0d14df99
History | View | Annotate | Download (17.3 kB)
1 | 82b5509d | Kostas Papadimitriou | Storage guide |
---|---|---|---|
2 | 82b5509d | Kostas Papadimitriou | ============= |
3 | 82b5509d | Kostas Papadimitriou | |
4 | 82b5509d | Kostas Papadimitriou | Instructions for RADOS cluster deployment and administration |
5 | 5a0df659 | Stratos Psomadakis | |
6 | 5a0df659 | Stratos Psomadakis | This document describes the basic steps to obtain a working RADOS cluster / |
7 | 5a0df659 | Stratos Psomadakis | object store installation, to be used as a storage backend for synnefo, and |
8 | 5a0df659 | Stratos Psomadakis | provides information about its administration. |
9 | 5a0df659 | Stratos Psomadakis | |
10 | 5a0df659 | Stratos Psomadakis | It begins by providing general information on the RADOS object store describing |
11 | 5a0df659 | Stratos Psomadakis | the different nodes in a RADOS cluster, and then moves to the installation and |
12 | 5a0df659 | Stratos Psomadakis | setup of the distinct software components. Finally, it provides some basic |
13 | 5a0df659 | Stratos Psomadakis | information about the cluster administration and debugging. |
14 | 5a0df659 | Stratos Psomadakis | |
15 | 5a0df659 | Stratos Psomadakis | RADOS is the object storage component of the Ceph project |
16 | 5a0df659 | Stratos Psomadakis | (http://http://ceph.newdream.net). For more documentation, see the official wiki |
17 | 5a0df659 | Stratos Psomadakis | (http://ceph.newdream.net/wiki), and the official documentation |
18 | 5a0df659 | Stratos Psomadakis | (http://ceph.newdream.net/docs). Usage information for userspace tools, used to |
19 | 5a0df659 | Stratos Psomadakis | administer the cluster, are also available in the respective manpages. |
20 | 5a0df659 | Stratos Psomadakis | |
21 | 5a0df659 | Stratos Psomadakis | |
22 | 5a0df659 | Stratos Psomadakis | RADOS Intro |
23 | 82b5509d | Kostas Papadimitriou | ----------- |
24 | 5a0df659 | Stratos Psomadakis | RADOS is the object storage component of Ceph. |
25 | 5a0df659 | Stratos Psomadakis | |
26 | 5a0df659 | Stratos Psomadakis | An object, in this context, means a named entity that has |
27 | 5a0df659 | Stratos Psomadakis | |
28 | 5a0df659 | Stratos Psomadakis | * name: a sequence of bytes, unique within its container, that is used to locate |
29 | 5a0df659 | Stratos Psomadakis | and access the object |
30 | 5a0df659 | Stratos Psomadakis | * content: sequence of bytes |
31 | 5a0df659 | Stratos Psomadakis | * metadata: a mapping from keys to values |
32 | 5a0df659 | Stratos Psomadakis | |
33 | 5a0df659 | Stratos Psomadakis | RADOS takes care of distributing the objects across the whole storage cluster |
34 | 5a0df659 | Stratos Psomadakis | and replicating them for fault tolerance. |
35 | 5a0df659 | Stratos Psomadakis | |
36 | 5a0df659 | Stratos Psomadakis | |
37 | 5a0df659 | Stratos Psomadakis | Node types |
38 | 82b5509d | Kostas Papadimitriou | ---------- |
39 | 5a0df659 | Stratos Psomadakis | |
40 | 5a0df659 | Stratos Psomadakis | Nodes in a RADOS deployment belong in one of the following types: |
41 | 5a0df659 | Stratos Psomadakis | |
42 | 5a0df659 | Stratos Psomadakis | * Monitor: |
43 | 5a0df659 | Stratos Psomadakis | Lightweight daemon (ceph-mon) that provides a consensus for distributed |
44 | 5a0df659 | Stratos Psomadakis | decisionmaking in a Ceph/RADOS cluster. It also is the initial point of |
45 | 5a0df659 | Stratos Psomadakis | contact for new clients, and will hand out information about the topology of |
46 | 5a0df659 | Stratos Psomadakis | the cluster, such as the osdmap. |
47 | 5a0df659 | Stratos Psomadakis | |
48 | 5a0df659 | Stratos Psomadakis | You normally run 3 ceph-mon daemons, on 3 separate physical machines, |
49 | 5a0df659 | Stratos Psomadakis | isolated from each other; for example, in different racks or rows. You could |
50 | 5a0df659 | Stratos Psomadakis | run just 1 instance, but that means giving up on high availability. |
51 | 5a0df659 | Stratos Psomadakis | |
52 | 5a0df659 | Stratos Psomadakis | Any decision requires the majority of the ceph-mon processes to be healthy |
53 | 5a0df659 | Stratos Psomadakis | and communicating with each other. For this reason, you never want an even |
54 | 5a0df659 | Stratos Psomadakis | number of ceph-mons; there is no unambiguous majority subgroup for an even |
55 | 5a0df659 | Stratos Psomadakis | number. |
56 | 5a0df659 | Stratos Psomadakis | |
57 | 5a0df659 | Stratos Psomadakis | * OSD: |
58 | 5a0df659 | Stratos Psomadakis | Storage daemon (ceph-osd) that provides the RADOS service. It uses the |
59 | 5a0df659 | Stratos Psomadakis | monitor servers for cluster membership, services object read/write/etc |
60 | 5a0df659 | Stratos Psomadakis | request from clients, and peers with other ceph-osds for data replication. |
61 | 5a0df659 | Stratos Psomadakis | |
62 | 5a0df659 | Stratos Psomadakis | The data model is fairly simple on this level. There are multiple named |
63 | 5a0df659 | Stratos Psomadakis | pools, and within each pool there are named objects, in a flat namespace (no |
64 | 5a0df659 | Stratos Psomadakis | directories). Each object has both data and metadata. |
65 | 5a0df659 | Stratos Psomadakis | |
66 | 5a0df659 | Stratos Psomadakis | By default, three pools are created (data, metadata, rbd). |
67 | 5a0df659 | Stratos Psomadakis | |
68 | 5a0df659 | Stratos Psomadakis | The data for an object is a single, potentially big, series of bytes. |
69 | 5a0df659 | Stratos Psomadakis | Additionally, the series may be sparse, it may have holes that contain binary |
70 | 5a0df659 | Stratos Psomadakis | zeros, and take up no actual storage. |
71 | 5a0df659 | Stratos Psomadakis | |
72 | 5a0df659 | Stratos Psomadakis | The metadata is an unordered set of key-value pairs. Its semantics are |
73 | 5a0df659 | Stratos Psomadakis | completely up to the client. |
74 | 5a0df659 | Stratos Psomadakis | |
75 | 5a0df659 | Stratos Psomadakis | Multiple OSDs can run on one node, one for each disk included in the object |
76 | 5a0df659 | Stratos Psomadakis | store. This might impose a perfomance overhead, due to peering/replication. |
77 | 5a0df659 | Stratos Psomadakis | Alternatively, disks can be pooled together (either with RAID or with btrfs), |
78 | 5a0df659 | Stratos Psomadakis | requiring only one osd to manage the pool. |
79 | 5a0df659 | Stratos Psomadakis | |
80 | 5a0df659 | Stratos Psomadakis | In the case of multiple OSDs, care must be taken to generate a CRUSH map, |
81 | 5a0df659 | Stratos Psomadakis | which doesn't replicate objects across OSDs on the same host (see the next |
82 | 5a0df659 | Stratos Psomadakis | section). |
83 | 5a0df659 | Stratos Psomadakis | |
84 | 5a0df659 | Stratos Psomadakis | * Clients: |
85 | 5a0df659 | Stratos Psomadakis | Clients that can access the RADOS cluster either directly, and on an object |
86 | 5a0df659 | Stratos Psomadakis | 'granurality' by using librados and the rados userspace tool, or by using |
87 | 5a0df659 | Stratos Psomadakis | librbd, and the rbd tool, which creates an image / volume abstraction over |
88 | 5a0df659 | Stratos Psomadakis | the object store. |
89 | 5a0df659 | Stratos Psomadakis | |
90 | 5a0df659 | Stratos Psomadakis | RBD images are striped over the object store daemons, to provide higher |
91 | 5a0df659 | Stratos Psomadakis | throughput, and can be accessed either via the in-kernel Rados Block Device |
92 | 5a0df659 | Stratos Psomadakis | (RBD) driver, which maps RBD images to block devices, or directly via Qemu, |
93 | 5a0df659 | Stratos Psomadakis | and the Qemu-RBD driver. |
94 | 5a0df659 | Stratos Psomadakis | |
95 | 5a0df659 | Stratos Psomadakis | |
96 | 5a0df659 | Stratos Psomadakis | Replication and Fault tolerance |
97 | 82b5509d | Kostas Papadimitriou | ------------------------------- |
98 | 5a0df659 | Stratos Psomadakis | |
99 | 5a0df659 | Stratos Psomadakis | The objects in each pool are paritioned in a (per-pool configurable) number |
100 | 5a0df659 | Stratos Psomadakis | of placement groups (pgs), and each placement group is mapped to a nubmer of |
101 | 5a0df659 | Stratos Psomadakis | OSDs, according to the (per-pool configurable) replication level, and a |
102 | 5a0df659 | Stratos Psomadakis | (per-pool configurable) CRUSH map, which defines how objects are replicated |
103 | 5a0df659 | Stratos Psomadakis | across OSDs. |
104 | 5a0df659 | Stratos Psomadakis | |
105 | 5a0df659 | Stratos Psomadakis | The CRUSH map is generated with hints from the config file (eg hostnames, racks |
106 | 5a0df659 | Stratos Psomadakis | etc), so that the objects are replicated across OSDs in different 'failure |
107 | 5a0df659 | Stratos Psomadakis | domains'. However, in order to be on the safe side, the CRUSH map should be |
108 | 5a0df659 | Stratos Psomadakis | examined to verify that for example PGs are not replicated acroos OSDs on the |
109 | 5a0df659 | Stratos Psomadakis | same host, and corrected if needed (see the Admin section). |
110 | 5a0df659 | Stratos Psomadakis | |
111 | 5a0df659 | Stratos Psomadakis | Information about objects, pools, and pgs is included in the osdmap, which |
112 | 5a0df659 | Stratos Psomadakis | the clients fetch initially from the monitor servers. Using the osdmap, |
113 | 5a0df659 | Stratos Psomadakis | clients learn which OSD is the primary for each PG, and therefore know which |
114 | 5a0df659 | Stratos Psomadakis | OSD to contact when they want to interact with a specific object. |
115 | 5a0df659 | Stratos Psomadakis | |
116 | 5a0df659 | Stratos Psomadakis | More information about the internals of the replication / fault tolerace / |
117 | 5a0df659 | Stratos Psomadakis | peering inside the RADOS cluster can be found in the original RADOS paper |
118 | 5a0df659 | Stratos Psomadakis | (http://dl.acm.org/citation.cfm?id=1374606). |
119 | 5a0df659 | Stratos Psomadakis | |
120 | 5a0df659 | Stratos Psomadakis | |
121 | 5a0df659 | Stratos Psomadakis | Journaling |
122 | 82b5509d | Kostas Papadimitriou | ----------- |
123 | 5a0df659 | Stratos Psomadakis | |
124 | 5a0df659 | Stratos Psomadakis | The OSD maintains a journal to help keep all on-disk data in a consistent state |
125 | 5a0df659 | Stratos Psomadakis | while still keep write latency low. That is, each OSD normally has a back-end |
126 | 5a0df659 | Stratos Psomadakis | file system (ideally btrfs) and a journal device or file. |
127 | 5a0df659 | Stratos Psomadakis | |
128 | 5a0df659 | Stratos Psomadakis | When the journal is enabled, all writes are written both to the journal and to |
129 | 5a0df659 | Stratos Psomadakis | the file system. This is somewhat similar to ext3's data=journal mode, with a |
130 | 5a0df659 | Stratos Psomadakis | few differences. There are two basic journaling modes: |
131 | 5a0df659 | Stratos Psomadakis | |
132 | 5a0df659 | Stratos Psomadakis | * In writeahead mode, every write transaction is written first to the journal. |
133 | 5a0df659 | Stratos Psomadakis | Once that is safely on disk, we can ack the write and then apply it to the |
134 | 5a0df659 | Stratos Psomadakis | back-end file system. This will work with any file system (with a few |
135 | 5a0df659 | Stratos Psomadakis | caveats). |
136 | 5a0df659 | Stratos Psomadakis | |
137 | 5a0df659 | Stratos Psomadakis | * In parallel mode, every write transaction is written to the journal and the |
138 | 5a0df659 | Stratos Psomadakis | file system in parallel. The write is acked when either one safely commits |
139 | 5a0df659 | Stratos Psomadakis | (usually the journal). This will only work on btrfs, as it relies on |
140 | 5a0df659 | Stratos Psomadakis | btrfs-specific snapshot ioctls to rollback to a consistent state before |
141 | 5a0df659 | Stratos Psomadakis | replaying the journal. |
142 | 5a0df659 | Stratos Psomadakis | |
143 | 5a0df659 | Stratos Psomadakis | |
144 | 5a0df659 | Stratos Psomadakis | Authentication |
145 | 82b5509d | Kostas Papadimitriou | -------------- |
146 | 5a0df659 | Stratos Psomadakis | |
147 | 5a0df659 | Stratos Psomadakis | Ceph supports cephx secure authentication between the nodes, this to make your |
148 | 5a0df659 | Stratos Psomadakis | cluster more secure. There are some issues with the cephx authentication, |
149 | 5a0df659 | Stratos Psomadakis | especially with clients (Qemu-RBD), and it complicates the cluster deployment. |
150 | 5a0df659 | Stratos Psomadakis | Future revisions of this document will include documentation on setting up |
151 | 5a0df659 | Stratos Psomadakis | fine-grained cephx authentication acroos the cluster. |
152 | 5a0df659 | Stratos Psomadakis | |
153 | 5a0df659 | Stratos Psomadakis | |
154 | 5a0df659 | Stratos Psomadakis | RADOS Cluster design and configuration |
155 | 82b5509d | Kostas Papadimitriou | -------------------------------------- |
156 | 5a0df659 | Stratos Psomadakis | |
157 | 5a0df659 | Stratos Psomadakis | This section proposes and describes a sample cluster configuration. |
158 | 5a0df659 | Stratos Psomadakis | |
159 | 5a0df659 | Stratos Psomadakis | 0. Monitor servers: |
160 | 5a0df659 | Stratos Psomadakis | * 3 mon servers on separate 'failure domains' (eg rack) |
161 | 5a0df659 | Stratos Psomadakis | * Monitor servers are named mon.a, mon.b, mon.c repectively |
162 | 5a0df659 | Stratos Psomadakis | * Monitor data stored in /rados/mon.$id (should be created) |
163 | 5a0df659 | Stratos Psomadakis | * Monitor servers bind on 6789 TCP port, which should not be blocked by |
164 | 5a0df659 | Stratos Psomadakis | firewall |
165 | 5a0df659 | Stratos Psomadakis | * Ceph configuration section for monitors: |
166 | 5a0df659 | Stratos Psomadakis | [mon] |
167 | 5a0df659 | Stratos Psomadakis | mon data = /rados/mon.$id |
168 | 5a0df659 | Stratos Psomadakis | |
169 | 5a0df659 | Stratos Psomadakis | [mon.a] |
170 | 5a0df659 | Stratos Psomadakis | host = [hostname] |
171 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
172 | 5a0df659 | Stratos Psomadakis | [mon.b] |
173 | 5a0df659 | Stratos Psomadakis | host = [hostname] |
174 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
175 | 5a0df659 | Stratos Psomadakis | [mon.c] |
176 | 5a0df659 | Stratos Psomadakis | host = [hostname] |
177 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
178 | 5a0df659 | Stratos Psomadakis | |
179 | 5a0df659 | Stratos Psomadakis | * Debugging options which can be included in the monitor configuration: |
180 | 5a0df659 | Stratos Psomadakis | [mon] |
181 | 5a0df659 | Stratos Psomadakis | ;show monitor messaging traffic |
182 | 5a0df659 | Stratos Psomadakis | debug ms = 1 |
183 | 5a0df659 | Stratos Psomadakis | ;show monitor debug messages |
184 | 5a0df659 | Stratos Psomadakis | debug mon = 20 |
185 | 5a0df659 | Stratos Psomadakis | ; show Paxos debug messages (consensus protocol) |
186 | 5a0df659 | Stratos Psomadakis | debug paxos = 20 |
187 | 5a0df659 | Stratos Psomadakis | |
188 | 5a0df659 | Stratos Psomadakis | 1. OSD servers: |
189 | 5a0df659 | Stratos Psomadakis | * A numeric id is used to name the osds (osd.0, osd.1, ... , osd.n) |
190 | 5a0df659 | Stratos Psomadakis | * OSD servers bind on 6800+ TCP ports, which should not be blocked by |
191 | 5a0df659 | Stratos Psomadakis | firewall |
192 | 5a0df659 | Stratos Psomadakis | * OSD data are stored in /rados/osd.$id (should be created and mounted if |
193 | 5a0df659 | Stratos Psomadakis | needed) |
194 | 5a0df659 | Stratos Psomadakis | * /rados/osd.$id can be either a directory on the rootfs, or a separate |
195 | 5a0df659 | Stratos Psomadakis | partition, on a dedicated fast disk (recommended) |
196 | 5a0df659 | Stratos Psomadakis | |
197 | 5a0df659 | Stratos Psomadakis | The upstream recommended filesystem is btrfs. btrfs will use the parallel |
198 | 5a0df659 | Stratos Psomadakis | mode for OSD journaling. |
199 | 5a0df659 | Stratos Psomadakis | |
200 | 5a0df659 | Stratos Psomadakis | Alternatively, ext4 can be used. ext4 will use the writeahead mode for OSD |
201 | 5a0df659 | Stratos Psomadakis | journaling. ext4 itself can also use an external journal device |
202 | 5a0df659 | Stratos Psomadakis | (preferably a fast, eg SSD, disk). In that case, the filesystem can be |
203 | 5a0df659 | Stratos Psomadakis | mounted with data=journal,commit=9999,noatime,nodiratime options, to |
204 | 5a0df659 | Stratos Psomadakis | improve perfomance (proof?): |
205 | 5a0df659 | Stratos Psomadakis | |
206 | 5a0df659 | Stratos Psomadakis | mkfs.ext4 /dev/sdyy |
207 | 5a0df659 | Stratos Psomadakis | mke2fs -O journal_dev /dev/sdxx |
208 | 5a0df659 | Stratos Psomadakis | tune2fs -O ^has_journal /dev/sdyy |
209 | 5a0df659 | Stratos Psomadakis | tune2fs -o journal_data -j -J device=/dev/sdxx /dev/sdyy |
210 | 5a0df659 | Stratos Psomadakis | mount /dev/sdyy /rados/osd.$id -o noatime,nodiratime,data=journal,commit=9999 |
211 | 5a0df659 | Stratos Psomadakis | |
212 | 5a0df659 | Stratos Psomadakis | * OSD journal can be either on a raw block device, a separate partition, or |
213 | 5a0df659 | Stratos Psomadakis | a file. |
214 | 5a0df659 | Stratos Psomadakis | |
215 | 5a0df659 | Stratos Psomadakis | A fash disk (SSD) is recommended as a journal device. |
216 | 5a0df659 | Stratos Psomadakis | |
217 | 5a0df659 | Stratos Psomadakis | If a file is used, the journal size must be also specified in the |
218 | 5a0df659 | Stratos Psomadakis | configuration. |
219 | 5a0df659 | Stratos Psomadakis | |
220 | 5a0df659 | Stratos Psomadakis | * Ceph configuration section for monitors: |
221 | 5a0df659 | Stratos Psomadakis | [osd] |
222 | 5a0df659 | Stratos Psomadakis | osd data = /rados/osd.$id |
223 | 5a0df659 | Stratos Psomadakis | osd journal = /dev/sdzz |
224 | 5a0df659 | Stratos Psomadakis | ;if a file is used as a journal |
225 | 5a0df659 | Stratos Psomadakis | ;osd journal size = N (in MB) |
226 | 5a0df659 | Stratos Psomadakis | |
227 | 5a0df659 | Stratos Psomadakis | [osd.0] |
228 | 5a0df659 | Stratos Psomadakis | ;host and rack directives are used to generate a CRUSH map for PG |
229 | 5a0df659 | Stratos Psomadakis | ;placement |
230 | 5a0df659 | Stratos Psomadakis | host = [hostname] |
231 | 5a0df659 | Stratos Psomadakis | rack = [rack] |
232 | 5a0df659 | Stratos Psomadakis | |
233 | 5a0df659 | Stratos Psomadakis | ;public addr is the one the clients will use to contact the osd |
234 | 5a0df659 | Stratos Psomadakis | public_addr = [public ip] |
235 | 5a0df659 | Stratos Psomadakis | ;cluster addr is the one used for osd-to-osd replication/peering etc |
236 | 5a0df659 | Stratos Psomadakis | cluster_addr = [cluster ip] |
237 | 5a0df659 | Stratos Psomadakis | |
238 | 5a0df659 | Stratos Psomadakis | [osd.1] |
239 | 5a0df659 | Stratos Psomadakis | ... |
240 | 5a0df659 | Stratos Psomadakis | |
241 | 5a0df659 | Stratos Psomadakis | * Debug options which can be included in the osd configuration: |
242 | 5a0df659 | Stratos Psomadakis | [osd] |
243 | 5a0df659 | Stratos Psomadakis | ;show OSD messaging traffic |
244 | 5a0df659 | Stratos Psomadakis | debug ms = 1 |
245 | 5a0df659 | Stratos Psomadakis | ;show OSD debug information |
246 | 5a0df659 | Stratos Psomadakis | debug osd = 20 |
247 | 5a0df659 | Stratos Psomadakis | ;show OSD journal debug information |
248 | 5a0df659 | Stratos Psomadakis | debug jorunal = 20 |
249 | 5a0df659 | Stratos Psomadakis | ;show filestore debug information |
250 | 5a0df659 | Stratos Psomadakis | debug filestore = 20 |
251 | 5a0df659 | Stratos Psomadakis | ;show monitor client debug information |
252 | 5a0df659 | Stratos Psomadakis | debug monc = 20 |
253 | 5a0df659 | Stratos Psomadakis | |
254 | 5a0df659 | Stratos Psomadakis | 3. Clients |
255 | 5a0df659 | Stratos Psomadakis | * Clients configuration only need the monitor servers addresses |
256 | 5a0df659 | Stratos Psomadakis | * Configration section for clients: |
257 | 5a0df659 | Stratos Psomadakis | [mon.a] |
258 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
259 | 5a0df659 | Stratos Psomadakis | [mon.b] |
260 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
261 | 5a0df659 | Stratos Psomadakis | [mon.c] |
262 | 5a0df659 | Stratos Psomadakis | mon addr = [ip]:6789 |
263 | 5a0df659 | Stratos Psomadakis | * Debug options which can be included in the client configuration: |
264 | 5a0df659 | Stratos Psomadakis | ;show client messaging traffic |
265 | 5a0df659 | Stratos Psomadakis | debug ms = 1 |
266 | 5a0df659 | Stratos Psomadakis | ;show RADOS debug information |
267 | 5a0df659 | Stratos Psomadakis | debug rados = 20 |
268 | 5a0df659 | Stratos Psomadakis | ;show objecter debug information |
269 | 5a0df659 | Stratos Psomadakis | debug objecter = 20 |
270 | 5a0df659 | Stratos Psomadakis | ;show filer debug information |
271 | 5a0df659 | Stratos Psomadakis | debug filer = 20 |
272 | 5a0df659 | Stratos Psomadakis | ;show objectcacher debug information |
273 | 5a0df659 | Stratos Psomadakis | debug object cacher = 20 |
274 | 5a0df659 | Stratos Psomadakis | |
275 | 5a0df659 | Stratos Psomadakis | 4. Tips |
276 | 5a0df659 | Stratos Psomadakis | * Mount all the filesystems with noatime,nodiratime options |
277 | 5a0df659 | Stratos Psomadakis | * Even without any debug options, RADOS generates lots of logs. Make sure |
278 | 5a0df659 | Stratos Psomadakis | the logs files are in a fast disk, with little I/O traffic, and the |
279 | 5a0df659 | Stratos Psomadakis | partition is mounted with noatime. |
280 | 5a0df659 | Stratos Psomadakis | |
281 | 5a0df659 | Stratos Psomadakis | |
282 | 5a0df659 | Stratos Psomadakis | Installation Process |
283 | 82b5509d | Kostas Papadimitriou | -------------------- |
284 | 5a0df659 | Stratos Psomadakis | |
285 | 5a0df659 | Stratos Psomadakis | This section describes the installation process of the various software |
286 | 5a0df659 | Stratos Psomadakis | components in a RADOS cluster. |
287 | 5a0df659 | Stratos Psomadakis | |
288 | 5a0df659 | Stratos Psomadakis | 0. Add Ceph Debian repository in /etc/apt/sources.list on every node (mon, osd, |
289 | c469ca86 | Kostas Papadimitriou | clients):: |
290 | c469ca86 | Kostas Papadimitriou | |
291 | 5a0df659 | Stratos Psomadakis | deb http://ceph.newdream.net/debian/ squeeze main |
292 | 5a0df659 | Stratos Psomadakis | deb-src http://ceph.newdream.net/debian/ squeeze main |
293 | 5a0df659 | Stratos Psomadakis | |
294 | 5a0df659 | Stratos Psomadakis | 1. Monitor and OSD servers: |
295 | 5a0df659 | Stratos Psomadakis | * Install the ceph package |
296 | 5a0df659 | Stratos Psomadakis | * Upgrade to an up-to-date kernel (>=3.x) |
297 | 5a0df659 | Stratos Psomadakis | * Edit the /etc/ceph/ceph.conf to include the mon and osd configuration |
298 | 5a0df659 | Stratos Psomadakis | sections, shown previously. |
299 | 5a0df659 | Stratos Psomadakis | * Create the corresponding dirs in /rados (mon.$id and osd.$id) |
300 | 5a0df659 | Stratos Psomadakis | * (optionally) Format and mount the osd.$id patition in /rados/osd.$id |
301 | 5a0df659 | Stratos Psomadakis | * Make sure the journal device specified in the conf exists. |
302 | 5a0df659 | Stratos Psomadakis | * (optionally) Make sure everything is mounted with the noatime,nodiratime |
303 | 5a0df659 | Stratos Psomadakis | options |
304 | 5a0df659 | Stratos Psomadakis | * Make sure monitor and osd servers can freely ssh to each other, using only |
305 | 5a0df659 | Stratos Psomadakis | hostnames. |
306 | 5a0df659 | Stratos Psomadakis | * Create the object store: |
307 | 5a0df659 | Stratos Psomadakis | mkcephfs -a -c /etc/ceph/ceph.conf |
308 | 5a0df659 | Stratos Psomadakis | * Start the servers: |
309 | 5a0df659 | Stratos Psomadakis | service ceph -a start |
310 | 5a0df659 | Stratos Psomadakis | * Verify that the object store is healthy, and running: |
311 | 5a0df659 | Stratos Psomadakis | ceph helth |
312 | 5a0df659 | Stratos Psomadakis | ceph -s |
313 | 5a0df659 | Stratos Psomadakis | |
314 | 5a0df659 | Stratos Psomadakis | 2. Clients: |
315 | 5a0df659 | Stratos Psomadakis | * Install the ceph-common package |
316 | 5a0df659 | Stratos Psomadakis | * Upgrade to an up-to-date kernel (>=3.x) |
317 | 5a0df659 | Stratos Psomadakis | * Install linux-headers for the new kernel |
318 | 5a0df659 | Stratos Psomadakis | * Check out the latest ceph-client git repo: |
319 | 5a0df659 | Stratos Psomadakis | git clone git://github.com/NewDreamNetwork/ceph-client.git |
320 | 5a0df659 | Stratos Psomadakis | * Copy the ncecessary ceph header file to linux-headers: |
321 | 5a0df659 | Stratos Psomadakis | cp -r ceph-client/include/linux/ceph/* /usr/src/linux-$(uname-r)/include/linux/ceph/ |
322 | 5a0df659 | Stratos Psomadakis | * Build the modules: |
323 | 5a0df659 | Stratos Psomadakis | cd ~/ceph-client/net/ceph/ |
324 | 5a0df659 | Stratos Psomadakis | make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) libceph.ko |
325 | 5a0df659 | Stratos Psomadakis | cp Modules.symvers ../../drivers/block/ |
326 | 5a0df659 | Stratos Psomadakis | cd ~/ceph-client/drivers/block/ |
327 | 5a0df659 | Stratos Psomadakis | make -C /usr/src/linux-headers-3.0.0-2-amd64/ M=$(pwd) rbd.ko |
328 | 5a0df659 | Stratos Psomadakis | * Optionally, copy rbd.ko and libceph. ko to /lib/modules/ |
329 | 5a0df659 | Stratos Psomadakis | * Load the modules: |
330 | 5a0df659 | Stratos Psomadakis | modprobe rbd |
331 | 5a0df659 | Stratos Psomadakis | |
332 | 5a0df659 | Stratos Psomadakis | |
333 | 5a0df659 | Stratos Psomadakis | Administration Notes |
334 | 82b5509d | Kostas Papadimitriou | -------------------- |
335 | 5a0df659 | Stratos Psomadakis | |
336 | 5a0df659 | Stratos Psomadakis | This section includes some notes on the RADOS cluster administration. |
337 | 5a0df659 | Stratos Psomadakis | |
338 | 5a0df659 | Stratos Psomadakis | 0. Starting / Stopping servers |
339 | 5a0df659 | Stratos Psomadakis | * service ceph -a start/stop (affects all the servers in the cluster) |
340 | 5a0df659 | Stratos Psomadakis | * service ceph start/stop osd (affects only the osds in the current node) |
341 | 5a0df659 | Stratos Psomadakis | * service ceph start/stop mon (affects only the mons in the current node) |
342 | 5a0df659 | Stratos Psomadakis | * service ceph start/stop osd.$id/mon.$id (affects only the specified node) |
343 | 5a0df659 | Stratos Psomadakis | |
344 | 5a0df659 | Stratos Psomadakis | * sevice ceph cleanlogs/cleanalllogs |
345 | 5a0df659 | Stratos Psomadakis | |
346 | 5a0df659 | Stratos Psomadakis | 1. Stop the cluster cleanly |
347 | 5a0df659 | Stratos Psomadakis | ceph stop |
348 | 5a0df659 | Stratos Psomadakis | |
349 | 5a0df659 | Stratos Psomadakis | 2. Increase the replication level for a given pool: |
350 | 5a0df659 | Stratos Psomadakis | ceph osd pool set $poolname size $size |
351 | 5a0df659 | Stratos Psomadakis | |
352 | 5a0df659 | Stratos Psomadakis | Note that increasing the replication level, the overhead for the replication |
353 | 5a0df659 | Stratos Psomadakis | will impact perfomance. |
354 | 5a0df659 | Stratos Psomadakis | |
355 | 5a0df659 | Stratos Psomadakis | 3. Adjust the number of placement groups per pool: |
356 | 5a0df659 | Stratos Psomadakis | ceph osd pool set $poolname pg_num $num |
357 | 5a0df659 | Stratos Psomadakis | |
358 | 5a0df659 | Stratos Psomadakis | The default number of pgs per pool is determined by the number of OSDs in the |
359 | 5a0df659 | Stratos Psomadakis | cluster, and the replication level of the pool (for 4 OSDs and replication |
360 | 5a0df659 | Stratos Psomadakis | size 2, the default value is 8). The default pools (data,metadata,rbd) are |
361 | 5a0df659 | Stratos Psomadakis | assigned 256 PGs. |
362 | 5a0df659 | Stratos Psomadakis | |
363 | 5a0df659 | Stratos Psomadakis | After the splitting is complete, the number of PGs in the system must be |
364 | 5a0df659 | Stratos Psomadakis | changed. Warning: this is not considered safe on PGs in use (with objects), |
365 | 5a0df659 | Stratos Psomadakis | and should be changed only when the PG is created, and before being used: |
366 | c469ca86 | Kostas Papadimitriou | ceph osd pool set $poolname pgp_num $num |
367 | 5a0df659 | Stratos Psomadakis | |
368 | 5a0df659 | Stratos Psomadakis | 4. Replacing the journal for osd.$id: |
369 | 5a0df659 | Stratos Psomadakis | Edit the osd.$id journal configration section |
370 | 5a0df659 | Stratos Psomadakis | ceph-osd -i osd.$id --mkjournal |
371 | 5a0df659 | Stratos Psomadakis | ceph-osd -i osd.$id --osd.journal /path/to/journal |
372 | 5a0df659 | Stratos Psomadakis | |
373 | 5a0df659 | Stratos Psomadakis | 5. Add a new OSD: |
374 | 5a0df659 | Stratos Psomadakis | Edit /etc/ceph/ceph.conf to include the new OSD |
375 | 5a0df659 | Stratos Psomadakis | ceph mon getmap -o /tmp/monmap |
376 | 5a0df659 | Stratos Psomadakis | ceph-osd --mkfs -i osd.$id --monmap /tmp/monmap |
377 | 5a0df659 | Stratos Psomadakis | ceph osd setmaxosd [maxosd+1] (ceph osd getmaxosd to get the num of osd if needed) |
378 | 5a0df659 | Stratos Psomadakis | service ceph start osd.$id |
379 | 5a0df659 | Stratos Psomadakis | |
380 | 5a0df659 | Stratos Psomadakis | Generate the CRUSH map to include the new osd in PGs: |
381 | 5a0df659 | Stratos Psomadakis | osdmaptool --createsimple [maxosd] --clobber /tmp/osdmap --export-crush /tmp/crush |
382 | 5a0df659 | Stratos Psomadakis | ceph osd setcrushmap -i /tmp/crush |
383 | 5a0df659 | Stratos Psomadakis | Or edit the CRUSH map by hand: |
384 | 5a0df659 | Stratos Psomadakis | ceph osd getcrushmap -o /tmp/crush |
385 | 5a0df659 | Stratos Psomadakis | crushmaptool -d /tmp/crush -o crushmap |
386 | 5a0df659 | Stratos Psomadakis | vim crushmap |
387 | 5a0df659 | Stratos Psomadakis | crushmaptool -c crushmap -o /tmp/crush |
388 | 5a0df659 | Stratos Psomadakis | ceph osd setcrushmap -i /tmp/crush |
389 | 5a0df659 | Stratos Psomadakis | |
390 | 5a0df659 | Stratos Psomadakis | 6. General ceph tool commands: |
391 | 5a0df659 | Stratos Psomadakis | * ceph mon stat (stat mon servers) |
392 | 5a0df659 | Stratos Psomadakis | * ceph mon getmap (get the monmap, use monmaptool to edit) |
393 | 5a0df659 | Stratos Psomadakis | * ceph osd dump (dump osdmap -> pool info, osd info) |
394 | 5a0df659 | Stratos Psomadakis | * ceph osd getmap (get osdmap -> use osdmaptool to edit) |
395 | 5a0df659 | Stratos Psomadakis | * ceph osd lspools |
396 | 5a0df659 | Stratos Psomadakis | * ceph osd stat (stat osd servers) |
397 | 5a0df659 | Stratos Psomadakis | * ceph ost tree (osd server info) |
398 | 5a0df659 | Stratos Psomadakis | * ceph pg dump/stat (show info about PGs) |
399 | 5a0df659 | Stratos Psomadakis | |
400 | 5a0df659 | Stratos Psomadakis | 7. rados userspace tool: |
401 | 5a0df659 | Stratos Psomadakis | |
402 | 5a0df659 | Stratos Psomadakis | The rados userspace tool (included in ceph-common package), uses librados to |
403 | 5a0df659 | Stratos Psomadakis | communicate with the object store. |
404 | 5a0df659 | Stratos Psomadakis | |
405 | 5a0df659 | Stratos Psomadakis | * rados mkpool [pool] |
406 | 5a0df659 | Stratos Psomadakis | * rados rmpool [pool] |
407 | 5a0df659 | Stratos Psomadakis | * rados df (show usage per pool) |
408 | 5a0df659 | Stratos Psomadakis | * rados lspools (list pools) |
409 | 5a0df659 | Stratos Psomadakis | * rados ls -p [pool] (list objects in [pool] |
410 | 5a0df659 | Stratos Psomadakis | * rados bench [secs] write|seq -t [concurrent operation] |
411 | 5a0df659 | Stratos Psomadakis | * rados import/export <pool> <dir> (import/export a local directory in a rados pool) |
412 | 5a0df659 | Stratos Psomadakis | |
413 | 5a0df659 | Stratos Psomadakis | 8. rbd userspace tool: |
414 | 5a0df659 | Stratos Psomadakis | |
415 | 5a0df659 | Stratos Psomadakis | The rbd userspace tool (included in ceph-commong package), uses librbd and |
416 | 5a0df659 | Stratos Psomadakis | librados to communicate with the object store. |
417 | 5a0df659 | Stratos Psomadakis | |
418 | 5a0df659 | Stratos Psomadakis | * rbd ls -p [pool] (list RBD images in [pool], default pool = rbd) |
419 | 5a0df659 | Stratos Psomadakis | * rbd info [pool] -p [pool] |
420 | 5a0df659 | Stratos Psomadakis | * rbd create [image] --size n (in MB) |
421 | 5a0df659 | Stratos Psomadakis | * rbd rm [image] |
422 | 5a0df659 | Stratos Psomadakis | * rbd export/import [dir] [image] |
423 | 5a0df659 | Stratos Psomadakis | * rbd cp/mv [image] [dest] |
424 | 5a0df659 | Stratos Psomadakis | * rbd resize [image] |
425 | 5a0df659 | Stratos Psomadakis | * rbd map [image] (map an RBD image to a block device using the in-kernel RBD driver) |
426 | 5a0df659 | Stratos Psomadakis | * rbd unmap /dev/rbdx (unmap an RBD device) |
427 | 5a0df659 | Stratos Psomadakis | * rbd showmapped |
428 | 5a0df659 | Stratos Psomadakis | |
429 | 5a0df659 | Stratos Psomadakis | 9. In-kernel RBD driver |
430 | 5a0df659 | Stratos Psomadakis | |
431 | 5a0df659 | Stratos Psomadakis | The in-kernel RBD driver can be used to map and ummap RBD images as block |
432 | 5a0df659 | Stratos Psomadakis | devices. Once mapped, they will appear as /dev/rbdX, and a symlink will be |
433 | 5a0df659 | Stratos Psomadakis | created in /dev/rbd/[poolname]/[imagename]:[bdev id]. |
434 | 5a0df659 | Stratos Psomadakis | |
435 | 5a0df659 | Stratos Psomadakis | It also exports a sysfs interface, under /sys/bus/rbd/ which can be used to |
436 | 5a0df659 | Stratos Psomadakis | add / remove / list devices, although the rbd map/unmap/showmapped commands |
437 | 5a0df659 | Stratos Psomadakis | are preferred. |
438 | 5a0df659 | Stratos Psomadakis | |
439 | 5a0df659 | Stratos Psomadakis | The RBD module depends on the net/ceph/libceph module, which implements the |
440 | 5a0df659 | Stratos Psomadakis | communication with the object store in the kernel. |
441 | 5a0df659 | Stratos Psomadakis | |
442 | 5a0df659 | Stratos Psomadakis | 10. Qemu-RBD driver |
443 | 5a0df659 | Stratos Psomadakis | |
444 | 5a0df659 | Stratos Psomadakis | The Qemu-RBD driver can be used directly by Qemu-KVM to access RBD images as |
445 | 5a0df659 | Stratos Psomadakis | block devices inside VMs. It currently supports a feature not present in the |
446 | 5a0df659 | Stratos Psomadakis | in-kenrel RBD driver (writeback_window). |
447 | 5a0df659 | Stratos Psomadakis | |
448 | 5a0df659 | Stratos Psomadakis | It can be configured via libvirt, and the configuration looks like this: |
449 | 5a0df659 | Stratos Psomadakis | |
450 | c469ca86 | Kostas Papadimitriou | .. code-block:: xml |
451 | c469ca86 | Kostas Papadimitriou | |
452 | 5a0df659 | Stratos Psomadakis | <disk type='network' device='disk'> |
453 | 5a0df659 | Stratos Psomadakis | <driver name='qemu' type='raw'/> |
454 | 5a0df659 | Stratos Psomadakis | <source protocol='rbd' name='[pool]/[image]:rbd_writeback_window=8000000'/> |
455 | 5a0df659 | Stratos Psomadakis | <target dev='vda' bus='virtio'/> |
456 | 5a0df659 | Stratos Psomadakis | </disk> |
457 | 5a0df659 | Stratos Psomadakis | |
458 | 5a0df659 | Stratos Psomadakis | Notae: it requires an up-to-date version of libvirt, plus a Qemu/KVM |
459 | 5a0df659 | Stratos Psomadakis | version, which is not included in Debian. |
460 | 5a0df659 | Stratos Psomadakis | |
461 | 5a0df659 | Stratos Psomadakis | 9. Logging and Debugging: |
462 | 5a0df659 | Stratos Psomadakis | For command-line tools (ceph, rados, rbd), you can specify debug options in |
463 | 5a0df659 | Stratos Psomadakis | the form of --debug-[component]=n, which will override the options in the |
464 | 5a0df659 | Stratos Psomadakis | config file. In order to get any output when using the cli debug options, |
465 | 5a0df659 | Stratos Psomadakis | you must also use --log-to-stderr. |
466 | 5a0df659 | Stratos Psomadakis | |
467 | 5a0df659 | Stratos Psomadakis | rados ls -p rbd --log-to-stderr --debug-ms=1 --debug-rados=20 |
468 | 5a0df659 | Stratos Psomadakis | |
469 | 5a0df659 | Stratos Psomadakis | Ceph log files are located in /var/log/ceph/mon.$id and |
470 | 5a0df659 | Stratos Psomadakis | /var/log/ceph/osd.$id. |