+Ganeti walk-through
+===================
+
+Documents Ganeti version |version|
+
+.. contents::
+
+.. highlight:: text
+
+Introduction
+------------
+
+This document serves as a more example-oriented guide to Ganeti; while
+the administration guide shows a conceptual approach, here you will find
+a step-by-step example to managing instances and the cluster.
+
+Our simulated, example cluster will have three machines, named
+``node1``, ``node2``, ``node3``. Note that in real life machines will
+usually FQDNs but here we use short names for brevity. We will use a
+secondary network for replication data, ``192.168.2.0/24``, with nodes
+having the last octet the same as their index. The cluster name will be
+``example-cluster``. All nodes have the same simulated hardware
+configuration, two disks of 750GB, 32GB of memory and 4 CPUs.
+
+On this cluster, we will create up to seven instances, named
+``instance1`` to ``instance7``.
+
+
+Cluster creation
+----------------
+
+Follow the :doc:`install` document and prepare the nodes. Then it's time
+to initialise the cluster::
+
+ node1# gnt-cluster init -s 192.168.2.1 --enabled-hypervisors=xen-pvm cluster
+ node1#
+
+The creation was fine. Let's check that one node we have is functioning
+correctly::
+
+ node1# gnt-node list
+ Node DTotal DFree MTotal MNode MFree Pinst Sinst
+ node1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
+ node1# gnt-cluster verify
+ Mon Oct 26 02:08:51 2009 * Verifying global settings
+ Mon Oct 26 02:08:51 2009 * Gathering data (1 nodes)
+ Mon Oct 26 02:08:52 2009 * Verifying node status
+ Mon Oct 26 02:08:52 2009 * Verifying instance status
+ Mon Oct 26 02:08:52 2009 * Verifying orphan volumes
+ Mon Oct 26 02:08:52 2009 * Verifying remaining instances
+ Mon Oct 26 02:08:52 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 02:08:52 2009 * Other Notes
+ Mon Oct 26 02:08:52 2009 * Hooks Results
+ node1#
+
+Since this proceeded correctly, let's add the other two nodes::
+
+ node1# gnt-node add -s 192.168.2.2 node2
+ -- WARNING --
+ Performing this operation is going to replace the ssh daemon keypair
+ on the target machine (node2) with the ones of the current one
+ and grant full intra-cluster ssh root access to/from it
+
+ The authenticity of host 'node2 (192.168.1.2)' can't be established.
+ RSA key fingerprint is 9f:…
+ Are you sure you want to continue connecting (yes/no)? yes
+ root@node2's password:
+ Mon Oct 26 02:11:54 2009 - INFO: Node will be a master candidate
+ node1# gnt-node add -s 192.168.2.3 node3
+ -- WARNING --
+ Performing this operation is going to replace the ssh daemon keypair
+ on the target machine (node2) with the ones of the current one
+ and grant full intra-cluster ssh root access to/from it
+
+ The authenticity of host 'node3 (192.168.1.3)' can't be established.
+ RSA key fingerprint is 9f:…
+ Are you sure you want to continue connecting (yes/no)? yes
+ root@node2's password:
+ Mon Oct 26 02:11:54 2009 - INFO: Node will be a master candidate
+
+Checking the cluster status again::
+
+ node1# gnt-node list
+ Node DTotal DFree MTotal MNode MFree Pinst Sinst
+ node1 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
+ node2 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
+ node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 0
+ node1# gnt-cluster verify
+ Mon Oct 26 02:15:14 2009 * Verifying global settings
+ Mon Oct 26 02:15:14 2009 * Gathering data (3 nodes)
+ Mon Oct 26 02:15:16 2009 * Verifying node status
+ Mon Oct 26 02:15:16 2009 * Verifying instance status
+ Mon Oct 26 02:15:16 2009 * Verifying orphan volumes
+ Mon Oct 26 02:15:16 2009 * Verifying remaining instances
+ Mon Oct 26 02:15:16 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 02:15:16 2009 * Other Notes
+ Mon Oct 26 02:15:16 2009 * Hooks Results
+ node1#
+
+And let's check that we have a valid OS::
+
+ node1# gnt-os list
+ Name
+ debootstrap
+ node1#
+
+Running a burnin
+----------------
+
+Now that the cluster is created, it is time to check that the hardware
+works correctly, that the hypervisor can actually create instances,
+etc. This is done via the debootstrap tool as described in the admin
+guide. Similar output lines are replaced with ``…`` in the below log::
+
+ node1# /usr/lib/ganeti/tools/burnin -o debootstrap -p instance{1..5}
+ - Testing global parameters
+ - Creating instances
+ * instance instance1
+ on node1, node2
+ * instance instance2
+ on node2, node3
+ …
+ * instance instance5
+ on node2, node3
+ * Submitted job ID(s) 157, 158, 159, 160, 161
+ waiting for job 157 for instance1
+ …
+ waiting for job 161 for instance5
+ - Replacing disks on the same nodes
+ * instance instance1
+ run replace_on_secondary
+ run replace_on_primary
+ …
+ * instance instance5
+ run replace_on_secondary
+ run replace_on_primary
+ * Submitted job ID(s) 162, 163, 164, 165, 166
+ waiting for job 162 for instance1
+ …
+ - Changing the secondary node
+ * instance instance1
+ run replace_new_secondary node3
+ * instance instance2
+ run replace_new_secondary node1
+ …
+ * instance instance5
+ run replace_new_secondary node1
+ * Submitted job ID(s) 167, 168, 169, 170, 171
+ waiting for job 167 for instance1
+ …
+ - Growing disks
+ * instance instance1
+ increase disk/0 by 128 MB
+ …
+ * instance instance5
+ increase disk/0 by 128 MB
+ * Submitted job ID(s) 173, 174, 175, 176, 177
+ waiting for job 173 for instance1
+ …
+ - Failing over instances
+ * instance instance1
+ …
+ * instance instance5
+ * Submitted job ID(s) 179, 180, 181, 182, 183
+ waiting for job 179 for instance1
+ …
+ - Migrating instances
+ * instance instance1
+ migration and migration cleanup
+ …
+ * instance instance5
+ migration and migration cleanup
+ * Submitted job ID(s) 184, 185, 186, 187, 188
+ waiting for job 184 for instance1
+ …
+ - Exporting and re-importing instances
+ * instance instance1
+ export to node node3
+ remove instance
+ import from node3 to node1, node2
+ remove export
+ …
+ * instance instance5
+ export to node node1
+ remove instance
+ import from node1 to node2, node3
+ remove export
+ * Submitted job ID(s) 196, 197, 198, 199, 200
+ waiting for job 196 for instance1
+ …
+ - Reinstalling instances
+ * instance instance1
+ reinstall without passing the OS
+ reinstall specifying the OS
+ …
+ * instance instance5
+ reinstall without passing the OS
+ reinstall specifying the OS
+ * Submitted job ID(s) 203, 204, 205, 206, 207
+ waiting for job 203 for instance1
+ …
+ - Rebooting instances
+ * instance instance1
+ reboot with type 'hard'
+ reboot with type 'soft'
+ reboot with type 'full'
+ …
+ * instance instance5
+ reboot with type 'hard'
+ reboot with type 'soft'
+ reboot with type 'full'
+ * Submitted job ID(s) 208, 209, 210, 211, 212
+ waiting for job 208 for instance1
+ …
+ - Adding and removing disks
+ * instance instance1
+ adding a disk
+ removing last disk
+ …
+ * instance instance5
+ adding a disk
+ removing last disk
+ * Submitted job ID(s) 213, 214, 215, 216, 217
+ waiting for job 213 for instance1
+ …
+ - Adding and removing NICs
+ * instance instance1
+ adding a NIC
+ removing last NIC
+ …
+ * instance instance5
+ adding a NIC
+ removing last NIC
+ * Submitted job ID(s) 218, 219, 220, 221, 222
+ waiting for job 218 for instance1
+ …
+ - Activating/deactivating disks
+ * instance instance1
+ activate disks when online
+ activate disks when offline
+ deactivate disks (when offline)
+ …
+ * instance instance5
+ activate disks when online
+ activate disks when offline
+ deactivate disks (when offline)
+ * Submitted job ID(s) 223, 224, 225, 226, 227
+ waiting for job 223 for instance1
+ …
+ - Stopping and starting instances
+ * instance instance1
+ …
+ * instance instance5
+ * Submitted job ID(s) 230, 231, 232, 233, 234
+ waiting for job 230 for instance1
+ …
+ - Removing instances
+ * instance instance1
+ …
+ * instance instance5
+ * Submitted job ID(s) 235, 236, 237, 238, 239
+ waiting for job 235 for instance1
+ …
+ node1#
+
+You can see in the above what operations the burnin does. Ideally, the
+burnin log would proceed successfully through all the steps and end
+cleanly, without throwing errors.
+
+Instance operations
+-------------------
+
+Creation
+++++++++
+
+At this point, Ganeti and the hardware seems to be functioning
+correctly, so we'll follow up with creating the instances manually::
+
+ node1# gnt-instance add -t drbd -o debootstrap -s 256m -n node1:node2 instance3
+ Mon Oct 26 04:06:52 2009 - INFO: Selected nodes for instance instance1 via iallocator hail: node2, node3
+ Mon Oct 26 04:06:53 2009 * creating instance disks...
+ Mon Oct 26 04:06:57 2009 adding instance instance1 to cluster config
+ Mon Oct 26 04:06:57 2009 - INFO: Waiting for instance instance1 to sync disks.
+ Mon Oct 26 04:06:57 2009 - INFO: - device disk/0: 20.00% done, 4 estimated seconds remaining
+ Mon Oct 26 04:07:01 2009 - INFO: Instance instance1's disks are in sync.
+ Mon Oct 26 04:07:01 2009 creating os for instance instance1 on node node2
+ Mon Oct 26 04:07:01 2009 * running the instance OS create scripts...
+ Mon Oct 26 04:07:14 2009 * starting instance...
+ node1# gnt-instance add -t drbd -o debootstrap -s 256m -n node1:node2 instanc<drbd -o debootstrap -s 256m -n node1:node2 instance2
+ Mon Oct 26 04:11:37 2009 * creating instance disks...
+ Mon Oct 26 04:11:40 2009 adding instance instance2 to cluster config
+ Mon Oct 26 04:11:41 2009 - INFO: Waiting for instance instance2 to sync disks.
+ Mon Oct 26 04:11:41 2009 - INFO: - device disk/0: 35.40% done, 1 estimated seconds remaining
+ Mon Oct 26 04:11:42 2009 - INFO: - device disk/0: 58.50% done, 1 estimated seconds remaining
+ Mon Oct 26 04:11:43 2009 - INFO: - device disk/0: 86.20% done, 0 estimated seconds remaining
+ Mon Oct 26 04:11:44 2009 - INFO: - device disk/0: 92.40% done, 0 estimated seconds remaining
+ Mon Oct 26 04:11:44 2009 - INFO: - device disk/0: 97.00% done, 0 estimated seconds remaining
+ Mon Oct 26 04:11:44 2009 - INFO: Instance instance2's disks are in sync.
+ Mon Oct 26 04:11:44 2009 creating os for instance instance2 on node node1
+ Mon Oct 26 04:11:44 2009 * running the instance OS create scripts...
+ Mon Oct 26 04:11:57 2009 * starting instance...
+ node1#
+
+The above shows one instance created via an iallocator script, and one
+being created with manual node assignment. The other three instances
+were also created and now it's time to check them::
+
+ node1# gnt-instance list
+ Instance Hypervisor OS Primary_node Status Memory
+ instance1 xen-pvm debootstrap node2 running 128M
+ instance2 xen-pvm debootstrap node1 running 128M
+ instance3 xen-pvm debootstrap node1 running 128M
+ instance4 xen-pvm debootstrap node3 running 128M
+ instance5 xen-pvm debootstrap node2 running 128M
+
+Accessing instances
++++++++++++++++++++
+
+Accessing an instance's console is easy::
+
+ node1# gnt-instance console instance2
+ [ 0.000000] Bootdata ok (command line is root=/dev/sda1 ro)
+ [ 0.000000] Linux version 2.6…
+ [ 0.000000] BIOS-provided physical RAM map:
+ [ 0.000000] Xen: 0000000000000000 - 0000000008800000 (usable)
+ [13138176.018071] Built 1 zonelists. Total pages: 34816
+ [13138176.018074] Kernel command line: root=/dev/sda1 ro
+ [13138176.018694] Initializing CPU#0
+ …
+ Checking file systems...fsck 1.41.3 (12-Oct-2008)
+ done.
+ Setting kernel variables (/etc/sysctl.conf)...done.
+ Mounting local filesystems...done.
+ Activating swapfile swap...done.
+ Setting up networking....
+ Configuring network interfaces...done.
+ Setting console screen modes and fonts.
+ INIT: Entering runlevel: 2
+ Starting enhanced syslogd: rsyslogd.
+ Starting periodic command scheduler: crond.
+
+ Debian GNU/Linux 5.0 instance2 tty1
+
+ instance2 login:
+
+At this moment you can login to the instance and, after configuring the
+network (and doing this on all instances), we can check their
+connectivity::
+
+ node1# fping instance{1..5}
+ instance1 is alive
+ instance2 is alive
+ instance3 is alive
+ instance4 is alive
+ instance5 is alive
+ node1#
+
+Removal
++++++++
+
+Removing unwanted instances is also easy::
+
+ node1# gnt-instance remove instance5
+ This will remove the volumes of the instance instance5 (including
+ mirrors), thus removing all the data of the instance. Continue?
+ y/[n]/?: y
+ node1#
+
+
+Recovering from hardware failures
+---------------------------------
+
+Recovering from node failure
+++++++++++++++++++++++++++++
+
+We are now left with four instances. Assume that at this point, node3,
+which has one primary and one secondary instance, crashes::
+
+ node1# gnt-node info node3
+ Node name: node3
+ primary ip: 172.24.227.1
+ secondary ip: 192.168.2.3
+ master candidate: True
+ drained: False
+ offline: False
+ primary for instances:
+ - instance4
+ secondary for instances:
+ - instance1
+ node1# fping node3
+ node3 is unreachable
+
+At this point, the primary instance of that node (instance4) is down,
+but the secondary instance (instance1) is not affected except it has
+lost disk redundancy::
+
+ node1# fping instance{1,4}
+ instance1 is alive
+ instance4 is unreachable
+ node1#
+
+If we try to check the status of instance4 via the instance info
+command, it fails because it tries to contact node3 which is down::
+
+ node1# gnt-instance info instance4
+ Failure: command execution error:
+ Error checking node node3: Connection failed (113: No route to host)
+ node1#
+
+So we need to mark node3 as being *offline*, and thus Ganeti won't talk
+to it anymore::
+
+ node1# gnt-node modify -O yes -f node3
+ Mon Oct 26 04:34:12 2009 - WARNING: Not enough master candidates (desired 10, new value will be 2)
+ Mon Oct 26 04:34:15 2009 - WARNING: Communication failure to node node3: Connection failed (113: No route to host)
+ Modified node node3
+ - offline -> True
+ - master_candidate -> auto-demotion due to offline
+ node1#
+
+And now we can failover the instance::
+
+ node1# gnt-instance failover --ignore-consistency instance4
+ Failover will happen to image instance4. This requires a shutdown of
+ the instance. Continue?
+ y/[n]/?: y
+ Mon Oct 26 04:35:34 2009 * checking disk consistency between source and target
+ Failure: command execution error:
+ Disk disk/0 is degraded on target node, aborting failover.
+ node1# gnt-instance failover --ignore-consistency instance4
+ Failover will happen to image instance4. This requires a shutdown of
+ the instance. Continue?
+ y/[n]/?: y
+ Mon Oct 26 04:35:47 2009 * checking disk consistency between source and target
+ Mon Oct 26 04:35:47 2009 * shutting down instance on source node
+ Mon Oct 26 04:35:47 2009 - WARNING: Could not shutdown instance instance4 on node node3. Proceeding anyway. Please make sure node node3 is down. Error details: Node is marked offline
+ Mon Oct 26 04:35:47 2009 * deactivating the instance's disks on source node
+ Mon Oct 26 04:35:47 2009 - WARNING: Could not shutdown block device disk/0 on node node3: Node is marked offline
+ Mon Oct 26 04:35:47 2009 * activating the instance's disks on target node
+ Mon Oct 26 04:35:47 2009 - WARNING: Could not prepare block device disk/0 on node node3 (is_primary=False, pass=1): Node is marked offline
+ Mon Oct 26 04:35:48 2009 * starting the instance on the target node
+ node1#
+
+Note in our first attempt, Ganeti refused to do the failover since it
+wasn't sure what is the status of the instance's disks. We pass the
+``--ignore-consistency`` flag and then we can failover::
+
+ node1# gnt-instance list
+ Instance Hypervisor OS Primary_node Status Memory
+ instance1 xen-pvm debootstrap node2 running 128M
+ instance2 xen-pvm debootstrap node1 running 128M
+ instance3 xen-pvm debootstrap node1 running 128M
+ instance4 xen-pvm debootstrap node1 running 128M
+ node1#
+
+But at this point, both instance1 and instance4 are without disk
+redundancy::
+
+ node1# gnt-instance info instance1
+ Instance name: instance1
+ UUID: 45173e82-d1fa-417c-8758-7d582ab7eef4
+ Serial number: 2
+ Creation time: 2009-10-26 04:06:57
+ Modification time: 2009-10-26 04:07:14
+ State: configured to be up, actual state is up
+ Nodes:
+ - primary: node2
+ - secondaries: node3
+ Operating system: debootstrap
+ Allocated network port: None
+ Hypervisor: xen-pvm
+ - root_path: default (/dev/sda1)
+ - kernel_args: default (ro)
+ - use_bootloader: default (False)
+ - bootloader_args: default ()
+ - bootloader_path: default ()
+ - kernel_path: default (/boot/vmlinuz-2.6-xenU)
+ - initrd_path: default ()
+ Hardware:
+ - VCPUs: 1
+ - memory: 128MiB
+ - NICs:
+ - nic/0: MAC: aa:00:00:78:da:63, IP: None, mode: bridged, link: xen-br0
+ Disks:
+ - disk/0: drbd8, size 256M
+ access mode: rw
+ nodeA: node2, minor=0
+ nodeB: node3, minor=0
+ port: 11035
+ auth key: 8e950e3cec6854b0181fbc3a6058657701f2d458
+ on primary: /dev/drbd0 (147:0) in sync, status *DEGRADED*
+ child devices:
+ - child 0: lvm, size 256M
+ logical_id: xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_data
+ on primary: /dev/xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_data (254:0)
+ - child 1: lvm, size 128M
+ logical_id: xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta
+ on primary: /dev/xenvg/22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta (254:1)
+
+The output is similar for instance4. In order to recover this, we need
+to run the node evacuate command which will change from the current
+secondary node to a new one (in this case, we only have two working
+nodes, so all instances will be end on nodes one and two)::
+
+ node1# gnt-node evacuate -I hail node3
+ Relocate instance(s) 'instance1','instance4' from node
+ node3 using iallocator hail?
+ y/[n]/?: y
+ Mon Oct 26 05:05:39 2009 - INFO: Selected new secondary for instance 'instance1': node1
+ Mon Oct 26 05:05:40 2009 - INFO: Selected new secondary for instance 'instance4': node2
+ Mon Oct 26 05:05:40 2009 Replacing disk(s) 0 for instance1
+ Mon Oct 26 05:05:40 2009 STEP 1/6 Check device existence
+ Mon Oct 26 05:05:40 2009 - INFO: Checking disk/0 on node2
+ Mon Oct 26 05:05:40 2009 - INFO: Checking volume groups
+ Mon Oct 26 05:05:40 2009 STEP 2/6 Check peer consistency
+ Mon Oct 26 05:05:40 2009 - INFO: Checking disk/0 consistency on node node2
+ Mon Oct 26 05:05:40 2009 STEP 3/6 Allocate new storage
+ Mon Oct 26 05:05:40 2009 - INFO: Adding new local storage on node1 for disk/0
+ Mon Oct 26 05:05:41 2009 STEP 4/6 Changing drbd configuration
+ Mon Oct 26 05:05:41 2009 - INFO: activating a new drbd on node1 for disk/0
+ Mon Oct 26 05:05:42 2009 - INFO: Shutting down drbd for disk/0 on old node
+ Mon Oct 26 05:05:42 2009 - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
+ Mon Oct 26 05:05:42 2009 Hint: Please cleanup this device manually as soon as possible
+ Mon Oct 26 05:05:42 2009 - INFO: Detaching primary drbds from the network (=> standalone)
+ Mon Oct 26 05:05:42 2009 - INFO: Updating instance configuration
+ Mon Oct 26 05:05:45 2009 - INFO: Attaching primary drbds to new secondary (standalone => connected)
+ Mon Oct 26 05:05:46 2009 STEP 5/6 Sync devices
+ Mon Oct 26 05:05:46 2009 - INFO: Waiting for instance instance1 to sync disks.
+ Mon Oct 26 05:05:46 2009 - INFO: - device disk/0: 13.90% done, 7 estimated seconds remaining
+ Mon Oct 26 05:05:53 2009 - INFO: Instance instance1's disks are in sync.
+ Mon Oct 26 05:05:53 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 05:05:53 2009 - INFO: Remove logical volumes for 0
+ Mon Oct 26 05:05:53 2009 - WARNING: Can't remove old LV: Node is marked offline
+ Mon Oct 26 05:05:53 2009 Hint: remove unused LVs manually
+ Mon Oct 26 05:05:53 2009 - WARNING: Can't remove old LV: Node is marked offline
+ Mon Oct 26 05:05:53 2009 Hint: remove unused LVs manually
+ Mon Oct 26 05:05:53 2009 Replacing disk(s) 0 for instance4
+ Mon Oct 26 05:05:53 2009 STEP 1/6 Check device existence
+ Mon Oct 26 05:05:53 2009 - INFO: Checking disk/0 on node1
+ Mon Oct 26 05:05:53 2009 - INFO: Checking volume groups
+ Mon Oct 26 05:05:53 2009 STEP 2/6 Check peer consistency
+ Mon Oct 26 05:05:53 2009 - INFO: Checking disk/0 consistency on node node1
+ Mon Oct 26 05:05:54 2009 STEP 3/6 Allocate new storage
+ Mon Oct 26 05:05:54 2009 - INFO: Adding new local storage on node2 for disk/0
+ Mon Oct 26 05:05:54 2009 STEP 4/6 Changing drbd configuration
+ Mon Oct 26 05:05:54 2009 - INFO: activating a new drbd on node2 for disk/0
+ Mon Oct 26 05:05:55 2009 - INFO: Shutting down drbd for disk/0 on old node
+ Mon Oct 26 05:05:55 2009 - WARNING: Failed to shutdown drbd for disk/0 on oldnode: Node is marked offline
+ Mon Oct 26 05:05:55 2009 Hint: Please cleanup this device manually as soon as possible
+ Mon Oct 26 05:05:55 2009 - INFO: Detaching primary drbds from the network (=> standalone)
+ Mon Oct 26 05:05:55 2009 - INFO: Updating instance configuration
+ Mon Oct 26 05:05:55 2009 - INFO: Attaching primary drbds to new secondary (standalone => connected)
+ Mon Oct 26 05:05:56 2009 STEP 5/6 Sync devices
+ Mon Oct 26 05:05:56 2009 - INFO: Waiting for instance instance4 to sync disks.
+ Mon Oct 26 05:05:56 2009 - INFO: - device disk/0: 12.40% done, 8 estimated seconds remaining
+ Mon Oct 26 05:06:04 2009 - INFO: Instance instance4's disks are in sync.
+ Mon Oct 26 05:06:04 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 05:06:04 2009 - INFO: Remove logical volumes for 0
+ Mon Oct 26 05:06:04 2009 - WARNING: Can't remove old LV: Node is marked offline
+ Mon Oct 26 05:06:04 2009 Hint: remove unused LVs manually
+ Mon Oct 26 05:06:04 2009 - WARNING: Can't remove old LV: Node is marked offline
+ Mon Oct 26 05:06:04 2009 Hint: remove unused LVs manually
+ node1#
+
+And now node3 is completely free of instances and can be repaired::
+
+ node1# gnt-node list
+ Node DTotal DFree MTotal MNode MFree Pinst Sinst
+ node1 1.3T 1.3T 32.0G 1.0G 30.2G 3 1
+ node2 1.3T 1.3T 32.0G 1.0G 30.4G 1 3
+ node3 ? ? ? ? ? 0 0
+
+Re-adding a node to the cluster
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+Let's say node3 has been repaired and is now ready to be
+reused. Re-adding it is simple::
+
+ node1# gnt-node add --readd node3
+ The authenticity of host 'node3 (172.24.227.1)' can't be established.
+ RSA key fingerprint is 9f:2e:5a:2e:e0:bd:00:09:e4:5c:32:f2:27:57:7a:f4.
+ Are you sure you want to continue connecting (yes/no)? yes
+ Mon Oct 26 05:27:39 2009 - INFO: Readding a node, the offline/drained flags were reset
+ Mon Oct 26 05:27:39 2009 - INFO: Node will be a master candidate
+
+And is now working again::
+
+ node1# gnt-node list
+ Node DTotal DFree MTotal MNode MFree Pinst Sinst
+ node1 1.3T 1.3T 32.0G 1.0G 30.2G 3 1
+ node2 1.3T 1.3T 32.0G 1.0G 30.4G 1 3
+ node3 1.3T 1.3T 32.0G 1.0G 30.4G 0 0
+
+.. note:: If you have the ganeti-htools package installed, you can
+ shuffle the instances around to have a better use of the nodes.
+
+Disk failures
++++++++++++++
+
+A disk failure is simpler than a full node failure. First, a single disk
+failure should not cause data-loss for any redundant instance; only the
+performance of some instances might be reduced due to more network
+traffic.
+
+Let take the cluster status in the above listing, and check what volumes
+are in use::
+
+ node1# gnt-node volumes -o phys,instance node2
+ PhysDev Instance
+ /dev/sdb1 instance4
+ /dev/sdb1 instance4
+ /dev/sdb1 instance1
+ /dev/sdb1 instance1
+ /dev/sdb1 instance3
+ /dev/sdb1 instance3
+ /dev/sdb1 instance2
+ /dev/sdb1 instance2
+ node1#
+
+You can see that all instances on node2 have logical volumes on
+``/dev/sdb1``. Let's simulate a disk failure on that disk::
+
+ node1# ssh node2
+ node2# echo offline > /sys/block/sdb/device/state
+ node2# vgs
+ /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
+ /dev/sdb1: read failed after 0 of 4096 at 750153695232: Input/output error
+ /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
+ Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'.
+ Couldn't find all physical volumes for volume group xenvg.
+ /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
+ /dev/sdb1: read failed after 0 of 4096 at 0: Input/output error
+ Couldn't find device with uuid '954bJA-mNL0-7ydj-sdpW-nc2C-ZrCi-zFp91c'.
+ Couldn't find all physical volumes for volume group xenvg.
+ Volume group xenvg not found
+ node2#
+
+At this point, the node is broken and if we are to examine
+instance2 we get (simplified output shown)::
+
+ node1# gnt-instance info instance2
+ Instance name: instance2
+ State: configured to be up, actual state is up
+ Nodes:
+ - primary: node1
+ - secondaries: node2
+ Disks:
+ - disk/0: drbd8, size 256M
+ on primary: /dev/drbd0 (147:0) in sync, status ok
+ on secondary: /dev/drbd1 (147:1) in sync, status *DEGRADED* *MISSING DISK*
+
+This instance has a secondary only on node2. Let's verify a primary
+instance of node2::
+
+ node1# gnt-instance info instance1
+ Instance name: instance1
+ State: configured to be up, actual state is up
+ Nodes:
+ - primary: node2
+ - secondaries: node1
+ Disks:
+ - disk/0: drbd8, size 256M
+ on primary: /dev/drbd0 (147:0) in sync, status *DEGRADED* *MISSING DISK*
+ on secondary: /dev/drbd3 (147:3) in sync, status ok
+ node1# gnt-instance console instance1
+
+ Debian GNU/Linux 5.0 instance1 tty1
+
+ instance1 login: root
+ Last login: Tue Oct 27 01:24:09 UTC 2009 on tty1
+ instance1:~# date > test
+ instance1:~# sync
+ instance1:~# cat test
+ Tue Oct 27 01:25:20 UTC 2009
+ instance1:~# dmesg|tail
+ [5439785.235448] NET: Registered protocol family 15
+ [5439785.235489] 802.1Q VLAN Support v1.8 Ben Greear <greearb@candelatech.com>
+ [5439785.235495] All bugs added by David S. Miller <davem@redhat.com>
+ [5439785.235517] XENBUS: Device with no driver: device/console/0
+ [5439785.236576] kjournald starting. Commit interval 5 seconds
+ [5439785.236588] EXT3-fs: mounted filesystem with ordered data mode.
+ [5439785.236625] VFS: Mounted root (ext3 filesystem) readonly.
+ [5439785.236663] Freeing unused kernel memory: 172k freed
+ [5439787.533779] EXT3 FS on sda1, internal journal
+ [5440655.065431] eth0: no IPv6 routers present
+ instance1:~#
+
+As you can see, the instance is running fine and doesn't see any disk
+issues. It is now time to fix node2 and re-establish redundancy for the
+involved instances.
+
+.. note:: For Ganeti 2.0 we need to fix manually the volume group on
+ node2 by running ``vgreduce --removemissing xenvg``
+
+::
+
+ node1# gnt-node repair-storage node2 lvm-vg xenvg
+ Mon Oct 26 18:14:03 2009 Repairing storage unit 'xenvg' on node2 ...
+ node1# ssh node2 vgs
+ VG #PV #LV #SN Attr VSize VFree
+ xenvg 1 8 0 wz--n- 673.84G 673.84G
+ node1#
+
+This has removed the 'bad' disk from the volume group, which is now left
+with only one PV. We can now replace the disks for the involved
+instances::
+
+ node1# for i in instance{1..4}; do gnt-instance replace-disks -a $i; done
+ Mon Oct 26 18:15:38 2009 Replacing disk(s) 0 for instance1
+ Mon Oct 26 18:15:38 2009 STEP 1/6 Check device existence
+ Mon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node1
+ Mon Oct 26 18:15:38 2009 - INFO: Checking disk/0 on node2
+ Mon Oct 26 18:15:38 2009 - INFO: Checking volume groups
+ Mon Oct 26 18:15:38 2009 STEP 2/6 Check peer consistency
+ Mon Oct 26 18:15:38 2009 - INFO: Checking disk/0 consistency on node node1
+ Mon Oct 26 18:15:39 2009 STEP 3/6 Allocate new storage
+ Mon Oct 26 18:15:39 2009 - INFO: Adding storage on node2 for disk/0
+ Mon Oct 26 18:15:39 2009 STEP 4/6 Changing drbd configuration
+ Mon Oct 26 18:15:39 2009 - INFO: Detaching disk/0 drbd from local storage
+ Mon Oct 26 18:15:40 2009 - INFO: Renaming the old LVs on the target node
+ Mon Oct 26 18:15:40 2009 - INFO: Renaming the new LVs on the target node
+ Mon Oct 26 18:15:40 2009 - INFO: Adding new mirror component on node2
+ Mon Oct 26 18:15:41 2009 STEP 5/6 Sync devices
+ Mon Oct 26 18:15:41 2009 - INFO: Waiting for instance instance1 to sync disks.
+ Mon Oct 26 18:15:41 2009 - INFO: - device disk/0: 12.40% done, 9 estimated seconds remaining
+ Mon Oct 26 18:15:50 2009 - INFO: Instance instance1's disks are in sync.
+ Mon Oct 26 18:15:50 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 18:15:50 2009 - INFO: Remove logical volumes for disk/0
+ Mon Oct 26 18:15:52 2009 Replacing disk(s) 0 for instance2
+ Mon Oct 26 18:15:52 2009 STEP 1/6 Check device existence
+ …
+ Mon Oct 26 18:16:01 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 18:16:01 2009 - INFO: Remove logical volumes for disk/0
+ Mon Oct 26 18:16:02 2009 Replacing disk(s) 0 for instance3
+ Mon Oct 26 18:16:02 2009 STEP 1/6 Check device existence
+ …
+ Mon Oct 26 18:16:09 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 18:16:09 2009 - INFO: Remove logical volumes for disk/0
+ Mon Oct 26 18:16:10 2009 Replacing disk(s) 0 for instance4
+ Mon Oct 26 18:16:10 2009 STEP 1/6 Check device existence
+ …
+ Mon Oct 26 18:16:18 2009 STEP 6/6 Removing old storage
+ Mon Oct 26 18:16:18 2009 - INFO: Remove logical volumes for disk/0
+ node1#
+
+As this point, all instances should be healthy again.
+
+.. note:: Ganeti 2.0 doesn't have the ``-a`` option to replace-disks, so
+ for it you have to run the loop twice, once over primary instances
+ with argument ``-p`` and once secondary instances with argument
+ ``-s``, but otherwise the operations are similar::
+
+ node1# gnt-instance replace-disks -p instance1
+ …
+ node1# for i in instance{2..4}; do gnt-instance replace-disks -s $i; done
+
+Common cluster problems
+-----------------------
+
+There are a number of small issues that might appear on a cluster that
+can be solved easily as long as the issue is properly identified. For
+this exercise we will consider the case of node3, which was broken
+previously and re-added to the cluster without reinstallation. Running
+cluster verify on the cluster reports::
+
+ node1# gnt-cluster verify
+ Mon Oct 26 18:30:08 2009 * Verifying global settings
+ Mon Oct 26 18:30:08 2009 * Gathering data (3 nodes)
+ Mon Oct 26 18:30:10 2009 * Verifying node status
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 0 is in use
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: unallocated drbd minor 1 is in use
+ Mon Oct 26 18:30:10 2009 * Verifying instance status
+ Mon Oct 26 18:30:10 2009 - ERROR: instance instance4: instance should not run on node node3
+ Mon Oct 26 18:30:10 2009 * Verifying orphan volumes
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-791667d07800.disk0_data is unknown
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data is unknown
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta is unknown
+ Mon Oct 26 18:30:10 2009 - ERROR: node node3: volume 22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta is unknown
+ Mon Oct 26 18:30:10 2009 * Verifying remaining instances
+ Mon Oct 26 18:30:10 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 18:30:10 2009 * Other Notes
+ Mon Oct 26 18:30:10 2009 * Hooks Results
+ node1#
+
+Instance status
++++++++++++++++
+
+As you can see, *instance4* has a copy running on node3, because we
+forced the failover when node3 failed. This case is dangerous as the
+instance will have the same IP and MAC address, wreaking havok on the
+network environment and anyone who tries to use it.
+
+Ganeti doesn't directly handle this case. It is recommended to logon to
+node3 and run::
+
+ node3# xm destroy instance4
+
+Unallocated DRBD minors
++++++++++++++++++++++++
+
+There are still unallocated DRBD minors on node3. Again, these are not
+handled by Ganeti directly and need to be cleaned up via DRBD commands::
+
+ node3# drbdsetup /dev/drbd0 down
+ node3# drbdsetup /dev/drbd1 down
+ node3#
+
+Orphan volumes
+++++++++++++++
+
+At this point, the only remaining problem should be the so-called
+*orphan* volumes. This can happen also in the case of an aborted
+disk-replace, or similar situation where Ganeti was not able to recover
+automatically. Here you need to remove them manually via LVM commands::
+
+ node3# lvremove xenvg
+ Do you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data"? [y/n]: y
+ Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_data" successfully removed
+ Do you really want to remove active logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta"? [y/n]: y
+ Logical volume "22459cf8-117d-4bea-a1aa-791667d07800.disk0_meta" successfully removed
+ Do you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data"? [y/n]: y
+ Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_data" successfully removed
+ Do you really want to remove active logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta"? [y/n]: y
+ Logical volume "1aaf4716-e57f-4101-a8d6-03af5da9dc50.disk0_meta" successfully removed
+ node3#
+
+At this point cluster verify shouldn't complain anymore::
+
+ node1# gnt-cluster verify
+ Mon Oct 26 18:37:51 2009 * Verifying global settings
+ Mon Oct 26 18:37:51 2009 * Gathering data (3 nodes)
+ Mon Oct 26 18:37:53 2009 * Verifying node status
+ Mon Oct 26 18:37:53 2009 * Verifying instance status
+ Mon Oct 26 18:37:53 2009 * Verifying orphan volumes
+ Mon Oct 26 18:37:53 2009 * Verifying remaining instances
+ Mon Oct 26 18:37:53 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 18:37:53 2009 * Other Notes
+ Mon Oct 26 18:37:53 2009 * Hooks Results
+ node1#
+
+N+1 errors
+++++++++++
+
+Since redundant instances in Ganeti have a primary/secondary model, it
+is needed to leave aside on each node enough memory so that if one of
+its peer node fails, all the secondary instances that have that node as
+primary can be relocated. More specifically, if instance2 has node1 as
+primary and node2 as secondary (and node1 and node2 do not have any
+other instances in this layout), then it means that node2 must have
+enough free memory so that if node1 fails, we can failover instance2
+without any other operations (for reducing the downtime window). Let's
+increase the memory of the current instances to 4G, and add three new
+instances, two on node2:node3 with 8GB of RAM and one on node1:node2,
+with 12GB of RAM (numbers chosen so that we run out of memory)::
+
+ node1# gnt-instance modify -B memory=4G instance1
+ Modified instance instance1
+ - be/memory -> 4096
+ Please don't forget that these parameters take effect only at the next start of the instance.
+ node1# gnt-instance modify …
+
+ node1# gnt-instance add -t drbd -n node2:node3 -s 512m -B memory=8G -o debootstrap instance5
+ …
+ node1# gnt-instance add -t drbd -n node2:node3 -s 512m -B memory=8G -o debootstrap instance6
+ …
+ node1# gnt-instance add -t drbd -n node1:node2 -s 512m -B memory=8G -o debootstrap instance7
+ node1# gnt-instance reboot --all
+ The reboot will operate on 7 instances.
+ Do you want to continue?
+ Affected instances:
+ instance1
+ instance2
+ instance3
+ instance4
+ instance5
+ instance6
+ instance7
+ y/[n]/?: y
+ Submitted jobs 677, 678, 679, 680, 681, 682, 683
+ Waiting for job 677 for instance1...
+ Waiting for job 678 for instance2...
+ Waiting for job 679 for instance3...
+ Waiting for job 680 for instance4...
+ Waiting for job 681 for instance5...
+ Waiting for job 682 for instance6...
+ Waiting for job 683 for instance7...
+ node1#
+
+We rebooted instances for the memory changes to have effect. Now the
+cluster looks like::
+
+ node1# gnt-node list
+ Node DTotal DFree MTotal MNode MFree Pinst Sinst
+ node1 1.3T 1.3T 32.0G 1.0G 6.5G 4 1
+ node2 1.3T 1.3T 32.0G 1.0G 10.5G 3 4
+ node3 1.3T 1.3T 32.0G 1.0G 30.5G 0 2
+ node1# gnt-cluster verify
+ Mon Oct 26 18:59:36 2009 * Verifying global settings
+ Mon Oct 26 18:59:36 2009 * Gathering data (3 nodes)
+ Mon Oct 26 18:59:37 2009 * Verifying node status
+ Mon Oct 26 18:59:37 2009 * Verifying instance status
+ Mon Oct 26 18:59:37 2009 * Verifying orphan volumes
+ Mon Oct 26 18:59:37 2009 * Verifying remaining instances
+ Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory on to accommodate failovers should peer node node1 fail
+ Mon Oct 26 18:59:37 2009 * Other Notes
+ Mon Oct 26 18:59:37 2009 * Hooks Results
+ node1#
+
+The cluster verify error above shows that if node1 fails, node2 will not
+have enough memory to failover all primary instances on node1 to it. To
+solve this, you have a number of options:
+
+- try to manually move instances around (but this can become complicated
+ for any non-trivial cluster)
+- try to reduce memory of some instances to accommodate the available
+ node memory
+- if you have the ganeti-htools package installed, you can run the
+ ``hbal`` tool which will try to compute an automated cluster solution
+ that complies with the N+1 rule
+
+Network issues
+++++++++++++++
+
+In case a node has problems with the network (usually the secondary
+network, as problems with the primary network will render the node
+unusable for ganeti commands), it will show up in cluster verify as::
+
+ node1# gnt-cluster verify
+ Mon Oct 26 19:07:19 2009 * Verifying global settings
+ Mon Oct 26 19:07:19 2009 * Gathering data (3 nodes)
+ Mon Oct 26 19:07:23 2009 * Verifying node status
+ Mon Oct 26 19:07:23 2009 - ERROR: node node1: tcp communication with node 'node3': failure using the secondary interface(s)
+ Mon Oct 26 19:07:23 2009 - ERROR: node node2: tcp communication with node 'node3': failure using the secondary interface(s)
+ Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node1': failure using the secondary interface(s)
+ Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node2': failure using the secondary interface(s)
+ Mon Oct 26 19:07:23 2009 - ERROR: node node3: tcp communication with node 'node3': failure using the secondary interface(s)
+ Mon Oct 26 19:07:23 2009 * Verifying instance status
+ Mon Oct 26 19:07:23 2009 * Verifying orphan volumes
+ Mon Oct 26 19:07:23 2009 * Verifying remaining instances
+ Mon Oct 26 19:07:23 2009 * Verifying N+1 Memory redundancy
+ Mon Oct 26 19:07:23 2009 * Other Notes
+ Mon Oct 26 19:07:23 2009 * Hooks Results
+ node1#
+
+This shows that both node1 and node2 have problems contacting node3 over
+the secondary network, and node3 has problems contacting them. From this
+output is can be deduced that since node1 and node2 can communicate
+between themselves, node3 is the one having problems, and you need to
+investigate its network settings/connection.
+
+Migration problems
+++++++++++++++++++
+
+Since live migration can sometimes fail and leave the instance in an
+inconsistent state, Ganeti provides a ``--cleanup`` argument to the
+migrate command that does:
+
+- check on which node the instance is actually running (has the
+ command failed before or after the actual migration?)
+- reconfigure the DRBD disks accordingly
+
+It is always safe to run this command as long as the instance has good
+data on its primary node (i.e. not showing as degraded). If so, you can
+simply run::
+
+ node1# gnt-instance migrate --cleanup instance1
+ Instance instance1 will be recovered from a failed migration. Note
+ that the migration procedure (including cleanup) is **experimental**
+ in this version. This might impact the instance if anything goes
+ wrong. Continue?
+ y/[n]/?: y
+ Mon Oct 26 19:13:49 2009 Migrating instance instance1
+ Mon Oct 26 19:13:49 2009 * checking where the instance actually runs (if this hangs, the hypervisor might be in a bad state)
+ Mon Oct 26 19:13:49 2009 * instance confirmed to be running on its primary node (node2)
+ Mon Oct 26 19:13:49 2009 * switching node node1 to secondary mode
+ Mon Oct 26 19:13:50 2009 * wait until resync is done
+ Mon Oct 26 19:13:50 2009 * changing into standalone mode
+ Mon Oct 26 19:13:50 2009 * changing disks into single-master mode
+ Mon Oct 26 19:13:50 2009 * wait until resync is done
+ Mon Oct 26 19:13:51 2009 * done
+ node1#
+
+In use disks at instance shutdown
++++++++++++++++++++++++++++++++++
+
+If you see something like the following when trying to shutdown or
+deactivate disks for an instance::
+
+ node1# gnt-instance shutdown instance1
+ Mon Oct 26 19:16:23 2009 - WARNING: Could not shutdown block device disk/0 on node node2: drbd0: can't shutdown drbd device: /dev/drbd0: State change failed: (-12) Device is held open by someone\n
+
+It most likely means something is holding open the underlying DRBD
+device. This can be bad if the instance is not running, as it might mean
+that there was concurrent access from both the node and the instance to
+the disks, but not always (e.g. you could only have had the partitions
+activated via ``kpartx``).
+
+To troubleshoot this issue you need to follow standard Linux practices,
+and pay attention to the hypervisor being used:
+
+- check if (in the above example) ``/dev/drbd0`` on node2 is being
+ mounted somewhere (``cat /proc/mounts``)
+- check if the device is not being used by device mapper itself:
+ ``dmsetup ls`` and look for entries of the form ``drbd0pX``, and if so
+ remove them with either ``kpartx -d`` or ``dmsetup remove``
+
+For Xen, check if it's not using the disks itself::
+
+ node1# xenstore-ls /local/domain/0/backend/vbd|grep -e "domain =" -e physical-device
+ domain = "instance2"
+ physical-device = "93:0"
+ domain = "instance3"
+ physical-device = "93:1"
+ domain = "instance4"
+ physical-device = "93:2"
+ node1#
+
+You can see in the above output that the node exports three disks, to
+three instances. The ``physical-device`` key is in major:minor format in
+hexadecimal, and 0x93 represents DRBD's major number. Thus we can see
+from the above that instance2 has /dev/drbd0, instance3 /dev/drbd1, and
+instance4 /dev/drbd2.
+
+.. vim: set textwidth=72 :
+.. Local Variables:
+.. mode: rst
+.. fill-column: 72
+.. End: