code.grnet.gr Git - ganeti-local/log

Update NEWS file for the 2.4.2 release

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Use floppy disk and a second CDROM on KVM

Hi all,
this patch will add 3 new KVM parameters and a new option.

New Parameters:
- floppy_image_path = "" -> Specify the floppy image to load as
floppy disk.
- cdrom2_image_path = "" -> Specify a second cdrom image to load on
the system (note: this in not intended to be used as a boot device. To
boot the system from cdrom you must use the "cdrom_image_path"
parameter as always).
- cdrom_disk_type = "" -> it can be one of the kvm supported types as
"ide,scsi,paravirtual,ecc". I introduced this optional parameter to
make possible to specify a different virtual device for cdroms. It is
useful if you want to install a windows system

New option for "boot_device" parameter:
- "floppy": with this value you should be able to boot a KVM
instance from floppy image.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit cc130cc7a60fd5377c032116b0c036ae44639913)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Document the selection of instance kernels

A simple doc patch to document how to configure the kernels for the
instances.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Make root_path an optional hypervisor parameter

This will allow us an easy migration to pv-grub, because a set root_path
confused pv-grub.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Some man page updates

This adds documentation for both the short and long form of many
options (which was inconsistent before: in some cases only the short
form was used, in others only the long form).

Note that the standard this patch adopts is to document both forms as
such:

  {-O|--os-parameters} …

This makes it a bit uglier in complex situations, but the alternatives
considered were not perfect either. Other suggestions (with patches)
welcome.

Additionally, it fixes two doc bugs:

- in gnt-cluster.rst, the --prealloc-wipe-disks section was in the
  middle of a paragraph
- in gnt-instance.rst, a list was not typed correctly, thus it was
  mangled as a single paragraph

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Add 2 new variables to the OS scripts environment

Add INSTANCE_PRIMARY_NODE and INSTANCE_SECONDARY_NODES. These new
values are useful for OS scripts that needs to know the nodes where
the instance lives.. or has lived.

Signed-off-by: Iustin Pop <iustin@google.com>
[iustin@google.com: fixed small issue with SECONDARY_NODES]
Reviewed-by: Iustin Pop <iustin@google.com>

Add --no-wait-for-sync when converting to drbd

Currently, when converting an instance from plain to DRBD, the
instance is blocked during the entire resync period. This patch adds
the --no-wait-for-sync so that the operation finishes as soon as the
DRBD sync has started, without waiting for the entire sync. This makes
the instance available much faster.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Recreate instance disks: allow changing nodes

This patch introduces the option of changing an instance's nodes when
doing the disk recreation. The rationale is that currently if an
instance lives on a node that has gone down and is marked offline,
it's not possible to re-create the disks and reinstall the instance on
a different node without hacking the config file.

Additionally, the LU now locks the instance's nodes (which was not
done before), as we most likely allocate new resources on them.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Rename instance: only show new name when different

It makes not sense to show messages like:
Fri May 6 02:04:01 2011 - INFO: Resolved given name 'instance18' to
'instance18'

So we'll skip the message if the resolved name is identical to the
requested one.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix race condition in LUGroupAssignNodes

The original code would get all node information and their groups
without before acquiring the necessary locks. With this patch the node
information is only retrieved once all locks have been acquired. Groups
are locked optimistically and verified after acquiring the node locks.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Re-wrap and fix formatting issues in gnt-instance.rst

This is mostly rewrapping plus fixing a few small issues in
gnt-instance.rst.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Documentation for the new parameters for KVM

Options added/updated are: cdrom2_image_path, floppy_image_path,
cdrom_disk_type and boot_order.

Signed-off-by: Iustin Pop <iustin@google.com>
[iustin@google.com: small formatting update]
Reviewed-by: Iustin Pop <iustin@google.com>

cmdlib: Fix typo, s/nick/NIC/

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

A small optimisation in cluster verify

This removes (count of instances + count of nodes) lock
acquires/releases.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

A few docstring fixes

At least one generates an epydoc error :)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

luxi: do not handle KeyboardInterrupt

With the current code, it's possible to mistake a ^C for a protocol
error:

node1# gnt-job info 221691
[press ^C]
Unhandled protocol error while talking to the master daemon:
Error while deserializing response:

(and note empty error message).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Handle EPIPE errors while writing to the terminal

This handles EPIPE errors in two places: ToStream (to catch logging
done in GenericMain itself) and in GenericMain (to cover also plain
print statements).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Cluster verify: check for missing bridges

Currently cluster verify doesn't check for bridge information; the
only checks are done at instance create and failover/migrate
time. This means a cluster that seems healthy will fail creation jobs.

This patch implements a simple verification that all nodes (in the
entire cluster, so doesn't work well for multi-group) have all the
required bridges: the default one plus any instance bridge.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

TLReplaceDisks: Use implicit loop for dictionary

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Release unneeded locks while replacing disks

If an iallocator is used, “gnt-instance replace-disks” would acquire the
locks of all nodes (only the allocator will decide which node to use).
Unfortunately the unneeded locks were not released during the operation,
causing unnecessary delays for other jobs.

This patch changes the LU to release unneeded locks and adds assertions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

locking: Export “list_owned” from lock manager

This is analog to “is_owned” and will be used for assertions.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

gnt-instance: Fix typo in error message

The iallocator parameter is “-I”, not “-i”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

mlock: fail gracefully if libc.so.6 cannot be loaded

This allows noded to continue instead of blowing up if the libc major
number changes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Allow creating the DRBD metadev in a different VG

This is a simple change to allow specifying a different VG for the
meta device during the creation of instances and addition of disks via
gnt-instance modify.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Make _GenerateDRBD8Branch accept different VG names

This is a small change to make this function take a list of VG names,
instead of a single one.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix WriteFile with unicode data

Unicode is fun, indeed:

>>> len(buffer("abc"))
3
>>> len(buffer(u"abc"))
12

So we can't pass unicode data to buffer(), as the result will be to
write the in-memory (usually UTF-32) representation to disk.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Replace disks: keep the meta device in the same VG

This patch enhances the multi-VG support in replace disks, by keeping
the meta device in the same VG, as opposed to moving it to the data
device VG (note that we don't have a way to create the meta in a
different VG in the first place, but at least we correctly handle a
custom config).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix for multiple VGs - PlainToDrbd and replace-disks

Converting an instance from 'plain' to 'drbd'. The old code would
create the drbd volumes in the default VG and then the renames would
fail. This fix pulls the plain VG names from the existing volumes and
places it into the new disk template.

Running 'replace-disks' has a similar issue with the new disks going
into the wrong VG and then the rename failing.

Their might be a similar issue with 'recreate-disks', but I actually
have no idea what recreate-disks does, so did not look into it.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix potential data-loss in utils.WriteFile

os.write can do incomplete writes, as long as at least some bytes have
been written (like write(2)):

>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
1300
>>> os.write(fd, " " * 1300)
980
>>> os.write(fd, " " * 1300)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
OSError: [Errno 28] No space left on device

Note that incomplete write that only wrote 980 bytes, before the
exception.

To workaround this, we simply iterate until all data is
written. Unittests could be written by using a parameter instead of
hardcoding os.write and checking for incomplete writes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Improve error messages in cluster verify/OS

A few issues in the clarity of the error messages are fixed:

- "ERROR: node node3: OS API version lenny-image": no preposition
  between the parameter type and the OS name, changed to "for
  lenny-image"

- "API version lenny-image differs from reference node node1: 10, 5
  vs. 10, 20, 5, 15": parameters not sorted in display

- "OS variants list lenny-image differs from reference node node1:
  vs. default, i386": empty sets are not clearly delimited, changed to
  add [] around the sets: "node node1: [] vs. [default, i386]"

- "OS parameters lenny-image differs from reference node node1:
  vs. (u'dhcp', u'Whether to enable (yes) or disable (dhcp)')": ugly
  formatting in the OS parameters list, as we used to just "%s" the
  tuple; now it is "reference node node1: [] vs. [dhcp: Whether to
  enable (yes) or disable (dhcp)]"

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Prevent readding of the master node

This breaks Ganeti in multiple ways. If we don't make the check in
gnt-node itself, then bootstrap.SetupNodeDaemon will restart the
master daemon, making the operation fail:

  node1# gnt-node add --readd node1
  Cannot communicate with the master daemon.
  Is it running and listening for connections?

The check in cmdlib is more of a safety check, as we shouldn't reach
it. If we do (via a bad client), then it will prevent breakage in the
job queue/config handling.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix punctuation in an error message

IIRC we don't use punctuation at the end of error messages.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

cli: Fix wrong argument kind for groups

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Quote filename in gnt-instance.8

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Fix typo in LUGroupAssignNodes

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

gnt-instance info: automatically request locking

Commit dae661a4 added support for controlling the locking, but it
didn't modify the gnt-instance info code, which leads to this command
always showing:

Wed Apr 20 04:10:48 2011 - WARNING: Non-static data requested, locks
need to be acquired

We simply change gnt-instance to request locks whenever we don't use
the static mode.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Document the dependency on OOB for gnt-node power

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix master IP activation in failover with no-voting

Thanks to net.for.hub@gmail.com for reporting this. The logic in
masterd.CheckMasterd did an early return in case of no_voting, hence
skipping the master IP activation. We just change the ifs to not
return but simply continue through the function.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

disk wiping: fix bug in chunk size computation

The current wipe_chunk_size computation is doing min(int_value,
float_value). For small disks (below 10GiB), the actual formula will
result into the float value being chosen. This results into very
interesting behaviour:

Wiping disk 0, offset 102.4, chunk 102.4
Wiping disk 0, offset 204.8, chunk 102.4
…
Wiping disk 0, offset 921.6, chunk 102.4
Wiping disk 0, offset 1024.0, chunk 1.13686837722e-13

Since these are passed to dd via %d, this will result into the call to
dd specifying offset 1024 and count 0, which will fail.

We just need to enforce conversion to int, in order to not get bitten
by floating point rounding errors.

The patch also reorders some logging messages in order to log the
chunk size.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix bug in watcher

If “utils.RunParts” were to raise an exception, a log message was
written and the code continued to run. Due to the exception the
“results” variable would not be defined.

Also change the code to log a backtrace (getting an exception is rather
unlikely and having a backtrace is useful) and update one comment.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Release locks before wiping disks during instance creation

Ganeti 2.3 introduced an optional feature to overwrite an instance's
disks on creation. Unfortunately the code kept all locks while doing the
wipe, slowing down the creation of multiple instances in parallel.

This patch changes the code to wipe the disks only after releasing the
locks.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

utils.WriteFile: Close file before renaming

Issue 154 (http://code.google.com/p/ganeti/issues/detail?id=154)
reported an “Operation not supported” error when writing instance
exports to a mounted CIFS filesystem. Experimentation showed the error
to only occur when using rename(2) on an opened file. Various references
on the web confirmed this observation. Whether or not the problem occurs
can also depend on the CIFS server implementation. In issue 154 it was
Windows 2008 R2.

While not solving all cases, closing the file before renaming helps
alleviating the issue a bit. Unittests are updated.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Fix distcheck

README is not copied to the build tree.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Nicer formatting for group query error

Before this patc the message would look like “Some groups do not exist:
[u'foo', u'bar']”, now it's “Some groups do not exist: foo, bar”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

gnt-instance.8: Fix wrongly formatted title

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Update version in README

Also add a check to Makefile's check-local target.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Merge branch 'stable-2.4' into devel-2.4

* stable-2.4:
  Add error checking and merging for cluster params
  Clarify --force-join parameter message
  Treat empty oob_program param as default
  Fix bug in instance listing with orphan instances
  Fix bug related to log opening failures
  Bump version for 2.4.1 release
  cfgupgrade: Fix critical bug overwriting RAPI users file

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

LUInstanceQueryData: Don't acquire locks unless requested

Until now LUInstanceQueryData always acquired locks for the instance(s)
and nodes involved. In combination with long-running operations this
prevented the use of “gnt-instance info”, even with the “--static”
option. With this patch, locks are only acquired when explicitely
requested in the opcode (like all query operations).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Increase the lock timeouts before we block-acquire

This has been observed to cause problems on real clusters via the
following mechanism:

- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
  instance
- the watcher starts and submits its query instances opcode which
  wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
  after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
  cannot start until the watcher has finished, even though there's no
  actual operation on that instance

In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.

We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

daemon.py: move startup log message before prep_fn

Before this, the output in the rapi daemon log was:
2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
at /var/lib/ganeti/rapi/users
2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
startup

Which is confusing, as it might look like the read of the users file
is part of the previous run. This is because we log the 'daemon
startup' message after the prepare_fn, which can log things on its
own.

The patch simply moves the 'daemon startup' message just before
prepare_fn call.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Display the actual memory values in N+1 failures

This changes the display from:
Mon Apr  4 02:29:46 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:29:46 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail

To:

Mon Apr  4 02:32:50 2011 * Verifying N+1 Memory redundancy
Mon Apr  4 02:32:50 2011   - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail (33536MiB needed,
27910MiB available)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

ssh.VerifyNodeHostname: remove the quiet flag

This is not needed for this function, and can interfere with debugging
of ssh failures.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Add error checking and merging for cluster params

Set the default stderr logging level to WARNING so the relevant output
can be seen.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

RAPI: Document need for Content-type header in requests

This was added to the NEWS file in commit ab221ddf, but never
documented properly.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Fix output for “gnt-job info”

If the result of an opcode was a non-empty dictionary, it
would be impossible to differenciate between input and result:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Expected output:

  Input fields:
    […]
    debug_level: 0
    fields: cluster_name,master_node,volume_group_name
  Result:
    jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

watcher: Fix misleading usage output

When “ganeti-watcher” is called with an argument, it would hint at
a non-existing “-f” parameter. With this patch the separate usage
string is no longer necessary.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Clarify --force-join parameter message

This isn't only used during cluster merge.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

locking: Fix race condition in lock monitor

In some rare cases it can happen that a lock is re-created very soon
after deletion, while the old instance hasn't been destructed yet. In
such a case the code would detect a duplicate name and raise an
exception.

We have seen at least one case where this happened during the creation
of many instances. It is not exactly clear how it came to be, but it
appears to have occurred while different jobs fought for locks with
short timeouts (in the case of instance creation locks are added at this
stage and removed shortly after if not all locks can be acquired).

The issue is fixed by removing the check for duplicate names. To still
guarantee a stable sort order for the lock information as shown by
“gnt-debug locks”, a registration number is recorded for each lock in
the monitor.

A unittest is included to check for the situation.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

utils: Export NiceSortKey function

The ability to split a string into a list of strings and integers can be
handy elsewhere and is necessary for sorting query results by names.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit f47941f864cf03264d363aebed530480a64e21dd)

Revert "Only merge nodes that are known to not be offline"

This reverts commit 288f240f62dafa8bd8ba7482c8367adbdf6d96c2.

That commit was buggy at various levels:
  - broke ssh access to the second cluster, making cluster-merge
    unusable (unless ssh key were previously setup?)
  - filtered away offline nodes from being added to the cluster config
    (wrong, they should be kept, as offline)
  - broke commit-check

The previous commit makes the code work again with what this commit
tried to achieve.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

cluster-merge: only operate on online nodes

The node list in MergerData is used only to:
- stop ganeti on the nodes
- readd the nodes to the cluster
As such offline nodes should be skipped from it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Only merge nodes that are known to not be offline

Otherwise the readd will fail, breaking the merge.

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Treat empty oob_program param as default

There is currently no way to reset oob_program back to its default from
the cmdline, which causes problems for cluster-merge. This patch means
that the following now works:
gnt-cluster modify --node-parameters oob_program=

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

Fix bug in instance listing with orphan instances

Nodes can return unknown instances, so we shouldn't use the name as an
index without checking.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix bug related to log opening failures

If opening the log file fails, then we shouldn't attempt to use that
variable.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Bump version for 2.4.1 release

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

cfgupgrade: Fix critical bug overwriting RAPI users file

The cfgupgrade tool was designed to be idempotent, that means it could
be run several times and still give produce the correct result. Ganeti
2.4 moved the file containing the RAPI users to a separate directory
(…/lib/ganeti/rapi/users). If it exists, cfgupgrade would automatically
move an existing file from …/lib/ganeti/rapi_users and replace it with a
symlink.

Unfortunately one of the checks for this was incorrect and, when run
multiple times, replaces the users file at the new location with a
symlink created during a previous run.

In addition the “--dry-run” parameter to cfgupgrade was not respected.
Unittests are updated for all these cases.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Release 2.4.0

NEWS update and version bump.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Merge branch 'devel-2.3' into devel-2.4

* devel-2.3:
  Fix LUClusterRepairDiskSizes and rpc result usage
  Fix RPC mismatch in blockdev_getsize[s]
  RAPI: fix evacuate node resource

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Small improvement to the ganeti man page

Also specifies the comma-escaping feature.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Merge branch 'devel-2.2' into devel-2.3

* devel-2.2:
Fix LUClusterRepairDiskSizes and rpc result usage
Fix RPC mismatch in blockdev_getsize[s]

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Fix LUClusterRepairDiskSizes and rpc result usage

This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit 043beb38f4e10b75d0820c361c668c441c7a6980)

Fix RPC mismatch in blockdev_getsize[s]

Commit 92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).

The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
(cherry picked from commit ccfbbd2d1546b4f57d5bfeb115573967f7fb558b)

RAPI: fix evacuate node resource

PollJob returns the whole op_results, hence a list of opcode results.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Merge remote branch 'stable-2.4' into devel-2.4

* origin/stable-2.4:
  Fix typo in kvm-ifup script
  NEWS: Replace smartquotes, start lines with uppercase
  Update NEWS and release 2.4.0 rc3
  Fix potential data-loss bug in disk wipe routines

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix typo in kvm-ifup script

Reported-by: Bas Tichelaar <bas@30loops.net>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

NEWS: Replace smartquotes, start lines with uppercase

- Sphinx converts ASCII quotes ("") to smartquotes (“”) automatically
- Sentences or list items start with an uppercase letter
- Changed description of non-verbose “gnt-* list” output slightly

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Fix LU processor's GetECId

The exception was never actually raised.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>

Update NEWS and release 2.4.0 rc3

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Merge branch 'devel-2.4' into stable-2.4

* devel-2.4:
  1-char comment typo fix
  Expand some acronyms, add to glossary
  query_unittest: Fix argument to set()

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Fix potential data-loss bug in disk wipe routines

For the 2.4 release, we only add the missing RPC calls. However, this
needs to be fixed properly, by preventing usage of mis-configured
disks.

Also add a bit more logging so that it's directly clear on which node
the wipe is being done.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

1-char comment typo fix

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Expand some acronyms, add to glossary

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

query_unittest: Fix argument to set()

Commit e431074f introduced an uncatched bug. This patch fixes this. The
set is expecting a list or iteratable to work on, so it splitted the
provided instance name into a set of characters. This caused the
exp_status never been set and therefore not catched in one assert rule
further below who checks that every status was tested.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix title of query field containing instance name

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Update news and bump version for 2.4.0 rc2

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

Merge branch 'devel-2.4' into stable-2.4

* devel-2.4: (23 commits)
  Fix pylint warnings
  Change the list formatting to a 'special' chars
  Add support for merging node groups
  Add option to rename groups on conflict
  Fix minor docstring typo
  Fix HV/OS parameter validation on non-vm nodes
  NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes
  NodeQuery: don't query non-vm_capable nodes
  Remove superfluous redundant requirement
  Don't remove master_candidate flag from merged nodes
  Use a consistent ECID base
  listrunner: convert from getopt to optparse
  listrunner: fix agent usage
  Revert "Disable the cluster-merge tool for the moment"
  Fix cluster-merging by not stopping noded
  Fix error msg for instances on offline nodes
  Minor reordering to match param order
  cluster verify and instance disks on offline nodes
  Cluster verify and N+1 warnings for offline nodes
  Handle gnt-instance shutdown --all for empty clusters
  Use gnt-node add --force-join to add foreign nodes
  Add --force-join option to gnt-node add
  Fix iterating over node groups

Of the above commits present in the devel-2.4 branch, only the “Add
--force-join option to gnt-node add” is a potential issue, but this
has been QA-ed successfully. The other fixes are split in three
groups:

- non-core changes (cluster-merge, listrunner)
- trivial fixes (docstrings, etc.)
- bugs that we want fixed

As such, instead of cherry-picking only individual patches, I propose
that we unify stable and devel 2.4 and make a new RC out of the
result.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix pylint warnings

- 1 80-char line infraction
- 4 changes in how arguments are passed to logging functions
- 3 pylint disable-msg's because cluster-merge needs to access ganeti
config internals

Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

TestRapiInstanceRename use instance name

Currently the QA rename job wrongly passed the whole info dict to the
client.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Change the list formatting to a 'special' chars

And also enable verbose display via the, well, verbose option. Man
page and tests are updated, and the formatting is moved from 4 if
statements to a data structure.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Add support for merging node groups

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Add option to rename groups on conflict

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix minor docstring typo

Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Add QA rapi test for instance reinstall

This tests at least the basic case, unfortunately there is no way to
check all possibilities using the provided rapi client, as that will use
the new method unless the cluster doesn't support it.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

RAPI: remove required parameters for reinstall

Before c744425f354f1bef2d0d7d306e2d00c494d67d2b instance reinstall
accepted the "os" and "nostartup" optional query parameters. With that
commit it was changed to allow "os" "start" and "osparams" via body
rather than encoded in the URL. Unfortunately that commit introduced a
bug, which required the "os" parameter to be passed for body requests,
and at least one of "os" or "nostartup" for query request.

This fix makes sure all parameters are optional again.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

Fix HV/OS parameter validation on non-vm nodes

Currently, there is at least one LU that does wrong validation of HV
parameters (against all nodes, LUClusterSetParams). It's possible to
fix this case, but I went and modified the base functions to filter
out non-vm_capable nodes so all callers are protected.

Note: the _CheckOSParams function is never called with all nodes list,
so modifying it shouldn't be needed. However, I think it's safe to do
so (and it shouldn't hurt as an instance's node shouldn't ever lack
the vm_capable bit).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes

Since we don't have the data per design, UNAVAIL is appropriate here,
while NODATA is not.

The patch also adds a comment: if we extend the live fields list to
contain other data in the future, we need to reevaluate this solution.

This should fix issue 143. The listing now shows (node2==ofline,
node3==not vm_capable):

  Node     DTotal     DFree    MTotal     MNode     MFree Pinst Sinst
  node1    698.6G    630.5G     32.0G      1.0G     30.0G     8     7
  node2 (offline) (offline) (offline) (offline) (offline)     9     4
  node3 (unavail) (unavail) (unavail) (unavail) (unavail)     0     0

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

NodeQuery: don't query non-vm_capable nodes

Because non-vm_capable nodes most likely don't have a hypervisor
configured and/or storage, so the call will fail anyway.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix LUClusterRepairDiskSizes and rpc result usage

This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Fix RPC mismatch in blockdev_getsize[s]

Commit 92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).

The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>