Iustin Pop [Wed, 20 Apr 2011 11:15:23 +0000 (13:15 +0200)]
gnt-instance info: automatically request locking
Commit
dae661a4 added support for controlling the locking, but it
didn't modify the gnt-instance info code, which leads to this command
always showing:
Wed Apr 20 04:10:48 2011 - WARNING: Non-static data requested, locks
need to be acquired
We simply change gnt-instance to request locks whenever we don't use
the static mode.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Wed, 20 Apr 2011 09:29:09 +0000 (11:29 +0200)]
Document the dependency on OOB for gnt-node power
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Tue, 19 Apr 2011 16:35:13 +0000 (18:35 +0200)]
Fix master IP activation in failover with no-voting
Thanks to net.for.hub@gmail.com for reporting this. The logic in
masterd.CheckMasterd did an early return in case of no_voting, hence
skipping the master IP activation. We just change the ifs to not
return but simply continue through the function.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Iustin Pop [Tue, 19 Apr 2011 15:31:09 +0000 (17:31 +0200)]
disk wiping: fix bug in chunk size computation
The current wipe_chunk_size computation is doing min(int_value,
float_value). For small disks (below 10GiB), the actual formula will
result into the float value being chosen. This results into very
interesting behaviour:
Wiping disk 0, offset 102.4, chunk 102.4
Wiping disk 0, offset 204.8, chunk 102.4
…
Wiping disk 0, offset 921.6, chunk 102.4
Wiping disk 0, offset 1024.0, chunk 1.
13686837722e-13
Since these are passed to dd via %d, this will result into the call to
dd specifying offset 1024 and count 0, which will fail.
We just need to enforce conversion to int, in order to not get bitten
by floating point rounding errors.
The patch also reorders some logging messages in order to log the
chunk size.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Michael Hanselmann [Tue, 19 Apr 2011 11:38:58 +0000 (13:38 +0200)]
Fix bug in watcher
If “utils.RunParts” were to raise an exception, a log message was
written and the code continued to run. Due to the exception the
“results” variable would not be defined.
Also change the code to log a backtrace (getting an exception is rather
unlikely and having a backtrace is useful) and update one comment.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Wed, 13 Apr 2011 11:53:55 +0000 (13:53 +0200)]
Release locks before wiping disks during instance creation
Ganeti 2.3 introduced an optional feature to overwrite an instance's
disks on creation. Unfortunately the code kept all locks while doing the
wipe, slowing down the creation of multiple instances in parallel.
This patch changes the code to wipe the disks only after releasing the
locks.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Mon, 11 Apr 2011 13:44:43 +0000 (15:44 +0200)]
utils.WriteFile: Close file before renaming
Issue 154 (http://code.google.com/p/ganeti/issues/detail?id=154)
reported an “Operation not supported” error when writing instance
exports to a mounted CIFS filesystem. Experimentation showed the error
to only occur when using rename(2) on an opened file. Various references
on the web confirmed this observation. Whether or not the problem occurs
can also depend on the CIFS server implementation. In issue 154 it was
Windows 2008 R2.
While not solving all cases, closing the file before renaming helps
alleviating the issue a bit. Unittests are updated.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Fri, 8 Apr 2011 11:58:48 +0000 (13:58 +0200)]
Fix distcheck
README is not copied to the build tree.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Fri, 8 Apr 2011 11:29:57 +0000 (13:29 +0200)]
Nicer formatting for group query error
Before this patc the message would look like “Some groups do not exist:
[u'foo', u'bar']”, now it's “Some groups do not exist: foo, bar”.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Fri, 8 Apr 2011 11:22:37 +0000 (13:22 +0200)]
gnt-instance.8: Fix wrongly formatted title
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Fri, 8 Apr 2011 10:21:41 +0000 (12:21 +0200)]
Update version in README
Also add a check to Makefile's check-local target.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Thu, 7 Apr 2011 09:44:52 +0000 (11:44 +0200)]
Merge branch 'stable-2.4' into devel-2.4
* stable-2.4:
Add error checking and merging for cluster params
Clarify --force-join parameter message
Treat empty oob_program param as default
Fix bug in instance listing with orphan instances
Fix bug related to log opening failures
Bump version for 2.4.1 release
cfgupgrade: Fix critical bug overwriting RAPI users file
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Wed, 6 Apr 2011 16:32:31 +0000 (18:32 +0200)]
LUInstanceQueryData: Don't acquire locks unless requested
Until now LUInstanceQueryData always acquired locks for the instance(s)
and nodes involved. In combination with long-running operations this
prevented the use of “gnt-instance info”, even with the “--static”
option. With this patch, locks are only acquired when explicitely
requested in the opcode (like all query operations).
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Mon, 4 Apr 2011 13:59:39 +0000 (15:59 +0200)]
Increase the lock timeouts before we block-acquire
This has been observed to cause problems on real clusters via the
following mechanism:
- a long job (e.g. a replace-disks) is keeping an exclusive lock on an
instance
- the watcher starts and submits its query instances opcode which
wants shared locks for all instances
- after about an hour, the watcher job falls back to blocking acquire,
after having acquired all other locks
- any instance opcode that wants an exclusive lock for an instance
cannot start until the watcher has finished, even though there's no
actual operation on that instance
In order to alleviate this problem, we simply increase the max timeout
until lock acquires are sent back to either blocking acquire or
priority increase. The timeout is computed such that we wait ~10 hours
(instead of one) for this to happen, which should be within the
maximum lifetime of a reasonable opcode on a healthy cluster. The
timeout also means that priority increases will happen every half hour.
We also increase the max wait interval to 15 seconds, otherwise we'd
have too many retries with the increased interval.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Mon, 4 Apr 2011 10:13:44 +0000 (12:13 +0200)]
daemon.py: move startup log message before prep_fn
Before this, the output in the rapi daemon log was:
2011-04-04 03:09:51,026: ganeti-rapi pid=17447 INFO Reading users file
at /var/lib/ganeti/rapi/users
2011-04-04 03:09:51,027: ganeti-rapi pid=17447 INFO ganeti-rapi daemon
startup
Which is confusing, as it might look like the read of the users file
is part of the previous run. This is because we log the 'daemon
startup' message after the prepare_fn, which can log things on its
own.
The patch simply moves the 'daemon startup' message just before
prepare_fn call.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Mon, 4 Apr 2011 09:33:01 +0000 (11:33 +0200)]
Display the actual memory values in N+1 failures
This changes the display from:
Mon Apr 4 02:29:46 2011 * Verifying N+1 Memory redundancy
Mon Apr 4 02:29:46 2011 - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail
To:
Mon Apr 4 02:32:50 2011 * Verifying N+1 Memory redundancy
Mon Apr 4 02:32:50 2011 - ERROR: node node2: not enough memory to
accomodate instance failovers should node node1 fail (33536MiB needed,
27910MiB available)
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 31 Mar 2011 16:41:09 +0000 (18:41 +0200)]
ssh.VerifyNodeHostname: remove the quiet flag
This is not needed for this function, and can interfere with debugging
of ssh failures.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Fri, 25 Feb 2011 15:01:38 +0000 (16:01 +0100)]
Add error checking and merging for cluster params
Set the default stderr logging level to WARNING so the relevant output
can be seen.
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Thu, 24 Mar 2011 14:13:12 +0000 (15:13 +0100)]
RAPI: Document need for Content-type header in requests
This was added to the NEWS file in commit
ab221ddf, but never
documented properly.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Thu, 24 Mar 2011 11:51:31 +0000 (12:51 +0100)]
Fix output for “gnt-job info”
If the result of an opcode was a non-empty dictionary, it
would be impossible to differenciate between input and result:
Input fields:
[…]
debug_level: 0
fields: cluster_name,master_node,volume_group_name
jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]
Expected output:
Input fields:
[…]
debug_level: 0
fields: cluster_name,master_node,volume_group_name
Result:
jobs: [[True, u'37922'], [True, u'37923'], [True, u'37924']]
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Thu, 17 Mar 2011 16:36:57 +0000 (17:36 +0100)]
watcher: Fix misleading usage output
When “ganeti-watcher” is called with an argument, it would hint at
a non-existing “-f” parameter. With this patch the separate usage
string is no longer necessary.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Thu, 17 Mar 2011 10:05:36 +0000 (11:05 +0100)]
Clarify --force-join parameter message
This isn't only used during cluster merge.
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Michael Hanselmann [Mon, 14 Mar 2011 18:09:28 +0000 (19:09 +0100)]
locking: Fix race condition in lock monitor
In some rare cases it can happen that a lock is re-created very soon
after deletion, while the old instance hasn't been destructed yet. In
such a case the code would detect a duplicate name and raise an
exception.
We have seen at least one case where this happened during the creation
of many instances. It is not exactly clear how it came to be, but it
appears to have occurred while different jobs fought for locks with
short timeouts (in the case of instance creation locks are added at this
stage and removed shortly after if not all locks can be acquired).
The issue is fixed by removing the check for duplicate names. To still
guarantee a stable sort order for the lock information as shown by
“gnt-debug locks”, a registration number is recorded for each lock in
the monitor.
A unittest is included to check for the situation.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Thu, 24 Feb 2011 18:20:13 +0000 (19:20 +0100)]
utils: Export NiceSortKey function
The ability to split a string into a list of strings and integers can be
handy elsewhere and is necessary for sorting query results by names.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit
f47941f864cf03264d363aebed530480a64e21dd)
Guido Trotter [Fri, 11 Mar 2011 12:59:33 +0000 (12:59 +0000)]
Revert "Only merge nodes that are known to not be offline"
This reverts commit
288f240f62dafa8bd8ba7482c8367adbdf6d96c2.
That commit was buggy at various levels:
- broke ssh access to the second cluster, making cluster-merge
unusable (unless ssh key were previously setup?)
- filtered away offline nodes from being added to the cluster config
(wrong, they should be kept, as offline)
- broke commit-check
The previous commit makes the code work again with what this commit
tried to achieve.
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Guido Trotter [Fri, 11 Mar 2011 12:13:36 +0000 (12:13 +0000)]
cluster-merge: only operate on online nodes
The node list in MergerData is used only to:
- stop ganeti on the nodes
- readd the nodes to the cluster
As such offline nodes should be skipped from it.
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Thu, 10 Mar 2011 15:48:42 +0000 (16:48 +0100)]
Only merge nodes that are known to not be offline
Otherwise the readd will fail, breaking the merge.
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Thu, 10 Mar 2011 14:19:21 +0000 (15:19 +0100)]
Treat empty oob_program param as default
There is currently no way to reset oob_program back to its default from
the cmdline, which causes problems for cluster-merge. This patch means
that the following now works:
gnt-cluster modify --node-parameters oob_program=
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Iustin Pop [Thu, 10 Mar 2011 11:37:16 +0000 (12:37 +0100)]
Fix bug in instance listing with orphan instances
Nodes can return unknown instances, so we shouldn't use the name as an
index without checking.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 10 Mar 2011 11:19:17 +0000 (12:19 +0100)]
Fix bug related to log opening failures
If opening the log file fails, then we shouldn't attempt to use that
variable.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Wed, 9 Mar 2011 12:05:16 +0000 (13:05 +0100)]
Bump version for 2.4.1 release
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Tue, 8 Mar 2011 16:20:07 +0000 (17:20 +0100)]
cfgupgrade: Fix critical bug overwriting RAPI users file
The cfgupgrade tool was designed to be idempotent, that means it could
be run several times and still give produce the correct result. Ganeti
2.4 moved the file containing the RAPI users to a separate directory
(…/lib/ganeti/rapi/users). If it exists, cfgupgrade would automatically
move an existing file from …/lib/ganeti/rapi_users and replace it with a
symlink.
Unfortunately one of the checks for this was incorrect and, when run
multiple times, replaces the users file at the new location with a
symlink created during a previous run.
In addition the “--dry-run” parameter to cfgupgrade was not respected.
Unittests are updated for all these cases.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Mon, 7 Mar 2011 11:00:51 +0000 (12:00 +0100)]
Release 2.4.0
NEWS update and version bump.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Mon, 7 Mar 2011 09:50:27 +0000 (10:50 +0100)]
Merge branch 'devel-2.3' into devel-2.4
* devel-2.3:
Fix LUClusterRepairDiskSizes and rpc result usage
Fix RPC mismatch in blockdev_getsize[s]
RAPI: fix evacuate node resource
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Thu, 3 Mar 2011 10:16:39 +0000 (11:16 +0100)]
Small improvement to the ganeti man page
Also specifies the comma-escaping feature.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Fri, 4 Mar 2011 11:36:15 +0000 (12:36 +0100)]
Merge branch 'devel-2.2' into devel-2.3
* devel-2.2:
Fix LUClusterRepairDiskSizes and rpc result usage
Fix RPC mismatch in blockdev_getsize[s]
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Tue, 15 Feb 2011 13:39:44 +0000 (14:39 +0100)]
Fix LUClusterRepairDiskSizes and rpc result usage
This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit
043beb38f4e10b75d0820c361c668c441c7a6980)
Iustin Pop [Tue, 15 Feb 2011 13:29:08 +0000 (14:29 +0100)]
Fix RPC mismatch in blockdev_getsize[s]
Commit
92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).
The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
(cherry picked from commit
ccfbbd2d1546b4f57d5bfeb115573967f7fb558b)
Iustin Pop [Fri, 4 Mar 2011 10:04:10 +0000 (11:04 +0100)]
RAPI: fix evacuate node resource
PollJob returns the whole op_results, hence a list of opcode results.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Guido Trotter [Wed, 2 Mar 2011 21:36:01 +0000 (13:36 -0800)]
Merge remote branch 'stable-2.4' into devel-2.4
* origin/stable-2.4:
Fix typo in kvm-ifup script
NEWS: Replace smartquotes, start lines with uppercase
Update NEWS and release 2.4.0 rc3
Fix potential data-loss bug in disk wipe routines
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Michael Hanselmann [Tue, 1 Mar 2011 17:32:40 +0000 (18:32 +0100)]
Fix typo in kvm-ifup script
Reported-by: Bas Tichelaar <bas@30loops.net>
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Michael Hanselmann [Mon, 28 Feb 2011 15:26:00 +0000 (16:26 +0100)]
NEWS: Replace smartquotes, start lines with uppercase
- Sphinx converts ASCII quotes ("") to smartquotes (“”) automatically
- Sentences or list items start with an uppercase letter
- Changed description of non-verbose “gnt-* list” output slightly
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Mon, 28 Feb 2011 17:01:43 +0000 (18:01 +0100)]
Fix LU processor's GetECId
The exception was never actually raised.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Adeodato Simo <dato@google.com>
Iustin Pop [Mon, 28 Feb 2011 14:12:14 +0000 (15:12 +0100)]
Update NEWS and release 2.4.0 rc3
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Mon, 28 Feb 2011 13:30:45 +0000 (14:30 +0100)]
Merge branch 'devel-2.4' into stable-2.4
* devel-2.4:
1-char comment typo fix
Expand some acronyms, add to glossary
query_unittest: Fix argument to set()
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Mon, 28 Feb 2011 10:06:14 +0000 (11:06 +0100)]
Fix potential data-loss bug in disk wipe routines
For the 2.4 release, we only add the missing RPC calls. However, this
needs to be fixed properly, by preventing usage of mis-configured
disks.
Also add a bit more logging so that it's directly clear on which node
the wipe is being done.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Stephen Shirley [Fri, 25 Feb 2011 15:02:14 +0000 (16:02 +0100)]
1-char comment typo fix
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Thu, 24 Feb 2011 15:19:07 +0000 (16:19 +0100)]
Expand some acronyms, add to glossary
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
René Nussbaumer [Wed, 23 Feb 2011 13:16:12 +0000 (14:16 +0100)]
query_unittest: Fix argument to set()
Commit
e431074f introduced an uncatched bug. This patch fixes this. The
set is expecting a list or iteratable to work on, so it splitted the
provided instance name into a set of characters. This caused the
exp_status never been set and therefore not catched in one assert rule
further below who checks that every status was tested.
Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Michael Hanselmann [Tue, 22 Feb 2011 17:17:57 +0000 (18:17 +0100)]
Fix title of query field containing instance name
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Mon, 21 Feb 2011 10:28:00 +0000 (11:28 +0100)]
Update news and bump version for 2.4.0 rc2
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Mon, 21 Feb 2011 09:36:10 +0000 (10:36 +0100)]
Merge branch 'devel-2.4' into stable-2.4
* devel-2.4: (23 commits)
Fix pylint warnings
Change the list formatting to a 'special' chars
Add support for merging node groups
Add option to rename groups on conflict
Fix minor docstring typo
Fix HV/OS parameter validation on non-vm nodes
NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes
NodeQuery: don't query non-vm_capable nodes
Remove superfluous redundant requirement
Don't remove master_candidate flag from merged nodes
Use a consistent ECID base
listrunner: convert from getopt to optparse
listrunner: fix agent usage
Revert "Disable the cluster-merge tool for the moment"
Fix cluster-merging by not stopping noded
Fix error msg for instances on offline nodes
Minor reordering to match param order
cluster verify and instance disks on offline nodes
Cluster verify and N+1 warnings for offline nodes
Handle gnt-instance shutdown --all for empty clusters
Use gnt-node add --force-join to add foreign nodes
Add --force-join option to gnt-node add
Fix iterating over node groups
Of the above commits present in the devel-2.4 branch, only the “Add
--force-join option to gnt-node add” is a potential issue, but this
has been QA-ed successfully. The other fixes are split in three
groups:
- non-core changes (cluster-merge, listrunner)
- trivial fixes (docstrings, etc.)
- bugs that we want fixed
As such, instead of cherry-picking only individual patches, I propose
that we unify stable and devel 2.4 and make a new RC out of the
result.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Fri, 18 Feb 2011 15:25:59 +0000 (16:25 +0100)]
Fix pylint warnings
- 1 80-char line infraction
- 4 changes in how arguments are passed to logging functions
- 3 pylint disable-msg's because cluster-merge needs to access ganeti
config internals
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Guido Trotter [Fri, 18 Feb 2011 12:52:58 +0000 (12:52 +0000)]
TestRapiInstanceRename use instance name
Currently the QA rename job wrongly passed the whole info dict to the
client.
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Fri, 18 Feb 2011 12:51:03 +0000 (13:51 +0100)]
Change the list formatting to a 'special' chars
And also enable verbose display via the, well, verbose option. Man
page and tests are updated, and the formatting is moved from 4 if
statements to a data structure.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Fri, 18 Feb 2011 12:59:46 +0000 (13:59 +0100)]
Add support for merging node groups
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Fri, 18 Feb 2011 12:30:37 +0000 (13:30 +0100)]
Add option to rename groups on conflict
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Thu, 17 Feb 2011 16:00:24 +0000 (17:00 +0100)]
Fix minor docstring typo
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Guido Trotter [Fri, 18 Feb 2011 11:33:09 +0000 (11:33 +0000)]
Add QA rapi test for instance reinstall
This tests at least the basic case, unfortunately there is no way to
check all possibilities using the provided rapi client, as that will use
the new method unless the cluster doesn't support it.
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Guido Trotter [Fri, 18 Feb 2011 11:20:01 +0000 (11:20 +0000)]
RAPI: remove required parameters for reinstall
Before
c744425f354f1bef2d0d7d306e2d00c494d67d2b instance reinstall
accepted the "os" and "nostartup" optional query parameters. With that
commit it was changed to allow "os" "start" and "osparams" via body
rather than encoded in the URL. Unfortunately that commit introduced a
bug, which required the "os" parameter to be passed for body requests,
and at least one of "os" or "nostartup" for query request.
This fix makes sure all parameters are optional again.
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Thu, 17 Feb 2011 16:06:59 +0000 (17:06 +0100)]
Fix HV/OS parameter validation on non-vm nodes
Currently, there is at least one LU that does wrong validation of HV
parameters (against all nodes, LUClusterSetParams). It's possible to
fix this case, but I went and modified the base functions to filter
out non-vm_capable nodes so all callers are protected.
Note: the _CheckOSParams function is never called with all nodes list,
so modifying it shouldn't be needed. However, I think it's safe to do
so (and it shouldn't hurt as an instance's node shouldn't ever lack
the vm_capable bit).
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 17 Feb 2011 13:42:57 +0000 (14:42 +0100)]
NodeQuery: mark live fields as UNAVAIL for non-vm_capable nodes
Since we don't have the data per design, UNAVAIL is appropriate here,
while NODATA is not.
The patch also adds a comment: if we extend the live fields list to
contain other data in the future, we need to reevaluate this solution.
This should fix issue 143. The listing now shows (node2==ofline,
node3==not vm_capable):
Node DTotal DFree MTotal MNode MFree Pinst Sinst
node1 698.6G 630.5G 32.0G 1.0G 30.0G 8 7
node2 (offline) (offline) (offline) (offline) (offline) 9 4
node3 (unavail) (unavail) (unavail) (unavail) (unavail) 0 0
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 17 Feb 2011 13:41:29 +0000 (14:41 +0100)]
NodeQuery: don't query non-vm_capable nodes
Because non-vm_capable nodes most likely don't have a hypervisor
configured and/or storage, so the call will fail anyway.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Tue, 15 Feb 2011 13:39:44 +0000 (14:39 +0100)]
Fix LUClusterRepairDiskSizes and rpc result usage
This LU was introduced before the RPC result conversion from .data to
.payload, and it has managed to keep the old-style usage (how? it's
the only LU that does so). Fix by changing to payload, and add some
extra logging for easier diagnose.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Tue, 15 Feb 2011 13:29:08 +0000 (14:29 +0100)]
Fix RPC mismatch in blockdev_getsize[s]
Commit
92fd2250 added consistency checks in the RPC layer, which broke
the call_blockdev_getsizes RPC call (declared with 's' at the end in
rpc.py, without 's' in the node daemon).
The immediate fix is to correct the rpc function name, the long term
one will be to remove this duplication.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Stephen Shirley [Tue, 15 Feb 2011 16:40:54 +0000 (17:40 +0100)]
Remove superfluous redundant requirement
The condition is already covered by the previous requirement.
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Tue, 15 Feb 2011 14:29:03 +0000 (15:29 +0100)]
Don't remove master_candidate flag from merged nodes
Prevents lots of spurious warnings like:
2011-02-10 17:00:22,776: CRITICAL Configuration data is not consistent:
Not enough master candidates: actual 3, target 4
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Tue, 15 Feb 2011 14:06:03 +0000 (15:06 +0100)]
Use a consistent ECID base
ECID was being calculated completely differently in
__MergeNodeGroups() and _MergeConfig()
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Wed, 16 Feb 2011 16:21:03 +0000 (17:21 +0100)]
listrunner: convert from getopt to optparse
The “-A” (use agent) was not documented, and instead of adding manual
listing, I converted it to optparse like the other CLI tools.
Note that I cleaned up a bit the usage and help texts.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Wed, 16 Feb 2011 12:32:29 +0000 (13:32 +0100)]
listrunner: fix agent usage
By delaying the agent key query until after the fork, we prevent the
problem of simultaneous access to the agent.
Tested that it works against 80 hosts in parallel without error; the
current version breaks already at 20 hosts.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Stephen Shirley [Thu, 10 Feb 2011 16:32:26 +0000 (17:32 +0100)]
Revert "Disable the cluster-merge tool for the moment"
This reverts commit
c0711f2cb989facd60430ab18c5b0e59a1f279ac.
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Thu, 10 Feb 2011 10:52:13 +0000 (11:52 +0100)]
Fix cluster-merging by not stopping noded
cli.RunWhileClusterStopped() stops noded on all of the nodes in the
original cluster. This prevents /etc/hosts updates on the master, and
config redistribution doesn't reach the other nodes in the original
cluster. As all we want to do is merge while the master is stopped,
simply stop it and start it again after.
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Thu, 10 Feb 2011 13:55:08 +0000 (14:55 +0100)]
Fix bug in iallocator data structures build
Commit
a1cef11c fixed non-vm_capable nodes export, but broke
inadvertently offline nodes. The update of the dict only needs to
happen for online nodes, in the 'if' block.
Without this patch, offline nodes keep the data from the last node
that was not offline; end result is that all nodes are considered
online (unless the first node is offline, in which case an error will
be raised).
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Iustin Pop [Wed, 9 Feb 2011 09:04:39 +0000 (10:04 +0100)]
Fix error msg for instances on offline nodes
Currently, for both primary and secondary offline nodes, we give the
same message:
- ERROR: instance instance14: instance lives on offline node(s) node3
- ERROR: instance instance15: instance lives on offline node(s) node3
- ERROR: instance instance16: instance lives on offline node(s) node3
- ERROR: instance instance17: instance lives on offline node(s) node3
This is confusing, as an offline primary is in a different category
than a secondary. The patch changes the warnings to have different
error messages:
- ERROR: instance instance14: instance has offline secondary node(s) node3
- ERROR: instance instance15: instance has offline secondary node(s) node3
- ERROR: instance instance16: instance lives on offline node node3
- ERROR: instance instance17: instance lives on offline node node3
Thanks to Alexander Schreiber <als@google.com> for reporting this
issue.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Alexander Schreiber <als@google.com>
Stephen Shirley [Tue, 8 Feb 2011 16:42:18 +0000 (17:42 +0100)]
Minor reordering to match param order
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Tue, 8 Feb 2011 16:07:13 +0000 (17:07 +0100)]
cluster verify and instance disks on offline nodes
Currently, cluster-verify says:
- ERROR: instance instance14: couldn't retrieve status for disk/0 on node3: node offline
- ERROR: instance instance14: instance lives on offline node(s) node3
- ERROR: instance instance15: couldn't retrieve status for disk/0 on node3: node offline
- ERROR: instance instance15: instance lives on offline node(s) node3
This is redundant as the “lives on offline node” message should be all we need to
understand the cluster situation.
The patch fixes this and also corrects a very old idiom.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Iustin Pop [Tue, 8 Feb 2011 15:56:23 +0000 (16:56 +0100)]
Cluster verify and N+1 warnings for offline nodes
Currently, cluster verify shows warnings N+1 warnings for offline
nodes having any redundant instances since the memory data that we
have for those nodes is zero, so any instance will trigger the
warning.
As the comment says, we already list secondary instances on offline
nodes, so that warning is enough, and we skip the N+1 one.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Stephen Shirley [Mon, 7 Feb 2011 15:35:34 +0000 (16:35 +0100)]
Handle gnt-instance shutdown --all for empty clusters
The current code gives:
Failure: prerequisites not met for this operation:
error type: wrong_input, error details:
Selection filter does not match any instances
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Tue, 1 Feb 2011 15:59:46 +0000 (16:59 +0100)]
Use gnt-node add --force-join to add foreign nodes
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Tue, 1 Feb 2011 15:59:45 +0000 (16:59 +0100)]
Add --force-join option to gnt-node add
This is needed so cluster-merge can add nodes from other clusters.
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Stephen Shirley [Tue, 1 Feb 2011 16:14:18 +0000 (17:14 +0100)]
Fix iterating over node groups
Current line tries to unpack dict incorrectly
Signed-off-by: Stephen Shirley <diamond@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Iustin Pop [Fri, 4 Feb 2011 09:54:05 +0000 (10:54 +0100)]
Update NEWS file for the 2.4.0 rc1 release
Also bump up the version.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Iustin Pop [Fri, 4 Feb 2011 09:58:45 +0000 (10:58 +0100)]
Disable the cluster-merge tool for the moment
Hopefully this can be fixed before the final 2.4 release…
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Reviewed-by: Stephen Shirley <diamond@google.com>
Iustin Pop [Thu, 3 Feb 2011 15:19:52 +0000 (16:19 +0100)]
Bump up intra-cluster import connect timeout
Currently, the export timeout is 10 times 20 seconds, but the import
is only 30 seconds. I'm raising this to 60 seconds with two goals in
mind:
- when debugging manually, this allows for easier synchronisation of
the processes
- 60 equals to 3 full 20 second intervals, which I think is better
than just one an a half
This change shouldn't make a big difference either way (at most, it
will possibly delay the job in case of failures by half a minute).
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 3 Feb 2011 13:17:58 +0000 (14:17 +0100)]
Import-export: fix logging of daemon output
In case of failures, the recent daemon output is logged as %r on a
list of unicode strings, which results in the (ugly):
Thu Feb 3 05:13:34 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: [u' DUMP: Date of this level 0 dump: Thu Feb 3 05:13:18 2011', u' DUMP: Dumping /dev/mapper/
6369a5f7-1e67-4d0d-a4f0-
956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output', u' DUMP: Label: none', u' DUMP: Writing 10 Kilobyte records', u' DUMP: mapping (Pass I) [regular files]', u' DUMP: mapping (Pass II) [directories]', u' DUMP: estimated 54301 blocks.', u' DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:13:19 2011', u' DUMP: dumping (Pass III) [directories]', u' DUMP: dumping (Pass IV) [regular files]', u'socat: E SSL_write(): Connection reset by peer', u"dd: dd: writing `standard output': Broken pipe", u' DUMP: Broken pipe', u' DUMP: The ENTIRE dump is aborted.'])
This patch joins this list and makes it a non-unicode string, thus
resulting in the more readable (and ~10% shorter):
Thu Feb 3 05:16:04 2011 snapshot/0 failed to send data: Exited with status 1 (recent output: DUMP: Date of this level 0 dump: Thu Feb 3 05:15:58 2011\n DUMP: Dumping /dev/mapper/
6369a5f7-1e67-4d0d-a4f0-
956b3649c6d7.disk0_data.snap-1 (an unlisted file system) to standard output\n DUMP: Label: none\n DUMP: Writing 10 Kilobyte records\n DUMP: mapping (Pass I) [regular files]\n DUMP: mapping (Pass II) [directories]\n DUMP: estimated 54350 blocks.\n DUMP: Volume 1 started with block 1 at: Thu Feb 3 05:15:59 2011\n DUMP: dumping (Pass III) [directories]\nsocat: E SSL_write(): Connection reset by peer\ndd: dd: writing `standard output': Broken pipe\n DUMP: Broken pipe\n DUMP: The ENTIRE dump is aborted.)
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Iustin Pop [Thu, 3 Feb 2011 10:02:20 +0000 (11:02 +0100)]
Fix handling of ^C in the CLI scripts
This adds a message and nice handling of ^C, especially useful for
``gnt-job watch``.
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Michael Hanselmann [Thu, 3 Feb 2011 11:38:25 +0000 (12:38 +0100)]
Merge branch 'devel-2.3' into devel-2.4
* devel-2.3:
backend: Disable compression in export info file
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Thu, 3 Feb 2011 11:25:04 +0000 (12:25 +0100)]
backend: Disable compression in export info file
The new import/export infrastructure in Ganeti 2.2 and up handles
compression differently. It no longer writes compressed files to the
destination. Unfortunately changing this behaviour would be non-trivial,
so in the meantime setting “compression = none” will hopefully avoid
some confusion.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Michael Hanselmann [Tue, 1 Feb 2011 15:31:15 +0000 (16:31 +0100)]
Reopen log files upon SIGHUP in daemons
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 16:26:55 +0000 (17:26 +0100)]
utils.SetupLogging: Return function to reopen log file
This function can be used from a SIGHUP handler to reopen log files.
Initial, simple unittests are included.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 16:03:18 +0000 (17:03 +0100)]
utils.SetupLogging: Make program a mandatory argument
It's passed in by most users (daemons, CLI scripts) and for the others
(burnin, watcher) it certainly doesn't hurt, especially when using
syslog.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 15:46:21 +0000 (16:46 +0100)]
utils.log: Restrict I/O error handling coverage
The I/O error will occur while opening the file, not while adding
and configuring the handler.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 15:43:28 +0000 (16:43 +0100)]
utils.log: Split formatter building into separate function
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 13:58:26 +0000 (14:58 +0100)]
burner: Trivial code cleanup
- Use constant for exit value
- Configure logging from main function, not from class' “__init__”
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Michael Hanselmann [Mon, 31 Jan 2011 13:54:38 +0000 (14:54 +0100)]
burnin: Reuse existing function for debug value
Instead of using its own, burnin can use cli.SetGenericOpcodeOpts.
Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Stephen Shirley [Tue, 1 Feb 2011 12:07:31 +0000 (13:07 +0100)]
Merge node groups from other cluster
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Stephen Shirley [Mon, 31 Jan 2011 16:07:08 +0000 (17:07 +0100)]
Enforce that new node groups have unique names
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Stephen Shirley [Mon, 31 Jan 2011 16:00:03 +0000 (17:00 +0100)]
Add _UnlockedLookupNodeGroup()
This allows calling of _UnlockedLookupNodeGroup() from within
AddNodeGroup()
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
Stephen Shirley [Mon, 31 Jan 2011 14:19:48 +0000 (15:19 +0100)]
cluster-merge should refuse to merge own cluster
Also fix type of Merger.cluster_name from list to string. This would
have triggered an error in sshRunner if cluster keys were in use.
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Stephen Shirley [Mon, 31 Jan 2011 13:49:03 +0000 (14:49 +0100)]
Minor grammar fix in QuitGanetiException docstring
Signed-off-by: Stephen Shirley <diamond@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>