code.grnet.gr Git - ganeti-local/log

Bernardo Dal Seno [Wed, 28 Mar 2012 11:42:46 +0000 (13:42 +0200)]

LUOobCommand: acquire BGL in shared mode

Fixed a typo so that now LUOobCommand acquires the BLG in shared mode, as
intended.

Signed-off-by: Bernardo Dal Seno <bdalseno@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

René Nussbaumer [Fri, 23 Mar 2012 11:18:15 +0000 (12:18 +0100)]

LUNodeAdd: Verify version in Prereq

There are other ways to leave the cluster in a broken state than just
the version check. However they are not very trivial to fix in 2.5. So
leave it up to 2.6 for a nicer fix.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
(cherry picked from commit e2ea8de1663b9a49219f2ea0709653b424384436)

commit | commitdiff | tree

Iustin Pop [Thu, 22 Mar 2012 19:16:20 +0000 (19:16 +0000)]

Fix LV status parsing to accept newer LVM

LVM version 2.02.93 (or at least, sometimes after .88) has extend the
lv_attr field with two more flag; we only care about the first digit,
so let's change the "!= 6" check to "< 6".

Thanks to Robin H Johnson <robbat2@gentoo.org> for finding this issue.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 22 Mar 2012 15:26:02 +0000 (16:26 +0100)]

Bump version for 2.5.0~rc6 release

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 21 Mar 2012 17:11:48 +0000 (18:11 +0100)]

Revert "Stop acquiring BGL for LUXI queries"

This reverts commit 0fa753bad2cf5a0cf88953347e5da3aebbf21956.

Turns out there are more queries acquiring locks than we'd like. This
patch goes to version 2.6 and a separate patch fixes the immediate
issues in LUClusterVerifyConfig.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Bernardo Dal Seno <bdalseno@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 21 Mar 2012 15:59:17 +0000 (16:59 +0100)]

LUClusterVerifyConfig: Share BGL, acquire all locks in shared mode

Instead of acquiring the BGL in exclusive mode (which blocks all other
operations), we acquire all locks for groups, nodes and instances in
shared mode before verifying the configuration.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Bernardo Dal Seno <bdalseno@google.com>

commit | commitdiff | tree

Guido Trotter [Wed, 21 Mar 2012 14:58:05 +0000 (14:58 +0000)]

KVM: don't add -nographic using spice

This fixes issue 222.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 20 Mar 2012 16:57:12 +0000 (17:57 +0100)]

Stop acquiring BGL for LUXI queries

Short description: This fixes an issue whereby masterd would become
unresponsive on the LUXI socket, leading to client timeouts. While made
worse in 2.5, the underlying issue was already present in 2.4.

Longer description: Until now all LUXI queries would acquire the BGL
(big Ganeti lock) in shared mode. With the exception of OpNodeAdd and
OpNodeRemove, this was also the case for all opcodes before version 2.5.
In 2.5 we split OpClusterVerify into multiple opcodes, one of which
(OpClusterVerifyConfig) now acquires the BGL in exclusive mode. Whether
or not doing so is good is a separate discussion: OpNodeAdd and
OpNodeRemove, as of this writing, still require an exclusive BGL.
OpClusterVerifyConfig is run more often than OpNodeAdd or OpNodeRemove
in normal clusters, which is why we only recognized this issue in 2.5.

What would happen is that once OpClusterVerifyConfig tried to acquire
its exclusive BGL while it was actually held by other opcodes (e.g.
OpInstanceReplaceDisks), the locking code would not grant shared
acquires for the BGL, even when the exclusive acquire is removed from
the queue for a short amount of time after a timeout. This is necessary
to prevent lock starvation.

In this situation further LUXI queries requiring the BGL in shared mode,
e.g. OpClusterQuery, would block and the client eventually time out.
Over time they fill the client request workerpool's queue and at that
point even requests not requiring the BGL stop working. Once the
long-running operation(s) holding the BGL in shared mode finished,
OpClusterVerifyConfig gets it in exclusive mode and everything returns
to normal. LUXI recovers very soon too.

I'd like to thank Bernardo Dal Seno for his contribution to this bugfix.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Bernardo Dal Seno <bdalseno@google.com>

commit | commitdiff | tree

Iustin Pop [Mon, 19 Mar 2012 09:26:29 +0000 (10:26 +0100)]

Fix type error in LUInstanceChangeGroup

If a specific list of groups has been requested, then the code used
that, without transforming it to a (frozen)set first, which results
in:

unsupported operand type(s) for &: 'list' and 'frozenset'

Trivial fix is to do that in the 'then' branch.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Iustin Pop [Sun, 19 Feb 2012 19:58:42 +0000 (20:58 +0100)]

Fix Makefile.am compatibility with automake 1.11.2

Automake 1.11.2 made the following change:

* Long-standing bugs:
- Automake now warns about more primary/directory invalid combinations,
such as "doc_LIBRARIES" or "pkglib_PROGRAMS".

Unfortunately, this breaks our Makefile.am (issue 216) exactly because
we were relying on pkglib_SCRIPTS.

This patch works around this by adding a new myexeclibdir variable
(exec so that it is intalled at `install-exec` time, the same as the
pkglibdir), and switches to that.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 31 Jan 2012 11:52:12 +0000 (12:52 +0100)]

Fix type check for OpQuery.filter

Just using ht.TListOf as a type check doesn't work correctly. The
function must be called with the expected item type. In this specific
case TListOf was always called with the filter as a value, and the
result of that call evaluated to truth. Since filters can be quite
complex there's no check yet, and therefore just “TList” is used.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Iustin Pop [Thu, 26 Jan 2012 16:31:05 +0000 (17:31 +0100)]

Fix explanation of gnt-node evacuate --primaries-only

Furthermore, correct the --help display on evacuate.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Bernardo Dal Seno [Fri, 9 Dec 2011 15:26:15 +0000 (16:26 +0100)]

Makefile.am: fix permissions for Python scripts on install

Some Python scripts in /usr/lib/ganeti/ were getting the wrong permissions
(their 'x' bit was cleared). This patch fixes that behavior.

This patch renames the variable 'dist_tools_PYTHON' to 'python_scripts'.
Some Python scripts were listed in the 'dist_tools_PYTHON' variable, but as
said scripts have no .py extension in their names, Automake treated the scripts
as data files, and hence no 'x' bit. Now the Python scripts are processed
by the rules created for the 'dist_tools_SCRIPTS' variable, and such rules
don't depend on file name extensions.

Signed-off-by: Bernardo Dal Seno <bdalseno@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit cc120286c3d811d48722379d1d6b44d52fda7517)

commit | commitdiff | tree

Bernardo Dal Seno [Thu, 8 Dec 2011 23:35:47 +0000 (00:35 +0100)]

devel/upload: Fix permissions for installed directories

Permissions for the directories created during install depended on the
umask of the user running the script. Now umask is reset inside the script
to remove such dependency.

Signed-off-by: Bernardo Dal Seno <bdalseno@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit 0f7968005178cd29bddd83b55c6e86816011c124)

commit | commitdiff | tree

Michael Hanselmann [Wed, 25 Jan 2012 14:00:29 +0000 (15:00 +0100)]

Fix cluster verification issues on multi-group clusters

This patch attempts to fix a number of issues with “gnt-cluster verify”
in presence of multiple node groups and DRBD8 instances split over nodes
in more than one group.

- Look up instances in a group only by their primary node (otherwise
  split instances would be considered when verifying any of their node's
  groups)
- When gathering additional nodes for LV checks, just compare instance's
  node's groups with the currently verified group instead of comparing
  against the primary node's group
- Exclude nodes in other groups when calculating N+1 errors and checking
  logical volumes

Not directly related, but a small error text is also clarified.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Guido Trotter [Fri, 20 Jan 2012 14:30:56 +0000 (14:30 +0000)]

Migrate: don't check for free memory on cleanup

Cleanup just updates the config with the correct location of the
instance, or informs of its down status, but never starts it. As such
there's no point in checking for enough free memory. Actually this check
could prevent a perfectly safe cleanup operation if a node is busy.

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 9 Jan 2012 16:27:10 +0000 (17:27 +0100)]

Bump version to 2.5.0~rc5, update NEWS

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 9 Jan 2012 16:09:56 +0000 (17:09 +0100)]

Merge branch 'devel-2.4' into stable-2.5

* devel-2.4:
  Add UnescapeAndSplit unittest for multi-escapes
  Fix a bug in command line option parsing code
  ConfigWriter: Fix epydoc error
  LUGroupAssignNodes: Fix node membership corruption
  Ensure unused ports return to the free port pool
  Re-wrap a paragraph to eliminate a sphinx warning

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Guido Trotter [Thu, 5 Jan 2012 16:20:24 +0000 (16:20 +0000)]

KVM: support version reported by 1.0

This of course was working for all the rcs, but broke with 1.0 itself.

In addition:
  - split between running kvm --version and parsing its output
  - unittest parsing for various known --help outputs
  - updated NEWS file
  - happy 2012 wishes
  - the hope to finish this patch before it's time to say happy easter
    :)

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 21 Dec 2011 16:01:08 +0000 (17:01 +0100)]

jqueue: Fix epylint errors introduced in 37d76f1e4

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 19 Dec 2011 15:26:55 +0000 (16:26 +0100)]

jqueue: Fix deadlock between job queue and dependency manager

When an opcode is about to be processed its dependencies are
evaluated using “_JobDependencyManager.CheckAndRegister”. Due
to its nature that function requires a lock on the manager's
internal structures. All of this happens while the job queue
lock is held in shared mode (required for the job processor).

When a job has been processed any pending dependencies are re-added
to the job workerpool. Before this patch that would require
the manager's lock and then, for adding the jobs, the job queue
lock. Since this is in reverse order it will lead to deadlocks.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Iustin Pop [Wed, 30 Nov 2011 09:33:52 +0000 (10:33 +0100)]

Add UnescapeAndSplit unittest for multi-escapes

This would have caught the bug in the first place. Argh,
hand-generated test cases!

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Nikos Skalkotos [Tue, 29 Nov 2011 12:30:46 +0000 (14:30 +0200)]

Fix a bug in command line option parsing code

Fix bug affecting command line options of "keyval" type. Although
escaping commands with \ is supported, it is is not applied to the
input recursively.

Signed-off-by: Nikos Skalkotos <skalkoto@grnet.gr>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 24 Nov 2011 12:02:36 +0000 (13:02 +0100)]

ConfigWriter: Fix epydoc error

The parameter is called “mods”, not “modes”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Andrea Spadaccini <spadaccio@google.com>
(cherry picked from commit 1730d4a1ab56ef36d082b614d3d0ab13f3e14a85)

commit | commitdiff | tree

Michael Hanselmann [Thu, 24 Nov 2011 12:02:36 +0000 (13:02 +0100)]

ConfigWriter: Fix epydoc error

The parameter is called “mods”, not “modes”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Andrea Spadaccini <spadaccio@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 24 Nov 2011 07:43:04 +0000 (08:43 +0100)]

LUGroupAssignNodes: Fix node membership corruption

Note: This bug only manifests itself in Ganeti 2.5, but since the
problematic code also exists in 2.4, I decided to fix it there.

If a node was assigned to a new group using “gnt-group assign-nodes” the
node object's group would be changed, but not the duplicate member list
in the group object. The latter is an optimization to require fewer
locks for other operations. The per-group member list is only kept in
memory and not written to disk.

Ganeti 2.5 starts to make use of the data kept in the per-group member
list and consequently fails when it is out of date. The following
commands can be used to reproduce the issue in 2.5 (in 2.4 the issue was
confirmed using additional logging):

  $ gnt-group add foo
  $ gnt-group assign-nodes foo $(gnt-node list --no-header -o name)
  $ gnt-cluster verify  # Fails with KeyError

This patch moves the code modifying node and group objects into
“config.ConfigWriter” to do the complete operation under the config
lock, and also to avoid making use of side-effects of modifying objects
without calling “ConfigWriter.Update”. A unittest is included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
(cherry picked from commit 218f4c3de706aca7e4521d7e1975f517cf5ecb9b)

commit | commitdiff | tree

Michael Hanselmann [Thu, 24 Nov 2011 07:43:04 +0000 (08:43 +0100)]

commit | commitdiff | tree

Michael Hanselmann [Thu, 24 Nov 2011 07:58:56 +0000 (08:58 +0100)]

Fix pylint warning on unreachable code

Commit c50452c3186 added an exception when all instances should be
evacuated off a node, but did so in a way which made pylint complain
about unreachable code.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 23 Nov 2011 13:01:23 +0000 (14:01 +0100)]

LUNodeEvacuate: Disallow migrating all instances at once

There is a design issue in the iallocator interface which prevents us
from doing this.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 23 Nov 2011 12:16:14 +0000 (13:16 +0100)]

LUNodeEvacuate: Locking fixes

When evacuating a node, only an assertion without informative text was
used to check if the necessary node locks had been acquired. This was on
top of evaluating the list of nodes without having a node group lock, so
this was changed as well.

Also update some exception messages to include “retry the operation”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 23 Nov 2011 07:15:18 +0000 (08:15 +0100)]

Fix error when removing node

ConfigWriter.GetAllInstancesInfo returns a dictionary, not a list.
Removing a node would fail with “too many values to unpack”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Iustin Pop [Tue, 15 Nov 2011 17:16:36 +0000 (18:16 +0100)]

htools: rework message display construction

While diagnosing some (unrelated) memory usage in htools, I've
stumbled upon some very bad behaviour in checkData: mapAccum is
non-strict, and the tuple we use also, so that results in the list of
list of messages being very bad space-wise (hundreds of MB of memory
for a simulated cluster with thousands of nodes, all with errors).

The new, explicit reuse of the old message list has a linear memory
behaviour. The only downside is that messages are listed in the
reverse order (which I'll fix on master).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Iustin Pop [Tue, 15 Nov 2011 10:15:34 +0000 (11:15 +0100)]

hbal: handle empty node groups

This patch changes an internal assert (which can only be triggered
when a node group is empty) into properly handling this case (and
returning empty node/instance lists).

While we could handle this in the backend (Cluster.splitNodeGroup)
this would actually mean than we change the behaviour for a cluster
with just two node groups, once of which is empty (where today we
don't require a node group argument).

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 11 Nov 2011 17:04:55 +0000 (18:04 +0100)]

Document OpNodeMigrate's result for RAPI

- Commit b7a1c8161 changed the LU to generate jobs
- Mention documented results in NEWS

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Vangelis Koukis [Thu, 27 Oct 2011 17:04:20 +0000 (20:04 +0300)]

Ensure unused ports return to the free port pool

Ensure ports previously allocated by calling ConfigWriter's AllocatePort() are
returned to the pool of free ports when no longer needed:

* Return the network_port of an instance when it is removed
* Return the port used by a DRBD-based disk when it is removed

Signed-off-by: Vangelis Koukis <vkoukis@grnet.gr>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Iustin Pop [Mon, 14 Nov 2011 09:01:06 +0000 (10:01 +0100)]

Re-wrap a paragraph to eliminate a sphinx warning

This just makes sure that the paragraph doesn't contains lines that
start with :, which make Sphinx (1.0.7) complain.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 7 Nov 2011 19:42:31 +0000 (20:42 +0100)]

Fail if node/group evacuation can't evacuate instances

If an instance can't be evacuated, only a message would be printed. With
this change the operation always aborts. Newly added unittests check for
this behaviour.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 4 Nov 2011 07:51:36 +0000 (08:51 +0100)]

LUInstanceRename: Compare name with name

… instead of object with name.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 3 Nov 2011 15:42:16 +0000 (16:42 +0100)]

LUClusterRepairDiskSizes: Acquire instance locks in exclusive mode

Instances are modified if their disk size doesn't match.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 27 Oct 2011 16:00:34 +0000 (18:00 +0200)]

Update NEWS for 2.5.0~rc4

I forgot this in the previous patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Andrea Spadaccini <spadaccio@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 27 Oct 2011 13:50:54 +0000 (15:50 +0200)]

Bump version to 2.5.0~rc4

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 27 Oct 2011 13:44:15 +0000 (15:44 +0200)]

Merge branch 'stable-2.4' into stable-2.5

* stable-2.4:
Update NEWS and increase to 2.4.5

Conflicts:
configure.ac: Trivial

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 27 Oct 2011 13:24:15 +0000 (15:24 +0200)]

jqueue: Allow zero jobs to be submitted at once

If cmdlib.LUNodeMigrate was called for a node without primary instances
it would try to submit an empty list of jobs. This was never visible via
CLI as there we check the list of primary instances first.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

René Nussbaumer [Thu, 27 Oct 2011 12:57:10 +0000 (14:57 +0200)]

Update NEWS and increase to 2.4.5

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Iustin Pop [Wed, 14 Sep 2011 10:44:18 +0000 (12:44 +0200)]

hail: don't select the primary as new secondary

This just adds the primary node of the instance as 'non-allocable'
during the choosing of the new secondary.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
(cherry picked from commit 7073b3a86856bcd8d8a62c0b72f82deaabb8d8f1)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Iustin Pop [Wed, 14 Sep 2011 10:43:38 +0000 (12:43 +0200)]

hail: add an extra safety check in relocate

If we select the primary as new secondary, better to fail than return
wrong data to Ganeti.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
(cherry picked from commit f25508bef4e85032f0468e5a6f0f8930ff154e66)

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 26 Oct 2011 06:24:31 +0000 (08:24 +0200)]

Bump version to 2.5.0~rc3

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

René Nussbaumer [Fri, 21 Oct 2011 12:59:50 +0000 (14:59 +0200)]

Merge branch 'devel-2.4' into stable-2.5

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

René Nussbaumer [Wed, 19 Oct 2011 14:51:27 +0000 (16:51 +0200)]

Fix queue archive creation with wrong permissions

On a master failover some of the archive dirs might have wrong
permissions in the non-root model. This is due to the nature of noded
still running as root and the job queue is synced that way. This patch
will fix this behaviour by setting the permissions accordingly.

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

René Nussbaumer [Wed, 19 Oct 2011 12:40:58 +0000 (14:40 +0200)]

Ensure permission on the job queue version file

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 19 Oct 2011 05:43:47 +0000 (07:43 +0200)]

OpGroupVerifyDisks: Fix wrong result type declaration

If an instance had actually a missing disk, the type check would fail.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 18 Oct 2011 15:27:39 +0000 (17:27 +0200)]

RAPI: Make node evacuation actually work

Commit e1f23243 changed te LU and opcode for node evacuation to receive
a “mode” parameter (among other things). Commit de40437a changed the
RAPI code accordingly, but did so for an earlier version of the first
patch. Obviously this couldn't work, so here's the fix.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 18 Oct 2011 14:33:49 +0000 (16:33 +0200)]

Bump version to 2.5.0~rc2

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 18 Oct 2011 11:52:23 +0000 (13:52 +0200)]

Merge branch 'devel-2.4' into stable-2.5

* devel-2.4:
Update NEWS for unreleased 2.4.5

Conflicts:
NEWS: Trivial

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 18 Oct 2011 11:39:34 +0000 (13:39 +0200)]

Update NEWS for unreleased 2.4.5

I need this for another 2.5 release.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 17 Oct 2011 13:58:19 +0000 (15:58 +0200)]

RAPI: Fix resource for replacing disks

Commit d1c172deb4f inadvertently changes the
“/2/instances/[instance_name]/replace-disks” resource to use body
parameters. There were no QA tests and the issue wasn't noticed.

This patch re-introduces support for query parameters and adds a QA
test.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Andrea Spadaccini <spadaccio@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 12 Oct 2011 11:00:19 +0000 (13:00 +0200)]

Merge branch 'devel-2.4' into stable-2.5

* devel-2.4:
rpc: Disable HTTP client pool and reduce memory consumption
Fix assertion error on unclean master shutdown

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Wed, 12 Oct 2011 10:37:43 +0000 (12:37 +0200)]

rpc: Disable HTTP client pool and reduce memory consumption

We noticed that “ganeti-masterd” can use large amounts of memory,
especially on large clusters. Measurements showed a single PycURL client
using about 500 kB of heap memory (the actual usage depends on versions,
build options and settings).

The RPC client uses a per-thread HTTP client pool with one client per
node. At this time there are 41 non-main threads (25 for the job queue
and 16 for client requests). This means the HTTP client pools use a lot
of memory (ca. 200 MB for 10 nodes, ca. 1 GB for 50 nodes).

This patch disables the per-thread HTTP client pool. No cleanup of
unused code is done. That will be done in the master branch only.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 7 Oct 2011 09:58:09 +0000 (11:58 +0200)]

hail: Fix result for node evacuation

According to the iallocator documentation the “node-evacuate” call needs
to return a list of jobs, not a list of lists of jobs.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 4 Oct 2011 09:29:34 +0000 (11:29 +0200)]

Bump version to 2.5.0~rc1

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 3 Oct 2011 14:58:22 +0000 (16:58 +0200)]

Fix issue when verifying cluster files

If a cluster has any non-master-candidate nodes, those don't contain all
files (e.g. config.data). With commit aef59ae764dc (March 31st, 2011)
the logic was changed and subsequently verifying a cluster with non-mc
nodes would complain.

This patch fixes this issue by changing the algorithm. It also adds an
additional check for files which shouldn't exist on a machine. A newly
added unittest is included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 3 Oct 2011 10:46:27 +0000 (12:46 +0200)]

Revert "utils.log: Write error messages to stderr"

This reverts commit 34aa8b7c4bb6f5e2e788108e024c9cd70bdb3431. Writing
error messages to stderr would also include backtraces, something we
tried to avoid in the past.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 3 Oct 2011 10:04:09 +0000 (12:04 +0200)]

Fix adding nodes after commit 64c7b3831dc

Commit 64c7b3831dc changed the RPC call for verifying SSH connections.
Unfortunately this case in adding nodes was missed.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 30 Sep 2011 15:48:28 +0000 (17:48 +0200)]

LUClusterVerifyGroup: Spread SSH checks over more nodes

When verifying a group the code would always check SSH to all nodes in
the same group, as well as the first node for every other group. On big
clusters this can cause issues since many nodes will try to connect to
the first node of another group at the same time. This patch changes the
algorithm to choose a different node every time.

A unittest for the selection algorithm is included.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Iustin Pop [Fri, 30 Sep 2011 14:35:29 +0000 (16:35 +0200)]

Optimise cli.JobExecutor with many pending jobs

In the case we submit many pending jobs (> 100) to the masterd, the
JobExecutor 'spams' the master daemon with status requests for the
status of all the jobs, even though in the end it will only choose a
single job for polling.

This is very sub-optimal, because when the master is busy processing
small/fast jobs, this query forces reading all the jobs from
this. Restricting the 'window' of jobs that we query from the entire
set to a smaller subset makes a huge difference (masterd only, 0s
delay jobs, all jobs to tmpfs thus no I/O involved):

- submitting/waiting for 500 jobs:
  - before: ~21 s
  - after:   ~5 s
- submitting/waiting for 1K jobs:
  - before: ~76 s
  - after:   ~8 s

This is with a batch of 25 jobs. With a batch of 50 jobs, it goes from
8s to 12s. I think that choosing the 'best' job for nice output only
matters with a small number of jobs, and that for more than that
people will not actually watch the jobs. So changing from 'perfect
job' to 'best job in the first 25' should be OK.

Note that most jobs won't execute as fast as 0 delay, but this is
still a good improvement.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 30 Sep 2011 09:54:20 +0000 (11:54 +0200)]

listrunner: Don't pass arguments if there are none

If no arguments were specified the “exec_args” variable was “None”,
leading to the command being run as “… ./… None”.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 30 Sep 2011 09:29:50 +0000 (11:29 +0200)]

ssh: Quote strings in error message

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 30 Sep 2011 09:28:59 +0000 (11:28 +0200)]

utils.log: Write error messages to stderr

When “gnt-cluster copyfile” failed it would only print “Copy of file …
to node … failed”. A detailed message is written using logging.error.
Writing error messages to stderr can be helpful in figuring out what
went wrong (the messages also go to the log file, but not everyone might
know about it).

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Iustin Pop [Fri, 30 Sep 2011 08:30:44 +0000 (10:30 +0200)]

Add signal handling doc to hbal man page

Also remove a bug note, since hbal can now for a long time directly
execute jobs.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Iustin Pop [Wed, 28 Sep 2011 10:38:12 +0000 (12:38 +0200)]

Fix handling of cluster verify hooks

The change to enforce boolean results for cluster verify group opcode
missed the HooksCallBack, which uses a very ugly 1/0
logic. Furthermore, the logic is wrong, since it unconditionally
resets the verify result to true.

The patch is changed to simply treat hook failures as failures, and do
nothing for offline/nodes.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Iustin Pop [Wed, 28 Sep 2011 09:06:06 +0000 (11:06 +0200)]

Redistribute the RAPI certificate

This reverts to the old behaviour in Ganeti 2.4 and before.

Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 22 Sep 2011 10:20:39 +0000 (12:20 +0200)]

QA: Add tests for instance start/stop via RAPI

This would have detected the issue fixed in the previous patch.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 22 Sep 2011 10:19:56 +0000 (12:19 +0200)]

RAPI: Fix wrong check on instance shutdown

Commit 7fa310f6d84 (April 1st, 2011) converted the RAPI resource for
shutting down an instance to FillOpCode. Unfortunately it missed the
fact that the shutdown resource gets its parameters as query arguments.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 8 Sep 2011 11:36:15 +0000 (13:36 +0200)]

baserlib: Accept empty body in FillOpcode

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
(cherry picked from commit c6e1a3eef05674d637570c39f25a799cec7ba187)

Signed-off-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 14 Jul 2011 20:49:34 +0000 (22:49 +0200)]

Fix assertion error on unclean master shutdown

Commit 66bd7445 added an assertion to ensure a finalized job has its
“end_timestamp” attribute set. Unfortunately it didn't cover a case when
the queue is recovering from an unclean master shutdown.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>
Reviewed-by: René Nussbaumer <rn@google.com>
(cherry picked from commit 45df0793c6bc83001aa545fda95c1ad9a35d732f)

commit | commitdiff | tree

Michael Hanselmann [Wed, 31 Aug 2011 13:47:47 +0000 (15:47 +0200)]

Version bump for 2.5.0~beta3

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 30 Aug 2011 15:37:54 +0000 (17:37 +0200)]

Makefile: Use $(LN_S) instead of “ln -s”

Some platforms apparently don't support “ln -s”, otherwise Autoconf
wouldn't have AC_PROG_LN_S.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Mon, 29 Aug 2011 19:28:41 +0000 (20:28 +0100)]

Fixes to errors/warnings raised by pylint 0.24

Running pylint 0.24.0 revealed 2 errors and 1 warning. Here is how I
fixed them:

* jqueue.py: silenced E1101
* netutils.py: rewrote the list comprehension using extend()
* watcher/__init__.py: fixed a missing format string parameter

These changes are backwards-compatible with pylint 0.21.1.

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Fri, 26 Aug 2011 15:31:12 +0000 (16:31 +0100)]

PEP8 for QA

- Makefile.am: added QA directory to the paths checked by pep8
- qa/: fixed the reported errors
- Makefile.am: also, added qa_group.py to qa_scripts

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 30 Aug 2011 08:47:49 +0000 (10:47 +0200)]

listrunner: Allow passing of arguments to executable

This wasn't possible until now.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Mon, 29 Aug 2011 19:15:15 +0000 (20:15 +0100)]

DeprecationWarning fixes for pylint

In version 0.21, pylint unified all the disable-* (and enable-*)
directives to disable (resp. enable). This leads to a lot of
DeprecationWarning being emitted even if one uses the recommended
version of pylint (0.21.1, as stated in devnotes.rst).

This commit changes all the disable-msg directives to disable.

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Signed-off-by: Iustin Pop <iustin@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Mon, 29 Aug 2011 15:34:24 +0000 (17:34 +0200)]

listrunner: Replace str.split with library functions

- str.split("/").pop() should be os.path.basename
- str.split("\n") should be str.splitlines()

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Tsachy Shacham [Wed, 24 Aug 2011 08:30:22 +0000 (10:30 +0200)]

Minor updates and fixes to CPU pinning design doc

Signed-off-by: Tsachy Shacham <tsachy@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

René Nussbaumer [Fri, 26 Aug 2011 14:50:58 +0000 (16:50 +0200)]

Merge branch 'devel-2.4' into devel-2.5

Conflicts:
NEWS (trivial)
configure.ac (trivial)
daemons/ensure-dirs.in (deleted)

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Fri, 26 Aug 2011 12:21:35 +0000 (14:21 +0200)]

utils: Fix UnescapeAndSplit parsing bug

If a value passed to UnescapeAndSplit ended with a backslash an
exception would be raised:

$ gnt-instance modify -H mem=x\\ inst1.example.com
[…]
e2 = slist.pop(0)
IndexError: pop from empty list

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Thu, 25 Aug 2011 16:47:07 +0000 (17:47 +0100)]

Delete master IPs from mergee master nodes

Added a step in cluster-merge that removes the cluster IP from the
master node of the mergee clusters.

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 25 Aug 2011 10:15:10 +0000 (12:15 +0200)]

Use pep8 utility in “make lint”

This utility checks whether the code conforms to PEP8. Some checks had
to be disabled for Ganeti.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 25 Aug 2011 15:57:25 +0000 (17:57 +0200)]

Two more PEP8 fixes

cmdlib: Avoid wrapping using backslash

gnt_group: Avoid ** magic using keyword arguments (the “pep8” tool
doesn't like the inline comment in this case and will complain about
spaces around the “**” operator)

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 23 Aug 2011 13:12:44 +0000 (15:12 +0200)]

check-python-code: Give location(s) of lines longer than 80 chars

Until now it would only say that there was a line longer than 80
characters, but not where.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Michael Hanselmann [Thu, 25 Aug 2011 10:36:56 +0000 (12:36 +0200)]

PEP8 style fixes

Identified using the “pep8” utility.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Guido Trotter [Tue, 23 Aug 2011 12:42:33 +0000 (13:42 +0100)]

Wrap a few long lines

Had to break it as well, today! ;)

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Michael Hanselmann [Tue, 23 Aug 2011 16:07:09 +0000 (18:07 +0200)]

listrunner: Avoid exception if machine is rebooted

Handle exceptions gracefully when trying to read the command's output.

Signed-off-by: Michael Hanselmann <hansmi@google.com>
Reviewed-by: Iustin Pop <iustin@google.com>

commit | commitdiff | tree

Guido Trotter [Tue, 23 Aug 2011 11:15:12 +0000 (12:15 +0100)]

Remove wrong type declaration from option

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Andrea Spadaccini <spadaccio@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Tue, 23 Aug 2011 10:28:42 +0000 (11:28 +0100)]

Fix wrong method name in cluster-merge

Fixed a wrong method name in the last patch.

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

René Nussbaumer [Tue, 23 Aug 2011 09:42:51 +0000 (11:42 +0200)]

Version bump 2.4.4

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Guido Trotter [Tue, 23 Aug 2011 09:47:08 +0000 (10:47 +0100)]

Fix --skip-stop-instances help message

Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

Andrea Spadaccini [Tue, 23 Aug 2011 08:48:40 +0000 (09:48 +0100)]

cluster-merge: Add the --skip-stop-instances opt

This option allows to do a check for running instances on the mergee
clusters instead of stopping them.

Signed-off-by: Andrea Spadaccini <spadaccio@google.com>
Signed-off-by: Guido Trotter <ultrotter@google.com>
Reviewed-by: Guido Trotter <ultrotter@google.com>

commit | commitdiff | tree

René Nussbaumer [Tue, 23 Aug 2011 09:21:36 +0000 (11:21 +0200)]

Update NEWS file

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

commit | commitdiff | tree

René Nussbaumer [Tue, 23 Aug 2011 09:10:36 +0000 (11:10 +0200)]

Documentation fix for importing with --src-dir option

Signed-off-by: Agata Murawska <agatamurawska@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>
(cherry picked from commit b7d7876bd0e9844fab8be28bfa1fd5d563ec7412)

Conflicts:

lib/cmdlib.py (easily fixed)

commit | commitdiff | tree

René Nussbaumer [Tue, 23 Aug 2011 09:04:27 +0000 (11:04 +0200)]

Adding missing test data for commit 7a380ddfc

Signed-off-by: René Nussbaumer <rn@google.com>
Reviewed-by: Michael Hanselmann <hansmi@google.com>

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom