Revision e0eb13de

b/doc/design-2.0.rst
8 8
The 2.0 version will constitute a rewrite of the 'core' architecture,
9 9
paving the way for additional features in future 2.x versions.
10 10

  
11
.. contents::
11
.. contents:: :depth: 3
12 12

  
13 13
Objective
14 14
=========
......
841 841
Node-related parameters are very few, and we will continue using the
842 842
same model for these as previously (attributes on the Node object).
843 843

  
844
There are three new node flags, described in a separate section "node
845
flags" below.
846

  
844 847
Instance parameters
845 848
+++++++++++++++++++
846 849

  
......
976 979
E.g. for the drbd shared secrets, we could export these with the
977 980
values replaced by an empty string.
978 981

  
982
Node flags
983
~~~~~~~~~~
984

  
985
Ganeti 2.0 adds three node flags that change the way nodes are handled
986
within Ganeti and the related infrastructure (iallocator interaction,
987
RAPI data export).
988

  
989
*master candidate* flag
990
+++++++++++++++++++++++
991

  
992
Ganeti 2.0 allows more scalability in operation by introducing
993
parallelization. However, a new bottleneck is reached that is the
994
synchronization and replication of cluster configuration to all nodes
995
in the cluster.
996

  
997
This breaks scalability as the speed of the replication decreases
998
roughly with the size of the nodes in the cluster. The goal of the
999
master candidate flag is to change this O(n) into O(1) with respect to
1000
job and configuration data propagation.
1001

  
1002
Only nodes having this flag set (let's call this set of nodes the
1003
*candidate pool*) will have jobs and configuration data replicated.
1004

  
1005
The cluster will have a new parameter (runtime changeable) called
1006
``candidate_pool_size`` which represents the number of candidates the
1007
cluster tries to maintain (preferably automatically).
1008

  
1009
This will impact the cluster operations as follows:
1010

  
1011
- jobs and config data will be replicated only to a fixed set of nodes
1012
- master fail-over will only be possible to a node in the candidate pool
1013
- cluster verify needs changing to account for these two roles
1014
- external scripts will no longer have access to the configuration
1015
  file (this is not recommended anyway)
1016

  
1017

  
1018
The caveats of this change are:
1019

  
1020
- if all candidates are lost (completely), cluster configuration is
1021
  lost (but it should be backed up external to the cluster anyway)
1022

  
1023
- failed nodes which are candidate must be dealt with properly, so
1024
  that we don't lose too many candidates at the same time; this will be
1025
  reported in cluster verify
1026

  
1027
- the 'all equal' concept of ganeti is no longer true
1028

  
1029
- the partial distribution of config data means that all nodes will
1030
  have to revert to ssconf files for master info (as in 1.2)
1031

  
1032
Advantages:
1033

  
1034
- speed on a 100+ nodes simulated cluster is greatly enhanced, even
1035
  for a simple operation; ``gnt-instance remove`` on a diskless instance
1036
  remove goes from ~9seconds to ~2 seconds
1037

  
1038
- node failure of non-candidates will be less impacting on the cluster
1039

  
1040
The default value for the candidate pool size will be set to 10 but
1041
this can be changed at cluster creation and modified any time later.
1042

  
1043
Testing on simulated big clusters with sequential and parallel jobs
1044
show that this value (10) is a sweet-spot from performance and load
1045
point of view.
1046

  
1047
*offline* flag
1048
++++++++++++++
1049

  
1050
In order to support better the situation in which nodes are offline
1051
(e.g. for repair) without altering the cluster configuration, Ganeti
1052
needs to be told and needs to properly handle this state for nodes.
1053

  
1054
This will result in simpler procedures, and less mistakes, when the
1055
amount of node failures is high on an absolute scale (either due to
1056
high failure rate or simply big clusters).
1057

  
1058
Nodes having this attribute set will not be contacted for inter-node
1059
RPC calls, will not be master candidates, and will not be able to host
1060
instances as primaries.
1061

  
1062
Setting this attribute on a node:
1063

  
1064
- will not be allowed if the node is the master
1065
- will not be allowed if the node has primary instances
1066
- will cause the node to be demoted from the master candidate role (if
1067
  it was), possibly causing another node to be promoted to that role
1068

  
1069
This attribute will impact the cluster operations as follows:
1070

  
1071
- querying these nodes for anything will fail instantly in the RPC
1072
  library, with a specific RPC error (RpcResult.offline == True)
1073

  
1074
- they will be listed in the Other section of cluster verify
1075

  
1076
The code is changed in the following ways:
1077

  
1078
- RPC calls were be converted to skip such nodes:
1079

  
1080
  - RpcRunner-instance-based RPC calls are easy to convert
1081

  
1082
  - static/classmethod RPC calls are harder to convert, and were left
1083
    alone
1084

  
1085
- the RPC results were unified so that this new result state (offline)
1086
  can be differentiated
1087

  
1088
- master voting still queries in repair nodes, as we need to ensure
1089
  consistency in case the (wrong) masters have old data, and nodes have
1090
  come back from repairs
1091

  
1092
Caveats:
1093

  
1094
- some operation semantics are less clear (e.g. what to do on instance
1095
  start with offline secondary?); for now, these will just fail as if the
1096
  flag is not set (but faster)
1097
- 2-node cluster with one node offline needs manual startup of the
1098
  master with a special flag to skip voting (as the master can't get a
1099
  quorum there)
1100

  
1101
One of the advantages of implementing this flag is that it will allow
1102
in the future automation tools to automatically put the node in
1103
repairs and recover from this state, and the code (should/will) handle
1104
this much better than just timing out. So, future possible
1105
improvements (for later versions):
1106

  
1107
- watcher will detect nodes which fail RPC calls, will attempt to ssh
1108
  to them, if failure will put them offline
1109
- watcher will try to ssh and query the offline nodes, if successful
1110
  will take them off the repair list
1111

  
1112
Alternatives considered: The RPC call model in 2.0 is, by default,
1113
much nicer - errors are logged in the background, and job/opcode
1114
execution is clearer, so we could simply not introduce this. However,
1115
having this state will make both the codepaths clearer (offline
1116
vs. temporary failure) and the operational model (it's not a node with
1117
errors, but an offline node).
1118

  
1119

  
1120
*drained* flag
1121
++++++++++++++
1122

  
1123
Due to parallel execution of jobs in Ganeti 2.0, we could have the
1124
following situation:
1125

  
1126
- gnt-node migrate + failover is run
1127
- gnt-node evacuate is run, which schedules a long-running 6-opcode
1128
  job for the node
1129
- partway through, a new job comes in that runs an iallocator script,
1130
  which finds the above node as empty and a very good candidate
1131
- gnt-node evacuate has finished, but now it has to be run again, to
1132
  clean the above instance(s)
1133

  
1134
In order to prevent this situation, and to be able to get nodes into
1135
proper offline status easily, a new *drained* flag was added to the nodes.
1136

  
1137
This flag (which actually means "is being, or was drained, and is
1138
expected to go offline"), will prevent allocations on the node, but
1139
otherwise all other operations (start/stop instance, query, etc.) are
1140
working without any restrictions.
1141

  
1142
Interaction between flags
1143
+++++++++++++++++++++++++
1144

  
1145
While these flags are implemented as separate flags, they are
1146
mutually-exclusive and are acting together with the master node role
1147
as a single *node status* value. In other words, a flag is only in one
1148
of these roles at a given time. The lack of any of these flags denote
1149
a regular node.
1150

  
1151
The current node status is visible in the ``gnt-cluster verify``
1152
output, and the individual flags can be examined via separate flags in
1153
the ``gnt-node list`` output.
1154

  
1155
These new flags will be exported in both the iallocator input message
1156
and via RAPI, see the respective man pages for the exact names.
1157

  
979 1158
Feature changes
980 1159
---------------
981 1160

  

Also available in: Unified diff