Revision e0eb13de
b/doc/design-2.0.rst | ||
---|---|---|
8 | 8 |
The 2.0 version will constitute a rewrite of the 'core' architecture, |
9 | 9 |
paving the way for additional features in future 2.x versions. |
10 | 10 |
|
11 |
.. contents:: |
|
11 |
.. contents:: :depth: 3
|
|
12 | 12 |
|
13 | 13 |
Objective |
14 | 14 |
========= |
... | ... | |
841 | 841 |
Node-related parameters are very few, and we will continue using the |
842 | 842 |
same model for these as previously (attributes on the Node object). |
843 | 843 |
|
844 |
There are three new node flags, described in a separate section "node |
|
845 |
flags" below. |
|
846 |
|
|
844 | 847 |
Instance parameters |
845 | 848 |
+++++++++++++++++++ |
846 | 849 |
|
... | ... | |
976 | 979 |
E.g. for the drbd shared secrets, we could export these with the |
977 | 980 |
values replaced by an empty string. |
978 | 981 |
|
982 |
Node flags |
|
983 |
~~~~~~~~~~ |
|
984 |
|
|
985 |
Ganeti 2.0 adds three node flags that change the way nodes are handled |
|
986 |
within Ganeti and the related infrastructure (iallocator interaction, |
|
987 |
RAPI data export). |
|
988 |
|
|
989 |
*master candidate* flag |
|
990 |
+++++++++++++++++++++++ |
|
991 |
|
|
992 |
Ganeti 2.0 allows more scalability in operation by introducing |
|
993 |
parallelization. However, a new bottleneck is reached that is the |
|
994 |
synchronization and replication of cluster configuration to all nodes |
|
995 |
in the cluster. |
|
996 |
|
|
997 |
This breaks scalability as the speed of the replication decreases |
|
998 |
roughly with the size of the nodes in the cluster. The goal of the |
|
999 |
master candidate flag is to change this O(n) into O(1) with respect to |
|
1000 |
job and configuration data propagation. |
|
1001 |
|
|
1002 |
Only nodes having this flag set (let's call this set of nodes the |
|
1003 |
*candidate pool*) will have jobs and configuration data replicated. |
|
1004 |
|
|
1005 |
The cluster will have a new parameter (runtime changeable) called |
|
1006 |
``candidate_pool_size`` which represents the number of candidates the |
|
1007 |
cluster tries to maintain (preferably automatically). |
|
1008 |
|
|
1009 |
This will impact the cluster operations as follows: |
|
1010 |
|
|
1011 |
- jobs and config data will be replicated only to a fixed set of nodes |
|
1012 |
- master fail-over will only be possible to a node in the candidate pool |
|
1013 |
- cluster verify needs changing to account for these two roles |
|
1014 |
- external scripts will no longer have access to the configuration |
|
1015 |
file (this is not recommended anyway) |
|
1016 |
|
|
1017 |
|
|
1018 |
The caveats of this change are: |
|
1019 |
|
|
1020 |
- if all candidates are lost (completely), cluster configuration is |
|
1021 |
lost (but it should be backed up external to the cluster anyway) |
|
1022 |
|
|
1023 |
- failed nodes which are candidate must be dealt with properly, so |
|
1024 |
that we don't lose too many candidates at the same time; this will be |
|
1025 |
reported in cluster verify |
|
1026 |
|
|
1027 |
- the 'all equal' concept of ganeti is no longer true |
|
1028 |
|
|
1029 |
- the partial distribution of config data means that all nodes will |
|
1030 |
have to revert to ssconf files for master info (as in 1.2) |
|
1031 |
|
|
1032 |
Advantages: |
|
1033 |
|
|
1034 |
- speed on a 100+ nodes simulated cluster is greatly enhanced, even |
|
1035 |
for a simple operation; ``gnt-instance remove`` on a diskless instance |
|
1036 |
remove goes from ~9seconds to ~2 seconds |
|
1037 |
|
|
1038 |
- node failure of non-candidates will be less impacting on the cluster |
|
1039 |
|
|
1040 |
The default value for the candidate pool size will be set to 10 but |
|
1041 |
this can be changed at cluster creation and modified any time later. |
|
1042 |
|
|
1043 |
Testing on simulated big clusters with sequential and parallel jobs |
|
1044 |
show that this value (10) is a sweet-spot from performance and load |
|
1045 |
point of view. |
|
1046 |
|
|
1047 |
*offline* flag |
|
1048 |
++++++++++++++ |
|
1049 |
|
|
1050 |
In order to support better the situation in which nodes are offline |
|
1051 |
(e.g. for repair) without altering the cluster configuration, Ganeti |
|
1052 |
needs to be told and needs to properly handle this state for nodes. |
|
1053 |
|
|
1054 |
This will result in simpler procedures, and less mistakes, when the |
|
1055 |
amount of node failures is high on an absolute scale (either due to |
|
1056 |
high failure rate or simply big clusters). |
|
1057 |
|
|
1058 |
Nodes having this attribute set will not be contacted for inter-node |
|
1059 |
RPC calls, will not be master candidates, and will not be able to host |
|
1060 |
instances as primaries. |
|
1061 |
|
|
1062 |
Setting this attribute on a node: |
|
1063 |
|
|
1064 |
- will not be allowed if the node is the master |
|
1065 |
- will not be allowed if the node has primary instances |
|
1066 |
- will cause the node to be demoted from the master candidate role (if |
|
1067 |
it was), possibly causing another node to be promoted to that role |
|
1068 |
|
|
1069 |
This attribute will impact the cluster operations as follows: |
|
1070 |
|
|
1071 |
- querying these nodes for anything will fail instantly in the RPC |
|
1072 |
library, with a specific RPC error (RpcResult.offline == True) |
|
1073 |
|
|
1074 |
- they will be listed in the Other section of cluster verify |
|
1075 |
|
|
1076 |
The code is changed in the following ways: |
|
1077 |
|
|
1078 |
- RPC calls were be converted to skip such nodes: |
|
1079 |
|
|
1080 |
- RpcRunner-instance-based RPC calls are easy to convert |
|
1081 |
|
|
1082 |
- static/classmethod RPC calls are harder to convert, and were left |
|
1083 |
alone |
|
1084 |
|
|
1085 |
- the RPC results were unified so that this new result state (offline) |
|
1086 |
can be differentiated |
|
1087 |
|
|
1088 |
- master voting still queries in repair nodes, as we need to ensure |
|
1089 |
consistency in case the (wrong) masters have old data, and nodes have |
|
1090 |
come back from repairs |
|
1091 |
|
|
1092 |
Caveats: |
|
1093 |
|
|
1094 |
- some operation semantics are less clear (e.g. what to do on instance |
|
1095 |
start with offline secondary?); for now, these will just fail as if the |
|
1096 |
flag is not set (but faster) |
|
1097 |
- 2-node cluster with one node offline needs manual startup of the |
|
1098 |
master with a special flag to skip voting (as the master can't get a |
|
1099 |
quorum there) |
|
1100 |
|
|
1101 |
One of the advantages of implementing this flag is that it will allow |
|
1102 |
in the future automation tools to automatically put the node in |
|
1103 |
repairs and recover from this state, and the code (should/will) handle |
|
1104 |
this much better than just timing out. So, future possible |
|
1105 |
improvements (for later versions): |
|
1106 |
|
|
1107 |
- watcher will detect nodes which fail RPC calls, will attempt to ssh |
|
1108 |
to them, if failure will put them offline |
|
1109 |
- watcher will try to ssh and query the offline nodes, if successful |
|
1110 |
will take them off the repair list |
|
1111 |
|
|
1112 |
Alternatives considered: The RPC call model in 2.0 is, by default, |
|
1113 |
much nicer - errors are logged in the background, and job/opcode |
|
1114 |
execution is clearer, so we could simply not introduce this. However, |
|
1115 |
having this state will make both the codepaths clearer (offline |
|
1116 |
vs. temporary failure) and the operational model (it's not a node with |
|
1117 |
errors, but an offline node). |
|
1118 |
|
|
1119 |
|
|
1120 |
*drained* flag |
|
1121 |
++++++++++++++ |
|
1122 |
|
|
1123 |
Due to parallel execution of jobs in Ganeti 2.0, we could have the |
|
1124 |
following situation: |
|
1125 |
|
|
1126 |
- gnt-node migrate + failover is run |
|
1127 |
- gnt-node evacuate is run, which schedules a long-running 6-opcode |
|
1128 |
job for the node |
|
1129 |
- partway through, a new job comes in that runs an iallocator script, |
|
1130 |
which finds the above node as empty and a very good candidate |
|
1131 |
- gnt-node evacuate has finished, but now it has to be run again, to |
|
1132 |
clean the above instance(s) |
|
1133 |
|
|
1134 |
In order to prevent this situation, and to be able to get nodes into |
|
1135 |
proper offline status easily, a new *drained* flag was added to the nodes. |
|
1136 |
|
|
1137 |
This flag (which actually means "is being, or was drained, and is |
|
1138 |
expected to go offline"), will prevent allocations on the node, but |
|
1139 |
otherwise all other operations (start/stop instance, query, etc.) are |
|
1140 |
working without any restrictions. |
|
1141 |
|
|
1142 |
Interaction between flags |
|
1143 |
+++++++++++++++++++++++++ |
|
1144 |
|
|
1145 |
While these flags are implemented as separate flags, they are |
|
1146 |
mutually-exclusive and are acting together with the master node role |
|
1147 |
as a single *node status* value. In other words, a flag is only in one |
|
1148 |
of these roles at a given time. The lack of any of these flags denote |
|
1149 |
a regular node. |
|
1150 |
|
|
1151 |
The current node status is visible in the ``gnt-cluster verify`` |
|
1152 |
output, and the individual flags can be examined via separate flags in |
|
1153 |
the ``gnt-node list`` output. |
|
1154 |
|
|
1155 |
These new flags will be exported in both the iallocator input message |
|
1156 |
and via RAPI, see the respective man pages for the exact names. |
|
1157 |
|
|
979 | 1158 |
Feature changes |
980 | 1159 |
--------------- |
981 | 1160 |
|
Also available in: Unified diff