Revision c3c5dc77 doc/design-2.2.rst
b/doc/design-2.2.rst | ||
---|---|---|
33 | 33 |
Core changes |
34 | 34 |
------------ |
35 | 35 |
|
36 |
Master Daemon Scaling improvements |
|
37 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
38 |
|
|
39 |
Current state and shortcomings |
|
40 |
++++++++++++++++++++++++++++++ |
|
41 |
|
|
42 |
Currently the Ganeti master daemon is based on four sets of threads: |
|
43 |
|
|
44 |
- The main thread (1 thread) just accepts connections on the master |
|
45 |
socket |
|
46 |
- The client worker pool (16 threads) handles those connections, |
|
47 |
one thread per connected socket, parses luxi requests, and sends data |
|
48 |
back to the clients |
|
49 |
- The job queue worker pool (25 threads) executes the actual jobs |
|
50 |
submitted by the clients |
|
51 |
- The rpc worker pool (10 threads) interacts with the nodes via |
|
52 |
http-based-rpc |
|
53 |
|
|
54 |
This means that every masterd currently runs 52 threads to do its job. |
|
55 |
Being able to reduce the number of thread sets would make the master's |
|
56 |
architecture a lot simpler. Moreover having less threads can help |
|
57 |
decrease lock contention, log pollution and memory usage. |
|
58 |
Also, with the current architecture, masterd suffers from quite a few |
|
59 |
scalability issues: |
|
60 |
|
|
61 |
- Since the 16 client worker threads handle one connection each, it's |
|
62 |
very easy to exhaust them, by just connecting to masterd 16 times and |
|
63 |
not sending any data. While we could perhaps make those pools |
|
64 |
resizable, increasing the number of threads won't help with lock |
|
65 |
contention. |
|
66 |
- Some luxi operations (in particular REQ_WAIT_FOR_JOB_CHANGE) make the |
|
67 |
relevant client thread block on its job for a relatively long time. |
|
68 |
This makes it easier to exhaust the 16 client threads. |
|
69 |
- The job queue lock is quite heavily contended, and certain easily |
|
70 |
reproducible workloads show that's it's very easy to put masterd in |
|
71 |
trouble: for example running ~15 background instance reinstall jobs, |
|
72 |
results in a master daemon that, even without having finished the |
|
73 |
client worker threads, can't answer simple job list requests, or |
|
74 |
submit more jobs. |
|
75 |
|
|
76 |
Proposed changes |
|
77 |
++++++++++++++++ |
|
78 |
|
|
79 |
In order to be able to interact with the master daemon even when it's |
|
80 |
under heavy load, and to make it simpler to add core functionality |
|
81 |
(such as an asynchronous rpc client) we propose three subsequent levels |
|
82 |
of changes to the master core architecture. |
|
83 |
|
|
84 |
After making this change we'll be able to re-evaluate the size of our |
|
85 |
thread pool, if we see that we can make most threads in the client |
|
86 |
worker pool always idle. In the future we should also investigate making |
|
87 |
the rpc client asynchronous as well, so that we can make masterd a lot |
|
88 |
smaller in number of threads, and memory size, and thus also easier to |
|
89 |
understand, debug, and scale. |
|
90 |
|
|
91 |
Connection handling |
|
92 |
^^^^^^^^^^^^^^^^^^^ |
|
93 |
|
|
94 |
We'll move the main thread of ganeti-masterd to asyncore, so that it can |
|
95 |
share the mainloop code with all other Ganeti daemons. Then all luxi |
|
96 |
clients will be asyncore clients, and I/O to/from them will be handled |
|
97 |
by the master thread asynchronously. Data will be read from the client |
|
98 |
sockets as it becomes available, and kept in a buffer, then when a |
|
99 |
complete message is found, it's passed to a client worker thread for |
|
100 |
parsing and processing. The client worker thread is responsible for |
|
101 |
serializing the reply, which can then be sent asynchronously by the main |
|
102 |
thread on the socket. |
|
103 |
|
|
104 |
Wait for job change |
|
105 |
^^^^^^^^^^^^^^^^^^^ |
|
106 |
|
|
107 |
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be |
|
108 |
subscription-based, so that the executing thread doesn't have to be |
|
109 |
waiting for the changes to arrive. Threads producing messages (job queue |
|
110 |
executors) will make sure that when there is a change another thread is |
|
111 |
awaken and delivers it to the waiting clients. This can be either a |
|
112 |
dedicated "wait for job changes" thread or pool, or one of the client |
|
113 |
workers, depending on what's easier to implement. In either case the |
|
114 |
main asyncore thread will only be involved in pushing of the actual |
|
115 |
data, and not in fetching/serializing it. |
|
116 |
|
|
117 |
Other features to look at, when implementing this code are: |
|
118 |
|
|
119 |
- Possibility not to need the job lock to know which updates to push. |
|
120 |
- Possibility to signal clients about to time out, when no update has |
|
121 |
been received, not to despair and to keep waiting (luxi level |
|
122 |
keepalive). |
|
123 |
- Possibility to defer updates if they are too frequent, providing |
|
124 |
them at a maximum rate (lower priority). |
|
125 |
|
|
126 |
Job Queue lock |
|
127 |
^^^^^^^^^^^^^^ |
|
128 |
|
|
129 |
Our tests show that the job queue lock is a point of high contention. |
|
130 |
We'll try to decrease its contention, either by more granular locking, |
|
131 |
the use of shared/exclusive locks, or reducing the size of the critical |
|
132 |
sections. This section of the design should be updated with the proposed |
|
133 |
changes for the 2.2 release, with regards to the job queue. |
|
134 |
|
|
36 | 135 |
Remote procedure call timeouts |
37 | 136 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
38 | 137 |
|
Also available in: Unified diff