Revision 37e1e262 doc/design-2.2.rst
b/doc/design-2.2.rst | ||
---|---|---|
58 | 58 |
Also, with the current architecture, masterd suffers from quite a few |
59 | 59 |
scalability issues: |
60 | 60 |
|
61 |
- Since the 16 client worker threads handle one connection each, it's |
|
62 |
very easy to exhaust them, by just connecting to masterd 16 times and |
|
63 |
not sending any data. While we could perhaps make those pools |
|
64 |
resizable, increasing the number of threads won't help with lock |
|
65 |
contention. |
|
66 |
- Some luxi operations (in particular REQ_WAIT_FOR_JOB_CHANGE) make the |
|
67 |
relevant client thread block on its job for a relatively long time. |
|
68 |
This makes it easier to exhaust the 16 client threads. |
|
69 |
- The job queue lock is quite heavily contended, and certain easily |
|
70 |
reproducible workloads show that's it's very easy to put masterd in |
|
71 |
trouble: for example running ~15 background instance reinstall jobs, |
|
72 |
results in a master daemon that, even without having finished the |
|
73 |
client worker threads, can't answer simple job list requests, or |
|
74 |
submit more jobs. |
|
61 |
Core daemon connection handling |
|
62 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
63 |
|
|
64 |
Since the 16 client worker threads handle one connection each, it's very |
|
65 |
easy to exhaust them, by just connecting to masterd 16 times and not |
|
66 |
sending any data. While we could perhaps make those pools resizable, |
|
67 |
increasing the number of threads won't help with lock contention nor |
|
68 |
with better handling long running operations making sure the client is |
|
69 |
informed that everything is proceeding, and doesn't need to time out. |
|
70 |
|
|
71 |
Wait for job change |
|
72 |
^^^^^^^^^^^^^^^^^^^ |
|
73 |
|
|
74 |
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client |
|
75 |
thread block on its job for a relative long time. This is another easy |
|
76 |
way to exhaust the 16 client threads, and a place where clients often |
|
77 |
time out, moreover this operation is negative for the job queue lock |
|
78 |
contention (see below). |
|
79 |
|
|
80 |
Job Queue lock |
|
81 |
^^^^^^^^^^^^^^ |
|
82 |
|
|
83 |
The job queue lock is quite heavily contended, and certain easily |
|
84 |
reproducible workloads show that's it's very easy to put masterd in |
|
85 |
trouble: for example running ~15 background instance reinstall jobs, |
|
86 |
results in a master daemon that, even without having finished the |
|
87 |
client worker threads, can't answer simple job list requests, or |
|
88 |
submit more jobs. |
|
89 |
|
|
90 |
Currently the job queue lock is an exclusive non-fair lock insulating |
|
91 |
the following job queue methods (called by the client workers). |
|
92 |
|
|
93 |
- AddNode |
|
94 |
- RemoveNode |
|
95 |
- SubmitJob |
|
96 |
- SubmitManyJobs |
|
97 |
- WaitForJobChanges |
|
98 |
- CancelJob |
|
99 |
- ArchiveJob |
|
100 |
- AutoArchiveJobs |
|
101 |
- QueryJobs |
|
102 |
- Shutdown |
|
103 |
|
|
104 |
Moreover the job queue lock is acquired outside of the job queue in two |
|
105 |
other classes: |
|
106 |
|
|
107 |
- jqueue._JobQueueWorker (in RunTask) before executing the opcode, after |
|
108 |
finishing its executing and when handling an exception. |
|
109 |
- jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the |
|
110 |
processor (mcpu.Processor) is about to start working on the opcode |
|
111 |
(after acquiring the necessary locks) and when any data is sent back |
|
112 |
via the feedback function. |
|
113 |
|
|
114 |
Of those the major critical points are: |
|
115 |
|
|
116 |
- Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow |
|
117 |
down and block client threads up to making the respective clients |
|
118 |
time out. |
|
119 |
- The code paths in NotifyStart, Feedback, and RunTask, which slow |
|
120 |
down job processing between clients and otherwise non-related jobs. |
|
121 |
|
|
122 |
To increase the pain: |
|
123 |
|
|
124 |
- WaitForJobChanges is a bad offender because it's implemented with a |
|
125 |
notified condition which awakes waiting threads, who then try to |
|
126 |
acquire the global lock again |
|
127 |
- Many should-be-fast code paths are slowed down by replicating the |
|
128 |
change to remote nodes, and thus waiting, with the lock held, on |
|
129 |
remote rpcs to complete (starting, finishing, and submitting jobs) |
|
75 | 130 |
|
76 | 131 |
Proposed changes |
77 | 132 |
++++++++++++++++ |
... | ... | |
116 | 171 |
|
117 | 172 |
Other features to look at, when implementing this code are: |
118 | 173 |
|
119 |
- Possibility not to need the job lock to know which updates to push. |
|
174 |
- Possibility not to need the job lock to know which updates to push: |
|
175 |
if the thread producing the data pushed a copy of the update for the |
|
176 |
waiting clients, the thread sending it won't need to acquire the |
|
177 |
lock again to fetch the actual data. |
|
120 | 178 |
- Possibility to signal clients about to time out, when no update has |
121 | 179 |
been received, not to despair and to keep waiting (luxi level |
122 | 180 |
keepalive). |
... | ... | |
126 | 184 |
Job Queue lock |
127 | 185 |
^^^^^^^^^^^^^^ |
128 | 186 |
|
129 |
Our tests show that the job queue lock is a point of high contention. |
|
130 |
We'll try to decrease its contention, either by more granular locking, |
|
131 |
the use of shared/exclusive locks, or reducing the size of the critical |
|
132 |
sections. This section of the design should be updated with the proposed |
|
133 |
changes for the 2.2 release, with regards to the job queue. |
|
187 |
In order to decrease the job queue lock contention, we will change the |
|
188 |
code paths in the following ways, initially: |
|
189 |
|
|
190 |
- A per-job lock will be introduced. All operations affecting only one |
|
191 |
job (for example feedback, starting/finishing notifications, |
|
192 |
subscribing to or watching a job) will only require the job lock. |
|
193 |
This should be a leaf lock, but if a situation arises in which it |
|
194 |
must be acquired together with the global job queue lock the global |
|
195 |
one must always be acquired last (for the global section). |
|
196 |
- The locks will be converted to a sharedlock. Any read-only operation |
|
197 |
will be able to proceed in parallel. |
|
198 |
- During remote update (which happens already per-job) we'll drop the |
|
199 |
job lock level to shared mode, so that activities reading the lock |
|
200 |
(for example job change notifications or QueryJobs calls) will be |
|
201 |
able to proceed in parallel. |
|
202 |
- The wait for job changes improvements proposed above will be |
|
203 |
implemented. |
|
204 |
|
|
205 |
In the future other improvements may include splitting off some of the |
|
206 |
work (eg replication of a job to remote nodes) to a separate thread pool |
|
207 |
or asynchronous thread, not tied with the code path for answering client |
|
208 |
requests or the one executing the "real" work. This can be discussed |
|
209 |
again after we used the more granular job queue in production and tested |
|
210 |
its benefits. |
|
211 |
|
|
134 | 212 |
|
135 | 213 |
Remote procedure call timeouts |
136 | 214 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Also available in: Unified diff