Revision 37e1e262

b/doc/design-2.2.rst
58 58
Also, with the current architecture, masterd suffers from quite a few
59 59
scalability issues:
60 60

  
61
- Since the 16 client worker threads handle one connection each, it's
62
  very easy to exhaust them, by just connecting to masterd 16 times and
63
  not sending any data. While we could perhaps make those pools
64
  resizable, increasing the number of threads won't help with lock
65
  contention.
66
- Some luxi operations (in particular REQ_WAIT_FOR_JOB_CHANGE) make the
67
  relevant client thread block on its job for a relatively long time.
68
  This makes it easier to exhaust the 16 client threads.
69
- The job queue lock is quite heavily contended, and certain easily
70
  reproducible workloads show that's it's very easy to put masterd in
71
  trouble: for example running ~15 background instance reinstall jobs,
72
  results in a master daemon that, even without having finished the
73
  client worker threads, can't answer simple job list requests, or
74
  submit more jobs.
61
Core daemon connection handling
62
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
63

  
64
Since the 16 client worker threads handle one connection each, it's very
65
easy to exhaust them, by just connecting to masterd 16 times and not
66
sending any data. While we could perhaps make those pools resizable,
67
increasing the number of threads won't help with lock contention nor
68
with better handling long running operations making sure the client is
69
informed that everything is proceeding, and doesn't need to time out.
70

  
71
Wait for job change
72
^^^^^^^^^^^^^^^^^^^
73

  
74
The REQ_WAIT_FOR_JOB_CHANGE luxi operation makes the relevant client
75
thread block on its job for a relative long time. This is another easy
76
way to exhaust the 16 client threads, and a place where clients often
77
time out, moreover this operation is negative for the job queue lock
78
contention (see below).
79

  
80
Job Queue lock
81
^^^^^^^^^^^^^^
82

  
83
The job queue lock is quite heavily contended, and certain easily
84
reproducible workloads show that's it's very easy to put masterd in
85
trouble: for example running ~15 background instance reinstall jobs,
86
results in a master daemon that, even without having finished the
87
client worker threads, can't answer simple job list requests, or
88
submit more jobs.
89

  
90
Currently the job queue lock is an exclusive non-fair lock insulating
91
the following job queue methods (called by the client workers).
92

  
93
  - AddNode
94
  - RemoveNode
95
  - SubmitJob
96
  - SubmitManyJobs
97
  - WaitForJobChanges
98
  - CancelJob
99
  - ArchiveJob
100
  - AutoArchiveJobs
101
  - QueryJobs
102
  - Shutdown
103

  
104
Moreover the job queue lock is acquired outside of the job queue in two
105
other classes:
106

  
107
  - jqueue._JobQueueWorker (in RunTask) before executing the opcode, after
108
    finishing its executing and when handling an exception.
109
  - jqueue._OpExecCallbacks (in NotifyStart and Feedback) when the
110
    processor (mcpu.Processor) is about to start working on the opcode
111
    (after acquiring the necessary locks) and when any data is sent back
112
    via the feedback function.
113

  
114
Of those the major critical points are:
115

  
116
  - Submit[Many]Job, QueryJobs, WaitForJobChanges, which can easily slow
117
    down and block client threads up to making the respective clients
118
    time out.
119
  - The code paths in NotifyStart, Feedback, and RunTask, which slow
120
    down job processing between clients and otherwise non-related jobs.
121

  
122
To increase the pain:
123

  
124
  - WaitForJobChanges is a bad offender because it's implemented with a
125
    notified condition which awakes waiting threads, who then try to
126
    acquire the global lock again
127
  - Many should-be-fast code paths are slowed down by replicating the
128
    change to remote nodes, and thus waiting, with the lock held, on
129
    remote rpcs to complete (starting, finishing, and submitting jobs)
75 130

  
76 131
Proposed changes
77 132
++++++++++++++++
......
116 171

  
117 172
Other features to look at, when implementing this code are:
118 173

  
119
  - Possibility not to need the job lock to know which updates to push.
174
  - Possibility not to need the job lock to know which updates to push:
175
    if the thread producing the data pushed a copy of the update for the
176
    waiting clients, the thread sending it won't need to acquire the
177
    lock again to fetch the actual data.
120 178
  - Possibility to signal clients about to time out, when no update has
121 179
    been received, not to despair and to keep waiting (luxi level
122 180
    keepalive).
......
126 184
Job Queue lock
127 185
^^^^^^^^^^^^^^
128 186

  
129
Our tests show that the job queue lock is a point of high contention.
130
We'll try to decrease its contention, either by more granular locking,
131
the use of shared/exclusive locks, or reducing the size of the critical
132
sections. This section of the design should be updated with the proposed
133
changes for the 2.2 release, with regards to the job queue.
187
In order to decrease the job queue lock contention, we will change the
188
code paths in the following ways, initially:
189

  
190
  - A per-job lock will be introduced. All operations affecting only one
191
    job (for example feedback, starting/finishing notifications,
192
    subscribing to or watching a job) will only require the job lock.
193
    This should be a leaf lock, but if a situation arises in which it
194
    must be acquired together with the global job queue lock the global
195
    one must always be acquired last (for the global section).
196
  - The locks will be converted to a sharedlock. Any read-only operation
197
    will be able to proceed in parallel.
198
  - During remote update (which happens already per-job) we'll drop the
199
    job lock level to shared mode, so that activities reading the lock
200
    (for example job change notifications or QueryJobs calls) will be
201
    able to proceed in parallel.
202
  - The wait for job changes improvements proposed above will be
203
    implemented.
204

  
205
In the future other improvements may include splitting off some of the
206
work (eg replication of a job to remote nodes) to a separate thread pool
207
or asynchronous thread, not tied with the code path for answering client
208
requests or the one executing the "real" work. This can be discussed
209
again after we used the more granular job queue in production and tested
210
its benefits.
211

  
134 212

  
135 213
Remote procedure call timeouts
136 214
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Also available in: Unified diff