Revision c3c5dc77

b/doc/design-2.2.rst
33 33
Core changes
34 34
------------
35 35

  
36
Master Daemon Scaling improvements
37
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38

  
39
Current state and shortcomings
40
++++++++++++++++++++++++++++++
41

  
42
Currently the Ganeti master daemon is based on four sets of threads:
43

  
44
- The main thread (1 thread) just accepts connections on the master
45
  socket
46
- The client worker pool (16 threads) handles those connections,
47
  one thread per connected socket, parses luxi requests, and sends data
48
  back to the clients
49
- The job queue worker pool (25 threads) executes the actual jobs
50
  submitted by the clients
51
- The rpc worker pool (10 threads) interacts with the nodes via
52
  http-based-rpc
53

  
54
This means that every masterd currently runs 52 threads to do its job.
55
Being able to reduce the number of thread sets would make the master's
56
architecture a lot simpler. Moreover having less threads can help
57
decrease lock contention, log pollution and memory usage.
58
Also, with the current architecture, masterd suffers from quite a few
59
scalability issues:
60

  
61
- Since the 16 client worker threads handle one connection each, it's
62
  very easy to exhaust them, by just connecting to masterd 16 times and
63
  not sending any data. While we could perhaps make those pools
64
  resizable, increasing the number of threads won't help with lock
65
  contention.
66
- Some luxi operations (in particular REQ_WAIT_FOR_JOB_CHANGE) make the
67
  relevant client thread block on its job for a relatively long time.
68
  This makes it easier to exhaust the 16 client threads.
69
- The job queue lock is quite heavily contended, and certain easily
70
  reproducible workloads show that's it's very easy to put masterd in
71
  trouble: for example running ~15 background instance reinstall jobs,
72
  results in a master daemon that, even without having finished the
73
  client worker threads, can't answer simple job list requests, or
74
  submit more jobs.
75

  
76
Proposed changes
77
++++++++++++++++
78

  
79
In order to be able to interact with the master daemon even when it's
80
under heavy load, and  to make it simpler to add core functionality
81
(such as an asynchronous rpc client) we propose three subsequent levels
82
of changes to the master core architecture.
83

  
84
After making this change we'll be able to re-evaluate the size of our
85
thread pool, if we see that we can make most threads in the client
86
worker pool always idle. In the future we should also investigate making
87
the rpc client asynchronous as well, so that we can make masterd a lot
88
smaller in number of threads, and memory size, and thus also easier to
89
understand, debug, and scale.
90

  
91
Connection handling
92
^^^^^^^^^^^^^^^^^^^
93

  
94
We'll move the main thread of ganeti-masterd to asyncore, so that it can
95
share the mainloop code with all other Ganeti daemons. Then all luxi
96
clients will be asyncore clients, and I/O to/from them will be handled
97
by the master thread asynchronously. Data will be read from the client
98
sockets as it becomes available, and kept in a buffer, then when a
99
complete message is found, it's passed to a client worker thread for
100
parsing and processing. The client worker thread is responsible for
101
serializing the reply, which can then be sent asynchronously by the main
102
thread on the socket.
103

  
104
Wait for job change
105
^^^^^^^^^^^^^^^^^^^
106

  
107
The REQ_WAIT_FOR_JOB_CHANGE luxi request is changed to be
108
subscription-based, so that the executing thread doesn't have to be
109
waiting for the changes to arrive. Threads producing messages (job queue
110
executors) will make sure that when there is a change another thread is
111
awaken and delivers it to the waiting clients. This can be either a
112
dedicated "wait for job changes" thread or pool, or one of the client
113
workers, depending on what's easier to implement. In either case the
114
main asyncore thread will only be involved in pushing of the actual
115
data, and not in fetching/serializing it.
116

  
117
Other features to look at, when implementing this code are:
118

  
119
  - Possibility not to need the job lock to know which updates to push.
120
  - Possibility to signal clients about to time out, when no update has
121
    been received, not to despair and to keep waiting (luxi level
122
    keepalive).
123
  - Possibility to defer updates if they are too frequent, providing
124
    them at a maximum rate (lower priority).
125

  
126
Job Queue lock
127
^^^^^^^^^^^^^^
128

  
129
Our tests show that the job queue lock is a point of high contention.
130
We'll try to decrease its contention, either by more granular locking,
131
the use of shared/exclusive locks, or reducing the size of the critical
132
sections. This section of the design should be updated with the proposed
133
changes for the 2.2 release, with regards to the job queue.
134

  
36 135
Remote procedure call timeouts
37 136
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
38 137

  

Also available in: Unified diff