OS scheduling and The anatomy of a context switch

DanielBenZvi 2,817 views 34 slides Jun 22, 2015
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

A general overview and some in depth about how operation systems schedule tasks and how much does it really cost?


Slide Content

OS Scheduling and The
Anatomy of a context
switch
Daniel Ben-Zvi

Multi-tasking

Scheduling in the modern world
●Pause a running process in the middle of its execution

●Resume a previously paused process that is ready to
execute

●Do this (tens) of thousands of times per second

But it wasn’t always like this

No Scheduler
Only one process can operate at a time.

A single process can hog the whole system
forever.

Digger sleep function, 1982
sleep(time)
int time;
{
int a,b;

for(a=0; a<time; a++) {
for(b=0; b<100; b++);
}
}
So this is why pressing Turbo would speed it up! :)

3
(source: http://www.digger.org/digsrc_orig.zip)

And then they said.. lets cooperate!

Cooperative scheduler
Each process must “play nice” and yield control
to other processes.

… but a single misbehaving process can still
hog the whole system forever.

(the sad reality of a kibbutz)

Sleep method designed for MacOS 8 (in C)
void wxThread::Sleep(unsigned long milliseconds)
{
UnsignedWide start, now;

Microseconds(&start);

double mssleep = milliseconds * 1000 ;
double msstart, msnow ;
msstart = (start.hi * 4294967296.0 + start.lo) ;

do
{
YieldToAnyThread();
Microseconds(&now);
msnow = (now.hi * 4294967296.0 + now.lo) ;
} while( msnow - msstart < mssleep );
}










// Start a busy loop till time has passed
// Pass control to other threads (in the active process context)





(source: https://github.com/LuaDist/wxwidgets/blob/master/src/mac/classic/thread.cpp#L539 )

Non preemptive
A process will eventually misbehave...

Round Robin (Time sharing)
(source: http://www.qnx.com/developers/docs/qnx_4.25_docs/qnx4/sysarch/microkernel.html)

but.. tasks are different in nature!
●Some tasks require CPU time to complete

●others require I/O (waiting for user input,
network data, disk)

●Some just sleep most of the time

Multilevel feedback queue

And many more..

So, how do you interrupt a running
process in the middle of its execution,
and then resume the work later on?

Use a timer interrupt!
●Ask the processor to interrupt the active process and wake up
the kernel every X interval (in the goold old Linux days, this was
defined as CONFIG_HZ):


void do_timer(struct pt_regs *regs) // Linux kernel (some version) interrupt handler
{
jiffies_64++;

update_process_times(user_mode(regs));
update_times();
}

The “context” in this program
bits 16 ; 16 bits real mode
start:
cli ; disable interrupts
mov si, msg ; SI points to RAM address of msg
mov ah, 0x0e ; print char service, with int 0x10
.loop lodsb ; load RAM: AL <- [DS:SI] && SI++
or al, al ; end of string?
jz halt
int 0x10 ; call BIOS print char
jmp .loop ; next char
halt: hlt ; halt
msg: db "Hello, World!", 0

# void swtch(struct context **old, struct context *new);
#
# Save current register context in old
# and then load register context from new.

.globl swtch
swtch:
movl 4(%esp), %eax
movl 8(%esp), %edx

# Save old callee-save registers
pushl %ebp ; Save old process ebp (base pointer) register
pushl %ebx ; Save old process ebx (general) register
pushl %esi ; Save old process esi (source index) register
pushl %edi ; Save old process edi (destination index) register

# Switch stacks
movl %esp, (%eax)
movl %edx, %esp

# Load new callee-save registers
popl %edi ; Load new process edi
popl %esi ; Load new process esi
popl %ebx ; Load new process ebx
popl %ebp ; Load new process ebp
ret

(source: http://samwho.co.uk/blog/2013/06/01/context-switching-on-x86/ )

Finding out the context switch rate
[us-east-1c] i-91xxxxxx [root:~]$ vmstat -w 10
procs ---------------memory-------------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3275668 48308 370264 0 0 0 0 12504 8884 1 5 94 0 0
0 0 0 3275492 48308 370264 0 0 0 0 12933 8834 2 5 93 0 0
1 0 0 3275588 48308 370264 0 0 0 0 12490 8825 2 4 94 0 0

Using /usr/bin/time
[us-east-1c] i-xxxxxxxx [root:~]$ /usr/bin/time -v find / > /dev/null
Command being timed: "find /"
User time (seconds): 0.12
System time (seconds): 0.30
Percent of CPU this job got: 6%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.79
...
Voluntary context switches: 7362
Involuntary context switches: 60
...
File system inputs: 58888

Voluntary vs Involuntary
●Non voluntary context switches occur when the
scheduler interrupts the process (i.e. timeslice expired)

●Voluntary context switches occur when the process
yields control to the CPU (i.e. waiting for I/O, sleeping)

Using pidstat
[us-east-1c] i-xxxxxxxx [root:~]$ pidstat -w -p 1504
Linux 3.13.0-55-generic (ip-10-xx-xx-x8) 06/21/2015 _x86_64_ (2 CPU)

09:24:23 PM UID PID cswch/s nvcswch/s Command
09:24:23 PM 0 1504 531.40 3.45 haproxy

Some benchmarks
●No CPU affinity
○Intel 5150: ~4300ns/context switch
○Intel E5440: ~3600ns/context switch
○Intel E5520: ~4500ns/context switch
○Intel X5550: ~3000ns/context switch
○Intel L5630: ~3000ns/context switch
○Intel E5-2620: ~3000ns/context switch

●With CPU affinity
○Intel 5150: ~1900ns/process context switch, ~1700ns/thread context switch
○Intel E5440: ~1300ns/process context switch, ~1100ns/thread context switch
○Intel E5520: ~1400ns/process context switch, ~1300ns/thread context switch
○Intel X5550: ~1300ns/process context switch, ~1100ns/thread context switch
○Intel L5630: ~1600ns/process context switch, ~1400ns/thread context switch
○Intel E5-2620: ~1600ns/process context switch, ~1300ns/thread context switch
Performance boost: 5150: 66%, E5440: 65-70%, E5520: 50-54%, X5550: 55%, L5630: 45%, E5-2620: 45%.
(source: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html)

The hidden cost
●TLB flush

●Potentially screw up L1,L2,L3 CPU cache

●Kernel tries very hard to schedule threads on
the same Core/CPU/Numa node, but
sometimes it can't.

Without CPU affinity
(source: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html)

With CPU affinity
(source: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html)

Without CPU affinity With CPU affinity
(source: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html)

Full context switch
●Save current process context
●TLB flush
●Potentially screw up L1,L2,L3 CPU cache
●Load next process context
●Can be done in hardware - but is usually
done in software.

So how much is too much?
10k c.s. @ 5 micro-seconds per c.s. = 50ms
CPU time spent on context switching.

If this was 100k...

Some formulas
●IO bound tasks - threads = number of cores * (1 + wait time / service time)
●Balanced - N = U * Ncores * (1+ I/O Wait time / CPU time) ^^ same as above without the U factor
●CPU bound tasks - threads = number of CPUs + 1 ^^ same as above but W == 0

see y i like this version of the formula? yes.
I think this is a pretty good presentation.. :P
yep, but it looks like we’ll keep adding slides until tomorrow night…. lol
Most probably you’re right - but this covers great things
we have to stop somewhere. also, we need to get drunk ;)

Conclusions
●Scheduling is complicated
●Context switches can be very expensive
●Find the right balance when deciding how many workers
to use
●Use CPU affinity to reduce cost where applicable
●And most important lesson, don't overload the scheduler
with runnable threads/processes

Thank you
Daniel Ben-Zvi | [email protected]

Further reading
●The Timer Interrupt Handler
●Context Switching on X86
●CPU Scheduling
●How long does it take to make a context switch?
●Calculate the Optimum Number of Threads