Intel® hyper threading technology

am_sharifian 63,766 views 46 slides Jul 12, 2013
Slide 1
Slide 1 of 46
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46

About This Presentation

This is a presentation about Intel Nehalm Micro-architecture and the reference of the slides is Intel white paper


Slide Content

Intel® Hyper-Threading Technology Mohammad Radpour Amirali Sharifian

Outline Introduction Traditional Approaches Hyper-Threading Overview Hyper-Threading Implementation Front-End Execution Out-of-Order Execution Performance Results OS Supports Conclusion

Introduction Hyper-Threading technology makes a single processor appear as two logical processors. It was first implemented in the Prestonia version of the Pentium ® 4 Xeon processor on 02/25/02.

Traditional Approaches (I) High requirements of Internet and Telecommunications Industries Results are unsatisfactory compared the gain they provide with the cost they cause Well-known techniques; Super Pipelining Branch Prediction Super-scalar Execution Out-of-order Execution Fast memories (Caches)

Traditional Approaches (II) Super Pipelining: Have finer granularities, execute far more instructions within a second (Higher clock frequencies) Hard to handle cache misses, interrupts and branch mispredictions Instruction Level Parallelism (ILP) Mainly targets to increase the number of instructions within a cycle Super Scalar Processors with multiple parallel execution units Execution needs to be verified for out-of-order execution Fast Memory (Caches) To reduce the memory latencies, hierarchical units are using which are not an exact solution

Traditional Approaches (III) Same silicon technology Normalized speed-ups with Intel486™ microarchitecture, has improved integer performance five- or six-fold1 die size has gone up fifteen-fold, a three-times-higher rate power increased almost eighteen-fold during this period1.

Thread-Level Parallelism Chip Multi-Processing (CMP) Put 2 processors on a single die Processors (only) may share on-chip cache Cost is still high Single Processor Multi-Threading; Time-sliced multi-threading Switch-on-event multi-threading ( well for server application) Simultaneous multi-threading

Hyper-Threading Technology

Hyper-Threading (HT) Technology Provides more satisfactory solution Single physical processor is shared as two logical processors Each logical processor has its own architecture state Single set of execution units are shared between logical processors N-logical PUs are supported Have the same gain % with only 5% die-size penalty. HT allows single processor to fetch and execute two separate code streams simultaneously.

First implementation on the intel xeon processor family Several goals were at the heart of the microarchitecture: M inimize the die area cost of implementing One logical processor is stalled the other logical processor could continue to make forward progress. Allow a processor running only one active software thread to run at the same speed

HT Resource Types Replicated Resources Flags, Registers, Time-Stamp Counter, APIC Shared Resources Memory, Range Registers, Data Bus Shared | Partitioned Resources Caches & Queues

HT Pipeline (I)

Execution Pipeline

Execution Pipeline Partition queues between major pipestages of pipeline

Partitioned Queue Example Partitioning resource ensures fairness and ensures progress for both logical processors !

HT Pipeline (II)

HT Pipeline (III)

The Execution Trace Cache Primary or Advanced form of L1 Instruction Cache . Delivers 3 μops /clock to the out-of-order execution logic . Most instructions fetched and decoded from the Trace Cache. caches the μops of previously decoded instructions here, so it bypasses the instruction decode R ecovery time for a mispredicted branch is much shorter in compression of re-decode the IA-32 instruction

Execution Trace Cache (TC) (I) Stores decoded instructions called “micro-operations” or “ uops ” Arbitrate access to the TC using two IPs If both PUs ask for access then switch will occur in the next cycle. Otherwise, access will be taken by the available PU Stalls (stem from misses) lead to switch Entries are tagged with the owner thread info 8-way set associative, Least Recently Used (LRU) algorithm Unbalanced usage between processors ( shared nature of TC)

Execution Trace Cache (TC) (I)

Microcode Store ROM (MSROM) (I) Complex instructions (e.g. IA-32) are decoded into more than 4 uops TC sends a microcode-instruction pointer to MSROM Shared by the logical processors Independent flow for each processor(Two microcode instruction pointers) Access to MSROM alternates between logical processors as in the TC

Microcode Store ROM (MSROM) (II) The Microcode ROM controller then fetches the uops needed and returns control to the TC

TLB processors have been working not with physical memory addresses, but with virtual addresses Advantages: More memory be allocated Keeping only necessary data Disadvantages: Virtual addresses need to translate to physical address Table gets so large and can’t be stored on chip ( paged) Translation stage need for each memory access, too much slow

Translation look aside buffer (TLB )(II) A small cache memory directly on the processor that stored the correspondences for a few recently accessed addresses . Until Core 2 used two level cache: Level 1 TLB: small but very fast( for loads only) – small (16 entries) Level 2 TLB: handled load missed( 256 entries)

Translation look aside buffer (TLB )(III) Nehalem: First level - data TLB: Shared between data and instruction Stores 64 entries( small pages 4k) or 32 entries( large pages 2M/4M ) First level - instruction TLB: Shared between data and instruction Stores 128 entries( small pages) and 7( large pages) Second level (unified): Stores up to 512 entries ( only small pages)

Branch Predictors A branch breaks the parallelism Branch prediction determines whether or not a branch will be taken and if it is, quickly determines the target address for continuing execution complicated techniques are needed needed is an array of branches—the Branch Target Buffer (BTB ) and an algorithm for determining the result of the next branch Intel hasn’t provided details on the algorithm used for their new predictors

ITLB and Branch Prediction (I) If there is a TC miss, bytes need to be loaded from L2 cache and decoded into TC ITLB gets the “instruction deliver” request ITLB translates next Pointer address to the physical address ITLBs are duplicated for each logical processors L2 cache arbitrates on first-come first-served basis while always reserve at least one slot for each processor

ITLB and Branch Prediction ( II) Branch prediction structures are either duplicated or shared If shared owner tags should be included Return stack buffer is duplicated Very small structure call/return pairs are better predicted for software threads independently . branch history buffer tracked independently for each logical processor large global history array is a shared structure tagged with a logical processor ID .

ITLB and Branch Prediction (II)

Uop Queue Decouples the Front End from the Out-of-order execution unit

OUT-OF-ORDER EXECUTION ENGINE

Allocator The out-of-order execution: Re-ordering Tracing sequencing Allocates many of the key machine buffers; 126 re-order buffer entries 128 integer and floating-point registers 48 load, 24 store buffer entries Allocator logic takes uops from the qeue Resources shared equal (partition) between processors Limitation of the key resource usage, we enforce fairness and prevent deadlocks over the Arch. For every clock cycle, allocator switches between uop queues If there is “stall” or “HALT”, there is no need to alternate between processors

Register Rename The register rename logic renames the architectural IA-32 registers onto the machine’s physical registers The 8 general usage IA-32 register expand to 128 available physical register. Involves with mapping shared registers names for each processor Each processor has its own Register Alias Table (RAT ) The register renaming process is done in parallel to the allocator logic Uops are stored in two different queues; Memory Instruction Queue (Load/Store) General Instruction Queue (Rest) Queues are partitioned among PUs

Instruction Scheduling Schedulers are at the heart of the out-of-order execution engine There are five schedulers which have queues of size 8-12 Collectively, they can dispatch up to six uops each clock cycle Scheduler is oblivious when getting and dispatching uops It ignores the owner of the uops Dependent inputs and availability of execution resources It can get uops from different PUs at the same time To provide fairness and prevent deadlock, there is a limit on the number of active entries of each PU qeue

Execution Units & Retirement Execution Units are oblivious when getting and executing uops Since resource and destination registers were renamed earlier, during/after the execution it is enough to access physical registries After execution, the uops are placed in the re-order buffer which decouples the execution stage from retirement stage The re-order buffer is partitioned between PUs Uop retirement commits the architecture state in program order Once stores have retired, the store data needs to be written into L1 data-cache, immediately

Memory Subsystem Totally oblivious to logical processors Schedulers can send load or store uops without regard to PUs and memory subsystem handles them as they come Memory types; DTLB: Translates addresses to physical addresses 64 fully associative entries; each entry can map either 4K or 4MB page Shared between PUs (Tagged with ID) L1, L2 and L3 caches The L1 data cache is virtually addressed and physically tagged . Cache conflict might degrade performance ( share data in the cache) Using same data ( same data) might increase performance (more mem . hits ) – common in server application code

System Modes (I) Two modes of operation; single-task (ST) When there is one software thread to execute multi-task (MT) When there are more than one software threads to execute ST0 or ST1 where number shows the active PU HALT command was introduced where resources are combined after the call Reason is to have better utilization of resources

System Modes ( II) HALT transitions the processor from MT-mode to ST0- or ST1-mode In ST0- or ST1-modes, an interrupt sent to the HALTed processor would cause a transition to MT-mode.

O perating system and Applications Hyper-Threading: operating system and application software as having twice the number of processors OS optimizations: use the HALT instruction if one logical processor is active and the other is not . Not use HALT -> Idle loop ! scheduling software threads to logical processors

Performance 21% increase performance 28% increase performance

OS Support for HT Native HT Support Windows XP Pro Edition Windows XP Home Edition Linux v 2.4.x (and higher) Compatible with HT Windows 2000 (all versions) Windows NT 4.0 (limited driver support) No HT Support Windows ME Windows 98 (and previous versions)

Conclusion Measured performance (Xeon) showed performance gains of up to 30% on common server applications. HT is expected to be viable and market standard from Mobile to server processes.

Questions ?
Tags