10_CH20 - Parallel Processing and Interconnection.pdf

abb3184 12 views 34 slides Sep 16, 2025
Slide 1
Slide 1 of 34
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34

About This Presentation

so these are the slide that explains the content of Parallel Processing and Interconnection for the course of computer organization and architecture 1


Slide Content

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Computer Organization and Architecture
Designing for Performance
11
th
Edition
Chapter 20
Parallel Processing

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
A Taxonomy of Parallel Processor Architectures

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Parallel Processing
Multicore Processor architectures
Is a chip multiprocessor, combines two or more processor units (called
cores) on a single piece of silicon (called a die). In 2000s: Multi-core
design, on-chip networking, parallel programming paradigms, power
reduction. Companies switch to multicore (AMD, Intel, IBM, Sun; all new
Apples 2-4 CPUs)
InstructionPipeline
Instruction pipelining is a powerful technique for enhancing performance
but requires careful design to achieve optimum results with reasonable
complexity.
Graphics processor unit (GPU)
Is designed specifically to be optimized for fast (3D) graphics rendering
and video processing. GPUs can be found in almost all of today’s
workstations, laptops, tablets, and smartphones

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Single-core CPU chip
the single core

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Multicore Processor
architectures

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Multi-core architectures
•The new trend in computer architecture is, replicate multiple
processor cores on a single die.
•Multi-core CPU chip: The cores fit on a single processor socket
•Also called CMP (Chip Multi-Processor)
Multi-coreCPU chip
Core 1 Core 2 Core 3 Core 4

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Why multi-core ?
•Difficult to make single-core
clock frequencies even higher
•Deeply pipelined circuits:
–heat problems
–speed of light problems
–difficult design and verification
–large design teams necessary
–server farms need expensive
air-conditioning
•Many new applications are multithreaded
•General trend in computer architecture (shift towards
more parallelism)

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Why multi-core?
•Other alternatives?
–Dataflow?
–Vector processors Single Instruction, Multiple Data (SIMD)?
–Integrating DRAM on chip?
–Reconfigurable logic? (general purpose?)
With multiple cores on chip
•What we want:
–N times the performance with N times the cores when we parallelize
an application on N cores
•What we get:
–Amdahl’s Law (serial bottleneck)
–Bottlenecks in the parallel portion

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Multiprocessor memory types
•Shared memory:
In this model, there is one (large) common shared memory
for all processors
•Distributed memory:
In this model, each processor has its own (small) local
memory, and its content is not replicated anywhere else.
•CPU Coreis a hardware component, called the ‘brain’ of a
CPU. It is like a small CPU within the bigger CPU. The core
can process all the computational jobs independently.
•CPU threadis a virtual component that handles the tasks
of a CPU core, to complete them in an effective manner.

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
10
Thecache coherence problem
Since we have private caches:
How to keep the data consistent across caches?
Each core should perceive the memory as a monolithic
array, shared by all the cores

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Multi-core processor is a special kind
of a multiprocessor:
All processors are on the same chip
•Multi-core processors are MIMD:
Different cores execute different threads (Multiple Instructions),
operating on different parts of memory (Multiple Data).
•Multi-core is a shared memory multiprocessor:
All cores share the same memory.

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Interaction with the Operating System
•OS perceives each core as a separate processor
•OS scheduler maps threads/processes to different cores
•Most major OS support multi-core today:
Windows, Linux, Mac OS X, …

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
13
What applications benefit from multi-
core?
•Database servers
•Web servers (Web commerce)
•Compilers
•Multimedia applications
•Scientific applications, CAD/CAM
•In general, applications with Thread-level
parallelism
Each can
run on its
own core

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Pipelined processor

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Basic Pipelining
Pipelining become universal technique in 1985
•Overlaps execution of instructions
•Exploits “Instruction Level Parallelism (ILP)”
Two main approaches:
Dynamic hardware-based
•Used in server and desktop processors
•Not used as extensively in Parallel Multiprogrammed
Microprocessors (PMP)
Static approaches compiler-based
•Not as successful outside of scientific applications

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Pipelining Technique
•Pipelining is the organizational implementation technique that
has been responsible for the most dramatic increase in
computer performance.
•Exploits instruction-level parallelism by overlapping the
execution of consecutive instructions.

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Pipelined vs Unpipelined
No. Topic Pipe-Lined System Non-Pipelined System
1 Working Multiple instructions are
overlappedduring
execution.
Processes like decoding,
fetching, execution, and writing
memory are merged into a
single unit or a single step.
2 Execution
Time
Many instructionsare
executed at the same time
and Execution time is
comparatively less
Only one instruction is executed
at the same time and Execution
time is comparatively high.
3 Dependency
on CPU
Scheduler
The efficiency of the
pipelining system depends
upon the effectiveness of
the CPU scheduler.
The efficiency is not dependent
on the CPU scheduler.
4 CPU Cycles
Needed
Execution is done in fewer
CPU cycles
Execution requires more number
of CPU cycles comparatively

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
F WE
F WE
Pipelined processor introduction
•Attempt to pipeline our processor using pipeline
registers/FIFOs
•Much better latency and throughput!
–Average CPI reduced from 3 to 1!
–Still lots of time spent not doing work. Can we do better?
Fetch WritebackExecute
time
instr. 1
instr. 2
F WE
F WE
* We will see soon why pipelining
a processor isn’t this simple
Note we need a memory interface with two concurrent interfaces now!
(For fetch and execute). Remember instruction and data caches!

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Building a balanced pipeline
•Must reduce the critical path of Execute
•Writing ALU results to register file can be moved to “Writeback”
–Most circuitry already exists in writebackstage
–No instruction uses memory load and ALU at the same time
▪RISC!
PC
Memory Interface
Instruction
Decoder
Register
File
ALU

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Building a balanced pipeline
•Divide execute into multiple stages
–“Decode”
▪Extract bit-encoded values from instruction word
▪Read register file
–“Execute”
▪Perform ALU operations
–“Memory”
▪Request memory read/write
•No single critical path which reads and writes to register
file in one cycle
Fetch WritebackDecode Execute Memory
Results in a small number of stage with relatively good balance!
Execute

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Ideally balanced pipeline performance
•Clock cycle: 1/5 of total latency
•Circuits in all stages are always busy with useful work
time
Fetch WritebackDecodeExecuteMemory
Fetch WritebackDecode Execute Memory
Fetch WritebackDecode Execute Memory
instr. 1
instr. 2
instr. 3

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Graphics processing unit
(GPU)

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
GPU: Graphics processing unit
•What is a GPU
–Specialized processor for graphics
–Embarrassingly parallel:
▪Lots of:
–Read data, calculate, write
–Used to be fixed function
–Are becoming more programmable
•GPU is the supercomputer in your laptop
•Very basic till about 1999
•Specialized device to accelerate display
•Then started changing into a full processor
•2000-…: Frontier times to start

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Programmer’s view with GPU
CPU
Memory
GPU
GPU Memory
1GB on our systems
3GB/s
12.8GB/sec –31.92GB/sec
8B per transfer
141GB/sec

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
CPU vs. GPU
•Different design philosophies
–CPU: Central Processing Unit. A few out-of-ordercores
–GPU: Graphics Processing Unit. Many in-ordercores
25

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
GPU Computing
•Computation is offloaded to the GPU, because the problems
run substantially faster on the GPU than on the CPU.
•Three steps
–CPU-GPU data transfer (1)
–GPU kernel execution (2)
–GPU-CPU data transfer (3)
26

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
GPU vs CPU Parallelism
•GPU is very good at data-parallel computing, CPU is very
good at parallel processing.
•GPU has thousands of cores, CPU has less than 100 cores.
•GPU has around 40 hyperthreadsper core, CPU has around
2(sometimes a few more) hyperthreadsper core.
•GPU has difficulty executing recursive code, CPU has less
problems with it.

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
GPU Architectures
•Processing is highly data-parallel
–GPUs are highly multithreaded
–Use thread switching to hide memory latency
▪Less reliance on multi-level caches
–Graphics memory is wide and high-bandwidth
•Trend toward general purpose GPUs
–Heterogeneous CPU/GPU systems
–CPU for sequential code, GPU for parallel code
•Programming languages/APIs
–DirectX, OpenGL
–C for Graphics (Cg), High Level ShaderLanguage (HLSL)
–Compute Unified Device Architecture (CUDA)

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Heterogeneous Computing
Host:
the CPU and its memory
Device:
the GPU and its memory

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Alternative Computer Organizations

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Pipelining 3 Stages
Assume a 2 ns flip-flop delay

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Rules for pipeline registers
•Each stage must be independent, so inter-stage registers
must hold
–Data values
–Control signals, including
▪Decoded instruction fields
▪MUX controls
▪ALU controls
•Think of the register file as two independent units
–Read file, accessed in ID
–Write file, accessed in WB
•There is no “final”set of registers after WB, (WB/IF) because
the instruction is finished and all results are recorded in
permanent machine state (register file, memory, and PC)

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Symmetric Multiprocessor Organization

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved
Copyright
This work is protected by United States copyright laws and is provided solely
for the use of instructions in teaching their courses and assessing student
learning. dissemination or sale of any part of this work (including on the
World Wide Web) will destroy the integrity of the work and is not permit-
ted. The work and materials from it should never be made available to
students except by instructors using the accompanying text in their
classes. All recipients of this work are expected to abide by these
restrictions and to honor the intended pedagogical purposes and the needs of
other instructors who rely on these materials.