Parallel Processors (SIMD)

AliRaza360 10,417 views 51 slides Feb 01, 2017
Slide 1
Slide 1 of 51
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51

About This Presentation

computer architecture presentation


Slide Content

Computer Architecture Parallel Processors (SIMD)

Contents Parallel Processors Flynn's taxonomy What is SIMD? Types of Processing Scalar Processing Vector Processing Architecture for Vector Processing Vector processors Vector Processor Architectures Components of Vector Processors Advantages of Vector Processing Array processors Array Processor Classification Array Processor Architecture Dedicated Memory Organization Global Memory Organization ILLIAC IV ILLIAC IV Architecture Super Computers Cray X1 Multimedia Extension

Parallel Processors In computers, parallel processing is the processing of  program  instructions by dividing them among multiple  processor s with the objective of running a program in less time . In the earliest computers, only one program ran at a time. A computation-intensive program that took one hour to run and a tape copying program that took one hour to run would take a total of two hours to run. An early form of parallel processing allowed the interleaved execution of both programs together. The computer would start an I/O operation, and while it was waiting for the operation to complete, it would execute the processor-intensive program. The total execution time for the two jobs would be a little over one hour.

Flynn's taxonomy Flynn's taxonomy  is a classification of  computer architectures , proposed by  Michael J. Flynn  in 1966.The classification system has stuck, and has been used as a tool in design of modern processors and their functionalities . The four classifications defined by Flynn are based upon the number of concurrent instruction (or control) streams and data streams available in the architecture . Single instruction stream single data stream (SISD) Single instruction stream, multiple data streams (SIMD ) Single instruction, multiple threads (SIMT) Multiple instruction streams, single data stream (MISD) Classification

What is SIMD? Single instruction, multiple data  ( SIMD ), is a class of parallel computers  in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Thus, such machines exploit data level parallelism. T here are simultaneous (parallel) computations, but only a single process (instruction) at a given moment.

Types of Processing Scalar Processing A CPU that performs computations on one number or set of data at a time. A  scalar processor  is known as a "single instruction stream single data stream" (SISD) CPU . Vector Processing A  vector  processor or array processor is a central  processing  unit (CPU) that implements an instruction set containing instructions that operate on 1-D arrays of data called  vectors .

Architecture for Vector Processing Two architectures suitable for vector processing are : Pipelined vector processors Parallel Array processors

Pipelined vector processors

Description of Vector Processors CPU that implements an instruction set that operates on 1-D arrays, called vectors Vectors contain multiple data elements Number of data elements per vector is typically referred to as the vector length Both instructions and data are pipelined to reduce decoding time + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 + v3 vector length add.vv v3, v1, v2 VECTOR (N operations)

Vector Processor Architectures Memory-to-Memory Architecture (Traditional) For all vector operation, operands are fetched directly from main memory, then routed to the functional unit Results are written back to main memory Includes early vector machines through mid 1980s: Advanced Scientific Computer (TI), Cyber 200 & ETA-10 Major reason for demise was due to large startup time

Memory-to-Memory Architecture

Vector Processor Architectures (cont) Register-to-Register Architecture (Modern) All vector operations occur between vector registers If necessary, operands are fetched from main memory into a set of vector registers (load-store unit) Includes all vector machines since the late 1980s: Convex, Cray, Fujitsu, Hitachi, NEC SIMD processors are based on this architecture

Register-to-Register Architecture

Components of Vector Processors Vector Registers Typically 8-32 vector registers with 64 - 128 64-bit elements Each contains a vector of double-precision numbers Register size determines the maximum vector length Each includes at least 2 read and 1 write ports Vector Functional Units ( FUs ) Fully pipelined, new operation every cycle Performs arithmetic and logic operations Typically 4-8 different units Vector Load-Store Units ( LSUs ) Moves vectors between memory and registers Scalar Registers Single elements for interconnecting FUs, LSUs, and registers

Components of Vector Processors

The Vector Unit A vector unit consists of a pipelined functional unit, which perform ALU operation of vectors in a pipeline It has also vector registers , including: A set of general purpose vector registers, each of length s(e.g., 128); A vector length register VL,which stores the length l (0 l s) of the currently processed vector(s )

Advantages of Vector Processing Advantages: Quick fetch and decode of a single instruction for multiple operations. The instruction provides a regular source of data, which arrive at each cycle, and can be processed in a pipelined fashion efficiently. Easier Addressing of Main Memory Elimination of Memory Wastage Simplification of Control Hazards Reduced Code Size

Array Processors

Array Processors ARRAY processor is  a processor that performs computations on a large array of data .  Array processor is a synchronous parallel computer with multiple ALU called processing elements ( PE) that can operate in parallel in lockstep fashion . It is composed of N identical PE under the control of a single control unit and a number of memory modules

Array Processors • Array processors also frequently use a form of parallel computation called pipelining where an operation is divided into smaller steps and the steps are performed simultaneously . • It can greatly improve performance on certain workloads mainly in numerical simulation .

How Array Processor can help?   An Example• Consider the simple task of adding two groups of 10 numbers together . In a normal programming language you might have done something as: execute this loop 10 times • read the next instruction and decode it • fetch this number fetch that number • add them • put the result here   But to an array processor this tasks looks as • read instruction and decode it • fetch these 10 numbers • fetch those 10 numbers • add them • put the results here

Array Processor Classification Processing element complexity Single-bit processors Connection Machine (CM-2)  65536 PEs connected by a hypercube network (by Thinking Machine Corporation). Multi-bit processors ILLIAC IV (64-bit), MasPar MP-1 (32-bit)

Array Processor Classification • SIMD ( Single Instruction Multiple Data ) is an array processor that has a single instruction multiple data organization.  It manipulates vector instructions by means of multiple functional unit responding to a common instruction. • Attached array processor is an auxiliary processor attached to a general purpose computer. Its intent is to improve the performance of the host computer in specific numeric calculation tasks.

SIMD-Array Processor Architecture  SIMD has two basic configuration – a. Array processors using RAM also known as ( Dedicated memory organization ) • ILLIAC-IV, CM-2,MP-1 – b. Associative processor using content accessible memory also known as ( Global Memory Organization) • BSP

Control Unit • A simple CPU • Can execute instructions w/o PE intervention • Coordinates all PE’s • 64 64b registers, D0-D63 • 4 64b Accumulators A0-A3 • Ops: – Integer ops – Shifts– Boolean – Loop control – Index Memory

Processing Element  A PE consists of an ALU with working registers and a local memory PMEMi which is used to store distributed data. • All PE do the same function synchronously under the super vision of CU in a lock-step fashion. • Before execution in a PE the vector instructions should be loaded into its PMEM . • Data can be added into the PMEM from an external source or by the CU.

Processing Element   A PE consists of the following: • 64 bit register • A: Accumulator • B: 2nd operand for binary ops • R: Routing – Inter -PE Communication • S: Status Register • X: Index for PMEM 16bits • D: mode 8bits • Communication: – PMEM only from local PE – Amongst PE with R

Interconnection Network and Host Computer   Interconnection Network : All communication between PE’s are done by the inter connection network. It does all the routing and manipulation function . This interconnection network is under the control of CU. • Host Computer: The array processor is interfaced to the host controller using host computer. The host computer does the resource management and peripheral and I/O supervisions.

Dedicated Memory Organization (Array processors using RAM ) Here we have a Control Unit and multiple synchronized PE. •The control unit controls all the PE below it . •Control unit decode all the instructions given to it and decides where the decoded instruction should be executed. •The vector instructions are broad casted to all the PE. •This broad casting is to get spatial parallelism through duplicate PE. •The scalar instructions are executed directly inside the CU.

Dedicated Memory Organization

Global Memory Organization  In this configuration PE does not have private memory. • Memories attached to PE are replaced by parallel memory modules shared to all PE via an alignment network. • Alignment network does path switching between PE and parallel memory. • The PE to PE communication is also via alignment network. • The alignment network is controlled by the CU. • The number of PE (N) and the number of memory modules (K)may not be equal . • An alignment network should allow conflict free access of shared memories by as many PEs as possible .

Global Memory Organization

Attached Array Processor   In this configuration the attached array processor has an input output interface to common processor and another interface with a local memory. • The local memory connects to the main memory with the help of a high speed memory bus.

Performance and Scalability of Array Processor To compute N Y =  A( i ) * B( i ) i =1 Assuming: A dedicated memory organization. Elements of A and B are properly and perfectly distributed among processors (the compiler can help here). We have: The product terms are generated in parallel. Additions can be performed in log 2 N iterations. Speed up factor (S) is (assuming that addition and multiplication take the same time): S = 2N-1 1+ log 2 N

ILLIAC IV

ILLIAC IV The ILLIAC IV system was the first real attempt to contract a large-scale parallel machine, and in its time it was the most powerful computing machine in the world. It was designed and constructed by academics and scientists from the University of Illinois and the Burroughs Corporation. A significant amount of software, including sophisticated compilers, was developed for ILLIAC IV, and many researchers were able to develop parallel application software. ILLIAC IV grew from a series of ILLIAC machines. Work on ILLIAC IV began in the 1960s, and the machine became operational in 1972. The original aim was to produce a 1 GFLOP machine using an SIMD array architecture comprising 256 processors partitioned into four quadrants, each controlled by an independent control unit.

ILLIAC IV Features ILLIAC IV (started in the late 60’s; fully operational in 1975) is a typical example of Array Processors. SIMD computer for array processing. Control Unit + 64 Processing Elements.  2K words memory per PE. CU can access all memory. PEs can access local memory and communicate with neighbors. CU reads program and broadcasts instructions to P e s .

ILLIAC IV Architecture

Super Computers Cray  Inc. is an American  supercomputer  manufacturer headquartered in Seattle, Washington. The company's predecessor,  Cray  Research, Inc. (CRI), was founded in 1972 by  computer  designer Seymour  Cray . Cray-1 The  Cray - 1  was a supercomputer designed, manufactured and marketed by  Cray Research. The first  Cray - 1  system was installed at Los Alamos National Laboratory in 1976 and it went on to become one of the best known and most successful supercomputers in history.

Cray X1 The  Cray X1  is a non-uniform memory access, vector processor supercomputer manufactured and sold by Cray Inc.  since 2003. The X1 is often described as the unification of the  Cray T90 ,  Cray SV1 , and  Cray T3E  architectures into a single machine. The X1 shares the multistreaming processors, vector caches, and  CMOS  design of the SV1, the highly scalable distributed memory design of the T3E, and the high  memory   bandwidth. The X1 uses a 1.2 ns (800 MHz) clock cycle, and 8-wide vector pipes in MSP mode, offering a peak speed of 12.8 gigaflops per processor. maximum of 4096 processors, comprising 1024 shared-memory  nodes  connected in a two-dimensional  network, in 32 frames. Such a system would supply a peak speed of 50  teraflops .

Cray X1 Cray combines several technologies in the X1 machine (2003): Multi-streaming vector processing. Multiple node architecture.

Cray  X1 System Functional Diagram Mainframe  Node interconnection network System Port Channel (SPC) C ommunicate within Nodes I/O drawers (IODs) Cray Programming Environment Server (CPES) Cray Network Subsystem (CNS) Storage area network (SAN) RAID

Cray  X1 System Functional Diagram

Nodes  Nodes are housed in hardware modules called Node module Four multichip modules (MCMs) One multi streaming processor (MSP) Four SPC I/O ports Routing switches controls all memory access  

 Node Processors Each node consists of four MCMs Each MCM includes one multi streaming processor (MSP) Each MSP includes a 2-MB cache A single MSP provides 12.8 GF (gigaflops) Each MSP has four internal single-streaming processors (SSPs) Each SSP contains both a superscalar processing unit and a two-pipe vector processing unit The four SSPs in an MSP share the 2-MB cache of the MSP

Cray Computers Cray-1 Cray-2 Cray-3 Cray-3/SSS Cray-4 Cray C90 Cray Urika -GD Cray X1 Cray X2 Cray XC30 Cray XC40 Continue ………

Multimedia extensions

Multimedia extensions A  multimedia extension  is essentially a supplementary processing capability that is supported on recent products. MMX provides integer operations, and defines eight different registers, names MM0 through MM7, and the operations that operate on them.

MMX (instruction set) MMX  is a  single instruction, multiple data  ( SIMD )  instruction set  designed by  Intel , introduced in 1997 with its  P5 -based  Pentium  line of  microprocessors , designated as "Pentium with MMX Technology". MMX is a single instruction, multiple data instruction set designed by the large manufacturer of computer products, Intel. A multimedia extension is essentially a supplementary processing capability that is supported on recent products.

Technical details MMX defines eight  registers , called MM0 through MM7, and operations that operate on them. Each register is 64 bits wide and can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format: a single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. Pentium II processor with MMX technology
Tags