Introduction: Processing data concurrently is known as Parallel Processing . Consider a multiprocessor system with ‗n‘ processors. If a processor fails, the system would continue to provide service with the remaining ‗n-1‘ processors. Parallelism is a mode of operation in which a process is split into parts, which are executed simultaneously on different processors attached to the same computer.
Goals of Parallelism: It increases the computational speed. It increases throughput by making two or more ALUs in CPU can work concurrently. [Throughput - amount of processing that can be accomplished during a given interval of time] It improves the performance of the computer for a given clock speed.
Types of Parallelism: Instruction level parallelism Thread level or Task level Parallelism Bit-level Parallelism Data level parallelism Transaction level parallelism
Instruction Level Parallelism When instructions in a sequence are independent and can be executed in parallel, then there is an Instruction Level Parallelism. Two primary methods are: Increasing the depth of pipeline Replicating the internal components.
1. Implementing a multiple issue processor - Static and Dynamic 2. Speculation - Approach to guess the properties of instruction 3. Recovery mechanisms - Exception Handling 4. Instruction issue policy - in-order issue with in-order completion - in-order issue with out-order completion - out-order issue with out-order completion 5. Register renaming 6. Branch prediction
Parallel Processing Challenges Challenges faced by industry is to create hardware and software that will make it easy to write correct parallel processing programs that will execute efficiently in performance and energy. Challenges : Writing programs Scheduling Partitioning the task Balance the load between processors.
Parallel Processing Challenges Amdahl’s Law : Amdahl’s law is used to calculate the performance gain that can be obtained by improving some portion of a computer.
Flynn’s Classification Flynn’s classification uses two basic concepts: Parallelism in instruction stream and Parallelism in data stream There are 4 possible combinations.
SISD (Single Instruction Single Data) A processor that can only do one job at a time from start to finish.
SIMD (Single Instruction Multiple Data) They have multiple processing/execution units and one control unit. SPMD
MISD (Multiple Instruction Single Data) There are N control and processor unit operating over the same data stream and result of one processor becomes input of the next processor.
MIMD (Multiple Instruction Multiple Data) Most of the multiprocessors system and multiple computers system come under this category. Multiple SISD(MSISD)
Vector Architecture Efficient method of SIMD. It collects data elements from memory, put them in order into a large set of register, operate them sequentially in registers and then write them results back to memory.
Hardware Multithreading The instruction stream is divided into several smaller streams called Threads. Otherwise it’s a high degree of instruction level parallism . Some terms: Process Resource ownership Scheduling /execution Process Switch Thread Thread Switch
Two methods: 1. Explicit Multithreading 2. Implicit Multithreading
Explicit Multithreading Explicit Multithreading are visible to the application programs and visible to operating system. Implicit Multithreading Implicit Multithreading are not direct method.
Approaches to Explicit Multithreading Single-threaded scalar Interleaved or fine-grained multithreading Blocked or coarse-grained multithreading Simultaneous multithreading (SMT) Chip multiprocessing
Multicore Processors and Other Shared Memory Multiprocessors Multicore architecture are classified into 3 types: Type 1 (Hyperthreading technology) Type 2 (Classic Multiprocessor) Type 3 (Multicore system)
Shared Memory Multiprocessor (SMP) SMP is one that offers the programmer a single physical address space across all processor. Classified as: Uniform memory access multiprocessor (UMA) Non-Uniform memory access multiprocessor (NUMA)
S.No . Key UMA NUMA 1 Definition UMA stands for Uniform Memory Access. NUMA stands for Non Uniform Memory Access. 2 Memory Controller UMA has single memory controller. NUMA has multiple memory controllers. 3 Memory Access UMA memory access is slow. NUMA memory accsss is faster than UMA memory. 4 Bandwidth UMA has limited bandwidth. NUMA has more bandwidth than UMA. 5 Suitability UMA is used in general purpose and time sharing applications. NUMA is used in real time and time critical applications. 6 Memory Access time UMA has equal memory access time. NUMA has varying memory access time. 7 Bus types 3 types of Buses supported: Single, Multiple and Crossbar. 2 types of Buses supported: Tree, hiearchical .
Graphics Processing Unit (GPU) GPUs vs CPUs Programming interface to GPU are high-level application programming interface (APIs) such as DirectX, OpenGL, NVIDIA’s C for graphics etc.. CPU supports sequential coding while GPU supports parallel coding.
Graphics Processing Unit (GPU) GPUs vs CPUs Programming interface to GPU are high-level application programming interface (APIs) such as DirectX, OpenGL, NVIDIA’s C for graphics etc.. CPU supports sequential coding while GPU supports parallel coding.
2. Connection between CPU and GPU
3. GPU Architecture SIMD One instruction operates on multiple data. Multithreading Most graphics have this property since they need to process many objects. (pixels, vertices, polygons) simultaneously. NIVIDIA GPU architecture Motherboard GPUs integrated Tesla-based GPUs – 900MHz, 128MB – DDR3 RAM
CUDA Programming Compute Unified Device Architecture CUDA is a parallel computing platform and programming model developed by Nvidia for general computing on its own GPUs (graphics processing units). CUDA enables developers to speed up compute-intensive applications by harnessing the power of GPUs for the parallelizable part of the computation. Heterogeneous CPU and GPU System.
Message-passing multiprocessors With no shared memory space, the alternative method to achieve multiprocessor is via explicit message passing technique. This is done by establishing a communication channel between two processor.
Shared memory multiprocessor A Shared memory multiprocessor is a computer system composed of multiple independent processors that execute different instruction streams. Processor share a common memory address space and communicate with each other via memory.
Clusters Clusters are collections of desktop computers or servers connected by local area networks to act as a single large computer.
Warehouse-Scale Computers Largest form of clusters are called Warehouse-scale computers (WSCs) WSC provide internet services: Google Facebook Youtube Amazon
Goals and requirements with servers: Cost-performance Energy efficiency Dependability Network i/o Interactive Characteristics not shared with servers: Ample parallelism Operational cost count Scale
Ques List four major groups of computes defined by Michael J.Flynn State amdahl’s law. Define Parallel processing. What is Speculation? State Coarse grained multithreading. Write note on SIMD processor. Define VLIW. Compare UMA and NUMA multiprocessor. What is multicore processor?
Part B What is hardware multithreading? Compare and contrast fine grained and coarse grained multithreading. Discuss in detail about Instruction Level Parallelism. Explain in detail Flynn’s classification of parallel hardware.
Part B Explain ( i ) Shared memory multiprocessor (3) (ii)Warehouse scale computers. (7) (iii)Message passing multiprocessors.(4) (iv)Parallel processing challenges.(3) (v)Clusters and Message passing system.(7) Describe GPU Architecture in detail.