The Evolution of Computational Power: From Laptops to Supercomputers

AnandaRaman7 3 views 52 slides Sep 15, 2025
Slide 1
Slide 1 of 52
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52

About This Presentation

Computational power has evolved from laptops, handling simple personal tasks, to workstations for complex calculations, to clusters of computers working together, and finally to supercomputers, performing quadrillions of calculations per second for science, weather, AI, and space exploration.


Slide Content

The Evolution of Computational Power:
From Laptops to Supercomputers
A. Anandaraman
[email protected]
National Institute of Science Education and Research
(NISER, Bhubaneswar)

01/08/2025 2
Objectives of this lecture
1. Moore's Law & Architectural Evolution.
2. Massive Parallelism & Exascale Computing
3. Flynn's Taxonomy
4. HPC Architectures
5. Performance Metrics in HPC
6. HPC Application Domains
7. Live Demo session with existing HPC
8. Q&A session

01/08/2025 3
Introduction
Computing has undergone a dramatic
transformation – more power in less
space, using less energy.

From Room-Sized Machines to
Pocket-Sized Devices.
●Relentless Pursuit of Speed,
Efficiency, and Miniaturization.

Impact on Science, Technology,
and Daily Life.

01/08/2025 4
Advancements in Processor, RAM, and Storage Technologies
Processor - Single-core CPUs, ~1 GHz to Multi-core, Hybrid (P+E cores), >5 GHz boost
Nanometer scaling (→ 2 nm)
Massive core counts (256+ in EPYC)
AI/ML accelerators (NPUs, GPUs, TPUs)
RAM - DDR1, < 1 GB, ~100 MHz to DDR5, LPDDR5X, HBM3, 128 GB+
Higher bandwidth (up to 8000+ MT/s)
3D stacking (HBM/DRAM)
ECC, low-power & high-density variants
Storage - HDDs (spinning, slow) to NVMe SSDs, PCIe 5.0, ZNS
Shift from SATA to NVMe (10× faster)
U.2 / M.2 / EDSFF form factors

01/08/2025 5
Performance parameters
CPU:
1. Clock Speed (Ghz)
2. Core/Thread count
3. IPC (Instructions Per Cycle)
4. Turbo Boost / Maximum Frequency
5. Cache Size (L1, L2, L3)
6. TDP (Thermal Design Power)
7. Fabrication Node (nm)
8. Architecture (e.g., Intel Raptor Lake, AMD Zen 4)
9. FLOPS
GPU:
1. GPU Clock Speed (Mhz/Ghz)
2. CUDA Cores / Stream Processors
3. Memory Bandwidth (GB/s)
4. VRAM Size (GB)
5. Memory Type
6. TDP (Thermal Design Power)
7. FLOPS (TFLOPS)
8. Ray Tracing Cores / Tensor Cores
9. Architecture
10. Bus Interface (PCIe Gen)
11. Driver and API Support
12. AI Performance (TOPS)

Anandaraman, NISER Bhubaneswar
Research Cycle (Fourth Paradigm of Science)
The Pillars of Scientific Discovery
A hypothesis is a testable idea; a theory is a tested and
confirmed explanation.
1. Observation
2. Hypothesis
3. Computation
4. Experiment

Anandaraman, NISER Bhubaneswar
What is Computing?
1. Computing is the process of manipulating data using defined algorithms to solve problems,
make decisions, or perform tasks.
2. It is powered by a combination of hardware (physical devices) and software (programs and
instructions).
3. From a hardware perspective, two major types of computing processors are:
CPU (Central Processing Unit) – the general-purpose processor.
GPU (Graphics Processing Unit) – originally designed for graphics, now widely used for
parallel computation.
4. When the CPU is used to perform computations, it is called CPU-based computation.
5. When the GPU is used to perform computations, especially for tasks involving high
parallelism (e.g., deep learning, simulations), it is referred to as GPU-based computation.

Anandaraman, NISER Bhubaneswar
Why FLOPS is Crucial in HPC
1. Real-World Phenomena are Continuous - these are not inherently integer values.
2. Need for Precision and Range
Precision - Floating-point numbers are designed for this
Range - incredibly small (e.g., Planck's constant, atomic distances) to the incredibly large
(e.g., speed of light, astronomical distances, number of particles in a system).
3. Basis for global rankings like the TOP500 list (LINPACK benchmark)
4. Most scientific, engineering, and AI problems rely heavily on these complex operations

Anandaraman, NISER Bhubaneswar
Floating-point arithmetic
1. IEEE 754 Double Precision (binary64)
Sign (S): 1 bit (Bit 63), Exponent (E): 11 bits (Bits 62-52) , Significand (M): 52 bits (Bits 51-0)
2. IEEE 754 Single Precision (binary32)
Sign (S): 1 bit (Bit 31), Exponent (E): 8 bits (Bits 30-23) , Significand (M): 23 bits (Bits 22-0)
3. IEEE 754 Half Precision (binary16)
Sign (S): 1 bit (Bit 15), Exponent (E): 5 bits (Bits 14-10) , Significand (M): 10 bits (Bits 9-0)
4. Quarter Precision (FP8 / binary8 - Not yet a standard IEEE 754 format)
E4M3 (Exponent 4 bits, Mantissa 3 bits) – Driven By NVIDIA and other AI Hardware vendor
Sign (S): 1 bit, Exponent (E): 4 bits, Significand (M): 3 bits

Anandaraman, NISER Bhubaneswar
FLOP Calculations
1. Definition of FLOPS/TFLOPS: Floating Point Operations Per Second.
2. CPU TFLOPs (Example): number of cores * clock speed * instructions per cycle for floating
point
Provide a hypothetical modern CPU example (e.g., 64 cores * 2.5 GHz * 16 FLOPS/cycle for
AVX512 = ~2.5 TFLOPS).
3. GPU TFLOPs (Example): Explain the same for a modern GPU (e.g., 10,000 cores * 1.5
GHz * 2 FLOPS/core/cycle for FP32 = ~30 TFLOPS). Highlight the significant difference.
4. Emphasize: It's not just raw FLOPs; it's about how the problem can be parallelized to utilize
those FLOPS.

01/08/2025 Anandaraman, NISER Bhubaneswar 11
FLOPs Calculation Demo
top500.org

Anandaraman, NISER Bhubaneswar
Why HPC?
1. Massive Data Volumes
Challenge: Processing petabytes of data
Example: Genomics – DNA sequencing for disease markers & treatments
2. Extreme Computational Complexity
Challenge: Trillions of calculations
Example: Drug Discovery – Simulating molecules & materials
3. Strict Time Constraints
Challenge: Fast, real-time results
Example: Weather Forecasting – Timely climate predictions
4. High Parallelism Needs
Challenge: Simultaneous task execution
Example: Crash Simulation – Multi-physics automotive design
5. Huge Memory & Hardware Demand
Challenge: Requires GPUs, high RAM
Example: AI Training – Large-scale language model development

Anandaraman, NISER Bhubaneswar
Why HPC? E.g. Weather Prediction
Atmosphere modeled by dividing it into 3-dimensional
cells.
Calculations of each cell repeated many times to
model passage of time.
Suppose whole global atmosphere (5x10
8
sq.miles)
divided into cells of size 1 mile x 1 mile x 1mile to a
height of 10 miles (10 cell high) = 50 x 10
8
cells

Anandaraman, NISER Bhubaneswar
Why HPC? E.g. Weather Prediction
Suppose whole global atmosphere (5x10
8
sq.miles) divided into cells of size
1 mile x 1 mile x 1mile to a height of 10 miles (10 cell high) = 50 x 10
8
cells
Suppose each calculation requires 2000 FLOPs (floating point operations). In one time step,
10
13
FLOPs necessary.
To forecast the weather over 7 days using 1-minute intervals, it takes (7*24*60*10
13
) =
10080*10
13
= 10
17
FLOPs
A computer operating at 10 Gflops (10
10
floating point operations/s)
takes 10
7
seconds or over 115 days.
To perform calculation in 50 minutes requires a computer operating at
34 Tflops (34 x 10
12
floating point operations/sec).
(My Macbook Pro is 0.02 Tflops !!!)

Anandaraman, NISER Bhubaneswar
MD Simulation
To Simulate a bio-molecule of Non-bond energy term For 1 microsecond simulation
10000 atoms
10
8
operations
10
9
steps
10
17
operations
On a 500 MFLOPS machine (5x10
8
operations per second) takes
2x10
8
secs (About 6 years)
(This may be on a machine of 500 MFLOPS peak)
Need to do large no of simulations for even larger molecules

Anandaraman, NISER Bhubaneswar
Flynn's Taxonomy

Anandaraman, NISER Bhubaneswar
Tightly coupled MIMD
(Shared Memory)

Anandaraman, NISER Bhubaneswar
Loosly coupled MIMD
(Distributed Memory)

Anandaraman, NISER Bhubaneswar
The Four Laws of HPC
1. Moore's Law: Transistor density doubling. Its slowing impact.
2. Amdahl's Law: Limits of parallel speedup based on serial fraction. Explain its importance for
parallel programming.
3. Gustafson's Law: Scaled speedup, highlighting the ability to solve larger problems. Contrast
with Amdahl's.
4. Dennard Scaling (and its breakdown): Power consumption challenges. This is why we need
new architectures!

Anandaraman, NISER Bhubaneswar
Moore’s Law
Moore’s Law is an observation made by Gordon Moore (co-founder of Intel) in 1965.
It states that:
"The number of transistors on an integrated circuit doubles approximately every two years, leading to an
exponential increase in computing power and a decrease in relative cost."
In simple terms:
1. Computers get faster and cheaper roughly every 2 years.
2. Transistor density (how many fit on a chip) keeps increasing.
3. Performance per dollar improves significantly over time.

??????️
Simple Summary:
Computers get faster and cheaper every ~2 years

Anandaraman, NISER Bhubaneswar
Amdahl’s law
Amdahl’s law states that if P is the proportion of a program that can be made parallel (i.e.,
benefit from parallelization), and (1-P) is the proportion that cannot be parallelized (remains
serial), then the maximum speedup that can be achieved by using N processors is given as

Anandaraman, NISER Bhubaneswar
Amdahl’s law (Cont...)

Anandaraman, NISER Bhubaneswar
Gustafson’s Law
1. "As problem size increases, the achievable
speedup on a parallel system can scale linearly
with the number of processors."
2. Unlike Amdahl’s Law (which assumes fixed
problem size), Gustafson’s Law reflects real-
world scalability.
3. Larger problems utilize more processors
efficiently.
4. Encourages designing systems for scalable
parallel computing.

Anandaraman, NISER Bhubaneswar
Gustafson’s Law (Cont..)

Anandaraman, NISER Bhubaneswar
Dennard scaling of MOSFET (1974)
Dennard scaling (named after Robert H. Dennard) refers to a key principle from a 1974 IBM
research paper, which observed that:

➡️
As transistors get smaller, their power density stays constant, so that:
1. Voltage and current scale down with the size.
2. Switching speed (frequency) increases.
3. Power consumption per transistor stays roughly the same.

Anandaraman, NISER Bhubaneswar
Why Dennard Scaling Broke Down (Post-2005)?
1. Voltage couldn’t keep dropping due to threshold voltage limits (leakage current became too
high).
2. Power density began to increase, leading to overheating.
3. Resulted in the end of traditional single-core scaling industry moved to multi-core, low-power

design, and new materials.

Anandaraman, NISER Bhubaneswar
Dennard Scaling Relations (using K)

Anandaraman, NISER Bhubaneswar
HPC Architecture
HPC cluster architecture consists of multiple computer servers networked together to form a
cluster that offers more performance than a single computer.
1. Compute
(CPU/GPU)
2. Network (IB)
3. Storage (PFS)

Anandaraman, NISER Bhubaneswar
Core Components of HPC Architecture

Anandaraman, NISER Bhubaneswar
HPC Interconnect
Ethernet - Initial 3 Mbps to 400 Gbps
Ethernet is clearly the dominant network for mainstream computing needs where a physical
connection is required.

For high bandwidth and low latency deployments???????
Infiniband - originated in 1999 to specifically address workload requirements that were not
adequately addressed by Ethernet
The InfiniBand protocol stack is considered less burdensome than the TCP protocol required for
Ethernet.
TCP or UDP latencies - low as 3 microseconds
InfiniBand latencies -Below 1 microseconds
RoCE has achieved -1.3 microseconds
EDR InfiniBand -610 nanoseconds

Anandaraman, NISER Bhubaneswar
HPC Interconnect
InfiniBand is designed for scalability –
RDMA to reduce CPU overhead
The InfiniBand protocol stack is
considered less burdensome than the TCP
protocol required for Ethernet.
This enables InfiniBand to maintain a
performance and latency edge in
comparison to Ethernet in many high
performance workloads.
“Bandwidth problems can be cured with
money. Latency problems are harder
because speed of light is fixed— you can’t
bribe God” - Anonymous

Anandaraman, NISER Bhubaneswar
HPC Storage (PFS)

Anandaraman, NISER Bhubaneswar
Advantage & Disadvantage of Parallel File System
Advantage
1. Throughput performance
2. Scalability: Usable by 1000s of clients
3. Lower management costs for huge capacity
Disadvantage
1. Metadata performance low compared to many separate file servers
2. Complexity: Management requires skilled administrators
3. Most PFS require adaption of clients for new Linux kernel versions

01/08/2025 Anandaraman, NISER Bhubaneswar 34
HPC Demo

Anandaraman, NISER Bhubaneswar
Benefits of Using GPUs in HPC
1. CPU is designed to excel at executing a sequence of operations, called a thread, as fast as
possible and can execute a few tens of these threads in parallel
2. GPU is designed to excel at executing thousands of them in parallel (amortizing the slower
single-thread performance to achieve greater throughput).
Multicore CPUs Manycore GPUs

Anandaraman, NISER Bhubaneswar
Why GP-GPU in HPC?
1. Massive Parallelism
2. High Throughput
3. Energy Efficiency (FLOPS/Watt)
4. TFLOPs Advantage
5. Memory Bandwidth - GDDR6, HBM2/HBM3
6. Ecosystem & Tools (CUDA)
7. Addressing Architectural Limits (overcome the challenges of slowing Moore's Law and
Dennard Scaling)
Extending the use of GPU to non-graphic workloads know as GP-GPU computing

Anandaraman, NISER Bhubaneswar
CPU vs. GPU: Architectural Differences
Integrated GPU – Do not have dedicated memory, uses the system memory
Dedicated GPU – Having dedicated memory

Anandaraman, NISER Bhubaneswar
Comparison CPU vs GPU
1. Clock Speed:
CPU – High clock speed, GPU – Slow clock speed
2. Cores and Threads
CPU – Few cores but faster, GPU – Many cores but slower
3. Function
CPU – Generalized component that handles main processing functions
GPU – Specialized component for parallel computing
4. Processing
CPU – Designed for serial instruction processing
GPU – Designed for parallel instruction processing
5. Suited for
CPU – General purpose computing applications
GPU – High-performance computing applications

Anandaraman, NISER Bhubaneswar
Comparison CPU vs GPU (Cont..)
6. Operational focus
CPU – Low Latency, GPU – High latency
7. Interaction with other components
CPU – Interacts with more computer components such as RAM, ROM, the basic
input/output system (BIOS), and input/output (I/O) ports.
GPU – Interacts mainly with RAM, Display
8. Versatility
CPU – More versatile (Execute numerous tasks)
GPU – Less versatile (Execute limited tasks)
9. API Application
CPU – No API limitations, GPU – Limited APIs
10. Context switch latency
CPU – Slowly between multiple threads
GPU – No inter wrap context switching

Anandaraman, NISER Bhubaneswar

Anandaraman, NISER Bhubaneswar
NVIDIA A30 GPU Anatomy

Anandaraman, NISER Bhubaneswar
NVIDIA GPU Architecture

Anandaraman, NISER Bhubaneswar
Graphics processing cluster (GPC)

Anandaraman, NISER Bhubaneswar
Texture processing cluster (TPC)
Graphic Processing Cluster (GPC)
Texture Processing Cluster (TPC)
2 SM per TPC
4 Process Block per SM

Anandaraman, NISER Bhubaneswar

Anandaraman, NISER Bhubaneswar
SFU – Special Function Unit
Trigonometric functions: sin(), cos()
Exponential and logarithmic functions:
exp(), log()
Reciprocal and reciprocal square root: 1/x,
1/√x

Anandaraman, NISER Bhubaneswar
CUDA Execution Model: Thread vs Block vs Grid

Anandaraman, NISER Bhubaneswar
CUDA Parallel Programming Model
Thread vs. Block vs. Grid:
Thread: Smallest unit of execution (an individual calculation).
Block: A group of threads that can communicate and synchronize (runs on a Streaming
Multiprocessor - SM).
Grid: A collection of blocks (the entire kernel execution).

Kernel vs. Device:
Kernel: The C++ function executed by many threads on the GPU (the parallel code).
Device: The GPU itself (the hardware that executes kernels).

Anandaraman, NISER Bhubaneswar
HPC: The Grand Challenges

Anandaraman, NISER Bhubaneswar
Challenges in High-Performance Computing
1. FLOPs increase faster than memory access speed.
2. Increased clock speeds and density lead to high power consumption.
3. Processors wait for data, limiting performance
4. Cooling and energy costs limits scaling
5. Increasing FLOPs without memory bandwidth is inefficient.
6. Need for energy-efficient high-performance computing

Anandaraman, NISER Bhubaneswar
Communication Bottlenecks in Large HPC
1. Latency vs Bandwidth Trade-offs
2. Communication Overhead
3. Efficient Communication
NOTE: Addressing communication bottlenecks is a key challenge in building robust and scalable
distributed systems. By understanding the trade-offs and optimizing communication patterns,
developers can create high-performance applications that can handle increasing workloads and
node counts

01/08/2025 Anandaraman, NISER Bhubaneswar 52
Tags