Memory Systems Syed Ammal Engineering College Based on: Carl Hamacher et al., Computer Organization and Embedded Systems (6th Edition) Prepared: Automated - Assistant
8.1 Basic Concepts Memory stores instructions and data for processor use. Key metrics: access time, cycle time, bandwidth, capacity, cost, and volatility. Trade-offs: speed vs. cost vs. capacity; designers use hierarchy to balance. Memory hierarchy: fastest/smallest at top (registers), largest/slower at bottom (secondary storage). Detailed explanation: Memory systems are designed around cost–capacity–speed tradeoffs. Registers are fastest but very limited. Main memory (DRAM) offers larger capacity but higher latency. Secondary storage (HDD/SSD/tape) is nonvolatile and orders of magnitude larger but slower. Locality (temporal and spatial) is the key principle exploited by caches and the hierarchy.
8.2 Semiconductor RAM Memories RAM provides random read/write access; typically volatile. Two primary types: Static RAM (SRAM) and Dynamic RAM (DRAM). SRAM uses flip-flops; DRAM uses capacitor-based cells requiring refresh. SRAM cell: fast, uses cross-coupled inverters; commonly used in caches. Detailed explanation: SRAM stores data in bistable latches (six-transistor cells); it is fast and does not require refresh, which makes it suitable for cache memories but relatively expensive and power hungry. DRAM stores bits as charge on capacitors with one transistor per cell; it is denser and cheaper but requires periodic refresh to retain data.
8.2.1 Internal Organization of Memory Chips Memory chips are organized as arrays of storage cells in rows and columns. Address decoding selects row (wordline) and column (bitline) to access a cell. Sense amplifiers detect small signals from DRAM cells; I/O buffers handle data transfer. DRAM internal organization: wordline selects a row, bitline carries data; sense amplifier reads small charge. Detailed explanation: A memory read activates the appropriate wordline; the selected row drives data onto the bitlines where sense amplifiers amplify and latch values. For larger memories, chips are organized into banks and controlled to hide latencies via interleaving and parallelism.
8.2.2 Static Memories (SRAM) SRAM provides very low access times and is typically used for CPU caches and registers. Cell design: cross-coupled inverters (flip-flop), typically 4-6 transistors per cell. Advantages: speed, no refresh. Disadvantages: higher cost and lower density. SRAM cell schematic (symbolic). Detailed explanation: Because SRAM does not require refresh, it offers deterministic, low-latency access suitable for on-chip caches. Its higher power and area per bit make it unsuitable for main memory at large capacities.
8.2.3 Dynamic RAMs (DRAM) DRAM stores data as charge on capacitors; each cell typically uses one transistor + one capacitor. Requires refresh cycles to restore charge, adding overhead and complexity. DRAM is denser and cheaper per bit than SRAM and forms the basis of main memory. DRAM cell (capacitor + transistor). Detailed explanation: DRAM's need for periodic refresh (every few milliseconds) means parts of memory must be periodically read and rewritten. Memory controllers handle refresh and provide RAS/CAS control signals. Modern DRAM chip families include SDRAM, DDR variants, and various specialized DRAM types.
8.2.4 Synchronous DRAMs (SDRAM and DDR) SDRAM synchronizes operations with the system clock and supports pipelined/burst transfers. DDR (Double Data Rate) transfers data on both clock edges to increase throughput. Benefits: higher bandwidth and more predictable timing; used as main memory in modern systems. Detailed explanation: SDRAM allows command pipelining (open, read, write, precharge) and supports burst transfers to amortize command overhead. DDR variants (DDR, DDR2, DDR3, DDR4, DDR5) increase bandwidth through prefetching and signaling improvements while requiring careful timing and voltage control.
8.2.5 Structure of Larger Memories Large memory systems use banks, interleaving, and multiple chips working in parallel. Banking allows one bank to be accessed while others are being precharged or refreshed. Interleaving spreads successive memory addresses across banks to improve throughput. Detailed explanation: To hide DRAM latencies and increase sustained bandwidth, designers organize memory as multiple independent banks. Interleaving maps sequential addresses to different banks so accesses can proceed in parallel, reducing stall time for the CPU.
8.3 Read-only Memories (ROM family) ROM types provide nonvolatile storage for firmware and microcode. Variants: ROM (mask-programmed), PROM (one-time programmable), EPROM, EEPROM, Flash. Trade-offs: programmability, erase granularity, write/endurance, and cost. Flash memory trade-offs: NAND vs NOR, block-erase vs random-read strengths. Detailed explanation: Mask ROM is programmed during manufacturing and is lowest cost per bit for high-volume code. PROM uses fusable links and is programmable once. EPROM is erasable with UV light; EEPROM and Flash allow electrical erasure and reprogramming. Flash memory's block-erase model makes it ideal for large nonvolatile storage (e.g., SSDs) but requires wear leveling and garbage collection management.
8.3.1 ROM and 8.3.2 PROM ROM: Mask-programmed, permanent contents—used when firmware is fixed. PROM: Programmable after manufacturing using one-time fuses; useful for field programming once. Detailed explanation: Mask ROM offers the cheapest per-bit cost for mass-produced firmware but is inflexible. PROM provides a post-manufacture programming step but cannot be reprogrammed; it is useful for small-volume scenarios or where one-time configuration is sufficient.
8.3.3 EPROM and 8.3.4 EEPROM EPROM: Erasable by UV light, packaged with a quartz window. EEPROM: Electrically erasable and programmable at the byte level; slower than RAM but nonvolatile. Detailed explanation: EPROMs require physical removal and UV exposure for erasure, making updates slow. EEPROM improves flexibility by allowing electrical erasure without package removal; it is used for configuration storage but has slower erase/write performance and limited endurance compared to RAM.
8.3.5 Flash Memory Flash is a widely used nonvolatile memory with block-level erase. Two common architectures: NOR (fast random read) and NAND (higher density, block erase). Used in USB drives, SSDs, embedded storage; requires wear-leveling and error management. Flash memory: NAND for dense storage (SSD), NOR for code storage with direct execute. Detailed explanation: NAND flash achieves high density and low cost per bit by optimizing for sequential throughput and block erase. SSD controllers handle bad-block management, wear-leveling, garbage collection, and error correction to present reliable storage to the system.
8.4 Direct Memory Access (DMA) DMA allows peripherals to transfer data to/from memory without CPU intervention. Techniques: cycle stealing, burst mode, and bus mastering. Offloads CPU and improves I/O throughput; requires arbitration and careful memory/IO coordination. DMA controller interacts directly with main memory and I/O device, reducing CPU overhead. Detailed explanation: A DMA controller can request the system bus to perform block transfers directly between an I/O device and memory. Cycle stealing temporarily suspends CPU access to use bus cycles; burst mode transfers blocks at once for efficiency. Bus masters must be arbitrated to avoid conflicts.
8.5 Memory Hierarchy (design principles) Hierarchy exploits locality: temporal (re-use) and spatial (nearby addresses). Smaller, faster, and more expensive storage levels are placed closer to CPU. Caching and virtual memory leverage hierarchy to present illusion of fast, large memory. Hierarchy balances speed, capacity, and cost. Detailed explanation: By organizing memory into levels, systems provide the illusion of a large, fast memory. Cache holds frequently used data; main memory holds the working set; secondary storage preserves bulk data. Effective hierarchy design increases hit rate and reduces average access latency.
8.6 Cache Memories (overview) Cache stores recently used memory blocks (lines) to reduce average access time. Key parameters: block size, associativity, replacement policy, write policy (write-through/write-back). Multi-level caches (L1, L2, L3) trade off latency vs. capacity and sharing. Detailed explanation: Caches are small, fast memories built from SRAM. Block size determines spatial locality exploitation; associativity reduces conflict misses; replacement policy manages which block to evict. Write-back caches defer writes to main memory, reducing bus traffic at the cost of more complex coherency handling.
8.6.1 Mapping Functions Direct-mapped: each memory block maps to exactly one cache line (fast index, more conflicts). Fully associative: any block can go to any line (flexible, complex hardware). Set-associative: compromise—cache divided into sets, each set holds several lines (n-way). Direct-mapped cache (example layout). Detailed explanation: Direct-mapped caching uses a portion of the address as an index directly to a single cache line, making lookup simple but vulnerable to conflict misses. Set-associative caches use small associative sets to reduce conflicts while keeping reasonable hardware complexity. Fully-associative caches are used for small structures like TLBs.
8.6.1 (cont.) — 2-way Set-Associative Example 2-way set associative caches group lines into sets; two candidate lines per set. Helps reduce conflict misses compared to direct mapping. Trade-off: slightly higher lookup cost and hardware complexity. 2-way set-associative cache (visualized). Detailed explanation: Addresses are typically partitioned into tag, index, and offset fields. The index selects a set; the tag is checked against each way in the set. Hardware may use comparators to check tags in parallel for fast access.
8.6.2 Replacement Algorithms Least Recently Used (LRU): evict the block not used for the longest time (good hit rate, hardware cost). First-In First-Out (FIFO): evict the oldest block (simpler, less optimal). Random: choose a victim at random (simple, sometimes surprisingly effective). Detailed explanation: Replacement policies balance performance and implementation cost. LRU approximations (e.g., pseudo-LRU) are common in hardware for associative caches. The choice affects miss rates, especially in workloads with conflict patterns.
8.6.3 Examples of Mapping Techniques (address breakdown) Address split: [Tag | Index | Block Offset]. Example: block size 32 bytes => 5 offset bits. Index chooses cache set; tag distinguishes between different memory blocks mapped to same set. Example mapping calculations illustrate tag/index extraction from physical address. Detailed explanation: For a 32-bit address with 64-byte blocks (6 offset bits) and a 4KB cache with 64-byte blocks, the number of lines and index bits determine the mapping. Precise examples help students practice binary splitting and mapping to cache lines.
8.7 Performance Considerations Important metrics: hit rate (H), miss rate (1−H), hit time, miss penalty. Effective Access Time (EAT) ≈ H*HitTime + (1−H)*MissPenalty. Design aims to maximize hit rate and minimize miss penalty (prefetching, larger blocks, multilevel caches). Detailed explanation: Hit time is the cycle cost for a cache hit. Miss penalty includes time to fetch from lower level and possibly update cache. Multi-level caches reduce miss penalties by catching misses at closer, faster levels. Prefetching and write buffers are common optimizations.
8.7.1 Hit Rate and Miss Penalty (example) Hit rate is the fraction of memory accesses found in cache. Miss penalty comprises transfer time from lower memory plus additional latency. Improving hit rate has multiplicative benefits on overall system performance. Detailed explanation: Consider EAT formula to reason about performance: a small increase in hit rate can dramatically reduce average memory access time if the miss penalty is large (e.g., main memory miss penalty in cycles is orders of magnitude higher than L1 hit time).
8.7.2 Caches on the Processor Chip & 8.7.3 Other Enhancements Modern processors integrate L1 and often L2 on chip, sometimes L3 as well. Enhancements: prefetchers, non-blocking caches, write buffers, and multi-thread-aware caches. On-chip caches reduce latency and support higher clock rates and core counts. Detailed explanation: On-chip caches reduce wire length and latency but compete for die area. Prefetchers predict future accesses and bring data into cache early. Non-blocking caches allow outstanding misses and continue servicing hits while miss is pending.
8.8 Virtual Memory Virtual memory gives each process its own address space, mapped to physical memory by the OS/hardware. Paging divides virtual address space into fixed-size pages mapped to physical frames. Page faults occur when referenced page is not resident and must be loaded from secondary storage. Address translation uses TLBs and page tables to translate virtual addresses to physical addresses. Detailed explanation: A translation lookaside buffer (TLB) caches recent virtual-to-physical translations to accelerate address translation. Page tables may be single-level or multi-level; multi-level tables reduce memory overhead for sparse address spaces. Handling page faults involves OS intervention and disk I/O to load missing pages into physical frames.
8.8.1 Address Translation (details) VA split into Tag/Index/Offset for caches; for paging, VA -> (page number, offset). Page tables map page numbers to frame numbers; TLB caches page table entries. Hardware support includes page table base register and translation bits (valid, protection). Detailed explanation: Multilevel page tables break the page number into multiple indices to index successive table levels, reducing memory consumption for sparse spaces. The OS sets up page tables and manages permissions, swapping, and replacement policies for frames.
8.9 Memory Management Requirements Relocation: allow programs to execute regardless of absolute physical address. Protection: prevent unauthorized access between processes. Sharing and fragmentation: support shared pages and minimize internal/external fragmentation. Detailed explanation: Memory management subsystems must provide mechanisms for allocation, protection, and sharing. Techniques such as paging avoid external fragmentation but can suffer from internal fragmentation if page sizes are large for small data sets. OS-level policies decide placement, eviction, and swapping strategies.
8.10 Secondary Storage (overview) Secondary storage provides nonvolatile long-term storage; trade-offs: latency, throughput, and capacity. Common media: magnetic hard disks, optical disks, and magnetic tape. Controllers and device drivers manage I/O scheduling, caching, and error recovery. Secondary storage devices trade off cost and performance against primary memory. Detailed explanation: Secondary storage is orders of magnitude slower than main memory but much larger and persistent. System software uses buffering, caching, and scheduling to manage the performance gap and improve throughput for common workloads.
8.10.1 Magnetic Hard Disks Structure: platters, tracks, sectors, cylinders, heads, and an actuator arm. Performance factors: seek time, rotational latency, and transfer rate. Organizational features: zones, sector formats, and disk scheduling algorithms. Disk components: platter, head, actuator arm; access latency = seek + rotation + transfer. Detailed explanation: Seek time (moving the arm) and rotational latency (waiting for sector) dominate random access performance. Disk controllers and OS schedulers (e.g., elevator algorithms) try to reduce head movement and improve throughput. Zoned bit recording increases density toward outer tracks.
8.10.2 Optical Disks Optical media encode data with pits and lands read by laser reflection. Examples: CD, DVD, Blu-ray (increasing densities). Primarily used for distribution and archival storage; random access slower than HDD/SSD. Detailed explanation: Optical disks have advantages in portability and cost for read-only distribution. Rewriteable optical formats exist (e.g., CD-RW, DVD-RW) but have limited endurance and slower access characteristics compared to magnetic and solid-state drives.
8.10.3 Magnetic Tape Systems Tape is low-cost, high-capacity, sequential-access media primarily used for backups and archives. Access patterns are sequential; rewind/seek times are large compared to disk. Often used with automated tape libraries for large-scale archival storage. Detailed explanation: Tape excels at streaming large volumes of data economically. For random access workloads, tape is inefficient; instead, it is used with well-defined restore procedures and indexing to locate archived data.
Summary & Key Takeaways Memory design balances speed, capacity, and cost using a hierarchical approach. SRAM and DRAM target different levels of the hierarchy (caches vs main memory). Caches use mapping, associativity, and replacement policies to improve average access time. Virtual memory and secondary storage provide the illusion of large memory but at higher latency. DMA and device controllers offload work from CPU for efficient I/O.
References Carl Hamacher et al., Computer Organization and Embedded Systems, 6th Edition, McGraw-Hill Higher Education. Slides prepared for Syed Ammal Engineering College - adapted and simplified diagrams. Instructor notes and typical architecture references (for teaching clarity).