Why Memory Organisation Matters
Memory organisation is the way a computer arranges, addresses, and manages its storage so that the CPU can fetch instructions and data fast enough to keep up. Because processors today are far faster than main memory, almost every performance bottleneck in modern systems traces back to how memory is organised. In PU’s BCA 5th-semester Microprocessor and Computer Architecture paper, this chapter is one of the highest-weighted topics — expect at least one long question (15 marks) and two or three short questions every year.
This note covers memory types, the hierarchy from registers to disk, cache mapping with worked numerical examples, virtual memory and paging, replacement algorithms, and memory interfacing, all explained the way PU expects you to write them in the exam.
1. Why we need a memory organisation at all
A modern CPU runs at 3–5 GHz and can issue several instructions per cycle. DRAM, on the other hand, takes roughly 50–100 nanoseconds to respond — that is around 200 CPU cycles for a single memory access. If every load and store actually went to DRAM, the processor would spend more than 95 % of its time waiting. Memory organisation is the engineering response to that gap: use a small amount of expensive, fast memory close to the CPU and progressively larger, cheaper, slower memory further away.
Two principles make this idea work:
- Temporal locality — if a memory location is accessed once, it is likely to be accessed again soon (e.g., a loop variable).
- Spatial locality — if a location is accessed, nearby locations are likely to be accessed soon (e.g., walking through an array).
Every cache, prefetcher, and TLB you will study below is essentially exploiting one or both of these.
2. Memory types
| Type | Volatile? | Speed | Cost / bit | Typical use |
|---|---|---|---|---|
| Register | Yes | ~0.25 ns | Very high | Inside the CPU |
| SRAM | Yes | 1–10 ns | High | L1 / L2 / L3 cache |
| DRAM | Yes | 50–100 ns | Medium | Main memory (RAM stick) |
| ROM | No | ~50 ns | Low | BIOS / firmware |
| Flash (NAND) | No | ~25 µs | Low | SSDs, pen-drives |
| Magnetic disk | No | ~10 ms | Very low | HDDs |
| Magnetic tape | No | seconds | Lowest | Archival backup |
SRAM vs DRAM in one line
SRAM stores each bit in a 6-transistor flip-flop — fast, no refresh needed, but bulky and expensive. DRAM stores each bit as charge on a tiny capacitor — dense and cheap, but the charge leaks, so it must be refreshed every few milliseconds. That is why your laptop has 8 GB of DRAM but only a few megabytes of SRAM cache.
ROM family
ROM, PROM (programmable once), EPROM (erased by UV light), EEPROM (electrically erasable), and Flash (a commercial EEPROM variant with block-level erase). Flash is what your phone storage and SSDs are built from.
3. The memory hierarchy
Picture an upside-down pyramid: small and fast at the top, huge and slow at the bottom.
| Level | Typical access time | Typical capacity |
|---|---|---|
| Registers | ~0.25 ns | Bytes |
| L1 cache | ~1 ns | 32–64 KB |
| L2 cache | ~3 ns | 256 KB – 1 MB |
| L3 cache | ~12 ns | 4 – 32 MB |
| Main memory (DRAM) | ~80 ns | 8 – 32 GB |
| SSD | ~25 µs | 256 GB – 2 TB |
| HDD | ~10 ms | 1 – 16 TB |
The hierarchy is inclusive or exclusive depending on the design. In an inclusive hierarchy (Intel), anything in L1 is also in L2 and L3 — simple but wasteful. In an exclusive hierarchy (AMD historically), a block lives in only one level — more capacity, more complex coherence.
4. Cache memory
Cache sits between the CPU and main memory. When the CPU issues an address, the cache controller checks if the corresponding block is present:
- Hit — block found, served at cache speed.
- Miss — block not found, the controller fetches it from the next level (or DRAM), and usually evicts an existing block to make room.
4.1 Cache mapping techniques
To answer “is this address in the cache, and if so where?”, the controller needs a mapping rule.
| Mapping | Where can block B go? | Hardware | Conflict misses |
|---|---|---|---|
| Direct-mapped | Exactly one line: B mod N | Simplest, one comparator | Highest |
| Fully associative | Any line | Most expensive, N comparators | None |
| n-way set associative | Any of n lines in set B mod (N/n) | n comparators | Moderate |
Real CPUs use set-associative caches almost universally (typical L1 is 8-way).
4.2 Worked example — direct-mapped cache
A processor has a 16 KB direct-mapped cache with 64-byte blocks and uses a 32-bit address. How are bits split between tag, index, and offset?
- Block size = 64 B → offset bits = log2(64) = 6 bits
- Number of lines = 16 KB / 64 B = 256 → index bits = log2(256) = 8 bits
- Tag bits = 32 − 8 − 6 = 18 bits
So address 0x1A2B3C40 splits as: Tag (18 bits) | Index (8 bits) | Offset (6 bits). The controller uses the index to pick line 0xCF, compares the stored tag with 0x68AC, and on a match returns the byte at offset 0.
4.3 Replacement policies
When a set is full and a new block arrives, one resident block must leave.
| Policy | Idea | Pros | Cons |
|---|---|---|---|
| LRU | Evict the line not used for the longest | Best hit rate in practice | Hard to implement for high associativity |
| FIFO | Evict the oldest | Simple counter | Ignores recent activity |
| Random | Pick any line | Cheapest | Surprisingly competitive |
| Pseudo-LRU | Tree-bit approximation of LRU | Cheap and close to LRU | Slightly worse hit rate |
4.4 Write policies
When the CPU writes to a cached block:
- Write-through — write to cache and memory simultaneously. Simple, but main-memory bandwidth is a bottleneck. Often paired with a write buffer to hide latency.
- Write-back — update only the cache, set a dirty bit; write to memory when the block is evicted. Reduces bandwidth dramatically at the cost of more complex coherence.
For a miss on a write, two further options apply:
- Write-allocate — fetch the block into cache, then write (matches write-back well).
- No-write-allocate — write straight to memory, don’t pollute the cache (matches write-through well).
5. Cache performance
5.1 Effective access time
For a single level of cache: T_eff = h × T_cache + (1 − h) × (T_cache + T_memory) where h is the hit rate.
Numerical example. Cache access = 2 ns, memory access = 80 ns, hit rate = 95 %. T_eff = 0.95 × 2 + 0.05 × (2 + 80) = 1.9 + 4.1 = 6.0 ns.
The CPU now sees a memory subsystem that is thirteen times faster than DRAM alone, just by adding a small SRAM cache with a 95 % hit rate. Notice how a drop from 95 % to 90 % nearly doubles average latency — every percentage point of hit rate matters.
5.2 Levers for improving hit rate
- Larger block size — exploits spatial locality, but past a point pulls in unused data and increases miss penalty.
- Larger cache — fewer capacity misses, but slower access and more power.
- Higher associativity — fewer conflict misses, more comparators per access.
- Prefetching — speculatively bring in blocks before they are demanded (hardware stride prefetcher, software
__builtin_prefetch). - Cache-conscious coding — loop tiling, struct-of-arrays layouts, padding to avoid false sharing.
5.3 The three Cs of cache misses
| Miss type | Cause | How to reduce |
|---|---|---|
| Compulsory (cold) | First reference to a block | Prefetching, larger block size |
| Capacity | Working set bigger than cache | Bigger cache, better locality |
| Conflict | Two hot blocks map to same set | Higher associativity, victim cache |
6. Virtual memory
Virtual memory creates the illusion that each process has a private, contiguous address space that can be larger than physical RAM. The operating system, together with the MMU (Memory Management Unit) inside the CPU, translates virtual addresses to physical addresses on every memory access.
6.1 Paging in one paragraph
The virtual address space is divided into fixed-size pages (usually 4 KB). Physical memory is divided into equal-size frames. A per-process page table maps each virtual page to a frame (or marks it as not present). When the CPU produces a virtual address, the MMU splits it into a page number (used to look up the table) and an offset (passed through unchanged).
Worked example. A 32-bit virtual address with 4 KB pages: offset = 12 bits (because 4 KB = 212); page number = 20 bits → up to 220 = 1,048,576 page-table entries per process. If each entry is 4 bytes, a single-level page table costs 4 MB per process. That is why real systems use multi-level page tables (typical x86-64 uses 4 or 5 levels) or inverted page tables (one entry per frame, not per page).
6.2 The TLB
Walking the page table on every access would defeat the purpose of caching. The TLB (Translation Lookaside Buffer) is a small, fully-associative cache of recent virtual-to-physical translations, typically 64 to 1024 entries. A TLB hit completes in 1 cycle; a TLB miss triggers a page-table walk (tens of cycles), and a page fault triggers OS intervention (millions of cycles).
6.3 Segmentation vs paging
| Feature | Segmentation | Paging |
|---|---|---|
| Unit | Variable-size segment | Fixed-size page |
| Address | (segment, offset) | (page, offset) |
| External fragmentation | Yes | No |
| Internal fragmentation | No | Yes (last page may be partial) |
| Programmer view | Logical (code, data, stack) | Transparent |
Modern systems use paging on top of segmentation (x86) or pure paging (most RISC architectures).
7. Page replacement algorithms
When all frames are occupied and a new page must be brought in, the OS chooses a victim.
| Algorithm | Idea | Belady’s anomaly? | Implementation |
|---|---|---|---|
| FIFO | Replace the page loaded earliest | Yes | Queue |
| Optimal (OPT) | Replace the page used farthest in the future | No | Unimplementable (theoretical baseline) |
| LRU | Replace the page not used for the longest | No | Counter or stack per page (costly) |
| Clock (Second-chance) | LRU approximation using a reference bit | No | Single bit per frame, pointer sweep |
| LFU | Replace the least-frequently used | No | Counter — but old pages with high counts stick |
Belady’s anomaly — for FIFO, increasing the number of frames can sometimes increase the number of page faults. This is the classic surprise result and a favourite short-question topic in PU exams.
Worked example — FIFO reference string
Reference string: 7 0 1 2 0 3 0 4 2 3 0 3 2, 3 frames, FIFO.
| Step | Ref | Frames after | Fault? |
|---|---|---|---|
| 1 | 7 | 7 | yes |
| 2 | 0 | 7 0 | yes |
| 3 | 1 | 7 0 1 | yes |
| 4 | 2 | 0 1 2 | yes (evict 7) |
| 5 | 0 | 0 1 2 | no |
| 6 | 3 | 1 2 3 | yes (evict 0) |
| 7 | 0 | 2 3 0 | yes (evict 1) |
| 8 | 4 | 3 0 4 | yes (evict 2) |
| 9 | 2 | 0 4 2 | yes (evict 3) |
| 10 | 3 | 4 2 3 | yes (evict 0) |
| 11 | 0 | 2 3 0 | yes (evict 4) |
| 12 | 3 | 2 3 0 | no |
| 13 | 2 | 2 3 0 | no |
Total page faults = 10.
8. Memory interfacing
The CPU communicates with memory chips through three buses:
- Address bus — one-way, CPU → memory. Width determines the maximum addressable memory (32 lines = 4 GB).
- Data bus — bidirectional. Width determines how many bits move per access (64 bits in most desktops).
- Control bus — carries signals like
READ,WRITE,MEMRQ, clock, and chip-select.
8.1 Address decoding
A memory system usually contains several chips, each handling a slice of the address space. A decoder (74LS138 in 8085 lab work, integrated logic in real systems) reads the high-order address bits and asserts the correct chip’s CS (Chip Select) line, leaving the low-order bits to address inside the chip.
8.2 Memory banks and interleaving
Splitting memory into independently-accessed banks lets multiple accesses overlap. Low-order interleaving distributes consecutive addresses across banks — perfect for sequential access patterns. High-order interleaving keeps consecutive addresses in the same bank — useful when different banks belong to different processes. A 4-way interleaved memory with 40 ns access time can deliver one word every 10 ns under sequential access, even though no single bank is faster than 40 ns.
9. Putting it all together — what the CPU actually does on a load
- Virtual address generated by the instruction.
- TLB lookup. Hit → physical address in 1 cycle. Miss → page-table walk; if the page is not in memory, page fault and OS swap.
- L1 cache lookup with the physical address. Hit → done in 4 cycles.
- L1 miss → L2 lookup (~12 cycles).
- L2 miss → L3 lookup (~40 cycles).
- L3 miss → DRAM access (~200 cycles), block returned and cached at every level on the way up.
A well-tuned program keeps step 3 (L1 hit) as the common case more than 90 % of the time.
10. PU-style practice questions
Short answer (5 marks each)
- Differentiate SRAM and DRAM with at least four points of comparison.
- A 64 KB cache uses 4-way set-associative mapping with 32-byte blocks and a 32-bit address. Determine the tag, index, and offset bits.
- What is Belady’s anomaly? Which replacement algorithm exhibits it?
- Explain the difference between write-through and write-back with one advantage and one drawback of each.
- What is the role of the TLB? Why is its hit rate critical for performance?
Long answer (15 marks each)
- Explain the memory hierarchy of a modern computer with typical sizes and access times. Justify why such a hierarchy outperforms a flat memory of equal total capacity.
- With a worked example, demonstrate FIFO, LRU, and Optimal page replacement on the reference string
1 2 3 4 1 2 5 1 2 3 4 5using three frames. Compare the number of page faults. - Discuss the three Cs of cache misses. For each type, suggest two practical techniques (hardware or software) to reduce it.
- Compare paging and segmentation as memory management techniques, with the advantages, disadvantages, and how modern systems combine both.
- Design the address decoding scheme to interface four 8 KB RAM chips with an 8085 microprocessor so that they occupy contiguous addresses starting from 0x4000.
11. Frequently confused points
“Is cache part of main memory?” — No. Cache is a separate, smaller memory built from SRAM, physically closer to the CPU. Main memory is DRAM on a different bus.
“Is virtual memory the same as swap space?” — No. Virtual memory is the abstraction of a private, large address space. Swap space is the disk area used to store pages that don’t fit in physical RAM. Virtual memory uses swap space, but the two are not synonyms.
“Does increasing cache size always increase hit rate?” — Up to a point, yes — but beyond the working set size, returns diminish quickly, and the larger cache also gets slower per access. A 1 MB L2 with 95 % hit rate can outperform a 4 MB L2 with 96 % hit rate if the L2 access time doubles.
“Why do we need the TLB if we already have caches?” — Caches operate on physical addresses (in most designs), so the address must be translated before the cache can be queried. Without a TLB, every memory access would walk the page table.
12. Exam tips
- Always begin a long answer with a clean definition before diving into details.
- Draw the memory-hierarchy pyramid even if not asked — it earns 1–2 marks for structure.
- For cache problems, show the bit split before computing tag/index/offset values.
- For page-replacement problems, tabulate frame contents step by step and circle the faults.
- Mention real numbers (ns, KB, GB) — examiners reward concreteness over generic prose.
- Link concepts back to locality of reference whenever you can — it is the unifying idea of the entire chapter.
Summary
Memory organisation exists because the CPU is far faster than DRAM. The hierarchy from registers down to disk uses locality of reference to give the CPU the illusion of fast, infinite memory. Caches (with their mapping, replacement, and write policies), virtual memory (with paging and the TLB), and memory interfacing (buses, decoding, interleaving) are the three pillars you must master for the PU BCA examination — and for any real systems-programming work that follows it.