Memory Organisation

Why Memory Organisation Matters

Memory organisation is the way a computer arranges, addresses, and manages its storage so that the CPU can fetch instructions and data fast enough to keep up. Because processors today are far faster than main memory, almost every performance bottleneck in modern systems traces back to how memory is organised. In PU’s BCA 5th-semester Microprocessor and Computer Architecture paper, this chapter is one of the highest-weighted topics — expect at least one long question (15 marks) and two or three short questions every year.

This note covers memory types, the hierarchy from registers to disk, cache mapping with worked numerical examples, virtual memory and paging, replacement algorithms, and memory interfacing, all explained the way PU expects you to write them in the exam.

1. Why we need a memory organisation at all

A modern CPU runs at 3–5 GHz and can issue several instructions per cycle. DRAM, on the other hand, takes roughly 50–100 nanoseconds to respond — that is around 200 CPU cycles for a single memory access. If every load and store actually went to DRAM, the processor would spend more than 95 % of its time waiting. Memory organisation is the engineering response to that gap: use a small amount of expensive, fast memory close to the CPU and progressively larger, cheaper, slower memory further away.

Two principles make this idea work:

Temporal locality — if a memory location is accessed once, it is likely to be accessed again soon (e.g., a loop variable).
Spatial locality — if a location is accessed, nearby locations are likely to be accessed soon (e.g., walking through an array).

Every cache, prefetcher, and TLB you will study below is essentially exploiting one or both of these.

2. Memory types

Type	Volatile?	Speed	Cost / bit	Typical use
Register	Yes	~0.25 ns	Very high	Inside the CPU
SRAM	Yes	1–10 ns	High	L1 / L2 / L3 cache
DRAM	Yes	50–100 ns	Medium	Main memory (RAM stick)
ROM	No	~50 ns	Low	BIOS / firmware
Flash (NAND)	No	~25 µs	Low	SSDs, pen-drives
Magnetic disk	No	~10 ms	Very low	HDDs
Magnetic tape	No	seconds	Lowest	Archival backup

SRAM vs DRAM in one line

SRAM stores each bit in a 6-transistor flip-flop — fast, no refresh needed, but bulky and expensive. DRAM stores each bit as charge on a tiny capacitor — dense and cheap, but the charge leaks, so it must be refreshed every few milliseconds. That is why your laptop has 8 GB of DRAM but only a few megabytes of SRAM cache.

ROM family

ROM, PROM (programmable once), EPROM (erased by UV light), EEPROM (electrically erasable), and Flash (a commercial EEPROM variant with block-level erase). Flash is what your phone storage and SSDs are built from.

3. The memory hierarchy

Picture an upside-down pyramid: small and fast at the top, huge and slow at the bottom.

Level	Typical access time	Typical capacity
Registers	~0.25 ns	Bytes
L1 cache	~1 ns	32–64 KB
L2 cache	~3 ns	256 KB – 1 MB
L3 cache	~12 ns	4 – 32 MB
Main memory (DRAM)	~80 ns	8 – 32 GB
SSD	~25 µs	256 GB – 2 TB
HDD	~10 ms	1 – 16 TB

The hierarchy is inclusive or exclusive depending on the design. In an inclusive hierarchy (Intel), anything in L1 is also in L2 and L3 — simple but wasteful. In an exclusive hierarchy (AMD historically), a block lives in only one level — more capacity, more complex coherence.

4. Cache memory

Cache sits between the CPU and main memory. When the CPU issues an address, the cache controller checks if the corresponding block is present:

Hit — block found, served at cache speed.
Miss — block not found, the controller fetches it from the next level (or DRAM), and usually evicts an existing block to make room.

4.1 Cache mapping techniques

To answer “is this address in the cache, and if so where?”, the controller needs a mapping rule.

Mapping	Where can block B go?	Hardware	Conflict misses
Direct-mapped	Exactly one line: `B mod N`	Simplest, one comparator	Highest
Fully associative	Any line	Most expensive, N comparators	None
n-way set associative	Any of n lines in set `B mod (N/n)`	n comparators	Moderate

Real CPUs use set-associative caches almost universally (typical L1 is 8-way).

4.2 Worked example — direct-mapped cache

A processor has a 16 KB direct-mapped cache with 64-byte blocks and uses a 32-bit address. How are bits split between tag, index, and offset?

Block size = 64 B → offset bits = log₂(64) = 6 bits
Number of lines = 16 KB / 64 B = 256 → index bits = log₂(256) = 8 bits
Tag bits = 32 − 8 − 6 = 18 bits

So address 0x1A2B3C40 splits as: Tag (18 bits) | Index (8 bits) | Offset (6 bits). The controller uses the index to pick line 0xCF, compares the stored tag with 0x68AC, and on a match returns the byte at offset 0.

4.3 Replacement policies

When a set is full and a new block arrives, one resident block must leave.

Policy	Idea	Pros	Cons
LRU	Evict the line not used for the longest	Best hit rate in practice	Hard to implement for high associativity
FIFO	Evict the oldest	Simple counter	Ignores recent activity
Random	Pick any line	Cheapest	Surprisingly competitive
Pseudo-LRU	Tree-bit approximation of LRU	Cheap and close to LRU	Slightly worse hit rate

4.4 Write policies

When the CPU writes to a cached block:

Write-through — write to cache and memory simultaneously. Simple, but main-memory bandwidth is a bottleneck. Often paired with a write buffer to hide latency.
Write-back — update only the cache, set a dirty bit; write to memory when the block is evicted. Reduces bandwidth dramatically at the cost of more complex coherence.

For a miss on a write, two further options apply:

Write-allocate — fetch the block into cache, then write (matches write-back well).
No-write-allocate — write straight to memory, don’t pollute the cache (matches write-through well).

5. Cache performance

5.1 Effective access time

For a single level of cache: T_eff = h × T_cache + (1 − h) × (T_cache + T_memory) where h is the hit rate.

Numerical example. Cache access = 2 ns, memory access = 80 ns, hit rate = 95 %. T_eff = 0.95 × 2 + 0.05 × (2 + 80) = 1.9 + 4.1 = 6.0 ns.

The CPU now sees a memory subsystem that is thirteen times faster than DRAM alone, just by adding a small SRAM cache with a 95 % hit rate. Notice how a drop from 95 % to 90 % nearly doubles average latency — every percentage point of hit rate matters.

5.2 Levers for improving hit rate

Larger block size — exploits spatial locality, but past a point pulls in unused data and increases miss penalty.
Larger cache — fewer capacity misses, but slower access and more power.
Higher associativity — fewer conflict misses, more comparators per access.
Prefetching — speculatively bring in blocks before they are demanded (hardware stride prefetcher, software __builtin_prefetch).
Cache-conscious coding — loop tiling, struct-of-arrays layouts, padding to avoid false sharing.

5.3 The three Cs of cache misses

Miss type	Cause	How to reduce
Compulsory (cold)	First reference to a block	Prefetching, larger block size
Capacity	Working set bigger than cache	Bigger cache, better locality
Conflict	Two hot blocks map to same set	Higher associativity, victim cache

6. Virtual memory

Virtual memory creates the illusion that each process has a private, contiguous address space that can be larger than physical RAM. The operating system, together with the MMU (Memory Management Unit) inside the CPU, translates virtual addresses to physical addresses on every memory access.

6.1 Paging in one paragraph

The virtual address space is divided into fixed-size pages (usually 4 KB). Physical memory is divided into equal-size frames. A per-process page table maps each virtual page to a frame (or marks it as not present). When the CPU produces a virtual address, the MMU splits it into a page number (used to look up the table) and an offset (passed through unchanged).

Worked example. A 32-bit virtual address with 4 KB pages: offset = 12 bits (because 4 KB = 2¹²); page number = 20 bits → up to 2²⁰ = 1,048,576 page-table entries per process. If each entry is 4 bytes, a single-level page table costs 4 MB per process. That is why real systems use multi-level page tables (typical x86-64 uses 4 or 5 levels) or inverted page tables (one entry per frame, not per page).

6.2 The TLB

Walking the page table on every access would defeat the purpose of caching. The TLB (Translation Lookaside Buffer) is a small, fully-associative cache of recent virtual-to-physical translations, typically 64 to 1024 entries. A TLB hit completes in 1 cycle; a TLB miss triggers a page-table walk (tens of cycles), and a page fault triggers OS intervention (millions of cycles).

6.3 Segmentation vs paging

Feature	Segmentation	Paging
Unit	Variable-size segment	Fixed-size page
Address	(segment, offset)	(page, offset)
External fragmentation	Yes	No
Internal fragmentation	No	Yes (last page may be partial)
Programmer view	Logical (code, data, stack)	Transparent

Modern systems use paging on top of segmentation (x86) or pure paging (most RISC architectures).

7. Page replacement algorithms

When all frames are occupied and a new page must be brought in, the OS chooses a victim.

Algorithm	Idea	Belady’s anomaly?	Implementation
FIFO	Replace the page loaded earliest	Yes	Queue
Optimal (OPT)	Replace the page used farthest in the future	No	Unimplementable (theoretical baseline)
LRU	Replace the page not used for the longest	No	Counter or stack per page (costly)
Clock (Second-chance)	LRU approximation using a reference bit	No	Single bit per frame, pointer sweep
LFU	Replace the least-frequently used	No	Counter — but old pages with high counts stick

Belady’s anomaly — for FIFO, increasing the number of frames can sometimes increase the number of page faults. This is the classic surprise result and a favourite short-question topic in PU exams.

Worked example — FIFO reference string

Reference string: 7 0 1 2 0 3 0 4 2 3 0 3 2, 3 frames, FIFO.

Step	Ref	Frames after	Fault?
1	7	7	yes
2	0	7 0	yes
3	1	7 0 1	yes
4	2	0 1 2	yes (evict 7)
5	0	0 1 2	no
6	3	1 2 3	yes (evict 0)
7	0	2 3 0	yes (evict 1)
8	4	3 0 4	yes (evict 2)
9	2	0 4 2	yes (evict 3)
10	3	4 2 3	yes (evict 0)
11	0	2 3 0	yes (evict 4)
12	3	2 3 0	no
13	2	2 3 0	no

Total page faults = 10.

8. Memory interfacing

The CPU communicates with memory chips through three buses:

Address bus — one-way, CPU → memory. Width determines the maximum addressable memory (32 lines = 4 GB).
Data bus — bidirectional. Width determines how many bits move per access (64 bits in most desktops).
Control bus — carries signals like READ, WRITE, MEMRQ, clock, and chip-select.

8.1 Address decoding

A memory system usually contains several chips, each handling a slice of the address space. A decoder (74LS138 in 8085 lab work, integrated logic in real systems) reads the high-order address bits and asserts the correct chip’s CS (Chip Select) line, leaving the low-order bits to address inside the chip.

8.2 Memory banks and interleaving

Splitting memory into independently-accessed banks lets multiple accesses overlap. Low-order interleaving distributes consecutive addresses across banks — perfect for sequential access patterns. High-order interleaving keeps consecutive addresses in the same bank — useful when different banks belong to different processes. A 4-way interleaved memory with 40 ns access time can deliver one word every 10 ns under sequential access, even though no single bank is faster than 40 ns.

9. Putting it all together — what the CPU actually does on a load

Virtual address generated by the instruction.
TLB lookup. Hit → physical address in 1 cycle. Miss → page-table walk; if the page is not in memory, page fault and OS swap.
L1 cache lookup with the physical address. Hit → done in 4 cycles.
L1 miss → L2 lookup (~12 cycles).
L2 miss → L3 lookup (~40 cycles).
L3 miss → DRAM access (~200 cycles), block returned and cached at every level on the way up.

A well-tuned program keeps step 3 (L1 hit) as the common case more than 90 % of the time.

10. PU-style practice questions

Short answer (5 marks each)

Differentiate SRAM and DRAM with at least four points of comparison.
A 64 KB cache uses 4-way set-associative mapping with 32-byte blocks and a 32-bit address. Determine the tag, index, and offset bits.
What is Belady’s anomaly? Which replacement algorithm exhibits it?
Explain the difference between write-through and write-back with one advantage and one drawback of each.
What is the role of the TLB? Why is its hit rate critical for performance?

Long answer (15 marks each)

Explain the memory hierarchy of a modern computer with typical sizes and access times. Justify why such a hierarchy outperforms a flat memory of equal total capacity.
With a worked example, demonstrate FIFO, LRU, and Optimal page replacement on the reference string 1 2 3 4 1 2 5 1 2 3 4 5 using three frames. Compare the number of page faults.
Discuss the three Cs of cache misses. For each type, suggest two practical techniques (hardware or software) to reduce it.
Compare paging and segmentation as memory management techniques, with the advantages, disadvantages, and how modern systems combine both.
Design the address decoding scheme to interface four 8 KB RAM chips with an 8085 microprocessor so that they occupy contiguous addresses starting from 0x4000.

11. Frequently confused points

“Is cache part of main memory?” — No. Cache is a separate, smaller memory built from SRAM, physically closer to the CPU. Main memory is DRAM on a different bus.

“Is virtual memory the same as swap space?” — No. Virtual memory is the abstraction of a private, large address space. Swap space is the disk area used to store pages that don’t fit in physical RAM. Virtual memory uses swap space, but the two are not synonyms.

“Does increasing cache size always increase hit rate?” — Up to a point, yes — but beyond the working set size, returns diminish quickly, and the larger cache also gets slower per access. A 1 MB L2 with 95 % hit rate can outperform a 4 MB L2 with 96 % hit rate if the L2 access time doubles.

“Why do we need the TLB if we already have caches?” — Caches operate on physical addresses (in most designs), so the address must be translated before the cache can be queried. Without a TLB, every memory access would walk the page table.

12. Exam tips

Always begin a long answer with a clean definition before diving into details.
Draw the memory-hierarchy pyramid even if not asked — it earns 1–2 marks for structure.
For cache problems, show the bit split before computing tag/index/offset values.
For page-replacement problems, tabulate frame contents step by step and circle the faults.
Mention real numbers (ns, KB, GB) — examiners reward concreteness over generic prose.
Link concepts back to locality of reference whenever you can — it is the unifying idea of the entire chapter.

Summary

Memory organisation exists because the CPU is far faster than DRAM. The hierarchy from registers down to disk uses locality of reference to give the CPU the illusion of fast, infinite memory. Caches (with their mapping, replacement, and write policies), virtual memory (with paging and the TLB), and memory interfacing (buses, decoding, interleaving) are the three pillars you must master for the PU BCA examination — and for any real systems-programming work that follows it.