Virtual Memory and Cache

1. Lecture 09: Address Translation, Virtual Memory, and TLBs

1.1. Administrative announcements

  • The course forum was working again after a failed automatic update.
  • The submission deadline for milestone 1 was extended to Friday.
  • Milestone 2 was released.
  • TA office hours were scheduled in room E1 5 105.
  • The milestone 2 repository should be used separately from the milestone 1 repository.

1.2. Recap: why virtualize memory addresses?

The processor runs ordinary software that issues memory accesses.

Between the program-visible address and the real physical memory address, there is a hardware-supported translation mechanism.

This mechanism gives the operating system three important capabilities.

1.2.1. Address translation

The processor translates a virtual address into a physical address.

The program uses virtual addresses.

The memory hardware ultimately needs physical addresses.

The OS controls the translation tables, so it can decide where each virtual page is stored in physical memory.

1.2.2. Access tracking

The processor can give feedback to the OS about page usage.

Important tracking bits include:

  • accessed bit;
  • written / dirty bit.

These bits are set by the processor when memory is accessed.

They are later useful for demand paging, page replacement, and caching decisions.

1.2.3. Permission checks

On every memory access, the processor checks whether the access is allowed.

Typical checks include:

  • whether a mapping is present;
  • whether the access comes from user mode or kernel mode;
  • whether the page is writable;
  • whether the page is executable.

If a check fails, the processor raises a page fault exception.

This gives the OS a chance to inspect the situation and decide what to do.

Examples:

  • invalid pointer access: terminate the process;
  • swapped-out page: load the page back and resume;
  • write to copy-on-write page: create a private copy;
  • lazy allocation: allocate the page only when first accessed.

1.3. Paged address translation

1.3.1. Basic idea

Memory is divided into fixed-size blocks called pages.

A virtual address is split into:

\begin{equation*} \text{virtual address} = \text{virtual page number} + \text{offset}. \end{equation*}

The virtual page number is translated through a page table.

The offset is not translated; it is copied unchanged into the physical address.

So:

\begin{equation*} \text{physical address} = \text{physical page frame number} + \text{same offset}. \end{equation*}

1.3.2. Page table entries

Each page table entry stores at least:

  • the physical page frame number;
  • protection and control bits.

Typical control bits include:

  • present;
  • user / kernel;
  • writable;
  • executable or non-executable;
  • accessed;
  • dirty.

1.3.3. Page table base register

The page table itself is a data structure stored in physical memory.

The processor therefore needs to know where the top-level page table is.

On x86, this is stored in the cr3 register.

The cr3 register contains the physical address of the top-level paging data structure.

The processor then walks the page table automatically during address translation.

1.4. Why not use one linear page table?

1.4.1. Linear page table idea

A simple design would be:

  • put one huge array in physical memory;
  • use the virtual page number as an index into that array;
  • each array element is one page table entry.

This is simple for hardware, because indexing into an array is easy.

1.4.2. Problem: the table is too large

With 4 KiB pages on a 32-bit system:

\[ 2^{32} / 2^{12} = 2^{20} \]

virtual pages are possible.

So the page table needs:

\[ 2^{20} \]

entries.

If each entry is 4 bytes:

\[ 2^{20} \cdot 4 = 4 \text{ MiB} \]

per page table.

This is already large, but still somewhat manageable.

For a 64-bit address space, the situation becomes impractical.

A linear page table would have to cover an enormous virtual address space, even though most of it is unused.

1.4.3. Sparse address spaces

Most virtual address spaces are sparse.

Large parts are unmapped.

Examples:

  • unused gaps between heap and stack;
  • high-address kernel mappings;
  • memory-mapped regions far away from ordinary code/data;
  • guard pages;
  • large empty parts of a 64-bit address space.

A linear page table wastes memory because it must have entries even for unmapped holes.

1.5. Multi-level page tables

1.5.1. Basic idea

A multi-level page table, also called a hierarchical page table or radix page table, breaks the virtual page number into several chunks.

Each chunk indexes one level of the tree.

Only the lower-level tables for actually used virtual address ranges need to exist.

So instead of one huge linear array, the OS builds a tree of page-table pages.

1.5.2. How translation works

Conceptually:

  1. The processor starts from the top-level table address stored in cr3.
  2. It uses the first chunk of the virtual page number to index the top-level table.
  3. That entry gives the physical address of the next-level table.
  4. The processor uses the next chunk to index that table.
  5. This continues until the final page table entry is reached.
  6. The final entry gives the physical page frame number.
  7. The original offset is appended unchanged.

This process is called a page table walk.

1.5.3. Why this saves memory

If a virtual address range is unused, the OS does not allocate the lower-level page table for that range.

The top-level entry can simply be absent or not present.

Thus page-table memory is proportional to the actually used parts of the virtual address space, not the full address space.

1.5.4. Page tables live in physical memory

All page-table pages themselves live in physical memory.

The entries that point to lower-level tables contain physical addresses.

The processor performs these lookups in hardware.

1.5.5. Sharing sub-tables

Because multi-level page tables use indirection, different address spaces can share some lower-level page tables.

A common example is kernel memory.

Most operating systems map the kernel into every process’s address space.

Instead of creating separate kernel page tables for every process, the OS can share the same kernel sub-tables among many processes.

Benefits:

  • saves memory;
  • makes kernel mapping updates easier;
  • avoids updating thousands of process page tables when a shared mapping changes.

1.6. x86 paging examples

1.6.1. x86 32-bit paging

In 32-bit x86 paging with 4 KiB pages, the address is split as:

\begin{equation*} 10 \text{ bits directory} + 10 \text{ bits table} + 12 \text{ bits offset}. \end{equation*}

The 12-bit offset gives:

\[ 2^{12} = 4096 \]

bytes per page.

Each page table entry is 4 bytes.

A page table with \(2^{10} = 1024\) entries therefore has size:

\[ 1024 \cdot 4 = 4096 \text{ bytes}. \]

So both page directories and page tables fit exactly in one 4 KiB page.

This is convenient for the kernel, because physical memory management can allocate page-sized chunks and use them either as data pages or page-table pages.

1.6.2. x86-64 paging

In 64-bit x86 mode, page table entries are 8 bytes.

Keeping page-table pages at 4 KiB means each table contains:

\[ 4096 / 8 = 512 = 2^9 \]

entries.

So each level consumes 9 bits of the virtual address.

A common 4-level x86-64 layout is:

\[ 9 + 9 + 9 + 9 + 12 = 48 \]

bits.

The levels are usually called:

  • PML4;
  • Page Directory Pointer Table;
  • Page Directory;
  • Page Table;
  • Offset.

The lecture notes that the names are historically messy and not conceptually important.

The important point is that the same radix-tree idea is repeated over more levels.

1.6.3. Why not full 64-bit virtual addresses?

A full 64-bit virtual address space with 4 KiB pages and 9 bits per level would need more levels.

Four levels only cover 48 address bits.

This was a hardware design trade-off.

The designers judged that 48 virtual address bits were enough for existing systems, while more levels would make translation slower and hardware more complicated.

Modern x86 has optional extensions for a fifth level.

1.6.4. Physical and virtual address size are separate

The virtual address size and physical address size do not have to be identical.

Virtual address size is about what addresses software can generate.

Physical address size is about how much physical memory and memory-mapped device space the hardware can address.

Even on x86-64, not all 64 bits are necessarily used for virtual or physical addressing.

1.7. x86 segmentation plus paging

x86 historically combines segmentation and paging.

The translation sequence is:

\[ \text{logical address} \rightarrow \text{segmentation} \rightarrow \text{linear address} \rightarrow \text{paging} \rightarrow \text{physical address}. \]

The x86 terminology is:

  • logical address: segment plus offset;
  • linear address: result after segmentation, still virtual;
  • physical address: result after paging.

In most modern systems, segmentation is configured to do almost nothing:

  • base = 0;
  • limit = full address space.

Then paging does the real memory virtualization.

One remaining practical use of segmentation-like mechanisms is thread-local storage.

Thread-local storage needs different threads to see different memory for certain variables while otherwise sharing the same virtual address space.

This helps explain why segmentation comes before paging: the thread-specific adjustment happens first, and then the result goes through the normal paging translation.

1.8. Advantages and disadvantages of multi-level page tables

1.8.1. Advantages

Multi-level page tables are memory-efficient for sparse address spaces.

They allow fine-grained page-level protection.

They allow sharing of sub-tables, for example kernel mappings shared across processes.

They provide flexibility: any physical page frame can be mapped at any virtual address.

They simplify physical memory allocation when all page-table and data pages are the same size.

1.8.2. Disadvantages

A page table walk is expensive.

With a 4-level page table, one memory access may require:

  1. access PML4;
  2. access PDPT;
  3. access Page Directory;
  4. access Page Table;
  5. access actual data.

So one conceptual memory access can become five dependent memory accesses.

These accesses are dependent because the address of the next level is only known after reading the previous level.

This pointer chasing cannot be easily pipelined away.

Multi-level paging is also expensive in hardware.

Small embedded processors may avoid such complex MMUs because the hardware cost can be larger than the simple processor core itself.

Another disadvantage is page-table storage overhead for very large mappings.

If a program maps terabytes of memory using 4 KiB pages, many page table entries are required even if the mapping is simple and contiguous.

1.9. Super pages / huge pages

1.9.1. Motivation

Small pages are good for fine-grained protection and flexible allocation.

But large contiguous mappings can be inefficient with 4 KiB pages.

For example, a large database may want gigabytes or terabytes of memory.

Using one page table entry per 4 KiB page wastes memory and creates management overhead.

1.9.2. Basic idea

A huge page allows the processor to stop the page-table walk early.

Instead of interpreting an entry as pointing to the next-level page table, the processor interprets it as pointing directly to a large physical memory region.

This skips lower levels of the page table tree.

1.9.3. Page size bit

On x86, page table entries can contain a page size bit.

If the page size bit is set, the entry maps a huge page instead of pointing to the next-level table.

1.9.4. Alignment and physical contiguity requirements

A huge page must be physically contiguous.

It must also be aligned to its own size.

For example, a 4 MiB huge page must begin at an address that is a multiple of 4 MiB.

This is necessary because the skipped address bits become part of the offset within the huge page.

The processor will not inspect the lower-level page table entries, because they do not exist for that mapping.

1.9.5. Typical huge page sizes

On 32-bit x86, skipping the page table level gives 4 MiB pages.

On x86-64, common huge page sizes include:

  • 2 MiB pages;
  • 1 GiB pages;
  • possibly larger sizes on newer hardware, depending on supported levels.

The size corresponds to the amount of address space that the skipped lower-level table would have covered.

1.9.6. Example: kernel mapping

The kernel is often mapped into the high part of every address space.

If the kernel occupies a physically contiguous region, the OS may map it with huge pages.

Instead of one lower-level page table containing many repeated entries for kernel pages, a higher-level entry can directly map the whole kernel region as a huge page.

1.9.7. Advantages

Huge pages reduce page table storage overhead.

They reduce page table walk overhead.

They improve TLB coverage because one TLB entry can cover much more memory.

They are good for large mappings such as databases or large memory regions.

1.9.8. Disadvantages

Huge pages complicate physical memory management.

The OS must find a physically contiguous, properly aligned region of the required size.

Memory fragmentation can make this difficult.

Even if the system has enough free memory in total, it may not have a free aligned 1 GiB chunk.

Linux often reserves huge pages at boot time to avoid fragmentation.

This means reserved huge pages may not be available for ordinary 4 KiB allocation.

Huge pages also reduce fine-grained protection and allocation flexibility.

1.10. Efficient address translation: TLB

1.10.1. Problem

Page table walks are too expensive to perform on every memory access.

A 4-level walk would add several dependent memory accesses before the actual memory access.

Since every instruction fetch and every data load/store needs address translation, this overhead would be unacceptable.

1.10.2. TLB idea

A Translation Lookaside Buffer, or TLB, is a cache for address translations.

It caches mappings from:

\[ \text{virtual page number} \rightarrow \text{physical page frame number + protection bits}. \]

On each memory access:

  1. The processor extracts the virtual page number.
  2. It checks the TLB.
  3. If there is a hit, translation is fast.
  4. If there is a miss, the processor performs a page table walk.
  5. The result is inserted into the TLB.

The offset is still copied unchanged.

1.10.3. TLB and protection bits

The TLB does not only store the physical page frame number.

It also stores protection and control information.

This is necessary because permission checks must still happen even when the translation is cached.

1.10.4. MMU

The Memory Management Unit, or MMU, is the hardware unit responsible for address translation.

It includes mechanisms such as:

  • TLB lookup;
  • page table walks;
  • permission checks;
  • page fault generation.

1.11. TLB consistency

1.11.1. Problem

The TLB caches page table entries.

If the OS changes the page table, the cached TLB entry may become stale.

The OS must ensure that the processor does not keep using obsolete translations.

1.11.2. Inserting a new mapping

If a mapping did not previously exist, it is usually not in the TLB.

Invalid mappings are generally not cached, because invalid access already causes an expensive exception.

So inserting a new mapping usually needs no immediate TLB action.

The first access will miss in the TLB and then load the new page table entry.

1.11.3. Updating or removing a mapping

If the OS removes a mapping or reduces its permissions, a stale TLB entry would be dangerous.

Examples:

  • a page was writable, but is now read-only;
  • a page was present, but is now unmapped;
  • a page was user-accessible, but is now kernel-only.

The OS must invalidate the TLB entry.

On x86, an instruction such as invlpg can invalidate the mapping for a given virtual address.

1.11.4. Why software invalidation?

The processor cannot easily detect that an arbitrary memory write changed a page table entry.

From the hardware’s perspective, the OS just wrote to some physical address.

Tracking whether that address corresponds to some page table entry would be very complicated.

So the architecture gives the OS explicit invalidation instructions.

The OS is responsible for using them correctly.

1.11.5. TLB invalidation strategy

Usually the OS does not insert the new translation manually.

It simply invalidates the old entry.

On the next access, the processor misses in the TLB, walks the page table, and reloads the correct translation.

This is simpler and avoids requiring the OS to know the internal cache state.

1.12. TLB behavior on context switch

1.12.1. Basic problem

Different processes can use the same virtual address for different physical pages.

If the CPU switches from process A to process B, old TLB entries from A may be wrong for B.

1.12.2. Simple solution: flush TLB on address space switch

Historically, changing the page table base register, such as cr3 on x86, would invalidate most TLB entries.

This is simple and safe.

1.12.3. Global mappings

Some mappings are identical across address spaces.

The kernel mapping is the main example.

x86 supports a global bit for such mappings.

Global TLB entries are not flushed on ordinary address space switches.

This avoids reloading kernel translations on every context switch.

1.12.4. Tagged TLB / address space identifiers

Modern architectures can tag TLB entries with an address-space identifier.

On x86, this idea appears as PCID-like tagged TLB support.

The TLB entry then contains not only a virtual page number but also a tag identifying the address space.

This allows entries from multiple processes to coexist in the TLB.

When switching address spaces, the CPU uses the tag to distinguish entries.

This reduces the cost of context switches.

1.13. Multiprocessor TLB shootdown

1.13.1. TLBs are per-core

A TLB must be very fast and close to the core.

Every instruction fetch needs translation.

A load/store instruction may need translation twice:

  • once for the instruction fetch;
  • once for the data access.

A shared TLB across cores would create latency and concurrency problems.

So each core has its own TLB.

1.13.2. Consistency problem

If a process runs on multiple cores, the same page table may have translations cached in multiple per-core TLBs.

If the OS removes or changes a mapping, it must invalidate the entry on every core where it might be cached.

1.13.3. Shootdown

A TLB shootdown is the process of forcing other cores to invalidate TLB entries.

The OS may need to:

  1. identify cores that may have cached the mapping;
  2. send inter-processor interrupts or similar messages;
  3. run the invalidation instruction on those cores;
  4. wait until they acknowledge completion;
  5. only then safely reuse the physical page.

This can be expensive on machines with many cores.

1.13.4. Why this matters

Suppose the OS removes a mapping and wants to reuse the physical page for another process.

If some core still has a stale TLB entry, the old process may still access that physical page.

This would break isolation and correctness.

Therefore the OS must be careful before reusing memory.

1.13.5. Optimizations

OS kernels may batch TLB invalidations.

They may delay invalidations if the memory is not immediately needed.

This avoids repeatedly interrupting many cores.

But this makes the OS code more complex.

1.14. TLBs and super pages

1.14.1. Multiple page sizes complicate TLB design

With only one page size, the TLB lookup compares fixed-size virtual page numbers.

With multiple page sizes, a virtual address may need to be compared at different granularities.

This is harder to implement in hardware.

1.14.2. Common solution: separate TLBs

Processors often use separate TLB structures for different page sizes.

For example:

  • one TLB for 4 KiB pages;
  • another TLB for 2 MiB pages;
  • another small TLB for 1 GiB pages.

The processor can check these in parallel.

By correct page table construction, a virtual address should only have one valid mapping size.

1.14.3. Benefit of huge pages for TLB reach

A single TLB entry for a 1 GiB page covers much more memory than one entry for a 4 KiB page.

So huge pages can significantly reduce TLB misses for large memory regions.

1.15. Page faults

1.15.1. Definition

A page fault is an exception triggered by the CPU when a memory access cannot be completed under the current page table and permissions.

Causes include:

  • non-present page;
  • non-present intermediate page table;
  • write to a read-only page;
  • user-mode access to kernel-only page;
  • instruction fetch from a non-executable page.

1.15.2. Page fault as a mechanism

A page fault is not always fatal.

It is a general mechanism that transfers control to the OS.

The OS can decide whether to:

  • terminate the process;
  • install a missing mapping;
  • load a swapped-out page;
  • handle lazy allocation;
  • handle copy-on-write;
  • report a segmentation fault.

1.15.3. x86 page fault information

On x86, page fault is interrupt 14.

The processor provides extra information to the OS.

The error code tells why the fault happened.

Examples:

  • non-present page;
  • write access;
  • user-mode access;
  • instruction fetch.

The cr2 register contains the virtual address that caused the fault.

This is extremely useful.

Without cr2, the OS would have to decode the faulting instruction to determine which address was accessed.

1.15.4. Resuming after a page fault

If the OS fixes the problem, it can resume the program.

The saved instruction pointer still points to the faulting instruction.

The OS restores the process state and returns from the exception.

The instruction is retried.

From the application perspective, it looks like the memory access merely took longer.

1.16. Historic alternative: software-loaded TLB

1.16.1. Basic idea

Some older architectures did not have hardware page table walks.

The hardware only checked the TLB.

If the TLB missed, the processor raised a TLB miss exception.

The OS then:

  1. handled the exception;
  2. looked up the mapping using any data structure it wanted;
  3. inserted the translation into the TLB;
  4. resumed execution.

1.16.2. Advantage

The OS is not forced to use a hardware-defined page table structure.

It can use any data structure:

  • radix tree;
  • hash table;
  • B-tree;
  • segmentation-like structure;
  • database-specific structure.

This gives great flexibility.

It also simplifies hardware because the processor only needs a TLB, not a hardware page table walker.

1.16.3. Disadvantages

TLB misses become much more expensive.

A hardware page table walk is already expensive, but a software exception handler is even more expensive.

The CPU must trap into the kernel, run instructions, inspect data structures, update the TLB, and return.

There is also a chicken-and-egg problem:

  • to handle a TLB miss, the page fault handler code must itself be accessible;
  • the data structures needed by the handler must also be accessible.

This requires careful management.

Modern general-purpose processors usually use hardware-managed page table walks because software-loaded TLBs are too slow for current performance expectations.

2. Lecture 10: Demand Paging and Caching

2.1. Recap: TLB with software invalidation

The previous lecture introduced the TLB as a cache for virtual-to-physical page translations.

The TLB stores:

  • physical page frame number;
  • protection and control bits.

When the OS changes a page table entry, it must invalidate stale TLB entries.

This is especially important when permissions are reduced or mappings are removed.

In a multicore system, TLB invalidation may require TLB shootdowns across cores.

2.2. Demand paging / swapping

2.2.1. Goal

Demand paging gives applications the illusion of having more main memory than is physically available.

The OS uses physical RAM as a cache for a larger and slower backing store.

Historically, the backing store was a hard drive.

Today it may be:

  • HDD;
  • SSD;
  • remote memory;
  • CXL-attached memory;
  • other disaggregated or tiered memory systems.

2.2.2. High-level idea

The OS keeps only some pages in physical memory.

Pages that are not currently needed can be written to slower storage.

Their page table mappings are removed or marked invalid.

When the application accesses such a page again, the CPU triggers a page fault.

The OS then loads the page back and resumes execution.

2.2.3. Transparency

The application does not explicitly know this is happening.

It just executes ordinary memory loads and stores.

If the OS handles the fault successfully, the program continues.

The only visible difference is performance: the access may take much longer.

2.3. Memory hierarchy

2.3.1. Hardware principle

The bigger the memory, the slower the memory.

Reasons include:

  • longer wires;
  • more complex address decoding;
  • physical distance from the CPU;
  • storage technology trade-offs.

2.3.2. Typical hierarchy

A modern system has multiple layers:

  • L1 cache: very small, very fast, per-core;
  • L2 cache: larger, still per-core;
  • L3 cache: larger, shared among cores;
  • DRAM: much larger, slower than cache;
  • SSD: much larger, much slower than DRAM;
  • HDD: larger, much slower than SSD;
  • cloud or remote storage: effectively huge, but much slower.

Approximate orders of magnitude from the lecture:

  • L1: around 1 ns;
  • L2: around 3 ns;
  • L3: around 10 ns;
  • DRAM: around 60–80 ns;
  • SSD: tens of microseconds;
  • HDD: milliseconds;
  • cloud storage: hundreds of milliseconds.

The exact numbers vary, but the important point is the orders of magnitude.

2.3.3. Caching is essential

Because slow storage is so much larger than fast storage, systems rely on caching at every level.

Examples:

  • CPU caches hide DRAM latency;
  • TLB hides page table walk latency;
  • demand paging treats RAM as a cache for disk or slower memory;
  • file systems cache disk contents in RAM.

2.4. Invalid page access and page fault path

When an application accesses a page that is not currently mapped:

  1. the CPU checks the TLB;
  2. the TLB misses;
  3. the CPU performs a page table walk;
  4. the page table entry is invalid or not present;
  5. the CPU triggers a page fault;
  6. control traps into the kernel.

The kernel then decides whether the access is invalid or whether it can be handled.

For demand paging, the kernel may discover:

  • this page belongs to the process;
  • it was evicted earlier;
  • its contents are stored somewhere on disk or other backing store.

Then the kernel can bring it back.

2.5. Evicting a page

2.5.1. When eviction is needed

The OS needs a physical page frame.

If no free physical page is available, it must evict some existing page.

Eviction means taking a physical page away from its current owner so it can be reused.

2.5.2. Eviction steps

A typical eviction sequence:

  1. The OS chooses a victim physical page.
  2. It writes the page contents to backing store, unless this can be avoided.
  3. It removes or invalidates all page table mappings that point to that physical page.
  4. It invalidates relevant TLB entries.
  5. It records where the contents were stored.
  6. The physical page frame becomes available for reuse.

2.5.3. Choosing the victim

Functionally, almost any page can be evicted if the OS can later restore it.

But some pages should not be evicted.

Examples:

  • page fault handler code;
  • kernel data needed to handle faults;
  • disk driver code or data needed to read pages back;
  • critical kernel memory.

Otherwise the OS may create a chicken-and-egg problem.

For performance, the victim should ideally be a page that will not be used again soon.

This leads to page replacement algorithms.

2.5.4. Need to unmap everywhere

A physical page may be mapped in multiple virtual addresses or multiple processes.

Before reusing that physical page, the OS must remove all mappings to it.

If one stale mapping remains, a process may still access memory that has been reassigned to someone else.

This would break correctness and security.

2.5.5. Need to invalidate TLBs

Removing mappings from page tables is not enough.

The old translation may still be cached in one or more TLBs.

So the OS must invalidate relevant TLB entries before safely reusing the physical page.

2.6. Handling a demand page fault

2.6.1. Fault handling sequence

When an application accesses an evicted page:

  1. the application issues a memory access;
  2. the CPU raises a page fault;
  3. the OS inspects the faulting address, for example via cr2 on x86;
  4. the OS checks its metadata and finds that this page was evicted;
  5. the OS finds a free physical page frame, possibly by evicting another page;
  6. the OS reads the page contents from backing store;
  7. the OS creates a valid page table entry;
  8. the OS resumes the application.

2.6.2. Blocking and scheduling

Reading a page from disk or SSD is slow compared to CPU execution.

The faulting thread usually cannot continue until the page is loaded.

The OS normally blocks that thread and runs something else.

Later, when I/O completes, the device interrupts the CPU.

The OS then marks the page as available and resumes the blocked thread.

2.6.3. Thrashing

Demand paging works well only when there is locality.

If memory is overcommitted too much, the system may enter a vicious loop:

  • access page A;
  • page fault;
  • evict page B;
  • load page A;
  • soon access page B;
  • page fault;
  • evict page A;
  • repeat.

This is called thrashing.

During thrashing, the system spends most of its time moving pages between RAM and backing store instead of doing useful work.

A common user-level symptom is the whole machine becoming extremely slow when too many memory-hungry programs or browser tabs are active.

2.7. Metadata for demand paging

2.7.1. Supplemental page table

When the OS evicts a page, it must remember where the page contents went.

This information may be stored in a supplemental page table or similar per-process metadata.

For each relevant virtual page, the OS may need to know:

  • whether it is resident in memory;
  • whether it was evicted;
  • where its backing store copy is;
  • what permissions it should have;
  • whether it is part of a file mapping;
  • whether it is copy-on-write.

Some architectures leave unused bits in non-present page table entries.

In principle, the OS may store metadata there.

In practice, real systems often use separate data structures because sharing and complex mappings make page table entries alone insufficient.

2.7.2. Page map

Page tables map from virtual pages to physical pages.

But eviction often needs the reverse direction.

Given a physical page, the OS must find where it is mapped.

For this, many OSs maintain a page map or similar structure.

The page map tracks physical memory.

For each physical page, it may record:

  • whether the page is free;
  • who owns it;
  • which virtual addresses map it;
  • reference counts;
  • dirty/accessed status summaries;
  • eviction-related metadata.

This is useful because scanning all page tables to find mappings of a physical page would be too expensive.

2.8. Avoiding unnecessary disk writes

2.8.1. Clean and dirty pages

A dirty page is a page that has been modified in memory.

A clean page is a page whose in-memory contents still match the backing store.

If a page is clean, the OS does not need to write it to disk before evicting it.

It can simply discard the in-memory copy and reload from the existing backing store later.

2.8.2. Dirty bit

The processor sets the dirty bit when a page is written.

The OS can clear the dirty bit when a page is loaded into memory.

Later, when evicting the page:

  • if dirty bit = 0, the page was not modified;
  • if dirty bit = 1, the page was modified and must be written back.

This reduces:

  • eviction latency;
  • disk bandwidth;
  • SSD wear;
  • I/O pressure.

2.8.3. If hardware lacks a dirty bit

The OS can emulate dirty tracking with page permissions.

Technique:

  1. map the page read-only, even if it should eventually be writable;
  2. if the application only reads it, no fault occurs;
  3. on the first write, the CPU raises a page fault;
  4. the OS records that the page is dirty;
  5. the OS changes the mapping to writable;
  6. the program resumes.

This same trick is also useful for copy-on-write.

2.9. Synchronous vs asynchronous eviction

2.9.1. Synchronous eviction

In synchronous eviction, the OS evicts a page only when a free page is needed.

Advantage:

  • avoids evicting pages unnecessarily early.

Disadvantage:

  • allocation or page fault handling becomes slower, because eviction is on the critical path.

If the OS must write a dirty page to disk before proceeding, the faulting thread waits longer.

2.9.2. Asynchronous eviction

In asynchronous eviction, the OS maintains a pool of free pages.

A background kernel thread periodically checks free memory.

If free memory falls below a threshold, it proactively evicts pages.

Advantages:

  • allocation can be fast because free pages are already available;
  • expensive writeback is moved off the critical path.

Disadvantage:

  • the OS may evict a page too early;
  • the page may be needed again soon.

2.9.3. Low-water and high-water marks

The OS may use thresholds.

For example:

  • if free pages fall below a low-water mark, start reclaiming memory;
  • continue until free pages reach a high-water mark.

Linux has kernel background activity such as kswapd that performs related memory reclaim work.

2.10. Page replacement policy

2.10.1. The problem

When the OS needs to evict a page, which page should it choose?

Correctness is usually easy: many pages can be evicted.

Performance is hard: the OS wants to evict a page that will not be used soon.

This is a cache replacement problem.

2.10.2. Optimal policy: MIN

The theoretical optimal policy evicts the page whose next use is furthest in the future.

This is often called MIN or Belady’s optimal algorithm.

It is optimal because it keeps pages that will be needed sooner.

But it requires knowledge of the future, so it cannot be implemented directly in a normal OS.

2.10.3. Need heuristics

Real systems use heuristics based on past access patterns.

The assumption is locality:

  • pages used recently are likely to be used again;
  • pages not used for a long time are less likely to be needed soon.

2.11. Access bit and approximating recency

2.11.1. Access bit

The processor sets the accessed bit when a page is read or written.

This gives the OS a small signal about recent usage.

But one bit is limited.

It only says whether the page has been accessed since the bit was last cleared.

2.11.2. Periodic resetting

The OS can periodically clear accessed bits.

Then, if the processor sets the bit again, the OS learns that the page was accessed during that interval.

By repeating this over time, the OS can approximate recency.

2.11.3. Clock algorithm

The clock algorithm is a common approximation.

Conceptually, pages are arranged in a circular list.

A clock hand scans pages.

For each page:

  • if accessed bit = 1:
    • clear accessed bit;
    • give the page a second chance;
  • if accessed bit = 0:
    • consider the page a candidate for eviction.

This avoids evicting pages that were accessed recently.

The OS need not scan all pages at once.

It can scan incrementally in the background.

2.11.4. N-th chance algorithm

The N-th chance algorithm adds a small counter.

Idea:

  • if a page is accessed, reset its counter to 0;
  • if it is not accessed during a scan, increment the counter;
  • if the counter exceeds \(N\), reclaim the page.

This gives a higher-resolution approximation than a single accessed bit.

Clock and N-th chance can be used synchronously or asynchronously.

2.12. General caching concepts

2.12.1. Cache abstraction

A cache is a small, fast storage layer that contains a subset of data from a larger, slower storage layer.

It maps some kind of address or key to cached contents.

Examples:

  • CPU data cache;
  • instruction cache;
  • TLB;
  • OS page cache;
  • demand-paging RAM cache;
  • database buffer pool.

2.12.2. Cache lines and pages

CPU caches are organized in cache lines.

Typical cache line sizes include:

  • 32 bytes;
  • 64 bytes;
  • 128 bytes.

Paging uses much larger blocks:

  • 4 KiB pages;
  • 2 MiB huge pages;
  • 1 GiB huge pages.

The same basic caching questions apply at different block sizes.

2.13. CPU cache organization

2.13.1. Direct-mapped cache

A direct-mapped cache hashes each address to exactly one cache location.

Advantages:

  • cheap hardware;
  • fast lookup;
  • simple implementation.

Disadvantages:

  • many conflicts;
  • if two frequently used addresses map to the same slot, they constantly evict each other.

2.13.2. Fully associative cache

A fully associative cache allows any address to be stored in any cache location.

On lookup, the hardware compares the requested address against all cache entries.

Advantages:

  • very flexible;
  • fewer conflicts.

Disadvantages:

  • expensive hardware;
  • many comparisons;
  • high energy cost;
  • difficult to scale to large caches.

2.13.3. Set-associative cache

Set-associative caching is a compromise.

The address is hashed to a set.

Within that set, the cache is associative.

If the cache is \(n\)-way set associative, each set has \(n\) possible slots.

Typical associativity may be 4-way to 16-way.

Advantages:

  • fewer conflicts than direct-mapped;
  • cheaper than fully associative;
  • practical for hardware.

The same trade-off appears in software cache designs too.

2.14. Why caches work: locality

2.14.1. Temporal locality

Temporal locality means:

\[ \text{Data accessed recently is likely to be accessed again soon.} \]

Example:

  • loop variables;
  • stack frames;
  • frequently used objects;
  • hot code paths.

2.14.2. Spatial locality

Spatial locality means:

\[ \text{Data near recently accessed data is likely to be accessed soon.} \]

Example:

  • iterating through an array;
  • sequential instruction execution;
  • scanning memory or files.

2.14.3. Working set

An application’s working set is the set of memory locations it actively uses over some time interval.

The working set size is the amount of memory needed to keep that active set resident.

Caching works well when:

\[ \text{working set size} \le \text{cache size}. \]

If the working set is larger than available cache or physical memory, the system may thrash.

2.14.4. Working set is about active use, not total allocation

A program may allocate terabytes but stream through them one page at a time.

Its working set may be small.

Another program may repeatedly access the same thousand pages.

Its working set is those thousand pages.

The OS wants to allocate physical memory according to useful working set behavior, not merely total virtual allocation.

2.15. FIFO page/cache replacement

2.15.1. FIFO idea

FIFO evicts the entry that has been in the cache the longest.

It is simple to implement.

The OS or cache can keep a queue of entries and evict from the front.

2.15.2. Problem with sequential access

FIFO performs badly for sequential scans slightly larger than the cache.

Example reference pattern:

\[ A, B, C, D, E, A, B, C, D, E, \ldots \]

with a cache that holds only four entries.

Every time the next page is needed, FIFO may have just evicted it.

This can lead to a miss on almost every access.

2.15.3. Locality does not guarantee FIFO performs well

Even if a workload has locality, FIFO can evict pages that are old but still frequently used.

FIFO cares only when the page was inserted.

It ignores whether the page has been used recently.

2.15.4. Belady’s anomaly

With FIFO, a larger cache can sometimes have more misses than a smaller cache.

This is called Belady’s anomaly.

This counterintuitive behavior shows that FIFO is not a robust replacement policy.

2.16. LRU page/cache replacement

2.16.1. LRU idea

Least Recently Used, or LRU, evicts the entry that has not been used for the longest time in the past.

The intuition is:

\[ \text{If something has not been used for a long time, it is less likely to be needed soon.} \]

LRU uses past access history as a prediction of future access.

2.16.2. Advantages

LRU usually performs much better than FIFO.

It captures temporal locality.

It avoids evicting pages that are old but still actively used.

For many workloads, LRU is a good approximation to the optimal future-looking policy.

2.16.3. Sequential scan problem

LRU is still not perfect.

A sequential scan over a region larger than the cache is difficult without knowing the future.

LRU may fill the cache with data that will not be reused soon.

This is why databases and specialized systems sometimes use custom caching policies: they may know their access pattern better than the OS.

2.16.4. Implementation cost

Perfect LRU is expensive.

It would require tracking the exact last access time or order for every cached page.

Eviction would require finding the least recently used page.

In a parallel system, maintaining one exact global LRU order would cause contention and synchronization overhead.

2.16.5. Approximate LRU

Real systems usually approximate LRU.

They do not need to evict the exact least recently used page.

It is usually enough to evict a page that has not been used recently.

Clock-style algorithms are common approximate LRU mechanisms.

2.17. Demand paging as an OS mechanism

Address translation gives the OS a powerful mechanism.

The OS can manipulate page tables and use page faults to implement policies behind the application’s back.

Applications see a simple memory abstraction.

The OS can decide:

  • when to allocate physical memory;
  • when to evict pages;
  • where to place pages;
  • how to share pages;
  • when to copy pages;
  • how to map files into memory.

This transparency is one of the main benefits of virtual memory.

2.18. Lazy allocation

2.18.1. Basic idea

When an application asks for memory, the OS does not necessarily allocate physical pages immediately.

Instead, it records that the virtual address range has been promised to the process.

The page table entries remain invalid or point to a special shared page.

When the application first accesses a page, a page fault occurs.

Then the OS allocates a real physical page.

2.18.2. Why lazy allocation helps

Many programs request more memory than they immediately use.

Lazy allocation avoids wasting physical memory on pages that may never be touched.

It also makes allocation calls return faster.

If the OS eagerly allocated all physical pages for a large request, it might have to:

  • find many free pages;
  • evict other pages;
  • zero memory;
  • create many page table entries.

Lazy allocation defers this work until it is actually needed.

2.18.3. Zero page optimization

Unix-like systems typically promise that newly allocated memory is zero-initialized.

For pages that are only read, the OS can map a shared read-only page filled with zeros.

Many processes can share this same zero page.

If a process writes to it, the CPU raises a page fault.

The OS then allocates a real private zeroed page and makes it writable.

This optimization is usually less important than lazy allocation itself, because programs often write newly allocated memory soon.

2.18.4. Downside of lazy allocation

Lazy allocation can be inefficient if the application immediately touches every page.

Then the program suffers one page fault per page.

For applications that know they need all memory immediately, eager allocation may be better.

Databases are a common example where the application may know its memory access pattern better than the OS.

Modern OS interfaces often provide ways to request more eager allocation or special mapping behavior.

2.19. Copy-on-write

2.19.1. Motivation

Copying large memory regions eagerly can be expensive.

A key example is the Unix fork() system call.

Fork creates a child process that initially has the same address space contents as the parent.

A naive implementation would copy all memory pages.

This is wasteful, especially because many programs call exec() soon after fork(), replacing the address space entirely.

2.19.2. Basic copy-on-write idea

Instead of copying pages immediately:

  1. parent and child initially share the same physical pages;
  2. the pages are mapped read-only in both processes;
  3. if either process only reads, no copy is needed;
  4. when one process writes to a shared page, a page fault occurs;
  5. the OS allocates a new physical page;
  6. the OS copies the old contents into the new page;
  7. the writing process gets a private writable mapping;
  8. the other process keeps using the original page.

Thus copying is delayed until it is actually necessary.

2.19.3. Why write-protection is useful

Copy-on-write relies on the same trick used for dirty-bit emulation.

The OS intentionally maps a page read-only.

A write fault is not treated as an illegal access.

Instead, it is interpreted as a signal that the OS must perform the deferred copy.

2.19.4. Benefits

Copy-on-write makes fork() much cheaper.

It avoids copying pages that are never modified.

It is also useful for snapshots and other situations where two views initially share the same data but may diverge later.

2.20. Memory-mapped files

The lecture briefly mentions memory-mapped files as another trick enabled by virtual memory.

The idea is that file contents can be mapped into a process’s address space.

Then ordinary memory loads and stores access file data.

The OS uses page faults to load file pages on demand.

This connects virtual memory with file-system caching.

The same mechanisms appear again in later I/O and file-system lectures.

2.21. Overall summary

Virtual memory is not only about giving each process its own address space.

It is a general mechanism that lets the OS control memory access transparently.

Address translation provides:

  • indirection from virtual to physical memory;
  • permission enforcement;
  • access tracking;
  • page faults as a programmable recovery mechanism.

Multi-level page tables make sparse address spaces practical.

Huge pages reduce overhead for large mappings.

TLBs make translation fast enough in practice.

Demand paging uses page faults and invalid mappings to treat RAM as a cache for slower storage.

Dirty and accessed bits help the OS make paging decisions.

Clock and approximate LRU policies help choose eviction victims.

Lazy allocation and copy-on-write show how page faults can implement useful optimizations while keeping the application interface simple.

Author: Lowtroo

Created on: 2026-05-30 Sat 16:00

Powered by Emacs 29.3 (Org mode 9.6.15)