Address Translation

1. Lecture 7 — Scheduling Wrap-Up and Protection

1.1. Part I: Fair Scheduling

1.1.1. Motivation: why fair scheduling?

Scheduling is used to share a serially reusable resource among multiple threads or processes.

The resource can be:

  • CPU time;
  • I/O controllers;
  • network bandwidth;
  • memory pages;
  • or any other virtualized shared resource.

Fair scheduling is not inherently morally “good” or “bad”. It is a mechanism for obtaining desirable system properties such as:

  • avoiding starvation;
  • bounding latency;
  • providing isolation;
  • giving users or processes predictable shares of resources.

The lecture focuses on proportional-share fairness.

If thread \(i\) has weight \(w_i\), then over an interval of length \(\Delta\), its ideal fair share is

\[ \mathrm{share}_i(\Delta) = \Delta \cdot \frac{w_i}{\sum_j w_j}. \]

A simple baseline is to give every thread weight \(1\). More important threads can be given larger weights.

1.1.2. Lottery scheduling

Lottery scheduling is a simple randomized proportional-share policy.

Each thread receives some number of tickets. At each scheduling decision, the scheduler randomly selects one ticket and runs the thread that owns it for one time slice.

If thread \(i\) owns \(w_i\) tickets, then in expectation it receives

\[ \frac{w_i}{\sum_j w_j} \]

of the processor time.

This works well as a conceptual example and may make sense for some resources, such as memory allocation. However, it is not commonly used for CPU scheduling because we have more precise algorithms with better latency and fairness properties.

1.1.3. STFQ: Start-Time Fair Queuing

  1. Basic idea

    STFQ, or Start-Time Fair Queuing, can be understood as:

    \[ \text{FIFO, but in virtual time.} \]

    The scheduler always picks the thread with the earliest virtual time.

    The key idea is that virtual time advances more slowly for threads with larger weight, because larger-weight threads are entitled to more service.

    After a thread receives actual service \(\Delta\), its virtual time is updated as

    \[ Vi ← Vi

    Δ ⋅ \frac{1}{\text{share of process } i}. \]

    Since

    \[ \text{share of process } i = \frac{w_i}{\sum_j w_j}, \]

    we get

    \[ Vi ← Vi

    Δ ⋅ \frac{\sum_j w_j}{w_i}. \]

    Thus:

    • small weight \(\Rightarrow\) virtual time increases quickly;
    • large weight \(\Rightarrow\) virtual time increases slowly;
    • the scheduler picks the smallest virtual time;
    • therefore large-weight threads are naturally chosen more often.
  2. Example intuition

    Suppose we have two threads:

    \[ w_A = 1,\qquad w_B = 2. \]

    Then their shares are:

    \[ A: \frac{1}{3}, \qquad B: \frac{2}{3}. \]

    With a time slice of \(10\):

    For \(A\):

    \[ V_A \leftarrow V_A + 10 \cdot \frac{3}{1} = V_A + 30. \]

    For \(B\):

    \[ V_B \leftarrow V_B + 10 \cdot \frac{3}{2} = V_B + 15. \]

    So \(A\)’s virtual time grows twice as fast as \(B\)’s. This makes \(B\) eligible more often, giving it approximately twice as much CPU time.

  3. Tie-breaking optimization

    If two threads have equal virtual time, the scheduler does not necessarily need to preempt the currently running thread.

    In practice, avoiding unnecessary preemptions reduces overhead.

1.1.4. Linux nice values and weights

Historically, Unix used multi-level feedback queues. A process’s nice value acted as an offset in the priority queues:

  • lower nice value \(\Rightarrow\) higher priority;
  • higher nice value \(\Rightarrow\) lower priority.

Modern Linux uses fair scheduling, so the old nice interface is mapped to weights.

Thus, nice is a legacy interface, but it is still required by POSIX-like systems.

1.1.5. The hard part: handling resuming threads

When a blocked or sleeping thread wakes up, we must decide what virtual time to assign to it.

This is subtle because a bad rule can cause either:

  • unfair CPU monopolization after wake-up;
  • bad wake-up latency;
  • or an exploitable way to erase recent CPU usage.

The lecture considers four possible rules.

  1. Option 1: do nothing

    On wake-up, keep the thread’s old virtual time unchanged.

    Problem:

    If the thread slept for a long time, its virtual time may be far in the past compared to the threads that continued running.

    Then, when it wakes up, it appears extremely under-served and may monopolize the CPU for a long time until its virtual time catches up.

    This causes large latency spikes for other threads.

    Therefore:

    \[ \text{do nothing} \quad \text{is bad.} \]

  2. Option 2: add the sleep time

    A possible idea is:

    \[ V_i \leftarrow V_i + \text{sleep time}. \]

    The intention is to keep the sleeping thread’s virtual time roughly aligned with the rest of the system.

    Problem:

    This can penalize sleeping or I/O-bound threads.

    If a thread blocks because it is waiting for I/O, that is usually good behavior: it gives up the CPU while it cannot make progress. We should not punish it for doing so.

    If sleeping is punished, programmers may avoid blocking and instead spin in a loop, wasting CPU.

    Therefore:

    \[ V_i \leftarrow V_i + \text{sleep time} \]

    is also bad.

    Strictly speaking, if one wanted to add anything, it should be a virtual sleep-time adjustment, not raw physical sleep time. But even the general idea is problematic.

  3. STFQ’s approximation of current virtual time

    STFQ needs a cheap way to estimate the current virtual time.

    The original STFQ context was packet scheduling in high-speed network switches, where per-packet overhead must be tiny.

    The trick is:

    1. If a thread is currently executing: use the virtual start time of the currently running thread.
    2. If the processor is idle: use the maximum virtual end time of any previously scheduled thread.

    The virtual start time is the virtual time at which the currently running thread started its current time slice.

    The virtual end time is the virtual time after a thread has consumed its allocated service.

    The maximum virtual end time is used to prevent virtual time from going backwards when the system becomes idle and later receives new work.

  4. Option 3: reset to current virtual time

    Another idea is:

    \[ V_i \leftarrow \text{current virtual time}. \]

    This treats a waking thread like a newly arriving thread.

    Problem:

    This allows a thread to forget recent CPU usage.

    A thread can run for a full time slice, block briefly near the end of the slice, wake up almost immediately, and then get its virtual time reset to “now”.

    This can let a low-weight thread receive much more CPU time than its weight allows.

    In the lecture’s example, a thread that should receive only around \(10\%\) of the CPU can end up receiving about \(50\%\).

    Therefore, resetting directly to the current virtual time is also bad.

  5. Option 4: use the maximum rule

    The compromise is:

    \[ V_i \leftarrow \max(V_i, \text{current virtual time}). \]

    This rule means:

    • if the thread slept for a long time and its old virtual time is far in the past, it is moved forward to the current virtual time;
    • if the thread only slept briefly and its old virtual time is still in the future because it recently consumed CPU, that history is preserved;
    • virtual time never goes backwards.

    This avoids both:

    • giving long-sleeping threads an unfair CPU monopoly;
    • allowing briefly-sleeping threads to erase recent CPU usage.

    This is the best of the four simple STFQ wake-up rules.

1.1.6. Limitation of STFQ

STFQ is still not ideal.

The key weakness is that it schedules based on virtual start time.

If many threads wake up at the same time, they may all receive the same virtual start time and thus look identical to the scheduler.

Example:

  • 100 threads each have weight \(1\);
  • 1 thread has weight \(100\).

The 100 small threads together should receive \(50\%\) of the CPU.

The single large thread should also receive \(50\%\).

Ideally, the large thread should run roughly every second time slice.

But if all 101 threads wake up at the same time, they may all have the same start time. STFQ may let many low-weight threads each run once before the high-weight thread gets enough service.

This can cause latency spikes of size

\[ \Omega(n) \]

in the number of threads.

So STFQ does not provide a strong worst-case lag bound.

The problem is that STFQ looks only at where a thread starts, not where it would be after receiving the next time slice.

1.2. EEVDF: Earliest Eligible Virtual Deadline First

1.2.1. Motivation

EEVDF fixes the STFQ problem by considering not only the virtual start time, but also the projected virtual time after giving a thread another time slice.

EEVDF stands for:

\[ \text{Earliest Eligible Virtual Deadline First}. \]

It is the conceptual basis for Linux’s SCHED_OTHER policy as of Linux 6.6.

Most normal user processes run under SCHED_OTHER.

Linux does not literally implement the clean mathematical algorithm exactly, but EEVDF is the conceptual basis.

1.2.2. Virtual time

In simplified form, EEVDF defines global virtual time as

\[ V(t) = \int_0^t \frac{1}{\sum_i w_i} \,dt. \]

If the total weight is constant, this simplifies to

\[ V(t) = \frac{t}{\sum_i w_i}. \]

If threads enter or leave the system, the total weight changes, so the integral must account for those changes.

1.2.3. Activation time

Let

\[ a_i \]

be the time at which thread \(i\) was activated.

Activation means:

  • created;
  • or resumed after blocking.

1.2.4. Service received

Let

\[ s_i(t_1,t_2) \]

be the amount of CPU service received by thread \(i\) in the interval

\[ [t_1,t_2). \]

1.2.5. Virtual eligibility time

The virtual eligibility time of thread \(i\) is

\begin{equation*} e_i(t) = V(a_i) + \frac{s_i(a_i,t)}{w_i}. \end{equation*}

This represents how far the thread has progressed in virtual service since activation.

The more service a thread receives, the larger \(e_i(t)\) becomes.

For a large-weight thread, the same amount of service increases \(e_i(t)\) more slowly.

1.2.6. Eligibility rule

A thread is eligible for service only if

\[ e_i(t) \le V(t). \]

If

\[ e_i(t) > V(t), \]

then the thread is too far ahead of its fair allocation and should not be considered for scheduling until virtual time catches up.

This prevents a thread from receiving too much service too early.

1.2.7. Virtual deadline

EEVDF also defines a virtual deadline:

\begin{equation*} d_i(t) = e_i(t) + \frac{\text{time-slice length}}{w_i}. \end{equation*}

This asks:

“If we gave this thread one more time slice, where would its virtual time end up?”

That projected endpoint is exactly what STFQ ignored.

1.2.8. Selection rule

EEVDF schedules as follows:

  1. Consider only eligible threads:

    \[ e_i(t) \le V(t). \]

  2. Among eligible threads, choose the thread with the earliest virtual deadline:

    \[ \min_i d_i(t). \]

Thus EEVDF is:

\[ \text{EDF among eligible threads.} \]

In case of ties, avoid preemption if possible.

1.2.9. Why EEVDF fixes the STFQ latency spike

Suppose two threads wake up at the same time.

In STFQ, they may have the same virtual start time, so they look identical.

In EEVDF, even if their eligibility times are equal, their virtual deadlines differ because the virtual deadline includes the weight:

\begin{equation*} d_i(t) = e_i(t) + \frac{\text{time-slice length}}{w_i}. \end{equation*}

A larger weight \(w_i\) gives a smaller increment.

So a high-weight thread has an earlier virtual deadline and is chosen earlier.

This prevents the bad case where many low-weight threads all get one time slice before the high-weight thread gets service.

1.2.10. Ineligible threads

A thread can become ineligible if it has received too much service relative to its fair share.

This happens when

\[ e_i(t) > V(t). \]

The lecture gives a contrived example with weights:

\[ w_A = 1,\qquad w_B = 8,\qquad w_C = 1. \]

Because \(B\) has a very high weight, it is scheduled many times. Eventually its eligibility time can move into the future relative to current virtual time, making it temporarily ineligible.

In small examples this may not visibly affect the schedule, but it is important in larger systems.

1.3. EEVDF and Blocking Threads

1.3.1. Why blocking is complicated

Blocking changes the scheduling state in two ways.

First, the total weight changes:

\[ \sum_j w_j \]

changes when a thread leaves the runnable set.

This affects:

  • every other thread’s fair share;
  • the rate at which virtual time progresses.

Second, when the thread blocks, it may not have received exactly its ideal fair share.

It may have received:

  • exactly its fair share;
  • more than its fair share;
  • less than its fair share.

To reason about this, EEVDF defines lag.

1.3.2. Lag

Lag measures the deviation from ideal fair allocation.

For thread \(i\),

\begin{equation*} \mathrm{lag}_i(t) = (t-a_i) \cdot \frac{w_i}{\sum_j w_j} - s_i(a_i,t). \end{equation*}

Interpretation:

  • \(\mathrm{lag}_i(t)=0\): the thread received exactly its fair share.
  • \(\mathrm{lag}_i(t)>0\): the thread received less than its fair share. It is lagging behind.
  • \(\mathrm{lag}_i(t)<0\): the thread received more than its fair share. It is ahead and should wait.

1.3.3. Case 1: zero lag

If a thread blocks with zero lag, then it leaves at the perfect time.

No special handling is needed.

The scheduler can simply subtract its weight from the total weight:

\[ \sum_j w_j \leftarrow \sum_j w_j - w_i. \]

1.3.4. Case 2: negative lag

If a thread blocks with negative lag, it has received more than its fair share.

The trick is to delay departure processing.

The thread is blocked at the application level, but the scheduler pretends it has not fully left yet. It remains in a special state such as:

\[ \text{"blocked but not yet departure-processed"}. \]

Its weight is not immediately subtracted from the total weight.

Only when virtual time catches up and its lag becomes zero does the scheduler process its departure.

If the thread wakes up before that point, the scheduler can pretend it never left.

1.3.5. Case 3: positive lag

If a thread blocks with positive lag, it has received less than its fair share.

It is leaving while still being owed service.

The scheduler must redistribute that unconsumed service.

The practical rule described in the lecture is:

  1. Fast-forward virtual time to the point where the blocking thread would have reached zero lag:

    \begin{equation*} V(t) \leftarrow V(t) + \frac{\mathrm{lag}_i(t)} {(\sum_j w_j)-w_i}. \end{equation*}
  2. Update the lag of every other thread \(x\) proportionally:

    \begin{equation*} \mathrm{lag}_x(t) \leftarrow \mathrm{lag}_x(t) + \mathrm{lag}_i(t) \cdot \frac{w_x}{(\sum_j w_j)-w_i}. \end{equation*}

This distributes the unconsumed service among the remaining threads according to their weights.

1.3.6. Key invariant: total lag sums to zero

A central invariant is:

\[ \sum_i \mathrm{lag}_i(t)=0. \]

The complicated blocking rules exist to preserve this invariant.

This invariant is important because it allows the algorithm to prove strong fairness properties.

1.3.7. Lag bound

EEVDF ensures constant lag:

\[ \forall i,\forall t: \quad -\text{time slice} < \mathrm{lag}_i(t) < \text{time slice}. \]

This means no thread’s actual allocation deviates from its ideal fair allocation by more than about one time slice.

This is close to the best one can hope for in a time-sliced scheduler.

The time-slice length therefore controls a trade-off:

  • small time slice: better fairness and latency, but higher scheduling overhead;
  • large time slice: lower overhead, but worse fairness granularity.

1.4. Scheduling Summary

1.4.1. Real-time scheduling

Real-time scheduling is appropriate when the system knows exactly what timing objective it wants to achieve.

Common policies:

  • fixed-priority scheduling;
  • EDF, Earliest Deadline First.

POSIX and Linux examples:

  • SCHED_FIFO: fixed-priority scheduling with FIFO tie-breaking within each priority level;
  • SCHED_RR: fixed-priority scheduling with round-robin within each priority level;
  • SCHED_DEADLINE: Linux EDF-style scheduling based on the Constant Bandwidth Server algorithm.

These policies generally require root privileges because badly configured real-time tasks can starve normal system tasks.

1.4.2. General-purpose scheduling

General-purpose workloads, such as a browser with many JavaScript threads, usually do not have clear deadlines or known execution times.

Historically, systems tried to approximate SRPT:

\[ \text{Shortest Remaining Processing Time First}. \]

SRPT is good for response time and naturally favors I/O-bound tasks, but the OS usually does not know the true remaining processing time.

So practical systems use heuristics such as:

  • multi-level feedback queues;
  • fair scheduling;
  • EEVDF-style scheduling.

Examples:

  • old Unix: MLFQ-style scheduling;
  • modern Linux: SCHED_OTHER based conceptually on EEVDF.

Other fair-scheduling algorithms in the literature include:

  • WFQ, Weighted Fair Queuing;
  • VTRR, Virtual-Time Round Robin;
  • GR3, Group Ratio Round Robin;
  • DRF, Dominant Resource Fairness.

2. Lecture 7 — Protection

2.1. Why protection is needed

After scheduling decides which application should run, the OS must still ensure that running multiple applications is safe.

Applications are generally untrusted.

They may be malicious, but more commonly they may simply contain bugs.

Things that can go wrong include:

  • one application reads or writes another application’s memory;
  • one application reads sensitive information from another application;
  • one application modifies OS state;
  • one application modifies shared libraries or system tables;
  • one application consumes too much CPU;
  • one application consumes too much memory or other resources;
  • one application runs forever and never voluntarily gives up control.

The OS must enforce isolation and recover resources even when applications do not cooperate.

2.2. Requirements for safely running multiple applications

The lecture lists several requirements.

2.2.1. Controlled CPU execution

The OS must be able to control which application runs on the processor.

Scheduling is the policy part, but the hardware and kernel mechanisms must make it possible to stop and resume applications safely.

2.2.2. Controlled kernel transitions

Applications need to request services from the OS, for example:

  • reading from a file;
  • writing to a file;
  • allocating memory;
  • creating processes;
  • exiting.

These transitions must happen through controlled entry points.

An application must not be able to jump into arbitrary kernel code.

2.2.3. Controlled memory access

Applications must only access memory that the OS allows them to access.

This includes preventing:

  • application \(A\) from reading or writing application \(B\)’s memory;
  • applications from reading or writing kernel memory.

2.2.4. Flexible memory placement

The OS should be able to place application memory flexibly.

For example, two instances of the same program may both believe that their function main or some global variable lives at the same address.

Behind the scenes, the OS maps these virtual addresses to different physical memory locations.

This requires address translation.

2.2.5. Recoverability and enforcement

The OS must be able to recover resources from applications.

For example:

  • reclaim CPU time through timer interrupts;
  • reclaim memory;
  • kill or suspend misbehaving processes.

The OS cannot rely on applications to voluntarily return resources.

2.3. Processor protection

2.3.1. Kernel mode and user mode

Modern processors support at least two execution modes:

  • kernel mode;
  • user mode.

The kernel runs in kernel mode.

Applications run in user mode.

Kernel mode has full access to:

  • CPU control mechanisms;
  • memory-management structures;
  • hardware devices;
  • privileged instructions.

User mode is restricted.

An application cannot directly execute sensitive instructions or directly change processor state such as the current privilege mode.

2.3.2. Hardware support

The current execution mode is tracked by the processor itself, effectively in an internal processor register.

On a multi-core system, each core has its own current mode.

This must be enforced by hardware because while application code is running, software alone cannot stop it from executing arbitrary instructions.

2.3.3. Privileged operations

Certain operations are only allowed in kernel mode.

Examples include:

  • changing memory-management configuration;
  • changing interrupt tables;
  • accessing device registers;
  • disabling interrupts;
  • changing privileged control registers;
  • directly switching to kernel mode.

If user-mode code tries such an operation, the processor raises an exception and transfers control to the kernel.

2.3.4. Controlled transitions between modes

There are two major directions.

  1. Kernel to user

    The kernel returns to user mode using a controlled return mechanism.

    On x86, this is often modeled as returning from an interrupt, for example using:

    \[ \texttt{iret} \]

    or related architecture-specific instructions.

  2. User to kernel

    User code enters the kernel through:

    • hardware interrupts;
    • software interrupts;
    • exceptions;
    • system calls.

    Examples:

    • timer interrupt;
    • disk interrupt;
    • network interrupt;
    • page fault;
    • division by zero;
    • invalid instruction;
    • explicit system call.

2.4. Exceptions, interrupts, and traps

The terminology can vary.

Some systems distinguish:

  • interrupt: asynchronous event from hardware;
  • exception: synchronous event caused by the current instruction;
  • trap: deliberate transition, such as a system call.

But in practice, they often use similar hardware mechanisms.

The key idea is:

\[ \text{CPU event} \Rightarrow \text{hardware takes control away from user code} \Rightarrow \text{jump to kernel handler}. \]

The kernel then decides what to do.

Possible kernel responses include:

  • kill the process;
  • deliver a signal;
  • handle the event and resume the process;
  • schedule another process;
  • perform I/O completion handling;
  • bring back a missing memory page.

2.5. POSIX signals vs CPU exceptions

A CPU exception is a hardware-level event.

A POSIX signal is a user-visible OS abstraction.

Example:

  1. A process accesses invalid memory.
  2. The CPU raises a page fault exception.
  3. The kernel page-fault handler runs.
  4. The kernel determines that the access is invalid.
  5. The kernel delivers SIGSEGV to the process.
  6. The process may run a signal handler or be terminated.

Thus:

\[ \text{page fault} \neq \text{SIGSEGV} \]

but a page fault may lead to SIGSEGV.

2.6. Kernel stack management

2.6.1. The problem

User programs need full access to general-purpose registers, including the stack pointer.

But when an interrupt or exception occurs, the kernel must immediately run code.

Running code requires a valid stack.

The kernel cannot trust the user stack pointer because:

  • the user program controls it;
  • it may point to invalid memory;
  • it may point to maliciously chosen memory;
  • the fault may have happened because the user stack was invalid.

Therefore the kernel needs a safe stack as soon as it enters kernel mode.

2.6.2. Solution: separate kernel stack

The usual solution is to maintain a separate kernel stack.

When entering the kernel, the processor or low-level kernel entry code switches to this kernel stack.

On x86, the hardware can automatically switch to a kernel stack when transitioning from user mode to kernel mode.

Then the kernel can safely push register contents onto the kernel stack.

2.6.3. Why kernel stacks may be per process or per kernel thread

In simple kernels that never block while running kernel code, one kernel stack per core may be enough.

But realistic kernels such as Linux or Windows may block inside the kernel.

Example:

  1. A user process calls read.
  2. The kernel enters filesystem code.
  3. The disk operation takes time.
  4. The kernel thread blocks.
  5. The scheduler runs another process.

The blocked kernel computation still has a call stack.

Therefore, systems with blocking kernel threads need separate kernel stacks per kernel thread or per process context.

2.7. x86 interrupt handling

2.7.1. Interrupt vectors

x86 supports up to 256 interrupt vectors.

They can be thought of as an array of possible events.

The first 32 vectors are reserved for processor-defined exceptions.

Examples:

  • vector \(0\): divide-by-zero;
  • vector \(13\): general protection fault;
  • vector \(14\): page fault.

The remaining vectors are assigned by the OS for hardware IRQs and software interrupts.

In Pintos:

  • IRQs use vectors 32–47;
  • system calls use vector 48.

Linux traditionally used software interrupt

\[ \texttt{int \$0x80} \]

for system calls on x86, although modern systems often use faster syscall instructions.

2.7.2. Interrupt Descriptor Table

The processor uses the Interrupt Descriptor Table, or IDT.

Conceptually, the IDT is an array of handler pointers.

Entry \(i\) tells the processor where to jump when interrupt vector \(i\) occurs.

The IDT is stored in memory.

A special processor register points to it.

The OS initializes this table during boot.

In Pintos, this is related to threads/interrupt.c.

2.7.3. Hardware part of interrupt handling

When interrupt vector \(a\) occurs, the processor:

  1. looks up entry \(a\) in the IDT;
  2. obtains the handler address;
  3. switches to the kernel stack if needed;
  4. pushes enough state to allow returning later, such as:
    • return instruction pointer;
    • old stack pointer;
    • status flags;
  5. switches into kernel mode;
  6. jumps to the handler.

This is not an ordinary function call, because it may occur asynchronously and it changes privilege level.

2.7.4. Software part of interrupt handling

After the hardware transition, kernel software runs.

The low-level handler usually:

  1. saves general-purpose registers on the kernel stack;
  2. identifies the event;
  3. calls the appropriate high-level handler;
  4. eventually restores registers;
  5. returns from the interrupt if the process should resume.

If the event should terminate the process, the kernel does not return to the original user code.

2.8. System calls

2.8.1. System call as controlled user-to-kernel transition

A system call is an explicit request from a user program to the kernel.

Mechanically, it can be implemented using a software interrupt or a special syscall instruction.

The user program prepares arguments according to the OS ABI.

Then it triggers the kernel entry mechanism.

The kernel examines:

  • which system call number was requested;
  • what arguments were provided;
  • whether the request is valid.

Then the kernel either performs the operation or rejects it.

2.8.2. Argument passing

On 32-bit x86, arguments may be pushed on the user stack.

On 64-bit x86, arguments are often passed in registers.

The exact convention is part of the OS ABI.

Pintos uses a simple convention suitable for the teaching kernel.

2.8.3. System call security: never trust user arguments

System calls cross a trust boundary.

The kernel is trusted.

The application is not.

Therefore every system call argument must be validated.

Example:

ssize_t read(int fd, void *buf, size_t count);

The user provides:

  • fd;
  • buf;
  • count.

Things that can go wrong:

  • fd may be outside the valid file descriptor table;
  • buf may point to kernel memory;
  • buf may point to unmapped memory;
  • buf may be valid at first but become invalid while copying;
  • count may be too large;
  • pointers to structures may contain further invalid pointers.

Pointers are especially dangerous because they are just numbers supplied by user space.

The kernel must not blindly dereference them.

2.8.4. Page faults inside kernel code

Validating user pointers becomes subtle with virtual memory.

Even if a pointer lies in the user address range, the page may not currently be resident in memory.

When the kernel tries to copy data into or out of user memory, the kernel itself may trigger a page fault.

This can mean:

  • the access is invalid and the process should be killed;
  • or the page was temporarily swapped out and should be brought back.

Thus system-call implementation and memory-management implementation are closely connected.

3. Lecture 8 — Address Translation

3.1. Administrative notes

The lecture begins with organizational reminders:

  • lecture rooms are as posted on CMS;
  • technical assignment questions should go to the forum rather than individual TAs;
  • CI-to-CMS score transfer had an issue being fixed;
  • TA office hours would be announced.

3.2. Review: running multiple untrusted applications

The lecture resumes the topic of safely running multiple applications.

The core protection problem is:

\[ \text{applications are untrusted, but share the same hardware.} \]

The OS must protect:

  • applications from each other;
  • the kernel from applications;
  • the system as a whole from buggy or malicious behavior.

The previous lecture focused mainly on processor protection:

  • kernel mode;
  • user mode;
  • interrupts;
  • system calls;
  • controlled transitions.

Lecture 8 focuses on memory protection and address translation.

3.3. Memory protection problem

Suppose physical memory is laid out like:

\begin{array}{c} \text{Unused} \\ \text{App 2} \\ \text{App 1} \\ \text{OS Kernel} \end{array}

Questions:

  • How do we prevent App 1 from accessing kernel memory?
  • How do we prevent App 2 from accessing App 1’s memory?
  • How can multiple instances of the same application use the same addresses without interfering?
  • How can the OS move applications around in memory?
  • How can the OS pretend there is more memory than physically available?

3.4. Base and bounds protection

3.4.1. Physical base and bounds

The simplest idea is to add two privileged registers:

  • base;
  • bound.

On every user-mode memory access, the CPU checks:

\[ \text{base} \le \text{address} < \text{bound}. \]

If the check fails, the CPU raises a memory exception.

In kernel mode, the kernel can access all memory.

During a context switch, the OS changes the base and bound registers to match the next process.

3.4.2. Advantages

Base and bounds protection is:

  • simple;
  • cheap in hardware;
  • fast;
  • sufficient for basic isolation in simple systems.

This kind of mechanism appears in some embedded systems.

For example, RISC-V has Physical Memory Protection, PMP, which provides configurable protected physical memory regions.

3.4.3. Limitations

The main limitation is lack of flexibility.

The process directly uses physical addresses.

This causes problems:

  • an application must occupy one contiguous physical memory interval;
  • it is hard to grow memory if adjacent physical memory is occupied;
  • if processes exit, physical memory becomes fragmented;
  • the OS cannot transparently move an application without breaking its pointers;
  • multiple instances of the same program cannot naturally use the same addresses;
  • the OS cannot easily pretend that more memory exists than physically available.

Example:

If App 1 terminates and leaves a hole below App 2, the OS may want to move App 2 down to compact memory.

But if App 2 uses physical addresses directly, all its pointers would become wrong.

Therefore we need indirection.

3.5. Address translation and virtual memory

3.5.1. Indirection principle

The lecture quotes David Wheeler:

\[ \text{"Any problem in computer science can be solved with another level of indirection."} \]

Operating systems use this principle heavily.

Address translation adds a layer between:

  • the addresses seen by the application;
  • the physical addresses used by hardware memory.

3.5.2. Virtual addresses and physical addresses

Applications use virtual addresses.

The memory hardware translates virtual addresses into physical addresses.

\[ \text{virtual address} \longrightarrow \text{address translation} \longrightarrow \text{physical address}. \]

The application executes ordinary load/store instructions.

It does not need to know that translation is happening.

This allows the OS to:

  • move memory transparently;
  • isolate applications;
  • map the same virtual address in different processes to different physical addresses;
  • temporarily remove pages and later restore them;
  • implement demand paging and swapping.

3.6. Virtually addressed base and bounds

3.6.1. Mechanism

A first step beyond physical base and bounds is virtual base and bounds.

The process uses virtual addresses starting from zero.

The CPU checks:

\[ \text{if } \text{virt} \ge \text{bound}: \text{ raise exception}. \]

Otherwise it translates by adding the base:

\[ \text{phys} = \text{virt} + \text{base}. \]

So the virtual address range

\[ [0,\text{bound}) \]

is mapped to the physical range

\[ [\text{base}, \text{base}+\text{bound}). \]

3.6.2. Advantages

This allows the OS to move a process in physical memory.

If the OS copies the process’s memory to a new physical location and updates base, the process can continue using the same virtual addresses.

The application does not notice.

This is a major improvement over direct physical addressing.

3.6.3. Disadvantages

The mechanism is still coarse-grained.

Problems remain:

  • each process still needs one physically contiguous region;
  • moving a process requires copying the entire region;
  • allocation and protection happen only at the granularity of the whole region;
  • it is hard to protect parts of the process from itself;
  • fine-grained permissions such as read-only code and non-executable stack are difficult.

For example, modern systems often enforce:

\[ \text{memory should not be both writable and executable}. \]

This helps prevent code-injection attacks.

With one base-and-bounds region, it is hard to make code executable but read-only while making heap and stack writable but non-executable.

3.7. Segmentation

3.7.1. Basic idea

Segmentation generalizes base and bounds by allowing multiple regions.

Each segment has its own:

  • base;
  • bound;
  • access permissions.

A virtual address consists of:

\[ \text{virtual address} = (\text{segment ID}, \text{offset}). \]

The segment ID indexes a segment table.

The offset is checked against the segment’s bound.

If valid, the physical address is:

\begin{equation*} \text{phys} = \text{segment base} + \text{offset}. \end{equation*}

3.7.2. Segment table

Each process has a segment table.

Each entry contains:

  • base;
  • bound;
  • access permissions.

Example segments:

  • code segment: readable and executable;
  • data segment: readable and writable;
  • heap segment: readable and writable;
  • stack segment: readable and writable.

The OS configures the segment table.

The application may know the segment IDs, but it cannot modify the segment table.

3.7.3. Advantages

Segmentation gives more flexibility than one base-and-bounds pair.

It supports:

  • multiple memory regions per process;
  • different permissions for different regions;
  • moving individual segments independently;
  • sharing code segments between processes;
  • separating code, data, heap, and stack.

It also gives finer-grained protection.

For example:

\[ \text{code}: R+X,\qquad \text{data}: R+W,\qquad \text{stack}: R+W. \]

3.7.4. Disadvantages

Each segment is still physically contiguous.

This causes fragmentation.

If a heap segment contains an unused hole in the middle, the OS cannot simply reclaim that hole unless it can split the segment.

But splitting a segment changes the segment structure visible to the application, which can break pointers.

Thus segmentation improves flexibility but does not solve the fundamental problem of contiguous physical allocation.

Memory management remains difficult because the OS must find contiguous physical intervals of appropriate size.

3.7.5. Segmentation on x86

32-bit x86 has segmentation built in.

The segment table is called the Global Descriptor Table, or GDT.

x86 has segment registers such as:

  • CS: code segment;
  • DS: data segment;
  • ES, FS, GS: extra data segments;
  • SS: stack segment.

Instruction fetches implicitly use CS.

Normal loads and stores typically use DS.

Stack push/pop operations use SS.

Some instructions can explicitly specify another segment, for example:

\[ \texttt{movl \%eax, \%es:(\%edi)} \]

In Pintos, the GDT appears in userprog/gdt.c.

In modern 64-bit x86, segmentation is mostly disabled or ignored.

Most segment bases are effectively zero and bounds cover the whole address space.

However, FS and GS are still used for special purposes such as thread-local storage.

3.8. Paged address translation

3.8.1. Motivation

Segmentation still requires each segment to be physically contiguous.

Paging solves this by managing memory in small fixed-size chunks.

Typical page size:

\[ 4\text{ KB} = 2^{12}\text{ bytes}. \]

The virtual address space is split into virtual pages.

Physical memory is split into physical page frames.

A page table maps:

\[ \text{virtual page number} \longrightarrow \text{physical page frame number}. \]

3.8.2. Process view

Each process sees a contiguous virtual address space.

But its virtual pages can be mapped to arbitrary physical frames.

Thus:

  • virtual page 0 can map to physical frame 600;
  • virtual page 1 can map to physical frame B00;
  • virtual page 2 can map to physical frame 800;
  • and so on.

The physical frames do not need to be contiguous.

This is the major advantage of paging.

3.8.3. Virtual address structure

A virtual address is split into:

\[ \text{virtual address} = (\text{virtual page number}, \text{offset}). \]

For 4 KB pages:

\[ 4\text{ KB}=2^{12}. \]

So the lower 12 bits are the page offset.

On a 32-bit system:

\[ 32 - 12 = 20 \]

bits remain for the virtual page number.

The offset is not translated.

Only the page number is translated.

So:

\[ (\text{VPN}, \text{offset}) \longrightarrow (\text{PFN}, \text{offset}). \]

3.8.4. Why page sizes are powers of two

Page sizes are powers of two so that splitting the address is cheap.

With a 4 KB page:

  • lower 12 bits: offset;
  • upper bits: page number.

No division or modulo is required.

The hardware can use simple bit slicing.

3.8.5. Page table

Each process has its own page table.

The page table stores the mapping from virtual pages to physical frames.

On a context switch, the OS tells the processor to use a different page table.

This is analogous to switching the segment table, but more fine-grained.

3.8.6. Page table entry

A page table entry, or PTE, contains:

  • physical page frame number;
  • present/valid bit;
  • write permission;
  • user/supervisor permission;
  • caching control bits;
  • accessed bit;
  • dirty bit;
  • global bit;
  • possibly execute-disable or similar permissions on modern systems.

The exact layout is architecture-specific.

3.9. x86 page table entry

3.9.1. Page frame number

For 32-bit x86 with 4 KB pages:

  • page offset: 12 bits;
  • page frame number: 20 bits.

A 32-bit PTE uses the high bits for the physical frame address and the low bits for control and permission flags.

3.9.2. Present bit

The present bit says whether the mapping is valid.

If

\[ P = 0, \]

then the entry is not present.

If the process accesses that virtual page, the CPU raises a page fault.

If

\[ P = 1, \]

then the entry contains a valid mapping.

3.9.3. Writable bit

The writable bit says whether writes are allowed.

If the page is not writable and user code tries to write, the processor raises an exception.

This enables read-only pages such as code pages.

3.9.4. User/supervisor bit

The user/supervisor bit says whether user-mode code can access the page.

If the bit says supervisor-only, then only kernel-mode code can access it.

This is how the kernel can be mapped in the page table while still being protected from user code.

3.9.5. Caching bits

Some bits control caching behavior.

These are especially important for device memory and low-level system programming.

For normal application memory, the default caching behavior is usually used.

3.9.6. Accessed bit

The accessed bit is set by hardware when the page is read or written.

This gives the OS feedback about which pages have been used.

It is useful for page replacement algorithms.

3.9.7. Dirty bit

The dirty bit is set by hardware when the page is written.

This tells the OS whether the page’s contents differ from the backing copy on disk.

It is important for demand paging and swapping.

For example:

  • clean page: can be discarded and reloaded later;
  • dirty page: must be written back before eviction.

3.9.8. Global bit

The global bit tells the processor that this mapping remains valid across address-space switches.

This is useful for kernel mappings that are shared across processes.

It is mainly a performance optimization related to translation caching.

3.10. Kernel mappings inside process page tables

3.10.1. Why the kernel is mapped

When virtual memory translation is enabled, all instruction fetches and memory accesses go through the page table.

This includes kernel code.

If a user process triggers an exception or system call, the CPU must jump to kernel code.

Therefore some kernel code must be mapped in the currently active page table.

Otherwise the processor could not even fetch the exception handler.

3.10.2. Why user code cannot access the kernel mapping

Although kernel pages are present in the page table, they are marked supervisor-only.

So:

  • kernel mode can access them;
  • user mode cannot.

If user mode tries to access those pages, the processor raises an exception.

Thus:

\[ \text{mapped} \neq \text{accessible to user mode}. \]

3.10.3. Kernel at high virtual addresses

A common layout is:

  • user program at lower virtual addresses;
  • kernel at high virtual addresses.

This is historically common in Unix-like systems and also appears in teaching kernels such as Pintos.

The kernel may be physically located at low memory but virtually mapped high.

3.10.4. Page tables themselves must be accessible to the kernel

Page tables are stored in memory.

The kernel must modify them when:

  • creating a process;
  • allocating memory;
  • freeing memory;
  • handling page faults;
  • swapping pages in or out.

Therefore the kernel needs a way to access the memory containing the page tables.

This creates subtle “chicken and egg” problems:

  • to change mappings, the kernel must write page tables;
  • to write page tables, the page-table memory itself must be mapped.

OS kernels carefully manage this.

3.11. Meltdown caveat

Traditionally, operating systems mapped the kernel into every process’s page table, protected by supervisor-only bits.

This was considered safe because user mode could not architecturally access those pages.

Meltdown showed that speculative execution can sometimes transiently access privileged memory before permission checks fully take effect.

Although the architectural result is discarded, microarchitectural side effects such as cache state can leak information.

As a mitigation, modern operating systems often avoid mapping most of the kernel into user page tables.

Instead, they map only a small trampoline needed to enter the kernel, then switch to a separate kernel page table.

This improves security but adds overhead, because switching page tables is expensive.

For the conceptual model in this lecture, it is still useful to think of the kernel as mapped but supervisor-only.

3.12. Advantages and disadvantages of paging

3.12.1. Advantages

Paging provides:

  • fine-grained allocation;
  • fine-grained protection;
  • non-contiguous physical memory allocation;
  • flexible virtual address spaces;
  • easy sharing of pages;
  • copy-on-write possibilities;
  • demand paging;
  • swapping;
  • memory-mapped files;
  • lazy allocation;
  • page-level permissions.

The OS can map any virtual page to any physical frame.

The OS can also remove a mapping temporarily.

If the process later accesses it, the CPU raises a page fault, and the OS can restore the page.

This is the basis for demand paging.

3.12.2. Disadvantages

Paging adds significant complexity.

Possible problems include:

  • page table management bugs;
  • page fault handling complexity;
  • stale translation caches;
  • page table memory overhead;
  • hardware complexity;
  • performance overhead on every memory access.

Address translation must happen for instruction fetches and data accesses.

Modern processors optimize this heavily, but the overhead is still non-trivial.

3.13. Implementing page tables

3.13.1. Linear page table

The simplest implementation is a linear array.

The virtual page number is used as an index into the array.

For a 32-bit address space with 4 KB pages:

\[ \frac{2^{32}}{2^{12}} = 2^{20} \]

virtual pages exist.

If each PTE is 4 bytes, the page table size is:

\[ 2^{20} \cdot 4 = 4\text{ MB}. \]

This is already large, especially for older 32-bit systems, and it is per process.

For a 64-bit system, a linear table is impossible.

With 4 KB pages, the number of pages is enormous. The slide gives the order of magnitude as around 18 petabytes of page table memory.

Thus linear page tables are too large.

3.13.2. Increasing page size

One way to reduce the number of PTEs is to increase the page size.

But this is not a complete solution.

Larger pages reduce page table size but increase internal fragmentation.

They also reduce the granularity of protection and memory management.

Modern systems often support multiple page sizes, but large pages are an optimization, not the basic solution.

3.13.3. Hashed page tables

Another possibility is a hash table from virtual page number to PTE.

Advantages:

  • size can be proportional to actually used pages;
  • avoids a huge sparse linear array.

Disadvantages:

  • hash collisions;
  • unpredictable lookup cost;
  • dependent memory accesses while resolving collisions;
  • poor cache locality;
  • harder hardware implementation.

Because every memory access may require address translation, unpredictable translation latency is very bad.

Thus hashed page tables are not the dominant design on common architectures.

3.13.4. Segmented / radix page tables

The practical solution is a multi-level page table, also called:

  • segmented page table;
  • hierarchical page table;
  • radix tree page table.

Observation:

Most virtual address spaces are sparse.

Large regions are unmapped.

So we should not allocate page table memory for unmapped regions.

A multi-level page table breaks the virtual page number into chunks.

Each chunk indexes one level of the tree.

Only subtrees for actually used virtual address ranges need to exist.

3.13.5. Two-level translation example

In a simplified two-level page table:

  1. The first few bits of the virtual page number index the top-level table.
  2. The top-level entry points to a second-level table.
  3. The remaining bits index the second-level table.
  4. The second-level entry gives the physical frame number and permissions.

If a whole region is unused, the top-level entry can be invalid, and no second-level table is allocated.

This saves memory for sparse address spaces.

3.13.6. Page table base register

The processor needs to know where the top-level page table is.

It stores the physical address of the top-level page table in a control register.

On x86, this register is CR3.

During a context switch, the OS changes CR3 to point to the new process’s page table.

This switches the active address space.

3.14. x86 paging structures

3.14.1. 32-bit x86 paging

In 32-bit x86 with 4 KB pages, the linear address is split into:

\[ \text{Directory index} \mid \text{Table index} \mid \text{Offset}. \]

Typical split:

  • 10 bits: page directory index;
  • 10 bits: page table index;
  • 12 bits: page offset.

The page directory is pointed to by CR3.

A Page Directory Entry, PDE, points to a page table.

A Page Table Entry, PTE, points to a physical 4 KB page.

3.14.2. 64-bit x86 paging

64-bit x86 uses more levels.

The slide shows a 4-level structure:

  • PML4;
  • Page Directory Pointer Table;
  • Page Directory;
  • Page Table;
  • Offset.

The address is split into chunks, typically:

  • 9 bits per level;
  • 12 bits offset.

Each level indexes a table of entries.

The final PTE gives the physical page frame.

This multi-level structure keeps page table memory proportional to used portions of the virtual address space.

3.15. Big picture

Address translation is one of the central mechanisms of operating systems.

It supports:

  • protection;
  • process isolation;
  • flexible memory placement;
  • efficient physical memory allocation;
  • demand paging;
  • swapping;
  • memory-mapped files;
  • copy-on-write;
  • sharing;
  • kernel/user separation.

The same basic mechanism is reused for many different OS features.

Conceptually:

\[ \text{virtual address} \rightarrow \text{translation structure} \rightarrow \text{physical address}. \]

The OS controls the translation structure.

The hardware performs the translation efficiently on each memory access.

This cooperation between hardware and OS is what makes modern process isolation and virtual memory possible.

Author: Lowtroo

Created on: 2026-05-30 Sat 10:00

Powered by Emacs 29.3 (Org mode 9.6.15)