Synchronization 3 & Scheduling

1. Lecture 5: Implementing Semaphores and Entering Scheduling

1.1. Administrative notes

  • The Pintos project has been posted.
  • There are slip days for project milestones.
    • In lecture 5, the professor was still unsure whether the number was two or three.
    • In lecture 6, this was clarified: there are four slip days across all four projects.
  • Slip days are based on the commit timestamp in the branch used for submission.
  • The recommendation is not to rely on them, but to reserve them for emergencies.
  • The TAs offer walkthrough sessions for the first Pintos project and source code.
    • These are meant especially for environment issues, build problems, editor/container setup problems, and general project-startup confusion.

2. Semaphore Implementation

2.1. Why semaphores need an OS implementation

Earlier lectures treated semaphores as a convenient abstraction:

  • They provide portable synchronization.
  • Threads can wait efficiently.
  • Waiting threads do not have to busy-wait.
  • They can be used for both:
    • mutual exclusion,
    • condition synchronization.

However, hardware does not directly provide this high-level abstraction.

The operating system has to implement semaphores using lower-level mechanisms such as:

  • interrupt control,
  • atomic hardware instructions,
  • scheduler support,
  • wait queues.

The important point is:

A semaphore is not just an integer. It is an integer plus scheduler-visible waiting behavior.

2.2. Semaphores in kernel space and user space

The OS needs semaphores in two places.

2.2.1. Kernel-space semaphores

Inside the kernel, many components need synchronization:

  • device drivers,
  • file systems,
  • system-call implementations,
  • memory-management structures,
  • scheduler data structures,
  • kernel-internal queues and lists.

Inside the kernel, using a semaphore is just a normal function call.

For example:

sema_down(&s);
sema_up(&s);

2.2.2. User-space semaphores

User-space programs also need synchronization.

For example:

  • threads in the same process may share memory,
  • libraries may protect shared state,
  • application code may use locks or semaphores.

The simple model is:

  • user-space code calls a library function;
  • the library function performs a system call;
  • the system call enters the kernel;
  • the kernel semaphore implementation performs the real blocking or waking.

So, in the simplest design:

The user-space semaphore is mostly a proxy for a kernel semaphore.

2.2.3. Why system-call-only semaphores are inefficient

System calls are expensive compared with ordinary user-space operations.

For a binary semaphore used like a lock, most acquisitions in well-designed code are uncontended.

Typical case:

  • nobody else holds the lock;
  • the thread can proceed immediately;
  • no blocking is needed.

Rare case:

  • another thread holds the lock;
  • the current thread must block;
  • the kernel must become involved.

If every lock/unlock operation always enters the kernel, then the common uncontended case pays unnecessary system-call overhead.

2.2.4. Futexes

Linux uses a more advanced design called a futex.

The name means roughly:

fast user-space mutex

The idea:

  • keep the common-case state in user space;
  • avoid a system call if the operation can complete without blocking;
  • call into the kernel only when blocking or waking is actually needed.

Conceptually:

  • the semaphore/lock counter is in user space;
  • the wait queue is still managed by the kernel;
  • the abstraction is split across user space and kernel space.

This course does not go deeply into futexes, but the point is important:

Efficient synchronization tries to avoid kernel entry in the uncontended case.

2.3. Semaphore data structure

A semaphore is conceptually:

typedef struct {
    int count;
    wait_queue q;
} Semaphore;

It contains:

  • a counter,
  • a queue of waiting threads/processes.

The two operations are:

  • P(), also called:
    • wait,
    • down,
    • acquire.
  • V(), also called:
    • signal,
    • up,
    • release.

2.3.1. P operation

The semantics of P(s) are:

  1. Wait until the counter is positive.
  2. Atomically decrement the counter.
  3. Proceed.

In abstract form:

P(s):
    wait until s.count > 0
    atomically:
        s.count -= 1

2.3.2. V operation

The semantics of V(s) are:

  1. Increment the counter.
  2. If there are waiting threads, wake one of them.

In abstract form:

V(s):
    atomically:
        s.count += 1
        wake one waiter if needed

2.4. Why implementing semaphores is nontrivial

The counter is just a machine word, but the wait queue is not.

The implementation must atomically manipulate:

  • the counter,
  • the wait queue,
  • the current thread state,
  • the scheduler’s ready/waiting structures.

There is no single hardware instruction that atomically says:

Check the semaphore counter, enqueue the current thread if needed, block the thread, and later wake it safely.

So semaphores must be built from lower-level primitives.

3. Semaphore Implementation on a Uniprocessor

3.1. Atomicity by disabling interrupts

On a uniprocessor, only one CPU core is executing kernel code.

If interrupts are disabled, then the currently executing kernel code will not be preempted by a timer interrupt or device interrupt.

Therefore:

On a uniprocessor, disabling interrupts briefly can make a critical section atomic with respect to thread interleavings.

But this only works if interrupts are disabled for a very short time.

Disabling interrupts for too long is bad because:

  • timer interrupts cannot happen;
  • device interrupts cannot be handled;
  • the system becomes unresponsive;
  • scheduling is delayed;
  • I/O latency increases.

3.2. P operation on a uniprocessor

A simplified implementation:

void P(Semaphore *s) {
    Disable interrupts;

    while (s->count <= 0) {
        set_state(current_thread, WAITING);
        add_to_queue(&s->q, current_thread);

        Enable interrupts;
        schedule();              /* context-switch away */
        Disable interrupts;
    }

    s->count -= 1;

    Enable interrupts;
}

3.2.1. Step-by-step explanation

  1. Disable interrupts.
    • This begins the atomic region.
    • No other thread can interleave on the same core.
  2. Check whether the semaphore count is positive.
    • If yes, the thread can proceed.
    • If no, the thread must block.
  3. If the count is zero or negative:
    • mark the current thread as waiting;
    • add it to the semaphore’s wait queue.
  4. Re-enable interrupts before calling the scheduler.
    • The system must not leave interrupts disabled while another thread runs.
    • Blocking is a long operation, not a short atomic critical section.
  5. Call schedule().
    • The current thread is no longer runnable.
    • The scheduler chooses another ready thread.
  6. When this thread is eventually woken and scheduled again:
    • execution resumes after schedule();
    • interrupts are disabled again;
    • the loop re-checks the semaphore count.
  7. Once the count is positive:
    • decrement it;
    • re-enable interrupts;
    • return.

3.2.2. Why the check is in a loop

The code uses:

while (s->count <= 0) {
    ...
}

rather than:

if (s->count <= 0) {
    ...
}

Reasons:

  • after waking up, the thread must re-check the condition;
  • multiple waiters may exist;
  • the wakeup does not by itself mean that the thread can safely proceed;
  • looping is the standard pattern for condition synchronization.

3.3. V operation on a uniprocessor

A simplified implementation:

void V(Semaphore *s) {
    Disable interrupts;

    s->count += 1;

    if (!isEmpty(&s->q)) {
        waiting_thread = RemoveFirst(&s->q);
        set_state(waiting_thread, READY);
        add_to_ready_queue(waiting_thread);
    }

    Enable interrupts;
}

3.3.1. Step-by-step explanation

  1. Disable interrupts.
    • The semaphore state and wait queue must be updated atomically.
  2. Increment the counter.
  3. If there is a waiting thread:
    • remove one thread from the semaphore wait queue;
    • mark it as ready;
    • add it to the scheduler’s ready queue.
  4. Re-enable interrupts.

The exact order of incrementing the counter and waking a thread can vary in implementation, as long as the whole sequence is atomic with respect to other semaphore operations.

3.4. Pintos semaphore structure

Pintos defines a semaphore roughly as:

struct semaphore {
    unsigned value;       /* Current value. */
    struct list waiters;  /* List of waiting threads. */
};

Important details:

  • value is unsigned, so it cannot be negative.
  • waiters is a list of blocked threads waiting on this semaphore.

3.5. Pintos semadown

Pintos uses sema_down() for the P/down/wait operation.

Important features:

  • it disables interrupts;
  • it stores the previous interrupt level;
  • it loops while the semaphore value is zero;
  • it pushes the current thread onto the waiters list;
  • it blocks the current thread;
  • after waking, it decrements the value;
  • finally, it restores the old interrupt level.

The relevant idea:

old_level = intr_disable();

while (sema->value == 0) {
    list_push_back(&sema->waiters, &thread_current()->elem);
    thread_block();
}

sema->value--;

intr_set_level(old_level);

3.5.1. Why Pintos stores the old interrupt level

The code does not simply do:

intr_disable();
/* critical section */
intr_enable();

Instead, it does:

old_level = intr_disable();
/* critical section */
intr_set_level(old_level);

Reason:

  • the caller may already have disabled interrupts;
  • if sema_down() blindly enabled interrupts at the end, it would break the caller’s assumption that interrupts remain disabled;
  • restoring the old interrupt level makes the code composable/reentrant with respect to interrupt state.

Example problem:

intr_disable();

/* caller expects this whole region to be atomic */
sema_down(&s);

/* still expects interrupts disabled here */

intr_enable();

If sema_down() unconditionally enabled interrupts, the caller’s atomic section would be broken.

3.5.2. Blocking while interrupts are already disabled

The lecture discussed an important subtlety.

Suppose:

  • the caller has disabled interrupts;
  • then calls sema_down();
  • the semaphore value is zero;
  • therefore sema_down() must block.

This is dangerous.

Blocking is a long operation. If the code assumed that interrupts remain disabled across this whole region, that assumption is no longer meaningful once the thread sleeps and other threads run.

In many OS kernels, calling a potentially blocking function while in an atomic/interrupt-disabled context is considered a bug.

Linux, for example, has checks that can detect such misuse and may panic or warn.

The principle:

Do not call potentially sleeping/blocking functions from contexts where sleeping is forbidden.

3.6. Pintos sema_up

Pintos uses sema_up() for the V/up/signal operation.

Conceptually:

old_level = intr_disable();

if (!list_empty(&sema->waiters)) {
    thread_unblock(...);
}

sema->value++;

intr_set_level(old_level);

Important points:

  • sema_up() may be called from an interrupt handler.
  • It does not sleep.
  • It wakes a waiting thread, if any.
  • It increments the semaphore value.
  • The whole operation is protected by disabling interrupts.

4. Semaphore Implementation on Multiprocessors

4.1. Why disabling interrupts is insufficient

On a uniprocessor:

Disabling interrupts prevents interleaving on the only CPU.

On a multiprocessor:

Disabling interrupts only affects the current core.

Other cores may still be running and may still access the same semaphore structure.

Therefore, disabling interrupts does not protect against:

  • concurrent updates by other cores,
  • simultaneous access to the semaphore counter,
  • simultaneous manipulation of the wait queue.

So the uniprocessor semaphore implementation is not correct on a multiprocessor.

4.2. What is needed on a multiprocessor

We need synchronization across cores.

The semaphore state must be protected by a lower-level multiprocessor-safe primitive.

The lecture introduces:

a spin lock protecting the semaphore state.

A multiprocessor semaphore can be represented as:

typedef struct {
    spin_lock_t slock;
    int count;
    wait_queue q;
} Semaphore;

The spin lock protects:

  • count,
  • q,
  • any state transitions associated with the semaphore.

4.3. Multiprocessor P operation

Simplified implementation:

void P(Semaphore *s) {
    Disable interrupts;
    spin_lock(&s->slock);

    while (s->count <= 0) {
        set_state(current_thread, WAITING);
        add_to_queue(&s->q, current_thread);

        spin_unlock(&s->slock);
        Enable interrupts;

        schedule();              /* context-switch away */

        Disable interrupts;
        spin_lock(&s->slock);
    }

    s->count -= 1;

    spin_unlock(&s->slock);
    Enable interrupts;
}

4.3.1. Important rule: release the spin lock before sleeping

The key point is:

Never sleep while holding a spin lock.

If a thread blocks while holding the spin lock:

  • other cores trying to acquire the spin lock will spin;
  • the sleeping thread is not running and cannot release the lock;
  • the system may deadlock or waste huge amounts of CPU time.

So the sequence is:

  1. acquire spin lock;
  2. inspect/update semaphore state;
  3. if blocking is necessary:
    • put current thread on wait queue;
    • release spin lock;
    • enable interrupts;
    • call scheduler;
  4. when resumed:
    • disable interrupts again;
    • reacquire spin lock;
    • re-check the condition.

4.3.2. Why disable interrupts in the multiprocessor semaphore implementation?

In the multiprocessor version, the semaphore operation uses both:

Disable interrupts;
spin_lock(&s->slock);

/* manipulate semaphore count and wait queue */

spin_unlock(&s->slock);
Enable interrupts;

At first this may look strange, because disabling interrupts alone does not make the semaphore operation atomic on a multiprocessor.

The key point is:

Disabling interrupts and acquiring a spin lock solve different problems.

  1. What the spin lock does

    The spin lock protects the semaphore state from other CPU cores.

    The semaphore contains shared state such as:

    • the counter,
    • the wait queue.

    On a multiprocessor, another core can access the same semaphore at the same time.

    So even if the current core disables interrupts, other cores can still run and modify the same memory.

    Therefore, we need:

    spin_lock(&s->slock);
    

    to prevent concurrent access from other cores.

    In short:

    The spin lock provides mutual exclusion across cores.

  2. What disabling interrupts does

    Disabling interrupts protects against a different problem:

    It prevents the current CPU core from being interrupted or preempted while it is holding the spin lock.

    This matters because holding a spin lock should be a very short operation.

    If a thread is interrupted while holding a spin lock, the lock may remain held while the thread is not running.

    That is dangerous.

  3. Bad scenario without disabling interrupts

    Suppose interrupts are not disabled.

    Then the following can happen:

    Core 0:
    
    Thread A:
      acquire spin lock
      start modifying semaphore state
    
    Timer interrupt happens
    
    Scheduler switches from Thread A to Thread B
    
    Thread B:
      tries to acquire the same spin lock
      spins, because Thread A still holds the lock
    

    Now the problem is:

    Thread A holds the spin lock,
    but Thread A is not currently running.
    
    Thread B is running,
    but Thread B cannot make progress because it is spinning on the lock.
    
    So Thread A cannot release the lock,
    and Thread B keeps wasting CPU time.
    

    This is especially bad if Thread B is running on the same core that Thread A was preempted on.

    The lock holder has been switched away, so the spinning thread may prevent the lock holder from running again soon.

  4. Why this is bad for spin locks

    A spin lock assumes that the lock holder will release the lock quickly.

    Spinning is acceptable only if the waiting time is very short.

    But if the lock holder is preempted, then the waiting time may become very long.

    So the system may waste CPU time spinning for a lock that cannot be released until the preempted lock holder runs again.

    In the worst case, this can lead to serious kernel bugs or deadlock-like behavior.

  5. Therefore: disable interrupts while holding the spin lock

    The kernel disables interrupts before acquiring the spin lock so that the current core will not be preempted in the middle of the spin-lock-protected critical section.

    So the intended pattern is:

    Disable interrupts;
    spin_lock(&s->slock);
    
    /* short critical section */
    
    spin_unlock(&s->slock);
    Enable interrupts;
    

    This ensures:

    • other cores cannot enter the same critical section because of the spin lock;
    • the current core will not be interrupted while holding the spin lock;
    • the spin lock is held only for a short, bounded time.

4.3.3. Summary

Disabling interrupts is not enough for multiprocessor atomicity.

It only affects the current core.

The spin lock is needed to protect shared semaphore state across cores.

However, disabling interrupts is still useful because it prevents the current core from being preempted while holding the spin lock.

So the two mechanisms have different jobs:

Mechanism Purpose
spin_lock prevents other cores from accessing the same semaphore state
Disable interrupts prevents the current core from being interrupted while holding the spin lock

The short version is:

On a multiprocessor, the spin lock protects against other cores; disabling interrupts protects against local preemption while the spin lock is held.

4.4. Multiprocessor V operation

Simplified implementation:

void V(Semaphore *s) {
    Disable interrupts;
    spin_lock(&s->slock);

    s->count += 1;

    if (!isEmpty(&s->q)) {
        waiting_thread = RemoveFirst(&s->q);
        set_state(waiting_thread, READY);
        add_to_ready_queue(waiting_thread);
    }

    spin_unlock(&s->slock);
    Enable interrupts;
}

There is no need to release and reacquire the spin lock inside V(), because V() does not sleep.

4.5. Why spin locks are acceptable here

Busy-waiting is usually bad.

But spin locks can be acceptable under special kernel conditions:

  • the critical section is very short;
  • the lock is held only briefly;
  • the code does not sleep while holding the spin lock;
  • interrupts are disabled while holding the spin lock, preventing preemption on the same core;
  • contention is expected to be rare.

The worst case for a spin lock is:

The lock holder is preempted while holding the lock, and another core spins waiting for it.

Disabling interrupts while holding the spin lock helps avoid this scenario on the local core.

5. Spin Locks

5.1. Hardware support

Spin locks require hardware-provided atomic instructions.

The lecture discusses two families:

  1. Test-and-set.
  2. Load-linked / store-conditional.

5.2. Test-and-set

An atomic test-and-set operation (x86) does roughly:

old = *x;
*x = 1;
return old;

But the read and write happen atomically.

Meaning:

  • if the old value was 0, the caller successfully changed it to 1;
  • if the old value was 1, someone else already held the lock.

Lock convention:

  • 0 means unlocked;
  • 1 means locked.

5.3. Test-test-and-set lock

The lecture presents a test-test-and-set lock, often abbreviated TTAS.

typedef volatile unsigned int spin_lock_t;

void spin_lock(spin_lock_t *lock) {
    do {
        while (*lock)
            /* busy-wait */;
    } while (TAS(lock) == 1);

    memory_barrier();
}

void spin_unlock(spin_lock_t *lock) {
    memory_barrier();
    *lock = 0;
}

5.3.1. Outer loop

The outer loop performs the actual atomic acquisition attempt:

} while (TAS(lock) == 1);

This is the part needed for correctness.

If TAS(lock) returns 0, the lock was previously free, and the current thread has acquired it.

If TAS(lock) returns 1, the lock was already held, and the thread must try again.

5.3.2. Inner loop

The inner loop is:

while (*lock)
    ;

This is not needed for correctness.

It is a performance optimization.

Why?

  • TAS is expensive because it is atomic.
  • Atomic operations interact with cache coherence.
  • Repeated atomic operations from many cores create heavy memory-system traffic.
  • A normal read is cheaper than repeatedly executing TAS.

So the algorithm first spins using ordinary reads while the lock is clearly held.

Only when the lock appears free does it attempt an expensive atomic TAS.

5.3.3. Why volatile is used

The lock variable is declared volatile:

typedef volatile unsigned int spin_lock_t;

Reason:

  • the compiler must not assume that *lock remains unchanged just because the current thread did not modify it;
  • another core may change it;
  • without volatile, the compiler might optimize the repeated read into an infinite loop that never reloads memory.

So volatile helps prevent an invalid compiler optimization in the busy-wait loop.

However, volatile alone is not enough for full synchronization; memory barriers are still needed.

5.3.4. Why does spin_unlock() use memory_barrier()?

In spin_unlock(), the memory barrier is used to enforce the correct ordering between the critical section and the lock release.

void spin_unlock(spin_lock_t *lock) {
    memory_barrier();
    *lock = 0;
}

The intended order is:

1. Finish all reads/writes inside the critical section.
2. Release the lock by setting *lock = 0.

Without the memory barrier, the compiler or the CPU might reorder memory operations so that the lock release becomes visible before the updates inside the critical section become visible to other cores.

That would be wrong.

For example, another core might see that the lock is free, acquire it, and then still observe stale shared data.

So the barrier before unlocking provides release semantics:

All memory effects inside the critical section must be completed and visible before the lock is released.

The barrier is not mainly about making *lock = 0 atomic. A normal aligned word store is usually already atomic. It is about preserving the ordering between the critical-section memory accesses and the unlock operation.

5.4. Load-linked / store-conditional

Load-linked/store-conditional (RISC) is another hardware synchronization mechanism.

It consists of two operations:

5.4.1. Load-linked

Load-linked reads a memory location and establishes a logical link between the current core and that memory location.

Conceptually:

old = load_linked(x);

The core says:

I read this location, and I intend to update it if nobody else changes it first.

5.4.2. Store-conditional

Store-conditional writes to the location only if the link is still valid.

success = store_conditional(x, new_value);

If another core wrote to that location after the load-linked, the link is broken and the store-conditional fails.

5.4.3. Emulating test-and-set with LL/SC

The lecture presents the idea that LL/SC can be used to implement test-and-set:

int TAS(unsigned int *x) {
    do {
        old_value = LDL(x);
    } while (STC(x, 1) == STORE_FAILED);

    return old_value;
}

That means:

  • read the old value with load-linked;
  • try to store 1 with store-conditional;
  • if another core interfered, retry;
  • once the store succeeds, return the old value.

This gives the same semantics as test-and-set.

5.5. Test-and-set vs. LL/SC

The lecture connects this to architecture styles:

  • test-and-set is often associated with more CISC-like interfaces, such as x86;
  • load-linked/store-conditional is often associated with RISC-style architectures.

But for this course, they are treated as alternative ways to build the same kind of lock.

6. Memory Barriers, Compiler Barriers, and Weak Memory

6.1. Why barriers are needed

The spin-lock code includes:

memory_barrier();

This is necessary because modern systems may reorder memory operations.

There are two different sources of reordering:

  1. compiler reordering;
  2. hardware reordering.

6.2. Compiler reordering

The compiler may reorder instructions to optimize performance.

For example, source code order:

lock();
x = shared_data;
unlock();

The programmer expects the read of shared_data to happen after acquiring the lock and before releasing it.

But an optimizing compiler may try to move instructions unless explicitly told not to.

Therefore, synchronization code must include compiler barriers.

6.3. Compiler barrier in Linux

The lecture explains the Linux-style compiler barrier.

It is often implemented as an empty inline assembly block.

Conceptually:

asm volatile("" ::: "memory");

Important points:

  • it emits no actual machine instruction;
  • it tells the compiler not to reorder memory operations across it;
  • the "memory" clobber tells the compiler to assume memory may have changed;
  • inline assembly acts as an optimization barrier because the compiler must respect the programmer’s explicit assembly insertion point.

6.4. Hardware memory barriers

Compiler barriers only constrain the compiler.

They do not necessarily constrain the CPU hardware.

Modern processors may execute memory operations out of program order because of:

  • pipelines,
  • store buffers,
  • load buffers,
  • cache hierarchies,
  • memory-level parallelism,
  • weak memory models.

Therefore, hardware fences are needed.

On x86, examples include:

  • mfence: full memory fence;
  • lfence: load/read fence;
  • sfence: store/write fence.

Linux exposes different kinds of barriers, such as:

  • full memory barrier;
  • read memory barrier;
  • write memory barrier.

6.5. Why memory barriers are placed around critical sections

For locking:

spin_lock(&lock);
/* critical section */
spin_unlock(&lock);

We want:

  • no critical-section memory access to move before the lock acquisition;
  • no critical-section memory access to move after the lock release.

So the lock acquire needs a barrier after acquisition:

TAS(lock);
memory_barrier();
/* critical section starts */

And unlock needs a barrier before release:

/* critical section ends */
memory_barrier();
*lock = 0;

The professor describes this as preventing memory accesses from “bleeding out” of the critical section.

6.6. Memory barriers are also compiler barriers

A hardware memory barrier is usually emitted via inline assembly.

asm volatile("mfence" ::: "memory");

Since inline assembly is already a compiler optimization barrier, a memory barrier generally also prevents compiler reordering.

Therefore, in spin-lock code, we often see only a memory barrier, not a separate compiler barrier plus a memory barrier.

7. Cache Coherence and the Cost of Sharing

7.1. Sharing memory is expensive

Lecture 6 spends time clarifying that synchronization performance is heavily affected by cache behavior.

The key point:

Atomic instructions are not always inherently expensive by themselves; they become expensive when they involve cache lines shared across cores.

If a cache line is used by multiple cores:

  • cache coherence has to coordinate ownership;
  • one core may need exclusive access;
  • other cores’ cache copies may need invalidation;
  • messages must travel through the cache/interconnect topology.

7.2. Cache hierarchy cost

The lecture discusses that access cost depends heavily on where the data is:

  • L1 cache: very fast, only a few cycles;
  • L2 cache: slower;
  • L3/last-level cache: slower still;
  • main memory: much slower.

The gap between CPU speed and memory latency is called the memory wall.

7.3. Topology matters

On multicore and multisocket systems:

  • cores may be on the same socket or different sockets;
  • caches may be shared by some cores but not others;
  • accessing data owned by a nearby core may be cheaper;
  • accessing data across sockets may be much more expensive.

Thus, synchronization cost is topology-dependent.

7.4. Atomic operations under contention

When many hardware threads repeatedly perform atomic operations on the same memory location:

  • throughput drops;
  • cache coherence traffic increases;
  • performance scales poorly.

The professor’s high-level takeaway:

Sharing is expensive. To scale, avoid unnecessary sharing.

7.5. Lock implementations differ

The lecture mentions that many lock designs exist.

The simple TTAS lock is easy to understand but not always the best-performing lock.

More advanced locks can reduce cache-coherence traffic.

One important example mentioned:

7.5.1. MCS locks

MCS locks are cache-friendly queue locks.

Basic idea:

  • each waiting core spins on a different memory location;
  • waiters form a linked list;
  • this avoids every core repeatedly banging on the same lock variable.

This improves scalability under contention.

The lecture does not require the details for the course, but mentions MCS locks as an interesting practical algorithm used in real systems, including Linux-like environments.

8. Pintos Locks and Benign Races

8.1. Pintos lock wrapper

Pintos has a lock abstraction built on top of a binary semaphore.

Conceptually, a Pintos lock contains:

  • a binary semaphore;
  • a holder field, mainly for debugging and correctness checks.

The semaphore provides the actual synchronization.

The holder field records which thread currently owns the lock.

8.2. Non-reentrant locks

Pintos locks are non-reentrant.

That means:

A thread that already holds a lock is not allowed to acquire it again.

Why?

If a thread tries to acquire a lock it already holds, it would block waiting for itself to release the lock.

That causes self-deadlock.

So Pintos includes assertions such as:

ASSERT(!lock_held_by_current_thread(lock));

This gives a useful error message instead of silently freezing.

8.3. Synchronization of the holder field

When acquiring the lock:

  1. the thread first acquires the underlying binary semaphore;
  2. then it sets holder = current_thread.

This is safe because once the semaphore is acquired, no other thread can hold the lock at the same time.

When releasing the lock:

  1. the holder is cleared;
  2. then the semaphore is released.

This is also safe because the releasing thread still holds the lock while clearing the holder.

8.4. Benign race in lock_held_by_current_thread

The professor discusses a subtle question:

Is checking the holder field itself synchronized?

There can be a race where another thread changes the holder field while some observer reads it.

But in the specific assertion:

lock_held_by_current_thread(lock)

we only care whether the holder equals the current thread.

Suppose thread C is checking whether it holds the lock.

Other threads A and B may acquire/release the lock concurrently.

C may observe:

  • holder is A;
  • holder is B;
  • holder is NULL.

But C should never observe holder as C unless C actually holds the lock.

So even though there is technically a race, it is benign for this check.

This is called a benign race:

There is a race, but all possible race outcomes are acceptable for the property being checked.

This relies on atomic pointer-sized reads/writes. If reads could produce torn or garbage pointer values, the race would not be benign.

9. Summary of Shared-Memory Synchronization

The synchronization part of the course forms a layered picture.

9.1. Hardware layer

The lowest layer provides:

  • interrupt control;
  • atomic operations;
  • cache coherence;
  • memory barriers/fences.

But hardware primitives are:

  • machine-dependent;
  • low-level;
  • easy to misuse;
  • often cause spinning;
  • not convenient for general programmers.

9.2. Low-level synchronization primitives

On top of hardware, the OS builds:

  • spin locks;
  • semaphores.

Spin locks:

  • are close to hardware;
  • use atomic instructions;
  • busy-wait;
  • are only appropriate for short kernel critical sections.

Semaphores:

  • are higher-level;
  • integrate with the scheduler;
  • allow threads to block instead of spin;
  • provide portable semantics.

9.3. Synchronization patterns

Semaphores support common patterns:

9.3.1. Binary semaphores

Used for mutual exclusion.

Equivalent intuition:

Only one thread may enter the critical section.

9.3.2. Counting semaphores

Used for condition synchronization.

Equivalent intuition:

Count how many events/resources are available.

9.4. Concurrent algorithms built from semaphores

Examples covered earlier:

  • producer-consumer synchronization;
  • reader-writer synchronization.

The important conceptual lesson:

Operating systems build higher-level, portable abstractions by layering them on top of lower-level hardware mechanisms.

10. Scheduling: Introduction

10.1. What is scheduling?

Scheduling is the problem of sharing a serially reusable resource among multiple clients.

Examples of resources:

  • CPU cores;
  • I/O controllers;
  • network interfaces;
  • bandwidth;
  • memory pages;
  • storage devices;
  • train tracks;
  • factory machines.

A schedule determines:

  1. in which order clients get the resource;
  2. for how long each client may use it.

In OS scheduling, the lecture distinguishes:

  • scheduling policy:
    • deciding who should run next;
  • dispatching mechanism:
    • the low-level mechanics of actually switching threads.

Scheduling arises naturally when resources are virtualized and shared.

10.2. Scheduling objectives

Scheduling may optimize many different goals.

10.2.1. Efficiency

Do not waste available capacity.

For CPU scheduling:

  • keep the CPU busy when useful work exists.

For I/O scheduling:

  • keep devices busy when possible.

For power/thermal-aware scheduling:

  • use energy, power, and thermal budget efficiently.

10.2.2. Low overhead

The scheduler itself consumes resources.

Any time spent making scheduling decisions is time not spent running useful work.

So the scheduler should not be too expensive.

10.2.3. Timeliness

Users and systems have timing expectations.

Possible timing goals include:

  • finish as soon as possible;
  • minimize average response time;
  • minimize maximum response time;
  • minimize a response-time percentile, such as the 99th percentile;
  • meet hard deadlines;
  • provide smooth soft-real-time behavior;
  • minimize makespan;
  • satisfy service-level agreements.
  1. Response time

    For a job/thread activation:

    \[ \text{response time} = \text{finish time} - \text{arrival time}. \]

  2. Makespan

    Makespan is the total time from the beginning of a multi-step workload to the end.

    It is similar to end-to-end response time for a whole dependency graph.

10.2.4. Isolation

Isolation means controlling how much of a resource some activity can consume.

Examples:

  • reserve at least 20% CPU for an important service;
  • limit background log processing to at most 10% CPU;
  • prevent one activity from destroying the temporal behavior of another.

10.3. Scheduling problem categories

10.3.1. Preemptive scheduling

A resource is preemptive if the scheduler can interrupt work in progress and resume it later without losing progress.

Example:

  • CPU scheduling.

A running thread can be interrupted by a timer interrupt, saved, and resumed later.

10.3.2. Non-preemptive scheduling

A resource is non-preemptive if interrupting work destroys progress or is impossible.

Examples:

  • sending bits of a network packet;
  • some GPU kernel executions.

If a packet is halfway transmitted, the system cannot pause and resume the packet later. It would have to retransmit.

10.3.3. Limited-preemptive scheduling

Some systems allow preemption only at specific points.

Examples:

  • a TCP stream may be preemptable between packets, but not inside a packet;
  • cooperative user-space threading may check for preemption only at explicit yield points.

10.3.4. Uniprocessor vs. multiprocessor scheduling

Uniprocessor scheduling:

  • one processor;
  • only one thread runs at a time.

Multiprocessor scheduling:

  • multiple cores/processors;
  • several threads may run at once;
  • requires load-balancing decisions.

Potential issue:

One core may have many ready threads while another core is idle.

10.3.5. Types of multiprocessors

  1. Identical multiprocessors

    All cores have the same capabilities and performance.

    It does not matter where a thread runs.

  2. Homogeneous / uniform multiprocessors

    Cores can run the same code but at different speeds.

    Examples:

    • performance cores and efficiency cores;
    • cores running at different frequencies.

    The same binary can run everywhere, but performance differs.

  3. Heterogeneous multiprocessors

    Processors may have different instruction sets or capabilities.

    Examples:

    • CPU plus GPU;
    • CPU plus specialized accelerator.

    You cannot freely move arbitrary code between them.

10.3.6. Additional constraints

Scheduling may also involve:

  • precedence constraints:
    • one job must complete before another starts;
  • mutual exclusion constraints:
    • two jobs cannot run simultaneously;
  • co-scheduling constraints:
    • some jobs must run together, or must not run together.

11. Foundational Scheduling Policies

The lectures focus mainly on preemptive uniprocessor CPU scheduling.

11.1. FIFO / FCFS

FIFO means:

First in, first out.

Also called:

First come, first served.

11.1.1. Policy

  • Run threads in order of arrival.
  • If multiple threads arrive simultaneously, break ties arbitrarily.
  • Once a thread is dispatched, let it run until it completes or blocks.
  • This is non-preemptive.

11.1.2. Why FIFO is simple

FIFO is almost a non-policy:

  • the environment determines the order;
  • the scheduler does not make clever decisions;
  • implementation is just a queue.

11.1.3. Example with equal job lengths

Suppose:

Thread Arrival Execution Finish Response
A 0 50 50 50
B 0 50 100 100
C 0 50 150 150

Average response time:

\[ \frac{50 + 100 + 150}{3} = 100. \]

11.1.4. Example with lopsided job lengths

Suppose FIFO order is A, B, C:

Thread Arrival Execution Finish Response
A 0 90 90 90
B 0 50 140 140
C 0 10 150 150

Average response time:

\[ \frac{90 + 140 + 150}{3} = \frac{380}{3} \approx 126.67. \]

Problem:

A long job at the front delays all later jobs.

This is sometimes called a convoy effect.

11.2. SJF: Shortest Job First

Shortest Job First tries to improve average response time.

11.2.1. Policy

  • Always choose the ready thread with the shortest required execution time until completion or blocking.
  • If two jobs have equal length, break ties arbitrarily.
  • Once dispatched, the job runs until it completes or blocks.
  • This is non-preemptive.

11.2.2. Optimality

SJF is optimal with respect to average response time in the non-preemptive setting, assuming all job lengths are known.

11.2.3. Example

For the same lopsided jobs:

Thread Arrival Execution Finish Response
A 0 90 150 150
B 0 50 60 60
C 0 10 10 10

Schedule:

C B A
0 10 60 150

Average response time:

\[ \frac{10 + 60 + 150}{3} = \frac{220}{3} \approx 73.33. \]

This is much better than FIFO’s \(126.67\).

11.2.4. Practical problems of SJF

  1. Problem 1: Need to know the future

    SJF requires knowing how long each thread will run before it blocks or completes.

    In general, the OS does not know this.

    For arbitrary programs, predicting whether a program terminates is related to the halting problem.

    Therefore:

    Exact SJF is not implementable for general-purpose operating systems.

  2. Problem 2: Non-preemptive commitment

    Once SJF starts a job, it does not preempt it.

    If a long job starts and then a very short urgent job arrives, the short job must wait.

  3. Problem 3: Starvation of long jobs

    If short jobs keep arriving, long jobs may never get scheduled.

    Analogy:

    You are in a supermarket with a full cart. You keep letting people with one item go first. If they keep arriving, you may never check out.

  4. Problem 4: A chosen job may never terminate

    If the chosen job runs forever, the scheduler never regains control under non-preemptive SJF.

    For an OS running untrusted user code, this is unacceptable.

12. Round Robin Scheduling

12.1. Motivation

To avoid the non-preemptive problems of FIFO and SJF, the scheduler needs to regain control.

The CPU is a preemptive resource.

The scheduler regains control through timer interrupts.

The OS sets a timer interrupt so that after some time quantum:

  • execution traps into the kernel;
  • the scheduler can choose another thread.

12.2. Policy

Round Robin uses fixed-length time slices.

  • Maintain a list/queue of ready threads.
  • Give each ready thread up to one time quantum.
  • If the thread blocks before the quantum ends, move to the next thread.
  • If the thread is still running at the end of the quantum, preempt it.
  • Cycle through ready threads repeatedly.

This is preemptive scheduling.

12.3. Benefits

Round Robin solves the worst non-preemptive starvation issues.

If a thread runs forever:

  • it only runs for one quantum at a time;
  • other ready threads still get CPU time.

If there is a finite number of ready threads:

  • every thread eventually gets a turn.

This provides a crude form of fairness.

12.4. Remaining problem: many threads

If the number of ready threads is unbounded, then the delay before a thread gets its next turn can also become unbounded.

In practice, systems impose limits:

  • memory limits;
  • process/thread limits;
  • browser worker limits;
  • container or user quotas.

So Round Robin is more robust than SJF but not perfect.

12.5. Example with 10 ms quantum

For jobs:

Thread Arrival Execution
A 0 90
B 0 50
C 0 10

With a 10 ms time slice, the schedule alternates:

A B C A B A B A B A B A A A A

The example response times are:

Thread Finish Response
A 150 150
B 110 110
C 30 30

Average response time:

\[ \frac{150 + 110 + 30}{3} = \frac{290}{3} \approx 96.67. \]

This is:

  • worse than SJF in the example;
  • better than FIFO in the lopsided example;
  • more robust than non-preemptive SJF.

12.6. Round Robin and urgent short jobs

Round Robin may delay a short but urgent job.

Example:

  • thread C is short;
  • but it may still wait behind A and B for its turn.

In real-time systems, some short jobs are very important.

Example:

  • airbag controller;
  • sensor processing;
  • safety-critical control.

For those, fairness is not enough.

We need priorities or deadlines.

13. Fixed-Priority Scheduling

13.1. Motivation

Some work is more important than other work.

Round Robin treats ready threads roughly equally.

But real-time systems need to express that some threads should preempt others.

13.2. Policy

Each thread gets a static priority.

  • The priority is assigned at thread creation/configuration time.
  • The scheduler always chooses the highest-priority ready thread.
  • If the running thread blocks or completes, choose the next-highest-priority ready thread.
  • If a higher-priority thread becomes ready, immediately preempt the current thread.

This is preemptive scheduling.

13.3. Priority-number convention

Different systems use different conventions.

Some systems say:

  • larger number means higher priority.

Other systems say:

  • smaller number means higher priority.

The lecture example uses:

99 is higher priority than 50, and 50 is higher priority than 1.

Always check the system convention.

13.4. Example

Thread Priority Arrival Execution Finish Response
A 1 0 90 150 150
B 50 0 50 60 60
C 99 10 10 20 10

Schedule:

B C B A
0 10 20 60 150

Explanation:

  • At time 0, A and B are ready.
  • B has higher priority than A, so B runs.
  • At time 10, C arrives.
  • C has higher priority than B, so C preempts B.
  • C finishes at time 20.
  • B resumes and finishes at time 60.
  • A runs last.

13.5. Response-time reasoning

For C:

\[ R_C = 10. \]

For B in this example:

\[ R_B = 10 + 50 = 60. \]

B’s response time includes:

  • 10 units lost to preemption by C;
  • 50 units of B’s own execution.

13.6. Why fixed priorities are useful for real-time systems

Fixed-priority scheduling can provide bounded response times for high-priority tasks.

If a critical task has the highest priority, then when it becomes ready:

  • it preempts lower-priority work;
  • its response time can be bounded by its own execution time plus interference from even higher-priority tasks.

This is why fixed-priority scheduling is widely used in real-time systems.

13.7. Starvation under fixed priorities

Lower-priority threads may starve.

If higher-priority threads keep arriving or running, lower-priority threads may never execute.

But in real-time systems, this may be intentional.

The system cares more about important tasks meeting deadlines than about equal treatment.

Therefore, fixed-priority systems require analysis.

13.8. Response Time Analysis

To use fixed-priority scheduling safely, one should analyze maximum response times.

Question:

Can every task meet its deadline under the assigned priorities?

If high-priority tasks consume too much CPU, lower-priority tasks may have unbounded response times.

This analysis becomes more complex when tasks recur periodically or sporadically.

14. EDF: Earliest Deadline First

14.1. Motivation

Fixed priorities are static.

But urgency often changes over time.

A thread may receive new input and get a new deadline.

Example:

  • a network packet arrives and must be processed within some time;
  • a periodic task is released and gets a deadline for this instance;
  • a project milestone appears and must be completed by a specific date.

EDF uses deadlines directly.

14.2. Policy

Each ready job/thread activation has a current absolute deadline.

  • Always run the ready thread with the earliest deadline.
  • If the running thread blocks or completes, switch to the ready thread with the next-earliest deadline.
  • If a thread with an earlier deadline becomes ready, preempt the current thread.
  • Deadlines are typically updated when new work arrives or at periodic releases.

This is preemptive scheduling.

14.3. EDF as dynamic priority scheduling

EDF can be viewed as priority scheduling where:

\[ \text{priority} = \text{urgency determined by deadline}. \]

But unlike fixed priority:

  • the priority changes over time;
  • each job activation may have a different deadline.

Therefore, EDF is a dynamic-priority algorithm.

14.4. Example

Thread Deadline Arrival Execution Finish Response
A 110 0 90 100 100
B 200 0 50 150 150
C 25 10 10 20 10

Schedule:

A C A B
0 10 20 100 150

Explanation:

  • At time 0, A and B are ready.
  • A’s deadline 110 is earlier than B’s deadline 200, so A runs.
  • At time 10, C arrives with deadline 25.
  • C’s deadline is earlier than A’s, so C preempts A.
  • C finishes at time 20.
  • A resumes and finishes at time 100.
  • B runs last.

14.5. Semantic advantage of deadlines

A fixed priority number has no meaning by itself.

For example, priority 20 could be:

  • high priority if all others are 1 and 2;
  • low priority if many others are 50, 60, 70.

Fixed priorities are not compositional.

Different teams must coordinate priority assignments globally.

A deadline, however, directly expresses an objective:

This work must be done by this time.

This makes deadlines more compositional:

  • one subsystem can assign deadlines based on its own timing requirements;
  • another subsystem can do the same;
  • integration still needs feasibility analysis, but the meaning of each deadline is local and explicit.

14.6. EDF optimality on uniprocessors

EDF is optimal with respect to meeting hard deadlines on uniprocessors.

Meaning:

If there exists any schedule that meets all deadlines, then EDF will also meet all deadlines.

This is a strong theoretical property.

14.7. EDF under overload

EDF is not optimal under overload.

If it is impossible to meet all deadlines, EDF does not necessarily minimize the number of missed deadlines.

So EDF gives:

If zero misses are possible, EDF gives zero misses.

But it does not guarantee:

If misses are unavoidable, EDF gives the fewest misses.

14.8. EDF on multiprocessors

EDF is not optimal on multiprocessors.

Reason:

  • deadlines express urgency;
  • multiprocessor scheduling also requires reasoning about parallelism and sequentiality.

Example intuition:

If one job cannot be parallelized, giving it ten processors does not make it finish ten times faster.

So EDF’s urgency heuristic is not enough for multiprocessor optimality.

14.9. Practical difficulty: where do deadlines come from?

EDF is elegant if deadlines are known.

But in general-purpose systems, deadlines are often not explicit.

Example:

  • a browser has JavaScript workers;
  • rendering, event handling, networking, and user interaction interact;
  • the user wants a snappy interface;
  • but the OS may not know explicit deadlines for each thread.

Therefore:

The difficult practical question for EDF is often not how to schedule deadlines, but how to obtain meaningful deadlines.

14.10. Real-time scheduling policies in practice

14.10.1. POSIX SCHED_FIFO

Despite the name, SCHED_FIFO is not plain FIFO scheduling.

It is:

fixed-priority scheduling with FIFO tie-breaking within each priority level.

At a given priority level:

  • a thread runs until it blocks, yields, or completes;
  • then the next same-priority thread can run.

14.10.2. POSIX SCHED_RR

SCHED_RR is also fixed-priority scheduling.

It differs from SCHED_FIFO only in tie-breaking:

At each priority level, equal-priority threads are scheduled round-robin.

14.10.3. Linux SCHED_DEADLINE

Linux provides SCHED_DEADLINE.

It is based on EDF plus the Constant Bandwidth Server, abbreviated CBS.

CBS helps assign and manage deadlines/budgets in a practical way.

14.10.4. Privilege requirement

These real-time policies usually require root privileges or special capabilities.

Reason:

  • a normal user process should not be able to monopolize the CPU with high real-time priority;
  • misuse could starve important system services;
  • real-time scheduling can break system responsiveness if configured incorrectly.

15. CPU-Bound vs. I/O-Bound Behavior

15.1. Not only the CPU matters

A computer system has many resources:

  • CPU;
  • disk;
  • SSD controller;
  • network interface;
  • memory;
  • GPU;
  • other devices.

Overall performance is often determined by the bottleneck resource.

If the system is bottlenecked on disk, adding CPU cores may not help.

15.2. CPU-bound threads

A CPU-bound thread mostly uses the processor.

Example:

while (true) {
    compute_without_IO();
}

If given more CPU time, it makes more progress.

15.3. I/O-bound threads

An I/O-bound thread uses the CPU briefly and then waits for I/O.

Example:

while (true) {
    compute_1ms();
    block_on_IO_for_10ms();
}

It needs short CPU bursts to submit or process I/O requests.

Then it blocks while the device works.

15.4. Interaction under Round Robin

Suppose:

  • thread A is I/O-bound:
    • compute 1 ms;
    • block on I/O for 10 ms.
  • thread B is CPU-bound:
    • compute forever.
  • Round Robin time slice is 100 ms.

A may run for 1 ms and submit an I/O request.

Then B runs.

The I/O finishes after 10 ms, but B still has most of its 100 ms time slice left.

So A is ready but cannot run immediately.

Result:

  • the I/O device sits idle;
  • A cannot submit the next I/O request;
  • device utilization is poor.

In the lecture example, the I/O device is mostly idle, around only 10% utilized.

15.5. Smaller time slice helps but is not a general solution

If the Round Robin time slice is reduced to 10 ms, then A may run more often.

This can improve I/O utilization.

But there is no universal best time-slice length.

Reasons:

  • different devices have different latencies;
  • SSDs, disks, networks, and GPUs differ by orders of magnitude;
  • threads need different amounts of CPU before submitting I/O;
  • workloads change over time.

So:

Choosing a fixed Round Robin quantum cannot generally solve the CPU-bound/I/O-bound interaction problem.

15.6. Desired principle

We want I/O-bound threads to run promptly when they become ready.

Reason:

  • they need only a little CPU;
  • running them quickly can keep I/O devices busy;
  • delaying them wastes non-CPU resources.

CPU-bound threads are less latency-sensitive:

  • they will just compute whenever scheduled;
  • delaying them slightly usually does not idle another device.

Thus:

General-purpose schedulers often try to favor interactive or I/O-bound tasks.

16. SRPT: Shortest Remaining Processing Time First

16.1. Motivation

Shortest Job First was non-preemptive.

SRPT is the preemptive version.

16.2. Policy

SRPT means:

Shortest Remaining Processing Time First.

Policy:

  • always run the ready thread with the least remaining processing time until it completes or blocks;
  • if a new ready thread has less remaining time than the current one, preempt immediately.

This is preemptive scheduling.

16.3. Relation to SJF

SJF:

  • chooses the shortest job;
  • does not preempt once dispatched.

SRPT:

  • always chooses the job with shortest remaining time;
  • may preempt when a shorter job becomes ready.

16.4. Why SRPT is attractive

SRPT naturally favors short CPU bursts.

Thus, it handles I/O-bound tasks well:

  • an I/O-bound thread wakes up;
  • it likely needs only a short CPU burst;
  • SRPT preempts long CPU-bound work to run it;
  • the I/O-bound thread submits the next I/O request quickly;
  • the I/O device stays busy.

SRPT adapts automatically to device latencies and CPU burst lengths.

16.5. Practical problem

SRPT requires knowing remaining processing time.

Again, this is generally impossible exactly.

The OS cannot solve the halting problem.

So practical schedulers approximate SRPT.

17. Locality Principle

17.1. Basic idea

The lecture introduces the locality principle:

Recent past behavior predicts near-term future behavior.

This is not a theorem about all possible programs.

It is an empirical engineering observation.

Most real threads have stable behavior over short time intervals.

Examples:

  • if a thread recently did many short I/O bursts, it probably will do more soon;
  • if a thread recently used CPU continuously, it probably remains CPU-bound soon.

17.2. Program phases

Threads can change behavior.

Example phases:

  1. read data from disk;
  2. compute intensively;
  3. write results.

So behavior is not fixed forever.

But within a phase, behavior is often regular.

17.3. Estimating future behavior

The scheduler can monitor:

  • recent CPU usage;
  • recent blocking behavior;
  • recent sleep/wakeup patterns;
  • recent burst lengths.

Then it can estimate:

  • whether the thread is CPU-bound;
  • whether it is I/O-bound;
  • how much CPU it is likely to need soon.

17.4. Exponential aging

A common method is exponential aging.

Idea:

  • combine old estimate with new observation;
  • give more weight to recent observations;
  • let old behavior fade out exponentially.

Generic form:

\begin{equation*} \text{new estimate} = \alpha \cdot \text{old estimate} + (1-\alpha) \cdot \text{new observation}. \end{equation*}

where:

\[ 0 \le \alpha \le 1. \]

Large \(\alpha\):

  • remembers history longer;
  • reacts more slowly.

Small \(\alpha\):

  • reacts quickly;
  • is noisier.

18. MLFQ: Multi-Level Feedback Queue

18.1. Motivation

MLFQ approximates SRPT without knowing future execution times.

It uses feedback from past behavior.

18.2. Basic structure

MLFQ has many priority levels.

At each priority level:

  • there is a queue of threads;
  • Round Robin is used within the level.

Different levels have different time slices:

  • high-priority levels use short time slices;
  • low-priority levels use longer time slices.

18.3. Intended placement

I/O-bound or interactive tasks:

  • use CPU briefly;
  • block often;
  • should remain at high priority;
  • get short response times.

CPU-bound tasks:

  • use lots of CPU;
  • rarely block;
  • are gradually demoted;
  • get longer time slices at lower priority;
  • run efficiently without hurting responsiveness too much.

18.4. Adaptive feedback

The scheduler changes priority based on observed behavior.

Rules:

  • if a thread uses a lot of CPU, lower its priority;
  • if a thread uses CPU sparingly or blocks quickly, raise or maintain its priority.

Thus MLFQ implements the idea:

Recent CPU behavior predicts near-future CPU behavior.

18.5. Also called exponential queues

The lecture notes that this style is also known as exponential queues.

Classic Unix, including the 4.4 BSD scheduler, used this general idea for many years.

18.6. Practical challenge: measuring CPU usage precisely

A practical challenge is measuring CPU usage accurately.

Older systems often used timer ticks.

Example:

  • every 10 ms or 100 ms, the timer interrupt fires;
  • the kernel charges the currently running thread for CPU usage.

Problem:

  • if a thread can predict timer ticks;
  • it may block right before the tick;
  • another thread gets charged;
  • then the original thread wakes up again.

This can trick the scheduler.

The lecture mentions that the Xen hypervisor had a security vulnerability related to this kind of timing/accounting issue.

So precise accounting matters.

18.7. Why MLFQ is no longer the final word

MLFQ is elegant and historically important.

But modern general-purpose systems often favor fair scheduling approaches instead.

The lecture then moves to fairness.

19. Fair Scheduling

19.1. Why fairness matters

Threads do not have feelings, so fairness is not a moral goal.

Fairness is useful because it prevents bad system behavior.

Fairness implies:

  • no starvation;
  • no thread waits too long relative to its share;
  • no single thread can monopolize the resource unfairly;
  • resource allocation has predictable semantics.

Fair scheduling is especially useful for general-purpose workloads where:

  • explicit deadlines are unavailable;
  • fixed priorities are hard to assign;
  • many independent programs compete for CPU time.

19.2. Proportional-share fairness

Suppose:

  • resource amount over interval \(\Delta\);
  • thread \(i\) has weight \(w_i\);
  • total weight is \(\sum_{j=1}^{n} w_j\).

Then thread \(i\)’s fair share is:

\[ \operatorname{share}_i(\Delta) = \Delta \cdot \frac{w_i}{\sum_{j=1}^{n} w_j}. \]

Interpretation:

  • if all weights are equal, all threads get equal shares;
  • if one thread has twice the weight, it should get twice the resource share.

Example:

If A has weight 1 and B has weight 2, then total weight is:

\[ 1 + 2 = 3. \]

A’s share:

\[ \frac{1}{3}. \]

B’s share:

\[ \frac{2}{3}. \]

So B should receive twice as much CPU time as A.

20. Lottery Scheduling

20.1. Basic idea

Lottery scheduling achieves proportional-share fairness randomly.

20.2. Policy

  • Give each thread some number of tickets.
  • At each scheduling decision, randomly select one ticket.
  • The thread holding the winning ticket runs for one time slice.
  • Repeat.

If a thread has more tickets, it has a higher probability of being selected.

20.3. Expected fairness

If thread \(i\) has \(w_i\) tickets, its probability of winning one lottery is:

\[ \frac{w_i}{\sum_j w_j}. \]

Over time, in expectation, it receives its proportional share.

20.4. Why it is interesting

Lottery scheduling is conceptually simple.

It gives fairness through randomness rather than complicated deterministic tracking.

20.5. Practical limitations

Lottery scheduling may take time to converge to fairness.

Short-term behavior can be unfair.

A thread may simply be unlucky.

For CPU scheduling, this is often undesirable.

Thus it is mostly interesting as a conceptual technique, not a dominant practical CPU scheduler.

However, randomized allocation ideas may be useful in other resource-management contexts, such as choosing memory pages to evict.

21. STFQ: Start-Time Fair Queuing

21.1. Motivation

Start-Time Fair Queuing gives deterministic proportional fairness.

The lecture describes it as:

Round Robin, but on virtual time.

Or:

FIFO, but ordered by virtual start time instead of real arrival time.

21.2. Core idea

Each thread has a virtual time \(V_i\).

The scheduler:

  • always runs the thread with the earliest virtual time;
  • after the thread receives service, advances its virtual time according to its weight.

A high-weight thread’s virtual time advances more slowly.

Thus it is selected more often.

21.3. Update rule

After thread \(i\) receives service amount \(\Delta\):

\begin{equation*} V_i \leftarrow V_i + \Delta \cdot \frac{\sum_{j=1}^{n} w_j}{w_i}. \end{equation*}

Since:

\[ \operatorname{share}_i = \frac{w_i}{\sum_{j=1}^{n} w_j}, \]

the update can also be written as:

\begin{equation*} V_i \leftarrow V_i + \Delta \cdot \frac{1}{\operatorname{share}_i}. \end{equation*}

Interpretation:

  • small weight \(\Rightarrow\) small share \(\Rightarrow\) virtual time advances quickly;
  • large weight \(\Rightarrow\) large share \(\Rightarrow\) virtual time advances slowly.

The scheduler picks the smallest virtual time, so heavier threads get scheduled more often.

21.4. Example: weights 1 and 2

Suppose:

Thread Weight
A 1
B 2

Total weight:

\[ 1 + 2 = 3. \]

Time slice:

\[ \Delta = 10. \]

21.4.1. A’s virtual-time increment

\begin{equation*} \Delta \cdot \frac{3}{1} = 10 \cdot 3 = 30. \end{equation*}

So each time A runs for 10 ms:

\[ V_A \leftarrow V_A + 30. \]

21.4.2. B’s virtual-time increment

\begin{equation*} \Delta \cdot \frac{3}{2} = 10 \cdot 1.5 = 15. \end{equation*}

So each time B runs for 10 ms:

\[ V_B \leftarrow V_B + 15. \]

B’s virtual time advances half as fast as A’s, so B gets scheduled more often.

21.5. Example schedule

Starting with:

\[ V_A = 0,\quad V_B = 0. \]

Either A or B can be chosen first. Suppose A is chosen.

After A runs:

\[ V_A = 30,\quad V_B = 0. \]

Now B has smaller virtual time, so B runs.

After B runs:

\[ V_A = 30,\quad V_B = 15. \]

B still has smaller virtual time, so B runs again.

After B runs:

\[ V_A = 30,\quad V_B = 30. \]

Now they tie. Depending on tie-breaking, B may run or A may run.

Over time, the schedule gives:

  • A about one third of CPU time;
  • B about two thirds of CPU time.

This matches their weights.

21.6. Waking threads and virtual time

A subtle issue:

What virtual time should a thread receive when it wakes up or enters the system?

If it gets virtual time zero, it may unfairly dominate the CPU.

If it keeps an old virtual time from long ago, it may be far behind and get too much service.

If it is set only to current real time, sleeping/I/O-bound threads may be penalized.

The lecture discusses several possible approaches and why naive ones fail.

21.6.1. Bad idea 1: do nothing

If a sleeping thread keeps an old virtual time, then after sleeping for a long time, its virtual time may be far in the past.

When it wakes, it may monopolize the CPU to “catch up.”

21.6.2. Bad idea 2: reset to current scaled real time only

This can penalize I/O-bound sleeping threads too much, because they are treated as if they had been consuming CPU while asleep.

21.6.3. Bad idea 3: account only for sleep time naively

This can create unfairness because it may not account properly for weights and previous service.

21.6.4. Better compromise

Use a rule that prevents virtual time from going backward.

The lecture’s idea:

\[ V_i \leftarrow \max(V_i, \text{current appropriate virtual time}). \]

Meaning:

  • if the thread’s virtual time is already far in the future because it used lots of CPU, it cannot erase that history by sleeping briefly;
  • if the thread has been sleeping and is not ahead, it can rejoin fairly;
  • this avoids both unfair advantage and unfair penalty.

The principle:

A thread should not be able to cheat the scheduler by sleeping briefly, but sleeping should also not be punished as if it consumed CPU.

22. Overall Conceptual Arc of Lectures 5 and 6

22.1. From synchronization to scheduling

The course transitions from shared-memory synchronization to CPU scheduling.

Synchronization answered:

How do threads coordinate safely when accessing shared state?

Scheduling asks:

Which thread should run next, and for how long?

The two topics are connected:

  • semaphores integrate with the scheduler;
  • blocking removes a thread from the ready queue;
  • waking adds a thread back to the ready queue;
  • locks and semaphores affect which threads are runnable;
  • priority scheduling interacts with lock design, leading to issues such as priority inversion and priority inheritance.

22.2. Layering lesson from synchronization

The OS builds layers:

  1. hardware atomicity and interrupt control;
  2. spin locks and memory barriers;
  3. semaphores;
  4. binary/counting semaphore patterns;
  5. producer-consumer and reader-writer algorithms.

This is a classic OS design pattern:

Build convenient, portable abstractions on top of machine-dependent mechanisms.

22.3. Scheduling lesson

There is no single perfect scheduler.

Different policies optimize different goals:

Policy Main idea Strength Weakness
FIFO arrival order very simple bad response time under unlucky order
SJF shortest job first optimal average response time if job lengths known needs future knowledge; starvation; non-preemptive
RR equal time slices robust, simple fairness may delay urgent/short/I/O-bound work
FP static priorities good for real-time systems low-priority starvation; needs analysis
EDF earliest deadline optimal for meeting deadlines on uniprocessor if feasible needs deadlines; poor under overload; not optimal on multiprocessors
SRPT shortest remaining time good response and I/O behavior needs remaining-time prediction
MLFQ feedback-based approximation adapts to CPU/I/O behavior accounting can be tricky; can be gamed
Lottery randomized proportional share simple conceptual fairness short-term unfairness
STFQ virtual-time fairness deterministic proportional fairness needs careful handling of sleeping/waking threads

22.4. Big takeaways

  1. Semaphores require scheduler integration.
    • Efficient waiting requires blocking, not spinning.
  2. On uniprocessors, disabling interrupts can provide short atomic sections.
    • But this does not work on multiprocessors.
  3. On multiprocessors, spin locks protect short critical sections.
    • But spin locks require atomic hardware instructions and memory barriers.
  4. Memory barriers are necessary because both compilers and CPUs reorder operations.
    • Correct locking must constrain both.
  5. Shared memory synchronization can be expensive.
    • Cache coherence and topology matter.
  6. Scheduling is policy.
    • Dispatching is mechanism.
  7. Different scheduling policies encode different goals.
    • Average response time, fairness, deadline satisfaction, isolation, and throughput can conflict.
  8. Exact optimal policies often require impossible knowledge.
    • SJF/SRPT need future execution times.
    • General-purpose OS kernels cannot solve the halting problem.
  9. Practical schedulers use approximations.
    • MLFQ uses feedback and locality.
    • Fair schedulers use weights and virtual time.
  10. Real systems combine ideas.
    • POSIX real-time scheduling uses fixed priorities.
    • Linux also supports deadline scheduling.
    • General-purpose scheduling often emphasizes fairness and responsiveness.

Author: Lowtroo

Created on: 2026-05-03 Sun 14:00

Powered by Emacs 29.3 (Org mode 9.6.15)