6.852: Notes for lecture on 20 Nov 2003 Victor Luchangco Pragmatics ========== In this lecture, we discuss several issues that are important for real systems. This isn't a systematic study, but it's important to be aware of these issues. Issues ------ Population-awareness - what about dynamic threads? Space consumption Need to allocate memory, which can be expensive - need to preallocate memory Unbounded counters Highly contended locations - cache effects - conflicts lead to wasted work ("helping" can hurt) Disjoint parallel access Data layout effects on caching - spatial locality (good for one procesor) - false sharing (bad for multiprocessor) NUMA (nonuniform memory access) architectures Memory consistency models and other sources of reordering Compilers (!) CAS much more expensive than LD/ST (what about membars?) Locking vs. nonblocking (various nonblocking conditions) Local spinning (busy waiting) vs. sleeping Software engineering - reducing complexity - protecting programmers from themselves Managing concurrency -------------------- Desiderata - minimize waiting - disjoint parallel access - minimize synchronization operations (these are expensive) - minimize contention (highly contended access is expensive) - minimize redundant/wasted work - fault tolerance - reduce complexity - simple algorithms - modularity - general techniques - provable correctness - performance (both average and worst case) List-based sets --------------- We now look at several linked-list-based implementations of sets, as examples to see various ways to manage concurrency. Coarse-grained locking - one lock that protects entire list - simple: almost same as sequential code (often underappreciated) - no sharing Fine-grained locking, #1 - each node has its own lock - acquire lock before first access, retain till operation completes - must use read/write locks (why?) - upgrade to write locks for nodes to be updated - need to be careful to avoid deadlock Fine-grained locking, #2 - "hand-over-hand" locking - use mutex locks - everyone gets stuck behind one slow process Optimistic synchronization - figure out what to do without taking locks - take out locks only "briefly" - may be possible to avoid taking out locks altogether in some cases Nonblocking implementation (by Tim Harris, Maged Michael) - use "deleted" bit on next pointer - delete by setting bit - don't change next pointer when deleted bit is set - remove deleted (marked) nodes Implementing R/W memory over network ------------------------------------ ABD [Attiya Bar-Noy Dolev] - single-writer, multi-reader registers (MWMR is homework) - tolerates \floor{(n-1)/2} faults (fewer than half failures) - only stopping failures WRITER Variables: val, tag write(v): val <- v tag <- tag+1 send("write", val, tag) to all readers wait for acknowledgment from majority (including self) receive("read", u)_j: send("read-ack", val, tag, u) to j READER (i) Variables: val, tag, readtag (UID only) read_i: readtag <- readtag+1 [why do we need readtag?] send("read", readtag) to all other processes wait for response (with value and tag) from majority let t = largest tag received in response if t > tag val <- value received with t [why is this well-defined?] tag <- t send("propagate", val, tag, readtag) to all other readers [why not writer?] wait for ack from majority receive("write", v, t): if t > tag val <- v tag <- t send("write-ack", t) to writer receive("read", u)_j: send("read-ack", val, tag, u) to j receive("propagate", v, t, u)_j: if t > tag val <- v tag <- t send("prop-ack", u) to j What kind of channels do we need? FIFO? Reliable? Nonduplicating? Byzantine? What faults does this algorithm tolerate? Linearize write with tag t to: - when majority of processes have tag >= t - several writes may linearize to same time; what do we do?) Linearize read to later of: - immediately after the linearization of write that it reads (based on tag) (might be between two writes linearized to the "same time"; how?) - invocation of read (why do we need this?) Why are these linearization points between invocation and response? Do they give the correct value? (What is the abstract value?) What are the implications of ABD? - can adapt r/w shared memory algs for async network if fewer than half fail - consensus impossible on async network if one process may fail