6.852: Notes for lecture on 4 Dec 2003 Victor Luchangco Computing in the "real world" (what we've been discussing for past few weeks) ============================= Intro ----- (We're covering material from Chaps 21, 25.2, 23.1-2, 24.1-3) We adopted asynchronous models because - most real systems don't guarantee timing properties - even to the extent they do, variance is usually very high, so waiting is impractical BUT: FLP says computation in asynchronous model (either network or read/write shared memory) cannot be fault-tolerant - even if only one process can fail - even if it can only fail by stopping This situation is completely different from synchronous networks - consensus possible with any number of stopping failures, and fewer than one-third Byzantine failures. Consider the Internet - No timing guarantees - Processes may fail (and not just stopping failures) - Communication is not reliable (last lecture covered getting reliable communication) ==> CONSENSUS IS IMPOSSIBLE - how can we do any useful work? - do we do any useful work? Two options: 1. Strengthen the system 2. Weaken the problem requirements - weaken safety (usually considered changing the problem) - weaken progress (usually not considered changing the problem) Usually we do both! (try to avoid weakening safety) Examples we've seen: - strong synchronization primitives for shared memory systems (CAS, etc.) - Paxos: ensure safety, "hope things go well" for progress - wait-free --> lock-free --> obstruction-free - k-set agreement Other examples; - atomic broadcast (strong primitive for networks) - group communication - randomization (both weakens problem and strengthens system!) - approximate agreement - failure detectors - timing (the main topic of this lecture) Quickie on Obstruction-freedom ------------------------------ Recall: wait-freedom: everyone makes progress (I make progress) lock-freedom: someone makes progress (if we have finite work to do, then every thing will be done) obstruction-freedom: I make progress if no one "interferes" - weaker than lock-freedom (and wait-freedom) - can solve consensus obs-free with async r/w shared memory (!) What about in systems with strong (universal) synchronization primitives? - we can compute anything lock-free, even population-oblivious (for wait-free, we need to know the number of process) BIG OPEN QUESTION: Can obsfree algorithms be more efficient? - we've thought about this for a year, but no provable gap so far (but we did get some nice lock-free algorithms instead) Failure Detectors ----------------- Suppose we have an oracle that can tell us which processes have failed. What can we do? What guarantees do we get from failure detector? - can it make mistakes? false positives? false negatives? - how quickly will we discover failures? (not clear what this means in asynchronous systems) - original work was on "weakest failure detector" that enables consensus We assume *perfect* failure detector (PFD): - always says exactly which processes have failed - assume only stopping failures - does "inform-stopped(j)" when j fails Then we can convert any synchronous (network) alg for use in an async system: - add "round number" to each message - wait for message or failure notification for each process (for each round) How can we implement a PFD? - in asynchronous system, we can't? (Why?) - use timing info (timeouts) Failure detector provides modularity, abstraction from timing dependence - - initial work of FDs on weakest FD that enables consensus Implementing PFD ---------------- Assume reliable communication with bounded-delay (FIFO) channels, and processes have access to the "real time" and an upper bound on step time - each process sends regular "I'm alive message" to FD - FD "suspects" process i if no msg from i for "long enough" Suppose channel delay bound is d, maximum time between sending msgs is b. How long is "long enough"? b+d How do we model this PFD? Assume we have a special state variable "now", which keeps current time. Also, for each process i, maintain last_msg_time[i]. Input: Output: rcv(i, "I'm alive") inform_stopped(j) Eff: last_msg_time[i] <- now Pre: now > last_msg_time[i] + b + d - No precondition for rcv, no effect for inform_stopped.) - PFD is not monolithic: implementation has pieces on each processor, so every processor needs to send "I'm alive" msgs to every other. Also, must be separate inform_stopped actions for each processor. (This is how it is modeled in the book.) - Don't worry for now about redundant inform_stopped actions. How is "now" updated? Modeling timing-based systems ----------------------------- "now" can't be updated by normal actions introduce special "time-passage" actions: \nu(t) = t time passes - like input actions: not "locally controlled" by any component - but we allow preconditions! (why?) - time "continuity" axioms - general timed automata (GTA): explicit "now" variable not required Need preconditions for upper bounds on time - if action a must happen by time T, then \nu(t) not enabled if t > now - T - the upper bound for a "prevents time from going forward too far" Does this make sense? How can an action stop time? We don't model fairness - if every action has a time bound, then it must happen "eventually" (assuming time doesn't stop) - we could add fairness back in for actions with no time bound (but we don't) Admissible timed executions/traces: - those executions/traces in which time doesn't stop - this is a liveness requirement: guarantees fairness Trace: (a_1, t_1), (a_2, t_2), ... - "action a_i occurs at time t_i" - a_i is an external action - t_i's monotonically nondecreasing (t_{i+1} >= t_i. Can t_{i+1} = t_i?) - can we require t_i's to go to infinity? Execution: s_0 a_1 s_1 a_2 s_2 ... - a_i can be input, output, internal, or time-passage action. - the sum of the times in time-passage actions must go to infinity - so the execution must be infinite - but requiring infinitely many time-passage actions is not enough! (Zeno executions) GTA properties: - powerful: can model any time-bound requirements (but not "hybrid" automata) - we can write automata that have no admissible timed executions (these automata are "bad", but we can't always recognize them easily) (How to model d-delay-bounded channels above) MMT automata (after Merritt, Modugno, Tuttle) - less powerful model - suitable for "low-level" models - all MMT automata are "reasonable" - idea: "give bounds on when action must occur after it becomes enabled" - use tasks to group actions; time bounds for entire task (no fairness) - reset time bounds when action occurs or when task is newly enabled MMT = IOA + boundmap boundmap: tasks -> (lower, upper) [lower and upper are time bounds] - require finite number of tasks (why?) Can't model d-delay-bounded channels above (if they are FIFO). - message can't be delivered until previous messages are delivered (What if we drop FIFO requirement? What if possible messages is finite?) An alternative channel: - d bound on delay of delivery of oldest message - msg may be delayed (k+1)d, if k messages are in channel when it is sent - MMT model: all rcv actions in one task, with bounds (0, d). Transformation from MMT to GTA: add "now" variable for each task C, add "first(C)" and "last(C)" - next action of C occurs between first(C) and last(C), unless C is disabled. for time-passage action \nu(t), - increase "now" variable by t - not enabled if now + t > last(C) for any task C for action a in task C, - add "now >= first(C)" to precondition - "reset" bounds when a occurs (if C is enabled in post-state): - first(C) <- now + lower(C); last(C) <- now + upper(C) - also reset bounds for any task C' that is not enabled in pre-state but is enabled in post-state. - for any task C' (including C) that is not enabled in post-state, set first(C') to 0 and last(C') to infinity. (why?) Mutual exclusion with time bounds --------------------------------- What do we get by using timing info? - consensus can be solved (use PFD) - so we can do (almost) anything - how efficiently? (relying on PFD may be expensive, wasteful) Recall Burns-Lynch lower bound: With r/w shared memory, need at least n registers for n-process mutex Fischer mutex algorithm (email to Lamport) - with timing, we can do with a single register (owner) - idea: write name; wait; if name not overwritten, enter critical section - how long do we have to wait? - processes must check before writing State variable: owner, initially null Code for process i: try: while (owner = null) {}; owner <- i; % at most time b after seeing owner = null wait at least time b; if (owner != i) goto try; exit: owner <- null; Properties: - mutual exclusion (why?) - bounded time for some process to enter once critical section is free (assuming upper bound on time for each step) - bounded exit time - no fairness (starvation possible, even likely if heavily contended) - mutex (not just progress) can be violated by incorrect timing (in fact, progress is not violated by bad timing) Can we guarantee safety always, and progress if timing is good? - with fewer than n registers? - idea: combine asynchronous mutex alg with Fischer alg - async alg guarantees progress only when only one process is trying - use Fischer alg so that only one process trying (in async alg) - timing failure can violate progress, but async alg ensures mutex - if we can detect "stuck" states (when no progress possible), and we can "reset" the state so that it is not stuck, we can overcome transient timing failures (similar to Paxos).