6.852: Notes for lecture on 4 Dec 2003
Victor Luchangco

Computing in the "real world" (what we've been discussing for past few weeks)
=============================

Intro
-----

(We're covering material from Chaps 21, 25.2, 23.1-2, 24.1-3)

We adopted asynchronous models because
  - most real systems don't guarantee timing properties
  - even to the extent they do, variance is usually very high,
     so waiting is impractical

BUT: FLP says computation in asynchronous model (either network or 
    read/write shared memory) cannot be fault-tolerant
     - even if only one process can fail
     - even if it can only fail by stopping

This situation is completely different from synchronous networks
  - consensus possible with any number of stopping failures, and fewer
   than one-third Byzantine failures.

Consider the Internet
 - No timing guarantees
 - Processes may fail (and not just stopping failures)
 - Communication is not reliable
   (last lecture covered getting reliable communication)

 ==> CONSENSUS IS IMPOSSIBLE 
       - how can we do any useful work?
       - do we do any useful work?

Two options:
1. Strengthen the system
2. Weaken the problem requirements
  - weaken safety (usually considered changing the problem)
  - weaken progress (usually not considered changing the problem)

Usually we do both!  (try to avoid weakening safety)

Examples we've seen:
 - strong synchronization primitives for shared memory systems (CAS, etc.)
 - Paxos: ensure safety, "hope things go well" for progress
 - wait-free --> lock-free --> obstruction-free
 - k-set agreement

Other examples;
 - atomic broadcast (strong primitive for networks)
 - group communication 
 - randomization (both weakens problem and strengthens system!)
 - approximate agreement
 - failure detectors
 - timing (the main topic of this lecture)


Quickie on Obstruction-freedom
------------------------------

Recall:
  wait-freedom: everyone makes progress (I make progress)
  lock-freedom: someone makes progress
                (if we have finite work to do, then every thing will be done)

obstruction-freedom: I make progress if no one "interferes"
 - weaker than lock-freedom (and wait-freedom)
 - can solve consensus obs-free with async r/w shared memory (!)

What about in systems with strong (universal) synchronization primitives?
 - we can compute anything lock-free, even population-oblivious
   (for wait-free, we need to know the number of process)

 BIG OPEN QUESTION: Can obsfree algorithms be more efficient?
  - we've thought about this for a year, but no provable gap so far
    (but we did get some nice lock-free algorithms instead)


Failure Detectors
-----------------

Suppose we have an oracle that can tell us which processes have failed.
What can we do?

What guarantees do we get from failure detector?
 - can it make mistakes? false positives? false negatives?
 - how quickly will we discover failures?
    (not clear what this means in asynchronous systems)
 - original work was on "weakest failure detector" that enables consensus

We assume *perfect* failure detector (PFD):
 - always says exactly which processes have failed 
 - assume only stopping failures
 - does "inform-stopped(j)" when j fails

Then we can convert any synchronous (network) alg for use in an async system:
 - add "round number" to each message
 - wait for message or failure notification for each process (for each round)

How can we implement a PFD?
 - in asynchronous system, we can't?  (Why?)
 - use timing info (timeouts)

Failure detector provides modularity, abstraction from timing dependence
 - 
 - initial work of FDs on weakest FD that enables consensus

Implementing PFD
----------------

Assume reliable communication with bounded-delay (FIFO) channels, 
 and processes have access to the "real time" and an upper bound on step time
 - each process sends regular "I'm alive message" to FD
 - FD "suspects" process i if no msg from i for "long enough"

Suppose channel delay bound is d, maximum time between sending msgs is b.
How long is "long enough"?  b+d

How do we model this PFD?

Assume we have a special state variable "now", which keeps current time.
Also, for each process i, maintain last_msg_time[i].

Input:					Output:
  rcv(i, "I'm alive")			  inform_stopped(j)
    Eff: last_msg_time[i] <- now	    Pre: now > last_msg_time[i] + b + d

 - No precondition for rcv, no effect for inform_stopped.)
 - PFD is not monolithic: implementation has pieces on each processor, so 
  every processor needs to send "I'm alive" msgs to every other.  Also, 
  must be separate inform_stopped actions for each processor.  (This is 
  how it is modeled in the book.)
 - Don't worry for now about redundant inform_stopped actions.

How is "now" updated?


Modeling timing-based systems
-----------------------------

"now" can't be updated by normal actions

introduce special "time-passage" actions: \nu(t) = t time passes
 - like input actions: not "locally controlled" by any component
 - but we allow preconditions!  (why?)
 - time "continuity" axioms
 - general timed automata (GTA): explicit "now" variable not required

Need preconditions for upper bounds on time
 - if action a must happen by time T, then \nu(t) not enabled if t > now - T
 - the upper bound for a "prevents time from going forward too far"

Does this make sense?  How can an action stop time?

We don't model fairness
 - if every action has a time bound, then it must happen "eventually"
  (assuming time doesn't stop)
 - we could add fairness back in for actions with no time bound (but we don't)

Admissible timed executions/traces:
 - those executions/traces in which time doesn't stop
 - this is a liveness requirement: guarantees fairness

Trace: (a_1, t_1), (a_2, t_2), ...
 - "action a_i occurs at time t_i"
 - a_i is an external action
 - t_i's monotonically nondecreasing
  (t_{i+1} >= t_i.  Can t_{i+1} = t_i?)
 - can we require t_i's to go to infinity?

Execution: s_0 a_1 s_1 a_2 s_2 ...
 - a_i can be input, output, internal, or time-passage action.
 - the sum of the times in time-passage actions must go to infinity 
  - so the execution must be infinite
  - but requiring infinitely many time-passage actions is not enough!
   (Zeno executions)

GTA properties:
 - powerful: can model any time-bound requirements (but not "hybrid" automata)
 - we can write automata that have no admissible timed executions
  (these automata are "bad", but we can't always recognize them easily)

(How to model d-delay-bounded channels above)

MMT automata (after Merritt, Modugno, Tuttle)
 - less powerful model
 - suitable for "low-level" models
 - all MMT automata are "reasonable"
 - idea: "give bounds on when action must occur after it becomes enabled"
    - use tasks to group actions; time bounds for entire task (no fairness)
    - reset time bounds when action occurs or when task is newly enabled

MMT = IOA + boundmap
boundmap: tasks -> (lower, upper)   [lower and upper are time bounds]
 - require finite number of tasks (why?)

Can't model d-delay-bounded channels above (if they are FIFO).
 - message can't be delivered until previous messages are delivered
(What if we drop FIFO requirement?  What if possible messages is finite?)

An alternative channel:
 - d bound on delay of delivery of oldest message
 - msg may be delayed (k+1)d, if k messages are in channel when it is sent
 - MMT model: all rcv actions in one task, with bounds (0, d).

Transformation from MMT to GTA:
 add "now" variable
 for each task C, add "first(C)" and "last(C)"
  - next action of C occurs between first(C) and last(C), unless C is disabled.
 for time-passage action \nu(t),
  - increase "now" variable by t
  - not enabled if now + t > last(C) for any task C
 for action a in task C, 
  - add "now >= first(C)" to precondition
  - "reset" bounds when a occurs (if C is enabled in post-state):
    - first(C) <- now + lower(C); last(C) <- now + upper(C)
  - also reset bounds for any task C' that is not enabled in pre-state but
   is enabled in post-state.
  - for any task C' (including C) that is not enabled in post-state, set
   first(C') to 0 and last(C') to infinity.  (why?)

Mutual exclusion with time bounds
---------------------------------

What do we get by using timing info?
 - consensus can be solved (use PFD)
 - so we can do (almost) anything
 - how efficiently? (relying on PFD may be expensive, wasteful)

Recall Burns-Lynch lower bound:
  With r/w shared memory, need at least n registers for n-process mutex

Fischer mutex algorithm (email to Lamport)
 - with timing, we can do with a single register (owner)
 - idea: write name; wait; if name not overwritten, enter critical section
 - how long do we have to wait?
 - processes must check before writing

State variable: owner, initially null

Code for process i:
  try: 
    while (owner = null) {};
    owner <- i;			% at most time b after seeing owner = null
    wait at least time b;
    if (owner != i) goto try;

  <critical section>

  exit:
    owner <- null;

Properties:
 - mutual exclusion (why?)
 - bounded time for some process to enter once critical section is free
    (assuming upper bound on time for each step)
 - bounded exit time
 - no fairness (starvation possible, even likely if heavily contended)
 - mutex (not just progress) can be violated by incorrect timing
  (in fact, progress is not violated by bad timing)

Can we guarantee safety always, and progress if timing is good?
 - with fewer than n registers?
 - idea: combine asynchronous mutex alg with Fischer alg
    - async alg guarantees progress only when only one process is trying
    - use Fischer alg so that only one process trying (in async alg)
    - timing failure can violate progress, but async alg ensures mutex
    - if we can detect "stuck" states (when no progress possible), and 
     we can "reset" the state so that it is not stuck, we can overcome
     transient timing failures (similar to Paxos).