6.852: Notes for lecture on 20 Nov 2003
Victor Luchangco

Pragmatics
==========

In this lecture, we discuss several issues that are important for
real systems.  This isn't a systematic study, but it's important
to be aware of these issues.

Issues
------

Population-awareness
 - what about dynamic threads?
Space consumption
Need to allocate memory, which can be expensive
 - need to preallocate memory
Unbounded counters
Highly contended locations
 - cache effects
 - conflicts lead to wasted work ("helping" can hurt)
Disjoint parallel access
Data layout effects on caching
 - spatial locality (good for one procesor)
 - false sharing (bad for multiprocessor) 
NUMA (nonuniform memory access) architectures
Memory consistency models and other sources of reordering
Compilers (!)
CAS much more expensive than LD/ST (what about membars?)
Locking vs. nonblocking (various nonblocking conditions)
Local spinning (busy waiting) vs. sleeping
Software engineering
 - reducing complexity
 - protecting programmers from themselves


Managing concurrency
--------------------

Desiderata
 - minimize waiting 
    - disjoint parallel access
 - minimize synchronization operations (these are expensive)
 - minimize contention (highly contended access is expensive)
 - minimize redundant/wasted work
 - fault tolerance
 - reduce complexity
    - simple algorithms
    - modularity
 - general techniques
 - provable correctness
 - performance (both average and worst case)


List-based sets
---------------

We now look at several linked-list-based implementations of sets,
as examples to see various ways to manage concurrency.

Coarse-grained locking
 - one lock that protects entire list
 - simple: almost same as sequential code (often underappreciated)
 - no sharing

Fine-grained locking, #1
 - each node has its own lock
    - acquire lock before first access, retain till operation completes
 - must use read/write locks (why?)
    - upgrade to write locks for nodes to be updated
 - need to be careful to avoid deadlock

Fine-grained locking, #2
 - "hand-over-hand" locking
 - use mutex locks
 - everyone gets stuck behind one slow process

Optimistic synchronization
 - figure out what to do without taking locks
 - take out locks only "briefly"
    - may be possible to avoid taking out locks altogether in some cases

Nonblocking implementation (by Tim Harris, Maged Michael)
 - use "deleted" bit on next pointer
    - delete by setting bit
 - don't change next pointer when deleted bit is set
 - remove deleted (marked) nodes


Implementing R/W memory over network
------------------------------------

ABD [Attiya Bar-Noy Dolev] 
 - single-writer, multi-reader registers (MWMR is homework)
 - tolerates \floor{(n-1)/2} faults (fewer than half failures)
    - only stopping failures

WRITER

  Variables: val, tag

  write(v):
    val <- v
    tag <- tag+1
    send("write", val, tag) to all readers
      wait for acknowledgment from majority (including self)

  receive("read", u)_j:
    send("read-ack", val, tag, u) to j

READER (i)

  Variables: val, tag, readtag (UID only)

  read_i:
    readtag <- readtag+1                         [why do we need readtag?]
    send("read", readtag) to all other processes
      wait for response (with value and tag) from majority
    let t = largest tag received in response
    if t > tag
      val <- value received with t               [why is this well-defined?]
      tag <- t
    send("propagate", val, tag, readtag) to all other readers [why not writer?]
      wait for ack from majority

  receive("write", v, t):
    if t > tag
      val <- v
      tag <- t
    send("write-ack", t) to writer

  receive("read", u)_j:
    send("read-ack", val, tag, u) to j

  receive("propagate", v, t, u)_j:
    if t > tag
      val <- v
      tag <- t
    send("prop-ack", u) to j


What kind of channels do we need?
  FIFO?
  Reliable?
  Nonduplicating?
  Byzantine?

What faults does this algorithm tolerate?

Linearize write with tag t to:
  - when majority of processes have tag >= t
  - several writes may linearize to same time; what do we do?)
Linearize read to later of:
  - immediately after the linearization of write that it reads (based on tag)
   (might be between two writes linearized to the "same time"; how?)
  - invocation of read (why do we need this?)

Why are these linearization points between invocation and response?
Do they give the correct value?  (What is the abstract value?)

What are the implications of ABD?
 - can adapt r/w shared memory algs for async network if fewer than half fail
 - consensus impossible on async network if one process may fail