R(ead) C(opy) U(pdate)‏ [email_address]
Agenda What is RCU? Why? RCU Primitives RCU List Operations Sleepable RCU User Level RCU Q&A
What is RCU? Read-copy-update An alternative of rwlock Allow low over-head wait-free read Update can be expensive: need to maintain old copies if in use
Why RCU? W/o lock, this is broken due to compiler optimization and CPU out-of-order exec 1 struct foo { 2  int a; 3  int b; 4  int c; 5 }; 6 struct foo *gp = NULL; 7  8 /* . . . */ 9  10 p = kmalloc(sizeof(*p), GFP_KERNEL); 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 gp = p;
Why RCU? Mutex, no concurrent readers Spin_lock, ditto Rwlock, allow concurrent readers. The right choice?
Why RCU? rwlock is expensive Even read_lock has more overhead than spin_lock If write_lock is not really rare, rwlock contention is much worse than spin_lock contension
RCU Basis Split update into removal and reclamation phases Removal is performed immediately, while reclamation is deferred until all readers active during the removal phase have completed Takes advantage of the fact that writes to single aligned pointers are atomic on modern CPUs
RCU Terminology read-side critical sections: code delimited by rcu_read_lock() and rcu_read_unlock(),  MUST NOT  sleep. quiescent state: any code not within an RCU read-side critical section grace period: any time period during which each thread resides at least one quiescent state
RCU Terminology More on grace period: after a full grace period, all pre-existing RCU read-side critical sections are completed.
RCU Update Sequence Remove pointers to a data structure, so that subsequent readers cannot gain a reference to it Wait for all previous readers to complete their RCU read-side critical sections (AKA, a grace period passes)‏ At this point, there cannot be any readers who hold references to the data structure, so it now may safely be reclaimed (e.g., in another thread)‏
When Grace Period Passes? RCU readers are not permitted to block, switch to user-mode execution, or enter the idle loop. As soon as a CPU is seen passing through any of these three states, we know that that CPU has exited any previous RCU read-side critical sections. If we remove an item from a linked list, and then wait until all CPUs have switched context, executed in user mode, or executed in the idle loop, we can safely free up that item.
Core RCU APIs rcu_read_lock()‏ rcu_read_unlock()‏ synchronize_rcu()/call_rcu()‏ rcu_assign_pointer()‏ rcu_dereference()‏
Wait for Readers synchronize_rcu(): waits only for all ongoing RCU read-side critical sections to complete call_rcu(): registers a function and argument which are invoked after all ongoing RCU read-side critical sections have completed
Assign & Retrieve rcu_assign_pointer(): assign a new value to an RCU-protected pointer rcu_dereference(): fetch an RCU-protected pointer, which is safe to use until rcu_read_unlock()‏
RCU List Insert list_add_rcu()  list_add_tail_rcu()  list_replace_rcu()  Must be protected by some locks.
Sample Code 1 struct foo { 2  struct list_node *list; 3  int a; 4  int b; 5  int c; 6 }; 7 LIST_HEAD(head); 8  9 /* . . . */  10 p = kmalloc(sizeof(*p), GFP_KERNEL); 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 spin_lock(&list_lock); 15 list_add_head_rcu(&p->list, &head); 16 spin_unlock(&list_lock);
RCU List Transversal list_for_each_entry_rcu()‏ rcu_read_lock() and rcu_read_unlock() must be called, but they never spin or block Allows list_add_rcu() execute concurrently
RCU List Removal list_del_rcu() removes element from list. Must be protected by some lock But when to free it? synchronize_rcu() blocks until all read-side critical sections that begin before synchronize_rcu() is completed call_rcu() runs after all read-side critical sections that begin before call_rcu() is completed.
Sample Code spin_lock(&mylock); p = search(head, key); if (p == NULL)‏ spin_unlock(&mylock); else { list_del_rcu(&p->list); spin_unlock(&mylock); synchronize_rcu(); kfree(p); }
Sleepable RCU Why? the realtime kernels that require spinlock critical sections be preemptible also require that RCU read-side critical sections be preemptible
SRCU  Implementation Strategy prevent any given task sleeping in an RCU read-side critical section from getting an unbounded number of RCU callbacks refusing to provide asynchronous grace-period interfaces, such as the Classic RCU's call_rcu() API  isolating grace-period detection within each subsystem using SRCU
SRCU Grace Period? grace periods are detected by counting per-CPU counters. readers manipulate CPU-local counters. Two sets of per-CPU counters to do read-copy-update
SRCU  Data Structure struct srcu_struct { int completed; struct srcu_struct_array __percpu *per_cpu_ref; struct mutex mutex; }; struct srcu_struct_array { int c[2]; };
Wait for Grace Period synchronize_srcu()‏ Flip the completed counter. So new readers will be using the other set of per-CPU counters. Wait for the old count to drain to zero.
SRCU APIs int init_srcu_struct(struct srcu_struct *sp); void cleanup_srcu_struct(struct srcu_struct *sp); int srcu_read_lock(struct srcu_struct *sp) __acquires(sp); void srcu_read_unlock(struct srcu_struct *sp, int idx); void synchronize_srcu(struct srcu_struct *sp); void synchronize_srcu_expedited(struct srcu_struct *sp); long srcu_batches_completed(struct srcu_struct *sp);
Userspace RCU Available on  http://lttng.org/urcu git clone git://git.lttng.org/userspace-rcu.git Debian: aptitude install liburcu-dev Examples
Q & A

RCU

  • 1.
  • 2.
    Agenda What isRCU? Why? RCU Primitives RCU List Operations Sleepable RCU User Level RCU Q&A
  • 3.
    What is RCU?Read-copy-update An alternative of rwlock Allow low over-head wait-free read Update can be expensive: need to maintain old copies if in use
  • 4.
    Why RCU? W/olock, this is broken due to compiler optimization and CPU out-of-order exec 1 struct foo { 2 int a; 3 int b; 4 int c; 5 }; 6 struct foo *gp = NULL; 7 8 /* . . . */ 9 10 p = kmalloc(sizeof(*p), GFP_KERNEL); 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 gp = p;
  • 5.
    Why RCU? Mutex,no concurrent readers Spin_lock, ditto Rwlock, allow concurrent readers. The right choice?
  • 6.
    Why RCU? rwlockis expensive Even read_lock has more overhead than spin_lock If write_lock is not really rare, rwlock contention is much worse than spin_lock contension
  • 7.
    RCU Basis Splitupdate into removal and reclamation phases Removal is performed immediately, while reclamation is deferred until all readers active during the removal phase have completed Takes advantage of the fact that writes to single aligned pointers are atomic on modern CPUs
  • 8.
    RCU Terminology read-sidecritical sections: code delimited by rcu_read_lock() and rcu_read_unlock(), MUST NOT sleep. quiescent state: any code not within an RCU read-side critical section grace period: any time period during which each thread resides at least one quiescent state
  • 9.
    RCU Terminology Moreon grace period: after a full grace period, all pre-existing RCU read-side critical sections are completed.
  • 10.
    RCU Update SequenceRemove pointers to a data structure, so that subsequent readers cannot gain a reference to it Wait for all previous readers to complete their RCU read-side critical sections (AKA, a grace period passes)‏ At this point, there cannot be any readers who hold references to the data structure, so it now may safely be reclaimed (e.g., in another thread)‏
  • 11.
    When Grace PeriodPasses? RCU readers are not permitted to block, switch to user-mode execution, or enter the idle loop. As soon as a CPU is seen passing through any of these three states, we know that that CPU has exited any previous RCU read-side critical sections. If we remove an item from a linked list, and then wait until all CPUs have switched context, executed in user mode, or executed in the idle loop, we can safely free up that item.
  • 12.
    Core RCU APIsrcu_read_lock()‏ rcu_read_unlock()‏ synchronize_rcu()/call_rcu()‏ rcu_assign_pointer()‏ rcu_dereference()‏
  • 13.
    Wait for Readerssynchronize_rcu(): waits only for all ongoing RCU read-side critical sections to complete call_rcu(): registers a function and argument which are invoked after all ongoing RCU read-side critical sections have completed
  • 14.
    Assign & Retrievercu_assign_pointer(): assign a new value to an RCU-protected pointer rcu_dereference(): fetch an RCU-protected pointer, which is safe to use until rcu_read_unlock()‏
  • 15.
    RCU List Insertlist_add_rcu() list_add_tail_rcu() list_replace_rcu() Must be protected by some locks.
  • 16.
    Sample Code 1struct foo { 2 struct list_node *list; 3 int a; 4 int b; 5 int c; 6 }; 7 LIST_HEAD(head); 8 9 /* . . . */ 10 p = kmalloc(sizeof(*p), GFP_KERNEL); 11 p->a = 1; 12 p->b = 2; 13 p->c = 3; 14 spin_lock(&list_lock); 15 list_add_head_rcu(&p->list, &head); 16 spin_unlock(&list_lock);
  • 17.
    RCU List Transversallist_for_each_entry_rcu()‏ rcu_read_lock() and rcu_read_unlock() must be called, but they never spin or block Allows list_add_rcu() execute concurrently
  • 18.
    RCU List Removallist_del_rcu() removes element from list. Must be protected by some lock But when to free it? synchronize_rcu() blocks until all read-side critical sections that begin before synchronize_rcu() is completed call_rcu() runs after all read-side critical sections that begin before call_rcu() is completed.
  • 19.
    Sample Code spin_lock(&mylock);p = search(head, key); if (p == NULL)‏ spin_unlock(&mylock); else { list_del_rcu(&p->list); spin_unlock(&mylock); synchronize_rcu(); kfree(p); }
  • 20.
    Sleepable RCU Why?the realtime kernels that require spinlock critical sections be preemptible also require that RCU read-side critical sections be preemptible
  • 21.
    SRCU ImplementationStrategy prevent any given task sleeping in an RCU read-side critical section from getting an unbounded number of RCU callbacks refusing to provide asynchronous grace-period interfaces, such as the Classic RCU's call_rcu() API isolating grace-period detection within each subsystem using SRCU
  • 22.
    SRCU Grace Period?grace periods are detected by counting per-CPU counters. readers manipulate CPU-local counters. Two sets of per-CPU counters to do read-copy-update
  • 23.
    SRCU DataStructure struct srcu_struct { int completed; struct srcu_struct_array __percpu *per_cpu_ref; struct mutex mutex; }; struct srcu_struct_array { int c[2]; };
  • 24.
    Wait for GracePeriod synchronize_srcu()‏ Flip the completed counter. So new readers will be using the other set of per-CPU counters. Wait for the old count to drain to zero.
  • 25.
    SRCU APIs intinit_srcu_struct(struct srcu_struct *sp); void cleanup_srcu_struct(struct srcu_struct *sp); int srcu_read_lock(struct srcu_struct *sp) __acquires(sp); void srcu_read_unlock(struct srcu_struct *sp, int idx); void synchronize_srcu(struct srcu_struct *sp); void synchronize_srcu_expedited(struct srcu_struct *sp); long srcu_batches_completed(struct srcu_struct *sp);
  • 26.
    Userspace RCU Availableon http://lttng.org/urcu git clone git://git.lttng.org/userspace-rcu.git Debian: aptitude install liburcu-dev Examples
  • 27.