Os Reindersfinal

Exploiting Parallelism with Multi-core Technologies ,[object Object],[object Object],[object Object],[object Object]

Coding with TBB Contest ,[object Object],[object Object],[object Object],[object Object],[object Object]

Problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Three Approaches for Improvement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Family Tree 1988 2001 2006 1995 Languages *Other names and brands may be claimed as the property of others Cilk space efficient scheduler cache-oblivious algorithms Threaded-C continuation tasks task stealing OpenMP* fork/join tasks OpenMP taskqueue while & recursion Pragmas Chare Kernel small tasks JSR-166 (FJTask) containers Intel® Threading Building Blocks STL generic programming STAPL recursive ranges ECMA .NET* parallel iteration classes Libraries

Enter Intel® Threading Building Blocks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Example (just a peek) void ApplyFoo( size_t n , int x ) { for(size_t i=range_begin; i<range_end; ++i) Foo( i,x ); } SERIAL VERSION void ParallelApplyFoo(size_t n, int x) { parallel_for( blocked_range<size_t>(0,n,10), <>(const blocked_range<size_t>& range) { for(size_t i=range.begin(); i<range.end(); ++i) Foo(i,x); } ); } PARALLEL VERSION (the way I wish I could write it)

Parallel Version (as it can be written) class ApplyFoo { public: int my_x; ApplyFoo( int x ) : my_x(x) {} void operator()(const blocked_range<size_t>& range) const { for(size_t i=range.begin(); i!=range.end(); ++i) Foo(i,my_x); } }; void ParallelApplyFoo(size_t n, int x) { parallel_for(blocked_range<size_t>(0,n,10), ApplyFoo(x)); }

Generic Programming ,[object Object],[object Object],[object Object],[object Object],[object Object]

Key Features of Intel Threading Building Blocks ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Relaxed Sequential Semantics ,[object Object],[object Object]

Synchronization Primitives atomic, spin_mutex, spin_rw_mutex, queuing_mutex, queuing_rw_mutex, mutex Generic Parallel Algorithms parallel_for parallel_while parallel_reduce pipeline parallel_sort parallel_scan Concurrent Containers concurrent_hash_map concurrent_queue concurrent_vector Task scheduler Memory Allocation cache_aligned_allocator scalable_allocator

Serial Example static void SerialUpdateVelocity() { for( int i=1; i<UniverseHeight-1; ++i ) for( int j=1; j<UniverseWidth-1; ++j ) V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; }

Parallel Version blue = original code red = provided by TBB black = boilerplate for library struct UpdateVelocityBody { void operator()( const blocked_range <int>& range ) const { int end = range.end (); for( int i= range.begin (); i<end; ++i ) { for( int j=1; j<UniverseWidth-1; ++j ) { V[i][j] += (S[i][j] - S[i][j-1] + T[i][j] - T[i-1][j])*M[i]; } } } void ParallelUpdateVelocity() { parallel_for ( blocked_range<int> ( 1, UniverseHeight-1), UpdateVelocityBody(), auto_partitioner() ); } Task Parallel control structure Task subdivision handler

Range is Generic ,[object Object],[object Object],[object Object],[object Object],Destructor R::~R() True if range is empty bool R::empty() const True if range can be partitioned bool R::is_divisible() const Split r into two subranges R::R (R& r, split ) Copy constructor R::R (const R&)

parallel_reduce parallel_scan parallel_while parallel_sort pipeline

Parallel pipeline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Parallel pipeline Parallel stage scales because it can process items in parallel or out of order. Serial stage processes items one at a time in order. Another serial stage. Items wait for turn in serial stage Controls excessive parallelism by limiting total number of items flowing through pipeline. Uses sequence numbers recover order for serial stage. Tag incoming items with sequence numbers Throughput limited by throughput of slowest serial stage. 1 3 2 4 5 6 7 8 9 10 11 12

Concurrent Containers, Mutual Exclusion Memory Allocator Task Scheduler

Concurrent Containers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Concurrent interface requirements ,[object Object],[object Object],[object Object],extern std::queue q; if(!q.empty()) { item=q.front(); q.pop(); } At this instant, another thread might pop last element.

concurrent_vector <T> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],// Append sequence [begin,end) to x in thread-safe way. template<typename T> void Append( concurrent_vector <T> &x, const T *begin, const T *end ) { std::copy(begin, end, x.begin() + x. grow_by (end-begin) ) } Example

concurrent_queue <T> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

concurrent_hash <Key,T,HashCompare> ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

struct MyHashCompare { static long hash ( const char* x ) { long h = 0; for( const char* s = x; *s; s++ ) h = (h*157)^*s; return h; } static bool equal ( const char* x, const char* y ) { return strcmp(x,y)==0; } }; typedef concurrent_hash_map <const char*,int,MyHashCompare> StringTable; StringTable MyTable; Example: map strings to integers void MyUpdateCount( const char* x ) { StringTable::accessor a; MyTable. insert ( a, x ); a->second += 1; } Multiple threads can insert and update entries concurrently. accessor object acts as smart pointer and writer lock.

Synchronization Primitives ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Example: spin_rw_mutex promotion ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Scalable Memory Allocator ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Task Scheduler ,[object Object],[object Object],[object Object],TBB Approach Problem Task chunking and work-stealing help balance load Load imbalance Programmer specifies tasks, not threads. Program complexity “ Greedy” scheduling often wins “ Fair” scheduling One scheduler thread per hardware thread Oversubscription

Task Scheduler ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],#include “tbb/task_scheduler_init.h” using namespace tbb; int main() { task_scheduler_init init; … . return 0; }

Tasking Development tools ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Open Source – quick ‘tour’ ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Coding with TBB Contest ,[object Object],[object Object],[object Object],[object Object]

Learn more… ,[object Object],[object Object]

Os Reindersfinal

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Os Reindersfinal

Similar to Os Reindersfinal (20)

More from oscon2007

More from oscon2007 (20)

Recently uploaded

Recently uploaded (20)

Os Reindersfinal