Introduction to multicore .ppt

Multicore: An Introduction
N. Rajagopal,
Systems Software Practice
Nagarajan_rajagopal@yahoo.com

Agenda
• Background
• Drivers for Multicore
• How is SW prepared?
• Challenges of multicore programming
• Functional programming
• Refactoring
• Hybrid approaches
• State of industry
• Summary
• Q&A

Background
• Moore’s Law (transistor density doubles every 18
months)
• But the gaps between the transistor count and
performance is increasing
• Between 1993, 1999, CPU speeds increased 10
times
• First 1Ghz CPU came in 2000. We should have had
10Ghz CPU by now. It is not there. Where is it?
• Intel’s 3.4Ghz CPU was introduced in 2004. Where is
the 4Ghz processor?
Answer is that it is unlikely to ever come!

CPU Clock speed increase over years
Source: http://www.cs.utexas.edu/users/cart/publications/isca00.pdf

Gap is increasing!
Source: http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869

Slowing signs
• Over last 30 years, CPU designers achieved
performance in three ways:
– Clock speed (new processes, materials etc)
– Execution optimization (doing more per cycle: Pipelining,
branch prediction, multiple instructions in same cycle etc)
– Cache (Putting memory closer to CPU: 2Mb+ caches are
now common)
• Steam is running out of these techniques
– Clock speed: (heat, physical issues, leakage currents)
– Less and lesser returns from execution optimization,
though cachesize has still potential to go up
Source: embedded.com

How is Semicon industry trying to do?
• Create simpler cores and put more of them in single package
• It is easier for semicon vendors to do this compared to
increasing speed
• Instead of 10Ghz processor, have 10 numbers of 1Ghz cores!
• All cores have L1 caches but access shared memory outside
• First versions of the processors were from Intel and now
there are many multicore processors
• Initially, multicore processors were for server market
• Now they are in desktops and even in embedded products

Some multicore processors..
• Dual Core processors
– One Gen Purpose Core and another specialized core
– Been in market for some time
– Network processors (Intel IXP)
– Signal processors (TI OMAP processors)
• Intel Quad Core, ATOM dual core
– Intel: Nehalem – Core i7 with 8 cores
– AMD: Montreal – 8 cores
• Cavium 16 cores
• Tilera 64 cores
• Predictions of 100’s and even thousands of cores in coming
years
• Intel Larrabee rumoured to be 32 cores

Typical multicore processor
Courtesy: http://zone.ni.com/cms/images/devzone/tut/rwlesvfu47926.jpg

How is Software Community prepared?
• “There ain’t no such thing as a free lunch” – R. A.
Heinlien
• Software community has enjoyed regular
performance gains for decades without doing
anything special
• Now the party is coming to an end and it is a rude
shock
• There is expectation that SW should take over from
HW in driving next generation performance
improvements
• So what to be done?

What is to be done?
Need to move to parallel programming model

Challenges of parallel programming
• Concurrency
– Multiple threads of execution access same pieces of data
– Need to synchronize the access
– Possibilities of race conditions and deadlocks

Challenges of parallel programming – Race conditions
Trivial C statement
b = b + 1;
Assume that there are two threads running this line of code, where b is a variable shared by
the two threads, and b started with the value 5.
Possible order of execution
(thread1) load b into some register in thread 1.
(thread2) load b into some register in thread 2.
(thread1) add 1 to thread 1's register, computing 6.
(thread2) add 1 to thread 2's register, computing 6.
(thread1) store the register value (6) to b.
(thread2) store the register value (6) to b.
• We started with 5, then two threads each added one, but the final result is 6 -- not the
expected 7. The problem is that the two threads interfered with each other, causing a wrong
final answer.
• Threads do not execute atomically, performing single operations all at once. Another thread
may interrupt it between essentially any two instructions and manipulate some shared
resource.

Challenges of parallel programming - Deadlocks
void *function1()
{
pthread_mutex_lock(&lock1); - Execution step 1
pthread_mutex_lock(&lock2); - Execution step 3 DEADLOCK!!!
...
pthread_mutex_lock(&lock2);
...
}
void *function2()
{
pthread_mutex_lock(&lock2); - Execution step 2
...
...
}
main()
{
pthread_create(&thread1, NULL, function1, NULL);
pthread_create(&thread2, NULL, function1, NULL);

• Challenge of visualization
– Difficult to visualize beyond 4-5 threads, each doing
different activities
– Human mind is not used to think/process data in parallel
– Complexities of interplay of threads increase exponentially
beyond few threads

• How one is confident that there are no bugs?
– Race conditions are dependent on timing
– Changes in execution times of one thread or other can change access
order to variables, triggering a previously unnoticed race
– It is fully possible for a code to pass through all rounds of testing and
work fine for years before a set of conditions like memory speed or
process load bring a new bug to life
– Lastly, it may not be possible to reproduce the bug at all times (may
occur on one server, may not occur on another server for same
environments)
• Leaves with an uneasy feeling about the code
• Visualization tools are coming to market, but they are
inadequete

What are the options?
Courtesy: http://bigyellowtaxi.files.wordpress.com/2008/06/crossroads.jpg

Option #1: Functional Programming

Option #1: Functional Programming
• Main issue of concurrent programming is the parallel access
to shared data
• All languages suffer from same issue
• So can we avoid the whole thing?
• Functional programming languages takes a radically different
approach to the problem
• Roots in theoretical computer science research called Lambda
calculas by Alanzo Church in 1949 in Princeton University
along with Turing, Godel and van Nueman

Functional Programming - background
• Based on mathematical system called “formal system”
• Set of “axioms” and “set of rules on how to operate on them”
• Stack them up with complex rules
• Alan Church was interested in “computability” issues and not
programming issues (Computers did not exist then)
• LISP was first implementation of Alonzo’s lambda calculas on
computers

So what is functional programming?
• Functions are used for everything, even for simplest of
computations
• Variables are just aliases – they cant even change values
(immutable)
• State is only in stack
• No concept of global variables
• No “side effects”
• One cannot modify state in a function or outside of it.
• If you call the same function with same parameters, it should
return identical results at all times
• Since there are no global variables, let alone variables, there
is no race conditions, no deadlocks!

Referential Transperancy
Source: Functions + Messages + Concurrency = Erlang – Presentation by Joe Amstrong

Advantages of FP
• Unit testing
– Check all possible inputs to a function
– Check by passing parameters that represent edge cases
• Debugging
– In Imperative programming, there is no surity that you can reproduce
the bug in same sequence
– Main reason is that behaviour of program dependent on object states,
global variables and external state
– In FP, a bug is a bug and it surfaces all time
• Concurrency
– No global variables, no race conditions, no locks
– Easy to refactor code
• String s1 = operation1();
• String s2 = operation2();
• String s3 = concat( s1, s2);
– First and second lines can be run on 2 different cores
– Tools can easily refactor code

Popular environments
• Erlang
– Invented by Ericcson
– Concurrent programming environment
– Message passing
– Used in highly scalable, mission critical switch products with hot
upgrade support
• Haskell
– Haskell is a pure functional language
• LISP
– Original parent

Example code fragment
• Erlang
Courtesy: TBD

So, why are programmers not jumping?
• Complete change of mindset needed!
• Learning curve is steep
• “University and research” Image
• Needs lots of resources (recursion, stack based state)
• CIOs are not comfortable
• Lack of trained staff
• Only strong believers are the adopting

Refactoring
• There is just too much existing software
• No one is developing software from scratch and it is built on
existing code bases/products
• Let us look at tools to help refactor existing code to get
performance gains rather asking to code in new languages
• May not be able to utilize all cores to maximum, but it is an
evolutionary path

Refactoring tools for non functional programmers
• Many tools are coming to market to help refactoring code to
multiple core processors
• Key ones
– RapidMind
– OpenMP
– CUDA
– Ct
– Clik
– Pervasive software
– And many others

Rapidmind
• Startup based in Waterloo, Canada
• Helps to parallelize code across multiple cores
• Supports C++ language
• Can be targeted to different processor families: AMD, x86, Cell, nVidia etc
• Extensive code changes needed in code
• Plus points:
– Multiple platforms, no lockin with architecture
– 1000’s of threads
– Existing compilers
• Minus points:
– Lockin to the Rapidmind platform!
• Costs:
– $1500 for dev platform

OpenMP
• Standard for multiprocessor shared memory programming
• Defined for C/C++/Fortran
• OpenMP = compiler directives + runtime library + env
variables
• Code is instrumented
• Threads are created and destroyed automatically

OpenMP example code
int main(int argc, char **argv)
{ const int N = 100000;
int i, a[N];
#pragma omp parallel
for for (i = 0; i < N; i++)
a[i] = 2 * i;
return 0;
}

OpenMP
• Plus points:
– Easiness to do
– Incremental Paralleization
– Hiding of thread semantics
– Scalable to many threads
– Coarse/Fine grained parallelization
• Minus points:
– Popular only in scientific communities only
– Need separate tool chain
• Products:
– Sun Studio 12
Source: Sun Studio presentation

CUDA
• From nVidia
• Defines extensions to code to parallelize the code
• Concept of “Thread block”
• Each thread pool has one shared memory to work on
• Scalable to 1000’s of threads
• Defined for C/C++ Languages
• Plus points:
– Scales very well for scientific, visualization communities (Financial
markets, computational mechanics, computational biology etc)
– Well integrated with nVidia processors – no middleware
• Minus points:
– Vendor lockin to nVidia/CUDA (processor/tool chain)

Option #3: Hybrid environments
• How to experiment in FP without throwing away all
investments done so far?
• Scala:
• Hybrid language
• Combines Java style OO with FP paradigm
• Compiles to bytecode
• Can use Java libraries in Scala code; Can inherit Java class
• Use eclipse tool chains, plugins
• Really married to Java
• F#:
• Experiment FP safely from known .NET environment
• Use FP where needed, use .NET in other place

OS readiness
• Windows XP can support 4 logical cores, 2 physical
cores; supports SMP model
• Windows Vista, Windows 7 better tuned for
multicore
• Finetuning of global locks, mutexes, better graphics
performance
• Linux 2.6+ has SMP support and hence multicore
support

“Embarrassingly” Parallel Programs
• An embarrassingly parallel workload (or embarrassingly parallel problem) is one for which
little or no effort is required to separate the problem into a number of parallel tasks.
• Examples:
– The Mandelbroit set and other fractal calculations, where each point can be calculated
independently.
– Rendering of graphics; each pixel may be rendered independently. In computer
animation, each frame may be rendered independently
– Large scale face recognition that involves comparing thousands of input faces with
similarly large number of faces.
– Computer simulations comparing many independent scenarios, such as climate models.
– Genetic algorithms and other evolutionary computation
– weather prediction Models
– Event simulation and reconstruction in particle physics.
• Compilation and Build systems: Each compilation can be run parallel and
get performance. (example: make –j or dmake in sun Studio)
• CUDA/OpenMP is very popular in the scientific and visualization markets
and widely used
Source: Wikipedia

Server programming
• Servers run OS like Linux 2.6 that
support pthreads and is
multithreaded
• Server daemon spawns a pthread
for each request
• OS schedules them on different
cores
• Decent scalability seen for server
side programs on multicore systems
• Examples: web services
• Challenges are IO performance,
Cache size, performance tuning of
OS, web browser etc
Source: http://www.edn.com/index.asp?layout=articlePrint&articleID=CA6646279

Server Virtualization
• Virtualization allows different OS to run on different cores
• One core can be running Linux and another Windows
• This is quite popular and allows efficient use of cores

Packet processing applications
• Packet processing lends to parallel processing
• Each IP packet is independent
• IP forwarding is largely stateless processing
• Multiple cores can take packets from an incoming queue, do
processing and put it on output queue
• Similar applications in security
• Solutions from Tilera, Cavium

Packet processing applications
Source: http://www.cisco.com/en/US/prod/collateral/routers/

High reliable concurrent systems
• Functional programs are getting used here
• Major users:
– Ericcson (GPRS system)
– Jabber
– Twitter
– Tmobile SMS system
– Nortel VPN gateway product
• Some people are trying out hybrids in their system
– Scala: Twitter

Desktop applications
• Common programs like word processing,
spreadsheets are not amenable to parallel
processing or threading
• Unlikely that desktop applications will be rewritten
with multiple threads
• So don’t expect much performance gains by running
this PPT on a quadcore processor 
• But simple things are possible:
– While you download a document, virus scan can run on
another core
– Provide processor affinity in Windows

Embedded Applications
• Embedded applications without OS are typically hand crafted
• Limited scope for using multiple cores
• Scope exists in running specialized code in one core and rest
of code in another
Courtesy: Windriver Inc

Challenges of performance improvement
• Lots of existing code is not written in multithreaded
way or amenable also
• There are other factors like IO issues, memory issues,
beyond just CPU. Best results are seen for CPU
intensive jobs, but may not be for IO bound jobs
• There could be bottlenecks in inter-core
communication
• Typical applications like CRUD may find it difficult to
scale

Sweet-spots of multicore software
• Virtualization
• Server programming
• Packet processing
• Scientific computing, visualization

Summary
• Multicore is here to stay
• Onus is on SW community to show performance
gains
• Unfortunately, lots of existing code is not written for
using the benefits of multicore
• Many programs are also not amenable for parallel
processing
• Options like FP, Refactoring, Hybrid being tried out
• Long time and way to go before we can tap the
performance gains

Introduction to multicore .ppt

More Related Content

What's hot

Similar to Introduction to multicore .ppt

Introduction to multicore .ppt