Multicore: An Introduction
N. Rajagopal,
Systems Software Practice
Nagarajan_rajagopal@yahoo.com
Agenda
• Background
• Drivers for Multicore
• How is SW prepared?
• Challenges of multicore programming
• Functional programming
• Refactoring
• Hybrid approaches
• State of industry
• Summary
• Q&A
Background
• Moore’s Law (transistor density doubles every 18
months)
• But the gaps between the transistor count and
performance is increasing
• Between 1993, 1999, CPU speeds increased 10
times
• First 1Ghz CPU came in 2000. We should have had
10Ghz CPU by now. It is not there. Where is it?
• Intel’s 3.4Ghz CPU was introduced in 2004. Where is
the 4Ghz processor?
Answer is that it is unlikely to ever come!
CPU Clock speed increase over years
Source: http://www.cs.utexas.edu/users/cart/publications/isca00.pdf
Gap is increasing!
Source: http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869
Slowing signs
• Over last 30 years, CPU designers achieved
performance in three ways:
– Clock speed (new processes, materials etc)
– Execution optimization (doing more per cycle: Pipelining,
branch prediction, multiple instructions in same cycle etc)
– Cache (Putting memory closer to CPU: 2Mb+ caches are
now common)
• Steam is running out of these techniques
– Clock speed: (heat, physical issues, leakage currents)
– Less and lesser returns from execution optimization,
though cachesize has still potential to go up
Source: embedded.com
Heat is on!
How is Semicon industry trying to do?
• Create simpler cores and put more of them in single package
• It is easier for semicon vendors to do this compared to
increasing speed
• Instead of 10Ghz processor, have 10 numbers of 1Ghz cores!
• All cores have L1 caches but access shared memory outside
• First versions of the processors were from Intel and now
there are many multicore processors
• Initially, multicore processors were for server market
• Now they are in desktops and even in embedded products
Some multicore processors..
• Dual Core processors
– One Gen Purpose Core and another specialized core
– Been in market for some time
– Network processors (Intel IXP)
– Signal processors (TI OMAP processors)
• Intel Quad Core, ATOM dual core
– Intel: Nehalem – Core i7 with 8 cores
– AMD: Montreal – 8 cores
• Cavium 16 cores
• Tilera 64 cores
• Predictions of 100’s and even thousands of cores in coming
years
• Intel Larrabee rumoured to be 32 cores
Typical multicore processor
Courtesy: http://zone.ni.com/cms/images/devzone/tut/rwlesvfu47926.jpg
How is Software Community prepared?
• “There ain’t no such thing as a free lunch” – R. A.
Heinlien
• Software community has enjoyed regular
performance gains for decades without doing
anything special
• Now the party is coming to an end and it is a rude
shock
• There is expectation that SW should take over from
HW in driving next generation performance
improvements
• So what to be done?
What is to be done?
Need to move to parallel programming model
Challenges of parallel programming
• Concurrency
– Multiple threads of execution access same pieces of data
– Need to synchronize the access
– Possibilities of race conditions and deadlocks
Challenges of parallel programming – Race conditions
Trivial C statement
b = b + 1;
Assume that there are two threads running this line of code, where b is a variable shared by
the two threads, and b started with the value 5.
Possible order of execution
(thread1) load b into some register in thread 1.
(thread2) load b into some register in thread 2.
(thread1) add 1 to thread 1's register, computing 6.
(thread2) add 1 to thread 2's register, computing 6.
(thread1) store the register value (6) to b.
(thread2) store the register value (6) to b.
• We started with 5, then two threads each added one, but the final result is 6 -- not the
expected 7. The problem is that the two threads interfered with each other, causing a wrong
final answer.
• Threads do not execute atomically, performing single operations all at once. Another thread
may interrupt it between essentially any two instructions and manipulate some shared
resource.
Challenges of parallel programming - Deadlocks
void *function1()
{
pthread_mutex_lock(&lock1); - Execution step 1
pthread_mutex_lock(&lock2); - Execution step 3 DEADLOCK!!!
...
pthread_mutex_lock(&lock2);
pthread_mutex_lock(&lock1);
...
}
void *function2()
{
pthread_mutex_lock(&lock2); - Execution step 2
pthread_mutex_lock(&lock1);
...
...
pthread_mutex_lock(&lock1);
pthread_mutex_lock(&lock2);
}
main()
{
pthread_create(&thread1, NULL, function1, NULL);
pthread_create(&thread2, NULL, function1, NULL);
Challenges of parallel programming
• Challenge of visualization
– Difficult to visualize beyond 4-5 threads, each doing
different activities
– Human mind is not used to think/process data in parallel
– Complexities of interplay of threads increase exponentially
beyond few threads
Challenges of parallel programming
• How one is confident that there are no bugs?
– Race conditions are dependent on timing
– Changes in execution times of one thread or other can change access
order to variables, triggering a previously unnoticed race
– It is fully possible for a code to pass through all rounds of testing and
work fine for years before a set of conditions like memory speed or
process load bring a new bug to life
– Lastly, it may not be possible to reproduce the bug at all times (may
occur on one server, may not occur on another server for same
environments)
• Leaves with an uneasy feeling about the code
• Visualization tools are coming to market, but they are
inadequete
What are the options?
Courtesy: http://bigyellowtaxi.files.wordpress.com/2008/06/crossroads.jpg
Option #1: Functional Programming
Option #1: Functional Programming
• Main issue of concurrent programming is the parallel access
to shared data
• All languages suffer from same issue
• So can we avoid the whole thing?
• Functional programming languages takes a radically different
approach to the problem
• Roots in theoretical computer science research called Lambda
calculas by Alanzo Church in 1949 in Princeton University
along with Turing, Godel and van Nueman
Functional Programming - background
• Based on mathematical system called “formal system”
• Set of “axioms” and “set of rules on how to operate on them”
• Stack them up with complex rules
• Alan Church was interested in “computability” issues and not
programming issues (Computers did not exist then)
• LISP was first implementation of Alonzo’s lambda calculas on
computers
So what is functional programming?
• Functions are used for everything, even for simplest of
computations
• Variables are just aliases – they cant even change values
(immutable)
• State is only in stack
• No concept of global variables
• No “side effects”
• One cannot modify state in a function or outside of it.
• If you call the same function with same parameters, it should
return identical results at all times
• Since there are no global variables, let alone variables, there
is no race conditions, no deadlocks!
Referential Transperancy
Source: Functions + Messages + Concurrency = Erlang – Presentation by Joe Amstrong
Advantages of FP
• Unit testing
– Check all possible inputs to a function
– Check by passing parameters that represent edge cases
• Debugging
– In Imperative programming, there is no surity that you can reproduce
the bug in same sequence
– Main reason is that behaviour of program dependent on object states,
global variables and external state
– In FP, a bug is a bug and it surfaces all time
• Concurrency
– No global variables, no race conditions, no locks
– Easy to refactor code
• String s1 = operation1();
• String s2 = operation2();
• String s3 = concat( s1, s2);
– First and second lines can be run on 2 different cores
– Tools can easily refactor code
Popular environments
• Erlang
– Invented by Ericcson
– Concurrent programming environment
– Message passing
– Used in highly scalable, mission critical switch products with hot
upgrade support
• Haskell
– Haskell is a pure functional language
• LISP
– Original parent
Example code fragment
• Erlang
Courtesy: TBD
So, why are programmers not jumping?
• Complete change of mindset needed!
• Learning curve is steep
• “University and research” Image
• Needs lots of resources (recursion, stack based state)
• CIOs are not comfortable
• Lack of trained staff
• Only strong believers are the adopting
Option #2: Refactoring
Refactoring
• There is just too much existing software
• No one is developing software from scratch and it is built on
existing code bases/products
• Let us look at tools to help refactor existing code to get
performance gains rather asking to code in new languages
• May not be able to utilize all cores to maximum, but it is an
evolutionary path
Refactoring tools for non functional programmers
• Many tools are coming to market to help refactoring code to
multiple core processors
• Key ones
– RapidMind
– OpenMP
– CUDA
– Ct
– Clik
– Pervasive software
– And many others
Rapidmind
• Startup based in Waterloo, Canada
• Helps to parallelize code across multiple cores
• Supports C++ language
• Can be targeted to different processor families: AMD, x86, Cell, nVidia etc
• Extensive code changes needed in code
• Plus points:
– Multiple platforms, no lockin with architecture
– 1000’s of threads
– Existing compilers
• Minus points:
– Lockin to the Rapidmind platform!
• Costs:
– $1500 for dev platform
Rapidmind
OpenMP
• Standard for multiprocessor shared memory programming
• Defined for C/C++/Fortran
• OpenMP = compiler directives + runtime library + env
variables
• Code is instrumented
• Threads are created and destroyed automatically
OpenMP example code
int main(int argc, char **argv)
{ const int N = 100000;
int i, a[N];
#pragma omp parallel
for for (i = 0; i < N; i++)
a[i] = 2 * i;
return 0;
}
OpenMP
• Plus points:
– Easiness to do
– Incremental Paralleization
– Hiding of thread semantics
– Scalable to many threads
– Coarse/Fine grained parallelization
• Minus points:
– Popular only in scientific communities only
– Need separate tool chain
• Products:
– Sun Studio 12
Source: Sun Studio presentation
CUDA
• From nVidia
• Defines extensions to code to parallelize the code
• Concept of “Thread block”
• Each thread pool has one shared memory to work on
• Scalable to 1000’s of threads
• Defined for C/C++ Languages
• Plus points:
– Scales very well for scientific, visualization communities (Financial
markets, computational mechanics, computational biology etc)
– Well integrated with nVidia processors – no middleware
• Minus points:
– Vendor lockin to nVidia/CUDA (processor/tool chain)
Is there a middle way?
Option #3: Hybrid environments
• How to experiment in FP without throwing away all
investments done so far?
• Scala:
• Hybrid language
• Combines Java style OO with FP paradigm
• Compiles to bytecode
• Can use Java libraries in Scala code; Can inherit Java class
• Use eclipse tool chains, plugins
• Really married to Java
• F#:
• Experiment FP safely from known .NET environment
• Use FP where needed, use .NET in other place
Status of the industry
OS readiness
• Windows XP can support 4 logical cores, 2 physical
cores; supports SMP model
• Windows Vista, Windows 7 better tuned for
multicore
• Finetuning of global locks, mutexes, better graphics
performance
• Linux 2.6+ has SMP support and hence multicore
support
“Embarrassingly” Parallel Programs
• An embarrassingly parallel workload (or embarrassingly parallel problem) is one for which
little or no effort is required to separate the problem into a number of parallel tasks.
• Examples:
– The Mandelbroit set and other fractal calculations, where each point can be calculated
independently.
– Rendering of graphics; each pixel may be rendered independently. In computer
animation, each frame may be rendered independently
– Large scale face recognition that involves comparing thousands of input faces with
similarly large number of faces.
– Computer simulations comparing many independent scenarios, such as climate models.
– Genetic algorithms and other evolutionary computation
– weather prediction Models
– Event simulation and reconstruction in particle physics.
• Compilation and Build systems: Each compilation can be run parallel and
get performance. (example: make –j or dmake in sun Studio)
• CUDA/OpenMP is very popular in the scientific and visualization markets
and widely used
Source: Wikipedia
Server programming
• Servers run OS like Linux 2.6 that
support pthreads and is
multithreaded
• Server daemon spawns a pthread
for each request
• OS schedules them on different
cores
• Decent scalability seen for server
side programs on multicore systems
• Examples: web services
• Challenges are IO performance,
Cache size, performance tuning of
OS, web browser etc
Source: http://www.edn.com/index.asp?layout=articlePrint&articleID=CA6646279
Server Virtualization
• Virtualization allows different OS to run on different cores
• One core can be running Linux and another Windows
• This is quite popular and allows efficient use of cores
Packet processing applications
• Packet processing lends to parallel processing
• Each IP packet is independent
• IP forwarding is largely stateless processing
• Multiple cores can take packets from an incoming queue, do
processing and put it on output queue
• Similar applications in security
• Solutions from Tilera, Cavium
Packet processing applications
Source: http://www.cisco.com/en/US/prod/collateral/routers/
High reliable concurrent systems
• Functional programs are getting used here
• Major users:
– Ericcson (GPRS system)
– Jabber
– Twitter
– Tmobile SMS system
– Nortel VPN gateway product
• Some people are trying out hybrids in their system
– Scala: Twitter
Desktop applications
• Common programs like word processing,
spreadsheets are not amenable to parallel
processing or threading
• Unlikely that desktop applications will be rewritten
with multiple threads
• So don’t expect much performance gains by running
this PPT on a quadcore processor 
• But simple things are possible:
– While you download a document, virus scan can run on
another core
– Provide processor affinity in Windows
Embedded Applications
• Embedded applications without OS are typically hand crafted
• Limited scope for using multiple cores
• Scope exists in running specialized code in one core and rest
of code in another
Courtesy: Windriver Inc
Challenges of performance improvement
• Lots of existing code is not written in multithreaded
way or amenable also
• There are other factors like IO issues, memory issues,
beyond just CPU. Best results are seen for CPU
intensive jobs, but may not be for IO bound jobs
• There could be bottlenecks in inter-core
communication
• Typical applications like CRUD may find it difficult to
scale
Sweet-spots of multicore software
• Virtualization
• Server programming
• Packet processing
• Scientific computing, visualization
Summary
• Multicore is here to stay
• Onus is on SW community to show performance
gains
• Unfortunately, lots of existing code is not written for
using the benefits of multicore
• Many programs are also not amenable for parallel
processing
• Options like FP, Refactoring, Hybrid being tried out
• Long time and way to go before we can tap the
performance gains
Q&A

Introduction to multicore .ppt

  • 1.
    Multicore: An Introduction N.Rajagopal, Systems Software Practice Nagarajan_rajagopal@yahoo.com
  • 2.
    Agenda • Background • Driversfor Multicore • How is SW prepared? • Challenges of multicore programming • Functional programming • Refactoring • Hybrid approaches • State of industry • Summary • Q&A
  • 3.
    Background • Moore’s Law(transistor density doubles every 18 months) • But the gaps between the transistor count and performance is increasing • Between 1993, 1999, CPU speeds increased 10 times • First 1Ghz CPU came in 2000. We should have had 10Ghz CPU by now. It is not there. Where is it? • Intel’s 3.4Ghz CPU was introduced in 2004. Where is the 4Ghz processor? Answer is that it is unlikely to ever come!
  • 4.
    CPU Clock speedincrease over years Source: http://www.cs.utexas.edu/users/cart/publications/isca00.pdf
  • 5.
    Gap is increasing! Source:http://www.embedded.com/columns/technicalinsights/198701652?_requestid=1042869
  • 6.
    Slowing signs • Overlast 30 years, CPU designers achieved performance in three ways: – Clock speed (new processes, materials etc) – Execution optimization (doing more per cycle: Pipelining, branch prediction, multiple instructions in same cycle etc) – Cache (Putting memory closer to CPU: 2Mb+ caches are now common) • Steam is running out of these techniques – Clock speed: (heat, physical issues, leakage currents) – Less and lesser returns from execution optimization, though cachesize has still potential to go up Source: embedded.com
  • 7.
  • 8.
    How is Semiconindustry trying to do? • Create simpler cores and put more of them in single package • It is easier for semicon vendors to do this compared to increasing speed • Instead of 10Ghz processor, have 10 numbers of 1Ghz cores! • All cores have L1 caches but access shared memory outside • First versions of the processors were from Intel and now there are many multicore processors • Initially, multicore processors were for server market • Now they are in desktops and even in embedded products
  • 9.
    Some multicore processors.. •Dual Core processors – One Gen Purpose Core and another specialized core – Been in market for some time – Network processors (Intel IXP) – Signal processors (TI OMAP processors) • Intel Quad Core, ATOM dual core – Intel: Nehalem – Core i7 with 8 cores – AMD: Montreal – 8 cores • Cavium 16 cores • Tilera 64 cores • Predictions of 100’s and even thousands of cores in coming years • Intel Larrabee rumoured to be 32 cores
  • 10.
    Typical multicore processor Courtesy:http://zone.ni.com/cms/images/devzone/tut/rwlesvfu47926.jpg
  • 11.
    How is SoftwareCommunity prepared? • “There ain’t no such thing as a free lunch” – R. A. Heinlien • Software community has enjoyed regular performance gains for decades without doing anything special • Now the party is coming to an end and it is a rude shock • There is expectation that SW should take over from HW in driving next generation performance improvements • So what to be done?
  • 12.
    What is tobe done? Need to move to parallel programming model
  • 13.
    Challenges of parallelprogramming • Concurrency – Multiple threads of execution access same pieces of data – Need to synchronize the access – Possibilities of race conditions and deadlocks
  • 14.
    Challenges of parallelprogramming – Race conditions Trivial C statement b = b + 1; Assume that there are two threads running this line of code, where b is a variable shared by the two threads, and b started with the value 5. Possible order of execution (thread1) load b into some register in thread 1. (thread2) load b into some register in thread 2. (thread1) add 1 to thread 1's register, computing 6. (thread2) add 1 to thread 2's register, computing 6. (thread1) store the register value (6) to b. (thread2) store the register value (6) to b. • We started with 5, then two threads each added one, but the final result is 6 -- not the expected 7. The problem is that the two threads interfered with each other, causing a wrong final answer. • Threads do not execute atomically, performing single operations all at once. Another thread may interrupt it between essentially any two instructions and manipulate some shared resource.
  • 15.
    Challenges of parallelprogramming - Deadlocks void *function1() { pthread_mutex_lock(&lock1); - Execution step 1 pthread_mutex_lock(&lock2); - Execution step 3 DEADLOCK!!! ... pthread_mutex_lock(&lock2); pthread_mutex_lock(&lock1); ... } void *function2() { pthread_mutex_lock(&lock2); - Execution step 2 pthread_mutex_lock(&lock1); ... ... pthread_mutex_lock(&lock1); pthread_mutex_lock(&lock2); } main() { pthread_create(&thread1, NULL, function1, NULL); pthread_create(&thread2, NULL, function1, NULL);
  • 16.
    Challenges of parallelprogramming • Challenge of visualization – Difficult to visualize beyond 4-5 threads, each doing different activities – Human mind is not used to think/process data in parallel – Complexities of interplay of threads increase exponentially beyond few threads
  • 17.
    Challenges of parallelprogramming • How one is confident that there are no bugs? – Race conditions are dependent on timing – Changes in execution times of one thread or other can change access order to variables, triggering a previously unnoticed race – It is fully possible for a code to pass through all rounds of testing and work fine for years before a set of conditions like memory speed or process load bring a new bug to life – Lastly, it may not be possible to reproduce the bug at all times (may occur on one server, may not occur on another server for same environments) • Leaves with an uneasy feeling about the code • Visualization tools are coming to market, but they are inadequete
  • 18.
    What are theoptions? Courtesy: http://bigyellowtaxi.files.wordpress.com/2008/06/crossroads.jpg
  • 19.
  • 20.
    Option #1: FunctionalProgramming • Main issue of concurrent programming is the parallel access to shared data • All languages suffer from same issue • So can we avoid the whole thing? • Functional programming languages takes a radically different approach to the problem • Roots in theoretical computer science research called Lambda calculas by Alanzo Church in 1949 in Princeton University along with Turing, Godel and van Nueman
  • 21.
    Functional Programming -background • Based on mathematical system called “formal system” • Set of “axioms” and “set of rules on how to operate on them” • Stack them up with complex rules • Alan Church was interested in “computability” issues and not programming issues (Computers did not exist then) • LISP was first implementation of Alonzo’s lambda calculas on computers
  • 22.
    So what isfunctional programming? • Functions are used for everything, even for simplest of computations • Variables are just aliases – they cant even change values (immutable) • State is only in stack • No concept of global variables • No “side effects” • One cannot modify state in a function or outside of it. • If you call the same function with same parameters, it should return identical results at all times • Since there are no global variables, let alone variables, there is no race conditions, no deadlocks!
  • 23.
    Referential Transperancy Source: Functions+ Messages + Concurrency = Erlang – Presentation by Joe Amstrong
  • 24.
    Advantages of FP •Unit testing – Check all possible inputs to a function – Check by passing parameters that represent edge cases • Debugging – In Imperative programming, there is no surity that you can reproduce the bug in same sequence – Main reason is that behaviour of program dependent on object states, global variables and external state – In FP, a bug is a bug and it surfaces all time • Concurrency – No global variables, no race conditions, no locks – Easy to refactor code • String s1 = operation1(); • String s2 = operation2(); • String s3 = concat( s1, s2); – First and second lines can be run on 2 different cores – Tools can easily refactor code
  • 25.
    Popular environments • Erlang –Invented by Ericcson – Concurrent programming environment – Message passing – Used in highly scalable, mission critical switch products with hot upgrade support • Haskell – Haskell is a pure functional language • LISP – Original parent
  • 26.
    Example code fragment •Erlang Courtesy: TBD
  • 27.
    So, why areprogrammers not jumping? • Complete change of mindset needed! • Learning curve is steep • “University and research” Image • Needs lots of resources (recursion, stack based state) • CIOs are not comfortable • Lack of trained staff • Only strong believers are the adopting
  • 28.
  • 29.
    Refactoring • There isjust too much existing software • No one is developing software from scratch and it is built on existing code bases/products • Let us look at tools to help refactor existing code to get performance gains rather asking to code in new languages • May not be able to utilize all cores to maximum, but it is an evolutionary path
  • 30.
    Refactoring tools fornon functional programmers • Many tools are coming to market to help refactoring code to multiple core processors • Key ones – RapidMind – OpenMP – CUDA – Ct – Clik – Pervasive software – And many others
  • 31.
    Rapidmind • Startup basedin Waterloo, Canada • Helps to parallelize code across multiple cores • Supports C++ language • Can be targeted to different processor families: AMD, x86, Cell, nVidia etc • Extensive code changes needed in code • Plus points: – Multiple platforms, no lockin with architecture – 1000’s of threads – Existing compilers • Minus points: – Lockin to the Rapidmind platform! • Costs: – $1500 for dev platform
  • 32.
  • 33.
    OpenMP • Standard formultiprocessor shared memory programming • Defined for C/C++/Fortran • OpenMP = compiler directives + runtime library + env variables • Code is instrumented • Threads are created and destroyed automatically
  • 34.
    OpenMP example code intmain(int argc, char **argv) { const int N = 100000; int i, a[N]; #pragma omp parallel for for (i = 0; i < N; i++) a[i] = 2 * i; return 0; }
  • 35.
    OpenMP • Plus points: –Easiness to do – Incremental Paralleization – Hiding of thread semantics – Scalable to many threads – Coarse/Fine grained parallelization • Minus points: – Popular only in scientific communities only – Need separate tool chain • Products: – Sun Studio 12 Source: Sun Studio presentation
  • 36.
    CUDA • From nVidia •Defines extensions to code to parallelize the code • Concept of “Thread block” • Each thread pool has one shared memory to work on • Scalable to 1000’s of threads • Defined for C/C++ Languages • Plus points: – Scales very well for scientific, visualization communities (Financial markets, computational mechanics, computational biology etc) – Well integrated with nVidia processors – no middleware • Minus points: – Vendor lockin to nVidia/CUDA (processor/tool chain)
  • 37.
    Is there amiddle way?
  • 38.
    Option #3: Hybridenvironments • How to experiment in FP without throwing away all investments done so far? • Scala: • Hybrid language • Combines Java style OO with FP paradigm • Compiles to bytecode • Can use Java libraries in Scala code; Can inherit Java class • Use eclipse tool chains, plugins • Really married to Java • F#: • Experiment FP safely from known .NET environment • Use FP where needed, use .NET in other place
  • 39.
    Status of theindustry
  • 40.
    OS readiness • WindowsXP can support 4 logical cores, 2 physical cores; supports SMP model • Windows Vista, Windows 7 better tuned for multicore • Finetuning of global locks, mutexes, better graphics performance • Linux 2.6+ has SMP support and hence multicore support
  • 41.
    “Embarrassingly” Parallel Programs •An embarrassingly parallel workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks. • Examples: – The Mandelbroit set and other fractal calculations, where each point can be calculated independently. – Rendering of graphics; each pixel may be rendered independently. In computer animation, each frame may be rendered independently – Large scale face recognition that involves comparing thousands of input faces with similarly large number of faces. – Computer simulations comparing many independent scenarios, such as climate models. – Genetic algorithms and other evolutionary computation – weather prediction Models – Event simulation and reconstruction in particle physics. • Compilation and Build systems: Each compilation can be run parallel and get performance. (example: make –j or dmake in sun Studio) • CUDA/OpenMP is very popular in the scientific and visualization markets and widely used Source: Wikipedia
  • 42.
    Server programming • Serversrun OS like Linux 2.6 that support pthreads and is multithreaded • Server daemon spawns a pthread for each request • OS schedules them on different cores • Decent scalability seen for server side programs on multicore systems • Examples: web services • Challenges are IO performance, Cache size, performance tuning of OS, web browser etc Source: http://www.edn.com/index.asp?layout=articlePrint&articleID=CA6646279
  • 43.
    Server Virtualization • Virtualizationallows different OS to run on different cores • One core can be running Linux and another Windows • This is quite popular and allows efficient use of cores
  • 44.
    Packet processing applications •Packet processing lends to parallel processing • Each IP packet is independent • IP forwarding is largely stateless processing • Multiple cores can take packets from an incoming queue, do processing and put it on output queue • Similar applications in security • Solutions from Tilera, Cavium
  • 45.
    Packet processing applications Source:http://www.cisco.com/en/US/prod/collateral/routers/
  • 46.
    High reliable concurrentsystems • Functional programs are getting used here • Major users: – Ericcson (GPRS system) – Jabber – Twitter – Tmobile SMS system – Nortel VPN gateway product • Some people are trying out hybrids in their system – Scala: Twitter
  • 47.
    Desktop applications • Commonprograms like word processing, spreadsheets are not amenable to parallel processing or threading • Unlikely that desktop applications will be rewritten with multiple threads • So don’t expect much performance gains by running this PPT on a quadcore processor  • But simple things are possible: – While you download a document, virus scan can run on another core – Provide processor affinity in Windows
  • 48.
    Embedded Applications • Embeddedapplications without OS are typically hand crafted • Limited scope for using multiple cores • Scope exists in running specialized code in one core and rest of code in another Courtesy: Windriver Inc
  • 49.
    Challenges of performanceimprovement • Lots of existing code is not written in multithreaded way or amenable also • There are other factors like IO issues, memory issues, beyond just CPU. Best results are seen for CPU intensive jobs, but may not be for IO bound jobs • There could be bottlenecks in inter-core communication • Typical applications like CRUD may find it difficult to scale
  • 50.
    Sweet-spots of multicoresoftware • Virtualization • Server programming • Packet processing • Scientific computing, visualization
  • 51.
    Summary • Multicore ishere to stay • Onus is on SW community to show performance gains • Unfortunately, lots of existing code is not written for using the benefits of multicore • Many programs are also not amenable for parallel processing • Options like FP, Refactoring, Hybrid being tried out • Long time and way to go before we can tap the performance gains
  • 52.