SlideShare a Scribd company logo
1 of 37
Agenda

Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Agenda

  Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Windows 8 Apps

New user experience
Touch-friendly
Trust
Battery-power
Fast and fluid
Windows 8 C++ App Options

XAML-based applications
 XAML user interface
 C++ code
DirectX-based applications and games
 DirectX user interface (D2D or D3D)
 C++ code
Hybrid XAML and DirectX applications
 XAML controls mixed with DirectX surfaces
 C++ code
HTML5 + JavaScript applications
 HTML5 user interface
 JS code calling into C++ code
Agenda Checkpoint

Windows 8 apps
  Free performance boost
Squeeze the CPU     (PPL)
Smoke the GPU (C++ AMP)
Recap of “free” performance

Compilation Unit Optimizations   Whole Program Optimizations
•   /O2 and friends              •   /GL and /LTCG




Profile Guided Optimization
• /LTCG:PGI and /LTCG:PGO
More “free” boosts

Automatic vectorization
•   Always on in VS2012                SCALAR              VECTOR
                                     (1 operation)      (N operations)
•   Uses “vector” instructions
    where possible in loops            r1        r2       v1        v2

                                            +                  +
    for (i = 0; i < 1000; i++) {
                                            r3                 v3
        A[i] = B[i] + C[i];                                              vector
                                                                         length
    }                               add r3, r1, r2    vadd v3, v1, v2


•   Can run this loop in only 250
    iterations down from 1,000!
More “free” boosts

Automatic parallelization
•   Uses multiple CPU cores
•   /Qpar compiler switch

    #pragma loop (hint_parallel(4))
    for (i = 0; i < 1000; i++) {
        A[i] = B[i] + C[i];
    }


•   Can run this loop “vectorized”
    and on 4 CPU cores in parallel
Agenda Checkpoint

Windows 8 apps
Free performance boost
  Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Parallel Patterns Library (PPL)

 Part of the C++ Runtime
    No new libraries to link in
    Task parallelism
    Parallel algorithms
    Concurrency-safe containers
    Asynchronous agents
 Abstracts away the notion of threads
    Tasks are computations that may be run in parallel
 Used to express your potential concurrency
    Let the runtime map it to the available concurrency
    Scale from 1 to 256 cores
parallel_for

parallel_for iterates over a range in parallel

#include <ppl.h>

using namespace concurrency;

parallel_for( 0, 1000,
   [] (int i) {
      work(i);
   }
);
parallel_for
 parallel_for(0, 1000, [] (int i) {
     work(i);
 });


           Core 1              Core 2
                                        • Order of iteration is indeterminate.
     work(0…249)        work(250…499)
                                        • Cores may come and go.
                                        • Ranges may be stolen by newly idle
                                          cores.
           Core 3              Core 4

    work(500…749)       work(750…999)
parallel_for

parallel_for considerations:
• Designed for unbalanced loop bodies
  •   An idle core can steal a portion of another core’s range of work
  •   Supports cancellation
  •   Early exit in search scenarios


For fixed-sized loop bodies that don’t need cancellation, use
parallel_for_fixed.
parallel_for_each

parallel_for_each iterates over an STL container in parallel

#include <ppl.h>

using namespace concurrency;

vector<int> v = …;

parallel_for_each(v.begin(), v.end(),
   [] (int i) {
      work(i);
   }
);
parallel_for_each

Works best with containers that support random-access iterators:
 std::vector, std::array, std::deque, concurrency::concurrent_vector, …


Works okay, but with higher overhead on containers that support forward
(or bi-di) iterators:
 std::list, std::map, …
parallel_invoke

• Executes function objects in parallel and waits for them to finish
 #include <ppl.h>
 #include <string>
 #include <iostream>
 using namespace concurrency; using namespace std;

 template <typename T>
 T twice(const T& t) {
    return t + t;
 }

 int main() {
    int n = 54; double d = 5.6; string s = "Hello";
    parallel_invoke(
       [&n] { n = twice(n); },
       [&d] { d = twice(d); },
       [&s] { s = twice(s); }
    );
    cout << n << ' ' << d << ' ' << s << endl;
    return 0;
 }
task<>
•       Used to write asynchronous code
•       Task::then lets you create continuations that get executed when the task finishes
•       You need to manage the lifetime of the variables going into a task
    #include <ppltasks.h>
    #include <iostream>
    using namespace concurrency; using namespace std;

    int main()
    {
        auto t = create_task([]() -> int
        {
            return 42;
        });

         t.then([](int result)
         {
             cout << result << endl;
         }).wait();
    }
Concurrent Containers

•           Thread-safe, lock-free containers provided:
           concurrent_vector<>
           concurrent_queue<>
           concurrent_unordered_map<>
           concurrent_unordered_multimap<>
           concurrent_unordered_set<>
           concurrent_unordered_multiset<>
•           Functionality resembles equivalent containers provided by the STL
•           Behavior is more limited to allow concurrency. For example:
        •     concurrent_vector can push_back but not insert
        •     concurrent_vector can clear but not pop_back or erase
concurrent_vector<T>

 #include <ppl.h>
 #include <concurrent_vector.h>

 using namespace concurrency;

 concurrent_vector<int> carmVec;

 parallel_for(2, 5000000, [&carmVec](int i) {
     if (is_carmichael(i))
         carmVec.push_back(i);
 });
Agenda Checkpoint

Windows 8 apps
Free performance boost
Squeeze the CPU      (PPL)
  Smoke the GPU (C++ AMP)
CPU / GPU Comparison
What is C++ AMP?

Performance & Productivity
  C++ AMP -> C++ Accelerated Massive Parallelism
  C++ AMP is
   •   Programming model for expressing data parallel algorithm
   •   Exploiting heterogeneous system using mainstream tools
   •   C++ language extensions and library


        C++ AMP delivers performance without compromising productivity
What is C++ AMP?

C++ AMP gives you…
  Productivity
   •   Simple programming model
  Portability
   •   Run on hardware from NVIDIA, AMD, Intel and ARM*
   •   Open Specification
  Performance
   •   Power of heterogeneous computing at your hands


Use it to speed up data parallel algorithms
1. #include <iostream>
2.
3.

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8.     for (int idx = 0; idx < 11; idx++)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
                                                                           amp.h: header for C++ AMP library
3. using namespace concurrency;
                                                                           concurrency: namespace for library

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8.     for (int idx = 0; idx < 11; idx++)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);                            array_view: wraps the data to operate on the
                                                             accelerator. array_view variables captured and
8.     for (int idx = 0; idx < 11; idx++)
                                                        associated data copied to accelerator (on demand)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);                             array_view: wraps the data to operate on the
                                                              accelerator. array_view variables captured and
8.     for (int idx = 0; idx < 11; idx++)
                                                         associated data copied to accelerator (on demand)
9.     {
10.      av[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()                                                            parallel_for_each: execute the lambda
5. {                                                                        on the accelerator once per thread
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
                                                                         extent: the parallel loop bounds
                                                                                  or computation “shape”
7.    array_view<int> av(11, v);
8.    parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.    {
10.       av[idx] += 1;                                        index: the thread ID that is running
11.   });                                                      the lambda, used to index into data

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);
8.     parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.     {
10.        av[idx] += 1;                                       restrict(amp): tells the compiler to
11.    });                                                      check that code conforms to C++
                                                                       subset, and tells compiler to target GPU
12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);
8.     parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.     {
10.        av[idx] += 1;     array_view: automatically copied
11.    });                            to accelerator if required

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);                     array_view: automatically copied
14. }                                                           back to host when and if required
C++ AMP
Parallel Debugger
Well known Visual Studio debugging features
 Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips
 Tool windows
   Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch,
    Quick Watch
New features (for both CPU and GPU)
 Parallel Stacks window, Parallel Watch window
New GPU-specific
 Emulator, GPU Threads window, race detection
concurrency::direct3d_printf, _errorf, _abort
Summary

C++ is a great way to create fast and fluid apps for Windows 8
Get the most out of the compiler’s free optimizations
Use PPL for concurrent programming
Use C++ AMP for data parallel algorithms
Thank you!

tarekm@microsoft.com

More Related Content

What's hot

.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/O
Jussi Pohjolainen
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
Stefan Marr
 
C++totural file
C++totural fileC++totural file
C++totural file
halaisumit
 

What's hot (20)

Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)Engineering fast indexes (Deepdive)
Engineering fast indexes (Deepdive)
 
Fun with Lambdas: C++14 Style (part 2)
Fun with Lambdas: C++14 Style (part 2)Fun with Lambdas: C++14 Style (part 2)
Fun with Lambdas: C++14 Style (part 2)
 
Scala to assembly
Scala to assemblyScala to assembly
Scala to assembly
 
.NET Multithreading and File I/O
.NET Multithreading and File I/O.NET Multithreading and File I/O
.NET Multithreading and File I/O
 
JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Building High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low EffortBuilding High-Performance Language Implementations With Low Effort
Building High-Performance Language Implementations With Low Effort
 
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...
 
Virtual machine and javascript engine
Virtual machine and javascript engineVirtual machine and javascript engine
Virtual machine and javascript engine
 
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
Gor Nishanov,  C++ Coroutines – a negative overhead abstractionGor Nishanov,  C++ Coroutines – a negative overhead abstraction
Gor Nishanov, C++ Coroutines – a negative overhead abstraction
 
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
Александр Гранин, Функциональная 'Жизнь': параллельные клеточные автоматы и к...
 
The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88The Ring programming language version 1.3 book - Part 84 of 88
The Ring programming language version 1.3 book - Part 84 of 88
 
The Ring programming language version 1.10 book - Part 102 of 212
The Ring programming language version 1.10 book - Part 102 of 212The Ring programming language version 1.10 book - Part 102 of 212
The Ring programming language version 1.10 book - Part 102 of 212
 
Operator overloading2
Operator overloading2Operator overloading2
Operator overloading2
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
About Those Python Async Concurrent Frameworks - Fantix @ OSTC 2014
 
Start Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New RopeStart Wrap Episode 11: A New Rope
Start Wrap Episode 11: A New Rope
 
C++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogrammingC++ How I learned to stop worrying and love metaprogramming
C++ How I learned to stop worrying and love metaprogramming
 
C++totural file
C++totural fileC++totural file
C++totural file
 
C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8C# 7.x What's new and what's coming with C# 8
C# 7.x What's new and what's coming with C# 8
 
Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017 Rust Workshop - NITC FOSSMEET 2017
Rust Workshop - NITC FOSSMEET 2017
 

Similar to Blazing Fast Windows 8 Apps using Visual C++

Similar to Blazing Fast Windows 8 Apps using Visual C++ (20)

C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략 C++ AMP 실천 및 적용 전략
C++ AMP 실천 및 적용 전략
 
Cocoa heads 09112017
Cocoa heads 09112017Cocoa heads 09112017
Cocoa heads 09112017
 
What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++What&rsquo;s new in Visual C++
What&rsquo;s new in Visual C++
 
Getting Input from User
Getting Input from UserGetting Input from User
Getting Input from User
 
Unit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptxUnit-2 Getting Input from User.pptx
Unit-2 Getting Input from User.pptx
 
C# - What's next
C# - What's nextC# - What's next
C# - What's next
 
C++ tutorial
C++ tutorialC++ tutorial
C++ tutorial
 
NodeJS for Beginner
NodeJS for BeginnerNodeJS for Beginner
NodeJS for Beginner
 
Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"Adam Sitnik "State of the .NET Performance"
Adam Sitnik "State of the .NET Performance"
 
State of the .Net Performance
State of the .Net PerformanceState of the .Net Performance
State of the .Net Performance
 
Shrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_youShrug2017 arcpy data_and_you
Shrug2017 arcpy data_and_you
 
Arrry structure Stacks in data structure
Arrry structure Stacks  in data structureArrry structure Stacks  in data structure
Arrry structure Stacks in data structure
 
Lab
LabLab
Lab
 
Exploiting vectorization with ISPC
Exploiting vectorization with ISPCExploiting vectorization with ISPC
Exploiting vectorization with ISPC
 
Unit 3
Unit 3 Unit 3
Unit 3
 
Nitin Mishra 0301EC201039 Internship PPT.pptx
Nitin Mishra 0301EC201039 Internship PPT.pptxNitin Mishra 0301EC201039 Internship PPT.pptx
Nitin Mishra 0301EC201039 Internship PPT.pptx
 
Lecture2.ppt
Lecture2.pptLecture2.ppt
Lecture2.ppt
 
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...
 
Managing console
Managing consoleManaging console
Managing console
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 

More from Microsoft Developer Network (MSDN) - Belgium and Luxembourg

More from Microsoft Developer Network (MSDN) - Belgium and Luxembourg (20)

Code in the Cloud - Ghent - 20 February 2015
Code in the Cloud - Ghent - 20 February 2015Code in the Cloud - Ghent - 20 February 2015
Code in the Cloud - Ghent - 20 February 2015
 
Executive Summit for ISV & Application builders - January 2015
Executive Summit for ISV & Application builders - January 2015Executive Summit for ISV & Application builders - January 2015
Executive Summit for ISV & Application builders - January 2015
 
Executive Summit for ISV & Application builders - Internet of Things
Executive Summit for ISV & Application builders - Internet of ThingsExecutive Summit for ISV & Application builders - Internet of Things
Executive Summit for ISV & Application builders - Internet of Things
 
Executive Summit for ISV & Application builders - January 2015
Executive Summit for ISV & Application builders - January 2015Executive Summit for ISV & Application builders - January 2015
Executive Summit for ISV & Application builders - January 2015
 
Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014Code in the Cloud - December 8th 2014
Code in the Cloud - December 8th 2014
 
Adam azure presentation
Adam   azure presentationAdam   azure presentation
Adam azure presentation
 
release management
release managementrelease management
release management
 
cloud value for application development
cloud value for application developmentcloud value for application development
cloud value for application development
 
Modern lifecycle management practices
Modern lifecycle management practicesModern lifecycle management practices
Modern lifecycle management practices
 
Belgian visual studio launch 2013
Belgian visual studio launch 2013Belgian visual studio launch 2013
Belgian visual studio launch 2013
 
Windows Azure Virtually Speaking
Windows Azure Virtually SpeakingWindows Azure Virtually Speaking
Windows Azure Virtually Speaking
 
Inside the Microsoft TechDays Belgium Apps
Inside the Microsoft TechDays Belgium AppsInside the Microsoft TechDays Belgium Apps
Inside the Microsoft TechDays Belgium Apps
 
TechDays 2013 Developer Keynote
TechDays 2013 Developer KeynoteTechDays 2013 Developer Keynote
TechDays 2013 Developer Keynote
 
Windows Phone 8 Security Deep Dive
Windows Phone 8 Security Deep DiveWindows Phone 8 Security Deep Dive
Windows Phone 8 Security Deep Dive
 
Deep Dive into Entity Framework 6.0
Deep Dive into Entity Framework 6.0Deep Dive into Entity Framework 6.0
Deep Dive into Entity Framework 6.0
 
Applied MVVM in Windows 8 apps: not your typical MVVM session!
Applied MVVM in Windows 8 apps: not your typical MVVM session!Applied MVVM in Windows 8 apps: not your typical MVVM session!
Applied MVVM in Windows 8 apps: not your typical MVVM session!
 
Building SPA’s (Single Page App) with Backbone.js
Building SPA’s (Single Page App) with Backbone.jsBuilding SPA’s (Single Page App) with Backbone.js
Building SPA’s (Single Page App) with Backbone.js
 
Deep Dive and Best Practices for Windows Azure Storage Services
Deep Dive and Best Practices for Windows Azure Storage ServicesDeep Dive and Best Practices for Windows Azure Storage Services
Deep Dive and Best Practices for Windows Azure Storage Services
 
Building data centric applications for web, desktop and mobile with Entity Fr...
Building data centric applications for web, desktop and mobile with Entity Fr...Building data centric applications for web, desktop and mobile with Entity Fr...
Building data centric applications for web, desktop and mobile with Entity Fr...
 
Bart De Smet Unplugged
Bart De Smet UnpluggedBart De Smet Unplugged
Bart De Smet Unplugged
 

Blazing Fast Windows 8 Apps using Visual C++

  • 1.
  • 2. Agenda Windows 8 Apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 3. Agenda Windows 8 Apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 4. Windows 8 Apps New user experience Touch-friendly Trust Battery-power Fast and fluid
  • 5. Windows 8 C++ App Options XAML-based applications  XAML user interface  C++ code DirectX-based applications and games  DirectX user interface (D2D or D3D)  C++ code Hybrid XAML and DirectX applications  XAML controls mixed with DirectX surfaces  C++ code HTML5 + JavaScript applications  HTML5 user interface  JS code calling into C++ code
  • 6.
  • 7. Agenda Checkpoint Windows 8 apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 8. Recap of “free” performance Compilation Unit Optimizations Whole Program Optimizations • /O2 and friends • /GL and /LTCG Profile Guided Optimization • /LTCG:PGI and /LTCG:PGO
  • 9. More “free” boosts Automatic vectorization • Always on in VS2012 SCALAR VECTOR (1 operation) (N operations) • Uses “vector” instructions where possible in loops r1 r2 v1 v2 + + for (i = 0; i < 1000; i++) { r3 v3 A[i] = B[i] + C[i]; vector length } add r3, r1, r2 vadd v3, v1, v2 • Can run this loop in only 250 iterations down from 1,000!
  • 10. More “free” boosts Automatic parallelization • Uses multiple CPU cores • /Qpar compiler switch #pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; } • Can run this loop “vectorized” and on 4 CPU cores in parallel
  • 11. Agenda Checkpoint Windows 8 apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 12. Parallel Patterns Library (PPL) Part of the C++ Runtime  No new libraries to link in  Task parallelism  Parallel algorithms  Concurrency-safe containers  Asynchronous agents Abstracts away the notion of threads  Tasks are computations that may be run in parallel Used to express your potential concurrency  Let the runtime map it to the available concurrency  Scale from 1 to 256 cores
  • 13. parallel_for parallel_for iterates over a range in parallel #include <ppl.h> using namespace concurrency; parallel_for( 0, 1000, [] (int i) { work(i); } );
  • 14. parallel_for parallel_for(0, 1000, [] (int i) { work(i); }); Core 1 Core 2 • Order of iteration is indeterminate. work(0…249) work(250…499) • Cores may come and go. • Ranges may be stolen by newly idle cores. Core 3 Core 4 work(500…749) work(750…999)
  • 15. parallel_for parallel_for considerations: • Designed for unbalanced loop bodies • An idle core can steal a portion of another core’s range of work • Supports cancellation • Early exit in search scenarios For fixed-sized loop bodies that don’t need cancellation, use parallel_for_fixed.
  • 16. parallel_for_each parallel_for_each iterates over an STL container in parallel #include <ppl.h> using namespace concurrency; vector<int> v = …; parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); } );
  • 17. parallel_for_each Works best with containers that support random-access iterators:  std::vector, std::array, std::deque, concurrency::concurrent_vector, … Works okay, but with higher overhead on containers that support forward (or bi-di) iterators:  std::list, std::map, …
  • 18. parallel_invoke • Executes function objects in parallel and waits for them to finish #include <ppl.h> #include <string> #include <iostream> using namespace concurrency; using namespace std; template <typename T> T twice(const T& t) { return t + t; } int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << ' ' << d << ' ' << s << endl; return 0; }
  • 19. task<> • Used to write asynchronous code • Task::then lets you create continuations that get executed when the task finishes • You need to manage the lifetime of the variables going into a task #include <ppltasks.h> #include <iostream> using namespace concurrency; using namespace std; int main() { auto t = create_task([]() -> int { return 42; }); t.then([](int result) { cout << result << endl; }).wait(); }
  • 20. Concurrent Containers • Thread-safe, lock-free containers provided:  concurrent_vector<>  concurrent_queue<>  concurrent_unordered_map<>  concurrent_unordered_multimap<>  concurrent_unordered_set<>  concurrent_unordered_multiset<> • Functionality resembles equivalent containers provided by the STL • Behavior is more limited to allow concurrency. For example: • concurrent_vector can push_back but not insert • concurrent_vector can clear but not pop_back or erase
  • 21. concurrent_vector<T> #include <ppl.h> #include <concurrent_vector.h> using namespace concurrency; concurrent_vector<int> carmVec; parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i); });
  • 22. Agenda Checkpoint Windows 8 apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 23. CPU / GPU Comparison
  • 24. What is C++ AMP? Performance & Productivity C++ AMP -> C++ Accelerated Massive Parallelism C++ AMP is • Programming model for expressing data parallel algorithm • Exploiting heterogeneous system using mainstream tools • C++ language extensions and library C++ AMP delivers performance without compromising productivity
  • 25. What is C++ AMP? C++ AMP gives you… Productivity • Simple programming model Portability • Run on hardware from NVIDIA, AMD, Intel and ARM* • Open Specification Performance • Power of heterogeneous computing at your hands Use it to speed up data parallel algorithms
  • 26. 1. #include <iostream> 2. 3. 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 27. 1. #include <iostream> 2. #include <amp.h> amp.h: header for C++ AMP library 3. using namespace concurrency; concurrency: namespace for library 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 28. 1. #include <iostream> 2. #include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and 8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 29. 1. #include <iostream> 2. #include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and 8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand) 9. { 10. av[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( av[i]); 14. }
  • 30. 1. #include <iostream> 2. #include <amp.h> 3. using namespace concurrency; 4. int main() parallel_for_each: execute the lambda 5. { on the accelerator once per thread 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; extent: the parallel loop bounds or computation “shape” 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; index: the thread ID that is running 11. }); the lambda, used to index into data 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); 14. }
  • 31. 1. #include <iostream> 2. #include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; restrict(amp): tells the compiler to 11. }); check that code conforms to C++ subset, and tells compiler to target GPU 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); 14. }
  • 32. 1. #include <iostream> 2. #include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; array_view: automatically copied 11. }); to accelerator if required 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); array_view: automatically copied 14. } back to host when and if required
  • 33. C++ AMP Parallel Debugger Well known Visual Studio debugging features  Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips  Tool windows  Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch New features (for both CPU and GPU)  Parallel Stacks window, Parallel Watch window New GPU-specific  Emulator, GPU Threads window, race detection concurrency::direct3d_printf, _errorf, _abort
  • 34.
  • 35.
  • 36. Summary C++ is a great way to create fast and fluid apps for Windows 8 Get the most out of the compiler’s free optimizations Use PPL for concurrent programming Use C++ AMP for data parallel algorithms