AgendaWindows 8 AppsFree performance boostSqueeze the CPU (PPL)Smoke the GPU (C++ AMP)
Agenda  Windows 8 AppsFree performance boostSqueeze the CPU (PPL)Smoke the GPU (C++ AMP)
Windows 8 AppsNew user experienceTouch-friendlyTrustBattery-powerFast and fluid
Windows 8 C++ App OptionsXAML-based applications XAML user interface C++ codeDirectX-based applications and games Direc...
Agenda CheckpointWindows 8 apps  Free performance boostSqueeze the CPU     (PPL)Smoke the GPU (C++ AMP)
Recap of “free” performanceCompilation Unit Optimizations   Whole Program Optimizations•   /O2 and friends              • ...
More “free” boostsAutomatic vectorization•   Always on in VS2012                SCALAR              VECTOR                ...
More “free” boostsAutomatic parallelization•   Uses multiple CPU cores•   /Qpar compiler switch    #pragma loop (hint_para...
Agenda CheckpointWindows 8 appsFree performance boost  Squeeze the CPU (PPL)Smoke the GPU (C++ AMP)
Parallel Patterns Library (PPL) Part of the C++ Runtime    No new libraries to link in    Task parallelism    Parallel ...
parallel_forparallel_for iterates over a range in parallel#include <ppl.h>using namespace concurrency;parallel_for( 0, 100...
parallel_for parallel_for(0, 1000, [] (int i) {     work(i); });           Core 1              Core 2                     ...
parallel_forparallel_for considerations:• Designed for unbalanced loop bodies  •   An idle core can steal a portion of ano...
parallel_for_eachparallel_for_each iterates over an STL container in parallel#include <ppl.h>using namespace concurrency;v...
parallel_for_eachWorks best with containers that support random-access iterators: std::vector, std::array, std::deque, co...
parallel_invoke• Executes function objects in parallel and waits for them to finish #include <ppl.h> #include <string> #in...
task<>•       Used to write asynchronous code•       Task::then lets you create continuations that get executed when the t...
Concurrent Containers•           Thread-safe, lock-free containers provided:           concurrent_vector<>           con...
concurrent_vector<T> #include <ppl.h> #include <concurrent_vector.h> using namespace concurrency; concurrent_vector<int> c...
Agenda CheckpointWindows 8 appsFree performance boostSqueeze the CPU      (PPL)  Smoke the GPU (C++ AMP)
CPU / GPU Comparison
What is C++ AMP?Performance & Productivity  C++ AMP -> C++ Accelerated Massive Parallelism  C++ AMP is   •   Programming m...
What is C++ AMP?C++ AMP gives you…  Productivity   •   Simple programming model  Portability   •   Run on hardware from NV...
1. #include <iostream>2.3.4. int main()5. {6.   int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7.8.     for (int idx = 0; ...
1. #include <iostream>2. #include <amp.h>                                                                           amp.h:...
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6.   int v[11] = {G, d, k, k, n, ...
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6.   int v[11] = {G, d, k, k, n, ...
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()                                     ...
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6.   int v[11] = {G, d, k, k, n, ...
1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6.   int v[11] = {G, d, k, k, n, ...
C++ AMPParallel DebuggerWell known Visual Studio debugging features Launch (incl. remote), Attach, Break, Stepping, Break...
SummaryC++ is a great way to create fast and fluid apps for Windows 8Get the most out of the compiler’s free optimizations...
Thank you!tarekm@microsoft.com
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
Blazing Fast Windows 8 Apps using Visual C++
Upcoming SlideShare
Loading in …5
×

Blazing Fast Windows 8 Apps using Visual C++

1,272 views

Published on

More info on http://www.techdays.be

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,272
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Blazing Fast Windows 8 Apps using Visual C++

  1. 1. AgendaWindows 8 AppsFree performance boostSqueeze the CPU (PPL)Smoke the GPU (C++ AMP)
  2. 2. Agenda Windows 8 AppsFree performance boostSqueeze the CPU (PPL)Smoke the GPU (C++ AMP)
  3. 3. Windows 8 AppsNew user experienceTouch-friendlyTrustBattery-powerFast and fluid
  4. 4. Windows 8 C++ App OptionsXAML-based applications XAML user interface C++ codeDirectX-based applications and games DirectX user interface (D2D or D3D) C++ codeHybrid XAML and DirectX applications XAML controls mixed with DirectX surfaces C++ codeHTML5 + JavaScript applications HTML5 user interface JS code calling into C++ code
  5. 5. Agenda CheckpointWindows 8 apps Free performance boostSqueeze the CPU (PPL)Smoke the GPU (C++ AMP)
  6. 6. Recap of “free” performanceCompilation Unit Optimizations Whole Program Optimizations• /O2 and friends • /GL and /LTCGProfile Guided Optimization• /LTCG:PGI and /LTCG:PGO
  7. 7. More “free” boostsAutomatic vectorization• Always on in VS2012 SCALAR VECTOR (1 operation) (N operations)• Uses “vector” instructions where possible in loops r1 r2 v1 v2 + + for (i = 0; i < 1000; i++) { r3 v3 A[i] = B[i] + C[i]; vector length } add r3, r1, r2 vadd v3, v1, v2• Can run this loop in only 250 iterations down from 1,000!
  8. 8. More “free” boostsAutomatic parallelization• Uses multiple CPU cores• /Qpar compiler switch #pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; }• Can run this loop “vectorized” and on 4 CPU cores in parallel
  9. 9. Agenda CheckpointWindows 8 appsFree performance boost Squeeze the CPU (PPL)Smoke the GPU (C++ AMP)
  10. 10. Parallel Patterns Library (PPL) Part of the C++ Runtime  No new libraries to link in  Task parallelism  Parallel algorithms  Concurrency-safe containers  Asynchronous agents Abstracts away the notion of threads  Tasks are computations that may be run in parallel Used to express your potential concurrency  Let the runtime map it to the available concurrency  Scale from 1 to 256 cores
  11. 11. parallel_forparallel_for iterates over a range in parallel#include <ppl.h>using namespace concurrency;parallel_for( 0, 1000, [] (int i) { work(i); });
  12. 12. parallel_for parallel_for(0, 1000, [] (int i) { work(i); }); Core 1 Core 2 • Order of iteration is indeterminate. work(0…249) work(250…499) • Cores may come and go. • Ranges may be stolen by newly idle cores. Core 3 Core 4 work(500…749) work(750…999)
  13. 13. parallel_forparallel_for considerations:• Designed for unbalanced loop bodies • An idle core can steal a portion of another core’s range of work • Supports cancellation • Early exit in search scenariosFor fixed-sized loop bodies that don’t need cancellation, useparallel_for_fixed.
  14. 14. parallel_for_eachparallel_for_each iterates over an STL container in parallel#include <ppl.h>using namespace concurrency;vector<int> v = …;parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); });
  15. 15. parallel_for_eachWorks best with containers that support random-access iterators: std::vector, std::array, std::deque, concurrency::concurrent_vector, …Works okay, but with higher overhead on containers that support forward(or bi-di) iterators: std::list, std::map, …
  16. 16. parallel_invoke• Executes function objects in parallel and waits for them to finish #include <ppl.h> #include <string> #include <iostream> using namespace concurrency; using namespace std; template <typename T> T twice(const T& t) { return t + t; } int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << << d << << s << endl; return 0; }
  17. 17. task<>• Used to write asynchronous code• Task::then lets you create continuations that get executed when the task finishes• You need to manage the lifetime of the variables going into a task #include <ppltasks.h> #include <iostream> using namespace concurrency; using namespace std; int main() { auto t = create_task([]() -> int { return 42; }); t.then([](int result) { cout << result << endl; }).wait(); }
  18. 18. Concurrent Containers• Thread-safe, lock-free containers provided:  concurrent_vector<>  concurrent_queue<>  concurrent_unordered_map<>  concurrent_unordered_multimap<>  concurrent_unordered_set<>  concurrent_unordered_multiset<>• Functionality resembles equivalent containers provided by the STL• Behavior is more limited to allow concurrency. For example: • concurrent_vector can push_back but not insert • concurrent_vector can clear but not pop_back or erase
  19. 19. concurrent_vector<T> #include <ppl.h> #include <concurrent_vector.h> using namespace concurrency; concurrent_vector<int> carmVec; parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i); });
  20. 20. Agenda CheckpointWindows 8 appsFree performance boostSqueeze the CPU (PPL) Smoke the GPU (C++ AMP)
  21. 21. CPU / GPU Comparison
  22. 22. What is C++ AMP?Performance & Productivity C++ AMP -> C++ Accelerated Massive Parallelism C++ AMP is • Programming model for expressing data parallel algorithm • Exploiting heterogeneous system using mainstream tools • C++ language extensions and library C++ AMP delivers performance without compromising productivity
  23. 23. What is C++ AMP?C++ AMP gives you… Productivity • Simple programming model Portability • Run on hardware from NVIDIA, AMD, Intel and ARM* • Open Specification Performance • Power of heterogeneous computing at your handsUse it to speed up data parallel algorithms
  24. 24. 1. #include <iostream>2.3.4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7.8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. }12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
  25. 25. 1. #include <iostream>2. #include <amp.h> amp.h: header for C++ AMP library3. using namespace concurrency; concurrency: namespace for library4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7.8. for (int idx = 0; idx < 11; idx++)9. {10. v[idx] += 1;11. }12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
  26. 26. 1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand)9. {10. v[idx] += 1;11. }12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( v[i]);14. }
  27. 27. 1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand)9. {10. av[idx] += 1;11. }12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>( av[i]);14. }
  28. 28. 1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main() parallel_for_each: execute the lambda5. { on the accelerator once per thread6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c}; extent: the parallel loop bounds or computation “shape”7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1; index: the thread ID that is running11. }); the lambda, used to index into data12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }
  29. 29. 1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1; restrict(amp): tells the compiler to11. }); check that code conforms to C++ subset, and tells compiler to target GPU12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]);14. }
  30. 30. 1. #include <iostream>2. #include <amp.h>3. using namespace concurrency;4. int main()5. {6. int v[11] = {G, d, k, k, n, 31, v, n, q, k, c};7. array_view<int> av(11, v);8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)9. {10. av[idx] += 1; array_view: automatically copied11. }); to accelerator if required12. for(unsigned int i = 0; i < 11; i++)13. std::cout << static_cast<char>(av[i]); array_view: automatically copied14. } back to host when and if required
  31. 31. C++ AMPParallel DebuggerWell known Visual Studio debugging features Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips Tool windows  Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick WatchNew features (for both CPU and GPU) Parallel Stacks window, Parallel Watch windowNew GPU-specific Emulator, GPU Threads window, race detectionconcurrency::direct3d_printf, _errorf, _abort
  32. 32. SummaryC++ is a great way to create fast and fluid apps for Windows 8Get the most out of the compiler’s free optimizationsUse PPL for concurrent programmingUse C++ AMP for data parallel algorithms
  33. 33. Thank you!tarekm@microsoft.com

×