Agenda

Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Agenda

  Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Windows 8 Apps

New user experience
Touch-friendly
Trust
Battery-power
Fast and fluid
Windows 8 C++ App Options

XAML-based applications
 XAML user interface
 C++ code
DirectX-based applications and games
 DirectX user interface (D2D or D3D)
 C++ code
Hybrid XAML and DirectX applications
 XAML controls mixed with DirectX surfaces
 C++ code
HTML5 + JavaScript applications
 HTML5 user interface
 JS code calling into C++ code
Agenda Checkpoint

Windows 8 apps
  Free performance boost
Squeeze the CPU     (PPL)
Smoke the GPU (C++ AMP)
Recap of “free” performance

Compilation Unit Optimizations   Whole Program Optimizations
•   /O2 and friends              •   /GL and /LTCG




Profile Guided Optimization
• /LTCG:PGI and /LTCG:PGO
More “free” boosts

Automatic vectorization
•   Always on in VS2012                SCALAR              VECTOR
                                     (1 operation)      (N operations)
•   Uses “vector” instructions
    where possible in loops            r1        r2       v1        v2

                                            +                  +
    for (i = 0; i < 1000; i++) {
                                            r3                 v3
        A[i] = B[i] + C[i];                                              vector
                                                                         length
    }                               add r3, r1, r2    vadd v3, v1, v2


•   Can run this loop in only 250
    iterations down from 1,000!
More “free” boosts

Automatic parallelization
•   Uses multiple CPU cores
•   /Qpar compiler switch

    #pragma loop (hint_parallel(4))
    for (i = 0; i < 1000; i++) {
        A[i] = B[i] + C[i];
    }


•   Can run this loop “vectorized”
    and on 4 CPU cores in parallel
Agenda Checkpoint

Windows 8 apps
Free performance boost
  Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)
Parallel Patterns Library (PPL)

 Part of the C++ Runtime
    No new libraries to link in
    Task parallelism
    Parallel algorithms
    Concurrency-safe containers
    Asynchronous agents
 Abstracts away the notion of threads
    Tasks are computations that may be run in parallel
 Used to express your potential concurrency
    Let the runtime map it to the available concurrency
    Scale from 1 to 256 cores
parallel_for

parallel_for iterates over a range in parallel

#include <ppl.h>

using namespace concurrency;

parallel_for( 0, 1000,
   [] (int i) {
      work(i);
   }
);
parallel_for
 parallel_for(0, 1000, [] (int i) {
     work(i);
 });


           Core 1              Core 2
                                        • Order of iteration is indeterminate.
     work(0…249)        work(250…499)
                                        • Cores may come and go.
                                        • Ranges may be stolen by newly idle
                                          cores.
           Core 3              Core 4

    work(500…749)       work(750…999)
parallel_for

parallel_for considerations:
• Designed for unbalanced loop bodies
  •   An idle core can steal a portion of another core’s range of work
  •   Supports cancellation
  •   Early exit in search scenarios


For fixed-sized loop bodies that don’t need cancellation, use
parallel_for_fixed.
parallel_for_each

parallel_for_each iterates over an STL container in parallel

#include <ppl.h>

using namespace concurrency;

vector<int> v = …;

parallel_for_each(v.begin(), v.end(),
   [] (int i) {
      work(i);
   }
);
parallel_for_each

Works best with containers that support random-access iterators:
 std::vector, std::array, std::deque, concurrency::concurrent_vector, …


Works okay, but with higher overhead on containers that support forward
(or bi-di) iterators:
 std::list, std::map, …
parallel_invoke

• Executes function objects in parallel and waits for them to finish
 #include <ppl.h>
 #include <string>
 #include <iostream>
 using namespace concurrency; using namespace std;

 template <typename T>
 T twice(const T& t) {
    return t + t;
 }

 int main() {
    int n = 54; double d = 5.6; string s = "Hello";
    parallel_invoke(
       [&n] { n = twice(n); },
       [&d] { d = twice(d); },
       [&s] { s = twice(s); }
    );
    cout << n << ' ' << d << ' ' << s << endl;
    return 0;
 }
task<>
•       Used to write asynchronous code
•       Task::then lets you create continuations that get executed when the task finishes
•       You need to manage the lifetime of the variables going into a task
    #include <ppltasks.h>
    #include <iostream>
    using namespace concurrency; using namespace std;

    int main()
    {
        auto t = create_task([]() -> int
        {
            return 42;
        });

         t.then([](int result)
         {
             cout << result << endl;
         }).wait();
    }
Concurrent Containers

•           Thread-safe, lock-free containers provided:
           concurrent_vector<>
           concurrent_queue<>
           concurrent_unordered_map<>
           concurrent_unordered_multimap<>
           concurrent_unordered_set<>
           concurrent_unordered_multiset<>
•           Functionality resembles equivalent containers provided by the STL
•           Behavior is more limited to allow concurrency. For example:
        •     concurrent_vector can push_back but not insert
        •     concurrent_vector can clear but not pop_back or erase
concurrent_vector<T>

 #include <ppl.h>
 #include <concurrent_vector.h>

 using namespace concurrency;

 concurrent_vector<int> carmVec;

 parallel_for(2, 5000000, [&carmVec](int i) {
     if (is_carmichael(i))
         carmVec.push_back(i);
 });
Agenda Checkpoint

Windows 8 apps
Free performance boost
Squeeze the CPU      (PPL)
  Smoke the GPU (C++ AMP)
CPU / GPU Comparison
What is C++ AMP?

Performance & Productivity
  C++ AMP -> C++ Accelerated Massive Parallelism
  C++ AMP is
   •   Programming model for expressing data parallel algorithm
   •   Exploiting heterogeneous system using mainstream tools
   •   C++ language extensions and library


        C++ AMP delivers performance without compromising productivity
What is C++ AMP?

C++ AMP gives you…
  Productivity
   •   Simple programming model
  Portability
   •   Run on hardware from NVIDIA, AMD, Intel and ARM*
   •   Open Specification
  Performance
   •   Power of heterogeneous computing at your hands


Use it to speed up data parallel algorithms
1. #include <iostream>
2.
3.

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8.     for (int idx = 0; idx < 11; idx++)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
                                                                           amp.h: header for C++ AMP library
3. using namespace concurrency;
                                                                           concurrency: namespace for library

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8.     for (int idx = 0; idx < 11; idx++)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);                            array_view: wraps the data to operate on the
                                                             accelerator. array_view variables captured and
8.     for (int idx = 0; idx < 11; idx++)
                                                        associated data copied to accelerator (on demand)
9.     {
10.        v[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( v[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);                             array_view: wraps the data to operate on the
                                                              accelerator. array_view variables captured and
8.     for (int idx = 0; idx < 11; idx++)
                                                         associated data copied to accelerator (on demand)
9.     {
10.      av[idx] += 1;
11.    }

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>( av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()                                                            parallel_for_each: execute the lambda
5. {                                                                        on the accelerator once per thread
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
                                                                         extent: the parallel loop bounds
                                                                                  or computation “shape”
7.    array_view<int> av(11, v);
8.    parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.    {
10.       av[idx] += 1;                                        index: the thread ID that is running
11.   });                                                      the lambda, used to index into data

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);
8.     parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.     {
10.        av[idx] += 1;                                       restrict(amp): tells the compiler to
11.    });                                                      check that code conforms to C++
                                                                       subset, and tells compiler to target GPU
12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);
14. }
1. #include <iostream>
2. #include <amp.h>
3. using namespace concurrency;

4. int main()
5. {
6.   int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.     array_view<int> av(11, v);
8.     parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9.     {
10.        av[idx] += 1;     array_view: automatically copied
11.    });                            to accelerator if required

12. for(unsigned int i = 0; i < 11; i++)
13.   std::cout << static_cast<char>(av[i]);                     array_view: automatically copied
14. }                                                           back to host when and if required
C++ AMP
Parallel Debugger
Well known Visual Studio debugging features
 Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips
 Tool windows
   Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch,
    Quick Watch
New features (for both CPU and GPU)
 Parallel Stacks window, Parallel Watch window
New GPU-specific
 Emulator, GPU Threads window, race detection
concurrency::direct3d_printf, _errorf, _abort
Summary

C++ is a great way to create fast and fluid apps for Windows 8
Get the most out of the compiler’s free optimizations
Use PPL for concurrent programming
Use C++ AMP for data parallel algorithms
Thank you!

tarekm@microsoft.com

Blazing Fast Windows 8 Apps using Visual C++

  • 2.
    Agenda Windows 8 Apps Freeperformance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 3.
    Agenda Windows8 Apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 4.
    Windows 8 Apps Newuser experience Touch-friendly Trust Battery-power Fast and fluid
  • 5.
    Windows 8 C++App Options XAML-based applications  XAML user interface  C++ code DirectX-based applications and games  DirectX user interface (D2D or D3D)  C++ code Hybrid XAML and DirectX applications  XAML controls mixed with DirectX surfaces  C++ code HTML5 + JavaScript applications  HTML5 user interface  JS code calling into C++ code
  • 7.
    Agenda Checkpoint Windows 8apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 8.
    Recap of “free”performance Compilation Unit Optimizations Whole Program Optimizations • /O2 and friends • /GL and /LTCG Profile Guided Optimization • /LTCG:PGI and /LTCG:PGO
  • 9.
    More “free” boosts Automaticvectorization • Always on in VS2012 SCALAR VECTOR (1 operation) (N operations) • Uses “vector” instructions where possible in loops r1 r2 v1 v2 + + for (i = 0; i < 1000; i++) { r3 v3 A[i] = B[i] + C[i]; vector length } add r3, r1, r2 vadd v3, v1, v2 • Can run this loop in only 250 iterations down from 1,000!
  • 10.
    More “free” boosts Automaticparallelization • Uses multiple CPU cores • /Qpar compiler switch #pragma loop (hint_parallel(4)) for (i = 0; i < 1000; i++) { A[i] = B[i] + C[i]; } • Can run this loop “vectorized” and on 4 CPU cores in parallel
  • 11.
    Agenda Checkpoint Windows 8apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 12.
    Parallel Patterns Library(PPL) Part of the C++ Runtime  No new libraries to link in  Task parallelism  Parallel algorithms  Concurrency-safe containers  Asynchronous agents Abstracts away the notion of threads  Tasks are computations that may be run in parallel Used to express your potential concurrency  Let the runtime map it to the available concurrency  Scale from 1 to 256 cores
  • 13.
    parallel_for parallel_for iterates overa range in parallel #include <ppl.h> using namespace concurrency; parallel_for( 0, 1000, [] (int i) { work(i); } );
  • 14.
    parallel_for parallel_for(0, 1000,[] (int i) { work(i); }); Core 1 Core 2 • Order of iteration is indeterminate. work(0…249) work(250…499) • Cores may come and go. • Ranges may be stolen by newly idle cores. Core 3 Core 4 work(500…749) work(750…999)
  • 15.
    parallel_for parallel_for considerations: • Designedfor unbalanced loop bodies • An idle core can steal a portion of another core’s range of work • Supports cancellation • Early exit in search scenarios For fixed-sized loop bodies that don’t need cancellation, use parallel_for_fixed.
  • 16.
    parallel_for_each parallel_for_each iterates overan STL container in parallel #include <ppl.h> using namespace concurrency; vector<int> v = …; parallel_for_each(v.begin(), v.end(), [] (int i) { work(i); } );
  • 17.
    parallel_for_each Works best withcontainers that support random-access iterators:  std::vector, std::array, std::deque, concurrency::concurrent_vector, … Works okay, but with higher overhead on containers that support forward (or bi-di) iterators:  std::list, std::map, …
  • 18.
    parallel_invoke • Executes functionobjects in parallel and waits for them to finish #include <ppl.h> #include <string> #include <iostream> using namespace concurrency; using namespace std; template <typename T> T twice(const T& t) { return t + t; } int main() { int n = 54; double d = 5.6; string s = "Hello"; parallel_invoke( [&n] { n = twice(n); }, [&d] { d = twice(d); }, [&s] { s = twice(s); } ); cout << n << ' ' << d << ' ' << s << endl; return 0; }
  • 19.
    task<> • Used to write asynchronous code • Task::then lets you create continuations that get executed when the task finishes • You need to manage the lifetime of the variables going into a task #include <ppltasks.h> #include <iostream> using namespace concurrency; using namespace std; int main() { auto t = create_task([]() -> int { return 42; }); t.then([](int result) { cout << result << endl; }).wait(); }
  • 20.
    Concurrent Containers • Thread-safe, lock-free containers provided:  concurrent_vector<>  concurrent_queue<>  concurrent_unordered_map<>  concurrent_unordered_multimap<>  concurrent_unordered_set<>  concurrent_unordered_multiset<> • Functionality resembles equivalent containers provided by the STL • Behavior is more limited to allow concurrency. For example: • concurrent_vector can push_back but not insert • concurrent_vector can clear but not pop_back or erase
  • 21.
    concurrent_vector<T> #include <ppl.h> #include <concurrent_vector.h> using namespace concurrency; concurrent_vector<int> carmVec; parallel_for(2, 5000000, [&carmVec](int i) { if (is_carmichael(i)) carmVec.push_back(i); });
  • 22.
    Agenda Checkpoint Windows 8apps Free performance boost Squeeze the CPU (PPL) Smoke the GPU (C++ AMP)
  • 23.
    CPU / GPUComparison
  • 24.
    What is C++AMP? Performance & Productivity C++ AMP -> C++ Accelerated Massive Parallelism C++ AMP is • Programming model for expressing data parallel algorithm • Exploiting heterogeneous system using mainstream tools • C++ language extensions and library C++ AMP delivers performance without compromising productivity
  • 25.
    What is C++AMP? C++ AMP gives you… Productivity • Simple programming model Portability • Run on hardware from NVIDIA, AMD, Intel and ARM* • Open Specification Performance • Power of heterogeneous computing at your hands Use it to speed up data parallel algorithms
  • 26.
    1. #include <iostream> 2. 3. 4.int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 27.
    1. #include <iostream> 2.#include <amp.h> amp.h: header for C++ AMP library 3. using namespace concurrency; concurrency: namespace for library 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. 8. for (int idx = 0; idx < 11; idx++) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 28.
    1. #include <iostream> 2.#include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and 8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand) 9. { 10. v[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( v[i]); 14. }
  • 29.
    1. #include <iostream> 2.#include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); array_view: wraps the data to operate on the accelerator. array_view variables captured and 8. for (int idx = 0; idx < 11; idx++) associated data copied to accelerator (on demand) 9. { 10. av[idx] += 1; 11. } 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>( av[i]); 14. }
  • 30.
    1. #include <iostream> 2.#include <amp.h> 3. using namespace concurrency; 4. int main() parallel_for_each: execute the lambda 5. { on the accelerator once per thread 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; extent: the parallel loop bounds or computation “shape” 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; index: the thread ID that is running 11. }); the lambda, used to index into data 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); 14. }
  • 31.
    1. #include <iostream> 2.#include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; restrict(amp): tells the compiler to 11. }); check that code conforms to C++ subset, and tells compiler to target GPU 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); 14. }
  • 32.
    1. #include <iostream> 2.#include <amp.h> 3. using namespace concurrency; 4. int main() 5. { 6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'}; 7. array_view<int> av(11, v); 8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp) 9. { 10. av[idx] += 1; array_view: automatically copied 11. }); to accelerator if required 12. for(unsigned int i = 0; i < 11; i++) 13. std::cout << static_cast<char>(av[i]); array_view: automatically copied 14. } back to host when and if required
  • 33.
    C++ AMP Parallel Debugger Wellknown Visual Studio debugging features  Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips  Tool windows  Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch New features (for both CPU and GPU)  Parallel Stacks window, Parallel Watch window New GPU-specific  Emulator, GPU Threads window, race detection concurrency::direct3d_printf, _errorf, _abort
  • 36.
    Summary C++ is agreat way to create fast and fluid apps for Windows 8 Get the most out of the compiler’s free optimizations Use PPL for concurrent programming Use C++ AMP for data parallel algorithms
  • 37.