Blazing Fast Windows 8 Apps using Visual C++

Agenda

Windows 8 Apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)

Windows 8 Apps

New user experience
Touch-friendly
Trust
Battery-power
Fast and fluid

Windows 8 C++ App Options

XAML-based applications
 XAML user interface
 C++ code
DirectX-based applications and games
 DirectX user interface (D2D or D3D)
 C++ code
Hybrid XAML and DirectX applications
 XAML controls mixed with DirectX surfaces
 C++ code
HTML5 + JavaScript applications
 HTML5 user interface
 JS code calling into C++ code

Agenda Checkpoint

Windows 8 apps
Free performance boost
Squeeze the CPU (PPL)
Smoke the GPU (C++ AMP)

Recap of “free” performance

Compilation Unit Optimizations Whole Program Optimizations
• /O2 and friends • /GL and /LTCG

Profile Guided Optimization
• /LTCG:PGI and /LTCG:PGO

More “free” boosts

Automatic vectorization
• Always on in VS2012 SCALAR VECTOR
(1 operation) (N operations)
• Uses “vector” instructions
where possible in loops r1 r2 v1 v2

+ +
for (i = 0; i < 1000; i++) {
r3 v3
A[i] = B[i] + C[i]; vector
length
} add r3, r1, r2 vadd v3, v1, v2

• Can run this loop in only 250
iterations down from 1,000!

More “free” boosts

Automatic parallelization
• Uses multiple CPU cores
• /Qpar compiler switch

#pragma loop (hint_parallel(4))
for (i = 0; i < 1000; i++) {
A[i] = B[i] + C[i];
}

• Can run this loop “vectorized”
and on 4 CPU cores in parallel

Parallel Patterns Library (PPL)

Part of the C++ Runtime
 No new libraries to link in
 Task parallelism
 Parallel algorithms
 Concurrency-safe containers
 Asynchronous agents
Abstracts away the notion of threads
 Tasks are computations that may be run in parallel
Used to express your potential concurrency
 Let the runtime map it to the available concurrency
 Scale from 1 to 256 cores

parallel_for

parallel_for iterates over a range in parallel

#include <ppl.h>

using namespace concurrency;

parallel_for( 0, 1000,
[] (int i) {
work(i);
}
);

parallel_for
parallel_for(0, 1000, [] (int i) {
work(i);
});

Core 1 Core 2
• Order of iteration is indeterminate.
work(0…249) work(250…499)
• Cores may come and go.
• Ranges may be stolen by newly idle
cores.
Core 3 Core 4

work(500…749) work(750…999)

parallel_for

parallel_for considerations:
• Designed for unbalanced loop bodies
• An idle core can steal a portion of another core’s range of work
• Supports cancellation
• Early exit in search scenarios

For fixed-sized loop bodies that don’t need cancellation, use
parallel_for_fixed.

parallel_for_each

parallel_for_each iterates over an STL container in parallel

#include <ppl.h>


vector<int> v = …;

parallel_for_each(v.begin(), v.end(),
[] (int i) {
work(i);
}
);

parallel_for_each

Works best with containers that support random-access iterators:
 std::vector, std::array, std::deque, concurrency::concurrent_vector, …

Works okay, but with higher overhead on containers that support forward
(or bi-di) iterators:
 std::list, std::map, …

parallel_invoke

• Executes function objects in parallel and waits for them to finish
#include <ppl.h>
#include <string>
#include <iostream>
using namespace concurrency; using namespace std;

template <typename T>
T twice(const T& t) {
return t + t;
}

int main() {
int n = 54; double d = 5.6; string s = "Hello";
parallel_invoke(
[&n] { n = twice(n); },
[&d] { d = twice(d); },
[&s] { s = twice(s); }
);
cout << n << ' ' << d << ' ' << s << endl;
return 0;
}

task<>
• Used to write asynchronous code
• Task::then lets you create continuations that get executed when the task finishes
• You need to manage the lifetime of the variables going into a task
#include <ppltasks.h>
#include <iostream>
using namespace concurrency; using namespace std;

int main()
{
auto t = create_task([]() -> int
{
return 42;
});

t.then([](int result)
{
cout << result << endl;
}).wait();
}

Concurrent Containers

• Thread-safe, lock-free containers provided:
 concurrent_vector<>
 concurrent_queue<>
 concurrent_unordered_map<>
 concurrent_unordered_multimap<>
 concurrent_unordered_set<>
 concurrent_unordered_multiset<>
• Functionality resembles equivalent containers provided by the STL
• Behavior is more limited to allow concurrency. For example:
• concurrent_vector can push_back but not insert
• concurrent_vector can clear but not pop_back or erase

concurrent_vector<T>

#include <ppl.h>
#include <concurrent_vector.h>


concurrent_vector<int> carmVec;

parallel_for(2, 5000000, [&carmVec](int i) {
if (is_carmichael(i))
carmVec.push_back(i);
});

What is C++ AMP?

Performance & Productivity
C++ AMP -> C++ Accelerated Massive Parallelism
C++ AMP is
• Programming model for expressing data parallel algorithm
• Exploiting heterogeneous system using mainstream tools
• C++ language extensions and library

C++ AMP delivers performance without compromising productivity

What is C++ AMP?

C++ AMP gives you…
Productivity
• Simple programming model
Portability
• Run on hardware from NVIDIA, AMD, Intel and ARM*
• Open Specification
Performance
• Power of heterogeneous computing at your hands

Use it to speed up data parallel algorithms

1. #include <iostream>
2.
3.

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
8. for (int idx = 0; idx < 11; idx++)
9. {
10. v[idx] += 1;
11. }

12. for(unsigned int i = 0; i < 11; i++)
13. std::cout << static_cast<char>( v[i]);
14. }

2. #include <amp.h>
amp.h: header for C++ AMP library
3. using namespace concurrency;
concurrency: namespace for library

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7.
9. {
10. v[idx] += 1;
11. }

14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7. array_view<int> av(11, v); array_view: wraps the data to operate on the
accelerator. array_view variables captured and
associated data copied to accelerator (on demand)
9. {
10. v[idx] += 1;
11. }

14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

7. array_view<int> av(11, v); array_view: wraps the data to operate on the
accelerator. array_view variables captured and
associated data copied to accelerator (on demand)
9. {
10. av[idx] += 1;
11. }

13. std::cout << static_cast<char>( av[i]);
14. }

2. #include <amp.h>

4. int main() parallel_for_each: execute the lambda
5. { on the accelerator once per thread
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};
extent: the parallel loop bounds
or computation “shape”
7. array_view<int> av(11, v);
8. parallel_for_each(av.extent, [=](index<1> idx) restrict(amp)
9. {
10. av[idx] += 1; index: the thread ID that is running
11. }); the lambda, used to index into data

13. std::cout << static_cast<char>(av[i]);
14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

9. {
10. av[idx] += 1; restrict(amp): tells the compiler to
11. }); check that code conforms to C++
subset, and tells compiler to target GPU
13. std::cout << static_cast<char>(av[i]);
14. }

2. #include <amp.h>

4. int main()
5. {
6. int v[11] = {'G', 'd', 'k', 'k', 'n', 31, 'v', 'n', 'q', 'k', 'c'};

9. {
10. av[idx] += 1; array_view: automatically copied
11. }); to accelerator if required

13. std::cout << static_cast<char>(av[i]); array_view: automatically copied
14. } back to host when and if required

C++ AMP
Parallel Debugger
Well known Visual Studio debugging features
 Launch (incl. remote), Attach, Break, Stepping, Breakpoints, DataTips
 Tool windows
 Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch,
Quick Watch
New features (for both CPU and GPU)
 Parallel Stacks window, Parallel Watch window
New GPU-specific
 Emulator, GPU Threads window, race detection
concurrency::direct3d_printf, _errorf, _abort

Summary

C++ is a great way to create fast and fluid apps for Windows 8
Get the most out of the compiler’s free optimizations
Use PPL for concurrent programming
Use C++ AMP for data parallel algorithms

Thank you!

tarekm@microsoft.com

Blazing Fast Windows 8 Apps using Visual C++

More Related Content

What's hot

Similar to Blazing Fast Windows 8 Apps using Visual C++

More from Microsoft Developer Network (MSDN) - Belgium and Luxembourg

Blazing Fast Windows 8 Apps using Visual C++