What’s new in Visual C++

What’s new in Visual C++
11
Jim Hogg

Program Manager
Visual C++
Microsoft

Agenda
• Why C++?
• Performance : CPUs and GPUs
• Baseline : Single-CPU / Multi-CPU Demo
• Vector CPU Demo
• GPU : C++ AMP Demo
• ISO C++ 11
• ALM (Application Lifetime Management)

Why C++? : Power & Performance
power: driver at all “The going word at Facebook is that
scales – on-die, mobile, „reasonably written C++ code just
desktop, datacenter
runs fast,‟ which underscores the
size: limits on
enormous effort spent at optimizing
processor PHP and Java code. Paradoxically, C++
resources code is more difficult to write than in
– desktop, mobile
other languages, but
experiences: bigger
experiences on efficient code is a lot easier.” –
smaller hardware; Andrei Alexandrescu
pushing envelope
means every
cycle matters

CPU v.s. GPU today
CPU GPU

• Low memory bandwidth • High memory bandwidth
• Higher power consumption • Lower power consumption
• Medium level of parallelism • High level of parallelism
• Deep execution pipelines • Shallow execution pipelines
• Random accesses • Sequential accesses
• Supports general code • Supports data-parallel code
• Mainstream programming • Niche programming

images source: AMD

NBody Simulation, CPU (novec)

Vector Processors – How they work
RAX 1.10

SCALAR ADD RAX, RBX RBX 1.20

RAX 2.30

for (int i = 0; i < 1000; ++i) a[i] += b[i ]

XMM1 1.10 2.10 3.10 4.10

VECTOR ADDPS XMM1, XMM2 1.20 2.20 3.20 4.20
XMM2
XMM1 2.30 4.30 6.30 8.30

for (int i = 0; i < 1000; i += 4) a[i : i+3] += b[i : i+3]

Compiler Enhancements
• Auto-vectorizer • Auto-parallelization
• Automatically vectorize loops. – Reorganizes the loop to run
• SIMD instructions. on multiple threads
• ON by default – /Qpar
– Optional #pragma loop

for (i = 0; i < 1024; i++)
a[i] = b[i] * c[i]; #pragma loop(hint_parallel(N))

for (i = 0; i < 1024; i++)
a[i] = b[i] * c[i];
for (i = 0; i < 1024; i += 4)
a[i:i+3] = b[i:i+3] * c[i:i+3];

Multi-Core Machines (w/
Vectorization)

NBody Simulation, CPU (Auto Vectorize + Parallelize)

Source Code Assembly of Body
int A[20000]; $LL3@foo:
mov ecx, DWORD PTR ?C@@3PAHA[eax*4]
int B[20000]; mov edx, DWORD PTR ?B@@3PAHA[eax*4]
int C[20000]; add ecx, edx
mov DWORD PTR ?A@@3PAHA[eax*4], ecx

for (i=0; i<20000; i++) { inc eax
A[i] = B[i] + C[i]; cmp eax, esi
jl SHORT $LL3@foo
}

Dev11 /O2 400% Speedup!!!
Transformation Assembly of Body
int A[20000]; $LL3@foo:
movdqu xmm1, XMMWORD PTR ?C@@3PAHA[eax*4]
int B[20000]; movdqu xmm0, XMMWORD PTR ?B@@3PAHA[eax*4]
int C[20000]; paddd xmm1, xmm0
movdqu XMMWORD PTR ?A@@3PAHA[eax*4], xmm1

for (i=0; i<20000; i+=4) { add eax, 4
A[i:i+3] = B[i:i+3] + C[i:i+3]; cmp eax, ecx
jl SHORT $LL3@foo
}

for (k = 1; k <= M; k++) {

if
if
if xmb

if

for dc[k] 1; dc[k-1] +k++) {
(k = = k <= M; tpdd[k-1];
if ((sc = = dc[k-1] + tpdd[k-1]; dc[k]) dc[k] = sc;
dc[k] mc[k-1] + tpmd[k-1]) >
if (dc[k] <= -INFTY) dc[k] = -INFTY; dc[k]) dc[k] = sc;
if ((sc mc[k-1] + tpmd[k-1]) >
if (dc[k] < -INFTY) dc[k] = -INFTY;
for if (k < M) { M; k++) { {
for (k = 1; k < M; k++)
(k = 1; k <=
if (k < M) =mpp[k] ++tpmi[k];
ic[k] = { mpp[k]
ic[k] tpmi[k];
ic[k] = mpp[k] + tpii[k])
((sc ip[k] tpmi[k];
ifif((sc ==ip[k] ++tpii[k]) >>ic[k]) ic[k] ==sc;
ic[k]) ic[k] sc;
ic[k] += = is[k]; + tpii[k]) > ic[k]) ic[k] = sc;
ic[k] +=is[k];
if ((sc ip[k]
ic[k] += is[k];
ifif(ic[k] <<-INFTY) ic[k] ==-INFTY;
(ic[k] -INFTY) ic[k] -INFTY;
} if (ic[k] < -INFTY) ic[k] = -INFTY;
}} }
}

The Power of Heterogeneous
Computing

146X 36X 19X 17X 100X
Interactive Ionic placement Transcoding HD Simulation in Astrophysics N-
visualization of for molecular video stream to Matlab using .mex body simulation
volumetric white dynamics H.264 file CUDA function
matter simulation on GPU
connectivity

149X 47X 20X 24X 30X
sourc
Financial Ultrasound Highly optimized e
GLAME@lab: An Cmatch exact string
simulation of M-script API for medical imaging object oriented matching to find
LIBOR model with linear Algebra for cancer molecular similar proteins and
swaptions operations on diagnostics dynamics gene sequences
GPU

C++ AMP
• Part of Visual C++
• Visual Studio integration
• STL-like library for multidimensional data
• Builds on Direct3D
performance
productivity
portability

Hello World: Array Addition
#include <amp.h>
using namespace concurrency;

void AddArrays(int* a, int* b, int* c, int N) void AddArrays(int n, int * pA, int* * pB, int * pC)
void AddArrays(int* a, int* b, int c, int N)
{ {{
array_view<int,1> va(N, a);
array_view<int,1> vb(N, b);
array_view<int,1> vc(N, c);

parallel_for_each(
for (int i = 0; i < n; ++i) for (int i=0; i<n; i++)
va.grid,
[=](index<1> i) restrict(direct3d)
{
{ { va[i] = vb[i] + vc[i];
a[i] = b[i] + c[i]; } pC[i] = pA[i] + pB[i];
} ); }
}
} }

Basic Elements of C++ AMP coding
array_view: wraps the data restrict(direct3d): tells the
to operate on the accelerator compiler to check that this code
parallel_for_each:
can execute on Direct3D hardware
execute the lambda void AddArrays(int* a, int* b, int* c, int N)
(aka accelerator)
on the accelerator {
once per thread array_view<int,1> va(N, a);
array_view<int,1> vb(N, b);
array_view<int,1> vc(N, c);

parallel_for_each(
grid: the number and va.grid,
shape of threads to [=](index<1> i) restrict(direct3d)
execute the lambda {
va[i] = vb[i] + vc[i];
} array_view variables captured
); and associated data copied to
index: the thread ID that is running the
} accelerator (on demand)
lambda, used to index into data

Achieving maximum performance gains
• Schedule threads in tiles 0 1 2 3 4 5 0 1 2 3 4 5
• Avoid thread index remapping 0 0

• Gain ability to use tile static memory
1 1
2 2
3 3
4 4

array_view<int,2> data(8, 6, p_my_data); 5 5

parallel_for_each( 6 6

data.grid.tile<2,2>(), 7 7

[=] (tiled_index<2,2> t_idx)… { … }); g.tile<4,3>() g.tile<2,2>()

C++ AMP at a Glance
• restrict(direct3d, cpu) • tile_static storage class
• parallel_for_each • class tiled_grid< , , >
• class array<T,N> • class tiled_index< , , >
• class array_view<T,N> • class tile_barrier
• class index<N>
• class extent<N>, grid<N>
• class accelerator
• class accelerator_view

Visual Studio/C++ AMP
• Organize
• Edit
• Design
• Build
• Browse
• Debug
• Profile

C++ AMP Parallel Debugger
• Well known Visual Studio debugging features
• Launch, Attach, Break, Stepping, Breakpoints, DataTips
• Toolwindows
• Processes, Debug Output, Modules, Disassembly, Call Stack, Memory,
Registers, Locals, Watch, Quick Watch
• New features (for both CPU and GPU)
• Parallel Stacks window, Parallel Watch window, Barrier
• New GPU-specific
• Emulator, GPU Threads window, race detection

Summary
• Democratization of parallel hardware programmability
• Performance for the mainstream
• High-level abstractions in C++ (not C)
• State-of-the-art Visual Studio IDE
• Hardware abstraction platform

• C++ AMP now published as open specification
• http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-
0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf

Modern C++: Clean, Safe and Fast
auto type deduction T* shared_ptr<T>
Then Now new
make_shared
circle* p = new circle( 42 ); auto p = make_shared<circle>( 42 );
vector<shape*> v = load_shapes(); vector<shared_ptr<shape>> vw = load_shapes();
for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) { for_each( begin(vw), end(vw), [&]( shared_ptr<circle>& s
if(*i && **i == *p ) ){
cout << **i << “ is a matchn”; if( s && *s == *p )
} cout << *s << “ is a matchn”;
for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) } ); for/while/do
{ std:: algorithms
delete *i; [&] lambda functions
no need for “delete”
} not exception-safe
automatic lifetime
delete p; missing try/catch, management
__try/__finally
exception-safe

C++ 11 Language Features in Visual Studio
C++11 Core Language Features VC10 VC11
rvalue references v2.0 v2.1*
auto v1.0 v1.0
decltype v1.0 v1.1**
static_assert Yes Yes
trailing return types Yes Yes
lambdas v1.0 v1.1
nullptr Yes Yes
strongly typed enums Partial Yes
forward declared enums No Yes
standard-layout and trivial types No Yes
atomics No Yes
strong compare and exchange No Yes
bidirectional fences No Yes
data-dependency ordering No Yes

rvalue refs
struct Car {
string make; // eg “Volvo”
int when; // last-serviced – eg 201103 => March 2011
};

workOnClone(Car c); // work on a clone of my car – not returned

inspect(const Car& c); // inspect, but don’t alter, my car

fix(Car& c); // fix and return my car

replace(Car&& c); // take my car and cannibalize it – I won’t be using it again
// note that && is not a ref-to-ref (unlike **)
// enables “move semantics” and “perfect forwarding”

auto
int n = 42;
double pi = 3.14159;
auto x = n * e; // will infer type of x is double

for (std::map<string, vector<double>>::const_iterator iter = m.cbegin(); iter != m.cend(); ++iter)
for (auto iter = m.cbegin(); iter != m.cend(); ++iter)

const auto * p = new MyClass; // “add back” qualifiers to auto’s inferred type
const auto & r = s; // “add back” qualifiers to auto’s inferred type

auto a1 = new auto(42); // infers int*
auto * a2 = new auto(42); // beware: also infers int*

Notes: static type inference!
like C# “var”
may break old code: old auto specifies allocation within current stack frame

decltype
decltype(new C) c = new C; // c is a C*
// Note: first “new C” is not executed

std::vector<int>::const_iterator iter1; // a long type name

decltype(iter1) iter2; // iter2 has same type as iter1

static_assert
pre-processor-time run-time
#if VERSION < 8 bool done(float g1, float g2, float tol) {
#error “Need version 8 or higher” assert (tol < 1.0e-3);
#endif

compile-time
static_assert (FeetPerMile > 5200 && FeetPerMile < 6100, “FeetPerMile is wrong”);

template<class T> struct S {
static_assert(sizeof(T) < sizeof(int), “T is too big”);
static_assert(std::is_unsigned<T>::value, “S needs an unsigned type”);

Trailing-Return-Type

template<class A, class B> ??? adder(A &a, B &b) { return a + b; } // no!

template<class A, class B> decltype(a + b) adder(A &a, B &b) { return a + b; } // no!

template<class A, class B> auto adder(A &a, B &b) -> decltype(a + b) { return a + b; } // yes!

lambdas – functions with no name
[ ] ( ) -> int { return 42; } ; // no arguments
[ ] (int n) -> int { return n * n; } ; // one argument
[ ] (int a, int b) -> int { return a + b; } ; // two arguments

for_each(v.begin(), v.end(), [ ] (int n) { cout << n << “ “; }); // one-liner

float f1 = integrate ( golden, 0.0, 1.0 );
float f2 = integrate ( [ ] (float x ) { return x * x + x – 1; }, 0.0, 1.0 );

[ ] { cout << “hi” } // can omit ( ) if no parameters
// can omit -> return-type if inferable

[ capture-clause] ( parameter-list ) -> return-type { body }// grammar

Strongly-Typed Enums
Illegal – members must be globally unique
enum Heights {SHORT, TALL}; // ok
enum Widths {BYTE, SHORT, INT, LONG}; // clash

enum members are just integers
enum Colors {RED, GREEN, BLUE};
if (GREEN == 1) cout << “GREEN == 1”; // yes!
enum Parts {ENGINE, BRAKE, CLUTCH};
if (GREEN == BRAKE) cout << “GREEN == BRAKE”; // yes!

Use enum class
enum class Heights {SHORT, TALL};
enum class Widths {BYTE, SHORT, INT, LONG}; // eg: Widths::SHORT

Forward-Declared Enum Classes
enum class Colors; // forward declaration

void fun(Colors c); // use

. . .

enum class Colors : unsigned char {RED = 3, GREEN, BLUE = 7};

nullptr
// the NULL hack:
int* p1 = 0; // value of 0 is ‘special’
int* p2 = 42; // illegal

void f (int n) { cout << n; };
f(0); // works

void f (int* p) { cout << p; };
f(0); // works

void f (int n) { cout << n; }
void f (int* p) { cout << p; };
f(0); // which one?

f(nullptr); // calls f(int*)

decltype(nullptr) == nullptr_t

Memory Model – Scary Terminology
• Dekker’s algorithm
• Double check locking
• Weak memory consistency
• Atomics
• Memory fences/barriers
• Volatile
• Sequential consistency
• Acquire/Release semantics
• Axiomatic definition & litmus tests

Dekker’s Algorithm
flag[0] := true flag[1] := true
while flag[1] = true { while flag[0] = true {
if turn ≠ 0 { if turn ≠ 1 {
flag[0] := false flag[1] := false
while turn ≠ 0 { } while turn ≠ 1 { }
flag[0] := true flag[1] := true
} }
} }
// critical section // critical section
turn := 1 turn := 0
flag[0] := false flag[1] := false

Each proc has FIFO store buffer
Proc Proc Reads read from local SB

Read bypassing

MFENCE flushes SB
Store buffer Store buffer
LOCK’d instruction acqiures Lock
(eg: XCHG)

Write to SB may reach memory at
any time Lock is not held

Lock
Memory
http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf

C++ Libraries (VS)
• STL
• C++ 11 conformant
• Support for new headers in VS vNext
• <atomic>, <filesystem>, <thread> (others)

• PPL
• Parallel Algorithms
• Task-based programming model
• Agents and Messaging - express dataflow pipelines
• Concurrency-safe containers

ALM (Application Life Management)

• New ALM features in vNext • Additional new C++ features
• Lightweight Requirements • 2010 features Updated
• Agile Planning Tools • Architecture Tools
• Stakeholder Feedback • Dependency Diagrams
Context Switching
• Architecture Explorer
•
• Code Review
• Unit Testing
• Exploratory Testing
• Native Unit Test Framework
• Manage and Run tests in VS
and Test Manager

MICROSOFTC++

2012
PARTICIPATE IN C++
MICROSOFT
DEVELOPER
DIVISION
DEVELOPMENT USER DESIGN
RESEARCH
RESEARCH
SIGN UP ONLINE AT
http://bit.ly/cppdeveloper

Pour aller plus loin

Prochaines sessions des Dev Camps
Chaque semaine, les DevCamps 10 février Open Data - Développer des applications riches avec le protocole Open
ALM, Azure, Windows Phone, HTML5, OpenData 2012
Live Meeting
Data
http://msdn.microsoft.com/fr-fr/devcamp 16 février Azure series - Développer des applications sociales sur la plateforme
Live Meeting
2012 Windows Azure

17 février
Téléchargement, ressources et toolkits : 2012
Live Meeting Comprendre le canvas avec Galactic et la librairie three.js

21 février
RdV sur MSDN 2012
Live Meeting La production automatisée de code avec CodeFluent Entities

http://msdn.microsoft.com/fr-fr/ 2 mars
Live Meeting
Comprendre et mettre en oeuvre le toolkit Azure pour Windows Phone 7,
2012 iOS et Android

6 mars
Live Meeting Nuget et ALM
Les offres à connaître 2012

9 mars
Live Meeting Kinect - Bien gérer la vie de son capteur
90 jours d’essai gratuit de Windows Azure 2012

www.windowsazure.fr 13 mars
2012
Live Meeting Sharepoint series - Automatisation des tests

14 mars TFS Health Check - vérifier la bonne santé de votre plateforme de
Jusqu’à 35% de réduction sur Visual Studio Pro, avec 2012
Live Meeting
développement

l’abonnement MSDN 15 mars
Live Meeting
Azure series - Développer pour les téléphones, les tablettes et le cloud
2012 avec Visual Studio 2010
www.visualstudio.fr
16 mars Applications METRO design - Désossage en règle d'un template METRO
Live Meeting
2012 javascript

20 mars Retour d'expérience LightSwitch, Optimisation de l'accès aux données,
Live Meeting
2012 Intégration Silverlight

23 mars
Live Meeting OAuth - la clé de l'utilisation des réseaux sociaux dans votre application
2012

What&rsquo;s new in Visual C++

More Related Content

What's hot

Viewers also liked

Similar to What&rsquo;s new in Visual C++

More from Microsoft

Recently uploaded

What&rsquo;s new in Visual C++

Editor's Notes

What’s new in Visual C++

Similar to What’s new in Visual C++

What’s new in Visual C++