What’s new in Visual C++
          11
      Jim Hogg

      Program Manager
      Visual C++
      Microsoft
Agenda
• Why C++?
• Performance : CPUs and GPUs
  • Baseline : Single-CPU / Multi-CPU   Demo
  • Vector CPU                          Demo
  • GPU : C++ AMP                       Demo
• ISO C++ 11
• ALM                        (Application Lifetime Management)
Why C++? : Power & Performance
     power: driver at all               “The going word at Facebook is that
     scales – on-die, mobile,          „reasonably written C++ code just
     desktop, datacenter
                                           runs fast,‟ which underscores the
                     size: limits on
                                         enormous effort spent at optimizing
                         processor     PHP and Java code. Paradoxically, C++
                         resources       code is more difficult to write than in
                 – desktop, mobile
                                                 other languages, but
 experiences: bigger
 experiences on                           efficient code is a lot easier.” –
 smaller hardware;                               Andrei Alexandrescu
 pushing envelope
 means every
 cycle matters
Agenda
• Why C++?
• Performance : CPUs and GPUs
  • Baseline : Single-CPU / Multi-CPU Demo
  • Vector CPU                         Demo
  • GPU : C++ AMP                      Demo
• ISO C++ 11
• ALM                       (Application Lifetime Management)
CPU v.s. GPU today
     CPU                                 GPU

 •   Low memory bandwidth          •     High memory bandwidth
 •   Higher power consumption      •     Lower power consumption
 •   Medium level of parallelism   •     High level of parallelism
 •   Deep execution pipelines      •     Shallow execution pipelines
 •   Random accesses               •     Sequential accesses
 •   Supports general code         •     Supports data-parallel code
 •   Mainstream programming        •     Niche programming

                                       images source: AMD
NBody Simulation, CPU   (novec)
Vector Processors (CPU)
Vector Processors – How they work
                                     RAX    1.10

SCALAR ADD RAX, RBX                  RBX    1.20

                                     RAX    2.30

          for (int i = 0; i < 1000; ++i) a[i] += b[i ]


                                  XMM1      1.10    2.10    3.10    4.10

VECTOR   ADDPS XMM1,              XMM2      1.20    2.20    3.20    4.20
         XMM2
                                  XMM1      2.30    4.30    6.30    8.30

             for (int i = 0; i < 1000; i += 4) a[i : i+3] += b[i : i+3]
Vector Processors (CPU)
Compiler Enhancements
 • Auto-vectorizer                         • Auto-parallelization
   • Automatically vectorize loops.          – Reorganizes the loop to run
   • SIMD instructions.                        on multiple threads
   • ON by default                           – /Qpar
                                             – Optional #pragma loop

     for (i = 0; i < 1024; i++)
         a[i] = b[i] * c[i];                   #pragma loop(hint_parallel(N))

                                               for (i = 0; i < 1024; i++)
                                                   a[i] = b[i] * c[i];
     for (i = 0; i < 1024; i += 4)
         a[i:i+3] = b[i:i+3] * c[i:i+3];
Multi-Core Machines (w/
Vectorization)
NBody Simulation, CPU (Auto Vectorize + Parallelize)
Source Code                          Assembly of Body
int A[20000];                        $LL3@foo:
                                       mov ecx, DWORD PTR ?C@@3PAHA[eax*4]
int B[20000];                          mov edx, DWORD PTR ?B@@3PAHA[eax*4]
int C[20000];                          add ecx, edx
                                       mov DWORD PTR ?A@@3PAHA[eax*4], ecx

for (i=0; i<20000; i++) {             inc   eax
   A[i] = B[i] + C[i];                cmp   eax, esi
                                      jl    SHORT $LL3@foo
}


                      Dev11 /O2             400% Speedup!!!
Transformation                       Assembly of Body
int A[20000];                        $LL3@foo:
                                       movdqu    xmm1, XMMWORD PTR ?C@@3PAHA[eax*4]
int B[20000];                          movdqu    xmm0, XMMWORD PTR ?B@@3PAHA[eax*4]
int C[20000];                          paddd     xmm1, xmm0
                                       movdqu    XMMWORD PTR ?A@@3PAHA[eax*4], xmm1

for (i=0; i<20000; i+=4) {            add        eax, 4
   A[i:i+3] = B[i:i+3] + C[i:i+3];    cmp        eax, ecx
                                      jl         SHORT $LL3@foo
}
for (k = 1; k <= M; k++) {


    if
    if
    if       xmb

    if

 for dc[k] 1; dc[k-1] +k++) {
       (k = = k <= M; tpdd[k-1];
     if ((sc = = dc[k-1] + tpdd[k-1]; dc[k]) dc[k] = sc;
         dc[k] mc[k-1] + tpmd[k-1]) >
     if (dc[k] <= -INFTY) dc[k] = -INFTY; dc[k]) dc[k] = sc;
         if ((sc   mc[k-1] + tpmd[k-1]) >
         if (dc[k] < -INFTY) dc[k] = -INFTY;
  for if (k < M) { M; k++) { {
   for (k = 1; k < M; k++)
       (k = 1; k <=
         if (k < M) =mpp[k] ++tpmi[k];
             ic[k] = { mpp[k]
               ic[k]              tpmi[k];
                ic[k] = mpp[k] + tpii[k])
                   ((sc   ip[k]   tpmi[k];
             ifif((sc ==ip[k] ++tpii[k]) >>ic[k]) ic[k] ==sc;
                                             ic[k]) ic[k]  sc;
             ic[k] += = is[k]; + tpii[k]) > ic[k]) ic[k] = sc;
               ic[k] +=is[k];
                if ((sc   ip[k]
                ic[k] += is[k];
             ifif(ic[k] <<-INFTY) ic[k] ==-INFTY;
                   (ic[k]   -INFTY) ic[k]   -INFTY;
       }        if (ic[k] < -INFTY) ic[k] = -INFTY;
  }} }
   }
Agenda
• Why C++?
• Performance : CPUs and GPUs
  • Baseline : Single-CPU / Multi-CPU   Demo
  • Vector CPU                          Demo
  • GPU : C++ AMP                       Demo
• ISO C++ 11
• ALM                        (Application Lifetime Management)
N-Body Simulation (GPU)
The Power of Heterogeneous
Computing

       146X                 36X                19X                 17X                  100X
       Interactive       Ionic placement    Transcoding HD        Simulation in     Astrophysics N-
    visualization of       for molecular    video stream to   Matlab using .mex     body simulation
   volumetric white          dynamics            H.264        file CUDA function
         matter        simulation on GPU
      connectivity




       149X                  47X               20X                 24X                   30X
                                                                                                          sourc
        Financial                            Ultrasound       Highly optimized                              e
                       GLAME@lab: An                                               Cmatch exact string
     simulation of     M-script API for    medical imaging     object oriented       matching to find
  LIBOR model with      linear Algebra        for cancer         molecular         similar proteins and
       swaptions        operations on        diagnostics         dynamics            gene sequences
                             GPU
C++ AMP
 •   Part of Visual C++
 •   Visual Studio integration
 •   STL-like library for multidimensional data
 •   Builds on Direct3D
                         performance
                         productivity
                          portability
Hello World: Array Addition
                                                #include <amp.h>
                                                using namespace concurrency;

void AddArrays(int* a, int* b, int* c, int N)   void AddArrays(int n, int * pA, int* * pB, int * pC)
                                                 void AddArrays(int* a, int* b, int c, int N)
{                                               {{
                                                     array_view<int,1> va(N, a);
                                                     array_view<int,1> vb(N, b);
                                                     array_view<int,1> vc(N, c);

                                                     parallel_for_each(
    for (int i = 0; i < n; ++i)                     for (int i=0; i<n; i++)
                                                          va.grid,
                                                          [=](index<1> i) restrict(direct3d)
                                                          {
    {                                                    {    va[i] = vb[i] + vc[i];
           a[i] = b[i] + c[i];                            } pC[i] = pA[i] + pB[i];
    }                                                 ); }
                                                }
}                                               }
Basic Elements of C++ AMP coding
                                 array_view: wraps the data               restrict(direct3d): tells the
                                 to operate on the accelerator            compiler to check that this code
parallel_for_each:
                                                                          can execute on Direct3D hardware
execute the lambda     void AddArrays(int* a, int*   b, int* c, int N)
                                                                          (aka accelerator)
on the accelerator     {
once per thread            array_view<int,1> va(N,   a);
                           array_view<int,1> vb(N,   b);
                           array_view<int,1> vc(N,   c);

                           parallel_for_each(
grid: the number and           va.grid,
shape of threads to            [=](index<1> i) restrict(direct3d)
execute the lambda             {
                                       va[i] = vb[i] + vc[i];
                               }                              array_view variables captured
                            );                                and associated data copied to
index: the thread ID that is running the
                       }                                         accelerator (on demand)
lambda, used to index into data
Achieving maximum performance gains
 • Schedule threads in tiles                       0 1 2 3 4 5         0 1 2 3 4 5
   • Avoid thread index remapping              0                   0

   • Gain ability to use tile static memory
                                               1                   1
                                               2                   2
                                               3                   3
                                               4                   4

   array_view<int,2> data(8, 6, p_my_data);    5                   5

   parallel_for_each(                          6                   6

       data.grid.tile<2,2>(),                  7                   7

       [=] (tiled_index<2,2> t_idx)… { … });       g.tile<4,3>()       g.tile<2,2>()
C++ AMP at a Glance
 •   restrict(direct3d, cpu)    •   tile_static storage class
 •   parallel_for_each          •   class tiled_grid< , , >
 •   class array<T,N>           •   class tiled_index< , , >
 •   class array_view<T,N>      •   class tile_barrier
 •   class index<N>
 •   class extent<N>, grid<N>
 •   class accelerator
 •   class accelerator_view
Visual Studio/C++ AMP
 •   Organize
 •   Edit
 •   Design
 •   Build
 •   Browse
 •   Debug
 •   Profile
C++ AMP Parallel Debugger
  • Well known Visual Studio debugging features
    • Launch, Attach, Break, Stepping, Breakpoints, DataTips
    • Toolwindows
      • Processes, Debug Output, Modules, Disassembly, Call Stack, Memory,
        Registers, Locals, Watch, Quick Watch
  • New features (for both CPU and GPU)
    • Parallel Stacks window, Parallel Watch window, Barrier
  • New GPU-specific
    • Emulator, GPU Threads window, race detection
Summary
• Democratization of parallel hardware programmability
  • Performance for the mainstream
  • High-level abstractions in C++ (not C)
  • State-of-the-art Visual Studio IDE
  • Hardware abstraction platform

• C++ AMP now published as open specification
• http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A-
  0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf
Agenda
• Why C++?
• Performance : CPUs and GPUs
  • Baseline : Single-CPU / Multi-CPU   Demo
  • Vector CPU                          Demo
  • GPU : C++ AMP                       Demo
• ISO C++ 11
• ALM                        (Application Lifetime Management)
Modern C++: Clean, Safe and Fast
                                                                 auto type deduction                  T*    shared_ptr<T>
    Then                                                                    Now                              new
                                                                                                           make_shared
circle* p = new circle( 42 );                                           auto p = make_shared<circle>( 42 );
vector<shape*> v = load_shapes();                                       vector<shared_ptr<shape>> vw = load_shapes();
for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) {     for_each( begin(vw), end(vw), [&]( shared_ptr<circle>& s
  if(*i && **i == *p )                                                  ){
      cout << **i << “ is a matchn”;                                        if( s && *s == *p )
}                                                                                cout << *s << “ is a matchn”;
for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i )       } );                                for/while/do
{                                                                                                      std:: algorithms
   delete *i;                                                                                        [&] lambda functions
                                                                 no need for “delete”
}                         not exception-safe
                                                                      automatic lifetime
delete p;                      missing try/catch,                       management
                                __try/__finally
                                                                       exception-safe
C++ 11 Language Features in Visual Studio
    C++11 Core Language Features        VC10      VC11
    rvalue references                    v2.0     v2.1*
    auto                                 v1.0      v1.0
    decltype                             v1.0     v1.1**
    static_assert                        Yes       Yes
    trailing return types                Yes       Yes
    lambdas                              v1.0      v1.1
    nullptr                              Yes       Yes
    strongly typed enums                Partial    Yes
    forward declared enums               No        Yes
    standard-layout and trivial types    No        Yes
    atomics                              No        Yes
    strong compare and exchange          No        Yes
    bidirectional fences                 No        Yes
    data-dependency ordering             No        Yes
rvalue refs
struct Car {
   string make;          // eg “Volvo”
   int when;             // last-serviced – eg 201103 => March 2011
};


workOnClone(Car c);      // work on a clone of my car – not returned

inspect(const Car& c);   // inspect, but don’t alter, my car

fix(Car& c);             // fix and return my car

replace(Car&& c);        // take my car and cannibalize it – I won’t be using it again
                         // note that && is not a ref-to-ref (unlike **)
                         // enables “move semantics” and “perfect forwarding”
auto
int    n = 42;
double pi = 3.14159;
auto x = n * e;                             // will infer type of x is double

for (std::map<string, vector<double>>::const_iterator iter = m.cbegin(); iter != m.cend(); ++iter)
for (auto                                         iter = m.cbegin(); iter != m.cend(); ++iter)

const auto * p = new MyClass;              // “add back” qualifiers to auto’s inferred type
const auto & r = s;                        // “add back” qualifiers to auto’s inferred type

auto   a1 = new auto(42);                  // infers int*
auto * a2 = new auto(42);                  // beware: also infers int*


                Notes:   static type inference!
                         like C# “var”
                         may break old code: old auto specifies allocation within current stack frame
decltype
decltype(new C) c = new C;                // c is a C*
                                          // Note: first “new C” is not executed

std::vector<int>::const_iterator iter1;   // a long type name

decltype(iter1) iter2;                    // iter2 has same type as iter1
static_assert
pre-processor-time                               run-time
#if VERSION < 8                                  bool done(float g1, float g2, float tol) {
 #error “Need version 8 or higher”                 assert (tol < 1.0e-3);
#endif



           compile-time
       static_assert (FeetPerMile > 5200 && FeetPerMile < 6100, “FeetPerMile is wrong”);

       template<class T> struct S {
         static_assert(sizeof(T) < sizeof(int), “T is too big”);
         static_assert(std::is_unsigned<T>::value, “S needs an unsigned type”);
Trailing-Return-Type

template<class A, class B> ??? adder(A &a, B &b) { return a + b; }                       // no!

template<class A, class B> decltype(a + b) adder(A &a, B &b) { return a + b; }           // no!

template<class A, class B> auto adder(A &a, B &b) -> decltype(a + b) { return a + b; }   // yes!
lambdas – functions with no name
[ ] ( ) -> int { return 42; } ;                                             // no arguments
[ ] (int n) -> int { return n * n; } ;                                      // one argument
[ ] (int a, int b) -> int { return a + b; } ;                               // two arguments

for_each(v.begin(), v.end(), [ ] (int n) { cout << n << “ “; });            // one-liner

float f1 = integrate ( golden,                                     0.0, 1.0 );
float f2 = integrate ( [ ] (float x ) { return x * x + x – 1; },   0.0, 1.0 );

[ ] { cout << “hi” }                                               // can omit ( ) if no parameters
                                                                   // can omit -> return-type if inferable

[ capture-clause] ( parameter-list ) -> return-type { body }// grammar
Strongly-Typed Enums
Illegal – members must be globally unique
enum Heights {SHORT, TALL};                     // ok
enum Widths {BYTE, SHORT, INT, LONG};           // clash

enum members are just integers
enum Colors {RED, GREEN, BLUE};
if (GREEN == 1) cout << “GREEN == 1”;           // yes!
enum Parts {ENGINE, BRAKE, CLUTCH};
if (GREEN == BRAKE) cout << “GREEN == BRAKE”;   // yes!

Use enum class
enum class Heights {SHORT, TALL};
enum class Widths {BYTE, SHORT, INT, LONG};     // eg:    Widths::SHORT
Forward-Declared Enum Classes
enum class Colors;                     // forward declaration


void fun(Colors c);                    // use

. . .

enum class Colors : unsigned char {RED = 3, GREEN, BLUE = 7};
nullptr
                                 // the NULL hack:
int* p1 = 0;                     // value of 0 is ‘special’
int* p2 = 42;                    // illegal

void f (int n) { cout << n; };
f(0);                            // works

void f (int* p) { cout << p; };
f(0);                           // works

void f (int n) { cout << n; }
void f (int* p) { cout << p; };
f(0);                           // which one?

f(nullptr);                      // calls f(int*)

                                                decltype(nullptr) == nullptr_t
Memory Model – Scary Terminology
      •   Dekker’s algorithm
      •   Double check locking
      •   Weak memory consistency
      •   Atomics
      •   Memory fences/barriers
      •   Volatile
      •   Sequential consistency
      •   Acquire/Release semantics
      •   Axiomatic definition & litmus tests
Dekker’s Algorithm
 flag[0] := true            flag[1] := true
 while flag[1] = true {     while flag[0] = true {
    if turn ≠ 0 {              if turn ≠ 1 {
       flag[0] := false           flag[1] := false
       while turn ≠ 0 { }         while turn ≠ 1 { }
       flag[0] := true            flag[1] := true
    }                          }
 }                          }
 // critical section        // critical section
 turn := 1                  turn := 0
 flag[0] := false           flag[1] := false
Each proc has FIFO store buffer
                      Proc                   Proc          Reads read from local SB

                                                           Read bypassing

                                                           MFENCE flushes SB
       Store buffer           Store buffer
                                                           LOCK’d instruction acqiures Lock
                                                            (eg: XCHG)

                                                           Write to SB may reach memory at
                                                           any time Lock is not held




Lock
                        Memory
       http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf
C++ Libraries (VS)
 • STL
   • C++ 11 conformant
   • Support for new headers in VS vNext
     • <atomic>, <filesystem>, <thread> (others)


 • PPL
   • Parallel Algorithms
   • Task-based programming model
   • Agents and Messaging - express dataflow pipelines
   • Concurrency-safe containers
Agenda
• Why C++?
• Performance : CPUs and GPUs
  • Baseline : Single-CPU / Multi-CPU   Demo
  • Vector CPU                          Demo
  • GPU : C++ AMP                       Demo
• ISO C++ 11
• ALM                        (Application Lifetime Management)
ALM (Application Life Management)

    • New ALM features in vNext   • Additional new C++ features
•   Lightweight Requirements      •   2010 features Updated
•   Agile Planning Tools          •   Architecture Tools
•   Stakeholder Feedback              • Dependency Diagrams
    Context Switching
                                      • Architecture Explorer
•
•   Code Review
                                  •   Unit Testing
•   Exploratory Testing
                                      • Native Unit Test Framework
                                      • Manage and Run tests in VS
                                        and Test Manager
Q&A
MICROSOFTC++




                                                 2012
PARTICIPATE IN C++
                                     MICROSOFT
                                     DEVELOPER
                                     DIVISION
DEVELOPMENT USER                     DESIGN
                                     RESEARCH
RESEARCH
        SIGN UP ONLINE AT
        http://bit.ly/cppdeveloper
Pour aller plus loin

                                                                Prochaines sessions des Dev Camps
  Chaque semaine, les DevCamps                                  10 février                  Open Data - Développer des applications riches avec le protocole Open
  ALM, Azure, Windows Phone, HTML5, OpenData                      2012
                                                                             Live Meeting
                                                                                            Data
  http://msdn.microsoft.com/fr-fr/devcamp                       16 février                  Azure series - Développer des applications sociales sur la plateforme
                                                                             Live Meeting
                                                                  2012                      Windows Azure

                                                                17 février
  Téléchargement, ressources et toolkits :                        2012
                                                                             Live Meeting   Comprendre le canvas avec Galactic et la librairie three.js

                                                                21 février
  RdV sur MSDN                                                    2012
                                                                             Live Meeting   La production automatisée de code avec CodeFluent Entities

  http://msdn.microsoft.com/fr-fr/                               2 mars
                                                                             Live Meeting
                                                                                            Comprendre et mettre en oeuvre le toolkit Azure pour Windows Phone 7,
                                                                  2012                      iOS et Android

                                                                 6 mars
                                                                             Live Meeting   Nuget et ALM
  Les offres à connaître                                          2012

                                                                 9 mars
                                                                             Live Meeting   Kinect - Bien gérer la vie de son capteur
         90 jours d’essai gratuit de Windows Azure                2012

          www.windowsazure.fr                                    13 mars
                                                                  2012
                                                                             Live Meeting   Sharepoint series - Automatisation des tests

                                                                 14 mars                    TFS Health Check - vérifier la bonne santé de votre plateforme de
         Jusqu’à 35% de réduction sur Visual Studio Pro, avec     2012
                                                                             Live Meeting
                                                                                            développement

         l’abonnement MSDN                                       15 mars
                                                                             Live Meeting
                                                                                            Azure series - Développer pour les téléphones, les tablettes et le cloud
                                                                  2012                      avec Visual Studio 2010
           www.visualstudio.fr
                                                                 16 mars                    Applications METRO design - Désossage en règle d'un template METRO
                                                                             Live Meeting
                                                                  2012                      javascript

                                                                 20 mars                    Retour d'expérience LightSwitch, Optimisation de l'accès aux données,
                                                                             Live Meeting
                                                                  2012                      Intégration Silverlight

                                                                 23 mars
                                                                             Live Meeting   OAuth - la clé de l'utilisation des réseaux sociaux dans votre application
                                                                  2012

What&rsquo;s new in Visual C++

  • 1.
    What’s new inVisual C++ 11 Jim Hogg Program Manager Visual C++ Microsoft
  • 2.
    Agenda • Why C++? •Performance : CPUs and GPUs • Baseline : Single-CPU / Multi-CPU Demo • Vector CPU Demo • GPU : C++ AMP Demo • ISO C++ 11 • ALM (Application Lifetime Management)
  • 3.
    Why C++? :Power & Performance power: driver at all “The going word at Facebook is that scales – on-die, mobile, „reasonably written C++ code just desktop, datacenter runs fast,‟ which underscores the size: limits on enormous effort spent at optimizing processor PHP and Java code. Paradoxically, C++ resources code is more difficult to write than in – desktop, mobile other languages, but experiences: bigger experiences on efficient code is a lot easier.” – smaller hardware; Andrei Alexandrescu pushing envelope means every cycle matters
  • 4.
    Agenda • Why C++? •Performance : CPUs and GPUs • Baseline : Single-CPU / Multi-CPU Demo • Vector CPU Demo • GPU : C++ AMP Demo • ISO C++ 11 • ALM (Application Lifetime Management)
  • 5.
    CPU v.s. GPUtoday CPU GPU • Low memory bandwidth • High memory bandwidth • Higher power consumption • Lower power consumption • Medium level of parallelism • High level of parallelism • Deep execution pipelines • Shallow execution pipelines • Random accesses • Sequential accesses • Supports general code • Supports data-parallel code • Mainstream programming • Niche programming images source: AMD
  • 6.
  • 7.
  • 8.
    Vector Processors –How they work RAX 1.10 SCALAR ADD RAX, RBX RBX 1.20 RAX 2.30 for (int i = 0; i < 1000; ++i) a[i] += b[i ] XMM1 1.10 2.10 3.10 4.10 VECTOR ADDPS XMM1, XMM2 1.20 2.20 3.20 4.20 XMM2 XMM1 2.30 4.30 6.30 8.30 for (int i = 0; i < 1000; i += 4) a[i : i+3] += b[i : i+3]
  • 9.
  • 10.
    Compiler Enhancements •Auto-vectorizer • Auto-parallelization • Automatically vectorize loops. – Reorganizes the loop to run • SIMD instructions. on multiple threads • ON by default – /Qpar – Optional #pragma loop for (i = 0; i < 1024; i++) a[i] = b[i] * c[i]; #pragma loop(hint_parallel(N)) for (i = 0; i < 1024; i++) a[i] = b[i] * c[i]; for (i = 0; i < 1024; i += 4) a[i:i+3] = b[i:i+3] * c[i:i+3];
  • 11.
  • 12.
    NBody Simulation, CPU(Auto Vectorize + Parallelize)
  • 13.
    Source Code Assembly of Body int A[20000]; $LL3@foo: mov ecx, DWORD PTR ?C@@3PAHA[eax*4] int B[20000]; mov edx, DWORD PTR ?B@@3PAHA[eax*4] int C[20000]; add ecx, edx mov DWORD PTR ?A@@3PAHA[eax*4], ecx for (i=0; i<20000; i++) { inc eax A[i] = B[i] + C[i]; cmp eax, esi jl SHORT $LL3@foo } Dev11 /O2 400% Speedup!!! Transformation Assembly of Body int A[20000]; $LL3@foo: movdqu xmm1, XMMWORD PTR ?C@@3PAHA[eax*4] int B[20000]; movdqu xmm0, XMMWORD PTR ?B@@3PAHA[eax*4] int C[20000]; paddd xmm1, xmm0 movdqu XMMWORD PTR ?A@@3PAHA[eax*4], xmm1 for (i=0; i<20000; i+=4) { add eax, 4 A[i:i+3] = B[i:i+3] + C[i:i+3]; cmp eax, ecx jl SHORT $LL3@foo }
  • 14.
    for (k =1; k <= M; k++) { if if if xmb if for dc[k] 1; dc[k-1] +k++) { (k = = k <= M; tpdd[k-1]; if ((sc = = dc[k-1] + tpdd[k-1]; dc[k]) dc[k] = sc; dc[k] mc[k-1] + tpmd[k-1]) > if (dc[k] <= -INFTY) dc[k] = -INFTY; dc[k]) dc[k] = sc; if ((sc mc[k-1] + tpmd[k-1]) > if (dc[k] < -INFTY) dc[k] = -INFTY; for if (k < M) { M; k++) { { for (k = 1; k < M; k++) (k = 1; k <= if (k < M) =mpp[k] ++tpmi[k]; ic[k] = { mpp[k] ic[k] tpmi[k]; ic[k] = mpp[k] + tpii[k]) ((sc ip[k] tpmi[k]; ifif((sc ==ip[k] ++tpii[k]) >>ic[k]) ic[k] ==sc; ic[k]) ic[k] sc; ic[k] += = is[k]; + tpii[k]) > ic[k]) ic[k] = sc; ic[k] +=is[k]; if ((sc ip[k] ic[k] += is[k]; ifif(ic[k] <<-INFTY) ic[k] ==-INFTY; (ic[k] -INFTY) ic[k] -INFTY; } if (ic[k] < -INFTY) ic[k] = -INFTY; }} } }
  • 15.
    Agenda • Why C++? •Performance : CPUs and GPUs • Baseline : Single-CPU / Multi-CPU Demo • Vector CPU Demo • GPU : C++ AMP Demo • ISO C++ 11 • ALM (Application Lifetime Management)
  • 16.
  • 17.
    The Power ofHeterogeneous Computing 146X 36X 19X 17X 100X Interactive Ionic placement Transcoding HD Simulation in Astrophysics N- visualization of for molecular video stream to Matlab using .mex body simulation volumetric white dynamics H.264 file CUDA function matter simulation on GPU connectivity 149X 47X 20X 24X 30X sourc Financial Ultrasound Highly optimized e GLAME@lab: An Cmatch exact string simulation of M-script API for medical imaging object oriented matching to find LIBOR model with linear Algebra for cancer molecular similar proteins and swaptions operations on diagnostics dynamics gene sequences GPU
  • 18.
    C++ AMP • Part of Visual C++ • Visual Studio integration • STL-like library for multidimensional data • Builds on Direct3D performance productivity portability
  • 19.
    Hello World: ArrayAddition #include <amp.h> using namespace concurrency; void AddArrays(int* a, int* b, int* c, int N) void AddArrays(int n, int * pA, int* * pB, int * pC) void AddArrays(int* a, int* b, int c, int N) { {{ array_view<int,1> va(N, a); array_view<int,1> vb(N, b); array_view<int,1> vc(N, c); parallel_for_each( for (int i = 0; i < n; ++i) for (int i=0; i<n; i++) va.grid, [=](index<1> i) restrict(direct3d) { { { va[i] = vb[i] + vc[i]; a[i] = b[i] + c[i]; } pC[i] = pA[i] + pB[i]; } ); } } } }
  • 20.
    Basic Elements ofC++ AMP coding array_view: wraps the data restrict(direct3d): tells the to operate on the accelerator compiler to check that this code parallel_for_each: can execute on Direct3D hardware execute the lambda void AddArrays(int* a, int* b, int* c, int N) (aka accelerator) on the accelerator { once per thread array_view<int,1> va(N, a); array_view<int,1> vb(N, b); array_view<int,1> vc(N, c); parallel_for_each( grid: the number and va.grid, shape of threads to [=](index<1> i) restrict(direct3d) execute the lambda { va[i] = vb[i] + vc[i]; } array_view variables captured ); and associated data copied to index: the thread ID that is running the } accelerator (on demand) lambda, used to index into data
  • 21.
    Achieving maximum performancegains • Schedule threads in tiles 0 1 2 3 4 5 0 1 2 3 4 5 • Avoid thread index remapping 0 0 • Gain ability to use tile static memory 1 1 2 2 3 3 4 4 array_view<int,2> data(8, 6, p_my_data); 5 5 parallel_for_each( 6 6 data.grid.tile<2,2>(), 7 7 [=] (tiled_index<2,2> t_idx)… { … }); g.tile<4,3>() g.tile<2,2>()
  • 22.
    C++ AMP ata Glance • restrict(direct3d, cpu) • tile_static storage class • parallel_for_each • class tiled_grid< , , > • class array<T,N> • class tiled_index< , , > • class array_view<T,N> • class tile_barrier • class index<N> • class extent<N>, grid<N> • class accelerator • class accelerator_view
  • 23.
    Visual Studio/C++ AMP • Organize • Edit • Design • Build • Browse • Debug • Profile
  • 24.
    C++ AMP ParallelDebugger • Well known Visual Studio debugging features • Launch, Attach, Break, Stepping, Breakpoints, DataTips • Toolwindows • Processes, Debug Output, Modules, Disassembly, Call Stack, Memory, Registers, Locals, Watch, Quick Watch • New features (for both CPU and GPU) • Parallel Stacks window, Parallel Watch window, Barrier • New GPU-specific • Emulator, GPU Threads window, race detection
  • 25.
    Summary • Democratization ofparallel hardware programmability • Performance for the mainstream • High-level abstractions in C++ (not C) • State-of-the-art Visual Studio IDE • Hardware abstraction platform • C++ AMP now published as open specification • http://download.microsoft.com/download/4/0/E/40EA02D8-23A7-4BD2-AD3A- 0BFFFB640F28/CppAMPLanguageAndProgrammingModel.pdf
  • 26.
    Agenda • Why C++? •Performance : CPUs and GPUs • Baseline : Single-CPU / Multi-CPU Demo • Vector CPU Demo • GPU : C++ AMP Demo • ISO C++ 11 • ALM (Application Lifetime Management)
  • 27.
    Modern C++: Clean,Safe and Fast auto type deduction T* shared_ptr<T> Then Now new make_shared circle* p = new circle( 42 ); auto p = make_shared<circle>( 42 ); vector<shape*> v = load_shapes(); vector<shared_ptr<shape>> vw = load_shapes(); for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) { for_each( begin(vw), end(vw), [&]( shared_ptr<circle>& s if(*i && **i == *p ) ){ cout << **i << “ is a matchn”; if( s && *s == *p ) } cout << *s << “ is a matchn”; for( vector<circle*>::iterator i = v.begin(); i != v.end(); ++i ) } ); for/while/do { std:: algorithms delete *i; [&] lambda functions no need for “delete” } not exception-safe automatic lifetime delete p; missing try/catch, management __try/__finally exception-safe
  • 28.
    C++ 11 LanguageFeatures in Visual Studio C++11 Core Language Features VC10 VC11 rvalue references v2.0 v2.1* auto v1.0 v1.0 decltype v1.0 v1.1** static_assert Yes Yes trailing return types Yes Yes lambdas v1.0 v1.1 nullptr Yes Yes strongly typed enums Partial Yes forward declared enums No Yes standard-layout and trivial types No Yes atomics No Yes strong compare and exchange No Yes bidirectional fences No Yes data-dependency ordering No Yes
  • 29.
    rvalue refs struct Car{ string make; // eg “Volvo” int when; // last-serviced – eg 201103 => March 2011 }; workOnClone(Car c); // work on a clone of my car – not returned inspect(const Car& c); // inspect, but don’t alter, my car fix(Car& c); // fix and return my car replace(Car&& c); // take my car and cannibalize it – I won’t be using it again // note that && is not a ref-to-ref (unlike **) // enables “move semantics” and “perfect forwarding”
  • 30.
    auto int n = 42; double pi = 3.14159; auto x = n * e; // will infer type of x is double for (std::map<string, vector<double>>::const_iterator iter = m.cbegin(); iter != m.cend(); ++iter) for (auto iter = m.cbegin(); iter != m.cend(); ++iter) const auto * p = new MyClass; // “add back” qualifiers to auto’s inferred type const auto & r = s; // “add back” qualifiers to auto’s inferred type auto a1 = new auto(42); // infers int* auto * a2 = new auto(42); // beware: also infers int* Notes: static type inference! like C# “var” may break old code: old auto specifies allocation within current stack frame
  • 31.
    decltype decltype(new C) c= new C; // c is a C* // Note: first “new C” is not executed std::vector<int>::const_iterator iter1; // a long type name decltype(iter1) iter2; // iter2 has same type as iter1
  • 32.
    static_assert pre-processor-time run-time #if VERSION < 8 bool done(float g1, float g2, float tol) { #error “Need version 8 or higher” assert (tol < 1.0e-3); #endif compile-time static_assert (FeetPerMile > 5200 && FeetPerMile < 6100, “FeetPerMile is wrong”); template<class T> struct S { static_assert(sizeof(T) < sizeof(int), “T is too big”); static_assert(std::is_unsigned<T>::value, “S needs an unsigned type”);
  • 33.
    Trailing-Return-Type template<class A, classB> ??? adder(A &a, B &b) { return a + b; } // no! template<class A, class B> decltype(a + b) adder(A &a, B &b) { return a + b; } // no! template<class A, class B> auto adder(A &a, B &b) -> decltype(a + b) { return a + b; } // yes!
  • 34.
    lambdas – functionswith no name [ ] ( ) -> int { return 42; } ; // no arguments [ ] (int n) -> int { return n * n; } ; // one argument [ ] (int a, int b) -> int { return a + b; } ; // two arguments for_each(v.begin(), v.end(), [ ] (int n) { cout << n << “ “; }); // one-liner float f1 = integrate ( golden, 0.0, 1.0 ); float f2 = integrate ( [ ] (float x ) { return x * x + x – 1; }, 0.0, 1.0 ); [ ] { cout << “hi” } // can omit ( ) if no parameters // can omit -> return-type if inferable [ capture-clause] ( parameter-list ) -> return-type { body }// grammar
  • 35.
    Strongly-Typed Enums Illegal –members must be globally unique enum Heights {SHORT, TALL}; // ok enum Widths {BYTE, SHORT, INT, LONG}; // clash enum members are just integers enum Colors {RED, GREEN, BLUE}; if (GREEN == 1) cout << “GREEN == 1”; // yes! enum Parts {ENGINE, BRAKE, CLUTCH}; if (GREEN == BRAKE) cout << “GREEN == BRAKE”; // yes! Use enum class enum class Heights {SHORT, TALL}; enum class Widths {BYTE, SHORT, INT, LONG}; // eg: Widths::SHORT
  • 36.
    Forward-Declared Enum Classes enumclass Colors; // forward declaration void fun(Colors c); // use . . . enum class Colors : unsigned char {RED = 3, GREEN, BLUE = 7};
  • 37.
    nullptr // the NULL hack: int* p1 = 0; // value of 0 is ‘special’ int* p2 = 42; // illegal void f (int n) { cout << n; }; f(0); // works void f (int* p) { cout << p; }; f(0); // works void f (int n) { cout << n; } void f (int* p) { cout << p; }; f(0); // which one? f(nullptr); // calls f(int*) decltype(nullptr) == nullptr_t
  • 38.
    Memory Model –Scary Terminology • Dekker’s algorithm • Double check locking • Weak memory consistency • Atomics • Memory fences/barriers • Volatile • Sequential consistency • Acquire/Release semantics • Axiomatic definition & litmus tests
  • 39.
    Dekker’s Algorithm flag[0]:= true flag[1] := true while flag[1] = true { while flag[0] = true { if turn ≠ 0 { if turn ≠ 1 { flag[0] := false flag[1] := false while turn ≠ 0 { } while turn ≠ 1 { } flag[0] := true flag[1] := true } } } } // critical section // critical section turn := 1 turn := 0 flag[0] := false flag[1] := false
  • 41.
    Each proc hasFIFO store buffer Proc Proc Reads read from local SB Read bypassing MFENCE flushes SB Store buffer Store buffer LOCK’d instruction acqiures Lock (eg: XCHG) Write to SB may reach memory at any time Lock is not held Lock Memory http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf
  • 42.
    C++ Libraries (VS) • STL • C++ 11 conformant • Support for new headers in VS vNext • <atomic>, <filesystem>, <thread> (others) • PPL • Parallel Algorithms • Task-based programming model • Agents and Messaging - express dataflow pipelines • Concurrency-safe containers
  • 43.
    Agenda • Why C++? •Performance : CPUs and GPUs • Baseline : Single-CPU / Multi-CPU Demo • Vector CPU Demo • GPU : C++ AMP Demo • ISO C++ 11 • ALM (Application Lifetime Management)
  • 45.
    ALM (Application LifeManagement) • New ALM features in vNext • Additional new C++ features • Lightweight Requirements • 2010 features Updated • Agile Planning Tools • Architecture Tools • Stakeholder Feedback • Dependency Diagrams Context Switching • Architecture Explorer • • Code Review • Unit Testing • Exploratory Testing • Native Unit Test Framework • Manage and Run tests in VS and Test Manager
  • 48.
  • 49.
    MICROSOFTC++ 2012 PARTICIPATE IN C++ MICROSOFT DEVELOPER DIVISION DEVELOPMENT USER DESIGN RESEARCH RESEARCH SIGN UP ONLINE AT http://bit.ly/cppdeveloper
  • 50.
    Pour aller plusloin Prochaines sessions des Dev Camps Chaque semaine, les DevCamps 10 février Open Data - Développer des applications riches avec le protocole Open ALM, Azure, Windows Phone, HTML5, OpenData 2012 Live Meeting Data http://msdn.microsoft.com/fr-fr/devcamp 16 février Azure series - Développer des applications sociales sur la plateforme Live Meeting 2012 Windows Azure 17 février Téléchargement, ressources et toolkits : 2012 Live Meeting Comprendre le canvas avec Galactic et la librairie three.js 21 février RdV sur MSDN 2012 Live Meeting La production automatisée de code avec CodeFluent Entities http://msdn.microsoft.com/fr-fr/ 2 mars Live Meeting Comprendre et mettre en oeuvre le toolkit Azure pour Windows Phone 7, 2012 iOS et Android 6 mars Live Meeting Nuget et ALM Les offres à connaître 2012 9 mars Live Meeting Kinect - Bien gérer la vie de son capteur 90 jours d’essai gratuit de Windows Azure 2012 www.windowsazure.fr 13 mars 2012 Live Meeting Sharepoint series - Automatisation des tests 14 mars TFS Health Check - vérifier la bonne santé de votre plateforme de Jusqu’à 35% de réduction sur Visual Studio Pro, avec 2012 Live Meeting développement l’abonnement MSDN 15 mars Live Meeting Azure series - Développer pour les téléphones, les tablettes et le cloud 2012 avec Visual Studio 2010 www.visualstudio.fr 16 mars Applications METRO design - Désossage en règle d'un template METRO Live Meeting 2012 javascript 20 mars Retour d'expérience LightSwitch, Optimisation de l'accès aux données, Live Meeting 2012 Intégration Silverlight 23 mars Live Meeting OAuth - la clé de l'utilisation des réseaux sociaux dans votre application 2012

Editor's Notes

  • #4 It used to be that a decade back in 1990’s you did not care about performance. Things have change drastically performance is king again – used to be Free lunch – not anymoreIf you look around in the market there is a wide spectrum of devices available to consumer. On one of the spectrum you have mobile devices Then you have the traditional desktopsAnd everything in the cloud Data centers.In each of the scenario power and performance are key and the language that helps you get more performance from your hardware while maintaining the simplicity of the modern language is C++.C language close to the hardware, portableC++ provides - Strong abstraction, strong type safety, type safe generic codeGreat modeling power. Full control of code in memory.C++ optimized for control and efficiency.Achieve C++’s value proposition of efficient abstraction.Strong abstraction: Type-safe OO and generic code for modeling power, without sacrificing control and efficiency.Full control over code and memory: You can always express what you want to do. And you can always control memory and data layout exactly.Pay-as-you-go efficiency: No mandatory overheads, don’t pay for what you don’t use.
  • #6 Intel Sandy Bridge (32 nm Tick, release Jan 2011). AVX (256-bit). Upto 8 cores; 16 with HT. Successor to Nehalem micro-arch.IntelIvy Bridge (22 nm Tock)Past decade has seen a huge increase in digital content.New class of applications that have to deal with huge amount of data.Distinguishing feature of such applications is the data level parallelism and data can be processed in any order.Two major computing platforms available:CPU GPUCPU – wide variety of applicationsPerfomanceimprovments but at cost of power.GPU – handle parallelism
  • #8 A vector processor, or array processor, is a central processing unit (CPU) that implements an instruction set containing instructions that operate on one-dimensional arrays of data called vectors. This is in contrast to a scalar processor, whose instructions operate on single data items. The vast majority of CPUs now include vector units.Today, most commodity CPUs implement architectures that feature instructions for some vector processing on multiple (vectorized) data sets, typically known as SIMD (Single Instruction, Multiple Data). Vectorization, in parallel computing, is a special case of parallelization, in which software programs that, by default, perform one operation at a time on a single thread are modified to perform multiple operations simultaneously, on that same thread.Automatic vectorization is major research topic in computer science; seeking methods that would allow a compiler to convert scalar programs into vectorized programs without human assistance.
  • #10 Traditional application not using the vector processor.
  • #11 SSE2 is now the defaultUsed to be x87 (floating point stack).4% performance gain on SPEC benchmarksCan cause slight differences in results in the least significant digitTo go back to old, use /ia32
  • #17 No more “free lunch” term used to indicate people relying on next gen hardware to improves software speed by increasing the clock-speed and more cache.Heterogeneous computing has been around for years but usage has been restricted to fairly small niches. I’m predicting that we’re going to see abrupt and steep growth over the next couple of years. The combination of delivering results for many workloads cheaper, faster, and more power efficiently coupled with improved programming tools is going to vault GPGPU programming into being a much more common technique available to everyone.
  • #29 C++11, also formerly known as C++0x (pronounced &quot;see plus plus oh ex&quot;),[1] is the name of the most recent iteration of the C++programming language, replacing C++03, approved by the ISO as of 12 August 2011.[2] The name is derived from the tradition of naming language versions by the date of the specification&apos;s publication.
  • #43 parallel_for, parallel_for_each, parallel_sort, parallel_reduce, parallel_transform.
  • #45 ALM, or Application Life Management describes the whole process involved in building, shipping and maintaining software.So, as a Developer, you might start the day be checking-out an existing source file from the version-control system. You then edit the file, write some unit tests, and get it working with the help of the debugger to identify and fix problems. You then submit the change to an automated build system that also runs a batch of regression tests – to make sure that this fix did not break any existing functionality.In the afternoon, you may take part in a bug triage – examining all of the bugs reported by extensive tests; or by customers running already-shipped versions of the software; or by customers running an early beta of the new software. You review the bugs, mark their relative priorities, and assign to testers for further diagnosis.Project managers run off custom reports to monitor the status of tasks – that’s to say, work that has to be done as part of developing the product – and reported bugs. They worry about how the project is progressing against where it should be according to plan. They worry about the rates at which bugs are coming in, being resolved, and being closed. They looks at graphs of bug “burn-down” rates, slippages, and surprises – such as new tasks added to the project.Back when software development was more informal, a small project team did all of these activities. But plans might be written on paper and thrown onto a shelf. Bugs might be recorded in a list, in an ASCII text file. New features might be kept in an Excel spreadsheet, updated manually as new bugs were found. A weekly report meant gathering all of the information together from these different sources, and manually constructing a custom document, which the management of the project poured over around a big table.The whole affair was relaxed – but it worked, because the team was small – and frankly, the volume code produced was small.But software development today is a much more serious (and stressful) undertaking. Teams are large – it’s common to have projects with 100 or more Developers working on the same set of source file – half-a-million lines of C++, for example. And there’s pressure to deliver high-quality software fast – to meet deadlines required in a competitive market, in order that the company survives and grows.So the old tools are inefficient. ALM seeks to solve these problems with a solution that centers around TFS – “Team Foundation Server”. This provides a large set of tools, that all work smoothly together, and solve the problems listed in the orange arc – managing the requirements that the project must meetthe day-to-day, or hour-to-hour monitoring and control of the project’s progressversion control – the system that provides a secure store for the project source files and documents. We update it using database transactions – all or nothing updates. It is backed-up every night, or whatever policy you choose. We can see how any source file evolved – should we ever need to track down a rogue checkin and back-it-out. We can create new “branches” to allow development along parallel branches. It discovers, and in most cases, resolves conflicts where two or more Devs change the same file, over the course of the same few days.Test case management – the 100s or 1000s of tests that the software must pass to be considered “good”. These tests cover unit, functional, performance, stress.Build automation – performing full builds of the entire project with different standard options: incremental or full; debug or optimized (what Visual Studio calls “release”)Reporting – a vast range of reports – both routine and “exception”. Examples would be, progress by Devs – they update how many days, or hours, they have worked on each task and whether the time left is still the same as they estimated; bug rates for incoming, resolved and fixed. “burn-down” rates that predict the date the project will be ready to fix. Exception reports such as peformance test number 123 ran more than 5% slower than it did on the last check – with a bug automatically assigned to the owner of that component. The list goes on and on. And these are just the standard reports.What I’ve described so far is all rather abstract. To give a concrete example, let me describe how it works for us – the 3,500 folks working in Microsoft’s “Developer Division”. We all use TFS every day. Developers, testers, program managers, documentation teams. In particular, in my group – we design and build the C++ compiler – which, in turn, is used to build some enormous products inside Microsoft (Windows, SQL Server, Office and so on); as well as millions of applications around the world, written by IT departments within big corporations and ISVs building their own products and services for sale.