• Save
Experiments with C++11
Upcoming SlideShare
Loading in...5
×
 

Experiments with C++11

on

  • 474 views

I experimented with several features of C++11: Lambda, range-for, thread, r-value references, etc. The context is a simple geometry processing/graphics example (10 lines) where I improve a mesh ...

I experimented with several features of C++11: Lambda, range-for, thread, r-value references, etc. The context is a simple geometry processing/graphics example (10 lines) where I improve a mesh smoothing function. The old version took over 14 seconds. After C++11-ification it now takes 2.5 seconds

Statistics

Views

Total Views
474
Views on SlideShare
465
Embed Views
9

Actions

Likes
0
Downloads
0
Comments
1

1 Embed 9

https://twitter.com 9

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Experiments with C++11 Experiments with C++11 Presentation Transcript

  • C++11A subset of the features explored
  • What is happening?• We want• the performance of carefully optimized code• the convenience of a high level language• to use all our cores
  • Example: Laplacian SmoothingVertex movesto center ofneighbors
  • Before
  • After 1k iterations
  • void laplacian_smooth0(Manifold& m,float weight, int max_iter){for(int iter=0;iter<max_iter; ++iter) {VertexAttributeVector<Vec3d> L_attr(m.no_vertices());for(VertexIDIterator v = m.vertices_begin(); v != m.vertices_end(); ++v)if(!boundary(m, *v))L_attr[*v] =laplacian(m, *v);for(VertexIDIterator v = m.vertices_begin(); v != m.vertices_end(); ++v)if(!boundary(m, *v))m.pos(*v) += weight*L_attr[*v];}}14.6OriginalIt is so C++98
  • void laplacian_smooth1(Manifold& m,float weight, int max_iter){for(int iter=0;iter<max_iter; ++iter) {VertexAttributeVector<Vec3d> L_attr(m.no_vertices());for(VertexID v : m.vertices())if(!boundary(m, v))L_attr[v] =laplacian(m, v);for(VertexID v : m.vertices()){if(!boundary(m, v))m.pos(v) += weight*L_attr[v];}}}14.2Range forMuch better to read. Not only is thefor loop clear, we did away with `*´.vertices() returns a class which justcontains begin and end functions
  • void laplacian_smooth2(Manifold& m,float weight, int max_iter){auto new_pos = m.positions_attribute_vector();for(int iter=0;iter<max_iter; ++iter) {for(auto v : m.vertices())if(!boundary(m, v))new_pos[v] = weight*laplacian(m, v)+m.pos(v);m.positions_attribute_vector() = new_pos;}}12.4OptimizedAnd we only need one loop.We canmemory move the vertex positions
  • void laplacian_smooth3(Manifold& m, float weight, int max_iter){for(int iter=0;iter<max_iter; ++iter) {auto new_pos = m.positions_attribute_vector();for(auto v : m.vertices())if(!boundary(m, v))new_pos[v] = weight*laplacian(m, v)+m.pos(v);m.positions_attribute_vector() = move(new_pos);}}12.6moveActually, we should move, but ... oh ...now I copy somewhere else
  • void laplacian_smooth4(Manifold& m,float weight, int max_iter){auto new_pos = m.positions_attribute_vector();for(int iter=0;iter<max_iter; ++iter) {for(auto v : m.vertices())if(!boundary(m, v))new_pos[v] = weight*laplacian(m, v)+m.pos(v);swap(m.positions_attribute_vector(),new_pos);}}12.1swapNow we only have two buffers forvertex positions and always read fromone and write to the other.Then swap!I think this version is the sweet spotfor single threaded code.
  • void laplacian_smooth4_5(Manifold& m,float weight, int max_iter){auto new_pos = m.positions_attribute_vector();for(int iter=0;iter<max_iter; ++iter) {for_each_vertex(m, [&](VertexID v) {new_pos[v] = weight*laplacian(m, v)+m.pos(v);});swap(m.positions_attribute_vector(),new_pos);}}Lambda variationNot much more clear.Should be about the sameperformance...
  • void laplacian_smooth5(Manifold& m, float weight, int max_iter){for(int iter=0;iter<max_iter; ++iter) {auto new_pos = m.positions_attribute_vector();vector<thread> t_vec;for(auto v : m.vertices())if(!boundary(m, v))t_vec.push_back(thread([&](VertexID vid){if(!boundary(m, vid))new_pos[vid] = weight*laplacian(m, vid)+ m.pos(vid);},v));for(int i=0;i<t_vec.size();++i)t_vec[i].join();m.positions_attribute_vector() = move(new_pos);}}∞Threads done wrongFor a brief moment I musthave thought I was codingto a GPU. First time I timedit, I got 666 times longer runtime
  • inline void laplacian_smooth_vertex(Manifold& m,vector<VertexID>& vids,VertexAttributeVector<Vec3d>& new_pos,float weight){for(auto v: vids)new_pos[v] = m.pos(v)+weight*laplacian(m, v);}void laplacian_smooth6(Manifold& m, float weight, int max_iter){vector<vector<VertexID>> vertex_ids(CORES);auto batch_size = m.no_vertices()/CORES;int cnt = 0;for_each_vertex(m, [&](VertexID v) {if (!boundary(m, v))vertex_ids[(cnt++/batch_size)%CORES].push_back(v);});vector<thread> t_vec(CORES);VertexAttributeVector<Vec3d> new_pos = m.positions_attribute_vector();for(int iter=0;iter<max_iter; ++iter) {for(int thread_no=0;thread_no<CORES;++thread_no)t_vec[thread_no] = thread(laplacian_smooth_vertex,ref(m), ref(vertex_ids[thread_no]),ref(new_pos), weight);for(int thread_no=0;thread_no<CORES;++thread_no)t_vec[thread_no].join();swap(m.positions_attribute_vector(), new_pos);}}2.5Almost five timesperformanceimprovement withfour physical cores.hyperthreadingworks!!CORES = 8
  • StatisticsMedianBaseline 14,6 14,6 14,5 14,6 14,6 14,6Range for 14,4 14,2 14,2 14,2 14,2 14,2Copy back 12,4 12,4 12,4 12,4 12,4 12,4Move back 12,5 12,5 12,9 12,9 12,6 12,6Swap 12,1 12,1 12,1 12,1 12,2 12,12 threads 6,8 6,7 6,8 6,7 6,7 6,74 threads 4,1 4,1 4,1 4,1 4,1 4,18 threads 2,5 2,5 2,5 2,5 2,5 2,5s s s s s s
  • Now make it generic!
  • typedef vector<vector<VertexID>> VertexIDBatches;VertexIDBatches batch_vertices(Manifold& m) {VertexIDBatches vertex_ids(CORES);auto batch_size = m.no_vertices()/CORES;int cnt = 0;for_each_vertex(m, [&](VertexID v) {if (!boundary(m, v))vertex_ids[(cnt++/batch_size)%CORES].push_back(v);});return vertex_ids;}template<typename T>void for_each_vertex_parallel(int no_threads, const VertexIDBatches& batches, T& f) {vector<thread> t_vec(no_threads);for(auto t : range(0, no_threads))t_vec[t] = thread(f, ref(batches[t]));for(auto t : range(0, no_threads))t_vec[t].join();}#1 Produces a vector ofvectors of vertex IDs#2 Actually spawns off workerthreads
  • void laplacian_smooth7(Manifold& m, float weight, int max_iter){auto vertex_ids = batch_vertices(m);auto new_pos = m.positions_attribute_vector();auto f = [&](const vector<VertexID>& vids) {for(VertexID v: vids)new_pos[v] = m.pos(v)+weight*laplacian(m, v);};for(auto _ : range(0, max_iter)) {for_each_vertex_parallel(CORES, vertex_ids, f);swap(m.positions_attribute_vector(), new_pos);}}Slightly faster, much simpler. Note I threw in a rangeclass to get rid of old school for loops.2.4
  • template<typename T>void for_each_vertex_parallel(int no_threads, const VertexIDBatches& batches, T& f) {vector<future<void>> f_vec(no_threads);for(auto t : range(0, no_threads))f_vec[t] = async(launch::async, f, ref(batches[t]));}template<typename T>void for_each_vertex_parallel(int no_threads, const VertexIDBatches& batches, T& f) {vector<thread> t_vec(no_threads);for(auto t : range(0, no_threads))t_vec[t] = thread(f, ref(batches[t]));for(auto t : range(0, no_threads))t_vec[t].join();}See the code above is simpler and the destructor joins!what happens if we ignore the future?!But the async code takes 50% more time than the old codewhere I join threads explicitly. Not sure why?!
  • More C++11 examples
  • Polymorphism with std::functionclass MyClass {int c;public:MyClass(int _c): c(_c) {}function<int(int)> fun;void set_fun(function<int(int,int)> f) {fun = bind1st(f, c);}};int fun1(int c, int x) { return c*x;}int fun2(int c, int x) { return x/c;}int main(int argc, const char * argv[]) {MyClass m1{1},m2{2};m1.set_fun(fun1);m2.set_fun(fun2);cout << m1.fun(42) << " " << m2.fun(42) << endl;}Maybe more exotic thanactually useful, but instructivethat polymorphism can beachieved so differently fromwhen using virtual functions
  • Kinder, gentler member initclass VisObj! {! ! std::string file;! ! GLGraphics::GLViewController view_ctrl;! ! bool create_display_list;! ! HMesh::Manifold mani;! ! HMesh::Manifold old_mani;! !! ! Harmonics* harmonics;GLGraphics::ManifoldRenderer* renderer;! ! CGLA::Vec3d bsphere_center;! ! float bsphere_radius;! public:! ! VisObj(): file(""), view_ctrl(WINX,WINY, CGLA::Vec3f(0), 1.0),create_display_list(true), harmonics(0) {}// ... and so onWe never really liked theselong initialization lists andalways wondered why wecould not just initialize whenwe declare
  • Kinder, gentler member initclass VisObj! {! ! std::string file = "";! ! GLGraphics::GLViewController view_ctrl =GLGraphics::GLViewController(WINX,WINY, CGLA::Vec3f(0), 1.0);! ! bool create_display_list = true;! ! HMesh::Manifold mani;! ! HMesh::Manifold old_mani;! !! ! Harmonics* harmonics = nullptr;GLGraphics::ManifoldRenderer* renderer = nullptr;! ! CGLA::Vec3d bsphere_center;! ! float bsphere_radius;! public:! ! VisObj() {}// and so onNow, we can! Whatis up with nullptr?!
  • ArithVec changestemplate <class T, class V, unsigned int N>class ArithVec{protected:/// The actual contents of the vector.std::array<T,N> data;// ......... Look, I did away with C stylearrays
  • ArithVec::ArithVec(T _a, T _b, T _c, T _d){assert(N==4);data[0] = _a;data[1] = _b;data[2] = _c;data[3] = _d;}ArithVec::ArithVec(T _a, T _b, T _c, T _d):data({_a,_b,_c,_d}) {assert(N==4);}Look! an initializer list ... hmmm MSVC does not like it
  • /// Assignment multiplication with scalar.const V& ArithVec::operator *=(T k){std::transform(data, &data[N], data,std::bind2nd(std::multiplies<T>(), k));return static_cast<const V&>(*this);}/// Assignment multiplication with scalar.const V& ArithVec::operator *=(T k){std::for_each(begin(), end(), [k](T& x){x*=k;});return static_cast<const V&>(*this);}Note: begin() and end()make the code nicerthan before
  • /// Assignment multiplication with scalar.const V& ArithVec::operator *=(T k){std::transform(data, &data[N], data,std::bind2nd(std::multiplies<T>(), k));return static_cast<const V&>(*this);}/// Assignment multiplication with scalar.const V& ArithVec::operator *=(T k){for(auto& x : data) {x*=k;}return static_cast<const V&>(*this);}Morten: this isactually simpler!
  • bool ArithVec:: operator==(const V& v) const{return std::equal(begin(),end(), v.begin());}bool ArithVec::operator==(const V& v) const{return std::inner_product(data, &data[N], v.get(), true,! ! ! std::logical_and<bool>(), std::equal_to<T>());}Just to use theobvious.This waspossible beforeC++11
  • circulate with functorsinline int circulate_vertex_ccw(const Manifold& m, VertexID v,std::function<void(Walker&)> f){Walker w = m.walker(v);for(; !w.full_circle(); w = w.circulate_vertex_ccw()) f(w);return w.no_steps();}inline int circulate_vertex_ccw(const Manifold& m, VertexID v,std::function<void(VertexID)> f){return circulate_vertex_ccw(m, v, [&](Walker& w){f(w.vertex());});}Five slides that showwhat we can do by havingcirculator functionsaccepting functors
  • int valency(const Manifold& m, VertexID v){// perform full circulation to get valencyWalker vj = m.walker(v);while(!vj.full_circle())vj = vj.circulate_vertex_cw();return vj.no_steps();}int valency(const Manifold& m, VertexID v){return circulate_vertex_ccw(m,v, [](Walker){});}
  • bool connected(const Manifold& m, VertexID v0, VertexID v1){for(Walker vj = m.walker(v0); !vj.full_circle();vj = vj.circulate_vertex_cw()){if(vj.vertex() == v1)return true;}return false;}bool connected(const Manifold& m, VertexID v0, VertexID v1){bool c=false;circulate_vertex_ccw(m, v0, [&](VertexID v){ c |= (v==v1);});return c;}
  • inline Vec3d laplacian(const Manifold& m, VertexID v){Vec3d p(0);int n = circulate_vertex_ccw(m, v, [&](VertexID v){ p += m.pos(v); });return p / n - m.pos(v);}Vec3d laplacian(const Manifold& m, VertexID v){Vec3d avg_pos(0);int n = 0;for(Walker w = m.walker(v); !w.full_circle(); w = w.circulate_vertex_cw()){avg_pos += m.pos(w.vertex());++n;}return avg_pos / n - m.pos(v);}
  • int no_edges(const Manifold& m, FaceID f){return circulate_face_ccw(m, f, [](Walker w){});}int no_edges(const Manifold& m, FaceID f){// perform full circulation to get valencyWalker w = m.walker(f);for(; !w.full_circle(); w = w.circulate_face_cw());return w.no_steps();}
  • Conclusions• Multicore is very important and the C++11 thread library makesconcurrency easy.We will rely on the compiler for SIMDoptimization!• range for is great. Makes code far more clear and we get rid ofiterators in many cases• move semantics & RVO make clear code faster• lambda functions improve on locality ... awesome with the STLalgorithms and std::function• auto helps us avoid obfuscation with ugly type names• uniform initialization and initializer lists also make code concise
  • Discussion• A C++11 developer version of GEL has branched off: shouldwe go for built-in parallellism?• Hmm - just so you know - there is much more in the C++11standard.This is just the part I understand so far...• Herb Sutter:“We broke all the books!”• Yet the learning curve is less daunting than when we first hadto do templates.