Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

1,134 views

Published on

Published in: Technology, Education
  • Be the first to comment

NUMA-optimized Parallel Breadth-first Search on Multicore Single-node System

  1. 1. NUMA%op(mized.Parallel.Breadth%first. Search.on.Mul(core.Single%node.System.. Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$ Kazushige$Goto*2$ $ *1$Chuo$university$&$JST$CREST$ *2$Intel$CorporaDon
  2. 2. Outline 1.  Background$ 2.  BreadthIfirst$Search$(BFS)$ 3.  Proposal.:$NUMAIopDmized$parallel$BFS$ 4.  Numerical$Results$ 5.  Conclusion$
  3. 3. Background •  Large.scale.graph.in.various.fields. –  US$Road$network$$$$:$$$$58$million$edges$ –  TwiVer$followIship$:$1.47$billion$$edges$ –  Neuronal$network$:$$$100$trillion$$edges$ 89.billion.ver(ces.&.100.trillion.edges Neuronal.network.@.Human.Brain.Project Cyber%security TwiQer US.road.network 24.million.ver(ces.&.58.million.edges 15.billion.log.entries./.day. Social.network •  Fast.and.scalable.graph.processing$by$using.HPC$ large 61.6.million.ver(ces. .&..1.47.billion.edges.
  4. 4. •  TransportaDon$ •  Social$network$ •  CyberIsecurity$ •  BioinformaDcs Importance.of.graph.processing •  BFS$is$important$and$fundamental$graph$processing$ –  Obtains$relaDonship$of$distance$(hops)$as$standIalone$ –  Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$ •  concurrent.search.(breadth%first.search). •  opDmizaDon$(single$source$shortest$path)$ •  edgeIoriented$(maximal$independent$set)$ graph. processing Understanding Applica(on.field - SCALE - edgefactor - SCAL - edge - BFS - Trav - TEPS Input parameters ReGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations Rela(on.ships - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations graph - SCALE - edgefactor Input parameters Graph generation Graph construction VBFS 6 results low.arithme(c.intensity$&$irregular.memory.accesses. Problems.of.Fast.&.scalable.computa(on.BFS Step1 Step2 Step3 Breadth%first.search
  5. 5. Graph500.Benchmark •  Measures$computer$performance$using$TEPS$raDo$in$ graph$processing$such$as$BFS$(BreathIfirst$search)$ •  TEPS.raDo$=$#$of$Traversed$edges$per$second$ SCALE$and$edgefactor.(=16) Median.TEPS 1.  Genera(on. - SCALE - edgefactor - SCALE - edgefactor - BFS Time - Traversed edge - TEPS put parameters ResultsGraph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor - S - e - B - T - T Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations - SCALE - edgefactor Input parameters Graph generation Graph construction TEPS ratio ValidationBFS 64 Iterations 3.  BFS.2.  Construc(on. .x.64 TEPS$raDo .x.64•  Kronecker$graph$ – 2SCALE$verDces$and$2SCALE+4$edges$ – syntheDc$scaleIfree$network$ hVp:www.graph500.org
  6. 6. •  NUMA%op(mized$hybrid$algorithm$ •  Improves$locality$of$memory$access$ – Library$for$considering$NUMA$carefully$ – ColumnIwise$graph$parDDoning$ Contribu(on •  Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$ –  reduces$unnecessary$edge$traversal$ 5.1.GTEPS Hybrid$BFS NUMA 4%way.Intel.Xeon.E5.(64.CPU.cores) •  Scalable:.Scale.well.up.to.64.threads. •  Fast:.11.15.GTEPS.and.2.2x.speedup.compared$ with$original$Hybrid$algorithm$ Our.proposal
  7. 7. Outline 1.  Background$ 2.  .Breadth%first.Search.(BFS). 3.  NUMA$architecture$ 4.  Proposal$:$NUMAIopDmized$parallel$BFS$ 5.  Numerical$Results$
  8. 8. Breadth%first.Search.(BFS) •  Obtains$level$of$each$verDces$from$source$vertex$ •  Level$=$certain$#$of$hops$away$from$the$source Input:$ Graph$G.and$source Output:$ Tree$with$root$as$source BFS Source Level.3 source Level.2Level.1
  9. 9. Hybrid.BFS.for.low%diameter.graph •  Efficient.for.Low%diameter.graph$ –  scale%free$and/or$small%world$property$such$as$social$network.$ •  At$higher$ranks$in$Graph500$benchmark$ •  Hybrid$algorithm$ –  combines$topIdown$algorithm$and$boVomIup$algorithm$ –  reduces$unnecessary$edge$traversal$ Fron(er Neighbors Level.k Level.k+1 Fron(er Level.k Level.k+1 neighbors Top%down.algorithm BoQom%up.algorithm switch Efficient$for$a$smallIfronDer Efficient$for$a$largeIfronDer [Beamer2011,.2012] Fron(er$<$neighbor Fron(er$>$neighbor
  10. 10. Top%down.algorithm •  Explores$outgoing$edges$of$fron(er.queue.QF. •  Appends$unvisited$verDces$into$neighbor.queue.QN. Level.1 Source Level.0 QN QF
  11. 11. Top%down.algorithm •  Explores$outgoing$edges$of$fron(er.queue.QF. •  Appends$unvisited$verDces$into$neighbor.queue.QN. QN Level.1 Source Level.0 QF Level.2 Level.1 QNQF Unnecessary.edge.traversal
  12. 12. Top%down.algorithm •  Explores$outgoing$edges$of$fron(er.queue.QF. •  Appends$unvisited$verDces$into$neighbor.queue.QN. •  Efficient.for.a.small.fron(er. •  Has$an$unnecessary$edge$traversal$for$a$large$fronDer$ QN Level.1 Source Level.0 QF Level.2 Level.1 QNQF Unnecessary.edge.traversal Level.3 Level.2 QN QF
  13. 13. BoQom%up.algorithm •  Explores$fron(er.queue.QF$from$unvisited.ver(ces. •  Appends$adjacent$verDces$into$neighbors.QN. source QN QF Unvisited.ver(ces Level.1 Unnecessary. edge.traversal
  14. 14. BoQom%up.algorithm •  Explores$fron(er.queue.QF$from$unvisited.ver(ces. •  Appends$adjacent$verDces$into$neighbors.QN. source QN QF Unvisited.ver(ces Level.1 Unnecessary. edge.traversal Level.2 QF Level.1 QN Unvisited.ver(ces
  15. 15. BoQom%up.algorithm •  Explores$fron(er.queue.QF$from$unvisited.ver(ces. •  Appends$adjacent$verDces$into$neighbors.QN. •  Efficient.for.a.large.fron(er. •  Has$unnecessary$edge$traversal$for$a$small$fronDer$ source QN QF Unvisited.ver(ces Level.1 Unnecessary. edge.traversal Level.2 QF Level.1 QN Unvisited.ver(ces Level.3 Level.2 QN QF
  16. 16. Hybrid.BFS.combines.Top%down.and.BoQom%up Fron(er Neighbors Level.k Level.k+1 Fron(er Level.k Level.k+1 neighbors Top%down.algorithm BoQom%up.algorithm switch vertices. • This lazy estimation of candidate neighbors increases the number of edges traverse. Level Top-down Bottom-up Hybrid mF mB min(mF , mB) 0 2 2,103,840,895 2 1 66,206 1,766,587,029 66,206 2 346,918,235 52,677,691 52,677,691 3 1,727,195,615 12,820,854 12,820,854 4 29,557,400 103,184 103,184 5 82,357 21,467 21,467 6 221 21,240 227 Total 2,103,820,036 3,936,072,360 65,689,631 Ratio 100.00% 187.09% 3.12% Fig: Top-down Fig: Bottom-u 1 S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012. Traversal.edges$of$ Kronecker$graph$ (SCALE$26) only switch switch
  17. 17. Outline 1.  Background$ 2.  BreadthIfirst$Search$(BFS)$ 3.  Proposal.:.NUMA%op(mized.parallel.BFS. 4.  Numerical$Results$ 5.  Conclusion$
  18. 18. How.to.speedup.the.hybrid.algorithm? •  NUMA.architecture. – Non%uniform.memory.access$ – Each.CPU.socket.has$a$local.RAM. – Fast.local.RAM.and.slow.non%local.RAM. RAM 8-core Xeon E5 4640 interconnect shared L3 cache processor core & L1/L2 cache RAM RAM RAM Non%locallocal 4%socket$Intel$Xeon$E5$system
  19. 19. How.to.speedup.the.hybrid.algorithm? •  NUMA.architecture. – Non%uniform.memory.access$ – Each.CPU.socket.has$a$local.RAM. – Fast.local.RAM.and.slow.non%local.RAM. •  Frequent$non%local.memory$accesses$on$NUMA.architecture. G BFS Source Fron(er Neighbors Level.k Level.k+1 Fron(er Level.k Level.k+1 neighbors Top%down BoQom%up Working.data.(QF,.QN,.visited%flag) Graph.G RAM 8-core Xeon E5 4640 interconnect shared L3 cache processor core & L1/L2 cache RAM RAM RAM Non%locallocal 4%socket$Intel$Xeon$E5$system Across.the.local.memories
  20. 20. Difficulty.of.considering.NUMA.architecture 1.  How.does.distribute.graph$and$data$to$each.local.RAM?. G$=$G0 G1 G2 G3 ? G
  21. 21. G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3 Difficulty.of.considering.NUMA.architecture 1.  How.does.distribute.graph$and$data$to$each.local.RAM?. . . 2.  How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?. G0 B0 B1 B2 B3G1 G2 G3 ? G0 G1 G2 G3 G$=$G0 G1 G2 G3G CPU0 CPU1 CPU2 CPU3 RAM0 RAM1 NUMA$unit3 RAM2 RAM3
  22. 22. ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores 1.  NUMACTL$(command$line$tool,$library$for$C/C++)$ 2.  Intel.compiler$Thread$Affinity$Interface$(API)$ 3.  ULIBC$(Our$library,$library$for$C/C++)$ –  Processor.ID$:$index$of$logical$processor$core$ –  Package.ID$:$index$of$CPU$socket$ –  Core.ID$:$index$of$physical$core$in$each$CPU$socket$ $CPU.affinity.+.Local.memory.binding $CPU.affinity.+.Local.memory.binding.+.Processor.Topology Processor.topology. for.each.CPU.core /sys/devices/system/* Linux.device.files
  23. 23. ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores 1.  NUMACTL$(command$line$tool,$library$for$C/C++)$ 2.  Intel.compiler$Thread$Affinity$Interface$(API)$ 3.  ULIBC$(Our$library,$library$for$C/C++)$ –  Processor.ID$:$index$of$logical$processor$core$ –  Package.ID$:$index$of$CPU$socket$ –  Core.ID$:$index$of$physical$core$in$each$CPU$socket$ $CPU.affinity.+.Local.memory.binding $CPU.affinity.+.Local.memory.binding.+.Processor.Topology Processor.topology. for.each.CPU.core Thread$ID /sys/devices/system/* Linux.device.files At.a.parallel.region sched_setaffinity.system$call mbind$system$call Processor$ID Package$ID Core$ID
  24. 24. ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores 1.  NUMACTL$(command$line$tool,$library$for$C/C++)$ 2.  Intel.compiler$Thread$Affinity$Interface$(API)$ 3.  ULIBC$(Our$library,$library$for$C/C++)$ –  Processor.ID$:$index$of$logical$processor$core$ –  Package.ID$:$index$of$CPU$socket$ –  Core.ID$:$index$of$physical$core$in$each$CPU$socket$ $CPU.affinity.+.Local.memory.binding $CPU.affinity.+.Local.memory.binding.+.Processor.Topology Processor.topology. for.each.CPU.core Thread$ID Supports$scaQer.and$compact.policy ULIBC.is.possible.to.manage.NUMA.carefully.. /sys/devices/system/* Linux.device.files round%robin.on.CPU.sockets. At.a.parallel.region sched_setaffinity.system$call mbind$system$call Processor$ID Package$ID Core$ID
  25. 25. NUMA%opt..Column%wise.Graph.Par((oning A0 A1 A2 A3 Row%wise.graph.par((oning Vk Column%wise.graph.par((oning A0 A1 A2 A3 Adjacency.matrix Vk Adjacency.matrix ji O(m).mostly.non%local.memory.accessesO(m).Local.memory.accesses.only i j Fron(er. Neighbors Level.k Level.k+1 Fron(er Neighbors. Level.k Level.k+1 •  divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$ –  Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk. i j i j
  26. 26. NUMA%op(mized.Top%down •  Explores$outgoing$edges$of$fron(er.queue.QF. •  Appends$unvisited$verDces$into$neighbor.queue.QN. •  Efficient.for.a.small.fron(er. •  Has$unnecessary$edge$traversal$for$a$large$fronDer$ Neighbors.QN Level.1 Source Level.0 Fron(er.QF Level.2 Level.1 Neighbors.QNFron(er.QF Unnecessary.edge.traversal Level.3 Level.2 Neighbors.QN Fron(er.QF
  27. 27. NUMA.unit.3 Details.of.NUMA%op(mized.Top%down •  Explores$outgoing$edges$Ak$of$fron(er.queue.QF k. •  Appends$unvisited$verDces$into$neighbor.queue.QN k. Level.2Level.1 QN 2 QF 2 Level.2Level.1 QN 1 QF 1 Level.2 Level.1 QN 0 QF 0 Level.2 Level.1 Neighbors.QNFron(er.QF Unnecessary.edge.traversal Level.2 Fron(er.QFAll%gather NUMA.unit.0 NUMA.unit.1 NUMA.unit.2
  28. 28. NUMA%op(mized.BoQom%up •  Explores$fron(er.queue.QF$from$unvisited.ver(ces. •  Appends$adjacent$verDces$into$neighbors.QN. •  Efficient.for.a.large.fron(er. •  Has$unnecessary$edge$traversal$for$a$small$fronDer$ source Neighbors.QN Fron(er.QF Unvisited.ver(ces Level.1 Unnecessary. edge.traversal Level.2 Fron(er.QF Level.1 Neighbors.QN Unvisited.ver(ces Level.3 Level.2 Neighbors.QN Fron(er.QF
  29. 29. NUMA.unit.3 Details.of.NUMA%op(mized.BoQom%up •  Explores$fron(er.queue.QF k$from$unvisited.ver(ces. •  Appends$adjacent$verDces$into$neighbors.QN k. Level.2 Fron(er.QF All%gather NUMA.unit.0 NUMA.unit.1 NUMA.unit.2 Level.2 Fron(er.QF Level.1 Neighbors.QN Unvisited.ver(ces Level.2 QF 0 Level.1 QN 0 Level.2 QF 1 Level.1 QN 1 Level.2 QF 2 Level.1 QN 2 Level.2 QF 3 Level.1 QN 3
  30. 30. Outline 1.  Background$ 2.  BreadthIfirst$Search$(BFS)$ 3.  NUMAIopDmized$parallel$BFS$ 4.  Numerical.Results. 5.  Conclusion.
  31. 31. Machine.specifica(on •  4%way.Intel.Xeon.E5. – CentOS$6.4$(Kernel$2.6.32)$ – GCC$4.4.7$ – 64$logical$CPU$cores$ – 4.NUMA.units.x.16.logical%cores. RAM 8-core Xeon E5 4640 interconnect shared L3 cache processor core & L1/L2 cache RAM RAM RAM •  4%way.AMD.Opteron.6174. – Fedora$19$(Kernel$3.11.2)$ – GCC$4.8.1$ – 48$CPU$cores$ – 8.NUMA.units.x.6%core. processor core & L1/L2 cache RAMRAMRAMRAM 12-cores Opteron 6174 interconnect
  32. 32. 0 2 4 6 8 10 12 14 20 21 22 23 24 25 26 27 28 29 GTEPS Scale Hybrid Hybrid + NUMA TEPS.ra(o$varied.with$problem.size •  Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26). •  Ours.2.2x.speedups$compared$with$original.hybrid.algorithm. Beamer2011,$2012 Peak.performance Hybrid Hybrid NUMA This.paper BeVer x2.2 11.15.GTEPS 5.1.GTEPS 4Iway$Intel$Xeon$E5I8870$ WestmereIEX$arch. 4Iway$Intel$Xeon$E5I4640$$ SandyBridgeIEP$arch. 67$million$verDces$and$1$billion$edges
  33. 33. 12 24 32 48 1 2 4 8 16 64 12 24 32 481 2 4 8 16 64 Speedup Number of threads ideal 4-way SandyBridge-EP 4-way MagnyCours Strong.scaling.on.Intel/AMD.System Scale.well.up.to.#.of.threads.as.#.of.cores 4%way.Intel.Xeon. 11.15.GTEPS 4%way.AMD.Opteron. 6.17.GTEPS 40.threads.:.x40 64.threads.:.x28
  34. 34. Lv FronDer$size Freq.$(%)$ Cum.$Freq.$(%)$ 0$ 1$$ 0.00$$ 0.00$$ 1$ 7$$ 0.00$$ 0.00$$ 2$ 6,188$$ 0.01$$ 0.01$$ 3$ 510,515$$ 1.23$$ 1.24$$ 4$ 29,526,508$$ 70.89$$ 72.13$$ 5$ 11,314,238$$ 27.16$$ 99.29$$ 6$ 282,456$$ 0.68$$ 99.97$$ 7$ 11536$$ 0.03$$ 100.00$$ 8$ 673$$ 0.00$$ 100.00$$ 9$ 68$$ 0.00$$ 100.00$$ 10$ 19$$ 0.00$$ 100.00$$ 11$ 10$$ 0.00$$ 100.00$$ 12$ 5$$ 0.00$$ 100.00$$ 13$ 2$$ 0.00$$ 100.00$$ 14$ 2$$ 0.00$$ 100.00$$ 15$ 2$$ 0.00$$ 100.00$$ Total 41,652,230$$ 100.00$$ I$ TwiQer.network 41$million$verDces$and$1.47$billion$edges$ Fron(er.size.in.BFS. $$$$$$$$$$$$$with$source$as$User$21,804,357 Follow%ship.network.2009 User$i User$j (i,$j)Iedge Our.NUMA%op(mized.BFS. on.4%way.Xeon.system 180.ms$/$BFS$ $$$$$$$$$$$$$$$$$$$$$$ $8.1$GTEPS Six%degrees.of.separa(on
  35. 35. Graph500$benchmark •  Fastest$of$singleInode$on$4th.list$(June$2012)$ •  Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$ ours ours 4%way.Intel.Xeon. Westmere%EX 4%way.Intel.Xeon. SandyBridge%EP 8.2.GTEPS Rank26 Rank57 11.1.GTEPS Convey. 4.FPGA.+.2.CPU hVp:www.graph500.org
  36. 36. 1st.Green.Graph500.list$on$June$2013 •  Measures$powerIefficient$using$TEPS/W$raDo$ •  Results$on$various$system$such$as$Android,.Linux,.and.Mac.$ Small.Data$category ours Rank.1.ASUS.tablet.TF700T Rank.2.Intel.NUC.(Linux) Rank.3.Mac.mini Android$NDK 53.5$MTEPS/w$ $(1.9$GTEPS)53.8$MTEPS/w$ $(1.1$GTEPS) 64.1$MTEPS/w$ $(150$MTEPS) NVIDIA.Tegra3.(4%core) NVIDIA.Tegra3 Intel/AMD.arch. with$same$source$code hQp://green.graph500.org
  37. 37. Conclusion •  NUMA%op(mized.Hybrid.BFS.algorithm. – Reduces.unnecessary.edge.traversals$and$remote. RAM.access.carefully$considering$NUMA. •  Numerical.results.on.4%way.Intel.Xeon –  scales.well.up.to.64.threads.(scalable)$ –  achieves.11.15.GTEPS.(fast). –  2.2x.speedup.compared.original.Hybrid. •  Graph500.&.Green.Graph500. – Fastest.single%node$in$June$2012$ – Most.power%efficient$in$June$2013$ Hybrid NUMA
  38. 38. Future.work •  Further.op(mizing$NUMAIopDmized$BFS$ 0 5 10 15 20 25 30 20 21 22 23 24 25 26 27 28 29 GTEPS SCALE Latest version Bigdata2013 BigData2013.version:.11.GTEPS Latest.version:.26.GTEPS. •  distributed%memory.parallel.computa(on$ 2.4x...faster

×