SlideShare a Scribd company logo
Ling	
  Liu	
  
	
  
School	
  of	
  Computer	
  Science	
  
College	
  of	
  Compu2ng	
  
	
  
Part	
  II:	
  	
  
Distributed	
  Graph	
  Processing
2
2
Big	
  Data	
  Trends	
Big Data	
Volume	
Velocity	
 Variety	
1 zettabyte = a trillion gigabytes (1021 bytes)
CISCO, 2012
500 million
Tweets per day
100 hours of video
are uploaded
every minute
3
3
Why	
  Graphs?	
Graphs	
  are	
  
everywhere
!	
  
Social
Network
Graphs
Road
Networks
National
Security
Business
Analytics
Biological
Graphs
Friendship Graph
Facebook Engineering, 2010
Brain Network
The Journal of Neuroscience 2011
US Road Network
www.pavementinteractive.org
Web Security
Graph
McAfee, 2013
Intelligence Data Model
NSA, 2013
4
4
How	
  Big?	
Social	
  Scale	
  
!   1	
  billion	
  ver2ces,	
  100	
  billion	
  edges	
  	
  
!   111	
  PB	
  adjacency	
  matrix	
  
!   2.92	
  TB	
  adjacency	
  list	
  
1 billion vertices, 100 billion edges
111 PB adjacency matrix
2.92 TB adjacency list
2.92 TB edge list
Twitter graph from Gephi dataset
(http://www.gephi.org)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
Web	
  Scale	
  
!   50	
  billion	
  ver2ces,	
  1	
  trillion	
  edges	
  
!   271	
  EB	
  adjacency	
  matrix	
  
!   29.5	
  TB	
  adjacency	
  list	
  
Brain	
  Scale	
  
!   100	
  billion	
  ver2ces,	
  100	
  trillion	
  edges	
  
!   1.1	
  ZB	
  adjacency	
  matrix	
  
!   2.83	
  PB	
  adjacency	
  list	
  
NSA-RD-2013-056001v1
Web scale. . .
50 billion vertices, 1 trillion edges
271 EB adjacency matrix
29.5 TB adjacency list
29.1 TB edge list
Internet graph from the Opte Project
(http://www.opte.org/maps)
Web graph from the SNAP database
(http://snap.stanford.edu/data)
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
NSA-RD-
Brain scale. . .
100 billion vertices, 100 trillion edges
2.08 mNA · bytes2 (molar bytes) adjacency matrix
2.84 PB adjacency list
2.84 PB edge list
Human connectome.
Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
5
5
Big	
  Graph	
  Data	
  Technical	
  Challenges	
Huge	
  and	
  growing	
  size	
  
-­‐ Requires	
  massive	
  storage	
  capaci2es	
  
-­‐ Graph	
  analy2cs	
  usually	
  requires	
  much	
  bigger	
  compu2ng	
  and	
  storage	
  resources	
  
Complicated	
  correlaEons	
  among	
  data	
  enEEes	
  (verEces)	
  
-­‐ Make	
  it	
  hard	
  to	
  parallelize	
  graph	
  processing	
  (hard	
  to	
  par22on)	
  
-­‐ Most	
  exis2ng	
  big	
  data	
  systems	
  are	
  not	
  designed	
  to	
  handle	
  such	
  complexity
Skewed	
  distribuEon	
  (i.e.,	
  high-­‐degree	
  verEces)	
  
-­‐ Makes	
  it	
  hard	
  to	
  ensure	
  load	
  balancing	
  
6
6
Parallel	
  Graph	
  Processing:	
  Challenges	
  
•  Structure	
  driven	
  computa2on	
  
–  Storage	
  and	
  Data	
  Transfer	
  Issues	
  
•  Irregular	
  Graph	
  Structure	
  and	
  Computa2on	
  Model	
  
–  Storage	
  and	
  Data/Computa2on	
  Par22oning	
  Issues	
  
–  Par22oning	
  v.s.	
  Load/Resource	
  Balancing	
  	
  
6
7
7
Parallel	
  Graph	
  Processing:	
  OpportuniEes	
  
•  Extend	
  Exis2ng	
  Paradigms	
  
–  Vertex	
  centric	
  
–  Edge	
  centric	
  
•  BUILD	
  NEW	
  FRAMEWORKS	
  for	
  Parallel	
  Graph	
  
Processing	
  
–  Single	
  Machine	
  Solu2ons	
  	
  
•  GraphLego	
  [ACM	
  HPDC	
  2015]	
  /	
  GraphTwist	
  [VLDB2015]	
  
–  Distributed	
  Approaches	
  
•  GraphMap	
  [IEEE	
  SC	
  2015],	
  PathGraph	
  [IEEE	
  SC	
  2014]	
  
7
8
8
Build	
  New	
  Graph	
  Frameworks:	
  	
  
Key	
  Requirements/Challenges	
  
•  Less	
  pre-­‐processing	
  
•  Low	
  and	
  load-­‐balanced	
  computaEon	
  
•  Low	
  and	
  load-­‐balanced	
  communicaEon	
  
•  Low	
  memory	
  footprint	
  
•  Scalable	
  wrt	
  cluster	
  size	
  and	
  graph	
  size	
  
•  General	
  graph	
  processing	
  framework	
  for	
  large	
  
collecEons	
  of	
  graph	
  computaEon	
  algorithms	
  and	
  
applicaEons	
  	
  
9
9
Graph	
  OperaEons:	
  Two	
  DisEnct	
  Classes	
IteraEve	
  Graph	
  Algorithms	
  
!   Each	
  execu2on	
  consists	
  of	
  a	
  set	
  of	
  itera2ons	
  
!   In	
  each	
  itera2on,	
  vertex	
  (or	
  edge)	
  values	
  are	
  updated	
  
!   All	
  (or	
  most)	
  ver2ces	
  par2cipate	
  in	
  the	
  execu2on	
  
!   Examples:	
  PageRank,	
  shortest	
  paths	
  (SSSP),	
  connected	
  components	
  
!   Systems:	
  Pregel,	
  GraphLab,	
  GraphChi,	
  X-­‐Stream,	
  GraphX,	
  Pregelix
Graph	
  PaXern	
  Queries	
  
! Subgraph	
  matching	
  problem	
  
!   Requires	
  fast	
  query	
  response	
  2me	
  
!   Explores	
  a	
  small	
  frac2on	
  of	
  the	
  en2re	
  graph	
  
!   Examples:	
  friends-­‐of-­‐friends,	
  triangle	
  paberns	
  	
  	
  	
  	
  
!   Systems:	
  RDF-­‐3X,	
  TripleBit,	
  SHAPE	
  
VLDB 2014
IEEE SC 2015
10
10
Distributed	
  Approaches	
  to	
  	
  
Parallel	
  Graph	
  Processing	
SHAPE:	
  Distributed	
  RDF	
  System	
  with	
  Seman2c	
  Hash	
  Par22oning	
  
!   Graph	
  Pabern	
  Queries	
  
!   Seman2c	
  Hash	
  Par22oning	
  
!   Distributed	
  RDF	
  Query	
  Processing	
  
!   Experiments	
  
GraphMap:	
  Scalable	
  Itera2ve	
  Graph	
  Computa2ons	
  
!   Itera2ve	
  Graph	
  Computa2ons	
  
! GraphMap	
  Approaches	
  
!   Experiments	
  
11
11
What	
  Are	
  IteraEve	
  Graph	
  Algorithms?	
IteraEve	
  Graph	
  Algorithms	
  
!   Each	
  execu2on	
  consists	
  of	
  a	
  set	
  of	
  itera2ons	
  
!   In	
  each	
  itera2on,	
  vertex	
  (or	
  edge)	
  values	
  are	
  updated	
  
!   All	
  (or	
  most)	
  ver2ces	
  par2cipate	
  in	
  the	
  opera2ons	
  
!   Examples:	
  PageRank,	
  shortest	
  paths	
  (SSSP),	
  connected	
  components	
  
!   Systems:	
  Google’s	
  Pregel,	
  GraphLab,	
  GraphChi,	
  X-­‐Stream,	
  GraphX,	
  Pregelix
SSSP
Connected Components Source: amplab
12
12
Why	
  Is	
  IteraEve	
  Graph	
  Processing	
  So	
  Difficult?	
Huge	
  and	
  growing	
  size	
  of	
  graph	
  data	
  
-­‐ Makes	
  it	
  hard	
  to	
  store	
  and	
  handle	
  the	
  data	
  on	
  a	
  single	
  machine	
  
Poor	
  locality	
  (many	
  random	
  accesses)	
  
-­‐ Each	
  vertex	
  depends	
  on	
  its	
  neighboring	
  ver2ces,	
  recursively	
  	
  
Huge	
  size	
  of	
  intermediate	
  data	
  for	
  each	
  iteraEon	
  
-­‐ Requires	
  addi2onal	
  compu2ng	
  and	
  storage	
  resources	
  	
  
Heterogeneous	
  graph	
  algorithms	
  
-­‐ Different	
  algorithms	
  have	
  different	
  computa2on	
  and	
  access	
  paberns	
  	
  
High-­‐degree	
  verEces	
  
-­‐ Make	
  it	
  hard	
  to	
  ensure	
  load	
  balancing	
  	
  
13
The	
  problems	
  of	
  current	
  computaEon	
  models	
  
13
14
The	
  problems	
  of	
  current	
  computaEon	
  models	
  
•  Ghost	
  ver2ces	
  maintain	
  adjacency	
  structure	
  and	
  
replicate	
  remote	
  data.	
  
•  Too	
  much	
  interac2ons	
  among	
  par22ons	
  
14
“ghost” vertices
15
15
IteraEve	
  Graph	
  Algorithm	
  Example	
Figure source: Apache Flink
Connected	
  Components	
  	
  
16
16
Why	
  Don’t	
  We	
  Use	
  MapReduce?	
Of	
  course,	
  we	
  can	
  use	
  MapReduce!	
  
The	
  first	
  iteraEon	
  of	
  Connected	
  Components	
  
for	
  this	
  graph	
  would	
  be	
  …	
  	
  
Map	
  
K	
 V	
2	
 1	
K	
 V	
1	
 2	
3	
 2	
4	
 2	
K	
 V	
2	
 4	
3	
 4	
K	
 V	
2	
 3	
4	
 3	
Reduce	
  
K	
 Values	
1	
 2	
Min	
1	
2	
 1,3,4	
3	
 2,4	
4	
 2,3	
1	
2	
2
17
17
Why	
  We	
  Shouldn’t	
  Use	
  MapReduce	
But	
  …	
  
In	
  a	
  typical	
  MapReduce	
  job,	
  disk	
  
IOs	
  are	
  performed	
  in	
  four	
  places	
  
So…	
  10	
  iteraEons	
  mean…	
  
Figure source: http://arasan-blog.blogspot.com/
Disk	
  IOs	
  in	
  40	
  places	
  
18
18
Related	
  Work	
Distributed	
  Memory-­‐Based	
  Systems	
  
!   Messaging-­‐based:	
  Google	
  Pregel,	
  Apache	
  Giraph,	
  Apache	
  Hama	
  
!   Vertex	
  mirroring:	
  GraphLab,	
  PowerGraph,	
  GraphX	
  
!   Dynamic	
  load	
  balancing:	
  Mizan,	
  GPS	
  
!   Graph-­‐centric	
  view:	
  Giraph++	
  
Disk-­‐Based	
  Systems	
  using	
  single	
  machine	
  
!   Vertex-­‐centric	
  model:	
  GraphChi	
  
!   Edge-­‐centric	
  model:	
  X-­‐Stream	
  
!   Vertex-­‐Edge	
  Centric:	
  GraphLego	
  
With	
  External	
  Memory	
  
!   Out-­‐of-­‐core	
  capabili2es	
  (Apache	
  Giraph,	
  Apache	
  Hama,	
  GraphX)	
  
!   Not	
  op2mized	
  for	
  graph	
  computa2ons	
  
!   Users	
  need	
  to	
  configure	
  several	
  parameters	
  
19
19
Two	
  Research	
  DirecEons	
IteraEve	
  Graph	
  Processing	
  Systems	
  
Disk-based systems
on a single machine	
	
  !   Load	
  a	
  part	
  of	
  the	
  input	
  graph	
  
in	
  memory	
  	
  
!   Include	
  a	
  set	
  of	
  data	
  structures	
  
and	
  techniques	
  to	
  efficiently	
  
load	
  graph	
  data	
  from	
  disk	
  	
  
! GraphChi,	
  X-­‐Stream,	
  …	
  
! Disadv.:	
  1)	
  relaEvely	
  slow,	
  2)	
  
resource	
  limitaEons	
  of	
  a	
  single	
  
machine	
  
Distributed memory-based
systems on a cluster
	
  !   Load	
  the	
  whole	
  input	
  graph	
  in	
  
memory	
  
!   Load	
  all	
  intermediate	
  results	
  
and	
  messages	
  in	
  memory	
  
! Pregel,	
  Giraph,	
  Hama,	
  
GraphLab,	
  GraphX,	
  …	
  
! Disadv.:	
  1)	
  very	
  high	
  memory	
  
requirement,	
  2)	
  coordinaEon	
  
of	
  distributed	
  machines	
  
20
20
Main	
  Features	
Develop	
  GraphMap	
  
!   Distributed	
  itera2ve	
  graph	
  computa2on	
  framework	
  that	
  effec2vely	
  
u2lizes	
  secondary	
  storage	
  
!   To	
  reduce	
  the	
  memory	
  requirement	
  of	
  itera2ve	
  graph	
  computa2ons	
  
while	
  ensuring	
  compe22ve	
  (or	
  beber)	
  performance	
  
Main	
  ContribuEons	
  
!   Clear	
  separaEon	
  between	
  mutable	
  and	
  read-­‐only	
  data	
  
!   Two-­‐level	
  parEEoning	
  technique	
  for	
  locality-­‐op2mized	
  data	
  
placement	
  
!   Dynamic	
  access	
  methods	
  based	
  on	
  the	
  workloads	
  of	
  the	
  current	
  
itera2on	
  
	
  
21
21
Clear	
  Data	
  SeparaEon	
Graph	
  Data	
  
Vertices and their data
(mutable)
Edges and their data
(read-only)
Read edge data for each iteration!
22
22
Locality-­‐Based	
  Data	
  Placement	
  on	
  Disk	
Edge	
  Access	
  Locality	
  
!   All	
  edges	
  (out-­‐edges,	
  in-­‐edges	
  or	
  bi-­‐edges)	
  of	
  a	
  vertex	
  are	
  accessed	
  
together	
  to	
  update	
  its	
  vertex	
  value	
  
è	
  We	
  place	
  all	
  connected	
  edges	
  of	
  a	
  vertex	
  together	
  on	
  disk	
  	
  
Vertex	
  Access	
  Locality	
  
!   All	
  ver2ces	
  in	
  a	
  par22on	
  are	
  accessed	
  by	
  the	
  same	
  worker	
  
(processor)	
  in	
  every	
  itera2on	
  
è	
  We	
  store	
  all	
  ver2ces,	
  in	
  a	
  par22on,	
  and	
  their	
  edges	
  into	
  con2guous	
  
disk	
  blocks	
  to	
  u2lize	
  sequenEal	
  disk	
  accesses	
  	
  
How	
  can	
  you	
  access	
  disk	
  efficiently	
  for	
  each	
  iteraEon?	
  
23
23
Dynamic	
  Access	
  Methods	
Various	
  Workloads	
   If	
  the	
  current	
  workload	
  is	
  	
  
larger	
  than	
  the	
  threshold?	
  
YES NO
Sequential
disk accesses
Random
disk accesses
The threshold is dynamically
configured based on actual access
times for each iteration and for
each worker
0
20
40
60
80
100
120
140
0 1 2 3 4 5 6 7 8 9
#activevertices(x1000)
Iteration
PageRank CC SSSP
24
24
Experiments	
First	
  Prototype	
  of	
  GraphMap	
  
!   BSP	
  engine	
  &	
  messaging	
  engine:	
  U2lize	
  Apache	
  Hama	
  
!   Disk	
  storage:	
  U2lize	
  Apache	
  HBase	
  
!   Two-­‐dimensional	
  key-­‐value	
  store	
  
Sefngs	
  
!   Cluster	
  of	
  21	
  machines	
  on	
  Emulab	
  
!   12GB	
  RAM,	
  Xeon	
  E5530,	
  500GB	
  and	
  250GB	
  SATA	
  disks	
  
!   Connected	
  via	
  a	
  1	
  GigE	
  network	
  
! HBase	
  (ver.	
  0.96)	
  on	
  HDFS	
  of	
  Hadoop	
  (ver.	
  1.0.4)	
  
!   Hama	
  (ver.	
  0.6.3)	
  
IteraEve	
  Graph	
  Algorithms	
  
!   1)	
  PageRank	
  (10	
  iter.),	
  2)	
  SSSP,	
  3)	
  CC	
  
25
25
ExecuEon	
  Time	
Analysis	
  
!   Hama	
  fails	
  for	
  large	
  graphs	
  with	
  more	
  than	
  900M	
  edges	
  
while	
  GraphMap	
  s2ll	
  works	
  
!   Note	
  that,	
  in	
  all	
  the	
  cases,	
  GraphMap	
  is	
  faster	
  (up	
  to	
  6	
  2mes)	
  
than	
  Hama,	
  which	
  is	
  the	
  in-­‐memory	
  system	
  
26
26
Breakdown	
  of	
  GraphMap	
  ExecuEon	
  Time	
PageRank on uk-2005 SSSP on uk-2005
CC on uk-2005
Analysis	
  
!   For	
  PageRank,	
  all	
  itera2ons	
  have	
  similar	
  
results	
  except	
  the	
  first	
  and	
  last	
  
!   For	
  SSSP,	
  itera2on	
  5	
  –	
  15	
  u2lize	
  
sequen2al	
  disk	
  accesses	
  based	
  on	
  our	
  
dynamic	
  selec2on	
  
!   For	
  CC,	
  random	
  disk	
  accesses	
  are	
  
selected	
  from	
  itera2on	
  24	
  
27
27
Effects	
  of	
  Dynamic	
  Access	
  Methods	
Analysis	
  
! GraphMap	
  chooses	
  the	
  
op2mal	
  access	
  method	
  in	
  
most	
  of	
  the	
  itera2ons	
  
!   Possible	
  further	
  improvement	
  
through	
  fine-­‐tuning	
  in	
  
itera2ons	
  5	
  and	
  15	
  	
  
!   For	
  cit-­‐Patents,	
  GraphMap	
  
always	
  chooses	
  random	
  
accesses	
  because	
  only	
  3.3%	
  
ver2ces	
  are	
  reachable	
  from	
  
the	
  start	
  vertex	
  and	
  thus	
  the	
  
number	
  of	
  ac2ve	
  ver2ces	
  is	
  
always	
  small	
  
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40
ComputationTime(sec)
Iteration
Sequential
Random
Dynamic
28
28
•  Exis2ng	
  distributed	
  graph	
  systems	
  are	
  all	
  in-­‐memory	
  systems.	
  In	
  
addi2on	
  to	
  Hama,	
  we	
  give	
  a	
  rela2ve	
  comparison	
  with	
  a	
  few	
  other	
  
representa2ve	
  systems:	
  
Comparing	
  GraphMap	
  with	
  other	
  systems	
  
GraphMap: 12GB DRAM per node of a cluster of size 21 nodes and 252GB
distributed shared memory
5x DRAM per node
29
29
Social	
  Life	
  Journal	
  (LJ)	
  Graph	
  Dataset	
  
Vertices: Members
Edges: Friendship
Graph dataset (stored in HDFS)
cit-Patents (raw size: 268MB): 3.8M vertices, 16.5M edges
soc-LiveJournal1 (raw size: 1.1GB): 4.8M vertices, 69M edges
30
30
•  Cluster	
  sepng	
  
–  6	
  machines	
  (1	
  master	
  &	
  5	
  slaves)	
  
•  Spark	
  sepng	
  
–  Spark	
  shell	
  (i.e.,	
  did	
  NOT	
  implement	
  any	
  Spark	
  applica2on	
  
yet)	
  
• Built-­‐in	
  PageRank	
  func2on	
  of	
  GraphX	
  
–  All	
  40	
  cores	
  (=	
  8	
  cores	
  x	
  5	
  slaves)	
  
–  Por2on	
  of	
  memory	
  for	
  RDD	
  storage:	
  0.52	
  (by	
  default)	
  
• If	
  we	
  assign	
  512MB	
  for	
  each	
  executor,	
  about	
  265MB	
  is	
  
dedicated	
  for	
  RDD	
  storage	
  
Our	
  iniEal	
  experience	
  with	
  SPARK	
  /	
  GraphX	
30
31
31
•  Implemented	
  two	
  hash	
  par22oning	
  techniques	
  using	
  GraphX	
  API	
  
–  1)	
  by	
  source	
  vertex	
  IDs,	
  2)	
  by	
  des2na2on	
  vertex	
  IDs	
  
Use-­‐defined	
  ParEEoning	
  (soc-­‐LiveJournal1)	
31	
5GB slave RAM &
40 partitions	
loading	
 1st PageRank
10 iterations	
2nd PageRank
10 iterations	
3rd PageRank
10 iterations	
Hashing by src. ve
rtices	
10s (2.1GB)	
 196s (7.2GB)
(12.3 / 13.2GB)	
156s (7.2GB)
(12.1 / 13.2GB)	
OOM	
Hashing by dst. ve
rtices 	
10s (2.1GB)	
	
168s (5.8GB)
(12.5 / 13.2GB)	
OOM	
EdgePartition2D	
 10s (2.1GB)	
 199s (8GB)
(12.1 / 13.2GB)	
OOM	
EdgePartition1D	
 13s (2.1GB, outlier
?)	
188s (6.9GB)
(12.2 / 13.2GB)	
173s (7.2GB)
(12.3 / 13.2GB)	
OOM	
10GB slave RAM &
40 partitions 	
loading	
 1st PageRank
10 iterations	
2nd PageRank
10 iterations	
3rd PageRank
10 iterations	
Hashing by src. verti
ces	
10s (2.1GB)	
 189s (13.2GB)
(17.5 / 26.1 GB)	
186s (18.2GB)
(23 / 26.1GB)	
OOM	
Hashing by dst. verti
ces 	
11s (2.1GB)	
	
223s (15.9GB)
(21.1 / 26.1GB)	
OOM
32
32
•  Our	
  ini2al	
  experience	
  with	
  SPARK	
  
–  Spark	
  performs	
  well	
  with	
  large	
  per-­‐node	
  with	
  >=	
  68GB	
  
DRAM,	
  as	
  reported	
  in	
  the	
  SPARK/GraphX	
  paper.	
  	
  
–  Do	
  not	
  perform	
  well	
  for	
  cluster	
  with	
  nodes	
  of	
  smaller	
  
DRAM	
  
•  Messaging	
  Overheads	
  
–  Distributed	
  graph	
  processing	
  systems	
  do	
  not	
  scale	
  as	
  the	
  #	
  
nodes	
  increases	
  due	
  to	
  the	
  amount	
  of	
  messaging	
  cost	
  
among	
  compute	
  nodes	
  in	
  the	
  cluster	
  to	
  synchronize	
  the	
  
computa2on	
  in	
  each	
  itera2on	
  round	
  
Spark/GraphX	
  experience	
  and	
  Messaging	
  Cost	
  
33
33
Summary	
GraphMap	
  
!   Distributed	
  itera2ve	
  graph	
  computa2on	
  framework	
  that	
  effec2vely	
  
u2lizes	
  secondary	
  storage	
  
!   Clear	
  separa2on	
  between	
  mutable	
  and	
  read-­‐only	
  data	
  
!   Locality-­‐based	
  data	
  placement	
  on	
  disk	
  
!   Dynamic	
  access	
  methods	
  based	
  on	
  the	
  workloads	
  of	
  the	
  current	
  
itera2on	
  
Ongoing	
  Research	
  
!   Disk	
  and	
  worker	
  coloca2on	
  to	
  improve	
  the	
  disk	
  access	
  performance	
  
!   Efficient	
  and	
  lightweight	
  par22oning	
  techniques,	
  incorpora2ng	
  our	
  
work	
  on	
  GraphLego	
  for	
  single	
  PC	
  graph	
  processing	
  [ACM	
  HPDC	
  2015]	
  
!   Comparing	
  with	
  SPARK/GraphX	
  on	
  larger	
  DRAM	
  cluster	
  
34
34
General	
  Purpose	
  Distributed	
  Graph	
  System	
ExisEng	
  State	
  of	
  Art	
  
!   Separate	
  efforts	
  for	
  the	
  two	
  representa2ve	
  graph	
  opera2ons	
  	
  	
  
!   Separate	
  efforts	
  for	
  the	
  scale-­‐up	
  and	
  scale-­‐out	
  systems	
  
Challenges	
  for	
  Developing	
  a	
  General	
  Purpose	
  Graph	
  Processing	
  System	
  
!   Different	
  data	
  access	
  paberns	
  /	
  graph	
  computa2on	
  models	
  
!   Different	
  inter-­‐node	
  communica2on	
  effects	
  
Possible	
  DirecEons	
  
!   Graph	
  summariza2on	
  techniques	
  	
  
!   Lightweight	
  graph	
  par22oning	
  techniques	
  	
  
!   Op2mized	
  data	
  storage	
  systems	
  and	
  access	
  methods	
  
35
VISIT:	
  
35
https://sites.google.com/site/gtshape/
https://sites.google.com/site/git_GraphLego/
https://sites.google.com/site/git_GraphTwist/

More Related Content

What's hot

Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
paperpublications3
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
Maria Stylianou
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
Maria Stylianou
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
Arinto Murdopo
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
Paladion Networks
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
Vijay Srinivas Agneeswaran, Ph.D
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
Symeon Papadopoulos
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
Impetus Technologies
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Michel Bruley
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
Avery Ching
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
Viet-Trung TRAN
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
IRJET Journal
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 

What's hot (20)

Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
Implementing Map Reduce Based Edmonds-Karp Algorithm to Determine Maximum Flo...
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
MapReduce in Cloud Computing
MapReduce in Cloud ComputingMapReduce in Cloud Computing
MapReduce in Cloud Computing
 

Similar to Ling liu part 02:big graph processing

Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
Arnaud Rachez
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
Reynold Xin
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
huguk
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
Graph processing
Graph processingGraph processing
Graph processing
yeahjs
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
attilacsordas
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
inside-BigData.com
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
Daniel S. Katz
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
Joy Rahman
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Albert Bifet
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
khstandrews
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
jie cao
 

Similar to Ling liu part 02:big graph processing (20)

Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Graph processing
Graph processingGraph processing
Graph processing
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Data-Centric Parallel Programming
Data-Centric Parallel ProgrammingData-Centric Parallel Programming
Data-Centric Parallel Programming
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Swift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance WorkflowSwift Parallel Scripting for High-Performance Workflow
Swift Parallel Scripting for High-Performance Workflow
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
ParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel ProgrammingParaForming - Patterns and Refactoring for Parallel Programming
ParaForming - Patterns and Refactoring for Parallel Programming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Hadoop
HadoopHadoop
Hadoop
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Challenges on Distributed Machine Learning
Challenges on Distributed Machine LearningChallenges on Distributed Machine Learning
Challenges on Distributed Machine Learning
 

More from jins0618

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
jins0618
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
jins0618
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
jins0618
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
jins0618
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
jins0618
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
jins0618
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
jins0618
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
jins0618
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
jins0618
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
jins0618
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
jins0618
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
jins0618
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
jins0618
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
jins0618
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
jins0618
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
jins0618
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
jins0618
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
jins0618
 

More from jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
 
Chengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big dataChengqi zhang graph processing and mining in the era of big data
Chengqi zhang graph processing and mining in the era of big data
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
 

Recently uploaded

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 

Recently uploaded (20)

Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 

Ling liu part 02:big graph processing

  • 1. Ling  Liu     School  of  Computer  Science   College  of  Compu2ng     Part  II:     Distributed  Graph  Processing
  • 2. 2 2 Big  Data  Trends Big Data Volume Velocity Variety 1 zettabyte = a trillion gigabytes (1021 bytes) CISCO, 2012 500 million Tweets per day 100 hours of video are uploaded every minute
  • 3. 3 3 Why  Graphs? Graphs  are   everywhere !   Social Network Graphs Road Networks National Security Business Analytics Biological Graphs Friendship Graph Facebook Engineering, 2010 Brain Network The Journal of Neuroscience 2011 US Road Network www.pavementinteractive.org Web Security Graph McAfee, 2013 Intelligence Data Model NSA, 2013
  • 4. 4 4 How  Big? Social  Scale   !   1  billion  ver2ces,  100  billion  edges     !   111  PB  adjacency  matrix   !   2.92  TB  adjacency  list   1 billion vertices, 100 billion edges 111 PB adjacency matrix 2.92 TB adjacency list 2.92 TB edge list Twitter graph from Gephi dataset (http://www.gephi.org) Paul Burkhardt, Chris Waring An NSA Big Graph experiment Web  Scale   !   50  billion  ver2ces,  1  trillion  edges   !   271  EB  adjacency  matrix   !   29.5  TB  adjacency  list   Brain  Scale   !   100  billion  ver2ces,  100  trillion  edges   !   1.1  ZB  adjacency  matrix   !   2.83  PB  adjacency  list   NSA-RD-2013-056001v1 Web scale. . . 50 billion vertices, 1 trillion edges 271 EB adjacency matrix 29.5 TB adjacency list 29.1 TB edge list Internet graph from the Opte Project (http://www.opte.org/maps) Web graph from the SNAP database (http://snap.stanford.edu/data) Paul Burkhardt, Chris Waring An NSA Big Graph experiment NSA-RD- Brain scale. . . 100 billion vertices, 100 trillion edges 2.08 mNA · bytes2 (molar bytes) adjacency matrix 2.84 PB adjacency list 2.84 PB edge list Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
  • 5. 5 5 Big  Graph  Data  Technical  Challenges Huge  and  growing  size   -­‐ Requires  massive  storage  capaci2es   -­‐ Graph  analy2cs  usually  requires  much  bigger  compu2ng  and  storage  resources   Complicated  correlaEons  among  data  enEEes  (verEces)   -­‐ Make  it  hard  to  parallelize  graph  processing  (hard  to  par22on)   -­‐ Most  exis2ng  big  data  systems  are  not  designed  to  handle  such  complexity Skewed  distribuEon  (i.e.,  high-­‐degree  verEces)   -­‐ Makes  it  hard  to  ensure  load  balancing  
  • 6. 6 6 Parallel  Graph  Processing:  Challenges   •  Structure  driven  computa2on   –  Storage  and  Data  Transfer  Issues   •  Irregular  Graph  Structure  and  Computa2on  Model   –  Storage  and  Data/Computa2on  Par22oning  Issues   –  Par22oning  v.s.  Load/Resource  Balancing     6
  • 7. 7 7 Parallel  Graph  Processing:  OpportuniEes   •  Extend  Exis2ng  Paradigms   –  Vertex  centric   –  Edge  centric   •  BUILD  NEW  FRAMEWORKS  for  Parallel  Graph   Processing   –  Single  Machine  Solu2ons     •  GraphLego  [ACM  HPDC  2015]  /  GraphTwist  [VLDB2015]   –  Distributed  Approaches   •  GraphMap  [IEEE  SC  2015],  PathGraph  [IEEE  SC  2014]   7
  • 8. 8 8 Build  New  Graph  Frameworks:     Key  Requirements/Challenges   •  Less  pre-­‐processing   •  Low  and  load-­‐balanced  computaEon   •  Low  and  load-­‐balanced  communicaEon   •  Low  memory  footprint   •  Scalable  wrt  cluster  size  and  graph  size   •  General  graph  processing  framework  for  large   collecEons  of  graph  computaEon  algorithms  and   applicaEons    
  • 9. 9 9 Graph  OperaEons:  Two  DisEnct  Classes IteraEve  Graph  Algorithms   !   Each  execu2on  consists  of  a  set  of  itera2ons   !   In  each  itera2on,  vertex  (or  edge)  values  are  updated   !   All  (or  most)  ver2ces  par2cipate  in  the  execu2on   !   Examples:  PageRank,  shortest  paths  (SSSP),  connected  components   !   Systems:  Pregel,  GraphLab,  GraphChi,  X-­‐Stream,  GraphX,  Pregelix Graph  PaXern  Queries   ! Subgraph  matching  problem   !   Requires  fast  query  response  2me   !   Explores  a  small  frac2on  of  the  en2re  graph   !   Examples:  friends-­‐of-­‐friends,  triangle  paberns           !   Systems:  RDF-­‐3X,  TripleBit,  SHAPE   VLDB 2014 IEEE SC 2015
  • 10. 10 10 Distributed  Approaches  to     Parallel  Graph  Processing SHAPE:  Distributed  RDF  System  with  Seman2c  Hash  Par22oning   !   Graph  Pabern  Queries   !   Seman2c  Hash  Par22oning   !   Distributed  RDF  Query  Processing   !   Experiments   GraphMap:  Scalable  Itera2ve  Graph  Computa2ons   !   Itera2ve  Graph  Computa2ons   ! GraphMap  Approaches   !   Experiments  
  • 11. 11 11 What  Are  IteraEve  Graph  Algorithms? IteraEve  Graph  Algorithms   !   Each  execu2on  consists  of  a  set  of  itera2ons   !   In  each  itera2on,  vertex  (or  edge)  values  are  updated   !   All  (or  most)  ver2ces  par2cipate  in  the  opera2ons   !   Examples:  PageRank,  shortest  paths  (SSSP),  connected  components   !   Systems:  Google’s  Pregel,  GraphLab,  GraphChi,  X-­‐Stream,  GraphX,  Pregelix SSSP Connected Components Source: amplab
  • 12. 12 12 Why  Is  IteraEve  Graph  Processing  So  Difficult? Huge  and  growing  size  of  graph  data   -­‐ Makes  it  hard  to  store  and  handle  the  data  on  a  single  machine   Poor  locality  (many  random  accesses)   -­‐ Each  vertex  depends  on  its  neighboring  ver2ces,  recursively     Huge  size  of  intermediate  data  for  each  iteraEon   -­‐ Requires  addi2onal  compu2ng  and  storage  resources     Heterogeneous  graph  algorithms   -­‐ Different  algorithms  have  different  computa2on  and  access  paberns     High-­‐degree  verEces   -­‐ Make  it  hard  to  ensure  load  balancing    
  • 13. 13 The  problems  of  current  computaEon  models   13
  • 14. 14 The  problems  of  current  computaEon  models   •  Ghost  ver2ces  maintain  adjacency  structure  and   replicate  remote  data.   •  Too  much  interac2ons  among  par22ons   14 “ghost” vertices
  • 15. 15 15 IteraEve  Graph  Algorithm  Example Figure source: Apache Flink Connected  Components    
  • 16. 16 16 Why  Don’t  We  Use  MapReduce? Of  course,  we  can  use  MapReduce!   The  first  iteraEon  of  Connected  Components   for  this  graph  would  be  …     Map   K V 2 1 K V 1 2 3 2 4 2 K V 2 4 3 4 K V 2 3 4 3 Reduce   K Values 1 2 Min 1 2 1,3,4 3 2,4 4 2,3 1 2 2
  • 17. 17 17 Why  We  Shouldn’t  Use  MapReduce But  …   In  a  typical  MapReduce  job,  disk   IOs  are  performed  in  four  places   So…  10  iteraEons  mean…   Figure source: http://arasan-blog.blogspot.com/ Disk  IOs  in  40  places  
  • 18. 18 18 Related  Work Distributed  Memory-­‐Based  Systems   !   Messaging-­‐based:  Google  Pregel,  Apache  Giraph,  Apache  Hama   !   Vertex  mirroring:  GraphLab,  PowerGraph,  GraphX   !   Dynamic  load  balancing:  Mizan,  GPS   !   Graph-­‐centric  view:  Giraph++   Disk-­‐Based  Systems  using  single  machine   !   Vertex-­‐centric  model:  GraphChi   !   Edge-­‐centric  model:  X-­‐Stream   !   Vertex-­‐Edge  Centric:  GraphLego   With  External  Memory   !   Out-­‐of-­‐core  capabili2es  (Apache  Giraph,  Apache  Hama,  GraphX)   !   Not  op2mized  for  graph  computa2ons   !   Users  need  to  configure  several  parameters  
  • 19. 19 19 Two  Research  DirecEons IteraEve  Graph  Processing  Systems   Disk-based systems on a single machine  !   Load  a  part  of  the  input  graph   in  memory     !   Include  a  set  of  data  structures   and  techniques  to  efficiently   load  graph  data  from  disk     ! GraphChi,  X-­‐Stream,  …   ! Disadv.:  1)  relaEvely  slow,  2)   resource  limitaEons  of  a  single   machine   Distributed memory-based systems on a cluster  !   Load  the  whole  input  graph  in   memory   !   Load  all  intermediate  results   and  messages  in  memory   ! Pregel,  Giraph,  Hama,   GraphLab,  GraphX,  …   ! Disadv.:  1)  very  high  memory   requirement,  2)  coordinaEon   of  distributed  machines  
  • 20. 20 20 Main  Features Develop  GraphMap   !   Distributed  itera2ve  graph  computa2on  framework  that  effec2vely   u2lizes  secondary  storage   !   To  reduce  the  memory  requirement  of  itera2ve  graph  computa2ons   while  ensuring  compe22ve  (or  beber)  performance   Main  ContribuEons   !   Clear  separaEon  between  mutable  and  read-­‐only  data   !   Two-­‐level  parEEoning  technique  for  locality-­‐op2mized  data   placement   !   Dynamic  access  methods  based  on  the  workloads  of  the  current   itera2on    
  • 21. 21 21 Clear  Data  SeparaEon Graph  Data   Vertices and their data (mutable) Edges and their data (read-only) Read edge data for each iteration!
  • 22. 22 22 Locality-­‐Based  Data  Placement  on  Disk Edge  Access  Locality   !   All  edges  (out-­‐edges,  in-­‐edges  or  bi-­‐edges)  of  a  vertex  are  accessed   together  to  update  its  vertex  value   è  We  place  all  connected  edges  of  a  vertex  together  on  disk     Vertex  Access  Locality   !   All  ver2ces  in  a  par22on  are  accessed  by  the  same  worker   (processor)  in  every  itera2on   è  We  store  all  ver2ces,  in  a  par22on,  and  their  edges  into  con2guous   disk  blocks  to  u2lize  sequenEal  disk  accesses     How  can  you  access  disk  efficiently  for  each  iteraEon?  
  • 23. 23 23 Dynamic  Access  Methods Various  Workloads   If  the  current  workload  is     larger  than  the  threshold?   YES NO Sequential disk accesses Random disk accesses The threshold is dynamically configured based on actual access times for each iteration and for each worker 0 20 40 60 80 100 120 140 0 1 2 3 4 5 6 7 8 9 #activevertices(x1000) Iteration PageRank CC SSSP
  • 24. 24 24 Experiments First  Prototype  of  GraphMap   !   BSP  engine  &  messaging  engine:  U2lize  Apache  Hama   !   Disk  storage:  U2lize  Apache  HBase   !   Two-­‐dimensional  key-­‐value  store   Sefngs   !   Cluster  of  21  machines  on  Emulab   !   12GB  RAM,  Xeon  E5530,  500GB  and  250GB  SATA  disks   !   Connected  via  a  1  GigE  network   ! HBase  (ver.  0.96)  on  HDFS  of  Hadoop  (ver.  1.0.4)   !   Hama  (ver.  0.6.3)   IteraEve  Graph  Algorithms   !   1)  PageRank  (10  iter.),  2)  SSSP,  3)  CC  
  • 25. 25 25 ExecuEon  Time Analysis   !   Hama  fails  for  large  graphs  with  more  than  900M  edges   while  GraphMap  s2ll  works   !   Note  that,  in  all  the  cases,  GraphMap  is  faster  (up  to  6  2mes)   than  Hama,  which  is  the  in-­‐memory  system  
  • 26. 26 26 Breakdown  of  GraphMap  ExecuEon  Time PageRank on uk-2005 SSSP on uk-2005 CC on uk-2005 Analysis   !   For  PageRank,  all  itera2ons  have  similar   results  except  the  first  and  last   !   For  SSSP,  itera2on  5  –  15  u2lize   sequen2al  disk  accesses  based  on  our   dynamic  selec2on   !   For  CC,  random  disk  accesses  are   selected  from  itera2on  24  
  • 27. 27 27 Effects  of  Dynamic  Access  Methods Analysis   ! GraphMap  chooses  the   op2mal  access  method  in   most  of  the  itera2ons   !   Possible  further  improvement   through  fine-­‐tuning  in   itera2ons  5  and  15     !   For  cit-­‐Patents,  GraphMap   always  chooses  random   accesses  because  only  3.3%   ver2ces  are  reachable  from   the  start  vertex  and  thus  the   number  of  ac2ve  ver2ces  is   always  small   0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 40 ComputationTime(sec) Iteration Sequential Random Dynamic
  • 28. 28 28 •  Exis2ng  distributed  graph  systems  are  all  in-­‐memory  systems.  In   addi2on  to  Hama,  we  give  a  rela2ve  comparison  with  a  few  other   representa2ve  systems:   Comparing  GraphMap  with  other  systems   GraphMap: 12GB DRAM per node of a cluster of size 21 nodes and 252GB distributed shared memory 5x DRAM per node
  • 29. 29 29 Social  Life  Journal  (LJ)  Graph  Dataset   Vertices: Members Edges: Friendship Graph dataset (stored in HDFS) cit-Patents (raw size: 268MB): 3.8M vertices, 16.5M edges soc-LiveJournal1 (raw size: 1.1GB): 4.8M vertices, 69M edges
  • 30. 30 30 •  Cluster  sepng   –  6  machines  (1  master  &  5  slaves)   •  Spark  sepng   –  Spark  shell  (i.e.,  did  NOT  implement  any  Spark  applica2on   yet)   • Built-­‐in  PageRank  func2on  of  GraphX   –  All  40  cores  (=  8  cores  x  5  slaves)   –  Por2on  of  memory  for  RDD  storage:  0.52  (by  default)   • If  we  assign  512MB  for  each  executor,  about  265MB  is   dedicated  for  RDD  storage   Our  iniEal  experience  with  SPARK  /  GraphX 30
  • 31. 31 31 •  Implemented  two  hash  par22oning  techniques  using  GraphX  API   –  1)  by  source  vertex  IDs,  2)  by  des2na2on  vertex  IDs   Use-­‐defined  ParEEoning  (soc-­‐LiveJournal1) 31 5GB slave RAM & 40 partitions loading 1st PageRank 10 iterations 2nd PageRank 10 iterations 3rd PageRank 10 iterations Hashing by src. ve rtices 10s (2.1GB) 196s (7.2GB) (12.3 / 13.2GB) 156s (7.2GB) (12.1 / 13.2GB) OOM Hashing by dst. ve rtices 10s (2.1GB) 168s (5.8GB) (12.5 / 13.2GB) OOM EdgePartition2D 10s (2.1GB) 199s (8GB) (12.1 / 13.2GB) OOM EdgePartition1D 13s (2.1GB, outlier ?) 188s (6.9GB) (12.2 / 13.2GB) 173s (7.2GB) (12.3 / 13.2GB) OOM 10GB slave RAM & 40 partitions loading 1st PageRank 10 iterations 2nd PageRank 10 iterations 3rd PageRank 10 iterations Hashing by src. verti ces 10s (2.1GB) 189s (13.2GB) (17.5 / 26.1 GB) 186s (18.2GB) (23 / 26.1GB) OOM Hashing by dst. verti ces 11s (2.1GB) 223s (15.9GB) (21.1 / 26.1GB) OOM
  • 32. 32 32 •  Our  ini2al  experience  with  SPARK   –  Spark  performs  well  with  large  per-­‐node  with  >=  68GB   DRAM,  as  reported  in  the  SPARK/GraphX  paper.     –  Do  not  perform  well  for  cluster  with  nodes  of  smaller   DRAM   •  Messaging  Overheads   –  Distributed  graph  processing  systems  do  not  scale  as  the  #   nodes  increases  due  to  the  amount  of  messaging  cost   among  compute  nodes  in  the  cluster  to  synchronize  the   computa2on  in  each  itera2on  round   Spark/GraphX  experience  and  Messaging  Cost  
  • 33. 33 33 Summary GraphMap   !   Distributed  itera2ve  graph  computa2on  framework  that  effec2vely   u2lizes  secondary  storage   !   Clear  separa2on  between  mutable  and  read-­‐only  data   !   Locality-­‐based  data  placement  on  disk   !   Dynamic  access  methods  based  on  the  workloads  of  the  current   itera2on   Ongoing  Research   !   Disk  and  worker  coloca2on  to  improve  the  disk  access  performance   !   Efficient  and  lightweight  par22oning  techniques,  incorpora2ng  our   work  on  GraphLego  for  single  PC  graph  processing  [ACM  HPDC  2015]   !   Comparing  with  SPARK/GraphX  on  larger  DRAM  cluster  
  • 34. 34 34 General  Purpose  Distributed  Graph  System ExisEng  State  of  Art   !   Separate  efforts  for  the  two  representa2ve  graph  opera2ons       !   Separate  efforts  for  the  scale-­‐up  and  scale-­‐out  systems   Challenges  for  Developing  a  General  Purpose  Graph  Processing  System   !   Different  data  access  paberns  /  graph  computa2on  models   !   Different  inter-­‐node  communica2on  effects   Possible  DirecEons   !   Graph  summariza2on  techniques     !   Lightweight  graph  par22oning  techniques     !   Op2mized  data  storage  systems  and  access  methods