Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

2,738 views

Published on

Graph relationships are everywhere. In fact, more often than not, analyzing relationships between points in your datasets lets you extract more business value from your data.

Consider social graphs, or relationships of customers to each other and products they purchase, as two of the most common examples. Now, if you think you have a scalability issue just analyzing points in your datasets, imagine what would happen if you wanted to start analyzing the arbitrary relationships between those data points: the amount of potential processing will increase dramatically, and the kind of algorithms you would typically want to run would change as well.

If your Hadoop batch-oriented approach with MapReduce works reasonably well, for scalable graph processing you have to embrace an in-memory, explorative, and iterative approach. One of the best ways to tame this complexity is known as the Bulk synchronous parallel approach. Its two most widely used implementations are available as Hadoop ecosystem projects: Apache Giraph (used at Facebook), and Apache GraphX (as part of a Spark project).

In this talk we will focus on practical advice on how to get up and running with Apache Giraph and GraphX; start analyzing simple datasets with built­-in algorithms; and finally how to implement your own graph processing applications using the APIs provided by the projects. We will finally compare and contrast the two, and try to lay out some principles of when to use one vs. the other.

Published in: Software
  • Be the first to comment

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

  1. 1. Scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  2. 2. Introduction into scalable graph analysis with Apache Giraph and Spark GraphX Roman Shaposhnik rvs@apache.org @rhatr Director of Open Source, Pivotal Inc.
  3. 3. Shameless plug #1
  4. 4. Shameless plug #1
  5. 5. Agenda:
  6. 6. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  7. 7. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee
  8. 8. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1
  9. 9. Lets define some terms •  Graph is a G = (V, E), where E VxV •  Directed multigraphs with properties attached to each vertex and edge foo bar fee 2 1 foo bar fee 42 fum
  10. 10. What kind of graphs are we talking about? • Page ranking on Facebook social graph (mid 2013) •  10^9 (billions) vertices •  10^12 (trillion) edges •  10^15 (petabtybe) cold storage data scale •  200 servers •  …all in under 4 minutes!
  11. 11. “On day one Doug created HDFS and MapReduce”
  12. 12. Google papers that started it all • GFS (file system) •  distributed •  replicated •  non-POSIX" • MapReduce (computational framework) •  distributed •  batch-oriented (long jobs; final results) •  data-gravity aware •  designed for “embarrassingly parallel” algorithms
  13. 13. HDFS pools and abstracts direct-attached storage … HDFS MR MR
  14. 14. A Unix analogy § It is as though instead of: $  grep  foo  bar.txt  |  tr  “,”  “  “  |  sort  -­‐u     § We are doing: $  grep  foo  <  bar.txt  >  /tmp/1.txt   $  tr  “,”  “  “    <  /tmp/1.txt  >  /tmp/2.txt   $  sort  –u  <  /tmp/2.txt  
  15. 15. Enter Apache Spark
  16. 16. RAM is the new disk, Disk is the new tape Source: UC Berkeley Spark project (just the image)
  17. 17. RDDs instead of HDFS files, RAM instead of Disk warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD

×