Apache Giraph
start analyzing graph relationships in your bigdata in 45 minutes (or
your money back)!
Who’s this guy?
Roman Shaposhnik
•ASF junkie
• VP of Apache Incubator, former VP of Apache Bigtop
• Hadoop/Sqoop/Giraph committer
• contri...
Giraph in action (MEAP)
http://manning.com/martella/
What’s this all about?
Agenda
• A brief history of Hadoop-based bigdata management
• Extracting graph relationships from unstructured data
• A ca...
On day one Doug created
HDFS/MR
Google papers
•GFS (file system)
• distributed
• replicated
• non-POSIX
•MapReduce (computational framework)
• distributed...
One size doesn’t fit all
•Key-value approach
• map is how we get the keys
• shuffle is how we sort the keys
• reduce is ho...
It’s not about the size of your
data;
it’s about what you do with it!
Graph relationships
•Entities in your data: tuples
• customer data
• product data
• interaction data
•Connection between e...
Challenges
•Data is dynamic
• No way of doing “schema on write”
•Combinatorial explosion of datasets
• Relationships grow ...
Graph databases
•Plenty available
• Neo4J, Titan, etc.
•Benefits
• Tightly integrate systems with few moving parts
• High ...
Enter Apache Giraph
Key insights
•Keep state in memory for as long as needed
•Leverage HDFS as a repository for unstructured data
•Allow for m...
Bulk Synchronous
Parallel
time
communications
local
processing
barrier #1
barrier #2
barrier #3
BSP applied to graphs
@rhatr
@TheASF
@c0sin
Think like a vertex:
•I know my local state
•I know my neighbours
•I can send ...
Bulk Sequential
Processing
time
message delivery
individual
vertex
processing
&
sending
messages
superstep #1
all vertices...
Giraph “Hello World”
public class GiraphHelloWorld extends
BasicComputation<IntWritable, IntWritable, NullWritable, NullWr...
Mighty four of Giraph
API
BasicComputation<IntWritable, // VertexID -- vertex ref
IntWritable, // VertexData -- a vertex
d...
On circles and arrows
•You don’t even need a graph to begin with!
• Well, ok you need at least one node
•Dynamic extractio...
Anatomy of Giraph run
3 1 2
1
2 1 3
1
2 1 3
3 1 2
HDFS mappers reducers HDFS
@rhatr
@TheASF
@c0sin
InputFormat
OutputForma...
Anatomy of Giraph run
mappers or YARN containers
@rhatr
@TheASF
@c0sin
InputFormat
OutputFormat
1
2
3
A vertex view
MessageData1
VertexID
EdgeData
EdgeData
VertexData MessageData2
Turning Twitter into
Facebook
@rhatr
@TheASF
@c0sin
rhatr
TheASF
c0sin
Ping thy neighbours
public void compute(Vertex<Text, DoubleWritable, DoubleWritable> vertex,
Iterable<Text> ms ){
if (getS...
Demo time!
But I don’t have a
cluster!•Hadoop in pseudo-distributed mode
• All Hadoop services on the same host (different JVMs)
• Ha...
Prerequisites
•Apache Hadoop 1.2.1
•Apache Giraph 1.1.0-SNAPSHOT
•Apache Maven 3.x
•JDK 7+
Setting things up
$ curl hadoop.tar.gz | tar xzvf –
$ git clone git://git.apache.org/giraph.git ; cd giraph
$ mvn –Phadoop...
Setting project up
(maven)<dependencies>
<dependency>
<groupId>org.apache.giraph</groupId>
<artifactId>giraph-core</artifa...
Running it
$ mvn package
$ giraph target/*.jar giraph.GiraphHelloWorld 
-vip src/main/resources/1 
-vif
org.apache.giraph....
Testing it
public void testNumberOfVertices() throws Exception {
GiraphConfiguration conf = new GiraphConfiguration();
con...
Simplified view
mappers or YARN containers
@rhatr
@TheASF
@c0sin
InputFormat
OutputFormat
1
2
3
Master and master
compute
@rhatr
@TheASF
@c0sin
1
2
3
Active
Master
compute()
Standby
Master
Zookeeper
Master compute
•Runs before slave compute()
•Has a global view
•A place for aggregator manipulation
Aggregators
•“Shared variables”
•Each vertex can push values to an aggregator
•Master compute has full control
@rhatr
@The...
Questions?
Upcoming SlideShare
Loading in...5
×

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

2,899

Published on

The genesis of Hadoop was in analyzing massive amounts of data with a mapreduce framework. SQL­-on­Hadoop has followed shortly after that, paving a way to the whole schema-­on­-read notion. Discovering graph relationship in your data is the next logical step. Apache Giraph (modeled on Google’s Pregel) lets you apply the power of BSP approach to the unstructured data. In this talk we will focus on practical advice of how to get up and running with Apache Giraph, start analyzing simple data sets with built­-in algorithms and finally how to implement your own graph processing applications using the APIs provided by the project. We will then dive into how Giraph integrates with the Hadoop ecosystem (Hive, HBase, Accumulo, etc.) and will also provide a whirlwind tour of Giraph architecture.

Published in: Software, Technology

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

  1. 1. Apache Giraph start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!
  2. 2. Who’s this guy?
  3. 3. Roman Shaposhnik •ASF junkie • VP of Apache Incubator, former VP of Apache Bigtop • Hadoop/Sqoop/Giraph committer • contributor across the Hadoop ecosystem) •Used to be root@Cloudera •Used to be a PHB at Yahoo! •Used to be a UNIX hacker at Sun microsystems
  4. 4. Giraph in action (MEAP) http://manning.com/martella/
  5. 5. What’s this all about?
  6. 6. Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads • Bulk Synchronous Parallel to the rescue! • Apache Giraph: a Hadoop-based BSP graph analysis framework • Giraph application development • Demos! Code! Lots of it!
  7. 7. On day one Doug created HDFS/MR
  8. 8. Google papers •GFS (file system) • distributed • replicated • non-POSIX •MapReduce (computational framework) • distributed • batch-oriented (long jobs; final results) • data-gravity aware • designed for “embarrassingly parallel” algorithms
  9. 9. One size doesn’t fit all •Key-value approach • map is how we get the keys • shuffle is how we sort the keys • reduce is how we get to see all the values for a key •Pipeline approach •Intermediate results in a pipeline need to be flushed to HDFS •A very particular “API” for working with your data
  10. 10. It’s not about the size of your data; it’s about what you do with it!
  11. 11. Graph relationships •Entities in your data: tuples • customer data • product data • interaction data •Connection between entities: graphs • social network or my customers • clustering of customers vs. products
  12. 12. Challenges •Data is dynamic • No way of doing “schema on write” •Combinatorial explosion of datasets • Relationships grow exponentially •Algorithms become • explorative • iterative
  13. 13. Graph databases •Plenty available • Neo4J, Titan, etc. •Benefits • Tightly integrate systems with few moving parts • High performance on known data sets •Shortcomings • Don’t integrate with HDFS • Combine storage and computational layers • A sea of APIs
  14. 14. Enter Apache Giraph
  15. 15. Key insights •Keep state in memory for as long as needed •Leverage HDFS as a repository for unstructured data •Allow for maximum parallelism (shared nothing) •Allow for arbitrary communications •Leverage BSP approach
  16. 16. Bulk Synchronous Parallel time communications local processing barrier #1 barrier #2 barrier #3
  17. 17. BSP applied to graphs @rhatr @TheASF @c0sin Think like a vertex: •I know my local state •I know my neighbours •I can send messages to vertices •I can declare that I am done •I can mutate graph topology
  18. 18. Bulk Sequential Processing time message delivery individual vertex processing & sending messages superstep #1 all vertices are “done” superstep #2
  19. 19. Giraph “Hello World” public class GiraphHelloWorld extends BasicComputation<IntWritable, IntWritable, NullWritable, NullWritable> { public void compute(Vertex<IntWritable, IntWritable, NullWritable> vertex, Iterable<NullWritable> messages) { System.out.println(“Hello world from the: “ + vertex.getId() + “ : “); for (Edge<IntWritable, NullWritable> e : vertex.getEdges()) { System.out.println(“ “ + e.getTargetVertexId()); } System.out.println(“”); } }
  20. 20. Mighty four of Giraph API BasicComputation<IntWritable, // VertexID -- vertex ref IntWritable, // VertexData -- a vertex datum NullWritable, // EdgeData -- an edge label datum NullWritable>// MessageData -- message payload
  21. 21. On circles and arrows •You don’t even need a graph to begin with! • Well, ok you need at least one node •Dynamic extraction of relationships • EdgeInputFormat • VetexInputFormat •Full integration with Hadoop ecosystem • HBase/Accumulo, Gora, Hive/HCatalog
  22. 22. Anatomy of Giraph run 3 1 2 1 2 1 3 1 2 1 3 3 1 2 HDFS mappers reducers HDFS @rhatr @TheASF @c0sin InputFormat OutputFormat 1 2 3
  23. 23. Anatomy of Giraph run mappers or YARN containers @rhatr @TheASF @c0sin InputFormat OutputFormat 1 2 3
  24. 24. A vertex view MessageData1 VertexID EdgeData EdgeData VertexData MessageData2
  25. 25. Turning Twitter into Facebook @rhatr @TheASF @c0sin rhatr TheASF c0sin
  26. 26. Ping thy neighbours public void compute(Vertex<Text, DoubleWritable, DoubleWritable> vertex, Iterable<Text> ms ){ if (getSuperstep() == 0) { sendMessageToAllEdges(vertex, vertex.getId()); } else { for (Text m : ms) { if (vertex.getEdgeValue(m) == null) { vertex.addEdge(EdgeFactory.create(m, SYNTHETIC_EDGE)); } } } vertex.voteToHalt();
  27. 27. Demo time!
  28. 28. But I don’t have a cluster!•Hadoop in pseudo-distributed mode • All Hadoop services on the same host (different JVMs) • Hadoop-as-a-Service • Amazon’s EMR, etc. • Hadoop in local mode
  29. 29. Prerequisites •Apache Hadoop 1.2.1 •Apache Giraph 1.1.0-SNAPSHOT •Apache Maven 3.x •JDK 7+
  30. 30. Setting things up $ curl hadoop.tar.gz | tar xzvf – $ git clone git://git.apache.org/giraph.git ; cd giraph $ mvn –Phadoop_1 package $ tar xzvf *dist*/*.tar.gz $ export HADOOP_HOME=/Users/shapor/dist/hadoop- 1.2.1 $ export GIRAPH_HOME=/Users/shapor/dist/ $ export HADOOP_CONF_DIR=$GIRAPH_HOME/conf $ PATH=$HADOOP_HOME/bin:$GIRAPH_HOME/bin: $PATH
  31. 31. Setting project up (maven)<dependencies> <dependency> <groupId>org.apache.giraph</groupId> <artifactId>giraph-core</artifactId> <version>1.1.0-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> </dependencies>
  32. 32. Running it $ mvn package $ giraph target/*.jar giraph.GiraphHelloWorld -vip src/main/resources/1 -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat -w 1 -ca giraph.SplitMasterWorker=false,giraph.logLevel=error
  33. 33. Testing it public void testNumberOfVertices() throws Exception { GiraphConfiguration conf = new GiraphConfiguration(); conf.setComputationClass(GiraphHelloWorld.class); conf.setVertexInputFormatClass(TextDoubleDoubleAdjacencyListVertexInputForma t.class); … Iterable<String> results = InternalVertexRunner.run(conf, graphSeed); … }
  34. 34. Simplified view mappers or YARN containers @rhatr @TheASF @c0sin InputFormat OutputFormat 1 2 3
  35. 35. Master and master compute @rhatr @TheASF @c0sin 1 2 3 Active Master compute() Standby Master Zookeeper
  36. 36. Master compute •Runs before slave compute() •Has a global view •A place for aggregator manipulation
  37. 37. Aggregators •“Shared variables” •Each vertex can push values to an aggregator •Master compute has full control @rhatr @TheASF @c0sin 1 2 3 Active Master min: sum: 1 1 2 2 3 1
  38. 38. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×