• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Giraph
 

Apache Giraph

on

  • 1,100 views

An introduction to Apache Giraph.

An introduction to Apache Giraph.

Statistics

Views

Total Views
1,100
Views on SlideShare
1,099
Embed Views
1

Actions

Likes
2
Downloads
52
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Giraph Apache Giraph Presentation Transcript

    • A Distributed Graph-Processing Library Ahmet Emre Aladağ - AGMLab 26.08.2013
    • ● Library for large-scale graph processing. ● Runs on Apache Hadoop with Map Jobs ● Bulk Synchronous Parallel (BSP) model What is Giraph? 1incoming messages outgoing messages 0.2 0.53 0.32 0.16 0.12 0.34 Vertex computation
    • Uses ● PageRank-variant iterative algorithms ● Graph clustering ○ Label propagation ○ Max Clique ○ Triangle Closure ○ Finding related people, groups, interests. ● Shortest-Path ○ Single source, s-t, all to all ● Finding Connected Components
    • Alternatives ● Map-Reduce jobs on Hadoop ○ Not a good fit for graph algorithms: overhead. ● Google Pregel ○ Requires its own infrastructure ○ Not available ○ Master is single point of failure. ● Message Passing Interface (MPI) ○ Not fault-tolerant ○ Too generic
    • How Giraph differs ● You can use a Hadoop cluster, no need for special infrastructure. ● Easy deployment with Amazon EMR ● Dynamic resource management ● Graph oriented API ● Open Source ● Fault Tolerant, no SPOF except Hadoop namenode and jobtracker ● Jython Support
    • Layers
    • Mechanism InputFormat/Reader Input Computation OutputFormat/Writer Output ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● Accumulo ● HBase ● HCatalog ● HDFS ● Hive ● Neo4j etc. ● GraphViz Adjacency matrix, id- value pairs, JSON
    • InputFormat ● VertexInputFormat 1;3.4 2;6.1 3;2.7 ● EdgeInputFormat 1;2 2;3 1;3 1 2 3 3.4 6.1 2.7 1 2 3
    • Computation ● Superstep barriers. ● Send/Receive messages from neighbors ● Update value. ● Vote to halt or wake up. Single-Source Shortest Path Example
    • Shortest-Path Computation Code Note: old API
    • Ex: Finding the maximum value
    • Aggregators ● Shared variables among the workers. ● Each vertex computation can add/multiply a value to aggregators. ● Examples: ○ Holding the min/max value among all vertices ○ Holding sum of the vertex values. ○ Holding average value of vertex values. ○ Holding sum of mean square errors and stdev. 1 2 3 0.2 0.6 0.45 1.25 Computation at Iteration k
    • MasterCompute Class ● Master’s compute() always runs before the slaves (like pre-superstep) ○ In compute: aggregate vertex values: sum of values ○ In MasterCompute: average=sum/N ● Aggregators are registered here. ● You can set values to aggregators.
    • Worker Context ● Allows for the execution of user code on a per-worker basis. ● There's one WorkerContext per worker. ● Methods for Pre/post superstep/application operations.
    • Flexible Edge/Vertex Input ● Read edges/vertices from different sources. ● Multiple input resources
    • Parallel Computing ● More map jobs (workers) = parallel computing ● To overcome slowest worker problem, multithreading is applied on input/computation/output ● Linear speedup in CPU-bound applications such as k-means clustering due to multithreading ● Take a set of entrie machines & use multithreading to maximize resource utilization.
    • Memory Optimization ● Vertices and edges are stored as serialized byte arrays. ● Used FastUtil-based Java primitives.
    • Sharded Aggregators ● Each aggregator is randomly assigned to one of the workers. ● The assigned worker is in charge of gathering the values of its aggregators from all workers, performing the aggregation, and distributing the final values to other workers. ● Aggregation responsibilities are balanced across all workers rather than bottlenecked by the master.
    • Performance ● PageRank on 1 trillion edges with 200 commodity machines: 4 minutes/iteration. ● K-Means on 1 billion input vectors x 100 features into 10.000 centroids: 10 minutes. ● Linear Scalability
    • Currently ● Version 1.0, on the way to 1.1 ● Changing rapidly: backwards-incompatible changes ● Documentation not mature yet. ● More algorithms to be contributed. ● More data sources to be ported. ● http://giraph.apache.org for more info
    • References Giraph: Large-scale graph processing infrastructure on Hadoop, 2011 Scaling Apache Giraph to a trillion edges, Avery Ching, Facebook, 2013 Scaling Apache Giraph, Nitay Joffe, Facebook, 2013. Giraph: http://giraph.apache.org
    • Questions ?