SlideShare a Scribd company logo
1 of 32
Processing Over a Billion Edges
on Apache Giraph
Hadoop Summit 2012



Avery Ching
Software Engineer
6/14/2012
Agenda

1   Motivation and Background

2   Giraph Concepts/API

3   Example Applications

4   Architecture Overview

5   Recent/Future Improvements
What is Apache Giraph?

•  Loose implementation of Google’s Pregel that runs
   as a map-only job on Hadoop

•  “Think like a vertex” that can send messages to any
   other vertex in the graph using the bulk synchronous
   parallel programming model

•  An in-memory scalable system*
 ▪    Will be enhanced with out-of-core messages/vertices to handle
      larger problem sets.
What (social) graphs are we targeting?

•  3/2012 LinkedIn has 161 million users

•  6/2012 Twitter discloses 140 million MAU

•  4/2012 Facebook declares 901 million MAU
Example applications

•  Ranking
 ▪    Popularity, importance, etc.

•  Label Propagation
 ▪    Location, school, gender, etc.

•  Community
 ▪    Groups, interests
Bulk synchronous parallel

•  Supersteps
 ▪    A global epoch followed by a global barrier where components
      do concurrent computation and send messages

•  Point-to-point messages (i.e. vertex to vertex)
 ▪    Sent during a superstep from one component to another and
      then delivered in the following superstep

•  Computation complete when all components
   complete
Computation +             Superstep
         Communication
   Processors




Time                     Barrier
MapReduce -> Giraph
“Think like a vertex”, not a key-value pair!

        MapReduce                         Giraph
public class Mapper<
                              public class Vertex<
     KEYIN,
                                   I extends
     VALUEIN,
                                      WritableComparable,
     KEYOUT,
                                   V extends Writable,
     VALUEOUT> {
                                   E extends Writable,
  void map(KEYIN key,
                                   M extends Writable> {
     VALUEIN value,
                                void compute(
     Context context)
                                   Iterator<M> msgIterator);
     throws IOException,
                              }
     InterruptedException;
}
Basic Giraph API
  Methods available to compute()

 Immediate effect/access                    Next superstep
I getVertexId()                    void sendMsg(I id, M msg)
V getVertexValue()                 void sendMsgToAllEdges(M msg)
void setVertexValue(V vertexValue)
                                   void addVertexRequest(
Iterator<I> iterator()               BasicVertex<I, V, E, M> vertex)
E getEdgeValue(I targetVertexId)   void removeVertexRequest(I vertexId)
boolean hasEdge(I targetVertexId) void addEdgeRequest(
boolean addEdge(I targetVertexId,    I sourceVertexId,
                       E             Edge<I, E> edge)
edgeValue)                         void removeEdgeRequest(
E removeEdge(I targetVertexId)       I sourceVertexId,
                                     I destVertexId)
void voteToHalt()
boolean isHalted()
Why not implement Giraph with multiple
MapReduce jobs?
•  Too much disk, no in-memory caching, a superstep
   becomes a job!

    Input     Map    Intermediate Reduce    Output
   format    tasks        files    tasks    format

   Split 0
                                            Output 0
   Split 1

   Split 2

   Split 3                                  Output 1
Giraph is a single Map-only job in
Hadoop
•  Hadoop is purely a resource manager for Giraph, all
   communication is done through Netty-based IPC

        Vertex input     Map      Vertex output
          format        tasks        format

           Split 0
                                     Output 0
           Split 1

           Split 2

           Split 3                   Output 1
Maximum vertex value implementation
public class MaxValueVertex extends EdgeListVertex<
    IntWritable, IntWritable, IntWritable, IntWritable> {
  @Override
  public void compute(Iterator<IntWritable> msgIterator) {
    boolean changed = false;
    while (msgIterator.hasNext()) {
      IntWritable msgValue = msgIterator.next();
      if (msgValue.get() > getVertexValue().get()) {
        setVertexValue(msgValue);
        changed = true;
      }
    }
    if (getSuperstep() == 0 || changed) {
      sendMsgToAllEdges(getVertexValue());
    } else {
      voteToHalt();
    }
  }
}
Maximum vertex value


 Processor 1   5             5             5             5




 Processor 2   1
                             1
                                           5             5
                             5



                                           2
               2             2                           5
                                           5




 Time              Barrier       Barrier       Barrier
Page rank implementation
public class SimplePageRankVertex extends EdgeListVertex<LongWritable,
DoubleWritable, FloatWritable, DoubleWritable> {
  public void compute(Iterator<DoubleWritable> msgIterator) {
     if (getSuperstep() >= 1) {
         double sum = 0;
         while (msgIterator.hasNext()) {
            sum += msgIterator.next().get();
         }
         setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f *
sum);
     }
     if (getSuperstep() < 30) {
         long edges = getNumOutEdges();
         sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges));
     } else {
         voteToHalt();
     }
  }
}
Giraph In MapReduce
Giraph components
•  Master – Application coordinator
 ▪    One active master at a time
 ▪    Assigns partition owners to workers prior to each superstep
 ▪    Synchronizes supersteps

•  Worker – Computation & messaging
 ▪    Loads the graph from input splits
 ▪    Does the computation/messaging of its assigned partitions

•  ZooKeeper
 ▪    Maintains global application state
Graph distribution
•  Master graph partitioner
 ▪    Create initial partitions, generate partition owner changes
      between supersteps

•  Worker graph partitioner
 ▪    Determine which partition a vertex belongs to
 ▪    Create/modify the partition stats (can split/merge partitions)

•  Default is hash partitioning (hashCode())
 ▪    Range-based partitioning is also possible on a per-type basis
Graph distribution example


          Partition 0              Load/Store   Stats 0
                        Worker 0    Compute
          Partition 1              Messages     Stats 1

          Partition 2              Load/Store   Stats 2
 Master                 Worker 1    Compute
          Partition 3              Messages     Stats 3

          Partition 4              Load/Store   Stats 4
                        Worker 2    Compute
          Partition 5              Messages     Stats 5

          Partition 6              Load/Store   Stats 6
                        Worker 3    Compute
          Partition 7              Messages     Stats 7
Customizable fault tolerance
•  No single point of failure from Giraph threads
 ▪    With multiple master threads, if the current master dies, a new one will automatically
      take over.
 ▪    If a worker thread dies, the application is rolled back to a previously checkpointed
      superstep. The next superstep will begin with the new amount of workers
 ▪    If a zookeeper server dies, as long as a quorum remains, the application can proceed

•  Hadoop single points of failure still exist
 ▪    Namenode, jobtracker

 ▪    Restarting manually from a checkpoint is always possible




                                                 19
Master thread fault tolerance
 Before failure of active master 0            After failure of active master 0
  “Active”                                       “Active”
  Master 0                                       Master 0
                            Active                                       Active
                            Master                                       Master
  “Spare”                   State                “Active”                State
  Master 1                                       Master 1

  “Spare”                                        “Spare”
  Master 2                                       Master 2

•  One active master, with spare masters taking over in the event of an active master
   failure

•  All active master state is stored in ZooKeeper so that a spare master can
   immediately step in when an active master fails

•  “Active” master implemented as a queue in ZooKeeper

                                            20
Worker thread fault tolerance
  Superstep i         Superstep i+1       Superstep i+2
(no checkpoint)        (checkpoint)      (no checkpoint)

                                         Worker failure!


                      Superstep i+1       Superstep i+2       Superstep i+3
                       (checkpoint)      (no checkpoint)       (checkpoint)
                                                            Worker failure after
                                                           checkpoint complete!


                                                              Superstep i+3        Application
                                                             (no checkpoint)       Complete
•  A single worker death fails the superstep

•  Application reverts to the last committed superstep automatically
 ▪    Master detects worker failure during any superstep with a ZooKeeper “health”
      znode
 ▪    Master chooses the last committed superstep and sends a command through
      ZooKeeper for all workers to restart from that superstep
                                              21
Optional features
•  Combiners
 ▪    Similar to Map-Reduce combiners
 ▪    Users implement a combine() method that can reduce the
      amount of messages sent and received
 ▪    Run on both the client side (memory, network) and server side
      (memory)

•  Aggregators
 ▪    Similar to MPI aggregation routines (i.e. max, min, sum, etc.)
 ▪    Commutative and associate operations that are performed
      globally
 ▪    Examples include global communication, monitoring, and
      statistics
Recent Netty IPC implementation
                                                   300                   50
                                                   250




                                  Time (Seconds)
•  Big improvement over the                                              40
   Hadoop RPC implementation                       200
                                                                         30
                                                   150
•  10-39% overall performance                                            20
   improvement                                     100
                                                    50                   10
•  Still need more Netty tuning                      0                   0
                                                         10   30    50
                                                              Workers

                                                         Netty
                                                         Hadoop RPC
                                                         % improvement
Recent benchmarks
•  Test cluster of 80 machines
     ▪    Facebook Hadoop (https://github.com/facebook/hadoop-20)
     ▪    72 cores, 64+ GB of memory
▪    org.apache.giraph.benchmark.PageRankBenchmark
     ▪    5 supersteps
     ▪    No checkpointing
     ▪    10 edges per vertex
Worker scalability
                  3000
 Time (Seconds)

                  2500
                  2000
                  1500
                  1000
                   500
                     0
                         10   20   30  40    45   50
                                   Workers
Edge Scalability
                  5000
 Time (Seconds)

                  4000
                  3000
                  2000
                  1000
                     0
                         1    2     3      4    5
                             Edges (Billions)
Worker / edge scalability
                  2000                               8
 Time (Seconds)




                                                         Edges (Billions)
                  1500                               6
                  1000                               4
                   500                               2
                     0                               0
                         10         30        50
                                    Workers
                         Run Time    Workers/Edges
Apache Giraph has graduated as of
5/2012
•  Incubated for less than a year (entered incubator
   9/12)

•  Committers from HortonWorks, Twitter, LinkedIn,
   Facebook, TrendMicro and various schools (VU
   Amsterdam, TU Berlin, Korea University)

•  Released 0.1 as of 2/6/2012, will be release 0.2
   within a few months
Future improvements
•  Out-of-core messages/graph
 ▪    Under memory pressure, dump messages/portions of the graph
      to local disk
 ▪    Ability to run applications without having all needed memory

•  Performance improvements
 ▪    Netty is a good step in the right direction, but need to tune
      messaging performance as it takes up a majority of the time
 ▪    Scale back use of ZooKeeper to only be for health registration,
      rather than implementing aggregators and coordination
More future improvements
•  Adding a master#compute() method
 ▪    Arbitrary master computation that sends results to workers prior
      to a superstep to simplify certain algorithms
 ▪    GIRAPH-127

•  Handling skew
 ▪    Some vertices have a large number of edges and we need to
      break them up and handle them differently to provide better
      scalability
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
Sessions will resume at 4:30pm




                             Page 32

More Related Content

What's hot

Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Ontico
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystifiedOmid Vahdaty
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive HookMinwoo Kim
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントCloudera Japan
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaCloudera, Inc.
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryTsz-Wo (Nicholas) Sze
 

What's hot (20)

Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...Understanding and tuning WiredTiger, the new high performance database engine...
Understanding and tuning WiredTiger, the new high performance database engine...
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
 
Apache Hive Hook
Apache Hive HookApache Hive Hook
Apache Hive Hook
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Hadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイントHadoopのシステム設計・運用のポイント
Hadoopのシステム設計・運用のポイント
 
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - ClouderaHadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
Hadoop World 2011: Hadoop Troubleshooting 101 - Kate Ting - Cloudera
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Apache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft LibraryApache Ratis - In Search of a Usable Raft Library
Apache Ratis - In Search of a Usable Raft Library
 

Viewers also liked

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagationYo Ehara
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job HuntingChrisSteed
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014鉄平 土佐
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDialogic Inc.
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNDataWorks Summit
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph projectChun Cheng Lin
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理maruyama097
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...rhatr
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraphsscdotopen
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersRick Hightower
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4jNeo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Junichi Noda
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Yosuke Mizutani
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...Adam Dunkels
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門鉄平 土佐
 

Viewers also liked (20)

2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Tokyo nlp #8 label propagation
Tokyo nlp #8 label propagationTokyo nlp #8 label propagation
Tokyo nlp #8 label propagation
 
Core Messages in Job Hunting
Core Messages in Job HuntingCore Messages in Job Hunting
Core Messages in Job Hunting
 
GETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHTGETTING YOUR MESSAGE RIGHT
GETTING YOUR MESSAGE RIGHT
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
GraphXはScalaエンジニアにとってのブルーオーシャン @ Scala Matsuri 2014
 
Deploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspectiveDeploying WebRTC successfully – A web developer perspective
Deploying WebRTC successfully – A web developer perspective
 
Fast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARNFast, Scalable Graph Processing: Apache Giraph on YARN
Fast, Scalable Graph Processing: Apache Giraph on YARN
 
Introduction of apache giraph project
Introduction of apache giraph projectIntroduction of apache giraph project
Introduction of apache giraph project
 
大規模グラフデータ処理
大規模グラフデータ処理大規模グラフデータ処理
大規模グラフデータ処理
 
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
Apache Giraph: start analyzing graph relationships in your bigdata in 45 minu...
 
Large Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache GiraphLarge Scale Graph Processing with Apache Giraph
Large Scale Graph Processing with Apache Giraph
 
Netty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and BuffersNetty Notes Part 2 - Transports and Buffers
Netty Notes Part 2 - Transports and Buffers
 
Initiation à Neo4j
Initiation à Neo4jInitiation à Neo4j
Initiation à Neo4j
 
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
Spark Streaming と Spark GraphX を使用したTwitter解析による レコメンドサービス例
 
Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析Spark GraphX で始めるグラフ解析
Spark GraphX で始めるグラフ解析
 
Building day 2 upload Building the Internet of Things with Thingsquare and ...
Building day 2   upload Building the Internet of Things with Thingsquare and ...Building day 2   upload Building the Internet of Things with Thingsquare and ...
Building day 2 upload Building the Internet of Things with Thingsquare and ...
 
Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan Best Practices to Build a Multichannel Campaign Plan
Best Practices to Build a Multichannel Campaign Plan
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門GraphX によるグラフ分析処理の実例と入門
GraphX によるグラフ分析処理の実例と入門
 

Similar to Processing edges on apache giraph

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users GroupNitay Joffe
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwordsNitay Joffe
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraphHsiao-Fei Liu
 
Java Review
Java ReviewJava Review
Java Reviewpdgeorge
 
Graph processing
Graph processingGraph processing
Graph processingyeahjs
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph ProcessingVasia Kalavri
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopMax Tepkeev
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers TrainingJan Gregersen
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoopRon Sher
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...The Linux Foundation
 
Reactive Thinking in Java
Reactive Thinking in JavaReactive Thinking in Java
Reactive Thinking in JavaYakov Fain
 

Similar to Processing edges on apache giraph (20)

2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
Pregel and giraph
Pregel and giraphPregel and giraph
Pregel and giraph
 
Mining quasi bicliques using giraph
Mining quasi bicliques using giraphMining quasi bicliques using giraph
Mining quasi bicliques using giraph
 
Java Review
Java ReviewJava Review
Java Review
 
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje CrnjakJavantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
Javantura v3 - Going Reactive with RxJava – Hrvoje Crnjak
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Graph processing
Graph processingGraph processing
Graph processing
 
Apache Flink & Graph Processing
Apache Flink & Graph ProcessingApache Flink & Graph Processing
Apache Flink & Graph Processing
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
OpenMI Developers Training
OpenMI Developers TrainingOpenMI Developers Training
OpenMI Developers Training
 
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
XPDS16: Xen Live Patching - Updating Xen Without Rebooting - Konrad Wilk, Ora...
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Hadoop Jungle
Hadoop JungleHadoop Jungle
Hadoop Jungle
 
Reactive Thinking in Java
Reactive Thinking in JavaReactive Thinking in Java
Reactive Thinking in Java
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Processing edges on apache giraph

  • 1. Processing Over a Billion Edges on Apache Giraph Hadoop Summit 2012 Avery Ching Software Engineer 6/14/2012
  • 2. Agenda 1 Motivation and Background 2 Giraph Concepts/API 3 Example Applications 4 Architecture Overview 5 Recent/Future Improvements
  • 3. What is Apache Giraph? •  Loose implementation of Google’s Pregel that runs as a map-only job on Hadoop •  “Think like a vertex” that can send messages to any other vertex in the graph using the bulk synchronous parallel programming model •  An in-memory scalable system* ▪  Will be enhanced with out-of-core messages/vertices to handle larger problem sets.
  • 4. What (social) graphs are we targeting? •  3/2012 LinkedIn has 161 million users •  6/2012 Twitter discloses 140 million MAU •  4/2012 Facebook declares 901 million MAU
  • 5. Example applications •  Ranking ▪  Popularity, importance, etc. •  Label Propagation ▪  Location, school, gender, etc. •  Community ▪  Groups, interests
  • 6. Bulk synchronous parallel •  Supersteps ▪  A global epoch followed by a global barrier where components do concurrent computation and send messages •  Point-to-point messages (i.e. vertex to vertex) ▪  Sent during a superstep from one component to another and then delivered in the following superstep •  Computation complete when all components complete
  • 7. Computation + Superstep Communication Processors Time Barrier
  • 8. MapReduce -> Giraph “Think like a vertex”, not a key-value pair! MapReduce Giraph public class Mapper< public class Vertex< KEYIN, I extends VALUEIN, WritableComparable, KEYOUT, V extends Writable, VALUEOUT> { E extends Writable, void map(KEYIN key, M extends Writable> { VALUEIN value, void compute( Context context) Iterator<M> msgIterator); throws IOException, } InterruptedException; }
  • 9. Basic Giraph API Methods available to compute() Immediate effect/access Next superstep I getVertexId() void sendMsg(I id, M msg) V getVertexValue() void sendMsgToAllEdges(M msg) void setVertexValue(V vertexValue) void addVertexRequest( Iterator<I> iterator() BasicVertex<I, V, E, M> vertex) E getEdgeValue(I targetVertexId) void removeVertexRequest(I vertexId) boolean hasEdge(I targetVertexId) void addEdgeRequest( boolean addEdge(I targetVertexId, I sourceVertexId, E Edge<I, E> edge) edgeValue) void removeEdgeRequest( E removeEdge(I targetVertexId) I sourceVertexId, I destVertexId) void voteToHalt() boolean isHalted()
  • 10. Why not implement Giraph with multiple MapReduce jobs? •  Too much disk, no in-memory caching, a superstep becomes a job! Input Map Intermediate Reduce Output format tasks files tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 11. Giraph is a single Map-only job in Hadoop •  Hadoop is purely a resource manager for Giraph, all communication is done through Netty-based IPC Vertex input Map Vertex output format tasks format Split 0 Output 0 Split 1 Split 2 Split 3 Output 1
  • 12. Maximum vertex value implementation public class MaxValueVertex extends EdgeListVertex< IntWritable, IntWritable, IntWritable, IntWritable> { @Override public void compute(Iterator<IntWritable> msgIterator) { boolean changed = false; while (msgIterator.hasNext()) { IntWritable msgValue = msgIterator.next(); if (msgValue.get() > getVertexValue().get()) { setVertexValue(msgValue); changed = true; } } if (getSuperstep() == 0 || changed) { sendMsgToAllEdges(getVertexValue()); } else { voteToHalt(); } } }
  • 13. Maximum vertex value Processor 1 5 5 5 5 Processor 2 1 1 5 5 5 2 2 2 5 5 Time Barrier Barrier Barrier
  • 14. Page rank implementation public class SimplePageRankVertex extends EdgeListVertex<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> { public void compute(Iterator<DoubleWritable> msgIterator) { if (getSuperstep() >= 1) { double sum = 0; while (msgIterator.hasNext()) { sum += msgIterator.next().get(); } setVertexValue(new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); } if (getSuperstep() < 30) { long edges = getNumOutEdges(); sentMsgToAllEdges(new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } } }
  • 16. Giraph components •  Master – Application coordinator ▪  One active master at a time ▪  Assigns partition owners to workers prior to each superstep ▪  Synchronizes supersteps •  Worker – Computation & messaging ▪  Loads the graph from input splits ▪  Does the computation/messaging of its assigned partitions •  ZooKeeper ▪  Maintains global application state
  • 17. Graph distribution •  Master graph partitioner ▪  Create initial partitions, generate partition owner changes between supersteps •  Worker graph partitioner ▪  Determine which partition a vertex belongs to ▪  Create/modify the partition stats (can split/merge partitions) •  Default is hash partitioning (hashCode()) ▪  Range-based partitioning is also possible on a per-type basis
  • 18. Graph distribution example Partition 0 Load/Store Stats 0 Worker 0 Compute Partition 1 Messages Stats 1 Partition 2 Load/Store Stats 2 Master Worker 1 Compute Partition 3 Messages Stats 3 Partition 4 Load/Store Stats 4 Worker 2 Compute Partition 5 Messages Stats 5 Partition 6 Load/Store Stats 6 Worker 3 Compute Partition 7 Messages Stats 7
  • 19. Customizable fault tolerance •  No single point of failure from Giraph threads ▪  With multiple master threads, if the current master dies, a new one will automatically take over. ▪  If a worker thread dies, the application is rolled back to a previously checkpointed superstep. The next superstep will begin with the new amount of workers ▪  If a zookeeper server dies, as long as a quorum remains, the application can proceed •  Hadoop single points of failure still exist ▪  Namenode, jobtracker ▪  Restarting manually from a checkpoint is always possible 19
  • 20. Master thread fault tolerance Before failure of active master 0 After failure of active master 0 “Active” “Active” Master 0 Master 0 Active Active Master Master “Spare” State “Active” State Master 1 Master 1 “Spare” “Spare” Master 2 Master 2 •  One active master, with spare masters taking over in the event of an active master failure •  All active master state is stored in ZooKeeper so that a spare master can immediately step in when an active master fails •  “Active” master implemented as a queue in ZooKeeper 20
  • 21. Worker thread fault tolerance Superstep i Superstep i+1 Superstep i+2 (no checkpoint) (checkpoint) (no checkpoint) Worker failure! Superstep i+1 Superstep i+2 Superstep i+3 (checkpoint) (no checkpoint) (checkpoint) Worker failure after checkpoint complete! Superstep i+3 Application (no checkpoint) Complete •  A single worker death fails the superstep •  Application reverts to the last committed superstep automatically ▪  Master detects worker failure during any superstep with a ZooKeeper “health” znode ▪  Master chooses the last committed superstep and sends a command through ZooKeeper for all workers to restart from that superstep 21
  • 22. Optional features •  Combiners ▪  Similar to Map-Reduce combiners ▪  Users implement a combine() method that can reduce the amount of messages sent and received ▪  Run on both the client side (memory, network) and server side (memory) •  Aggregators ▪  Similar to MPI aggregation routines (i.e. max, min, sum, etc.) ▪  Commutative and associate operations that are performed globally ▪  Examples include global communication, monitoring, and statistics
  • 23. Recent Netty IPC implementation 300 50 250 Time (Seconds) •  Big improvement over the 40 Hadoop RPC implementation 200 30 150 •  10-39% overall performance 20 improvement 100 50 10 •  Still need more Netty tuning 0 0 10 30 50 Workers Netty Hadoop RPC % improvement
  • 24. Recent benchmarks •  Test cluster of 80 machines ▪  Facebook Hadoop (https://github.com/facebook/hadoop-20) ▪  72 cores, 64+ GB of memory ▪  org.apache.giraph.benchmark.PageRankBenchmark ▪  5 supersteps ▪  No checkpointing ▪  10 edges per vertex
  • 25. Worker scalability 3000 Time (Seconds) 2500 2000 1500 1000 500 0 10 20 30 40 45 50 Workers
  • 26. Edge Scalability 5000 Time (Seconds) 4000 3000 2000 1000 0 1 2 3 4 5 Edges (Billions)
  • 27. Worker / edge scalability 2000 8 Time (Seconds) Edges (Billions) 1500 6 1000 4 500 2 0 0 10 30 50 Workers Run Time Workers/Edges
  • 28. Apache Giraph has graduated as of 5/2012 •  Incubated for less than a year (entered incubator 9/12) •  Committers from HortonWorks, Twitter, LinkedIn, Facebook, TrendMicro and various schools (VU Amsterdam, TU Berlin, Korea University) •  Released 0.1 as of 2/6/2012, will be release 0.2 within a few months
  • 29. Future improvements •  Out-of-core messages/graph ▪  Under memory pressure, dump messages/portions of the graph to local disk ▪  Ability to run applications without having all needed memory •  Performance improvements ▪  Netty is a good step in the right direction, but need to tune messaging performance as it takes up a majority of the time ▪  Scale back use of ZooKeeper to only be for health registration, rather than implementing aggregators and coordination
  • 30. More future improvements •  Adding a master#compute() method ▪  Arbitrary master computation that sends results to workers prior to a superstep to simplify certain algorithms ▪  GIRAPH-127 •  Handling skew ▪  Some vertices have a large number of edges and we need to break them up and handle them differently to provide better scalability
  • 31. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0
  • 32. Sessions will resume at 4:30pm Page 32