So what is giraph?Giraph is a distributed graph processing framework. It tries to solve a class of iterative problems that hadoop has problem with, such as pagerank.Graph processing is very improtant to linkedinGiraph is designed with master slave architecture and does all its computation in memory. Meaning, it loads the inputs from HDFS once and writes the output back to HDFS only after finishing its business logic processing.Giraph provides a vertex-centric programming model. All algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation.Giraph makes graph algorithm easy to reason about and implement by following the BSP. A bsp computation proceeds in a series of global supersteps. A superstep consists of three components, concurrent computation, communication, and barrier synchronization.
Client is nothing but initiating an application for the user. It just asks the resource manager will you launch my application master. From there, the application master is going to do everything.Resource manager just schedules your task (job tracker sort of activity ask the node managers for the containers with the right heap size, and where we could put this task.Application master is sort of master node for your application and is going to launch, manage the life cycle, communicates with health anything to do with your task. New brain of your application.You may wonder what is the difference between the master and application master? The answer is these two components could be combined as one. However, giraph is implemented this way and my focus is on giraph’s performance.
byte 1.8GBLong/DoubleWritable 2GB----- Meeting Notes (9/3/13 19:49) -----move it upadd GB----- Meeting Notes (9/4/13 11:45) -----move up
Step 2. Make them support Java-based applicationsJava interface to write Giraph applications running on C++ Giraph master/workers----- Meeting Notes (9/3/13 19:49) -----No need to rewrite the whole Giraphmore control over memory management----- Meeting Notes (9/4/13 11:45) -----from prev slides, we found out jvm is the killerthat's why we are thinking if c++ is better candidate, we are asking to overhaul the giraph, we only ask the griaph to pice
Relative new things in the community.There are a few bugs we fixed to make it work
Master – responsible for coordination (load distribution, coordinates synchronization, request checkpoints, collect health status, etc)Worker – responsible for computation within each iteration or superstepZookeeper – responsible for computation state (partition to worker mapping, global state, checkpoints paths, statistics, etc)In the next slide, I’m going to show you how these components work together----- Meeting Notes (9/3/13 19:49) -----add another slide descirbing the computational modelpartition the vertice to distribute the load