Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
GraphFrames
DataFrame-based graphs for Apache® Spark™
Joseph K. Bradley
4/14/2016
About the speaker: Joseph Bradley
Joseph Bradley is a Software Engineerand Apache
Spark PMC member working on MLlib at
Dat...
About the moderator: Denny Lee
Denny Lee is a Technology Evangelistwith
Databricks; he is a hands-on data sciencesengineer...
We are Databricks, the company behind Apache Spark
Founded by the creators of
Apache Spark in 2013
Share of Spark code
con...
…
Apache Spark Engine
Spark Core
Spark
Streaming
Spark SQL MLlib GraphX
Unified engineacross diverse workloads & environme...
NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT
2 0 1 5 SAN F RANCISCO
Source: Slide5ofSparkCommunityUpdate
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
8
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
9
Graphs
10
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
s...
Apache Spark’s GraphX library
Overview
• General-purpose graph
processinglibrary
• Optimized for fast
distributedcomputing...
Enter GraphFrames
Goal: DataFrame-based graphson ApacheSpark
• Simplify interactive queries
• Support motif-findingforstru...
Graphs
13
vertex
edge
id City State
“JFK” “New York” NY
Example: airports & flights between them
JFK
IAD
LAX
SFO
SEA
DFW
s...
GraphFrames
“vertices” DataFrame
• 1 vertexper Row
• id: column with unique ID
“edges” DataFrame
• 1 edge per Row
• src, d...
Demo:
Building a GraphFrame
15
16
Queries
Simple queries
Motif finding
Graph algorithms
19
Simple queries
SQL queries on vertices & edges
E.g., what trips are most likely to have significantdelays?
20
Graph querie...
21
Motif finding
24
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)...
Motif finding
25
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)Search for structural
patterns within a graph.
val paths: DataFrame =
g.fi...
Motif finding
26
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =...
Motif finding
27
IAD
JFK
LAX
SFO
SEA
DFW
(b)
(a)
(c)
Search for structural
patterns within a graph.
val paths: DataFrame =...
Motif finding
28
IAD
JFK
LAX
SFO
SEA
DFW
Search for structural
patterns within a graph.
val paths: DataFrame =
g.find(“(a)...
29
Graph algorithms
Find importantvertices
• PageRank
31
Find pathsbetweensets of vertices
• Breadth-first search (BFS)
• Sho...
32
Algorithm implementations
Mostly wrappers for GraphX
• PageRank
• Shortest paths
• Connected components
• Strongly connect...
Saving & loading graphs
Save & load the DataFrames.
vertices = sqlContext.read.parquet(...)
edges = sqlContext.read.parque...
APIs: Scala, Java, Python
API available from all 3 languages
à First time GraphX functionality hasbeen available to
Java &...
Outline
GraphFrames overview
GraphFrames vs. GraphX and other libraries
Details for power users
Roadmap and resources
36
2 types of graph libraries
37
Graph algorithms Graph queries
Standard & custom algorithms
Optimized for batch processing
M...
GraphFrames vs. GraphX
38
GraphFrames GraphX
Builton DataFrames RDDs
Languages Scala, Java, Python Scala
Use cases Queries...
GraphX compatibility
Simple conversionsbetweenGraphFrames& GraphX.
val g: GraphFrame = ...
// Convert GraphFrame à GraphX
...
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
40
Scalability
Currentstatus
• DataFrame-based parts benefitfrom DataFrame scalability +
performance optimizations(Catalyst, ...
WIP optimizations
Join elimination
• GraphFrame algorithms require lots
of joins.
• Not all joins are necessary
Solution:
...
Implementing new algorithms
43
Method 2: Messagepassing
aggregateMessages
• Same primitive as GraphX
• Specify messages & ...
Outline
GraphFrames overview
GraphFrames vs. GraphXand other libraries
Details for power users
Roadmap and resources
44
Current status
Published
• Open source (Apache 2.0) on Github
https://github.com/graphframes/graphframes
• Spark package h...
Roadmap
• MergeWIP speed optimizations
• Java API tests & examples
• Migrate more algorithms to DataFrame-based
implementa...
Resources for learning more
User guide + API docs http://graphframes.github.io/
• Quick-start
• Overview & examples for al...
48
Thank you!
Thanks to
• Denny Lee & Bill Chambers (demo)
• Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames devel...
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Upcoming SlideShare
Loading in …5
×

GraphFrames: DataFrame-based graphs for Apache® Spark™

5,748 views

Published on

These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.

Published in: Software
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download Full EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download Full doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download PDF EBOOK here { https://soo.gd/irt2 } ......................................................................................................................... Download EPUB Ebook here { https://soo.gd/irt2 } ......................................................................................................................... Download doc Ebook here { https://soo.gd/irt2 } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

GraphFrames: DataFrame-based graphs for Apache® Spark™

  1. 1. GraphFrames DataFrame-based graphs for Apache® Spark™ Joseph K. Bradley 4/14/2016
  2. 2. About the speaker: Joseph Bradley Joseph Bradley is a Software Engineerand Apache Spark PMC member working on MLlib at Databricks. Previously,he was a postdoc at UC Berkeley after receiving hisPh.D. in Machine Learning from Carnegie Mellon U.in 2013.His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanismsfor peergrading in MOOCs. 2
  3. 3. About the moderator: Denny Lee Denny Lee is a Technology Evangelistwith Databricks; he is a hands-on data sciencesengineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premisesand cloud. Prior to joining Databricks, Denny worked as a SeniorDirector of Data SciencesEngineering at Concur and was part of the incubation teamthat builtHadoop on Windowsand Azure (currently known as HDInsight). 3
  4. 4. We are Databricks, the company behind Apache Spark Founded by the creators of Apache Spark in 2013 Share of Spark code contributed by Databricks in 2014 75% 4 Data Value Created Databricks on top of Spark to make big data simple.
  5. 5. … Apache Spark Engine Spark Core Spark Streaming Spark SQL MLlib GraphX Unified engineacross diverse workloads & environments Scale out, fault tolerant Python, Java, Scala, and R APIs Standard libraries
  6. 6. NOTABL E USERS THAT PRESENTED AT SPARK SUMMIT 2 0 1 5 SAN F RANCISCO Source: Slide5ofSparkCommunityUpdate
  7. 7. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 8
  8. 8. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 9
  9. 9. Graphs 10 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  10. 10. Apache Spark’s GraphX library Overview • General-purpose graph processinglibrary • Optimized for fast distributedcomputing • Library of algorithms: PageRank, Connected Components,etc. 11 Challenges • No Java, PythonAPIs • Lower-levelRDD-based API (vs.DataFrames) • Cannot use recent Spark optimizations:Catalyst query optimizer,Tungsten memory management
  11. 11. Enter GraphFrames Goal: DataFrame-based graphson ApacheSpark • Simplify interactive queries • Support motif-findingforstructural pattern search • Benefitfrom DataFrame optimizations Collaboration between Databricks, UC Berkeley& MIT + Now with community contributors! 12
  12. 12. Graphs 13 vertex edge id City State “JFK” “New York” NY Example: airports & flights between them JFK IAD LAX SFO SEA DFW src dst delay tripID “JFK ” “SEA” 45 1058923
  13. 13. GraphFrames “vertices” DataFrame • 1 vertexper Row • id: column with unique ID “edges” DataFrame • 1 edge per Row • src, dst: columns using IDs from vertices.id 14 Extra columns store vertexor edge data (a.k.a. attributes or properties). id City State “JFK” “New York” NY “SEA” “Seattle” WA src dst delay tripID “JFK” “SEA” 45 1058923 “DFW” “SFO” -7 4100224
  14. 14. Demo: Building a GraphFrame 15
  15. 15. 16
  16. 16. Queries Simple queries Motif finding Graph algorithms 19
  17. 17. Simple queries SQL queries on vertices & edges E.g., what trips are most likely to have significantdelays? 20 Graph queries • Vertex degrees • # edgesper vertex(incoming,outgoing,total) • Triplets • Join vertices and edgesto get (src, edge,dst)
  18. 18. 21
  19. 19. Motif finding 24 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  20. 20. Motif finding 25 IAD JFK LAX SFO SEA DFW (b) (a)Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  21. 21. Motif finding 26 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  22. 22. Motif finding 27 IAD JFK LAX SFO SEA DFW (b) (a) (c) Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”)
  23. 23. Motif finding 28 IAD JFK LAX SFO SEA DFW Search for structural patterns within a graph. val paths: DataFrame = g.find(“(a)-[e1]->(b); (b)-[e2]->(c); !(c)-[]->(a)”) (b) (a) (c) Then filter using vertex & edge data. paths.filter(“e1.delay > 20”)
  24. 24. 29
  25. 25. Graph algorithms Find importantvertices • PageRank 31 Find pathsbetweensets of vertices • Breadth-first search (BFS) • Shortest paths Find groupsof vertices(components, communities) • Connected components • Strongly connected components • Label Propagation Algorithm(LPA) Other • Triangle counting • SVDPlusPlus
  26. 26. 32
  27. 27. Algorithm implementations Mostly wrappers for GraphX • PageRank • Shortest paths • Connected components • Strongly connected components • Label Propagation Algorithm (LPA) • SVDPlusPlus 33 Some algorithms implemented usingDataFrames • Breadth-first search • Triangle counting
  28. 28. Saving & loading graphs Save & load the DataFrames. vertices = sqlContext.read.parquet(...) edges = sqlContext.read.parquet(...) g = GraphFrame(vertices, edges) g.vertices.write.parquet(...) g.edges.write.parquet(...) In the future... • SQL data sources for graph formats 34
  29. 29. APIs: Scala, Java, Python API available from all 3 languages à First time GraphX functionality hasbeen available to Java & Python users 2 missing items (WIP) • Java-friendliness is currently in alpha. • Python does not have aggregateMessages (for implementing your own graph algorithms). 35
  30. 30. Outline GraphFrames overview GraphFrames vs. GraphX and other libraries Details for power users Roadmap and resources 36
  31. 31. 2 types of graph libraries 37 Graph algorithms Graph queries Standard & custom algorithms Optimized for batch processing Motif finding Point queries &updates GraphFrames: Both algorithms &queries (but notpoint updates)
  32. 32. GraphFrames vs. GraphX 38 GraphFrames GraphX Builton DataFrames RDDs Languages Scala, Java, Python Scala Use cases Queries & algorithms Algorithms Vertex IDs Any type (in Catalyst) Long Vertex/edg e attributes Any number of DataFrame columns Any type (VD, ED) Return types GraphFrame or DataFrame Graph[VD, ED], or RDD[Long, VD]
  33. 33. GraphX compatibility Simple conversionsbetweenGraphFrames& GraphX. val g: GraphFrame = ... // Convert GraphFrame à GraphX val gx: Graph[Row, Row] = g.toGraphX // Convert GraphX à GraphFrame val g2: GraphFrame = GraphFrame.fromGraphX(gx) 39 Vertex & edgeattributes are Rows in order to handlenon-LongIDs Wrapping existing GraphX code: See Belief Propagation example: https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/examples/BeliefPropagation.scala
  34. 34. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 40
  35. 35. Scalability Currentstatus • DataFrame-based parts benefitfrom DataFrame scalability + performance optimizations(Catalyst, Tungsten). • GraphX wrappers are as fast as GraphX (+ conversion overhead). WIP • GraphX hasoptimizationswhich are not yet ported to GraphFrames. • See nextslide… 41
  36. 36. WIP optimizations Join elimination • GraphFrame algorithms require lots of joins. • Not all joins are necessary Solution: • Vertex IDs serve as unique keys. • Tracking keys allows Catalyst to eliminate some joins. 42 For more info & benchmark results, see AnkurDave’s SSE 2016 talk. https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/ Materializedviews • Data locality for common usecases • Message-passing algorithms often need “triplet view” (src,edge, dst) Solution: • Materialize specific views • Analogous to GraphX’s “replicated vertex view”
  37. 37. Implementing new algorithms 43 Method 2: Messagepassing aggregateMessages • Same primitive as GraphX • Specify messages & aggregation using DataFrame expressions Belief propagation example code Method 1: DataFrame & GraphFrame operations Motif finding • Series of DataFrame joins Triangle count • DataFrame ops + motif finding BFS • DataFrame joins & filters
  38. 38. Outline GraphFrames overview GraphFrames vs. GraphXand other libraries Details for power users Roadmap and resources 44
  39. 39. Current status Published • Open source (Apache 2.0) on Github https://github.com/graphframes/graphframes • Spark package http://spark- packages.org/package/graphframes/graphframes Compatible • Spark 1.4, 1.5, 1.6 • Databricks Community Edition Documented • http://graphframes.github.io/ 45
  40. 40. Roadmap • MergeWIP speed optimizations • Java API tests & examples • Migrate more algorithms to DataFrame-based implementations for greater scalability • Getcommunity feedback! 46 Contribute • Tracking issueson Github • Thanks to those who have already sent pull requests!
  41. 41. Resources for learning more User guide + API docs http://graphframes.github.io/ • Quick-start • Overview & examples for all algorithms • Alsoavailable as executablenotebooks: • Scala: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-scala.html • Python: http://go.databricks.com/hubfs/notebooks/3-GraphFrames-User-Guide-python.html Blog posts • Intro: https://databricks.com/blog/2016/03/03/introducing-graphframes.html • Flight delay analysis: https://databricks.com/blog/2016/03/16/on-time-flight-performance- with-spark-graphframes.html 47
  42. 42. 48
  43. 43. Thank you! Thanks to • Denny Lee & Bill Chambers (demo) • Tim Hunter, Xiangrui Meng, Ankur Dave &others (GraphFrames development)

×