SlideShare a Scribd company logo

GraphFrames: Graph Queries In Spark SQL

Spark Summit
Spark SummitSpark Summit
GraphFrames: Graph Queries in
Apache Spark SQL
Ankur Dave
UC Berkeley AMPLab
Joint work with Alekh Jindal (Microsoft), Li Erran Li
(Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC
Berkeley), and Matei Zaharia (MIT and Databricks)
+ Graph
Queries
2016
Apache Spark +
GraphFrames
GraphFrames (2016)
+ Graph
Algorithms
2013
Apache Spark +
GraphX
Relational
Queries
2009
Spark
Graph Algorithms vs. Graph Queries
≈
x
PageRank
Alternating Least Squares
Graph Algorithms Graph Queries
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators
Editor 1 Editor 2 Article 1 Article 2
⇓
Article 1
Article 2
Editor 1
Editor 2
same day} same day}
Graph Algorithms vs. Graph Queries
Graph Algorithm: PageRank
// Iterate until convergence
wikipedia.pregel(
sendMsg = { e =>
e.sendToDst(e.srcRank * e.weight)
},
mergeMsg = _ + _,
vprog = { (id, oldRank, msgSum) =>
0.15 + 0.85 * msgSum
})
Graph Query: Wikipedia Collaborators
wikipedia.find(
"(u1)-[e11]->(article1);
(u2)-[e21]->(article1);
(u1)-[e12]->(article2);
(u2)-[e22]->(article2)")
.select(
"*",
"e11.date – e21.date".as("d1"),
"e12.date – e22.date".as("d2"))
.sort("d1 + d2".desc).take(10)
Separate Systems
Graph Algorithms Graph Queries
Raw Wikipedia
< / >< / >< / >
XML
Text Table
Edit Graph
Edit Table
Frequent
Collaborators
Problem: Mixed Graph Analysis
Hyperlinks PageRank
Article Text
User Article
Vandalism
Suspects
User User
User Article
Solution: GraphFrames
Graph Algorithms Graph Queries
Spark SQL
GraphFramesAPI
Pattern Query
Optimizer
GraphFrames API
• Unifies graph algorithms, graph queries, and DataFrames
• Available in Scala,Java, and Python
class GraphFrame {
def vertices: DataFrame
def edges: DataFrame
def find(pattern: String): DataFrame
def registerView(pattern: String, df: DataFrame): Unit
def degrees(): DataFrame
def pageRank(): GraphFrame
def connectedComponents(): GraphFrame
...
}
Implementation
Parsed
Pattern
Logical Plan
Materialized
Views
Optimized
Logical Plan
DataFrame
Result
Query String
Graph–Relational
Translation Join Elimination
and Reordering
Spark SQL
View Selection
Graph
Algorithms
GraphX
Graph–Relational Translation
B
D
A
C
Existing
Logical Plan
Output: A,B,C
Src Dst
⋈C=Src
Edge Table
ID Attr
VertexTable
⋈D=ID
Materialized View Selection
GraphX: Triplet view enabled efficient message-passing algorithms
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
+
Updated
PageRanks
B
A
C
D
A
Materialized View Selection
GraphFrames: User-defined views enable efficient graph queries
Vertices
B
A
C
D
Edges
A B
A C
B C
C D
A
B
Triplet View
A C
B C
C D
Graph
User-Defined Views
PageRank
Community
Detection
…
Graph Queries
Join Elimination
Src Dst
1 2
1 3
2 3
2 5
Edges
ID Attr
1 A
2 B
3 C
4 D
Vertices
SELECT src, dst
FROM edges INNER JOIN vertices ON src = id;
Unnecessaryjoin
can be eliminated if tables satisfy referential
integrity, simplifying graph–relational
translation:
SELECT src, dst FROM edges;
Join Reordering
A → B B → A
⋈A, B
B → D
C → B⋈B
B → E⋈B
C → D⋈B
C → E⋈C, D
⋈C, E
Example Query
Left-Deep Plan BushyPlan
A → B B → A
⋈A, B
B → D C → B
⋈B
B → E⋈B
⋈B
⋈B, C
User-Defined View
Evaluation
Faster than Neo4j for unanchored patternqueries
0
0.5
1
1.5
2
2.5
GraphFrames Neo4j
Querylatency,s
AnchoredPatternQuery
0
10
20
30
40
50
60
70
80
GraphFrames Neo4j
Querylatency,s
UnanchoredPatternQuery
Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
Evaluation
Approaches performance of GraphX for graph algorithms using Spark SQL
whole-stage code generation
0
1
2
3
4
5
6
7
GraphFrames GraphX Naïve Spark
Per-iterationruntime,s
PageRankPerformance
Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
Evaluation
Registering the right views cangreatlyimprove performance for some queries
Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.
Future Work
• Suggest views automatically
• Exploit attribute-based partitioning in optimizer
• Code generationfor single node
Try It Out!
Releasedas a Spark Package at:
https://github.com/graphframes/graphframes
Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter.
ankurd@eecs.berkeley.edu
1 of 20

GraphFrames: Graph Queries In Spark SQL

Download to read offline

Spark Summit talk by Ankur Dave (UC Berkeley)

Spark Summit
Spark SummitSpark Summit

Recommended

Airflow Best Practises & Roadmap to Airflow 2.0 by
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Kaxil Naik
339 views43 slides
Iceberg: a fast table format for S3 by
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
7.5K views30 slides
Flexible and Real-Time Stream Processing with Apache Flink by
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
2.2K views38 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
Physical Plans in Spark SQL by
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
7K views126 slides
Where is my bottleneck? Performance troubleshooting in Flink by
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
552 views32 slides
Spark SQL Deep Dive @ Melbourne Spark Meetup by
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
9K views57 slides
The Parquet Format and Performance Optimization Opportunities by
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
8.2K views32 slides

More Related Content

What's hot

Google Cloud and Data Pipeline Patterns by
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
6.6K views32 slides
Radical Speed for SQL Queries on Databricks: Photon Under the Hood by
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
1.1K views48 slides
Apache Iceberg - A Table Format for Hige Analytic Datasets by
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
6.6K views28 slides
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ... by
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
2K views68 slides
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME by
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMEconfluent
352 views65 slides
Flink vs. Spark by
Flink vs. SparkFlink vs. Spark
Flink vs. SparkSlim Baltagi
69.5K views67 slides
Full Stack Graph in the Cloud by
Full Stack Graph in the CloudFull Stack Graph in the Cloud
Full Stack Graph in the CloudNeo4j
377 views47 slides
Why your Spark job is failing by
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
69.1K views50 slides
OpenTelemetry Introduction by
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction DimitrisFinas1
756 views58 slides
Stage Level Scheduling Improving Big Data and AI Integration by
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
850 views38 slides
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
33.4K views60 slides
From Query Plan to Query Performance: Supercharging your Apache Spark Queries... by
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
684 views52 slides
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022 by
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022HostedbyConfluent
1.2K views31 slides
Open Policy Agent by
Open Policy AgentOpen Policy Agent
Open Policy AgentTorin Sandall
7.2K views52 slides
Data ingestion and distribution with apache NiFi by
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
8.5K views35 slides
Running Apache Spark on Kubernetes: Best Practices and Pitfalls by
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
2.9K views36 slides
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
127.8K views69 slides
Productizing Structured Streaming Jobs by
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
3.2K views63 slides
Processing Large Data with Apache Spark -- HasGeek by
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
12.4K views72 slides
Optimizing Apache Spark SQL Joins by
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
44.9K views24 slides

What's hot (20)

Google Cloud and Data Pipeline Patterns by Lynn Langit
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
Lynn Langit6.6K views
Radical Speed for SQL Queries on Databricks: Photon Under the Hood by Databricks
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks1.1K views
Apache Iceberg - A Table Format for Hige Analytic Datasets by Alluxio, Inc.
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.6.6K views
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ... by Edureka!
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!2K views
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME by confluent
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LMESet your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
Set your Data in Motion with Confluent & Apache Kafka Tech Talk Series LME
confluent352 views
Flink vs. Spark by Slim Baltagi
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi69.5K views
Full Stack Graph in the Cloud by Neo4j
Full Stack Graph in the CloudFull Stack Graph in the Cloud
Full Stack Graph in the Cloud
Neo4j377 views
Why your Spark job is failing by Sandy Ryza
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza69.1K views
OpenTelemetry Introduction by DimitrisFinas1
OpenTelemetry Introduction OpenTelemetry Introduction
OpenTelemetry Introduction
DimitrisFinas1756 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama33.4K views
From Query Plan to Query Performance: Supercharging your Apache Spark Queries... by Databricks
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks684 views
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022 by HostedbyConfluent
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
HostedbyConfluent1.2K views
Data ingestion and distribution with apache NiFi by Lev Brailovskiy
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
Lev Brailovskiy8.5K views
Running Apache Spark on Kubernetes: Best Practices and Pitfalls by Databricks
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks2.9K views
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by Cloudera, Inc.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.127.8K views
Productizing Structured Streaming Jobs by Databricks
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks3.2K views
Processing Large Data with Apache Spark -- HasGeek by Venkata Naga Ravi
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi12.4K views
Optimizing Apache Spark SQL Joins by Databricks
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks44.9K views

Similar to GraphFrames: Graph Queries In Spark SQL

GraphFrames: Graph Queries in Spark SQL by Ankur Dave by
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
7K views20 slides
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20) by
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
6.3K views33 slides
What Makes Graph Queries Difficult? by
What Makes Graph Queries Difficult?What Makes Graph Queries Difficult?
What Makes Graph Queries Difficult?Gábor Szárnyas
92 views34 slides
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ... by
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
1.3K views37 slides
3DRepo by
3DRepo3DRepo
3DRepoMongoDB
854 views42 slides
Web-Scale Graph Analytics with Apache® Spark™ by
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
2.7K views59 slides
Synthetic Encoding by
Synthetic EncodingSynthetic Encoding
Synthetic EncodingCheng LI
437 views30 slides
GraphFrames: DataFrame-based graphs for Apache® Spark™ by
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
7.3K views49 slides
Web-Scale Graph Analytics with Apache® Spark™ by
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
1.4K views56 slides
High-Performance Graph Analysis and Modeling by
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
200 views48 slides
DiscoGAN by
DiscoGANDiscoGAN
DiscoGANIl Gu Yi
2.2K views60 slides
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac... by
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
1.7K views102 slides
Neo4j MeetUp - Graph Exploration with MetaExp by
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExpAdrian Ziegler
198 views27 slides
Machine Learning and GraphX by
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
6.1K views46 slides
Mapping Graph Queries to PostgreSQL by
Mapping Graph Queries to PostgreSQLMapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQLGábor Szárnyas
2.6K views28 slides
Graph technology meetup slides by
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slidesSean Mulvehill
216 views36 slides
Outrageous Ideas for Graph Databases by
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesMax De Marzi
1.1K views114 slides
Transformations and actions a visual guide training by
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
6.9K views122 slides
Lecture 3.pdf by
Lecture 3.pdfLecture 3.pdf
Lecture 3.pdfDanielGarca686549
5 views35 slides
Large-scale Recommendation Systems on Just a PC by
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
6.2K views41 slides

Similar to GraphFrames: Graph Queries In Spark SQL (20)

GraphFrames: Graph Queries in Spark SQL by Ankur Dave by Spark Summit
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Spark Summit7K views
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20) by Ankur Dave
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
Ankur Dave6.3K views
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ... by Spark Summit
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Spark Summit1.3K views
3DRepo by MongoDB
3DRepo3DRepo
3DRepo
MongoDB854 views
Web-Scale Graph Analytics with Apache® Spark™ by Databricks
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks2.7K views
Synthetic Encoding by Cheng LI
Synthetic EncodingSynthetic Encoding
Synthetic Encoding
Cheng LI437 views
GraphFrames: DataFrame-based graphs for Apache® Spark™ by Databricks
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
Databricks7.3K views
Web-Scale Graph Analytics with Apache® Spark™ by Databricks
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks1.4K views
High-Performance Graph Analysis and Modeling by Nesreen K. Ahmed
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
Nesreen K. Ahmed200 views
DiscoGAN by Il Gu Yi
DiscoGANDiscoGAN
DiscoGAN
Il Gu Yi2.2K views
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac... by Databricks
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Databricks1.7K views
Neo4j MeetUp - Graph Exploration with MetaExp by Adrian Ziegler
Neo4j MeetUp - Graph Exploration with MetaExpNeo4j MeetUp - Graph Exploration with MetaExp
Neo4j MeetUp - Graph Exploration with MetaExp
Adrian Ziegler198 views
Machine Learning and GraphX by Andy Petrella
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella6.1K views
Mapping Graph Queries to PostgreSQL by Gábor Szárnyas
Mapping Graph Queries to PostgreSQLMapping Graph Queries to PostgreSQL
Mapping Graph Queries to PostgreSQL
Gábor Szárnyas2.6K views
Graph technology meetup slides by Sean Mulvehill
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slides
Sean Mulvehill216 views
Outrageous Ideas for Graph Databases by Max De Marzi
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
Max De Marzi1.1K views
Transformations and actions a visual guide training by Spark Summit
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit6.9K views
Large-scale Recommendation Systems on Just a PC by Aapo Kyrölä
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
Aapo Kyrölä6.2K views

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
6.8K views32 slides
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
4.3K views34 slides
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
3.1K views21 slides
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
3.1K views21 slides
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
3.4K views56 slides
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
1.7K views90 slides
Apache Spark and Tensorflow as a Service with Jim Dowling by
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
2.1K views44 slides
Apache Spark and Tensorflow as a Service with Jim Dowling by
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
1.1K views44 slides
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
2.8K views22 slides
Next CERN Accelerator Logging Service with Jakub Wozniak by
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
2.3K views36 slides
Powering a Startup with Apache Spark with Kevin Kim by
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
787 views30 slides
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra by
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
730 views20 slides
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—... by
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
1.1K views26 slides
How Nielsen Utilized Databricks for Large-Scale Research and Development with... by
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
999 views15 slides
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov... by
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
2.3K views19 slides
Goal Based Data Production with Sim Simeonov by
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
746 views44 slides
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le... by
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
1.5K views32 slides
Getting Ready to Use Redis with Apache Spark with Dvir Volk by
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
3K views43 slides
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
915 views26 slides
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization... by
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
1.7K views34 slides

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit6.8K views
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit4.3K views
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit3.1K views
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit3.1K views
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit3.4K views
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit1.7K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit2.1K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit1.1K views
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit2.8K views
Next CERN Accelerator Logging Service with Jakub Wozniak by Spark Summit
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit2.3K views
Powering a Startup with Apache Spark with Kevin Kim by Spark Summit
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit787 views
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit730 views
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—... by Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit1.1K views
How Nielsen Utilized Databricks for Large-Scale Research and Development with... by Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit999 views
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov... by Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit2.3K views
Goal Based Data Production with Sim Simeonov by Spark Summit
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit746 views
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le... by Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit1.5K views
Getting Ready to Use Redis with Apache Spark with Dvir Volk by Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit3K views
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit915 views
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization... by Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit1.7K views

Recently uploaded

Running PostgreSQL in a Kubernetes cluster: CloudNativePG by
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePGNick Ivanov
11 views29 slides
Deep analytics via learning to reason by
Deep analytics via learning to reasonDeep analytics via learning to reason
Deep analytics via learning to reasonDeakin University
6 views57 slides
2312 PACLIC by
2312 PACLIC2312 PACLIC
2312 PACLICWarNik Chow
5 views1 slide
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
12 views37 slides
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange by
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeRNayak3
5 views6 slides
Customer Data Cleansing Project.pptx by
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptxNat O
7 views23 slides
PyData Global 2022 - Things I learned while running neural networks on microc... by
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...SARADINDU SENGUPTA
6 views12 slides
Construction Accidents & Injuries by
Construction Accidents & InjuriesConstruction Accidents & Injuries
Construction Accidents & InjuriesBisnar Chase Personal Injury Attorneys
7 views5 slides
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionSARADINDU SENGUPTA
13 views16 slides
Report on OSINT by
Report on OSINTReport on OSINT
Report on OSINTAyonDebnathCertified
6 views15 slides
Customize GA4 Dashboard and Collections.pdf by
Customize GA4 Dashboard and Collections.pdfCustomize GA4 Dashboard and Collections.pdf
Customize GA4 Dashboard and Collections.pdfGA4 Tutorials
9 views8 slides
Harry Potter & Apache iceberg format by
Harry Potter & Apache iceberg formatHarry Potter & Apache iceberg format
Harry Potter & Apache iceberg formatTaras Fedorov
8 views53 slides
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... by
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...patiladiti752
9 views15 slides
Introduction to Statistics information.pptx by
Introduction to Statistics information.pptxIntroduction to Statistics information.pptx
Introduction to Statistics information.pptxHarshiHarshika1
6 views22 slides
Business administration Project File.pdf by
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdfKiranPrajapati91
12 views36 slides
AIMS-EREA.pdf by
AIMS-EREA.pdfAIMS-EREA.pdf
AIMS-EREA.pdfSudarson Roy Pratihar
11 views18 slides
Underfunded.pptx by
Underfunded.pptxUnderfunded.pptx
Underfunded.pptxvgarcia19
16 views7 slides
Pydata Global 2023 - How can a learnt model unlearn something by
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
12 views13 slides
Penetration testing by Burpsuite by
Penetration testing by  BurpsuitePenetration testing by  Burpsuite
Penetration testing by BurpsuiteAyonDebnathCertified
6 views19 slides
GDG Community Day 2023 - Interpretable ML in production by
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in productionSARADINDU SENGUPTA
7 views19 slides

Recently uploaded (20)

Running PostgreSQL in a Kubernetes cluster: CloudNativePG by Nick Ivanov
Running PostgreSQL in a Kubernetes cluster: CloudNativePGRunning PostgreSQL in a Kubernetes cluster: CloudNativePG
Running PostgreSQL in a Kubernetes cluster: CloudNativePG
Nick Ivanov11 views
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion by Bertram Ludäscher
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange by RNayak3
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS TriangeAnalytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
Analytics Center of Excellence | Data CoE |Analytics CoE| WNS Triange
RNayak35 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O7 views
PyData Global 2022 - Things I learned while running neural networks on microc... by SARADINDU SENGUPTA
PyData Global 2022 - Things I learned while running neural networks on microc...PyData Global 2022 - Things I learned while running neural networks on microc...
PyData Global 2022 - Things I learned while running neural networks on microc...
AZConf 2023 - Considerations for LLMOps: Running LLMs in production by SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
Customize GA4 Dashboard and Collections.pdf by GA4 Tutorials
Customize GA4 Dashboard and Collections.pdfCustomize GA4 Dashboard and Collections.pdf
Customize GA4 Dashboard and Collections.pdf
GA4 Tutorials9 views
Harry Potter & Apache iceberg format by Taras Fedorov
Harry Potter & Apache iceberg formatHarry Potter & Apache iceberg format
Harry Potter & Apache iceberg format
Taras Fedorov8 views
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... by patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7529 views
Introduction to Statistics information.pptx by HarshiHarshika1
Introduction to Statistics information.pptxIntroduction to Statistics information.pptx
Introduction to Statistics information.pptx
HarshiHarshika16 views
Business administration Project File.pdf by KiranPrajapati91
Business administration Project File.pdfBusiness administration Project File.pdf
Business administration Project File.pdf
KiranPrajapati9112 views
Underfunded.pptx by vgarcia19
Underfunded.pptxUnderfunded.pptx
Underfunded.pptx
vgarcia1916 views
Pydata Global 2023 - How can a learnt model unlearn something by SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
GDG Community Day 2023 - Interpretable ML in production by SARADINDU SENGUPTA
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production

GraphFrames: Graph Queries In Spark SQL

  • 1. GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
  • 2. + Graph Queries 2016 Apache Spark + GraphFrames GraphFrames (2016) + Graph Algorithms 2013 Apache Spark + GraphX Relational Queries 2009 Spark
  • 3. Graph Algorithms vs. Graph Queries ≈ x PageRank Alternating Least Squares Graph Algorithms Graph Queries
  • 4. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank Graph Query: Wikipedia Collaborators Editor 1 Editor 2 Article 1 Article 2 ⇓ Article 1 Article 2 Editor 1 Editor 2 same day} same day}
  • 5. Graph Algorithms vs. Graph Queries Graph Algorithm: PageRank // Iterate until convergence wikipedia.pregel( sendMsg = { e => e.sendToDst(e.srcRank * e.weight) }, mergeMsg = _ + _, vprog = { (id, oldRank, msgSum) => 0.15 + 0.85 * msgSum }) Graph Query: Wikipedia Collaborators wikipedia.find( "(u1)-[e11]->(article1); (u2)-[e21]->(article1); (u1)-[e12]->(article2); (u2)-[e22]->(article2)") .select( "*", "e11.date – e21.date".as("d1"), "e12.date – e22.date".as("d2")) .sort("d1 + d2".desc).take(10)
  • 7. Raw Wikipedia < / >< / >< / > XML Text Table Edit Graph Edit Table Frequent Collaborators Problem: Mixed Graph Analysis Hyperlinks PageRank Article Text User Article Vandalism Suspects User User User Article
  • 8. Solution: GraphFrames Graph Algorithms Graph Queries Spark SQL GraphFramesAPI Pattern Query Optimizer
  • 9. GraphFrames API • Unifies graph algorithms, graph queries, and DataFrames • Available in Scala,Java, and Python class GraphFrame { def vertices: DataFrame def edges: DataFrame def find(pattern: String): DataFrame def registerView(pattern: String, df: DataFrame): Unit def degrees(): DataFrame def pageRank(): GraphFrame def connectedComponents(): GraphFrame ... }
  • 10. Implementation Parsed Pattern Logical Plan Materialized Views Optimized Logical Plan DataFrame Result Query String Graph–Relational Translation Join Elimination and Reordering Spark SQL View Selection Graph Algorithms GraphX
  • 11. Graph–Relational Translation B D A C Existing Logical Plan Output: A,B,C Src Dst ⋈C=Src Edge Table ID Attr VertexTable ⋈D=ID
  • 12. Materialized View Selection GraphX: Triplet view enabled efficient message-passing algorithms Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph + Updated PageRanks B A C D A
  • 13. Materialized View Selection GraphFrames: User-defined views enable efficient graph queries Vertices B A C D Edges A B A C B C C D A B Triplet View A C B C C D Graph User-Defined Views PageRank Community Detection … Graph Queries
  • 14. Join Elimination Src Dst 1 2 1 3 2 3 2 5 Edges ID Attr 1 A 2 B 3 C 4 D Vertices SELECT src, dst FROM edges INNER JOIN vertices ON src = id; Unnecessaryjoin can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation: SELECT src, dst FROM edges;
  • 15. Join Reordering A → B B → A ⋈A, B B → D C → B⋈B B → E⋈B C → D⋈B C → E⋈C, D ⋈C, E Example Query Left-Deep Plan BushyPlan A → B B → A ⋈A, B B → D C → B ⋈B B → E⋈B ⋈B ⋈B, C User-Defined View
  • 16. Evaluation Faster than Neo4j for unanchored patternqueries 0 0.5 1 1.5 2 2.5 GraphFrames Neo4j Querylatency,s AnchoredPatternQuery 0 10 20 30 40 50 60 70 80 GraphFrames Neo4j Querylatency,s UnanchoredPatternQuery Triangle query on 1M edge subgraph of web-Google. Each system configured touse a single core.
  • 17. Evaluation Approaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation 0 1 2 3 4 5 6 7 GraphFrames GraphX Naïve Spark Per-iterationruntime,s PageRankPerformance Per-iteration performance on web-Google, single 8-core machine. Naïve SparkusesScala RDD API.
  • 18. Evaluation Registering the right views cangreatlyimprove performance for some queries Workload: J. Huang, K. Venkatraman, and D.J. Abadi.Query optimization of distributed pattern matching. In ICDE 2014.
  • 19. Future Work • Suggest views automatically • Exploit attribute-based partitioning in optimizer • Code generationfor single node
  • 20. Try It Out! Releasedas a Spark Package at: https://github.com/graphframes/graphframes Thanks to Joseph Bradley,Xiangrui Meng,and Timothy Hunter. ankurd@eecs.berkeley.edu