Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics

Unifying Spark and non-Spark Ecosystems for Big Data Analytics
Han Wang, and Jintao Zhang
Fugue

▪ Motivation
▪ Fugue Core
▪ Fugue SQL
▪ Fugue ML and Streaming
▪ Fugue Use Cases
▪ Open Source Plan
Agenda

Motivation of Fugue
● A pure abstraction layer
● Unify and simplify core concepts of distributed computing
● Decouple your logic from any speciﬁc solution
● Easy to learn and easy to switch
● NOT invasive, NOT obstructive, and NOT exclusive

Example: Node2Vec
Apply certain walk strategy on graph to generate a collection of node vectors
to be used by embedding algos such as Word2Vec

Why DAG?
1. X = Run mapper A on a dataframe
2. Map X by mapper B and save
3. Map X by mapper C and save

Optimizations on DAG Execution
● Automatically parallelize independent branches
● Auto persist
● More errors can be captured at “compile” time
● Determinism enables checkpointing, executions can “resume”

# Enriched syntax
a:= CREATE [[“k1”,0],[“k2”,1]] SCHEMA k:str,f:int
# Transformer extension
b:= TRANSFORM a USING plus_n PARTITION BY k
# SELECT statement
c:= SELECT a.*, b.f2 FROM a JOIN b ON a.k = b.k
# Simplified syntax & multi tasks
SELECT f, f2, 3 AS f3 PERSIST
PRINT
OUTPUT TO “file.parquet”
# Checkpoint
df ?? TRANSFORM b USING expensive_op
OUTPUT c, df USING assert_eq
Fugue SQL

Fugue SQL vs Spark SQL
Fugue SQL Spark SQL
Workflow level Yes No
Cross platform Yes No
SELECT statement Yes Yes
Other SQL statements No, can be done in extensions Yes
Multiple statements Yes Yes (WITH statement)
Spark/Hive UDF (Java/Py) Yes Yes
Fugue extensions Yes No
Caching/checkpointing Yes No

Fugue Programming Interface vs SQL

ML Library: Node2Vec
● We implemented the distributed Node2Vec algorithm on Fugue
○ Use adjacency lists to represent a graph
○ Distributed Breadth-First Search for random walk
○ Cache critical variables for picking next step during BFS

● Graph (10 million vertices, 300 million edges)
○ 2-3 hours with 500 cores and 3 TB memory
● Graph (100 million vertices, 3 billion edges)
○ 6-8 hours with 2,000 cores and 12 TB memory
Large Scale Testing

ML Library: Time Series Seasonality
● Forecast seasonality coefficients using Kalman Filter
○ Decent performance on noisy data
○ Simulate special events (holidays, and etc.) and anomalies
○ Any interval: hourly, daily, weekly, yearly, and etc.
● Handle very large number of time series with seasonalities

● Fugue supports Spark streaming very well
○ Treats batch processing and streaming equivalently
○ Fugue spark-streaming pipeline in production
● Fugue abstract connectors for streaming
○ Kinesis connectors
○ Conﬂuent Kafka connectors
○ Commonly used streaming APIs
Fugue Streaming

An Interactive On-demand Spark Ecosystem

Migrated Projects
● Collaborated with multiple product teams to migrate legacy
pipelines
○ Large cost and runtime saving
○ Higher testability
○ Shorter development time
● Performance Improvement on all migrated projects (by Dec 2019)
○ Average total CPU hours: 74.6% reduction
○ Average total runtime: 83.9% reduction

Multi-region Regression
Region-based models to be trained and tuned.
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~80% ~$630 7+ hours
Fugue Pipeline 99.5% ~$23 30 min
Improvement - 95+% reduction 90+% reduction

Time-series Forecasting
Forecast business metric for better budget planning and decision making
Horizon: weekly, monthly, quarterly
Reliability Avg Cost/Run Runtime
Legacy Pipeline ~70% ~$70 2+ hours
Fugue Pipeline 99.5% ~$5 10 min
Improvement - 90+% reduction 90+% reduction

▪ Fugue unifies various computing frameworks with uniform
interfaces.
▪ Fugue SQL is a novel language for workflows.
▪ K8S + Spark + Fugue is a great combination with high flexibility
and efficiency for distributed computing.
▪ The Fugue project will build a unified ecosystem for integrating
distributed systems and machine learning.
Summary

Announcing
Fugue
Open sourced https://github.com/fugue-project/fugue
Node2Vec on Fugue
Open sourced https://github.com/fugue-project/node2vec

Community
Fugue Tutorials
https://github.com/fugue-project/tutorials
Slack channel
https://fugue-project.slack.com
Google Groups
https://groups.google.com/d/foum/fugue-project

Open Source Plan
Fugue Core, Spark, Dask and SQL Soon
Node2Vec on Fugue Soon
Fugue ML Sep 2020
Fugue Streaming Dec 2020

pip install fugue
Install Fugue
or pip install fugue[all]

Feedback
Your feedback is important to us!
Don’t forget to rate and review
this session.

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics

Similar to Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics