Spark + AI Summit 2020 イベント概要

Spark + AI Summit 2020
イベント概要
Spark Meetup Tokyo #3 Online
Paulo Gutierrez （パウロ）
Solutions Architect at Databricks

数字で⾒るSAIS 2020
• Registration = 67,945
• Attendance = 35,162 (51.8% 歩留まり)
• Alumni = 6184
• Training = 8242
• Free Training = 3000 (満席)
• Paid Training = 5242

エリア別参加者
28kNORAM
Regions 4kEMEA
2kAPJ
1kLATAM
125
Countries

プロファイル別参加者
12kData Engineers
Role
Status
11kData Scientists
2kAcademia
4kSoftware Engineers
3kDecision Makers
7kCustomers
+200% YoY
27kCommunity/Prospects

メディアカバレッジ 80+ articles worldwide and counting...
~50% of articles published outside of the US...
54 articles in Tier 1 publications worldwide...

Introducing Apache Spark™ 3.0:
A 10-Year Retrospective and A Look Ahead

Apache Spark Today: Python
68%
of notebook
commands on
Databricks are in
Python

Apache Spark Today: SQL
exabytes
queried/day in SQL
on Databricks alone
>90%
of Spark API calls
run via Spark SQL
TPC-DS benchmark record
set using Spark SQL

Apache Spark Today: Streaming
>5 trillion
records/day processed on Databricks
with Structured Streaming

Apache Spark 3.0
3400+ patches from community
Easy to switch to from Spark 2.x

Adaptive Query Execution (AQE)
change execution plan at runtime to automatically set # of reducers and join algorithms
3.0: SQL Performance Enhancement
Change join algorithm
Accelerates TPC-DS queries up to 8x
TPC-DS 1TB No-Stats With vs. Without Adaptive
Query Execution
Duration
(Seconds)

Speeds up 60/102
TPC-DS queries by
2-18x
3.0: SQL Performance Enhancement
TPC-DS 1TB With vs. Without Dynamic Partition Pruning
Duration(Seconds)
Dynamic Partition Pruning (DPP)
Efficiently broadcast partition information to speed up star-schema join performance

3.0: SQL Compatibility
ANSI Reserved
Keywords
ANSI Gregorian
Calendar
ANSI Store
Assignment
ANSI Overflow
Checking
ANSI SQL: Run unmodified queries from major SQL engines
(language dialect and broader support)

Python type hints for Pandas UDFs
3.0: Python & R Performance
Old API

Faster Apache Arrow-based
calls to Python user code
Vectorized SparkR calls
New Pandas function APIs
3.0: Python & R Performance
SparkR API Performance
Python Pandas UDF Performance
Time(Seconds)
Time(Seconds)

3.0: Other Features
Structured Streaming UI

3.0: Other Features
Python Error Messages

Other Apache Spark Ecosystem Projects
Pandas API over Spark
Large-scale genomics GPU-accelerated data science
Reliable table storage Scale-out on Spark
Visualization

What is Koalas?
Implementation of Pandas APIs over Spark
▪ Easily port existing data science code
Launched at Spark+AI Summit 2019
Now up to 850,000 downloads
per month (1/5th of PySpark!)
import databricks.koalas as ks
df = ks.read_csv(file)
df[‘x’] = df.y * df.z
df.describe()
df.plot.line(...)

Announcing Koalas 1.0!
Close to 80% API coverage
Faster performance with Spark 3.0 APIs
More support for missing values, in-place updates
Faster distributed index type
pip install koalas
to get started!
20.17% faster
Time(Seconds)
26.39% faster
Koalas API Coverage
77%
69%
65%

3.0: Other Features
Python Error Messages
Python API Doc

Lake HouseUnifying data warehouses and data lakes

Data Warehouses
were purpose-built
for BI and reporting, however…
▪ No support for video, audio, text
▪ No support for data science, ML
▪ Limited support for streaming
▪ Closed & proprietary formats
Therefore, most data is stored in data lakes & blob
stores
ETL
External Data Operational Data
Data Warehouses
BI Reports

Data Lakes
could handle all your data for data
science and ML, however…
▪ Poor BI support
▪ Complex to set up
▪ Poor performance
▪ Difficult to quality control
▪ Unreliable data swamps
BIData
Science
Machine
Learning
Structured, Semi-Structured and Unstructured Data
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL

Lakehouse
Data Warehouse Data Lake
ETL
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL

Lakehouse
Data Warehouse Data Lake
Streaming
Analytics
BI Data
Science
Machine
Learning
ETL
Data Warehouses
BI Reports BIData
Science
Machine
Learning
Data Lake
Real-Time
Database
Reports
Data Warehouses
Data Prep and
Validation
ETL

Streaming
Analytics
BI Data
Science
Machine
Learning
Data Lake for all your data
One platform for every use case
Structured transactional layer
Lakehouse

Streaming
Analytics
BI Data
Science
Machine
Learning
Data Lake for all your data
One platform for every use case
Structured transactional layer
Lakehouse
High performance query engineDELTA ENGINE

Delta EngineHigh performance query engine for data lakes

Delta Engine
▪ Builds on Apache Spark 3.0
▪ Fully API compatible
▪ Accelerates SQL and DataFrame
workloads with:
▪ Improved query optimizer
▪ Native vectorized execution engine
▪ Caching
Query
Optimizer
Native
Execution Engine
Caching
SQL
Spark
DataFrame
Koalas

Photon
New execution engine for Delta Engine to accelerate Spark SQL
Built from scratch in C++, for performance:
▪ Vectorization:
▪ Data-level parallelism
▪ Instruction-level parallelism
▪ Optimized for modern workloads, not just benchmarks:
▪ Faster string processing
▪ Regex
Native execution engine purpose-built for performance

Data Level Parallelism
col1 + col3: Billion rows processed per core per sec (higher is better)

Instruction Level Parallelism
▪ Pipeline memory access, load
multiple memory addresses in
parallel
▪ Prefetch to eliminate cache-
misses
▪ Minimize TLB misses with huge
pages

TPC-DS 30TB Queries/Hour
3.3x
speedup
110
32
(Higher is better)

Faster String Processing
MBs processed per core per sec, UPPER() function (higher is better)
MBs processed per core per sec, SUBSTRING() function (higher is better)

Faster String Processing - Regex
Millions of rows processed per core per sec, LIKE "%a_c%" (higher better)

Re:DashSQL Visualization on Big Data

Redash helps you make sense of your data
Powerful SQL editor
Browse schema and click-to-
insert
Create reusable snippets
Schedule updates and setup
alerts

Visualize and share
▪Build a wide variety of
visualizations and gather them
into thematic dashboards
▪Drag & drop and resize any
visualization
▪Share dashboards with your team
or with the public

Re-dash in Action SQL query against the data to pull out the data we
need.

Re-dash in Action Easily turn SQL into a visualization to make the data easier
to understand

Re-dash in Action We can build a dashboard the business can use to
understand what’s going on

Databases Integrations
Query all of your SQL, NoSQL, big data, and API data sources

One line to record params, metrics
and models in popular ML libraries:
Autologging
mlflow.keras.autolog()
updated
in
1.8
Including speciﬁc data versions read when using Delta Lake

Model Schemas
Specify input and output data types for models
Incompatible schemas!
Model
Input Schema
Output Schema
Check Compatibility
and Validate New
Model Versions
new
in
1.9
zipcode: string,
sqft: double,
distance: double
price: double
log_model(…)

Model Serving on Databricks
Tracking
Experiment tracking
Logged
Model
Model Registry
Model management
Model Serving
Turnkey serving for
MLflow models
new
Staging Production Archived
Data Scientists Application Engineers
Reports
Applications
...
REST
Endpoint
in
preview
Deployment Backends

Cloning on Databricks
com
ing
soon
Recreate exact
configuration of an
experiment run

Pluggable way to create and manage
deployment endpoints in MLflow
Used in 2 new endpoints:
Other integrations being ported:
Deployments API
com
ing
soon
mlflow deployments create -t gcp -n spam
-m models:/SpamScorer/production
mlflow deployments predict -t gcp –n spam
-f emails.json

The Next Generation
Data Science Workspace

Announcing support for
Repository-Level Git Integration
for Collaboration and
Reproducibility
in
preview

CI/CD based Workflow from
Experimentation to Production
Version Review Test
Development /
Experimentation
Production Jobs
Git / CI/CD
Systems
in
preview

Environment
Autoscaling
Workers
Consistent environments across an
autoscaling cluster
Project-Scoped
Environment Configuration
com
ing
soon

Seamless transition to and from
Jupyter Notebooks
Native Support for Standard
Notebook Formats
Before
(conversion):
ipynb
Databricks
Format
Databricks
Notebooks
com
ing
soon

Before
(conversion):
After
(native support):
ipynb
Databricks
Format
Databricks
Notebooks
Databricks
Notebooks
Jupyter
ipynb ipynb
Seamless transition to and from
Jupyter Notebooks
Native Support for Standard
Notebook Formats
com
ing
soon

Co-Presence
Co-Editing
Collaborative Features for
Standard Notebook Formats

Comments
Collaborative Features for
Standard Notebook Formats

ありがとうございました

Spark + AI Summit 2020 イベント概要

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark + AI Summit 2020 イベント概要

Similar to Spark + AI Summit 2020 イベント概要 (20)

Recently uploaded

Recently uploaded (20)

Spark + AI Summit 2020 イベント概要