Data Science Across Data Sources with Apache Arrow
1. 25127
The Data Lake Engine
Spark + AI Summit 2020
Data Science Across Data Sources with Apache Arrow
2. 25127
Dremio is the Data Lake Engine CompanyTomer Shiran
Co-Founder & CPO, Dremio
tomer@dremio.com Powering the cloud data lakes of the world’s
leading companies across all industries
Creators of
Over $100M raised
Background
3. 25127
Your Data Lake is Exploding, Yet Your Data Remains Inaccessible
But…
>100% YoY S3
Data Growth1
>50% of Data
Will Live on Cloud Data
Lake Storage by 20252
1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/
2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data
Data Lakes are becoming the
primary place that data lands
Consuming the data is
too slow & too difficult
SQL
Data Consumers
X X X
S3ADLS
S3ADLS
or or
4. 25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3
5. 25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
1
Brittle & complex
ETL/ELT
Data Lake
Storage ADLS S3
6. 25127
Data Movement is the Typical Workaround for Data Lake Storage
1
2
Brittle & complex
ETL/ELT
Data Lake
Storage
Proprietary & expensive
DW/Data Marts
BI Users
SQL
Data Scientists
ADLS S3
7. 25127
Data Movement is the Typical Workaround for Data Lake Storage
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
Data Lake
Storage
BI Users
SQL
Data Scientists
ADLS S3
8. 25127
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3o
r
o
r
Query data lake storage directly with 4-100X performance
Powered by .
9. What is Apache Arrow?
Columnar In-
Memory
Representation
Many Language
Bindings
Broad Industry
Adoption
Row-based Column-based
11. 25127
Apache Arrow Gandiva Improves CPU Efficiency
✓ A standalone C++ library for efficient
evaluation of arbitrary SQL expressions on
Arrow vectors using runtime code-
generation in LLVM
✓ Expressions are compiled to LLVM bytecode
(IR), optimized & translated to machine code
✓ Gandiva enables vectorized execution with
Intel SIMD instructions
SQL expression
Vectorized
execution
kernel
Input Arrow
buffer
Output Arrow
buffer
Gandiva
compiler
Pre-compiled
functions (.bs)
OptimizeIRBuilder
12. 25127
4.5x-90x Faster than Java-based Code Generation
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
13. 25127
Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O
✓ Columnar cloud cache (C3) automatically provides
NVMe-level I/O performance when reading from
S3/ADLS
✓ Arrow persistence enables granular caching as Arrow
buffers in local engine NVMe
✓ Bypass data deserialization and decompression
✓ Enables high-concurrency, low-latency BI workloads
on cloud data lake storage
…
Executor Executor Executor Executor
AWS S3
NVMe NVMeNVMe NVMe
C3 with Apache Arrow persistence
…
Executor Executor Executor
NVMe NVMe NVMe
C3 with Apache Arrow persistence
XL engine
M engine
14. 25127
The Open Data Platform
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
Batch processing
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
15. We Need Fast, Industry-Standard Data Exchange
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
Batch processing
2
1
3
4
16. Arrow Flight is an Arrow-based RPC Interface
✓ High-performance wire protocol
✓ Parallel streams of Arrow buffers are transferred
✓ Delivers on the interoperability promise of Apache
Arrow
✓ Client-cluster and cluster-cluster communication
…
Arrow Flight dataframe
17. Arrow Flight Python Client
import pyarrow.flight as flt
c = flt.FlightClient.connect("localhost", 47470)
fd = flt.FlightDescriptor.for_command(sql)
fi = c.get_flight_info(fd)
ticket = fi.endpoints[0].ticket
df = c.do_get(ticket0).read_all()
23. 25127
Dremio is the Data Lake Engine
Data
Lake
Storage
Data
Lake
Engine
BI Users
SQL
Data Scientists
ADLS S3or or
Optional
External
Sources
Data
Users
Accelerate
Business
100X BI query speed
4X Ad-hoc query speed
0 cubes, extracts, or
aggregation tables
Reduce
Cost & Risk&
10x lower AWS EC2 /
Azure VM spend for same
performance
0 lock-in, loss of control,
and duplication of data
Powered by
A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage