DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Apache Arrow
Columnar In-Memory Analytics
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Dremio [NOT TODAY’S TOPIC]
Jacques
Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer
Shiran
Founder & CEO
• VP Product, MapR; Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Founded in June 2015
• Led by experts in Big Data and open source
(Apache Parquet, Drill, Pig, Calcite and more)
• Currently in stealth
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Introducing Apache Arrow
• New open source project under the Apache Software Foundation
– Top-level project (directly!)
• Introduces new era of Columnar In-Memory Analytics
1. 10-100x speedup & concurrency for most workloads
2. Common data layer enables companies to choose best of breed
systems
3. Users can utilize any programming language
4. Works with relational and complex data as-is; no ETL required
• 13 major open source Big Data projects are already on board
– A significant % of the world’s data will be processed through Arrow!
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Arrow Turbo-Charges Big Data Execution Engines
Apache Arrow Apache Arrow Apache Arrow Apache Arrow
Impala
Apache ArrowApache Arrow Apache Arrow Apache Arrow
…
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Performance Advantage of Columnar In-Memory
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
• Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs
• Arrow optimizes CPU prefetching
and caching
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Evolution Towards Heterogeneous Data Infrastructure
RDBMS
Hadoop MapReduce
Databases
Cassandra
Elasticsearch
HBase
Kudu
MongoDB
Parquet
Phoenix
Execution Engines
Drill
Ibis
Impala
MapReduce
Pandas
Spark
Storm
Phase 1
Common Scheduler
YARN Mesos
Kubernetes
Phase 2
Common Data/Memory
Arrow
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Advantages of a Common Data Layer
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Who’s Behind Apache Arrow?
• The creators and lead developers of 13
major open source Big Data projects
– Employees of Cloudera, Databricks,
Datastax, Dremio, Hortonworks, MapR,
Salesforce, Twitter
• Jacques Nadeau is the PMC Chair (aka VP
Apache Arrow)
– Co-founder & CTO of Dremio
Calcite
Cassandra
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Current Status
• C, C++, Python and Java implementations
currently underway
• Will be adopted by Drill, Ibis, Impala, Kudu,
Parquet and Spark by EOY
• Additional languages (eg, R, JavaScript) and
projects also expected to adopt Arrow by EOY
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Questions?
Jacques Nadeau
Dremio Founder & CTO
VP Apache Arrow
Julien Le Dem
Dremio Architect
VP Apache Parquet
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
APPENDIX
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
PMC Members/Committers
Jacques Nadeau (PMC Chair)
Todd Lipcon
Ted Dunning
Michael Stack
P. Taylor Goetz
Reynold Xin
Julian Hyde
Julien Le Dem
James Taylor
Jake Luciani
Parth Chandra
Alex Levenson
Marcel Kornacker
Steven Phillips
Hanifi Gunes
Jason Altekruse
Abdel Hakim Deneche
Wes McKinney
Karthik Ramasamy
David Alves
Seshadri Mahalingam
Ippokratis Pandis

Apache Arrow - An Overview

  • 1.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Apache Arrow Columnar In-Memory Analytics UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 2.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Dremio [NOT TODAY’S TOPIC] Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • VP Product, MapR; Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Founded in June 2015 • Led by experts in Big Data and open source (Apache Parquet, Drill, Pig, Calcite and more) • Currently in stealth
  • 3.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Introducing Apache Arrow • New open source project under the Apache Software Foundation – Top-level project (directly!) • Introduces new era of Columnar In-Memory Analytics 1. 10-100x speedup & concurrency for most workloads 2. Common data layer enables companies to choose best of breed systems 3. Users can utilize any programming language 4. Works with relational and complex data as-is; no ETL required • 13 major open source Big Data projects are already on board – A significant % of the world’s data will be processed through Arrow! UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 4.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Arrow Turbo-Charges Big Data Execution Engines Apache Arrow Apache Arrow Apache Arrow Apache Arrow Impala Apache ArrowApache Arrow Apache Arrow Apache Arrow …
  • 5.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Performance Advantage of Columnar In-Memory Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer • Arrow leverages the data parallelism (SIMD) in modern Intel CPUs • Arrow optimizes CPU prefetching and caching
  • 6.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Evolution Towards Heterogeneous Data Infrastructure RDBMS Hadoop MapReduce Databases Cassandra Elasticsearch HBase Kudu MongoDB Parquet Phoenix Execution Engines Drill Ibis Impala MapReduce Pandas Spark Storm Phase 1 Common Scheduler YARN Mesos Kubernetes Phase 2 Common Data/Memory Arrow
  • 7.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Advantages of a Common Data Layer Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Who’s Behind Apache Arrow? • The creators and lead developers of 13 major open source Big Data projects – Employees of Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce, Twitter • Jacques Nadeau is the PMC Chair (aka VP Apache Arrow) – Co-founder & CTO of Dremio Calcite Cassandra Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm
  • 9.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Current Status • C, C++, Python and Java implementations currently underway • Will be adopted by Drill, Ibis, Impala, Kudu, Parquet and Spark by EOY • Additional languages (eg, R, JavaScript) and projects also expected to adopt Arrow by EOY
  • 10.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Questions? Jacques Nadeau Dremio Founder & CTO VP Apache Arrow Julien Le Dem Dremio Architect VP Apache Parquet
  • 11.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET APPENDIX
  • 12.
    DREMIODremio Confidential UNDEREMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET PMC Members/Committers Jacques Nadeau (PMC Chair) Todd Lipcon Ted Dunning Michael Stack P. Taylor Goetz Reynold Xin Julian Hyde Julien Le Dem James Taylor Jake Luciani Parth Chandra Alex Levenson Marcel Kornacker Steven Phillips Hanifi Gunes Jason Altekruse Abdel Hakim Deneche Wes McKinney Karthik Ramasamy David Alves Seshadri Mahalingam Ippokratis Pandis

Editor's Notes

  • #4 This is changing the world! Emphasize that.
  • #5 Trying to turbo-charge all the major technologies that people use today.
  • #6 Explain that columnar on disk existed for several years, this is columnar in memory Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here. Very technical explanation – simplify it. One blue vs 4 blues
  • #7 Maybe improve the slide – from common scheduling to common data in memory
  • #9 Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today. We’re not offloading the work for them, they are going to do the work. Relationships – good point Call this a platform?