SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Next-­‐genera;on	
  	
  
Python	
  Big	
  Data	
  Tools,	
  	
  
powered	
  by	
  Apache	
  Arrow	
  
Wes	
  McKinney	
  @wesmckinn	
  
SF	
  Big	
  Analy;cs	
  Meetup,	
  2016-­‐04-­‐05	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
• Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
• Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incuba;ng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd	
  Edi4on	
  
Coming	
  late	
  2016	
  /	
  early	
  
2017	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  +	
  Big	
  Data:	
  The	
  State	
  of	
  things	
  
•  See	
  “Python	
  and	
  Apache	
  Hadoop:	
  A	
  State	
  of	
  the	
  Union”	
  from	
  February	
  17	
  
•  Areas	
  where	
  much	
  more	
  work	
  needed	
  
• Binary	
  file	
  format	
  read/write	
  support	
  (e.g.	
  Parquet	
  files)	
  
• File	
  system	
  libraries	
  (HDFS,	
  S3,	
  etc.)	
  
• Client	
  drivers	
  (Spark,	
  Hive,	
  Impala,	
  Kudu)	
  
• Compute	
  system	
  integra;on	
  (Spark,	
  Impala,	
  etc.)	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
Many	
  slides	
  here	
  from	
  my	
  joint	
  talk	
  with	
  Jacques	
  Nadeau,	
  VP	
  Apache	
  Arrow	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  Sofware	
  Founda;on	
  project	
  
•  Announced	
  Feb	
  17,	
  2016	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  Analy;cs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  rela;onal	
  and	
  complex	
  data	
  as-­‐is	
  
•  Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved	
  
•  A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow:	
  What	
  is	
  it?	
  	
  
•  hkp://arrow.apache.org	
  
•  Not	
  a	
  piece	
  of	
  sofware,	
  exactly!	
  
•  A	
  standardized	
  in-­‐memory	
  representa;on	
  for	
  columnar	
  data	
  
•  Enables	
  
• Suitable	
  for	
  implemen;ng	
  high-­‐performance	
  analy;cs	
  in-­‐memory	
  (think	
  like	
  
“pandas	
  internals”)	
  
• Cheap	
  data	
  interchange	
  amongst	
  systems,	
  likle	
  or	
  no	
  serializa;on	
  
• Flexible	
  support	
  for	
  complex	
  JSON-­‐like	
  data	
  
•  Targets:	
  Impala,	
  Kudu,	
  Parquet,	
  Spark	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer	
  
Arrow
Memory Buffer	
  
•  Cache	
  Locality	
  
•  Super-­‐scalar	
  &	
  vectorized	
  
opera;on	
  
•  Minimal	
  Structure	
  Overhead	
  
•  Constant	
  value	
  access	
  	
  
•  With	
  minimal	
  structure	
  overhead	
  
•  Operate	
  directly	
  on	
  columnar	
  
compressed	
  data	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Big	
  Data	
  Systems:	
  Poor	
  Python	
  IO	
  performance	
  
h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
• Problem:	
  fast,	
  language-­‐
agnos;c	
  binary	
  data	
  frame	
  
file	
  format	
  
• Wriken	
  by	
  Wes	
  McKinney	
  
(Python)	
  Hadley	
  Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Binary	
  columnar	
  storage	
  format	
  
•  I	
  just	
  became	
  a	
  Parquet	
  commiker!	
  
•  github.com/apache/parquet-­‐cpp	
  
•  Python	
  users	
  will	
  soon	
  be	
  able	
  to	
  
read	
  Parquet	
  files	
  via	
  PyArrow	
  
•  parquet-­‐cpp	
  <-­‐>	
  PyArrow	
  <-­‐>	
  
pandas	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Language	
  Bindings	
  
•  Target	
  Languages	
  
• Java	
  (beta)	
  
• CPP	
  (underway)	
  
• Python	
  &	
  Pandas	
  (underway)	
  
• R	
  
• Julia	
  
•  Ini;al	
  Focus	
  
• Read	
  a	
  structure	
  
• Write	
  a	
  structure	
  	
  
• Manage	
  Memory	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  and	
  Arrow	
  in	
  context	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RPC	
  &	
  IPC:	
  Moving	
  Data	
  Between	
  Systems	
  
RPC	
  
•  Avoid	
  Serializa;on	
  &	
  Deserializa;on	
  
•  Layer	
  TBD:	
  Focused	
  on	
  suppor;ng	
  vectored	
  io	
  
• Scaker/gather	
  reads/writes	
  against	
  socket	
  
IPC	
  
•  Alpha	
  implementa;on	
  	
  using	
  memory	
  mapped	
  files	
  
• Moving	
  data	
  between	
  Python	
  and	
  Drill	
  
•  Working	
  on	
  shared	
  alloca;on	
  approach	
  
• Shared	
  reference	
  coun;ng	
  and	
  well-­‐defined	
  ownership	
  seman;cs	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu;ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Drill,	
  Impala	
  
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  Next	
  
•  Parquet	
  for	
  Python	
  &	
  C++	
  
• Using	
  Arrow	
  as	
  intermediary	
  
•  Available	
  IPC	
  Implementa;on	
  
•  Spark,	
  Drill	
  Integra;on	
  
• Faster	
  UDFs,	
  Storage	
  interfaces	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  prac;ce	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  
•  Join	
  the	
  community	
  
• dev@arrow.apache.org	
  
• Slack:	
  hkps://apachearrowslackin.herokuapp.com/	
  
• hkp://arrow.apache.org	
  
• @ApacheArrow	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  

More Related Content

What's hot

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applicationshadooparchbook
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architectureMartinStrycek
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemShivaji Dutta
 

What's hot (19)

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Exponea - Kafka and Hadoop as components of architecture
Exponea  - Kafka and Hadoop as components of architectureExponea  - Kafka and Hadoop as components of architecture
Exponea - Kafka and Hadoop as components of architecture
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 

Viewers also liked

Api Strat Portland 2017 Serverless Extensibility talk
Api Strat Portland 2017 Serverless Extensibility talkApi Strat Portland 2017 Serverless Extensibility talk
Api Strat Portland 2017 Serverless Extensibility talkGlenn Block
 
あなたの開発チームには、チームワークがあふれていますか?
 あなたの開発チームには、チームワークがあふれていますか? あなたの開発チームには、チームワークがあふれていますか?
あなたの開発チームには、チームワークがあふれていますか?Yusuke Amano
 
サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦Teppei Sato
 
遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetup遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetupS Akai
 
サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤Shin'ya Ueoka
 
すべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティすべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティKobayashi Daisuke
 
WalB: Real-time and Incremental Backup System for Block Devices
WalB: Real-time and Incremental Backup System for Block DevicesWalB: Real-time and Incremental Backup System for Block Devices
WalB: Real-time and Incremental Backup System for Block Devicesuchan_nos
 
3000社の業務データ絞り込みを支える技術
3000社の業務データ絞り込みを支える技術3000社の業務データ絞り込みを支える技術
3000社の業務データ絞り込みを支える技術Ryo Mitoma
 
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ーTeppei Sato
 
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜Jumpei Miyata
 
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜Jumpei Miyata
 
Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)lestrrat
 
プロジェクト管理でkintone
プロジェクト管理でkintoneプロジェクト管理でkintone
プロジェクト管理でkintoneCybozucommunity
 
Kubernetesにまつわるエトセトラ(主に苦労話)
Kubernetesにまつわるエトセトラ(主に苦労話)Kubernetesにまつわるエトセトラ(主に苦労話)
Kubernetesにまつわるエトセトラ(主に苦労話)Works Applications
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with SparkSandy Ryza
 

Viewers also liked (20)

Api Strat Portland 2017 Serverless Extensibility talk
Api Strat Portland 2017 Serverless Extensibility talkApi Strat Portland 2017 Serverless Extensibility talk
Api Strat Portland 2017 Serverless Extensibility talk
 
あなたの開発チームには、チームワークがあふれていますか?
 あなたの開発チームには、チームワークがあふれていますか? あなたの開発チームには、チームワークがあふれていますか?
あなたの開発チームには、チームワークがあふれていますか?
 
サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦サイボウズのフロントエンド開発 現在とこれからの挑戦
サイボウズのフロントエンド開発 現在とこれからの挑戦
 
遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetup遅いクエリと向き合う仕組み #CybozuMeetup
遅いクエリと向き合う仕組み #CybozuMeetup
 
サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤サイボウズのサービスを支えるログ基盤
サイボウズのサービスを支えるログ基盤
 
すべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティすべての人にチームワークを サイボウズのアクセシビリティ
すべての人にチームワークを サイボウズのアクセシビリティ
 
形態素解析
形態素解析形態素解析
形態素解析
 
WalB: Real-time and Incremental Backup System for Block Devices
WalB: Real-time and Incremental Backup System for Block DevicesWalB: Real-time and Incremental Backup System for Block Devices
WalB: Real-time and Incremental Backup System for Block Devices
 
3000社の業務データ絞り込みを支える技術
3000社の業務データ絞り込みを支える技術3000社の業務データ絞り込みを支える技術
3000社の業務データ絞り込みを支える技術
 
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
離れた場所でも最高のチームワークを実現する方法 ーサイボウズ開発チームのリモートワーク事例ー
 
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
Jenkins 2.0 最新事情 〜Make Jenkins Great Again〜
 
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜すべてを自動化せよ! 〜生産性向上チームの挑戦〜
すべてを自動化せよ! 〜生産性向上チームの挑戦〜
 
Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)Kubernetes in 30 minutes (2017/03/10)
Kubernetes in 30 minutes (2017/03/10)
 
プロジェクト管理でkintone
プロジェクト管理でkintoneプロジェクト管理でkintone
プロジェクト管理でkintone
 
Kubernetesにまつわるエトセトラ(主に苦労話)
Kubernetesにまつわるエトセトラ(主に苦労話)Kubernetesにまつわるエトセトラ(主に苦労話)
Kubernetesにまつわるエトセトラ(主に苦労話)
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Time Series Analysis with Spark
Time Series Analysis with SparkTime Series Analysis with Spark
Time Series Analysis with Spark
 

Similar to Next-generation Python Big Data Tools, powered by Apache Arrow

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyHakka Labs
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Julien Le Dem
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Marcel Krcah
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 

Similar to Next-generation Python Big Data Tools, powered by Apache Arrow (20)

Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 

More from Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 

Recently uploaded

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeCzechDreamin
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2DianaGray10
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyUXDXConf
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastUXDXConf
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoTAnalytics
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationZilliz
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1DianaGray10
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxAbida Shariff
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty SecureFemke de Vroome
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaCzechDreamin
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 

Recently uploaded (20)

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2UiPath Test Automation using UiPath Test Suite series, part 2
UiPath Test Automation using UiPath Test Suite series, part 2
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 

Next-generation Python Big Data Tools, powered by Apache Arrow

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Next-­‐genera;on     Python  Big  Data  Tools,     powered  by  Apache  Arrow   Wes  McKinney  @wesmckinn   SF  Big  Analy;cs  Meetup,  2016-­‐04-­‐05  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba;ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  late  2016  /  early   2017  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Python  +  Big  Data:  The  State  of  things   •  See  “Python  and  Apache  Hadoop:  A  State  of  the  Union”  from  February  17   •  Areas  where  much  more  work  needed   • Binary  file  format  read/write  support  (e.g.  Parquet  files)   • File  system  libraries  (HDFS,  S3,  etc.)   • Client  drivers  (Spark,  Hive,  Impala,  Kudu)   • Compute  system  integra;on  (Spark,  Impala,  etc.)  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   Many  slides  here  from  my  joint  talk  with  Jacques  Nadeau,  VP  Apache  Arrow  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sofware  Founda;on  project   •  Announced  Feb  17,  2016   •  Focused  on  Columnar  In-­‐Memory  Analy;cs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  rela;onal  and  complex  data  as-­‐is   •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  hkp://arrow.apache.org   •  Not  a  piece  of  sofware,  exactly!   •  A  standardized  in-­‐memory  representa;on  for  columnar  data   •  Enables   • Suitable  for  implemen;ng  high-­‐performance  analy;cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  likle  or  no  serializa;on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache  Locality   •  Super-­‐scalar  &  vectorized   opera;on   •  Minimal  Structure  Overhead   •  Constant  value  access     •  With  minimal  structure  overhead   •  Operate  directly  on  columnar   compressed  data  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  Systems:  Poor  Python  IO  performance   h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   • Problem:  fast,  language-­‐ agnos;c  binary  data  frame   file  format   • Wriken  by  Wes  McKinney   (Python)  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Binary  columnar  storage  format   •  I  just  became  a  Parquet  commiker!   •  github.com/apache/parquet-­‐cpp   •  Python  users  will  soon  be  able  to   read  Parquet  files  via  PyArrow   •  parquet-­‐cpp  <-­‐>  PyArrow  <-­‐>   pandas  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Language  Bindings   •  Target  Languages   • Java  (beta)   • CPP  (underway)   • Python  &  Pandas  (underway)   • R   • Julia   •  Ini;al  Focus   • Read  a  structure   • Write  a  structure     • Manage  Memory  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   pandas  and  Arrow  in  context  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   RPC  &  IPC:  Moving  Data  Between  Systems   RPC   •  Avoid  Serializa;on  &  Deserializa;on   •  Layer  TBD:  Focused  on  suppor;ng  vectored  io   • Scaker/gather  reads/writes  against  socket   IPC   •  Alpha  implementa;on    using  memory  mapped  files   • Moving  data  between  Python  and  Drill   •  Working  on  shared  alloca;on  approach   • Shared  reference  coun;ng  and  well-­‐defined  ownership  seman;cs  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Execu;ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Next   •  Parquet  for  Python  &  C++   • Using  Arrow  as  intermediary   •  Available  IPC  Implementa;on   •  Spark,  Drill  Integra;on   • Faster  UDFs,  Storage  interfaces  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac;ce  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved   •  Join  the  community   • dev@arrow.apache.org   • Slack:  hkps://apachearrowslackin.herokuapp.com/   • hkp://arrow.apache.org   • @ApacheArrow  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own