SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Scaling	
  the	
  Python	
  Data	
  
Experience	
  
Wes	
  McKinney	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Marcel	
  Kornacker	
  
JusFn	
  Erickson 	
   	
  Silvius	
  Rus	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Wes	
  McKinney	
  
•  A	
  key	
  person	
  in	
  building	
  today’s	
  open	
  source	
  Python	
  data	
  community	
  
•  Creator	
  of	
  pandas,	
  a	
  standard	
  Python	
  data	
  wrangling	
  and	
  analyFcs	
  toolkit	
  used	
  
by	
  data	
  scienFsts	
  
•  Author	
  of	
  best-­‐selling	
  canonical	
  text	
  Python	
  for	
  Data	
  Analysis	
  (2012)	
  
•  Formerly	
  Founder/CEO	
  of	
  DataPad	
  (acquired	
  by	
  Cloudera	
  in	
  2014)	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  is	
  popular…	
  
•  Python	
  has	
  become	
  a	
  standard	
  language	
  of	
  data	
  science	
  
•  Why	
  is	
  it	
  popular?	
  
• Maximizes	
  producFvity	
  for	
  data	
  engineers	
  and	
  data	
  scienFsts	
  
• Build	
  robust	
  so[ware	
  and	
  do	
  interacFve	
  data	
  analysis	
  with	
  100%	
  Python	
  code	
  	
  
• Easy-­‐to-­‐learn	
  and	
  makes	
  happy	
  and	
  producFve	
  data	
  teams	
  	
  
• Large,	
  diverse	
  open	
  source	
  development	
  community	
  
• Comprehensive	
  libraries:	
  data	
  wrangling,	
  ML,	
  visualizaFon,	
  etc.	
  
•  Main	
  use	
  case:	
  data	
  science	
  &	
  engineering	
  swiss	
  army	
  knife	
  on	
  small-­‐to-­‐medium	
  
size	
  data	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
…but	
  Python	
  does	
  not	
  scale	
  today	
  
•  Python	
  ecosystem	
  confined	
  to	
  single-­‐node	
  analysis	
  
• Great	
  for	
  smaller	
  data	
  sets	
  
• Requires	
  sampling	
  or	
  aggregaFons	
  for	
  larger	
  data	
  
• Distributed	
  tools	
  compromise	
  in	
  various	
  ways	
  
•  ExtracFng	
  samples	
  or	
  aggregaFons	
  for	
  larger	
  data	
  means:	
  
• “Scales”	
  by	
  losing	
  more	
  fidelity	
  
• AddiFonal	
  ETL	
  overhead	
  to	
  extract	
  samples/aggregaFons	
  
• Loss	
  of	
  producFvity	
  with	
  mulFple	
  languages,	
  tools,	
  etc	
  
• Blocks	
  certain	
  analysis	
  and	
  use	
  cases	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis:	
  Same	
  Python,	
  now	
  at	
  scale	
  
•  Target	
  user:	
  
• Data	
  scienFsts	
  and	
  data	
  engineers	
  (“Python	
  data	
  users”)	
  
•  Goals:	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Scales	
  to	
  any	
  node	
  and	
  data	
  size	
  
• No	
  compromise	
  in	
  funcFonality	
  or	
  usability	
  
• InteracFve	
  experience	
  at	
  naFve	
  hardware	
  speeds	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  announced?	
  
•  First	
  public	
  release	
  of	
  Ibis	
  
• hgp://ibis-­‐project.org	
  
•  Beta	
  release	
  to	
  Cloudera	
  Labs	
  
•  InviFng	
  usage	
  and	
  community	
  development	
  
•  Apache-­‐licensed	
  open-­‐source	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis’s	
  Vision	
  
•  Uncompromised	
  Python	
  experience	
  
• 100%	
  Python	
  end-­‐to-­‐end	
  user	
  workflows	
  	
  
• Enable	
  integraFon	
  with	
  the	
  exisFng	
  Python	
  data	
  ecosystem	
  (pandas,	
  scikit-­‐
learn,	
  NumPy,	
  etc)	
  
•  InteracFve	
  at	
  big	
  data	
  scale	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Scalability	
  for	
  big	
  data	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Advantages	
  of	
  our	
  approach	
  
•  Analyze	
  big	
  data	
  100%	
  in	
  Python,	
  with	
  the	
  same	
  ease	
  as	
  small/medium	
  data	
  on	
  
the	
  local	
  filesystem	
  
•  Full-­‐fidelity	
  data	
  access	
  
•  Familiar	
  Python	
  experience	
  and	
  integraFon	
  with	
  exisFng	
  Python	
  data	
  libraries	
  
•  Provide	
  a	
  means	
  for	
  Python	
  high	
  performance	
  compuFng	
  tools	
  to	
  be	
  leveraged	
  at	
  
Hadoop-­‐scale	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Beta	
  0.3	
  release 	
  	
  
•  High	
  level	
  Python	
  API	
  for	
  describing	
  analyFcs	
  and	
  ETL	
  that	
  can	
  be	
  executed	
  by	
  
Impala	
  
• Familiar	
  API	
  for	
  users	
  of	
  pandas	
  
• Comprehensive	
  coverage	
  of	
  operaFons	
  expressible	
  as	
  relaFonal	
  data	
  flows	
  
•  Integrated	
  tools	
  for	
  managing	
  data	
  in	
  HDFS	
  
•  Simple	
  workflows	
  to	
  query	
  data	
  files	
  in	
  several	
  formats	
  (Parquet,	
  Avro,	
  Text)	
  
•  pandas	
  data	
  interchange	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis/Impala	
  Joint	
  Roadmap	
  
•  More	
  natural	
  data	
  modeling	
  
• Complex	
  types	
  support	
  
•  IntegraFon	
  with	
  full	
  Python	
  data	
  ecosystem	
  
• Advanced	
  analyFcs	
  +	
  machine	
  learning	
  
• Enable	
  use	
  of	
  performance	
  compuFng	
  tools	
  
•  User	
  extensibility	
  with	
  naFve	
  performance	
  
• In-­‐memory	
  columnar	
  format	
  
• Python-­‐to-­‐LLVM	
  IR	
  compilaFon	
  
•  Workflow	
  and	
  usability	
  tools	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Benefits	
  of	
  Ibis	
  
•  Maximize	
  developer	
  producFvity	
  
• Mirrors	
  single-­‐node	
  Python	
  experience	
  
• Solve	
  big	
  data	
  problems	
  without	
  leaving	
  Python	
  
• Leverage	
  Python	
  skills,	
  ecosystem,	
  and	
  tools	
  
•  Python	
  as	
  first-­‐class	
  language	
  for	
  Hadoop	
  
• Full-­‐fidelity	
  analysis	
  without	
  extracFons	
  
• Python	
  analysis	
  at	
  any	
  scale	
  
• NaFve	
  hardware	
  speeds	
  for	
  a	
  broad	
  set	
  of	
  use	
  cases	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
wes@cloudera.com	
  

More Related Content

What's hot

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 

What's hot (20)

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Ibis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
PyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 

Viewers also liked

Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 

Viewers also liked (11)

Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 

Similar to Ibis: Scaling the Python Data Experience

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
Hortonworks
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
Timothy Spann
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
DATAVERSITY
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
Timothy Spann
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
DataWorks Summit
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
Nick Pentreath
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
neeraj rathore
 
Data mining tools overall
Data mining tools overallData mining tools overall
Data mining tools overall
Mohamed Sharique Vellikan
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
Travis Oliphant
 
OpenPOWER foundation update new executive director and bright open future_i...
OpenPOWER  foundation update  new executive director and bright open future_i...OpenPOWER  foundation update  new executive director and bright open future_i...
OpenPOWER foundation update new executive director and bright open future_i...
Ganesan Narayanasamy
 
OpenFaaS 2019 Project Update
OpenFaaS 2019 Project UpdateOpenFaaS 2019 Project Update
OpenFaaS 2019 Project Update
Alex Ellis
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
 
Intel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStackIntel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStack
Silicon Valley Cloud Foundry Meetup
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Krishna Petrochemicals
 

Similar to Ibis: Scaling the Python Data Experience (20)

Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Deep learning on HDP 2018 Prague
Deep learning on HDP 2018 PragueDeep learning on HDP 2018 Prague
Deep learning on HDP 2018 Prague
 
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented MiddlewareADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
ADV Slides: Trends in Streaming Analytics and Message-oriented Middleware
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018Apache Deep Learning 101 - DWS Berlin 2018
Apache Deep Learning 101 - DWS Berlin 2018
 
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...
 
Apache deep learning 101
Apache deep learning 101Apache deep learning 101
Apache deep learning 101
 
IBM Developer Model Asset eXchange
IBM Developer Model Asset eXchangeIBM Developer Model Asset eXchange
IBM Developer Model Asset eXchange
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Data mining tools overall
Data mining tools overallData mining tools overall
Data mining tools overall
 
Keynote at Converge 2019
Keynote at Converge 2019Keynote at Converge 2019
Keynote at Converge 2019
 
OpenPOWER foundation update new executive director and bright open future_i...
OpenPOWER  foundation update  new executive director and bright open future_i...OpenPOWER  foundation update  new executive director and bright open future_i...
OpenPOWER foundation update new executive director and bright open future_i...
 
OpenFaaS 2019 Project Update
OpenFaaS 2019 Project UpdateOpenFaaS 2019 Project Update
OpenFaaS 2019 Project Update
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Intel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStackIntel Cloud Foundry and OpenStack
Intel Cloud Foundry and OpenStack
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 

More from Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 

Recently uploaded

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
Vadym Kazulkin
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
UiPathCommunity
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Pitangent Analytics & Technology Solutions Pvt. Ltd
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 

Recently uploaded (20)

LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024High performance Serverless Java on AWS- GoTo Amsterdam 2024
High performance Serverless Java on AWS- GoTo Amsterdam 2024
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Session 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdfSession 1 - Intro to Robotic Process Automation.pdf
Session 1 - Intro to Robotic Process Automation.pdf
 
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
Crafting Excellence: A Comprehensive Guide to iOS Mobile App Development Serv...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 

Ibis: Scaling the Python Data Experience

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  the  Python  Data   Experience   Wes  McKinney                    Marcel  Kornacker   JusFn  Erickson    Silvius  Rus  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Wes  McKinney   •  A  key  person  in  building  today’s  open  source  Python  data  community   •  Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used   by  data  scienFsts   •  Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)   •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  producFvity  for  data  engineers  and  data  scienFsts   • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggregaFons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  ExtracFng  samples  or  aggregaFons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons   • Loss  of  producFvity  with  mulFple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Same  Python,  now  at  scale   •  Target  user:   • Data  scienFsts  and  data  engineers  (“Python  data  users”)   •  Goals:   • Mirrors  single-­‐node  Python  experience   • Scales  to  any  node  and  data  size   • No  compromise  in  funcFonality  or  usability   • InteracFve  experience  at  naFve  hardware  speeds  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  announced?   •  First  public  release  of  Ibis   • hgp://ibis-­‐project.org   •  Beta  release  to  Cloudera  Labs   •  InviFng  usage  and  community  development   •  Apache-­‐licensed  open-­‐source  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  InteracFve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extracFons   • Scalability  for  big  data   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Advantages  of  our  approach   •  Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on   the  local  filesystem   •  Full-­‐fidelity  data  access   •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries   •  Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at   Hadoop-­‐scale  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Beta  0.3  release     •  High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by   Impala   • Familiar  API  for  users  of  pandas   • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows   •  Integrated  tools  for  managing  data  in  HDFS   •  Simple  workflows  to  query  data  files  in  several  formats  (Parquet,  Avro,  Text)   •  pandas  data  interchange  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  IntegraFon  with  full  Python  data  ecosystem   • Advanced  analyFcs  +  machine  learning   • Enable  use  of  performance  compuFng  tools   •  User  extensibility  with  naFve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compilaFon   •  Workflow  and  usability  tools  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  producFvity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extracFons   • Python  analysis  at  any  scale   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   wes@cloudera.com