Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
Check these out next
High Performance Python on Apache Spark
Wes McKinney
Data Science Languages and Industry Analytics
Wes McKinney
Apache Arrow and Python: The latest
Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
Impala use case @ Zoosk
Cloudera, Inc.
Improving data interoperability in Python and R
Wes McKinney
Python Data Wrangling: Preparing for the Future
Wes McKinney
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
1
of
31
Top clipped slide
PyData: The Next Generation
Jan. 10, 2015
•
0 likes
63 likes
×
Be the first to like this
Show More
•
22,181 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
State of the union and questions for Python, Big Data, Analytics, and so forth in 2015 onward
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney
7.4K views
•
37 slides
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
DataFrames: The Good, Bad, and Ugly
Wes McKinney
12.9K views
•
24 slides
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
7.6K views
•
33 slides
More Related Content
Slideshows for you
(20)
High Performance Python on Apache Spark
Wes McKinney
•
16.5K views
Data Science Languages and Industry Analytics
Wes McKinney
•
5.5K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Impala use case @ Zoosk
Cloudera, Inc.
•
2.1K views
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
•
1.1K views
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
PyCon Singapore 2013 Keynote
Wes McKinney
•
94.6K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
•
17K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
•
23.1K views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
•
2.3K views
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
•
4.6K views
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
•
1.7K views
Similar to PyData: The Next Generation
(20)
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson
•
1.8K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
•
1.1K views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
•
641 views
Twitter with hadoop for oow
Gwen (Chen) Shapira
•
1.5K views
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Cloudera, Inc.
•
913 views
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
High-Performance Python On Spark
Jen Aman
•
1.7K views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Cloudera, Inc.
•
4.1K views
Data Science and CDSW
Jason Hubbard
•
1.3K views
Hadoop Application Architectures tutorial at Big DataService 2015
hadooparchbook
•
3.4K views
Simplifying Real-Time Architectures for IoT with Apache Kudu
Cloudera, Inc.
•
4.1K views
Cloudera Big Data Integration Speedpitch at TDWI Munich June 2017
Stefan Lipp
•
1.7K views
Part 2: A Visual Dive into Machine Learning and Deep Learning
Cloudera, Inc.
•
1.5K views
Consolidate your data marts for fast, flexible analytics 5.24.18
Cloudera, Inc.
•
1.1K views
Bi on Big Data - Strata 2016 in London
Dremio Corporation
•
1.7K views
Gimel and PayPal Notebooks @ TDWI Leadership Summit Orlando
Romit Mehta
•
142 views
SQL Engines for Hadoop - The case for Impala
markgrover
•
1.2K views
Advertisement
More from Wes McKinney
(18)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
996 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Recently uploaded
(20)
KPIs&Goals.pdf
mennaHendy
•
2 views
Raspberry pi presentation.pptx
FrankAnthonyChin
•
0 views
pdf.pdf
YenenehMulat
•
0 views
如何办理一份高仿南达科他大学毕业证成绩单?
aazepp
•
3 views
Introduction to Virtualization.pptx
latifdhalait
•
0 views
如何办理一份高仿伦敦南岸大学毕业证成绩单?
aazepp
•
3 views
Email Signature.pdf
mennaHendy
•
3 views
如何办理一份高仿科克大学毕业证成绩单?
aazepp
•
3 views
Our Business Goals.pdf
mennaHendy
•
4 views
solar panel.pptx
AbdulberBaig
•
3 views
RC522 RFID Reader_Write For Arduino.pdf
RoboDJ
•
0 views
Spring_Boot_Microservices-5_Day_Session.pptx
Prabhakaran Ravichandran
•
0 views
Excel 2010.docx
RobertoMarcelinodaSi1
•
5 views
Technology Companies Development Story
Hamidreza Soleimani
•
0 views
Office 365 licenses
Princy Nadar
•
0 views
Pill Camera.pptx
Md Refatul Amin Refat
•
0 views
如何办理一份高仿纽约州立大学宾汉姆顿分校毕业证成绩单?
aazepp
•
0 views
ARTIFICIAL INTELLIGENCE.pptx
Butterfly education
•
6 views
Do Reinvent the Wheel - Nov 2021 - DigiNext.pdf
Hamidreza Soleimani
•
0 views
Blomberg KWD2330X Service Manual.pdf
ssuser78bec11
•
0 views
Advertisement
PyData: The Next Generation
1 © Cloudera,
Inc. All rights reserved. PyData: The Next Genera@on Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
2 © Cloudera,
Inc. All rights reserved. PyData: Everything’s awesome…or is it? Wes McKinney @wesmckinn Data Day Texas 2015 #ddtx15
3 © Cloudera,
Inc. All rights reserved. Me • Data systems, tools, Python guru at Cloudera • Formerly Founder/CEO of DataPad (visual analy@cs startup) • Created pandas in 2008, lead developer un@l 2013 • Python for Data Analysis, published 10/2012 • O’Reilly’s best-‐selling data book of 2014 • Pythonista since 2007
4 © Cloudera,
Inc. All rights reserved. What’s this about? • Hopes and fears for the community and ecosystem • Why do I care? • Python is fun! • Leverage • Accessibility for newbies • Community: smart, nice, humble people
5 © Cloudera,
Inc. All rights reserved. Python at Cloudera • Want Cloudera plaaorm users to be successful with Python • Spark/PySpark part of the Enterprise Data Hub / CDH • Ac@vely inves@ng in Python tooling • (p.s. we’re hiring?) • (p.p.s. we have an Aus@n office now!)
6 © Cloudera,
Inc. All rights reserved. Historical perspec@ve and background • 20 years of fast numerical compu@ng in Python (Numeric 1995) • 10 years of NumPy • PyData becomes a thing in 2012 • Python as a data language goes mainstream • Job descrip@ons tell all • Shig in larger Python community from web towards data • PyCon 2015 commihee reported substan@al growth in data-‐related submissions!
7 © Cloudera,
Inc. All rights reserved. How’d this happen? • Data, data everywhere • Science! scikit-‐learn, statsmodels, and friends • Comprehensive data wrangling tools and in-‐memory analy@cs/repor@ng (pandas) • IPython Notebook • Learning resources (books, conferences, blogs, etc.) • Python environment/library management that “just works”
8 © Cloudera,
Inc. All rights reserved. Put a Python (interface) on it! Something no one got fired for, ever.
9 © Cloudera,
Inc. All rights reserved. Meanwhile… • Hadoop and Big Data go mainstream in 2009 onward • First Hadoop World: Fall 2009 • First Strata conference: Spring 2011 • Lots of smart engineers in fast-‐growing businesses with massive analy@cs / ETL problems • Solu@ons built, frameworks developed, companies founded • Python was generally not a central part of those solu@ons • A lot of our nice things weren’t much help for data munging and coun@ng at scale (more on this later)
10 © Cloudera,
Inc. All rights reserved. We’re lucky to have lots of nice things • What a language! • IPython: interac@ve compu@ng and collabora@on • Libraries to solve nearly any (non-‐big data) problem • Trustworthy (medium) data wrangling, sta@s@cs, machine learning • HPC / GPU / parallel compu@ng frameworks • FFI tools • … and much more
11 © Cloudera,
Inc. All rights reserved. “If this isn’t nice, what is?” —Kurt Vonnegut
12 © Cloudera,
Inc. All rights reserved. So, what kind of big data? • Big mul@dimensional arrays / linear algebra • Big tables (structured data) • Big text data (unstructured data) • Empirically I personally am mostly interested in big tables
13 © Cloudera,
Inc. All rights reserved. What kind of big data problems? • ETL / Data Wrangling • Python been used here for years with Hadoop Streaming • BI / Analy@cs (“things you can do in SQL”) • Advanced Analy@cs / Machine Learning
14 © Cloudera,
Inc. All rights reserved. Some ways we are #winning • Python seen as a viable alterna@ve to SAS/MATLAB/proprietary sogware without nearly as much arguing • Huge uptake in the financial sector • Many current and upcoming genera@ons of data scien@sts learning Python as a first language • Python in HPC / scien@fic compu@ng
15 © Cloudera,
Inc. All rights reserved. Some ways we are not #winning • Python s@ll doesn’t have a great “big data story” • Lihle venture capital trickling down to Python projects • Data structures and programming APIs lagging modern reali@es • Weak support for emerging data formats • Many companies with Python big data successes have not open-‐sourced their work
16 © Cloudera,
Inc. All rights reserved. Python in big data workflows in prac@ce HDFS Hadoop-‐MR Spark SQL Big Data, Many machines Small/Medium Data, One Machine pandas Viz tools ML / Stats More coun@ng / ETL More insights / repor@ng DSLs
17 © Cloudera,
Inc. All rights reserved. Big data storage formats • JSON and CSV are not a good way to warehouse data • Apache Avro • Compact binary data serializa@on format • RPC framework • Apache Parquet • Efficient columnar data format op@mized for HDFS • Supports nested and repeated fields, compression, encoding schemes • Co-‐developed by Twiher and Cloudera • Reference impl’s in Impala (C++), and standalone Java/Scala (used in Spark)
18 © Cloudera,
Inc. All rights reserved. We’re living in a JVM world • Scala rapidly taking over big data analy@cs • Func@onal, concise, good for building high level DSLs • Build nice Scala APIs to clunkier Java frameworks • JVM legi@mately good for concurrent, distributed systems • Binary interface with Python a major issue
19 © Cloudera,
Inc. All rights reserved. Dremel, baby, Dremel… • VLDB 2010: Dremel: Interac5ve Analysis of Web-‐Scale Datasets • Inspira@on for Parquet (cf blog “Dremel made easy with Parquet”) • Peta-‐scale analy@cs directly on nested data • Google BigQuery said to be a IaaS-‐ifica@on of Dremel • Supports SQL variant + new user-‐defined func@ons with JavaScript + V8 SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
20 © Cloudera,
Inc. All rights reserved. Cloudera Impala • Open-‐source interac@ve SQL for Hadoop • Analy@cal query processor wrihen in C++ with LLVM code genera@on • Op@mized to scan tables (best as Parquet format) in HDFS • SQL front-‐end and query op@mizer / planner • User-‐defined func@on API (C++) • impyla enables Python UDFs to be compiled with Numba to LLVM IR
21 © Cloudera,
Inc. All rights reserved. Cloudera Impala (cont’d) • For high performance big data analy@cs, Impala could be Python’s best friend • C++/LLVM backend is lower-‐level than SQL • Nested data support is coming
22 © Cloudera,
Inc. All rights reserved. Some interes@ng things in recent @mes
23 © Cloudera,
Inc. All rights reserved. Set point: Hadley Wickham • R has upped it’s game with dplyr, @dyr, and other new projects • New standard for a uniform interface to either in-‐memory or in-‐database data processing • Composable table primi@ve opera@ons • Mul@ple major versions shipped, gevng adopted 80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl %>% filter(c==‘bar’) %>% group_by(a, b) %>% summarise(metric=mean(d – f)) %>% arrange(desc(metric))
24 © Cloudera,
Inc. All rights reserved. Blaze • Shares some seman@cs with dplyr • Uses a generalized datashape protocol • Fresh start in 2014 under Mahhew Rocklin’s (Con@nuum) direc@on • Deferred expression API • Support for piping data between storage systems • Mul@ple backends (pandas, SQL, MongoDB, PySpark, …) • Growing support for out-‐of-‐core analy@cs
25 © Cloudera,
Inc. All rights reserved. libdynd • Led by Mark Wiebe at Con@nuum Analy@cs • Pure C++11 modern reimagining of NumPy • Python bindings • Supports variadic data cells and nested types (datashape protocol) • Development has focused on the data container design over analy@cs
26 © Cloudera,
Inc. All rights reserved. PySpark • Popularity may exceed official Scala API • Spark was not exactly designed to be an ideal companion to Python • General architecture • Users build Spark deferred expression graphs in Python • User-‐supplied func@ons are serialized and broadcast around the cluster • Spark plans job and breaks work into tasks executed by Python worker jobs • Data is managed / shuffled by the Spark Scala master process • Python used largely as a black box to transform input to output
27 © Cloudera,
Inc. All rights reserved. PySpark: Some more gory details • Spark master controlled using py4j • Py4J docs: “If performance is cri@cal to your applica@on, accessing Java objects from Python programs might not be the best idea” • Data is marshalled mostly with files with various serializa@on protocols (pickle + bespoke formats) • Does not na5vely interface with NumPy (yet) • But, the in-‐memory benefits of Spark over Hadoop Streaming alterna@ves massively outweigh the downsides # pass large object by py4j is very slow and need much memory
28 © Cloudera,
Inc. All rights reserved. Spartan • hhp://github.com/spartan-‐array/spartan • Python distributed array expression evaluator (“distributed NumPy”) • Developed by Russell Power & others at NYU • Uses ZeroMQ and custom RPC implementa@on
29 © Cloudera,
Inc. All rights reserved. Things I think we should do • Create high fidelity data structures for Dremel-‐style data • Get serious about Avro, Parquet, and other new data format standards • Invest in the Python-‐Impala-‐LLVM rela@onship • Efficient binary protocols to receive and emit data from Python processes
30 © Cloudera,
Inc. All rights reserved. Conclusions • Python + PyData stack is as strong as ever, and s@ll gaining momentum • The @me for a “dark horse” Python-‐centric big data solu@on has probably passed us by. Maybe beher to pursue alliances. • Focused work is needed to s@ll be relevant in 2020. Some of our compe@@ve advantages are eroding
31 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn wes@cloudera.com
Advertisement