SlideShare a Scribd company logo
1 of 48
Download to read offline
Practical Medium Data
Analytics with Python
PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney
@wesmckinn
• Former quant and MIT math dude
• Creator of Pandas project for Python
• Author of
Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

3

www.datapad.io
•
•

4

> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

www.datapad.io
• http://datapad.io
Founded in 2013, located in SF
•
In private beta, join us!
•
• Hiring for engineering
www.datapad.io
Why hate on pandas?
7

www.datapad.io
pandas rocks!
So, pandas
• Easy-to-use, fast in-memory data wrangling
and analytics library

• Enabled loads of complex data work to be
done by mere mortals in Python

• Might have kept R from taking over the
world (hehe)

10

www.datapad.io
11

www.datapad.io
pandas, the project

• 170 distinct contributors
• Over 5400 issues and pull requests
on GitHub

•
12

Upcoming 0.13 release

www.datapad.io
But.

• pandas’s broad applicability also a
liability

•
pandas being used in some
•

Only game in town for lot of things
unplanned ways

13

www.datapad.io
Some things to love
• No more structured dtype drudgery!
• Easy IO!
• Data alignment!
• Hierarchical indexing!
• Time series analytics!
14

www.datapad.io
More things to love

• Table reshaping
• Missing data handling
pandas.merge, pandas.concat
•
Expressive groupby machinery
•
15

www.datapad.io
Some pandas use cases

• General data wrangling
• ETL jobs
Business analytics (incl. BI uses)
•
Time series analysis, statistical
•
modeling

16

www.datapad.io
pandas does many things
that are tedious, slow, or
difficult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal

• DataFrame’s internal structure

intended to make row-oriented ops
fast on numerical data

•
19

Python objects can be used as data,
indices (a feature, not a bug)
www.datapad.io
#2 No support (yet) for
memory maps
• Many analytics ops require a small portion
of the data

• Many ways to “materialize” the full data set
in memory by accident

• Axis indexes wouldn’t necessarily make
sense on out of core data sets

20

www.datapad.io
#2 No support (yet) for
memory maps

• N.B. HDF5/PyTables support is a
partial solution

21

www.datapad.io
#3 No tight database
integration

• Makes it difficult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system

•
22

Inadequacy of pandas/NumPy data
type systems
www.datapad.io
#3 No tight database
integration

• Jobs with heavy SQL-reading are
slow and use tons of memory

•

23

TODO: integrate pandas with ODBC
C API and write out SQL data directly
into NumPy arrays
www.datapad.io
#4 Best-efforts NA
representation

• Inconsistent representation of
missing data

•
NA needs to be a first class citizen in
•
No Boolean or Integer NA values
analytics operations

24

www.datapad.io
#5 RAM management
• Difficult to understand footprint of pandas
object

• Ample data copying throughout library
• Would benefit from being able to compress

data in-memory or shuttle data temporarily
to disk

25

www.datapad.io
#6 Weak support for
categorical data

• Makes pandas not quite a fullyfledged R replacement

•

26

GroupBy and Joins slower than they
could be

www.datapad.io
#7 Complex GroupBy
operations get messy

• Must write custom functions to pass
to .apply(..)

•

27

Easy to run up against DRY
problems and general Python
syntax limitations
www.datapad.io
#8 Appending data slow
and tedious

• DataFrame not intended as a
database table

•

Makes streaming data use a
challenge

• B+ tree tables interesting?
28

www.datapad.io
#9 Limited type system,
column metadata

• Currencies, units
• Time zones
Geographic data
•
Composite data types
•
29

www.datapad.io
#10 No true query
processing layer

•
•
•
•
•
•
30

Filter
Group
Join
Aggregate
Limit/TopK
Sorting

WHERE, HAVING
GROUP BY
JOIN
SUM, MEAN, ...
LIMIT
ORDER BY
www.datapad.io
#11 “Slow”: no multicore /
distributed algos

• Hampered by use of Python data
structures / GIL interactions

•

31

Object internals not designed for
concurrent use

www.datapad.io
Oh no what do we do
Stop believing in the “one
tool to rule them all”
“Real Artists Ship”
- Steve Jobs
www.datapad.io
Focus on results

• I am heavily biased by focus on
business analytics/BI use cases

•

36

Need production-ready software to
ship in relatively short time frame

www.datapad.io
A new project

• In internal development at DataPad
• Code named “badger”
pandas-ish syntax: designed for
•
data processing and analytical
queries

37

www.datapad.io
Badger in a nutshell

•
Compressed columnar binary storage
•
• High perf analytical query processor
• Data preparation/cleaning tools
Consistent data type system

38

www.datapad.io
Badger in a nutshell

•
Immutable array data, little copying
•
• Analytics kernels: written C with no
Time series analytics

dependencies

•
39

Caching of useful intermediates
www.datapad.io
Some benchmarks

• Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
•
• Tools
• pandas
badger
•
• R: data.table
SQL: PostgreSQL, SQLite
•
40

www.datapad.io
Query 1

• Total contributions by candidate
SELECT	
  cand_nm,	
  
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm

41

www.datapad.io
Query 1

• Total contributions by candidate
badger	
  (in-­‐memory)	
  :	
  	
  	
  19ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  131ms	
  (6.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  273ms	
  (14.3x)
R	
  data.table	
  1.8.10:	
  	
  382ms	
  (20x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  4.7s	
  (247x)
SQLite	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  	
  72s	
  (3800x)

42

www.datapad.io
Query 2
contributions by candidate
• Totalstate
and
SELECT	
  cand_nm,	
  contbr_st,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm,	
  contbr_st

43

www.datapad.io
Query 2

•

Total contributions by candidate and
state

badger	
  (in-­‐memory)	
  :	
  	
  269ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  391ms	
  (1.5x)
R	
  data.table	
  1.8.10:	
  	
  500ms	
  (1.8x)
pandas	
  (in-­‐memory)	
  :	
  	
  770ms	
  (2.9x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  5.96s	
  (23x)

44

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

SELECT	
  cand_nm,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  as	
  total
FROM	
  fec
WHERE	
  contb_receipt_dt	
  BETWEEN
	
  	
  	
  	
  	
  	
  	
  	
  '2012-­‐05-­‐01'	
  and	
  '2012-­‐11-­‐05'
	
  	
  AND	
  contb_receipt_amt	
  BETWEEN	
  
	
  	
  	
  	
  	
  	
  	
  	
  0	
  and	
  2500
GROUP	
  BY	
  cand_nm
45

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

badger	
  (in-­‐memory)	
  :	
  	
  	
  96ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  275ms	
  (2.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  946ms	
  (9.8x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  6.2s	
  (65x)

46

www.datapad.io
Badger, the future

• Distributed in-memory analytics
• Multicore algorithms
• ETL job-building tools
• Open source in some form someday
Looking for algorithms hackers to help
•
47

www.datapad.io
Thank you!

48

www.datapad.io

More Related Content

What's hot

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...Databricks
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1Federico Campoli
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdfRAHULRAHU8
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningAparup Chatterjee
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
Hybrid Data Guard to Cloud GEN2 ExaCS.pdf
Hybrid Data Guard to Cloud GEN2 ExaCS.pdfHybrid Data Guard to Cloud GEN2 ExaCS.pdf
Hybrid Data Guard to Cloud GEN2 ExaCS.pdfALI ANWAR, OCP®
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesDatabricks
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to GraphNeo4j
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 

What's hot (20)

Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
Revolutionizing the Legal Industry with Spark, NLP and Azure Databricks at Cl...
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition PruningApache Spark 3 Dynamic Partition Pruning
Apache Spark 3 Dynamic Partition Pruning
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Hybrid Data Guard to Cloud GEN2 ExaCS.pdf
Hybrid Data Guard to Cloud GEN2 ExaCS.pdfHybrid Data Guard to Cloud GEN2 ExaCS.pdf
Hybrid Data Guard to Cloud GEN2 ExaCS.pdf
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
RDBMS to Graph
RDBMS to GraphRDBMS to Graph
RDBMS to Graph
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 

Similar to Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Wes McKinney
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGIAF
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 

Similar to Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) (20)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 

Recently uploaded

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 

Recently uploaded (20)

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

  • 1. Practical Medium Data Analytics with Python PyData NYC 2013
  • 2. Practical Medium Data Analytics with Python 10 Things I Hate About pandas PyData NYC 2013
  • 3. Wes McKinney @wesmckinn • Former quant and MIT math dude • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 3 www.datapad.io
  • 4. • • 4 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 5. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 6. Why hate on pandas?
  • 9.
  • 10. So, pandas • Easy-to-use, fast in-memory data wrangling and analytics library • Enabled loads of complex data work to be done by mere mortals in Python • Might have kept R from taking over the world (hehe) 10 www.datapad.io
  • 12. pandas, the project • 170 distinct contributors • Over 5400 issues and pull requests on GitHub • 12 Upcoming 0.13 release www.datapad.io
  • 13. But. • pandas’s broad applicability also a liability • pandas being used in some • Only game in town for lot of things unplanned ways 13 www.datapad.io
  • 14. Some things to love • No more structured dtype drudgery! • Easy IO! • Data alignment! • Hierarchical indexing! • Time series analytics! 14 www.datapad.io
  • 15. More things to love • Table reshaping • Missing data handling pandas.merge, pandas.concat • Expressive groupby machinery • 15 www.datapad.io
  • 16. Some pandas use cases • General data wrangling • ETL jobs Business analytics (incl. BI uses) • Time series analysis, statistical • modeling 16 www.datapad.io
  • 17. pandas does many things that are tedious, slow, or difficult to do correctly without it
  • 19. #1 Slightly too far from the metal • DataFrame’s internal structure intended to make row-oriented ops fast on numerical data • 19 Python objects can be used as data, indices (a feature, not a bug) www.datapad.io
  • 20. #2 No support (yet) for memory maps • Many analytics ops require a small portion of the data • Many ways to “materialize” the full data set in memory by accident • Axis indexes wouldn’t necessarily make sense on out of core data sets 20 www.datapad.io
  • 21. #2 No support (yet) for memory maps • N.B. HDF5/PyTables support is a partial solution 21 www.datapad.io
  • 22. #3 No tight database integration • Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system • 22 Inadequacy of pandas/NumPy data type systems www.datapad.io
  • 23. #3 No tight database integration • Jobs with heavy SQL-reading are slow and use tons of memory • 23 TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays www.datapad.io
  • 24. #4 Best-efforts NA representation • Inconsistent representation of missing data • NA needs to be a first class citizen in • No Boolean or Integer NA values analytics operations 24 www.datapad.io
  • 25. #5 RAM management • Difficult to understand footprint of pandas object • Ample data copying throughout library • Would benefit from being able to compress data in-memory or shuttle data temporarily to disk 25 www.datapad.io
  • 26. #6 Weak support for categorical data • Makes pandas not quite a fullyfledged R replacement • 26 GroupBy and Joins slower than they could be www.datapad.io
  • 27. #7 Complex GroupBy operations get messy • Must write custom functions to pass to .apply(..) • 27 Easy to run up against DRY problems and general Python syntax limitations www.datapad.io
  • 28. #8 Appending data slow and tedious • DataFrame not intended as a database table • Makes streaming data use a challenge • B+ tree tables interesting? 28 www.datapad.io
  • 29. #9 Limited type system, column metadata • Currencies, units • Time zones Geographic data • Composite data types • 29 www.datapad.io
  • 30. #10 No true query processing layer • • • • • • 30 Filter Group Join Aggregate Limit/TopK Sorting WHERE, HAVING GROUP BY JOIN SUM, MEAN, ... LIMIT ORDER BY www.datapad.io
  • 31. #11 “Slow”: no multicore / distributed algos • Hampered by use of Python data structures / GIL interactions • 31 Object internals not designed for concurrent use www.datapad.io
  • 32. Oh no what do we do
  • 33. Stop believing in the “one tool to rule them all”
  • 36. Focus on results • I am heavily biased by focus on business analytics/BI use cases • 36 Need production-ready software to ship in relatively short time frame www.datapad.io
  • 37. A new project • In internal development at DataPad • Code named “badger” pandas-ish syntax: designed for • data processing and analytical queries 37 www.datapad.io
  • 38. Badger in a nutshell • Compressed columnar binary storage • • High perf analytical query processor • Data preparation/cleaning tools Consistent data type system 38 www.datapad.io
  • 39. Badger in a nutshell • Immutable array data, little copying • • Analytics kernels: written C with no Time series analytics dependencies • 39 Caching of useful intermediates www.datapad.io
  • 40. Some benchmarks • Data set: 2012 Election data (FEC) 5.3 mm records 7 columns • • Tools • pandas badger • • R: data.table SQL: PostgreSQL, SQLite • 40 www.datapad.io
  • 41. Query 1 • Total contributions by candidate SELECT  cand_nm,                sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm 41 www.datapad.io
  • 42. Query 1 • Total contributions by candidate badger  (in-­‐memory)  :      19ms  (1x) badger  (from-­‐disk)  :    131ms  (6.9x) pandas  (in-­‐memory)  :    273ms  (14.3x) R  data.table  1.8.10:    382ms  (20x) PostgreSQL                  :      4.7s  (247x) SQLite                          :        72s  (3800x) 42 www.datapad.io
  • 43. Query 2 contributions by candidate • Totalstate and SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm,  contbr_st 43 www.datapad.io
  • 44. Query 2 • Total contributions by candidate and state badger  (in-­‐memory)  :    269ms  (1x) badger  (from-­‐disk)  :    391ms  (1.5x) R  data.table  1.8.10:    500ms  (1.8x) pandas  (in-­‐memory)  :    770ms  (2.9x) PostgreSQL                  :    5.96s  (23x) 44 www.datapad.io
  • 45. Query 3 • Total contributions by candidate and state with 2 filter predicates SELECT  cand_nm,              sum(contb_receipt_amt)  as  total FROM  fec WHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500 GROUP  BY  cand_nm 45 www.datapad.io
  • 46. Query 3 • Total contributions by candidate and state with 2 filter predicates badger  (in-­‐memory)  :      96ms  (1x) badger  (from-­‐disk)  :    275ms  (2.9x) pandas  (in-­‐memory)  :    946ms  (9.8x) PostgreSQL                  :      6.2s  (65x) 46 www.datapad.io
  • 47. Badger, the future • Distributed in-memory analytics • Multicore algorithms • ETL job-building tools • Open source in some form someday Looking for algorithms hackers to help • 47 www.datapad.io