2. Wes McKinney
@wesmckinn
• Former quant @ AQR (a hedge fund)
• Creator of pandas
• Author of
Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
2
www.datapad.io
8. Some Trends
•
• SQL-on-Hadoop
Spark / Spark ecosystem
•
New life in visual ETL / data prep
•
Better data manipulation libraries
•
Columnar / analytic databases
8
www.datapad.io
9. Crunching data with code
•
Python: pandas
•
Data frames in Scala, F#, Julia, …
•
Spark (Scala/Java)
•
R (+ data.table, dplyr)
9
www.datapad.io
10. Some Programmatic Tool
Problems
• Awkward / slow DB interactions
• In-process memory management
Reuse of intermediate results
•
Execution speed
•
• Evaluation semantics
10
www.datapad.io
11. dplyr (R library)
• By Hadley Wickham and Romain
Francois
•
Uniform R API, SQL and in-memory
backends
• Describe complex data
manipulation using “chaining”
11
www.datapad.io
13. Apache Spark
• Broad set of primitive data ops
• Distributed in-memory model scales
naturally, high performance
•
Build complex computation graphs
for analytics
• Applications: Shark, GraphX, …
13
www.datapad.io
14. pandas (Python library)
• Broad traction
• Strong feature: time series analytics
User-friendly API and community
•
Being used in many unexpected
•
ways
14
www.datapad.io
15. badger (DataPad internal)
• A high performance in-memory
analytics engine for DataPad
• Addresses many performance and
memory management concerns in
pandas
• May become an OSS project someday
15
www.datapad.io
18. Analytic databases
• Powering visual analytics tools on
big data
•
• MPP / in-memory execution model
Compressed columnar storage
18
www.datapad.io
19. Visual data tools
• Visual Analytics/BI gone mainstream
• New Data Prep products
• Drag-and-drop predictive analytics
• Proliferation of vertical SaaS solutions
19
www.datapad.io
20. Visual tool challenges
• Tend to be less flexible than code
• Multiple tools to get the job done
• Many still dependent on Excel
• Collaboration, versioning, provenance
20
www.datapad.io
21. Collaboration tools
• Discovery and reuse
• Cataloguing insights
• Analytics from ad-hoc to production
• Interesting projects: IPython Notebook,
Shiny, Pivotal Chorus
21
www.datapad.io