The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

3,570 views

Published on

Published in: Technology
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,570
On SlideShare
0
From Embeds
0
Number of Embeds
194
Actions
Shares
0
Downloads
33
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)

  1. 1. The Last Mile: Challenges and opportunities in data tools Strata Santa Clara 2014
  2. 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of pandas • Author of 
 Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  3. 3. 3 www.datapad.io
  4. 4. • http://datapad.io • New web-based visual analytics environment • In private beta, join us! • Hiring for engineering 4 www.datapad.io
  5. 5. Some Problems • • Statistics and ML ETL • • Data Visualization Workflows + Collaboration • Business Analytics 5 www.datapad.io
  6. 6. Data toolchains Data Acquisition Data Slinging / Management ETL SQL / Tidy Form Code-based Env 6 UI-based Env www.datapad.io Analysis
  7. 7. Data toolchains Data Acquisition Maybe HDFS ETL ETL Analytic DBMS Code-based Env 7 ETL? www.datapad.io UI-based Env
  8. 8. Some Trends • • SQL-on-Hadoop Spark / Spark ecosystem • New life in visual ETL / data prep • Better data manipulation libraries • Columnar / analytic databases 8 www.datapad.io
  9. 9. Crunching data with code • Python: pandas • Data frames in Scala, F#, Julia, … • Spark (Scala/Java) • R (+ data.table, dplyr) 9 www.datapad.io
  10. 10. Some Programmatic Tool Problems • Awkward / slow DB interactions • In-process memory management Reuse of intermediate results • Execution speed • • Evaluation semantics 10 www.datapad.io
  11. 11. dplyr (R library) • By Hadley Wickham and Romain Francois • Uniform R API, SQL and in-memory backends • Describe complex data manipulation using “chaining” 11 www.datapad.io
  12. 12. dplyr (R library) final %.% %.% %.% %.% %.% %.% 12 <- crime.by.state filter(State=="New York", Year==2005) arrange(desc(Count)) select(Type.of.Crime, Count) mutate(Proportion=Count/sum(Count)) group_by(Type.of.Crime) summarise(num.types = n(), counts = sum(Count)) www.datapad.io
  13. 13. Apache Spark • Broad set of primitive data ops • Distributed in-memory model scales naturally, high performance • Build complex computation graphs for analytics • Applications: Shark, GraphX, … 13 www.datapad.io
  14. 14. pandas (Python library) • Broad traction • Strong feature: time series analytics User-friendly API and community • Being used in many unexpected • ways 14 www.datapad.io
  15. 15. badger (DataPad internal) • A high performance in-memory analytics engine for DataPad • Addresses many performance and memory management concerns in pandas • May become an OSS project someday 15 www.datapad.io
  16. 16. Standardized machine learning toolkits • scikit-learn • PMML • Mahout Cloudera ML • 16 www.datapad.io
  17. 17. Enterprise data workflows • • Apache Crunch Pig • Cascading (+ Scalding, Cascalog) 17 www.datapad.io
  18. 18. Analytic databases • Powering visual analytics tools on big data • • MPP / in-memory execution model Compressed columnar storage 18 www.datapad.io
  19. 19. Visual data tools • Visual Analytics/BI gone mainstream • New Data Prep products • Drag-and-drop predictive analytics • Proliferation of vertical SaaS solutions 19 www.datapad.io
  20. 20. Visual tool challenges • Tend to be less flexible than code • Multiple tools to get the job done • Many still dependent on Excel • Collaboration, versioning, provenance 20 www.datapad.io
  21. 21. Collaboration tools • Discovery and reuse • Cataloguing insights • Analytics from ad-hoc to production • Interesting projects: IPython Notebook, Shiny, Pivotal Chorus 21 www.datapad.io
  22. 22. Some ideas 22 www.datapad.io
  23. 23. Abstract away the execution model (where possible) 23 www.datapad.io
  24. 24. More integrated environments 24 www.datapad.io
  25. 25. Enhance collaboration 25 www.datapad.io
  26. 26. Thank you!

×