Building Better Analytics Workflows (Strata-Hadoop World 2013)

80,975 views
80,723 views

Published on

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Published in: Technology, Education
0 Comments
22 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
80,975
On SlideShare
0
From Embeds
0
Number of Embeds
75,820
Actions
Shares
0
Downloads
120
Comments
0
Likes
22
Embeds 0
No embeds

No notes for slide

Building Better Analytics Workflows (Strata-Hadoop World 2013)

  1. 1. Building better analytics workflows Strata-Hadoop World 2013
  2. 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  3. 3. • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  4. 4. Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
  5. 5. The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
  6. 6. The Analytics Workflow 6 www.datapad.io
  7. 7. What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
  8. 8. Data Tools for Humans (TM?) 8 www.datapad.io
  9. 9. What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
  10. 10. 10 www.datapad.io
  11. 11. Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
  12. 12. Big Notable Data Trends 12 www.datapad.io
  13. 13. Data Preparation: an ongoing problem 13 www.datapad.io
  14. 14. 14 www.datapad.io
  15. 15. For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
  16. 16. Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
  17. 17. Some new startups building data preparation tools www.datapad.io
  18. 18. Business Intelligence: essential for doing business www.datapad.io
  19. 19. BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
  20. 20. It’s the hey-day for BI startups www.datapad.io
  21. 21. Predictive Analytics is getting easier www.datapad.io
  22. 22. Some predictive analytics startups www.datapad.io
  23. 23. Perils of “data science in a box” www.datapad.io
  24. 24. Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
  25. 25. Some analytics workflow problems still need work www.datapad.io
  26. 26. Friction between tools www.datapad.io
  27. 27. Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
  28. 28. Time series analytics www.datapad.io
  29. 29. Large scale visualization www.datapad.io
  30. 30. Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
  31. 31. Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
  32. 32. Iterating on analysis www.datapad.io
  33. 33. Versioning and provenance www.datapad.io
  34. 34. Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
  35. 35. The elusive GitHub for Data Analysis? www.datapad.io
  36. 36. ...Google Docs for Data Analysis? www.datapad.io
  37. 37. Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
  38. 38. Some possible solutions www.datapad.io
  39. 39. Build more integrated tool environments www.datapad.io
  40. 40. www.datapad.io
  41. 41. Enhance collaboration www.datapad.io
  42. 42. Accessible data science ...with training wheels www.datapad.io
  43. 43. One more thing www.datapad.io
  44. 44. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  45. 45. Q&A time www.datapad.io
  46. 46. Thank you! 46 www.datapad.io

×