Building Better Analytics Workflows (Strata-Hadoop World 2013)

  • 47,896 views
Uploaded on

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
47,896
On Slideshare
0
From Embeds
0
Number of Embeds
34

Actions

Shares
Downloads
90
Comments
0
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Building better analytics workflows Strata-Hadoop World 2013
  • 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  • 3. • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 4. Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
  • 5. The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
  • 6. The Analytics Workflow 6 www.datapad.io
  • 7. What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
  • 8. Data Tools for Humans (TM?) 8 www.datapad.io
  • 9. What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
  • 10. 10 www.datapad.io
  • 11. Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
  • 12. Big Notable Data Trends 12 www.datapad.io
  • 13. Data Preparation: an ongoing problem 13 www.datapad.io
  • 14. 14 www.datapad.io
  • 15. For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
  • 16. Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
  • 17. Some new startups building data preparation tools www.datapad.io
  • 18. Business Intelligence: essential for doing business www.datapad.io
  • 19. BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
  • 20. It’s the hey-day for BI startups www.datapad.io
  • 21. Predictive Analytics is getting easier www.datapad.io
  • 22. Some predictive analytics startups www.datapad.io
  • 23. Perils of “data science in a box” www.datapad.io
  • 24. Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
  • 25. Some analytics workflow problems still need work www.datapad.io
  • 26. Friction between tools www.datapad.io
  • 27. Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
  • 28. Time series analytics www.datapad.io
  • 29. Large scale visualization www.datapad.io
  • 30. Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
  • 31. Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
  • 32. Iterating on analysis www.datapad.io
  • 33. Versioning and provenance www.datapad.io
  • 34. Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
  • 35. The elusive GitHub for Data Analysis? www.datapad.io
  • 36. ...Google Docs for Data Analysis? www.datapad.io
  • 37. Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
  • 38. Some possible solutions www.datapad.io
  • 39. Build more integrated tool environments www.datapad.io
  • 40. www.datapad.io
  • 41. Enhance collaboration www.datapad.io
  • 42. Accessible data science ...with training wheels www.datapad.io
  • 43. One more thing www.datapad.io
  • 44. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 45. Q&A time www.datapad.io
  • 46. Thank you! 46 www.datapad.io