Your SlideShare is downloading. ×
  • Like
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Building Better Analytics Workflows (Strata-Hadoop World 2013)

  • 51,908 views
Published

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
51,908
On SlideShare
0
From Embeds
0
Number of Embeds
36

Actions

Shares
Downloads
95
Comments
0
Likes
12

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Building better analytics workflows Strata-Hadoop World 2013
  • 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  • 3. • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 4. Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
  • 5. The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
  • 6. The Analytics Workflow 6 www.datapad.io
  • 7. What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
  • 8. Data Tools for Humans (TM?) 8 www.datapad.io
  • 9. What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
  • 10. 10 www.datapad.io
  • 11. Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
  • 12. Big Notable Data Trends 12 www.datapad.io
  • 13. Data Preparation: an ongoing problem 13 www.datapad.io
  • 14. 14 www.datapad.io
  • 15. For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
  • 16. Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
  • 17. Some new startups building data preparation tools www.datapad.io
  • 18. Business Intelligence: essential for doing business www.datapad.io
  • 19. BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
  • 20. It’s the hey-day for BI startups www.datapad.io
  • 21. Predictive Analytics is getting easier www.datapad.io
  • 22. Some predictive analytics startups www.datapad.io
  • 23. Perils of “data science in a box” www.datapad.io
  • 24. Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
  • 25. Some analytics workflow problems still need work www.datapad.io
  • 26. Friction between tools www.datapad.io
  • 27. Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
  • 28. Time series analytics www.datapad.io
  • 29. Large scale visualization www.datapad.io
  • 30. Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
  • 31. Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
  • 32. Iterating on analysis www.datapad.io
  • 33. Versioning and provenance www.datapad.io
  • 34. Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
  • 35. The elusive GitHub for Data Analysis? www.datapad.io
  • 36. ...Google Docs for Data Analysis? www.datapad.io
  • 37. Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
  • 38. Some possible solutions www.datapad.io
  • 39. Build more integrated tool environments www.datapad.io
  • 40. www.datapad.io
  • 41. Enhance collaboration www.datapad.io
  • 42. Accessible data science ...with training wheels www.datapad.io
  • 43. One more thing www.datapad.io
  • 44. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 45. Q&A time www.datapad.io
  • 46. Thank you! 46 www.datapad.io