Building better
analytics workflows
Strata-Hadoop World 2013
Wes McKinney
@wesmckinn
• Former quant @ AQR (a hedge fund)
• Creator of Pandas project for Python
• Author of
Python for ...
•
•

3

> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

www.datapad.io
Why so many learning
to program?
• Increasing data scale
• More and more data munging/integration
• Need for Statistics an...
The Analytics Workflow

Acquisition

5

Preparation

Visualization

www.datapad.io

Analysis

Sharing
The Analytics Workflow

6

www.datapad.io
What do we care about?

• Minimize time to answer
Ask more questions
•
Reduce friction between tools and
•
processes

• Te...
Data Tools for Humans
(TM?)

8

www.datapad.io
What can go wrong?

• Inefficient workflows lead to lower
quality analysis

•

9

Results may not be actionable in a
reasonab...
10

www.datapad.io
Three type of problems

• Tooling
Workflow management
•
Collaboration
•
11

www.datapad.io
Big Notable Data Trends

12

www.datapad.io
Data Preparation:
an ongoing problem

13

www.datapad.io
14

www.datapad.io
For programmers, luckily
it’s not 2005 anymore

• R: Hadley Wickham’s packages
Python: pandas
•
Hadoop: Pig
•
www.datapad....
Data preparation with
visual tools

• Google OpenRefine
Google Fusion Tables
•
Microsoft Excel
•
• Data Wrangler
www.datapa...
Some new startups building
data preparation tools

www.datapad.io
Business Intelligence:
essential for doing business

www.datapad.io
BI macro-trends

• Self Service BI
Visual Discovery
•
SQL on Hadoop
•
www.datapad.io
It’s the hey-day for BI
startups

www.datapad.io
Predictive Analytics
is getting easier

www.datapad.io
Some predictive analytics
startups

www.datapad.io
Perils of “data science
in a box”

www.datapad.io
Predictive analytics
pitfalls

• Signal vs. Noise
Identify the right patterns
•
Uncertain ROI
•
www.datapad.io
Some analytics workflow
problems still need work

www.datapad.io
Friction between tools

www.datapad.io
Friction between tools:
a typical scenario

•
Tableau for visualization
•
• SPSS/R for modeling

Excel and SQL for data wr...
Time series analytics

www.datapad.io
Large scale visualization

www.datapad.io
Data workflows as
dependency graphs?
E

A

C

D

B

30

F

www.datapad.io
Data workflows as
dependency graphs?

CHRONOS

31

www.datapad.io
Iterating on analysis

www.datapad.io
Versioning and provenance

www.datapad.io
Leveraging diverse
skill sets

• Within teams, different
competencies

•

Work together on a data project sharing code, dat...
The elusive
GitHub for Data Analysis?

www.datapad.io
...Google Docs for Data
Analysis?

www.datapad.io
Make an impact

• Getting results into the hands of
people who need it

•

Getting models "into production"

www.datapad.i...
Some possible
solutions
www.datapad.io
Build more integrated
tool environments

www.datapad.io
www.datapad.io
Enhance collaboration

www.datapad.io
Accessible data science
...with training wheels

www.datapad.io
One more thing

www.datapad.io
• http://datapad.io
Founded in 2013, located in SF
•
In private beta, join us!
•
• Hiring for engineering
www.datapad.io
Q&A time

www.datapad.io
Thank you!

46

www.datapad.io
Upcoming SlideShare
Loading in...5
×

Building Better Analytics Workflows (Strata-Hadoop World 2013)

76,065

Published on

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Published in: Technology, Education
0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
76,065
On Slideshare
0
From Embeds
0
Number of Embeds
40
Actions
Shares
0
Downloads
116
Comments
0
Likes
21
Embeds 0
No embeds

No notes for slide

Building Better Analytics Workflows (Strata-Hadoop World 2013)

  1. 1. Building better analytics workflows Strata-Hadoop World 2013
  2. 2. Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  3. 3. • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  4. 4. Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
  5. 5. The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
  6. 6. The Analytics Workflow 6 www.datapad.io
  7. 7. What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
  8. 8. Data Tools for Humans (TM?) 8 www.datapad.io
  9. 9. What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
  10. 10. 10 www.datapad.io
  11. 11. Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
  12. 12. Big Notable Data Trends 12 www.datapad.io
  13. 13. Data Preparation: an ongoing problem 13 www.datapad.io
  14. 14. 14 www.datapad.io
  15. 15. For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
  16. 16. Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
  17. 17. Some new startups building data preparation tools www.datapad.io
  18. 18. Business Intelligence: essential for doing business www.datapad.io
  19. 19. BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
  20. 20. It’s the hey-day for BI startups www.datapad.io
  21. 21. Predictive Analytics is getting easier www.datapad.io
  22. 22. Some predictive analytics startups www.datapad.io
  23. 23. Perils of “data science in a box” www.datapad.io
  24. 24. Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
  25. 25. Some analytics workflow problems still need work www.datapad.io
  26. 26. Friction between tools www.datapad.io
  27. 27. Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
  28. 28. Time series analytics www.datapad.io
  29. 29. Large scale visualization www.datapad.io
  30. 30. Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
  31. 31. Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
  32. 32. Iterating on analysis www.datapad.io
  33. 33. Versioning and provenance www.datapad.io
  34. 34. Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
  35. 35. The elusive GitHub for Data Analysis? www.datapad.io
  36. 36. ...Google Docs for Data Analysis? www.datapad.io
  37. 37. Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
  38. 38. Some possible solutions www.datapad.io
  39. 39. Build more integrated tool environments www.datapad.io
  40. 40. www.datapad.io
  41. 41. Enhance collaboration www.datapad.io
  42. 42. Accessible data science ...with training wheels www.datapad.io
  43. 43. One more thing www.datapad.io
  44. 44. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  45. 45. Q&A time www.datapad.io
  46. 46. Thank you! 46 www.datapad.io
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×