• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 

Building Better Analytics Workflows (Strata-Hadoop World 2013)

on

  • 27,755 views

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Statistics

Views

Total Views
27,755
Views on SlideShare
2,901
Embed Views
24,854

Actions

Likes
10
Downloads
68
Comments
0

28 Embeds 24,854

http://blog.wesmckinney.com 16900
http://wesmckinney.com 7278
http://cloud.feedly.com 186
https://gibbon.co 132
https://twitter.com 94
http://www.datacricket.com 66
http://feedly.com 41
http://digg.com 34
http://www.newsblur.com 26
http://www.feedspot.com 18
http://eventifier.co 17
http://inoreader.com 10
http://newsblur.com 10
http://www.waffleme.com 7
http://www.blog.wesmckinney.com 5
http://www.inoreader.com 5
http://eventifier.com 4
http://reader.aol.com 3
http://www.hanrss.com 3
http://thiagomarzagao.wordpress.com 3
http://translate.googleusercontent.com 3
http://webcache.googleusercontent.com 2
https://translate.googleusercontent.com 2
http://beta.inoreader.com 1
https://www.google.com 1
http://sjgoread.appspot.com 1
http://smashingreader.com 1
http://home.speedo.ca 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building Better Analytics Workflows (Strata-Hadoop World 2013) Building Better Analytics Workflows (Strata-Hadoop World 2013) Presentation Transcript

    • Building better analytics workflows Strata-Hadoop World 2013
    • Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
    • • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
    • Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
    • The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
    • The Analytics Workflow 6 www.datapad.io
    • What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
    • Data Tools for Humans (TM?) 8 www.datapad.io
    • What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
    • 10 www.datapad.io
    • Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
    • Big Notable Data Trends 12 www.datapad.io
    • Data Preparation: an ongoing problem 13 www.datapad.io
    • 14 www.datapad.io
    • For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
    • Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
    • Some new startups building data preparation tools www.datapad.io
    • Business Intelligence: essential for doing business www.datapad.io
    • BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
    • It’s the hey-day for BI startups www.datapad.io
    • Predictive Analytics is getting easier www.datapad.io
    • Some predictive analytics startups www.datapad.io
    • Perils of “data science in a box” www.datapad.io
    • Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
    • Some analytics workflow problems still need work www.datapad.io
    • Friction between tools www.datapad.io
    • Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
    • Time series analytics www.datapad.io
    • Large scale visualization www.datapad.io
    • Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
    • Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
    • Iterating on analysis www.datapad.io
    • Versioning and provenance www.datapad.io
    • Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
    • The elusive GitHub for Data Analysis? www.datapad.io
    • ...Google Docs for Data Analysis? www.datapad.io
    • Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
    • Some possible solutions www.datapad.io
    • Build more integrated tool environments www.datapad.io
    • www.datapad.io
    • Enhance collaboration www.datapad.io
    • Accessible data science ...with training wheels www.datapad.io
    • One more thing www.datapad.io
    • • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
    • Q&A time www.datapad.io
    • Thank you! 46 www.datapad.io