Building Better Analytics Workflows (Strata-Hadoop World 2013)
Upcoming SlideShare
Loading in...5
×
 

Building Better Analytics Workflows (Strata-Hadoop World 2013)

on

  • 33,692 views

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Wes McKinney (twitter.com/wesmckinn, http://datapad.io) talk from Strata 2013 NYC

Statistics

Views

Total Views
33,692
Views on SlideShare
3,280
Embed Views
30,412

Actions

Likes
10
Downloads
82
Comments
0

30 Embeds 30,412

http://blog.wesmckinney.com 21229
http://wesmckinney.com 8457
http://cloud.feedly.com 186
https://gibbon.co 159
https://twitter.com 96
http://www.datacricket.com 66
http://feedly.com 50
http://digg.com 36
http://www.newsblur.com 27
http://www.feedspot.com 18
http://eventifier.co 17
http://newsblur.com 12
http://inoreader.com 10
http://www.waffleme.com 7
http://www.blog.wesmckinney.com 6
http://eventifier.com 6
http://www.inoreader.com 5
http://translate.googleusercontent.com 5
http://www.hanrss.com 3
http://thiagomarzagao.wordpress.com 3
http://reader.aol.com 3
https://translate.googleusercontent.com 2
http://webcache.googleusercontent.com 2
https://www.google.com&_=1408612062761 HTTP 1
http://smashingreader.com 1
http://beta.inoreader.com 1
http://sjgoread.appspot.com 1
http://home.speedo.ca 1
https://www.google.com 1
https://www.linkedin.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Building Better Analytics Workflows (Strata-Hadoop World 2013) Building Better Analytics Workflows (Strata-Hadoop World 2013) Presentation Transcript

  • Building better analytics workflows Strata-Hadoop World 2013
  • Wes McKinney @wesmckinn • Former quant @ AQR (a hedge fund) • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 2 www.datapad.io
  • • • 3 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • Why so many learning to program? • Increasing data scale • More and more data munging/integration • Need for Statistics and Predictive Analytics • Building complex data visualizations • Inadequacy of Excel or other UI-driven data tools 4 www.datapad.io
  • The Analytics Workflow Acquisition 5 Preparation Visualization www.datapad.io Analysis Sharing
  • The Analytics Workflow 6 www.datapad.io
  • What do we care about? • Minimize time to answer Ask more questions • Reduce friction between tools and • processes • Team productivity 7 www.datapad.io
  • Data Tools for Humans (TM?) 8 www.datapad.io
  • What can go wrong? • Inefficient workflows lead to lower quality analysis • 9 Results may not be actionable in a reasonable time-frame www.datapad.io
  • 10 www.datapad.io
  • Three type of problems • Tooling Workflow management • Collaboration • 11 www.datapad.io
  • Big Notable Data Trends 12 www.datapad.io
  • Data Preparation: an ongoing problem 13 www.datapad.io
  • 14 www.datapad.io
  • For programmers, luckily it’s not 2005 anymore • R: Hadley Wickham’s packages Python: pandas • Hadoop: Pig • www.datapad.io
  • Data preparation with visual tools • Google OpenRefine Google Fusion Tables • Microsoft Excel • • Data Wrangler www.datapad.io
  • Some new startups building data preparation tools www.datapad.io
  • Business Intelligence: essential for doing business www.datapad.io
  • BI macro-trends • Self Service BI Visual Discovery • SQL on Hadoop • www.datapad.io
  • It’s the hey-day for BI startups www.datapad.io
  • Predictive Analytics is getting easier www.datapad.io
  • Some predictive analytics startups www.datapad.io
  • Perils of “data science in a box” www.datapad.io
  • Predictive analytics pitfalls • Signal vs. Noise Identify the right patterns • Uncertain ROI • www.datapad.io
  • Some analytics workflow problems still need work www.datapad.io
  • Friction between tools www.datapad.io
  • Friction between tools: a typical scenario • Tableau for visualization • • SPSS/R for modeling Excel and SQL for data wrangling www.datapad.io
  • Time series analytics www.datapad.io
  • Large scale visualization www.datapad.io
  • Data workflows as dependency graphs? E A C D B 30 F www.datapad.io
  • Data workflows as dependency graphs? CHRONOS 31 www.datapad.io
  • Iterating on analysis www.datapad.io
  • Versioning and provenance www.datapad.io
  • Leveraging diverse skill sets • Within teams, different competencies • Work together on a data project sharing code, data, tracking changes www.datapad.io
  • The elusive GitHub for Data Analysis? www.datapad.io
  • ...Google Docs for Data Analysis? www.datapad.io
  • Make an impact • Getting results into the hands of people who need it • Getting models "into production" www.datapad.io
  • Some possible solutions www.datapad.io
  • Build more integrated tool environments www.datapad.io
  • www.datapad.io
  • Enhance collaboration www.datapad.io
  • Accessible data science ...with training wheels www.datapad.io
  • One more thing www.datapad.io
  • • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • Q&A time www.datapad.io
  • Thank you! 46 www.datapad.io