O'Reilly Strata: Distilling Data Exhaust
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

O'Reilly Strata: Distilling Data Exhaust

on

  • 2,508 views

Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. ...

Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata.

Includes a preview of some of the technology behind LinkedIn Skills, which I launched in a Keynote with DJ Patil the following day.

Video: http://blip.tv/oreilly-promos/distilling-data-exhaust-4780870

Statistics

Views

Total Views
2,508
Views on SlideShare
2,498
Embed Views
10

Actions

Likes
2
Downloads
16
Comments
0

3 Embeds 10

https://www.linkedin.com 5
http://www.linkedin.com 4
http://datawrangling.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

O'Reilly Strata: Distilling Data Exhaust Presentation Transcript

  • 1. Distilling Data Exhaust How to Surface Insights & Build Data Products Feb 2, 2011 Peter Skomoroch LinkedIn @peteskomoroch
  • 2. What is Data Exhaust?
  • 3. What is Data Exhaust? My Delicious Tags
  • 4. What is Data Exhaust? Words I use on Twitter
  • 5. What can you do with it? •Data has value •I’ll share some lessons I’ve learned about how to extract that value •We’ll go through a case study
  • 6. Part 1) 10 Lessons Learned
  • 7. 1) Choose a meaningful problem http://www.flickr.com/photos/aloshbennett/ •Find pain points •Work on stuff that matters •Look for underutilized data
  • 8. 2) Find or collect relevant data •DataWrangling •InfoChimps •Pete Warden •Factual, SimpleGeo •Mechanical Turk
  • 9. 3) Raw is better than processed •Normalization could be incorrect •Data might be lost or corrupted •Good approach: public.resource.org http://www.flickr.com/photos/nedraggett/347280918/
  • 10. 4) Guide user input when you can •Auto suggest •Validate inputs •Collect tags, votes •Makes data scrubbing easier
  • 11. 5) Solve easier problems first http://where2conf.com/where2010/public/schedule/detail/12400
  • 12. 6) Build a baseline model quickly •Iterate rapidly after baseline is done •Measure accuracy on hold out test set
  • 13. 7) Test code on sample data build logical sample data
  • 14. 8) Use Continuous Integration
  • 15. 8) Use Continuous Integration https://github.com/matthayes/azkaban
  • 16. 9) Pick the right tool for the job
  • 17. 10) Developer productivity is key •Fast Iterations: Python, Ruby, Pig •Convention over configuration •Embrace Github, DevOps, & EC2 •Currently using JRuby & Sinatra
  • 18. SNA Team: sna-projects.com
  • 19. Part 2) Case Study: Strata
  • 20. Conference Insights •I’d like to understand the audience at Strata •What companies do we work for? •What are the top skills at Strata? •Do attendees cluster together based on skill?
  • 21. Round up a Data Viz team
  • 22. Use the right tools •Data Crunching: Hadoop, Pig •Statistical Work: Python, NumPy •Visualization: Gephi
  • 23. Find Some Data: Attendees
  • 24. Add LinkedIn data
  • 25. Extract Skills from Profiles What are skills? Extract
  • 26. Build Hadoop Skill Graph Discover Core Talent Graph for “Hadoop” Igor Perisic
  • 27. The Talent Graph
  • 28. We can combine skills with the attendee directory to better understand Strata
  • 29. What are skills @Strata?
  • 30. Extract skills for attendees
  • 31. Top Skills @Strata
  • 32. Information Overload
  • 33. Relevance Measures Jaccard Similarity TFIDF
  • 34. Relevant Skills @Strata
  • 35. Do attendees cluster together based on skills?
  • 36. •Compute similarity of attendees based on skill vector distance •Cluster similarities in Gephi
  • 37. More analysis on the way •DJ Patil has a session tomorrow •We’ll blog about additional Strata insights soon
  • 38. Questions? Peter Skomoroch LinkedIn @peteskomoroch http://linkedin.com/in/peterskomoroch Blog: DataWrangling.com
  • 39. Appendix