Distilling Data
Exhaust
How to Surface Insights &
Build Data Products
Feb 2, 2011
Peter Skomoroch
LinkedIn
@peteskomoroch
What is Data Exhaust?
What is Data Exhaust?
My Delicious Tags
What is Data Exhaust?
Words I use on Twitter
What can you do with it?
•Data has value
•I’ll share some lessons I’ve
learned about how to extract
that value
•We’ll go t...
Part 1) 10 Lessons Learned
1) Choose a meaningful problem
http://www.flickr.com/photos/aloshbennett/
•Find pain points
•Work on stuff that
matters
•L...
2) Find or collect relevant data
•DataWrangling
•InfoChimps
•Pete Warden
•Factual, SimpleGeo
•Mechanical Turk
3) Raw is better than processed
•Normalization could
be incorrect
•Data might be lost
or corrupted
•Good approach:
public....
4) Guide user input when you can
•Auto suggest
•Validate inputs
•Collect tags, votes
•Makes data
scrubbing easier
5) Solve easier problems first
http://where2conf.com/where2010/public/schedule/detail/12400
6) Build a baseline model quickly
•Iterate rapidly after
baseline is done
•Measure accuracy
on hold out test set
7) Test code on sample data
build logical sample data
8) Use Continuous Integration
8) Use Continuous Integration
https://github.com/matthayes/azkaban
9) Pick the right tool for the job
10) Developer productivity is key
•Fast Iterations: Python, Ruby, Pig
•Convention over configuration
•Embrace Github, DevO...
SNA Team: sna-projects.com
Part 2) Case Study: Strata
Conference Insights
•I’d like to understand the audience
at Strata
•What companies do we work for?
•What are the top skill...
Round up a Data Viz team
Use the right tools
•Data Crunching: Hadoop, Pig
•Statistical Work: Python, NumPy
•Visualization: Gephi
Find Some Data: Attendees
Add LinkedIn data
Extract Skills from Profiles
What are skills?
Extract
Build Hadoop Skill Graph
Discover
Core Talent Graph for “Hadoop”
Igor Perisic
The Talent Graph
We can combine skills with the
attendee directory to better
understand Strata
What are skills @Strata?
Extract skills for attendees
Top Skills @Strata
Information Overload
Relevance Measures
Jaccard Similarity
TFIDF
Relevant Skills @Strata
Do attendees cluster together
based on skills?
•Compute similarity of
attendees based on skill vector
distance
•Cluster similarities in Gephi
More analysis on the way
•DJ Patil has a session tomorrow
•We’ll blog about additional
Strata insights soon
Questions?
Peter Skomoroch
LinkedIn
@peteskomoroch
http://linkedin.com/in/peterskomoroch
Blog: DataWrangling.com
Appendix
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
O'Reilly Strata: Distilling Data Exhaust
Upcoming SlideShare
Loading in...5
×

O'Reilly Strata: Distilling Data Exhaust

2,479

Published on

Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata.

Includes a preview of some of the technology behind LinkedIn Skills, which I launched in a Keynote with DJ Patil the following day.

Video: http://blip.tv/oreilly-promos/distilling-data-exhaust-4780870

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,479
On Slideshare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
19
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

O'Reilly Strata: Distilling Data Exhaust

  1. 1. Distilling Data Exhaust How to Surface Insights & Build Data Products Feb 2, 2011 Peter Skomoroch LinkedIn @peteskomoroch
  2. 2. What is Data Exhaust?
  3. 3. What is Data Exhaust? My Delicious Tags
  4. 4. What is Data Exhaust? Words I use on Twitter
  5. 5. What can you do with it? •Data has value •I’ll share some lessons I’ve learned about how to extract that value •We’ll go through a case study
  6. 6. Part 1) 10 Lessons Learned
  7. 7. 1) Choose a meaningful problem http://www.flickr.com/photos/aloshbennett/ •Find pain points •Work on stuff that matters •Look for underutilized data
  8. 8. 2) Find or collect relevant data •DataWrangling •InfoChimps •Pete Warden •Factual, SimpleGeo •Mechanical Turk
  9. 9. 3) Raw is better than processed •Normalization could be incorrect •Data might be lost or corrupted •Good approach: public.resource.org http://www.flickr.com/photos/nedraggett/347280918/
  10. 10. 4) Guide user input when you can •Auto suggest •Validate inputs •Collect tags, votes •Makes data scrubbing easier
  11. 11. 5) Solve easier problems first http://where2conf.com/where2010/public/schedule/detail/12400
  12. 12. 6) Build a baseline model quickly •Iterate rapidly after baseline is done •Measure accuracy on hold out test set
  13. 13. 7) Test code on sample data build logical sample data
  14. 14. 8) Use Continuous Integration
  15. 15. 8) Use Continuous Integration https://github.com/matthayes/azkaban
  16. 16. 9) Pick the right tool for the job
  17. 17. 10) Developer productivity is key •Fast Iterations: Python, Ruby, Pig •Convention over configuration •Embrace Github, DevOps, & EC2 •Currently using JRuby & Sinatra
  18. 18. SNA Team: sna-projects.com
  19. 19. Part 2) Case Study: Strata
  20. 20. Conference Insights •I’d like to understand the audience at Strata •What companies do we work for? •What are the top skills at Strata? •Do attendees cluster together based on skill?
  21. 21. Round up a Data Viz team
  22. 22. Use the right tools •Data Crunching: Hadoop, Pig •Statistical Work: Python, NumPy •Visualization: Gephi
  23. 23. Find Some Data: Attendees
  24. 24. Add LinkedIn data
  25. 25. Extract Skills from Profiles What are skills? Extract
  26. 26. Build Hadoop Skill Graph Discover Core Talent Graph for “Hadoop” Igor Perisic
  27. 27. The Talent Graph
  28. 28. We can combine skills with the attendee directory to better understand Strata
  29. 29. What are skills @Strata?
  30. 30. Extract skills for attendees
  31. 31. Top Skills @Strata
  32. 32. Information Overload
  33. 33. Relevance Measures Jaccard Similarity TFIDF
  34. 34. Relevant Skills @Strata
  35. 35. Do attendees cluster together based on skills?
  36. 36. •Compute similarity of attendees based on skill vector distance •Cluster similarities in Gephi
  37. 37. More analysis on the way •DJ Patil has a session tomorrow •We’ll blog about additional Strata insights soon
  38. 38. Questions? Peter Skomoroch LinkedIn @peteskomoroch http://linkedin.com/in/peterskomoroch Blog: DataWrangling.com
  39. 39. Appendix
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×