Rapid Data Exploration With Hadoop

3,801 views
3,665 views

Published on

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio.

To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way.

This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.

Published in: Technology
1 Comment
15 Likes
Statistics
Notes
  • Awesome presentation! Thanks for sharing it.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
3,801
On SlideShare
0
From Embeds
0
Number of Embeds
198
Actions
Shares
0
Downloads
145
Comments
1
Likes
15
Embeds 0
No embeds

No notes for slide

Rapid Data Exploration With Hadoop

  1. 1. Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch
  2. 2. Outline • Overview: LinkedIn Biz, Tech, & Analytics • Rapid Data Exploration 101 - Spatial Analytics Pig Code - Trend detection with Pig & Python - R Streaming Example • Deep Dive: Our Data Analysis Approach • Building Data Products • LinkedIn Data Insights
  3. 3. Connect the world’s professionals to make them more productive and successful
  4. 4. Professional Identity
  5. 5. LinkedIn at a glance • Founded in 2003 • #17 site in the US (Alexa) • 60+ million members • First million members = 477 days • Latest million = 9 days • 500K+ company profiles • 12+ million small business professionals • In 2009 - 1billion people searches • Average age: 41 • Household income $107,000 • 42% are “decision makers”
  6. 6. How International? • More than 50% international (members in over 200 countries & territories) • 13+ million in Europe • 4+ million in India • 3+ million in UK • #13 site in UK (Alexa)
  7. 7. How do we keep the lights on? • Profitable since 2007 • Valued at over $1B at the last funding round • Subscriptions • Ads • Job Postings • Enterprise Client
  8. 8. Hadoop on LinkedIn 1,400+ members list “Hadoop” on their profile What other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they? Who do they work for? • 36% in Bay Area • 11% Yahoo! • 8% in India • 2% Apache Software Foundation • 6% in NYC • 1% LinkedIn • 4% in Seattle • 1% Google • 4% in Los Angeles • 1% Facebook
  9. 9. Hadoop at LinkedIn
  10. 10. Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }
  11. 11. Getting Data In •From Databases (user data, news, jobs etc.) • Need a way to get data reliably periodically • Need tests to verify data • Support for incremental replication • Solution: Transmogrify Driver Program • InputReader: JDBCReader, CSV Reader • Output Writer: JDBCWriter, HDFS writers • From web logs (page views, search, clicks etc) • Weblogs files are rsynced and loaded up in HDFS • Hadoop jobs for date cleaning and transformation.
  12. 12. Getting Data Out
  13. 13. Giving Back: Open Source http://sna-projects.com/sna/
  14. 14. Analytics Technologies
  15. 15. We Build Things With Data Give smart people great tools, enable them to solve problems
  16. 16. Prototyping Culture
  17. 17. How does Hadoop enable rapid data exploration?
  18. 18. Pig for Spatial Analytics
  19. 19. US County HeatMap
  20. 20. Pig for Trend Detection
  21. 21. Python Streaming Script
  22. 22. Sort Output & Display
  23. 23. R Streaming Also Easy *from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
  24. 24. Let’s Talk Data
  25. 25. Business is recognizing the importance of analytics
  26. 26. What data do we start with?
  27. 27. We can also leverage... • Connection Graph • Company Pages • Recommendations • Talent Match • Address Book Uploads • Web Referrals • Search Logs • 1M+ Twitter Accounts • Profile Views & Activity • Wikipedia Data • Job Postings • Mechanical Turk • LinkedIn Groups • Census, BLS, & Data.gov • LinkedIn Questions • Much more...
  28. 28. How do we think of Analytics? Data Jujitsu
  29. 29. Lots of Medium can be more powerful than Big >
  30. 30. Reconstruct Reality from Data Exhaust
  31. 31. Data Scientist Lessons • Follow the data, avoid assumptions • Sanity check the extremes (0, infinity) • Don’t get mired in rare edge cases • Data Jujitsu: solve easier auxiliary problems • Build smaller consistent samples to test code • Establish a baseline model quickly, iterate often • Use the right tool for the job at hand • Iterate quickly with high level languages
  32. 32. Where did the bankers go?
  33. 33. We’re Hiring! http://sna-projects.com/sna/ pskomoro@linkedin.com @peteskomoroch

×