Rapid Data Exploration With Hadoop
Upcoming SlideShare
Loading in...5
×
 

Rapid Data Exploration With Hadoop

on

  • 4,267 views

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most ...

LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio.

To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way.

This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.

Statistics

Views

Total Views
4,267
Views on SlideShare
4,105
Embed Views
162

Actions

Likes
15
Downloads
137
Comments
1

7 Embeds 162

https://dia.log.pt 109
http://www.linkedin.com 27
http://www.slideshare.net 15
https://www.linkedin.com 7
http://www.datawrangling.com 2
http://www.lmodules.com 1
http://datawrangling.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Awesome presentation! Thanks for sharing it.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Rapid Data Exploration With Hadoop Rapid Data Exploration With Hadoop Presentation Transcript

  • Rapid Data Exploration With Hadoop Peter Skomoroch Senior Data Scientist @peteskomoroch
  • Outline • Overview: LinkedIn Biz, Tech, & Analytics • Rapid Data Exploration 101 - Spatial Analytics Pig Code - Trend detection with Pig & Python - R Streaming Example • Deep Dive: Our Data Analysis Approach • Building Data Products • LinkedIn Data Insights
  • Connect the world’s professionals to make them more productive and successful View slide
  • Professional Identity View slide
  • LinkedIn at a glance • Founded in 2003 • #17 site in the US (Alexa) • 60+ million members • First million members = 477 days • Latest million = 9 days • 500K+ company profiles • 12+ million small business professionals • In 2009 - 1billion people searches • Average age: 41 • Household income $107,000 • 42% are “decision makers”
  • How International? • More than 50% international (members in over 200 countries & territories) • 13+ million in Europe • 4+ million in India • 3+ million in UK • #13 site in UK (Alexa)
  • How do we keep the lights on? • Profitable since 2007 • Valued at over $1B at the last funding round • Subscriptions • Ads • Job Postings • Enterprise Client
  • Hadoop on LinkedIn 1,400+ members list “Hadoop” on their profile What other skills do they have? •HBase, Lucene, Solr, MapReduce, Nutch... Where are they? Who do they work for? • 36% in Bay Area • 11% Yahoo! • 8% in India • 2% Apache Software Foundation • 6% in NYC • 1% LinkedIn • 4% in Seattle • 1% Google • 4% in Los Angeles • 1% Facebook
  • Hadoop at LinkedIn
  • Voldemort Data Storage Compact, compressed, binary data (something like Avro) Type can be any combination of int, double, float, String, Map, List, etc. => Sequence Files Example member definition: { ‘member_id’: ‘int32’, ‘first_name': 'string', ’last_name': ’string’, ‘age’ : ‘int32’ … }
  • Getting Data In •From Databases (user data, news, jobs etc.) • Need a way to get data reliably periodically • Need tests to verify data • Support for incremental replication • Solution: Transmogrify Driver Program • InputReader: JDBCReader, CSV Reader • Output Writer: JDBCWriter, HDFS writers • From web logs (page views, search, clicks etc) • Weblogs files are rsynced and loaded up in HDFS • Hadoop jobs for date cleaning and transformation.
  • Getting Data Out
  • Giving Back: Open Source http://sna-projects.com/sna/
  • Analytics Technologies
  • We Build Things With Data Give smart people great tools, enable them to solve problems
  • Prototyping Culture
  • How does Hadoop enable rapid data exploration?
  • Pig for Spatial Analytics
  • US County HeatMap
  • Pig for Trend Detection
  • Python Streaming Script
  • Sort Output & Display
  • R Streaming Also Easy *from http://www.stat.uiowa.edu/~luke/classes/295-hpc/
  • Let’s Talk Data
  • Business is recognizing the importance of analytics
  • What data do we start with?
  • We can also leverage... • Connection Graph • Company Pages • Recommendations • Talent Match • Address Book Uploads • Web Referrals • Search Logs • 1M+ Twitter Accounts • Profile Views & Activity • Wikipedia Data • Job Postings • Mechanical Turk • LinkedIn Groups • Census, BLS, & Data.gov • LinkedIn Questions • Much more...
  • How do we think of Analytics? Data Jujitsu
  • Lots of Medium can be more powerful than Big >
  • Reconstruct Reality from Data Exhaust
  • Data Scientist Lessons • Follow the data, avoid assumptions • Sanity check the extremes (0, infinity) • Don’t get mired in rare edge cases • Data Jujitsu: solve easier auxiliary problems • Build smaller consistent samples to test code • Establish a baseline model quickly, iterate often • Use the right tool for the job at hand • Iterate quickly with high level languages
  • Where did the bankers go?
  • We’re Hiring! http://sna-projects.com/sna/ pskomoro@linkedin.com @peteskomoroch