Data infrastructure and Hadoop at LinkedIn
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data infrastructure and Hadoop at LinkedIn

on

  • 900 views

 

Statistics

Views

Total Views
900
Views on SlideShare
856
Embed Views
44

Actions

Likes
3
Downloads
37
Comments
1

3 Embeds 44

http://www.linkedin.com 34
https://www.linkedin.com 8
https://crowdflower.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • so many authors
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Being part of LinkedIn, being a social media company, we deal with a lot of data. We face with a lot of the challenges – sell LIHadoop user group
  • For us, fundamentally changing the way the world works begins with our mission statement: To connect the world’s professionals and entrepreneurs to make them more productive and successful. This means not only helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in. Platform that lets us become more productiveTalent is THE driving force for success and economic opportunity; that holds true for both individual professionals and the companies they work for. At our core, LinkedIn is in the business of connecting talent with opportunity at massive scale.  We are able to do this in an unprecedented way due to the convergence of two unique trends:Scalable infrastructure that connects hundreds of millions of people in milliseconds, andExtraordinary shifts in online behavior related to the way people represent their identities, build their networks and share information and knowledge. This is fundamentally changing the world in the way we live, play, and, of course, work. And that’s where LinkedIn is focused: on fundamentally transforming the way the world works. These factors enable LI to connect talent+opportunity.
  • With north of 175 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.-With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web. This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn
  • The power of LinkedIn’s platform grows exponentially as we continue toAdd more membersGet them to come back more often, and Give them more reasons to engage on the siteThese three actions drive network effects that form a virtuous cycle on LinkedIn. As membership grows, and activity on the platform increases, it improves the quantity and quality of data propagated throughout the network, which we then use to create better and more relevant products and services for our members and customers. Virtuous cycle. We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site. Enables professionals to be more productive.Volume – Generally large – in several TB’s – sometimes in PBVariety – 80% of the data is unstructured, Growing at 15 time the rate of growth of structured data,,Velocity – High velocityUser data (More structured)Traffic data (Real-time)3rd party data (Batch data, but unstructured)Example
  • Need for various technologiesOne size doesn’t fit all
  • History: Google paper, Doug cutting, Yahoo, Storage and computation- Synonymous with big dataEmpowering.Made a lot of new ideas feasible, spurned a new bunch of startupsAbility to store and process => More data to storeMay be 2 slidesNAS systems, OLAP. But not feasible. Hadoop democratized scalable data processing.
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.Very visible value addition – Right information to the right user at the right timeIntegral to virality of the networkProblems:Computation intensive algorithmsVariety of recommendationsLots of A/B testing required
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.50% of job views/applications by members are a direct result of recommendations.Similar results across all recommendations
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.AggregationsComplex transformationsLong-term data storageLoad sharing (?)
  • The Hadoop impactETL jobs transfer to hadoop has helped make data available to adhoc queries by data scientsts.
  • We have a unique perspective into data Before the collapse, we saw substantial spikes in user activity for the following 5 companies during major financial events:One hypothesis is that many of the employees left the financial industry.  According to the LinkedIn data set, that just isn’t true. Bank of America acquired Merrill Lynch and Nomura acquired Lehman Brothers’ franchise in the Asia Pacific region),Barclays was by far the biggest beneficiary, scooping up 10% of the laid off talent, followed by Credit Suisse at 1.5% and Citigroup at 1.1 %.
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.
  • ENG SLIDE What is a data scientists? What are the different technologies, big data, challenges and opportunities? Open Source – IN Maps (hackday projects), full fledged products.
  • Add images for SQL/Mapreduce
  • Hadoop is, and will always be optimized for sequential reads and throughput rather than speed of completion
  • Use abstractions!  Pig and Hive
  • No random read/writsNative APIs insuficient

Data infrastructure and Hadoop at LinkedIn Presentation Transcript

  • 1. Big data and HadoopSeptember 2012Hari Shankar MenonSoftware engineerLinkedIn 1
  • 2. About me LinkedIn Engineering  Data warehouse team Previously, Software engineer @Clickable – Worked on building the reporting and analytics platform on Hadoop and HBase. Hadoop and Open-source enthusiast 2
  • 3. Agenda About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 3
  • 4. Our missionConnect the world’s professionals to make them more productive and successful 4
  • 5. LinkedIn by numbers 175M+ 90 ~2/sec New Members joining >2M 55 Company Pages 32 85% Fortune 100 Companies use LinkedIn to** hire 17 2 4 8 ~4.2B Professional2004 2005 2006 2007 2008 2009 2010 searches in 2011 LinkedIn Members (Millions) *as of Nov 4, 2011 **as of June 30, 2011
  • 6.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 6
  • 7. What is big data?* Chart from Philip Russom- Research Director: TDWI
  • 8. Infrastructure technologies Search technologies Primary data store (Front-end) Document-oriented store Distributed key-value store Distributed PubSub messaging Database change replication SenseiDB Zoie Bobo 8
  • 9. Open sourcehttp://data.linkedin.com/opensource 9
  • 10.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 10
  • 11.  What is Hadoop Evolution of Hadoop Impact 11
  • 12. @ Recommendation systems – Generating recommendations – Modeling – A/B Testing – Grandfathering Data warehouse/ETL – Raw data storage – Aggregations – Heavy lifting Data sciences – Strategic analyses – Experimentation sandbox 12
  • 13. The Recommendations opportunity• Relevance/Late Pandora Search for People ncy• Offline computation Events You Groups browse maps May Be Interested In• Caching 13
  • 14. Improving recommendations• Mathematical modeling• A/B Testing• Grandfathering 14
  • 15. Hadoop in the Data warehouse • Longer retention • Source of truth • Complex • Lower retention transformations • Ad-hoc analysis • Algorithmic computations 15
  • 16. Hadoop in Data Sciences• Deep dives• Sandbox• Hackday projects 16
  • 17. Data Insights - 1 Job migration after financial collapse 17
  • 18. Data Insights - 2 18
  • 19. Data Insights - 3 19
  • 20.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 20
  • 21. Challenges1. User adoption of new technologies2. Real-time processing3. Graph/Network algorithms4. Making data accessible 21
  • 22. User adoption 22
  • 23. Real-time processing• Challenges • Random reads/writes • Warm-up time• Solutions • Parts of the problem that can be moved offline? • HBase, Voldemort 23
  • 24. Map-reduce-incompatible problems• Graph problems• Traditional joins 24
  • 25. Making data accessible• Hadoop  Tons of data 25
  • 26. Finally!No Silver bulletHadoop  Offline processingScalability by design 26
  • 27. www.linkedin.com/in/harisreekumarwww.linkedin.com/company/linkedin/careers 27