Big data and HadoopSeptember 2012Hari Shankar MenonSoftware engineerLinkedIn                      1
About me LinkedIn Engineering        Data warehouse team Previously, Software engineer @Clickable   – Worked on buildin...
Agenda About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges                                 3
Our missionConnect the world’s professionals to make  them more productive and successful                                 ...
LinkedIn by numbers                                 175M+                                            90                   ...
 About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges                                 6
What is big data?* Chart from Philip Russom- Research Director: TDWI
Infrastructure technologies                                            Search technologies Primary data store (Front-end) ...
Open sourcehttp://data.linkedin.com/opensource                                      9
 About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges                                 10
 What is Hadoop Evolution of Hadoop Impact                        11
@ Recommendation systems   –   Generating recommendations   –   Modeling   –   A/B Testing   –   Grandfathering Data war...
The Recommendations opportunity• Relevance/Late                   Pandora Search for People  ncy• Offline  computation    ...
Improving recommendations• Mathematical modeling• A/B Testing• Grandfathering                             14
Hadoop in the Data warehouse         • Longer retention    • Source of truth         • Complex             • Lower retenti...
Hadoop in Data Sciences• Deep dives• Sandbox• Hackday projects                           16
Data Insights - 1            Job migration after financial collapse                                                     17
Data Insights - 2                    18
Data Insights - 3                    19
 About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges                                 20
Challenges1.   User adoption of new technologies2.   Real-time processing3.   Graph/Network algorithms4.   Making data acc...
User adoption                22
Real-time processing• Challenges   • Random reads/writes   • Warm-up time• Solutions   • Parts of the problem that can be ...
Map-reduce-incompatible problems• Graph problems• Traditional joins                                            24
Making data accessible• Hadoop  Tons of data                                25
Finally!No Silver bulletHadoop  Offline processingScalability by design                              26
www.linkedin.com/in/harisreekumarwww.linkedin.com/company/linkedin/careers                                            27
Upcoming SlideShare
Loading in...5
×

Data infrastructure and Hadoop at LinkedIn

867

Published on

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • so many authors
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
867
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
45
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide
  • Being part of LinkedIn, being a social media company, we deal with a lot of data. We face with a lot of the challenges – sell LIHadoop user group
  • For us, fundamentally changing the way the world works begins with our mission statement: To connect the world’s professionals and entrepreneurs to make them more productive and successful. This means not only helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in. Platform that lets us become more productiveTalent is THE driving force for success and economic opportunity; that holds true for both individual professionals and the companies they work for. At our core, LinkedIn is in the business of connecting talent with opportunity at massive scale.  We are able to do this in an unprecedented way due to the convergence of two unique trends:Scalable infrastructure that connects hundreds of millions of people in milliseconds, andExtraordinary shifts in online behavior related to the way people represent their identities, build their networks and share information and knowledge. This is fundamentally changing the world in the way we live, play, and, of course, work. And that’s where LinkedIn is focused: on fundamentally transforming the way the world works. These factors enable LI to connect talent+opportunity.
  • With north of 175 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.-With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web. This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn
  • The power of LinkedIn’s platform grows exponentially as we continue toAdd more membersGet them to come back more often, and Give them more reasons to engage on the siteThese three actions drive network effects that form a virtuous cycle on LinkedIn. As membership grows, and activity on the platform increases, it improves the quantity and quality of data propagated throughout the network, which we then use to create better and more relevant products and services for our members and customers. Virtuous cycle. We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site. Enables professionals to be more productive.Volume – Generally large – in several TB’s – sometimes in PBVariety – 80% of the data is unstructured, Growing at 15 time the rate of growth of structured data,,Velocity – High velocityUser data (More structured)Traffic data (Real-time)3rd party data (Batch data, but unstructured)Example
  • Need for various technologiesOne size doesn’t fit all
  • History: Google paper, Doug cutting, Yahoo, Storage and computation- Synonymous with big dataEmpowering.Made a lot of new ideas feasible, spurned a new bunch of startupsAbility to store and process => More data to storeMay be 2 slidesNAS systems, OLAP. But not feasible. Hadoop democratized scalable data processing.
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.Very visible value addition – Right information to the right user at the right timeIntegral to virality of the networkProblems:Computation intensive algorithmsVariety of recommendationsLots of A/B testing required
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.50% of job views/applications by members are a direct result of recommendations.Similar results across all recommendations
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.AggregationsComplex transformationsLong-term data storageLoad sharing (?)
  • The Hadoop impactETL jobs transfer to hadoop has helped make data available to adhoc queries by data scientsts.
  • We have a unique perspective into data Before the collapse, we saw substantial spikes in user activity for the following 5 companies during major financial events:One hypothesis is that many of the employees left the financial industry.  According to the LinkedIn data set, that just isn’t true. Bank of America acquired Merrill Lynch and Nomura acquired Lehman Brothers’ franchise in the Asia Pacific region),Barclays was by far the biggest beneficiary, scooping up 10% of the laid off talent, followed by Credit Suisse at 1.5% and Citigroup at 1.1 %.
  • We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.
  • ENG SLIDE What is a data scientists? What are the different technologies, big data, challenges and opportunities? Open Source – IN Maps (hackday projects), full fledged products.
  • Add images for SQL/Mapreduce
  • Hadoop is, and will always be optimized for sequential reads and throughput rather than speed of completion
  • Use abstractions!  Pig and Hive
  • No random read/writsNative APIs insuficient
  • Data infrastructure and Hadoop at LinkedIn

    1. 1. Big data and HadoopSeptember 2012Hari Shankar MenonSoftware engineerLinkedIn 1
    2. 2. About me LinkedIn Engineering  Data warehouse team Previously, Software engineer @Clickable – Worked on building the reporting and analytics platform on Hadoop and HBase. Hadoop and Open-source enthusiast 2
    3. 3. Agenda About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 3
    4. 4. Our missionConnect the world’s professionals to make them more productive and successful 4
    5. 5. LinkedIn by numbers 175M+ 90 ~2/sec New Members joining >2M 55 Company Pages 32 85% Fortune 100 Companies use LinkedIn to** hire 17 2 4 8 ~4.2B Professional2004 2005 2006 2007 2008 2009 2010 searches in 2011 LinkedIn Members (Millions) *as of Nov 4, 2011 **as of June 30, 2011
    6. 6.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 6
    7. 7. What is big data?* Chart from Philip Russom- Research Director: TDWI
    8. 8. Infrastructure technologies Search technologies Primary data store (Front-end) Document-oriented store Distributed key-value store Distributed PubSub messaging Database change replication SenseiDB Zoie Bobo 8
    9. 9. Open sourcehttp://data.linkedin.com/opensource 9
    10. 10.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 10
    11. 11.  What is Hadoop Evolution of Hadoop Impact 11
    12. 12. @ Recommendation systems – Generating recommendations – Modeling – A/B Testing – Grandfathering Data warehouse/ETL – Raw data storage – Aggregations – Heavy lifting Data sciences – Strategic analyses – Experimentation sandbox 12
    13. 13. The Recommendations opportunity• Relevance/Late Pandora Search for People ncy• Offline computation Events You Groups browse maps May Be Interested In• Caching 13
    14. 14. Improving recommendations• Mathematical modeling• A/B Testing• Grandfathering 14
    15. 15. Hadoop in the Data warehouse • Longer retention • Source of truth • Complex • Lower retention transformations • Ad-hoc analysis • Algorithmic computations 15
    16. 16. Hadoop in Data Sciences• Deep dives• Sandbox• Hackday projects 16
    17. 17. Data Insights - 1 Job migration after financial collapse 17
    18. 18. Data Insights - 2 18
    19. 19. Data Insights - 3 19
    20. 20.  About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 20
    21. 21. Challenges1. User adoption of new technologies2. Real-time processing3. Graph/Network algorithms4. Making data accessible 21
    22. 22. User adoption 22
    23. 23. Real-time processing• Challenges • Random reads/writes • Warm-up time• Solutions • Parts of the problem that can be moved offline? • HBase, Voldemort 23
    24. 24. Map-reduce-incompatible problems• Graph problems• Traditional joins 24
    25. 25. Making data accessible• Hadoop  Tons of data 25
    26. 26. Finally!No Silver bulletHadoop  Offline processingScalability by design 26
    27. 27. www.linkedin.com/in/harisreekumarwww.linkedin.com/company/linkedin/careers 27
    1. ¿Le ha llamado la atención una diapositiva en particular?

      Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

    ×