Being part of LinkedIn, being a social media company, we deal with a lot of data. We face with a lot of the challenges – sell LIHadoop user group
For us, fundamentally changing the way the world works begins with our mission statement: To connect the world’s professionals and entrepreneurs to make them more productive and successful. This means not only helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in. Platform that lets us become more productiveTalent is THE driving force for success and economic opportunity; that holds true for both individual professionals and the companies they work for. At our core, LinkedIn is in the business of connecting talent with opportunity at massive scale. We are able to do this in an unprecedented way due to the convergence of two unique trends:Scalable infrastructure that connects hundreds of millions of people in milliseconds, andExtraordinary shifts in online behavior related to the way people represent their identities, build their networks and share information and knowledge. This is fundamentally changing the world in the way we live, play, and, of course, work. And that’s where LinkedIn is focused: on fundamentally transforming the way the world works. These factors enable LI to connect talent+opportunity.
With north of 175 million members, we’re making great strides toward our mission of connecting the world’s professionals to make them more productive and successful. For us this not only means helping people to find their dream jobs, but also enabling them to be great at the jobs they’re already in.-With terabytes of data flowing through our systems, generated from member’s profile, their connections and their activity on LinkedIn, we have amassed rich and structured data of one of the most influential, affluent and highly-educated audience on the web. This huge semi-structured data is getting updated in real-time and growing at a tremendous pace, we are all very excited about the data opportunity at LinkedIn
The power of LinkedIn’s platform grows exponentially as we continue toAdd more membersGet them to come back more often, and Give them more reasons to engage on the siteThese three actions drive network effects that form a virtuous cycle on LinkedIn. As membership grows, and activity on the platform increases, it improves the quantity and quality of data propagated throughout the network, which we then use to create better and more relevant products and services for our members and customers. Virtuous cycle. We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site. Enables professionals to be more productive.Volume – Generally large – in several TB’s – sometimes in PBVariety – 80% of the data is unstructured, Growing at 15 time the rate of growth of structured data,,Velocity – High velocityUser data (More structured)Traffic data (Real-time)3rd party data (Batch data, but unstructured)Example
Need for various technologiesOne size doesn’t fit all
History: Google paper, Doug cutting, Yahoo, Storage and computation- Synonymous with big dataEmpowering.Made a lot of new ideas feasible, spurned a new bunch of startupsAbility to store and process => More data to storeMay be 2 slidesNAS systems, OLAP. But not feasible. Hadoop democratized scalable data processing.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.Very visible value addition – Right information to the right user at the right timeIntegral to virality of the networkProblems:Computation intensive algorithmsVariety of recommendationsLots of A/B testing required
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.50% of job views/applications by members are a direct result of recommendations.Similar results across all recommendations
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.AggregationsComplex transformationsLong-term data storageLoad sharing (?)
The Hadoop impactETL jobs transfer to hadoop has helped make data available to adhoc queries by data scientsts.
We have a unique perspective into data Before the collapse, we saw substantial spikes in user activity for the following 5 companies during major financial events:One hypothesis is that many of the employees left the financial industry. According to the LinkedIn data set, that just isn’t true. Bank of America acquired Merrill Lynch and Nomura acquired Lehman Brothers’ franchise in the Asia Pacific region),Barclays was by far the biggest beneficiary, scooping up 10% of the laid off talent, followed by Credit Suisse at 1.5% and Citigroup at 1.1 %.
We have recommendation solutions for everyone, for individuals, recruiters and advertisers In our view, recommendations are ubiquitous and they permeate the whole site.
ENG SLIDE What is a data scientists? What are the different technologies, big data, challenges and opportunities? Open Source – IN Maps (hackday projects), full fledged products.
Add images for SQL/Mapreduce
Hadoop is, and will always be optimized for sequential reads and throughput rather than speed of completion
Use abstractions! Pig and Hive
No random read/writsNative APIs insuficient
Big data and HadoopSeptember 2012Hari Shankar MenonSoftware engineerLinkedIn 1
About me LinkedIn Engineering Data warehouse team Previously, Software engineer @Clickable – Worked on building the reporting and analytics platform on Hadoop and HBase. Hadoop and Open-source enthusiast 2
Agenda About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 3
Our missionConnect the world’s professionals to make them more productive and successful 4
LinkedIn by numbers 175M+ 90 ~2/sec New Members joining >2M 55 Company Pages 32 85% Fortune 100 Companies use LinkedIn to** hire 17 2 4 8 ~4.2B Professional2004 2005 2006 2007 2008 2009 2010 searches in 2011 LinkedIn Members (Millions) *as of Nov 4, 2011 **as of June 30, 2011
About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 6
What is big data?* Chart from Philip Russom- Research Director: TDWI
Infrastructure technologies Search technologies Primary data store (Front-end) Document-oriented store Distributed key-value store Distributed PubSub messaging Database change replication SenseiDB Zoie Bobo 8
Open sourcehttp://data.linkedin.com/opensource 9
About LinkedIn Data Infrastructure overview Hadoop@LinkedIn Challenges 10
What is Hadoop Evolution of Hadoop Impact 11
@ Recommendation systems – Generating recommendations – Modeling – A/B Testing – Grandfathering Data warehouse/ETL – Raw data storage – Aggregations – Heavy lifting Data sciences – Strategic analyses – Experimentation sandbox 12
The Recommendations opportunity• Relevance/Late Pandora Search for People ncy• Offline computation Events You Groups browse maps May Be Interested In• Caching 13