Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

May. 18, 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015
1 of 46

More Related Content

What's hot

DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !
DataGalaxy et Denodo : le guichet unique de gouvernance et d’accès aux données !Denodo
Data Management for DummiesData Management for Dummies
Data Management for DummiesDmitrii Kovalchuk
Semantic Web Technologies For Digital LibrariesSemantic Web Technologies For Digital Libraries
Semantic Web Technologies For Digital LibrariesNikesh Narayanan
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
Introduction à Neo4jIntroduction à Neo4j
Introduction à Neo4jNeo4j
Neo4j in DepthNeo4j in Depth
Neo4j in DepthMax De Marzi

Viewers also liked

The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
Social crmSocial crm
Social crmAngsuman Chakraborty
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of HadoopDataWorks Summit
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
A Case Study In Social CRM Without Technology: The Green Bay PackersA Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersPaul Greenberg
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalInnovation Enterprise

Similar to Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - SparkSofian Hadiwijaya
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032

Similar to Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015(20)

More from Mitul Tiwari

Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInMitul Tiwari
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationMitul Tiwari
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMitul Tiwari
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedInMitul Tiwari
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsMitul Tiwari

Recently uploaded

Streaming data using aws serverless in a bank - AWS Community day NL 2023Streaming data using aws serverless in a bank - AWS Community day NL 2023
Streaming data using aws serverless in a bank - AWS Community day NL 2023Jacob Verhoeks
CASE STUDY.pdfCASE STUDY.pdf
CASE STUDY.pdfShivamYadav8517
Web3_Metaverse_Foundations - DTW Coppenhagen - FINAL - 230919Web3_Metaverse_Foundations - DTW Coppenhagen - FINAL - 230919
Web3_Metaverse_Foundations - DTW Coppenhagen - FINAL - 230919Michael Lesniak
i3.pdfi3.pdf
i3.pdfssusere51484
Microsoft Blockchain Case Studies.pptxMicrosoft Blockchain Case Studies.pptx
Microsoft Blockchain Case Studies.pptxJoelJohn481077
Serverless and event-driven in a world of IoTServerless and event-driven in a world of IoT
Serverless and event-driven in a world of IoTJimmy Dahlqvist

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

Editor's Notes

  1. Hi Everyone. I am Mitul Tiwari. Today I am going to talk about Big Data Ecosystem at LinkedIn.
  2. LinkedIn is the largest professional network with more than 360M members and it’s growing fast with more than 2 members joining per second. What’s LinkedIn’s Mission? … LinkedIn’s mission is to connect the world’s professionals and make them more productive and successful. - Members can connect with each other and maintain their professional network on linkedin.
  3. A rich recommender ecosystem at linkedin: from connections, news, skills, Jobs, companies, groups, search queries, talent, similar profiles, ...
  4. How do we build these data driven products? Building these data products involve three major steps. First, moving production data from online to offline system. Second, processing data in the offline system using technologies such as Hadoop, Samza, Spark. And finally, moving the results or processed data from offline to online serving system.
  5. Let’s take a concrete data product example of People You May Know at LinkedIn. Production data such as database data, activity data is moved to offline system. Offline system processes this data to generate PYMK recommendations for each member. This recommendation output is stored in a key value store Voldemort. Production system query this store to get PYMK recommendation for a given member and serve it online. In aAny deployed large-scale recommendation systems has to deal with scaling challenges high level design Kafka, Voldemort citations, url to Azkaban
  6. Let me talk about each of these three steps in more detail starting with Ingress that is moving data from online system to offline system.
  7. There are various types of data at LinkedIn in online production system. Database data contains various member information such as profile and connections. This is persistent data that member has provided. Activity data contains various kinds of member activities such as which pages member viewed or which People You May Know results were shown or impressed to users Performance and system metrics of online serving application system is also stored to monitor the health of the serving system. Finally, each online service generates various kinds of log information, for example, what kind of request parameters were used by the People You May Know backend service while serving the results
  8. Initial solution built for Data Ingress was point to point solution. That is each production service had many offline clients and data was transferred from a production service to an offline system. There are many limitations of such a solution. First, O(N^2) data integration complexity. That is, each online system could be transferring data to all the offline systems. Second, this is fragile and easy to break. That is, very hard to monitor the correctness of the the data flow. Also, because of O(N^2) complexity this can easily overload a service or data pipeline resulting in delay or loss of data. Finally, this solution is very hard to standardize and each point-to-point data transfer can come up with their own schema.
  9. At LinkedIn we have built a centralized data pipeline. This reduces point-to-point data transfer complexity to O(N) We could build more reliable data pipeline And this data pipeline is standardizable. That is,
  10. At LinkedIn we have built an open source data ingress pipeline called Kafka. Kafka is a publish subscribe messaging system Producers of data (such as online serving systems) send data to brokers. Consumers such as offline system can read messages from brokers Messages are sent for a particular topic. For example, PYMK impressions are sent at a topic such as PYMKImpressionTopic Each topic is broken into one or more ordered partitions of messages
  11. TODO: Kafka stats Kafka uses a standardize schema for each topic We use Avro schema which is like a Json schema with superior serialization, deserialization benefits There is a central repository of schema for each topic Both producers and consumers use the same topic schema Kafka also simplifies data verification using audits on the number of produced messages and the number of consumed messages Kakfa also facilitate each ETL of data to Hadoop by using Map online jobs to load data from brokers For more details check out this IEEE Data Engineering paper.
  12. Once we have data available in offline data processing system from production, we use various technologies such as Hadoop, Samza, and Spark to process this data. Let me start with talking about batch processing technologies based on Hadoop.
  13. Hadoop has been very successful to scale offline computation needs. Hadoop made ease of distributed programming by providing simple high level primitives like Map and Reduce functions Hadoop is scalable to a very large cluster Hadoop MapReduce provide fault tolerant functionalities like speculative execution, restarting failed MapReduce tasks automatically There are many scripting language like Pig, Hive, Scalding built on top of Hadoop to further ease the programming
  14. At LinkedIn: Hadoop is in use for building data products, feature computation, training machine learning models, business analytics, trouble shooting by analyzing data, etc. We have workflows with 100s of Hadoop MapReduce jobs And 100s of such workfllows Daily we process peta-bytes of data on Hadoop
  15. One good signal to indicate are common connections. That is Bob and Carol likely to know each other if they share a common connection. Bob and Carol likely to know each other if they share a common connection. Also, as the number of common connections increases, the likelihood of the two people knowing each other increases.
  16. Here is an example of data processing using Hadoop. For PYMK an important feature is triangle closing that is, finding the second degree connections and the number of common connections between two members Here is a PIG script that computes that Go through the PIG script
  17. Here is the PYMK production Azkaban Hadoop workflow, which involves dozens of hadoop jobs and dependencies Looks complicated but it’s trivial to manage such workflows using Azkaban
  18. How to manage
  19. After feature engineering and getting features such as triangle closing, organizational overlap scores for schools and companies, we apply a machine learning model to predict probability of two people knowing each other. We also incorporate user feedback both explicit and implicit in enhancing the connection probability We use pass connections as positive response variable to train our machine learning model
  20. ADMM stands for Alternating Direction Method of Multipliers (Boyd et al. 2011). The basic idea of ADMM is as follows: ADMM considers the large scale logistic regression model fitting as a convex optimization problem with constraints. While minimizing the user-defined loss function, it enforces an extra constraint that coefficients from all partitions have to equal. To solve this optimization problem, ADMM uses an iterative process. For each iteration it partitions the big data into many small partitions, and fits an independent logistic regression for each partition. Then, it aggregates the coefficients collected from all partitions, learns the consensus coefficients, and sends it back to all partitions to retrain. After 10-20 iterations, it ends up with a converged solution that is theoretically close to what you would have obtained if you trained it on a single machine.
  21. TODO: get comfortable
  22. TODO: get comfortable with this slide
  23. load one week of data and build a OLAP cube over country and locale as dimensions for unique users over the week, unique users for today, as well as total number of clicks.
  24. TODO: get comfortable
  25. TODO: revise
  26. TODO: get comfortable
  27. Consider what data is necessary to build a particular view of the LinkedIn home page. We provide interesting news via Pulse, timely updates from your connections in the Network Update Stream, potential new connections from People You May Know, advertisements targeted to your background, and much, much more. Each service publishes its logs to its own specific Kafka topic, which is named after the service, i.e. <service>_service_call. There are hundreds of these topics, one for each service and they share the same Avro schema, which allows them to be analyzed together. This schema includes timing information, who called whom, what was returned, etc, as well as the specific of what each particular service call did. Additionally log4j-style warnings and errors are also routed to Kafka in a separate <service>_log_event topic.
  28. After a request has been satisfied, the complete record of all the work that went into generating it is scattered across the Kafka logs for each service that participated. These individual logs are great tools for evaluating the performance and correctness of the individual services themselves, and are carefully monitored by the service owners. But how can we use these individual elements to gain a larger view of the entire chain of calls that created that page? Such a perspective would allow us to see how the calls are interacting with each other, identify slow services or highlight redundant or unnecessary calls.
  29. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  30. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  31. That concludes my brief discussion on Stream processing using Samza. Next I am going to talk about iterative processing using Spark.
  32. ADMM example
  33. ADMM example
  34. ADMM example
  35. ADMM example
  36. TODO: add reference
  37. TODO: get comfortable
  38. TODO: revise - add lessons, opportunities
  39. TODO: revise