Successfully reported this slideshow.
Your SlideShare is downloading. ×

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

  1. 1. Big Data Ecosystem at LinkedIn Big 2015 at WWW
  2. 2. LinkedIn: Largest Professional Network 2 360M members 2 new members/sec
  3. 3. Rich Data Driven Products at LinkedIn 3 Similar Profiles Connections News Skill Endorsements
  4. 4. How to build Data Products 4 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  5. 5. Example Data Product: PYMK 5 • People You May Know (PYMK): recommend members to connect
  6. 6. Outline 6 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  7. 7. Ingress - types of Data 7 • Database data: member profile, connections, … • Activity data: Page views, Impressions, etc. • Application and System metrics • Service logs
  8. 8. Data Ingress - Point-to-point Pipelines 8 • O(n^2) data integration complexity • Fragile, delayed, lossy • Non-standardized
  9. 9. Data Ingress - Centralized Pipeline 9 • O(n) data integration complexity • More reliable • Standardizable
  10. 10. Data Ingress: Apache Kafka 10 • Publish subscribe messaging • Producers send messages to Brokers • Consumers read messages from Brokers • Messages are sent to a topic • E.g. PeopleYouMayKnowTopic • Each topic is broken into one or more ordered partitions of messages
  11. 11. Kafka: Data Evolution and loading 11 • Standardized Schema for each topic • Avro • Central repository • Producers/consumers use the same schema • Data verification - audits • ETL to Hadoop • Map only jobs load data from broker Goodhope et al., IEEE Data Eng. 2012
  12. 12. Outline 12 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing using Hadoop, Azkaban, Cubert • Stream processing using Samza • Iterative processing using Spark • Data Egress • Moving results from offline to online system
  13. 13. Data Processing: Hadoop 13 • Ease of programming • High level Map and Reduce functions • Scalable to very large cluster • Fault tolerant • Speculative execution, auto restart of failed jobs • Scripting languages: PIG, Hive, Scalding
  14. 14. Data Processing: Hadoop at LinkedIn 14 • Used for data products, feature computation, training models, analytics and reporting, trouble shooting, … • Native MapReduce, PIG, Hive • Workflows with 100s of Hadoop jobs • 100s of workflows • Processing petabytes of data everyday
  15. 15. Data Processing Example PYMK Feature Engineering Triangle closing Prob(Bob knows Carol) ~ the # of common connections Alice Bob Carol 15 How do people know each other?
  16. 16. Data Processing in Hadoop Example 16 -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); -- second degree pairs (id1, id2), aggregate and count common connections common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); How to do PYMK Triangle Closing in Hadoop
  17. 17. 17 How to manage Production Hadoop Workflow
  18. 18. Azkaban: Hadoop Workflow management 18 • Configuration • Dependency management • Access control • Scheduling and SLA management • Monitoring, history
  19. 19. Distributed Machine Learning: ML-ease 20 • ADMM Logistic Regression for binary response prediction Agarwal et al. 2014
  20. 20. Limitations of Hadoop: Join and Group By 21 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Common Hadoop MapReduce/Pig/Hive implementation • MapReduce: Load data and shuffle and reduce to do Inner Join and store the output • MapReduce: Load the above output, shuffle on group by keys and aggregate on reducer to generate final output
  21. 21. Limitations of Triangle Closing Using Hadoop 22 • Large amount of data to shuffle from Mappers to Reducers — connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); — Shuffling all 2nd degree connections - terabytes of data common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  22. 22. Cubert 23 • An open source project built for analytics needs • Map side aggregation • Minimizes intermediate data and shuffling • Fast and scalable primitives for joins and aggregation • Partitions data into blocks • Specialized operators MeshJoin, Cube • 5-60X faster in experience • Developer friendly - script like Vemuri et al. VLDB 2014
  23. 23. Cubert Design 24 • Language • Scripting language • Physical - write MR programs • Execution • Data movement: Shuffle, Blockgen, Combine, Pivot • Primitives: MashJoin, Cube • Data blocks: partition of data by cost Vemuri et al. VLDB 2014
  24. 24. Cubert Script: count Daily/Weekly Stats 25 JOB "create blocks of the fact table" MAP { data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO(); } // create blocks of one week of data with a cost function BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId; STORE data INTO "$output/blocks" USING RUBIX; END JOB "compute cubes" MAP { data = LOAD "$output/blocks" USING RUBIX; // create a new column 'todayUserId' for today's records only data = FROM data GENERATE country, locale, userId, clicks, CASE(timestamp == $today, userId) AS todayUserId; } // creates the three cubes in a single job to count daily, weekly users and clicks CUBE data BY country, locale INNER userId AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers, COUNT_DISTINCT(todayUserId) as dailyUniqueUsers, SUM(clicks) as totalClicks; STORE data INTO "$output/results" USING AVRO(); END Vemuri et al. VLDB 2014
  25. 25. Cubert Example: Join and Group By 26 Vemuri et al. VLDB 2014 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Sort A by Product and B by Location • Divide A and B in specialized blocks sorted by group by keys • Load A’s blocks in memory and stream B’s blocks to Join • Group by can be performed immediately after Join
  26. 26. Cubert Example: Triangle Closing 27 • Divide connections (src, dest) in blocks • Duplicate connection graph G1, G2 • Sort G1 edges (src, dest) by src • Sort G2 edges (src, dest) by dest • MeshJoin G1 and G2 such that G1.dest=G2.src • Aggregate by (G1.src, G2,dest) to get the number of common connections • Speedup by 50%
  27. 27. Cubert Summary 28 Vemuri et al. VLDB 2014 • Built for analytics needs • Faster and scalable: 5-60X • Working well in practice
  28. 28. Outline 29 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  29. 29. Samza 30 • Samza streaming computation • On top of messaging layer like Kafka for input/output • Low latency • Stateful processing through local store • Many use cases at LinkedIn • Site-speed monitoring • Data standardization
  30. 30. Samza: Site Speed Monitoring 31 • LinkedIn homepage assembled by calling many services • Each service logs through Kafka what went on with a request Id
  31. 31. Samza: Site Speed Monitoring 32 • The complete record of request - scattered across Kafka logs • Problem: combine these logs to generate wholistic view
  32. 32. Samza: Site Speed Monitoring 33 • Hadoop/MR: join the logs using the request Id - once a day • Too late to troubleshoot any issue • Samza: near real-time join the Kafka logs using the requestId
  33. 33. Samza: Site Speed Monitoring 34 • Samza: near real-time join the Kafka logs using the requestId • Two jobs • Partition Kafka stream by request Id • Aggregate all the records for a request Id Fernandez et al. CIDR 2015
  34. 34. Outline 35 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  35. 35. Iterative Processing using Spark 36 • Limitations of MapReduce • What is Spark? • Spark at LinkedIn
  36. 36. Limitations of MapReduce 37 • Iterative computation is slow • Inefficient multi-pass computation • Intermediate data written in distributed file system
  37. 37. Limitations of MapReduce 38 • Interactive computation is slow • Same data is loaded again from distributed file system
  38. 38. Example: ADMM at LinkedIn 39 • Intermediate data is stored in distributed file system - slow Intermediate data in HDFS
  39. 39. SPARK 40 • Extends programming language with a distributed data structure • Resilient Distributed Datasets (RDD) • can be stored in memory • Faster iterative computation • Faster interactive computation • Clean APIs in Python, Scala, Java • SQL, Streaming, Machine learning, graph processing support Matei Zaharia et al. NSDI 2012
  40. 40. Spark at LinkedIn 41 • ADMM on Spark • Intermediate data is stored in memory - faster Intermediate data in memory
  41. 41. Outline 42 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing - Hadoop, Azkaban, Cubert • Iterative processing - Spark • Stream processing - Samza • Data Egress • Moving results from offline to online system
  42. 42. Data Egress - Key/Value 43 • Key-value store: Voldemort • Based on Amazon’s Dynamo DB • Distributed • Scalable • Bulk load from Hadoop • Simple to use • store results into ‘url’ using KeyValue(‘member_id’) Sumbaly et al. FAST 2012
  43. 43. Data Egress - Streams 44 • Stream - Kafka • Hadoop job as a Producer • Service acts as Consumer • Simple to use • store data into ‘url’ using Stream(“topic=x“) Goodhope et al., IEEE Data Eng. 2012
  44. 44. Conclusion 45 • Rich primitives for Data Ingress, Processing, Egress • Data Ingress: Kafka, ETL • Data Processing • Batch processing - Hadoop, Cubert • Stream processing - Samza • Iterative processing - Spark • Data Egress: Voldemort, Kafka • Allow Data Scientists to focus to build Data Products
  45. 45. Future Opportunities 46 • Models of computation • Efficient Graph processing • Distributed Machine Learning
  46. 46. 47 Acknowledgement Thanks to data team at LinkedIn: data.linkedin.com Contact: mtiwari@linkedin.com @mitultiwari

Editor's Notes

  • Hi Everyone. I am Mitul Tiwari. Today I am going to talk about Big Data Ecosystem at LinkedIn.
  • LinkedIn is the largest professional network with more than 360M members and it’s growing fast with more than 2 members joining per second.

    What’s LinkedIn’s Mission? …
    LinkedIn’s mission is to connect the world’s professionals and make them more productive and successful.
    - Members can connect with each other and maintain their professional network on linkedin.
  • A rich recommender ecosystem at linkedin: from connections, news, skills, Jobs, companies, groups, search queries, talent, similar profiles, ...
  • How do we build these data driven products?
    Building these data products involve three major steps. First, moving production data from online to offline system. Second, processing data in the offline system using technologies such as Hadoop, Samza, Spark. And finally, moving the results or processed data from offline to online serving system.
  • Let’s take a concrete data product example of People You May Know at LinkedIn. Production data such as database data, activity data is moved to offline system. Offline system processes this data to generate PYMK recommendations for each member. This recommendation output is stored in a key value store Voldemort. Production system query this store to get PYMK recommendation for a given member and serve it online.

    In aAny deployed large-scale recommendation systems has to deal with scaling challenges
    high level design
    Kafka, Voldemort citations, url to Azkaban
  • Let me talk about each of these three steps in more detail starting with Ingress that is moving data from online system to offline system.
  • There are various types of data at LinkedIn in online production system.
    Database data contains various member information such as profile and connections. This is persistent data that member has provided.
    Activity data contains various kinds of member activities such as which pages member viewed or which People You May Know results were shown or impressed to users
    Performance and system metrics of online serving application system is also stored to monitor the health of the serving system.
    Finally, each online service generates various kinds of log information, for example, what kind of request parameters were used by the People You May Know backend service while serving the results
  • Initial solution built for Data Ingress was point to point solution. That is each production service had many offline clients and data was transferred from a production service to an offline system.
    There are many limitations of such a solution.
    First, O(N^2) data integration complexity. That is, each online system could be transferring data to all the offline systems.
    Second, this is fragile and easy to break. That is, very hard to monitor the correctness of the the data flow. Also, because of O(N^2) complexity this can easily overload a service or data pipeline resulting in delay or loss of data.
    Finally, this solution is very hard to standardize and each point-to-point data transfer can come up with their own schema.
  • At LinkedIn we have built a centralized data pipeline.
    This reduces point-to-point data transfer complexity to O(N)
    We could build more reliable data pipeline
    And this data pipeline is standardizable. That is,
  • At LinkedIn we have built an open source data ingress pipeline called Kafka.
    Kafka is a publish subscribe messaging system
    Producers of data (such as online serving systems) send data to brokers.
    Consumers such as offline system can read messages from brokers
    Messages are sent for a particular topic. For example, PYMK impressions are sent at a topic such as PYMKImpressionTopic
    Each topic is broken into one or more ordered partitions of messages
  • TODO: Kafka stats
    Kafka uses a standardize schema for each topic
    We use Avro schema which is like a Json schema with superior serialization, deserialization benefits
    There is a central repository of schema for each topic
    Both producers and consumers use the same topic schema
    Kafka also simplifies data verification using audits on the number of produced messages and the number of consumed messages
    Kakfa also facilitate each ETL of data to Hadoop by using Map online jobs to load data from brokers
    For more details check out this IEEE Data Engineering paper.
  • Once we have data available in offline data processing system from production, we use various technologies such as Hadoop, Samza, and Spark to process this data. Let me start with talking about batch processing technologies based on Hadoop.
  • Hadoop has been very successful to scale offline computation needs.
    Hadoop made ease of distributed programming by providing simple high level primitives like Map and Reduce functions
    Hadoop is scalable to a very large cluster
    Hadoop MapReduce provide fault tolerant functionalities like speculative execution, restarting failed MapReduce tasks automatically
    There are many scripting language like Pig, Hive, Scalding built on top of Hadoop to further ease the programming
  • At LinkedIn: Hadoop is in use for building data products, feature computation, training machine learning models, business analytics, trouble shooting by analyzing data, etc.
    We have workflows with 100s of Hadoop MapReduce jobs
    And 100s of such workfllows
    Daily we process peta-bytes of data on Hadoop
  • One good signal to indicate are common connections. That is Bob and Carol likely to know each other if they share a common connection.
    Bob and Carol likely to know each other if they share a common connection. Also, as the number of common connections increases, the likelihood of the two people knowing each other increases.
  • Here is an example of data processing using Hadoop.
    For PYMK an important feature is triangle closing that is, finding the second degree connections and the number of common connections between two members
    Here is a PIG script that computes that
    Go through the PIG script
  • Here is the PYMK production Azkaban Hadoop workflow, which involves dozens of hadoop jobs and dependencies
    Looks complicated but it’s trivial to manage such workflows using Azkaban
  • How to manage
  • After feature engineering and getting features such as triangle closing, organizational overlap scores for schools and companies, we apply a machine learning model to predict probability of two people knowing each other.
    We also incorporate user feedback both explicit and implicit in enhancing the connection probability
    We use pass connections as positive response variable to train our machine learning model
  • ADMM stands for Alternating Direction Method of Multipliers (Boyd et al. 2011). The basic idea of ADMM is as follows: ADMM considers the large scale logistic regression model fitting as a convex optimization problem with constraints. While minimizing the user-defined loss function, it enforces an extra constraint that coefficients from all partitions have to equal. To solve this optimization problem, ADMM uses an iterative process. For each iteration it partitions the big data into many small partitions, and fits an independent logistic regression for each partition. Then, it aggregates the coefficients collected from all partitions, learns the consensus coefficients, and sends it back to all partitions to retrain. After 10-20 iterations, it ends up with a converged solution that is theoretically close to what you would have obtained if you trained it on a single machine.
  • TODO: get comfortable
  • TODO: get comfortable with this slide
  • load one week of data and build a OLAP cube over country and locale as dimensions for unique users over the week, unique users for today, as well as total number of clicks.
  • TODO: get comfortable
  • TODO: revise
  • TODO: get comfortable
  • Consider what data is necessary to build a particular view of the LinkedIn home page. We provide interesting news via Pulse, timely updates from your connections in the Network Update Stream, potential new connections from People You May Know, advertisements targeted to your background, and much, much more.

    Each service publishes its logs to its own specific Kafka topic, which is named after the service, i.e. <service>_service_call. There are hundreds of these topics, one for each service and they share the same Avro schema, which allows them to be analyzed together. This schema includes timing information, who called whom, what was returned, etc, as well as the specific of what each particular service call did. Additionally log4j-style warnings and errors are also routed to Kafka in a separate <service>_log_event topic.
  • After a request has been satisfied, the complete record of all the work that went into generating it is scattered across the Kafka logs for each service that participated. These individual logs are great tools for evaluating the performance and correctness of the individual services themselves, and are carefully monitored by the service owners. But how can we use these individual elements to gain a larger view of the entire chain of calls that created that page? Such a perspective would allow us to see how the calls are interacting with each other, identify slow services or highlight redundant or unnecessary calls.
  • By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways.
    The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  • By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways.
    The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  • That concludes my brief discussion on Stream processing using Samza. Next I am going to talk about iterative processing using Spark.
  • ADMM example
  • ADMM example
  • ADMM example
  • ADMM example
  • TODO: add reference
  • TODO: get comfortable
  • TODO: revise - add lessons, opportunities
  • TODO: revise

×