Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data Ecosystem at LinkedIn
Big 2015 at WWW
LinkedIn: Largest Professional Network
2
360M members 2 new members/sec
Rich Data Driven Products at LinkedIn
3
Similar Profiles
Connections
News
Skill Endorsements
How to build Data Products
4
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offli...
Example Data Product: PYMK
5
• People You May Know (PYMK): recommend members to connect
Outline
6
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data...
Ingress - types of Data
7
• Database data: member profile, connections, …
• Activity data: Page views, Impressions, etc.
•...
Data Ingress - Point-to-point Pipelines
8
• O(n^2) data integration complexity
• Fragile, delayed, lossy
• Non-standardized
Data Ingress - Centralized Pipeline
9
• O(n) data integration complexity
• More reliable
• Standardizable
Data Ingress: Apache Kafka
10
• Publish subscribe messaging
• Producers send messages to Brokers
• Consumers read messages...
Kafka: Data Evolution and loading
11
• Standardized Schema for each topic
• Avro
• Central repository
• Producers/consumer...
Outline
12
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing using Hadoop, A...
Data Processing: Hadoop
13
• Ease of programming
• High level Map and Reduce functions
• Scalable to very large cluster
• ...
Data Processing: Hadoop at LinkedIn
14
• Used for data products, feature computation, training
models, analytics and repor...
Data Processing Example PYMK Feature
Engineering
Triangle closing
Prob(Bob knows Carol) ~ the # of common
connections
Alic...
Data Processing in Hadoop Example
16
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `...
17
How to manage Production Hadoop Workflow
Azkaban: Hadoop Workflow management
18
• Configuration
• Dependency management
• Access control
• Scheduling and SLA manag...
Distributed Machine Learning: ML-ease
20
• ADMM Logistic Regression for binary response prediction
Agarwal et al. 2014
Limitations of Hadoop: Join and Group By
21
— Two datasets: A=(Salesman, Product), B=(Salesman,
Location)
Select SomeAggre...
Limitations of Triangle Closing Using Hadoop
22
• Large amount of data to shuffle from Mappers to Reducers
— connections i...
Cubert
23
• An open source project built for analytics needs
• Map side aggregation
• Minimizes intermediate data and shuf...
Cubert Design
24
• Language
• Scripting language
• Physical - write MR programs
• Execution
• Data movement: Shuffle, Bloc...
Cubert Script: count Daily/Weekly Stats
25
JOB "create blocks of the fact table"
MAP {
data = LOAD ("$FactTable", $weekAgo...
Cubert Example: Join and Group By
26
Vemuri et al. VLDB 2014
— Two datasets: A=(Salesman, Product), B=(Salesman, Location)...
Cubert Example: Triangle Closing
27
• Divide connections (src, dest) in blocks
• Duplicate connection graph G1, G2
• Sort ...
Cubert Summary
28
Vemuri et al. VLDB 2014
• Built for analytics needs
• Faster and scalable: 5-60X
• Working well in pract...
Outline
29
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban...
Samza
30
• Samza streaming computation
• On top of messaging layer like Kafka for input/output
• Low latency
• Stateful pr...
Samza: Site Speed Monitoring
31
• LinkedIn homepage assembled by calling many services
• Each service logs through Kafka w...
Samza: Site Speed Monitoring
32
• The complete record of request - scattered across Kafka logs
• Problem: combine these lo...
Samza: Site Speed Monitoring
33
• Hadoop/MR: join the logs using the request Id - once a day
• Too late to troubleshoot an...
Samza: Site Speed Monitoring
34
• Samza: near real-time join the Kafka logs using the requestId
• Two jobs
• Partition Kaf...
Outline
35
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban...
Iterative Processing using Spark
36
• Limitations of MapReduce
• What is Spark?
• Spark at LinkedIn
Limitations of MapReduce
37
• Iterative computation is slow
• Inefficient multi-pass computation
• Intermediate data writt...
Limitations of MapReduce
38
• Interactive computation is slow
• Same data is loaded again from distributed file system
Example: ADMM at LinkedIn
39
• Intermediate data is stored in distributed file system - slow
Intermediate
data in HDFS
SPARK
40
• Extends programming language with a
distributed data structure
• Resilient Distributed Datasets (RDD)
• can be ...
Spark at LinkedIn
41
• ADMM on Spark
• Intermediate data is stored in memory - faster
Intermediate
data in memory
Outline
42
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing - Hadoop, Azkab...
Data Egress - Key/Value
43
• Key-value store: Voldemort
• Based on Amazon’s Dynamo DB
• Distributed
• Scalable
• Bulk load...
Data Egress - Streams
44
• Stream - Kafka
• Hadoop job as a Producer
• Service acts as Consumer
• Simple to use
• store da...
Conclusion
45
• Rich primitives for Data Ingress, Processing, Egress
• Data Ingress: Kafka, ETL
• Data Processing
• Batch ...
Future Opportunities
46
• Models of computation
• Efficient Graph processing
• Distributed Machine Learning
47
Acknowledgement
Thanks to data team at LinkedIn: data.linkedin.com
Contact: mtiwari@linkedin.com
@mitultiwari
Upcoming SlideShare
Loading in …5
×

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

3,632 views

Published on

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

Published in: Internet
  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Good job guys! Where do you think you can add semantics in your workflow? Eager to see a shift from PYMK to DYMLT (Data You May Linked To)
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering at WWW 2015

  1. 1. Big Data Ecosystem at LinkedIn Big 2015 at WWW
  2. 2. LinkedIn: Largest Professional Network 2 360M members 2 new members/sec
  3. 3. Rich Data Driven Products at LinkedIn 3 Similar Profiles Connections News Skill Endorsements
  4. 4. How to build Data Products 4 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  5. 5. Example Data Product: PYMK 5 • People You May Know (PYMK): recommend members to connect
  6. 6. Outline 6 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  7. 7. Ingress - types of Data 7 • Database data: member profile, connections, … • Activity data: Page views, Impressions, etc. • Application and System metrics • Service logs
  8. 8. Data Ingress - Point-to-point Pipelines 8 • O(n^2) data integration complexity • Fragile, delayed, lossy • Non-standardized
  9. 9. Data Ingress - Centralized Pipeline 9 • O(n) data integration complexity • More reliable • Standardizable
  10. 10. Data Ingress: Apache Kafka 10 • Publish subscribe messaging • Producers send messages to Brokers • Consumers read messages from Brokers • Messages are sent to a topic • E.g. PeopleYouMayKnowTopic • Each topic is broken into one or more ordered partitions of messages
  11. 11. Kafka: Data Evolution and loading 11 • Standardized Schema for each topic • Avro • Central repository • Producers/consumers use the same schema • Data verification - audits • ETL to Hadoop • Map only jobs load data from broker Goodhope et al., IEEE Data Eng. 2012
  12. 12. Outline 12 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing using Hadoop, Azkaban, Cubert • Stream processing using Samza • Iterative processing using Spark • Data Egress • Moving results from offline to online system
  13. 13. Data Processing: Hadoop 13 • Ease of programming • High level Map and Reduce functions • Scalable to very large cluster • Fault tolerant • Speculative execution, auto restart of failed jobs • Scripting languages: PIG, Hive, Scalding
  14. 14. Data Processing: Hadoop at LinkedIn 14 • Used for data products, feature computation, training models, analytics and reporting, trouble shooting, … • Native MapReduce, PIG, Hive • Workflows with 100s of Hadoop jobs • 100s of workflows • Processing petabytes of data everyday
  15. 15. Data Processing Example PYMK Feature Engineering Triangle closing Prob(Bob knows Carol) ~ the # of common connections Alice Bob Carol 15 How do people know each other?
  16. 16. Data Processing in Hadoop Example 16 -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); -- second degree pairs (id1, id2), aggregate and count common connections common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); How to do PYMK Triangle Closing in Hadoop
  17. 17. 17 How to manage Production Hadoop Workflow
  18. 18. Azkaban: Hadoop Workflow management 18 • Configuration • Dependency management • Access control • Scheduling and SLA management • Monitoring, history
  19. 19. Distributed Machine Learning: ML-ease 20 • ADMM Logistic Regression for binary response prediction Agarwal et al. 2014
  20. 20. Limitations of Hadoop: Join and Group By 21 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Common Hadoop MapReduce/Pig/Hive implementation • MapReduce: Load data and shuffle and reduce to do Inner Join and store the output • MapReduce: Load the above output, shuffle on group by keys and aggregate on reducer to generate final output
  21. 21. Limitations of Triangle Closing Using Hadoop 22 • Large amount of data to shuffle from Mappers to Reducers — connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); — Shuffling all 2nd degree connections - terabytes of data common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  22. 22. Cubert 23 • An open source project built for analytics needs • Map side aggregation • Minimizes intermediate data and shuffling • Fast and scalable primitives for joins and aggregation • Partitions data into blocks • Specialized operators MeshJoin, Cube • 5-60X faster in experience • Developer friendly - script like Vemuri et al. VLDB 2014
  23. 23. Cubert Design 24 • Language • Scripting language • Physical - write MR programs • Execution • Data movement: Shuffle, Blockgen, Combine, Pivot • Primitives: MashJoin, Cube • Data blocks: partition of data by cost Vemuri et al. VLDB 2014
  24. 24. Cubert Script: count Daily/Weekly Stats 25 JOB "create blocks of the fact table" MAP { data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO(); } // create blocks of one week of data with a cost function BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId; STORE data INTO "$output/blocks" USING RUBIX; END JOB "compute cubes" MAP { data = LOAD "$output/blocks" USING RUBIX; // create a new column 'todayUserId' for today's records only data = FROM data GENERATE country, locale, userId, clicks, CASE(timestamp == $today, userId) AS todayUserId; } // creates the three cubes in a single job to count daily, weekly users and clicks CUBE data BY country, locale INNER userId AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers, COUNT_DISTINCT(todayUserId) as dailyUniqueUsers, SUM(clicks) as totalClicks; STORE data INTO "$output/results" USING AVRO(); END Vemuri et al. VLDB 2014
  25. 25. Cubert Example: Join and Group By 26 Vemuri et al. VLDB 2014 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Sort A by Product and B by Location • Divide A and B in specialized blocks sorted by group by keys • Load A’s blocks in memory and stream B’s blocks to Join • Group by can be performed immediately after Join
  26. 26. Cubert Example: Triangle Closing 27 • Divide connections (src, dest) in blocks • Duplicate connection graph G1, G2 • Sort G1 edges (src, dest) by src • Sort G2 edges (src, dest) by dest • MeshJoin G1 and G2 such that G1.dest=G2.src • Aggregate by (G1.src, G2,dest) to get the number of common connections • Speedup by 50%
  27. 27. Cubert Summary 28 Vemuri et al. VLDB 2014 • Built for analytics needs • Faster and scalable: 5-60X • Working well in practice
  28. 28. Outline 29 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  29. 29. Samza 30 • Samza streaming computation • On top of messaging layer like Kafka for input/output • Low latency • Stateful processing through local store • Many use cases at LinkedIn • Site-speed monitoring • Data standardization
  30. 30. Samza: Site Speed Monitoring 31 • LinkedIn homepage assembled by calling many services • Each service logs through Kafka what went on with a request Id
  31. 31. Samza: Site Speed Monitoring 32 • The complete record of request - scattered across Kafka logs • Problem: combine these logs to generate wholistic view
  32. 32. Samza: Site Speed Monitoring 33 • Hadoop/MR: join the logs using the request Id - once a day • Too late to troubleshoot any issue • Samza: near real-time join the Kafka logs using the requestId
  33. 33. Samza: Site Speed Monitoring 34 • Samza: near real-time join the Kafka logs using the requestId • Two jobs • Partition Kafka stream by request Id • Aggregate all the records for a request Id Fernandez et al. CIDR 2015
  34. 34. Outline 35 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  35. 35. Iterative Processing using Spark 36 • Limitations of MapReduce • What is Spark? • Spark at LinkedIn
  36. 36. Limitations of MapReduce 37 • Iterative computation is slow • Inefficient multi-pass computation • Intermediate data written in distributed file system
  37. 37. Limitations of MapReduce 38 • Interactive computation is slow • Same data is loaded again from distributed file system
  38. 38. Example: ADMM at LinkedIn 39 • Intermediate data is stored in distributed file system - slow Intermediate data in HDFS
  39. 39. SPARK 40 • Extends programming language with a distributed data structure • Resilient Distributed Datasets (RDD) • can be stored in memory • Faster iterative computation • Faster interactive computation • Clean APIs in Python, Scala, Java • SQL, Streaming, Machine learning, graph processing support Matei Zaharia et al. NSDI 2012
  40. 40. Spark at LinkedIn 41 • ADMM on Spark • Intermediate data is stored in memory - faster Intermediate data in memory
  41. 41. Outline 42 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing - Hadoop, Azkaban, Cubert • Iterative processing - Spark • Stream processing - Samza • Data Egress • Moving results from offline to online system
  42. 42. Data Egress - Key/Value 43 • Key-value store: Voldemort • Based on Amazon’s Dynamo DB • Distributed • Scalable • Bulk load from Hadoop • Simple to use • store results into ‘url’ using KeyValue(‘member_id’) Sumbaly et al. FAST 2012
  43. 43. Data Egress - Streams 44 • Stream - Kafka • Hadoop job as a Producer • Service acts as Consumer • Simple to use • store data into ‘url’ using Stream(“topic=x“) Goodhope et al., IEEE Data Eng. 2012
  44. 44. Conclusion 45 • Rich primitives for Data Ingress, Processing, Egress • Data Ingress: Kafka, ETL • Data Processing • Batch processing - Hadoop, Cubert • Stream processing - Samza • Iterative processing - Spark • Data Egress: Voldemort, Kafka • Allow Data Scientists to focus to build Data Products
  45. 45. Future Opportunities 46 • Models of computation • Efficient Graph processing • Distributed Machine Learning
  46. 46. 47 Acknowledgement Thanks to data team at LinkedIn: data.linkedin.com Contact: mtiwari@linkedin.com @mitultiwari

×