Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Linkedin uses Automic for Big Data Processes

809 views

Published on

How Linkedin uses Automic for Big Data Processes

  • Be the first to comment

How Linkedin uses Automic for Big Data Processes

  1. 1. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
  2. 2. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.  Vijay Aruswamy,  Staff Engineer, Big Data Operations,  LinkedIn Corporation  https://www.linkedin.com/in/vijayaruswamy 2
  3. 3. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin’s Big Data Eco-System  How Automic tools are helping LinkedIn 3
  4. 4. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Our Mission  Connect the world's professionals to make them more productive and successful. 4
  5. 5. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 5
  6. 6. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn – Worlds Largest Professional Network
  7. 7. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin’s Big Data Eco-System  How Automic tools are helping LinkedIn 7
  8. 8. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. “What Gets measured, gets fixed” -David Henke, Former SVP Operations, LinkedIn 8
  9. 9. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 9
  10. 10. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge 10
  11. 11. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products 11  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  12. 12. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products 12  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  13. 13. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products 13  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  14. 14. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products 14  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  15. 15. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Few Data Driven Products 15  People You May Know (PYMK)  Companies you may be Interested  Jobs you may be interested  Groups you may like  Who Viewed your profile  Economic Graph Challenge
  16. 16. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Outline  LinkedIn Overview  Why Data is important for LinkedIn  Linkedin's Big Data Eco-System  How Automic tools are helping LinkedIn 16
  17. 17. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 17
  18. 18. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Type of Data at “LinkedIn” Behavioral Data 18 Identity Data Social Data + +
  19. 19. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. What does “Big Data” mean at LinkedIn 19 Analytical Challenges & Complexity Data Volume + ∞ + ∞ Social Media Data Web/Behavior Data CRM Data Member Data
  20. 20. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 20 High Level Data Flow
  21. 21. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Camus  Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of incrementally copying data from Kafka into HDFS  http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline 21  Unified data ingestion system for internal and external data sources. Gobblin uses a worker framework where each records run through the four stages of extraction, conversion, quality checking before writing.  https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease Gobblin
  22. 22. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 22 High Level Data Flow Cont..
  23. 23. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Automic  Data driven scheduling - A process will not execute before the data dependency is satisfied.  Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are handled by Azkaban  Processes should execute only when the input data sets are available 23  Grouping -Organize components and workflows into common area for maintenance, enhancements  Supports External dependencies  Use of Global Variable –Keep storing commonly used password in one place.  Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day, Hold jobs under the same flow  Load Balancing --Assign queues to run on a particular server  Monitoring --Graphical Explorer Azkaban
  24. 24. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. LinkedIn’s Big Data Architecture Online DBs - Prod DCs Espress o Service Metrics Web Tracking
  25. 25. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 25 LinkedIn’s Application Manager
  26. 26. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Type of jobs scheduled by Automic  External ETL  ODS ETL  Hadoop ETL  Teradata ETL  User Input ETL  Historical Loads  One-time data fixes 26
  27. 27. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 27
  28. 28. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Data Volume 28  How many Kafka topics (tracking + service) do we dump on Hadoop? – ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service) – Data size/day of above?  ~10 TB  How many online DB tables do we have on Hadoop? – ~300+ (Oracle, Espresso, MySql) tables – Data size?  ~8 TB  Capacity of DWH on Teradata – ~186 TB overall with 6 month retention, ~3 TB every day – ~340k unique queries/day (248k from users and ~ 90K from ETL)  Capacity of Hadoop – Biggest cluster 5 PB with 2500+ nodes – ETL clusters 3.1 PB with 360+ nodes
  29. 29. ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. Q & A 29

×