The Big Data Analytics Ecosystem at LinkedIn


Published on

LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it.

This talk provides an overview of the various components of this ecosystem which are:

- Hadoop
- Teradata
- Kafka
- Databus
- Camus
- Lumos


Published in: Technology

The Big Data Analytics Ecosystem at LinkedIn

  1. 1. The Big Data Analytics Ecosystem at LinkedIn Rajappa Iyer September 17, 2013
  2. 2. Agenda  LinkedIn by the numbers  An Overview of Data Driven Products / Insights  The Big Data Analytics Ecosystem – Storage and Compute Platforms – Data Transport Pipelines – Data Processing Pipelines – Operational Tooling - Metadata  Q&A
  3. 3. LinkedIn: The World’s Largest Professional Network Members Worldwide 2 new Members Per Second 100M+ Monthly Unique Visitors 238M+ 3M+ Company Pages Connecting Talent  Opportunity. At scale…
  4. 4. Insights (Analysts and Data Scientists) Data Driven Products and Insights Products for Members (Professionals) Products for Enterprises (Companies) Data, Platforms, Analytics
  5. 5. Products for Members
  6. 6. Products for Enterprises Sell - Sales Navigator Market - Marketing Solutions Hire - Talent Solutions
  7. 7. Examples of Business Insights
  8. 8. Example of Deeper Insight Job Migration After Financial Collapse
  9. 9. A Simplified Overview of Data Flow Hadoop Camus Lumos Teradata External Partner Data Ingest Utilities DWH ETL Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity Data Espresso / Voldemort / Oracle Member Data DatabusChanges Derived Data Set Core Data Set Computed Results for Member Facing Products Enterprise Products
  10. 10. Storage and Compute Platforms LinkedIn Confidential ©2013 All Rights Reserved 10 Most data in Avro format Access via Hive and Pig Most ETL processes run here Specialized batch processing Algorithmic data mining
  11. 11. Storage and Compute Platforms LinkedIn Confidential ©2013 All Rights Reserved 11 Integrated Data Warehouse Standard BI Tools Interactive Querying (Low latency) Workload Management
  12. 12. Transport Pipeline - Kafka LinkedIn Confidential ©2013 All Rights Reserved 12  High-volume, low-latency messaging system  Horizontally scalable  Automatic load balancing  Rewindability  Intra-cluster replication  Mainly used for log aggregation and queuing
  13. 13. Transport Pipeline - Databus  Timeline consistent data change capture  Works with Oracle, MySQL, Espresso…  Transactional semantics  In-order, at least once delivery  Low latency  Has scaled to 100s of sources LinkedIn Confidential ©2013 All Rights Reserved 13
  14. 14. Hadoop Kafka Brokers Topic Registry (zookeeper) Camus Incremental Pull Hourly Data by Topic Last processed offset by topic Daily Data by Topic Camus Daily Compaction Audit DB Camus Audit Job audit counts audit counts Hive Registration Topics 1 file/ run/ topic/ partitionTopics Data + offsets Processing Pipeline: Camus Camus: Framework for ingesting Kafka streams to HDFS LinkedIn Confidential ©2013 All Rights Reserved 14
  15. 15. Camus: Features  Highly scalable due to adaptive input format – Handled 10x increase in data volume without change  Restartable with checkpointing  Robust auditing support  Plays nicely with Hive and Pig – Avro format support – Hive metastore registration  Open source – LinkedIn Confidential ©2013 All Rights Reserved 15
  16. 16. Processing Pipeline: Lumos LinkedIn Confidential ©2013 All Rights Reserved 16 Lumos: Framework to ingest database data to HDFS PROD Oracle Virtual Snapshot Materializer ETL Hadoop Cluster Staging Data (internal) Data- Bus DB Extract Lazy Snapshot Materializer External Data Inc/Full (internal) DWH processes Meta- Data Published Virtual Snapshot Pig/Hive Loaders PROD Espresso
  17. 17. Lumos: Features  Supports Espresso, Oracle and MySQL as sources  Full snapshots and incremental dumps  Automatic type translation for most database types  Provides LAST UPDATE semantics for data  Supports low latency requirements – Reader API performs just-in-time compaction  Snapshot constructed two ways: – On demand compaction for upserts – Periodic snapshotting that reflects deletes as well LinkedIn Confidential ©2013 All Rights Reserved 17
  18. 18. Operational Support - Metadata  ETL pipeline is a complex graph of workflows – Our comprehensive dashboard production flow is nearly 30 levels deep with complex dependencies  To manage this, we needed to capture: – Process dependencies – Data dependencies – Process execution history – Data load status – Data consumption status (watermarks)
  19. 19. Operational Metadata – v1  Capture process dependency graph – Also capture useful metadata such as process owners  Capture stats for each execution of a workflow – Time of execution – Status – Pointer to error logs  Has proved quite useful for monitoring critical chains Workflow F Workunit W1 Workunit W2 Workunit W3 Workunit W4 Workunit W5 on success on success on failure on successon success Start Stop
  20. 20. Operational Metadata – v2 Data Entity D1 Data Entity D2 Data Entity D3 Workflow F consumes consumes produces  For each flow, capture input and output data elements  For each execution, capture stats on data element, e.g. – Number of records / lines read – Number of records / lines written – Error counts – Last processed record  Can be time based or sequence based  This can be per flow as more than one flow can consume a data element
  21. 21. Operational Metadata – The Payoff  Restartable ETL jobs – Process new data since last successful previous run  Catch up mode for ETL jobs – Single run can consume data from multiple intervals in one batch – Next run will resume from correct place  Data freshness and availability dashboard  Coarse form of data lineage – Impact analysis for unfortunately all-too-common changes upstream
  22. 22. Putting it all Together LinkedIn Confidential ©2013 All Rights Reserved 22 Online Data Stores Data Transport Pipelines Data Processing Pipelines Offline Storage / Compute Analytics Applications Espresso Voldemort Kafka Databus Camus Lumos Hadoop Teradata Operational Metadata and Tooling
  23. 23. `whoami`  Sr. Manager / DWH Architect @ LinkedIn since 2011  Prior to that: – Director of Engineering at Digg – Enterprise Data Architect at eBay 
  24. 24. Questions? More at We’re Hiring