Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably


Published on

Data is the lifeblood of many LinkedIn products and must be delivered to the appropriate systems in a reliably and timely manner. This talk provides details of a metadata system that we built at LinkedIn to help manage the set of ETL flows that are responsible for data delivery at scale.

Published in: Technology

Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably

  1. 1. Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013
  2. 2. `whoami`  Data Infrastructure @ LinkedIn since 2011  Prior to that: – Director of Engineering at Digg – Enterprise Data Architect at eBay  www.linkedin.com/in/rajappaiyer/
  3. 3. Outline of talk  Background and Context – The Why  Challenges with Data Delivery – The What  Metadata to the Rescue – The How  Q&A
  4. 4. LinkedIn: The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 259M+ 2 new Members Worldwide Members Per Second 100M+ Monthly Unique Visitors 3M+ Company Pages
  5. 5. Data Driven Products and Insights Products for Members Data, Platforms, Analytics Products for Enterprises (Companies) (Professionals) Insights (Analysts and Data Scientists)
  6. 6. Products for Members
  7. 7. Products for Enterprises Hire - Talent Solutions Sell - Sales Navigator Market - Marketing Solutions
  8. 8. Examples of Insights
  9. 9. Example of Deeper Insight Job Migration After Financial Collapse
  10. 10. Data is critical to LinkedIn’s products It needs to be delivered in a reliable and timely manner LinkedIn Confidential ©2013 All Rights Reserved 10
  11. 11. A Simplified Overview of Data Flow Hadoop Site (Member Facing Products) Activity Data Kafka Camus Member Data Espresso / Voldemort / Oracle DWH ETL Product, Sciences, Enterprise Analytics Changes Databus External Partner Data Lumos Ingest Utilities Computed Results for Member Facing Products Teradata Enterprise Products Core Data Set Derived Data Set
  12. 12. Components of typical ETL jobs  Ingress / Egress of message-oriented data – Logs and clickstream data  Ingress / Egress of record-oriented data – Database data  Transformations – – – – – Select, project, join Aggregations Partitioning Cleansing and data normalization Schema conversions – e.g., Nested JSON to Relational LinkedIn Confidential ©2013 All Rights Reserved 12
  13. 13. An Example ETL Flow LinkedIn Confidential ©2013 All Rights Reserved 13
  14. 14. Challenges  Complex process dependencies – Some flows are over 30 levels deep – Flows may span multiple platforms (Hadoop, RDBMS etc.)  Complex data dependencies – Multiple flows may consume a data element – Multiple data elements feed into a single flow – Can be viewed as “data sync barriers”  Recovery – Restartable flows that pick up from last checkpoint – Catch up mode to compensate for downtime  Monitoring and Alerting – Prioritization of “important” flows for ops attention – Who do you call when things fail? LinkedIn Confidential ©2013 All Rights Reserved 14
  15. 15. Metadata to the rescue  What metadata is collected? – Process dependencies – Data dependencies – Execution history and data processing statistics  How is it used? – Drives the ETL framework with lots of functionality     Check for data availability Retries and restarts Standardized error reporting / alerting Prioritized view of business critical flows LinkedIn Confidential ©2013 All Rights Reserved 15
  16. 16. Metadata: Process Dependencies  Capture process dependency graph Workflow F Start – Also capture metadata such as process owners, importance, SLA etc. Workunit W1 on success Workunit W2 on success on failure Workunit W3 Workunit W4 on success on success Workunit W5  Capture stats for each execution of a workflow – Time of execution – Execution status – Pointer to error logs  Alert on delayed processes – Based on execution history Stop
  17. 17. Metadata: Data Dependencies Data Entity D1 Data Entity D2 consumes consumes Workflow F produces Data Entity D3  For each flow, capture input and output data elements  For each flow execution, capture stats on data element  Number of records or messages processed  Error counts  Watermarks – Can be time based or sequence based – This can be per flow as more than one flow can consume a data element
  18. 18. Metadata: Data Elements  Simple catalog of data elements – Name, physical location, owner etc.  Data elements can have logical names – Names resolve to one or more physical entity – Logical names can represent useful collections  E.g., data as of a particular interval  Data element availability can trigger processes – E.g., kick off hourly process when hourly data is complete and available – Enables data driven ETL scheduling 18
  19. 19. Putting it all together Dashboards , Reports ETL applications Data Availability Status ETL Framework Scheduler Checkpoint Execution State Retry / Resume Name resolver Execution History Data Check Statistics (process and data) Alerting / Monitoring Log Parsers Data Lineage Metadata Management System LinkedIn Confidential ©2013 All Rights Reserved 19
  20. 20. Questions? More at data.linkedin.com Come Work on Challenging Data Infrastructure problems - We’re Hiring