Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn

273 views

Published on

Bài techtalk của anh Khải Trần nói về hệ thống data pipeline của LinkedIn được dùng để thu thập hàng chục tỷ messages mỗi ngày, và cách họ chạy hệ thống real-time processing để thống kê lượng dữ liệu này cho mục đính metrics monitoring.

1 số điểm bài talk sẽ chia sẻ:
- Giới thiệu về hệ thống unified metrics platform của LinkedIn
- Cách LinkedIn setup hệ thống BigData pipeline dùng Kafka, HDFS, Apache Calcite và Apache Samza.
- Khái niệm nearline storage, và cách LinkedIn chuyển từ offline architecture sang nearline architecture.


Speaker: Khai Tran, Staff Software Engineer - LinkedIn.
- Hiện đang là staff software engineer ở LinkedIn, phụ trách hệ thống metrics monitoring system. Trước đây từng làm ở Amazon AWS và Oracle.
- PhD, University of Wisconsin-Madison, nghiên cứu về Database Systems.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn

  1. 1. Building Realtime Metrics Platform at LinkedIn ​Khai Tran ​Staff Software Engineer
  2. 2. Agenda ● Self introduction ● Data organization at LinkedIn ● Overview of LinkedIn metrics platform ● Moving from offline to nearline ● Under the hood of the nearline architecture ● Nearline production use cases ● Conclusion
  3. 3. About me ● Currently teach lead of LinkedIn metrics team ● BS from DHBK HN, K46 ● PhD from University of Wisconsin- Madison, on databases. Thesis: “Realizing parallelism in OLTP workloads” ● Interns at Oracle, Microsoft Research, and Google ● First job - Oracle Labs, mostly working on Oracle query optimizer ● Second job - AWS, in two teams: ○ DynamoDB: on Request Routers, DynamoDB query language ○ Redshift: on query processing, statistic collection
  4. 4. Data organization at LinkedIn ● Data infrastructure - online world ○ Operational databases: Espresso, Venice/Voldermort, graph database… ○ Streaming: Samza, Kafka, Brooklin/Databus ○ Search infrastructure... ● Analytic Platform and Application - offline world ○ Infrastructure: Hadoop eco, Spark, Presto, Goblin (for ETL), Pinot (OLAP databases)... ○ Platform: metrics platform, data warehouses ○ Applications: XLNT (A/B testing), Raptor (visualization), ThirdEye (anomaly detection)... ● Relevance - machine learning related ○ Machine learning infrastructure and platform ○ Feed ranking, search ranking ○ PYMK, JYMI...
  5. 5. Overview of LinkedIn metrics platform
  6. 6. Metrics @ LinkedIn ● Metrics = Measurements over tracking data ● Tracking data: any logged events (web or mobile) ● Crucial for decision making: ○ Experimentation - test everything ○ Reporting - monitor and alert ○ In production, site-facing applications
  7. 7. Example
  8. 8. We provide: ● A trusted repository of metrics ● A self-serve platform for sustainable lifecycle of metrics In production Experimentation Reporting Primary Data Unified Metrics Platform LinkedIn unified metrics platform (UMP)
  9. 9. Growth of UMP Metrics 2016 20172015 6800 4680 1100 Current: 10K+ metrics
  10. 10. # code LOAD … # data # transformation # code STORE … # config Metrics: - A = SUM(A’) - B = Unique(id) Downstream: - XLNT - Raptor UMP User Code Platform Generated Code To App To App DefineDeclare Onboard Data Metadata Onboarding process User
  11. 11. Moving from offline to nearline
  12. 12. Offline computation flows Hourly job latency: 3-6 hours -> want realtime/nearline ...... Metric union User code User code Cubing, Rollup Dimension augmentation HDFS tables Dali views Pinot, Presto Azkaban execution Espresso, Oracle, MySQL
  13. 13. ... What we want for nearline flows Metric unionUser code User code Samza job Dimension augmentation Pinot
  14. 14. Latency is not the only requirement Easy to onboard ● Minimum effort to convert existing offline into nearline ● Easy to write user code for new nearline flows Easy to maintain ● Just one version of user code - single source of truth ● Run as a service Latency ● ~5 - 30 mins
  15. 15. Samza jobs Putting things together Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  16. 16. Current support User code in Pig ● LOAD, STORE ● FILTER, SAMPLE, SPLIT, UNION ● Simple FOREACH ● JOIN - all semantics ● GROUP/COGROUP, DISTINCT ● Record/Array FLATTEN ● Java UDFs, Python UDFs ● Pig Nested FOREACH and sort/limit (in Windows) ● Hive Not yet
  17. 17. Under the hood of the nearline architecture
  18. 18. Pig to Samza through SQL processing Open source framework for building dynamic data management systems. Including: ➢ SQL Parser ➢ Relational algebra APIs ➢ Query planning engine We built UMP nearline with: ➢ Pig’s Grunt parser ➢ Calcite relational algebra ➢ Calcite query planning engine
  19. 19. Architecture ... Metric union User code User code Dimension augmentation Calcite relational algebra as an IR convert generate Samza code optimize Samza physical plan Samza configuration Pig to Calcite Calcite to Samza
  20. 20. Pig to Calcite # code LOAD … LOAD ... COGROUP ... STORE … GruntParser CO- GROUP LOAD LOAD PigRelConverter FULL OUTERJ OIN AGGRE GATE AGGRE GATE TABLE SCAN TABLE SCAN PRO- JECT User scripts Pig Logical Plan Calcite relational algebra
  21. 21. Example
  22. 22. Example
  23. 23. Example INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite logical plan
  24. 24. Planning/Optimization ➢ Calcite logical plans: ○ Relational algebra: What to do ➢ Samza physical plans: ○ Samza physical node: How to do it ➢ Calcite Samza planner: ○ Calcite logical plan -> optimized Samza physical plan
  25. 25. Example Stream- Stream Self Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream INNER JOIN FILTER FILTER PROJECT PROJECT PROJECT TABLE SCAN TABLE SCAN Calcite Samza planner Calcite logical plan Samza physical plan
  26. 26. Code-gen From Samza physical plans: ➢ Generate Samza code for constructing the stream graph using Samza Fluent APIs . Mapping: ➢ Samza physical nodes -> corresponding stream APIs: ○ Samza project -> stream.map() ○ Samza filter -> stream.filter() ○ ... ➢ Relational expressions -> lambda functions: ○ Filter expressions -> filter() functions ○ Project expressions -> map() functions ○ ...
  27. 27. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  28. 28. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  29. 29. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  30. 30. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  31. 31. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  32. 32. Example Stream- Stream Self Join Samza Project 1 Samza Project 2 Samza Filter 1 Samza Filter 2 Samza Project Input Stream
  33. 33. Config-gen Stream Stream Join Samza Project Samza Project Samza Filter Samza Filter Samza Project Input Stream # dataset.conf app-src app-def
  34. 34. Nearline production use cases
  35. 35. Top stories picked up by editors
  36. 36. Usage case 1 - Feedback to editor
  37. 37. Usage case 2 - Recruiter usage statistics
  38. 38. Conclusion
  39. 39. Samza jobs From improved Lambda architecture... Pinot Batch jobs UMP realtime platform UMP offline platform HDFS Raptor Lambda architecture with a single codebase code configMetrics definition
  40. 40. … to our bigger picture Pig Latin Calcite relational algebra HiveQL SparkSQL/ RDD Presto SQL Portable UDFs AORA (Author Once, Run Anywhere) architecture

×