Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

4,490 views

Published on

So, you finally have a data ecosystem with Kafka and Hadoop both deployed and operating correctly at scale. Congratulations. Are you done? Far from it.

As the birthplace of Kafka and an early adopter of Hadoop, LinkedIn has 13 years of combined experience using Kafka and Hadoop at scale to run a data-driven company. Both Kafka and Hadoop are flexible, scalable infrastructure pieces, but using these technologies without a clear idea of what the higher-level data ecosystem should be is perilous. Shirshanka Das and Yael Garten share best practices around data models and formats, choosing the right level of granularity of Kafka topics and Hadoop tables, and moving data efficiently and correctly between Kafka and Hadoop and explore a data abstraction layer, Dali, that can help you to process data seamlessly across Kafka and Hadoop.

Beyond pure technology, Shirshanka and Yael outline the three components of a great data culture and ecosystem and explain how to create maintainable data contracts between data producers and data consumers (like data scientists and data analysts) and how to standardize data effectively in a growing organization to enable (and not slow down) innovation and agility. They then look to the future, envisioning a world where you can successfully deploy a data abstraction of views on Hadoop data, like a data API as a protective and enabling shield. Along the way, Shirshanka and Yael discuss observations on how to enable teams to be good data citizens in producing, consuming, and owning datasets and offer an overview of LinkedIn’s governance model: the tools, process and teams that ensure that its data ecosystem can handle change and sustain #datasciencehappiness.

Published in: Data & Analytics
  • Thirteen Signs the Angels Are With You ♣♣♣ http://t.cn/AiuvUCDd
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Free Magical Healing Soundscape - Attract more abundance... clear negativity... reduce stress... sleep deeper... and bring astonishing miracles into your life... simply by listening for a few minutes each day. Get yours now. ♣♣♣ https://bit.ly/30Ju5r6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: www.bit.ly/sexinarea
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Manifest Absolutely Anything. Discover the Universe's "7 Sacred Signs" that guide the way to unlocking your heart�s greatest desires. Access your free report now! ◆◆◆ http://scamcb.com/manifmagic/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: www.bit.ly/2AJerkH
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

  1. 1. Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Mar 16, 2017 Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn @shirshanka, @yaelgarten
  2. 2. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka
  3. 3. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  4. 4. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  5. 5. Achieving Data Democracy “… helping everybody to access and understand data .… breaking down silos… providing access to data when and where it is needed at any given moment.” Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution
  6. 6. The forms of data Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable Espresso Venice Pinot Galene Graph DB Document DB DynamoDB Azure Blob, Data Lake Storage
  7. 7. The forms of data At RestIn Motion Espresso Venice Pinot Galene Graph DBDocument DB DynamoDB Azure Blob, Data Lake Storage
  8. 8. The forms of data At RestIn Motion Scale O(10) clusters ~1.7 Trillion messages ~450 TB Scale O(10) clusters ~10K machines ~100 PB
  9. 9. At RestIn Motion SFTPJDBCREST Data Integration Azure Blob, Data Lake Storage
  10. 10. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  11. 11. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ github.com/linkedin/gobblin Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more… Apache incubation under way SFTP Azure Blob, Data Lake Storage
  12. 12. Query Engines At RestIn Motion Processing Frameworks
  13. 13. Processing Frameworks
  14. 14. Query Engines At RestIn Motion Processing Frameworks
  15. 15. Kafka Hadoop Samza Jobs Pinot minutes hour + Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second Query Engines
  16. 16. Site-facing Apps Reporting dashboards Monitoring Open source. In production @ LinkedIn, Uber
  17. 17. At RestIn Motion Processing Frameworks Data Infra 1.0 for Data Democracy Query Engines 2010 - now
  18. 18. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata
  19. 19. How does LinkedIn build data- driven products? 
 Data Scientist PM Designer Engineer We should enable users to filter connection suggestions by company How much do people utilize existing filter capabilities? Let's see how users send connection invitations today.
  20. 20. Tracking data records user activity InvitationClickEvent()
  21. 21. (powers metrics and data products) InvitationClickEvent() Scale fact:
 ~ 1000 tracking event types, 
 ~ Hundreds of metrics & data products Tracking data records user activity
  22. 22. user engagement tracking data metric scripts production code Tracking Data Lifecycle TransportProduce Consume Member facing data products Business facing decision making
  23. 23. Tracking Data Lifecycle & Teams Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making TransportProduce Consume Members Execs
  24. 24. How do we calculate a metric: ProfileViews PageViewEvent Record 1: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "profile_page", "trackingInfo" : “Viewee=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } Metric: 
 ProfileViews = sum(PageViewEvent
 where pagekey = profile_page
 ) PageViewEvent Record 101: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "new_profile_page", "trackingInfo" : "viewee_id=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } or new_profile_page Ok but forgot to notifyundesirable
  25. 25. Metrics ecosystem at LinkedIn: 3 yrs ago Operational Challenges for infra teams Diminished Trust due to multiple sources of truth
  26. 26. What was causing unhappiness? 1. No contracts: Downstream scripts broke when upstream changed
 2. "Naming things is hard": different semantics & conventions in various data Events (per team) 
 --> need to email to figure out what is correct and complete logic to use 
 --> inefficient and potentially wrong
 3. Discrepant metric logic: 
 Duplicate tech allowed for duplicate logic allowed for discrepant metric logic So how did we solve this?
  27. 27. Data Modeling Tip Say no to Fragile Formats or Schema-Free Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc. 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Chose Avro as our format
  28. 28. Sometimes you need a committee Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Data Model Review Committee (DMRC) Tooling to codify conventions “Always be reviewing” Who and What Evolution
  29. 29. Unified Metrics Platform A single source of truth for all business metrics at LinkedIn 1. No contracts 2. Naming things is hard 3. Discrepant metric logic - metrics processing platform as a service - a metrics computation template - a set of tools and process to facilitate metrics life-cycle Central Team, Relevant Stakeholders Sandbox Metric Definition Code Repo Build & Deploy System JobsCore Metrics Job Metric Owner 1. iterate 2. create 4. check in 3. review 5,000 metrics daily
  30. 30. Unified Metrics Platform: Pipeline Metrics Logic Raw Data Pinot UMP Harness Incremental Aggregate Backfill Auto-join Raptor dashboards HDFS Aggregated Data Experiment Analysis Machine Learning Anomaly Detection HDFS Ad-hoc 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Tracking + Database + Other data
  31. 31. Tracking Platform: standardizing production Schema compatibility Time Audit KafkaClient-side Tracking Tracking Frontend Services Tools
  32. 32. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 2.0 Pinot Tracking Platform Unified Metrics Platform (UMP) Production Consumption circa 2015
  33. 33. What was still causing unhappiness? 1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt We started from the bottom.
  34. 34. Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making Members Execs 3. Never a good time to pay down this "data" debt #victimsOfTheData —> #DataScienceHappiness 
 via proactively forging our own data destiny. Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?
  35. 35. The Big Problem Opportunity in 2015 
 Launch a completely rewritten LinkedIn mobile app
  36. 36. PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["Viewee" : "23456"], ... } } We already wanted to move to better data models ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, } viewee_ID
  37. 37. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt There were two options:
  38. 38. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt How much work would it be? #DataScienceHappiness
  39. 39. We pitched it to our Leadership team Do it! CTOCEO
  40. 40. 2. Clear contract did not exist for data production Producers were unaware of consumers needs, and were "Throwing data over the wall". 
 Albeit avro, Schema adherence != Semantics equivalence user engagement tracking data metric 
 scripts production
 code Member facing
 data products Business facing decision making #victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition Own the artifact that feeds the data ecosystem (and data scientists!) Data producers (PM, app developers) Data consumers 
 (DS)
  41. 41. 2a. Ensure dialogue between Producers & Consumers • Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step playbook for who & how to develop and own tracking
  42. 42. 2b. Standardized core data entities • Event types and names: Page, Action, Impression • Framework level client side tracking: views, clicks, flows • For all else (custom) - guide when to create a new Event
 Navigation Page View Control Interaction
  43. 43. 2c. Created clear maintainable data production contracts Tracking specification with monitoring and alerting for adherence: 
 clear, visual, consistent contract Need tooling to support culture and process shift - "Always be tooling"
 Tracking specification Tool
  44. 44. 1. Old bad data sticks around PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, }
  45. 45. How do we handle old and new? PageViewEvent ProfileViewEvent Producers Consumers old new Relevance Analytics
  46. 46. The Big Challenge load “/data/tracking/PageViewEvent” using AvroStorage() (Pig scripts) My Raw Data Our scripts were doing ….
  47. 47. My Raw Data My Data API We need “microservices" for Data
  48. 48. The Database community solved this decades ago... Views!
  49. 49. We built Dali to solve this A Data Access Layer for Linkedin Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem
  50. 50. Solving With Views Producers LinkedInProfileView PageViewEvent ProfileViewEvent new old Consumers pagekey== profile 1:1 Relevance Analytics
  51. 51. Views ecosystem 51 Producers Consumers LinkedInProfileView JSAProfileView Job Seeker App (JSA) LinkedIn App UnifiedProfileView
  52. 52. Dali: Implementation Details in Context Dali FileSystem Processing Engine (MR, Spark) Dali Datasets (Tables+Views) Dataflow APIs (MR, Spark, Scalding) Query Layers (Pig, Hive, Spark) Dali CLI Data Catalog Git + Artifactory View Def + UDFs Dataset Owner Data Source Data Sink
  53. 53. From load ‘/data/tracking/PageViewEvent’ using AvroStorage(); To load ‘tracking.UnifiedProfileView’ using DaliStorage(); One small step for a script
  54. 54. A Few Hard Problems Versioning Views and UDFs Mapping to Hive metastore entities Development lifecycle Git as source of truth Gradle for build LinkedIn tooling integration for deployment
  55. 55. State of the world today ~300 views Pretty much all new UMP metrics use Dali data sources ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ... At Rest Data Processing Frameworks
  56. 56. Now brewing: Dali on Kafka Can we take the same views and run them seamlessly on Kafka as well? Stream Data Standard streaming API-s - Samza System Consumer - Kafka Consumer
  57. 57. What’s next for Dali? Selective materialization Open source Hive is an implementation detail, not a long term bet
  58. 58. Dali: When are we done dreaming? At RestIn Motion Data Processing Frameworks Dali
  59. 59. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 3.0 Pinot Tracking Platform Unified Metrics Platform (UMP) DaliDr Elephant WhereHows circa 2017
  60. 60. Did we succeed? We just handled another huge rewrite! #DataScienceHappiness
  61. 61. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata Three (Naïve) Steps to #DataScienceHappiness
  62. 62. Basic data infrastructure for data democracy Platforms, Process to standardize produce + consume Evangelize investing
 in #DataScience Happiness Tech + process
 to sustain healthy data ecosystem Our Journey towards #DataScienceHappiness Dali, Dialogue 2015-> Tracking, UMP DMRC 2013 -> Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->
  63. 63. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka Thank You! to be continued…

×