Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem

5,265 views

Published on

Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes that enable LinkedIn to roll out future product innovations with minimal downstream impact. Shirshanka and Yael explore the motivations and the building blocks for this reimagined data analytics ecosystem, the technical details of LinkedIn’s new client-side tracking infrastructure, its unified reporting platform, and its data virtualization layer on top of Hadoop and share lessons learned from data producers and consumers that are participating in this governance model. Along the way, they offer some anecdotal evidence during the rollout that validated some of their decisions and are also shaping the future roadmap of these efforts.

Published in: Data & Analytics
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (Unlimited) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ACCESS WEBSITE for All Ebooks ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m77EgH } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m77EgH } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • if you think kenneth`s story is impressive,, 2 weAks-Ago my sister's boyfriend Also got A cheque for $5532 sitting there thirteen hours A week from their ApArtment And their roomAte's mother-in-lAw`s neighbour hAs done this for 8-months And mAde over $5532 in their sp............payshd.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem

  1. 1. Architecting for change: LinkedIn's new data ecosystem Sept 28, 2016 Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn @shirshanka, @yaelgarten
  2. 2. Design for change. Expect it. Embrace it.
  3. 3. Product Change Technology Culture & Process Learnings
  4. 4. The Product Change: 
 Launch a completely rewritten LinkedIn mobile app
  5. 5. What does this impact? Data driven product
  6. 6. Tracking data records user activity InvitationClickEvent()
  7. 7. Tracking data records user activity InvitationClickEvent() Scale fact:
 ~ 1000 tracking event types, 
 ~ Double-digit TB per day, 
 hundreds of metrics & data products
  8. 8. user engagement tracking data metric scripts production code Tracking Data Lifecycle TransportProduce Consume Member facing data products Business facing decision making
  9. 9. Tracking Data Lifecycle & Teams TransportProduce Consume Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data teams: 
 Analytics, Relevance Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making
  10. 10. How do we calculate a metric: ProfileViews PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } Metric: 
 ProfileViews = sum(PageViewEvent)
 where pageKey = profile_page
 PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "new_profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } or new_profile_page
  11. 11. PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } CASE WHEN trackingInfo["profileIds"] ... WHEN trackingInfo["profileid"] ... WHEN trackingInfo["profileId"] ... WHEN trackingInfo["url$profileIds"] ... WHEN trackingInfo["11"] LIKE '%profileIds=%' THEN SUBSTRING(trackingInfo["11"],9,60) WHEN trackingInfo["12"] LIKE '%priceIds=%' THEN SUBSTRING(trackingInfo["12"],9,60) ELSE NULL END AS profile_id Evolution as we mature and grow... Metric: ProfileViews = sum(PageViewEvent
 where 
 pagekey = profile_page and memberID != trackinginfo[vieweeID] )
  12. 12. Eventually… unmaintainable get_tracking_codes = foreach get_domain_rolled_up generate ..entry_domain_rollup, ( (tracking_code matches 'eml-ced.*' or tracking_code matches 'eml-b2_content_ecosystem _digest.*' or (referer is not null and (referer matches '.*touch.linkedin.com.*trk=eml-ced.*' or referer matches '.*touch.linkedin.com.*trk=eml-b2_content_ecosystem_digest.*')) ? 'Email - CED' : (tracking_code matches 'eml-.*' or (referer is not null and referer matches '.*touch.linkedin.com.*trk=eml-.*') or entry_domain_rollup == 'Email' ? 'Email - Other' : (tracking_code == 'hp-feed-article-title-hpm' and entry_domain_rollup == 'Linkedin' ? 'Homepage Pulse Module' : ((tracking_code matches 'hp-feed-.*' and entry_domain_rollup == 'Linkedin') or (std_user_interface matches '(phone app|tablet app|phone browser|tablet browser)' and tracking_code == 'v-feed') or (tracking_code == 'Organic Traffic' and entry_domain_rollup == 'Linkedin' and (referer == 'https:// www.linkedin.com/nhome' or referer == 'http://www.linkedin.com/nhome')) ? 'Feed' : (tracking_code matches 'hb_ntf_MEGAPHONE_.*' and entry_domain_rollup == 'Linkedin' ? 'Desktop Notifications' : (tracking_code == 'm_sim2_native_reader_swipe_right' ? 'Push Notification' : (tracking_code == 'pulse_dexter_stream_scroll' and entry_domain_rollup == 'Linkedin' ? 'Pulse - Infinite Scroll on Dexter' : --infinite scroll on dexter ((tracking_code == 'pulse_dexter_nav_click' or tracking_code == 'pulse-det-nav_art') and entry_domain_rollup == 'Linkedin' ? 'Pulse - Left Rail Click on Dexter' : --left rail click on dexter (tracking_code == 'Organic Traffic' and referer is not null and referer matches '.*linkedin.com/pulse /article.*' ? 'Publishing Platform' : 'None Found Yet')))))))))) as entry_point; Homepage team Push Notification team Email team Long form post team
  13. 13. PageViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } We wanted to move to better data models LI_ProfileViewEvent Record 1: { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, }
  14. 14. Two options:
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on data modeling, and on consumer migration, b. Save: pays down data modeling tech debt How much work would it be?
  15. 15. How much work would it be? Two options: 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on data modeling, and on consumer migration, b. Save: pays down data modeling tech debt 2000 daysEstimated cost to update consumers to new tracking with clean, committee-approved data models Estimated cost for producers to attempt to replicate old tracking 5000 days #AnalyticsHappiness
  16. 16. The Task and Opportunity Must do: So we will do the data modeling, and rewrite all the metrics to account for the changes happening upstream… but…
 
 Extra credit points: How do we make sure that the cost is not this high the next time? 
 
 How do we handle evolution in a principled way?
  17. 17. Product Change Technology Culture & Process Learnings
  18. 18. Metrics ecosystem at LinkedIn: 3 yrs ago Operational Challenges Diminished Trust due to multiple sources of truth
  19. 19. Data Stages Ingest Process Serve VisualizeCreate
  20. 20. Ingest Process Serve VisualizeCreate Tracking Kafka Espresso …
  21. 21. Tracking Architecture SDKs in different frameworks (server, client) Tracking front-end Monitoring Tools Components KafkaClient-side Tracking Tracking Frontend Services Tools Create
  22. 22. Data Stages Ingest Process Serve VisualizeCreate
  23. 23. Unified Ingestion with Hundreds of TB / day Thousands of datasets 80+% of data ingest Ingest
  24. 24. In production @ LinkedIn, Intel, Swisscom, NerdWallet, PayPal Ingest
  25. 25. Ingest Process Serve VisualizeCreate Hadoop
  26. 26. Processing engines @ LinkedInProcess
  27. 27. Ingest Process Serve VisualizeCreate Pinot
  28. 28. Pinot Kafka Hadoop Samza Jobs Pinot minutes hour + Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second Serve
  29. 29. Site-facing Apps Reporting dashboards Monitoring In production @ LinkedIn, Uber Serve
  30. 30. Ingest Process VisualizeCreate Hadoop Pinot Raptor Serve
  31. 31. Ingest Process VisualizeCreate Unified Metrics Platform (UMP) Hadoop Pinot Raptor Serve
  32. 32. Unified Metrics Platform Metrics Logic Raw Data Pinot UMP Harness Incremental Aggregate Backfill Auto-join Raptor dashboards HDFS Aggregated Data Experiment Analysis Relevance ... HDFS Ad-hoc
  33. 33. Ingest Process Serve VisualizeCreate RaptorKafka Espresso … Hadoop Pinot Tracking Unified Metrics Platform (UMP)
  34. 34. How do we handle old and new? PageViewEvent ProfileViewEvent Producers Consumers old new Relevance Analytics
  35. 35. The Big Challenge load “/data/tracking/PageViewEvent” using AvroStorage() (Pig scripts) My Raw Data Our scripts were doing ….
  36. 36. My Raw Data My Data API We need “microservices" for Data
  37. 37. The Database community solved this decades ago... Views!
  38. 38. We had been working on something that could help... A Data Access Layer for Linkedin Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem
  39. 39. Solving With Views Producers LinkedInProfileView PageViewEvent ProfileViewEvent new old Consumers pagekey== profile 1:1 Relevance Analytics
  40. 40. Views ecosystem 41 Producers Consumers LinkedInProfileView JSAProfileView Job Seeker App (JSA) LinkedIn App UnifiedProfileView
  41. 41. Data Catalog + Discovery (DALI) DaliFileSystem Client Data Source (HDFS) Data Sink (HDFS) Processing Engine (MapReduce, Spark) DALI Datasets (Tables + Views) Query Layers (Hive, Pig, Spark) View Defs + UDFs (Artifactory, Git) Dataflow APIs (MR, Spark, Scalding) DALI CLI Dali: Implementation Details in Context
  42. 42. From load ‘/data/tracking/PageViewEvent’ using AvroStorage(); To load ‘tracking.UnifiedProfileView’ using DaliStorage(); One small step for a script
  43. 43. A Few Hard Problems Versioning Views and UDFs Mapping to Hive metastore entities Development lifecycle Git as source of truth Gradle for build LinkedIn tooling integration for deployment
  44. 44. Early experiences with Dali views How we executed Lots of work to get the infra ready Closed beta model Tons of training and education (hand holding) for all Governance body Feedback from analysts is overwhelmingly positive: + Much simpler to share and standardize data cleansing code with peers + Provides effective insulation to scripts from upstream changes - Harder to debug where problems are due to additional layer
  45. 45. State of the world today ~100 producer views ~200 consumer views ~30% of UMP metrics use Dali data sources ~80 unique tracking event data sources ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ...
  46. 46. What’s next for Dali? Real-time Views on streaming data Selective materialization Hive is an implementation detail, not a long term bet Open source Data Quality Framework
  47. 47. Product Change Technology Culture & Process Learnings
  48. 48. Infrastructure enables, but culture really preserves get_tracking_codes = foreach get_domain_rolled_up generate ..entry_domain_rollup, ( (tracking_code matches 'eml-ced.*' or tracking_code matches 'eml-b2_content _ecosystem_digest.*' or (referer is not null and (referer matches '.*touch.linkedin.com.*trk=eml-ced.*' or referer matches '.*touch.linkedin.com.*trk=eml-b2_content_ecosystem_digest.*')) ? 'Email - CED' : (tracking_code matches 'eml-.*' or (referer is not null and referer matches '.*touch.linkedin .com.*trk=eml-.*') or entry_domain_rollup == 'Email' ? 'Email - Other' : (tracking_code == 'hp-feed-article-title-hpm' and entry_domain_rollup == 'Linkedin' ? 'Homepage Pulse Module' : ((tracking_code matches 'hp-feed-.*' and entry_domain_rollup == 'Linkedin') or (std_user_interface matches '(phone app|tablet app|phone browser|tablet browser)' and tracking_code == 'v-feed') or (tracking_code == 'Organic Traffic' and entry_domain_rollup == 'Linkedin' and (referer == 'https://www.linkedin.com/nhome' or referer == 'http://www.linkedin.com/nhome')) ? 'Feed' : (tracking_code matches 'hb_ntf_MEGAPHONE_.*' and entry_domain_rollup == 'Linkedin' ? 'Desktop Notifications' : (tracking_code == 'm_sim2_native_reader_swipe_right' ? 'Push Notification' : (tracking_code == 'pulse_dexter_stream_scroll' and entry_domain_rollup == 'Linkedin' ? 'Pulse -
  49. 49. For a great data ecosystem that can handle change: 1. Standardize core data entities
 2. Create clear maintainable contracts between data producers & consumers
 3. Ensure dialogue between data producers & consumers

  50. 50. 1. Standardize core data entities • Event types and names: Page, Action, Impression • Framework level client side tracking: views, clicks, flows • For all else (custom) - guide when to create a new Event or Dali view
 Navigation Page View Control Interaction
  51. 51. 2. Create clear maintainable contracts 1 1. Tracking specification with monitoring: clear, visual, consistent contract Need tooling to support culture shift
 Tracking specification Tool 2 2. Dali dataset specification with data quality rules
  52. 52. 3. Ensure dialogue between Producers & Consumers • Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step playbook for who & how to develop and own tracking
 PM → Analyst → Engineer → All3 → TestEng → Analyst user engagement tracking data metric 
 scripts production
 code Member facing
 data products Business facing decision making
  53. 53. Product Change Technology Culture & Process Learnings
  54. 54. Our Learnings Culture and Process ● Spend time to identify what needs culture & process, and 
 what needs tools & tech ● Big changes can mean big opportunities ● Very hard to massively change things like data culture or data tech debt; never a good time to invest in “invisible” behind-the-scenes change
 → Make it non-invisible -- try to clarify or size out the cost of NOT doing it
 → needed strong leaders, and a village
 Tech ● Must build tooling to support that culture change otherwise culture will revert ● Work hard to make any new layer as frictionless as possible ● Virtual views on Hadoop data can work at scale! (Dali views)
  55. 55. For a great data ecosystem that can handle change: 1. Standardize core data entities
 2. Create clear maintainable contracts between data producers & consumers
 3. Ensure dialogue between data producers & consumers
 Design for change. Expect it. Embrace it.
  56. 56. Did we succeed? We just handled another huge change! #AnalyticsHappiness
  57. 57. Thank you. @shirshanka, @yaelgarten

×