SlideShare a Scribd company logo
1 of 37
Download to read offline
The Evolution of Metadata:
LinkedIn’s Journey
Sept 25, 2019
Shirshanka Das, Principal Staff Engineer, LinkedIn
Mars Lan, Staff Engineer, LinkedIn
@shirshanka
LinkedIn’s Data Ecosystem
What is Metadata?
What datasets do we have in our data warehouse (Hadoop, Teradata)
How do we easily find them
What does their schema look like
What datasets derive from these datasets
"Confused" by Sara Beyer (CC BY-NC-ND 2.0)
? ?
Attempt 1: Crawl Phase
Attempt 1: Crawl Phase
Crawl all catalogs you can
Parse all the logs you can
ETL into an opinionated data model
Build a search index + a lookup store
Build an app to serve this info
This is a useful product!
We gave it a clever name: WhereHows
And we open sourced it in 2016
Attempt 1: A few things we observed
Pull-based integrations
Central team’s burden
Pipeline fragility, freshness
Rigidity of model
Hard to iterate quickly
"Grimaces" by Fouquier (CC BY-NC-ND 2.0)
What is Metadata?
What datasets do we have in our entire data
ecosystem?
Espresso, Kafka, Hadoop, Teradata, Search, Pinot …
20+ data systems
What does their schema look like?
What datasets derive from these datasets?
What business types are contained in these schemas?
Who owns these datasets?
Where are datasets being copied?
…
"[Katsiaryna Lenets] © 123RF.com"
Attempt 2: Walk Phase
Attempt 2: Walk Phase
Separate REST-ful service to support this diversity
Dataset Naming
Generic Schema model for all kinds of data stores
and formats
Scalable integration patterns
Pub-Sub using Kafka
Support REST + Kafka ingest route
We called it TMS: THE Metadata Store :)
We couldn’t open source it, too coupled with our internal business concepts :(
New extensions to the model
Business metadata
Ownership
Attempt 2: A few things we observed
The Good
Teams were now accountable for the quality of their metadata and their custom ETL
Aggregate base metadata from all data platforms, overlay new metadata “aspects” on top easily
The Not So Good
New kind of metadata needed —> get in line behind central TMS team
Source of truth versus “Reflection of truth” debates —> no one wants to take a dependency
Standardized Event as interface: requires data model adapters everywhere, low incentive for producer to
excel at their job
What is Metadata?
"Thinking" by Elvin (CC BY-NC 2.0)
What is Metadata?
"Thinking" by Elvin (CC BY-NC 2.0)
What is Metadata?
"Thinking" by Elvin (CC BY-NC 2.0)
Models,
Features, …
Some observations about the problem
There is value in local sub-graphs, but
the real value lies in the global model
Cannot execute this with a single monolith
data model + service
Different teams care about different sub-
graphs
Central team bottleneck
Micro-services can help?
Back to silo-ed metadata problem
Attempt 3: Run Together
Distributed but collaborative authorship of model
Single ORM-like layer to auto-generate integrations for
CRUD operations
Search queries
Graph traversal
Distributed deployment and ownership of services possible
We’re calling this, the Generalized Metadata Architecture (GMA)
An Example Metadata Graph
DATASET
urn
platform
name
fabric
USER
Urn
firstName
lastName
PROFILE
{
firstName: John
lastName: Doe
Ldap: jdoe
}
GROUP
urn
name
size
OWNERSHIP
{
owners: [{
type: SRE,
user: jdoe
}, …]
}
MEMBERSHIP
{
admin: jdoe,
members: [{
jdoe,
…]
}
Owned By
Has Admin
Has Member
1N
1
1
1
NAspects
Relationships
Entity Anchors
…
SCHEMA
{
fields: [{
type: integer,
name: id
}, …]
}
Metadata Serving
API Endpoints
CRUD DAO
Search DAO
Graph DAO
Graph DB
Search Index
Document Store
Metadata Service
Metadata Indexing
Metadata Service
API Endpoints
CRUD DAO
Search Index
Graph DB
Search
processor
Graph
processor
Metadata
Audit
Event
Putting it all together
GIT Web App Service
CRUD Event
Event Processor
User, Groups
Service Dataset Service
SoT DB SoT DB
Change Event
Event Processor
Change
Event
Hadoop
Data Lake
Analytics
+
Relevance
Search
Design Decision Cheat-Sheet
Generic versus Specific Types:
Support strong-types layered over generic storage, model-first development
Expose strongly-typed REST API + Graph APIs for metadata traversal
Integration Strategy: Crawling versus Pub-Sub versus REST-ful API:
Prefer unified “Pub-Sub + REST-ful” API, build crawlers that publish
Single Source of Truth versus Replicated versus Federated:
For new metadata systems, integrate as ORM layer
For existing metadata systems, use Kafka to adapt
Federation strategy only supports lookup cases, hard to support search and graph well
What about X?
A very brief survey of other similar systems in this space
Hive Metastore: limited to Hadoop Dataset, focused on query planning
Atlas: generic model, missing strongly-typed API, dynamic types supported
Marquez: strongly opinionated model, focused on data pipeline metadata
Ground: generic model, missing strongly-typed API, research prototype
Amundsen: Web-app backed by Hive Metastore, strongly opinionated model
M
onolithic
Services
Metadata Platform by the numbers
Datasets : ~ 5M
Dashboards: ~2K
Features: ~3K
Metrics: ~30K
Schemas: ~100K
People: 10K+
CRUD Events: ~2M / day
Relationship Events: ~3M / day
Powered by Metadata
Metadata Platform
???
A Web App: Data Hub
Search and Discover Data Constructs
Find relationships, lineage, data quality, …
Enrich metadata
Some interesting parallels to the metadata platform in terms of extensibility
How do we add new pages to this app in a multi-team environment?
Data Hub: Search
Data Hub: Browse
Data Hub: a Dataset’s page
Data Hub: Lineage
Data Hub: a Metric’s page
Data Management with Compliance
Metadata Platform
Data Access Layer
Data
Management
(Purge, Export,
…)
Std Frameworks
Operations
Application Code
Physical Data
Data Pipeline Operations
Powered by Metadata
Metadata Platform
Search and
Discovery
beyond just
Datasets
Data Management,
Access
with Compliance
Operational
Monitoring,
Incremental Compute
AI : Model, Feature
reproducibility,
explainability
… and we’re just
getting started
It’s open source!
Open Source : the details!
Alpha release out currently at https://lnkd.in/datahub-alpha
Check it out at: Github project wherehows, branch: datahub (https://lnkd.in/datahub-github)
Capabilities:
Generic Modeling Layer with CRUD on MySQL
Search on Elastic
Graph on Neo4j*
Interfaces:
The DataHub Web App
REST-ful service from model files
Event Processors for metadata events
Metadata Model:
Datasets
People
Integrations
Data Catalogs (Hive, Kafka, JDBC*)
People (LDAP)
* coming real soon
Open sourcing the Data Model
LinkedIn Internal Model Open Source Model
Open Source : what’s coming next!
More Entities: [Jobs, Flows]
More Aspects within existing entities: [e.g. ReplicationPolicy, HiveSpecification]
More Integrations: [Calcite-compatible systems for fine-grain lineage]
App Features
User pages
Social interactions
Get engaged and
contribute to the global model
Build integrations with systems
Thank You!

More Related Content

What's hot

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in AzureValdas Maksimavičius
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformHien Luu
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...DataWorks Summit
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Denodo
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data PipelineJesus Rodriguez
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Jordan Chung
 

What's hot (20)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Making Data Scientists Productive in Azure
Making Data Scientists Productive in AzureMaking Data Scientists Productive in Azure
Making Data Scientists Productive in Azure
 
LinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting PlatformLinkedIn Segmentation & Targeting Platform
LinkedIn Segmentation & Targeting Platform
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
 
Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...Building intelligent applications, experimental ML with Uber’s Data Science W...
Building intelligent applications, experimental ML with Uber’s Data Science W...
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Building a Big Data Pipeline
Building a Big Data PipelineBuilding a Big Data Pipeline
Building a Big Data Pipeline
 
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
What is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | WhitepaperWhat is an Open Data Lake? - Data Sheets | Whitepaper
What is an Open Data Lake? - Data Sheets | Whitepaper
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
DataHub
DataHubDataHub
DataHub
 
Big-Data Server Farm Architecture
Big-Data Server Farm Architecture Big-Data Server Farm Architecture
Big-Data Server Farm Architecture
 

Similar to The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfkalai75
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
Big Data
Big DataBig Data
Big DataNGDATA
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionAmazon Web Services
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big DataMrinal Kumar
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dataconomy Media
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 

Similar to The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019] (20)

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Cloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdfCloud and Bid data Dr.VK.pdf
Cloud and Bid data Dr.VK.pdf
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
Big Data
Big DataBig Data
Big Data
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in Action
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Cloud Computing & Big Data
Cloud Computing & Big DataCloud Computing & Big Data
Cloud Computing & Big Data
 
The future of Big Data tooling
The future of Big Data toolingThe future of Big Data tooling
The future of Big Data tooling
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 

More from Shirshanka Das

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Shirshanka Das
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Shirshanka Das
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Shirshanka Das
 

More from Shirshanka Das (6)

Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015Bigger Faster Easier: LinkedIn Hadoop Summit 2015
Bigger Faster Easier: LinkedIn Hadoop Summit 2015
 
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
 

Recently uploaded

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]

  • 1. The Evolution of Metadata: LinkedIn’s Journey Sept 25, 2019 Shirshanka Das, Principal Staff Engineer, LinkedIn Mars Lan, Staff Engineer, LinkedIn @shirshanka
  • 3. What is Metadata? What datasets do we have in our data warehouse (Hadoop, Teradata) How do we easily find them What does their schema look like What datasets derive from these datasets "Confused" by Sara Beyer (CC BY-NC-ND 2.0) ? ?
  • 5. Attempt 1: Crawl Phase Crawl all catalogs you can Parse all the logs you can ETL into an opinionated data model Build a search index + a lookup store Build an app to serve this info This is a useful product! We gave it a clever name: WhereHows And we open sourced it in 2016
  • 6. Attempt 1: A few things we observed Pull-based integrations Central team’s burden Pipeline fragility, freshness Rigidity of model Hard to iterate quickly "Grimaces" by Fouquier (CC BY-NC-ND 2.0)
  • 7. What is Metadata? What datasets do we have in our entire data ecosystem? Espresso, Kafka, Hadoop, Teradata, Search, Pinot … 20+ data systems What does their schema look like? What datasets derive from these datasets? What business types are contained in these schemas? Who owns these datasets? Where are datasets being copied? … "[Katsiaryna Lenets] © 123RF.com"
  • 9. Attempt 2: Walk Phase Separate REST-ful service to support this diversity Dataset Naming Generic Schema model for all kinds of data stores and formats Scalable integration patterns Pub-Sub using Kafka Support REST + Kafka ingest route We called it TMS: THE Metadata Store :) We couldn’t open source it, too coupled with our internal business concepts :( New extensions to the model Business metadata Ownership
  • 10. Attempt 2: A few things we observed The Good Teams were now accountable for the quality of their metadata and their custom ETL Aggregate base metadata from all data platforms, overlay new metadata “aspects” on top easily The Not So Good New kind of metadata needed —> get in line behind central TMS team Source of truth versus “Reflection of truth” debates —> no one wants to take a dependency Standardized Event as interface: requires data model adapters everywhere, low incentive for producer to excel at their job
  • 11. What is Metadata? "Thinking" by Elvin (CC BY-NC 2.0)
  • 12. What is Metadata? "Thinking" by Elvin (CC BY-NC 2.0)
  • 13. What is Metadata? "Thinking" by Elvin (CC BY-NC 2.0) Models, Features, …
  • 14. Some observations about the problem There is value in local sub-graphs, but the real value lies in the global model Cannot execute this with a single monolith data model + service Different teams care about different sub- graphs Central team bottleneck Micro-services can help? Back to silo-ed metadata problem
  • 15. Attempt 3: Run Together Distributed but collaborative authorship of model Single ORM-like layer to auto-generate integrations for CRUD operations Search queries Graph traversal Distributed deployment and ownership of services possible We’re calling this, the Generalized Metadata Architecture (GMA)
  • 16. An Example Metadata Graph DATASET urn platform name fabric USER Urn firstName lastName PROFILE { firstName: John lastName: Doe Ldap: jdoe } GROUP urn name size OWNERSHIP { owners: [{ type: SRE, user: jdoe }, …] } MEMBERSHIP { admin: jdoe, members: [{ jdoe, …] } Owned By Has Admin Has Member 1N 1 1 1 NAspects Relationships Entity Anchors … SCHEMA { fields: [{ type: integer, name: id }, …] }
  • 17. Metadata Serving API Endpoints CRUD DAO Search DAO Graph DAO Graph DB Search Index Document Store Metadata Service
  • 18. Metadata Indexing Metadata Service API Endpoints CRUD DAO Search Index Graph DB Search processor Graph processor Metadata Audit Event
  • 19. Putting it all together GIT Web App Service CRUD Event Event Processor User, Groups Service Dataset Service SoT DB SoT DB Change Event Event Processor Change Event Hadoop Data Lake Analytics + Relevance Search
  • 20. Design Decision Cheat-Sheet Generic versus Specific Types: Support strong-types layered over generic storage, model-first development Expose strongly-typed REST API + Graph APIs for metadata traversal Integration Strategy: Crawling versus Pub-Sub versus REST-ful API: Prefer unified “Pub-Sub + REST-ful” API, build crawlers that publish Single Source of Truth versus Replicated versus Federated: For new metadata systems, integrate as ORM layer For existing metadata systems, use Kafka to adapt Federation strategy only supports lookup cases, hard to support search and graph well
  • 21. What about X? A very brief survey of other similar systems in this space Hive Metastore: limited to Hadoop Dataset, focused on query planning Atlas: generic model, missing strongly-typed API, dynamic types supported Marquez: strongly opinionated model, focused on data pipeline metadata Ground: generic model, missing strongly-typed API, research prototype Amundsen: Web-app backed by Hive Metastore, strongly opinionated model M onolithic Services
  • 22. Metadata Platform by the numbers Datasets : ~ 5M Dashboards: ~2K Features: ~3K Metrics: ~30K Schemas: ~100K People: 10K+ CRUD Events: ~2M / day Relationship Events: ~3M / day
  • 24. A Web App: Data Hub Search and Discover Data Constructs Find relationships, lineage, data quality, … Enrich metadata Some interesting parallels to the metadata platform in terms of extensibility How do we add new pages to this app in a multi-team environment?
  • 27. Data Hub: a Dataset’s page
  • 29. Data Hub: a Metric’s page
  • 30. Data Management with Compliance Metadata Platform Data Access Layer Data Management (Purge, Export, …) Std Frameworks Operations Application Code Physical Data
  • 32. Powered by Metadata Metadata Platform Search and Discovery beyond just Datasets Data Management, Access with Compliance Operational Monitoring, Incremental Compute AI : Model, Feature reproducibility, explainability … and we’re just getting started
  • 34. Open Source : the details! Alpha release out currently at https://lnkd.in/datahub-alpha Check it out at: Github project wherehows, branch: datahub (https://lnkd.in/datahub-github) Capabilities: Generic Modeling Layer with CRUD on MySQL Search on Elastic Graph on Neo4j* Interfaces: The DataHub Web App REST-ful service from model files Event Processors for metadata events Metadata Model: Datasets People Integrations Data Catalogs (Hive, Kafka, JDBC*) People (LDAP) * coming real soon
  • 35. Open sourcing the Data Model LinkedIn Internal Model Open Source Model
  • 36. Open Source : what’s coming next! More Entities: [Jobs, Flows] More Aspects within existing entities: [e.g. ReplicationPolicy, HiveSpecification] More Integrations: [Calcite-compatible systems for fine-grain lineage] App Features User pages Social interactions Get engaged and contribute to the global model Build integrations with systems