SlideShare a Scribd company logo
1 of 63
Building a healthy data ecosystem
around Kafka and Hadoop:
Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata!
Three (Naïve) Steps to #DataScienceHappiness
circa 2010
Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata!
Three (Naïve) Steps to #DataScienceHappiness
circa 2010
Achieving Data
Democracy
“… helping everybody to access and understand data .…
breaking down silos… providing access to data when and where
it is needed at any given moment.”
Collect, flow, store as much data as you can
Provide efficient access to data in all its stages of evolution
The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Espresso
Venice
Pinot Galene Graph DB
Document
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
The forms of data
At RestIn Motion
Espresso
Venice
Pinot
Galene
Graph DBDocument
DB
DynamoDB
Azure
Blob, Data
Lake
Storage
The forms of data
At RestIn Motion
Scale
O(10) clusters
~1.7 Trillion messages
~450 TB
Scale
O(10) clusters
~10K machines
~100 PB
At RestIn Motion
SFTPJDBCREST
Data Integration
Azure
Blob, Data
Lake
Storage
Data Integration: key requirements
Source, Sink
Diversity
Batch
+
Streaming
Data
Quality
So, we built
SFTP
JDBC
REST
Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal,
NerdWallet and many more…
Apache incubation under way
SFTP
Azure
Blob, Data
Lake
Storage
Query
Engines
At RestIn Motion
Processing
Frameworks
Processing
Frameworks
Query
Engines
At RestIn Motion
Processing
Frameworks
Kafka Hadoop
Samza Jobs
Pinot
minutes
hour +
Distributed Multi-dimensional OLAP
Columnar + indexes
No joins
Latency: low ms to sub-second
Query
Engines
Site-facing	Apps Reporting	dashboards Monitoring
Open source.
In production @
LinkedIn, Uber
At RestIn Motion
Processing
Frameworks
Data Infra 1.0 for Data Democracy
Query Engines
2010 - now
Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata
How
does
LinkedIn
build
data-
driven
products? 

Data Scientist
PM Designer
Engineer
We should enable users
to filter connection
suggestions by company
How much do
people utilize
existing filter
capabilities?
Let's see how
users send
connection
invitations
today.
Tracking data records user activity
InvitationClickEvent()
(powers metrics and data products)
InvitationClickEvent()
Scale fact:

~ 1000 tracking event types, 

~ Hundreds of metrics & data
products
Tracking data records user activity
user
engagement
tracking data
metric scripts
production code
Tracking Data Lifecycle
TransportProduce Consume
Member facing
data products
Business facing
decision making
Tracking Data Lifecycle & Teams
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams: 

Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
TransportProduce Consume
Members
Execs
How do we calculate a metric: ProfileViews
PageViewEvent
	Record	1:	
{	
		"memberId"	:	12345,	
		"time"	:	1454745292951,	
		"appName"	:	"LinkedIn",	
		"pageKey"	:	"profile_page",	
		"trackingInfo"	:		
		“Viewee=1214,lnl=f,nd=1,o=1214,	
			^SP=pId-'pro_stars',rslvd=t,vs=v,	
				vid=1214,ps=EDU|EXP|SKIL|	..."	
}	
Metric: 

ProfileViews = sum(PageViewEvent

where pagekey = profile_page

)
PageViewEvent
	Record	101:	
{	
		"memberId"	:	12345,	
			"time"	:	1454745292951,	
			"appName"	:	"LinkedIn",	
			"pageKey"	:	"new_profile_page",	
		"trackingInfo"	:		
				"viewee_id=1214,lnl=f,nd=1,o=1214,	
			^SP=pId-'pro_stars',rslvd=t,vs=v,	
				vid=1214,ps=EDU|EXP|SKIL|	..."	
}	
or new_profile_page
Ok but
forgot to notifyundesirable
Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams
Diminished Trust due to multiple sources of truth
What was causing unhappiness?
1. No contracts: Downstream scripts broke when upstream changed

2. "Naming things is hard": different semantics & conventions in various data Events
(per team) 

--> need to email to figure out what is correct and complete logic to use 

--> inefficient and potentially wrong

3. Discrepant metric logic: 

Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?
Data Modeling Tip
Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing
your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Chose Avro as our format
Sometimes you need a committee
Leads from product and infra teams
Review each new data model
Ensure that it follows our conventions,
patterns and best practices across entire
data lifecycle
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Data Model Review Committee
(DMRC)
Tooling to codify conventions
“Always be reviewing”
Who and What Evolution
Unified Metrics Platform
A single source of truth for all
business metrics at LinkedIn
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
- metrics processing platform as a
service
- a metrics computation template
- a set of tools and process to
facilitate metrics life-cycle
Central Team,
Relevant
Stakeholders
Sandbox
Metric
Definition
Code
Repo
Build
& Deploy
System JobsCore Metrics
Job
Metric
Owner
1. iterate
2. create
4. check in
3. review
5,000 metrics daily
Unified Metrics Platform: Pipeline
Metrics Logic
Raw
Data
Pinot
UMP Harness
Incremental
Aggregate
Backfill
Auto-join
Raptor
dashboards
HDFS
Aggregated
Data
Experiment
Analysis
Machine
Learning
Anomaly
Detection
HDFS
Ad-hoc
1. No contracts
2. Naming things is hard
3. Discrepant metric logic
Tracking
+ Database
+ Other data
Tracking Platform: standardizing production
Schema compatibility
Time
Audit
KafkaClient-side
Tracking
Tracking
Frontend
Services
Tools
Query
Engines
At RestIn Motion
Processing
Frameworks
Data Infra + Platforms 2.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)
Production Consumption
circa 2015
What was still causing unhappiness?
1. Old bad data sticks around (e.g. old mobile app versions)
2. No clear contract for data production - Producers unaware of consumers concerns
3. Never a good time to pay down this tech debt
We started from the bottom.
Product or App teams:
PMs, Developers, TestEng
Infra teams:
Hadoop, Kafka, DWH, ...
Data science teams: 

Analytics, ML Engineers,...
user
engagement
tracking data
metric scripts
production code
Member facing
data products
Business facing
decision making
Members
Execs
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness 

via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible
But what is the cost
of not doing it?
The Big Problem Opportunity in 2015 

Launch a completely rewritten LinkedIn mobile app
PageViewEvent
		
{	
		"header"	:	{	
				"memberId"	:	12345,	
				"time"	:	1454745292951,	
				"appName"	:	{	
						"string"	:	"LinkedIn"	
				"pageKey"	:	"profile_page"	
				},	
		},	
		"trackingInfo"	:	{	
				["Viewee"	:	"23456"],	
	 ...	
		}	
}	
We already wanted to move to better data models
ProfileViewEvent
{	
		"header"	:	{	
				"memberId"	:	12345,	
				"time"	:	4745292951145,	
				"appName"	:	{	
						"string"	:	"LinkedIn"	
				"pageKey"	:	"profile_page"	
				},	
		},	
"entityView"	:	{	
					"viewType"	:	"profile-view",	
					"viewerId"	:	“12345”,		
"vieweeId"	:	“23456”,		
		},	
}	
viewee_ID


1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.



2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
There were two options:


1. Keep the old tracking:
a. Cost: producers (try to) replicate it (write bad old code from
scratch),
b. Save: consumers avoid migrating.



2. Evolve.
a. Cost: time on clean data modeling, and on consumer
migration to new tracking events,
b. Save: pays down data modeling tech debt
How much work would it be?
#DataScienceHappiness
We pitched it to our Leadership team
Do it!
CTOCEO
2. Clear contract did not exist for data production
Producers were unaware of consumers needs, and were "Throwing data over the wall". 

Albeit avro, Schema adherence != Semantics equivalence
user engagement
tracking data
metric 

scripts
production

code
Member facing

data products
Business facing
decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that
feeds the data ecosystem
(and data scientists!)
Data producers
(PM, app developers)
Data consumers 

(DS)
2a. Ensure dialogue between Producers & Consumers
• Awareness: Train about end-to-end data pipeline, data modeling
• Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
2b. Standardized core data entities
• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event

Navigation
Page View
Control Interaction
2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence: 

clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling"

Tracking specification Tool
1. Old bad data sticks around
PageViewEvent
{
"header" : {
"memberId" : 12345,
"time" : 1454745292951,
"appName" : {
"string" : "LinkedIn"
"pageKey" :
"profile_page"
},
},
"trackingInfo" : {
["vieweeID" : "23456"],
...
}
}
ProfileViewEvent
{
"header" : {
"memberId" : 12345,
"time" : 4745292951145,
"appName" : {
"string" : "LinkedIn"
"pageKey" : "profile_page"
},
},
"entityView" : {
"viewType" : "profile-view",
"viewerId" : “12345”,
"vieweeId" : “23456”,
},
}
How do we handle old and new?
PageViewEvent
ProfileViewEvent
Producers Consumers
old
new
Relevance
Analytics
The Big Challenge
load “/data/tracking/PageViewEvent” using AvroStorage()
(Pig scripts)
My Raw Data
Our scripts were doing ….
My Raw Data
My Data API
We need “microservices" for Data
The Database community solved this
decades ago...
Views!
We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
Solving
With
Views
Producers
LinkedInProfileView
PageViewEvent
ProfileViewEvent
new
old
Consumers
pagekey==
profile
1:1
Relevance
Analytics
Views
ecosystem
51
Producers Consumers
LinkedInProfileView
JSAProfileView
Job Seeker App
(JSA)
LinkedIn App
UnifiedProfileView
Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Dali CLI
Data Catalog
Git + Artifactory
View Def +
UDFs
Dataset
Owner
Data Source
Data Sink
From
load ‘/data/tracking/PageViewEvent’
using AvroStorage();
To
load ‘tracking.UnifiedProfileView’ using
DaliStorage();
One small step for a script
A Few Hard Problems
Versioning
Views and UDFs
Mapping to Hive metastore entities
Development lifecycle
Git as source of truth
Gradle for build
LinkedIn tooling integration for deployment
State of the world today
~300 views
Pretty much all new UMP metrics use Dali
data sources
ProfileViews
MessagesSent
Searches
InvitationsSent
ArticlesRead
JobApplications
...
At Rest
Data
Processing
Frameworks
Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
well?
Stream Data
Standard streaming API-s
- Samza System Consumer
- Kafka Consumer
What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet
Dali: When are we done dreaming?
At RestIn Motion
Data
Processing
Frameworks
Dali
Query
Engines
At RestIn Motion
Processing
Frameworks
Data Infra + Platforms 3.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)
DaliDr Elephant WhereHows
circa 2017
Did we succeed? We just handled another huge rewrite!
#DataScienceHappiness
Achieve
Data
Democracy
Data
Scientists
write code
Unleash
Insights
Share Learnings
at Strata
Three (Naïve) Steps to #DataScienceHappiness
Basic data
infrastructure
for data democracy
Platforms, Process
to standardize
produce + consume
Evangelize
investing

in
#DataScience
Happiness
Tech + process

to sustain
healthy data
ecosystem
Our Journey towards #DataScienceHappiness
Dali,
Dialogue
2015->
Tracking, UMP
DMRC
2013 ->
Kafka, Hadoop,
Gobblin, Pinot
2010 -> 2015 ->
The Pursuit of #DataScienceHappiness
A original
@yaelgarten @shirshanka
Thank You!
to be continued…

More Related Content

What's hot

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Databricks
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in NetflixDanny Yuan
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXMaps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXDatabricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 

What's hot (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
Building a Versatile Analytics Pipeline on Top of Apache Spark with Mikhail C...
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphXMaps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
Maps and Meaning: Graph-based Entity Resolution in Apache Spark & GraphX
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 

Viewers also liked

Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationYael Garten
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemYael Garten
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbenchRan Wei
 
10のJava9で変わるJava8の嫌なとこ!
10のJava9で変わるJava8の嫌なとこ!10のJava9で変わるJava8の嫌なとこ!
10のJava9で変わるJava8の嫌なとこ!bitter_fox
 
Tips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTanya Altmann
 
Deploying and Managing a Global Blockchain Network
Deploying and Managing a Global Blockchain NetworkDeploying and Managing a Global Blockchain Network
Deploying and Managing a Global Blockchain NetworkDuncan Johnston-Watt
 
CSW2017 Enrico branca What if encrypted communications are not as secure as w...
CSW2017 Enrico branca What if encrypted communications are not as secure as w...CSW2017 Enrico branca What if encrypted communications are not as secure as w...
CSW2017 Enrico branca What if encrypted communications are not as secure as w...CanSecWest
 
Periodización Táctica: Morfociclo Patrón: Manchester United de José Mourinho
Periodización Táctica: Morfociclo Patrón: Manchester United de José MourinhoPeriodización Táctica: Morfociclo Patrón: Manchester United de José Mourinho
Periodización Táctica: Morfociclo Patrón: Manchester United de José MourinhoJuan Manuel Navarrete
 
L'acheteur un nouvel entrepreneur
L'acheteur un nouvel entrepreneurL'acheteur un nouvel entrepreneur
L'acheteur un nouvel entrepreneurFrance Barter
 
IES Triangle Principle
IES Triangle PrincipleIES Triangle Principle
IES Triangle PrincipleHandaru Sakti
 
Prins Amedeo officieel benoemd bij Gutzwiller bank
Prins Amedeo officieel benoemd bij Gutzwiller bankPrins Amedeo officieel benoemd bij Gutzwiller bank
Prins Amedeo officieel benoemd bij Gutzwiller bankThierry Debels
 
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_mark
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_markCSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_mark
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_markCanSecWest
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CanSecWest
 
Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Anna Völkl
 
Is Your Business Ready for An ERP System 5 Signs to Look Out For!
Is Your Business Ready for An ERP System 5 Signs to Look Out For!Is Your Business Ready for An ERP System 5 Signs to Look Out For!
Is Your Business Ready for An ERP System 5 Signs to Look Out For!Smith Roy
 
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderDiagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderMJ Cachón Yáñez
 

Viewers also liked (20)

Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
10のJava9で変わるJava8の嫌なとこ!
10のJava9で変わるJava8の嫌なとこ!10のJava9で変わるJava8の嫌なとこ!
10のJava9で変わるJava8の嫌なとこ!
 
Tips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPTTips to Handling Dental Emergencies - #PPT
Tips to Handling Dental Emergencies - #PPT
 
Deploying and Managing a Global Blockchain Network
Deploying and Managing a Global Blockchain NetworkDeploying and Managing a Global Blockchain Network
Deploying and Managing a Global Blockchain Network
 
CSW2017 Enrico branca What if encrypted communications are not as secure as w...
CSW2017 Enrico branca What if encrypted communications are not as secure as w...CSW2017 Enrico branca What if encrypted communications are not as secure as w...
CSW2017 Enrico branca What if encrypted communications are not as secure as w...
 
Examen Visual aplicado a las uniones soldadas (VT1w) (04/17)
Examen Visual aplicado a las uniones soldadas (VT1w) (04/17)Examen Visual aplicado a las uniones soldadas (VT1w) (04/17)
Examen Visual aplicado a las uniones soldadas (VT1w) (04/17)
 
Periodización Táctica: Morfociclo Patrón: Manchester United de José Mourinho
Periodización Táctica: Morfociclo Patrón: Manchester United de José MourinhoPeriodización Táctica: Morfociclo Patrón: Manchester United de José Mourinho
Periodización Táctica: Morfociclo Patrón: Manchester United de José Mourinho
 
L'acheteur un nouvel entrepreneur
L'acheteur un nouvel entrepreneurL'acheteur un nouvel entrepreneur
L'acheteur un nouvel entrepreneur
 
IES Triangle Principle
IES Triangle PrincipleIES Triangle Principle
IES Triangle Principle
 
Prins Amedeo officieel benoemd bij Gutzwiller bank
Prins Amedeo officieel benoemd bij Gutzwiller bankPrins Amedeo officieel benoemd bij Gutzwiller bank
Prins Amedeo officieel benoemd bij Gutzwiller bank
 
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_mark
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_markCSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_mark
CSW2017 Peng qiu+shefang-zhong win32k -dark_composition_finnal_finnal_rm_mark
 
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
CSW2017 Henry li how to find the vulnerability to bypass the control flow gua...
 
Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017Secure development environment @ Meet Magento Croatia 2017
Secure development environment @ Meet Magento Croatia 2017
 
Is Your Business Ready for An ERP System 5 Signs to Look Out For!
Is Your Business Ready for An ERP System 5 Signs to Look Out For!Is Your Business Ready for An ERP System 5 Signs to Look Out For!
Is Your Business Ready for An ERP System 5 Signs to Look Out For!
 
Diagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounderDiagnóstico SEO Técnico con Herramientas #TheInbounder
Diagnóstico SEO Técnico con Herramientas #TheInbounder
 

Similar to Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemShirshanka Das
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Tomasz Bednarz
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big DataInfochimps, a CSC Big Data Business
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformSanjay Padhi, Ph.D
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsArcadia Data
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Adrien Blind
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security AnalyticsAmrit Chhetri
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Balvinder Hira
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Ian Gomez
 
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointconfluent
 

Similar to Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn (20)

Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
 
Advanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data VirtualizationAdvanced Analytics and Machine Learning with Data Virtualization
Advanced Analytics and Machine Learning with Data Virtualization
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
Platform for Big Data Analytics and Visual Analytics: CSIRO use cases. Februa...
 
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
[Webinar] Getting to Insights Faster: A Framework for Agile Big Data
 
Tag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh PlatformTag.bio: Self Service Data Mesh Platform
Tag.bio: Self Service Data Mesh Platform
 
Accelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time AnalyticsAccelerating Data Lakes and Streams with Real-time Analytics
Accelerating Data Lakes and Streams with Real-time Analytics
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Information Security Analytics
Information Security AnalyticsInformation Security Analytics
Information Security Analytics
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning Start Getting Your Feet Wet in Open Source Machine and Deep Learning
Start Getting Your Feet Wet in Open Source Machine and Deep Learning
 
Confluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPointConfluent Partner Tech Talk with BearingPoint
Confluent Partner Tech Talk with BearingPoint
 

Recently uploaded

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 

Recently uploaded (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

  • 1. Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Mar 16, 2017 Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn @shirshanka, @yaelgarten
  • 2. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka
  • 3. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  • 4. Achieve Data Democracy Data Scientists write code Unleash Insights Share Learnings at Strata! Three (Naïve) Steps to #DataScienceHappiness circa 2010
  • 5. Achieving Data Democracy “… helping everybody to access and understand data .… breaking down silos… providing access to data when and where it is needed at any given moment.” Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution
  • 6. The forms of data Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable Espresso Venice Pinot Galene Graph DB Document DB DynamoDB Azure Blob, Data Lake Storage
  • 7. The forms of data At RestIn Motion Espresso Venice Pinot Galene Graph DBDocument DB DynamoDB Azure Blob, Data Lake Storage
  • 8. The forms of data At RestIn Motion Scale O(10) clusters ~1.7 Trillion messages ~450 TB Scale O(10) clusters ~10K machines ~100 PB
  • 9. At RestIn Motion SFTPJDBCREST Data Integration Azure Blob, Data Lake Storage
  • 10. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  • 11. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ github.com/linkedin/gobblin Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more… Apache incubation under way SFTP Azure Blob, Data Lake Storage
  • 15. Kafka Hadoop Samza Jobs Pinot minutes hour + Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second Query Engines
  • 16. Site-facing Apps Reporting dashboards Monitoring Open source. In production @ LinkedIn, Uber
  • 17. At RestIn Motion Processing Frameworks Data Infra 1.0 for Data Democracy Query Engines 2010 - now
  • 19. How does LinkedIn build data- driven products? 
 Data Scientist PM Designer Engineer We should enable users to filter connection suggestions by company How much do people utilize existing filter capabilities? Let's see how users send connection invitations today.
  • 20. Tracking data records user activity InvitationClickEvent()
  • 21. (powers metrics and data products) InvitationClickEvent() Scale fact:
 ~ 1000 tracking event types, 
 ~ Hundreds of metrics & data products Tracking data records user activity
  • 22. user engagement tracking data metric scripts production code Tracking Data Lifecycle TransportProduce Consume Member facing data products Business facing decision making
  • 23. Tracking Data Lifecycle & Teams Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making TransportProduce Consume Members Execs
  • 24. How do we calculate a metric: ProfileViews PageViewEvent Record 1: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "profile_page", "trackingInfo" : “Viewee=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } Metric: 
 ProfileViews = sum(PageViewEvent
 where pagekey = profile_page
 ) PageViewEvent Record 101: { "memberId" : 12345, "time" : 1454745292951, "appName" : "LinkedIn", "pageKey" : "new_profile_page", "trackingInfo" : "viewee_id=1214,lnl=f,nd=1,o=1214, ^SP=pId-'pro_stars',rslvd=t,vs=v, vid=1214,ps=EDU|EXP|SKIL| ..." } or new_profile_page Ok but forgot to notifyundesirable
  • 25. Metrics ecosystem at LinkedIn: 3 yrs ago Operational Challenges for infra teams Diminished Trust due to multiple sources of truth
  • 26. What was causing unhappiness? 1. No contracts: Downstream scripts broke when upstream changed
 2. "Naming things is hard": different semantics & conventions in various data Events (per team) 
 --> need to email to figure out what is correct and complete logic to use 
 --> inefficient and potentially wrong
 3. Discrepant metric logic: 
 Duplicate tech allowed for duplicate logic allowed for discrepant metric logic So how did we solve this?
  • 27. Data Modeling Tip Say no to Fragile Formats or Schema-Free Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc. 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Chose Avro as our format
  • 28. Sometimes you need a committee Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Data Model Review Committee (DMRC) Tooling to codify conventions “Always be reviewing” Who and What Evolution
  • 29. Unified Metrics Platform A single source of truth for all business metrics at LinkedIn 1. No contracts 2. Naming things is hard 3. Discrepant metric logic - metrics processing platform as a service - a metrics computation template - a set of tools and process to facilitate metrics life-cycle Central Team, Relevant Stakeholders Sandbox Metric Definition Code Repo Build & Deploy System JobsCore Metrics Job Metric Owner 1. iterate 2. create 4. check in 3. review 5,000 metrics daily
  • 30. Unified Metrics Platform: Pipeline Metrics Logic Raw Data Pinot UMP Harness Incremental Aggregate Backfill Auto-join Raptor dashboards HDFS Aggregated Data Experiment Analysis Machine Learning Anomaly Detection HDFS Ad-hoc 1. No contracts 2. Naming things is hard 3. Discrepant metric logic Tracking + Database + Other data
  • 31. Tracking Platform: standardizing production Schema compatibility Time Audit KafkaClient-side Tracking Tracking Frontend Services Tools
  • 32. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 2.0 Pinot Tracking Platform Unified Metrics Platform (UMP) Production Consumption circa 2015
  • 33. What was still causing unhappiness? 1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt We started from the bottom.
  • 34. Product or App teams: PMs, Developers, TestEng Infra teams: Hadoop, Kafka, DWH, ... Data science teams: 
 Analytics, ML Engineers,... user engagement tracking data metric scripts production code Member facing data products Business facing decision making Members Execs 3. Never a good time to pay down this "data" debt #victimsOfTheData —> #DataScienceHappiness 
 via proactively forging our own data destiny. Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?
  • 35. The Big Problem Opportunity in 2015 
 Launch a completely rewritten LinkedIn mobile app
  • 36. PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["Viewee" : "23456"], ... } } We already wanted to move to better data models ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, } viewee_ID
  • 37. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt There were two options:
  • 38. 
 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from scratch), b. Save: consumers avoid migrating.
 
 2. Evolve. a. Cost: time on clean data modeling, and on consumer migration to new tracking events, b. Save: pays down data modeling tech debt How much work would it be? #DataScienceHappiness
  • 39. We pitched it to our Leadership team Do it! CTOCEO
  • 40. 2. Clear contract did not exist for data production Producers were unaware of consumers needs, and were "Throwing data over the wall". 
 Albeit avro, Schema adherence != Semantics equivalence user engagement tracking data metric 
 scripts production
 code Member facing
 data products Business facing decision making #victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition Own the artifact that feeds the data ecosystem (and data scientists!) Data producers (PM, app developers) Data consumers 
 (DS)
  • 41. 2a. Ensure dialogue between Producers & Consumers • Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step playbook for who & how to develop and own tracking
  • 42. 2b. Standardized core data entities • Event types and names: Page, Action, Impression • Framework level client side tracking: views, clicks, flows • For all else (custom) - guide when to create a new Event
 Navigation Page View Control Interaction
  • 43. 2c. Created clear maintainable data production contracts Tracking specification with monitoring and alerting for adherence: 
 clear, visual, consistent contract Need tooling to support culture and process shift - "Always be tooling"
 Tracking specification Tool
  • 44. 1. Old bad data sticks around PageViewEvent { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } } ProfileViewEvent { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”, "vieweeId" : “23456”, }, }
  • 45. How do we handle old and new? PageViewEvent ProfileViewEvent Producers Consumers old new Relevance Analytics
  • 46. The Big Challenge load “/data/tracking/PageViewEvent” using AvroStorage() (Pig scripts) My Raw Data Our scripts were doing ….
  • 47. My Raw Data My Data API We need “microservices" for Data
  • 48. The Database community solved this decades ago... Views!
  • 49. We built Dali to solve this A Data Access Layer for Linkedin Abstract away underlying physical details to allow users to focus solely on the logical concerns Logical Tables + Views Logical FileSystem
  • 52. Dali: Implementation Details in Context Dali FileSystem Processing Engine (MR, Spark) Dali Datasets (Tables+Views) Dataflow APIs (MR, Spark, Scalding) Query Layers (Pig, Hive, Spark) Dali CLI Data Catalog Git + Artifactory View Def + UDFs Dataset Owner Data Source Data Sink
  • 53. From load ‘/data/tracking/PageViewEvent’ using AvroStorage(); To load ‘tracking.UnifiedProfileView’ using DaliStorage(); One small step for a script
  • 54. A Few Hard Problems Versioning Views and UDFs Mapping to Hive metastore entities Development lifecycle Git as source of truth Gradle for build LinkedIn tooling integration for deployment
  • 55. State of the world today ~300 views Pretty much all new UMP metrics use Dali data sources ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ... At Rest Data Processing Frameworks
  • 56. Now brewing: Dali on Kafka Can we take the same views and run them seamlessly on Kafka as well? Stream Data Standard streaming API-s - Samza System Consumer - Kafka Consumer
  • 57. What’s next for Dali? Selective materialization Open source Hive is an implementation detail, not a long term bet
  • 58. Dali: When are we done dreaming? At RestIn Motion Data Processing Frameworks Dali
  • 59. Query Engines At RestIn Motion Processing Frameworks Data Infra + Platforms 3.0 Pinot Tracking Platform Unified Metrics Platform (UMP) DaliDr Elephant WhereHows circa 2017
  • 60. Did we succeed? We just handled another huge rewrite! #DataScienceHappiness
  • 62. Basic data infrastructure for data democracy Platforms, Process to standardize produce + consume Evangelize investing
 in #DataScience Happiness Tech + process
 to sustain healthy data ecosystem Our Journey towards #DataScienceHappiness Dali, Dialogue 2015-> Tracking, UMP DMRC 2013 -> Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->
  • 63. The Pursuit of #DataScienceHappiness A original @yaelgarten @shirshanka Thank You! to be continued…