The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirshanka Das
Hadoop Summit 2015
2
How does PYMK work?
5
Houston
we have a problem
Step 1
Central transport pipeline
Still have
a problem
Hadoop Ingest Pipeline
Complexity
Step 2
Central
Ingestion
Framework
11
Requirements
Source
Diversity
Batch
and
Streaming
Data
Quality
Gobblin Architecture
14
Source
Work
Unit
Work
Unit
Work
Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Write
Write
Data
Publish
Task
Task
Task
Taming Source Diversity
REST
SFTP
JDBC
Protocol
Config
Source Extractor
checkpoint
Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source
Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker
Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN
Transformation: No one size fits all
Cubert: Converting hours to minutes
http://github.com/linkedin/cubert
Physical language
Block organization
Specialized operators
Got Diversity?
Where is the billings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
When will the latest profile data
show up? 24
25
Where is my data?
How did it get here?
….
26
Where is my data?
How did it get here?
….
WhereHows
26
WhereHows architecture
28
29
31
Lineage
WhereHows: Roadmap
Streaming ecosystem integration
Kafka, Samza
Recommendations for Datasets, Metrics
Exploring Open Source
Real-time. Interactive.
Slice and Dice metrics
Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,US 1
Android,IN 1
More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...
Challenges
Horizontally scalable
Low latency
Data freshness
Fault tolerance
OLAP features
Introducing Pinot
Key features
SQL-like
interface
Columnar
storage and
indexing
Real-time
data load
(S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'
(S)QL: Group By
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action
(S)QL: ORDER BY and LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1
Columnar Storage
Forward Index
Broker Helix
Real
time Historical
Kafka Hadoop
Pinot
Architecture
Queries
Raw
Data Samza
Fast but needs a ton of RAM
To pre-compute or not?
Data aware
pre-computation
Pinot@LinkedIn
Site-­‐facing	
  Apps Reporting	
  dashboards Monitoring
Breaking the cycle
Breaking the cycle
Breaking the cycle
Form hypothesis
Query
Repeat
Breaking the cycle
Form hypothesis
Query
Repeat
OR …
Hmm... whats up with portugese and
spanish speaking countries?
Brazil?
56
57
Holidays in Brazil 2015
Pinot Roadmap
Pinot is
Open Source !!!
github.com/linkedin/pinot
59
Kapil Surlaker
@kapilsurlaker
github.com/linkedin/
60
gobblin
cubert
pinot
Shirshanka Das
@shirshanka
Thanks!

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn