Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

The Data Driven Network
Kapil Surlaker
Director of Engineering
Powering the Data Driven Network
Kapil Surlaker and Shirshanka Das
Hadoop Summit 2015

Step 1
Central transport pipeline

Hadoop Ingest Pipeline
Complexity

Step 2
Central
Ingestion
Framework
11

Requirements
Source
Diversity
Batch
and
Streaming
Data
Quality

14
Source
Work
Unit
Work
Unit
Work
Unit
Extract
Extract
Extract
Convert
Convert
Convert
Quality
Quality
Quality
Write
Write
Write
Data
Publish
Task
Task
Task

Taming Source Diversity
REST
SFTP
JDBC
Protocol
Config
Source Extractor
checkpoint

Solving for real-time
Inefficiencies in batch
YARN based
Apache Helix
Continuous
Auto-scaling
YARN
Helix
Executor 1
Executor 2
Executor 3
HDFS
Stream Source

Data Quality
Per record, per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker

Current Activity
Open source @ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN

Transformation: No one size fits all

Cubert: Converting hours to minutes
http://github.com/linkedin/cubert
Physical language
Block organization
Specialized operators

Where is the billings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
When will the latest profile data
show up? 24

Where is my data?
….
26

Where is my data?
….
WhereHows
26

WhereHows: Roadmap
Streaming ecosystem integration
Kafka, Samza
Recommendations for Datasets, Metrics
Exploring Open Source

Precompute!
Device Geo View
Android US 1
Android IN 1
iOS US 1
Dimension View
Android 2
iOS 1
US 2
IN 1
Android,US 1
iOS,US 1
Android,IN 1

More dimensions!
Device Geo Carrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...

Challenges
Horizontally scalable
Low latency
Data freshness
Fault tolerance
OLAP features

Key features
SQL-like
interface
Columnar
storage and
indexing
Real-time
data load

(S)QL: Filters and Aggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'

(S)QL: Group By
SELECT count(*)
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action

(S)QL: ORDER BY and LIMIT
SELECT *
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1

Broker Helix
Real
time Historical
Kafka Hadoop
Pinot
Architecture
Queries
Raw
Data Samza

Pinot@LinkedIn
Site-‐facing
Apps Reporting
dashboards Monitoring

Breaking the cycle
Form hypothesis
Query
Repeat

Breaking the cycle
Form hypothesis
Query
Repeat
OR …

Hmm... whats up with portugese and
spanish speaking countries?

Pinot Roadmap
Pinot is
Open Source !!!
github.com/linkedin/pinot
59

Kapil Surlaker
@kapilsurlaker
github.com/linkedin/
60
gobblin
cubert
pinot
Shirshanka Das
@shirshanka
Thanks!

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

In this document

More Related Content

What's hot

Viewers also liked

Similar to Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn

More from DataWorks Summit

Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn