Bigger, Faster, Easier: Building a Real-Time Self Service Data Analytics Ecosystem at LinkedIn
This document discusses technologies for data ingestion, transformation, and analytics. It introduces Gobblin for scalable data ingestion from diverse sources, Cubert for converting data formats, WhereHows for data lineage tracking, and Pinot for real-time analytics. Gobblin provides a framework for extracting, converting, validating data in parallel tasks. Cubert allows converting data between formats using a domain-specific language. WhereHows tracks lineage metadata to answer questions about where data came from and how it flows. Pinot is a real-time distributed OLAP store for interactive queries on fresh data using a SQL-like interface.
In this document
Powered by AI
Introduction to the Data Driven Network concept presented by Kapil Surlaker and Shirshanka Das at Hadoop Summit 2015.
Overview of the PYMK (People You May Know) process and the complexities of the Hadoop Ingest Pipeline.
Step-by-step process of the Central Ingestion Framework focusing on source diversity and data quality.
Addressing inefficiencies in real-time data processing with YARN and Apache Helix, emphasizing data quality.
Current Gobblin architecture used at LinkedIn for data ingestion that manages tens of TBs daily from multiple sources.
Transformation strategies in data processing, introducing Cubert for enhanced efficiency.
Asking questions about data flow and management, addressing data lineage and using WhereHows for tracking. Architecture of WhereHows and its future roadmap, including integration with streaming ecosystems like Kafka.
Exploring real-time and interactive metrics with focus on device geographical views and preprocessing.
Identifying challenges in data management such as scalability and latency; introduction of Pinot.
Details on querying capabilities in Pinot, showcasing SQL-like interfaces and data handling methods.
Explanation of the Pinot architecture merging real-time data with historical queries and pre-computation strategies.
Use cases for Pinot at LinkedIn, future developments, and open-source contributions by Kapil Surlaker.
Data Quality
Per record,per task, or per
job
Composable quality checkers
Schema compatibility
Audit check
Sensitive fields
Unique key
Policy driven
Record
WriterJob
Task
Quality
Checker
FailQuarantine
Policy
Checker
18.
Current Activity
Open source@ github.com/linkedin/gobblin
In production @ LinkedIn
Tens of TB per day
Hundreds of datasets
~20 different sources
Gobblin on YARN
Where is thebillings data?
How did it get here?
What data is used to create inferred
skills data?
Who owns that flow?
When will the latest profile data
show up? 24
More dimensions!
Device GeoCarrier View
Android US ATT 1
Android IN Reliance 1
iOS US Verizon 1
Dimension View
Android 2
iOS 1
US 2
IN 1
ATT 1
Reliance 1
Verizon 1
Android,US 1
... ...
(S)QL: Filters andAggs
SELECT count(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’ AND
action = 'stop'
45.
(S)QL: Group By
SELECTcount(*)
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
'day' >= 15949 AND 'day' <= 15963 AND
paid = 'y’
GROUP BY action
46.
(S)QL: ORDER BYand LIMIT
SELECT *
FROM companyFollowHistoricalEvents
WHERE entityId = 121011 AND
entityId = 1000 AND
action = 'start'
ORDER BY creationTime DESC LIMIT 1