SlideShare a Scribd company logo
SRE
Bruno Connelly
#LinkedInWIT
The Past, Present and Future of Big
Data @ LinkedIn
People You May Know
Suja Viswesan
SR ENGINEERING MANAGER, BIG DATA PLATFORM
MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE
Scale of Processing @
2.3 Trillion
Messages per Day
0.6 PB in 2.3 PB out
per Day (compressed)
16 Million
Messages per Second at peaks!
4.6K users
125 TB ingested per day
120 PB of HDFS
224K jobs per day across
13 clusters (9 K nodes)
220+ Applications
Most Applications require
Stateful Processing ~
several TBs (overall)
800+ nodes across 9
clusters
samza
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
Big Data!
Collect
- Collect User Events from
Across the Globe
- Eg. Page Views, Feed
Impressions, Connections
- Multiple Sources of Data
- Transport Data with Low
Latency
- Scale - 2.3 trillion msgs/day
(~2.5 PB) (Pymk Scale ~10K
msg/sec)
Process
- Highly Reliable and
Fault-tolerant Processing of
Events
- Offline Batch Processing
- Near-realtime Stream
Processing
- Seamlessly Transport Results
from Offline Processing to
Online Services
Access
- Persist Data Durably
- High availability for Serving
Online Services
- Data should be Searchable
Analytics Infrastructure
Gobblin
Espresso
Data
Sources
3rd Party
Services
Data
Ingestion
Oracle DB
HDFS
Voldemort
Data
Storage
Dataset
Management
Dali
Datasets
Analytics Infrastructure
A/B
testing
Cluster
Management
Compute
Engines
Workflow
Orchestration
Usecases
Relevance
Analytics
Reporting
YARN Azkaban
Analytics Infrastructure Challenges
Computation
Cluster Management
System
Scaling up computation
● Limited shared computation resources
● Efficient computation to cut down cost of jobs
Scaling up cluster management
● Thousands of daily active cluster users
● Hundreds of thousands of daily jobs
● A mix of SLA requirements
Scaling up system
● Tens of thousands of nodes
● Tens of PT of data
THESCALINGPYRAMID
Our Solutions
Scaling up system
● Federated HDFS
● Dali - Logical Data Access Layer for Hadoop
Scaling up cluster management
● Hadoop OrgQueue
● Elasticity Tuner
Scaling up computation
● Dr. Elephant
● Better computation strategy for handling large datasets
LinkedIn Open Source Projects
Pinot
Dr Elephant
Cubert
Streaming
Near Realtime
Stream Processing
Data Management Performance Tuning OLAP Storage
Computation EngineWorkflow Manager
samza
Photon - ML
Bruno Connelly
See you at Grace Hopper Celebration!

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

What's hot (20)

The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
What is NoSQL and CAP Theorem
What is NoSQL and CAP TheoremWhat is NoSQL and CAP Theorem
What is NoSQL and CAP Theorem
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
Denodo Data Virtualization Platform: Overview (session 1 from Architect to Ar...
 
Advanced SQL For Data Scientists
Advanced SQL For Data ScientistsAdvanced SQL For Data Scientists
Advanced SQL For Data Scientists
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Big Data Technologies.pdf
Big Data Technologies.pdfBig Data Technologies.pdf
Big Data Technologies.pdf
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Big data
Big dataBig data
Big data
 

Similar to The Past, Present and Future of Big Data @LinkedIn

Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture Highload
Ontico
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
 

Similar to The Past, Present and Future of Big Data @LinkedIn (20)

Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise WorkloadsDAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
DAT332_How Verizon is Adopting Amazon Aurora PostgreSQL for Enterprise Workloads
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini HBaseCon 2012 | HBase, the Use Case in eBay Cassini
HBaseCon 2012 | HBase, the Use Case in eBay Cassini
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture Highload
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and MigrationAWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
AWS December 2015 Webinar Series - Amazon Aurora: Introduction and Migration
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
 
BIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephantBIG DATA: From mammoth to elephant
BIG DATA: From mammoth to elephant
 
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...
 

Recently uploaded

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Recently uploaded (20)

"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 

The Past, Present and Future of Big Data @LinkedIn

  • 1. SRE Bruno Connelly #LinkedInWIT The Past, Present and Future of Big Data @ LinkedIn
  • 2. People You May Know Suja Viswesan SR ENGINEERING MANAGER, BIG DATA PLATFORM
  • 3. MEMBERS COMPANIES JOBS SKILLS SCHOOLS KNOWLEDGE
  • 4.
  • 5. Scale of Processing @ 2.3 Trillion Messages per Day 0.6 PB in 2.3 PB out per Day (compressed) 16 Million Messages per Second at peaks! 4.6K users 125 TB ingested per day 120 PB of HDFS 224K jobs per day across 13 clusters (9 K nodes) 220+ Applications Most Applications require Stateful Processing ~ several TBs (overall) 800+ nodes across 9 clusters samza
  • 6. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec)
  • 7. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services
  • 8. Big Data! Collect - Collect User Events from Across the Globe - Eg. Page Views, Feed Impressions, Connections - Multiple Sources of Data - Transport Data with Low Latency - Scale - 2.3 trillion msgs/day (~2.5 PB) (Pymk Scale ~10K msg/sec) Process - Highly Reliable and Fault-tolerant Processing of Events - Offline Batch Processing - Near-realtime Stream Processing - Seamlessly Transport Results from Offline Processing to Online Services Access - Persist Data Durably - High availability for Serving Online Services - Data should be Searchable
  • 9. Analytics Infrastructure Gobblin Espresso Data Sources 3rd Party Services Data Ingestion Oracle DB HDFS Voldemort Data Storage Dataset Management Dali Datasets
  • 11. Analytics Infrastructure Challenges Computation Cluster Management System Scaling up computation ● Limited shared computation resources ● Efficient computation to cut down cost of jobs Scaling up cluster management ● Thousands of daily active cluster users ● Hundreds of thousands of daily jobs ● A mix of SLA requirements Scaling up system ● Tens of thousands of nodes ● Tens of PT of data THESCALINGPYRAMID
  • 12. Our Solutions Scaling up system ● Federated HDFS ● Dali - Logical Data Access Layer for Hadoop Scaling up cluster management ● Hadoop OrgQueue ● Elasticity Tuner Scaling up computation ● Dr. Elephant ● Better computation strategy for handling large datasets
  • 13. LinkedIn Open Source Projects Pinot Dr Elephant Cubert Streaming Near Realtime Stream Processing Data Management Performance Tuning OLAP Storage Computation EngineWorkflow Manager samza Photon - ML
  • 14. Bruno Connelly See you at Grace Hopper Celebration!