The document summarizes an agenda for a Spark Meetup event hosted by LinkedIn. The event includes presentations on LinkedIn's data ecosystem, optimizing Spark performance with Dr. Elephant, and understanding Spark application scalability limits. It also lists the speakers and timing of presentations, breaks for Q&A, and a post-event networking lunch. The document provides context on LinkedIn's infrastructure challenges at scale in areas like federated HDFS, cluster management, and computation strategies for large datasets. It also mentions some of LinkedIn's open source projects.
2. Agenda
10:00 AM Welcome Guest and Speakers
10:15 AM Introduction to Linkedin Data Ecosystem by Gaurav Verma from Linkedin– (15 Mins)
10:30 AM Dr. Elephant: Achieving Quicker, Easier, And Cost-effective Analytics In Spark By Akshay Rai
from Linkedin – (30 Mins)
11:00 AM Q&A and Break (30 mins)
11:30 AM Understanding scalability limits of spark applications by Rohit Karlupia from Qubole (1 Hr)
12:30 PM Q&A, Feedback Form, Networking and Distribution of Goodies (30 mins)
1:00 PM Lunch and Networking @Linkedin Cafeteria
5. Scale @ Linkedin
>2 Trillion
Messages per Day
>.5 PB in and 2.3 PB out
Per Day (compressed)
>15 M
Messages per Sec at peaks
>4K Users
>100 TB ingested per day
>100 PB of HDFS
>200K Jobs
per day across >10 clusters
(>9000 Nodes)
6. Analytics Infrastructure
Oracle
DB
Voldemort
Espresso
Kafka
Gobblin HDFS Dali
Pig
Hive
Spark
Data Sources Data Ingestion Data Storage Data Access Layer
Azkaban
AnalyticsRelevance Reporting
Data ProcessingWorkflow SchedulerAnalytics Use Cases
MR
3rd party
Services
Presto
XLNT Raptor Third Eye
7. Linkedin Infrastructure Challenges
Scaling up system
• Federated HDFS
• Dali
Scaling up cluster Management
• Hadoop OrgQueue
Scaling up computation
• Dr. Elephant
• Better computation strategy for handling
lager datasets
Scaling up system
• Tens of thousands of nodes
• Tens of PB of data
Scaling up cluster Management
• Thousands of daily active users
• Hundreds of thousands of jobs
Scaling up computation
• Limited shared computation resources