Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE
(300TB+) APACHE SPARK PIPELINES IN PRODUCTION
Brandon Carl
MARCH 15, 2016
"SEE THE MOMENTS YOU CARE ABOUT FIRST"
MACHINE LEARNING
MACHINE LEARNING LIFECYCLE
Training
Examples
Machine
Learning
Model
Make
Predictions
Measure
Outcomes
Ranking
Events
Clien...
WHY SPARK?
• Performance
• Testability
• Modularity
• Serialized Logging
SERIALIZED LOGGING
{
"id": 123,
"scores": {
"modelA": 0.2345,
"modelB": 0.0012
},
"features": {
1001: 0.9934,
1002: 0.1923...
SERIALIZED LOGGING
struct Candidate {
1: i64 id;
2: map<string, double> scores;
3: map<i64, double> features;
}
new Candid...
CHANGES OVER TIME
CHANGES OVER TIME
• RDD
• Dataset
• Training Data Joiner
TRAINING DATA JOINER
class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner {
val labels: Map[String, ...
MANAGING MASSIVE SCALE
MANAGING MASSIVE SCALE - PEOPLE
AUTOMATE EVERYTHING
SIMPLE INTERFACE
SIMPLE INTERFACE
RankingEvent
.read('input_table', '2017-10-25')
.filter(...)
.map(...)
.write('output_table', '2017-10-25...
MANAGING MASSIVE SCALE - DATA
PLAN FOR GROWTH
PERSIST TO HDFS
PERSIST TO HDFS
Source
Data
Map/Filter
Join Output
Source
Data
Map/Filter
PERSIST TO HDFS
Source
Data
Map/Filter
Temporary
Table
Source
Data
Map/Filter
Temporary
Table
Join Output
KRYO SERIALIZATION
KRYO SERIALIZATION
new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo...
BIG-O MATTERS
BIG-O MATTERS
final def withName(s: String): Value =
values
.find(_.toString == s)
.getOrElse(throw new NoSuchElementExcep...
BIG-O MATTERS
final def withName(s: String): Value =
values
.find(_.toString == s)
.getOrElse(throw new NoSuchElementExcep...
DATA STRUCTURES MATTER
DATA STRUCTURES MATTER
• AnyRefMap
• IntMap
• LongMap
• fastutil (http://fastutil.di.unimi.it)
DATA SKEW MATTERS
TEST ON SAMPLED DATA
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Next
Upcoming SlideShare
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Next
Download to read offline and view in fullscreen.

Share

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

Download to read offline

With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

  1. 1. LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTION Brandon Carl
  2. 2. MARCH 15, 2016 "SEE THE MOMENTS YOU CARE ABOUT FIRST"
  3. 3. MACHINE LEARNING
  4. 4. MACHINE LEARNING LIFECYCLE Training Examples Machine Learning Model Make Predictions Measure Outcomes Ranking Events Client Events
  5. 5. WHY SPARK? • Performance • Testability • Modularity • Serialized Logging
  6. 6. SERIALIZED LOGGING { "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 } }
  7. 7. SERIALIZED LOGGING struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features; } new Candidate() .setId(id) .setScores(scores) .setFeatures(features)
  8. 8. CHANGES OVER TIME
  9. 9. CHANGES OVER TIME • RDD • Dataset • Training Data Joiner
  10. 10. TRAINING DATA JOINER class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner { val labels: Map[String, LabelFunction] = ??? } case class Output(id: Long, label_value: Double)
  11. 11. MANAGING MASSIVE SCALE
  12. 12. MANAGING MASSIVE SCALE - PEOPLE
  13. 13. AUTOMATE EVERYTHING
  14. 14. SIMPLE INTERFACE
  15. 15. SIMPLE INTERFACE RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')
  16. 16. MANAGING MASSIVE SCALE - DATA
  17. 17. PLAN FOR GROWTH
  18. 18. PERSIST TO HDFS
  19. 19. PERSIST TO HDFS Source Data Map/Filter Join Output Source Data Map/Filter
  20. 20. PERSIST TO HDFS Source Data Map/Filter Temporary Table Source Data Map/Filter Temporary Table Join Output
  21. 21. KRYO SERIALIZATION
  22. 22. KRYO SERIALIZATION new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))
  23. 23. BIG-O MATTERS
  24. 24. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
  25. 25. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
  26. 26. DATA STRUCTURES MATTER
  27. 27. DATA STRUCTURES MATTER • AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)
  28. 28. DATA SKEW MATTERS
  29. 29. TEST ON SAMPLED DATA
  • alabarga

    May. 29, 2019

With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.

Views

Total views

2,420

On Slideshare

0

From embeds

0

Number of embeds

206

Actions

Downloads

101

Shares

0

Comments

0

Likes

1

×