Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

1,395 views

Published on

With more than 700 million monthly active users, Instagram continues to make it easier for people across the globe to join the community, share their experiences, and strengthen connections to their friends and passions. Powering Instagram’s various products requires the use of machine learning, high performance ranking services, and most importantly large amounts of data. At Instagram, we use Apache Spark for several critical production pipelines, including generating labeled training data for our machine learning models. In this session, you’ll learn about how one of Instagram’s largest Spark pipelines has evolved over time in order to process ~300 TB of input and ~90 TB of shuffle data. We’ll discuss the experience of building and managing such a large production pipeline and some tips and tricks we’ve learned along the way to manage Spark at scale. Topics include migrating from RDD to Dataset for better memory efficiency, splitting up long-running pipelines in order to better tune intermediate shuffle data, and dealing with changing data skew over time. Finally, we will also go over some optimizations we have made in order to maintain reliability of this critical data pipeline.

Published in: Data & Analytics
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ,DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/yyxo9sk7 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl

  1. 1. LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTION Brandon Carl
  2. 2. MARCH 15, 2016 "SEE THE MOMENTS YOU CARE ABOUT FIRST"
  3. 3. MACHINE LEARNING
  4. 4. MACHINE LEARNING LIFECYCLE Training Examples Machine Learning Model Make Predictions Measure Outcomes Ranking Events Client Events
  5. 5. WHY SPARK? • Performance • Testability • Modularity • Serialized Logging
  6. 6. SERIALIZED LOGGING { "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 } }
  7. 7. SERIALIZED LOGGING struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features; } new Candidate() .setId(id) .setScores(scores) .setFeatures(features)
  8. 8. CHANGES OVER TIME
  9. 9. CHANGES OVER TIME • RDD • Dataset • Training Data Joiner
  10. 10. TRAINING DATA JOINER class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner { val labels: Map[String, LabelFunction] = ??? } case class Output(id: Long, label_value: Double)
  11. 11. MANAGING MASSIVE SCALE
  12. 12. MANAGING MASSIVE SCALE - PEOPLE
  13. 13. AUTOMATE EVERYTHING
  14. 14. SIMPLE INTERFACE
  15. 15. SIMPLE INTERFACE RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')
  16. 16. MANAGING MASSIVE SCALE - DATA
  17. 17. PLAN FOR GROWTH
  18. 18. PERSIST TO HDFS
  19. 19. PERSIST TO HDFS Source Data Map/Filter Join Output Source Data Map/Filter
  20. 20. PERSIST TO HDFS Source Data Map/Filter Temporary Table Source Data Map/Filter Temporary Table Join Output
  21. 21. KRYO SERIALIZATION
  22. 22. KRYO SERIALIZATION new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))
  23. 23. BIG-O MATTERS
  24. 24. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
  25. 25. BIG-O MATTERS final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
  26. 26. DATA STRUCTURES MATTER
  27. 27. DATA STRUCTURES MATTER • AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)
  28. 28. DATA SKEW MATTERS
  29. 29. TEST ON SAMPLED DATA

×