Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Migrating Complex Data
Aggregations from
Hadoop to Spark
puneet.kumar@pubmatic.com
ashish.singh@pubmatic.com
Agenda
• Aggregation and Scale @PubMatic
• Problem Statement: Why Spark when we run Hadoop
• 3 Use Cases
• Configuration T...
Who we are?
• Marketing Automation Software Company
• Developed Industry’s first Real Time
Analytics Solution
Pubmatic Analytics
Data : Scale & Complexity
Challenges on Current Stack
• Ever Increasing Hardware Costs
• Complex Data Flows
• Cardinality Estimation : Estimating
Bi...
3 Use cases
Data Flow Diagram : Batch
Cardinality Estimation Multi Stage Workflows Grouping sets
Why Spark ?
• Efficient for dependent Data Flows
• Memory : Cheaper (Moore’s Law)
• Optimized Hardware Usage
• Unified sta...
Case 1: Cardinality Estimation
Sec
Size
Spark is ~ 25-30 % faster than Hive on MR
Case 2 :Multi Stage Data Flow
Sec
Size
Spark is ~ 85 % faster than Hive on MR
Case 3 : Grouping Sets
192 GB 384 GB 768 GB
0
200
400
600
800
Spark(Sec)
Hive
Queries(Sec)
Spark is ~ 150 % faster than Hi...
Challenges faced
• Spark on YARN : executors did not use full memory
• Reading Nested Avro Schemas until Spark 1.2 was ted...
Important Performance Params
• SET spark.default.parallelism;
• SET spark.serializer : Kyro Serialization improved the run...
Memory Based Architecture
In Memory Distributed Store
HDFS S3
Flow 1 Flow 2 Flow 3
Conclusions :
• Spark Multi Stage workflows were faster by 85 % over Hive on MR
• Single stage workflows did not see huge ...
THANK YOU!
Upcoming SlideShare
Loading in …5
×

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic)

2,498 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics

Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPuneet Kumar, PubMatic)

  1. 1. Migrating Complex Data Aggregations from Hadoop to Spark puneet.kumar@pubmatic.com ashish.singh@pubmatic.com
  2. 2. Agenda • Aggregation and Scale @PubMatic • Problem Statement: Why Spark when we run Hadoop • 3 Use Cases • Configuration Tuning • Challenges & Learnings
  3. 3. Who we are? • Marketing Automation Software Company • Developed Industry’s first Real Time Analytics Solution
  4. 4. Pubmatic Analytics
  5. 5. Data : Scale & Complexity
  6. 6. Challenges on Current Stack • Ever Increasing Hardware Costs • Complex Data Flows • Cardinality Estimation : Estimating Billion distinct users • Multiple Grouping Sets • Different flows for Real Time and Batch Analytics Data Flow Diagram : Batch
  7. 7. 3 Use cases Data Flow Diagram : Batch Cardinality Estimation Multi Stage Workflows Grouping sets
  8. 8. Why Spark ? • Efficient for dependent Data Flows • Memory : Cheaper (Moore’s Law) • Optimized Hardware Usage • Unified stack for Real Time & Batch • Awesome Scala API’s
  9. 9. Case 1: Cardinality Estimation Sec Size Spark is ~ 25-30 % faster than Hive on MR
  10. 10. Case 2 :Multi Stage Data Flow Sec Size Spark is ~ 85 % faster than Hive on MR
  11. 11. Case 3 : Grouping Sets 192 GB 384 GB 768 GB 0 200 400 600 800 Spark(Sec) Hive Queries(Sec) Spark is ~ 150 % faster than Hive on MR Sec
  12. 12. Challenges faced • Spark on YARN : executors did not use full memory • Reading Nested Avro Schemas until Spark 1.2 was tedious • Had to rewrite code to leverage Spark-Avro with Spark 1.3(DataFrames) • Join and Deduplication was slow for Spark vs Hive
  13. 13. Important Performance Params • SET spark.default.parallelism; • SET spark.serializer : Kyro Serialization improved the runtime. • SET spark.sql.inMemoryColumnarStorage.compressed : Snappy compassion set to true • SET spark.sql.inMemoryColumnarStorage.batchSize : Increasing it to a higher optimum value. • SET spark.shuffle.memorySize
  14. 14. Memory Based Architecture In Memory Distributed Store HDFS S3 Flow 1 Flow 2 Flow 3
  15. 15. Conclusions : • Spark Multi Stage workflows were faster by 85 % over Hive on MR • Single stage workflows did not see huge benefits • HLL mask generation and heavy jobs finished 20-30% faster • Use In Memory Distributed Storage with Spark for multiple jobs on same Input • Overall Hardware cost is expected to decrease by ~35% due to Spark usage(more memory , less nodes)
  16. 16. THANK YOU!

×