SlideShare a Scribd company logo
1 of 33
June 14, 2012

Optimizing MapReduce Job
Performance
Todd Lipcon [@tlipcon]
Introductions

    • Software Engineer at Cloudera since 2009
    • Committer and PMC member on HDFS,
      MapReduce, and HBase
    • Spend lots of time looking at full stack
      performance

    • This talk is to help you develop faster jobs
      – If you want to hear about how we made Hadoop
        faster, see my Hadoop World 2011 talk on
        cloudera.com

2
                       ©2011 Cloudera, Inc. All Rights Reserved.
Aspects of Performance

    • Algorithmic performance
      – big-O, join strategies, data
        structures, asymptotes
    • Physical performance
      – Hardware (disks, CPUs, etc)
    • Implementation performance
      – Efficiency of code, avoiding extra work
      – Make good use of available physical perf


3
                       ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals

    • You can’t tune what you don’t
      understand
      – MR’s strength as a framework is its black-box
        nature
      – To get optimal performance, you have to
        understand the internals

    • This presentation: understanding the
      black box

4
                      ©2011 Cloudera, Inc. All Rights Reserved.
Performance fundamentals (2)

    • You can’t improve what you can’t
      measure
      – Ganglia/Cacti/Cloudera Manager/etc a must
      – Top 4 metrics: CPU, Memory, Disk, Network
      – MR job metrics: slot-seconds, CPU-seconds,
        task wall-clocks, and I/O


    • Before you start: run jobs, gather data


5
                     ©2011 Cloudera, Inc. All Rights Reserved.
Graphing bottlenecks
                                                                             This job might
                                                                             be CPU-bound
                                                                             in map phase
                                                        Most jobs not
                                                        CPU-bound
     Plenty of free
     RAM, perhaps
     can make better
     use of it?




                                                                        Fairly flat-topped
                                                                        network –
                                                                        bottleneck?




6
                       ©2011 Cloudera, Inc. All Rights Reserved.
Performance tuning cycle


                   Identify                                     Address
       Run job
                   bottleneck                                   bottleneck
                  -Graphs                                       - Tune configs
                  -Job counters                                 - Improve code
                  -Job logs                                     - Rethink algos
                  -Profiler results

                 In order to understand these metrics and make
                 changes, you need to understand MR internals.




7
                    ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map       Sort/           Fetch             Merge   Reduce   OutputFormat
                   Task      Spill                                      Task




8
                          ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map       Sort/           Fetch             Merge   Reduce   OutputFormat
                   Task      Spill                                      Task




9
                          ©2011 Cloudera, Inc. All Rights Reserved.
Map-side sort/spill overview
  • Goal: when complete, map task outputs one sorted file
  • What happens when you call
    OutputCollector.collect()?

       Map
       Task                       2. Output Buffer fills up.
                                  Contents sorted, partitioned
           .collect(K,V)          and spilled to disk

MapOutputBuffer                                  IFile
1. In-memory buffer
holds serialized,                                                      Map-side
                                                 IFile                                       IFile
unsorted key-values                                                     Merge
                                                                       3. Map task finishes. All
                                                 IFile                 IFiles merged to single
                                                                       IFile per task


10
                           ©2011 Cloudera, Inc. All Rights Reserved.
Zooming further: MapOutputBuffer
   (Hadoop 1.0)




                                                  12 bytes/rec
                  kvoffsets
              (Partition, KOff, VOff)
                    per record
                                                                          io.sort.record.percent
                                                                          * io.sort.mb

                      kvindices




                                                  4 bytes/rec
                  1 indirect-sort index
io.sort.mb             per record



                      kvbuffer                    R bytes/rec

                    Raw, serialized                                     (1-io.sort.record.percent)
                    (Key, Val) pairs                                    * io.sort.mb




  11
                                  ©2011 Cloudera, Inc. All Rights Reserved.
MapOutputBuffer spill behavior

 • Memory is limited: must spill
     – If either of the kvbuffer or the metadata
       buffers fill up, “spill” to disk
     – In fact, we spill before it’s full (in another
       thread): configure io.sort.spill.percent
 • Performance impact
     – If we spill more than one time, we must re-
       read and re-write all data: 3x the IO!
     – #1 goal for map task optimization: spill once!

12
                      ©2011 Cloudera, Inc. All Rights Reserved.
Spill counters on map tasks

 • ratio of Spilled Records vs Map Output
   Records
     – if unequal, then you are doing more than one
       spill
 • FILE: Number of bytes read/written
     – get a sense of I/O amplification due to spilling




13
                      ©2011 Cloudera, Inc. All Rights Reserved.
Spill logs on map tasks
                       indicates that the metadata buffers
 2012-06-04 11:52:21,445 INFO before the data buffer map output:
                       filled up MapTask: Spilling
    record full = true
 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend
    = 60030900; bufvoid = 228117712
 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend =
    600309; length = 750387
 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0
 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output:
    record full = true
 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900;
    bufend = 120061700; bufvoid = 228117712
  2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309;
    kvend = 450230; length = 750387
 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of
    map output
 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1
 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2
           3 11:52:29,105 we can do
           better?


14
                         ©2011 Cloudera, Inc. All Rights Reserved.
Tuning to reduce spills

 • Parameters:
     – io.sort.mb: total buffer space
     – io.sort.record.percent: proportion between
       metadata buffers and key/value data
     – io.sort.spill.percent: threshold at which
       spill is triggered
     – Total map output generated: can you use
       more compact serialization?
 • Optimal settings depend on your data and
   available RAM!

15
                     ©2011 Cloudera, Inc. All Rights Reserved.
Setting io.sort.record.percent

 • Common mistake: metadata buffers fill up
   way before kvdata buffer
 • Optimal setting:
     – io.sort.record.percent = 16/(16 + R)
     – R = average record size: divide “Map Output
       Bytes” counter by “Map Output Records” counter
 • Default (0.05) is usually too low (optimal for
   ~300byte records)
 • Hadoop 2.0: this is no longer necessary!
     – see MAPREDUCE-64 for gory details

16
                     ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Example (terasort)

 • Map input size = output size
     – 128MB block = 1,342,177 records, each 100
       bytes
     – metadata: 16 * 1342177 = 20.9MB
 • io.sort.mb
    – 128MB data + 20.9MB meta = 148.9MB
 • io.sort.record.percent
    – 16/(16+100)=0.138
 • io.sort.spill.percent = 1.0

17
                    ©2011 Cloudera, Inc. All Rights Reserved.
More tips on spill tuning
 • Biggest win is going from 2 spills to 1 spill
     – 3 spills is approximately the same speed as 2 spills
       (same IO amplificatoin)
 • Calculate if it’s even possible, given your heap
   size
     – io.sort.mb has to fit within your Java heap (plus
       whatever RAM your Mapper needs, plus ~30% for
       overhead)
 • Only bother if this is the bottleneck!
     – Look at map task logs: if the merge step at the end is
       taking a fraction of a second, not worth it!
     – Typically most impact on jobs with big shuffle
       (sort/dedup)


18
                        ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map       Sort/           Fetch             Merge   Reduce   OutputFormat
                   Task      Spill                                      Task




19
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer fetch tuning

 • Reducers fetch map output via HTTP
 • Tuning parameters:
     – Server side: tasktracker.http.threads
     – Client side:
      mapred.reduce.parallel.copies
 • Turns out this is not so interesting
     – follow the best practices from Hadoop:
       Definitive Guide


20
                      ©2011 Cloudera, Inc. All Rights Reserved.
Improving fetch bottlenecks

 • Reduce intermediate data
     – Implement a Combiner: less data transfers faster
     – Enable intermediate compression: Snappy is
       easy to enable; trades off some CPU for less
       IO/network
 • Double-check for network issues
     – Frame errors, NICs auto-negotiated to 100mbit,
       etc: one or two slow hosts can bottleneck a job
     – Tell-tale sign: all maps are done, and reducers sit
       in fetch stage for many minutes (look at logs)


21
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map       Sort/           Fetch             Merge   Reduce   OutputFormat
                   Task      Spill                                      Task




22
                          ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge (Hadoop 1.0)

                                  Yes:
                                                   RAMManager
                                  fetch to                                      RAM-to-disk
                                  RAM                                           merges
 Remote Map           Fits in
    Outputs           RAM?
  (via HTTP)                                                                     1. Data accumulated
                                                                                 in RAM is merged to
                                No: fetch                                        disk files
                                to disk
                                                      Local Disk

                                                         IFile
2. If too many disk
                         disk-to-disk                                                   Merged
files accumulate,                                        IFile
                         merges                                                         iterator
they are re-merged

                                                         IFile                          Reduce
                                                                                         Task
                                                                  3. Segments from
                                                                  RAM and disk are
23                                                                merged into the
                                                                  reducer code
                                     ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge triggers
 • RAMManager
     – Total buffer size:
       mapred.job.shuffle.input.buffer.percent
       (default 0.70, percentage of reducer heapsize)
 • Mem-to-disk merge triggers:
     – RAMManager is
       mapred.job.shuffle.merge.percent % full
       (default 0.66)
     – Or mapred.inmem.merge.threshold segments
       accumulated (default 1000)
 • Disk-to-disk merge
     – io.sort.factor on-disk segments pile up (fairly rare)



24
                            ©2011 Cloudera, Inc. All Rights Reserved.
Final merge phase

 • MR assumes that reducer code needs the
   full heap worth of RAM
     – Spills all in-RAM segments before running
       user code to free memory
 • This isn’t true if your reducer is simple
     – eg sort, simple aggregation, etc with no state
 • Configure
     mapred.job.reduce.input.buffer.percent to
     0.70 to keep reducer input data in RAM


25
                      ©2011 Cloudera, Inc. All Rights Reserved.
Reducer merge counters

 • FILE: number of bytes read/written
     – Ideally close to 0 if you can fit in RAM
 • Spilled records:
     – Ideally close to 0. If significantly more than
       reduce input records, job is hitting a multi-
       pass merge which is quite expensive




26
                      ©2011 Cloudera, Inc. All Rights Reserved.
Tuning reducer merge

 • Configure
     mapred.job.reduce.input.buffer.percent
   to 0.70 to keep data in RAM if you don’t
   have any state in reducer
 • Experiment with setting
   mapred.inmem.merge.threshold to 0 to
   avoid spills
 • Hadoop 2.0: experiment with
     mapreduce.reduce.merge.memtomem.enabled


27
                   ©2011 Cloudera, Inc. All Rights Reserved.
Rules of thumb for # maps/reduces

 • Aim for map tasks running 1-3 minutes each
     – Too small: wasted startup overhead, less efficient
       shuffle
     – Too big: not enough parallelism, harder to share
       cluster
 • Reduce task count:
     – Large reduce phase: base on cluster slot count (a
       few GB per reducer)
     – Small reduce phase: fewer reducers will result in
       more efficient shuffle phase


28
                       ©2011 Cloudera, Inc. All Rights Reserved.
MR from 10,000 feet
     InputFormat   Map       Sort/           Fetch             Merge   Reduce   OutputFormat
                   Task      Spill                                      Task




29
                          ©2011 Cloudera, Inc. All Rights Reserved.
Tuning Java code for MR
 • Follow general Java best practices
     – String parsing and formatting is slow
     – Guard debug statements with isDebugEnabled()
     – StringBuffer.append vs repeated string concatenation
 • For CPU-intensive jobs, make a test
   harness/benchmark outside MR
     – Then use your favorite profiler
 • Check for GC overhead: -XX:+PrintGCDetails –
   verbose:gc
 • Easiest profiler: add –Xprof to
   mapred.child.java.opts – then look at
   stdout task log

30
                        ©2011 Cloudera, Inc. All Rights Reserved.
Other tips for fast MR code

 • Use the most compact and efficient data
   formats
     – LongWritable is way faster than parsing text
     – BytesWritable instead of Text for SHA1
       hashes/dedup
     – Avro/Thrift/Protobuf for complex data, not JSON!
 • Write a Combiner and RawComparator
 • Enable intermediate compression
   (Snappy/LZO)

31
                      ©2011 Cloudera, Inc. All Rights Reserved.
Summary

 • Understanding MR internals helps understand
   configurations and tuning
 • Focus your tuning effort on things that are
   bottlenecks, following a scientific approach
 • Don’t forget that you can always just add nodes!
     – Spending 1 month of engineer time to make your job
       20% faster is not worth it if you have a 10 node
       cluster!
 • We’re working on simplifying this where we can,
   but deep understanding will always allow more
   efficient jobs


32
                       ©2011 Cloudera, Inc. All Rights Reserved.
Questions?

    @tlipcon
todd@cloudera.com

More Related Content

What's hot

Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureDataWorks Summit
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 

What's hot (20)

MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
Transactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and futureTransactional operations in Apache Hive: present and future
Transactional operations in Apache Hive: present and future
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 

Similar to Hadoop Summit 2012 | Optimizing MapReduce Job Performance

Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceTsuyoshi OZAWA
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012Weiwei Chen
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsSatya Narayan
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftLee Stott
 

Similar to Hadoop Summit 2012 | Optimizing MapReduce Job Performance (20)

Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Multilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduceMultilevel aggregation for Hadoop/MapReduce
Multilevel aggregation for Hadoop/MapReduce
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
03 Hadoop
03 Hadoop03 Hadoop
03 Hadoop
 
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
 
IBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark BasicsIBM Spark Meetup - RDD & Spark Basics
IBM Spark Meetup - RDD & Spark Basics
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...
 
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Hadoop Summit 2012 | Optimizing MapReduce Job Performance

  • 1. June 14, 2012 Optimizing MapReduce Job Performance Todd Lipcon [@tlipcon]
  • 2. Introductions • Software Engineer at Cloudera since 2009 • Committer and PMC member on HDFS, MapReduce, and HBase • Spend lots of time looking at full stack performance • This talk is to help you develop faster jobs – If you want to hear about how we made Hadoop faster, see my Hadoop World 2011 talk on cloudera.com 2 ©2011 Cloudera, Inc. All Rights Reserved.
  • 3. Aspects of Performance • Algorithmic performance – big-O, join strategies, data structures, asymptotes • Physical performance – Hardware (disks, CPUs, etc) • Implementation performance – Efficiency of code, avoiding extra work – Make good use of available physical perf 3 ©2011 Cloudera, Inc. All Rights Reserved.
  • 4. Performance fundamentals • You can’t tune what you don’t understand – MR’s strength as a framework is its black-box nature – To get optimal performance, you have to understand the internals • This presentation: understanding the black box 4 ©2011 Cloudera, Inc. All Rights Reserved.
  • 5. Performance fundamentals (2) • You can’t improve what you can’t measure – Ganglia/Cacti/Cloudera Manager/etc a must – Top 4 metrics: CPU, Memory, Disk, Network – MR job metrics: slot-seconds, CPU-seconds, task wall-clocks, and I/O • Before you start: run jobs, gather data 5 ©2011 Cloudera, Inc. All Rights Reserved.
  • 6. Graphing bottlenecks This job might be CPU-bound in map phase Most jobs not CPU-bound Plenty of free RAM, perhaps can make better use of it? Fairly flat-topped network – bottleneck? 6 ©2011 Cloudera, Inc. All Rights Reserved.
  • 7. Performance tuning cycle Identify Address Run job bottleneck bottleneck -Graphs - Tune configs -Job counters - Improve code -Job logs - Rethink algos -Profiler results In order to understand these metrics and make changes, you need to understand MR internals. 7 ©2011 Cloudera, Inc. All Rights Reserved.
  • 8. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 8 ©2011 Cloudera, Inc. All Rights Reserved.
  • 9. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 9 ©2011 Cloudera, Inc. All Rights Reserved.
  • 10. Map-side sort/spill overview • Goal: when complete, map task outputs one sorted file • What happens when you call OutputCollector.collect()? Map Task 2. Output Buffer fills up. Contents sorted, partitioned .collect(K,V) and spilled to disk MapOutputBuffer IFile 1. In-memory buffer holds serialized, Map-side IFile IFile unsorted key-values Merge 3. Map task finishes. All IFile IFiles merged to single IFile per task 10 ©2011 Cloudera, Inc. All Rights Reserved.
  • 11. Zooming further: MapOutputBuffer (Hadoop 1.0) 12 bytes/rec kvoffsets (Partition, KOff, VOff) per record io.sort.record.percent * io.sort.mb kvindices 4 bytes/rec 1 indirect-sort index io.sort.mb per record kvbuffer R bytes/rec Raw, serialized (1-io.sort.record.percent) (Key, Val) pairs * io.sort.mb 11 ©2011 Cloudera, Inc. All Rights Reserved.
  • 12. MapOutputBuffer spill behavior • Memory is limited: must spill – If either of the kvbuffer or the metadata buffers fill up, “spill” to disk – In fact, we spill before it’s full (in another thread): configure io.sort.spill.percent • Performance impact – If we spill more than one time, we must re- read and re-write all data: 3x the IO! – #1 goal for map task optimization: spill once! 12 ©2011 Cloudera, Inc. All Rights Reserved.
  • 13. Spill counters on map tasks • ratio of Spilled Records vs Map Output Records – if unequal, then you are doing more than one spill • FILE: Number of bytes read/written – get a sense of I/O amplification due to spilling 13 ©2011 Cloudera, Inc. All Rights Reserved.
  • 14. Spill logs on map tasks indicates that the metadata buffers 2012-06-04 11:52:21,445 INFO before the data buffer map output: filled up MapTask: Spilling record full = true 2012-06-04 11:52:21,445 INFO MapTask: bufstart = 0; bufend = 60030900; bufvoid = 228117712 2012-06-04 11:52:21,445 INFO MapTask: kvstart = 0; kvend = 600309; length = 750387 2012-06-04 11:52:24,320 INFO MapTask: Finished spill 0 2012-06-04 11:52:26,117 INFO MapTask: Spilling map output: record full = true 2012-06-04 11:52:26,118 INFO MapTask: bufstart = 60030900; bufend = 120061700; bufvoid = 228117712 2012-06-04 11:52:26,118 INFO MapTask: kvstart = 600309; kvend = 450230; length = 750387 2012-06-04 11:52:26,666 INFO MapTask: Starting flush of map output 2012-06-04 11:52:28,272 INFO MapTask: Finished spill 1 2012-06-04 spills total! maybeINFO MapTask: Finished spill 2 3 11:52:29,105 we can do better? 14 ©2011 Cloudera, Inc. All Rights Reserved.
  • 15. Tuning to reduce spills • Parameters: – io.sort.mb: total buffer space – io.sort.record.percent: proportion between metadata buffers and key/value data – io.sort.spill.percent: threshold at which spill is triggered – Total map output generated: can you use more compact serialization? • Optimal settings depend on your data and available RAM! 15 ©2011 Cloudera, Inc. All Rights Reserved.
  • 16. Setting io.sort.record.percent • Common mistake: metadata buffers fill up way before kvdata buffer • Optimal setting: – io.sort.record.percent = 16/(16 + R) – R = average record size: divide “Map Output Bytes” counter by “Map Output Records” counter • Default (0.05) is usually too low (optimal for ~300byte records) • Hadoop 2.0: this is no longer necessary! – see MAPREDUCE-64 for gory details 16 ©2011 Cloudera, Inc. All Rights Reserved.
  • 17. Tuning Example (terasort) • Map input size = output size – 128MB block = 1,342,177 records, each 100 bytes – metadata: 16 * 1342177 = 20.9MB • io.sort.mb – 128MB data + 20.9MB meta = 148.9MB • io.sort.record.percent – 16/(16+100)=0.138 • io.sort.spill.percent = 1.0 17 ©2011 Cloudera, Inc. All Rights Reserved.
  • 18. More tips on spill tuning • Biggest win is going from 2 spills to 1 spill – 3 spills is approximately the same speed as 2 spills (same IO amplificatoin) • Calculate if it’s even possible, given your heap size – io.sort.mb has to fit within your Java heap (plus whatever RAM your Mapper needs, plus ~30% for overhead) • Only bother if this is the bottleneck! – Look at map task logs: if the merge step at the end is taking a fraction of a second, not worth it! – Typically most impact on jobs with big shuffle (sort/dedup) 18 ©2011 Cloudera, Inc. All Rights Reserved.
  • 19. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 19 ©2011 Cloudera, Inc. All Rights Reserved.
  • 20. Reducer fetch tuning • Reducers fetch map output via HTTP • Tuning parameters: – Server side: tasktracker.http.threads – Client side: mapred.reduce.parallel.copies • Turns out this is not so interesting – follow the best practices from Hadoop: Definitive Guide 20 ©2011 Cloudera, Inc. All Rights Reserved.
  • 21. Improving fetch bottlenecks • Reduce intermediate data – Implement a Combiner: less data transfers faster – Enable intermediate compression: Snappy is easy to enable; trades off some CPU for less IO/network • Double-check for network issues – Frame errors, NICs auto-negotiated to 100mbit, etc: one or two slow hosts can bottleneck a job – Tell-tale sign: all maps are done, and reducers sit in fetch stage for many minutes (look at logs) 21 ©2011 Cloudera, Inc. All Rights Reserved.
  • 22. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 22 ©2011 Cloudera, Inc. All Rights Reserved.
  • 23. Reducer merge (Hadoop 1.0) Yes: RAMManager fetch to RAM-to-disk RAM merges Remote Map Fits in Outputs RAM? (via HTTP) 1. Data accumulated in RAM is merged to No: fetch disk files to disk Local Disk IFile 2. If too many disk disk-to-disk Merged files accumulate, IFile merges iterator they are re-merged IFile Reduce Task 3. Segments from RAM and disk are 23 merged into the reducer code ©2011 Cloudera, Inc. All Rights Reserved.
  • 24. Reducer merge triggers • RAMManager – Total buffer size: mapred.job.shuffle.input.buffer.percent (default 0.70, percentage of reducer heapsize) • Mem-to-disk merge triggers: – RAMManager is mapred.job.shuffle.merge.percent % full (default 0.66) – Or mapred.inmem.merge.threshold segments accumulated (default 1000) • Disk-to-disk merge – io.sort.factor on-disk segments pile up (fairly rare) 24 ©2011 Cloudera, Inc. All Rights Reserved.
  • 25. Final merge phase • MR assumes that reducer code needs the full heap worth of RAM – Spills all in-RAM segments before running user code to free memory • This isn’t true if your reducer is simple – eg sort, simple aggregation, etc with no state • Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep reducer input data in RAM 25 ©2011 Cloudera, Inc. All Rights Reserved.
  • 26. Reducer merge counters • FILE: number of bytes read/written – Ideally close to 0 if you can fit in RAM • Spilled records: – Ideally close to 0. If significantly more than reduce input records, job is hitting a multi- pass merge which is quite expensive 26 ©2011 Cloudera, Inc. All Rights Reserved.
  • 27. Tuning reducer merge • Configure mapred.job.reduce.input.buffer.percent to 0.70 to keep data in RAM if you don’t have any state in reducer • Experiment with setting mapred.inmem.merge.threshold to 0 to avoid spills • Hadoop 2.0: experiment with mapreduce.reduce.merge.memtomem.enabled 27 ©2011 Cloudera, Inc. All Rights Reserved.
  • 28. Rules of thumb for # maps/reduces • Aim for map tasks running 1-3 minutes each – Too small: wasted startup overhead, less efficient shuffle – Too big: not enough parallelism, harder to share cluster • Reduce task count: – Large reduce phase: base on cluster slot count (a few GB per reducer) – Small reduce phase: fewer reducers will result in more efficient shuffle phase 28 ©2011 Cloudera, Inc. All Rights Reserved.
  • 29. MR from 10,000 feet InputFormat Map Sort/ Fetch Merge Reduce OutputFormat Task Spill Task 29 ©2011 Cloudera, Inc. All Rights Reserved.
  • 30. Tuning Java code for MR • Follow general Java best practices – String parsing and formatting is slow – Guard debug statements with isDebugEnabled() – StringBuffer.append vs repeated string concatenation • For CPU-intensive jobs, make a test harness/benchmark outside MR – Then use your favorite profiler • Check for GC overhead: -XX:+PrintGCDetails – verbose:gc • Easiest profiler: add –Xprof to mapred.child.java.opts – then look at stdout task log 30 ©2011 Cloudera, Inc. All Rights Reserved.
  • 31. Other tips for fast MR code • Use the most compact and efficient data formats – LongWritable is way faster than parsing text – BytesWritable instead of Text for SHA1 hashes/dedup – Avro/Thrift/Protobuf for complex data, not JSON! • Write a Combiner and RawComparator • Enable intermediate compression (Snappy/LZO) 31 ©2011 Cloudera, Inc. All Rights Reserved.
  • 32. Summary • Understanding MR internals helps understand configurations and tuning • Focus your tuning effort on things that are bottlenecks, following a scientific approach • Don’t forget that you can always just add nodes! – Spending 1 month of engineer time to make your job 20% faster is not worth it if you have a 10 node cluster! • We’re working on simplifying this where we can, but deep understanding will always allow more efficient jobs 32 ©2011 Cloudera, Inc. All Rights Reserved.
  • 33. Questions? @tlipcon todd@cloudera.com