Hadoop tuning

546 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
546
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
18
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • {}
  • Hadoop tuning

    1. 1. Your natural partner to develop innovative solutions Nokia Institute of Technology Nokia Internal Use Only
    2. 2. Agenda Agenda • • • MapReduce Summarization Patterns MapReduce Coding Best Practices Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Code Profiling • Profiling Results • Code Tuning • Hbase Configuration Tuning • Tuning Results • Refactoring Proposal Nokia Internal Use Only
    3. 3. MapReduce Summarization Patterns MapReduce Summarization Patterns • Numerical Summarizations • General counting of data set records • Groups records by a custom key, calculating numerical values per group • Known Uses • Word count, record count, min/max count, avg/median/standard deviation Nokia Internal Use Only
    4. 4. MapReduce Summarization Patterns MapReduce Summarization Patterns • Inverted Index • Indexes large data set into keywords • Mapper emits keywords/ids values and the framework handles most of the work • May use IdentityReducer • Should benefit from Partitioner for load balance Nokia Internal Use Only
    5. 5. MapReduce Summarization Patterns MapReduce Summarization Patterns • • Counting with Counters • Leverages MapReduce framework’s counters. • Counters are all stored in-memory locally on each Mapper, then aggregated by the framework. • Better performance, however may not exceed tens of counters definition. Known Uses • Count number of records, count small number of groups, summations Nokia Internal Use Only
    6. 6. MapReduce Coding Best Practices MapReduce Coding Best Practices • • • Define Output Values • Create custom Writable extending classes to be used as output from Mappers; • Provides cleaner Mapper code and avoids String parsing on Reducer code side; Avoid Local Object Creation • Map and Reduce methods are invoked on very large loops; • Creating local objects inside map or reduce leads to huge number of objects being attached to Eden space of Young Generation JVM’s Heap; • Reuse Global instances to decrease Young GC Activity; Use Combiners on Counting Summarizations • Combiners reduce bandwidth consuption, as it applies aggregations locally to mappers node, before mapper output is sent to shuffle and sort phase, then made available for reducers Nokia Internal Use Only
    7. 7. Ctrending MR Performance Evaluation Ctrending MR Performance Evaluation • Ctrending MR Execution Summary • Total MR Jobs Running: 8 • Avg of processed tweets: 2.2 Million • Tweets identified as Music related: 10.5% • Total Execution Time: 2 hours and 20 minutes • Slowest MapReduces: • Tweets Counter: 46 minutes • Nokia Entity Id Join: 1 hour and 10 minutes Nokia Internal Use Only
    8. 8. Ctrending MR Code Profiling Ctrending MR Code Profiling • Mainly applied to Nokia Id Join Mapper • Added usage of MapReduce framework’s Counters to collect execution time metrics • Also used Counters to sum total of entities id being found in Nokia Id Join mapper • Needed to create Static fields in search strategy implementations to collect execution time metric Nokia Internal Use Only
    9. 9. Ctrending MR Profiling Results Ctrending MR Profiling Results TOTAL_ARTISTS_NMS_FOUND 77 TOTAL_ARTISTS_NOT_IN_CACHE TOTAL_CANDIDATES_FORMATTING_TIME TOTAL_HBASE_GET_TIME 1,904 67,873 262,647 TOTAL_NORMALIZATION_TIME 22,452 TOTAL_SEARCH_ARTIST_TIME 611,066 TOTAL_SEARCH_CALCULATION_TIME 5,605 TOTAL_SEARCH_NMS_TIME 3,740,552 TOTAL_SEARCH_TIME 4,098,270 TOTAL_SEARCH_TRACK_TIME 3,486,978 TOTAL_TRACKS_NMS_FOUND 145 TOTAL_TRACKS_NOT_IN_CACHE Nokia Internal Use Only 4,635
    10. 10. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Tweets Count MapReduce • Applied IntSumReducer as combiner. • Ajusted Hbase Scan to fetch and copy records on blocks of thousands, in order to optimize network usage between nodes. • Also set blockCache to false, as this table will always be read sequentially at once. Nokia Internal Use Only
    11. 11. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Removed unnecessary split/indexof calls • Removed redundant object creation from map method Nokia Internal Use Only
    12. 12. Ctrending MR Code Tuning Ctrending MR Code Tuning • Tuning Entity Id Search MapReduce • Profiling results shows that NMS Search is the bottleneck • It costs more than 90% of all MapReduce execution time • It also shows that NMS Search is not adding enough value • It founds only 4% of Artists Ids not in cache • It founds only 3% of Tracks Ids not in cache • This drove the decision to remove NMS search by simply referencing CustomCache ISearchStrategy implementation on Mapper setup method Nokia Internal Use Only
    13. 13. Hbase Configuration Tuning Hbase Configuration Tuning • Artists and Tracks Cache is an inverted indexes structure stored on Hbase tables. • These tables present high level of random access to it’s records (Get operations), while Entity Id Search MapReduce performs searches on the cache. • This could have performance optimized if Cache table blocks were made available in RegionServer’s memory. • Hbase provides Table level configuration property that increases blocks priority to be stored on RegionServer’s memory Nokia Internal Use Only
    14. 14. Hbase Configuration Tuning Hbase Configuration Tuning • Additional configuration is required on Hbase RegionServer, so that block cache is possible most part of the time. • hbase.regionserver.global.memstore.upperLimit -> defines maximum % of Heap available for writing in memstores, before put operations are actually written to disk files. • hbase.regionserver.global.memstore.lowerLimit -> defines minimum % of Heap available for writing in memstores. Flush operations will free memstore until this limit is reached. • hfile.block.cache.size -> % of Heap to be used to store blocks inmemory Nokia Internal Use Only
    15. 15. Hbase Configuration Tuning Hbase Configuration Tuning • Most Ctrending Hbase put operations are done in batch jobs (Twitter Crawler). • Music entities cache requires many Get operations, while EntityIdSearchMR is executing. • Simply setting cache tables to be maintained in-memory does not work, if there is not enough memory available. • More memory can be made available to cache tables blocks on RegionServers by decreasing % of Heap reserved to memstore and increasing it for block cache. Nokia Internal Use Only
    16. 16. Ctrending MR Tuning Results Ctrending MR Tuning Results • TweetsCountMR • Total Execution Time Prior Tuning: 46 minutes (average) • Total Execution Time After Tuning: 20 minutes (average) • EntityIdSearchMR • Total Execution Time Prior Tuning: 1 hour and 10 minutes (average) • Total Execution Time Adter Tuning: 6 minutes (average) • CONCLUSION: Do not ever perform HTTP Requests on MapReduces again!!! Nokia Internal Use Only
    17. 17. Refactoring Refactoring • Write batch process to read generated rankings and perform requests to NMS for music entities which ID was not found. • Better implement this as a Java multi-thread standalone process, instead of MapReduce • As input file is small (the filtered rank), Hadoop default InputFormat implementations will not split it in many Map tasks. • Unless a custom InputFormat be implemented, develop a MapReduce for this will probably take long time to execute, as it will end up with a single Map task to request NMS for all unknown Ids • Optimize Heap usage on other MRs by avoiding Object creation on Map methods. • Enhance code quality (and even performance), by defining OutputValues for Trending MRs Nokia Internal Use Only
    18. 18. References References • HBase, The Definitive Guide, Lars George, O'Reilly • MapReduce Design Patterns, Donald Miner, Adam Shook • Hadoop Official WebSite • http://hadoop.apache.org/ Nokia Internal Use Only

    ×