• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop testing workshop - july 2013
 

Hadoop testing workshop - july 2013

on

  • 366 views

Presentation from the July's 2013 workshop on how to test, monitor and profile map reduce jobs

Presentation from the July's 2013 workshop on how to test, monitor and profile map reduce jobs

Statistics

Views

Total Views
366
Views on SlideShare
366
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop testing workshop - july 2013 Hadoop testing workshop - july 2013 Presentation Transcript

    • Hadoop Testing Workshop Ophir Cohen Data Platform Leader, ophirc@liveperson.com July 2013
    • Agenda 1. Connection Before Content 2. Testing Fundamental 3. Unit Tests 4. Integration Tests 5. Try it out 6. Performance 7. Diagnostics
    • Why Testing 1. Catch bugs early in the developing cycle 2. Transparency of current project status 3. Easy developing / refactoring: immediate feedback 4. Push developer to provide better and stable code 5. Decrease developing cycle times
    • Why Automatic Testing? It isn't real question right?
    • Testing Fundamental 1. Unit testing - functional verification of each 'unit' (method / class in Java) 2. Integration testing - verifies that the system works as a whole 3. Performance testing - test the efficiency of the program. Deepened by code AND cluster architecture 4. Diagnostic - the way to find problems in production. --> 1 + 2 should be done BEFORE production
    • Unit Tests Key Features 1. Simple (up to 10 lines) 2. Isolation (no DB connection, no cluster dependency etc...) 3. Deterministics - PASS or FAIL 4. Automated (of course) Why Unit Tests 1. Prevent regression 2. Fast - no need of full MR env 3. Help in refactoring and updates
    • Unit Tests - MR jobs Best Practices 1. Extract the tested code into isolated method/class 2. Do not test MR framework but pure Java 3. Use the same package for tests MRUnit 1. Lib for MR unit tests 2. Apache project 3. Supports testing of mappers, reducers and full job (without full cluster) 4. Supports counters testing (nice!)
    • Unit Tests - Examples Unit Tests Code Example
    • Integration Tests - background 1. Unit tests test each unit (Mapper/Reducer), integration test the integrated work 2. Test the integration with the framework 3. Does not limited by data volumes
    • Integration Tests - tips and tricks Tips and tricks 1. Use MiniMRCluster / MiniDFSCluster for tests 2. Use Linux 3. Make dev == production 4. Use data sampling: a. Random sampling b. Biased sampling 5. Apache BigTop (never try that) 6. Use Cloudera CDH
    • Lets play a bit 1. Checkout the code: git clone https://github.com/ophchu/mapreduce-tutorials.git 2. Make sure you manage to run the mapper test 3. Complete the MRUnit tests for the reducer and full job 4. Play with the MiniMRCluster/MiniDFSCluster test
    • Performance Profiling (at a glance...) 1. Profile your code 2. Measure and tune what's matters to you 3. Benchmarking: micro and macro 4. Hadoop has a built-in profiler (e.g. using hprof)
    • Cluster Performance 1. Terasort test hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4. 1.2.jar teragen 1000 /user/dataint/terasort/input hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4. 1.2.jar terasort /user/dataint/terasort/input /user/dataint/terasort/output 2. MRBench - MR benchmarking hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 2 -maps 10 -reduces 10 -inputLines 100 -inputType random 3. NNBench - Name Node benchmarking 4. TestDFSIO - write and read performance
    • Diagnostics 1. Check web API (http://your_server:50030/jobtracker.jsp): a. Nodes: how many up, how many down, check slots b. Jobs: logs, failures, exceptions c. Counters: expected 2. Configuration: a. check job conf (job.xml) b. Check env conf (http://your_server:50030/conf) 3. Jobs history (http://your_server:50030/jobhistory.jsp) 4. Log dirs: a. Job tracker (http://your_server:50030/logs/) b. Task trakcers
    • Thanks ● ophchu@gmail.com ● @ophchu Thanks