Hadoop Testing
Workshop
Ophir Cohen
Data Platform Leader,
ophirc@liveperson.com
July 2013
Agenda
1. Connection Before Content
2. Testing Fundamental
3. Unit Tests
4. Integration Tests
5. Try it out
6. Performance...
Why Testing
1. Catch bugs early in the developing cycle
2. Transparency of current project status
3. Easy developing / ref...
Why Automatic Testing?
It isn't real question right?
Testing Fundamental
1. Unit testing - functional verification of each 'unit' (method /
class in Java)
2. Integration testi...
Unit Tests
Key Features
1. Simple (up to 10 lines)
2. Isolation (no DB connection, no cluster dependency etc...)
3. Determ...
Unit Tests - MR jobs
Best Practices
1. Extract the tested code into isolated method/class
2. Do not test MR framework but ...
Unit Tests - Examples
Unit Tests Code Example
Integration Tests - background
1. Unit tests test each unit (Mapper/Reducer), integration
test the integrated work
2. Test...
Integration Tests - tips and tricks
Tips and tricks
1. Use MiniMRCluster / MiniDFSCluster for tests
2. Use Linux
3. Make d...
Lets play a bit
1. Checkout the code:
git clone https://github.com/ophchu/mapreduce-tutorials.git
2. Make sure you manage ...
Performance
Profiling (at a glance...)
1. Profile your code
2. Measure and tune what's matters to you
3. Benchmarking: mic...
Cluster Performance
1. Terasort test
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4.
1.2.jar ter...
Diagnostics
1. Check web API (http://your_server:50030/jobtracker.jsp):
a. Nodes: how many up, how many down, check slots
...
Thanks
● ophchu@gmail.com
● @ophchu
Thanks
Upcoming SlideShare
Loading in...5
×

Hadoop testing workshop - july 2013

1,599

Published on

Presentation from the July's 2013 workshop on how to test, monitor and profile map reduce jobs

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,599
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
98
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Hadoop testing workshop - july 2013

  1. 1. Hadoop Testing Workshop Ophir Cohen Data Platform Leader, ophirc@liveperson.com July 2013
  2. 2. Agenda 1. Connection Before Content 2. Testing Fundamental 3. Unit Tests 4. Integration Tests 5. Try it out 6. Performance 7. Diagnostics
  3. 3. Why Testing 1. Catch bugs early in the developing cycle 2. Transparency of current project status 3. Easy developing / refactoring: immediate feedback 4. Push developer to provide better and stable code 5. Decrease developing cycle times
  4. 4. Why Automatic Testing? It isn't real question right?
  5. 5. Testing Fundamental 1. Unit testing - functional verification of each 'unit' (method / class in Java) 2. Integration testing - verifies that the system works as a whole 3. Performance testing - test the efficiency of the program. Deepened by code AND cluster architecture 4. Diagnostic - the way to find problems in production. --> 1 + 2 should be done BEFORE production
  6. 6. Unit Tests Key Features 1. Simple (up to 10 lines) 2. Isolation (no DB connection, no cluster dependency etc...) 3. Deterministics - PASS or FAIL 4. Automated (of course) Why Unit Tests 1. Prevent regression 2. Fast - no need of full MR env 3. Help in refactoring and updates
  7. 7. Unit Tests - MR jobs Best Practices 1. Extract the tested code into isolated method/class 2. Do not test MR framework but pure Java 3. Use the same package for tests MRUnit 1. Lib for MR unit tests 2. Apache project 3. Supports testing of mappers, reducers and full job (without full cluster) 4. Supports counters testing (nice!)
  8. 8. Unit Tests - Examples Unit Tests Code Example
  9. 9. Integration Tests - background 1. Unit tests test each unit (Mapper/Reducer), integration test the integrated work 2. Test the integration with the framework 3. Does not limited by data volumes
  10. 10. Integration Tests - tips and tricks Tips and tricks 1. Use MiniMRCluster / MiniDFSCluster for tests 2. Use Linux 3. Make dev == production 4. Use data sampling: a. Random sampling b. Biased sampling 5. Apache BigTop (never try that) 6. Use Cloudera CDH
  11. 11. Lets play a bit 1. Checkout the code: git clone https://github.com/ophchu/mapreduce-tutorials.git 2. Make sure you manage to run the mapper test 3. Complete the MRUnit tests for the reducer and full job 4. Play with the MiniMRCluster/MiniDFSCluster test
  12. 12. Performance Profiling (at a glance...) 1. Profile your code 2. Measure and tune what's matters to you 3. Benchmarking: micro and macro 4. Hadoop has a built-in profiler (e.g. using hprof)
  13. 13. Cluster Performance 1. Terasort test hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4. 1.2.jar teragen 1000 /user/dataint/terasort/input hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples-2.0.0-mr1-cdh4. 1.2.jar terasort /user/dataint/terasort/input /user/dataint/terasort/output 2. MRBench - MR benchmarking hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-test.jar mrbench -numRuns 2 -maps 10 -reduces 10 -inputLines 100 -inputType random 3. NNBench - Name Node benchmarking 4. TestDFSIO - write and read performance
  14. 14. Diagnostics 1. Check web API (http://your_server:50030/jobtracker.jsp): a. Nodes: how many up, how many down, check slots b. Jobs: logs, failures, exceptions c. Counters: expected 2. Configuration: a. check job conf (job.xml) b. Check env conf (http://your_server:50030/conf) 3. Jobs history (http://your_server:50030/jobhistory.jsp) 4. Log dirs: a. Job tracker (http://your_server:50030/logs/) b. Task trakcers
  15. 15. Thanks ● ophchu@gmail.com ● @ophchu Thanks
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×