Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Developing unit-testable
software with Hadoop
HUG UK - Jan 13th 2015
8
108 Keywords
1010 Web hits
9
Big data +
High stakes =
Exciting
12
Big data +
High stakes +
Complexity +
Change +
Unknowns =
Fear
13
Benefits of unit testing
• Build confidence
• Enable change
• Describe behaviour
• Accelerate development
14
Tests need to be:
• Simple and quick to write
• Simple and quick to run
15
Hadoop testing challenges
• Framework modularization issues
• Heavyweight execution engine
• Availability of testing ut...
16
Hadoop testing challenges
• Framework modularization issues
• Heavyweight execution engine
• Availability of testing ut...
17
Sequential modularization
18
Modularization via encapsulation
19
Hive modularity
20
Cascading modularity
21
Pig modularity
22
Crunch modularity
23
Mapreduce modularity
25
Hadoop testing challenges
• Framework modularization issues
• Heavyweight execution engine
• Availability of testing ut...
26
Local execution
Framework Local engine
Hive Local-mode
Cascading LocalFlowConnector
Pig Local mode
Crunch MemPipeline
27
Hadoop testing challenges
• Framework modularization issues
• Heavyweight execution engine
• Availability of testing ut...
28
Helper libraries
Framework Library
Hive HiveRunner
Cascading cascading-test
Plunger
Pig PigUnit
Crunch MemPipeline
29
Example
Topic | Subtopic
30
Hive example with HiveRunner
[https://github.com/klarna/HiveRunner]
31
Hive + HiveRunner: pros
• Write/test Hive apps in the same environment
• Seamless UDF development
32
Hive + Runner: cons
• Slow execution
• CSV data – hard to maintain
• Assertions on CSV strings is brittle
• Hadoop comp...
33
Cascading example with Plunger
[https://github.com/HotelsDotCom/plunger]
34
Cascading + Plunger: pros
• Compact tests, well defined scope
• Use standard Java tools
• Fast
35
Cascading + Plunger: cons
• Some tools only appear to work
• False sense of security
36
Measuring coverage
• Identify activated branches
• No tools do this
37
Conclusions
• Testing possible with most frameworks
• Efficacy largely influenced by framework
• Tooling is immature
38
We’re hiring
Java developers
Hadoop developers
39
Questions?
40
Attribution
https://flic.kr/p/7XQdXm - Chris Campbell - CC BY-NC 2.0
https://flic.kr/p/4hpX7j - Andrew_Writer - CC BY-N...
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Developing Unit Testable Software with Hadoop at Expedia
Upcoming SlideShare
Loading in …5
×

Developing Unit Testable Software with Hadoop at Expedia

1,014 views

Published on

Unit testing is an established software engineering practice, yet it can be more challenging to apply in the Hadoop domain. This talk examines what testing options are available with some popular frameworks and discusses in detail how we test our Hive and Cascading applications at Hotels.com.

Published in: Technology
  • Be the first to comment

Developing Unit Testable Software with Hadoop at Expedia

  1. 1. Developing unit-testable software with Hadoop HUG UK - Jan 13th 2015
  2. 2. 8 108 Keywords 1010 Web hits
  3. 3. 9 Big data + High stakes = Exciting
  4. 4. 12 Big data + High stakes + Complexity + Change + Unknowns = Fear
  5. 5. 13 Benefits of unit testing • Build confidence • Enable change • Describe behaviour • Accelerate development
  6. 6. 14 Tests need to be: • Simple and quick to write • Simple and quick to run
  7. 7. 15 Hadoop testing challenges • Framework modularization issues • Heavyweight execution engine • Availability of testing utilities
  8. 8. 16 Hadoop testing challenges • Framework modularization issues • Heavyweight execution engine • Availability of testing utilities
  9. 9. 17 Sequential modularization
  10. 10. 18 Modularization via encapsulation
  11. 11. 19 Hive modularity
  12. 12. 20 Cascading modularity
  13. 13. 21 Pig modularity
  14. 14. 22 Crunch modularity
  15. 15. 23 Mapreduce modularity
  16. 16. 25 Hadoop testing challenges • Framework modularization issues • Heavyweight execution engine • Availability of testing utilities
  17. 17. 26 Local execution Framework Local engine Hive Local-mode Cascading LocalFlowConnector Pig Local mode Crunch MemPipeline
  18. 18. 27 Hadoop testing challenges • Framework modularization issues • Heavyweight execution engine • Availability of testing utilities
  19. 19. 28 Helper libraries Framework Library Hive HiveRunner Cascading cascading-test Plunger Pig PigUnit Crunch MemPipeline
  20. 20. 29 Example Topic | Subtopic
  21. 21. 30 Hive example with HiveRunner [https://github.com/klarna/HiveRunner]
  22. 22. 31 Hive + HiveRunner: pros • Write/test Hive apps in the same environment • Seamless UDF development
  23. 23. 32 Hive + Runner: cons • Slow execution • CSV data – hard to maintain • Assertions on CSV strings is brittle • Hadoop compatibility issues
  24. 24. 33 Cascading example with Plunger [https://github.com/HotelsDotCom/plunger]
  25. 25. 34 Cascading + Plunger: pros • Compact tests, well defined scope • Use standard Java tools • Fast
  26. 26. 35 Cascading + Plunger: cons • Some tools only appear to work • False sense of security
  27. 27. 36 Measuring coverage • Identify activated branches • No tools do this
  28. 28. 37 Conclusions • Testing possible with most frameworks • Efficacy largely influenced by framework • Tooling is immature
  29. 29. 38 We’re hiring Java developers Hadoop developers
  30. 30. 39 Questions?
  31. 31. 40 Attribution https://flic.kr/p/7XQdXm - Chris Campbell - CC BY-NC 2.0 https://flic.kr/p/4hpX7j - Andrew_Writer - CC BY-NC-ND 2.0 http://bit.ly/1BVe8xH - Prokofiev - CC BY-SA 3.0 Resources HiveRunner https://github.com/klarna/HiveRunner Plunger https://github.com/HotelsDotCom/plunger

×