Hadoop - Simple. Scalable.

2,066 views
1,991 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,066
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop - Simple. Scalable.

  1. 1. Hadoop Simple. Scalable.
  2. 2. @markgunnels mark@catamorphiclabs.com
  3. 3. Java. Clojure. Ruby. Cloudera Certified
  4. 4. posscon.org April 15, 16, and 17
  5. 5. Agenda Overview Massively Large Data Sets and the problems therein Distributed File System MapReduce Pig
  6. 6. Overview
  7. 7. Doug Cutting Genius
  8. 8. Favorite Hadoop Story New York Times
  9. 9. 4 Terabytes of Source Articles.
  10. 10. 24 Hours.
  11. 11. 5.5 Terabytes of PDFs.
  12. 12. Did it again.
  13. 13. $240.
  14. 14. Infoporn from Yahoo 73 hours 490 TB Shuffling 280 TB Output 4000 Nodes 16 PB Disk Space 32K Cores 64 TB RAM
  15. 15. Hadoop solves...
  16. 16. Analyzing Massively Large Datasets
  17. 17. Two Problems You have to distribute.
  18. 18. Data Storage Capacity has increased rapidly beyond read speeds. Datasets won't fit on one disk. Tolerate node failure.
  19. 19. Data Analysis Combine data from many machines. Tolerate node failure.
  20. 20. How Hadoop solves these problems.
  21. 21. Send Code to Data. Not Data to Code.
  22. 22. Data Storage HDFS
  23. 23. Name Node. Data Nodes. Master - Slave Relationship
  24. 24. Shard massive files across multiple machines. MB, GB, and TB
  25. 25. Tolerant of Node Failure Files replicated across at least 3 nodes.
  26. 26. HDFS behaves like a normal file system. No true appends yet.
  27. 27. Demonstration.
  28. 28. Data Analysis MapReduce
  29. 29. Job Tracker. Task Nodes. Master - Slave Relationship.
  30. 30. map
  31. 31. Demonstration
  32. 32. pmap
  33. 33. Demonstration
  34. 34. reduce
  35. 35. Demonstration
  36. 36. (reduce (pmap))
  37. 37. Demonstration.
  38. 38. MapReduce Java
  39. 39. Nobody likes it. :-)
  40. 40. MapReduce Ruby. Python. Unix Utilities.
  41. 41. MapReduce Clojure
  42. 42. Hadoop Ecosystem Pigkeeper. Hive. Cascading.
  43. 43. Pig
  44. 44. HBase

×