A Hadoop Primer

2,893 views

Published on

A simple introduction to Hadoop talk given to the Maine Java Users' Group February 15, 2011.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,893
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
85
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

A Hadoop Primer

  1. 1. A Hadoop PrimerFeb 201110.20.2005
  2. 2. http://redmonk.com/public/hadoop.pdf 2
  3. 3. The Background 3
  4. 4. October, 20034
  5. 5. December, 20045
  6. 6. Map::Reduce 6
  7. 7. Job::Map Reduce::Output 7
  8. 8. Counting Shakespeare 8
  9. 9. The Birth of Hadoop 9
  10. 10. 10
  11. 11. 11
  12. 12. Project Architecture Source: Running Hadoop On Ubuntu Linux, Michael G. Noll, 8.8.07 12
  13. 13. Project Traction 13
  14. 14. Employment Potential 14
  15. 15. Hadoop Users 15
  16. 16. Why Hadoop? 16
  17. 17. More Machines = More Faster 17
  18. 18. The reason everyone knows 18
  19. 19. BIG DATA 19
  20. 20. “The big issue is not that everyone willsuddenly operate at petabyte scale; a lot offolks do not have that much data.The more important topics are the specificsof the storage and processing infrastructureand what approaches best suit eachproblem.” - Bradford Cross, Flightcaster/Woven 20
  21. 21. The reason not everyone knows 21
  22. 22. ru dU st tu e Data n r c 22
  23. 23. What Hadoop Is 23
  24. 24. “build Amazons product search indices”“build the recommender system for behavioral targeting”“ETL style processing and statistics generation”“information extraction & search”“searching and analysis of millions of rental bookings”“we use Hadoop to summarize of users tracking data”“we use Hadoop to store ad serving logs”“the freedom to query the data in an ad-hoc manner”“generating web graphs on 100 nodes”“we use Hadoop for batch-processing large RDF datasets”“facial similarity and recognition across large datasets““We are using Hadoop and Nutch to crawl Blog posts”“Used for ETL & data analysis on terascale datasets” Source: http://wiki.apache.org/hadoop/PoweredBy 24
  25. 25. What Hadoop Isnt 25
  26. 26. A relational database killer No Yes 26
  27. 27. Beyond Hadoop 27
  28. 28. The Hadoop Ecosystem 28
  29. 29. What We Use Hadoop For 29
  30. 30. Crawling LargeishUnstructured Datasets 30
  31. 31. Like 1.3M StackOverflow Questions 31
  32. 32. Or 1.7M HackerNews Entries 32
  33. 33. Or Years of Apache Log Files 33
  34. 34. How to Get Started 34
  35. 35. We use Cloudera 35
  36. 36. Mostly because its easy 36
  37. 37. This easy 37
  38. 38. Or if you prefer 38
  39. 39. Or maybe this 39
  40. 40. QUESTIONS 40
  41. 41. Student? Talk to us 41

×