Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

EclipseCon Keynote: Apache Hadoop - An Introduction


Published on

Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.

EclipseCon Keynote: Apache Hadoop - An Introduction

  1. 1. Apache Hadoop an introduction Todd Lipcon @tlipcon @cloudera March 24, 2011
  2. 2. IntroductionsSoftware Engineer atApache Hadoop, HBase, ThriftcommitterPreviously: systems programming,operations, large scale data analysisI love data and data systems
  3. 3. OutlineWhy should you care? (Intro)What is Hadoop?How does it work?The Hadoop EcosystemUse CasesExperiences as a developer
  4. 4. Data is the difference. What‟s data?
  5. 5. Photo by C.C Chapman (CC BY-NC-ND)
  6. 6. “Every two days we create asmuch information as we didfrom the dawn of civilizationup until 2003.” Eric Schmidt
  7. 7. “I keep saying that the sexyjob in the next 10 years will bestatisticians. And I‟m notkidding.” Hal Varian (Google‟s chief economist)
  8. 8. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  9. 9. So, what‟sHadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  10. 10. So, what‟s Hadoop?The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  11. 11. Apache Hadoop is an open-source system to reliably store and process GOBS of dataacross many commodity computers.
  12. 12. Two Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerant high-bandwidth distributedclustered storage. processing.
  13. 13. What makesHadoop special?
  14. 14. Falsehood #1: Machines can be reliable…Image: MadMan the Mighty CC BY-NC-SA
  15. 15. Hadoop separatesdistributed system fault- tolerance code from application logic. Unicorns Systems Statisticians Programmers
  16. 16. Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA
  17. 17. Hadoop lets you interact with a cluster, not a bunch of machines.Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  18. 18. Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC
  19. 19. Hadoop scales linearly with data sizeor analysis complexity.Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream dataHadoop works for both applications!
  20. 20. Hadoop sounds like magic. Coincidentally, today is Houdini‟s birthday, though he was not a Hadoop committer.How is it possible?
  21. 21. A Typical Look...5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E)2-level network architecture 20-40 nodes per rack
  22. 22. Cluster nodesMaster nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler)Slave nodes (1-4000 each) DataNodes TaskTrackers (block storage) (task execution)
  23. 23. HDFS APIFileSystem fs = FileSystem.get(conf);InputStream in = Path(“/foo/bar”));OutputStream os = fs.create(new Path(“/baz”));fs.delete(…), fs.listStatus(…)
  24. 24. HDFS Data Storage /logs/weblog.txt DN 1 64MB blk_29232 DN 2158MB 30MB 64MB blk_19231 DN 3 blk_329432 NameNode DN 4
  25. 25. HDFS Write Path
  26. 26. • HDFS has split the file into 64MB blocks and stored it on the DataNodes.• Now, we want to process that data.
  27. 27. The MapReduce Programming Model
  28. 28. You specify map() and reduce() functions. The framework does the rest.
  29. 29. map() map: K₁,V₁→list K₂,V₂Key: byte offset 193284Value: “ - frank [10/Oct/2000:13:55:36-0700] "GET /userimage/123 HTTP/1.0" 200 2326”Key: userimageValue: 2326 bytesThe map function runs on the same node as the datawas stored!
  30. 30. Input Format• Wait! HDFS is not a Key-Value store!• InputFormat interprets bytes as a Key and Value - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “ - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  31. 31. The ShuffleEach map output is assigned to a“reducer” based on its keymap output is grouped andsorted by key
  32. 32. reduce() K₂, iter(V₂)→list(K₃,V₃)Key: userimageValue: 2326 bytes (from map task 0001)Value: 1000 bytes (from map task 0008)Value: 3020 bytes (from map task 0120) Reducer functionKey: userimageValue: 6346 bytes TextOutputFormatuserimage t 6346
  33. 33. Putting it together...
  34. 34. Hadoop isnot just MapReduce (NoNoSQL!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  35. 35. Hive ExampleCREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY t„ STORED AS TEXTFILE;LOAD DATA INPATH „/datasets/movielens‟ INTO TABLEmovie_rating_data;CREATE TABLE average_ratings ASSELECT movieid, AVG(rating) FROM movie_rating_dataGROUP BY movieid;
  36. 36. The Hadoop Ecosystem(Column DB)
  37. 37. Hadoop in the Wild (yes, it‟s used in production)Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC ‟09)Facebook: 15TB new data per day;1200 machines, 21PB in one clusterTwitter: >1TB per day, ~120 nodesLots of 5-40 node clusters at companies withoutpetabytes of data (web, retail, finance, telecom,research)
  38. 38. Use Cases
  39. 39. Product Recommendations• Naïve approach: Users who bought toothpaste bought toothbrushes.• Hadoop approach • What products did a user browse, hover over, rate, add to cart (but not buy), etc in the last 2 months? • What are the attributes of the user? • What are our margins, promotions, inventory, etc?
  40. 40. Production Recommendations• A lot of data! • Activity: ~20GB/day x ~60 days = 1.2TB • User Data: 2GB • Purchase Data: ~5GB• Pre-aggregating loses fidelity for individual users.
  41. 41. Hadoop and Java (the good)Integration, integration, integration!Tooling: IDEs, JCarder, AspectJ,Maven/IvyDeveloper accessibility
  42. 42. Hadoop and Java (the bad)Java is great for applications. Hadoop issystems programming.JNI is our hammer Compression, Security, FS accessC++ wrapper for setuid task execution
  43. 43. Hadoop and Java (the ugly)JVM bugs!Garbage Collection pauses on 50GBheapsWORA is a giant lie for systems – worstof both worlds?
  44. 44. Ok, fine, what next?Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop it out! (Locally, VM, or EC2)Watch free training videos on
  45. 45. Thanks!•• @tlipcon• (feedback? yes!)• (hiring? yes!)