Apache Hadoop  an introduction    Todd Lipcon   todd@cloudera.com   @tlipcon @cloudera     March 24, 2011
IntroductionsSoftware Engineer atApache Hadoop, HBase, ThriftcommitterPreviously: systems programming,operations, large sc...
OutlineWhy should you care? (Intro)What is Hadoop?How does it work?The Hadoop EcosystemUse CasesExperiences as a developer
Data is the difference.    What‟s data?
Photo by C.C Chapman (CC BY-NC-ND)http://www.flickr.com/photos/cc_chapman/3342268874/
“Every two days we create asmuch information as we didfrom the dawn of civilizationup until 2003.”                    Eric...
“I keep saying that the sexyjob in the next 10 years will bestatisticians. And I‟m notkidding.”                           ...
Are you throwing  away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual dat...
So, what‟sHadoop?    The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
So, what‟s    Hadoop?The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
Apache Hadoop is an        open-source system   to reliably store and process       GOBS of dataacross many commodity comp...
Two Core        Components      Store            Process     HDFS            Map/Reduce   Self-healing      Fault-tolerant...
What makesHadoop special?
Falsehood #1: Machines can be reliable…Image: MadMan the Mighty CC BY-NC-SA
Hadoop separatesdistributed system fault-  tolerance code from   application logic.                    Unicorns    Systems...
Falsehood #2: Machines deserve identities...                                      Image:Laughing Squid CC BY-NC-SA
Hadoop lets you interact                with a cluster, not a                bunch of machines.Image:Yahoo! Hadoop cluster...
Falsehood #3: Your analysis fits on one machine…                                    Image: Matthew J. Stinson CC-BY-NC
Hadoop scales linearly     with data sizeor analysis complexity.Data-parallel or compute-parallel. For example:  Extensive...
Hadoop sounds like     magic.               Coincidentally, today is               Houdini‟s birthday, though             ...
A Typical Look...5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E)2-level network architecture 20-40 nodes per rack
Cluster nodesMaster nodes (1 each)        NameNode (metadata server and database)        JobTracker (scheduler)Slave nodes...
HDFS APIFileSystem fs =  FileSystem.get(conf);InputStream in = fs.open(new  Path(“/foo/bar”));OutputStream os = fs.create(...
HDFS Data Storage           /logs/weblog.txt      DN 1        64MB                    blk_29232                           ...
HDFS Write Path
•   HDFS has split the file into    64MB blocks and stored it on    the DataNodes.•   Now, we want to process that    data.
The MapReduce Programming    Model
You specify map()  and reduce()   functions. The framework does the rest.
map()       map: K₁,V₁→list K₂,V₂Key:   byte offset 193284Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36-0700] "GET /user...
Input Format•    Wait! HDFS is not a Key-Value store!•    InputFormat interprets bytes as a Key     and Value    127.0.0.1...
The ShuffleEach map output is assigned to a“reducer” based on its keymap output is grouped andsorted by key
reduce()    K₂, iter(V₂)→list(K₃,V₃)Key:     userimageValue:   2326 bytes   (from map task 0001)Value:   1000 bytes   (fro...
Putting it together...
Hadoop isnot just MapReduce      (NoNoSQL!)        Hive project adds SQL        support to Hadoop        HiveQL (SQL diale...
Hive ExampleCREATE TABLE movie_rating_data (  userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED  ...
The Hadoop Ecosystem(Column DB)
Hadoop in the Wild     (yes, it‟s used in production)Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC...
Use Cases
Product         Recommendations•   Naïve approach: Users who bought toothpaste bought    toothbrushes.•   Hadoop approach ...
Production    Recommendations•   A lot of data!    •   Activity: ~20GB/day x ~60 days = 1.2TB    •   User Data: 2GB    •  ...
Hadoop and Java              (the good)Integration, integration, integration!Tooling: IDEs, JCarder, AspectJ,Maven/IvyDeve...
Hadoop and Java             (the bad)Java is great for applications. Hadoop issystems programming.JNI is our hammer Compre...
Hadoop and Java             (the ugly)JVM bugs!Garbage Collection pauses on 50GBheapsWORA is a giant lie for systems – wor...
Ok, fine, what next?Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop http://cloudera.com/ http://hadoop.a...
Thanks!•   todd@cloudera.com•   @tlipcon•   (feedback? yes!)•   (hiring? yes!)
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Upcoming SlideShare
Loading in...5
×

EclipseCon Keynote: Apache Hadoop - An Introduction

5,449

Published on

Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.

2 Comments
33 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,449
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
2
Likes
33
Embeds 0
No embeds

No notes for slide

EclipseCon Keynote: Apache Hadoop - An Introduction

  1. 1. Apache Hadoop an introduction Todd Lipcon todd@cloudera.com @tlipcon @cloudera March 24, 2011
  2. 2. IntroductionsSoftware Engineer atApache Hadoop, HBase, ThriftcommitterPreviously: systems programming,operations, large scale data analysisI love data and data systems
  3. 3. OutlineWhy should you care? (Intro)What is Hadoop?How does it work?The Hadoop EcosystemUse CasesExperiences as a developer
  4. 4. Data is the difference. What‟s data?
  5. 5. Photo by C.C Chapman (CC BY-NC-ND)http://www.flickr.com/photos/cc_chapman/3342268874/
  6. 6. “Every two days we create asmuch information as we didfrom the dawn of civilizationup until 2003.” Eric Schmidt
  7. 7. “I keep saying that the sexyjob in the next 10 years will bestatisticians. And I‟m notkidding.” Hal Varian (Google‟s chief economist)
  8. 8. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  9. 9. So, what‟sHadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  10. 10. So, what‟s Hadoop?The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry
  11. 11. Apache Hadoop is an open-source system to reliably store and process GOBS of dataacross many commodity computers.
  12. 12. Two Core Components Store Process HDFS Map/Reduce Self-healing Fault-tolerant high-bandwidth distributedclustered storage. processing.
  13. 13. What makesHadoop special?
  14. 14. Falsehood #1: Machines can be reliable…Image: MadMan the Mighty CC BY-NC-SA
  15. 15. Hadoop separatesdistributed system fault- tolerance code from application logic. Unicorns Systems Statisticians Programmers
  16. 16. Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA
  17. 17. Hadoop lets you interact with a cluster, not a bunch of machines.Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  18. 18. Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC
  19. 19. Hadoop scales linearly with data sizeor analysis complexity.Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream dataHadoop works for both applications!
  20. 20. Hadoop sounds like magic. Coincidentally, today is Houdini‟s birthday, though he was not a Hadoop committer.How is it possible?
  21. 21. A Typical Look...5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E)2-level network architecture 20-40 nodes per rack
  22. 22. Cluster nodesMaster nodes (1 each) NameNode (metadata server and database) JobTracker (scheduler)Slave nodes (1-4000 each) DataNodes TaskTrackers (block storage) (task execution)
  23. 23. HDFS APIFileSystem fs = FileSystem.get(conf);InputStream in = fs.open(new Path(“/foo/bar”));OutputStream os = fs.create(new Path(“/baz”));fs.delete(…), fs.listStatus(…)
  24. 24. HDFS Data Storage /logs/weblog.txt DN 1 64MB blk_29232 DN 2158MB 30MB 64MB blk_19231 DN 3 blk_329432 NameNode DN 4
  25. 25. HDFS Write Path
  26. 26. • HDFS has split the file into 64MB blocks and stored it on the DataNodes.• Now, we want to process that data.
  27. 27. The MapReduce Programming Model
  28. 28. You specify map() and reduce() functions. The framework does the rest.
  29. 29. map() map: K₁,V₁→list K₂,V₂Key: byte offset 193284Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36-0700] "GET /userimage/123 HTTP/1.0" 200 2326”Key: userimageValue: 2326 bytesThe map function runs on the same node as the datawas stored!
  30. 30. Input Format• Wait! HDFS is not a Key-Value store!• InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  31. 31. The ShuffleEach map output is assigned to a“reducer” based on its keymap output is grouped andsorted by key
  32. 32. reduce() K₂, iter(V₂)→list(K₃,V₃)Key: userimageValue: 2326 bytes (from map task 0001)Value: 1000 bytes (from map task 0008)Value: 3020 bytes (from map task 0120) Reducer functionKey: userimageValue: 6346 bytes TextOutputFormatuserimage t 6346
  33. 33. Putting it together...
  34. 34. Hadoop isnot just MapReduce (NoNoSQL!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  35. 35. Hive ExampleCREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY t„ STORED AS TEXTFILE;LOAD DATA INPATH „/datasets/movielens‟ INTO TABLEmovie_rating_data;CREATE TABLE average_ratings ASSELECT movieid, AVG(rating) FROM movie_rating_dataGROUP BY movieid;
  36. 36. The Hadoop Ecosystem(Column DB)
  37. 37. Hadoop in the Wild (yes, it‟s used in production)Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC ‟09)Facebook: 15TB new data per day;1200 machines, 21PB in one clusterTwitter: >1TB per day, ~120 nodesLots of 5-40 node clusters at companies withoutpetabytes of data (web, retail, finance, telecom,research)
  38. 38. Use Cases
  39. 39. Product Recommendations• Naïve approach: Users who bought toothpaste bought toothbrushes.• Hadoop approach • What products did a user browse, hover over, rate, add to cart (but not buy), etc in the last 2 months? • What are the attributes of the user? • What are our margins, promotions, inventory, etc?
  40. 40. Production Recommendations• A lot of data! • Activity: ~20GB/day x ~60 days = 1.2TB • User Data: 2GB • Purchase Data: ~5GB• Pre-aggregating loses fidelity for individual users.
  41. 41. Hadoop and Java (the good)Integration, integration, integration!Tooling: IDEs, JCarder, AspectJ,Maven/IvyDeveloper accessibility
  42. 42. Hadoop and Java (the bad)Java is great for applications. Hadoop issystems programming.JNI is our hammer Compression, Security, FS accessC++ wrapper for setuid task execution
  43. 43. Hadoop and Java (the ugly)JVM bugs!Garbage Collection pauses on 50GBheapsWORA is a giant lie for systems – worstof both worlds?
  44. 44. Ok, fine, what next?Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop http://cloudera.com/ http://hadoop.apache.org/Try it out! (Locally, VM, or EC2)Watch free training videos onhttp://cloudera.com/
  45. 45. Thanks!• todd@cloudera.com• @tlipcon• (feedback? yes!)• (hiring? yes!)

×