Your SlideShare is downloading. ×

Apache Hadoop and HBase

24,704
views

Published on

Apache Hadoop and HBase …

Apache Hadoop and HBase

Todd Lipcon
Cloudera

Published in: Technology

1 Comment
87 Likes
Statistics
Notes
No Downloads
Views
Total Views
24,704
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
0
Comments
1
Likes
87
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Hadoop and HBase Todd Lipcon todd@cloudera.com @tlipcon @cloudera Nov 2, 2010
  • 2. Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems 今日は!
  • 3. Outline Why should you care? (Intro) What is Hadoop? How does it work? Hadoop MapReduce The Hadoop Ecosystem Questions
  • 4. Data is everywhere. Data is important.
  • 5. “I keep saying that the sexy job in the next 10 years will be statisticians, and I‟m not kidding.” Hal Varian (Google‟s chief economist)
  • 6. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  • 7. So, what‟s Hadoop?
  • 8. Apache Hadoop is an open-source system to reliably store and process A LOT of information across many commodity computers.
  • 9. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth clustered storage. Fault-tolerant distributed processing. Store Process
  • 10. What makes Hadoop special?
  • 11. Hadoop separates distributed system fault- tolerance code from application logic. Systems Programmers Statisticians Unicorns
  • 12. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  • 13. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data Hadoop works for both applications!
  • 14. A Typical Look... 5-4000 commodity servers (8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack
  • 15. Hadoop sounds like magic. How is it possible?
  • 16. Cluster nodes NameNode (metadata server and database) JobTracker (scheduler) DataNodes (block storage) TaskTrackers (task execution) Master nodes (1 each) Slave nodes (1-4000 each)
  • 17. NameNode HDFS Data Storage /logs/weblog.txt blk_29232 blk_19231 blk_329432 158MB DN 1 DN 2 DN 3 DN 4 64MB64MB30MB
  • 18. HDFS Write Path
  • 19. • HDFS has split the file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
  • 20. The MapReduce Programming Model
  • 21. You specify map() and reduce() functions. The framework does the rest.
  • 22. map() map: K₁,V₁→list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
  • 23. Input Format • Wait! HDFS is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  • 24. The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped and sorted by key
  • 25. reduce() K₂, iter(V₂)→list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Key: userimage Value: 6346 bytes userimage t 6346 TextOutputFormat Reducer function
  • 26. Putting it together...
  • 27. Hadoop is not NoSQL (sorry!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  • 28. Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT AVG(rating) FROM movie_rating_data GROUP BY movieid;
  • 29. The Hadoop Ecosystem (Column DB)
  • 30. Hadoop in the Wild (yes, it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
  • 31. What about real time access? • MapReduce is a batch system • The fastest MR job takes 24 seconds • HDFS just stores bytes, and is append- only • Not about to serve data for your next web site.
  • 32. Apache HBase HBase is an open source, distributed, sorted map modeled after Google‟s BigTable
  • 33. HBase is built on Hadoop • Hadoop provides: • Fault tolerance • Scalability • Batch processing with MapReduce
  • 34. HDFS + HBase = HDFS + random read/write • HBase uses HDFS for storage • “Log structured merge trees” • Similar to “log structured file systems” • Same storage pattern as Cassandra!
  • 35. A Big Sorted Map Row key Column key Timestamp Cell Row1 info:aaa 1273516197868 valueA Row1 info:bbb 1273871824184 valueB Row1 info:bbb 1273871823022 oldValueB Row1 info:ccc 1273746289103 valueC Row2 info:hello 1273878447049 i_am_a_value Row3 info: 1273616297446 another_value Sorted by Row key and Column Timestamp is a long value 2 Versions of this cell
  • 36. HBase API • get(row) • put(row, map<column, value>) • scan(key range, filter) • increment(row, columns) • … (checkAndPut, delete, etc…) • MapReduce/Hive
  • 37. HBase Architecture
  • 38. HBase in Numbers • Largest cluster: 600 nodes, ~600TB • Most clusters: 5-20 nodes, 100GB-4TB • Writes: 1-3ms, 1k-10k writes/sec per node • Reads: 0-3ms cached, 10-30ms disk • 10-40k reads / second / node from cache • Cell size: 0-3MB preferred
  • 39. HBase compared • Favors Consistency over Availability (but availability is good in practice!) • Great Hadoop integration (very efficient bulk loads, MapReduce analysis) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers) • Sparse column storage (not key-value)
  • 40. HBase in Production • Facebook (product release soon) • StumbleUpon / su.pr • Mozilla (receives crash reports) • … many others
  • 41. Ok, fine, what next? Get Hadoop! Cloudera‟s Distribution for Hadoop http://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos on http://cloudera.com/ Available in Japanese!
  • 42. Questions? • todd@cloudera.com • (feedback? yes!) • (hiring? yes!)