Apache Hadoop
and HBase
Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
Nov 2, 2010
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data ana...
Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
Hadoop MapReduce
The Hadoop Ecosystem
Questions
Data is
everywhere.
Data is important.
“I keep saying that the sexy
job in the next 10 years will be
statisticians, and I‟m not
kidding.”
Hal Varian
(Google‟s ch...
Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data...
So, what‟s
Hadoop?
Apache Hadoop is an
open-source system
to reliably store and process
A LOT of information
across many commodity computers.
Two Core
Components
HDFS Map/Reduce
Self-healing
high-bandwidth
clustered storage.
Fault-tolerant
distributed
processing.
...
What makes
Hadoop special?
Hadoop separates
distributed system fault-
tolerance code from
application logic.
Systems
Programmers
Statisticians
Unicor...
Hadoop lets you interact
with a cluster, not a
bunch of machines.
Image:Yahoo! Hadoop cluster [
OSCON ‟07 ]
Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive ma...
A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per...
Hadoop sounds like
magic.
How is it possible?
Cluster nodes
NameNode (metadata server and database)
JobTracker (scheduler)
DataNodes
(block storage)
TaskTrackers
(task ...
NameNode
HDFS Data Storage
/logs/weblog.txt
blk_29232
blk_19231
blk_329432
158MB
DN 1
DN 2
DN 3
DN 4
64MB64MB30MB
HDFS Write Path
• HDFS has split the file into
64MB blocks and stored it on
the DataNodes.
• Now, we want to process that
data.
The MapReduce
Programming
Model
You specify map()
and reduce()
functions.
The framework
does the rest.
map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage...
Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [1...
The Shuffle
Each map output is assigned to a
“reducer” based on its key
map output is grouped and
sorted by key
reduce()
K₂, iter(V₂)→list(K₃,V₃)
Key: userimage
Value: 2326 bytes (from map task 0001)
Value: 1000 bytes (from map task 0...
Putting it together...
Hadoop is not NoSQL
(sorry!)
Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query p...
Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
...
The Hadoop
Ecosystem
(Column DB)
Hadoop in the Wild
(yes, it‟s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟...
What about real time
access?
• MapReduce is a batch system
• The fastest MR job takes 24 seconds
• HDFS just stores bytes,...
Apache HBase
HBase is an
open source, distributed,
sorted map
modeled after Google‟s BigTable
HBase is built on
Hadoop
• Hadoop provides:
• Fault tolerance
• Scalability
• Batch processing with MapReduce
HDFS + HBase
= HDFS + random read/write
• HBase uses HDFS for storage
• “Log structured merge trees”
• Similar to “log str...
A Big Sorted Map
Row key Column key Timestamp Cell
Row1 info:aaa 1273516197868 valueA
Row1 info:bbb 1273871824184 valueB
R...
HBase API
• get(row)
• put(row, map<column, value>)
• scan(key range, filter)
• increment(row, columns)
• … (checkAndPut, ...
HBase Architecture
HBase in Numbers
• Largest cluster: 600 nodes, ~600TB
• Most clusters: 5-20 nodes, 100GB-4TB
• Writes: 1-3ms, 1k-10k write...
HBase compared
• Favors Consistency over Availability (but
availability is good in practice!)
• Great Hadoop integration (...
HBase in Production
• Facebook (product release soon)
• StumbleUpon / su.pr
• Mozilla (receives crash reports)
• … many ot...
Ok, fine, what next?
Get Hadoop!
Cloudera‟s Distribution for Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it ...
Questions?
• todd@cloudera.com
• (feedback? yes!)
• (hiring? yes!)
Apache Hadoop and HBase
Apache Hadoop and HBase
Upcoming SlideShare
Loading in...5
×

Apache Hadoop and HBase

26,022

Published on

Apache Hadoop and HBase

Todd Lipcon
Cloudera

Published in: Technology
1 Comment
88 Likes
Statistics
Notes
No Downloads
Views
Total Views
26,022
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
0
Comments
1
Likes
88
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Hadoop and HBase"

  1. 1. Apache Hadoop and HBase Todd Lipcon todd@cloudera.com @tlipcon @cloudera Nov 2, 2010
  2. 2. Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems 今日は!
  3. 3. Outline Why should you care? (Intro) What is Hadoop? How does it work? Hadoop MapReduce The Hadoop Ecosystem Questions
  4. 4. Data is everywhere. Data is important.
  5. 5. “I keep saying that the sexy job in the next 10 years will be statisticians, and I‟m not kidding.” Hal Varian (Google‟s chief economist)
  6. 6. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  7. 7. So, what‟s Hadoop?
  8. 8. Apache Hadoop is an open-source system to reliably store and process A LOT of information across many commodity computers.
  9. 9. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth clustered storage. Fault-tolerant distributed processing. Store Process
  10. 10. What makes Hadoop special?
  11. 11. Hadoop separates distributed system fault- tolerance code from application logic. Systems Programmers Statisticians Unicorns
  12. 12. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  13. 13. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data Hadoop works for both applications!
  14. 14. A Typical Look... 5-4000 commodity servers (8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack
  15. 15. Hadoop sounds like magic. How is it possible?
  16. 16. Cluster nodes NameNode (metadata server and database) JobTracker (scheduler) DataNodes (block storage) TaskTrackers (task execution) Master nodes (1 each) Slave nodes (1-4000 each)
  17. 17. NameNode HDFS Data Storage /logs/weblog.txt blk_29232 blk_19231 blk_329432 158MB DN 1 DN 2 DN 3 DN 4 64MB64MB30MB
  18. 18. HDFS Write Path
  19. 19. • HDFS has split the file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
  20. 20. The MapReduce Programming Model
  21. 21. You specify map() and reduce() functions. The framework does the rest.
  22. 22. map() map: K₁,V₁→list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
  23. 23. Input Format • Wait! HDFS is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  24. 24. The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped and sorted by key
  25. 25. reduce() K₂, iter(V₂)→list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Key: userimage Value: 6346 bytes userimage t 6346 TextOutputFormat Reducer function
  26. 26. Putting it together...
  27. 27. Hadoop is not NoSQL (sorry!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  28. 28. Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT AVG(rating) FROM movie_rating_data GROUP BY movieid;
  29. 29. The Hadoop Ecosystem (Column DB)
  30. 30. Hadoop in the Wild (yes, it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
  31. 31. What about real time access? • MapReduce is a batch system • The fastest MR job takes 24 seconds • HDFS just stores bytes, and is append- only • Not about to serve data for your next web site.
  32. 32. Apache HBase HBase is an open source, distributed, sorted map modeled after Google‟s BigTable
  33. 33. HBase is built on Hadoop • Hadoop provides: • Fault tolerance • Scalability • Batch processing with MapReduce
  34. 34. HDFS + HBase = HDFS + random read/write • HBase uses HDFS for storage • “Log structured merge trees” • Similar to “log structured file systems” • Same storage pattern as Cassandra!
  35. 35. A Big Sorted Map Row key Column key Timestamp Cell Row1 info:aaa 1273516197868 valueA Row1 info:bbb 1273871824184 valueB Row1 info:bbb 1273871823022 oldValueB Row1 info:ccc 1273746289103 valueC Row2 info:hello 1273878447049 i_am_a_value Row3 info: 1273616297446 another_value Sorted by Row key and Column Timestamp is a long value 2 Versions of this cell
  36. 36. HBase API • get(row) • put(row, map<column, value>) • scan(key range, filter) • increment(row, columns) • … (checkAndPut, delete, etc…) • MapReduce/Hive
  37. 37. HBase Architecture
  38. 38. HBase in Numbers • Largest cluster: 600 nodes, ~600TB • Most clusters: 5-20 nodes, 100GB-4TB • Writes: 1-3ms, 1k-10k writes/sec per node • Reads: 0-3ms cached, 10-30ms disk • 10-40k reads / second / node from cache • Cell size: 0-3MB preferred
  39. 39. HBase compared • Favors Consistency over Availability (but availability is good in practice!) • Great Hadoop integration (very efficient bulk loads, MapReduce analysis) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers) • Sparse column storage (not key-value)
  40. 40. HBase in Production • Facebook (product release soon) • StumbleUpon / su.pr • Mozilla (receives crash reports) • … many others
  41. 41. Ok, fine, what next? Get Hadoop! Cloudera‟s Distribution for Hadoop http://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos on http://cloudera.com/ Available in Japanese!
  42. 42. Questions? • todd@cloudera.com • (feedback? yes!) • (hiring? yes!)

×