2. Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
今日は!
3. Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
Hadoop MapReduce
The Hadoop Ecosystem
Questions
7. “I keep saying that the sexy
job in the next 10 years will be
statisticians, and I‟m not
kidding.”
Hal Varian
(Google‟s chief economist)
8. Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data (e.g., e-
mail), … .
Are you throwing it away because it
doesn‟t „fit‟?
14. Hadoop lets you interact
with a cluster, not a
bunch of machines.
Image:Yahoo! Hadoop cluster [
OSCON ‟07 ]
15. Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image
data
Simple SQL-style queries on >100TB of
clickstream data
Hadoop works for both applications!
16. A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
24. map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
Key: userimage
Value: 2326 bytes
The map function runs on the same node as the data
was stored!
25. Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
"GET /userimage/123 HTTP/1.0" 200 2326
Key: log offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
26. The Shuffle
Each map output is assigned to a
“reducer” based on its key
map output is grouped and
sorted by key
29. Hadoop is not NoSQL
(sorry!)
Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query plan executes as
MapReduce jobs
30. Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't„
STORED AS TEXTFILE;
LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;
CREATE TABLE average_ratings AS
SELECT AVG(rating) FROM movie_rating_data
GROUP BY movieid;
32. Hadoop in the Wild
(yes, it‟s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)
Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster
Twitter: ~1TB per day, ~80 nodes
Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)
33. What about real time
access?
• MapReduce is a batch system
• The fastest MR job takes 24 seconds
• HDFS just stores bytes, and is append-
only
• Not about to serve data for your next
web site.
34. Apache HBase
HBase is an
open source, distributed,
sorted map
modeled after Google‟s BigTable
35. HBase is built on
Hadoop
• Hadoop provides:
• Fault tolerance
• Scalability
• Batch processing with MapReduce
36. HDFS + HBase
= HDFS + random read/write
• HBase uses HDFS for storage
• “Log structured merge trees”
• Similar to “log structured file systems”
• Same storage pattern as Cassandra!
37. A Big Sorted Map
Row key Column key Timestamp Cell
Row1 info:aaa 1273516197868 valueA
Row1 info:bbb 1273871824184 valueB
Row1 info:bbb 1273871823022 oldValueB
Row1 info:ccc 1273746289103 valueC
Row2 info:hello 1273878447049 i_am_a_value
Row3 info: 1273616297446 another_value
Sorted by
Row key
and Column
Timestamp is a long value
2 Versions
of this cell
40. HBase in Numbers
• Largest cluster: 600 nodes, ~600TB
• Most clusters: 5-20 nodes, 100GB-4TB
• Writes: 1-3ms, 1k-10k writes/sec per node
• Reads: 0-3ms cached, 10-30ms disk
• 10-40k reads / second / node from cache
• Cell size: 0-3MB preferred
41. HBase compared
• Favors Consistency over Availability (but
availability is good in practice!)
• Great Hadoop integration (very efficient
bulk loads, MapReduce analysis)
• Ordered range partitions (not hash)
• Automatically shards/scales (just turn on
more servers)
• Sparse column storage (not key-value)
42. HBase in Production
• Facebook (product release soon)
• StumbleUpon / su.pr
• Mozilla (receives crash reports)
• … many others
43. Ok, fine, what next?
Get Hadoop!
Cloudera‟s Distribution for Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/
Available in
Japanese!