Apache Hadoop and HBase

Cloudera, Inc.
Cloudera, Inc.Cloudera, Inc.
Apache Hadoop
and HBase
Todd Lipcon
todd@cloudera.com
@tlipcon @cloudera
Nov 2, 2010
Software Engineer at
Hadoop contributor, HBase committer
Previously: systems programming,
operations, large scale data analysis
I love data and data systems
今日は!
Outline
Why should you care? (Intro)
What is Hadoop?
How does it work?
Hadoop MapReduce
The Hadoop Ecosystem
Questions
Data is
everywhere.
Data is important.
Apache Hadoop and HBase
Apache Hadoop and HBase
“I keep saying that the sexy
job in the next 10 years will be
statisticians, and I‟m not
kidding.”
Hal Varian
(Google‟s chief economist)
Are you throwing
away data?
Data comes in many shapes and
sizes: relational tuples, log files,
semistructured textual data (e.g., e-
mail), … .
Are you throwing it away because it
doesn‟t „fit‟?
So, what‟s
Hadoop?
Apache Hadoop is an
open-source system
to reliably store and process
A LOT of information
across many commodity computers.
Two Core
Components
HDFS Map/Reduce
Self-healing
high-bandwidth
clustered storage.
Fault-tolerant
distributed
processing.
Store Process
What makes
Hadoop special?
Hadoop separates
distributed system fault-
tolerance code from
application logic.
Systems
Programmers
Statisticians
Unicorns
Hadoop lets you interact
with a cluster, not a
bunch of machines.
Image:Yahoo! Hadoop cluster [
OSCON ‟07 ]
Hadoop scales linearly
with data size
or analysis complexity.
Data-parallel or compute-parallel. For example:
Extensive machine learning on <100GB of image
data
Simple SQL-style queries on >100TB of
clickstream data
Hadoop works for both applications!
A Typical Look...
5-4000 commodity servers
(8-core, 24GB RAM, 4-12 TB, gig-E)
2-level network architecture
20-40 nodes per rack
Hadoop sounds like
magic.
How is it possible?
Cluster nodes
NameNode (metadata server and database)
JobTracker (scheduler)
DataNodes
(block storage)
TaskTrackers
(task execution)
Master nodes (1 each)
Slave nodes (1-4000 each)
NameNode
HDFS Data Storage
/logs/weblog.txt
blk_29232
blk_19231
blk_329432
158MB
DN 1
DN 2
DN 3
DN 4
64MB64MB30MB
HDFS Write Path
• HDFS has split the file into
64MB blocks and stored it on
the DataNodes.
• Now, we want to process that
data.
The MapReduce
Programming
Model
You specify map()
and reduce()
functions.
The framework
does the rest.
map()
map: K₁,V₁→list K₂,V₂
Key: byte offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
Key: userimage
Value: 2326 bytes
The map function runs on the same node as the data
was stored!
Input Format
• Wait! HDFS is not a Key-Value store!
• InputFormat interprets bytes as a Key
and Value
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700]
"GET /userimage/123 HTTP/1.0" 200 2326
Key: log offset 193284
Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36
-0700] "GET /userimage/123 HTTP/1.0" 200 2326”
The Shuffle
Each map output is assigned to a
“reducer” based on its key
map output is grouped and
sorted by key
reduce()
K₂, iter(V₂)→list(K₃,V₃)
Key: userimage
Value: 2326 bytes (from map task 0001)
Value: 1000 bytes (from map task 0008)
Value: 3020 bytes (from map task 0120)
Key: userimage
Value: 6346 bytes
userimage t 6346
TextOutputFormat
Reducer function
Putting it together...
Hadoop is not NoSQL
(sorry!)
Hive project adds SQL
support to Hadoop
HiveQL (SQL dialect)
compiles to a query plan
Query plan executes as
MapReduce jobs
Hive Example
CREATE TABLE movie_rating_data (
userid INT, movieid INT, rating INT, unixtime STRING
) ROW FORMAT DELIMITED
FIELDS TERMINATED BY 't„
STORED AS TEXTFILE;
LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE
movie_rating_data;
CREATE TABLE average_ratings AS
SELECT AVG(rating) FROM movie_rating_data
GROUP BY movieid;
The Hadoop
Ecosystem
(Column DB)
Hadoop in the Wild
(yes, it‟s used in production)
Yahoo! Hadoop Clusters: > 82PB, >25k machines
(Eric14, HadoopWorld NYC ‟09)
Facebook: 15TB new data per day;
1200 machines, 21PB in one cluster
Twitter: ~1TB per day, ~80 nodes
Lots of 5-40 node clusters at companies without
petabytes of data (web, retail, finance, telecom,
research)
What about real time
access?
• MapReduce is a batch system
• The fastest MR job takes 24 seconds
• HDFS just stores bytes, and is append-
only
• Not about to serve data for your next
web site.
Apache HBase
HBase is an
open source, distributed,
sorted map
modeled after Google‟s BigTable
HBase is built on
Hadoop
• Hadoop provides:
• Fault tolerance
• Scalability
• Batch processing with MapReduce
HDFS + HBase
= HDFS + random read/write
• HBase uses HDFS for storage
• “Log structured merge trees”
• Similar to “log structured file systems”
• Same storage pattern as Cassandra!
A Big Sorted Map
Row key Column key Timestamp Cell
Row1 info:aaa 1273516197868 valueA
Row1 info:bbb 1273871824184 valueB
Row1 info:bbb 1273871823022 oldValueB
Row1 info:ccc 1273746289103 valueC
Row2 info:hello 1273878447049 i_am_a_value
Row3 info: 1273616297446 another_value
Sorted by
Row key
and Column
Timestamp is a long value
2 Versions
of this cell
HBase API
• get(row)
• put(row, map<column, value>)
• scan(key range, filter)
• increment(row, columns)
• … (checkAndPut, delete, etc…)
• MapReduce/Hive
HBase Architecture
HBase in Numbers
• Largest cluster: 600 nodes, ~600TB
• Most clusters: 5-20 nodes, 100GB-4TB
• Writes: 1-3ms, 1k-10k writes/sec per node
• Reads: 0-3ms cached, 10-30ms disk
• 10-40k reads / second / node from cache
• Cell size: 0-3MB preferred
HBase compared
• Favors Consistency over Availability (but
availability is good in practice!)
• Great Hadoop integration (very efficient
bulk loads, MapReduce analysis)
• Ordered range partitions (not hash)
• Automatically shards/scales (just turn on
more servers)
• Sparse column storage (not key-value)
HBase in Production
• Facebook (product release soon)
• StumbleUpon / su.pr
• Mozilla (receives crash reports)
• … many others
Ok, fine, what next?
Get Hadoop!
Cloudera‟s Distribution for Hadoop
http://cloudera.com/
http://hadoop.apache.org/
Try it out! (Locally, VM, or EC2)
Watch free training videos on
http://cloudera.com/
Available in
Japanese!
Questions?
• todd@cloudera.com
• (feedback? yes!)
• (hiring? yes!)
1 of 44

Recommended

Hadoop File system (HDFS) by
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
8.4K views54 slides
Apache HBase™ by
Apache HBase™Apache HBase™
Apache HBase™Prashant Gupta
3.8K views58 slides
Introduction to HDFS by
Introduction to HDFSIntroduction to HDFS
Introduction to HDFSBhavesh Padharia
2.4K views39 slides
Introduction To HBase by
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
87.8K views18 slides
Hadoop Overview & Architecture by
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
64.6K views91 slides
HBase Low Latency by
HBase Low LatencyHBase Low Latency
HBase Low LatencyDataWorks Summit
5.1K views38 slides

More Related Content

What's hot

Apache sqoop with an use case by
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
7.7K views32 slides
Intro to HBase by
Intro to HBaseIntro to HBase
Intro to HBasealexbaranau
42.6K views27 slides
HBASE Overview by
HBASE OverviewHBASE Overview
HBASE OverviewSampath Rachakonda
622 views21 slides
Big Data Analytics with Hadoop by
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
441.9K views64 slides
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB... by
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...Michael Stack
582 views16 slides
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had... by
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
3.9K views95 slides

What's hot(20)

Apache sqoop with an use case by Davin Abraham
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
Davin Abraham7.7K views
Intro to HBase by alexbaranau
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau42.6K views
Big Data Analytics with Hadoop by Philippe Julio
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio441.9K views
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB... by Michael Stack
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...
Michael Stack582 views
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had... by Simplilearn
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn3.9K views
Introduction to Big Data & Hadoop Architecture - Module 1 by Rohit Agrawal
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal1.3K views
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T... by Simplilearn
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn2.6K views
Introduction to Apache Spark by Rahul Jain
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain24.8K views
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera by Cloudera, Inc.
Hadoop World 2011: Advanced HBase Schema Design - Lars George, ClouderaHadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Cloudera, Inc.8.8K views
Chicago Data Summit: Apache HBase: An Introduction by Cloudera, Inc.
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.22.5K views
Introduction to Hadoop and Hadoop component by rebeccatho
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho2.7K views
Apache Flink and what it is used for by Aljoscha Krettek
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek1.4K views

Viewers also liked

HBase for Architects by
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
33.7K views21 slides
HBase: Just the Basics by
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
7.4K views23 slides
Apache HBase for Architects by
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
11.2K views34 slides
HBaseCon 2013: General Session by
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General SessionCloudera, Inc.
2.9K views85 slides
Introduction to HBase by
Introduction to HBaseIntroduction to HBase
Introduction to HBaseAvkash Chauhan
3.9K views51 slides
Intro to HBase - Lars George by
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars GeorgeJAX London
5K views61 slides

Viewers also liked(20)

HBase for Architects by Nick Dimiduk
HBase for ArchitectsHBase for Architects
HBase for Architects
Nick Dimiduk33.7K views
HBase: Just the Basics by HBaseCon
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
HBaseCon7.4K views
Apache HBase for Architects by Nick Dimiduk
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
Nick Dimiduk11.2K views
HBaseCon 2013: General Session by Cloudera, Inc.
HBaseCon 2013: General SessionHBaseCon 2013: General Session
HBaseCon 2013: General Session
Cloudera, Inc.2.9K views
Intro to HBase - Lars George by JAX London
Intro to HBase - Lars GeorgeIntro to HBase - Lars George
Intro to HBase - Lars George
JAX London5K views
Apache HBase - Introduction & Use Cases by Data Con LA
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
Data Con LA10.1K views
Hortonworks Technical Workshop: HBase and Apache Phoenix by Hortonworks
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks15K views
Intro to HBase Internals & Schema Design (for HBase users) by alexbaranau
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau28.1K views
Seminar Presentation Hadoop by Varun Narang
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang81.5K views
HBase and HDFS: Understanding FileSystem Usage in HBase by enissoz
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz74K views
A successful Git branching model by abodeltae
A successful Git branching model A successful Git branching model
A successful Git branching model
abodeltae1.1K views
The MovieLens Datasets: History and Context by Max Harper
The MovieLens Datasets: History and ContextThe MovieLens Datasets: History and Context
The MovieLens Datasets: History and Context
Max Harper1.9K views
NoSQL with Hadoop and HBase by NGDATA
NoSQL with Hadoop and HBaseNoSQL with Hadoop and HBase
NoSQL with Hadoop and HBase
NGDATA3.9K views
Avro, la puissance du binaire, la souplesse du JSON by Alexandre Victoor
Avro, la puissance du binaire, la souplesse du JSONAvro, la puissance du binaire, la souplesse du JSON
Avro, la puissance du binaire, la souplesse du JSON
Alexandre Victoor3.3K views
MovieTweetings: a movie rating dataset collected from twitter by Simon Dooms
MovieTweetings: a movie rating dataset collected from twitterMovieTweetings: a movie rating dataset collected from twitter
MovieTweetings: a movie rating dataset collected from twitter
Simon Dooms7.5K views

Similar to Apache Hadoop and HBase

EclipseCon Keynote: Apache Hadoop - An Introduction by
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
1.8K views56 slides
Meethadoop by
MeethadoopMeethadoop
MeethadoopIIIT-H
1.2K views41 slides
Hadoop_arunam_ppt by
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_pptjerrin joseph
601 views63 slides
Sf NoSQL MeetUp: Apache Hadoop and HBase by
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
2.3K views61 slides
מיכאל by
מיכאלמיכאל
מיכאלsqlserver.co.il
1K views18 slides
Hadoop intro by
Hadoop introHadoop intro
Hadoop introKeith Davis
714 views17 slides

Similar to Apache Hadoop and HBase(20)

EclipseCon Keynote: Apache Hadoop - An Introduction by Cloudera, Inc.
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
Cloudera, Inc.1.8K views
Meethadoop by IIIT-H
MeethadoopMeethadoop
Meethadoop
IIIT-H1.2K views
Sf NoSQL MeetUp: Apache Hadoop and HBase by Cloudera, Inc.
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.2.3K views
Map-Reduce and Apache Hadoop by Svetlin Nakov
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
Svetlin Nakov1.1K views
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015 by Andrey Vykhodtsev
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Andrey Vykhodtsev526 views
Hw09 Practical HBase Getting The Most From Your H Base Install by Cloudera, Inc.
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.10.3K views
Above the cloud: Big Data and BI by Denny Lee
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
Denny Lee2.8K views
Hadoop Big Data A big picture by J S Jodha
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
J S Jodha727 views
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012 by Andrew Brust
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust1.3K views
Fundamental of Big Data with Hadoop and Hive by Sharjeel Imtiaz
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz104 views
SQL on Hadoop for the Oracle Professional by Michael Rainey
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey481 views
Hadoop and mysql by Chris Schneider by Dmitry Makarchuk
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk2.5K views
Microsoft's Big Play for Big Data by Andrew Brust
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust1.3K views

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
107 views55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
6.4K views34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
6.3K views43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
4.5K views67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
3.6K views36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
2.5K views21 slides

More from Cloudera, Inc.(20)

Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.107 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.6.4K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.1.2K views

Recently uploaded

Vertical User Stories by
Vertical User StoriesVertical User Stories
Vertical User StoriesMoisés Armani Ramírez
14 views16 slides
Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft VivaRichard Harbridge
12 views160 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
30 views15 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
18 views49 slides
The details of description: Techniques, tips, and tangents on alternative tex... by
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...BookNet Canada
127 views24 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
23 views38 slides

Recently uploaded(20)

Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada127 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada136 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta26 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software263 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson85 views

Apache Hadoop and HBase

  • 1. Apache Hadoop and HBase Todd Lipcon todd@cloudera.com @tlipcon @cloudera Nov 2, 2010
  • 2. Software Engineer at Hadoop contributor, HBase committer Previously: systems programming, operations, large scale data analysis I love data and data systems 今日は!
  • 3. Outline Why should you care? (Intro) What is Hadoop? How does it work? Hadoop MapReduce The Hadoop Ecosystem Questions
  • 7. “I keep saying that the sexy job in the next 10 years will be statisticians, and I‟m not kidding.” Hal Varian (Google‟s chief economist)
  • 8. Are you throwing away data? Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e- mail), … . Are you throwing it away because it doesn‟t „fit‟?
  • 10. Apache Hadoop is an open-source system to reliably store and process A LOT of information across many commodity computers.
  • 11. Two Core Components HDFS Map/Reduce Self-healing high-bandwidth clustered storage. Fault-tolerant distributed processing. Store Process
  • 13. Hadoop separates distributed system fault- tolerance code from application logic. Systems Programmers Statisticians Unicorns
  • 14. Hadoop lets you interact with a cluster, not a bunch of machines. Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
  • 15. Hadoop scales linearly with data size or analysis complexity. Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL-style queries on >100TB of clickstream data Hadoop works for both applications!
  • 16. A Typical Look... 5-4000 commodity servers (8-core, 24GB RAM, 4-12 TB, gig-E) 2-level network architecture 20-40 nodes per rack
  • 17. Hadoop sounds like magic. How is it possible?
  • 18. Cluster nodes NameNode (metadata server and database) JobTracker (scheduler) DataNodes (block storage) TaskTrackers (task execution) Master nodes (1 each) Slave nodes (1-4000 each)
  • 21. • HDFS has split the file into 64MB blocks and stored it on the DataNodes. • Now, we want to process that data.
  • 23. You specify map() and reduce() functions. The framework does the rest.
  • 24. map() map: K₁,V₁→list K₂,V₂ Key: byte offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326” Key: userimage Value: 2326 bytes The map function runs on the same node as the data was stored!
  • 25. Input Format • Wait! HDFS is not a Key-Value store! • InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
  • 26. The Shuffle Each map output is assigned to a “reducer” based on its key map output is grouped and sorted by key
  • 27. reduce() K₂, iter(V₂)→list(K₃,V₃) Key: userimage Value: 2326 bytes (from map task 0001) Value: 1000 bytes (from map task 0008) Value: 3020 bytes (from map task 0120) Key: userimage Value: 6346 bytes userimage t 6346 TextOutputFormat Reducer function
  • 29. Hadoop is not NoSQL (sorry!) Hive project adds SQL support to Hadoop HiveQL (SQL dialect) compiles to a query plan Query plan executes as MapReduce jobs
  • 30. Hive Example CREATE TABLE movie_rating_data ( userid INT, movieid INT, rating INT, unixtime STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't„ STORED AS TEXTFILE; LOAD DATA INPATH „/datasets/movielens‟ INTO TABLE movie_rating_data; CREATE TABLE average_ratings AS SELECT AVG(rating) FROM movie_rating_data GROUP BY movieid;
  • 32. Hadoop in the Wild (yes, it‟s used in production) Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ‟09) Facebook: 15TB new data per day; 1200 machines, 21PB in one cluster Twitter: ~1TB per day, ~80 nodes Lots of 5-40 node clusters at companies without petabytes of data (web, retail, finance, telecom, research)
  • 33. What about real time access? • MapReduce is a batch system • The fastest MR job takes 24 seconds • HDFS just stores bytes, and is append- only • Not about to serve data for your next web site.
  • 34. Apache HBase HBase is an open source, distributed, sorted map modeled after Google‟s BigTable
  • 35. HBase is built on Hadoop • Hadoop provides: • Fault tolerance • Scalability • Batch processing with MapReduce
  • 36. HDFS + HBase = HDFS + random read/write • HBase uses HDFS for storage • “Log structured merge trees” • Similar to “log structured file systems” • Same storage pattern as Cassandra!
  • 37. A Big Sorted Map Row key Column key Timestamp Cell Row1 info:aaa 1273516197868 valueA Row1 info:bbb 1273871824184 valueB Row1 info:bbb 1273871823022 oldValueB Row1 info:ccc 1273746289103 valueC Row2 info:hello 1273878447049 i_am_a_value Row3 info: 1273616297446 another_value Sorted by Row key and Column Timestamp is a long value 2 Versions of this cell
  • 38. HBase API • get(row) • put(row, map<column, value>) • scan(key range, filter) • increment(row, columns) • … (checkAndPut, delete, etc…) • MapReduce/Hive
  • 40. HBase in Numbers • Largest cluster: 600 nodes, ~600TB • Most clusters: 5-20 nodes, 100GB-4TB • Writes: 1-3ms, 1k-10k writes/sec per node • Reads: 0-3ms cached, 10-30ms disk • 10-40k reads / second / node from cache • Cell size: 0-3MB preferred
  • 41. HBase compared • Favors Consistency over Availability (but availability is good in practice!) • Great Hadoop integration (very efficient bulk loads, MapReduce analysis) • Ordered range partitions (not hash) • Automatically shards/scales (just turn on more servers) • Sparse column storage (not key-value)
  • 42. HBase in Production • Facebook (product release soon) • StumbleUpon / su.pr • Mozilla (receives crash reports) • … many others
  • 43. Ok, fine, what next? Get Hadoop! Cloudera‟s Distribution for Hadoop http://cloudera.com/ http://hadoop.apache.org/ Try it out! (Locally, VM, or EC2) Watch free training videos on http://cloudera.com/ Available in Japanese!