Is hadoop for you

1
Is Hadoop For You?
Gwen Shapira, Solutions Architect

About Me
• Solution Architect @ Cloudera
• Making our customers successful
• Formerly:
• Database consultant @ Pythian
• Specializing in Exadata, RAC, replication
• Oracle ACED, Oak Table Member
• @gwenshap <- Hadoop tips in 140 characters
2

Agenda
Answer the question:
Who needs Hadoop?
3

In more details…
4
0% 10% 20% 30% 40% 50%
Getting Started
What you need to succeed
When to Hadoop
Basic Hadoop Architecture
What's so special about
Hadoop
% of Session
% of Session

5
What’s so special about Hadoop?
Technically Speaking

Databases in 1999
1. Buy a really big machine
2. Install an expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another really big machine as a
backup
6

Problems:
• Reliability
• Scalability
• Storage throughput
• Complex Upgrades
• Relational only
7

Exadata: State of the Art - 2007
1. Storage and compute in one rack
2. Cluster with Infiniband interconnect
3. Balanced architecture
4. Offloading
5. Parallelism
6. Compression
8

Hadoop
• Distributed File System
• Programming Framework
• Many projects on top
• Open Source
(This means free)
9

Designed For:
• Reliability
• Parallel Processing
• Scalability
• Flexibility
10

Reminders:
• Disk does a seek for each I/O operation
• Seeks are expensive (~10ms)
• Big I/Os mean better throughput
• Network is fast inside rack
• Slower between racks
11

The File System
• Files are split into 64M blocks
• 64M!!!
• Distributed
• Replicated
• Write-Once
12

HDFS Architecture
13
DataNode
Metadata
Paths, filenames,
file sizes, block
locations, …
NameNode
DataNode DataNode DataNode

HDFS Architecture
14
DataNode
Data
Blocks, checksums
NameNode
DataNode DataNode DataNode

HDFS Write Path
15
DN 1
NameNode
DN 2 DN 3 DN 4
Rack 1 Rack 2
Client
create(“/tmp/myfile”)
Write to
[DN4,DN3,DN2]
[DN3,DN2]
[DN2]

HDFS Read Path
16
DN 1
NameNode
DN 2 DN 3 DN 4
Rack 1 Rack 2
Client
open(“/tmp/myfile”,“r”)
Read from
[DN4,DN3,DN2]
readdata

Map-Reduce
• Java Framework
• Works on Key-Value pairs
• Map:
• Operate on every element
• Filter or transform
• Code runs where the data is stored
• Shuffle:
• Redistribution of data
• Reduce:
• Aggregate or Join
17

MapReduce Architcture
18
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
• Gateway for users
• Assigns tasks to
TaskTrackers
• Tracks job status

19
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
• TaskTrackers execute
Map and Reduce
tasks assigned by JT

Word Count Example
20
The cat sat on the mat
The aardvark sat on the sofa
The, 1
cat, 1
sat, 1
on, 1
the, 1
mat, 1
The, 1
aardvark, 1
sat, 1
on, 1
the, 1
sofa, 1
Mapper Input
Mapping
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
aardvark, 1
cat, 1
mat, 1
on [1, 1]
sat [1, 1]
sofa, 1
the [1, 1, 1, 1]
Shufﬂing Reducing
Final Result

21
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
wordcount(<files>)
M1 M2 M3 M4 R1
[cat, 1] [dog, 1][the, 1] [sat, 1]

22
DN 1
JobTracker
DN 2 DN 3 DN 4
Rack 1 Rack 2
NameNode
TT 3 TT 4TT 2TT 1
wordcount(<files>)
M5 M6 M7 M8 R1
[a, 5]
[cat, 2]
[dog, 1]
[the, 4]
[mat, 1]

Compare to Oracle PX
• Mappers -> Producers
• Reducers -> Consumers
• Shuffle -> Re-distribution
23

In Short
Benefits
• Reliable
• Scalable
• Infinite Flexibility
• Cheap
Challenges
• New skills
• Infinitely Flexible
• Feature-completeness
• Best practices and examples
24

When to Hadoop?
When Relational Databases
Don’t Add Benefits
26

Non-relational Data
• XML
• Logs
• Geo spatial data
• Video
27

Adding to the Data Warehouse
• ETL
• History
• Some reports
• Rocket Data Science
28

Toolset for DBAs
• Hive – Turn SQL to Map-Reduce
• Streaming – Map-Reduce in any language
• Pig – Write and Execute execution plans
• Oozie – Coordinate workflows
• Impala – real-time SQL
• HBase – key-value real-time data store
33

Data Model
• Partitions
• Batch processing
• Star Schema
• Materialized Views
• Sort and Compress
• De-normalize
• Tune the data
• Nested data structures
34

Right Hardware
• If possible – POC with your workload
• Sizing by storage
• You probably need to over-provision
• Machine reliability
• Big Data Appliance is a good start
35

Non-technical Advice
• Your team will have to learn a lot
• Be ready for a challenge
36

Why get started?
• Hadoop projects are more visible
• 48% of Hadoop clusters are owned by DWH team
• Big Data == Business pays attention to data
• New skills – from coding to cluster administration
• Interesting projects
• No, you don’t need to learn Java
38

Beginner Projects
• Install 5 node Hadoop cluster in AWS
• Load data:
• Complete works of Shakespeare
• Movielens database
• Find the 10 most common words in Shakespeare
• Find the 10 most recommended movies
• Run TPC-H
• Cloudera Data Science Challenge
• Actual use-case:
XML ingestion, ETL process, DWH history
42

Need Help?
• I can help:
• @gwenshap
• gshapira@cloudera.com
• Hadoop Community:
• http://community.cloudera.com
• user@hadoop.apache.org
• Google group: CDH Users
43

Is hadoop for you

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Is hadoop for you

Similar to Is hadoop for you (20)

More from Gwen (Chen) Shapira

More from Gwen (Chen) Shapira (20)

Recently uploaded

Recently uploaded (20)

Is hadoop for you

Editor's Notes