2. About @odimulescu
• Working on the Web since 1997
• Into startup and engineering cultures
• Speaker at user groups, code camps
• Founder and organizer for JaxMUG.com
• Organizer for Jax Big Data meetup
4. What is ?
• Apache Hadoop is an open source Java software
framework for running data-intensive applications on
large clusters of commodity hardware
• Created by Doug Cutting (Lucene & Nutch creator)
• Named after Doug’s son’s toy elephant
5. What and how is solving?
• Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simpler programming model
CPU
CPU
CPU
CPU
CPU
CPU
CPU CPU
6. Why does it matter?
• Volume, Velocity, Variety and Value
• Datasets do not fit on local HDDs let alone RAM
• Scaling up
‣ Is expensive (licensing, hardware, etc.)
‣ Has a ceiling (physical, technical, etc.)
7. Why does it matter?
Data types Complex Data
Images,Video
20% Logs
Documents
Call records
Sensor data
80% Mail archives
Structured Data
Complex
Structured User Profiles
CRM
* Chart Source: IDC White Paper HR Records
8. Why does it matter?
• Scanning 10TB at sustained transfer of 75MB/s takes
~2 days on 1 node
~5 hrs on 10 nodes cluster
• Low $/TB for commodity drives
• Low-end servers are multicore capable
12. What is Hadoop not?
• Not a database replacement
• Not a data warehousing (complements it)
• Not for interactive reporting
• Not a general purpose storage mechanism
• Not for problems that are not parallelizable in a
share-nothing fashion
13. Architecture – Core Components
HDFS
Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.
Map-Reduce
Programming model for processing and generating
large data sets.
14. HDFS - Design
• Files are stored as blocks (64MB default size)
• Configurable data replication (3x, Rack Aware*)
• Fault Tolerant, Expects HW failures
• HUGE files, Expects Streaming not Low Latency
• Mostly WORM
• Not POSIX compliant
• Not mountable OOTB*
15. HDFS - Architecture
Namenode (NN)
Client ask NN for file H
NN returns DNs that D
host it F
Client ask DN for data
S
Datanode 1 Datanode 2 Datanode N
Namenode - Master Datanode - Slaves
• Filesystem metadata • Reads / Write blocks to / from clients
• Controls read/write to files • Replicates blocks at master’s request
• Manages blocks replication • Notifies master about block-ids
Single Namespace
Single Block Pool
16. HDFS - Fault tolerance
• DataNode
Uses CRC32 to avoid corruption
Data is replicated on other nodes (3x)*
• NameNode
fsimage - last snapshot
edits - changes log since last snapshot
Checkpoint Node
Backup NameNode
Failover is manual*
17. MapReduce - Architecture
Client launches a job J JobsTracker (JT)
- Configuration
O
- Mapper B
- Reducer S
- Input
- Output
API TaskTracker 1 TaskTracker 2 TaskTracker N
JobTracker - Master TaskTracker - Slaves
• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received
• Assigns Map and Reduce tasks to from Jobtracker
TaskTrackers • Manage storage and transmission of
• Monitors tasks and TaskTracker status, intermediate output
re-executes tasks upon failure
• Speculative execution
18. Hadoop - Core Architecture
J JobsTracker
O
B
S
TaskTracker 1 TaskTracker 2 TaskTracker N
API
DataNode 1 DataNode 2 DataNode N
H
D
F
S
NameNode
* Mini OS: Filesystem & Scheduler
24. Installation - Versions
Public Numbering
1.0.x - current stable version
1.1.x - current beta version for 1.x branch
2.X - current alpha version
Development Numbering
0.20.x aka 1.x - CDH 3 & HDP 1
0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
25. Installation - For toying
Option I - Official project releases
hadoop.apache.org/common/releases.html
Option 2 - Demo VM from vendors
• Cloudera
• Hortonworks
• Greenplum
• MapR
Option 3 - Cloud
• Amazon’s EMR
• Hadoop on Azure
26. Installation - For real
Vendor distributions
• Cloudera CDH
• Hortonworks HDP
• Greenplum GPHD
• MapR M3, M5 or M7
Hosted solutions
• AWS EMR
• Hadoop on Azure
Use Virtualization - VMware Serengeti *
27. Security - Simple Mode
• Use in a trusted environment
‣ Identity comes from euid of the client process
‣ MapReduce tasks run as the TaskTracker user
‣ User that starts the NameNode is super-user
• Reasonable protection for accidental misuse
• Simple to setup
28. Security - Secure Mode
• Kerberos based
• Use for tight granular access
‣ Identity comes from Kerberos Principal
‣ MapReduce tasks run as Kerberos Principal
• Use a dedicated MIT KDC
• Hook it to your primary KDC (AD, etc.)
• Significant setup effort (users, groups and Kerberos keys
on all nodes, etc.)
29. Monitoring
Built-in
• JMX
• REST
• No SNMP support
Other
Cloudera Manager (Free up to 50 nodes)
Ambari - Free, RPM based systems (RH, CentOS)
32. References
Hadoop Operations, by Eric Sammer
Hadoop Security, by Hortonworks Blog
HDFS Federation, by Suresh Srinivas
Hadoop 2.0 New Features, by VertiCloud Inc
MapReduce in Simple Terms, by Saliya Ekanayake
Hadoop Architecture, by Phillipe Julio