Introduction to Hadoop - ACCU2010
by Gavin Heavyside
- 3,302 views
Slides from my Introduction to Hadoop session at the ACCU2010 conference
Slides from my Introduction to Hadoop session at the ACCU2010 conference
Accessibility
Categories
Upload Details
Uploaded via SlideShare as Apple Keynote
Usage Rights
© All Rights Reserved
Statistics
- Likes
- 0
- Downloads
- 104
- Comments
- 0
- Embed Views
- Views on SlideShare
- 2,859
- Total Views
- 3,302
A server might have a MTBF of a few years
With thousands you can expect failures at least daily
System needs to be handle HW failures
8-core, 16-24GB ram, 4x1TB disks, gig ethernet
Locking, race conditions. Used for HBase distributed column-oriented database
Chukwa is large-scale log collection & analysis tools
Avro - like protocol buffers - move towards default IPC mechanism in Hadoop
Namenode is point of failure - not highly available
Mention lots of disks - JBOD: don’t stripe, don’t RAID
Rack-aware - splits will be processed local to the HDFS blocks where possible.
Pseudo distributed cluster can run all services on single machine
DN/TT on lots of machines, NN, SNN, JT depend on cluster size.
64MB block size not efficient for lots of small files -> concatenate them
Binary data (protocol buffers, avro)
Custom input/output formats
Talk about partitioning
Reducers have to wait until the map and sort/shuffle has finished
Combiners can run on Tasktracker
Can be single Reducer or many
Forward index: list of words per document
Inverted Index: list of documents containing word
Once you “get” it, you start to see how lots of problems can be parallelised
Map Tiles at Google - Tech Talk on YouTube
Streaming works in any language that can read/write stdin/stdout
Cascalog brand new; only heard about it last night
SPLIT
AVG, COUNT, CONCAT, MIN, MAX, SUM
Custom input/output formats
SQL is a declarative language, you specify an outcome, and the query planner generates a sequence of steps.
Ad-hoc queries using basic SQL
Custom input/output formats
Load Data
Select/Join/Group
Cloudera do commercial support
Other providers on Hadoop homepage
Plenty of hobbyist/small-scale MapReduce frameworks out there