Hadoop - Simple. Scalable.
Upcoming SlideShare
Loading in...5
×
 

Hadoop - Simple. Scalable.

on

  • 2,710 views

 

Statistics

Views

Total Views
2,710
Views on SlideShare
2,705
Embed Views
5

Actions

Likes
0
Downloads
27
Comments
0

1 Embed 5

http://www.slideshare.net 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop - Simple. Scalable. Hadoop - Simple. Scalable. Presentation Transcript

  • Hadoop Simple. Scalable.
  • @markgunnels mark@catamorphiclabs.com
  • Java. Clojure. Ruby. Cloudera Certified
  • posscon.org April 15, 16, and 17
  • Agenda Overview Massively Large Data Sets and the problems therein Distributed File System MapReduce Pig
  • Overview
  • Doug Cutting Genius
  • Favorite Hadoop Story New York Times
  • 4 Terabytes of Source Articles.
  • 24 Hours.
  • 5.5 Terabytes of PDFs.
  • Did it again.
  • $240.
  • Infoporn from Yahoo 73 hours 490 TB Shuffling 280 TB Output 4000 Nodes 16 PB Disk Space 32K Cores 64 TB RAM
  • Hadoop solves...
  • Analyzing Massively Large Datasets
  • Two Problems You have to distribute.
  • Data Storage Capacity has increased rapidly beyond read speeds. Datasets won't fit on one disk. Tolerate node failure.
  • Data Analysis Combine data from many machines. Tolerate node failure.
  • How Hadoop solves these problems.
  • Send Code to Data. Not Data to Code.
  • Data Storage HDFS
  • Name Node. Data Nodes. Master - Slave Relationship
  • Shard massive files across multiple machines. MB, GB, and TB
  • Tolerant of Node Failure Files replicated across at least 3 nodes.
  • HDFS behaves like a normal file system. No true appends yet.
  • Demonstration.
  • Data Analysis MapReduce
  • Job Tracker. Task Nodes. Master - Slave Relationship.
  • map
  • Demonstration
  • pmap
  • Demonstration
  • reduce
  • Demonstration
  • (reduce (pmap))
  • Demonstration.
  • MapReduce Java
  • Nobody likes it. :-)
  • MapReduce Ruby. Python. Unix Utilities.
  • MapReduce Clojure
  • Hadoop Ecosystem Pigkeeper. Hive. Cascading.
  • Pig
  • HBase