10c introduction

Introduction: MapR and Hadoop
7/6/2012

© 2012 MapR Technologies Introduction 1

Introduction
Agenda
• Hadoop Overview
• MapReduce Overview
• Hadoop Ecosystem
• How is MapR Different?
• Summary


Introduction
Objectives
At the end of this module you will be able to:
• Explain why Hadoop is an important technology for effectively working with
Big Data
• Describe the phases of a MapReduce job
• Identify some of the tools used with Hadoop
• List the similarities and differences between MapR and other Hadoop
distributions


Hadoop Overview


Data is Growing Faster than Moore’s Law

Business Analytics Requires a New Approach

Data Volume
Growing 44x
2010:
1.2
Zettabytes 2020: 35.2
Zettabytes IDC
Digital Universe
Study 2011

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

Before Hadoop
Web crawling to power search engines
• Must be able to handle gigantic data
• Must be fast!
Problem: databases (B-Tree) not so fast, and do not scale
Solution: Sort and Merge
• Eliminate the pesky seek time!


How to Scale?
Big Data has Big Problems
• Petabytes of data
• MTBF on 1000s of nodes is < 1 day
• Something is always broken
• There are limits to scaling Big Iron
• Sequential and random access just don’t scale


Example: Update 1% of 1TB

 Data consists of 10 billion records, each 100 bytes
 Task: Update 1% of these records


Approach 1: Just Do It

 Each update involves read, modify and write
– t = 1 seek + 2 disk rotations = 20ms
– 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)
 Total time dominated by seek and rotation times


Approach 2: The “Hard” Way

 Copy the entire database 1GB at a time
 Update records sequentially
– t = 2 x 1GB / 100MB/s + 20ms = 20s
– 103 x 20s = 20,000s = 5.6 hours
 100x faster to move 100x more data!
 Moral: Read data sequentially even if you only want 1% of it


Introducing Hadoop!
 Now imagine you have thousands of disks on hundreds of
machines with near linear scaling
– Commodity hardware – thousands of nodes!
– Handles Big Data – Petabytes and more!
– Sequential file access – all spindles at once!
– Sharding – data distributed evenly across cluster
– Reliability – self-healing, self-balancing
– Redundancy – data replicated across multiple hosts and disks
– MapReduce
• Parallel computing framework
• Moves the computation to the data


Hadoop Architecture
• MapReduce: Parallel computing
– Move the computation to the data
– Minimizes network utilization

• Distributed storage layer: Keeping track of data and metadata
– Data is sharded across the cluster

• Cluster management tools
• Applications and tools


What’s Driving Hadoop Adoption?

“Simple algorithms and lots of data
trump complex models ”

Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems


MapReduce Overview


MapReduce
• A programming model for processing very large data sets
― A framework for processing parallel problems across huge datasets using
a large number of nodes
― Brute force parallel computing paradigm

• Phases
― Map
• Job partitioned into “splits”

― Shuffle and sort
• Map output sent to reducer(s) using a hash

― Reduce


Inside Map-Reduce

the, 1
"The time has come," the Walrus said,
time, 1
"To talk of many things: come, [3,2,1]
has, 1
Of shoes—and ships—and sealing-wax
has, [1,5,2]
come, 1 come, 6
the, [1,2,1] has, 8
…
time, the, 4
[10,1,3] time, 14
Input Map …
Shuffle Reduce
… Output
and sort


JobTracker
• Sends out tasks
• Co-locates tasks with data
• Gets data location
• Manages TaskTrackers


TaskTracker
• Performs tasks (Map, Reduce)
• Slots determine number of concurrent tasks
• Notifies JobTracker of completed jobs
• Heartbeats to the JobTracker
• Each task is a separate Java process


Hadoop Ecosystem


Hadoop Ecosystem
• PIG: It will eat anything
– High level language, set algebra, careful semantics
– Filter, transform, co-group, generate, flatten
– PIG generates and optimizes map-reduce programs
• Hive: Busy as a bee
– High level language, more ad hoc than PIG
– SQL-ish
– Has central meta-data service
– Loves external scripts
• HBase: NoSQL for your cluster
• Mahout: distributed/scalable machine learning algorithms


How is MapR Different?


Mostly, It’s Not!

 API-compatible
– Move code over without modifications
– Use the familiar Hadoop Shell
 Supports popular tools and applications
– Hive, Pig, HBase—Flume, if you want it


Very Different Where It Counts
 No single point of failure
 Faster shuffle, faster file creation
 Read/write storage layer
 NFS-mountable
 Management tools—MCS, Rest API, CLI
 Data placement, protection, backup
 HA at all layers (Naming, NFS, JobTracker, MCS)


Summary


Questions


10c introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 10c introduction

Similar to 10c introduction (20)

More from mapr-academy

More from mapr-academy (18)

Recently uploaded

Recently uploaded (20)

10c introduction

Editor's Notes