10c introduction

Introduction: MapR and Hadoop
7/6/2012

© 2012 MapR Technologies Introduction 1

Introduction
Agenda
• Hadoop Overview
• MapReduce Overview
• Hadoop Ecosystem
• How is MapR Different?
• Summary


Introduction
Objectives
At the end of this module you will be able to:
• Explain why Hadoop is an important technology for effectively working with
Big Data
• Describe the phases of a MapReduce job
• Identify some of the tools used with Hadoop
• List the similarities and differences between MapR and other Hadoop
distributions


Hadoop Overview


Data is Growing Faster than Moore’s Law

Business Analytics Requires a New Approach

Data Volume
Growing 44x
2010:
1.2
Zettabytes 2020: 35.2
Zettabytes IDC
Digital Universe
Study 2011

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

Before Hadoop
Web crawling to power search engines
• Must be able to handle gigantic data
• Must be fast!
Problem: databases (B-Tree) not so fast, and do not scale
Solution: Sort and Merge
• Eliminate the pesky seek time!


How to Scale?
Big Data has Big Problems
• Petabytes of data
• MTBF on 1000s of nodes is < 1 day
• Something is always broken
• There are limits to scaling Big Iron
• Sequential and random access just don’t scale


Example: Update 1% of 1TB

 Data consists of 10 billion records, each 100 bytes
 Task: Update 1% of these records


Approach 1: Just Do It

 Each update involves read, modify and write
– t = 1 seek + 2 disk rotations = 20ms
– 1% x 1010 x 20 ms = 2 mega-seconds = 23 days (552 hours)
 Total time dominated by seek and rotation times


Approach 2: The “Hard” Way

 Copy the entire database 1GB at a time
 Update records sequentially
– t = 2 x 1GB / 100MB/s + 20ms = 20s
– 103 x 20s = 20,000s = 5.6 hours
 100x faster to move 100x more data!
 Moral: Read data sequentially even if you only want 1% of it


Introducing Hadoop!
 Now imagine you have thousands of disks on hundreds of
machines with near linear scaling
– Commodity hardware – thousands of nodes!
– Handles Big Data – Petabytes and more!
– Sequential file access – all spindles at once!
– Sharding – data distributed evenly across cluster
– Reliability – self-healing, self-balancing
– Redundancy – data replicated across multiple hosts and disks
– MapReduce
• Parallel computing framework
• Moves the computation to the data


Hadoop Architecture
• MapReduce: Parallel computing
– Move the computation to the data
– Minimizes network utilization

• Distributed storage layer: Keeping track of data and metadata
– Data is sharded across the cluster

• Cluster management tools
• Applications and tools


What’s Driving Hadoop Adoption?

“Simple algorithms and lots of data
trump complex models ”

Halevy, Norvig, and Pereira, Google
IEEE Intelligent Systems


MapReduce Overview


MapReduce
• A programming model for processing very large data sets
― A framework for processing parallel problems across huge datasets using
a large number of nodes
― Brute force parallel computing paradigm

• Phases
― Map
• Job partitioned into “splits”

― Shuffle and sort
• Map output sent to reducer(s) using a hash

― Reduce


Inside Map-Reduce

the, 1
"The time has come," the Walrus said,
time, 1
"To talk of many things: come, [3,2,1]
has, 1
Of shoes—and ships—and sealing-wax
has, [1,5,2]
come, 1 come, 6
the, [1,2,1] has, 8
…
time, the, 4
[10,1,3] time, 14
Input Map …
Shuffle Reduce
… Output
and sort


JobTracker
• Sends out tasks
• Co-locates tasks with data
• Gets data location
• Manages TaskTrackers


TaskTracker
• Performs tasks (Map, Reduce)
• Slots determine number of concurrent tasks
• Notifies JobTracker of completed jobs
• Heartbeats to the JobTracker
• Each task is a separate Java process


Hadoop Ecosystem


Hadoop Ecosystem
• PIG: It will eat anything
– High level language, set algebra, careful semantics
– Filter, transform, co-group, generate, flatten
– PIG generates and optimizes map-reduce programs
• Hive: Busy as a bee
– High level language, more ad hoc than PIG
– SQL-ish
– Has central meta-data service
– Loves external scripts
• HBase: NoSQL for your cluster
• Mahout: distributed/scalable machine learning algorithms


How is MapR Different?


Mostly, It’s Not!

 API-compatible
– Move code over without modifications
– Use the familiar Hadoop Shell
 Supports popular tools and applications
– Hive, Pig, HBase—Flume, if you want it


Very Different Where It Counts
 No single point of failure
 Faster shuffle, faster file creation
 Read/write storage layer
 NFS-mountable
 Management tools—MCS, Rest API, CLI
 Data placement, protection, backup
 HA at all layers (Naming, NFS, JobTracker, MCS)


Summary


Questions


10c introduction

More Related Content

What's hot

Viewers also liked

Similar to 10c introduction

More from mapr-academy

Recently uploaded

10c introduction

Editor's Notes