Introduction to Hadoop

Introduction to Hadoop
Ran Ziv
Introduction to Hadoop Ran Ziv© 2012 1

Who Am I?
Ran Ziv
Current:
Past:
Data Architect at Technology Research Group
Architect, Data Platform & Analytics Group Manager, LivePerson
Analytics Project Manager, Software Industry
System Analyst, Telco Industry
Fraud Detection Systems Engineer, Telco Industry
Data Researcher, Internet Industry

What’s Ahead?
• Solid introduction to Apache Hadoop
• What it is
• Why it’s relevant
• How it works
• The Ecosystem
• No prior experience needed
• Feel free to ask questions

What Is Apache Hadoop?
• Scalable data storage and processing
• Open source Apache project
• Harnesses the power of commodity servers
• Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
• HDFS (storage)
• MapReduce (processing)

A Large Ecosystem

A Coherent Platform

How Did Apache Hadoop Originate?
• Heavily inﬂuenced by Google’s architecture
• Notably, the Google File System and MapReduce papers
• Other Web companies quickly saw the beneﬁts
• Early adoption by Yahoo, Facebook and others

What Is Common Across Hadoop-able Problems?
• Nature of the data
• Complex data
• Multiple data sources
• Lots of it
• Nature of the analysis
• Parallel execution
• Spread data over a cluster of servers and take the
computation to the data

Benefits Of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability

Hadoop: How does it work?
• Moore’s law… and not

Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept
pace
• Capacity has increased while price has decreased

Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates haven’t kept pace with
capacity

Architecture of a Typical HPC System

You Don’t Just Need Speed…
• The problem is that we have way more data than
code

You Need Speed At Scale

Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computation to the data”
• Reduces I/O and boosts performance

Introducing HDFS
• Hadoop Distributed File System
• Scalable storage influenced by Google’s file system paper
• It’s not a general-purpose file system
• HDFS is optimized for Hadoop
• Values high throughput much more than low latency
• It’s a user-space java process
• Primarily accessed via command-line utilities and Java API

HDFS is (Mostly) UNIX-Like
• In many ways, HDFS is similar to a unix file system
• Hierarchical
• Unix-style paths (e.g. /foo/bar/myfile.txt)
• File ownership and permissions
• There are also some major deviations from Unix
• Cannot modify files once written

HDFS High-Level Architecture
• HDFS follows a master-slave architecture
• There are two essential deamons in HDFS
• Master: NameNode
• Responsible for namespace and metadata
• Namespace: file hierarchy
• Metadata: ownership, permissions, block locations, etc.
• Slave: DataNode
• Responsible for storing actual datablocks

HDFS Blocks
• When a ﬁle is added to HDFS, it’s split into blocks
• This is a similar concept to native ﬁle systems
• HDFS uses a much larger block size (64MB), for
performance

HDFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D

HDFS Replication
• The next block might be replicated to B, D and E

HDFS Replication
• The last block might be replicated to A, C and E

HDFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• These will be re-replicated to other nodes automatically

MapReduce High-Level Architecture
• Like HDFS, MapReduce has a master-slave
Architecture
• There are two daemons in “classical” MapReduce
• Master: JobTracker
• Responsible for dividing, scheduling and monitoring work
• Slave: TaskTracker
• Responsible for actual processing

Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• That output is ultimately input to another function
(Reduce)

The Map Function
• Operates on each record individually
• Typical uses include ﬁltering, parsing, or transforming

Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuﬄe” process in Hadoop

The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions

MapReduce Benefits
• Complex details are abstracted away from the
developer
• No ﬁle I/O
• No networking code
• No synchronization

MapReduce Example in Python
• MapReduce code for Hadoop is typically written in
Java
• But possible to use nearly any language with Hadoop Streaming
• I’ll show the log event counter using MapReduce in Python
• It’s very helpful to see the data as well as the code

Job Input
• Each mapper gets a chunk of job’s input data to
• This “chunk” is called an InputSplit

Python Code for Map Function
• Our map function will parse the event type
• And then output that event (key) and a literal 1 (value)

Output of Map Function
• The map function produces key/value pairs as output

Input to Reduce Function
• The Reducer receives a key and all values for that key
• Keys are always passed to reducers in sorted order
• Although it’s not obvious here, values are unordered

Python Code for Reduce Function
• The Reducer ﬁrst extracts the key and value it was
passed

Python Code for Reduce Function
• Then simply adds up the value for each key

Output of Reduce Function
• The output of this Reduce function is a sum for each
level

Recap of Data Flow

Input Splits Feed the Map Tasks
• Input for the entire job is subdivided into InputSplits
• An InputSplit usually corresponds to a single HDFS block
• Each of these serves as input to a single Map task

Mappers Feed the Shuffle and Sort
• Output of all Mappers is partitioned, merged, and
sorted (No code required – Hadoop does this automatically)

Shuffle and Sort Feeds the Reducers
• All values for a given key are then collapsed into a list
• The key and all its values are fed to reducers as input

Each Reducer Has an Output File
• These are stored in HDFS below your output
directory
• Use hadoop fs -getmerge to combine them into a local
copy

Apache Hadoop Ecosystem: Overview
• "Core Hadoop" consists of HDFS and MapReduce
• These are the kernel of a much broader platform
• Hadoop has many related projects
• Some help you integrate Hadoop with other systems
• Others help you analyze your data
• Still others, like Oozie, help you use Hadoop more
eﬀectively
• Most are open source Apache projects like Hadoop
• Also like Hadoop, they have funny names

Ecosystem: Apache Flume

Ecosystem: Apache Sqoop
• Integrates with any JDBC-compatible database
• Retrieve all tables, a single table, or a portion to store in
HDFS
• Can also export data from HDFS back to the database

Ecosystem: Apache Hive
• Hive allows you to do SQL-like queries on data in
HDFS
• It turns this into MapReduce jobs that run on your cluster
• Reduces development time

Ecosystem: Apache Pig
• Apache Pig has a similar purpose to Hive
• It has a high-level language (PigLatin) for data analysis
• Scripts yield MapReduce jobs that run on your cluster
• But Pig’s approach is much diﬀerent than Hive

Ecosystem: Apache HBase
• NoSQL database built on HDFS
• Low-latency and high-performance for reads and
writes
• Extremely scalable
• Tables can have billions of rows
• And potentially millions of columns

When is Hadoop (Not) a Good Choice
• Hadoop may be a great choice when
• You need to process non-relational (unstructured) data
• You are processing large amounts of data
• You can run your jobs in batch mode
• Hadoop may not be a great choice when
• You’re processing small amounts of data
• Your algorithms require communication among nodes
• You need very low latency or transactions
• As always, use the best tool for the job
• And know how to integrate it with other systems

Conclusion
• Thanks for your time!
• Questions?

Introduction to Hadoop

More Related Content

What's hot

Similar to Introduction to Hadoop

Recently uploaded

Introduction to Hadoop