Introduction to Hadoop
Ran Ziv
Introduction to Hadoop Ran Ziv© 2012 1
Who Am I?
Ran Ziv
Current:
Past:
Data Architect at Technology Research Group
Architect, Data Platform & Analytics Group Manager, LivePerson
Analytics Project Manager, Software Industry
System Analyst, Telco Industry
Fraud Detection Systems Engineer, Telco Industry
Data Researcher, Internet Industry
Introduction to Hadoop Ran Ziv© 2012 2
What’s Ahead?
• Solid introduction to Apache Hadoop
• What it is
• Why it’s relevant
• How it works
• The Ecosystem
• No prior experience needed
• Feel free to ask questions
Introduction to Hadoop Ran Ziv© 2012 3
What Is Apache Hadoop?
• Scalable data storage and processing
• Open source Apache project
• Harnesses the power of commodity servers
• Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
• HDFS (storage)
• MapReduce (processing)
Introduction to Hadoop Ran Ziv© 2012 4
A Large Ecosystem
Introduction to Hadoop Ran Ziv© 2012 5
A Coherent Platform
Introduction to Hadoop Ran Ziv© 2012 6
How Did Apache Hadoop Originate?
• Heavily influenced by Google’s architecture
• Notably, the Google File System and MapReduce papers
• Other Web companies quickly saw the benefits
• Early adoption by Yahoo, Facebook and others
Introduction to Hadoop Ran Ziv© 2012 8
What Is Common Across Hadoop-able Problems?
• Nature of the data
• Complex data
• Multiple data sources
• Lots of it
• Nature of the analysis
• Parallel execution
• Spread data over a cluster of servers and take the
computation to the data
Introduction to Hadoop Ran Ziv© 2012 15
Benefits Of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
Introduction to Hadoop Ran Ziv© 2012 16
Hadoop: How does it work?
• Moore’s law… and not
Introduction to Hadoop Ran Ziv© 2012 17
Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept
pace
• Capacity has increased while price has decreased
Introduction to Hadoop Ran Ziv© 2012 18
Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates haven’t kept pace with
capacity
Introduction to Hadoop Ran Ziv© 2012 19
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 20
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 21
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 22
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 23
You Don’t Just Need Speed…
• The problem is that we have way more data than
code
Introduction to Hadoop Ran Ziv© 2012 24
You Need Speed At Scale
Introduction to Hadoop Ran Ziv© 2012 25
Introduction to Hadoop Ran Ziv© 2012 26
Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computation to the data”
• Reduces I/O and boosts performance
Introduction to Hadoop Ran Ziv© 2012 27
Introducing HDFS
• Hadoop Distributed File System
• Scalable storage influenced by Google’s file system paper
• It’s not a general-purpose file system
• HDFS is optimized for Hadoop
• Values high throughput much more than low latency
• It’s a user-space java process
• Primarily accessed via command-line utilities and Java API
Introduction to Hadoop Ran Ziv© 2012 29
HDFS is (Mostly) UNIX-Like
• In many ways, HDFS is similar to a unix file system
• Hierarchical
• Unix-style paths (e.g. /foo/bar/myfile.txt)
• File ownership and permissions
• There are also some major deviations from Unix
• Cannot modify files once written
Introduction to Hadoop Ran Ziv© 2012 30
HDFS High-Level Architecture
• HDFS follows a master-slave architecture
• There are two essential deamons in HDFS
• Master: NameNode
• Responsible for namespace and metadata
• Namespace: file hierarchy
• Metadata: ownership, permissions, block locations, etc.
• Slave: DataNode
• Responsible for storing actual datablocks
Introduction to Hadoop Ran Ziv© 2012 31
HDFS Blocks
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
• HDFS uses a much larger block size (64MB), for
performance
Introduction to Hadoop Ran Ziv© 2012 32
HDFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D
Introduction to Hadoop Ran Ziv© 2012 33
HDFS Replication
• The next block might be replicated to B, D and E
Introduction to Hadoop Ran Ziv© 2012 34
HDFS Replication
• The last block might be replicated to A, C and E
Introduction to Hadoop Ran Ziv© 2012 35
HDFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• These will be re-replicated to other nodes automatically
Introduction to Hadoop Ran Ziv© 2012 36
Introduction to Hadoop Ran Ziv© 2012 37
MapReduce High-Level Architecture
• Like HDFS, MapReduce has a master-slave
Architecture
• There are two daemons in “classical” MapReduce
• Master: JobTracker
• Responsible for dividing, scheduling and monitoring work
• Slave: TaskTracker
• Responsible for actual processing
Introduction to Hadoop Ran Ziv© 2012 38
Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• That output is ultimately input to another function
(Reduce)
Introduction to Hadoop Ran Ziv© 2012 39
The Map Function
• Operates on each record individually
• Typical uses include filtering, parsing, or transforming
Introduction to Hadoop Ran Ziv© 2012 40
Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuffle” process in Hadoop
Introduction to Hadoop Ran Ziv© 2012 41
The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions
Introduction to Hadoop Ran Ziv© 2012 42
MapReduce Benefits
• Complex details are abstracted away from the
developer
• No file I/O
• No networking code
• No synchronization
Introduction to Hadoop Ran Ziv© 2012 43
MapReduce Example in Python
• MapReduce code for Hadoop is typically written in
Java
• But possible to use nearly any language with Hadoop Streaming
• I’ll show the log event counter using MapReduce in Python
• It’s very helpful to see the data as well as the code
Introduction to Hadoop Ran Ziv© 2012 44
Job Input
• Each mapper gets a chunk of job’s input data to
• This “chunk” is called an InputSplit
Introduction to Hadoop Ran Ziv© 2012 45
Python Code for Map Function
• Our map function will parse the event type
• And then output that event (key) and a literal 1 (value)
Introduction to Hadoop Ran Ziv© 2012 46
Output of Map Function
• The map function produces key/value pairs as output
Introduction to Hadoop Ran Ziv© 2012 47
Input to Reduce Function
• The Reducer receives a key and all values for that key
• Keys are always passed to reducers in sorted order
• Although it’s not obvious here, values are unordered
Introduction to Hadoop Ran Ziv© 2012 48
Python Code for Reduce Function
• The Reducer first extracts the key and value it was
passed
Introduction to Hadoop Ran Ziv© 2012 49
Python Code for Reduce Function
• Then simply adds up the value for each key
Introduction to Hadoop Ran Ziv© 2012 50
Output of Reduce Function
• The output of this Reduce function is a sum for each
level
Introduction to Hadoop Ran Ziv© 2012 51
Recap of Data Flow
Introduction to Hadoop Ran Ziv© 2012 52
Input Splits Feed the Map Tasks
• Input for the entire job is subdivided into InputSplits
• An InputSplit usually corresponds to a single HDFS block
• Each of these serves as input to a single Map task
Introduction to Hadoop Ran Ziv© 2012 53
Mappers Feed the Shuffle and Sort
• Output of all Mappers is partitioned, merged, and
sorted (No code required – Hadoop does this automatically)
Introduction to Hadoop Ran Ziv© 2012 54
Shuffle and Sort Feeds the Reducers
• All values for a given key are then collapsed into a list
• The key and all its values are fed to reducers as input
Introduction to Hadoop Ran Ziv© 2012 55
Each Reducer Has an Output File
• These are stored in HDFS below your output
directory
• Use hadoop fs -getmerge to combine them into a local
copy
Introduction to Hadoop Ran Ziv© 2012 56
Apache Hadoop Ecosystem: Overview
• "Core Hadoop" consists of HDFS and MapReduce
• These are the kernel of a much broader platform
• Hadoop has many related projects
• Some help you integrate Hadoop with other systems
• Others help you analyze your data
• Still others, like Oozie, help you use Hadoop more
effectively
• Most are open source Apache projects like Hadoop
• Also like Hadoop, they have funny names
Introduction to Hadoop Ran Ziv© 2012 57
Ecosystem: Apache Flume
Introduction to Hadoop Ran Ziv© 2012 58
Ecosystem: Apache Sqoop
• Integrates with any JDBC-compatible database
• Retrieve all tables, a single table, or a portion to store in
HDFS
• Can also export data from HDFS back to the database
Introduction to Hadoop Ran Ziv© 2012 59
Ecosystem: Apache Hive
• Hive allows you to do SQL-like queries on data in
HDFS
• It turns this into MapReduce jobs that run on your cluster
• Reduces development time
Introduction to Hadoop Ran Ziv© 2012 60
Ecosystem: Apache Pig
• Apache Pig has a similar purpose to Hive
• It has a high-level language (PigLatin) for data analysis
• Scripts yield MapReduce jobs that run on your cluster
• But Pig’s approach is much different than Hive
Introduction to Hadoop Ran Ziv© 2012 61
Ecosystem: Apache HBase
• NoSQL database built on HDFS
• Low-latency and high-performance for reads and
writes
• Extremely scalable
• Tables can have billions of rows
• And potentially millions of columns
Introduction to Hadoop Ran Ziv© 2012 62
When is Hadoop (Not) a Good Choice
• Hadoop may be a great choice when
• You need to process non-relational (unstructured) data
• You are processing large amounts of data
• You can run your jobs in batch mode
• Hadoop may not be a great choice when
• You’re processing small amounts of data
• Your algorithms require communication among nodes
• You need very low latency or transactions
• As always, use the best tool for the job
• And know how to integrate it with other systems
Introduction to Hadoop Ran Ziv© 2012 63
Conclusion
• Thanks for your time!
• Questions?
Introduction to Hadoop Ran Ziv© 2012 64

Introduction to Hadoop

  • 1.
    Introduction to Hadoop RanZiv Introduction to Hadoop Ran Ziv© 2012 1
  • 2.
    Who Am I? RanZiv Current: Past: Data Architect at Technology Research Group Architect, Data Platform & Analytics Group Manager, LivePerson Analytics Project Manager, Software Industry System Analyst, Telco Industry Fraud Detection Systems Engineer, Telco Industry Data Researcher, Internet Industry Introduction to Hadoop Ran Ziv© 2012 2
  • 3.
    What’s Ahead? • Solidintroduction to Apache Hadoop • What it is • Why it’s relevant • How it works • The Ecosystem • No prior experience needed • Feel free to ask questions Introduction to Hadoop Ran Ziv© 2012 3
  • 4.
    What Is ApacheHadoop? • Scalable data storage and processing • Open source Apache project • Harnesses the power of commodity servers • Distributed and fault-tolerant • “Core” Hadoop consists of two main parts • HDFS (storage) • MapReduce (processing) Introduction to Hadoop Ran Ziv© 2012 4
  • 5.
    A Large Ecosystem Introductionto Hadoop Ran Ziv© 2012 5
  • 6.
    A Coherent Platform Introductionto Hadoop Ran Ziv© 2012 6
  • 7.
    How Did ApacheHadoop Originate? • Heavily influenced by Google’s architecture • Notably, the Google File System and MapReduce papers • Other Web companies quickly saw the benefits • Early adoption by Yahoo, Facebook and others Introduction to Hadoop Ran Ziv© 2012 8
  • 8.
    What Is CommonAcross Hadoop-able Problems? • Nature of the data • Complex data • Multiple data sources • Lots of it • Nature of the analysis • Parallel execution • Spread data over a cluster of servers and take the computation to the data Introduction to Hadoop Ran Ziv© 2012 15
  • 9.
    Benefits Of AnalyzingWith Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability Introduction to Hadoop Ran Ziv© 2012 16
  • 10.
    Hadoop: How doesit work? • Moore’s law… and not Introduction to Hadoop Ran Ziv© 2012 17
  • 11.
    Disk Capacity andPrice • We’re generating more data than ever before • Fortunately, the size and cost of storage has kept pace • Capacity has increased while price has decreased Introduction to Hadoop Ran Ziv© 2012 18
  • 12.
    Disk Capacity andPerformance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates haven’t kept pace with capacity Introduction to Hadoop Ran Ziv© 2012 19
  • 13.
    Architecture of aTypical HPC System Introduction to Hadoop Ran Ziv© 2012 20
  • 14.
    Architecture of aTypical HPC System Introduction to Hadoop Ran Ziv© 2012 21
  • 15.
    Architecture of aTypical HPC System Introduction to Hadoop Ran Ziv© 2012 22
  • 16.
    Architecture of aTypical HPC System Introduction to Hadoop Ran Ziv© 2012 23
  • 17.
    You Don’t JustNeed Speed… • The problem is that we have way more data than code Introduction to Hadoop Ran Ziv© 2012 24
  • 18.
    You Need SpeedAt Scale Introduction to Hadoop Ran Ziv© 2012 25
  • 19.
    Introduction to HadoopRan Ziv© 2012 26
  • 20.
    Collocated Storage andProcessing • Solution: store and process data on the same nodes • Data Locality: “Bring the computation to the data” • Reduces I/O and boosts performance Introduction to Hadoop Ran Ziv© 2012 27
  • 21.
    Introducing HDFS • HadoopDistributed File System • Scalable storage influenced by Google’s file system paper • It’s not a general-purpose file system • HDFS is optimized for Hadoop • Values high throughput much more than low latency • It’s a user-space java process • Primarily accessed via command-line utilities and Java API Introduction to Hadoop Ran Ziv© 2012 29
  • 22.
    HDFS is (Mostly)UNIX-Like • In many ways, HDFS is similar to a unix file system • Hierarchical • Unix-style paths (e.g. /foo/bar/myfile.txt) • File ownership and permissions • There are also some major deviations from Unix • Cannot modify files once written Introduction to Hadoop Ran Ziv© 2012 30
  • 23.
    HDFS High-Level Architecture •HDFS follows a master-slave architecture • There are two essential deamons in HDFS • Master: NameNode • Responsible for namespace and metadata • Namespace: file hierarchy • Metadata: ownership, permissions, block locations, etc. • Slave: DataNode • Responsible for storing actual datablocks Introduction to Hadoop Ran Ziv© 2012 31
  • 24.
    HDFS Blocks • Whena file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems • HDFS uses a much larger block size (64MB), for performance Introduction to Hadoop Ran Ziv© 2012 32
  • 25.
    HDFS Replication • Thoseblocks are then replicated across machines • The first block might be replicated to A, C and D Introduction to Hadoop Ran Ziv© 2012 33
  • 26.
    HDFS Replication • Thenext block might be replicated to B, D and E Introduction to Hadoop Ran Ziv© 2012 34
  • 27.
    HDFS Replication • Thelast block might be replicated to A, C and E Introduction to Hadoop Ran Ziv© 2012 35
  • 28.
    HDFS Reliability • Replicationhelps to achieve reliability • Even when a node fails, two copies of the block remain • These will be re-replicated to other nodes automatically Introduction to Hadoop Ran Ziv© 2012 36
  • 29.
    Introduction to HadoopRan Ziv© 2012 37
  • 30.
    MapReduce High-Level Architecture •Like HDFS, MapReduce has a master-slave Architecture • There are two daemons in “classical” MapReduce • Master: JobTracker • Responsible for dividing, scheduling and monitoring work • Slave: TaskTracker • Responsible for actual processing Introduction to Hadoop Ran Ziv© 2012 38
  • 31.
    Gentle Introduction toMapReduce • MapReduce is conceptually like a UNIX pipeline • One function (Map) processes data • That output is ultimately input to another function (Reduce) Introduction to Hadoop Ran Ziv© 2012 39
  • 32.
    The Map Function •Operates on each record individually • Typical uses include filtering, parsing, or transforming Introduction to Hadoop Ran Ziv© 2012 40
  • 33.
    Intermediate Processing • TheMap function’s output is grouped and sorted • This is the automatic “sort and shuffle” process in Hadoop Introduction to Hadoop Ran Ziv© 2012 41
  • 34.
    The Reduce Function •Operates on all records in a group • Often used for sum, average or other aggregate functions Introduction to Hadoop Ran Ziv© 2012 42
  • 35.
    MapReduce Benefits • Complexdetails are abstracted away from the developer • No file I/O • No networking code • No synchronization Introduction to Hadoop Ran Ziv© 2012 43
  • 36.
    MapReduce Example inPython • MapReduce code for Hadoop is typically written in Java • But possible to use nearly any language with Hadoop Streaming • I’ll show the log event counter using MapReduce in Python • It’s very helpful to see the data as well as the code Introduction to Hadoop Ran Ziv© 2012 44
  • 37.
    Job Input • Eachmapper gets a chunk of job’s input data to • This “chunk” is called an InputSplit Introduction to Hadoop Ran Ziv© 2012 45
  • 38.
    Python Code forMap Function • Our map function will parse the event type • And then output that event (key) and a literal 1 (value) Introduction to Hadoop Ran Ziv© 2012 46
  • 39.
    Output of MapFunction • The map function produces key/value pairs as output Introduction to Hadoop Ran Ziv© 2012 47
  • 40.
    Input to ReduceFunction • The Reducer receives a key and all values for that key • Keys are always passed to reducers in sorted order • Although it’s not obvious here, values are unordered Introduction to Hadoop Ran Ziv© 2012 48
  • 41.
    Python Code forReduce Function • The Reducer first extracts the key and value it was passed Introduction to Hadoop Ran Ziv© 2012 49
  • 42.
    Python Code forReduce Function • Then simply adds up the value for each key Introduction to Hadoop Ran Ziv© 2012 50
  • 43.
    Output of ReduceFunction • The output of this Reduce function is a sum for each level Introduction to Hadoop Ran Ziv© 2012 51
  • 44.
    Recap of DataFlow Introduction to Hadoop Ran Ziv© 2012 52
  • 45.
    Input Splits Feedthe Map Tasks • Input for the entire job is subdivided into InputSplits • An InputSplit usually corresponds to a single HDFS block • Each of these serves as input to a single Map task Introduction to Hadoop Ran Ziv© 2012 53
  • 46.
    Mappers Feed theShuffle and Sort • Output of all Mappers is partitioned, merged, and sorted (No code required – Hadoop does this automatically) Introduction to Hadoop Ran Ziv© 2012 54
  • 47.
    Shuffle and SortFeeds the Reducers • All values for a given key are then collapsed into a list • The key and all its values are fed to reducers as input Introduction to Hadoop Ran Ziv© 2012 55
  • 48.
    Each Reducer Hasan Output File • These are stored in HDFS below your output directory • Use hadoop fs -getmerge to combine them into a local copy Introduction to Hadoop Ran Ziv© 2012 56
  • 49.
    Apache Hadoop Ecosystem:Overview • "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader platform • Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data • Still others, like Oozie, help you use Hadoop more effectively • Most are open source Apache projects like Hadoop • Also like Hadoop, they have funny names Introduction to Hadoop Ran Ziv© 2012 57
  • 50.
    Ecosystem: Apache Flume Introductionto Hadoop Ran Ziv© 2012 58
  • 51.
    Ecosystem: Apache Sqoop •Integrates with any JDBC-compatible database • Retrieve all tables, a single table, or a portion to store in HDFS • Can also export data from HDFS back to the database Introduction to Hadoop Ran Ziv© 2012 59
  • 52.
    Ecosystem: Apache Hive •Hive allows you to do SQL-like queries on data in HDFS • It turns this into MapReduce jobs that run on your cluster • Reduces development time Introduction to Hadoop Ran Ziv© 2012 60
  • 53.
    Ecosystem: Apache Pig •Apache Pig has a similar purpose to Hive • It has a high-level language (PigLatin) for data analysis • Scripts yield MapReduce jobs that run on your cluster • But Pig’s approach is much different than Hive Introduction to Hadoop Ran Ziv© 2012 61
  • 54.
    Ecosystem: Apache HBase •NoSQL database built on HDFS • Low-latency and high-performance for reads and writes • Extremely scalable • Tables can have billions of rows • And potentially millions of columns Introduction to Hadoop Ran Ziv© 2012 62
  • 55.
    When is Hadoop(Not) a Good Choice • Hadoop may be a great choice when • You need to process non-relational (unstructured) data • You are processing large amounts of data • You can run your jobs in batch mode • Hadoop may not be a great choice when • You’re processing small amounts of data • Your algorithms require communication among nodes • You need very low latency or transactions • As always, use the best tool for the job • And know how to integrate it with other systems Introduction to Hadoop Ran Ziv© 2012 63
  • 56.
    Conclusion • Thanks foryour time! • Questions? Introduction to Hadoop Ran Ziv© 2012 64