Introduction to Hadoop
Ran Ziv
Introduction to Hadoop Ran Ziv© 2012 1
Who Am I?
Ran Ziv
Current:
Past:
Data Architect at Technology Research Group
Architect, Data Platform & Analytics Group Ma...
What’s Ahead?
• Solid introduction to Apache Hadoop
• What it is
• Why it’s relevant
• How it works
• The Ecosystem
• No p...
What Is Apache Hadoop?
• Scalable data storage and processing
• Open source Apache project
• Harnesses the power of commod...
A Large Ecosystem
Introduction to Hadoop Ran Ziv© 2012 5
A Coherent Platform
Introduction to Hadoop Ran Ziv© 2012 6
How Did Apache Hadoop Originate?
• Heavily influenced by Google’s architecture
• Notably, the Google File System and MapRed...
What Is Common Across Hadoop-able Problems?
• Nature of the data
• Complex data
• Multiple data sources
• Lots of it
• Nat...
Benefits Of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower co...
Hadoop: How does it work?
• Moore’s law… and not
Introduction to Hadoop Ran Ziv© 2012 17
Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept...
Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates h...
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 20
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 21
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 22
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 23
You Don’t Just Need Speed…
• The problem is that we have way more data than
code
Introduction to Hadoop Ran Ziv© 2012 24
You Need Speed At Scale
Introduction to Hadoop Ran Ziv© 2012 25
Introduction to Hadoop Ran Ziv© 2012 26
Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computa...
Introducing HDFS
• Hadoop Distributed File System
• Scalable storage influenced by Google’s file system paper
• It’s not a...
HDFS is (Mostly) UNIX-Like
• In many ways, HDFS is similar to a unix file system
• Hierarchical
• Unix-style paths (e.g. /...
HDFS High-Level Architecture
• HDFS follows a master-slave architecture
• There are two essential deamons in HDFS
• Master...
HDFS Blocks
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
• HDFS...
HDFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D
In...
HDFS Replication
• The next block might be replicated to B, D and E
Introduction to Hadoop Ran Ziv© 2012 34
HDFS Replication
• The last block might be replicated to A, C and E
Introduction to Hadoop Ran Ziv© 2012 35
HDFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• The...
Introduction to Hadoop Ran Ziv© 2012 37
MapReduce High-Level Architecture
• Like HDFS, MapReduce has a master-slave
Architecture
• There are two daemons in “class...
Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• Th...
The Map Function
• Operates on each record individually
• Typical uses include filtering, parsing, or transforming
Introduc...
Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuffle” process...
The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions
Introd...
MapReduce Benefits
• Complex details are abstracted away from the
developer
• No file I/O
• No networking code
• No synchro...
MapReduce Example in Python
• MapReduce code for Hadoop is typically written in
Java
• But possible to use nearly any lang...
Job Input
• Each mapper gets a chunk of job’s input data to
• This “chunk” is called an InputSplit
Introduction to Hadoop ...
Python Code for Map Function
• Our map function will parse the event type
• And then output that event (key) and a literal...
Output of Map Function
• The map function produces key/value pairs as output
Introduction to Hadoop Ran Ziv© 2012 47
Input to Reduce Function
• The Reducer receives a key and all values for that key
• Keys are always passed to reducers in ...
Python Code for Reduce Function
• The Reducer first extracts the key and value it was
passed
Introduction to Hadoop Ran Ziv...
Python Code for Reduce Function
• Then simply adds up the value for each key
Introduction to Hadoop Ran Ziv© 2012 50
Output of Reduce Function
• The output of this Reduce function is a sum for each
level
Introduction to Hadoop Ran Ziv© 201...
Recap of Data Flow
Introduction to Hadoop Ran Ziv© 2012 52
Input Splits Feed the Map Tasks
• Input for the entire job is subdivided into InputSplits
• An InputSplit usually correspo...
Mappers Feed the Shuffle and Sort
• Output of all Mappers is partitioned, merged, and
sorted (No code required – Hadoop do...
Shuffle and Sort Feeds the Reducers
• All values for a given key are then collapsed into a list
• The key and all its valu...
Each Reducer Has an Output File
• These are stored in HDFS below your output
directory
• Use hadoop fs -getmerge to combin...
Apache Hadoop Ecosystem: Overview
• "Core Hadoop" consists of HDFS and MapReduce
• These are the kernel of a much broader ...
Ecosystem: Apache Flume
Introduction to Hadoop Ran Ziv© 2012 58
Ecosystem: Apache Sqoop
• Integrates with any JDBC-compatible database
• Retrieve all tables, a single table, or a portion...
Ecosystem: Apache Hive
• Hive allows you to do SQL-like queries on data in
HDFS
• It turns this into MapReduce jobs that r...
Ecosystem: Apache Pig
• Apache Pig has a similar purpose to Hive
• It has a high-level language (PigLatin) for data analys...
Ecosystem: Apache HBase
• NoSQL database built on HDFS
• Low-latency and high-performance for reads and
writes
• Extremely...
When is Hadoop (Not) a Good Choice
• Hadoop may be a great choice when
• You need to process non-relational (unstructured)...
Conclusion
• Thanks for your time!
• Questions?
Introduction to Hadoop Ran Ziv© 2012 64
Upcoming SlideShare
Loading in...5
×

Introduction to Hadoop

315

Published on

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
315
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Hadoop

  1. 1. Introduction to Hadoop Ran Ziv Introduction to Hadoop Ran Ziv© 2012 1
  2. 2. Who Am I? Ran Ziv Current: Past: Data Architect at Technology Research Group Architect, Data Platform & Analytics Group Manager, LivePerson Analytics Project Manager, Software Industry System Analyst, Telco Industry Fraud Detection Systems Engineer, Telco Industry Data Researcher, Internet Industry Introduction to Hadoop Ran Ziv© 2012 2
  3. 3. What’s Ahead? • Solid introduction to Apache Hadoop • What it is • Why it’s relevant • How it works • The Ecosystem • No prior experience needed • Feel free to ask questions Introduction to Hadoop Ran Ziv© 2012 3
  4. 4. What Is Apache Hadoop? • Scalable data storage and processing • Open source Apache project • Harnesses the power of commodity servers • Distributed and fault-tolerant • “Core” Hadoop consists of two main parts • HDFS (storage) • MapReduce (processing) Introduction to Hadoop Ran Ziv© 2012 4
  5. 5. A Large Ecosystem Introduction to Hadoop Ran Ziv© 2012 5
  6. 6. A Coherent Platform Introduction to Hadoop Ran Ziv© 2012 6
  7. 7. How Did Apache Hadoop Originate? • Heavily influenced by Google’s architecture • Notably, the Google File System and MapReduce papers • Other Web companies quickly saw the benefits • Early adoption by Yahoo, Facebook and others Introduction to Hadoop Ran Ziv© 2012 8
  8. 8. What Is Common Across Hadoop-able Problems? • Nature of the data • Complex data • Multiple data sources • Lots of it • Nature of the analysis • Parallel execution • Spread data over a cluster of servers and take the computation to the data Introduction to Hadoop Ran Ziv© 2012 15
  9. 9. Benefits Of Analyzing With Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability Introduction to Hadoop Ran Ziv© 2012 16
  10. 10. Hadoop: How does it work? • Moore’s law… and not Introduction to Hadoop Ran Ziv© 2012 17
  11. 11. Disk Capacity and Price • We’re generating more data than ever before • Fortunately, the size and cost of storage has kept pace • Capacity has increased while price has decreased Introduction to Hadoop Ran Ziv© 2012 18
  12. 12. Disk Capacity and Performance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates haven’t kept pace with capacity Introduction to Hadoop Ran Ziv© 2012 19
  13. 13. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 20
  14. 14. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 21
  15. 15. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 22
  16. 16. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 23
  17. 17. You Don’t Just Need Speed… • The problem is that we have way more data than code Introduction to Hadoop Ran Ziv© 2012 24
  18. 18. You Need Speed At Scale Introduction to Hadoop Ran Ziv© 2012 25
  19. 19. Introduction to Hadoop Ran Ziv© 2012 26
  20. 20. Collocated Storage and Processing • Solution: store and process data on the same nodes • Data Locality: “Bring the computation to the data” • Reduces I/O and boosts performance Introduction to Hadoop Ran Ziv© 2012 27
  21. 21. Introducing HDFS • Hadoop Distributed File System • Scalable storage influenced by Google’s file system paper • It’s not a general-purpose file system • HDFS is optimized for Hadoop • Values high throughput much more than low latency • It’s a user-space java process • Primarily accessed via command-line utilities and Java API Introduction to Hadoop Ran Ziv© 2012 29
  22. 22. HDFS is (Mostly) UNIX-Like • In many ways, HDFS is similar to a unix file system • Hierarchical • Unix-style paths (e.g. /foo/bar/myfile.txt) • File ownership and permissions • There are also some major deviations from Unix • Cannot modify files once written Introduction to Hadoop Ran Ziv© 2012 30
  23. 23. HDFS High-Level Architecture • HDFS follows a master-slave architecture • There are two essential deamons in HDFS • Master: NameNode • Responsible for namespace and metadata • Namespace: file hierarchy • Metadata: ownership, permissions, block locations, etc. • Slave: DataNode • Responsible for storing actual datablocks Introduction to Hadoop Ran Ziv© 2012 31
  24. 24. HDFS Blocks • When a file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems • HDFS uses a much larger block size (64MB), for performance Introduction to Hadoop Ran Ziv© 2012 32
  25. 25. HDFS Replication • Those blocks are then replicated across machines • The first block might be replicated to A, C and D Introduction to Hadoop Ran Ziv© 2012 33
  26. 26. HDFS Replication • The next block might be replicated to B, D and E Introduction to Hadoop Ran Ziv© 2012 34
  27. 27. HDFS Replication • The last block might be replicated to A, C and E Introduction to Hadoop Ran Ziv© 2012 35
  28. 28. HDFS Reliability • Replication helps to achieve reliability • Even when a node fails, two copies of the block remain • These will be re-replicated to other nodes automatically Introduction to Hadoop Ran Ziv© 2012 36
  29. 29. Introduction to Hadoop Ran Ziv© 2012 37
  30. 30. MapReduce High-Level Architecture • Like HDFS, MapReduce has a master-slave Architecture • There are two daemons in “classical” MapReduce • Master: JobTracker • Responsible for dividing, scheduling and monitoring work • Slave: TaskTracker • Responsible for actual processing Introduction to Hadoop Ran Ziv© 2012 38
  31. 31. Gentle Introduction to MapReduce • MapReduce is conceptually like a UNIX pipeline • One function (Map) processes data • That output is ultimately input to another function (Reduce) Introduction to Hadoop Ran Ziv© 2012 39
  32. 32. The Map Function • Operates on each record individually • Typical uses include filtering, parsing, or transforming Introduction to Hadoop Ran Ziv© 2012 40
  33. 33. Intermediate Processing • The Map function’s output is grouped and sorted • This is the automatic “sort and shuffle” process in Hadoop Introduction to Hadoop Ran Ziv© 2012 41
  34. 34. The Reduce Function • Operates on all records in a group • Often used for sum, average or other aggregate functions Introduction to Hadoop Ran Ziv© 2012 42
  35. 35. MapReduce Benefits • Complex details are abstracted away from the developer • No file I/O • No networking code • No synchronization Introduction to Hadoop Ran Ziv© 2012 43
  36. 36. MapReduce Example in Python • MapReduce code for Hadoop is typically written in Java • But possible to use nearly any language with Hadoop Streaming • I’ll show the log event counter using MapReduce in Python • It’s very helpful to see the data as well as the code Introduction to Hadoop Ran Ziv© 2012 44
  37. 37. Job Input • Each mapper gets a chunk of job’s input data to • This “chunk” is called an InputSplit Introduction to Hadoop Ran Ziv© 2012 45
  38. 38. Python Code for Map Function • Our map function will parse the event type • And then output that event (key) and a literal 1 (value) Introduction to Hadoop Ran Ziv© 2012 46
  39. 39. Output of Map Function • The map function produces key/value pairs as output Introduction to Hadoop Ran Ziv© 2012 47
  40. 40. Input to Reduce Function • The Reducer receives a key and all values for that key • Keys are always passed to reducers in sorted order • Although it’s not obvious here, values are unordered Introduction to Hadoop Ran Ziv© 2012 48
  41. 41. Python Code for Reduce Function • The Reducer first extracts the key and value it was passed Introduction to Hadoop Ran Ziv© 2012 49
  42. 42. Python Code for Reduce Function • Then simply adds up the value for each key Introduction to Hadoop Ran Ziv© 2012 50
  43. 43. Output of Reduce Function • The output of this Reduce function is a sum for each level Introduction to Hadoop Ran Ziv© 2012 51
  44. 44. Recap of Data Flow Introduction to Hadoop Ran Ziv© 2012 52
  45. 45. Input Splits Feed the Map Tasks • Input for the entire job is subdivided into InputSplits • An InputSplit usually corresponds to a single HDFS block • Each of these serves as input to a single Map task Introduction to Hadoop Ran Ziv© 2012 53
  46. 46. Mappers Feed the Shuffle and Sort • Output of all Mappers is partitioned, merged, and sorted (No code required – Hadoop does this automatically) Introduction to Hadoop Ran Ziv© 2012 54
  47. 47. Shuffle and Sort Feeds the Reducers • All values for a given key are then collapsed into a list • The key and all its values are fed to reducers as input Introduction to Hadoop Ran Ziv© 2012 55
  48. 48. Each Reducer Has an Output File • These are stored in HDFS below your output directory • Use hadoop fs -getmerge to combine them into a local copy Introduction to Hadoop Ran Ziv© 2012 56
  49. 49. Apache Hadoop Ecosystem: Overview • "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader platform • Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data • Still others, like Oozie, help you use Hadoop more effectively • Most are open source Apache projects like Hadoop • Also like Hadoop, they have funny names Introduction to Hadoop Ran Ziv© 2012 57
  50. 50. Ecosystem: Apache Flume Introduction to Hadoop Ran Ziv© 2012 58
  51. 51. Ecosystem: Apache Sqoop • Integrates with any JDBC-compatible database • Retrieve all tables, a single table, or a portion to store in HDFS • Can also export data from HDFS back to the database Introduction to Hadoop Ran Ziv© 2012 59
  52. 52. Ecosystem: Apache Hive • Hive allows you to do SQL-like queries on data in HDFS • It turns this into MapReduce jobs that run on your cluster • Reduces development time Introduction to Hadoop Ran Ziv© 2012 60
  53. 53. Ecosystem: Apache Pig • Apache Pig has a similar purpose to Hive • It has a high-level language (PigLatin) for data analysis • Scripts yield MapReduce jobs that run on your cluster • But Pig’s approach is much different than Hive Introduction to Hadoop Ran Ziv© 2012 61
  54. 54. Ecosystem: Apache HBase • NoSQL database built on HDFS • Low-latency and high-performance for reads and writes • Extremely scalable • Tables can have billions of rows • And potentially millions of columns Introduction to Hadoop Ran Ziv© 2012 62
  55. 55. When is Hadoop (Not) a Good Choice • Hadoop may be a great choice when • You need to process non-relational (unstructured) data • You are processing large amounts of data • You can run your jobs in batch mode • Hadoop may not be a great choice when • You’re processing small amounts of data • Your algorithms require communication among nodes • You need very low latency or transactions • As always, use the best tool for the job • And know how to integrate it with other systems Introduction to Hadoop Ran Ziv© 2012 63
  56. 56. Conclusion • Thanks for your time! • Questions? Introduction to Hadoop Ran Ziv© 2012 64

×