SlideShare a Scribd company logo
1 of 56
Introduction to Hadoop
Ran Ziv
Introduction to Hadoop Ran Ziv© 2012 1
Who Am I?
Ran Ziv
Current:
Past:
Data Architect at Technology Research Group
Architect, Data Platform & Analytics Group Manager, LivePerson
Analytics Project Manager, Software Industry
System Analyst, Telco Industry
Fraud Detection Systems Engineer, Telco Industry
Data Researcher, Internet Industry
Introduction to Hadoop Ran Ziv© 2012 2
What’s Ahead?
• Solid introduction to Apache Hadoop
• What it is
• Why it’s relevant
• How it works
• The Ecosystem
• No prior experience needed
• Feel free to ask questions
Introduction to Hadoop Ran Ziv© 2012 3
What Is Apache Hadoop?
• Scalable data storage and processing
• Open source Apache project
• Harnesses the power of commodity servers
• Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
• HDFS (storage)
• MapReduce (processing)
Introduction to Hadoop Ran Ziv© 2012 4
A Large Ecosystem
Introduction to Hadoop Ran Ziv© 2012 5
A Coherent Platform
Introduction to Hadoop Ran Ziv© 2012 6
How Did Apache Hadoop Originate?
• Heavily influenced by Google’s architecture
• Notably, the Google File System and MapReduce papers
• Other Web companies quickly saw the benefits
• Early adoption by Yahoo, Facebook and others
Introduction to Hadoop Ran Ziv© 2012 8
What Is Common Across Hadoop-able Problems?
• Nature of the data
• Complex data
• Multiple data sources
• Lots of it
• Nature of the analysis
• Parallel execution
• Spread data over a cluster of servers and take the
computation to the data
Introduction to Hadoop Ran Ziv© 2012 15
Benefits Of Analyzing With Hadoop
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
Introduction to Hadoop Ran Ziv© 2012 16
Hadoop: How does it work?
• Moore’s law… and not
Introduction to Hadoop Ran Ziv© 2012 17
Disk Capacity and Price
• We’re generating more data than ever before
• Fortunately, the size and cost of storage has kept
pace
• Capacity has increased while price has decreased
Introduction to Hadoop Ran Ziv© 2012 18
Disk Capacity and Performance
• Disk performance has also increased in the last 15
years
• Unfortunately, transfer rates haven’t kept pace with
capacity
Introduction to Hadoop Ran Ziv© 2012 19
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 20
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 21
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 22
Architecture of a Typical HPC System
Introduction to Hadoop Ran Ziv© 2012 23
You Don’t Just Need Speed…
• The problem is that we have way more data than
code
Introduction to Hadoop Ran Ziv© 2012 24
You Need Speed At Scale
Introduction to Hadoop Ran Ziv© 2012 25
Introduction to Hadoop Ran Ziv© 2012 26
Collocated Storage and Processing
• Solution: store and process data on the same nodes
• Data Locality: “Bring the computation to the data”
• Reduces I/O and boosts performance
Introduction to Hadoop Ran Ziv© 2012 27
Introducing HDFS
• Hadoop Distributed File System
• Scalable storage influenced by Google’s file system paper
• It’s not a general-purpose file system
• HDFS is optimized for Hadoop
• Values high throughput much more than low latency
• It’s a user-space java process
• Primarily accessed via command-line utilities and Java API
Introduction to Hadoop Ran Ziv© 2012 29
HDFS is (Mostly) UNIX-Like
• In many ways, HDFS is similar to a unix file system
• Hierarchical
• Unix-style paths (e.g. /foo/bar/myfile.txt)
• File ownership and permissions
• There are also some major deviations from Unix
• Cannot modify files once written
Introduction to Hadoop Ran Ziv© 2012 30
HDFS High-Level Architecture
• HDFS follows a master-slave architecture
• There are two essential deamons in HDFS
• Master: NameNode
• Responsible for namespace and metadata
• Namespace: file hierarchy
• Metadata: ownership, permissions, block locations, etc.
• Slave: DataNode
• Responsible for storing actual datablocks
Introduction to Hadoop Ran Ziv© 2012 31
HDFS Blocks
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
• HDFS uses a much larger block size (64MB), for
performance
Introduction to Hadoop Ran Ziv© 2012 32
HDFS Replication
• Those blocks are then replicated across machines
• The first block might be replicated to A, C and D
Introduction to Hadoop Ran Ziv© 2012 33
HDFS Replication
• The next block might be replicated to B, D and E
Introduction to Hadoop Ran Ziv© 2012 34
HDFS Replication
• The last block might be replicated to A, C and E
Introduction to Hadoop Ran Ziv© 2012 35
HDFS Reliability
• Replication helps to achieve reliability
• Even when a node fails, two copies of the block remain
• These will be re-replicated to other nodes automatically
Introduction to Hadoop Ran Ziv© 2012 36
Introduction to Hadoop Ran Ziv© 2012 37
MapReduce High-Level Architecture
• Like HDFS, MapReduce has a master-slave
Architecture
• There are two daemons in “classical” MapReduce
• Master: JobTracker
• Responsible for dividing, scheduling and monitoring work
• Slave: TaskTracker
• Responsible for actual processing
Introduction to Hadoop Ran Ziv© 2012 38
Gentle Introduction to MapReduce
• MapReduce is conceptually like a UNIX pipeline
• One function (Map) processes data
• That output is ultimately input to another function
(Reduce)
Introduction to Hadoop Ran Ziv© 2012 39
The Map Function
• Operates on each record individually
• Typical uses include filtering, parsing, or transforming
Introduction to Hadoop Ran Ziv© 2012 40
Intermediate Processing
• The Map function’s output is grouped and sorted
• This is the automatic “sort and shuffle” process in Hadoop
Introduction to Hadoop Ran Ziv© 2012 41
The Reduce Function
• Operates on all records in a group
• Often used for sum, average or other aggregate functions
Introduction to Hadoop Ran Ziv© 2012 42
MapReduce Benefits
• Complex details are abstracted away from the
developer
• No file I/O
• No networking code
• No synchronization
Introduction to Hadoop Ran Ziv© 2012 43
MapReduce Example in Python
• MapReduce code for Hadoop is typically written in
Java
• But possible to use nearly any language with Hadoop Streaming
• I’ll show the log event counter using MapReduce in Python
• It’s very helpful to see the data as well as the code
Introduction to Hadoop Ran Ziv© 2012 44
Job Input
• Each mapper gets a chunk of job’s input data to
• This “chunk” is called an InputSplit
Introduction to Hadoop Ran Ziv© 2012 45
Python Code for Map Function
• Our map function will parse the event type
• And then output that event (key) and a literal 1 (value)
Introduction to Hadoop Ran Ziv© 2012 46
Output of Map Function
• The map function produces key/value pairs as output
Introduction to Hadoop Ran Ziv© 2012 47
Input to Reduce Function
• The Reducer receives a key and all values for that key
• Keys are always passed to reducers in sorted order
• Although it’s not obvious here, values are unordered
Introduction to Hadoop Ran Ziv© 2012 48
Python Code for Reduce Function
• The Reducer first extracts the key and value it was
passed
Introduction to Hadoop Ran Ziv© 2012 49
Python Code for Reduce Function
• Then simply adds up the value for each key
Introduction to Hadoop Ran Ziv© 2012 50
Output of Reduce Function
• The output of this Reduce function is a sum for each
level
Introduction to Hadoop Ran Ziv© 2012 51
Recap of Data Flow
Introduction to Hadoop Ran Ziv© 2012 52
Input Splits Feed the Map Tasks
• Input for the entire job is subdivided into InputSplits
• An InputSplit usually corresponds to a single HDFS block
• Each of these serves as input to a single Map task
Introduction to Hadoop Ran Ziv© 2012 53
Mappers Feed the Shuffle and Sort
• Output of all Mappers is partitioned, merged, and
sorted (No code required – Hadoop does this automatically)
Introduction to Hadoop Ran Ziv© 2012 54
Shuffle and Sort Feeds the Reducers
• All values for a given key are then collapsed into a list
• The key and all its values are fed to reducers as input
Introduction to Hadoop Ran Ziv© 2012 55
Each Reducer Has an Output File
• These are stored in HDFS below your output
directory
• Use hadoop fs -getmerge to combine them into a local
copy
Introduction to Hadoop Ran Ziv© 2012 56
Apache Hadoop Ecosystem: Overview
• "Core Hadoop" consists of HDFS and MapReduce
• These are the kernel of a much broader platform
• Hadoop has many related projects
• Some help you integrate Hadoop with other systems
• Others help you analyze your data
• Still others, like Oozie, help you use Hadoop more
effectively
• Most are open source Apache projects like Hadoop
• Also like Hadoop, they have funny names
Introduction to Hadoop Ran Ziv© 2012 57
Ecosystem: Apache Flume
Introduction to Hadoop Ran Ziv© 2012 58
Ecosystem: Apache Sqoop
• Integrates with any JDBC-compatible database
• Retrieve all tables, a single table, or a portion to store in
HDFS
• Can also export data from HDFS back to the database
Introduction to Hadoop Ran Ziv© 2012 59
Ecosystem: Apache Hive
• Hive allows you to do SQL-like queries on data in
HDFS
• It turns this into MapReduce jobs that run on your cluster
• Reduces development time
Introduction to Hadoop Ran Ziv© 2012 60
Ecosystem: Apache Pig
• Apache Pig has a similar purpose to Hive
• It has a high-level language (PigLatin) for data analysis
• Scripts yield MapReduce jobs that run on your cluster
• But Pig’s approach is much different than Hive
Introduction to Hadoop Ran Ziv© 2012 61
Ecosystem: Apache HBase
• NoSQL database built on HDFS
• Low-latency and high-performance for reads and
writes
• Extremely scalable
• Tables can have billions of rows
• And potentially millions of columns
Introduction to Hadoop Ran Ziv© 2012 62
When is Hadoop (Not) a Good Choice
• Hadoop may be a great choice when
• You need to process non-relational (unstructured) data
• You are processing large amounts of data
• You can run your jobs in batch mode
• Hadoop may not be a great choice when
• You’re processing small amounts of data
• Your algorithms require communication among nodes
• You need very low latency or transactions
• As always, use the best tool for the job
• And know how to integrate it with other systems
Introduction to Hadoop Ran Ziv© 2012 63
Conclusion
• Thanks for your time!
• Questions?
Introduction to Hadoop Ran Ziv© 2012 64

More Related Content

What's hot

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationSameer Tiwari
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 

What's hot (20)

Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
HDFS
HDFSHDFS
HDFS
 
6.hive
6.hive6.hive
6.hive
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 

Similar to Introduction to Hadoop

Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemZohar Elkayam
 
Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeepPradeep Pandey
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssA1 Trainings
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online TrainingLearntek1
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Alex Gorbachev
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemZohar Elkayam
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 

Similar to Introduction to Hadoop (20)

SQL Server 2012 and Big Data
SQL Server 2012 and Big DataSQL Server 2012 and Big Data
SQL Server 2012 and Big Data
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
SpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache HadoopSpringPeople Introduction to Apache Hadoop
SpringPeople Introduction to Apache Hadoop
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop EcosystemThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem
 
Oozie & sqoop by pradeep
Oozie & sqoop by pradeepOozie & sqoop by pradeep
Oozie & sqoop by pradeep
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
1. Apache HIVE
1. Apache HIVE1. Apache HIVE
1. Apache HIVE
 
Hive
HiveHive
Hive
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingssHadoop course content @ a1 trainingss
Hadoop course content @ a1 trainingss
 
Big data - Online Training
Big data - Online TrainingBig data - Online Training
Big data - Online Training
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Hive_Pig.pptx
Hive_Pig.pptxHive_Pig.pptx
Hive_Pig.pptx
 
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
Bridging Oracle Database and Hadoop by Alex Gorbachev, Pythian from Oracle Op...
 
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop EcosystemThings Every Oracle DBA Needs To Know About The Hadoop Ecosystem
Things Every Oracle DBA Needs To Know About The Hadoop Ecosystem
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 

Recently uploaded

My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxNeo4j
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 

Recently uploaded (20)

My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptxGraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
GraphSummit Copenhagen 2024 - Neo4j Vision and Roadmap.pptx
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 

Introduction to Hadoop

  • 1. Introduction to Hadoop Ran Ziv Introduction to Hadoop Ran Ziv© 2012 1
  • 2. Who Am I? Ran Ziv Current: Past: Data Architect at Technology Research Group Architect, Data Platform & Analytics Group Manager, LivePerson Analytics Project Manager, Software Industry System Analyst, Telco Industry Fraud Detection Systems Engineer, Telco Industry Data Researcher, Internet Industry Introduction to Hadoop Ran Ziv© 2012 2
  • 3. What’s Ahead? • Solid introduction to Apache Hadoop • What it is • Why it’s relevant • How it works • The Ecosystem • No prior experience needed • Feel free to ask questions Introduction to Hadoop Ran Ziv© 2012 3
  • 4. What Is Apache Hadoop? • Scalable data storage and processing • Open source Apache project • Harnesses the power of commodity servers • Distributed and fault-tolerant • “Core” Hadoop consists of two main parts • HDFS (storage) • MapReduce (processing) Introduction to Hadoop Ran Ziv© 2012 4
  • 5. A Large Ecosystem Introduction to Hadoop Ran Ziv© 2012 5
  • 6. A Coherent Platform Introduction to Hadoop Ran Ziv© 2012 6
  • 7. How Did Apache Hadoop Originate? • Heavily influenced by Google’s architecture • Notably, the Google File System and MapReduce papers • Other Web companies quickly saw the benefits • Early adoption by Yahoo, Facebook and others Introduction to Hadoop Ran Ziv© 2012 8
  • 8. What Is Common Across Hadoop-able Problems? • Nature of the data • Complex data • Multiple data sources • Lots of it • Nature of the analysis • Parallel execution • Spread data over a cluster of servers and take the computation to the data Introduction to Hadoop Ran Ziv© 2012 15
  • 9. Benefits Of Analyzing With Hadoop • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability Introduction to Hadoop Ran Ziv© 2012 16
  • 10. Hadoop: How does it work? • Moore’s law… and not Introduction to Hadoop Ran Ziv© 2012 17
  • 11. Disk Capacity and Price • We’re generating more data than ever before • Fortunately, the size and cost of storage has kept pace • Capacity has increased while price has decreased Introduction to Hadoop Ran Ziv© 2012 18
  • 12. Disk Capacity and Performance • Disk performance has also increased in the last 15 years • Unfortunately, transfer rates haven’t kept pace with capacity Introduction to Hadoop Ran Ziv© 2012 19
  • 13. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 20
  • 14. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 21
  • 15. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 22
  • 16. Architecture of a Typical HPC System Introduction to Hadoop Ran Ziv© 2012 23
  • 17. You Don’t Just Need Speed… • The problem is that we have way more data than code Introduction to Hadoop Ran Ziv© 2012 24
  • 18. You Need Speed At Scale Introduction to Hadoop Ran Ziv© 2012 25
  • 19. Introduction to Hadoop Ran Ziv© 2012 26
  • 20. Collocated Storage and Processing • Solution: store and process data on the same nodes • Data Locality: “Bring the computation to the data” • Reduces I/O and boosts performance Introduction to Hadoop Ran Ziv© 2012 27
  • 21. Introducing HDFS • Hadoop Distributed File System • Scalable storage influenced by Google’s file system paper • It’s not a general-purpose file system • HDFS is optimized for Hadoop • Values high throughput much more than low latency • It’s a user-space java process • Primarily accessed via command-line utilities and Java API Introduction to Hadoop Ran Ziv© 2012 29
  • 22. HDFS is (Mostly) UNIX-Like • In many ways, HDFS is similar to a unix file system • Hierarchical • Unix-style paths (e.g. /foo/bar/myfile.txt) • File ownership and permissions • There are also some major deviations from Unix • Cannot modify files once written Introduction to Hadoop Ran Ziv© 2012 30
  • 23. HDFS High-Level Architecture • HDFS follows a master-slave architecture • There are two essential deamons in HDFS • Master: NameNode • Responsible for namespace and metadata • Namespace: file hierarchy • Metadata: ownership, permissions, block locations, etc. • Slave: DataNode • Responsible for storing actual datablocks Introduction to Hadoop Ran Ziv© 2012 31
  • 24. HDFS Blocks • When a file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems • HDFS uses a much larger block size (64MB), for performance Introduction to Hadoop Ran Ziv© 2012 32
  • 25. HDFS Replication • Those blocks are then replicated across machines • The first block might be replicated to A, C and D Introduction to Hadoop Ran Ziv© 2012 33
  • 26. HDFS Replication • The next block might be replicated to B, D and E Introduction to Hadoop Ran Ziv© 2012 34
  • 27. HDFS Replication • The last block might be replicated to A, C and E Introduction to Hadoop Ran Ziv© 2012 35
  • 28. HDFS Reliability • Replication helps to achieve reliability • Even when a node fails, two copies of the block remain • These will be re-replicated to other nodes automatically Introduction to Hadoop Ran Ziv© 2012 36
  • 29. Introduction to Hadoop Ran Ziv© 2012 37
  • 30. MapReduce High-Level Architecture • Like HDFS, MapReduce has a master-slave Architecture • There are two daemons in “classical” MapReduce • Master: JobTracker • Responsible for dividing, scheduling and monitoring work • Slave: TaskTracker • Responsible for actual processing Introduction to Hadoop Ran Ziv© 2012 38
  • 31. Gentle Introduction to MapReduce • MapReduce is conceptually like a UNIX pipeline • One function (Map) processes data • That output is ultimately input to another function (Reduce) Introduction to Hadoop Ran Ziv© 2012 39
  • 32. The Map Function • Operates on each record individually • Typical uses include filtering, parsing, or transforming Introduction to Hadoop Ran Ziv© 2012 40
  • 33. Intermediate Processing • The Map function’s output is grouped and sorted • This is the automatic “sort and shuffle” process in Hadoop Introduction to Hadoop Ran Ziv© 2012 41
  • 34. The Reduce Function • Operates on all records in a group • Often used for sum, average or other aggregate functions Introduction to Hadoop Ran Ziv© 2012 42
  • 35. MapReduce Benefits • Complex details are abstracted away from the developer • No file I/O • No networking code • No synchronization Introduction to Hadoop Ran Ziv© 2012 43
  • 36. MapReduce Example in Python • MapReduce code for Hadoop is typically written in Java • But possible to use nearly any language with Hadoop Streaming • I’ll show the log event counter using MapReduce in Python • It’s very helpful to see the data as well as the code Introduction to Hadoop Ran Ziv© 2012 44
  • 37. Job Input • Each mapper gets a chunk of job’s input data to • This “chunk” is called an InputSplit Introduction to Hadoop Ran Ziv© 2012 45
  • 38. Python Code for Map Function • Our map function will parse the event type • And then output that event (key) and a literal 1 (value) Introduction to Hadoop Ran Ziv© 2012 46
  • 39. Output of Map Function • The map function produces key/value pairs as output Introduction to Hadoop Ran Ziv© 2012 47
  • 40. Input to Reduce Function • The Reducer receives a key and all values for that key • Keys are always passed to reducers in sorted order • Although it’s not obvious here, values are unordered Introduction to Hadoop Ran Ziv© 2012 48
  • 41. Python Code for Reduce Function • The Reducer first extracts the key and value it was passed Introduction to Hadoop Ran Ziv© 2012 49
  • 42. Python Code for Reduce Function • Then simply adds up the value for each key Introduction to Hadoop Ran Ziv© 2012 50
  • 43. Output of Reduce Function • The output of this Reduce function is a sum for each level Introduction to Hadoop Ran Ziv© 2012 51
  • 44. Recap of Data Flow Introduction to Hadoop Ran Ziv© 2012 52
  • 45. Input Splits Feed the Map Tasks • Input for the entire job is subdivided into InputSplits • An InputSplit usually corresponds to a single HDFS block • Each of these serves as input to a single Map task Introduction to Hadoop Ran Ziv© 2012 53
  • 46. Mappers Feed the Shuffle and Sort • Output of all Mappers is partitioned, merged, and sorted (No code required – Hadoop does this automatically) Introduction to Hadoop Ran Ziv© 2012 54
  • 47. Shuffle and Sort Feeds the Reducers • All values for a given key are then collapsed into a list • The key and all its values are fed to reducers as input Introduction to Hadoop Ran Ziv© 2012 55
  • 48. Each Reducer Has an Output File • These are stored in HDFS below your output directory • Use hadoop fs -getmerge to combine them into a local copy Introduction to Hadoop Ran Ziv© 2012 56
  • 49. Apache Hadoop Ecosystem: Overview • "Core Hadoop" consists of HDFS and MapReduce • These are the kernel of a much broader platform • Hadoop has many related projects • Some help you integrate Hadoop with other systems • Others help you analyze your data • Still others, like Oozie, help you use Hadoop more effectively • Most are open source Apache projects like Hadoop • Also like Hadoop, they have funny names Introduction to Hadoop Ran Ziv© 2012 57
  • 50. Ecosystem: Apache Flume Introduction to Hadoop Ran Ziv© 2012 58
  • 51. Ecosystem: Apache Sqoop • Integrates with any JDBC-compatible database • Retrieve all tables, a single table, or a portion to store in HDFS • Can also export data from HDFS back to the database Introduction to Hadoop Ran Ziv© 2012 59
  • 52. Ecosystem: Apache Hive • Hive allows you to do SQL-like queries on data in HDFS • It turns this into MapReduce jobs that run on your cluster • Reduces development time Introduction to Hadoop Ran Ziv© 2012 60
  • 53. Ecosystem: Apache Pig • Apache Pig has a similar purpose to Hive • It has a high-level language (PigLatin) for data analysis • Scripts yield MapReduce jobs that run on your cluster • But Pig’s approach is much different than Hive Introduction to Hadoop Ran Ziv© 2012 61
  • 54. Ecosystem: Apache HBase • NoSQL database built on HDFS • Low-latency and high-performance for reads and writes • Extremely scalable • Tables can have billions of rows • And potentially millions of columns Introduction to Hadoop Ran Ziv© 2012 62
  • 55. When is Hadoop (Not) a Good Choice • Hadoop may be a great choice when • You need to process non-relational (unstructured) data • You are processing large amounts of data • You can run your jobs in batch mode • Hadoop may not be a great choice when • You’re processing small amounts of data • Your algorithms require communication among nodes • You need very low latency or transactions • As always, use the best tool for the job • And know how to integrate it with other systems Introduction to Hadoop Ran Ziv© 2012 63
  • 56. Conclusion • Thanks for your time! • Questions? Introduction to Hadoop Ran Ziv© 2012 64