SlideShare a Scribd company logo
1 of 27
Download to read offline
© 2013 KMS Technology
AN INTRODUCTION OF
APACHE HADOOP
WHO AM I?
Minh Tran
KMS Technology
Current: Software Architect at KMS Technology
Past: Technical at Yahoo!
Senior Engineer at MobiVi, Sciant, ELCA
Admin at JavaVietnam
OBJECTIVES
• Understand what Apache Hadoop is
• Understand problems Hadoop aims to solve
• Explore Hadoop architecture and its
ecosystem
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
AGENDA – HADOOP OVERVIEW
• Big Data & Challenges
• What is Hadoop?
• Hadoop Benefits
• Which problem can Hadoop solve?
• Hadoop Installation
WHY DO WE HAVE SO MUCH
DATA?
• Every single day
– Twitter processes 340 million messages
– Facebook stores 2.7 billion comments and
“Likes”
– Google processes about 24 petabytes of data
• And every single minute
– More than 200 million e-mails are sent
– Foursquare processes more than 2,000
check-ins
WHERE DOES DATA COME FROM?
• Science: medical imaging, sensor data,
genome sequencing, weather data,
satellite feeds, etc.
• Legacy: Sales data, customer behavior,
product databases, accounting data, etc.
• System Data: Log files, network messages,
Web Analytics, intrusion detection, spam
filters
• (Not all of this maps cleanly to the relational model)
DATA ANALYSIS CHALLENGE
• Huge volumes of data
• Mixed sources result in many different formats
– XML
– CSV
– EDI
– Log files
– Objects
– SQL
– Text
– JSON
– Binary
– etc.
WHAT IS HADOOP?
• Scalable data storage and processing
– Open source Apache project
– Harnesses the power of commodity servers
– Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
– HDFS (storage)
– MapReduce (processing)
WHO USES HADOOP?
BENEFITS OF ANALYZING WITH
HADOOP
• Previously impossible/impractical
to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
WHICH PROBLEM CAN
HADOOP SOLVE?
• Nature of the data
– Complex & multiple data sources
– Lots of it
• Nature of the analysis
– Batch processing
– Parallel execution
– Spread data over a cluster of servers and take the computation
to the data
• Common Hadoop Problems:
– Customer churn analysis
– Recommendation engine
– PoS transaction analysis
– Threat analysis
– Search quality
– Data “sandbox”
HADOOP INSTALLATION
1. Install a Linux machine, for e.g.: Ubuntu
2. Install latest JDK
3. Install Hadoop package, download at
http://hadoop.apache.org/
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
AGENDA - HADDOP ARCHITECTURE
AT A GLANCE
• Hadoop Distributed File System
• How MapReduce works
COLLOCATED STORAGE
AND PROCESSING
• Because 10,000 hard disks are better than one
• Solution: store and process data on the same nodes
– Data locality: “Bring the computation to the data”
– Reduces I/O and boosts performance
HARD DISK LATENCY
• Disk seeks are expensive
• Solution: Read lots of data at once to amortize the cost
HDFS BLOCKS
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
– HDFS uses a much larger block size (64 MB), for
performance
Client
application
Hadoop file
system client
DataNode 1
C
D
B
DataNode 2
A
C
D
DataNode 3
B
A
C
NameNode
/tmp/file1.txt
Block A
Block B
DataNode 3
DataNode 2
DataNode 1
DataNode 3
Block C DataNode 1
DataNode 2
DataNode 3
HDFS High Level Architecture
HOW MAPREDUCE WORKS?
ANOTHER EXAMPLE ABOUT
BUILDING INVERTED INDEX
• Input: a number of text files
• Output: a list of tuples, where each tuple is a word and a list of files
that contain the word
doc1.txt
cat sat mat
doc2.txt
cat sat dog
Input filenames
and contents
Mappers
Intermediate
output
Reducers
cat, doc1.txt
sat, doc1.txt
mat, doc1.txt
cat, doc2.txt
sat, doc2.txt
dog, doc2.txt
part-r-00000
cat: doc1.txt, doc2.txt
part-r-00001
sat: doc1.txt, doc2.txt
dog: doc2.txt
part-r-00002
mat: doc1.txt
Output filenames
and contents
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
HADOOP ECOSYSTEM
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
REFERENCES
• Hadoop In Practice – Alex Homes
• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon
Lentz, Brian Femiano
• Hadoop In Action – Chuck Lam
• Hadoop The Definitive Guide – Tom White
• MapReduce Design Patterns – Donald Miner, Adam Shook
• An Introduction to Hadoop – Mark Fei
• http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-
linux-single-node-cluster/
• http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-
0-23-0/
© 2013 KMS Technology
THANK YOU

More Related Content

What's hot

Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
Ganesh B
 

What's hot (20)

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Big data
Big dataBig data
Big data
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 

Viewers also liked

Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt NamMột góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Imr Hung
 
Software Development Process Seminar at HUI
Software Development Process Seminar at HUISoftware Development Process Seminar at HUI
Software Development Process Seminar at HUI
KMS Technology
 

Viewers also liked (20)

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to WEB HTML, CSS
Introduction to WEB HTML, CSSIntroduction to WEB HTML, CSS
Introduction to WEB HTML, CSS
 
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt NamMột góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
 
Training android
Training androidTraining android
Training android
 
Software Development Process Seminar at HUI
Software Development Process Seminar at HUISoftware Development Process Seminar at HUI
Software Development Process Seminar at HUI
 
Training Google Drive and Hangouts.pptx
Training Google Drive and Hangouts.pptxTraining Google Drive and Hangouts.pptx
Training Google Drive and Hangouts.pptx
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Pbc day-01-introduction
Pbc day-01-introductionPbc day-01-introduction
Pbc day-01-introduction
 
Training javascript 2012 hcmut
Training javascript 2012 hcmutTraining javascript 2012 hcmut
Training javascript 2012 hcmut
 
Git Using - pythonvietnam.info
Git Using - pythonvietnam.infoGit Using - pythonvietnam.info
Git Using - pythonvietnam.info
 
Slide Python Bai 2 pythonvietnam.info
Slide Python Bai 2   pythonvietnam.infoSlide Python Bai 2   pythonvietnam.info
Slide Python Bai 2 pythonvietnam.info
 
Python Beginner Class day-03-flow
Python Beginner Class day-03-flowPython Beginner Class day-03-flow
Python Beginner Class day-03-flow
 
Bai 1 pythonvietnam.info
Bai 1   pythonvietnam.infoBai 1   pythonvietnam.info
Bai 1 pythonvietnam.info
 
Python Beginner Class day-07-08-module
Python Beginner Class day-07-08-modulePython Beginner Class day-07-08-module
Python Beginner Class day-07-08-module
 
Design patterns tutorials
Design patterns tutorialsDesign patterns tutorials
Design patterns tutorials
 
Python Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterationsPython Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterations
 
Training python (new Updated)
Training python (new Updated)Training python (new Updated)
Training python (new Updated)
 
Am hoc kien truc
Am hoc kien trucAm hoc kien truc
Am hoc kien truc
 
Git Version Control System
Git Version Control SystemGit Version Control System
Git Version Control System
 

Similar to An Introduction of Apache Hadoop

9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
Manoel Ribeiro
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 

Similar to An Introduction of Apache Hadoop (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 

More from KMS Technology

Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page Application
KMS Technology
 
Technology Application Development Trends For IT Students
Technology Application Development Trends For IT StudentsTechnology Application Development Trends For IT Students
Technology Application Development Trends For IT Students
KMS Technology
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using Cucumber
KMS Technology
 

More from KMS Technology (20)

A journey to a Full Stack Tester
A journey to a Full Stack Tester A journey to a Full Stack Tester
A journey to a Full Stack Tester
 
React & Redux, how to scale?
React & Redux, how to scale?React & Redux, how to scale?
React & Redux, how to scale?
 
Sexy React Stack
Sexy React StackSexy React Stack
Sexy React Stack
 
Common design principles and design patterns in automation testing
Common design principles and design patterns in automation testingCommon design principles and design patterns in automation testing
Common design principles and design patterns in automation testing
 
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
 
KMSNext Roadmap
KMSNext RoadmapKMSNext Roadmap
KMSNext Roadmap
 
KMS Introduction
KMS IntroductionKMS Introduction
KMS Introduction
 
What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?
 
JavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy LanguageJavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy Language
 
JavaScript No longer A “toy” Language
JavaScript No longer A “toy” LanguageJavaScript No longer A “toy” Language
JavaScript No longer A “toy” Language
 
Preparations For A Successful Interview
Preparations For A Successful InterviewPreparations For A Successful Interview
Preparations For A Successful Interview
 
Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page Application
 
AWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic BeanstalkAWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic Beanstalk
 
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
 
KMS Introduction
KMS IntroductionKMS Introduction
KMS Introduction
 
Technology Application Development Trends For IT Students
Technology Application Development Trends For IT StudentsTechnology Application Development Trends For IT Students
Technology Application Development Trends For IT Students
 
Contributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project SeminarContributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project Seminar
 
Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using Cucumber
 
Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Designing for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at ComcastDesigning for Hardware Accessibility at Comcast
Designing for Hardware Accessibility at Comcast
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 

An Introduction of Apache Hadoop

  • 1. © 2013 KMS Technology
  • 3. WHO AM I? Minh Tran KMS Technology Current: Software Architect at KMS Technology Past: Technical at Yahoo! Senior Engineer at MobiVi, Sciant, ELCA Admin at JavaVietnam
  • 4. OBJECTIVES • Understand what Apache Hadoop is • Understand problems Hadoop aims to solve • Explore Hadoop architecture and its ecosystem
  • 5. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 6. AGENDA – HADOOP OVERVIEW • Big Data & Challenges • What is Hadoop? • Hadoop Benefits • Which problem can Hadoop solve? • Hadoop Installation
  • 7. WHY DO WE HAVE SO MUCH DATA? • Every single day – Twitter processes 340 million messages – Facebook stores 2.7 billion comments and “Likes” – Google processes about 24 petabytes of data • And every single minute – More than 200 million e-mails are sent – Foursquare processes more than 2,000 check-ins
  • 8. WHERE DOES DATA COME FROM? • Science: medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. • Legacy: Sales data, customer behavior, product databases, accounting data, etc. • System Data: Log files, network messages, Web Analytics, intrusion detection, spam filters • (Not all of this maps cleanly to the relational model)
  • 9. DATA ANALYSIS CHALLENGE • Huge volumes of data • Mixed sources result in many different formats – XML – CSV – EDI – Log files – Objects – SQL – Text – JSON – Binary – etc.
  • 10. WHAT IS HADOOP? • Scalable data storage and processing – Open source Apache project – Harnesses the power of commodity servers – Distributed and fault-tolerant • “Core” Hadoop consists of two main parts – HDFS (storage) – MapReduce (processing)
  • 12. BENEFITS OF ANALYZING WITH HADOOP • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability
  • 13. WHICH PROBLEM CAN HADOOP SOLVE? • Nature of the data – Complex & multiple data sources – Lots of it • Nature of the analysis – Batch processing – Parallel execution – Spread data over a cluster of servers and take the computation to the data • Common Hadoop Problems: – Customer churn analysis – Recommendation engine – PoS transaction analysis – Threat analysis – Search quality – Data “sandbox”
  • 14. HADOOP INSTALLATION 1. Install a Linux machine, for e.g.: Ubuntu 2. Install latest JDK 3. Install Hadoop package, download at http://hadoop.apache.org/
  • 15. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 16. AGENDA - HADDOP ARCHITECTURE AT A GLANCE • Hadoop Distributed File System • How MapReduce works
  • 17. COLLOCATED STORAGE AND PROCESSING • Because 10,000 hard disks are better than one • Solution: store and process data on the same nodes – Data locality: “Bring the computation to the data” – Reduces I/O and boosts performance
  • 18. HARD DISK LATENCY • Disk seeks are expensive • Solution: Read lots of data at once to amortize the cost
  • 19. HDFS BLOCKS • When a file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems – HDFS uses a much larger block size (64 MB), for performance
  • 20. Client application Hadoop file system client DataNode 1 C D B DataNode 2 A C D DataNode 3 B A C NameNode /tmp/file1.txt Block A Block B DataNode 3 DataNode 2 DataNode 1 DataNode 3 Block C DataNode 1 DataNode 2 DataNode 3 HDFS High Level Architecture
  • 22. ANOTHER EXAMPLE ABOUT BUILDING INVERTED INDEX • Input: a number of text files • Output: a list of tuples, where each tuple is a word and a list of files that contain the word doc1.txt cat sat mat doc2.txt cat sat dog Input filenames and contents Mappers Intermediate output Reducers cat, doc1.txt sat, doc1.txt mat, doc1.txt cat, doc2.txt sat, doc2.txt dog, doc2.txt part-r-00000 cat: doc1.txt, doc2.txt part-r-00001 sat: doc1.txt, doc2.txt dog: doc2.txt part-r-00002 mat: doc1.txt Output filenames and contents
  • 23. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 25. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 26. REFERENCES • Hadoop In Practice – Alex Homes • Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon Lentz, Brian Femiano • Hadoop In Action – Chuck Lam • Hadoop The Definitive Guide – Tom White • MapReduce Design Patterns – Donald Miner, Adam Shook • An Introduction to Hadoop – Mark Fei • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu- linux-single-node-cluster/ • http://www.crobak.org/2011/12/getting-started-with-apache-hadoop- 0-23-0/
  • 27. © 2013 KMS Technology THANK YOU