SlideShare a Scribd company logo
1 of 27
Download to read offline
© 2013 KMS Technology
AN INTRODUCTION OF
APACHE HADOOP
WHO AM I?
Minh Tran
KMS Technology
Current: Software Architect at KMS Technology
Past: Technical at Yahoo!
Senior Engineer at MobiVi, Sciant, ELCA
Admin at JavaVietnam
OBJECTIVES
• Understand what Apache Hadoop is
• Understand problems Hadoop aims to solve
• Explore Hadoop architecture and its
ecosystem
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
AGENDA – HADOOP OVERVIEW
• Big Data & Challenges
• What is Hadoop?
• Hadoop Benefits
• Which problem can Hadoop solve?
• Hadoop Installation
WHY DO WE HAVE SO MUCH
DATA?
• Every single day
– Twitter processes 340 million messages
– Facebook stores 2.7 billion comments and
“Likes”
– Google processes about 24 petabytes of data
• And every single minute
– More than 200 million e-mails are sent
– Foursquare processes more than 2,000
check-ins
WHERE DOES DATA COME FROM?
• Science: medical imaging, sensor data,
genome sequencing, weather data,
satellite feeds, etc.
• Legacy: Sales data, customer behavior,
product databases, accounting data, etc.
• System Data: Log files, network messages,
Web Analytics, intrusion detection, spam
filters
• (Not all of this maps cleanly to the relational model)
DATA ANALYSIS CHALLENGE
• Huge volumes of data
• Mixed sources result in many different formats
– XML
– CSV
– EDI
– Log files
– Objects
– SQL
– Text
– JSON
– Binary
– etc.
WHAT IS HADOOP?
• Scalable data storage and processing
– Open source Apache project
– Harnesses the power of commodity servers
– Distributed and fault-tolerant
• “Core” Hadoop consists of two main parts
– HDFS (storage)
– MapReduce (processing)
WHO USES HADOOP?
BENEFITS OF ANALYZING WITH
HADOOP
• Previously impossible/impractical
to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
• Linear scalability
WHICH PROBLEM CAN
HADOOP SOLVE?
• Nature of the data
– Complex & multiple data sources
– Lots of it
• Nature of the analysis
– Batch processing
– Parallel execution
– Spread data over a cluster of servers and take the computation
to the data
• Common Hadoop Problems:
– Customer churn analysis
– Recommendation engine
– PoS transaction analysis
– Threat analysis
– Search quality
– Data “sandbox”
HADOOP INSTALLATION
1. Install a Linux machine, for e.g.: Ubuntu
2. Install latest JDK
3. Install Hadoop package, download at
http://hadoop.apache.org/
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
AGENDA - HADDOP ARCHITECTURE
AT A GLANCE
• Hadoop Distributed File System
• How MapReduce works
COLLOCATED STORAGE
AND PROCESSING
• Because 10,000 hard disks are better than one
• Solution: store and process data on the same nodes
– Data locality: “Bring the computation to the data”
– Reduces I/O and boosts performance
HARD DISK LATENCY
• Disk seeks are expensive
• Solution: Read lots of data at once to amortize the cost
HDFS BLOCKS
• When a file is added to HDFS, it’s split into blocks
• This is a similar concept to native file systems
– HDFS uses a much larger block size (64 MB), for
performance
Client
application
Hadoop file
system client
DataNode 1
C
D
B
DataNode 2
A
C
D
DataNode 3
B
A
C
NameNode
/tmp/file1.txt
Block A
Block B
DataNode 3
DataNode 2
DataNode 1
DataNode 3
Block C DataNode 1
DataNode 2
DataNode 3
HDFS High Level Architecture
HOW MAPREDUCE WORKS?
ANOTHER EXAMPLE ABOUT
BUILDING INVERTED INDEX
• Input: a number of text files
• Output: a list of tuples, where each tuple is a word and a list of files
that contain the word
doc1.txt
cat sat mat
doc2.txt
cat sat dog
Input filenames
and contents
Mappers
Intermediate
output
Reducers
cat, doc1.txt
sat, doc1.txt
mat, doc1.txt
cat, doc2.txt
sat, doc2.txt
dog, doc2.txt
part-r-00000
cat: doc1.txt, doc2.txt
part-r-00001
sat: doc1.txt, doc2.txt
dog: doc2.txt
part-r-00002
mat: doc1.txt
Output filenames
and contents
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
HADOOP ECOSYSTEM
AGENDA
• Hadoop Overview
• Haddop Architecture at a glance
• Hadoop Ecosystem
• A demo of using Hadoop
REFERENCES
• Hadoop In Practice – Alex Homes
• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon
Lentz, Brian Femiano
• Hadoop In Action – Chuck Lam
• Hadoop The Definitive Guide – Tom White
• MapReduce Design Patterns – Donald Miner, Adam Shook
• An Introduction to Hadoop – Mark Fei
• http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-
linux-single-node-cluster/
• http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-
0-23-0/
© 2013 KMS Technology
THANK YOU

More Related Content

What's hot

Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemCloudera, Inc.
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginningsDaniel Leon
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture Ganesh B
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integrationibi
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache HadoopMike Frampton
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 

What's hot (20)

Hadoop Ecosystem
Hadoop EcosystemHadoop Ecosystem
Hadoop Ecosystem
 
Hive Hadoop
Hive HadoopHive Hadoop
Hive Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop
HadoopHadoop
Hadoop
 
Cloudera Hadoop Distribution
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop Distribution
 
Doug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop EcosystemDoug Cutting on the State of the Hadoop Ecosystem
Doug Cutting on the State of the Hadoop Ecosystem
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Big data
Big dataBig data
Big data
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Architecture
Hadoop Architecture Hadoop Architecture
Hadoop Architecture
 
Summer Shorts: Big Data Integration
Summer Shorts: Big Data IntegrationSummer Shorts: Big Data Integration
Summer Shorts: Big Data Integration
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
Introdution to Apache Hadoop
Introdution to Apache HadoopIntrodution to Apache Hadoop
Introdution to Apache Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 

Viewers also liked

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big dataYukti Kaura
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt NamMột góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt NamImr Hung
 
Software Development Process Seminar at HUI
Software Development Process Seminar at HUISoftware Development Process Seminar at HUI
Software Development Process Seminar at HUIKMS Technology
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Git Using - pythonvietnam.info
Git Using - pythonvietnam.infoGit Using - pythonvietnam.info
Git Using - pythonvietnam.infoKhánh Nguyễn
 
Slide Python Bai 2 pythonvietnam.info
Slide Python Bai 2   pythonvietnam.infoSlide Python Bai 2   pythonvietnam.info
Slide Python Bai 2 pythonvietnam.infoKhánh Nguyễn
 
Python Beginner Class day-03-flow
Python Beginner Class day-03-flowPython Beginner Class day-03-flow
Python Beginner Class day-03-flowKhánh Nguyễn
 
Bai 1 pythonvietnam.info
Bai 1   pythonvietnam.infoBai 1   pythonvietnam.info
Bai 1 pythonvietnam.infoKhánh Nguyễn
 
Python Beginner Class day-07-08-module
Python Beginner Class day-07-08-modulePython Beginner Class day-07-08-module
Python Beginner Class day-07-08-moduleKhánh Nguyễn
 
Python Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterationsPython Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterationsKhánh Nguyễn
 
Am hoc kien truc
Am hoc kien trucAm hoc kien truc
Am hoc kien trucDang Lam
 
Git Version Control System
Git Version Control SystemGit Version Control System
Git Version Control SystemKMS Technology
 

Viewers also liked (20)

Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Introduction to WEB HTML, CSS
Introduction to WEB HTML, CSSIntroduction to WEB HTML, CSS
Introduction to WEB HTML, CSS
 
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt NamMột góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
Một góc nhìn về chuyện khởi nghiệp của Sinh Viên Việt Nam
 
Training android
Training androidTraining android
Training android
 
Software Development Process Seminar at HUI
Software Development Process Seminar at HUISoftware Development Process Seminar at HUI
Software Development Process Seminar at HUI
 
Training Google Drive and Hangouts.pptx
Training Google Drive and Hangouts.pptxTraining Google Drive and Hangouts.pptx
Training Google Drive and Hangouts.pptx
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Pbc day-01-introduction
Pbc day-01-introductionPbc day-01-introduction
Pbc day-01-introduction
 
Training javascript 2012 hcmut
Training javascript 2012 hcmutTraining javascript 2012 hcmut
Training javascript 2012 hcmut
 
Git Using - pythonvietnam.info
Git Using - pythonvietnam.infoGit Using - pythonvietnam.info
Git Using - pythonvietnam.info
 
Slide Python Bai 2 pythonvietnam.info
Slide Python Bai 2   pythonvietnam.infoSlide Python Bai 2   pythonvietnam.info
Slide Python Bai 2 pythonvietnam.info
 
Python Beginner Class day-03-flow
Python Beginner Class day-03-flowPython Beginner Class day-03-flow
Python Beginner Class day-03-flow
 
Bai 1 pythonvietnam.info
Bai 1   pythonvietnam.infoBai 1   pythonvietnam.info
Bai 1 pythonvietnam.info
 
Python Beginner Class day-07-08-module
Python Beginner Class day-07-08-modulePython Beginner Class day-07-08-module
Python Beginner Class day-07-08-module
 
Design patterns tutorials
Design patterns tutorialsDesign patterns tutorials
Design patterns tutorials
 
Python Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterationsPython Beginner Class day-04-05-06-iterations
Python Beginner Class day-04-05-06-iterations
 
Training python (new Updated)
Training python (new Updated)Training python (new Updated)
Training python (new Updated)
 
Am hoc kien truc
Am hoc kien trucAm hoc kien truc
Am hoc kien truc
 
Git Version Control System
Git Version Control SystemGit Version Control System
Git Version Control System
 

Similar to An Introduction to Apache Hadoop Architecture and Ecosystem

Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2tcloudcomputing-tw
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundNidhiAhuja30
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 

Similar to An Introduction to Apache Hadoop Architecture and Ecosystem (20)

List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
hadoop overview.pptx
hadoop overview.pptxhadoop overview.pptx
hadoop overview.pptx
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 

More from KMS Technology

A journey to a Full Stack Tester
A journey to a Full Stack Tester A journey to a Full Stack Tester
A journey to a Full Stack Tester KMS Technology
 
React & Redux, how to scale?
React & Redux, how to scale?React & Redux, how to scale?
React & Redux, how to scale?KMS Technology
 
Common design principles and design patterns in automation testing
Common design principles and design patterns in automation testingCommon design principles and design patterns in automation testing
Common design principles and design patterns in automation testingKMS Technology
 
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOpsKMS Technology
 
What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?KMS Technology
 
JavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy LanguageJavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy LanguageKMS Technology
 
JavaScript No longer A “toy” Language
JavaScript No longer A “toy” LanguageJavaScript No longer A “toy” Language
JavaScript No longer A “toy” LanguageKMS Technology
 
Preparations For A Successful Interview
Preparations For A Successful InterviewPreparations For A Successful Interview
Preparations For A Successful InterviewKMS Technology
 
Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page ApplicationKMS Technology
 
AWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic BeanstalkAWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic BeanstalkKMS Technology
 
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...KMS Technology
 
Technology Application Development Trends For IT Students
Technology Application Development Trends For IT StudentsTechnology Application Development Trends For IT Students
Technology Application Development Trends For IT StudentsKMS Technology
 
Contributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project SeminarContributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project SeminarKMS Technology
 
Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014KMS Technology
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberKMS Technology
 
Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014KMS Technology
 

More from KMS Technology (20)

A journey to a Full Stack Tester
A journey to a Full Stack Tester A journey to a Full Stack Tester
A journey to a Full Stack Tester
 
React & Redux, how to scale?
React & Redux, how to scale?React & Redux, how to scale?
React & Redux, how to scale?
 
Sexy React Stack
Sexy React StackSexy React Stack
Sexy React Stack
 
Common design principles and design patterns in automation testing
Common design principles and design patterns in automation testingCommon design principles and design patterns in automation testing
Common design principles and design patterns in automation testing
 
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
[Webinar] Test First, Fail Fast - Simplifying the Tester's Transition to DevOps
 
KMSNext Roadmap
KMSNext RoadmapKMSNext Roadmap
KMSNext Roadmap
 
KMS Introduction
KMS IntroductionKMS Introduction
KMS Introduction
 
What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?What's new in the Front-end development nowadays?
What's new in the Front-end development nowadays?
 
JavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy LanguageJavaScript - No Longer A Toy Language
JavaScript - No Longer A Toy Language
 
JavaScript No longer A “toy” Language
JavaScript No longer A “toy” LanguageJavaScript No longer A “toy” Language
JavaScript No longer A “toy” Language
 
Preparations For A Successful Interview
Preparations For A Successful InterviewPreparations For A Successful Interview
Preparations For A Successful Interview
 
Introduction To Single Page Application
Introduction To Single Page ApplicationIntroduction To Single Page Application
Introduction To Single Page Application
 
AWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic BeanstalkAWS: Scaling With Elastic Beanstalk
AWS: Scaling With Elastic Beanstalk
 
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
Behavior-Driven Development and Automation Testing Using Cucumber Framework W...
 
KMS Introduction
KMS IntroductionKMS Introduction
KMS Introduction
 
Technology Application Development Trends For IT Students
Technology Application Development Trends For IT StudentsTechnology Application Development Trends For IT Students
Technology Application Development Trends For IT Students
 
Contributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project SeminarContributors for Delivering a Successful Testing Project Seminar
Contributors for Delivering a Successful Testing Project Seminar
 
Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014Increase Chances to Be Hired as Software Developers - 2014
Increase Chances to Be Hired as Software Developers - 2014
 
Behavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using CucumberBehavior Driven Development and Automation Testing Using Cucumber
Behavior Driven Development and Automation Testing Using Cucumber
 
Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014Software Technology Trends in 2013-2014
Software Technology Trends in 2013-2014
 

Recently uploaded

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 

Recently uploaded (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 

An Introduction to Apache Hadoop Architecture and Ecosystem

  • 1. © 2013 KMS Technology
  • 3. WHO AM I? Minh Tran KMS Technology Current: Software Architect at KMS Technology Past: Technical at Yahoo! Senior Engineer at MobiVi, Sciant, ELCA Admin at JavaVietnam
  • 4. OBJECTIVES • Understand what Apache Hadoop is • Understand problems Hadoop aims to solve • Explore Hadoop architecture and its ecosystem
  • 5. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 6. AGENDA – HADOOP OVERVIEW • Big Data & Challenges • What is Hadoop? • Hadoop Benefits • Which problem can Hadoop solve? • Hadoop Installation
  • 7. WHY DO WE HAVE SO MUCH DATA? • Every single day – Twitter processes 340 million messages – Facebook stores 2.7 billion comments and “Likes” – Google processes about 24 petabytes of data • And every single minute – More than 200 million e-mails are sent – Foursquare processes more than 2,000 check-ins
  • 8. WHERE DOES DATA COME FROM? • Science: medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. • Legacy: Sales data, customer behavior, product databases, accounting data, etc. • System Data: Log files, network messages, Web Analytics, intrusion detection, spam filters • (Not all of this maps cleanly to the relational model)
  • 9. DATA ANALYSIS CHALLENGE • Huge volumes of data • Mixed sources result in many different formats – XML – CSV – EDI – Log files – Objects – SQL – Text – JSON – Binary – etc.
  • 10. WHAT IS HADOOP? • Scalable data storage and processing – Open source Apache project – Harnesses the power of commodity servers – Distributed and fault-tolerant • “Core” Hadoop consists of two main parts – HDFS (storage) – MapReduce (processing)
  • 12. BENEFITS OF ANALYZING WITH HADOOP • Previously impossible/impractical to do this analysis • Analysis conducted at lower cost • Analysis conducted in less time • Greater flexibility • Linear scalability
  • 13. WHICH PROBLEM CAN HADOOP SOLVE? • Nature of the data – Complex & multiple data sources – Lots of it • Nature of the analysis – Batch processing – Parallel execution – Spread data over a cluster of servers and take the computation to the data • Common Hadoop Problems: – Customer churn analysis – Recommendation engine – PoS transaction analysis – Threat analysis – Search quality – Data “sandbox”
  • 14. HADOOP INSTALLATION 1. Install a Linux machine, for e.g.: Ubuntu 2. Install latest JDK 3. Install Hadoop package, download at http://hadoop.apache.org/
  • 15. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 16. AGENDA - HADDOP ARCHITECTURE AT A GLANCE • Hadoop Distributed File System • How MapReduce works
  • 17. COLLOCATED STORAGE AND PROCESSING • Because 10,000 hard disks are better than one • Solution: store and process data on the same nodes – Data locality: “Bring the computation to the data” – Reduces I/O and boosts performance
  • 18. HARD DISK LATENCY • Disk seeks are expensive • Solution: Read lots of data at once to amortize the cost
  • 19. HDFS BLOCKS • When a file is added to HDFS, it’s split into blocks • This is a similar concept to native file systems – HDFS uses a much larger block size (64 MB), for performance
  • 20. Client application Hadoop file system client DataNode 1 C D B DataNode 2 A C D DataNode 3 B A C NameNode /tmp/file1.txt Block A Block B DataNode 3 DataNode 2 DataNode 1 DataNode 3 Block C DataNode 1 DataNode 2 DataNode 3 HDFS High Level Architecture
  • 22. ANOTHER EXAMPLE ABOUT BUILDING INVERTED INDEX • Input: a number of text files • Output: a list of tuples, where each tuple is a word and a list of files that contain the word doc1.txt cat sat mat doc2.txt cat sat dog Input filenames and contents Mappers Intermediate output Reducers cat, doc1.txt sat, doc1.txt mat, doc1.txt cat, doc2.txt sat, doc2.txt dog, doc2.txt part-r-00000 cat: doc1.txt, doc2.txt part-r-00001 sat: doc1.txt, doc2.txt dog: doc2.txt part-r-00002 mat: doc1.txt Output filenames and contents
  • 23. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 25. AGENDA • Hadoop Overview • Haddop Architecture at a glance • Hadoop Ecosystem • A demo of using Hadoop
  • 26. REFERENCES • Hadoop In Practice – Alex Homes • Hadoop Real World Solutions Cookbook – Jonathan R. Owens, Jon Lentz, Brian Femiano • Hadoop In Action – Chuck Lam • Hadoop The Definitive Guide – Tom White • MapReduce Design Patterns – Donald Miner, Adam Shook • An Introduction to Hadoop – Mark Fei • http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu- linux-single-node-cluster/ • http://www.crobak.org/2011/12/getting-started-with-apache-hadoop- 0-23-0/
  • 27. © 2013 KMS Technology THANK YOU