Your SlideShare is downloading. ×
0
© 2013 KMS Technology
AN INTRODUCTION OFAPACHE HADOOP
WHO AM I?Minh TranKMS TechnologyCurrent: Software Architect at KMS TechnologyPast: Technical at Yahoo!Senior Engineer at M...
OBJECTIVES• Understand what Apache Hadoop is• Understand problems Hadoop aims to solve• Explore Hadoop architecture and it...
AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
AGENDA – HADOOP OVERVIEW• Big Data & Challenges• What is Hadoop?• Hadoop Benefits• Which problem can Hadoop solve?• Hadoop...
WHY DO WE HAVE SO MUCHDATA?• Every single day– Twitter processes 340 million messages– Facebook stores 2.7 billion comment...
WHERE DOES DATA COME FROM?• Science: medical imaging, sensor data,genome sequencing, weather data,satellite feeds, etc.• L...
DATA ANALYSIS CHALLENGE• Huge volumes of data• Mixed sources result in many different formats– XML– CSV– EDI– Log files– O...
WHAT IS HADOOP?• Scalable data storage and processing– Open source Apache project– Harnesses the power of commodity server...
WHO USES HADOOP?
BENEFITS OF ANALYZING WITHHADOOP• Previously impossible/impracticalto do this analysis• Analysis conducted at lower cost• ...
WHICH PROBLEM CANHADOOP SOLVE?• Nature of the data– Complex & multiple data sources– Lots of it• Nature of the analysis– B...
HADOOP INSTALLATION1. Install a Linux machine, for e.g.: Ubuntu2. Install latest JDK3. Install Hadoop package, download at...
AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
AGENDA - HADDOP ARCHITECTUREAT A GLANCE• Hadoop Distributed File System• How MapReduce works
COLLOCATED STORAGEAND PROCESSING• Because 10,000 hard disks are better than one• Solution: store and process data on the s...
HARD DISK LATENCY• Disk seeks are expensive• Solution: Read lots of data at once to amortize the cost
HDFS BLOCKS• When a file is added to HDFS, it’s split into blocks• This is a similar concept to native file systems– HDFS ...
ClientapplicationHadoop filesystem clientDataNode 1CDBDataNode 2ACDDataNode 3BACNameNode/tmp/file1.txtBlock ABlock BDataNo...
HOW MAPREDUCE WORKS?
ANOTHER EXAMPLE ABOUTBUILDING INVERTED INDEX• Input: a number of text files• Output: a list of tuples, where each tuple is...
AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
HADOOP ECOSYSTEM
AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
REFERENCES• Hadoop In Practice – Alex Homes• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, JonLentz, Brian Fem...
© 2013 KMS TechnologyTHANK YOU
Upcoming SlideShare
Loading in...5
×

An Introduction of Apache Hadoop

1,049

Published on

This slide deck that Mr. Minh Tran - KMS's Software Architect shared at "Java-Trends and Career Opportunities" seminar of Information Technology Center of HCMC University of Science.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,049
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
38
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "An Introduction of Apache Hadoop"

  1. 1. © 2013 KMS Technology
  2. 2. AN INTRODUCTION OFAPACHE HADOOP
  3. 3. WHO AM I?Minh TranKMS TechnologyCurrent: Software Architect at KMS TechnologyPast: Technical at Yahoo!Senior Engineer at MobiVi, Sciant, ELCAAdmin at JavaVietnam
  4. 4. OBJECTIVES• Understand what Apache Hadoop is• Understand problems Hadoop aims to solve• Explore Hadoop architecture and itsecosystem
  5. 5. AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
  6. 6. AGENDA – HADOOP OVERVIEW• Big Data & Challenges• What is Hadoop?• Hadoop Benefits• Which problem can Hadoop solve?• Hadoop Installation
  7. 7. WHY DO WE HAVE SO MUCHDATA?• Every single day– Twitter processes 340 million messages– Facebook stores 2.7 billion comments and“Likes”– Google processes about 24 petabytes of data• And every single minute– More than 200 million e-mails are sent– Foursquare processes more than 2,000check-ins
  8. 8. WHERE DOES DATA COME FROM?• Science: medical imaging, sensor data,genome sequencing, weather data,satellite feeds, etc.• Legacy: Sales data, customer behavior,product databases, accounting data, etc.• System Data: Log files, network messages,Web Analytics, intrusion detection, spamfilters• (Not all of this maps cleanly to the relational model)
  9. 9. DATA ANALYSIS CHALLENGE• Huge volumes of data• Mixed sources result in many different formats– XML– CSV– EDI– Log files– Objects– SQL– Text– JSON– Binary– etc.
  10. 10. WHAT IS HADOOP?• Scalable data storage and processing– Open source Apache project– Harnesses the power of commodity servers– Distributed and fault-tolerant• “Core” Hadoop consists of two main parts– HDFS (storage)– MapReduce (processing)
  11. 11. WHO USES HADOOP?
  12. 12. BENEFITS OF ANALYZING WITHHADOOP• Previously impossible/impracticalto do this analysis• Analysis conducted at lower cost• Analysis conducted in less time• Greater flexibility• Linear scalability
  13. 13. WHICH PROBLEM CANHADOOP SOLVE?• Nature of the data– Complex & multiple data sources– Lots of it• Nature of the analysis– Batch processing– Parallel execution– Spread data over a cluster of servers and take the computationto the data• Common Hadoop Problems:– Customer churn analysis– Recommendation engine– PoS transaction analysis– Threat analysis– Search quality– Data “sandbox”
  14. 14. HADOOP INSTALLATION1. Install a Linux machine, for e.g.: Ubuntu2. Install latest JDK3. Install Hadoop package, download athttp://hadoop.apache.org/
  15. 15. AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
  16. 16. AGENDA - HADDOP ARCHITECTUREAT A GLANCE• Hadoop Distributed File System• How MapReduce works
  17. 17. COLLOCATED STORAGEAND PROCESSING• Because 10,000 hard disks are better than one• Solution: store and process data on the same nodes– Data locality: “Bring the computation to the data”– Reduces I/O and boosts performance
  18. 18. HARD DISK LATENCY• Disk seeks are expensive• Solution: Read lots of data at once to amortize the cost
  19. 19. HDFS BLOCKS• When a file is added to HDFS, it’s split into blocks• This is a similar concept to native file systems– HDFS uses a much larger block size (64 MB), forperformance
  20. 20. ClientapplicationHadoop filesystem clientDataNode 1CDBDataNode 2ACDDataNode 3BACNameNode/tmp/file1.txtBlock ABlock BDataNode 3DataNode 2DataNode 1DataNode 3Block C DataNode 1DataNode 2DataNode 3HDFS High Level Architecture
  21. 21. HOW MAPREDUCE WORKS?
  22. 22. ANOTHER EXAMPLE ABOUTBUILDING INVERTED INDEX• Input: a number of text files• Output: a list of tuples, where each tuple is a word and a list of filesthat contain the worddoc1.txtcat sat matdoc2.txtcat sat dogInput filenamesand contentsMappersIntermediateoutputReducerscat, doc1.txtsat, doc1.txtmat, doc1.txtcat, doc2.txtsat, doc2.txtdog, doc2.txtpart-r-00000cat: doc1.txt, doc2.txtpart-r-00001sat: doc1.txt, doc2.txtdog: doc2.txtpart-r-00002mat: doc1.txtOutput filenamesand contents
  23. 23. AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
  24. 24. HADOOP ECOSYSTEM
  25. 25. AGENDA• Hadoop Overview• Haddop Architecture at a glance• Hadoop Ecosystem• A demo of using Hadoop
  26. 26. REFERENCES• Hadoop In Practice – Alex Homes• Hadoop Real World Solutions Cookbook – Jonathan R. Owens, JonLentz, Brian Femiano• Hadoop In Action – Chuck Lam• Hadoop The Definitive Guide – Tom White• MapReduce Design Patterns – Donald Miner, Adam Shook• An Introduction to Hadoop – Mark Fei• http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/• http://www.crobak.org/2011/12/getting-started-with-apache-hadoop-0-23-0/
  27. 27. © 2013 KMS TechnologyTHANK YOU
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×