SlideShare a Scribd company logo
Hadoop/MapReduce/HDFS
Team:
Wasnaa AL-Mawee
Praveen Bhat
Class: CS6550
Department of Computer Science
Western Michigan University
• We live in the data age
 Facebook - 1.01b daily active users
 New York Stock Exchange – 1 terabyte of new trade/day
 Internet Archive stores appr. 2 petabytes
Introduction
Data
Enterprise
Social
Media
Sensor
PublicTransaction
• Characteristics of data
 Humongous.
 Structured, Semi-structured, and unstructured
 Growing beyond one can imagine.
• We call it Big Data!
Introduction
Velocity
Variety
Volume
Big
Data
What is the problem
Storage Drive capacity
1990 1370MB
2010 1 terabyte
2013 4 terabyte
Transfer Speed
1990 4.4 MB/s
2010 100MB/s
2013 146MB/s
• Require more time to read data from disk.
• Traditional data storage mechanism insufficient
What do we do ?
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for
more systems of computers.”
—Grace Hopper, Computer Scientist
• Create a cluster of systems
• Store data in clustered systems
• Process data sets independent of one another
Hadoop
Hadoop is a framework for running applications on large cluster built of
commodity hardware.
In other words,
A reliable shared storage and analysis system.
Hadoop Modules
• Hadoop Common
• Hadoop Distributed File System(HDFS)
• Hadoop Yarn
• Hadoop MapReduce
Journey of Hadoop
2002
Started by
Dough
Cutting and
Mike
Cafarella as a
text search
library
2003
Google’s
distributed file
system paper
published
Yahoo hired
Dough,
Supported
Hadoop
2006
2008
Yahoo
announced
that its search
index was
generated by
10,000-core
Hadoop
cluster
2009
Won the
minute sort by
sorting 500
GB in 59
seconds ! 2013
More than half
of the Fortune
50 use
Hadoop
Current projects under Apache Hadoop
• Avro
• Cassandra:
• Chukwa
• HBase
• Hive
• Mahout
• Pig
• Spark
• Tez
• Zoookeeper
Hadoop Distributed File System(HDFS)
• File systems that manages the storage across a network of machines
• Built around to handle
 Very large files - Terabytes, petabytes
 Streaming data access - write once, read many times
 Commodity Hardware - commonly available hardware
Namenodes and Datanodes
• Two types of node operating in a master-worker pattern
• Namenode
 Master node
 Manages filesystem namespace
 Maintains metadata for all the files and directories in the tree
• Datanode
 Workhorses of the file system
 Store and retrieve blocks when told by client or Namenode
 Periodically report to Namenode
HDFS Architecture
Source: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Client reading files from HDFS
Client
Name Node
Tell me the
block
locations of
results.txt
Blk A = 1,5,6
Blk B = 1, 2, 8
Blk C = 5, 8, 9
Data Node
Data Node
Data Node 6
Data Node 5
SwitchSwitch
Data Node 1
Data Node 2
Data Node
Data Node
B A
B
C A
Data Node
Data Node
Data Node 9
Data Node 8
Switch
C
C
B
A
Result.txt =
Blk A :
DN1, DN5, N6
Blk B:
DN8, DN1, DN2
Blk C = DN5, DN8,
DN9
Metadata
• Client receives Data Node list from each block
• Picks first Data Node for each block
• Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
Writing files to HDFS
I want to
write blocks
A,B,C of
file.txt
Client
Name Node
Data Node 1 Data Node 5 Data Node 6 Data Node N
Blk A Blk B Blk C
file.txt
Blk A Blk B Blk C
OK. Write to
data nodes
1,5, 6
• Client consults Name Node
• Writes block directly to one Data Node
• Data Node replicates block
• Cycle repeats for next block
Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
What is MapReduce?
• MapReduce is a programming model for processing
large data sets with a parallel, distributed algorithm
on a cluster.
• Published in 2004 from Google engineers Jeffrey
Dean and Sanjay Ghemawat.
MapReduce Features
• Large-scale distributed data processing
• Parallel programming.
• Simple but restricted.
• Load Balancing
• Handling machine failure
When should we use MapReduce ?
Query
• Index and search such as inverted index
• Classification
• Filtering
Analytics
• Sorting and merging
• Frequency distribution
• Summarization and statistics
• SQL-based queries: group by, having, etc.
• Generation of graphics
Others
• Message passing such as Breadth first-search algorithm
MapReduce Inspiration!
- Read massive data
- Map: Extracting data from each record
map (in_key, in_value) (out_key, intermediate_value) list
- Shuffle and Sort
- Reduce: Aggregate, filter, summarize and transform
reduce (out_key, intermediate_value list) out_value list
- Write the result
MapReduce Process Architecture
MapReduce Examples
1. Word Counting
2. Inverted indexes
MapReduce Algorithms
1. Disease propagation detection based-MapReduce
2. Trading strategies based-MapReduce.
3. Graph processing algorithm based-MapReduce.
Final Note !
• Open source community taking newer and larger steps
– Spark, Ceph, Open Stack
• Need for better processing
– Batch processing + Streaming
• Time to move on from Hadoop?
References
• http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705.
• http://mashable.com/2008/10/15/facebook-10-billion-photos/.
• http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx,
• http://www.archive.org/about/faqs.php.
• http://www.interactions.org/cms/?pid=1027032.
• Hadoop The Definitive Guide 2nd Edition by Tom White
• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003
• http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/
• R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact
Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6.
• X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy
Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274.
• K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing
Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid
Computing, Delft, May 2013, pp. 277-284.
• G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information
Processing and Trusted, Hubei, Oct. 2011, pp.154-156.
• C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int.
Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434.
• Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available
on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf.
• Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr.
• Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp-
content/uploads/2010/01/InvertedIndex.pdf.

More Related Content

What's hot

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
Chirag Ahuja
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
EMC
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
Sandeep Deshmukh
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
FARUK BERKSÖZ
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
Siva Pandeti
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
Prashanth Babu
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
MapR Technologies
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
Prashant Gupta
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
jeffturner
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
Bikas Saha
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 

What's hot (19)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Viewers also liked

Hdfs
HdfsHdfs
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Ankan Banerjee
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
Sankarapu Anjaneyulu
 
Hadoop story
Hadoop storyHadoop story
Hadoop story
Deep Kakkar
 
MS_Learning_Transcript.PDF
MS_Learning_Transcript.PDFMS_Learning_Transcript.PDF
MS_Learning_Transcript.PDF
Jaroslaw Bielicki
 
Ecommercebypraveen
EcommercebypraveenEcommercebypraveen
Ecommercebypraveen
Praveen kumar kc
 
Polishop
PolishopPolishop
French day (6)
French day (6)French day (6)
French day (6)
Estelle Wraight
 
Resumen de slideshare
Resumen de slideshareResumen de slideshare
Resumen de slideshare
Sara Cardenaas Romero
 
Diario Resumen 20151222
Diario Resumen 20151222Diario Resumen 20151222
Diario Resumen 20151222
Diario Resumen
 
Articulocea2012 ottoayala
Articulocea2012 ottoayalaArticulocea2012 ottoayala
Articulocea2012 ottoayala
cuerpo de bomberos ibarra
 
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
Jerry's Toyota
 
Chase Portfolio
Chase PortfolioChase Portfolio
Chase Portfolio
Zach Chase
 
Diario Resumen 20160205
Diario Resumen 20160205Diario Resumen 20160205
Diario Resumen 20160205
Diario Resumen
 
Bring the Backyard Back Recap
Bring the Backyard Back RecapBring the Backyard Back Recap
Bring the Backyard Back Recap
Robin Lomax
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
Vladimír Hanušniak
 
Lc board presentation2010
Lc board presentation2010Lc board presentation2010
Lc board presentation2010
Melanie Brandt
 

Viewers also liked (17)

Hdfs
HdfsHdfs
Hdfs
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
 
Hadoop story
Hadoop storyHadoop story
Hadoop story
 
MS_Learning_Transcript.PDF
MS_Learning_Transcript.PDFMS_Learning_Transcript.PDF
MS_Learning_Transcript.PDF
 
Ecommercebypraveen
EcommercebypraveenEcommercebypraveen
Ecommercebypraveen
 
Polishop
PolishopPolishop
Polishop
 
French day (6)
French day (6)French day (6)
French day (6)
 
Resumen de slideshare
Resumen de slideshareResumen de slideshare
Resumen de slideshare
 
Diario Resumen 20151222
Diario Resumen 20151222Diario Resumen 20151222
Diario Resumen 20151222
 
Articulocea2012 ottoayala
Articulocea2012 ottoayalaArticulocea2012 ottoayala
Articulocea2012 ottoayala
 
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
2012 Avalon starting at only $34,500 at Jerry's Toyota in Baltimore, Maryland
 
Chase Portfolio
Chase PortfolioChase Portfolio
Chase Portfolio
 
Diario Resumen 20160205
Diario Resumen 20160205Diario Resumen 20160205
Diario Resumen 20160205
 
Bring the Backyard Back Recap
Bring the Backyard Back RecapBring the Backyard Back Recap
Bring the Backyard Back Recap
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
Lc board presentation2010
Lc board presentation2010Lc board presentation2010
Lc board presentation2010
 

Similar to Hadoop/MapReduce/HDFS

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
Pramit Choudhary
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
Aamir Ameen
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
Sathish24111
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
Jayant Mukherjee
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
Kelly Technologies
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Media Gorod
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
tcloudcomputing-tw
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 

Similar to Hadoop/MapReduce/HDFS (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Need for Time series Database
Need for Time series DatabaseNeed for Time series Database
Need for Time series Database
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big Data & Hadoop Introduction
Big Data & Hadoop IntroductionBig Data & Hadoop Introduction
Big Data & Hadoop Introduction
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with HadoopКонстантин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
Константин Швачко, Yahoo!, - Scaling Storage and Computation with Hadoop
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 

Recently uploaded

Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
AjmalKhan50578
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
shadow0702a
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Sinan KOZAK
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
TaghreedAltamimi
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
bijceesjournal
 

Recently uploaded (20)

Welding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdfWelding Metallurgy Ferrous Materials.pdf
Welding Metallurgy Ferrous Materials.pdf
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024
 
Software Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.pptSoftware Quality Assurance-se412-v11.ppt
Software Quality Assurance-se412-v11.ppt
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...Rainfall intensity duration frequency curve statistical analysis and modeling...
Rainfall intensity duration frequency curve statistical analysis and modeling...
 

Hadoop/MapReduce/HDFS

  • 1. Hadoop/MapReduce/HDFS Team: Wasnaa AL-Mawee Praveen Bhat Class: CS6550 Department of Computer Science Western Michigan University
  • 2. • We live in the data age  Facebook - 1.01b daily active users  New York Stock Exchange – 1 terabyte of new trade/day  Internet Archive stores appr. 2 petabytes Introduction Data Enterprise Social Media Sensor PublicTransaction
  • 3. • Characteristics of data  Humongous.  Structured, Semi-structured, and unstructured  Growing beyond one can imagine. • We call it Big Data! Introduction Velocity Variety Volume Big Data
  • 4. What is the problem Storage Drive capacity 1990 1370MB 2010 1 terabyte 2013 4 terabyte Transfer Speed 1990 4.4 MB/s 2010 100MB/s 2013 146MB/s • Require more time to read data from disk. • Traditional data storage mechanism insufficient
  • 5. What do we do ? “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” —Grace Hopper, Computer Scientist • Create a cluster of systems • Store data in clustered systems • Process data sets independent of one another
  • 6. Hadoop Hadoop is a framework for running applications on large cluster built of commodity hardware. In other words, A reliable shared storage and analysis system. Hadoop Modules • Hadoop Common • Hadoop Distributed File System(HDFS) • Hadoop Yarn • Hadoop MapReduce
  • 7. Journey of Hadoop 2002 Started by Dough Cutting and Mike Cafarella as a text search library 2003 Google’s distributed file system paper published Yahoo hired Dough, Supported Hadoop 2006 2008 Yahoo announced that its search index was generated by 10,000-core Hadoop cluster 2009 Won the minute sort by sorting 500 GB in 59 seconds ! 2013 More than half of the Fortune 50 use Hadoop
  • 8. Current projects under Apache Hadoop • Avro • Cassandra: • Chukwa • HBase • Hive • Mahout • Pig • Spark • Tez • Zoookeeper
  • 9. Hadoop Distributed File System(HDFS) • File systems that manages the storage across a network of machines • Built around to handle  Very large files - Terabytes, petabytes  Streaming data access - write once, read many times  Commodity Hardware - commonly available hardware
  • 10. Namenodes and Datanodes • Two types of node operating in a master-worker pattern • Namenode  Master node  Manages filesystem namespace  Maintains metadata for all the files and directories in the tree • Datanode  Workhorses of the file system  Store and retrieve blocks when told by client or Namenode  Periodically report to Namenode
  • 12. Client reading files from HDFS Client Name Node Tell me the block locations of results.txt Blk A = 1,5,6 Blk B = 1, 2, 8 Blk C = 5, 8, 9 Data Node Data Node Data Node 6 Data Node 5 SwitchSwitch Data Node 1 Data Node 2 Data Node Data Node B A B C A Data Node Data Node Data Node 9 Data Node 8 Switch C C B A Result.txt = Blk A : DN1, DN5, N6 Blk B: DN8, DN1, DN2 Blk C = DN5, DN8, DN9 Metadata • Client receives Data Node list from each block • Picks first Data Node for each block • Reads blocks sequentially Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Client-Read-from-HDFS.PNG
  • 13. Writing files to HDFS I want to write blocks A,B,C of file.txt Client Name Node Data Node 1 Data Node 5 Data Node 6 Data Node N Blk A Blk B Blk C file.txt Blk A Blk B Blk C OK. Write to data nodes 1,5, 6 • Client consults Name Node • Writes block directly to one Data Node • Data Node replicates block • Cycle repeats for next block Source: http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Writing-Files-to-HDFS.PNG
  • 14. What is MapReduce? • MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. • Published in 2004 from Google engineers Jeffrey Dean and Sanjay Ghemawat.
  • 15. MapReduce Features • Large-scale distributed data processing • Parallel programming. • Simple but restricted. • Load Balancing • Handling machine failure
  • 16. When should we use MapReduce ? Query • Index and search such as inverted index • Classification • Filtering Analytics • Sorting and merging • Frequency distribution • Summarization and statistics • SQL-based queries: group by, having, etc. • Generation of graphics Others • Message passing such as Breadth first-search algorithm
  • 17. MapReduce Inspiration! - Read massive data - Map: Extracting data from each record map (in_key, in_value) (out_key, intermediate_value) list - Shuffle and Sort - Reduce: Aggregate, filter, summarize and transform reduce (out_key, intermediate_value list) out_value list - Write the result
  • 21. MapReduce Algorithms 1. Disease propagation detection based-MapReduce 2. Trading strategies based-MapReduce. 3. Graph processing algorithm based-MapReduce.
  • 22. Final Note ! • Open source community taking newer and larger steps – Spark, Ceph, Open Stack • Need for better processing – Batch processing + Streaming • Time to move on from Hadoop?
  • 23. References • http://www.intelligententerprise.com/showArticle.jhtml?articleID=207800705. • http://mashable.com/2008/10/15/facebook-10-billion-photos/. • http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret +Data+Center.aspx, • http://www.archive.org/about/faqs.php. • http://www.interactions.org/cms/?pid=1027032. • Hadoop The Definitive Guide 2nd Edition by Tom White • Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, “The Google File System,” October 2003 • http://www.forbes.com/sites/teradata/2015/05/22/the-future-of-hadoop-is-cloudy-with-a-chance-of-growing-ecosystem/ • R. Ranjan, and R. Misra,” Epidemic Disease Propagation Detection Algorithm using MapReduce for Realistic Social Contact Networks, “IEEE Int. Conf. on High Performance Computing and Applications, vol. 2, Bhubaneswar, Dec. 2014, pp.1-6. • X. Qin, and et al,“Optimizing Parameters of algorithm trading strategies using MapReduce ,” 9th IEEE Int. Conf. Fuzzy Systems and Knowledge Discovery, Sichuan, May 2012, pp. 2738-274. • K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka “A Scalable Implementation of a MapReduce-based Graph Processing Algorithm for Large Scale Heterogeneous Supercomputers, “13th IEEE/ACM Int. Sym. on Cluster, Cloud, and Grid Computing, Delft, May 2013, pp. 277-284. • G. Yang, “The Application of MapReduce in the Cloud Computing,” 2nd IEEE Int. Syn. On Intilligence Information Processing and Trusted, Hubei, Oct. 2011, pp.154-156. • C. Goncalves, L. Assuncao, and J.C Cunha “Data Analytics in the Cloud with Flexible MapReduce Workflows” 4th IEEE Int. Conf. on Cloud computing technology and Sience, Taipei, Dec. 2012, pp. 427-434. • Count Frequencies of Words in Document. Last access Nov. 15th, 2015. Available on:http://hci.stanford.edu/courses/cs448g/a2/files/map_reduce_tutorial.pdf. • Link Elevation. Last access Nov. 15th, 2015. Available on: http://www.slideshare.net/ChicagoHUG/mr. • Inverted indexes. Last access Nov. 15, 2015. Available on: http://blog.cloudera.com/wp- content/uploads/2010/01/InvertedIndex.pdf.