SlideShare a Scribd company logo
1 of 20
A new way to store and analyze data




      Presented By :: Harsha Jain
         CSE – IV Year Student

www.powerpointpresentationon.blogspot.com
Topics Covered
•   What is Hadoop?            • HDFS
•   Why, Where, When?          • Hadoop MapReduce
•   Benefits of Hadoop         • Installation &
•   How Hadoop Works?            Execution
•   Hdoop Architecture         • Demo of installation
•   Hadoop Common              • Hadoop Community




                     By Harsha Jain
What is Hadoop?
• Hadoop was created by Douglas Reed Cutting, who named haddop
  after his child’s stuffed elephant to support Lucene and Nutch
  search engine projects.
• Open-source project administered by Apache Software Foundation.
• Hadoop consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called
MapReduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
  of system changes or failures.




                                  By Harsha Jain
Hadoop, Why?
 • Need to process 100TB datasets
 • On 1 node:
– scanning @ 50MB/s = 23 days
 • On 1000 node cluster:
– scanning @ 50MB/s = 33 min
 • Need Efficient, Reliable and Usable framework




                         By Harsha Jain
Where and When Hadoop
            Where                                   When
• Batch data processing, not             • Process lots of unstructured
  real-time / user facing (e.g.            data
  Document Analysis and                  • When your processing can
  Indexing, Web Graphs and                 easily be made parallel
  Crawling)                              • Running batch jobs is
• Highly parallel data intensive           acceptable
  distributed applications               • When you have access to lots
• Very large production                    of cheap hardware
  deployments (GRID)




                               By Harsha Jain
Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
  hardware
• It automatically handles data replication and node
  failure
• It does the hard work – you can focus on processing
  data
• Cost Saving and efficient and reliable data
  processing




                       By Harsha Jain
How Hadoop Works
• Hadoop implements a computational paradigm named
  Map/Reduce, where the application is divided into many small
  fragments of work, each of which may be executed or re-executed
  on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
  stores data on the compute nodes, providing very high aggregate
  bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
  that node failures are automatically handled by the framework.




                            By Harsha Jain
Hdoop Architecture
       The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing



Hadoop Consists::
 • Hadoop Common*: The common utilities that support the other
   Hadoop subprojects.
 • HDFS*: A distributed file system that provides high throughput
   access to application data.
 • MapReduce*: A software framework for distributed processing of
   large data sets on compute clusters.
Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,
At the bottom is the Hadoop Distributed File System (HDFS), which stores files across
storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which
consists of JobTrackers and TaskTrackers.

* This presentation is primarily focus on Hadoop architecture and related sub
project




                                                  By Harsha Jain
Data Flow

Web             Scribe
Servers         Servers

                                       Network
                                       Storage




 Oracle   Hadoop Cluster        MySQ
 RAC                            L


               By Harsha Jain
Hadoop Common
• Hadoop Common is a set of utilities that
  support the other Hadoop subprojects.
  Hadoop Common includes FileSystem,
  RPC, and serialization libraries.




                   By Harsha Jain
HDFS
• Hadoop Distributed File System (HDFS) is
  the primary storage system used by
  Hadoop applications.
• HDFS creates multiple replicas of data
  blocks and distributes them on compute
  nodes throughout a cluster to enable
  reliable, extremely rapid computations.
• Replication and locality


                  By Harsha Jain
HDFS Architecture




      By Harsha Jain
Hadoop MapReduce
 • The Map-Reduce programming model
– Framework for distributed processing of large data sets
– Pluggable user code runs in generic framework
 • Common design pattern in data processing
cat * | grep | sort | unique -c | cat > file
input | map | shuffle | reduce | output
 • Natural for:
– Log processing
– Web search indexing
– Ad-hoc queries



                          By Harsha Jain
MapReduce Implementation
1. Input files split (M splits)
2. Assign Master & Workers
3. Map tasks
4. Writing intermediate data to
   disk (R regions)
5. Intermediate data read &
   sort
6. Reduce tasks
7. Return




                             By Harsha Jain
MapReduce Cluster
                Implementation
 Input files    M map     Intermediate        R reduce    Output files
                tasks     files               tasks


 split 0                                                   Output 0
 split 1
 split 2
 split 3                                                   Output 1
 split 4


Several map or          Each intermediate file     Each reduce task
reduce tasks can run    is divided into R          corresponds to one
on a single computer    partitions, by             partition
                        partitioning function
                             By Harsha Jain
Examples of MapReduce
                       Word Count

• Read text files and count how often words
  occur.
  o   The input is text files
  o   The output is a text file
        each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the
  counts.


                           By Harsha Jain
Lets Go…
Installation ::                      Execution::
• Requirements: Linux, Java            • Compile your job into a JAR
  1.6, sshd, rsync                       file
• Configure SSH for                    • Copy input data into HDFS
  password-free authentication         • Execute bin/hadoop jar with
• Unpack Hadoop distribution             relevant args
• Edit a few configuration files       • Monitor tasks via Web
• Format the DFS on the                  interface (optional)
  name node                            • Examine output when job is
• Start all the daemon                   complete
  processes




                             By Harsha Jain
Demo Video for installation




           By Harsha Jain
Hadoop Community
Hadoop Users           Major Contributor
• Adobe                • Apache
• Alibaba              • Cloudera
• Amazon               • Yahoo
• AOL
• Facebook
• Google
• IBM




               By Harsha Jain
References
• Apache Hadoop! (http://hadoop.apache.org )
• Hadoop on Wikipedia
  (http://en.wikipedia.org/wiki/Hadoop)
• Free Search by Doug Cutting
  (http://cutting.wordpress.com )
• Hadoop and Distributed Computing at Yahoo!
  (http://developer.yahoo.com/hadoop )
• Cloudera - Apache Hadoop for the Enterprise
  (http://www.cloudera.com )


                    By Harsha Jain

More Related Content

What's hot

Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 

What's hot (20)

Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop Ecosystem Overview
Hadoop Ecosystem OverviewHadoop Ecosystem Overview
Hadoop Ecosystem Overview
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q2
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 

Viewers also liked (7)

Shore
ShoreShore
Shore
 
Variables
VariablesVariables
Variables
 
Java scripts
Java scriptsJava scripts
Java scripts
 
Performing calculations using java script
Performing calculations using java scriptPerforming calculations using java script
Performing calculations using java script
 
Spintronics
SpintronicsSpintronics
Spintronics
 
Basics java scripts
Basics java scriptsBasics java scripts
Basics java scripts
 
Groovy scripts with Groovy
Groovy scripts with GroovyGroovy scripts with Groovy
Groovy scripts with Groovy
 

Similar to Presentation

Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 

Similar to Presentation (20)

Anju
AnjuAnju
Anju
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Hadoop
HadoopHadoop
Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop
HadoopHadoop
Hadoop
 
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptxM. Florence Dayana - Hadoop Foundation for Analytics.pptx
M. Florence Dayana - Hadoop Foundation for Analytics.pptx
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
hadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptxhadoop-ecosystem-ppt.pptx
hadoop-ecosystem-ppt.pptx
 
Big data
Big dataBig data
Big data
 

More from ch samaram (15)

Syllabus r10-ecm-42-network security and cryptography
 Syllabus r10-ecm-42-network security and cryptography Syllabus r10-ecm-42-network security and cryptography
Syllabus r10-ecm-42-network security and cryptography
 
Restaurant billing application
Restaurant billing applicationRestaurant billing application
Restaurant billing application
 
Spintronics (1)
Spintronics (1)Spintronics (1)
Spintronics (1)
 
Open gl
Open glOpen gl
Open gl
 
Opengl (1)
Opengl (1)Opengl (1)
Opengl (1)
 
Computer forensics law and privacy
Computer forensics   law and privacyComputer forensics   law and privacy
Computer forensics law and privacy
 
Blue gene
Blue geneBlue gene
Blue gene
 
Blue gene
Blue geneBlue gene
Blue gene
 
Wearable (1)
Wearable (1)Wearable (1)
Wearable (1)
 
Javascript sivasoft
Javascript sivasoftJavascript sivasoft
Javascript sivasoft
 
Html siva
Html sivaHtml siva
Html siva
 
Css siva
Css sivaCss siva
Css siva
 
Ajax
AjaxAjax
Ajax
 
Html 5
Html 5Html 5
Html 5
 
Css siva
Css sivaCss siva
Css siva
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Presentation

  • 1. A new way to store and analyze data Presented By :: Harsha Jain CSE – IV Year Student www.powerpointpresentationon.blogspot.com
  • 2. Topics Covered • What is Hadoop? • HDFS • Why, Where, When? • Hadoop MapReduce • Benefits of Hadoop • Installation & • How Hadoop Works? Execution • Hdoop Architecture • Demo of installation • Hadoop Common • Hadoop Community By Harsha Jain
  • 3. What is Hadoop? • Hadoop was created by Douglas Reed Cutting, who named haddop after his child’s stuffed elephant to support Lucene and Nutch search engine projects. • Open-source project administered by Apache Software Foundation. • Hadoop consists of two key services: a. Reliable data storage using the Hadoop Distributed File System (HDFS). b. High-performance parallel data processing using a technique called MapReduce. • Hadoop is large-scale, high-performance processing jobs — in spite of system changes or failures. By Harsha Jain
  • 4. Hadoop, Why? • Need to process 100TB datasets • On 1 node: – scanning @ 50MB/s = 23 days • On 1000 node cluster: – scanning @ 50MB/s = 33 min • Need Efficient, Reliable and Usable framework By Harsha Jain
  • 5. Where and When Hadoop Where When • Batch data processing, not • Process lots of unstructured real-time / user facing (e.g. data Document Analysis and • When your processing can Indexing, Web Graphs and easily be made parallel Crawling) • Running batch jobs is • Highly parallel data intensive acceptable distributed applications • When you have access to lots • Very large production of cheap hardware deployments (GRID) By Harsha Jain
  • 6. Benefits of Hadoop • Hadoop is designed to run on cheap commodity hardware • It automatically handles data replication and node failure • It does the hard work – you can focus on processing data • Cost Saving and efficient and reliable data processing By Harsha Jain
  • 7. How Hadoop Works • Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. • In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. • Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework. By Harsha Jain
  • 8. Hdoop Architecture The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing Hadoop Consists:: • Hadoop Common*: The common utilities that support the other Hadoop subprojects. • HDFS*: A distributed file system that provides high throughput access to application data. • MapReduce*: A software framework for distributed processing of large data sets on compute clusters. Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common, At the bottom is the Hadoop Distributed File System (HDFS), which stores files across storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which consists of JobTrackers and TaskTrackers. * This presentation is primarily focus on Hadoop architecture and related sub project By Harsha Jain
  • 9. Data Flow Web Scribe Servers Servers Network Storage Oracle Hadoop Cluster MySQ RAC L By Harsha Jain
  • 10. Hadoop Common • Hadoop Common is a set of utilities that support the other Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries. By Harsha Jain
  • 11. HDFS • Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. • HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. • Replication and locality By Harsha Jain
  • 12. HDFS Architecture By Harsha Jain
  • 13. Hadoop MapReduce • The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework • Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output • Natural for: – Log processing – Web search indexing – Ad-hoc queries By Harsha Jain
  • 14. MapReduce Implementation 1. Input files split (M splits) 2. Assign Master & Workers 3. Map tasks 4. Writing intermediate data to disk (R regions) 5. Intermediate data read & sort 6. Reduce tasks 7. Return By Harsha Jain
  • 15. MapReduce Cluster Implementation Input files M map Intermediate R reduce Output files tasks files tasks split 0 Output 0 split 1 split 2 split 3 Output 1 split 4 Several map or Each intermediate file Each reduce task reduce tasks can run is divided into R corresponds to one on a single computer partitions, by partition partitioning function By Harsha Jain
  • 16. Examples of MapReduce Word Count • Read text files and count how often words occur. o The input is text files o The output is a text file  each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts. By Harsha Jain
  • 17. Lets Go… Installation :: Execution:: • Requirements: Linux, Java • Compile your job into a JAR 1.6, sshd, rsync file • Configure SSH for • Copy input data into HDFS password-free authentication • Execute bin/hadoop jar with • Unpack Hadoop distribution relevant args • Edit a few configuration files • Monitor tasks via Web • Format the DFS on the interface (optional) name node • Examine output when job is • Start all the daemon complete processes By Harsha Jain
  • 18. Demo Video for installation By Harsha Jain
  • 19. Hadoop Community Hadoop Users Major Contributor • Adobe • Apache • Alibaba • Cloudera • Amazon • Yahoo • AOL • Facebook • Google • IBM By Harsha Jain
  • 20. References • Apache Hadoop! (http://hadoop.apache.org ) • Hadoop on Wikipedia (http://en.wikipedia.org/wiki/Hadoop) • Free Search by Doug Cutting (http://cutting.wordpress.com ) • Hadoop and Distributed Computing at Yahoo! (http://developer.yahoo.com/hadoop ) • Cloudera - Apache Hadoop for the Enterprise (http://www.cloudera.com ) By Harsha Jain

Editor's Notes

  1. This is the architecture of our backend data warehouing system. This system provides important information on the usage of our website, including but not limited to the number page views of each page, the number of active users in each country, etc. We generate 3TB of compressed log data every day. All these data are stored and processed by the hadoop cluster which consists of over 600 machines. The summary of the log data is then copied to Oracle and MySQL databases, to make sure it is easy for people to access.