Hadoop and MapReduce

Hemanth Kumar Mantri
Hemanth Kumar MantriGraduate Teaching Assistant
WHAT STARTS HERE CHANGES THE WORLD




           and MapReduce
Hemanth Kumar Mantri
  Graduate Student
     UT-Austin



   November 9th 2011
WHAT STARTS HERE CHANGES THE WORLD




                 Agenda
•   What is Hadoop?
•   Where is MapReduce used?
•   HDFS and MapReduce
•   Amazon Web Services
•   Map Reduce Demo on Hadoop
WHAT STARTS HERE CHANGES THE WORLD




            What is Hadoop?
• Inspired by Google File System (GFS) and
  MapReduce.
• Supports data-intensive distributed
  applications.
• Thousands of nodes and PBytes of data.
• Apache project – Open Source
• Implemented in Java
• Yahoo! - largest contributor
WHAT STARTS HERE CHANGES THE WORLD




Typical Hadoop Cluster!
WHAT STARTS HERE CHANGES THE WORLD




Who Uses Hadoop?
WHAT STARTS HERE CHANGES THE WORLD




                    Who Uses Hadoop?
•   At Google:
     – Index construction for Google Search
     – Popular Passages in Google Books
     – Article clustering for Google News

•   At Yahoo!:
     – “Web map” powering Yahoo! Search
     – Spam detection for Yahoo! Mail
     – More than 100,000 CPUs in >36,000 computers

•   At Facebook:
     – Used in reporting/analytics and machine learning
          • Data Mining, Spam detection
     – as storage engine for logs.
     – 1100-machine cluster with 8800 cores and about 12 PB raw storage.
WHAT STARTS HERE CHANGES THE WORLD




FaceBook Lexicon
WHAT STARTS HERE CHANGES THE WORLD




                           Yelp!
• Uses Amazon S3 to store daily logs and photos,
   – generating around 100GB of logs per day.
• Amazon Elastic MapReduce for:
   –   People Who Viewed this Also Viewed
   –   Review highlights
   –   Auto complete as you type on search
   –   Search spelling suggestions
   –   Top searches
   –   Ads
• Yelp runs approximately 200 Elastic MapReduce jobs
  processing 3TB of data per day.
WHAT STARTS HERE CHANGES THE WORLD




          Hadoop Components
• Distributed file system (HDFS)
  – Single namespace for entire cluster
  – Almost same as GFS
  – Replicates data 3x for fault-tolerance

• MapReduce framework
  – Executes user jobs specified as “map” and
    “reduce” functions
  – Manages work distribution & fault-tolerance
WHAT STARTS HERE CHANGES THE WORLD




Hadoop Architecture
WHAT STARTS HERE CHANGES THE WORLD




The Big Picture
WHAT STARTS HERE CHANGES THE WORLD




                         Using the HDFS
• hadoop dfs
   –   [-ls <path>]
   –   [-du <path>]
   –   [-cp <src> <dst>]
   –   [-rm <path>]
   –   [-put <localsrc> <dst>]
   –   [-copyFromLocal <localsrc> <dst>]
   –   [-moveFromLocal <localsrc> <dst>]
   –   [-get [-crc] <src> <localdst>]
   –   [-cat <src>]
   –   [-copyToLocal [-crc] <src> <localdst>]
   –   [-moveToLocal [-crc] <src> <localdst>]
   –   [-mkdir <path>]
   –   [-touchz <path>]
   –   [-test -[ezd] <path>]
   –   [-stat [format] <path>]
   –   [-help [cmd]]
WHAT STARTS HERE CHANGES THE WORLD




AWS and Cloud
WHAT STARTS HERE CHANGES THE WORLD




           Amazon Web Services
• Collection of services – Pay as you use!
   – S3 (Simple Storage Service)
       Storage in the Cloud ($0.140/GB/Month)
       Key Value Store (Big HashMap!)
   – EC2 (Elastic Compute Cloud)
       Compute in the Cloud ($0.085 - $2.6 /computing hour)
   – Elastic MapReduce
       Run Hadoop Jobs on EC2 using Data stored in S3
   – Email Service
   – …. Many more
WHAT STARTS HERE CHANGES THE WORLD




       Map Reduce on EC2 Cluster
• Create AWS account and get the keys for authentication
• Go to src/contrib/ec2 in Hadoop directory
• Launch a cluster on EC2
   – % bin/hadoop-ec2 launch-cluster <cluster-name> <#nodes>
• Login to the cluster
   – % bin/hadoop-ec2 login test-cluster
• Start Computation
   – # cd /usr/local/hadoop-*
   – # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
• Terminate the Cluster after use!!!!!
   – % bin/hadoop-ec2 terminate-cluster test-cluster
WHAT STARTS HERE CHANGES THE WORLD




                References
• Hadoop Project Page:
  – http://hadoop.apache.org/
• Amazon Web Services:
  – http://aws.amazon.com/
WHAT STARTS HERE CHANGES THE WORLD




Thank You!
1 of 17

Recommended

Cloud Optimized Big Data by
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big DataJoydeep Sen Sarma
822 views21 slides
Qubole Overview at the Fifth Elephant Conference by
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
1.3K views21 slides
The Meta of Hadoop - COMAD 2012 by
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma
559 views14 slides
Messaging architecture @FB (Fifth Elephant Conference) by
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Joydeep Sen Sarma
2K views22 slides
Facebook Retrospective - Big data-world-europe-2012 by
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
914 views17 slides
Qubole @ AWS Meetup Bangalore - July 2015 by
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
1.7K views57 slides

More Related Content

What's hot

Introduction to apache hadoop copy by
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
2.6K views21 slides
Hadoop hbase introduction by
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introductionJakub Stransky
456 views13 slides
Geek camp by
Geek campGeek camp
Geek campjdhok
385 views16 slides
Another Intro To Hadoop by
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
5.4K views23 slides
Introduction to Apache Hadoop by
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
1.6K views15 slides
Big Data in the Microsoft Platform by
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
385 views55 slides

What's hot(19)

Introduction to apache hadoop copy by Mohammad_Tariq
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
Mohammad_Tariq2.6K views
Geek camp by jdhok
Geek campGeek camp
Geek camp
jdhok385 views
Another Intro To Hadoop by Adeel Ahmad
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad5.4K views
Introduction to Apache Hadoop by Steve Watt
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Steve Watt1.6K views
Big Data in the Microsoft Platform by Jesus Rodriguez
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez385 views
Facebook Hadoop Data & Applications by dzhou
Facebook Hadoop Data & ApplicationsFacebook Hadoop Data & Applications
Facebook Hadoop Data & Applications
dzhou1.9K views
The Bixo Web Mining Toolkit by Tom Croucher
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
Tom Croucher4.1K views
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce by Hadoop User Group
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group4.8K views
Hadoop Conference Japan 2011 Fallに行ってきました by moai kids
Hadoop Conference Japan 2011 Fallに行ってきましたHadoop Conference Japan 2011 Fallに行ってきました
Hadoop Conference Japan 2011 Fallに行ってきました
moai kids3.4K views
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works by Cloudera, Inc.
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton WorksHadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Hadoop World 2011: Apache Hadoop 0.23 - Arun Murthy, Horton Works
Cloudera, Inc.3.9K views
End-to-end Analytics with Apache Cassandra by Jeremy Hanna
End-to-end Analytics with Apache CassandraEnd-to-end Analytics with Apache Cassandra
End-to-end Analytics with Apache Cassandra
Jeremy Hanna6.9K views
Hadoop @ eBay: Past, Present, and Future by Ryan Hennig
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
Ryan Hennig5.2K views
HBase backups and performance on MapR by lohitvijayarenu
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
lohitvijayarenu2.2K views

Similar to Hadoop and MapReduce

Hadoop Primer by
Hadoop PrimerHadoop Primer
Hadoop PrimerSteve Staso
537 views12 slides
AWS (Hadoop) Meetup 30.04.09 by
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
531 views59 slides
Hadoop by
HadoopHadoop
HadoopYojana Nanaware
233 views18 slides
Big data, just an introduction to Hadoop and Scripting Languages by
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
4.9K views25 slides
Big Data in the Microsoft Platform by
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
5.2K views53 slides
Hadoop 2.0 handout 5.0 by
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Manaranjan Pradhan
507 views98 slides

Similar to Hadoop and MapReduce(20)

Big data, just an introduction to Hadoop and Scripting Languages by Corley S.r.l.
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.4.9K views
Big Data in the Microsoft Platform by Jesus Rodriguez
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez5.2K views
Apache hadoop, hdfs and map reduce Overview by Nisanth Simon
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon961 views
Apache Hadoop 1.1 by Sperasoft
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
Sperasoft1.1K views
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3 by tcloudcomputing-tw
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw3.2K views
INTRODUCTION TO BIG DATA HADOOP by Krishna Sujeer
INTRODUCTION TO BIG DATA HADOOPINTRODUCTION TO BIG DATA HADOOP
INTRODUCTION TO BIG DATA HADOOP
Krishna Sujeer155 views
Hadoop ecosystem framework n hadoop in live environment by Delhi/NCR HUG
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
Delhi/NCR HUG2.5K views
Hadoop-Quick introduction by Sandeep Singh
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh679 views
Big Data and Hadoop in Cloud - Leveraging Amazon EMR by Vijay Rayapati
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Vijay Rayapati4.3K views
Distributed Data processing in a Cloud by elliando dias
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias890 views
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) by VMware Tanzu
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu4.9K views
Dataiku big data paris - the rise of the hadoop ecosystem by Dataiku
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku2.9K views
An Introduction to Apache Hadoop, Mahout and HBase by Lukas Vlcek
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
Lukas Vlcek5.4K views
4. hadoop גיא לבנברג by Taldor Group
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
Taldor Group864 views

More from Hemanth Kumar Mantri

TCP Issues in DataCenter Networks by
TCP Issues in DataCenter NetworksTCP Issues in DataCenter Networks
TCP Issues in DataCenter NetworksHemanth Kumar Mantri
1.5K views32 slides
Basic Paxos Implementation in Orc by
Basic Paxos Implementation in OrcBasic Paxos Implementation in Orc
Basic Paxos Implementation in OrcHemanth Kumar Mantri
1.3K views28 slides
Neural Networks in File access Prediction by
Neural Networks in File access PredictionNeural Networks in File access Prediction
Neural Networks in File access PredictionHemanth Kumar Mantri
720 views17 slides
Connected Components Labeling by
Connected Components LabelingConnected Components Labeling
Connected Components LabelingHemanth Kumar Mantri
2.3K views52 slides
JPEG Image Compression by
JPEG Image CompressionJPEG Image Compression
JPEG Image CompressionHemanth Kumar Mantri
6.8K views42 slides
Traffic Simulation using NetLogo by
Traffic Simulation using NetLogoTraffic Simulation using NetLogo
Traffic Simulation using NetLogoHemanth Kumar Mantri
3K views12 slides

Recently uploaded

Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...ShapeBlue
180 views18 slides
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
203 views54 slides
Qualifying SaaS, IaaS.pptx by
Qualifying SaaS, IaaS.pptxQualifying SaaS, IaaS.pptx
Qualifying SaaS, IaaS.pptxSachin Bhandari
1K views8 slides
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...ShapeBlue
126 views10 slides
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueShapeBlue
135 views13 slides
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...ShapeBlue
139 views29 slides

Recently uploaded(20)

Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha... by ShapeBlue
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
ShapeBlue180 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue203 views
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ... by ShapeBlue
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
Live Demo Showcase: Unveiling Dell PowerFlex’s IaaS Capabilities with Apache ...
ShapeBlue126 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue135 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue139 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue152 views
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue by ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlueMigrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
Migrating VMware Infra to KVM Using CloudStack - Nicolas Vazquez - ShapeBlue
ShapeBlue218 views
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue130 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson160 views
Business Analyst Series 2023 - Week 4 Session 8 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 8Business Analyst Series 2023 -  Week 4 Session 8
Business Analyst Series 2023 - Week 4 Session 8
DianaGray10123 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc170 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker54 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue297 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue138 views
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool by ShapeBlue
Extending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPoolExtending KVM Host HA for Non-NFS Storage -  Alex Ivanov - StorPool
Extending KVM Host HA for Non-NFS Storage - Alex Ivanov - StorPool
ShapeBlue123 views

Hadoop and MapReduce

  • 1. WHAT STARTS HERE CHANGES THE WORLD and MapReduce Hemanth Kumar Mantri Graduate Student UT-Austin November 9th 2011
  • 2. WHAT STARTS HERE CHANGES THE WORLD Agenda • What is Hadoop? • Where is MapReduce used? • HDFS and MapReduce • Amazon Web Services • Map Reduce Demo on Hadoop
  • 3. WHAT STARTS HERE CHANGES THE WORLD What is Hadoop? • Inspired by Google File System (GFS) and MapReduce. • Supports data-intensive distributed applications. • Thousands of nodes and PBytes of data. • Apache project – Open Source • Implemented in Java • Yahoo! - largest contributor
  • 4. WHAT STARTS HERE CHANGES THE WORLD Typical Hadoop Cluster!
  • 5. WHAT STARTS HERE CHANGES THE WORLD Who Uses Hadoop?
  • 6. WHAT STARTS HERE CHANGES THE WORLD Who Uses Hadoop? • At Google: – Index construction for Google Search – Popular Passages in Google Books – Article clustering for Google News • At Yahoo!: – “Web map” powering Yahoo! Search – Spam detection for Yahoo! Mail – More than 100,000 CPUs in >36,000 computers • At Facebook: – Used in reporting/analytics and machine learning • Data Mining, Spam detection – as storage engine for logs. – 1100-machine cluster with 8800 cores and about 12 PB raw storage.
  • 7. WHAT STARTS HERE CHANGES THE WORLD FaceBook Lexicon
  • 8. WHAT STARTS HERE CHANGES THE WORLD Yelp! • Uses Amazon S3 to store daily logs and photos, – generating around 100GB of logs per day. • Amazon Elastic MapReduce for: – People Who Viewed this Also Viewed – Review highlights – Auto complete as you type on search – Search spelling suggestions – Top searches – Ads • Yelp runs approximately 200 Elastic MapReduce jobs processing 3TB of data per day.
  • 9. WHAT STARTS HERE CHANGES THE WORLD Hadoop Components • Distributed file system (HDFS) – Single namespace for entire cluster – Almost same as GFS – Replicates data 3x for fault-tolerance • MapReduce framework – Executes user jobs specified as “map” and “reduce” functions – Manages work distribution & fault-tolerance
  • 10. WHAT STARTS HERE CHANGES THE WORLD Hadoop Architecture
  • 11. WHAT STARTS HERE CHANGES THE WORLD The Big Picture
  • 12. WHAT STARTS HERE CHANGES THE WORLD Using the HDFS • hadoop dfs – [-ls <path>] – [-du <path>] – [-cp <src> <dst>] – [-rm <path>] – [-put <localsrc> <dst>] – [-copyFromLocal <localsrc> <dst>] – [-moveFromLocal <localsrc> <dst>] – [-get [-crc] <src> <localdst>] – [-cat <src>] – [-copyToLocal [-crc] <src> <localdst>] – [-moveToLocal [-crc] <src> <localdst>] – [-mkdir <path>] – [-touchz <path>] – [-test -[ezd] <path>] – [-stat [format] <path>] – [-help [cmd]]
  • 13. WHAT STARTS HERE CHANGES THE WORLD AWS and Cloud
  • 14. WHAT STARTS HERE CHANGES THE WORLD Amazon Web Services • Collection of services – Pay as you use! – S3 (Simple Storage Service) Storage in the Cloud ($0.140/GB/Month) Key Value Store (Big HashMap!) – EC2 (Elastic Compute Cloud) Compute in the Cloud ($0.085 - $2.6 /computing hour) – Elastic MapReduce Run Hadoop Jobs on EC2 using Data stored in S3 – Email Service – …. Many more
  • 15. WHAT STARTS HERE CHANGES THE WORLD Map Reduce on EC2 Cluster • Create AWS account and get the keys for authentication • Go to src/contrib/ec2 in Hadoop directory • Launch a cluster on EC2 – % bin/hadoop-ec2 launch-cluster <cluster-name> <#nodes> • Login to the cluster – % bin/hadoop-ec2 login test-cluster • Start Computation – # cd /usr/local/hadoop-* – # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000 • Terminate the Cluster after use!!!!! – % bin/hadoop-ec2 terminate-cluster test-cluster
  • 16. WHAT STARTS HERE CHANGES THE WORLD References • Hadoop Project Page: – http://hadoop.apache.org/ • Amazon Web Services: – http://aws.amazon.com/
  • 17. WHAT STARTS HERE CHANGES THE WORLD Thank You!